Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Password auth design doc #32005

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jubrad
Copy link
Contributor

@jubrad jubrad commented Mar 25, 2025

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

WIP password auth design doc
## Open questions
1. **Where exactly are passwords stored**: While it makes sense to store passwords in the catalog, I have not yet identified exactly where they should be stored. Should it go alongside `mz_roles`, or in a new `mz_auth_id` table. The latter makes sense as it could store password metadata, and might let us store multiple passwords down the road.

2. **Audit Logging and Brute Force Detection**: Do we need to log all failed password attempts and the IPs of those attempts? Do we need to take action to lock an IP or User when an attempt threshold is met?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any action taken should be user configurable. We don't want to allow DOS attacks by a user just spamming incorrect passwords, for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @alex-hunt-materialize - there's too much risk to availability from DOS or automatic retries with an incorrect password.
Our default behavior should log all login attempts, and we should aim to replicate PostgreSQL log messages. This will allow customers to send the data to their SIEM and/or use fail2ban or another tool to act on any failures.
We'd need to add a way for customers to lock users with fail2ban or a script, but I think we can leave that out for now unless it is very simple to implement.

here's some discussion on how to configure fail2ban w/ Postgres, for example
https://gist.github.com/rc9000/fd1be13b5c8820f63d982d0bf8154db1

@SangJunBak
Copy link
Contributor

SangJunBak commented Mar 26, 2025

Regarding closing off the internal port for Cloud in the cloud sync:

  • We should consider Teleport too since i think it uses some internal port?
  • Are we able to easily disable this for the emulator?

@jubrad
Copy link
Contributor Author

jubrad commented Mar 26, 2025

Option A) - preferred
We should consider adding confimap syncing as a self-managed replacement for LD for system vars. If we do this we can probably drop the need for mz_system (in self-managed) and just close off the internal port for self-managed, perhaps not even start it up.

Option B) - backup

  • We need to evaluate other mz_system capabilities and ensure that it is not needed for any other purposes.
    If mz_system is needed then we may need to optionally allow external port login for mz_system with a password.

@alex-hunt-materialize
Copy link
Contributor

We should consider Teleport too since i think it uses some internal port?

Teleport prefers mutual TLS, and we've actually gone out of our way to disable it. It's been on the roadmap for a long time to get that working, but never got to it.

@DAlperin
Copy link
Contributor

Here are my thoughts on implementation:

  • A system table for storing sensitive role data like the (hashed and salted) password of a role
    • In the future we could transition this to secrets once those are supported for self managed
  • We will expand the adapter::Client surface area to expose an API to the protocol layer which can be used to exchange a username and password for a role
    • when running in cloud we will continue to use frontegg and this codepath will be unexercised
  • we will do the same adapter call or equivalent in tower middlewares for the http endpoint

The rest of it as laid out in the doc makes sense to me

@jubrad
Copy link
Contributor Author

jubrad commented Mar 27, 2025

On system users... We might be able to pass in a bootstrap user that is also considered internal/system. This user should be considered superuser and internal user and it's password should come from a k8s secret. The k8s secret passed in can be used as it's password secrets reference in the catalog.


### 4. Configurable admin system login:

Passwords for `mz_system` and `mz_support` roles will be settable via environment variables `MZ_MZ_<USER>_EXTERNAL_LOGIN_PASSWORD`. Orchestratord should provide a set of parameters to set these variables via a Kubernetes secret. Additionally, We should enable login of system users through the external ports when they have external login passwords set. Login through the external port must not be possible unless this flag is set, this logic should not rely on whether the internal user has a password.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should be MZ_<USER>_EXTERNAL_LOGIN_PASSWORD given mz_system would give MZ_MZ_MZ_SYSTEM_EXTERNAL_LOGIN_PASSWORD

Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the user should be last? MZ_EXTERNAL_LOGIN_PASSWORD_<USER>

That also gives a consistent prefix when grepping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

EXTERNAL_LOGIN_PASSWORD_<USER>

with a env variable of

MZ_EXTERNAL_LOGIN_PASSWORD_<USER>

@SangJunBak
Copy link
Contributor

SangJunBak commented Mar 27, 2025

Still not clear to me the flow of setting up auth. In my head, it looks like:

  1. As they initially set up via the helm chart, orchestratord will have the env variable to set the password of the mz_system user. This is probably configurable via the helm chart?
  2. Once the user deploys the materialize CR, orchestratord will pass that env variable in order to start the environment with an mz_system type user
  3. The user can now login via that user and add more roles

Does that sound correct? Then we just make it a limitation that a user can't use the Console unless there's a registered role. In the Console, we'll most likely need the following information

  • Is (password) auth enabled?
  • Does there exist an authenticated user?
    But I'm also confused how the Console might receive this information if we can't query the database to begin with.

Should also consider the migration path from self managed customers without auth to customers with auth.


**Password Rotation or Expiration**: While not currently in scope, the design should allow for additions in the future, such as password rotation and expiration.

**Abuse Detection**: Action should be taken action when brute force attempts are detected, either by locking the user or blocking the requesting client IP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement this, it should be optional. We don't want to enable DOS attacks just by spamming broken passwords at us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should write logs and let customers use fail2ban or whatever solutions they use elsewhere.


- **Password Dating**: The creation date of passwords will be recorded along with the hashed and salted password. When the password changes a creation date should be reset. The creation date of passwords should be queryable.

- **OPTIONAL: Password Versioning**: Updates to the password hashing mechanisms may be required. As long as we are receiving passwords in plain text we should be able to take a validated password and replace the existing a new securely hashed value. This may require prefixing the password with data about the hash algorithm or parameters.
Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trivial so we should just do it non-optionally. We don't have to implement any replacement now, just store the metadata about the algorithm and parameters.

EDIT: To clarify, the trivial part is to store things like the algorithm used and parameters like number of iterations, not implementing the full replacement process.


### 4. Configurable admin system login:

Passwords for `mz_system` and `mz_support` roles will be settable via environment variables `MZ_MZ_<USER>_EXTERNAL_LOGIN_PASSWORD`. Orchestratord should provide a set of parameters to set these variables via a Kubernetes secret. Additionally, We should enable login of system users through the external ports when they have external login passwords set. Login through the external port must not be possible unless this flag is set, this logic should not rely on whether the internal user has a password.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need an updated Materialize CRD to add a new field for the name of the secret.


3. **Store Passwords in K8s Secrets**: It would be viable to store passwords in Kubernetes secrets.

- **Reasons Not Chosen**: This may create an untenable number of Kubernetes secrets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we just discussed this as what we were going to do for our initial implementation. This is also just a detail of the implementation (where we store things).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is from the original pass. Let's remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this some more and talking it over, I think we should just store passwords in the catalog. This is considered user metadata, it seems like it should be relatively simple to keep there may simplify security reviews.

see:
https://materializeinc.slack.com/archives/C07PN7KSB0T/p1743173985961509

@alex-hunt-materialize
Copy link
Contributor

@SangJunBak +1 to everything you said.

But I'm also confused how the Console might receive this information if we can't query the database to begin with.

Maybe we can pass that to the console also? Maybe through an env var? That gets a bit tricky, since I think we update the balancers and console immediately upon getting an updated Materialize, while only updating environmentd if we trigger a rollout. Perhaps this logic needs to change, but that may be difficult too.

@jubrad
Copy link
Contributor Author

jubrad commented Mar 27, 2025

Still not clear to me the flow of setting up auth. In my head, it looks like:

  1. As they initially set up via the helm chart, orchestratord will have the env variable to set the password of the mz_system user. This is probably configurable via the helm chart?
  2. Once the user deploys the materialize CR, orchestratord will pass that env variable in order to start the environment with an mz_system type user
  3. The user can now login via that user and add more roles

Does that sound correct? Then we just make it a limitation that a user can't use the Console unless they have . In the Console, we'll most likely need the following information

  • Is (password) auth enabled?
  • Does there exist an authenticated user?
    But I'm also confused how the Console might receive this information if we can't query the database to begin with.

Should also consider the migration path from self managed customers without auth to customers with auth.

Your initial flow is correct!

Then we just make it a limitation that a user can't use the Console unless they have .

Unless they have what?

The standard auth for postgres has login and non login users see pg_authid. For logins, one can first check that the role allows login then compare passwords. This would be done fully on the server side. I imagine we want to enable rolcanlogin if a password is set, and by that just check to see password.is_some() or something.

@SangJunBak
Copy link
Contributor

@jubrad

Unless they have what?
Ah my bad. Unless there's a registered role! Edited.

The standard auth for postgres has login and non login users see pg_authid.

But I mean in order for the Console to query pg_authid via the HTTP API, wouldn't they need to be authenticated?

@SangJunBak
Copy link
Contributor

Maybe we can pass that to the console also? Maybe through an env var? That gets a bit tricky, since I think we update the balancers and console immediately upon getting an updated Materialize

@alex-hunt-materialize and with the current design, an env variable won't do much good given the Console's already built by the time orchestratord installs the console pod. There's currently no way for orchestratord to edit build variables for the Console, which I think we should fix!

@jubrad
Copy link
Contributor Author

jubrad commented Mar 27, 2025

Unless they have what?
Ah my bad. Unless there's a registered role! Edited.

The standard auth for postgres has login and non login users see pg_authid.

But I mean in order for the Console to query pg_authid via the HTTP API, wouldn't they need to be authenticated?

Console shouldn't need to query pg_authid. It should just need to attempt to connect using username and password. If the role isn't a login role or there's a password mismatch that should all be handled server side. To determine if auth is enabled it could just attempt to run a query and, on failed login, route to the login box, or we can install a configurable endpoint on NGINX that console can hit to tell it what kind of auth it should attempt to use.

@jubrad jubrad force-pushed the password-auth-design-doc branch from 291b450 to 2c5754d Compare March 28, 2025 18:52
Copy link
Contributor

@SangJunBak SangJunBak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my end! Thanks for writing all this out!

@@ -127,7 +130,7 @@ The authentication mechanism for an environment must be configurable. A new flag

### Note on Console builds:

We currently re-use the same Console build for the emulator and self-managed. For practical purposes we can build both with password auth enabled with no option to disable.
Console does currently do not support runtime or startup configuration. Configuration is handled only at build time. To resolve this we should add a `config.json` or `config.js` file which can be mounted directly into the Nginx container assets. This file should come from a materialize-console config map which must be setup by Orchestratord. We will also need changes to the console to support reading in configuration from this map. The initial config value here should be `authentication_type: password`, in cloud we should use `authentication_type: frontegg` or `authentication_type: jwt`. The console build process can still be used to set default values for this config file.
Copy link
Contributor

@SangJunBak SangJunBak Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where someone wants to bump the Console version or if we ever need to reinstall the console service, we can most likely just swap out the config file given the Console will get it on runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll most likely just block the UI until the config map is loaded in (i.e. via a loading screen). This is fine given loading it in should be really quick.

@jubrad jubrad marked this pull request as ready for review April 1, 2025 01:00
Copy link
Contributor

@jasonhernandez jasonhernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My "hard requirements"

  1. Set the minimum iteration count >=200,000.
  2. Do not allow lockouts by default.
  3. Log successful and failed login attempts by default.

I've made some additional comments and recommendations for discussion / consideration. Thanks for writing this all up and the consideration in this design.

a. The coordinator layer will check the password against the stored hash. If the password is correct, the user will be authenticated.
b. If the password is incorrect, the user will be denied access.

3. **OPTIONAL: Book-Keeping**: Login and failed login requests should be tracked. Metrics should be created to monitor potential security events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log successful and failed login attempts, as closely to what Postgres does as possible.


**Password Rotation or Expiration**: While not currently in scope, the design should allow for additions in the future, such as password rotation and expiration.

**Abuse Detection**: Action should be taken action when brute force attempts are detected, either by locking the user or blocking the requesting client IP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should write logs and let customers use fail2ban or whatever solutions they use elsewhere.


### 3. Hashed Password Storage

- **Secure Hashing**: For future compatibility with SCRAM, passwords should be hashed using `scram-sha-256` encryption. The RFC for [SCRAM](https://datatracker.ietf.org/doc/html/rfc5802) dictates an iteration count of at least 4096. However, the RFC for SCRAM-SHA-256 dictates that "[the hash iteration-count should be such that a modern machine will take 0.1 seconds to perform the complete algorithm](https://www.rfc-editor.org/rfc/rfc7677)". This also dictates at least 4096 with a recommendation of 15000. We should aim for 15000 if it does not negatively impact connection times or load.
Copy link
Contributor

@jasonhernandez jasonhernandez Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC is 10 years old.
I ran a quick benchmark on two computers:
t4g.nano - 200,000 iterations in 0.100s
Macbook Pro M3 Max - 450,000 iterations in 0.102s

I recommend we plan on a minimum of 400,000 iterations for now. I would also like us to store the iteration count in a column in the database. If practical, it would be very nice to randomly generate an iteration count for each password, perhaps in the range of 400,000-450,000 iterations. This would significantly reduce the probability of birthday attacks and the usefulness of rainbow tables. We will want to track the iteration count for passwords regardless, because we may need to increase the number of iterations as computers get faster.

I will try and do some more research and might adjust my recommended iteration count.

(note: edited to increase iterations given the 2021 draft RFC https://www.ietf.org/archive/id/draft-ietf-kitten-password-storage-07.txt)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (expired) draft RFC from 2021 recommends a minimum of 310,000 iterations.
https://www.ietf.org/archive/id/draft-ietf-kitten-password-storage-07.txt

We should also specify a length on salts. We should use at least 32 bytes of cryptographically random data for salts.


3. **Store Passwords in K8s Secrets**: It would be viable to store passwords in Kubernetes secrets.

- **Reasons Not Chosen**: This may create an untenable number of Kubernetes secrets, and may cause issues with passing security reviews.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Kubernetes secrets also creates some additional complexity for backup / restore / migration. There's an advantage to keeping the same mental model as Postgres where possible.


2. **Audit Logging and Brute Force Detection**: Do we need to log all failed password attempts and the IPs of those attempts? Do we need to take action to lock an IP or User when an attempt threshold is met?

3. **Password Strength Validation**: Should Materialize include built-in password strength validation, or should it be handled outside the system (e.g., via Kubernetes or other security tools)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is necessary yet. If we included the requirements, we would want to include an override setting for developer mode.
If we were to implement this, I would suggest copying the behavior of Postgres passwordcheck without cracklib + requiring an entropy score of 3+ from zxcvbn.
https://crates.io/crates/zxcvbn

Docs on Postgres passwordcheck
https://www.postgresql.org/docs/current/passwordcheck.html
https://github.com/postgres/postgres/blob/master/contrib/passwordcheck/passwordcheck.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants