Skip to content

Conversation

@jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Oct 23, 2025

The majority of this PR fills out the CrdbScimProviderStore implementation that the scim2-rs crate's Provider object will use to implement SCIM where CRDB is the durable store for users and groups, and adds a boat load of integration tests.

The notion of a silo user being "active" or not has also been added to support users being deactivated by the SCIM client. Non-SCIM silo users should not be affected.

The majority of this PR fills out the CrdbScimProviderStore
implementation that the scim2-rs crate's Provider object will use to
implement SCIM where CRDB is the durable store for users and groups, and
adds a boat load of integration tests.

The notion of a silo user being "active" or not has also been added to
support users being deactivated by the SCIM client. Non-SCIM silo users
should not be affected.
@jmpesp jmpesp requested a review from papertigers October 23, 2025 14:07
@morlandi7 morlandi7 added this to the 17 milestone Oct 29, 2025
Comment on lines +344 to +356
for user in users {
let groups = self
.get_user_groups_for_user_in_txn(conn, user.identity.id.into())
.await?;

let SiloUser::Scim(user) = user.into() else {
// With the user provision type filter, this should never be
// another type.
unreachable!();
};

returned_users.push(convert_to_scim_user(user, groups));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be able to be done in one query with the list of user IDs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow on PR by @david-crespo in #9325

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is classified as a performance optimization I am inclined to say that we should punt on the addition of a follow on PR until after R17 is cut. The main reasoning behind this is all of the testing done so far by me, James, Angela, and Jay has been against the current implementation and the window for R17 is closing asap.

I would like to give the perf optimaztion more time to soak as we already know some of this code will have to change as we have to fix oxidecomputer/scim2-rs#7 relatively soon anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about running 1000+ queries in a single GET request — has that been exercised?

Copy link
Contributor

@david-crespo david-crespo Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this gist we compared the time to list users before and after my changes and it was 75-80% faster after the optimization, at least on my machine. However both response times were tolerable — the slower one was 260-290ms. The fast one was 62-65ms. On @papertigers machine the slow version was more like 800-900ms.

https://gist.github.com/papertigers/5aa9f341797e0a82e66c3f8188c9701f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified one of the integration tests to do the following:

  • Create 1000 users
  • Create a group with those 1000 users as member
  • Captured the time it takes to hit the GET /Users endpoint

On my machine over multiple runs it seemed to take ~700-900ms.
Again, I think this should be improved upon and I suspect many optimizations can be made here. Regardless I think some of this code will need to be modified anyways to support pagination.

Thoughts on having this go in as is?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to get this landed as is and follow up with tuning work afterwards to unblock customers waiting to validate their integration workflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if it's tolerable for now and considered a starting point anyway then that's fine.

let SiloUser::Scim(user) = user.into() else {
// With the user provision type filter, this should never be another
// type.
unreachable!();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to 500 here? This will be a 500, I guess, but doing it explicitly lets you stick a message on it. This could be an argument against using diesel::result::Error as your error type for these functions. Most of the datastore functions return external::Error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started prototyping a new error type that can be returned however the transaction retry calls are expecting a diesel::result::Error . So I don't think we can return another type there, which is probably fine since we do have the filter on the query already.

let maybe_other_user = dsl::silo_user
.filter(dsl::silo_id.eq(self.authz_silo.id()))
.filter(dsl::user_provision_type.eq(model::UserProvisionType::Scim))
.filter(lower(dsl::user_name).eq(lower(user_request.name.clone())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the DB itself isn't enforcing case-sensitive uniqueness, is it possible that two concurrent requests could write different cases of the same name at the same time? You might want a unique index like CREATE UNIQUE INDEX ON silo_user (silo_id, LOWER(user_name)) WHERE user_provision_type = 'scim' AND time_deleted IS NULL.

Copy link
Contributor

@papertigers papertigers Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think two concurrent requests could race in this way because of the other calls to lower in the filters but a belt and suspenders approach here is the right one.

done in 6285129

// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(203, "scim-users-and-groups-lower"),
KnownVersion::new(202, "scim-actor-audit-log"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I didn't realize this PR already touched this and can probably collapse it into one version bump. Will fix this soonish when back at the keyboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants