Disk cache failure with large db sizes #7793

marcokusa · 2025-01-14T20:16:26Z

Hi,

We have a relatively large AD deployment with provider = ldap and are severely affected by this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1886492

we have tried to mitigate the issue with:

ignore_group_members = true
lower values of entry_cache_timeout, entry_cache_user_timeout, entry_cache_group_timeout
lower value of ldap_purge_cache_timeout in conjunction with 2.
ldap_group_search_base filtering when possible

Despite this we are still running into cases where certain hosts that see access from a higehr number of users (nfs servers) grow the database too quickly despite the optimisations above, this is the current performance for a user lookup when memcache expires, and it gets progressively worse until it can't return queries anymore:

id user, db 22M -> 7.0s
id user, db 43M -> 14s
id user, db 100M -> 30s

We were counting on purging the disk cache frequently enough with ldap_purge_cache_timeout, but we found that is that once the db reaches a certain size, the ldap purge process is unable to complete (as if it times out, there does not seem to be any detailed information even with the highest debug level on the ldap backend). So the db does not shrink, and the purge process also is a blocking operation that hangs queries while it runs, so running it frequently is less than ideal.

Ultimately with db growing further, sssd becomes unresponsive and the only way to recover is to delete the disk cache manually and restart the service.

We have had a case open with Red Hat for over a year and they open an internal case SSSD-5812 to which we have no access and have been given no updates in this time period.

We understand that the disk cache performance might be related to missing indexes as specified in https://bugzilla.redhat.com/show_bug.cgi?id=1886492 but it's not clear why this was marked as CLOSED WONTFIX or if there is a plan to resolve.

Ultimately for us it would be acceptable to have an option to disable the disk cache completely and rely exclusively on memcache, but I understand that is not supported currently? Or if the cache purge timeout issue can be resolved that would also work for us.

Any suggestions appreciated

Thanks

alexey-tikhonov · 2025-01-15T18:28:12Z

I guess you already have /var/lib/sss/db mounted on 'tmpfs'?

SSSD-5812 isn't relevant
it is not about a missing index (in was shown in comments that were private because of customer logs)
proper solution is 'sysdb_add_group_member_overrides()' could be skipped completely in case "ignore_group_members=true"

What is your platform?

marcokusa · 2025-01-17T17:50:57Z

We have tested mounting on tmpfs and it makes no difference, it's not related to storage speed. Our hosts have large amounts of ram and the db would be in pagecache anyway.
Would skipping sysdb_add_group_member_overrides resolve the database query performance issue or would it just cause it to grow slower and eventually run into the same issue? Could you explain why the performance the db degrades so dramatically if it's not an indexing issue?

Is it feasible to provide an option to not use the ldb disk cache at all and allow us to rely on memcache exclusively?

Our main platforms are RHEL8 & 9.

Thanks again

alexey-tikhonov · 2025-01-17T18:46:21Z

Is it feasible to provide an option to not use the ldb disk cache at all and allow us to rely on memcache exclusively?

No. Cache is the only way a responder can get info from a backend: https://sssd.io/contrib/architecture.html

id user, db 100M -> 30s

What is the real use case that is slow?
That's not id, right?

Could you maybe profile an issue in your env?
(1) first of all, are getent passwd user and groups user slow or fast?
(2) can you get an ltrace and strace (with timestamps) of id and see what takes the time - is it resolution of groups user is member of (getgrgid())?
(3) I guess everything is cached so all the time is spent in sssd_nss - right? Could you maybe profile this process while running id?

alexey-tikhonov · 2025-01-17T18:48:24Z

Could you explain why the performance the db degrades so dramatically if it's not an indexing issue?

What I saw in that ticket is described in https://bugzilla.redhat.com/show_bug.cgi?id=1886492#c6
I don't know if your case it the same.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk cache failure with large db sizes #7793

Disk cache failure with large db sizes #7793

marcokusa commented Jan 14, 2025

alexey-tikhonov commented Jan 15, 2025

marcokusa commented Jan 17, 2025

alexey-tikhonov commented Jan 17, 2025

alexey-tikhonov commented Jan 17, 2025

Disk cache failure with large db sizes #7793

Disk cache failure with large db sizes #7793

Comments

marcokusa commented Jan 14, 2025

alexey-tikhonov commented Jan 15, 2025

marcokusa commented Jan 17, 2025

alexey-tikhonov commented Jan 17, 2025

alexey-tikhonov commented Jan 17, 2025