Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk cache failure with large db sizes #7793

Open
marcokusa opened this issue Jan 14, 2025 · 4 comments
Open

Disk cache failure with large db sizes #7793

marcokusa opened this issue Jan 14, 2025 · 4 comments

Comments

@marcokusa
Copy link

Hi,

We have a relatively large AD deployment with provider = ldap and are severely affected by this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1886492

we have tried to mitigate the issue with:

  1. ignore_group_members = true
  2. lower values of entry_cache_timeout, entry_cache_user_timeout, entry_cache_group_timeout
  3. lower value of ldap_purge_cache_timeout in conjunction with 2.
  4. ldap_group_search_base filtering when possible

Despite this we are still running into cases where certain hosts that see access from a higehr number of users (nfs servers) grow the database too quickly despite the optimisations above, this is the current performance for a user lookup when memcache expires, and it gets progressively worse until it can't return queries anymore:

id user, db 22M -> 7.0s
id user, db 43M -> 14s
id user, db 100M -> 30s

We were counting on purging the disk cache frequently enough with ldap_purge_cache_timeout, but we found that is that once the db reaches a certain size, the ldap purge process is unable to complete (as if it times out, there does not seem to be any detailed information even with the highest debug level on the ldap backend). So the db does not shrink, and the purge process also is a blocking operation that hangs queries while it runs, so running it frequently is less than ideal.

Ultimately with db growing further, sssd becomes unresponsive and the only way to recover is to delete the disk cache manually and restart the service.

We have had a case open with Red Hat for over a year and they open an internal case SSSD-5812 to which we have no access and have been given no updates in this time period.

We understand that the disk cache performance might be related to missing indexes as specified in https://bugzilla.redhat.com/show_bug.cgi?id=1886492 but it's not clear why this was marked as CLOSED WONTFIX or if there is a plan to resolve.

Ultimately for us it would be acceptable to have an option to disable the disk cache completely and rely exclusively on memcache, but I understand that is not supported currently? Or if the cache purge timeout issue can be resolved that would also work for us.

Any suggestions appreciated

Thanks

@alexey-tikhonov
Copy link
Member

I guess you already have /var/lib/sss/db mounted on 'tmpfs'?

  • SSSD-5812 isn't relevant
  • it is not about a missing index (in was shown in comments that were private because of customer logs)
  • proper solution is 'sysdb_add_group_member_overrides()' could be skipped completely in case "ignore_group_members=true"

What is your platform?

@marcokusa
Copy link
Author

  • We have tested mounting on tmpfs and it makes no difference, it's not related to storage speed. Our hosts have large amounts of ram and the db would be in pagecache anyway.
  • Would skipping sysdb_add_group_member_overrides resolve the database query performance issue or would it just cause it to grow slower and eventually run into the same issue? Could you explain why the performance the db degrades so dramatically if it's not an indexing issue?

Is it feasible to provide an option to not use the ldb disk cache at all and allow us to rely on memcache exclusively?

Our main platforms are RHEL8 & 9.

Thanks again

@alexey-tikhonov
Copy link
Member

Is it feasible to provide an option to not use the ldb disk cache at all and allow us to rely on memcache exclusively?

No. Cache is the only way a responder can get info from a backend: https://sssd.io/contrib/architecture.html

id user, db 100M -> 30s

What is the real use case that is slow?
That's not id, right?

Could you maybe profile an issue in your env?
(1) first of all, are getent passwd user and groups user slow or fast?
(2) can you get an ltrace and strace (with timestamps) of id and see what takes the time - is it resolution of groups user is member of (getgrgid())?
(3) I guess everything is cached so all the time is spent in sssd_nss - right? Could you maybe profile this process while running id?

@alexey-tikhonov
Copy link
Member

  • Could you explain why the performance the db degrades so dramatically if it's not an indexing issue?

What I saw in that ticket is described in https://bugzilla.redhat.com/show_bug.cgi?id=1886492#c6
I don't know if your case it the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants