Skip to content

Conversation

yusufozgur
Copy link

Hello,

I needed to use wiki-rag with a rather large wiki. As I knew the wiki allows bot scraping, I added an option to remove rate limits, which was essentially a wait statement with a random number of seconds. The new option is controlled by an environment variable called ENABLE_RATE_LIMITING. By default it is enabled (ENABLE_RATE_LIMITING=true) to discourage users from abusing servers. When rate limiting is disabled, and tested on a random wikimedia site such as https://naruto.fandom.com/, it will reduce the estimated fetching time from 8 hours to 1.5 hours.

Best,

Yusuf

@stronk7
Copy link
Member

stronk7 commented Aug 30, 2025

Hi @yusufozgur,

thanks for the contribution, it looks 99% perfect. The only details is that, surely, it's a good idea, for consistency, to also apply the new setting here: https://github.com/moodlehq/wiki-rag/blob/main/wiki_rag/load/util.py#L82 (that's when all the namespace pages metadata is fetched, to make the list of target pages).

Also, surely, it would be good if we apply the same 2-3 randomness there, instead of the current 2-5 one. Again, just to keep both rate-limits the same.

With that tiny change, I think that this can be applied without problems.

Thanks!

PS: Edited, also, worth commenting, one of the planned improvements is about to allow incremental loads, so only changed/new/deleted pages are fetched and re-processed. I wish that will dramatically reduce loading times. Later, I'd want to apply the same to the indexing, also making it incremental, but that's less critical because indexing is orders of magnitude quicker than loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants