-
-
Notifications
You must be signed in to change notification settings - Fork 115
Added a rate limiter for load reduction on the website #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi, It's a nice addition, however ratelimit seems not present in the default library. It's an additional library (https://pypi.org/project/ratelimit/) ? |
Thank you. Yes that is indeed the library which is needed |
@@ -9,3 +9,6 @@ | |||
xml_footer = "</urlset>" | |||
|
|||
crawler_user_agent = 'Sitemap crawler' | |||
|
|||
number_calls = 1 # number of requests per call period | |||
call_period = 15 # time in seconds per number of requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be no rate limiting, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No limit would be in line with how the original code works, yes. I could not find a default option in the ratelimit package, you could set the default to a very high number as a workaround though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility would be having something like this in crawler.py
:
if number_calls is None or call_period is None:
rate_limit_decorator = lambda func: func
else:
rate_limit_decorator = limits(calls=config.number_calls, period=config.call_period)
Haven't dealt with @sleep_and_retry
, but I believe should be possible to combine that into rate_limit_decorator
. Of course, this strategy comes at the cost of increasing the complexity of the code.
Coincidentally, I'm adding a
Without this, I reckon this PR can't be merged, otherwise, it won't work for people who don't happen to have |
@Garrett-R I think it still would be a nice addition |
Hi, Ratelimiting is a good addition, but i'm not a big fan of a tierce library. Not because its a tierce library, but I really like the idea of a « bare metal » tool. What do you think @Garrett-R @Bash- having to rely to a tierce library is not a problem for you ? |
Yeah, it's an interesting point and wasn't sure what you'd think of introducing a A couple factors I'd be weighing if I were you:
On the point about how tough the package is to use for folks, I actually think we should put this on PYPI (made issue here). In that case folks would just have to do: pip install python-sitemap and the extra dependencies will automatically be handled. I'd probably cast my vote for having dependencies, but I can definitely see benefits to both approaches, so I support whatever you think is best. |
You know, in light of the Maybe close this one? |
We found that websites found the scraper too resource consuming. Therefore I added this configurable rate limiter, to be able to decrease the number of requests per time period.