Skip to content

Rate-limit anonymous accesses to expensive pages when the server is overloaded #34167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wxiaoguang
Copy link
Contributor

@wxiaoguang wxiaoguang commented Apr 10, 2025

To help public sites like gitea.com to protect from crawlers:

  1. Limit the max anonymous requests inflight
  2. When a new anonymous comes and it exceeds the limit:
    • Generate a rate-limit token, then respond a 503 page.
    • Crawlers will see this 503 page and queue the request by themselves, so crawlers could still be able to read whole site.
    • End users will see a prompt page: "Wait for a few seconds, or sign in", the request is queued by browser JS, then they could still access the pages (see the screenshot)

To debug (manually at the moment): set OVERLOAD_INFLIGHT_ANONYMOUS_REQUESTS=0

This is just a quick PR, some values are hard-coded, if this PR looks good, they could also become config options later or in a new PR.


image


image

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Apr 10, 2025
@github-actions github-actions bot added modifies/go Pull requests that update Go code modifies/templates This PR modifies the template files docs-update-needed The document needs to be updated synchronously labels Apr 10, 2025
@wxiaoguang wxiaoguang force-pushed the rate-limit-anonymous-access branch 5 times, most recently from ef82ced to fea3640 Compare April 10, 2025 02:42
{{if .IsRepo}}{{template "repo/header" .}}{{end}}
<div class="ui container">
{{/*TODO: this page could be improved*/}}
Server is busy, loading .... or <a href="{{AppSubUrl}}/user/login">click here to sign in</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should translate

@@ -780,6 +780,7 @@ LEVEL = Info
;; for example: block anonymous AI crawlers from accessing repo code pages.
;; The "expensive" mode is experimental and subject to change.
;REQUIRE_SIGNIN_VIEW = false
;OVERLOAD_INFLIGHT_ANONYMOUS_REQUESTS =
Copy link
Member

@silverwind silverwind Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need to document debug-only options, at least not like that. Maybe we should add a [debug] ini section?

@silverwind
Copy link
Member

silverwind commented Apr 11, 2025

As mentioned in the other PR, I prefer status code 429 over 503 for per-client rate-limited responses. Ideally including a Retry-After header in the response. 503 is fine for general server-load related responses though.

@wxiaoguang
Copy link
Contributor Author

To answer all these comments: this PR is a demo for one of my proposed "anti-crawler actions" in #33951 (comment) , that option is not for "debug-purpose only".

But there are arguments in #33951 . God knows what would happen 🤷

@wxiaoguang wxiaoguang marked this pull request as draft April 13, 2025 08:02
Copy link
Contributor

@bohde bohde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is very interesting, and could complement #33951. One assumption this makes however, is that if the client runs Javascript, it is not a crawler, and therefore does not need to be throttled. This doesn't protect in scenarios where an overload is due to a page drawing high amounts of traffic from humans operating browsers, but it also doesn't protect in the case of bots using frameworks such as playwright to crawl.

It's difficult to distinguish between humans and bots in this scenario, since they have are using real browser user agents, but I did run a quick experiment on our production traffic. I deployed a piece of javascript that will ping a beacon endpoint on the Gitea instance with the URL of the page it is on, which is then logged. I then looked through these logs with common endpoints I see crawlers target. I was able to find traffic from data center IPs in the logs for these beacons. In zooming in on a particular IP address, it did appear to be crawling links in an iterative fashion. I think this demonstrates that there are bots out there that are executing javascript when crawling Gitea instances.

I think that a slight shift in strategy to work with the QoS middleware would make this approach more resilient. This could change the algorithm from "if you execute javascript, you are able to access unrestricted" to "if you don't execute javascript, you are more likely to be restricted". This also has the benefit of continuing to protect the server in cases where overload is due to humans generating traffic.

const tokenCookieName = "gitea_arlt" // gitea anonymous rate limit token
cookieToken, _ := req.Cookie(tokenCookieName)
if cookieToken != nil && cookieToken.Value != "" {
token, exist := tokenCache.Get(cookieToken.Value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this doesn't work when running multiple Gitea processes, perhaps to load balance or enable graceful restarts. Since this code is meant for instances that serve substantial public traffic, they are more likely to be running a multi process setup.

suggestion: leverage a signed token instead, perhaps a JWT. This prevents needing some sort of external store for these tokens. However, this prevents easy revocation of the token, so it should likely have an expiration as well. Because of this, we should also add the request's IP address to this JWT, and check that the IP address in the token is the same as the request's. This is because large, malicious crawler often distribute workloads over many IPs to get around IP-based rate limits.

Copy link
Contributor Author

@wxiaoguang wxiaoguang Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are far more problems to run Gitea in a cluster, see my comments in other issues (#13791)

For this problem: there must be Redis in a cluster.

<script>
setTimeout(() => {
document.cookie = "{{.RateLimitCookieName}}={{.RateLimitTokenKey}}; path=/";
window.location.reload();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: in the event of an overload where the client executes javascript, this can still cause an outage. This could be a result of something such as a repo being on the frontpage of a new aggregator, drawing a lot of human users, or one of the many crawler frameworks that use headless browsers to automate crawling.

suggestion: move this logic to the QoS middleware instead, where this logic only activates on a low priority request being rejected. In this case, change the priority to default in order, which gives them a higher chance of getting through when non javascript executing clients are crawling the site, but still provide a good experience for logged in users in the event of a javascript executing crawler having all of its requests classified as default

suggestion(non-blocking): to further prioritize users, we could leverage the existing captcha support (if the user has configured it) to further segment out bots from humans. However, this might be quite a large addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also designed that approach, but I didn't integrate the QoS because its your authorship.

Server is busy, loading .... or <a href="{{AppSubUrl}}/user/login">click here to sign in</a>.
<script>
setTimeout(() => {
document.cookie = "{{.RateLimitCookieName}}={{.RateLimitTokenKey}}; path=/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: the path on this cookie can cause problems when setting.AppSubURL is not a default value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some details still need to be improved, including the i18n and cookie management.

ctxData := middleware.GetContextData(req.Context())

tokenKey, _ := util.CryptoRandomString(32)
retryAfterDuration := 1 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: since this causes the client to retry, there is a possibility that a lot of clients are told to retry after a given time, and those retries all happen at approximately the same time, causing an outage.

suggestion: add some random jitter to this value so that clients don't all retry at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Client time already random.

@wxiaoguang
Copy link
Contributor Author

but it also doesn't protect in the case of bots using frameworks such as playwright to crawl.

It is still rate-limited, and could be fine-tuned, and still better than a pure 503 page.

I think that a slight shift in strategy to work with the QoS middleware would make this approach more resilient.

That's also my plan, but I didn't integrate the QoS in this demo because its your authorship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-update-needed The document needs to be updated synchronously lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. modifies/go Pull requests that update Go code modifies/templates This PR modifies the template files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants