How I protect my Forgejo instance from AI web crawlers

(her.esy.fun)

41 points | by todsacerdoti 20 hours ago

10 comments

maelito 2 hours ago
I'm having lots of connections every day from Singapor. It's now the main country... despite the whole website being French-only. AI crawlers, for sure.
Thanks for this tip.
[-]
- arjie 1 hour ago
  Amazonbot does this despite my efforts in robots.txt to help it out. I look at all the Singapore requests and they’re Amazonbot trying to get various variants of the Special:RecentChanges page. You’re wasting your time, Amazonbot. I’m trying to help you.
  [-]
  - reconnecting 16 minutes ago
    Did you check IP address of this UA?
- input_sh 1 hour ago
  Fun fact: you don't get rid of them even when you put a captcha on all visitors from Singapore. I still see a spike in traffic that perfectly matches the spike in served captchas, but this time it's geographically distributed between places like Iraq, Bangladesh and Brazil.
  Hopefully it at least costs them a little bit more.
Simplita 41 minutes ago
We ran into similar issues with aggressive crawling. What helped was rate limiting combined with making intent explicit at the entry point, instead of letting requests fan out blindly. It reduced both load and unexpected edge cases.
[-]
- cheshire_cat 16 minutes ago
  What do you mean by "making intent explicit at the entry point"?
andai 1 hour ago
Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing?
[-]
- input_sh 16 minutes ago
  > How come even small sites get hammered constantly?
  Because big sites have decades of experience fighting against scrapers and have recently upped their game significantly (even when doing so carries some SEO costs) so that they're the only ones that can train AI on their own data.
  So now, when you're starting from scratch and your goal is to gather as much data as possible, targetting smaller sites with weak / non-existent scraping protection is the path of least resistence.
- bingo-bongo 51 minutes ago
  AI companies scrape to:
  - have data to train on
  - update the data more or less continuously
  - answer queries from users on the fly
  With a lot of AI companies, that generates a lot of scraping. Also, some of them behave terribly when scraping or is just bad at it.
  [-]
  - adastra22 36 minutes ago
    Why don’t they scrape once though?
    [-]
    - blell 29 minutes ago
      1) It may be out of date 2) Storing it costs money
- reppap 32 minutes ago
  It's not just companies either, a lot of people run crawlers for their home lab projects too.
- devsda 54 minutes ago
  May be the teams developing AI crawlers are dogfooding & are using the AI itself(and its small context) to keep track of the sites that are already scraped. /s
Roark66 53 minutes ago
I'm glad the author clarified he wants to prevent his instance from crashing not simply "block robots and allow humans".
I think the idea that you can block bots and allow humans is fallacious.
We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.
Also, perhaps I'm biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.
I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).
The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.
The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.
[-]
- asfdasfsd 49 minutes ago
  And people who block AI crawlers on moral grounds?
- szundi 37 minutes ago
  [dead]
KronisLV 1 hour ago
We should just have some standard for crawlable archived versions of pages with no back end or DB interaction behind them etc., for example if there's a reverse proxy, whatever it outputs is archived and it wouldn't actually pass on any call in the archive version. Same for translating the output of any dynamic JS into fully static HTML. Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.
Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.
Or if PoW was a proper web standard with no JS, then ppl who want to tell AI and other crawlers to fuck off, they could at least make it uneconomical to crawl their stuff en masse. In my view, proof of work that would work through headers in the current day world should be as ubiquitous as TLS.
userbinator 2 hours ago
Unfortunately this means, my website could only be seen if you enable javascript in your browser.
Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)
apples_oranges 2 hours ago
HTTP 412 would be better I guess..
reconnecting 2 hours ago
tirreno (1) guy here.
Our open-source system can block IP addresses based on rules triggered by specific behavior.
Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?
1. https://github.com/tirrenotechnologies/tirreno
[-]
- reconnecting 1 hour ago
  I believe there is a slight misunderstanding regarding the role of 'AI crawlers'.
  Bad crawlers have been there since the very beginning. Some of them looking for known vulnerabilities, some scraping content for third-party services. Most of them have spoofed UAs to pretend to be legitimate bots.
  This is approximately 30–50% of traffic on any website.
- notachatbot123 2 hours ago
  The article is about AI web crawlers. How can your tool help and how would one set it up for this specific context?
  [-]
  - reconnecting 2 hours ago
    I don't see how an AI crawler is different from any others.
    The simplest approach is to count the UA as risky or flag multiple 404 errors or HEAD requests, and block on that. Those are rules we already have out of the box.
    It's open source, there's no pain in writing specific rules for rate limiting, thus my question.
    Plus, we have developed a dashboard for manually choosing UA blocks based on name, but we're still not sure if this is something that would be really helpful for website operators.
    [-]
    - Roark66 1 hour ago
      >It's open source, there's no pain in writing specific rules for rate limiting, thus my question.
      Depends on the goal.
      Author wants his instance not to get killed. Request rate limiting may achieve that easily in a way transparent to normal users.
immibis 20 hours ago
My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.
[-]
- userbinator 2 hours ago
  Each access creates a new zip file on disk which is never cleaned up.
  That sounds like a bug.
- bob1029 3 hours ago
  > you can successfully handle many requests.
  There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.
  I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.
  [-]
  - idontsee 2 hours ago
    > There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant.
    It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.
  - spockz 2 hours ago
    Maybe it is fast enough but my objection is mostly due to the gross inefficiency of crawlers. Requesting downloads of whole repositories over and over, leading to storing these archives on disk wasting CPU cycles to create them and storage space to retain them, and bandwidth to sent them over the wire. Add this to the gross power consumption of AI and hogging of physical compute hardware, and it is easy to see “AI” as wasteful.
csilker 2 hours ago
Cloudflare has a solution to protect routes from crawlers.
https://blog.cloudflare.com/introducing-pay-per-crawl/
[-]
- roywashere 1 hour ago
  Sure, but the whole point of self-hosting forgejo is to not use these big cloud solutions. Introducing cloudflare is a step back!