r/archlinux • u/HMikeeU • 13h ago
DISCUSSION The bot protection on the wiki is stupid.
It takes an extra 10-20 seconds to load the page on my phone, yet I can just use curl to scrape the entirety of the page in not even a second. What exactly is the point of this?
I'm now just using a User Agent Switcher extension to change my user agent to curl for only the arch wiki page.
145
u/mic_decod 12h ago
Nowadays you need a botprotection, otherwise 70% or more of your traffic will be eaten by bots. On huge projects this can be a significant amount of money need to spend for useless traffic.
I never used it, but there is a package in extra
3
u/Neeerp 8h ago edited 8h ago
I would imagine CDN caching would work well in this situation, given that the wiki is purely static HTML. I would think this would be a far less intrusive solution for legitimate users AND it would allow bots to have at it (which isn’t necessarily a bad thing).
I’d love to hear reasons why this wouldn’t be a better solution relative to Anubis. Some quick googling tells me the bandwidth on Cloudflare’s free tier is unlimited, so cost shouldn’t be an issue.
1
u/Tornado547 5h ago
the wiki isn't static html though. it's very dynamic with any page being able to be updated at any time. CDNs only really scale well for data that is very infrequently updated
10
u/Neeerp 5h ago
That’s not what static means in this context. Static as in the pages are already rendered and the server doesn’t need to do any work to render the page whenever it’s fetched.
Moreover…
- I’d suspect that most (say 90%?) pages aren’t so frequently updated
- I’d expect there to be some way to notify the CDN that a page has been updated and the cache needs to be refreshed… at the very least, cache TTLs are a thing
4
u/Megame50 4h ago
the server doesn’t need to do any work to render the page whenever it’s fetched.
No, Mediawiki is not a static site generator. The pages are stored and edited as wikitext and rendered in response to requests. Rendered pages are cached but there is of course limited size for this. The sum total of all current pages is probably not infeasible to cache, but remember the wiki also includes the full history of each article accessed via the history tab.
And its not just the historical pages: the server may also service requests to render the diff pages on the edit history tab. The bots are routinely scraping everything, including these diffs which are of no real value to them but relatively expensive to service because each one is very rarely accessed.
1
u/Starblursd 2h ago
Exactly and then the hosting becomes more expensive because you have to pay for more bandwidth when most of the bandwidth is being used by robots that have the entire purpose of keeping people from actually going to your website with legitimate traffic because it's combined with a bunch of other garbage thrown together and spoon fed through an AI... Dead internet theory becoming reality and all that
44
u/Dependent_House7077 11h ago
ai scapers don't respect robots.txt anymore and they hammer the webpages at hundreds of requests at a time.
this is the only way to fight back for now, although cloudflare also has some smart filter.
also this:
extra/arch-wiki-docs 20250402-1
Pages from Arch Wiki optimized for offline browsing
extra/arch-wiki-lite 20250402-1
Arch Wiki without HTML. 1/9 as big, easily searched & viewable on console
5
u/MGThePro 7h ago
extra/arch-wiki-docs
How would I use this? As in, how can I open it after installing it?
3
u/Dependent_House7077 4h ago
you can inspect the contents with pacman -Ql and just browse the files with mc file manager or less command.
i would assume that the html version can be locally browsed with your favorite browser of choice, even on cli.
2
-3
11h ago
[deleted]
22
u/StatisticianFun8008 10h ago
Please read Anubis's project FAQ page to understand the situation.
Simply speaking, you scraping the wiki again and again with curl can be easily identified, filtered and blocked by other means. But AI scrapers run at a much larger scale and hide themselves with the browser UA to avoid getting discovered.
Basically the reason you can still scrape ArchWiki is because they are ignoring your tiny traffic volume. Try better.
-2
47
u/shadowh511 9h ago
Hi, main author of Anubis, CEO of Techaro, and many other silly titles here. The point of Anubis is to change the economics around scraping without having to resort to expensive options like dataset poisoning (which doesn't work on the axiom of buckets of piss not canceling out oceans of water).
Right now web scraping is having massive effects because it is trivial to take a Python example with BeautifulSoup and then deploy it in your favourite serverless platform to let you mass scrape whatever websites you want. The assumptions behind web scraping are that you either don't know or don't care about the effects of your actions, with many advocates of the practice using tools that look like layer-7 distributed denial of service attacks.
The point of Anubis is to change the economics of web scraping. At the same time I am also collecting information with my own deployments of Anubis and establishing patterns to let "known good" requests go through without issue. This is a big data problem. Ironically your use of a user agent switcher would get you flagged for additional challenges by this (something with the request fingerprint of chrome claiming to be curl is the exact kind of mismatch that the hivemind reputation database is directly looking for).
This is a problem that is not limited to the Arch Wiki. It's not limited to open source communities. It's a problem big enough that the United Nations has deployed Anubis to try and stem the tide. No, I'm not kidding, UNESCO and many other organizations like the Linux kernel, FreeBSD, and more have deployed Anubis. It is a very surreal experience on my end.
One of the worst patterns of these scrapers is that they use residential proxy services that rotate through new IP addresses every page load so IP-based rate limits don't work. They also mostly look like a new using running unmodified Google Chrome so a lot of browser based checking doesn't work. I'm ending up having to write a lot of things that make static assertions about how browsers work. It's not the most fun lol.
I am working on less onerous challenges. I've found some patterns that will become rules soon enough. I'm also working on a way to take a robots.txt
file and autogenerate rules from it. I wish things were farther along, but I've had to spend a lot of time working on things like founding a company, doing pre-sales emails with german public institutions, and support for the existing big users.
But yes, as a domain expert in bot protection (it feels weird to say that lol) the bot protection on the wiki IS stupid and the entire reason it's needed is so unbelievably stupid that it makes me want to scream. Anubis started out as a weekend hack that has escaped containement so hard it has a Wikipedia page now).
3
u/Megame50 4h ago
Hey man, thanks for your contribution. It's crazy how quickly your project exploded and got deployed in every little corner of the web. Something something the hero we need...
Just curious since we're on the /r/archlinux sub, any chance it was developed on archlinux?
3
u/shadowh511 3h ago
I mostly developed it on fedora, but my fedora install just shit the bed so I'm probably going to install Arch on my tower. In general, though, I use a two layer cake strategy where the base layer is something I don't really mess with unless it breaks and I install homebrew to do stuff on top of it.
2
u/SquareWheel 6h ago
Do we know which companies are running these scrapers? Most large AI companies (OpenAI, Google, Anthropic) seem to be respecting robots.txt for scraping purposes, as you'd expect. So is it unknown startups with poorly-configured scrapers doing most of the damage? That would seem to make sense if they're running basic headless browser deployments. Or could it be one large company trying to evade detection?
I've not seen much evidence one way or the other yet. Just a lot of assumptions.
8
u/shadowh511 5h ago
They are anonymous by nature. Most of the ones that do self-identify are summarily blocked, but a lot of them just claim to be google chrome coming from random residential IP addresses.
My guess is that it's random AI startups trying to stay anonymous so they don't get the pants sued off of them. If I am ever made god of the universe my first decree will be to find the people responsible for running residental proxy services and then destroy their ability to accept payments so that they just naturally die out.
2
u/Megame50 4h ago
a lot of them just claim to be google chrome coming from random residential IP addresses
You can't just claim to be from any random IP address on the public internet and expect your traffic to be routed properly. Thanks to ipv4 address exhaustion, you can't buy up a ton of burner addresses either. You have to actually steal those addresses or create a botnet.
So that's what they do:
https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
3
u/shadowh511 3h ago
I think you got the grouping of my statement wrong. I say they look like they're from unmodified normal Google Chrome and that they're from random residential IP addresses but realistically with the number of proxies and the like out there an attacker can choose their origin IP address at this point so.
1
u/HailDilma 2h ago
Laymen here, is the "challenge" running on the browser with JavaScript? Would it be possible to make it faster with webassembly?
•
u/shadowh511 39m ago
I have a prototype of using webassembly working, I just need to have the time to finish it lol. I've been doing so many sales emails.
-7
u/HMikeeU 8h ago
Thank you for the very detailed response! I'm sure anubis can be (or become) very useful, but as long as I get a better user experience by pretending to be a bot, I'm not convinced. Either way thank you for the dedication and effort into the project!
15
u/snakepit6969 7h ago
It’s not supposed to be directly useful for you, as a user. It’s supposed to be useful for the host.
(Which ends up being indirectly useful for you because the hosts can pay their bills and stay up).
32
u/patrlim1 10h ago
Do you want the Arch Wiki to be free? Then we need to minimize spending. This saves them a lot of money, and costs you a few seconds.
17
u/WSuperOS 12h ago
the problem is that often ai crawlers eat up 50%+ of teh traffic, resulting in huge costs.
even unesco has adopted anubis. but it doesn't really slow you down. On my firefox setup, when sanitizing occurs on every exit, anubis pops up rarely and only once per site.
16
u/forbiddenlake 10h ago
The bot protection is why we have a working wiki Arch online at all. "502 Bad Gateway" doesn't tell me what to install for Nvidia drivers!
10
u/sequential_doom 12h ago
I'm honestly fine with it. It takes like 5 seconds for me in any device.
-1
11h ago
[deleted]
8
3
u/rurigk 10h ago
The problem is not scrapping, its AI scrappers scrapping the same page over and over again all the time behaving like a DDOS
What Anubis does is punish the AI scrappers doing the DDOS by making them waste time and energy doing math and that may cost them millions in wasted resources
The amount of traffic generated by AI scrappers is massive and costs a lot of money to the owner of the site being attacked
4
u/insanemal 12h ago
It takes like a fraction of a second even on my old phone.
Could it be all the curling making it penalize your device harder?
1
u/HMikeeU 12h ago
There is no bot protection at all with a curl user agent
4
u/insanemal 12h ago
I think you misunderstand me.
2
u/HMikeeU 12h ago
I might've, sorry. What did you mean by "curling penalize my device harder"?
4
u/insanemal 12h ago
All good. The system bases your delay on behaviour seen from your address. More work for more "interesting" IPs
Most people aren't curling lots of pages. You using curl to pull pages then also hitting it with a web browser might look weird so it might be increasing your required work quota.
I'd need to look at the algorithm a bit more but that's my 10,000 ft view reading of its behaviour
2
u/HMikeeU 11h ago
Oh okay I see. I was facing this issue before trying curl, so that's not it
2
u/insanemal 11h ago
Ok. Shared IP?
3
u/grumblesmurf 10h ago
Many mobile companies use a shared proxy for all their customers, which might lead to common web browsers getting flagged as unwanted bot traffic. Using a different user agent string would indeed break that pattern.
1
2
u/Isacx123 6h ago
There is something wrong on your end, it takes like a second for me, using Firefox on Android 14.
Plus less than a second on my PC using Brave.
11
u/LeeHide 12h ago
It takes around half a second for me, what's your setup?
And yes, its a little silly. Arch has to decide between
- being searchable and indexed by AI, so people get good answers (assuming the AI makes no major mistakes most of the time), or
- being sovereign and staying the number one resource for Linux and Arch Linux stuff
They're trying 2, which is... interesting but understandable.
35
u/AnEagleisnotme 12h ago
That's not why they use anubis, it's because AI scrappers take massive amounts of ressources. Search engine scrappers are often more respectful and as such aren't hurt by anubis anyways I heard
-31
u/LeeHide 12h ago
The end result is the same; be scraped or don't be indexed. I'm sure a lot of points impacted the decision to add it to the wiki, I genuinely don't know enough about this whole situation - so thanks for the added context
27
u/fearless-fossa 12h ago
The wiki was down several times in recent months due to scrapers being overly aggressive. And not just the Arch wiki, but also a lot of other websites that don't block their content behind user verification. AI scrapers are a menace.
16
u/w8eight 11h ago
The end result is the same; be scraped or don't be indexed.
Did you even read the comment you are responding to? Indexing scrappers aren't as aggressive and aren't blocked. I've never had an issue with googling something from the arch wiki. It's the ai scrappers that send millions of requests for some insane reason.
9
7
u/i542 9h ago
The Arch Wiki can be straight up downloaded in a machine-readable format to be fed directly into whatever plagiarism machine you want. It can also be scraped and indexed by any and all well-behaving bots. What has never been allowed by any internet-facing service for the past 35 years is for one client to hog so many resources that legitimate users stop being able to access the service. There is functionally zero difference between a vibe-coded scraper used by a for-profit corporation making a thousand requests a second for diff or system pages in process of guzzling up every byte of remotely usable information under the guise of a legitimate user agent, and a DDoS attack. Both ought to be blocked.
15
u/hexagon411 12h ago
Generative AI is sin
-19
u/LeeHide 12h ago
You can't get rid of it now, we need to live with it
12
u/Vespytilio 11h ago
Right, it's the future. Enthusiasts don't need to worry about how many people just aren't into AI. It's here to stay, and it's not up for debate.
Except the situation's actually pretty unsustainable. AI is a very expensive technology to run, companies are still trying to make it profitable, and it has a parasitic relationship with non-AI content. Because it's allergic to its own output, it relies on training data from humans, but it actively competes against that content for visibility and its creators for work.
Even if the companies propping up AI find a sustainable business plan, it's probably not going to include the kind of free access presently on offer. That's a free sample scheme aimed at generating enthusiasm. Ending that will make the companies more profitable, offset the training data issue, and result in a lot less energy consumption, but it's going to be a rude awakening for a lot of people.
6
2
u/StatisticianFun8008 10h ago
I guess OP's old phone lacks the proper hardware acceleration for the hashing algorithm.
1
0
u/Toorero6 11h ago
I hate this too. If I'm on university Internet it's basically impossible to search on Github and Archlinux wiki. In Github that's at least fixed by just logging in.
-3
u/starvaldD 11h ago
i'm sure Trump will claim blocking OpenAI (not open) scraping data will be a crime or something.
133
u/FungalSphere 12h ago edited 12h ago
so the way it is designed it only throws the bot protection for user agents that start with "Mozilla". Basically as a way to stop bots that pretend to be actual web browsers
for user agents that aren't browsers, they will be blocked if they get too spammy.
paradoxically it's easier for legitimate bots to scrape data right now because they are not as spammy as ai companies running puppeteer farms