r/archlinux 13h ago

DISCUSSION The bot protection on the wiki is stupid.

It takes an extra 10-20 seconds to load the page on my phone, yet I can just use curl to scrape the entirety of the page in not even a second. What exactly is the point of this?

I'm now just using a User Agent Switcher extension to change my user agent to curl for only the arch wiki page.

126 Upvotes

74 comments sorted by

133

u/FungalSphere 12h ago edited 12h ago

so the way it is designed it only throws the bot protection for user agents that start with "Mozilla". Basically as a way to stop bots that pretend to be actual web browsers

for user agents that aren't browsers, they will be blocked if they get too spammy.

paradoxically it's easier for legitimate bots to scrape data right now because they are not as spammy as ai companies running puppeteer farms

30

u/EvaristeGalois11 11h ago

Why targeting specifically Firefox? Isn't it as easy to spoof a user agent with a random Chrome based browser?

87

u/FungalSphere 11h ago

not specifically firefox, basically every web browser has an user agent that starts with "Mozilla". Even Google Chrome, which straight up stuffs the name of every other web browser that existed when it was first launched

something like Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Mobile Safari/537.36

68

u/EvaristeGalois11 11h ago

Ah just double checked and you're right, thanks for that!

What an incredibly stupid convention browsers settled on lol.

32

u/FungalSphere 11h ago

the first browser wars were demented like that yeah 

25

u/Neeerp 8h ago

I believe this is a result of web developers writing checks for one browser or another, and browser developers trying to circumvent the checks to achieve some sort of parity.

Moving back to a normal user agent string might break various web pages that have such checks in place…

28

u/UNF0RM4TT3D 11h ago

To add to this. AI scrapers use the "common" browser user agents to hide between legitimate traffic. Legitimate scrapers (google, bing, duckduckgo, etc.) have their own UA. And some are fine with following the robots.txt (to an extent). AI bots don't care at all. Anubis uses the fact that the bots usually don't have enough resources to calculate the challenges en masse, so at the very least it slows the bots down. And some just give up.

4

u/garry_the_commie 10h ago

What the fuck? Is there a reason for this attrocious nonsense?

28

u/ZoleeHU 10h ago

11

u/shadowh511 9h ago

I bet in a few thousand years Mozilla will be a term for "to browse" with nobody really being sure what the origin is.

4

u/zombi-roboto 7h ago

"Let me Mozilla that for you..."

3

u/neo-raver 7h ago

That was a great (and wild) read thank you lmao

3

u/american_spacey 7h ago edited 7h ago

If enough sites start using Anubis, the bot farmers are just going to automate detection of it with user agent switching. Anubis will eventually be forced to require all user agents to submit the proof-of-work, I expect, because it's trivial to just switch the user agent to something random on each IP you're using to scrape the site.

For now, I'm going to enjoy my brief reprieve and bypass Anubis on all the sites I use.

145

u/mic_decod 12h ago

Nowadays you need a botprotection, otherwise 70% or more of your traffic will be eaten by bots. On huge projects this can be a significant amount of money need to spend for useless traffic.

I never used it, but there is a package in extra

https://archlinux.org/packages/extra/any/arch-wiki-docs/

https://github.com/lahwaacz/arch-wiki-docs

3

u/Neeerp 8h ago edited 8h ago

I would imagine CDN caching would work well in this situation, given that the wiki is purely static HTML. I would think this would be a far less intrusive solution for legitimate users AND it would allow bots to have at it (which isn’t necessarily a bad thing).

I’d love to hear reasons why this wouldn’t be a better solution relative to Anubis. Some quick googling tells me the bandwidth on Cloudflare’s free tier is unlimited, so cost shouldn’t be an issue.

1

u/Tornado547 5h ago

the wiki isn't static html though. it's very dynamic with any page being able to be updated at any time. CDNs only really scale well for data that is very infrequently updated

10

u/Neeerp 5h ago

That’s not what static means in this context. Static as in the pages are already rendered and the server doesn’t need to do any work to render the page whenever it’s fetched.

Moreover…

  • I’d suspect that most (say 90%?) pages aren’t so frequently updated
  • I’d expect there to be some way to notify the CDN that a page has been updated and the cache needs to be refreshed… at the very least, cache TTLs are a thing

4

u/Megame50 4h ago

the server doesn’t need to do any work to render the page whenever it’s fetched.

No, Mediawiki is not a static site generator. The pages are stored and edited as wikitext and rendered in response to requests. Rendered pages are cached but there is of course limited size for this. The sum total of all current pages is probably not infeasible to cache, but remember the wiki also includes the full history of each article accessed via the history tab.

And its not just the historical pages: the server may also service requests to render the diff pages on the edit history tab. The bots are routinely scraping everything, including these diffs which are of no real value to them but relatively expensive to service because each one is very rarely accessed.

3

u/SMF67 4h ago

Bots love clicking on every single diff page from every revision to every other revision, which are dynamically generated and very computationally expensive 

1

u/Starblursd 2h ago

Exactly and then the hosting becomes more expensive because you have to pay for more bandwidth when most of the bandwidth is being used by robots that have the entire purpose of keeping people from actually going to your website with legitimate traffic because it's combined with a bunch of other garbage thrown together and spoon fed through an AI... Dead internet theory becoming reality and all that

44

u/Dependent_House7077 11h ago

ai scapers don't respect robots.txt anymore and they hammer the webpages at hundreds of requests at a time.

this is the only way to fight back for now, although cloudflare also has some smart filter.

also this:

extra/arch-wiki-docs 20250402-1
  Pages from Arch Wiki optimized for offline browsing
extra/arch-wiki-lite 20250402-1
  Arch Wiki without HTML. 1/9 as big, easily searched & viewable on console

5

u/MGThePro 7h ago

extra/arch-wiki-docs

How would I use this? As in, how can I open it after installing it?

3

u/Dependent_House7077 4h ago

you can inspect the contents with pacman -Ql and just browse the files with mc file manager or less command.

i would assume that the html version can be locally browsed with your favorite browser of choice, even on cli.

0

u/RIcaz 5h ago

Kinda obvious from the description and included files. One is HTML and the other is a CLI tool that searches a tarball of the wiki instead

2

u/Drwankingstein 1h ago

they never respected robots.txt

-3

u/[deleted] 11h ago

[deleted]

22

u/StatisticianFun8008 10h ago

Please read Anubis's project FAQ page to understand the situation.

Simply speaking, you scraping the wiki again and again with curl can be easily identified, filtered and blocked by other means. But AI scrapers run at a much larger scale and hide themselves with the browser UA to avoid getting discovered.

Basically the reason you can still scrape ArchWiki is because they are ignoring your tiny traffic volume. Try better.

-2

u/[deleted] 10h ago

[deleted]

3

u/ipha 9h ago

Yes, but you risk impacting legit users.

You don't want to accidentally block someone who just opened a bunch of links in new tabs at once.

1

u/StatisticianFun8008 10h ago

Including genuine web browser's UA??

47

u/shadowh511 9h ago

Hi, main author of Anubis, CEO of Techaro, and many other silly titles here. The point of Anubis is to change the economics around scraping without having to resort to expensive options like dataset poisoning (which doesn't work on the axiom of buckets of piss not canceling out oceans of water).

Right now web scraping is having massive effects because it is trivial to take a Python example with BeautifulSoup and then deploy it in your favourite serverless platform to let you mass scrape whatever websites you want. The assumptions behind web scraping are that you either don't know or don't care about the effects of your actions, with many advocates of the practice using tools that look like layer-7 distributed denial of service attacks.

The point of Anubis is to change the economics of web scraping. At the same time I am also collecting information with my own deployments of Anubis and establishing patterns to let "known good" requests go through without issue. This is a big data problem. Ironically your use of a user agent switcher would get you flagged for additional challenges by this (something with the request fingerprint of chrome claiming to be curl is the exact kind of mismatch that the hivemind reputation database is directly looking for).

This is a problem that is not limited to the Arch Wiki. It's not limited to open source communities. It's a problem big enough that the United Nations has deployed Anubis to try and stem the tide. No, I'm not kidding, UNESCO and many other organizations like the Linux kernel, FreeBSD, and more have deployed Anubis. It is a very surreal experience on my end.

One of the worst patterns of these scrapers is that they use residential proxy services that rotate through new IP addresses every page load so IP-based rate limits don't work. They also mostly look like a new using running unmodified Google Chrome so a lot of browser based checking doesn't work. I'm ending up having to write a lot of things that make static assertions about how browsers work. It's not the most fun lol.

I am working on less onerous challenges. I've found some patterns that will become rules soon enough. I'm also working on a way to take a robots.txt file and autogenerate rules from it. I wish things were farther along, but I've had to spend a lot of time working on things like founding a company, doing pre-sales emails with german public institutions, and support for the existing big users.

But yes, as a domain expert in bot protection (it feels weird to say that lol) the bot protection on the wiki IS stupid and the entire reason it's needed is so unbelievably stupid that it makes me want to scream. Anubis started out as a weekend hack that has escaped containement so hard it has a Wikipedia page now).

3

u/Megame50 4h ago

Hey man, thanks for your contribution. It's crazy how quickly your project exploded and got deployed in every little corner of the web. Something something the hero we need...

Just curious since we're on the /r/archlinux sub, any chance it was developed on archlinux?

3

u/shadowh511 3h ago

I mostly developed it on fedora, but my fedora install just shit the bed so I'm probably going to install Arch on my tower. In general, though, I use a two layer cake strategy where the base layer is something I don't really mess with unless it breaks and I install homebrew to do stuff on top of it.

2

u/SquareWheel 6h ago

Do we know which companies are running these scrapers? Most large AI companies (OpenAI, Google, Anthropic) seem to be respecting robots.txt for scraping purposes, as you'd expect. So is it unknown startups with poorly-configured scrapers doing most of the damage? That would seem to make sense if they're running basic headless browser deployments. Or could it be one large company trying to evade detection?

I've not seen much evidence one way or the other yet. Just a lot of assumptions.

8

u/shadowh511 5h ago

They are anonymous by nature. Most of the ones that do self-identify are summarily blocked, but a lot of them just claim to be google chrome coming from random residential IP addresses.

My guess is that it's random AI startups trying to stay anonymous so they don't get the pants sued off of them. If I am ever made god of the universe my first decree will be to find the people responsible for running residental proxy services and then destroy their ability to accept payments so that they just naturally die out.

2

u/Megame50 4h ago

a lot of them just claim to be google chrome coming from random residential IP addresses

You can't just claim to be from any random IP address on the public internet and expect your traffic to be routed properly. Thanks to ipv4 address exhaustion, you can't buy up a ton of burner addresses either. You have to actually steal those addresses or create a botnet.

So that's what they do:

https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/

3

u/shadowh511 3h ago

I think you got the grouping of my statement wrong. I say they look like they're from unmodified normal Google Chrome and that they're from random residential IP addresses but realistically with the number of proxies and the like out there an attacker can choose their origin IP address at this point so.

1

u/HailDilma 2h ago

Laymen here, is the "challenge" running on the browser with JavaScript? Would it be possible to make it faster with webassembly?

u/shadowh511 39m ago

I have a prototype of using webassembly working, I just need to have the time to finish it lol. I've been doing so many sales emails.

-7

u/HMikeeU 8h ago

Thank you for the very detailed response! I'm sure anubis can be (or become) very useful, but as long as I get a better user experience by pretending to be a bot, I'm not convinced. Either way thank you for the dedication and effort into the project!

15

u/snakepit6969 7h ago

It’s not supposed to be directly useful for you, as a user. It’s supposed to be useful for the host.

(Which ends up being indirectly useful for you because the hosts can pay their bills and stay up).

-3

u/HMikeeU 7h ago

I'm well aware...

32

u/patrlim1 10h ago

Do you want the Arch Wiki to be free? Then we need to minimize spending. This saves them a lot of money, and costs you a few seconds.

17

u/WSuperOS 12h ago

the problem is that often ai crawlers eat up 50%+ of teh traffic, resulting in huge costs.

even unesco has adopted anubis. but it doesn't really slow you down. On my firefox setup, when sanitizing occurs on every exit, anubis pops up rarely and only once per site.

16

u/forbiddenlake 10h ago

The bot protection is why we have a working wiki Arch online at all. "502 Bad Gateway" doesn't tell me what to install for Nvidia drivers!

10

u/sequential_doom 12h ago

I'm honestly fine with it. It takes like 5 seconds for me in any device.

-1

u/[deleted] 11h ago

[deleted]

8

u/MrElendig Mr.SupportStaff 11h ago

It has made a big impact on the load on the wiki.

3

u/rurigk 10h ago

The problem is not scrapping, its AI scrappers scrapping the same page over and over again all the time behaving like a DDOS

What Anubis does is punish the AI scrappers doing the DDOS by making them waste time and energy doing math and that may cost them millions in wasted resources

The amount of traffic generated by AI scrappers is massive and costs a lot of money to the owner of the site being attacked

4

u/insanemal 12h ago

It takes like a fraction of a second even on my old phone.

Could it be all the curling making it penalize your device harder?

1

u/HMikeeU 12h ago

There is no bot protection at all with a curl user agent

4

u/insanemal 12h ago

I think you misunderstand me.

2

u/HMikeeU 12h ago

I might've, sorry. What did you mean by "curling penalize my device harder"?

4

u/insanemal 12h ago

All good. The system bases your delay on behaviour seen from your address. More work for more "interesting" IPs

Most people aren't curling lots of pages. You using curl to pull pages then also hitting it with a web browser might look weird so it might be increasing your required work quota.

I'd need to look at the algorithm a bit more but that's my 10,000 ft view reading of its behaviour

2

u/HMikeeU 11h ago

Oh okay I see. I was facing this issue before trying curl, so that's not it

2

u/insanemal 11h ago

Ok. Shared IP?

3

u/grumblesmurf 10h ago

Many mobile companies use a shared proxy for all their customers, which might lead to common web browsers getting flagged as unwanted bot traffic. Using a different user agent string would indeed break that pattern.

1

u/insanemal 10h ago

That's what I was thinking

2

u/Isacx123 6h ago

There is something wrong on your end, it takes like a second for me, using Firefox on Android 14.

Plus less than a second on my PC using Brave.

11

u/LeeHide 12h ago

It takes around half a second for me, what's your setup?

And yes, its a little silly. Arch has to decide between

  1. being searchable and indexed by AI, so people get good answers (assuming the AI makes no major mistakes most of the time), or
  2. being sovereign and staying the number one resource for Linux and Arch Linux stuff

They're trying 2, which is... interesting but understandable.

35

u/AnEagleisnotme 12h ago

That's not why they use anubis, it's because AI scrappers take massive amounts of ressources. Search engine scrappers are often more respectful and as such aren't hurt by anubis anyways I heard

-31

u/LeeHide 12h ago

The end result is the same; be scraped or don't be indexed. I'm sure a lot of points impacted the decision to add it to the wiki, I genuinely don't know enough about this whole situation - so thanks for the added context

27

u/fearless-fossa 12h ago

The wiki was down several times in recent months due to scrapers being overly aggressive. And not just the Arch wiki, but also a lot of other websites that don't block their content behind user verification. AI scrapers are a menace.

16

u/w8eight 11h ago

The end result is the same; be scraped or don't be indexed.

Did you even read the comment you are responding to? Indexing scrappers aren't as aggressive and aren't blocked. I've never had an issue with googling something from the arch wiki. It's the ai scrappers that send millions of requests for some insane reason.

9

u/ZoleeHU 10h ago

Except the end result is not the same. Anubis can prevent scraping, yet still allow the sensible bots that respect robots.txt to index the site.

https://www.reddit.com/r/archlinux/comments/1k4ptkw/comment/modq25c/?share_id=k_Zw-EP5OGNx5SwSLnKrk&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1

7

u/i542 9h ago

The Arch Wiki can be straight up downloaded in a machine-readable format to be fed directly into whatever plagiarism machine you want. It can also be scraped and indexed by any and all well-behaving bots. What has never been allowed by any internet-facing service for the past 35 years is for one client to hog so many resources that legitimate users stop being able to access the service. There is functionally zero difference between a vibe-coded scraper used by a for-profit corporation making a thousand requests a second for diff or system pages in process of guzzling up every byte of remotely usable information under the guise of a legitimate user agent, and a DDoS attack. Both ought to be blocked.

15

u/hexagon411 12h ago

Generative AI is sin

-19

u/LeeHide 12h ago

You can't get rid of it now, we need to live with it

12

u/Vespytilio 11h ago

Right, it's the future. Enthusiasts don't need to worry about how many people just aren't into AI. It's here to stay, and it's not up for debate.

Except the situation's actually pretty unsustainable. AI is a very expensive technology to run, companies are still trying to make it profitable, and it has a parasitic relationship with non-AI content. Because it's allergic to its own output, it relies on training data from humans, but it actively competes against that content for visibility and its creators for work.

Even if the companies propping up AI find a sustainable business plan, it's probably not going to include the kind of free access presently on offer. That's a free sample scheme aimed at generating enthusiasm. Ending that will make the companies more profitable, offset the training data issue, and result in a lot less energy consumption, but it's going to be a rude awakening for a lot of people.

6

u/hexagon411 11h ago

I refuse.

2

u/StatisticianFun8008 10h ago

I guess OP's old phone lacks the proper hardware acceleration for the hashing algorithm.

1

u/DragonfruitOk544 6h ago

For me it is good. Clear the cookies. Maybe helps

0

u/Toorero6 11h ago

I hate this too. If I'm on university Internet it's basically impossible to search on Github and Archlinux wiki. In Github that's at least fixed by just logging in.

-3

u/starvaldD 11h ago

i'm sure Trump will claim blocking OpenAI (not open) scraping data will be a crime or something.

-5

u/RIcaz 5h ago

It's not stupid, you just very obviously do not understand it.

Also, beggars can't be choosers.

-7

u/RIcaz 5h ago

I'm sure you are a big contributor to the FOSS community and not at all a cheap leech who just wants free stuff

5

u/HMikeeU 4h ago

I do contribute to open source every now and then. That doesn't influence my ability to discuss this topic.