Page MenuHomePhorge

Disallow webcrawlers to follow Paste line number anchor links
ClosedPublic

Authored by aklapper on Nov 10 2023, 11:57.

Details

Summary

Paste provides line anchor links in every single line of a paste.
If webcrawlers follow these links, they index the very same Paste again.
Thus disallow in robots.txt to reduce unneeded traffic and indexing time.

Closes T15662

Test Plan

Go to /robots.txt in the web browser.
Cross fingers that more webcrawlers abide by RFC 9309.

Diff Detail

Repository
rP Phorge
Branch
pasteAnchorsWebcrawlers (branched from master)
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 900
Build 900: arc lint + arc unit

Event Timeline

I will keep this change in my production for a while:

https://gitpull.it/robots.txt

https://gitpull.it/P22$1

Feel free to test.

Just to say that most online tools do not work. For instance:

🔶 https://technicalseo.com/tools/robots-txt/

I see this change as safe since:

  • In the best case, URLs like /P123$123123 are just finally ignored and /P123 is still indexed
  • In the worst case, the page /P123%24 is not indexed but that is nonsense and it should not negatively impact in any way /P123

Feel free to follow the tip. Please wait at least 10 seconds before landing, so maybe we can collect more feedback

src/applications/system/controller/robots/PhabricatorRobotsPlatformController.php
27

✅ I verified that %24 is the URL encode of the dollar $

As a side note, in theory, the last * is probably not necessary.

This revision is now accepted and ready to land.Nov 11 2023, 21:45

Thanks for landing!

(As a side note, in theory, the last * was very probably not necessary)