Per old upstream's https://secure.phabricator.com/T4610 ("I suspect no installs are ever interested in spiders generating an index of Diffusion.") and per current upstream's https://we.phorge.it/robots.txt including Disallow: /diffusion/, I propose to also disallow indexing Diffusion URLs like https://we.phorge.it/rP7868ab3754fad13714640451d664d5bc71b7a02f
Description
Revisions and Commits
Related Objects
Event Timeline
This is something that may be not appreciated by some people.
Not everybody likes GitHub and not everybody mirrors things there, so, some public installations may like to have commits indexed on search engines to simplify troubleshooting in general and attract new contributors who are looking for something that is already fixed by a specific commit that has a particular commit message.
Indeed you may say that Phorge already has Differential revisions with that, but any commit does not need to be associated to a Differential revision. So I'm inclined to say that this is not an optimal default.
It should be easy for other installations to do that, but I don't think it's a good default to de-index all commits.
Valerio: Uhm, I'm sorry, I had not seen your comment here before I landed the patch (as I had checked my Differential page instead of my notifications).
From a product POV, I agree with @valerio.bozzolan - there is (sometimes) some information on commits that would be nice to index in a search engine - comments, mostly.
What's the motive to disallow these from being indexed?
Thinking more, I think we'd like to allow the robots to index latest version of the code - these days the big boys know how to handle that. Stopping them from crawling older versions is still important.
Anyway, I vote to revert the change - commit pages can have discussions in.
OK. Then we can add a Task about how to easily configure robot changes without forking, in case.
In case of what?
Nobody showed a use-case for customizing it yet (or for excluding /rXXXX, for that matter).
When I was at Wikimedia I remember a lot of issues from search robots endlessly indexing dynamic pages.
https://phabricator.wikimedia.org/robots.txt is the result of many incidents with heavy traffic from one or two robots going wild indexing every single commit hash and every individual file in every branch and tag. In Diffusion there are lots of URLs that serve essentially the same content. It can turn into a search engine trap. If I remember correctly, one pathological case has to do with the way you can link to individual lines in a file, each line has a unique url and it doesn't use the #hash part of the url for the line number (which search engines are better about ignoring)
A root problem is that highlighted line number(s) should be a # fragment really, to do not multiply pages exponentially.
If it's that easy, then I'm both impressed and surprised it remained this way for so long. I'm actually not quite sure I understand the reasoning for not using # to begin with.
Good work tracking that down, @aklapper! I'll attempt to test locally.
We could also add nofollow attributes to links on commit landing pages, that would stop engines from getting lost in the commit graph but they would still find links to commits elsewhere and index them. Maybe we could add a global flag that we check in phutil_link and add the attribute automatically if the flag is set. It's a bit of a dirty hack but it would achieve the needed result.
Ah, also adding a smalll meta "noindex" HTML tag on legacy file.php$123 pages and similar ones, would maybe help a little bit.
Maybe we can rephrase a bit the title of this task to avoid to index commit lines, since we are going in that direction.
I'm guessing $ is used instead of # because (1) a user-agent might not send the # part to the server, and (2) the natural behavior of # ("scroll to this anchor") isn't what the intended behavior ("highlight these lines and scroll to the first one").
The robots rule can forbid anything that has a $ in it...
If it's that easy
It's not that easy... For example, clicking on a line number and dragging down updates the URL in the address bar to phorge.localhost/P1#33-41 but accessing that URL of course will not jump to those lines and highlight them (because # instead of $ and because #33-41 is not valid #33).
Let me put up a proper patch in Differential which should fix that problem.
The underlying question ofc is what's the balance between giving up functionality versus reducing webcrawler load.
We don't create the links to page$line in most places as hrefs, so this shouldn't be an issue.
- Don't exist in Diffusion
- Do exist in Paste
- Don't exist in Differential
In Diffusion/Differential the eventual link is created using JS, so a crawler can't find it.
The same can be done in Paste (i.e. stop generating href tags).
and the robots.txt can have *$* in it as another level of defence.
Probably my somewhat 10 cents at a broader level;
Maybe simply just make robots.txt user-configurable config option? (Similar ways to do same thing included)
(Yeah, I can do same thing via apache, nginx, blahblahblah, but I find it much more pleasurable to do stuff on phorge web interface…)