Diffusion repository commits: avoid to be a black hole for webcrawlers
Open, LowPublic
Actions

Assigned To

Authored By

	aklapper
	Nov 16 2023, 09:55

Description

Per old upstream's https://secure.phabricator.com/T4610 ("I suspect no installs are ever interested in spiders generating an index of Diffusion.") and per current upstream's https://we.phorge.it/robots.txt including Disallow: /diffusion/, I propose to also disallow indexing Diffusion URLs like https://we.phorge.it/rP7868ab3754fad13714640451d664d5bc71b7a02f

Whenever we disallow robots from visiting these contents or not, Diffusion repository commits at the moment are black holes (every single line is a link). So still there is room for improvement.

Revisions and Commits

rP Phorge
	D25474	rP05f4d5071fdc Disallow webcrawlers to index Diffusion commits

Related Objects

Mentioned In: D25569: Avoid separate per-line URIs in line number column
Mentioned Here: P35 Replace fragments
rP7868ab3754fa: Make InterpreterBlockRule regex only match on valid interpreter names

Event Timeline

aklapper created this task.Nov 16 2023, 09:55

Herald added subscribers: Cigaryno, Matthew, chris, tobiaswiese. · View Herald TranscriptNov 16 2023, 09:55

aklapper added a revision: D25474: Disallow webcrawlers to index Diffusion commits.Nov 16 2023, 10:11

This is something that may be not appreciated by some people.

Not everybody likes GitHub and not everybody mirrors things there, so, some public installations may like to have commits indexed on search engines to simplify troubleshooting in general and attract new contributors who are looking for something that is already fixed by a specific commit that has a particular commit message.

Indeed you may say that Phorge already has Differential revisions with that, but any commit does not need to be associated to a Differential revision. So I'm inclined to say that this is not an optimal default.

It should be easy for other installations to do that, but I don't think it's a good default to de-index all commits.

aklapper closed this task as Resolved by committing rP05f4d5071fdc: Disallow webcrawlers to index Diffusion commits.Nov 17 2023, 07:45

aklapper claimed this task.

aklapper added a commit: rP05f4d5071fdc: Disallow webcrawlers to index Diffusion commits.

Valerio: Uhm, I'm sorry, I had not seen your comment here before I landed the patch (as I had checked my Differential page instead of my notifications).

From a product POV, I agree with @valerio.bozzolan - there is (sometimes) some information on commits that would be nice to index in a search engine - comments, mostly.

What's the motive to disallow these from being indexed?

avivey reopened this task as Open.Nov 19 2023, 09:14

Thinking more, I think we'd like to allow the robots to index latest version of the code - these days the big boys know how to handle that. Stopping them from crawling older versions is still important.

Anyway, I vote to revert the change - commit pages can have discussions in.

OK. Then we can add a Task about how to easily configure robot changes without forking, in case.

In case of what?

Nobody showed a use-case for customizing it yet (or for excluding /rXXXX, for that matter).

When I was at Wikimedia I remember a lot of issues from search robots endlessly indexing dynamic pages.

https://phabricator.wikimedia.org/robots.txt is the result of many incidents with heavy traffic from one or two robots going wild indexing every single commit hash and every individual file in every branch and tag. In Diffusion there are lots of URLs that serve essentially the same content. It can turn into a search engine trap. If I remember correctly, one pathological case has to do with the way you can link to individual lines in a file, each line has a unique url and it doesn't use the #hash part of the url for the line number (which search engines are better about ignoring)

A root problem is that highlighted line number(s) should be a # fragment really, to do not multiply pages exponentially.

In T15670#16064, @valerio.bozzolan wrote:

A root problem is that highlighted line number(s) should be a # fragment really, to do not multiply pages exponentially.

That would require fiddling with the getURILineRange code

In T15670#16064, @valerio.bozzolan wrote:

A root problem is that highlighted line number(s) should be a # fragment really, to do not multiply pages exponentially.

Throwing in a quick draft which does welcome some testing:

P35 Replace fragments

1	commit af19928d1565e748b35099ab8ee047de097be57b
2	Author: Andre Klapper <a9016009@gmx.de>
3	Date: Tue Apr 2 21:24:58 2024 +0200
4
5	Replace expensive $ fragment links with standard # HTML anchors
6
7	Create an HTML anchor fragment via `'id' => $line_number` that we can link to, and replace the `$` with `#` in the URIs.
8
9	The `$1-10` range syntax will still work (and might be posted across places in Phorge) but webcrawlers on Diffusion source code pages and in Pastes will only find standard # anchor links for the very same page not to follow. Which is our goal per https://we.phorge.it/T15670 when it comes to reducing distractive webcrawler behavior.
10
11	For general understanding:
12	* src/view/layout/PhabricatorSourceCodeView.php calls `Javelin::initBehavior('phabricator-line-linker')`.
13	* webroot/rsrc/js/core/behavior-line-linker.js calls `JX.DOM.alterClass(cursor, 'phabricator-source-highlight', true)` to set the background color of a highlighted line via CSS.
14	* webroot/rsrc/js/core/behavior-line-linker.js calls `JX.DOM.scrollToPosition(0, JX.$V(anchor).y - 60);` to make the browser jump to the highlighted range, but again we cannot enforce highlighting of a line.
15	"the other way round" here as the scroll position update is done locally by the browser only (and the browser jumps to the anchor line as the very first line, instead of 60px above as the server-side JS does here).
16	* We cannot highlight our anchor line in `AphrontRequest.php`'s `getURILineRange` as the fragment identifier is never sent to our server but only locally handled in the browser. So we lose highlighting with the anchor approach.
17
18	Test Plan:
19	* Go to http://phorge.localhost/P1, click on line number 47, see that highlighting still works and URL in the address bar becomes http://phorge.localhost/P1#47 without a page reload
20	* Go to http://phorge.localhost/P1#33 and see that scrolling works as expected
21	* Try http://phorge.localhost/P1$38-40 syntax and see that highlighting still works as expected
22	* Do the same test in a source code file rendered in Diffusion
23
24	diff --git a/src/view/layout/PhabricatorSourceCodeView.php b/src/view/layout/PhabricatorSourceCodeView.php
25	index ae20fe2501..fd4db19176 100644
26	--- a/src/view/layout/PhabricatorSourceCodeView.php
27	+++ b/src/view/layout/PhabricatorSourceCodeView.php
28	@@ -132,7 +132,7 @@ final class PhabricatorSourceCodeView extends AphrontView {
29
30	if ($this->canClickHighlight) {
31	if ($base_uri) {
32	- $line_href = $base_uri.'$'.$line_number;
33	+ $line_href = $base_uri.'#'.$line_number;
34	} else {
35	$line_href = null;
36	}
37	@@ -142,6 +142,7 @@ final class PhabricatorSourceCodeView extends AphrontView {
38	array(
39	'href' => $line_href,
40	'data-n' => $line_number,
41	+ 'id' => $line_number,
42	));
43	} else {
44	$tag_number = phutil_tag(
45	diff --git a/webroot/rsrc/js/core/behavior-line-linker.js b/webroot/rsrc/js/core/behavior-line-linker.js
46	index fd8510646e..611a3311c2 100644
47	--- a/webroot/rsrc/js/core/behavior-line-linker.js
48	+++ b/webroot/rsrc/js/core/behavior-line-linker.js
49	@@ -164,7 +164,7 @@ JX.behavior('phabricator-line-linker', function() {
50
51	uri = JX.$U(uri);
52	path = uri.getPath();
53	- path = path + '$' + lines;
54	+ path = path + '#' + lines;
55	uri = uri.setPath(path).toString();
56
57	JX.History.replace(uri);

If it's that easy, then I'm both impressed and surprised it remained this way for so long. I'm actually not quite sure I understand the reasoning for not using # to begin with.

Good work tracking that down, @aklapper! I'll attempt to test locally.

In T15670#15501, @avivey wrote:

Thinking more, I think we'd like to allow the robots to index latest version of the code - these days the big boys know how to handle that. Stopping them from crawling older versions is still important.

Anyway, I vote to revert the change - commit pages can have discussions in.

We could also add nofollow attributes to links on commit landing pages, that would stop engines from getting lost in the commit graph but they would still find links to commits elsewhere and index them. Maybe we could add a global flag that we check in phutil_link and add the attribute automatically if the flag is set. It's a bit of a dirty hack but it would achieve the needed result.

Ah, also adding a smalll meta "noindex" HTML tag on legacy file.php$123 pages and similar ones, would maybe help a little bit.

Maybe we can rephrase a bit the title of this task to avoid to index commit lines, since we are going in that direction.

I'm guessing $ is used instead of # because (1) a user-agent might not send the # part to the server, and (2) the natural behavior of # ("scroll to this anchor") isn't what the intended behavior ("highlight these lines and scroll to the first one").

The robots rule can forbid anything that has a $ in it...

If it's that easy

It's not that easy... For example, clicking on a line number and dragging down updates the URL in the address bar to phorge.localhost/P1#33-41 but accessing that URL of course will not jump to those lines and highlight them (because # instead of $ and because #33-41 is not valid #33).
Let me put up a proper patch in Differential which should fix that problem.
The underlying question ofc is what's the balance between giving up functionality versus reducing webcrawler load.

aklapper mentioned this in D25569: Avoid separate per-line URIs in line number column.Apr 4 2024, 21:20

We don't create the links to page$line in most places as hrefs, so this shouldn't be an issue.

Don't exist in Diffusion
Do exist in Paste
Don't exist in Differential

In Diffusion/Differential the eventual link is created using JS, so a crawler can't find it.
The same can be done in Paste (i.e. stop generating href tags).

and the robots.txt can have *$* in it as another level of defence.

Probably my somewhat 10 cents at a broader level;

Maybe simply just make robots.txt user-configurable config option? (Similar ways to do same thing included)

(Yeah, I can do same thing via apache, nginx, blahblahblah, but I find it much more pleasurable to do stuff on phorge web interface…)

(Trying to make this task a bit more about the root problem, and less about the proposed solution)

valerio.bozzolan renamed this task from Disallow webcrawlers to index Diffusion repository commits to Diffusion repository commits: avoid to be a black hoile for webcrawlers.Jul 21 2024, 08:39

valerio.bozzolan renamed this task from Diffusion repository commits: avoid to be a black hoile for webcrawlers to Diffusion repository commits: avoid to be a black hole for webcrawlers.

valerio.bozzolan triaged this task as Low priority.

valerio.bozzolan updated the task description. (Show Details)

Cigaryno added a project: Diffusion.Dec 28 2024, 18:40

Cigaryno moved this task from Backlog to Discussion Needed on the Diffusion board.Dec 29 2024, 17:34

Diffusion repository commits: avoid to be a black hole for webcrawlersOpen, LowPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Diffusion repository commits: avoid to be a black hole for webcrawlers
Open, LowPublic
Actions