Page MenuHomePhorge

Diffusion repository file lines: adopt web fragments (from: $123 to: #L123) to reduce the amount of permalinks visited by webcrawlers
Open, LowPublic

Description

Historically to point line '123', we were using URL suffix '$123'. Example:

https://we.phorge.it/source/phorge/browse/master/LICENSE$123

Unfortunately, when mentioning these URIs on public installations, these permalinks are (technically) completely- different pages:

https://we.phorge.it/source/phorge/browse/master/LICENSE$123

https://we.phorge.it/source/phorge/browse/master/LICENSE$122

https://we.phorge.it/source/phorge/browse/master/LICENSE$121

https://we.phorge.it/source/phorge/browse/master/LICENSE

All the above URIs are really different from the perspective of a web crawler, but really semantically the same. Just highlighting a line or not, does not justify having dedicated "nice permalinks" to these lines.

The problem with this approach is, if human beings mention the page LICENSE$123 somewhere (e.g. in task comments), stay sure that:

  • crawlers and users will visit these URI variants LICENSE and LICENSE$123, even if the LICENSE was already visited and indexed/cached, causing:
    • from the Phorge perspective:
      • causing extra webserver hits that could be avoided / cached (e.g. thanks to fragments #L123 that are not real extra visits)
      • causing extra CPU/RAM consumption in your Phorge server
        • this problem is especially true in the age of AI very-aggressive pirate web-scaper bots, where more pages you have, more pages they aggressively visit
      • probably causing also SEO "duplicate content" issues (but really not something relevant here - but still interesting to be mentioned)
    • from external perspective:
      • causing unnecessary duplicate contents in search engine databases
        • you may say that Google's problems are not our problem, and we could don't care, but maybe we care about Internet Archive
    • from the perspective of users:
      • we should not try to track which which IP address visited which specific source code line, so, using web fragments (e.g. #L123) would be better to just highlight a line

In short, we have 5+ reasons to avoid to promote such LICENSE$123 URIs, and no good reasons to continue to promote that, since patches like D25569 demonstrate that we can easily play with the web fragment (JavaScript) to highlight that line, or that line range, without deeply rely on server-side generation for this specific case.

Non-Bug

This task is surprisingly down-escalated because:

NOTE: Don't panic. Phorge does NOT cause 1000 pages to be visited by crawlers if a file has 1000 lines. At the moment users MUST click on line 123 to generate the $123 destination, and must share that URI somewhere before a scraper can get it. So this is a very low priority bug, to do not generate anymore $123 URIs, and use web fragments #L123 instead, to don't introduce new URIs while chatting about source code, so that humans do not expose new nonsense permalinks to web crawlers.

Upstream

Exploration

Note that accordingly to W3C specifications, web anchors must start with a letter. E.g. anchor #123 is not correct.

they cannot start with a digit, two hyphens, or a hyphen followed by a digit.

https://www.w3.org/TR/CSS21/syndata.html#value-def-identifier

Both GitLab and GitHub already use the format #L123:

https://gitlab.com/ItalianLinuxSociety/ilsmanager/-/blob/master/README.md?ref_type=heads&plain=1#L123

https://github.com/phorgeit/phorge/blob/master/README.md?plain=1#L123

Phorge generally does not want to follow which things GitHub and GitLab and others do, but this web fragment called #L<LINE> sounds very reasonable.

So, we SHOULD encourage the use of anchors instead. So, passing from this to that:

https://we.phorge.it/source/phorge/browse/master/LICENSE$L123
https://we.phorge.it/source/phorge/browse/master/LICENSE#L123

Limitations

Phorge can probably improve the future situation when we generate these URIs, but Phorge should probably not automatically try to "rewrite" old links mentioned by past human beings here and there.