Diffusion refuses to show a commit in a non-UTF-8 file

Event Timeline

tinloaf asked this question.Jul 20 2023, 11:07

tinloaf created this object in space S1 Public.

Herald added subscribers: Cigaryno, Matthew. · View Herald TranscriptJul 20 2023, 11:07

To be clear - it's a single file in a single commit, in the equivalent of this page: https://we.phorge.it/rARCa604548101025875de20a9c263df3790fea425b3 - is that right?

And the same file in the same version is shown correctly in the equivalent of https://we.phorge.it/source/arcanist/browse/master/src/parser/ArcanistBundle.php ?

And other commits showing different versions of the same file also work correctly?

And I'm assuming you can't share the specific file, right?

I think the "is binary" calculation only looks for some special characters and/or only in the first x bytes of the file, so it's possible a small change would take the offending character over the limit.

To be clear - it's a single file in a single commit, in the equivalent of this page: https://we.phorge.it/rARCa604548101025875de20a9c263df3790fea425b3 - is that right?

Yes, exactly.

And the same file in the same version is shown correctly in the equivalent of https://we.phorge.it/source/arcanist/browse/master/src/parser/ArcanistBundle.php ?

Yes, and additionally it is also correctly shown in the 'previous' version, i.e., the parent version of the commit.

And other commits showing different versions of the same file also work correctly?

No, it does not look like it. I did not verify all commits that touch the respective file, but ~30 of them, and the file is shown as 'binary' in all those commits.

And I'm assuming you can't share the specific file, right?

Unfortunately, yes. :(

I think the "is binary" calculation only looks for some special characters and/or only in the first x bytes of the file, so it's possible a small change would take the offending character over the limit.

Do you know off the top of your head whether the binary-detection in the commit view and the code browser view differ in some way?

Also, the file that Phorge analyzes should be binary-identical to the file have checked out on my system, right? There is no conversion of any kind? I'll go ahead and poke around in the file to see if I see any weird bytes in the first part…

I'm using Mercurial by the way, that may complicate things…

Do you know off the top of your head whether the binary-detection in the commit view and the code browser view differ in some way?

Well, there's obviously some difference, as they get different results...
I'll need to dig some more into the applicable code to find it.

Also, the file that Phorge analyzes should be binary-identical to the file have checked out on my system, right? There is no conversion of any kind? I'll go ahead and poke around in the file to see if I see any weird bytes in the first part…

I'm not very familiar with hg, but as far as I understand, that's correct - unless there's an explicit conversion in the client, the files should be byte-identical. I know that in git, there's no metadata for "file type", meaning git doesn't actually know if it's "text" or "binary".
On the server, we just use hg clone (or init) to get a bare copy, and other hg commands to read the contents, so there shouldn't be any changes in the content.

Okay, I think I created a minimal example reproducing the problem. The repository is publicly available here: https://sourceforge.net/p/tinloaf-phorge-problem/code-hg/ci/default/tree/

I just reproduced this problem on my (very recent) Phorge installation by doing this:

Add a new Mercurial repository, set it to observe http://hg.code.sf.net/p/tinloaf-phorge-problem/code-hg
Do not change the repo encoding. It should be set to utf-8 (default)
Wait for the repository to be cloned

There is a file texttext.txt in the repo. It should be viewable in the code browser with a warning that it was converted from ISO-8859. This is correct - the file texttext.txt contains ISO-8859 'Umlaute'.

However, the most recent commit should be non-viewable, it just says 'this is a binary file' - even if one manually changes the encoding to ISO-8859.

Edit: @avivey your idea that maybe only the first part of the file is inspected to determine the encoding may be correct, by the way. I had to add some lorem ipsum text (which does not contain any non-ASCII characters) before the ä, ü, etc. to trigger the problem.

Cool, I'll play with it and see what I can find.

EDIT: I'm able to reproduce!

avivey added an answer.Jul 26 2023, 17:40

MacFan4000 closed this question as resolved.Aug 26 2023, 05:21