Diffusion refuses to show a commit in a non-UTF-8 file
Closed, ResolvedPublic
Actions

Asked by tinloaf on Jul 20 2023, 11:07.

Details

Hi,

I have a sort-of-weird setup. I have a repository with several 'product version' branches. The repository is old, and historically, the source code was not encoded in UTF-8, but in Windows CP-1252. We did convert everything to UTF-8 a while ago, so all 'newer' branches, starting with a branch I'll call version_X, are UTF-8-encoded, and all older branches (up to version_X-1) are encoded in CP-1252.

I have selected UTF-8 as repository encoding. This means that when viewing one of the contained files on an older branch, Diffusion will show me a warning like this:

This document is not UTF8. It was detected as ISO-8859-1 (Latin 1) and converted to UTF8 for display.

That's nice.

Now I have a commit on such a file and would like to perform an audit. However, the commit viewer only shows This is a binary file. I tried selecting View Options -> Change Text Encoding… from the menu in the headline of the file, and select CP-1252 (or ISO-8859-1), but in both cases it still tells me that This is a binary file.

I verified that the Diffusion Code Viewer can display the whole file immediately before the respective commit and immediately after the commit, both with showing the …converted to UTF8… warning. The changes in the commit itself do not contain any non-ASCII characters.

Is there anything else I can do to figure out why Diffusion thinks this is a binary file?

Event Timeline

To be clear - it's a single file in a single commit, in the equivalent of this page: https://we.phorge.it/rARCa604548101025875de20a9c263df3790fea425b3 - is that right?

And the same file in the same version is shown correctly in the equivalent of https://we.phorge.it/source/arcanist/browse/master/src/parser/ArcanistBundle.php ?

And other commits showing different versions of the same file also work correctly?

And I'm assuming you can't share the specific file, right?

I think the "is binary" calculation only looks for some special characters and/or only in the first x bytes of the file, so it's possible a small change would take the offending character over the limit.

To be clear - it's a single file in a single commit, in the equivalent of this page: https://we.phorge.it/rARCa604548101025875de20a9c263df3790fea425b3 - is that right?

Yes, exactly.

And the same file in the same version is shown correctly in the equivalent of https://we.phorge.it/source/arcanist/browse/master/src/parser/ArcanistBundle.php ?

Yes, and additionally it is also correctly shown in the 'previous' version, i.e., the parent version of the commit.

And other commits showing different versions of the same file also work correctly?

No, it does not look like it. I did not verify all commits that touch the respective file, but ~30 of them, and the file is shown as 'binary' in all those commits.

And I'm assuming you can't share the specific file, right?

Unfortunately, yes. :(

I think the "is binary" calculation only looks for some special characters and/or only in the first x bytes of the file, so it's possible a small change would take the offending character over the limit.

Do you know off the top of your head whether the binary-detection in the commit view and the code browser view differ in some way?

Also, the file that Phorge analyzes should be binary-identical to the file have checked out on my system, right? There is no conversion of any kind? I'll go ahead and poke around in the file to see if I see any weird bytes in the first part…

I'm using Mercurial by the way, that may complicate things…

Do you know off the top of your head whether the binary-detection in the commit view and the code browser view differ in some way?

Well, there's obviously some difference, as they get different results...
I'll need to dig some more into the applicable code to find it.

Also, the file that Phorge analyzes should be binary-identical to the file have checked out on my system, right? There is no conversion of any kind? I'll go ahead and poke around in the file to see if I see any weird bytes in the first part…

I'm not very familiar with hg, but as far as I understand, that's correct - unless there's an explicit conversion in the client, the files should be byte-identical. I know that in git, there's no metadata for "file type", meaning git doesn't actually know if it's "text" or "binary".
On the server, we just use hg clone (or init) to get a bare copy, and other hg commands to read the contents, so there shouldn't be any changes in the content.

Okay, I think I created a minimal example reproducing the problem. The repository is publicly available here: https://sourceforge.net/p/tinloaf-phorge-problem/code-hg/ci/default/tree/

I just reproduced this problem on my (very recent) Phorge installation by doing this:

Add a new Mercurial repository, set it to observe http://hg.code.sf.net/p/tinloaf-phorge-problem/code-hg
Do not change the repo encoding. It should be set to utf-8 (default)
Wait for the repository to be cloned

There is a file texttext.txt in the repo. It should be viewable in the code browser with a warning that it was converted from ISO-8859. This is correct - the file texttext.txt contains ISO-8859 'Umlaute'.

However, the most recent commit should be non-viewable, it just says 'this is a binary file' - even if one manually changes the encoding to ISO-8859.

Edit: @avivey your idea that maybe only the first part of the file is inspected to determine the encoding may be correct, by the way. I had to add some lorem ipsum text (which does not contain any non-ASCII characters) before the ä, ü, etc. to trigger the problem.

Cool, I'll play with it and see what I can find.

EDIT: I'm able to reproduce!

Answers

avivey
Updated 666 Days Ago
Actions

Ok, I as able to reproduce, and track the code.
TL;DR the views use different detection modes, but setting the repository's encoding to the right encoding, (ISO-8859-1 in this case), the commit view does show the correct thing (at least as far as I my eye can tell.

Oddly enough, the repository's encoding setting doesn't seem to effect the regular browse view - I still get the message "was detected as..." at the top of the file.

There is a bug here though: The Commit view allows you to specify encoding for a one-off view, and then ignores this selection.

Gory details:

The "Browse" view (single file in a specific state) is using DiffusionDocumentRenderingEngine, which picks up all possible "engines" (text, json, pdf, video, hexdump....), and tries them all in some order.
The "Text" renderers check for "is binary" by searching for the NULL char (\0) in the first 1MB of the file - a logic that was copied from git. NULL chars probably don't show up in any of the reasonable text encodings we use. Maybe.

The "Commit" view (showing changes in a file in a commit) is using a much more primitive rendering engine, because it's doing a 2-up and marking changes.
It gets the files' content (multiple - before and after the commit) by invoking DiffusionDiffQueryConduitAPIMethod, and that is using ArcanistDiffParser (same one used in arc diff).
ArcanistDiffParser detects "is_binary" by:

calling mb_check_encoding($corpus, 'UTF-8').

If the file isn't UTF8 and no backup encoding (repository encoding) is provided, this is it - file is binary.

If a backup encoding is defined, it will search for the NULL char anywhere in the file.

If NULL exists in the file, it's binary.

It will try to convert the file from the backup encoding to UTF-8. If the conversion fails, it throws an exception and there's no diff. If it works, then it's a text file.

New Answer

Answer

This question has been marked as closed, but you can still leave a new answer.

Diffusion refuses to show a commit in a non-UTF-8 fileClosed, ResolvedPublicActions

Details

Event Timeline

Answers

aviveyUpdated 666 Days AgoActions

Event Timeline

New Answer

Answer

Diffusion refuses to show a commit in a non-UTF-8 file
Closed, ResolvedPublic
Actions

avivey
Updated 666 Days Ago
Actions