Details
Hi,
I have a sort-of-weird setup. I have a repository with several 'product version' branches. The repository is old, and historically, the source code was not encoded in UTF-8, but in Windows CP-1252. We did convert everything to UTF-8 a while ago, so all 'newer' branches, starting with a branch I'll call version_X, are UTF-8-encoded, and all older branches (up to version_X-1) are encoded in CP-1252.
I have selected UTF-8 as repository encoding. This means that when viewing one of the contained files on an older branch, Diffusion will show me a warning like this:
This document is not UTF8. It was detected as ISO-8859-1 (Latin 1) and converted to UTF8 for display.
That's nice.
Now I have a commit on such a file and would like to perform an audit. However, the commit viewer only shows This is a binary file. I tried selecting View Options -> Change Text Encoding… from the menu in the headline of the file, and select CP-1252 (or ISO-8859-1), but in both cases it still tells me that This is a binary file.
I verified that the Diffusion Code Viewer can display the whole file immediately before the respective commit and immediately after the commit, both with showing the …converted to UTF8… warning. The changes in the commit itself do not contain any non-ASCII characters.
Is there anything else I can do to figure out why Diffusion thinks this is a binary file?
Answers
Ok, I as able to reproduce, and track the code.
TL;DR the views use different detection modes, but setting the repository's encoding to the right encoding, (ISO-8859-1 in this case), the commit view does show the correct thing (at least as far as I my eye can tell.
Oddly enough, the repository's encoding setting doesn't seem to effect the regular browse view - I still get the message "was detected as..." at the top of the file.
There is a bug here though: The Commit view allows you to specify encoding for a one-off view, and then ignores this selection.
Gory details:
The "Browse" view (single file in a specific state) is using DiffusionDocumentRenderingEngine, which picks up all possible "engines" (text, json, pdf, video, hexdump....), and tries them all in some order.
The "Text" renderers check for "is binary" by searching for the NULL char (\0) in the first 1MB of the file - a logic that was copied from git. NULL chars probably don't show up in any of the reasonable text encodings we use. Maybe.
The "Commit" view (showing changes in a file in a commit) is using a much more primitive rendering engine, because it's doing a 2-up and marking changes.
It gets the files' content (multiple - before and after the commit) by invoking DiffusionDiffQueryConduitAPIMethod, and that is using ArcanistDiffParser (same one used in arc diff).
ArcanistDiffParser detects "is_binary" by:
- calling mb_check_encoding($corpus, 'UTF-8').
If the file isn't UTF8 and no backup encoding (repository encoding) is provided, this is it - file is binary.
- If a backup encoding is defined, it will search for the NULL char anywhere in the file.
If NULL exists in the file, it's binary.
- It will try to convert the file from the backup encoding to UTF-8. If the conversion fails, it throws an exception and there's no diff. If it works, then it's a text file.