Page MenuHomePhorge

Correct manual upload of Differential patch with a leading BOM
ClosedPublic

Authored by aklapper on Jan 13 2024, 10:21.

Details

Summary

Strip a leading UTF-8 Byte Order Mark to avoid silently dropping the first file in a manually uploaded patch.

This change only strips the UTF-8 BOM as UTF-8 is acceptable input.
(Probably non-existing) handling of any other BOMs as first bytes in a diff remains unchanged.

Closes T15452

Test Plan

Go to /differential/diff/create/ and upload the test case in T15452 via Raw Diff From File.
See two files listed on resulting page /differential/diff/1/ instead of previously only one file.
Optionally, confirm that byte length of $diff is three bytes less now (via strlen($diff)).

Diff Detail

Repository
rARC Arcanist
Branch
T15452stripBOM (branched from master)
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 1020
Build 1020: arc lint + arc unit

Event Timeline

Is the input to parseDiff guaranteed to be UTF-8 encoded? I don’t have the code on-hand to look up but it’s possible encoding is handled elsewhere — or arc might have a hard requirement for all input to be UTF-8. All mercurial commands run by arcanist are done with a flag to enforce UTF-8.

This seems like a good change I’m mostly wondering if there are contracts about the input this might change. E.g. should we attempt to parse UTF-16 at all? (Pretty sure we should not). Might be worth additional comment here explaining it’s the only BOM we check for because by contract only UTF-8 is acceptable input.

Is the input to parseDiff guaranteed to be UTF-8 encoded?

Probably not but I am not aware of a way to reliably detect random encodings either - I think there's no way for *.diff files to express their character encoding (similar to why CSV is a horrible data exchange format).
US-ASCII's 128 code points and ISO-8859-1's 256 code points have a 1:1 mapping to UTF-8; any other entertaining encodings out there will expose random differences that we cannot do much about.
If encoding was UTF-16 or UTF-32 or something else with a BOM, the code would not strip the BOM and we'd still leave it to PHP and Phorge to decide what they make out of it or not - shrug. :)

Might be worth additional comment here explaining it’s the only BOM we check for because by contract only UTF-8 is acceptable input.

Indeed. For the records, AFAIK the PHP default_charset is UTF-8 since PHP 5.4.0. Phorge requires PHP 5.5 or newer.

Thanks - I did mean as an inline comment in the code but either is good.

This revision is now accepted and ready to land.Jan 13 2024, 19:58