Correct manual upload of Differential patch with a leading BOM
ClosedPublic
Actions

Authored by aklapper on Jan 13 2024, 10:21.

Details

Reviewers

speck

Group Reviewers

O1: Blessed Committers

Maniphest Tasks

T15452: Differential drops a diff's first hunk if the file starts with a Byte-order-Mark

Commits

rARC6c7caf3572f4: Correct manual upload of Differential patch with a leading BOM

Summary

Strip a leading UTF-8 Byte Order Mark to avoid silently dropping the first file in a manually uploaded patch.

This change only strips the UTF-8 BOM as UTF-8 is acceptable input.
(Probably non-existing) handling of any other BOMs as first bytes in a diff remains unchanged.

Closes T15452

Test Plan

Go to /differential/diff/create/ and upload the test case in T15452 via Raw Diff From File.
See two files listed on resulting page /differential/diff/1/ instead of previously only one file.
Optionally, confirm that byte length of $diff is three bytes less now (via strlen($diff)).

Diff Detail

Repository

rARC Arcanist

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

aklapper created this revision.Jan 13 2024, 10:21

Owners added a reviewer: O1: Blessed Committers.Jan 13 2024, 10:21

Herald added subscribers: Cigaryno, Matthew, valerio.bozzolan, tobiaswiese. · View Herald TranscriptJan 13 2024, 10:21

aklapper requested review of this revision.Jan 13 2024, 10:21

Harbormaster completed remote builds in B1020: Diff 1635.Jan 13 2024, 10:21

Is the input to parseDiff guaranteed to be UTF-8 encoded? I don’t have the code on-hand to look up but it’s possible encoding is handled elsewhere — or arc might have a hard requirement for all input to be UTF-8. All mercurial commands run by arcanist are done with a flag to enforce UTF-8.

This seems like a good change I’m mostly wondering if there are contracts about the input this might change. E.g. should we attempt to parse UTF-16 at all? (Pretty sure we should not). Might be worth additional comment here explaining it’s the only BOM we check for because by contract only UTF-8 is acceptable input.

Is the input to parseDiff guaranteed to be UTF-8 encoded?

Probably not but I am not aware of a way to reliably detect random encodings either - I think there's no way for *.diff files to express their character encoding (similar to why CSV is a horrible data exchange format).
US-ASCII's 128 code points and ISO-8859-1's 256 code points have a 1:1 mapping to UTF-8; any other entertaining encodings out there will expose random differences that we cannot do much about.
If encoding was UTF-16 or UTF-32 or something else with a BOM, the code would not strip the BOM and we'd still leave it to PHP and Phorge to decide what they make out of it or not - shrug. :)