Page MenuHomePhorge

Differential drops a diff's first hunk if the file starts with a Byte-order-Mark
Closed, ResolvedPublic

Description

I am creating a Differential revision by manually uploading a diff (which I have created by running hg diff --git …). Everything is encoded in UTF-8, and the diff file starts with the UTF-8 Byte Order Mark (BOM), i.e., the bytes 0x EF BB BF. This results in the first hunk of the diff just being silently ignored by Differential.

I assume that with a BOM, Differential fails to parse the first line of the file, which is the header for the first hunk, and thus just ignores that first hunk.

Example

Take this very simple diff that just adds two files:

diff --git a/path/to/file1.cs b/path/to/file1.cs
new file mode 100644
--- /dev/null
+++ b/path/to/file1.cs
@@ -0,0 +1,1 @@
+foo1
diff --git a/path/to/file2.cs b/path/to/file2.cs
new file mode 100644
--- /dev/null
+++ b/path/to/file2.cs
@@ -0,0 +1,1 @@
+foo2

Note that even though I copy-pasted this from a file with a BOM, the paste here of course does not carry a BOM. Here is a file that contains exactly this diff, with an UTF-8 BOM:

In a hex editor, this file should start like this (just to make sure that Phorge does not re-encode the uploaded file or strip the BOM):

00000000: efbb bf64 6966 6620 2d2d 6769 7420 612f  ...diff --git a/

If I upload this file to Differential, the resulting Differential diff shows file2.cs being added, but there is no trace of file1.cs.

Event Timeline

avivey renamed this task from Diffusion drops a diff's first hunk if the file starts with a Byte-order-Mark to Differntial drops a diff's first hunk if the file starts with a Byte-order-Mark.Jul 20 2023, 11:39
aklapper renamed this task from Differntial drops a diff's first hunk if the file starts with a Byte-order-Mark to Differential drops a diff's first hunk if the file starts with a Byte-order-Mark.Jan 13 2024, 02:50

Hmm I’ve used mercurial and arcanist/Phab for years at my company and don’t believe we’ve ever run into this. Any idea what’s causing the presence of the BOM? We’ll apply a change to handle the UTF-8 bom but I am curious what may have caused it to show up. Is your hgrc configured in some way for this or maybe an environment variable?