Page MenuHomePhorge

Exception when importing Mercurial repository with non-UTF-8 characters in filenames
Open, Needs TriagePublic

Description

(Note: this was moved to a task from this Ponder question )

I'm trying to import a (really old…) Mercurial repository in 'observe' mode into a recent Phorge (cloned a week ago). The repository is large (~175k commits), and 60 of those commits fail to import. Not a bad ratio, but I still would like to get the repository into a 'fully imported' state. I know I can manually set it to 'imported', but I'm not sure what the consequences of that are, whether I should expect missing files etc. So, it would be optimal to get those remaining 60 commits to import.

Under /daemon/, I can see PhabricatorRepositoryMercurialCommitChangeParserWorker jobs with high failure counts. This is what the repository tool tells me:

root@ec6149cd0a0b:/var/www/phorge/phorge# ./bin/repository importing R3
R3:f60bb794a270 Change, Publish
R3:5eeae9954ae8 Change, Publish
R3:773043ec1398 Change, Publish
[…]

They are all standing on Change, Publish. Let's see what phd log says:

Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #13 PhutilDaemon::execute() called at [<phorge>/scripts/daemon/exec/exec_daemon.php:131]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000] [2023-04-24 14:14:05] EXCEPTION: (PhutilProxyException) Error while executing Task ID 644343. {>} (AphrontCharacterSetQueryException) Attempting to construct a query using a non-utf8 string when utf8 is expected. Use the `%B` conversion to escape binary strings data. at [<phorge>/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:418]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000] arcanist(head=master, ref.master=08dfffd5caf7), phorge(head=master, ref.master=b587865ce78a)
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #0 <#2> AphrontBaseMySQLDatabaseConnection::validateUTF8String(string) called at [<phorge>/src/infrastructure/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:12]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #1 <#2> AphrontMySQLiDatabaseConnection::escapeUTF8String(string) called at [<phorge>/src/infrastructure/storage/xsprintf/qsprintf.php:266]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #2 <#2> xsprintf_query(array, string, integer, string, integer) called at [<arcanist>/src/xsprintf/xsprintf.php:82]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #3 <#2> xsprintf(string, array, array) called at [<phorge>/src/infrastructure/storage/xsprintf/PhutilQueryString.php:31]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #4 <#2> PhutilQueryString::__construct(AphrontMySQLiDatabaseConnection, array) called at [<phorge>/src/infrastructure/storage/xsprintf/qsprintf.php:78]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #5 <#2> qsprintf(AphrontMySQLiDatabaseConnection, string, string, string) called at [<phorge>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:69]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #6 <#2> PhabricatorRepositoryCommitChangeParserWorker::lookupOrCreatePaths(array) called at [<phorge>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryMercurialCommitChangeParserWorker.php:255]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #7 <#2> PhabricatorRepositoryMercurialCommitChangeParserWorker::parseCommitChanges(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phorge>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:36]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #8 <#2> PhabricatorRepositoryCommitChangeParserWorker::parseCommit(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phorge>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:72]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #9 <#2> PhabricatorRepositoryCommitParserWorker::doWork() called at [<phorge>/src/infrastructure/daemon/workers/PhabricatorWorker.php:124]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #10 <#2> PhabricatorWorker::executeTask() called at [<phorge>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:160]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #11 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phorge>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #12 PhabricatorTaskmasterDaemon::run() called at [<phorge>/src/infrastructure/daemon/PhutilDaemon.php:219]
Daemon 26 STDE [Mon, 24 Apr 2023 14:14:05 +0000]   #13 PhutilDaemon::execute() called at [<phorge>/scripts/daemon/exec/exec_daemon.php:131]
Daemon 26 FAIL [Mon, 24 Apr 2023 14:14:05 +0000] Process exited with error 255.

To my untrained eye this looks like it's trying to create a file path in the database and chokes on the encoding of the path. In fact, when I look at the problematic commits in my repository, I see that they contain files with German "Umlaute" ("ü", "ö", "ä", …) in their file names. I did not verify this for all 60 failed commits, but picked a couple of them at random. The commits are mostly from 1999 (did I mention the repo is old?). I assume that these file names are encoded in CP-1252, because this was Windows in the year 1999.

I'm not sure how to reproduce this with a test repository, because I'm honestly not sure how to create a non-utf-8 filename and force it into Mercurial in the year 2023. However, I see nothing else special about these changesets.

Event Timeline

Could you check whether your install is running with this change? https://secure.phabricator.com/D21676

That should be forcing mercurial to encode everything in UTF-8.

Hi @speck ,

Our currently running version is git commit 1ba5c8c26095 from 2023-04-20, so that change should already be included. In fact, I can see the --encoding utf8 argument to hg in various logs.

Please tell me if there is anything else I can check.