Page MenuHomePhorge

Fix PHP 8.0 ValueError calling mb_convert_encoding() with an invalid encoding
ClosedPublic

Authored by aklapper on May 26 2023, 21:44.

Details

Summary

Per https://www.php.net/manual/en/function.mb-convert-encoding.php, as of PHP 8.0.0, a ValueError is thrown if the value of to_encoding or from_encoding is an invalid encoding but a ValueError is not suppressed by the stfu operator ("@").

Origin of the function:

https://secure.phabricator.com/rPHU72ad8fd0f05b0d84f7d8efd7db62ad0b3ba4431f

Premising that Arcanist elevates warnings to exception, now we just try-catch.

Closes T15423

Test Plan

On /diffusion/edit/1/page/encoding/,

  • enter a valid encoding, such as "7bit", successfully changed encoding
  • enter a valid encoding with random capitalization, such as "7biT", successfully changed encoding
  • enter a valid alias encoding, such as "ISO-10646-UCS-2", successfully changed encoding
  • enter a valid alias encoding with random capitalization, such as "isO-10646-uCS-2", successfully changed encoding
  • enter an invalid encoding, such as "whatever", get error message "Repository encoding "whatever" is not valid: String conversion from encoding 'UTF-8' to encoding 'whatever' failed: mb_convert_encoding(): Argument #2 ($to_encoding) must be a valid encoding, "whatever" given"

In any case, no exception is shown anymore.

Diff Detail

Repository
rARC Arcanist
Branch
arcpatch-D25249
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 533
Build 533: arc lint + arc unit

Event Timeline

This makes a lot of sense semantically, thanks

I approve as behalf of myself since probably we can use the strict mode:

in_array($to_encoding, $available_encodings, true)

Reason:

Prior to PHP 8.0.0, a string needle will match an array value of 0 in non-strict mode, and vice versa. That may lead to undesireable results. Similar edge cases exist for other types, as well. If not absolutely certain of the types of values involved, always use the strict flag to avoid unexpected behavior.

Then, another +1 of course

I've done some fuzzy tests:

I have a Very Bad News, and a Good news.


🔶 The Good news:

Here we are just checking the $to_encoding. I think that is 98% OK. Probably to be 100% OK we should also check the $from_encoding but that would slow down things without any clear indication of bad usage around. So I would say: just check "$to", and, on the next reported crash, let's also eventually fix the "from". Premising that usually the "from" should be already assumed as meaningful and usually not user-provided, so I do not expect a crash there, so I do not expect changes.

→ So I would propose, let's keep your current "just check the 'from'" logic.

🔴 Bad news:

PHP is really more evil than thought: the damn mb_list_encodings() just returns few possible encodings. To verify if an encoding exists, we need to also check against $aliases = mb_encoding_aliases($existing).

https://www.php.net/manual/en/function.mb-encoding-aliases.php

So our nightmare is similar to the following one (that also needs strtolower stuff):

function is_existing_encoding($encoding) {
  foreach(mb_list_encodings() as $candidate) {
    // check against the canonical version
    if($candidate === $encoding) {
      return true;
    }

    // also check against any related alias
    $aliases = mb_encoding_aliases($candidate);
    if(in_array($encoding, $aliases, true)) {
      return true;
    }
  }
  return false;
}

With strtolower() extra normalization it becomes something like:

function is_existing_encoding($encoding) {
  $encoding = strtolower($encoding);

  foreach(mb_list_encodings() as $candidate) {
    // check against the canonical version
    if(strtolower($candidate) === $encoding) {
      return true;
    }

    // also check against any related alias
    $aliases = mb_encoding_aliases($candidate);
    foreach($aliases as $alias_candidate) {
      // check against the canonical version
      if(strtolower($alias_candidate) === $encoding) {
        return true;
      }
    }
  }

  return false;
}

This is becoming really overkill. Probably this deserves a dedicated function.

src/utils/utf8.php
675–678

I don't get the old description.

We can detect failures caused by invalid encoding names

But?

I would suggest to keep this in pause until a PHP God enlightens our sinful souls

This revision now requires changes to proceed.May 30 2023, 13:27

Thinking deeply about this, probably we can try a... try catch structure. That would do the job, without introducing terrible giant overhead (like in my terrible code proposal)

aklapper edited the test plan for this revision. (Show Details)

Also handle encoding aliases; Store existing encodings and encoding aliases in array keys so look up is O(1); compare new encoding against existing encodings and encoding aliases both as lower case.

Still get various PHP 8.2 deprecation messages about libxml_disable_entity_loader and mb_encoding_aliases(): Handling Base64 via mbstring is deprecated; use base64_encode/base64_decode instead but that's not to be covered here.

Wrap three lines to be less than 80 characters

Add PHP version check and manually remove deprecated encodings from array to hopefully avoid the unit test failure "Handling Base64 via mbstring is deprecated; use base64_encode/base64_decode instead" though that should only be an issue since PHP 8.2 according to https://php.watch/versions/8.2/mbstring-qprint-base64-uuencode-html-entities-deprecated ? What do I know. :)

Thinking deeply about this, probably we can try a... try catch structure. That would do the job, without introducing terrible giant overhead (like in my terrible code proposal)

Thanks for this change but just to clarify this comment, I was just suggesting something like this:

try {
  $result = mb_convert_encoding($string, $to_encoding, $from_encoding);
} catch(Throwable $e) {
  $message = $e->getMessage();
}

(Untested) or something like that, since the function mb_convert_encoding() seems to already have all the necessary validity checks, and probably we do not need to rewrite that business logic: we can just catch its errors.

I suggest that, also because this is supposed to be a low-level phutil utility and probably should be as fast and simple as possible

Simply pass the Exception from mb_convert_encoding() to the UI when trying to change repo to an invalid encoding (as proposed by valerio.bozzolan)

I suggest that, also because this is supposed to be a low-level phutil utility and probably should be as fast and simple as possible

Indeed, that works and is way more simple.

Uh, yeah, applying this patch I was able to throw our lovely exception:

$ php -a
$ require 'arcanist/support/init/init-script.php';
$ var_dump(phutil_utf8_convert('foo', 'asd', 'lol'));
Uncaught Exception: String conversion from encoding 'lol' to encoding 'asd' failed: mb_convert_encoding(): Unknown encoding "asd" in /var/www/arcanist/src/utils/utf8.php:725

Uh! Very good news! I tried this also in PHP 5.5 using:

$ cd arcanist
$ docker run -it --rm --name my-running-script -v "$PWD":/usr/src/myapp -w /usr/src/myapp php:5.5-cli php -a
$ require 'support/init/init-script.php';
$ var_dump(phutil_utf8_convert('asd', 'lol', 'dsa'));

And SURPRISINGLY this still works in PHP 5.5 and still generates an exception! 💌 I was very skeptical since only PHP >= 8.0 generates a ValueError, and just an E_WARNING before.

But.. it works because Arcanist elevates warnings to exceptions! We hate this because of PHP 8.1 warnings were massive, but, here that is useful! ✨

I'm 99% sure that this works. But let's see if anybody else is OK with this. Just to have more eyes on this.

fix linter that does not report indentation error

speck requested changes to this revision.Jun 13 2023, 00:26

I think this change is removing intentional design. The reason that @ is used instead of try/catch -- the comment on the function is saying if you have a string in e.g. Latin-1 and $to_encoding is set to something different-yet-valid such as Korean (e.g. ISO-2022-KR) -- then the encoding conversion will silently fail, not throwing an exception, but still populating the result with incorrect garbage. The exception only appears to be thrown if either of the to/from encodings are invalid like lol or asd. This is likely a breaking change.

You can repeat this issue with an example in php -a

$test = mb_convert_encoding('😄', 'UTF8', 'LATIN1');

No exception is thrown here, however doing

$test = @mb_convert_encoding('😄', 'UTF8', 'LATIN1');
$message = error_get_last();

Will show that $message['message'] is non-null.

In testing on PHP 8.2 I think the solution is to keep the existing code to use @ and error_get_last() but to try/catch around it in order to capture invalid encodings

This revision now requires changes to proceed.Jun 13 2023, 00:26

Something like this, though I'm not sure if this is actually correct - in PHP8 using @ will still throw a ValueError if passed an invalid encoding. But I'm not sure This @ thing is working as expected in php8, or I don't have a valid test case.

$message = null;
try {
  $result = @mb_convert_encoding($string, $to_encoding, $from_encoding);

  if ($result === false) {
    $message = error_get_last();
    if ($message) {
      $message = idx($message, 'message', pht('Unknown error.'));
    }
  }
} catch (Throwable $ex) {
  $message = $ex->getMessage();
}

if ($message !== null) {
  throw new Exception(...
}
In D25249#8543, @speck wrote:

The reason that @ is used instead of try/catch -- the comment on the function is saying if you have a string in e.g. Latin-1 and $to_encoding is set to something different-yet-valid such as Korean (e.g. ISO-2022-KR) -- then the encoding conversion will silently fail, not throwing an exception, but still populating the result with incorrect garbage.

My interpretation of

This function assumes that the input is in the given source encoding. [...] We can detect failures caused by invalid encoding names, but mb_convert_encoding() fails silently if the encoding name identifies a real encoding but the string is not actually encoded with that encoding.

is slightly different. If $from_encoding doesn't match the actual encoding used for the string, then mb_convert_encoding() will produce garbage, but there will be no way at all to determine that this happened. In particular, using @ will have no impact.

In D25249#8543, @speck wrote:
$test = @mb_convert_encoding('😄', 'UTF8', 'LATIN1');
$message = error_get_last();

Will show that $message['message'] is non-null.

I can't reproduce this with any version of PHP (5.6, 7.4, 8.0, 8.1, or 8.2). With or without the @ (and even if the encodings are swapped), it reports no errors and throws no exceptions, because both encodings are valid.

In D25249#8543, @speck wrote:

In testing on PHP 8.2 I think the solution is to keep the existing code to use @ and error_get_last() but to try/catch around it in order to capture invalid encodings

As far as I can tell, the only reason to keep the @ in addition to the try/catch would be to support both PHP <8 and PHP >=8. Prior to PHP 8, mb_convert_encoding() used to emit E_WARNING for invalid encodings, which can be handled using @ and error_get_last(). Starting from PHP 8, it throws a ValueError for invalid encodings, which must be caught instead. I think that's consistent with this observation:

In D25249#8546, @speck wrote:

in PHP8 using @ will still throw a ValueError if passed an invalid encoding

Despite the API change for mb_convert_encoding() in PHP 8, we might not need to include both versions of error handling:

And SURPRISINGLY this still works in PHP 5.5 and still generates an exception! [...] it works because Arcanist elevates warnings to exceptions!

This suggests that there's actually no need for @/error_get_last() here, even with old PHP. However, does the generated exception still have the "String conversion from encoding" part of the message, or does it only include the "Unknown encoding" part? If the latter, do we care? Is it fine to rely on the elevation of warnings to exceptions to deal with this, or should it still be handled explicitly in case that policy changes?

The use of error_get_last() in this very code has been unhelpful and misleading to me as it led to displaying messages entirely unrelated to invalid encodings (e.g. about libxml_disable_entity_loader having been deprecated).

Thanks - the behavior I was seeing I think mirrors your own but the error I was getting was more directly related which confused me. It seems using @ does not clear the previous error from its use? Bleh.

My guess is this was intended to catch going from one encoding to a more narrow one in which it might error. It couldn’t catch all errors of this type so agreed with the current changes.

Thank you!

This revision is now accepted and ready to land.Jun 19 2023, 23:00