Page MenuHomePhorge

arc lint: "Function utf8_decode() is deprecated" in PHP 8.2
Closed, ResolvedPublic

Description

If I use arc lint in PHP 8.2 this exception is thrown:

arc lint
[2023-03-24 13:33:28] EXCEPTION: (RuntimeException) Function utf8_decode() is deprecated at [<arcanist>/src/error/PhutilErrorHandler.php:261]
arcanist(head=arcpatch-D25049_1, ref.master=9e1bb955fac9, ref.arcpatch-D25049_1=d5dc36d4024c)
  #0 PhutilErrorHandler::handleError(integer, string, string, integer) called at [<arcanist>/src/utils/utf8.php:292]
  #1 phutil_utf8_strlen(string) called at [<arcanist>/src/parser/argument/PhutilArgumentSpellingCorrector.php:158]
  #2 PhutilArgumentSpellingCorrector::correctSpelling(string, array) called at [<arcanist>/src/parser/argument/PhutilArgumentParser.php:437]
  #3 PhutilArgumentParser::parseWorkflowsFull(array) called at [<arcanist>/src/runtime/ArcanistRuntime.php:171]
  #4 ArcanistRuntime::executeCore(array) called at [<arcanist>/src/runtime/ArcanistRuntime.php:37]
  #5 ArcanistRuntime::execute(array) called at [<arcanist>/support/init/init-arcanist.php:6]
  #6 require_once(string) called at [<arcanist>/bin/arc:10]

To fix all the results, there is an automatic Rector.php rule name:

Rector\Php82\Rector\FuncCall\Utf8DecodeEncodeToMbConvertEncodingRector

https://github.com/rectorphp/rector/blob/main/docs/rector_rules_overview.md#utf8decodeencodetombconvertencodingrector

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode - deprecation

https://www.php.net/manual/en/function.utf8-decode - deprecated

https://www.php.net/manual/en/function.mb-convert-encoding.php - available since PHP 4

https://www.php.net/manual/en/function.mb-strlen.php - available since PHP 4

In short, it seems the exact replacement should be:

-utf8_decode($string)
+mb_convert_encoding($string, 'ISO-8859-1')

And this is the exact replacement for counting length:

-strlen(utf8_decode($string))
+mb_strlen($string, 'UTF-8')

Weird Notes

This is a weird test showing that - pro

<?php
# https://www.php.net/manual/en/function.strlen.php#118484
// the Arabic (Hello) string below is: 59 bytes and 32 characters
$string = "السلام علیکم ورحمة الله وبرکاته!";

# says: 32
var_dump( strlen(utf8_decode($string)) );

# says: 32
var_dump( strlen(mb_convert_encoding($string, 'ISO-8859-1') ) );

# says: 32
var_dump( mb_strlen(mb_convert_encoding($string, 'ISO-8859-1') , 'ISO-8859-1') );

# says: 32 and it's probably the most efficient
var_dump( mb_strlen($string, 'UTF-8') );

# says: 69
var_dump( mb_strlen($string, 'ISO-8859-1') );

Event Timeline

valerio.bozzolan triaged this task as High priority.
valerio.bozzolan created this object in space S1 Public.

By the way, Rector.php was able to do this replacement automatically:

-utf8_decode($string)
+mb_convert_encoding($string, 'ISO-8859-1')

But it was not able to do this replacement:

-strlen(utf8_decode($string))
+mb_strlen($string, 'UTF-8')

So, Rector.php originally proposed this version that is not very optimized:

-strlen(utf8_decode($string))
+strlen(mb_convert_encoding($string, 'ISO-8859-1'))

It looks like upstream just straight-up removed the call to utf8_decode() in the master branch: https://secure.phabricator.com/diffusion/ARC/browse/master/src/utils/utf8.php$290-292

In T15188#9488, @speck wrote:

It looks like upstream just straight-up removed the call to utf8_decode() in the master branch: https://secure.phabricator.com/diffusion/ARC/browse/master/src/utils/utf8.php$290-292

Yeah thanks. As mentioned here:

rARC08dfffd5caf7: Replace function utf8_decode() - deprecated since PHP 8.2

In my opinion the upstream change is just slow as hell since it does not benefit in any way from native multi-byte functions. Also, it does not mention any reason for not doing that.