Page MenuHomePhorge

Replace function utf8_decode() - deprecated since PHP 8.2
ClosedPublic

Authored by valerio.bozzolan on Mar 24 2023, 14:31.

Details

Summary

The function utf8_decode() was a shortcut to convert strings
encoded from UTF-8 to ISO-8859-1 ("Latin 1").

This function was deprecated since PHP 8.2 and will be dropped
in PHP 9:

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

As mentioned in the RFC, if a $string is a valid UTF-8 string,
so this could be used to count the number of code points:

strlen(utf8_decode($string))

It works because any unmappable code point is replaced with the
single byte '?' in the output. But, the correct native approach
should be this one:

mb_strlen($string, 'UTF-8');

Also, another good approach is this one:

iconv_strlen($string, 'UTF-8')

Note that mb_strlen() was introduced in PHP 4, so, there
are no compatibility issues in using that.

Note that the mbstring extension is already required in the installation
documentation, so this should not change anything for any person.

https://we.phorge.it/T15188

https://wiki.php.net/rfc/remove_utf8_decode_and_utf8_encode

https://www.php.net/manual/en/function.utf8-decode

https://www.php.net/manual/en/function.mb-convert-encoding.php

https://github.com/rectorphp/rector/blob/main/docs/rector_rules_overview.md#utf8decodeencodetombconvertencodingrector

Closes T15188

Test Plan
  • I was able to execute "arc lint" from PHP 8.2
  • I was able to execute this "arc diff" from PHP 8.2
  • With this patch you can still run "arc lint" with your local version

Diff Detail

Repository
rARC Arcanist
Branch
master
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 171
Build 171: arc lint + arc unit

Event Timeline

adopt mb_strlen() that is optimized to do exactly this if you tell that you want an UTF-8

valerio.bozzolan edited the test plan for this revision. (Show Details)

Interestingly, this modification also brings a performance improvement on calculating the length of multiple strings.

Example test:

<?php

// the Arabic (Hello) string below is: 59 bytes and 32 characters
$string = "السلام علیکم ورحمة الله وبرکاته!";

$t = 100000;

$start_time = microtime(TRUE);
for($i=0; $i <$t; $i++) {
        $n = mb_strlen($string, 'UTF-8');
}
$end_time = microtime(TRUE);
echo $end_time - $start_time . "\n";

$start_time = microtime(TRUE);
for($i=0; $i <$t; $i++) {
        $n = strlen(utf8_decode($string) );
}
$end_time = microtime(TRUE);
echo $end_time - $start_time . "\n";

On my computer I get:

# new way (shorter is better):
0.042040109634399

# old way:
0.060499906539917

This means less CPU cycles and, for environmental lovers, this also means less waste of resources in long terms. Anyway, a single burp for me is probably able to cancel this environmental benefit. So I will try to restrain myself.

This revision is now accepted and ready to land.Mar 25 2023, 09:47
remote: This push was rejected by Herald push rule H8.
remote:     Change: commit/
remote:       Rule: Guard Arcanist Repo with Blessed Committers
remote:     Reason: Commit is not approved by Blessed Committers
remote: Transcript: https://we.phorge.it/herald/transcript/4967/

https://we.phorge.it/herald/transcript/4967/

remote: This push was rejected by Herald push rule H8.
remote:     Change: commit/
remote:       Rule: Guard Arcanist Repo with Blessed Committers
remote:     Reason: Commit is not approved by Blessed Committers
remote: Transcript: https://we.phorge.it/herald/transcript/4967/

https://we.phorge.it/herald/transcript/4967/

There's too many negatives in that conditions. I'm going to need some paper...