Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent std::regex_replace results on x64 Linux and aarch64 Android #1911

Closed
zheng-yu-yang opened this issue Aug 3, 2023 · 5 comments
Assignees
Labels
Projects

Comments

@zheng-yu-yang
Copy link

Description

The code to reproduce the issue:
https://gist.github.com/zheng-yu-yang/a225cc68350ae828cf68b2591730871c

With NDK r25c/r26b1 x64, the output is:

src: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 20 20
dst: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38

where the 2 trailing spaces (20 20) are removed from the source content.

With NDK r25c/r26b1 aarch64, the output is:

src: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 20 20
dst: e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38

where the trailing 2 spaces (20, 20), as well as the first 3 bytes (e0 b8 81) from the source content are removed.

I did not use any building system but manually compiled the source code to static ELF binary.
clang++ regex_test.cpp -o regex_test -static -std=c++11 or
g++ regex_test.cpp -o regex_test -static -std=c++11

I also tried cross gcc (arm-linux-gnueabihf-g++, 11.4.0) and native gcc (g++, 11.4.0), and the outputs are the same (only trailing spaces were removed).

I run the test program on MI MAX 3 (Android 9) and Ubuntu 22.04 in WSL.

Affected versions

r25, r26

Canary version

No response

Host OS

Linux

Host OS version

Ubuntu 22.04 in WSL

Affected ABIs

arm64-v8a

Build system

Other (specify below)

Other build system

manual build from bash command line

minSdkVersion

30

Device API level

28

@zheng-yu-yang
Copy link
Author

BTW, this issue is not specific to wchar_t. When I keep the source string to UTF8 encoded std::string and modify everything else accordingly, the results are still inconsistent, only in a different way: the source string has some illegal UTF8 bytes inserted besides removing the trailing spaces. I can provide the source code to reproduce if needed.

@rprichard
Copy link
Collaborator

So far I'm not seeing a difference between arm64 and x86_64 behavior. On both a P and an Sv2 emulator, I see this:

src: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 20 20 
dst: e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 

That does seem wrong though? I reduced it to:

#include <cstdio>
#include <regex>
int main() {
    std::string src = "A";
    std::string dst = std::regex_replace(src, std::regex(""), "x");
    printf("[%s]\n", dst.c_str());
    return 0;
}
// libstdc++ output: [xAx]
// libc++ output:    [xx]

I see some comments in http://eel.is/c++draft/re about a "zero-length match", so I'm guessing these test cases have defined behavior, and maybe there's a libc++ bug here.

@zheng-yu-yang
Copy link
Author

Prichard,

Thanks for the try.

Actually you got consistent but incorrect results. They are incorrect because the first 3 bytes (e0 b8 81) should not be removed according to the regular expression "^\s*|\s*$" which matches the leading or trailing spaces while, e0 b8 81 are the encoding for the Thai character 'ก'.

@rprichard
Copy link
Collaborator

I reported the issue to LLVM, llvm/llvm-project#64451.

@DanAlbert DanAlbert added this to Awaiting triage in LLVM via automation Aug 7, 2023
@DanAlbert DanAlbert moved this from Awaiting triage to Needs upstream bug in LLVM Aug 7, 2023
@DanAlbert DanAlbert moved this from Needs upstream bug to Awaiting fix in LLVM Aug 7, 2023
@pirama-arumuga-nainar
Copy link
Collaborator

Upstream issue was fixed in llvm/llvm-project#94550. Will try to cherry-pick to the next prebuilt drop into r27.

@DanAlbert DanAlbert moved this from Awaiting fix to Awaiting update in LLVM Jun 14, 2024
@DanAlbert DanAlbert moved this from Awaiting update to Prebuilts submitted in LLVM Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants