Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV in phtread_exit after switching from gnustl_static to libc++_static #360

Closed
andreya108 opened this issue Apr 10, 2017 · 22 comments
Closed
Assignees
Milestone

Comments

@andreya108
Copy link

Description

We have a project which is for a long time compiled gnustl_static in c++ code.

Jni tasks are running async in separate thread like this (simplified for readability):

new Thread(this::asyncRunnable).start()
...
void asyncRunnable() {
try { 
   callJni(); 
} catch () {
...
} finally {
...
}
// everything is ok
} // <- sigsegv here

So, everything is fine when all code is linked with gnustl_static, but when I've tried libc++_static then I got SIGSEGV when finishing java thread (actual stacktrace points to releasing threadlocal data in pthread_exit).

Can anyone please help me figuring out what is wrong with libc++?

Environment Details

  • NDK Version: r14b
  • Build sytem: ndk-build
  • Host OS: Ubuntu 16.04
  • Compiler: Clang -std=c++14
  • ABI: arm64-v8a
  • STL: libc++
  • NDK API level: 21
  • Device API level: 25
@enh
Copy link
Contributor

enh commented Apr 10, 2017

do you have a reproduceable example we can actually build?

a stack trace of a real crash?

@andreya108
Copy link
Author

andreya108 commented Apr 10, 2017

Here is stack trace:

signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7c41fc040c
Stack frame #00 pc 0000007c41fc040c  <unknown>
Stack frame #01 pc 0000000000068d44  /system/lib64/libc.so (_Z21pthread_key_clean_allv+116)
Stack frame #02 pc 0000000000068874  /system/lib64/libc.so (pthread_exit+68)
Stack frame #03 pc 0000000000068738  /system/lib64/libc.so (_ZL15__pthread_startPv+212)
Stack frame #04 pc 000000000001da7c  /system/lib64/libc.so (__start_thread+16)

Unfortunately I have no reproducable example yet, there are tons of proprietary code with many dependencies (like boost libs).. The only definite fact that it works normally with gnustl.

Another thing may be important, inside jni call the rest of code is loaded on-demand by dlopen (libcore.so). So we have a pair of dlopen/dlclose inside the call. (preventing keeping in memory huge lib when it is not actually needed)

And std::string& argument is passing through libjni.so -> libcore.so

@andreya108
Copy link
Author

Here is complete stack trace (whithout ndk-trace shortening):

04-10 18:06:30.952 16905-16905/? A/DEBUG: *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
04-10 18:06:30.953 16905-16905/? A/DEBUG: Build fingerprint: 'google/bullhead/bullhead:7.1.2/NPG05D/3635581:user/release-keys'
04-10 18:06:30.953 16905-16905/? A/DEBUG: Revision: 'rev_1.0'
04-10 18:06:30.953 16905-16905/? A/DEBUG: ABI: 'arm64'
04-10 18:06:30.953 16905-16905/? A/DEBUG: pid: 16098, tid: 16365, name: X  >>> com.x.y <<<
04-10 18:06:30.953 16905-16905/? A/DEBUG: signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7c41fc040c
04-10 18:06:30.958 16905-16905/? A/DEBUG: Abort message: ' 
                                          ==========Log Level 5 ========
                                          '
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x0   0000007c415b1cf0  x1   0000000000000000  x2   0000007c4702e380  x3   0000000000000000
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x4   000000000000002e  x5   0000000000000000  x6   0000000000000000  x7   0000000000000001
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x8   0000007c4272e5f0  x9   0000000000000001  x10  0000007c41fc040c  x11  0000000000000001
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x12  0000007c4272da18  x13  0000000000000007  x14  00000000ffffffff  x15  2e8ba2e8ba2e8ba3
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x16  0000007c62932458  x17  0000007c628da464  x18  0000000000ffffeb  x19  0000007c4272e450
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x20  0000000000000004  x21  0000007c62939988  x22  0000000000000002  x23  00000000000001a0
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x24  0000007c4272e4d0  x25  0000000000105000  x26  725df47256723079  x27  0000007c61625854
04-10 18:06:30.958 16905-16905/? A/DEBUG:     x28  0000007c617fe170  x29  0000007c4272e390  x30  0000007c628d9d48
04-10 18:06:30.958 16905-16905/? A/DEBUG:     sp   0000007c4272e360  pc   0000007c41fc040c  pstate 0000000060000000
04-10 18:06:30.966 16905-16905/? A/DEBUG: backtrace:
04-10 18:06:30.966 16905-16905/? A/DEBUG:     #00 pc 0000007c41fc040c  <unknown>
04-10 18:06:30.966 16905-16905/? A/DEBUG:     #01 pc 0000000000068d44  /system/lib64/libc.so (_Z21pthread_key_clean_allv+116)
04-10 18:06:30.966 16905-16905/? A/DEBUG:     #02 pc 0000000000068874  /system/lib64/libc.so (pthread_exit+68)
04-10 18:06:30.966 16905-16905/? A/DEBUG:     #03 pc 0000000000068738  /system/lib64/libc.so (_ZL15__pthread_startPv+212)
04-10 18:06:30.966 16905-16905/? A/DEBUG:     #04 pc 000000000001da7c  /system/lib64/libc.so (__start_thread+16)
@andreya108
Copy link
Author

andreya108 commented Apr 10, 2017

What is the meaning of 'Abort message' here? Looks like it's a string from our logging engine...

@DanAlbert
Copy link
Member

And std::string& argument is passing through libjni.so -> libcore.so

Both libjni and libcore are your libraries, right? You don't mean libjavacore.so from the system? If you're passing STL objects between your code and the system you're gonna have a bad time since there isn't a stable C++ ABI for the platform.

@andreya108
Copy link
Author

Sorry for unclear naming.

Yes, libjni and libcore have nothing to system libraries. And passing STL objects only inside my code, not to the system.

I'll try to make an example project when I have time...

@DanAlbert
Copy link
Member

Thanks, I'll take a look once we have a test case. Unfortunately not much I can do until then.

@andreya108
Copy link
Author

Here is a test case:
https://drive.google.com/open?id=0By1bFhqM0X9PYWlXVUNEUjEzbjA

I've played some time to get the problem within the above test case, but failed to reproduce all the same.

But I've faced very similar problem concerning both libc++ and gnustl.

When I introduced thread_local object, got the same behavior as I described originally.

But original code does not contain any explicit thread_locals (though thread locals might be there since the code is multithreaded and pthread is widely used in boost, mutexes, lockers, semaphores, etc..).

@thomasjammet
Copy link

I am facing the same problems with thread_local variables with both libc++ and gnustl (ndk r15).
It seems that threads and thread_local objects are deleted too late (after static objects destruction?).

@andreya108
Copy link
Author

Unfortunately, it breaks even C++11 support (when thread_local appeared), currently in the middle of 2017...

Is everything clear with the issue and you (NDK Developers) can reproduce it?

@DanAlbert
Copy link
Member

Sorry, haven't had a chance to look back at this yet, but yes, now that we have source there should be something actionable here. I'll let you know if for some reason it doesn't repro.

@DanAlbert DanAlbert self-assigned this Oct 3, 2017
@DanAlbert DanAlbert added this to the r17 milestone Oct 3, 2017
@DanAlbert
Copy link
Member

So, for one, the platform's implementation of __cxa_thread_atexit_impl doesn't seem to be correct. This doesn't work even in AOSP (no NDK):

foo_impl.cpp, creates shared library libfoo_impl.so

#include <stdio.h>

class Foo {
 public:
  ~Foo() {
    fprintf(stderr, "dtor\n");
  }
};

extern "C" void myfunc() {
  thread_local Foo test;
}

foo.cpp, creates executable foo

#include <dlfcn.h>
#include <stdio.h>

int main(int, char**) {
  void* lib = dlopen("libfoo_impl.so", RTLD_LAZY);
  auto myfunc = reinterpret_cast<void (*)()>(dlsym(lib, "myfunc"));
  myfunc();
  dlclose(lib);
  return 0;
}

This doesn't print "dtor" unless I remove the thread_local or the dlclose. The same test case built with the NDK actually segfaults. I haven't nailed down the cause for the difference in behavior yet, but they are running slightly different versions of libc++.

If I modify libc++abi to never use the platform's __cxa_thread_atexit_impl (by default it's a weak symbol, and it falls back to the implementation in libc if running on a new enough platform to have it), this works correctly if we (incorrectly, since we have more than one binary) use libc++_static instead of libc++_shared. With libc++_shared, the destructors don't run until after dlclose. This happens because the destructors are implemented on a destructor for an object that is in libc++, so they don't run until libc++_shared.so is unloaded.

I'm going to file a platform bug for the first issue, and I'll work on a fix for libc++abi. Going to try to hit this for r17 since the plan is for gnustl to be removed in r18.

btw, from my experimentation none of these problems exist if you don't call dlclose. The problem is that the destructors don't get run too late, so the code has already been unmapped by the time they get called. If that's an option for you, that'd be a reasonable workaround.

@DanAlbert
Copy link
Member

Looks like this is broken for gnustl_shared as well, fwiw. Their builtin implementation has the same problem that libc++ does: the destructor is in a static object in libgnustl_shared.so, not in the library that actually has the thread_local object.

@DanAlbert
Copy link
Member

Also, never mind, the platform also crashes. The device I'm using just has a broken adbd that isn't reporting its exit status and was returning before the segfault message was sent to the host.

According to the glibc docs (it seems __cxa_thread_atexit isn't a part of the official C++ ABI yet), DSOs with thread_local destructors registered should not be unloaded on dlclose by the linker. Android does this, which we'll need to get fixed.

Another workaround you could use is adding -Wl,-z,nodelete to your ldflags for any libraries that use thread_local. This will have the same behavior as the default behavior when this is fixed.

@andreya108
Copy link
Author

Thanks for your investigation!
This issue is too global to use thread_local, so we don't do it for now.
Only low-level pthread api is used.
I think we will wait for proper fix rather than using workarounds.

@DanAlbert
Copy link
Member

The fix will only be able to fix devices running P or newer.

Pre-M M-O P+
No workarounds Works for static STL Broken Works
-Wl,-z,nodelete Works for static STL Works Works
No dlclose Works Works Works
@andreya108
Copy link
Author

Thank you for comprehensive explanation!
Is it worth mentioning in known issues?

@DanAlbert
Copy link
Member

Definitely. Thank you for the reminder.

hubot pushed a commit to aosp-mirror/platform_bionic that referenced this issue Oct 26, 2017
Bug: android/ndk#360
Test: N/A
Change-Id: I964a6c9abd1ae65d74c6aee1c842b2f3d24b5556
@dimitry- dimitry- self-assigned this Jan 4, 2018
hubot pushed a commit to aosp-mirror/platform_bionic that referenced this issue Jan 10, 2018
Introduce new flag to mark soinfo as TLS_NODELETE when
there are thread_local dtors associated with dso_handle
belonging to it.

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
@DanAlbert
Copy link
Member

Thanks, @dimitry-!

hubot pushed a commit to aosp-mirror/platform_bionic that referenced this issue Jan 20, 2018
Bug: android/ndk#360
Test: N/A
Change-Id: Ib9807045bd3206fa3cd300ab70ebed93c73a58e4
@gdh1995
Copy link

gdh1995 commented Apr 10, 2018

(Pre-M) (No workarounds) Works for static STL

In my tests, a .so statically linked with tensorflow v1.6 (NDK r15c, gcc, gnustl_static, armeabi-v7a) also breaks down on native Android 5.1.1 system. In fact, it breaks down on both 5.1.1, 6.0.1 and 7.1.1 (all systems are arm64-v8a updated: only 6.0 and 7.1 are arm64-v8a)

Added: on Android6.1.1, -Wl,-z,nodelete in the project above doesn't work.
Maybe other .so were unloaded during my tests.

@DanAlbert
Copy link
Member

@gdh1995: This bug was about libc++, not gnustl. gnustl did not have the issue discussed in the bug, and is no longer supported.

miodragdinic pushed a commit to MIPS/ndk that referenced this issue Apr 17, 2018
Test: None, markdown only
Bug: android/ndk#360
Change-Id: I1b1ecce8f988c9d4949079f0296518c0604e82ec
gdh1995 added a commit to evision-ai/tensorflow that referenced this issue May 21, 2018
NDK doesn't support thread_local well, so on Android it should use __thread instead.

Google Protocol Buffer and other libraries have kept away from thread_local on Android.

In Tensorflow, there's a "thread_local" in code about CUDA, which should be safe enough.

More discussions are on android/ndk#360 .
gdh1995 added a commit to evision-ai/tensorflow that referenced this issue May 22, 2018
NDK doesn't support thread_local variables which require destructors,
so on Android it should use __thread instead.

Observations:
* ProtoBuf and other libraries are not using thread_local on Android.
* In Tensorflow, there's a "thread_local" in code about CUDA,
  which should be safe enough.

More discussions are on android/ndk#360 .
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Test: bionic unit tests
Bug: android/ndk#360
Change-Id: I733fb303379828c0ac83b1b64f4cb85878e88909
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Bug: android/ndk#360
Test: N/A
Change-Id: I964a6c9abd1ae65d74c6aee1c842b2f3d24b5556
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Mark library NODELETE once first pthread_atexit handler is registered.

According to the test runs it seems that glibc loader agrees on the fact
that it is too dangerous to unload a library on pthread_exit(). It only
does it on dlclose if there are no active thread local dtors left
associated with the dso.

(At this point we just have tests to demonstrate how it works)

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Mark library NODELETE once first pthread_atexit handler is registered.

According to the test runs it seems that glibc loader agrees on the fact
that it is too dangerous to unload a library on pthread_exit(). It only
does it on dlclose if there are no active thread local dtors left
associated with the dso.

(At this point we just have tests to demonstrate how it works)

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Mark library NODELETE once first pthread_atexit handler is registered.

According to the test runs it seems that glibc loader agrees on the fact
that it is too dangerous to unload a library on pthread_exit(). It only
does it on dlclose if there are no active thread local dtors left
associated with the dso.

(At this point we just have tests to demonstrate how it works)

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Mark library NODELETE once first pthread_atexit handler is registered.

According to the test runs it seems that glibc loader agrees on the fact
that it is too dangerous to unload a library on pthread_exit(). It only
does it on dlclose if there are no active thread local dtors left
associated with the dso.

(At this point we just have tests to demonstrate how it works)

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Introduce new flag to mark soinfo as TLS_NODELETE when
there are thread_local dtors associated with dso_handle
belonging to it.

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Mark library NODELETE once first pthread_atexit handler is registered.

According to the test runs it seems that glibc loader agrees on the fact
that it is too dangerous to unload a library on pthread_exit(). It only
does it on dlclose if there are no active thread local dtors left
associated with the dso.

(At this point we just have tests to demonstrate how it works)

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
sepehrst pushed a commit to spsforks/android-bionic-libc that referenced this issue Apr 22, 2024
Introduce new flag to mark soinfo as TLS_NODELETE when
there are thread_local dtors associated with dso_handle
belonging to it.

Test: bionic-unit-tests --gtest_filter=dl*
Test: bionic-unit-tests-glibc --gtest_filter=dl*
Bug: android/ndk#360
Change-Id: I724ef89fc899788f95c47e6372c38b3313f18fed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants