Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition during the shutdown of the worker process #5943

Merged

Conversation

Smjert
Copy link
Member

@Smjert Smjert commented Oct 27, 2019

When a thread different from the main requests a shutdown
through Initializer::requestShutdown, it should not call
waitForShutdown; there's no reason to wait, moreover the function
doesn't only wait, but also actually stops other components and then
finally calls exit().

Since the main thread is already inside the waitForShutdown call
waiting on Dispatcher::joinServices or inside the shutdown() callable
on Windows, having a secondary thread do
the same work potentially at the same time is wrong.
Moreover calling exit() from a secondary thread is most of the time
incorrect.

The waitForShutdown function has been renamed to waitThenShutdown
to better represent what it's actually doing.

When a thread different from the main requests a shutdown
through Initializer::requestShutdown, it should not call
waitForShutdown; there's no reason to wait, moreover the function
doesn't only wait, but also actually stops other components and then
finally calls exit().

Since the main thread is already inside the waitForShutdown call
waiting on Dispatcher::joinServices or inside the shutdown() callable
on Windows, having a secondary thread do
the same work potentially at the same time is wrong.
Moreover calling exit() from a secondary thread is most of the time
incorrect.

The waitForShutdown function has been renamed to waitThenShutdown
to better represent what it's actually doing.
@Smjert Smjert added the bug label Oct 27, 2019
@Smjert
Copy link
Member Author

Smjert commented Oct 27, 2019

The issue has been initially highlighted by running the test_osqueryd.py test, specifically the test_3_daemon_with_watchdog testcase.
In the logs one could sometimes see

92: terminating with uncaught exception of type std::__1::system_error: thread::join failed: Resource deadlock avoided

An example is in this build https://dev.azure.com/trailofbits/osquery/_build/results?buildId=1254

With TSAN help I've found that when killing the parent process, the worker, in a different thread than the main, notices that the watchdog is dead so requests a shutdown through requestShutdown().
This function ends up calling waitForShutdown():

void Initializer::requestShutdown(int retcode) {
if (kExitCode == 0) {
kExitCode = retcode;
}
// Stop thrift services/clients/and their thread pools.
if (std::this_thread::get_id() != kMainThreadId &&
FLAGS_enable_signal_handler) {
raise(SIGUSR1);
} else {
// The main thread is requesting a shutdown, meaning in almost every case
// it is NOT waiting for a shutdown.
// Exceptions include: tight request / wait in an exception handler or
// custom signal handling.
Dispatcher::stopServices();
waitForShutdown();

which though not only waits, but also actually shutdowns various components.

Moreover since waitForShutdown() calls Dispatcher::joinServices:

void Initializer::waitForShutdown() {
{
RecursiveLock lock(shutdown_mutex_);
if (shutdown_ != nullptr) {
// Copy the callable, then remove it, prevent callable recursion.
auto shutdown = shutdown_;
shutdown_ = nullptr;
// Call the shutdown callable.
shutdown();
}
}
// Attempt to be the only place in code where a join is attempted.
Dispatcher::joinServices();

and since WatcherWatcherRunner is itself a service added to the Dispatcher, the thread will end up joining on itself.

Another issue, ignoring joining on itself, is that waitForShutdown ends up calling exit(), from a secondary thread.
This is a very unclean shutdown that normally leads to problems; so it's always better to have the main thread do so.

Other useful information is that the main thread calls waitForShutdown already, here:

int startDaemon(Initializer& runner) {
runner.start();
// Conditionally begin the distributed query service
auto s = startDistributed();
if (!s.ok()) {
VLOG(1) << "Not starting the distributed query service: " << s.toString();
}
// Begin the schedule runloop.
startScheduler();
osquery::events::init_syscall_tracing();
// Finally wait for a signal / interrupt to shutdown.
runner.waitForShutdown();

and waits on Dispatcher::joinServices, so the exit() will be called; it only needs to be waked up by something and then it will do the correct shutdown procedure.
This is done already by requestShutdown(), which calls Dispatcher::stopServices().

@Smjert
Copy link
Member Author

Smjert commented Oct 27, 2019

Following the TSAN log, which shows that there's usage/destruction of the same object from two different threads:

WARNING: ThreadSanitizer: data race (pid=6605)
  Write of size 1 at 0x000002f52738 by thread T1:
    #0 pthread_rwlock_destroy /opt/toolchain/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1325 (osqueryd+0xc5a15e)
    #1 glog_internal_namespace_::Mutex::~Mutex() logging.cc:? (osqueryd+0x20ef4bd)
    #2 cxa_at_exit_wrapper(void*) /opt/toolchain/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:393 (osqueryd+0xc91ce3)
    #3 osquery::Initializer::requestShutdown(int) /home/smjert/Development/osquery/src/osquery/core/init.cpp:771 (osqueryd+0x1efc2d7)
    #4 osquery::WatcherWatcherRunner::start() /home/smjert/Development/osquery/src/osquery/core/watcher.cpp:640 (osqueryd+0x1f0f4d4)
    #5 osquery::InternalRunnable::run() /home/smjert/Development/osquery/src/osquery/dispatcher/dispatcher.cpp:46 (osqueryd+0x1e51278)
    #6 _ZNSt3__18__invokeIRMN7osquery16InternalRunnableEFvvERPS2_JEvEEDTcldsdeclsr3std3__1E7forwardIT0_Efp0_Efp_spclsr3std3__1E7forwardIT1_Efp1_EEEOT_OS8_DpOS9_ /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/type_traits:4294 (osqueryd+0x1e5263d)
    #7 _ZNSt3__115__apply_functorIMN7osquery16InternalRunnableEFvvENS_5tupleIJPS2_EEEJLm0EENS5_IJEEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_returnISA_SB_SC_EE5valueEE4typeERSA_RSB_NS_15__tuple_indicesIJXspT1_EEEEOSC_ /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/functional:2644 (osqueryd+0x1e5259a)
    #8 _ZNSt3__16__bindIMN7osquery16InternalRunnableEFvvEJPS2_EEclIJEEENS_13__bind_returnIS4_NS_5tupleIJS5_EEENS9_IJDpOT_EEEXsr22__is_valid_bind_returnIS4_SA_SE_EE5valueEE4typeESD_ /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/functional:2677 (osqueryd+0x1e5259a)
    #9 _ZNSt3__18__invokeINS_6__bindIMN7osquery16InternalRunnableEFvvEJPS3_EEEJEEEDTclclsr3std3__1E7forwardIT_Efp_Espclsr3std3__1E7forwardIT0_Efp0_EEEOS8_DpOS9_ /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/type_traits:4353 (osqueryd+0x1e524dc)
    #10 void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (osquery::InternalRunnable::*)(), osquery::InternalRunnable*>>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (osquery::InternalRunnable::*)(), osquery::InternalRunnable*>>&, std::__1::__tuple_indices<>) /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/thread:342 (osqueryd+0x1e524dc)
    #11 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (osquery::InternalRunnable::*)(), osquery::InternalRunnable*> > >(void*) /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/thread:352 (osqueryd+0x1e524dc)

  Previous atomic read of size 1 at 0x000002f52738 by main thread:
    #0 pthread_rwlock_rdlock /opt/toolchain/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1334 (osqueryd+0xc5a29e)
    #1 glog_internal_namespace_::Mutex::ReaderLock() logging.cc:? (osqueryd+0x20f6e81)
    #2 google::LogDestination::WaitForSinks(google::LogMessage::LogMessageData*) logging.cc:? (osqueryd+0x20f14e6)
    #3 google::LogMessage::Flush() logging.cc:? (osqueryd+0x20f1398)
    #4 google::LogMessage::~LogMessage() logging.cc:? (osqueryd+0x20f114e)
    #5 osquery::Dispatcher::joinServices() /home/smjert/Development/osquery/src/osquery/dispatcher/dispatcher.cpp:136 (osqueryd+0x1e51cc7)
    #6 osquery::Initializer::waitForShutdown() /home/smjert/Development/osquery/src/osquery/core/init.cpp:741 (osqueryd+0x1efd0a5)
    #7 osquery::startDaemon(osquery::Initializer&) /home/smjert/Development/osquery/src/osquery/main/main.cpp:112 (osqueryd+0xe8efdd)
    #8 osquery::startOsquery(int, char**, std::__1::function<void ()>) /home/smjert/Development/osquery/src/osquery/main/main.cpp:184 (osqueryd+0xe8f46f)
    #9 main /home/smjert/Development/osquery/src/osquery/main/posix/main.cpp:30 (osqueryd+0xe8ead4)

  Location is global '<null>' at 0x000000000000 (osqueryd+0x000002f52738)

  Thread T1 'WatcherWatcherRunner' (tid=6607, running) created by main thread at:
    #0 pthread_create /opt/toolchain/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:975 (osqueryd+0xc57e55)
    #1 std::__1::__libcpp_thread_create(unsigned long*, void* (*)(void*), void*) compaction_job.cc:? (osqueryd+0xf5fdec)
    #2 thread<std::__1::__bind<void (osquery::InternalRunnable::*)(), osquery::InternalRunnable *>, void> /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/thread:368 (osqueryd+0x1e523f0)
    #3 std::__1::__unique_if<std::__1::thread>::__unique_single std::__1::make_unique<std::__1::thread, std::__1::__bind<void (osquery::InternalRunnable::*)(), osquery::InternalRunnable*> >(std::__1::__bind<void (osquery::InternalRunnable::*)(), osquery::InternalRunnable*>&&) /usr/local/osquery-toolchain/usr/bin/../include/c++/v1/memory:3132 (osqueryd+0x1e51927)
    #4 osquery::Dispatcher::addService(std::__1::shared_ptr<osquery::InternalRunnable>) /home/smjert/Development/osquery/src/osquery/dispatcher/dispatcher.cpp:74 (osqueryd+0x1e516c2)
    #5 osquery::Initializer::initWorker(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const /home/smjert/Development/osquery/src/osquery/core/init.cpp:559 (osqueryd+0x1efc41a)
    #6 osquery::Initializer::initWorkerWatcher(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) const /home/smjert/Development/osquery/src/osquery/core/init.cpp:565 (osqueryd+0x1efc547)
    #7 osquery::startOsquery(int, char**, std::__1::function<void ()>) /home/smjert/Development/osquery/src/osquery/main/main.cpp:181 (osqueryd+0xe8f44b)
    #8 main /home/smjert/Development/osquery/src/osquery/main/posix/main.cpp:30 (osqueryd+0xe8ead4)

SUMMARY: ThreadSanitizer: data race logging.cc:? in glog_internal_namespace_::Mutex::~Mutex()

Copy link
Member

@theopolis theopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the change in this PR, other than the naming clarification.

raise(SIGUSR1);
} else {
// The main thread is requesting a shutdown, meaning in almost every case
// it is NOT waiting for a shutdown.
// Exceptions include: tight request / wait in an exception handler or
// custom signal handling.
Dispatcher::stopServices();
waitForShutdown();

if (current_thread_id == kMainThreadId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check superfluous? The else is only exercised if !(current_thread_id != kMainThreadId)

Copy link
Member Author

@Smjert Smjert Oct 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous if doesn't only check for the thread id, but also if FLAGS_enable_signal_handler is enabled, which by default it's not.
This means that other threads can end up calling the else path, which caused the bug.
This is also shown by the TSAN callstack, where you have WatcherWatcherRunner entering in waitForShutdown through requestShutdown() and then calling exit().

@theopolis theopolis merged commit 62336ee into osquery:master Oct 27, 2019
@Smjert Smjert deleted the stefano/fix/race-condition-on-shutdown branch November 9, 2019 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants