Fix crash due to interaction between distributed and config plugin #7504

Smjert · 2022-03-15T15:20:00Z

Move the distributed plugin RocksDB metadata to a different domain
which is not used or touched by the config plugin.

Also do not acquire the pending queries multiple times just to count
them. Acquire them once, use the obtained vector to count them
and then use them to proceed executing the queries.

Move the distributed plugin RocksDB metadata to a different domain which is not used or touched by the config plugin. Also do not acquire the pending queries multiple times just to count them. Acquire them once, use the obtained vector to count them and then use them to proceed executing the queries.

Smjert · 2022-03-15T15:45:42Z

This has been reported to us and was happening rarely/around once a week.

The crash happens here

osquery/osquery/distributed/distributed.cpp

Line 261 in 84f8b94

request.id = next.substr(kDistributedQueryPrefix.size());

when it's trying to substring the query SQL value, which in the crash case is a nullptr.
The immediate reason is that the array queries is empty and there's no check for it to be empty or not.

Looking at the rest of the logic though one notices that popRequest is called only if there are pending queries in the database:

osquery/osquery/distributed/distributed.cpp

Lines 127 to 128 in 84f8b94

    
           while (getPendingQueryCount() > 0) { 
        
             auto request = popRequest();

Also note that both getPendingQueryCount and popRequest re-scan the database to get all the pending queries.

Due to this issue is then possible that something else deletes a key in between the getPendingQueryCount and popRequest, causing then popRequest to incorrectly assume that the pending queries cannot be empty, and therefore crash.

Now about what is deleting the key, this is the Config plugin logic.
Note first of all that the Distributed plugin writes its metadata in the queries domain, with a distributed. prefix.
At the same time the scheduler uses that domain to store query results and other metadata; then everytime there's a config refresh from the Config plugin, the Config purge function is called:

osquery/osquery/config/config.cpp

Line 919 in 84f8b94

void Config::purge() {

This function has the job to go through all the keys in the queries domain, checks if the query name is a query that still exists in the schedule, if not checks if there's a timestamp information which reports the last time the query has been executed, if there's is then it checks if it has run in the last week. If it didn't, then deletes the key and its other metadata.
Also note that if the timestamp metadata is not found, it's automatically added with a current time.
The problem with this is that also distributed metadata keys are picked up.

So gluing all together here's what happens:

User enables distributed plugin and config plugin with a config refresh interval
User, from remote, runs a distributed query named "distributed_query". This gets pulled and saved in the RocksDB database in the queries domain with key distributed.distributed_query.
The distributed query doesn't run yet.
A config refresh runs, which calls purge, finds the distributed.distributed_query key and since it's not present in any scheduled query list, the logic checks the timestamp metadata to check when it has run. Since the timestamp doesn't exist, it creates one in the database with the current time and doesn't delete the query key.
One week later the stars align and the user runs the same distributed query again. So this gets pulled/saved in the RocksDB database. Then the distributed plugin logic stops just after the getPendingQueryCount call, thinking that there are queries to be run, but the Config refresh runs in between and the purge deletes the more than a week old entries, included the distributed ones.
Finally the distributed plugin wakes up and proceeds to "pop" a request off of the database which is now empty, and attempts to access queries that aren't there.

I went with changing the domain so that the distributed plugin and the config/scheduler plugin do not touch the same keys because the logic that there's behind the config purge and scheduler doesn't make sense for the distributed plugin, where the values there, are somewhat temporary already and they are stored in the database just to keep track of pending queries if osquery should die/crash in between executions.

directionless

Looks pretty simple to me. Don't know if @zwass or @theopolis want to comment

Smjert added bug distributed daemon labels Mar 15, 2022

Smjert requested review from a team as code owners March 15, 2022 15:20

directionless reviewed Mar 18, 2022

View reviewed changes

directionless approved these changes Apr 29, 2022

View reviewed changes

mike-myers-tob merged commit e09e59b into osquery:master Apr 29, 2022

mike-myers-tob deleted the stefano/fix/distributed-config-crash branch April 29, 2022 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash due to interaction between distributed and config plugin #7504

Fix crash due to interaction between distributed and config plugin #7504

Smjert commented Mar 15, 2022

Smjert commented Mar 15, 2022 •

edited

directionless left a comment

Fix crash due to interaction between distributed and config plugin #7504

Fix crash due to interaction between distributed and config plugin #7504

Conversation

Smjert commented Mar 15, 2022

Smjert commented Mar 15, 2022 • edited

directionless left a comment

Choose a reason for hiding this comment

Smjert commented Mar 15, 2022 •

edited