Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash due to interaction between distributed and config plugin #7504

Conversation

Smjert
Copy link
Member

@Smjert Smjert commented Mar 15, 2022

Move the distributed plugin RocksDB metadata to a different domain
which is not used or touched by the config plugin.

Also do not acquire the pending queries multiple times just to count
them. Acquire them once, use the obtained vector to count them
and then use them to proceed executing the queries.

Move the distributed plugin RocksDB metadata to a different domain
which is not used or touched by the config plugin.

Also do not acquire the pending queries multiple times just to count
them. Acquire them once, use the obtained vector to count them
and then use them to proceed executing the queries.
@Smjert
Copy link
Member Author

Smjert commented Mar 15, 2022

This has been reported to us and was happening rarely/around once a week.

The crash happens here

request.id = next.substr(kDistributedQueryPrefix.size());

when it's trying to substring the query SQL value, which in the crash case is a nullptr.
The immediate reason is that the array queries is empty and there's no check for it to be empty or not.

Looking at the rest of the logic though one notices that popRequest is called only if there are pending queries in the database:

while (getPendingQueryCount() > 0) {
auto request = popRequest();

Also note that both getPendingQueryCount and popRequest re-scan the database to get all the pending queries.

Due to this issue is then possible that something else deletes a key in between the getPendingQueryCount and popRequest, causing then popRequest to incorrectly assume that the pending queries cannot be empty, and therefore crash.

Now about what is deleting the key, this is the Config plugin logic.
Note first of all that the Distributed plugin writes its metadata in the queries domain, with a distributed. prefix.
At the same time the scheduler uses that domain to store query results and other metadata; then everytime there's a config refresh from the Config plugin, the Config purge function is called:

void Config::purge() {

This function has the job to go through all the keys in the queries domain, checks if the query name is a query that still exists in the schedule, if not checks if there's a timestamp information which reports the last time the query has been executed, if there's is then it checks if it has run in the last week. If it didn't, then deletes the key and its other metadata.
Also note that if the timestamp metadata is not found, it's automatically added with a current time.
The problem with this is that also distributed metadata keys are picked up.

So gluing all together here's what happens:

  1. User enables distributed plugin and config plugin with a config refresh interval
  2. User, from remote, runs a distributed query named "distributed_query". This gets pulled and saved in the RocksDB database in the queries domain with key distributed.distributed_query.
  3. The distributed query doesn't run yet.
  4. A config refresh runs, which calls purge, finds the distributed.distributed_query key and since it's not present in any scheduled query list, the logic checks the timestamp metadata to check when it has run. Since the timestamp doesn't exist, it creates one in the database with the current time and doesn't delete the query key.
  5. One week later the stars align and the user runs the same distributed query again. So this gets pulled/saved in the RocksDB database. Then the distributed plugin logic stops just after the getPendingQueryCount call, thinking that there are queries to be run, but the Config refresh runs in between and the purge deletes the more than a week old entries, included the distributed ones.
    Finally the distributed plugin wakes up and proceeds to "pop" a request off of the database which is now empty, and attempts to access queries that aren't there.

I went with changing the domain so that the distributed plugin and the config/scheduler plugin do not touch the same keys because the logic that there's behind the config purge and scheduler doesn't make sense for the distributed plugin, where the values there, are somewhat temporary already and they are stored in the database just to keep track of pending queries if osquery should die/crash in between executions.

Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty simple to me. Don't know if @zwass or @theopolis want to comment

@mike-myers-tob mike-myers-tob merged commit e09e59b into osquery:master Apr 29, 2022
@mike-myers-tob mike-myers-tob deleted the stefano/fix/distributed-config-crash branch April 29, 2022 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants