Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix watchdog not killing unhealthy worker/extension fast enough #7474

Conversation

Smjert
Copy link
Member

@Smjert Smjert commented Feb 3, 2022

Add a new flag "watchdog_forced_shutdown_delay" set to a default of 4 seconds,
which separates the timeouts for a forced shutdown due to a resource
limit being hit by a worker or extension, from the timeouts for a normal shutdown.

This flag can also be set to 0, which skips sending a gracefull kill
and directly goes for a SIGKILL.

This has currently no effect on Windows, since the normal
shutdown would issue a TerminateProcess anyway, which kills a process immediately.

The fix was necessary since the watchdog was using the normal shutdown timeouts to kill
a misbehaving worker or extension; this was not ideal,
especially since the lowest amount of time we could wait for a forced kill was 6 seconds.

@Smjert Smjert added this to the 5.2.2 milestone Feb 3, 2022
@Smjert Smjert requested review from a team as code owners February 3, 2022 11:18
@Smjert Smjert force-pushed the stefano/fix/watchdog-resource-limit-kill branch from 47c38e3 to 7493aea Compare February 3, 2022 11:20
@Smjert
Copy link
Member Author

Smjert commented Feb 3, 2022

Additional information: We kill immediately also because we have to consider that the watchdog has a poll interval of 3 seconds, so specifically for the memory limit that isn't a resource limit over time, it could've been up to 3 seconds late in noticing that the limit was hit.
Couple this with the default alarm timeout, set to 15 seconds, the previous logic would take at worst 14 seconds (3 from the poll, 11 from the stopChild logic) and at best 9 (the alarm timeout can be decreased down to 10 seconds).

Even before the alarm timeout changes, the wait to do a forced kill was the alarm timeout + 1 second, and the default for the alarm timeout was 4 seconds, so with the poll interval it was at worst 8 seconds, at best 5.

@directionless
Copy link
Member

I'm a little hesitant. As discussed on slack, I don't know if this increases the risk of corrupting the local database. It feels like there's a balance between killing an out of control process, and giving the DB a chance to flush.

@directionless directionless modified the milestones: 5.2.2, 5.3.0 Feb 4, 2022
Add a new flag "watchdog_forced_shutdown_delay" set to a default of 4 seconds,
which separates the timeouts for a forced shutdown due to a resource
limit being hit by a worker or extension, from the timeouts for a normal shutdown.

This flag can also be set to 0, which skips sending a gracefull kill
and directly goes for a SIGKILL.

This has currently no effect on Windows, since the normal
shutdown would issue a TerminateProcess anyway, which kills a process immediately.

The fix was necessary since the watchdog was using the normal shutdown timeouts to kill
a misbehaving worker or extension; this was not ideal,
especially since the lowest amount of time we could wait for a forced kill was 6 seconds.
@Smjert Smjert force-pushed the stefano/fix/watchdog-resource-limit-kill branch from 7493aea to 376a5a0 Compare February 24, 2022 16:04
@Smjert
Copy link
Member Author

Smjert commented Feb 24, 2022

I have updated the PR with the ideas discussed in the office hours. There's a flag now which controls the "forced" shutdown path amount of seconds to wait after issuing a graceful shutdown.

directionless
directionless previously approved these changes Mar 2, 2022
Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to try

osquery/core/watcher.h Outdated Show resolved Hide resolved
@Smjert
Copy link
Member Author

Smjert commented Mar 2, 2022

Seems reasonable to try

@directionless Thanks for the approval!
And sorry, I've noticed last minute I didn't update the usage description of the force argument for the stopChild function.

@mike-myers-tob mike-myers-tob merged commit 939bcc3 into osquery:master Mar 2, 2022
@Smjert Smjert deleted the stefano/fix/watchdog-resource-limit-kill branch March 3, 2022 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants