-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix watchdog not killing unhealthy worker/extension fast enough #7474
Fix watchdog not killing unhealthy worker/extension fast enough #7474
Conversation
47c38e3
to
7493aea
Compare
Additional information: We kill immediately also because we have to consider that the watchdog has a poll interval of 3 seconds, so specifically for the memory limit that isn't a resource limit over time, it could've been up to 3 seconds late in noticing that the limit was hit. Even before the alarm timeout changes, the wait to do a forced kill was the alarm timeout + 1 second, and the default for the alarm timeout was 4 seconds, so with the poll interval it was at worst 8 seconds, at best 5. |
I'm a little hesitant. As discussed on slack, I don't know if this increases the risk of corrupting the local database. It feels like there's a balance between killing an out of control process, and giving the DB a chance to flush. |
Add a new flag "watchdog_forced_shutdown_delay" set to a default of 4 seconds, which separates the timeouts for a forced shutdown due to a resource limit being hit by a worker or extension, from the timeouts for a normal shutdown. This flag can also be set to 0, which skips sending a gracefull kill and directly goes for a SIGKILL. This has currently no effect on Windows, since the normal shutdown would issue a TerminateProcess anyway, which kills a process immediately. The fix was necessary since the watchdog was using the normal shutdown timeouts to kill a misbehaving worker or extension; this was not ideal, especially since the lowest amount of time we could wait for a forced kill was 6 seconds.
7493aea
to
376a5a0
Compare
I have updated the PR with the ideas discussed in the office hours. There's a flag now which controls the "forced" shutdown path amount of seconds to wait after issuing a graceful shutdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to try
@directionless Thanks for the approval! |
Add a new flag "watchdog_forced_shutdown_delay" set to a default of 4 seconds,
which separates the timeouts for a forced shutdown due to a resource
limit being hit by a worker or extension, from the timeouts for a normal shutdown.
This flag can also be set to 0, which skips sending a gracefull kill
and directly goes for a SIGKILL.
This has currently no effect on Windows, since the normal
shutdown would issue a TerminateProcess anyway, which kills a process immediately.
The fix was necessary since the watchdog was using the normal shutdown timeouts to kill
a misbehaving worker or extension; this was not ideal,
especially since the lowest amount of time we could wait for a forced kill was 6 seconds.