-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom cpu limit duration for the watchdog #7348
Conversation
* Add `--watchdog_latency_limit` flag to allow customizing amount of time osquery is allowed to spend over the cpu utilization limit.
@@ -102,10 +102,20 @@ If this value is >0 then the watchdog level (`--watchdog_level`) for maximum mem | |||
|
|||
`--watchdog_utilization_limit=0` | |||
|
|||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum sustained CPU utilization is overridden. Use this if you would like to allow the `osqueryd` process to use more than 30% of a thread for more than 9 seconds of wall time. The length of sustained utilization is not independently configurable. | |||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum sustained CPU utilization is overridden. Use this if you would like to allow the `osqueryd` process to use more than 10% of a thread for more than `--watchdog_latency_limit` seconds of wall time. The length of sustained utilization is configurable with `--watchdog_latency_limit`. | |||
|
|||
This value is a maximum number of CPU cycles counted as the `processes` table's `user_time` and `system_time`. The default is 90, meaning less 90 seconds of cpu time per 3 seconds of wall time is allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line, specifically "default is 90, meaning less 90 seconds of cpu time per 3 seconds of wall time", didn't make a lot of sense to me. I don't know where 90 comes from, or the 3 seconds. I have currently left this unchanged, but I feel like it needs to be corrected / updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@directionless - hey, since you reviewed the rest of this... do you have enough context to answer this? I didn't really understand this part of the implementation, and I think this should be clarified / corrected.
@@ -102,10 +102,20 @@ If this value is >0 then the watchdog level (`--watchdog_level`) for maximum mem | |||
|
|||
`--watchdog_utilization_limit=0` | |||
|
|||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum sustained CPU utilization is overridden. Use this if you would like to allow the `osqueryd` process to use more than 30% of a thread for more than 9 seconds of wall time. The length of sustained utilization is not independently configurable. | |||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum sustained CPU utilization is overridden. Use this if you would like to allow the `osqueryd` process to use more than 10% of a thread for more than `--watchdog_latency_limit` seconds of wall time. The length of sustained utilization is configurable with `--watchdog_latency_limit`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be fine, but it also seems a little weird?
If the value isn't conforming (like, if it's under 0) we should throw an error, and not silently ignore it.
And it feels weird to have a command line that defaults to 0, and then have some other default somewhere. Why not have the command line default to what we want? (there might be history behind it, but it feels byzantine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's maybe less weird in context, because there is a different default value depending on --watchdog_level
, so there isn't really a single "default" value.
For example, "normal" CPU utilization limit is 10%, restrictive is 5%, and off is 100%.
Now... maybe the concept of "watchdog level" is outdated and could be replaced by having defaults for each of these values instead? But as is, it's pretty reasonable to have this set to 0 to distinguish between "use the value set by this flag" and "use the value for the configured watchdog level".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that this is weird, but it's consistent with the other flags. The weirdness is not net new. :P
[WIP] - Took a stab at implementing #7212. This functionality had some inconsistencies in the docs (around percentages, times, etc.), so I'm still not incredibly confident that this is correct.
I'd like to get some early feedback / corrections to make from maintainers.
Fixes: #7212