-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --enable_watchdog_debug
flag and improve error messages
#8070
Add --enable_watchdog_debug
flag and improve error messages
#8070
Conversation
@@ -90,15 +90,13 @@ The level limits are as follows: | |||
Memory: default 200M, restrictive 100M | |||
CPU: default 10% (for 12 seconds), restrictive 5% (for 6 seconds) | |||
|
|||
The normal level allows for 10 restarts if the limits are violated. The restrictive allows for only 4, then the service will be disabled. For both there is a linear backoff of 5 seconds, doubling each retry. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no code that enforces this. (Maybe this was the case in the past and this is a residue.)
It is better to set the level to disabled (`-1`) rather than disabling the watchdog outright, as the worker/watcher concept is used for extensions auto-loading too. | ||
|
||
The watchdog "profiles" can be overridden for Memory and CPU Utilization. | ||
|
||
`--watchdog_memory_limit=0` | ||
|
||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 1G. This memory limit is expressed as a value representing MB. | ||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 10G. This memory limit is expressed as a value representing MB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's 10G:
osquery/osquery/core/watcher.cpp
Lines 64 to 65 in 2e34958
// Maximum MB worker can privately allocate. | |
{WatchdogLimitType::MEMORY_LIMIT, {200, 100, 10000}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder which is the typo. Not sure it matters
@@ -131,7 +129,7 @@ By default the watchdog monitors extensions for improper shutdown, but NOT for p | |||
|
|||
`--table_delay=0` | |||
|
|||
Add a microsecond delay between multiple table calls (when a table is used in a JOIN). A `200` microsecond delay will trade about 20% additional time for a reduced 5% CPU utilization. | |||
Add a millisecond delay between multiple table calls (when a table is used in a JOIN). A `200` millisecond delay will trade about 20% additional time for a reduced 5% CPU utilization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's milliseconds:
osquery/osquery/process/process.h
Lines 217 to 220 in 2e34958
inline void sleepFor(uint64_t msec) { | |
std::chrono::milliseconds mduration(msec); | |
std::this_thread::sleep_for(mduration); | |
} |
uint64_t footprint{0}; | ||
uint64_t iv{0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a rename, what does iv
mean? From a comment it seems it was the check interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and improving things! It generally seems reasonable to me. A naming nit.
@Smjert You did some watchdog work, any comments?
It is better to set the level to disabled (`-1`) rather than disabling the watchdog outright, as the worker/watcher concept is used for extensions auto-loading too. | ||
|
||
The watchdog "profiles" can be overridden for Memory and CPU Utilization. | ||
|
||
`--watchdog_memory_limit=0` | ||
|
||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 1G. This memory limit is expressed as a value representing MB. | ||
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 10G. This memory limit is expressed as a value representing MB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder which is the typo. Not sure it matters
#10292 The query was processing *every* file under `/Applications/`, which makes it super expensive both in CPU usage and Memory footprint. This query was the main culprit of triggering worker process kills by the watchdog. On some runs it triggered CPU usage alerts: ``` 7716:W0623 15:38:05.402959 221732864 watcher.cpp:415] osqueryd worker (72976) stopping: Maximum sustainable CPU utilization limit 1200ms exceeded for 12 seconds ``` And on other runs it triggered memory usage alerts: ``` 4431 W0626 07:28:50.868021 147312640 watcher.cpp:424] osqueryd worker (21453) stopping: Memory limits exceeded: 214020096 bytes (limit is 200MB) ``` For the above logs I used a custom osqueryd branch to be able to print more information: osquery/osquery#8070 The metrics for the old query were CPU usage: ~4521 ms ``` 435:level=warn ts=2023-06-26T09:58:29.665712Z query=fleet_policy_query_1233 queryTime=4521 memory=12226560 msg="distributed query performance is excessive" hostID=308 platform=darwin ``` With the new query, CPU usage: ~210 ms. ``` 23893:level=debug ts=2023-06-26T18:06:08.242456Z query=fleet_policy_query_1233 queryTime=210 msg=stats memory=0 hostID=308 platform=darwin ``` Basically a ~20x improvement. - [X] Changes file added for user-visible changes in `changes/` or `orbit/changes/`. See [Changes files](https://fleetdm.com/docs/contributing/committing-changes#changes-files) for more information. - ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or docs/Contributing/API-for-contributors.md)~ - ~[ ] Documented any permissions changes~ - ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL injection is prevented (using placeholders for values in statements)~ - ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for new osquery data ingestion features.~ - ~[ ] Added/updated tests~ - [X] Manual QA for all new/changed functionality - For Orbit and Fleet Desktop changes: - ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows and Linux.~ - ~[ ] Auto-update manual QA, from released version of component to new version (see [tools/tuf/test](../tools/tuf/test/README.md)).~
@@ -128,8 +138,25 @@ CLI_FLAG(uint64, | |||
|
|||
CLI_FLAG(bool, disable_watchdog, false, "Disable userland watchdog process"); | |||
|
|||
CLI_FLAG(bool, | |||
enable_watchdog_debug, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove enable
, since I think it's always confusing. But I don't feel strongly about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucasmrod thanks for this improvement!
A couple of corrections in the docs, but otherwise looks good!
--enable_watchdog_logging
flag and improve error messages--enable_watchdog_debug
flag and improve error messages
See #8069.
With the changes in this PR, the following is logged every check interval (3s) for every process being watched:
The following is a log from
master
when the CPU limit is exceeded:The following is a log with this PR when the CPU limit is exceeded:
The following is a log from
master
when the memory footprint limit is exceeded:The following is a log with this PR when the memory footprint limit is exceeded: