Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --enable_watchdog_debug flag and improve error messages #8070

Merged

Conversation

lucasmrod
Copy link
Contributor

@lucasmrod lucasmrod commented Jun 23, 2023

See #8069.

With the changes in this PR, the following is logged every check interval (3s) for every process being watched:

[...]
I0623 17:54:02.500684 81305600 watcher.cpp:537] pid: 93084, cpu: 97ms/1200ms, memory: 0.00MB/200MB
I0623 17:54:02.500970 81305600 watcher.cpp:537] pid: 93083, cpu: 0ms/1200ms, memory: 0.03MB/200MB
I0623 17:54:05.503715 81305600 watcher.cpp:537] pid: 93084, cpu: 1ms/1200ms, memory: 0.02MB/200MB
I0623 17:54:05.504016 81305600 watcher.cpp:537] pid: 93083, cpu: 2ms/1200ms, memory: 0.04MB/200MB
[...]

The following is a log from master when the CPU limit is exceeded:

2668:W0623 15:27:06.032490 221732864 watcher.cpp:415] osqueryd worker (72457) stopping:
Maximum sustainable CPU utilization limit: 12

The following is a log with this PR when the CPU limit is exceeded:

2668:W0623 15:27:06.032490 221732864 watcher.cpp:415] osqueryd worker (72457) stopping:
Maximum sustainable CPU utilization limit 1200ms exceeded for 12 seconds

The following is a log from master when the memory footprint limit is exceeded:

18277:W0623 16:21:05.841567 221732864 watcher.cpp:415] osqueryd worker (73620) stopping:
Memory limits exceeded: 212180992

The following is a log with this PR when the memory footprint limit is exceeded:

18277:W0623 16:21:05.841567 221732864 watcher.cpp:415] osqueryd worker (73620) stopping:
Memory limits exceeded: 212180992 bytes (limit is 200MB)

@lucasmrod lucasmrod requested review from a team as code owners June 23, 2023 20:46
@@ -90,15 +90,13 @@ The level limits are as follows:
Memory: default 200M, restrictive 100M
CPU: default 10% (for 12 seconds), restrictive 5% (for 6 seconds)

The normal level allows for 10 restarts if the limits are violated. The restrictive allows for only 4, then the service will be disabled. For both there is a linear backoff of 5 seconds, doubling each retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no code that enforces this. (Maybe this was the case in the past and this is a residue.)

It is better to set the level to disabled (`-1`) rather than disabling the watchdog outright, as the worker/watcher concept is used for extensions auto-loading too.

The watchdog "profiles" can be overridden for Memory and CPU Utilization.

`--watchdog_memory_limit=0`

If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 1G. This memory limit is expressed as a value representing MB.
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 10G. This memory limit is expressed as a value representing MB.
Copy link
Contributor Author

@lucasmrod lucasmrod Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's 10G:

// Maximum MB worker can privately allocate.
{WatchdogLimitType::MEMORY_LIMIT, {200, 100, 10000}},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder which is the typo. Not sure it matters

@@ -131,7 +129,7 @@ By default the watchdog monitors extensions for improper shutdown, but NOT for p

`--table_delay=0`

Add a microsecond delay between multiple table calls (when a table is used in a JOIN). A `200` microsecond delay will trade about 20% additional time for a reduced 5% CPU utilization.
Add a millisecond delay between multiple table calls (when a table is used in a JOIN). A `200` millisecond delay will trade about 20% additional time for a reduced 5% CPU utilization.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's milliseconds:

inline void sleepFor(uint64_t msec) {
std::chrono::milliseconds mduration(msec);
std::this_thread::sleep_for(mduration);
}

uint64_t footprint{0};
uint64_t iv{0};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a rename, what does iv mean? From a comment it seems it was the check interval.

Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look and improving things! It generally seems reasonable to me. A naming nit.

@Smjert You did some watchdog work, any comments?

It is better to set the level to disabled (`-1`) rather than disabling the watchdog outright, as the worker/watcher concept is used for extensions auto-loading too.

The watchdog "profiles" can be overridden for Memory and CPU Utilization.

`--watchdog_memory_limit=0`

If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 1G. This memory limit is expressed as a value representing MB.
If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 10G. This memory limit is expressed as a value representing MB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder which is the typo. Not sure it matters

osquery/core/watcher.cpp Outdated Show resolved Hide resolved
osquery/core/watcher.cpp Outdated Show resolved Hide resolved
lucasmrod added a commit to fleetdm/fleet that referenced this pull request Jun 27, 2023
#10292

The query was processing *every* file under `/Applications/`, which
makes it super expensive both in CPU usage and Memory footprint. This
query was the main culprit of triggering worker process kills by the
watchdog.

On some runs it triggered CPU usage alerts:
```
7716:W0623 15:38:05.402959 221732864 watcher.cpp:415] osqueryd worker (72976) stopping:
Maximum sustainable CPU utilization limit 1200ms exceeded for 12 seconds
```
And on other runs it triggered memory usage alerts:
```
4431 W0626 07:28:50.868021 147312640 watcher.cpp:424] osqueryd worker (21453) stopping:
Memory limits exceeded: 214020096 bytes (limit is 200MB)
```

For the above logs I used a custom osqueryd branch to be able to print
more information: osquery/osquery#8070

The metrics for the old query were CPU usage: ~4521 ms
```
435:level=warn ts=2023-06-26T09:58:29.665712Z query=fleet_policy_query_1233 queryTime=4521 memory=12226560 msg="distributed query performance is excessive" hostID=308 platform=darwin
```
With the new query, CPU usage: ~210 ms.
```
23893:level=debug ts=2023-06-26T18:06:08.242456Z query=fleet_policy_query_1233 queryTime=210 msg=stats memory=0 hostID=308 platform=darwin
```
Basically a ~20x improvement.

- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
  - For Orbit and Fleet Desktop changes:
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~
@@ -128,8 +138,25 @@ CLI_FLAG(uint64,

CLI_FLAG(bool, disable_watchdog, false, "Disable userland watchdog process");

CLI_FLAG(bool,
enable_watchdog_debug,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove enable, since I think it's always confusing. But I don't feel strongly about it

Copy link
Member

@Smjert Smjert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucasmrod thanks for this improvement!

A couple of corrections in the docs, but otherwise looks good!

docs/wiki/installation/cli-flags.md Outdated Show resolved Hide resolved
docs/wiki/installation/cli-flags.md Outdated Show resolved Hide resolved
@lucasmrod lucasmrod requested a review from Smjert July 14, 2023 14:36
@directionless directionless merged commit c9f5543 into osquery:master Jul 14, 2023
16 checks passed
@lucasmrod lucasmrod changed the title Add --enable_watchdog_logging flag and improve error messages Add --enable_watchdog_debug flag and improve error messages Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants