Add `--enable_watchdog_debug` flag and improve error messages #8070

lucasmrod · 2023-06-23T20:46:42Z

With the changes in this PR, the following is logged every check interval (3s) for every process being watched:

[...]
I0623 17:54:02.500684 81305600 watcher.cpp:537] pid: 93084, cpu: 97ms/1200ms, memory: 0.00MB/200MB
I0623 17:54:02.500970 81305600 watcher.cpp:537] pid: 93083, cpu: 0ms/1200ms, memory: 0.03MB/200MB
I0623 17:54:05.503715 81305600 watcher.cpp:537] pid: 93084, cpu: 1ms/1200ms, memory: 0.02MB/200MB
I0623 17:54:05.504016 81305600 watcher.cpp:537] pid: 93083, cpu: 2ms/1200ms, memory: 0.04MB/200MB
[...]

The following is a log from master when the CPU limit is exceeded:

2668:W0623 15:27:06.032490 221732864 watcher.cpp:415] osqueryd worker (72457) stopping:
Maximum sustainable CPU utilization limit: 12

The following is a log with this PR when the CPU limit is exceeded:

2668:W0623 15:27:06.032490 221732864 watcher.cpp:415] osqueryd worker (72457) stopping:
Maximum sustainable CPU utilization limit 1200ms exceeded for 12 seconds

The following is a log from master when the memory footprint limit is exceeded:

18277:W0623 16:21:05.841567 221732864 watcher.cpp:415] osqueryd worker (73620) stopping:
Memory limits exceeded: 212180992

The following is a log with this PR when the memory footprint limit is exceeded:

18277:W0623 16:21:05.841567 221732864 watcher.cpp:415] osqueryd worker (73620) stopping:
Memory limits exceeded: 212180992 bytes (limit is 200MB)

lucasmrod · 2023-06-23T20:47:17Z

docs/wiki/installation/cli-flags.md

@@ -90,15 +90,13 @@ The level limits are as follows:
 Memory: default 200M, restrictive 100M
 CPU: default 10% (for 12 seconds), restrictive 5% (for 6 seconds)

-The normal level allows for 10 restarts if the limits are violated. The restrictive allows for only 4, then the service will be disabled. For both there is a linear backoff of 5 seconds, doubling each retry.
-


There's no code that enforces this. (Maybe this was the case in the past and this is a residue.)

lucasmrod · 2023-06-23T20:48:30Z

docs/wiki/installation/cli-flags.md

 It is better to set the level to disabled (`-1`) rather than disabling the watchdog outright, as the worker/watcher concept is used for extensions auto-loading too.

 The watchdog "profiles" can be overridden for Memory and CPU Utilization.

 `--watchdog_memory_limit=0`

-If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 1G. This memory limit is expressed as a value representing MB.
+If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 10G. This memory limit is expressed as a value representing MB.


Seems it's 10G:

osquery/osquery/core/watcher.cpp

Lines 64 to 65 in 2e34958

// Maximum MB worker can privately allocate.

{WatchdogLimitType::MEMORY_LIMIT, {200, 100, 10000}},

I wonder which is the typo. Not sure it matters

lucasmrod · 2023-06-23T20:49:38Z

docs/wiki/installation/cli-flags.md

@@ -131,7 +129,7 @@ By default the watchdog monitors extensions for improper shutdown, but NOT for p

 `--table_delay=0`

-Add a microsecond delay between multiple table calls (when a table is used in a JOIN). A `200` microsecond delay will trade about 20% additional time for a reduced 5% CPU utilization.
+Add a millisecond delay between multiple table calls (when a table is used in a JOIN). A `200` millisecond delay will trade about 20% additional time for a reduced 5% CPU utilization.


Seems it's milliseconds:

osquery/osquery/process/process.h

Lines 217 to 220 in 2e34958

inline void sleepFor(uint64_t msec) {

std::chrono::milliseconds mduration(msec);

std::this_thread::sleep_for(mduration);

}

lucasmrod · 2023-06-23T20:50:35Z

osquery/core/watcher.cpp

  uint64_t footprint{0};
-  uint64_t iv{0};


This is just a rename, what does iv mean? From a comment it seems it was the check interval.

directionless

Thanks for taking a look and improving things! It generally seems reasonable to me. A naming nit.

@Smjert You did some watchdog work, any comments?

directionless · 2023-06-27T06:04:27Z

docs/wiki/installation/cli-flags.md

 It is better to set the level to disabled (`-1`) rather than disabling the watchdog outright, as the worker/watcher concept is used for extensions auto-loading too.

 The watchdog "profiles" can be overridden for Memory and CPU Utilization.

 `--watchdog_memory_limit=0`

-If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 1G. This memory limit is expressed as a value representing MB.
+If this value is >0 then the watchdog level (`--watchdog_level`) for maximum memory is overridden. Use this if you would like to allow the `osqueryd` process to allocate more than 200M, but somewhere less than 10G. This memory limit is expressed as a value representing MB.


I wonder which is the typo. Not sure it matters

osquery/core/watcher.cpp

#10292 The query was processing *every* file under `/Applications/`, which makes it super expensive both in CPU usage and Memory footprint. This query was the main culprit of triggering worker process kills by the watchdog. On some runs it triggered CPU usage alerts: ``` 7716:W0623 15:38:05.402959 221732864 watcher.cpp:415] osqueryd worker (72976) stopping: Maximum sustainable CPU utilization limit 1200ms exceeded for 12 seconds ``` And on other runs it triggered memory usage alerts: ``` 4431 W0626 07:28:50.868021 147312640 watcher.cpp:424] osqueryd worker (21453) stopping: Memory limits exceeded: 214020096 bytes (limit is 200MB) ``` For the above logs I used a custom osqueryd branch to be able to print more information: osquery/osquery#8070 The metrics for the old query were CPU usage: ~4521 ms ``` 435:level=warn ts=2023-06-26T09:58:29.665712Z query=fleet_policy_query_1233 queryTime=4521 memory=12226560 msg="distributed query performance is excessive" hostID=308 platform=darwin ``` With the new query, CPU usage: ~210 ms. ``` 23893:level=debug ts=2023-06-26T18:06:08.242456Z query=fleet_policy_query_1233 queryTime=210 msg=stats memory=0 hostID=308 platform=darwin ``` Basically a ~20x improvement. - [X] Changes file added for user-visible changes in `changes/` or `orbit/changes/`. See [Changes files](https://fleetdm.com/docs/contributing/committing-changes#changes-files) for more information. - ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or docs/Contributing/API-for-contributors.md)~ - ~[ ] Documented any permissions changes~ - ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL injection is prevented (using placeholders for values in statements)~ - ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for new osquery data ingestion features.~ - ~[ ] Added/updated tests~ - [X] Manual QA for all new/changed functionality - For Orbit and Fleet Desktop changes: - ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows and Linux.~ - ~[ ] Auto-update manual QA, from released version of component to new version (see [tools/tuf/test](../tools/tuf/test/README.md)).~

directionless · 2023-06-27T21:06:12Z

osquery/core/watcher.cpp

@@ -128,8 +138,25 @@ CLI_FLAG(uint64,

 CLI_FLAG(bool, disable_watchdog, false, "Disable userland watchdog process");

+CLI_FLAG(bool,
+         enable_watchdog_debug,


I'd remove enable, since I think it's always confusing. But I don't feel strongly about it

Smjert

@lucasmrod thanks for this improvement!

A couple of corrections in the docs, but otherwise looks good!

docs/wiki/installation/cli-flags.md

Add enable_watchdog_logging flag and improve error messages

58d5b24

lucasmrod requested review from a team as code owners June 23, 2023 20:46

lucasmrod commented Jun 23, 2023

View reviewed changes

This was referenced Jun 23, 2023

Load test macOS CIS benchmarks: test macOS device running policies fleetdm/fleet#10292

Closed

Optimize macOS CIS query 5.1.5 fleetdm/fleet#12506

Merged

directionless reviewed Jun 27, 2023

View reviewed changes

Rename flag to enable_watchdog_debug

caa3075

lucasmrod mentioned this pull request Jun 27, 2023

Watchdog docs for CPU usage limits are self-contradictory and also disagree with actual behavior #4300

Closed

directionless reviewed Jun 27, 2023

View reviewed changes

directionless added this to the 5.10.0 milestone Jun 27, 2023

mike-myers-tob added user experience logging performance labels Jun 28, 2023

lucasmrod added 3 commits June 29, 2023 10:25

Update watcher test

931fd33

Fix formatting

686cb72

Explain --watchdog_utilization_limit

84e52b7

lucasmrod requested a review from directionless July 5, 2023 18:20

Smjert requested changes Jul 10, 2023

View reviewed changes

docs/wiki/installation/cli-flags.md Outdated Show resolved Hide resolved

docs/wiki/installation/cli-flags.md Outdated Show resolved Hide resolved

Amend docs in cli-flags.md

44e0ddc

lucasmrod requested a review from Smjert July 14, 2023 14:36

Smjert approved these changes Jul 14, 2023

View reviewed changes

directionless merged commit c9f5543 into osquery:master Jul 14, 2023
16 checks passed

lucasmrod mentioned this pull request Jul 17, 2023

Add --enable_watchdog_debug flag to provide more information around watchdog monitoring #8069

Closed

lucasmrod changed the title ~~Add --enable_watchdog_logging flag and improve error messages~~ Add --enable_watchdog_debug flag and improve error messages Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--enable_watchdog_debug` flag and improve error messages #8070

Add `--enable_watchdog_debug` flag and improve error messages #8070

lucasmrod commented Jun 23, 2023 •

edited

lucasmrod Jun 23, 2023

lucasmrod Jun 23, 2023 •

edited

directionless Jun 27, 2023

lucasmrod Jun 23, 2023

lucasmrod Jun 23, 2023

directionless left a comment

directionless Jun 27, 2023

directionless Jun 27, 2023

Smjert left a comment

	// Maximum MB worker can privately allocate.
	{WatchdogLimitType::MEMORY_LIMIT, {200, 100, 10000}},

	inline void sleepFor(uint64_t msec) {
	std::chrono::milliseconds mduration(msec);
	std::this_thread::sleep_for(mduration);
	}

Add --enable_watchdog_debug flag and improve error messages #8070

Add --enable_watchdog_debug flag and improve error messages #8070

Conversation

lucasmrod commented Jun 23, 2023 • edited

lucasmrod Jun 23, 2023

Choose a reason for hiding this comment

lucasmrod Jun 23, 2023 • edited

Choose a reason for hiding this comment

directionless Jun 27, 2023

Choose a reason for hiding this comment

lucasmrod Jun 23, 2023

Choose a reason for hiding this comment

lucasmrod Jun 23, 2023

Choose a reason for hiding this comment

directionless left a comment

Choose a reason for hiding this comment

directionless Jun 27, 2023

Choose a reason for hiding this comment

directionless Jun 27, 2023

Choose a reason for hiding this comment

Smjert left a comment

Choose a reason for hiding this comment

Add `--enable_watchdog_debug` flag and improve error messages #8070

Add `--enable_watchdog_debug` flag and improve error messages #8070

lucasmrod commented Jun 23, 2023 •

edited

lucasmrod Jun 23, 2023 •

edited