Track uptime, and report it as well as PID in stats & telemetry #3730

ckreibich · 2024-05-09T01:13:51Z

When grepping through stats.log it's unclear when/whether a given node restarted, and inferring it from the numbers themselves is awkward and prone to error. Adding the PID would make this explicit, and also help relate the node to other system-level logging. Tracking and reporting node uptime would help identify lifetime more broadly, since the point of restart may not be available in a given log file.

These could become part of telemetry too, as gauges.

@timwoj, @bbannier, fyi

The text was updated successfully, but these errors were encountered:

timwoj · 2024-05-17T01:11:46Z

This one is pretty simple with the new async counter mode in the new telemetry rework. Add a counter that has a callback method that sets the counter to the uptime, and Prometheus will update it whenever it needs to. Optionally you could set one of the labels on the counter to the PID, if that's something that is actually wanted.

JustinAzoff · 2024-05-24T12:50:40Z

fwiw, it can potentially be more useful to have the metric be the 'start time' vs the 'up time'.

that's how prometheus normally does it for the node_exporter:

# HELP node_boot_time_seconds Node boot time, in unixtime.
# TYPE node_boot_time_seconds gauge
node_boot_time_seconds 1.702939098e+09

If you have that you don't necessarily need the pid, since a change in that time means the node restarted, and you can easily compute the uptime based on it.

bbannier · 2024-05-25T18:44:54Z

fwiw, it can potentially be more useful to have the metric be the 'start time' vs the 'up time'.

What is the benefit of uptime over boot time? In my experience I found uptime more useful since it always increases monotonically, so if it does not go up this is helpful information about e.g., the state of the metrics collection pipeline (it effectively acts as some timestamped health check). If uptime ever decreases one knows that there must have been a restart, and can even potentially recover the approximate time via extrapolation (modulo sampling jitter).

Apart from allowing exact recovery of the restart time I cannot really see any benefit of boot time over uptime, so maybe you could expand. Do you want to optimize for compression in the metrics collection backend?

I 100% agree that the pid is not very interesting. If needed I believe one could scrape that via Prometheus system metrics already anyway (as well as process_start_time).

ckreibich added Complexity: Modest A cup of tea and an evening (or two) with Zeek. Type: Enhancement Area: Logging Implementation: Scripts Implementation requires Zeek scripting Area: Telemetry labels May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track uptime, and report it as well as PID in stats & telemetry #3730

Track uptime, and report it as well as PID in stats & telemetry #3730

ckreibich commented May 9, 2024

timwoj commented May 17, 2024

JustinAzoff commented May 24, 2024

bbannier commented May 25, 2024

Track uptime, and report it as well as PID in stats & telemetry #3730

Track uptime, and report it as well as PID in stats & telemetry #3730

Comments

ckreibich commented May 9, 2024

timwoj commented May 17, 2024

JustinAzoff commented May 24, 2024

bbannier commented May 25, 2024