Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track uptime, and report it as well as PID in stats & telemetry #3730

Open
ckreibich opened this issue May 9, 2024 · 3 comments
Open

Track uptime, and report it as well as PID in stats & telemetry #3730

ckreibich opened this issue May 9, 2024 · 3 comments
Labels
Area: Logging Area: Telemetry Complexity: Modest A cup of tea and an evening (or two) with Zeek. Implementation: Scripts Implementation requires Zeek scripting Type: Enhancement

Comments

@ckreibich
Copy link
Member

When grepping through stats.log it's unclear when/whether a given node restarted, and inferring it from the numbers themselves is awkward and prone to error. Adding the PID would make this explicit, and also help relate the node to other system-level logging. Tracking and reporting node uptime would help identify lifetime more broadly, since the point of restart may not be available in a given log file.

These could become part of telemetry too, as gauges.

@timwoj, @bbannier, fyi

@ckreibich ckreibich added Complexity: Modest A cup of tea and an evening (or two) with Zeek. Type: Enhancement Area: Logging Implementation: Scripts Implementation requires Zeek scripting Area: Telemetry labels May 9, 2024
@timwoj
Copy link
Contributor

timwoj commented May 17, 2024

This one is pretty simple with the new async counter mode in the new telemetry rework. Add a counter that has a callback method that sets the counter to the uptime, and Prometheus will update it whenever it needs to. Optionally you could set one of the labels on the counter to the PID, if that's something that is actually wanted.

@JustinAzoff
Copy link
Contributor

fwiw, it can potentially be more useful to have the metric be the 'start time' vs the 'up time'.

that's how prometheus normally does it for the node_exporter:

# HELP node_boot_time_seconds Node boot time, in unixtime.
# TYPE node_boot_time_seconds gauge
node_boot_time_seconds 1.702939098e+09

If you have that you don't necessarily need the pid, since a change in that time means the node restarted, and you can easily compute the uptime based on it.

@bbannier
Copy link
Contributor

fwiw, it can potentially be more useful to have the metric be the 'start time' vs the 'up time'.

What is the benefit of uptime over boot time? In my experience I found uptime more useful since it always increases monotonically, so if it does not go up this is helpful information about e.g., the state of the metrics collection pipeline (it effectively acts as some timestamped health check). If uptime ever decreases one knows that there must have been a restart, and can even potentially recover the approximate time via extrapolation (modulo sampling jitter).

Apart from allowing exact recovery of the restart time I cannot really see any benefit of boot time over uptime, so maybe you could expand. Do you want to optimize for compression in the metrics collection backend?

I 100% agree that the pid is not very interesting. If needed I believe one could scrape that via Prometheus system metrics already anyway (as well as process_start_time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Logging Area: Telemetry Complexity: Modest A cup of tea and an evening (or two) with Zeek. Implementation: Scripts Implementation requires Zeek scripting Type: Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants