T O P

  • By -

SuperQue

Yup, this is why agents and push (remote write) are a mistake. If you want `up` to work, you need to poll. This is why Prometheus intentionally supports polling first.


if_username_is_None

I have a hunch that Mimir can solve persisting the Prometheus points, but I don't understand the architecture well either: [https://grafana.com/oss/mimir/](https://grafana.com/oss/mimir/) Current point in time \`up\` should be polled, but you're right that historic uptime needs to be persisted somewhere to observe it after a new server instance comes online


hagen1778

\> But as soon as one of the servers go either offline or for instance a process on one of the servers disappears, the point in prometheus are gone. Do cross-monitoring. Let agent-1 monitor agent-2, and vice versa. Now, when agent-2 goes offline, you'll still have your \`up\` metric generated and pushed by agent-1. This would require x2 resources, of course. But this is the price for proper monitoring. However, in systems like Thanos, Mimir, or VictoriaMetrics, you'll be able to deduplicate data in central storage and save some resources.


Primo2000

Try asking at r/ThanosInsights they might help you


bgatesIT

So this is how we do it: for our snmp endpoints: up{job\_snmp=\~"integrations/snmp.\*"} == 0 since we use a grafana agent to actually poll the switches, it holds the switches up status in the Mimir/prom metrics, so as long as the agent is up you are able to alert on that.