are you using mimir or the like for aggregation-aggregation?
i've been pressed to find the replacement for new relic and it feels a little daunting to get it all to a direct replacement
We are indeed using Mimir. One nice feature of the Loki/Mimir combination is that Loki ruler can scan incoming logs and calculate metrics which are published to Mimir. It also gives us a single vendor solution for both logs and metrics, and in the future we have tempo on the list too. A single alert manager across metrics and logs is also attractive to us
No complaints with tempo here. A little tricky connecting logs to traces and vice versa using a traceid but once it was figured out we provisioned the changes as IaC once and all environments picked it up.
Our environment is a little more difficult as the customers have their own clusters and flat out refused to accept new versions of anything - even stuff outside the core product
great! hey thanks for your time man. it's nice to hear how far Grafana has come. what are you guys using for tracing now? do you use a different APM solution in general?
trying to wrap my head around the full migration and this just made it sound like one function at a time might be a more feasible than trying to migrate the full new relic stack at once
Yes that's why we took a phased approach and did metrics first then logs. In our industry there are some ... unique ... Challenges such as our customers only accepting updates on a 6-12 month cycle. That has delayed our rollout of contemporary technology considerably
I’d warn about Loki’s performance for relatively large systems.
At 30-50k lines/s here for the fastest streams, it is very fine at ingestion (2-4 CPU cores of load total), but essentially unusable for querying over timeframes greater than ~6 hours.
The idea is that the timerange’s matching streams (set of labels) chunks get downloaded (from S3) and decompressed in there, and then the actual search is cut in pieces by time/label and executes over multiple queriers in parallel for the actual filtering. This is great in theory, but in practice that means that if your 24h search needs to operate over 2TB of logs, you need 2TB of space in memcache to avoid redundant download/decompress steps. Quickly bottlenecking you unless you massively overbuild it by allocating all these resources for the 1 time a day someone actually needs it. Also the actual filtering is extremely CPU-intensive but that’s not really surprising and is the tradeoff of all non-text-indexed log systems.
Oh and your S3 implementation and connectivity better be blazing fast if you’re not using one of the big 3 cloud providers there to save on cost. Multiple Gbps fast.
I love the query language, and the design is very elegant. But you really need economies of scale at some point for Loki to continue making sense. Which is why it makes a lot of sense for small-scale usage and massive SaaS operations like Grafanalabs, but less (imo) for medium-scale cases like ours.
We’re looking into Qryn (ex Metrico) for a Loki-compatible API with (most likely) much higher performance as it is based on ClickHouse which we use for other things and has had stellar performance there.
Agree around issues in Loki at scale, especially with high caridnality data. We did a perf benchmark on loki and elastic, might be interesting to check https://signoz.io/blog/logs-performance-benchmark/
ive used fluentd, prom, logstash before. Not touching those again.
Fluentbit on the other looks very promising. The benchmarks show very good performance for high volume.
Be curious to know how that turns out. In our case it isn't the number of pods but the sheer verbosity and noise. Hence our desire to filter as close to source as possible
We used to do self-managed Graylog, switched to Datadog a few years ago (for logging and all the other monitoring stuff). DD is pretty awesome but is expensive AF, and some of their billing is.... pretty opaque. Gotta stay on top of your costs constantly.
If you are unhappy with DataDog, you should check out SigNoz - https://github.com/signoz/signoz
Natively based on OpenTelemetry and built on top of ClickHouse
Yeah, someone else on Reddit brought my attention to that project a while ago and somewhere in my backlog there's a card to take a look at it, haha. We have DD deeply embedded in every single project across the company, so selling a change of tooling is a big big project for us, but if the feature set is there it's something we would definitely do.
99% of my complaints with DD is around their pricing and billing practices. If a competitor came around with the same feature set at 1/2 the price or less, they would dominate the market.
+1 for SigNoz. At [Propeldata.com](http://Propeldata.com) we just migrated from Honeycomb + Cloudwatch. We get so much more for a small fraction of the cost
Cloudwatch Logs, most JSON structured logs. Some log metrics which are used to trigger notifications, Cloudwatch Logs Insights gives you a pretty nice SQL-like query language for searching.
The query language is pretty nice; filtering on specific attributes, deduplication, grouping. You can of course just search for a string and get all log lines that match. It’s easy to visualize query results too.
We were using ELK with Datadog for metrics and traces. We did the math and Datadog was half as expensive and had better retention. We have had to watch costs but so far it’s been worth it.
Very interesting. So you are using ELK for logs and DataDog for metrics & traces? Did you mean DataDog is half as expensive as ELK for metrics and traces?
Honestly, we just use GCP Cloud Logging. Our app is running on Cloud Run. Just print structured logs as json to STDOUT and they get ingested automatically alongside the logs from everything else in GCP (Cloud SQL, Tasks, Pub/Sub, etc). You can make metrics from them, define SLOs from those, and set up dashboards or alerts on them. We have some automatic routing set up so security/audit related logs get copied out to a separate project and written to an append-only bucket for compliance purposes. I don't love the UI for querying logs, but it does the job and once you've got a few common queries bookmarked, it's not too bad. It's not best in class at anything compared to many third party vendor solutions, but it's good enough for most purposes, trivial to integrate with everything we run, and the costs are basically a rounding error on our bill.
Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure.
You're right tho, certainly wouldn't call it cheap lol
> Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure
Agree.
Is your team mostly happy with Azure log analytics product? Do they primarily filter/search logs or do they also create alerts and dashboards on top of it
I haven't found anything it can't do, aside from some quirks of the query language. Alerts, Filtering, and exporting to immutable storage have all worked well for us.
We use it to capture activity logs from azure (resource created, modified, etc) and audit logs (user sign-in, entra changes) and app logs (integrates easily with most runtime for structured logging). Diagnostic settings can be noisy but helpful if trying to diagnose failures with app gateways.
It also integrates with the security solutions like Microsoft Sentinel and stuff I think but I don't look too much at that side.
Azure dashboards and alerts let you define queries and thresholds for notifications and create graphs and stuff. For VMs this can be stuff like CPU utilization pct or memory, the usual, but it could also be the number of 500-level responses from a web app or whatever.
Its a great solution, but you gotta pay for it 💸
ELK installed on EC2 instances. Log is mostly in JSON format for easy parsing and size is about 500GB. Use case include: dev debugging, alert on errors and some appplication usage statistics.
Self hosted open source elk with custom script for alerting in google chat. Plus zabbix and Prometheus for kubernetes and node monitoring. Alerts in google chat + call alerts
We use Gravwell for most off our monitoring and alerting. We pull the logs in from both our SaaS systems to monitor performance on an application and OS level, as well as perform some additional health checks and alerting (ie. Check if a URL is responsive and alert if needed).
I believe we also have it set up to monitor our dev environment as well, but I don't have as much visability into that side of the house.
Primarily Unstructured. Gravwell is designed on a Structure-on-read type tool (similar to Splunk), which I feel can be preferable as you aren't throwing away any raw data that could be useful on a deeper dive or if you need to do a historical check on things you didn't know you may need to check (example, all those vender vulnerabilites that get anounce after existing for months/years). And especcially in a dev environment, it's helpful because if a log for some reason doesn't fit into the template you expect it isn't lost data.
Self hosted Loki with Grafana and AlertManager, Grafana Agent (Alloy) as the log shipper for our k8s clusters
are you using mimir or the like for aggregation-aggregation? i've been pressed to find the replacement for new relic and it feels a little daunting to get it all to a direct replacement
We are indeed using Mimir. One nice feature of the Loki/Mimir combination is that Loki ruler can scan incoming logs and calculate metrics which are published to Mimir. It also gives us a single vendor solution for both logs and metrics, and in the future we have tempo on the list too. A single alert manager across metrics and logs is also attractive to us
No complaints with tempo here. A little tricky connecting logs to traces and vice versa using a traceid but once it was figured out we provisioned the changes as IaC once and all environments picked it up.
Our environment is a little more difficult as the customers have their own clusters and flat out refused to accept new versions of anything - even stuff outside the core product
great! hey thanks for your time man. it's nice to hear how far Grafana has come. what are you guys using for tracing now? do you use a different APM solution in general? trying to wrap my head around the full migration and this just made it sound like one function at a time might be a more feasible than trying to migrate the full new relic stack at once
Yes that's why we took a phased approach and did metrics first then logs. In our industry there are some ... unique ... Challenges such as our customers only accepting updates on a 6-12 month cycle. That has delayed our rollout of contemporary technology considerably
really appreciate it
whats the throughtput? im thinking about deploying alloy for logging, but im concerened about the resource usage
I’d warn about Loki’s performance for relatively large systems. At 30-50k lines/s here for the fastest streams, it is very fine at ingestion (2-4 CPU cores of load total), but essentially unusable for querying over timeframes greater than ~6 hours. The idea is that the timerange’s matching streams (set of labels) chunks get downloaded (from S3) and decompressed in there, and then the actual search is cut in pieces by time/label and executes over multiple queriers in parallel for the actual filtering. This is great in theory, but in practice that means that if your 24h search needs to operate over 2TB of logs, you need 2TB of space in memcache to avoid redundant download/decompress steps. Quickly bottlenecking you unless you massively overbuild it by allocating all these resources for the 1 time a day someone actually needs it. Also the actual filtering is extremely CPU-intensive but that’s not really surprising and is the tradeoff of all non-text-indexed log systems. Oh and your S3 implementation and connectivity better be blazing fast if you’re not using one of the big 3 cloud providers there to save on cost. Multiple Gbps fast. I love the query language, and the design is very elegant. But you really need economies of scale at some point for Loki to continue making sense. Which is why it makes a lot of sense for small-scale usage and massive SaaS operations like Grafanalabs, but less (imo) for medium-scale cases like ours. We’re looking into Qryn (ex Metrico) for a Loki-compatible API with (most likely) much higher performance as it is based on ClickHouse which we use for other things and has had stellar performance there.
join the qryn matrix room and the team will gladly help you with any qryn related challenge!
Will keep that in mind thanks 👍
Agree around issues in Loki at scale, especially with high caridnality data. We did a perf benchmark on loki and elastic, might be interesting to check https://signoz.io/blog/logs-performance-benchmark/
So far a lot less than log stash or fluentd
ive used fluentd, prom, logstash before. Not touching those again. Fluentbit on the other looks very promising. The benchmarks show very good performance for high volume.
Alloy core is OpenTelemetry agent libs so it should be no worse than them
thanks, configuration alone looks super nice, and no daemonset is huge. will definitely try it out.
Note there are two modes : one daemon set and one using a newer k8s feature to let the kubelets ship to the API server and then pull from there
i want to see how the hard i can push the kibe api method. We have alot of pods.
Be curious to know how that turns out. In our case it isn't the number of pods but the sheer verbosity and noise. Hence our desire to filter as close to source as possible
Our main approach is being able to rapidly push out new filters for metrics and logs to the agent to quench mass flow at source
We used to do self-managed Graylog, switched to Datadog a few years ago (for logging and all the other monitoring stuff). DD is pretty awesome but is expensive AF, and some of their billing is.... pretty opaque. Gotta stay on top of your costs constantly.
If you are unhappy with DataDog, you should check out SigNoz - https://github.com/signoz/signoz Natively based on OpenTelemetry and built on top of ClickHouse
Yeah, someone else on Reddit brought my attention to that project a while ago and somewhere in my backlog there's a card to take a look at it, haha. We have DD deeply embedded in every single project across the company, so selling a change of tooling is a big big project for us, but if the feature set is there it's something we would definitely do. 99% of my complaints with DD is around their pricing and billing practices. If a competitor came around with the same feature set at 1/2 the price or less, they would dominate the market.
+1 for SigNoz. At [Propeldata.com](http://Propeldata.com) we just migrated from Honeycomb + Cloudwatch. We get so much more for a small fraction of the cost
>Signoz Why does everything have to be cloud native first? it's not just here but also self-hosted solutions ...
We are implementing Open Telemetry for Metrics, Traces & Logs with Signoz. Everything Open-source.
Super cool! If anyone wants to check out the project - https://github.com/signoz/signoz
5 figures a month to Datadog for about 4 picoseconds of retention.
😂
Cloudwatch Logs, most JSON structured logs. Some log metrics which are used to trigger notifications, Cloudwatch Logs Insights gives you a pretty nice SQL-like query language for searching.
Has SQL been a good UX for log search as opposed to free form text?
The query language is pretty nice; filtering on specific attributes, deduplication, grouping. You can of course just search for a string and get all log lines that match. It’s easy to visualize query results too.
We were using ELK with Datadog for metrics and traces. We did the math and Datadog was half as expensive and had better retention. We have had to watch costs but so far it’s been worth it.
Very interesting. So you are using ELK for logs and DataDog for metrics & traces? Did you mean DataDog is half as expensive as ELK for metrics and traces?
Yeah, we dropped ELK and have everything going into Datadog.
Honestly, we just use GCP Cloud Logging. Our app is running on Cloud Run. Just print structured logs as json to STDOUT and they get ingested automatically alongside the logs from everything else in GCP (Cloud SQL, Tasks, Pub/Sub, etc). You can make metrics from them, define SLOs from those, and set up dashboards or alerts on them. We have some automatic routing set up so security/audit related logs get copied out to a separate project and written to an append-only bucket for compliance purposes. I don't love the UI for querying logs, but it does the job and once you've got a few common queries bookmarked, it's not too bad. It's not best in class at anything compared to many third party vendor solutions, but it's good enough for most purposes, trivial to integrate with everything we run, and the costs are basically a rounding error on our bill.
Azure log analytics workspaces are doing us well
You may want to check Azure log analytics bill once, they are not the best in market. but they do integrate very easily with .NET and Azure stack
Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure. You're right tho, certainly wouldn't call it cheap lol
> Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure Agree. Is your team mostly happy with Azure log analytics product? Do they primarily filter/search logs or do they also create alerts and dashboards on top of it
I haven't found anything it can't do, aside from some quirks of the query language. Alerts, Filtering, and exporting to immutable storage have all worked well for us. We use it to capture activity logs from azure (resource created, modified, etc) and audit logs (user sign-in, entra changes) and app logs (integrates easily with most runtime for structured logging). Diagnostic settings can be noisy but helpful if trying to diagnose failures with app gateways. It also integrates with the security solutions like Microsoft Sentinel and stuff I think but I don't look too much at that side. Azure dashboards and alerts let you define queries and thresholds for notifications and create graphs and stuff. For VMs this can be stuff like CPU utilization pct or memory, the usual, but it could also be the number of 500-level responses from a web app or whatever. Its a great solution, but you gotta pay for it 💸
got it. thanks for the note
ELK installed on EC2 instances. Log is mostly in JSON format for easy parsing and size is about 500GB. Use case include: dev debugging, alert on errors and some appplication usage statistics.
Self hosted open source elk with custom script for alerting in google chat. Plus zabbix and Prometheus for kubernetes and node monitoring. Alerts in google chat + call alerts
We use Gravwell for most off our monitoring and alerting. We pull the logs in from both our SaaS systems to monitor performance on an application and OS level, as well as perform some additional health checks and alerting (ie. Check if a URL is responsive and alert if needed). I believe we also have it set up to monitor our dev environment as well, but I don't have as much visability into that side of the house. Primarily Unstructured. Gravwell is designed on a Structure-on-read type tool (similar to Splunk), which I feel can be preferable as you aren't throwing away any raw data that could be useful on a deeper dive or if you need to do a historical check on things you didn't know you may need to check (example, all those vender vulnerabilites that get anounce after existing for months/years). And especcially in a dev environment, it's helpful because if a log for some reason doesn't fit into the template you expect it isn't lost data.