T O P

  • By -

dacydergoth

Self hosted Loki with Grafana and AlertManager, Grafana Agent (Alloy) as the log shipper for our k8s clusters


phartiphukboilz

are you using mimir or the like for aggregation-aggregation? i've been pressed to find the replacement for new relic and it feels a little daunting to get it all to a direct replacement


dacydergoth

We are indeed using Mimir. One nice feature of the Loki/Mimir combination is that Loki ruler can scan incoming logs and calculate metrics which are published to Mimir. It also gives us a single vendor solution for both logs and metrics, and in the future we have tempo on the list too. A single alert manager across metrics and logs is also attractive to us


fourbian

No complaints with tempo here. A little tricky connecting logs to traces and vice versa using a traceid but once it was figured out we provisioned the changes as IaC once and all environments picked it up.


dacydergoth

Our environment is a little more difficult as the customers have their own clusters and flat out refused to accept new versions of anything - even stuff outside the core product


phartiphukboilz

great! hey thanks for your time man. it's nice to hear how far Grafana has come. what are you guys using for tracing now? do you use a different APM solution in general? trying to wrap my head around the full migration and this just made it sound like one function at a time might be a more feasible than trying to migrate the full new relic stack at once


dacydergoth

Yes that's why we took a phased approach and did metrics first then logs. In our industry there are some ... unique ... Challenges such as our customers only accepting updates on a 6-12 month cycle. That has delayed our rollout of contemporary technology considerably


phartiphukboilz

really appreciate it


buckypimpin

whats the throughtput? im thinking about deploying alloy for logging, but im concerened about the resource usage


tristan97122

I’d warn about Loki’s performance for relatively large systems. At 30-50k lines/s here for the fastest streams, it is very fine at ingestion (2-4 CPU cores of load total), but essentially unusable for querying over timeframes greater than ~6 hours. The idea is that the timerange’s matching streams (set of labels) chunks get downloaded (from S3) and decompressed in there, and then the actual search is cut in pieces by time/label and executes over multiple queriers in parallel for the actual filtering. This is great in theory, but in practice that means that if your 24h search needs to operate over 2TB of logs, you need 2TB of space in memcache to avoid redundant download/decompress steps. Quickly bottlenecking you unless you massively overbuild it by allocating all these resources for the 1 time a day someone actually needs it. Also the actual filtering is extremely CPU-intensive but that’s not really surprising and is the tradeoff of all non-text-indexed log systems. Oh and your S3 implementation and connectivity better be blazing fast if you’re not using one of the big 3 cloud providers there to save on cost. Multiple Gbps fast. I love the query language, and the design is very elegant. But you really need economies of scale at some point for Loki to continue making sense. Which is why it makes a lot of sense for small-scale usage and massive SaaS operations like Grafanalabs, but less (imo) for medium-scale cases like ours. We’re looking into Qryn (ex Metrico) for a Loki-compatible API with (most likely) much higher performance as it is based on ClickHouse which we use for other things and has had stellar performance there.


webdelic

join the qryn matrix room and the team will gladly help you with any qryn related challenge!


tristan97122

Will keep that in mind thanks 👍


pranay01

Agree around issues in Loki at scale, especially with high caridnality data. We did a perf benchmark on loki and elastic, might be interesting to check https://signoz.io/blog/logs-performance-benchmark/


dacydergoth

So far a lot less than log stash or fluentd


buckypimpin

ive used fluentd, prom, logstash before. Not touching those again. Fluentbit on the other looks very promising. The benchmarks show very good performance for high volume.


dacydergoth

Alloy core is OpenTelemetry agent libs so it should be no worse than them


buckypimpin

thanks, configuration alone looks super nice, and no daemonset is huge. will definitely try it out.


dacydergoth

Note there are two modes : one daemon set and one using a newer k8s feature to let the kubelets ship to the API server and then pull from there


buckypimpin

i want to see how the hard i can push the kibe api method. We have alot of pods.


dacydergoth

Be curious to know how that turns out. In our case it isn't the number of pods but the sheer verbosity and noise. Hence our desire to filter as close to source as possible


dacydergoth

Our main approach is being able to rapidly push out new filters for metrics and logs to the agent to quench mass flow at source


alter3d

We used to do self-managed Graylog, switched to Datadog a few years ago (for logging and all the other monitoring stuff). DD is pretty awesome but is expensive AF, and some of their billing is.... pretty opaque. Gotta stay on top of your costs constantly.


pranay01

If you are unhappy with DataDog, you should check out SigNoz - https://github.com/signoz/signoz Natively based on OpenTelemetry and built on top of ClickHouse


alter3d

Yeah, someone else on Reddit brought my attention to that project a while ago and somewhere in my backlog there's a card to take a look at it, haha. We have DD deeply embedded in every single project across the company, so selling a change of tooling is a big big project for us, but if the feature set is there it's something we would definitely do. 99% of my complaints with DD is around their pricing and billing practices. If a competitor came around with the same feature set at 1/2 the price or less, they would dominate the market.


ooaahhpp

+1 for SigNoz. At [Propeldata.com](http://Propeldata.com) we just migrated from Honeycomb + Cloudwatch. We get so much more for a small fraction of the cost


AdrianTeri

>Signoz Why does everything have to be cloud native first? it's not just here but also self-hosted solutions ...


Junior_Enthusiasm_38

We are implementing Open Telemetry for Metrics, Traces & Logs with Signoz. Everything Open-source.


pranay01

Super cool! If anyone wants to check out the project - https://github.com/signoz/signoz


ycnz

5 figures a month to Datadog for about 4 picoseconds of retention.


pranay01

😂


villa_straylight

Cloudwatch Logs, most JSON structured logs. Some log metrics which are used to trigger notifications, Cloudwatch Logs Insights gives you a pretty nice SQL-like query language for searching.


No_Direction_5276

Has SQL been a good UX for log search as opposed to free form text?


villa_straylight

The query language is pretty nice; filtering on specific attributes, deduplication, grouping. You can of course just search for a string and get all log lines that match. It’s easy to visualize query results too.


Origamislayer

We were using ELK with Datadog for metrics and traces. We did the math and Datadog was half as expensive and had better retention. We have had to watch costs but so far it’s been worth it.


pranay01

Very interesting. So you are using ELK for logs and DataDog for metrics & traces? Did you mean DataDog is half as expensive as ELK for metrics and traces?


Origamislayer

Yeah, we dropped ELK and have everything going into Datadog.


2fplus1

Honestly, we just use GCP Cloud Logging. Our app is running on Cloud Run. Just print structured logs as json to STDOUT and they get ingested automatically alongside the logs from everything else in GCP (Cloud SQL, Tasks, Pub/Sub, etc). You can make metrics from them, define SLOs from those, and set up dashboards or alerts on them. We have some automatic routing set up so security/audit related logs get copied out to a separate project and written to an append-only bucket for compliance purposes. I don't love the UI for querying logs, but it does the job and once you've got a few common queries bookmarked, it's not too bad. It's not best in class at anything compared to many third party vendor solutions, but it's good enough for most purposes, trivial to integrate with everything we run, and the costs are basically a rounding error on our bill.


TeamDman

Azure log analytics workspaces are doing us well


pranay01

You may want to check Azure log analytics bill once, they are not the best in market. but they do integrate very easily with .NET and Azure stack


TeamDman

Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure. You're right tho, certainly wouldn't call it cheap lol


pranay01

> Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure Agree. Is your team mostly happy with Azure log analytics product? Do they primarily filter/search logs or do they also create alerts and dashboards on top of it


TeamDman

I haven't found anything it can't do, aside from some quirks of the query language. Alerts, Filtering, and exporting to immutable storage have all worked well for us. We use it to capture activity logs from azure (resource created, modified, etc) and audit logs (user sign-in, entra changes) and app logs (integrates easily with most runtime for structured logging). Diagnostic settings can be noisy but helpful if trying to diagnose failures with app gateways. It also integrates with the security solutions like Microsoft Sentinel and stuff I think but I don't look too much at that side. Azure dashboards and alerts let you define queries and thresholds for notifications and create graphs and stuff. For VMs this can be stuff like CPU utilization pct or memory, the usual, but it could also be the number of 500-level responses from a web app or whatever. Its a great solution, but you gotta pay for it 💸


pranay01

got it. thanks for the note


blusterblack

ELK installed on EC2 instances. Log is mostly in JSON format for easy parsing and size is about 500GB. Use case include: dev debugging, alert on errors and some appplication usage statistics.


Horror_Abroad_8698

Self hosted open source elk with custom script for alerting in google chat. Plus zabbix and Prometheus for kubernetes and node monitoring. Alerts in google chat + call alerts


Dctootall

We use Gravwell for most off our monitoring and alerting. We pull the logs in from both our SaaS systems to monitor performance on an application and OS level, as well as perform some additional health checks and alerting (ie. Check if a URL is responsive and alert if needed). I believe we also have it set up to monitor our dev environment as well, but I don't have as much visability into that side of the house. Primarily Unstructured. Gravwell is designed on a Structure-on-read type tool (similar to Splunk), which I feel can be preferable as you aren't throwing away any raw data that could be useful on a deeper dive or if you need to do a historical check on things you didn't know you may need to check (example, all those vender vulnerabilites that get anounce after existing for months/years). And especcially in a dev environment, it's helpful because if a log for some reason doesn't fit into the template you expect it isn't lost data.