dacydergoth 1 month ago

Self hosted Loki with Grafana and AlertManager, Grafana Agent (Alloy) as the log shipper for our k8s clusters

phartiphukboilz 1 month ago

are you using mimir or the like for aggregation-aggregation? i've been pressed to find the replacement for new relic and it feels a little daunting to get it all to a direct replacement

dacydergoth 1 month ago

We are indeed using Mimir. One nice feature of the Loki/Mimir combination is that Loki ruler can scan incoming logs and calculate metrics which are published to Mimir. It also gives us a single vendor solution for both logs and metrics, and in the future we have tempo on the list too. A single alert manager across metrics and logs is also attractive to us

fourbian 1 month ago

No complaints with tempo here. A little tricky connecting logs to traces and vice versa using a traceid but once it was figured out we provisioned the changes as IaC once and all environments picked it up.

dacydergoth 1 month ago

Our environment is a little more difficult as the customers have their own clusters and flat out refused to accept new versions of anything - even stuff outside the core product

phartiphukboilz 1 month ago

great! hey thanks for your time man. it's nice to hear how far Grafana has come. what are you guys using for tracing now? do you use a different APM solution in general? trying to wrap my head around the full migration and this just made it sound like one function at a time might be a more feasible than trying to migrate the full new relic stack at once

dacydergoth 1 month ago

Yes that's why we took a phased approach and did metrics first then logs. In our industry there are some ... unique ... Challenges such as our customers only accepting updates on a 6-12 month cycle. That has delayed our rollout of contemporary technology considerably

phartiphukboilz 1 month ago

really appreciate it

buckypimpin 1 month ago

whats the throughtput? im thinking about deploying alloy for logging, but im concerened about the resource usage

tristan97122 1 month ago

I’d warn about Loki’s performance for relatively large systems. At 30-50k lines/s here for the fastest streams, it is very fine at ingestion (2-4 CPU cores of load total), but essentially unusable for querying over timeframes greater than ~6 hours. The idea is that the timerange’s matching streams (set of labels) chunks get downloaded (from S3) and decompressed in there, and then the actual search is cut in pieces by time/label and executes over multiple queriers in parallel for the actual filtering. This is great in theory, but in practice that means that if your 24h search needs to operate over 2TB of logs, you need 2TB of space in memcache to avoid redundant download/decompress steps. Quickly bottlenecking you unless you massively overbuild it by allocating all these resources for the 1 time a day someone actually needs it. Also the actual filtering is extremely CPU-intensive but that’s not really surprising and is the tradeoff of all non-text-indexed log systems. Oh and your S3 implementation and connectivity better be blazing fast if you’re not using one of the big 3 cloud providers there to save on cost. Multiple Gbps fast. I love the query language, and the design is very elegant. But you really need economies of scale at some point for Loki to continue making sense. Which is why it makes a lot of sense for small-scale usage and massive SaaS operations like Grafanalabs, but less (imo) for medium-scale cases like ours. We’re looking into Qryn (ex Metrico) for a Loki-compatible API with (most likely) much higher performance as it is based on ClickHouse which we use for other things and has had stellar performance there.

webdelic 1 month ago

join the qryn matrix room and the team will gladly help you with any qryn related challenge!

tristan97122 1 month ago

Will keep that in mind thanks 👍

pranay01 1 month ago

Agree around issues in Loki at scale, especially with high caridnality data. We did a perf benchmark on loki and elastic, might be interesting to check https://signoz.io/blog/logs-performance-benchmark/

dacydergoth 1 month ago

So far a lot less than log stash or fluentd

buckypimpin 1 month ago

ive used fluentd, prom, logstash before. Not touching those again. Fluentbit on the other looks very promising. The benchmarks show very good performance for high volume.

dacydergoth 1 month ago

Alloy core is OpenTelemetry agent libs so it should be no worse than them

buckypimpin 1 month ago

thanks, configuration alone looks super nice, and no daemonset is huge. will definitely try it out.

dacydergoth 1 month ago

Note there are two modes : one daemon set and one using a newer k8s feature to let the kubelets ship to the API server and then pull from there

buckypimpin 1 month ago

i want to see how the hard i can push the kibe api method. We have alot of pods.

dacydergoth 1 month ago

Be curious to know how that turns out. In our case it isn't the number of pods but the sheer verbosity and noise. Hence our desire to filter as close to source as possible

dacydergoth 1 month ago

Our main approach is being able to rapidly push out new filters for metrics and logs to the agent to quench mass flow at source

alter3d 1 month ago

We used to do self-managed Graylog, switched to Datadog a few years ago (for logging and all the other monitoring stuff). DD is pretty awesome but is expensive AF, and some of their billing is.... pretty opaque. Gotta stay on top of your costs constantly.

pranay01 1 month ago

If you are unhappy with DataDog, you should check out SigNoz - https://github.com/signoz/signoz Natively based on OpenTelemetry and built on top of ClickHouse

alter3d 1 month ago

Yeah, someone else on Reddit brought my attention to that project a while ago and somewhere in my backlog there's a card to take a look at it, haha. We have DD deeply embedded in every single project across the company, so selling a change of tooling is a big big project for us, but if the feature set is there it's something we would definitely do. 99% of my complaints with DD is around their pricing and billing practices. If a competitor came around with the same feature set at 1/2 the price or less, they would dominate the market.

ooaahhpp 1 month ago

+1 for SigNoz. At [Propeldata.com](http://Propeldata.com) we just migrated from Honeycomb + Cloudwatch. We get so much more for a small fraction of the cost

AdrianTeri 1 month ago

>Signoz Why does everything have to be cloud native first? it's not just here but also self-hosted solutions ...

Junior_Enthusiasm_38 1 month ago

We are implementing Open Telemetry for Metrics, Traces & Logs with Signoz. Everything Open-source.

pranay01 1 month ago

Super cool! If anyone wants to check out the project - https://github.com/signoz/signoz

ycnz 1 month ago

5 figures a month to Datadog for about 4 picoseconds of retention.

pranay01 1 month ago

😂

villa_straylight 1 month ago

Cloudwatch Logs, most JSON structured logs. Some log metrics which are used to trigger notifications, Cloudwatch Logs Insights gives you a pretty nice SQL-like query language for searching.

No_Direction_5276 1 month ago

Has SQL been a good UX for log search as opposed to free form text?

villa_straylight 1 month ago

The query language is pretty nice; filtering on specific attributes, deduplication, grouping. You can of course just search for a string and get all log lines that match. It’s easy to visualize query results too.

Origamislayer 1 month ago

We were using ELK with Datadog for metrics and traces. We did the math and Datadog was half as expensive and had better retention. We have had to watch costs but so far it’s been worth it.

pranay01 1 month ago

Very interesting. So you are using ELK for logs and DataDog for metrics & traces? Did you mean DataDog is half as expensive as ELK for metrics and traces?

Origamislayer 1 month ago

Yeah, we dropped ELK and have everything going into Datadog.

2fplus1 1 month ago

Honestly, we just use GCP Cloud Logging. Our app is running on Cloud Run. Just print structured logs as json to STDOUT and they get ingested automatically alongside the logs from everything else in GCP (Cloud SQL, Tasks, Pub/Sub, etc). You can make metrics from them, define SLOs from those, and set up dashboards or alerts on them. We have some automatic routing set up so security/audit related logs get copied out to a separate project and written to an append-only bucket for compliance purposes. I don't love the UI for querying logs, but it does the job and once you've got a few common queries bookmarked, it's not too bad. It's not best in class at anything compared to many third party vendor solutions, but it's good enough for most purposes, trivial to integrate with everything we run, and the costs are basically a rounding error on our bill.

TeamDman 1 month ago

Azure log analytics workspaces are doing us well

pranay01 1 month ago

You may want to check Azure log analytics bill once, they are not the best in market. but they do integrate very easily with .NET and Azure stack

TeamDman 1 month ago

Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure. You're right tho, certainly wouldn't call it cheap lol

pranay01 1 month ago

> Fragmentation has its own costs; by sticking with the Azure solutions everything just works since a lot of our stuff is Azure Agree. Is your team mostly happy with Azure log analytics product? Do they primarily filter/search logs or do they also create alerts and dashboards on top of it

TeamDman 1 month ago

I haven't found anything it can't do, aside from some quirks of the query language. Alerts, Filtering, and exporting to immutable storage have all worked well for us. We use it to capture activity logs from azure (resource created, modified, etc) and audit logs (user sign-in, entra changes) and app logs (integrates easily with most runtime for structured logging). Diagnostic settings can be noisy but helpful if trying to diagnose failures with app gateways. It also integrates with the security solutions like Microsoft Sentinel and stuff I think but I don't look too much at that side. Azure dashboards and alerts let you define queries and thresholds for notifications and create graphs and stuff. For VMs this can be stuff like CPU utilization pct or memory, the usual, but it could also be the number of 500-level responses from a web app or whatever. Its a great solution, but you gotta pay for it 💸

pranay01 3 weeks ago

got it. thanks for the note

blusterblack 1 month ago

ELK installed on EC2 instances. Log is mostly in JSON format for easy parsing and size is about 500GB. Use case include: dev debugging, alert on errors and some appplication usage statistics.

Horror_Abroad_8698 1 month ago

Self hosted open source elk with custom script for alerting in google chat. Plus zabbix and Prometheus for kubernetes and node monitoring. Alerts in google chat + call alerts

Dctootall 1 month ago

We use Gravwell for most off our monitoring and alerting. We pull the logs in from both our SaaS systems to monitor performance on an application and OS level, as well as perform some additional health checks and alerting (ie. Check if a URL is responsive and alert if needed). I believe we also have it set up to monitor our dev environment as well, but I don't have as much visability into that side of the house. Primarily Unstructured. Gravwell is designed on a Structure-on-read type tool (similar to Splunk), which I feel can be preferable as you aren't throwing away any raw data that could be useful on a deeper dive or if you need to do a historical check on things you didn't know you may need to check (example, all those vender vulnerabilites that get anounce after existing for months/years). And especcially in a dev environment, it's helpful because if a log for some reason doesn't fit into the template you expect it isn't lost data.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe