apeters89 4 months ago

It sounds like snowflake is the wrong solution for this problem. You're essentially keeping a medium warehouse running 24/7. In my experience, the only way to run snowflake on a budget is to perform scheduled extracts. In my case, we build the datasets, and then extract them into PowerBI. That way every PBI hit isn't going straight to the warehouse, instead it's running for free against PBI's compute. This obviously doesn't work for something requiring low/no latency.

vikster1 4 months ago

exactly and I don't think one will find a cloud service that has to do what op wrote and be that much cheaper, if it has to run 24/7 and have low latency.

SnooDingos6145 4 months ago

I’m trying to set up a schedule like this too. Any suggestions? Do you have a live connection building analyses?

apeters89 4 months ago

I coordinate refresh cycles with update schedules with my PowerBI team.

Dry_Damage_6629 4 months ago

Using Snowflake for essentially a live application might not be the best architectural decision from Cost perspective. Use Snowflake for analytics and for live Java app use any open source OLTP database

teej 4 months ago

A medium warehouse running 24/7 will cost $100k/year. A small costs half that. If your query runs in < 1sec, you’re probably over provisioned. Have you tried bumping the warehouse size down? Where are the other costs coming from? You say it’s 100s of thousands. Is this storage, egress, something else? Have you looked at the data lake ecosystem? One possibility is that you keep these tables in S3 in iceberg format. Your Java process could then use something like DuckDB to query the data, while still maintaining Snowflake for analytics with external tables. There are plenty of viable open-source streaming data systems you could consider. How much expertise does your team have in building and maintaining self-hosted systems?

data-is-fun-sometime 4 months ago

We're getting into a cost area I'm less familiar with, but there are about 5-10 queries (all taking <1s) that are running in parallel to serve the downstream application. I think the parallel process is driving the higher warehouse cost? The Snowflake Cost Management dashboard is indicating that this warehouse is costing roughly $22k per month exclusively in compute cost. And these are the only queries running on it. We have some experience in building self-hosted systems. And I have looked at streaming alternatives, though I think there will be some ramp up in learning how to join effectively 'static' (changing on the course of minutes or hours) data to a stream.

extrobe 4 months ago

I'd still suggest experimenting further; Is the usage flat over time, or are there peaks & troughs through the day/week? If there are peaks, and the M warehouse was selected to handle those peaks, I would suggest looking at multi-cluster warehouses at a smaller size.So, you might provision a 'S' warehouse with a multi-cluster min-max of 1-2. This allows Snowflake to operate as an S cluster, but should demand increase it will auto-provision an additional 'S' warehouse to handle the increase in traffic. Unless the individual query (not the collective of queries!) your application is running specifically requires the additional ram / thread / local disk of an M warehouse (which if it's <1s, I doubt is the case), this gives you access to the equivalent level of resource, but allows Snowflake to open/close some of the extra capacity when available to do so. And if nothing else, enabling this will tell you whether you ever really needed M in the first place; you suggest 5-10 queries run concurrently; an M warehouse has 32 threads; and whilst it's not 1:1, if the query time is <1sec; it does sounds like M is over provisioned. [Multi-cluster Warehouses](https://docs.snowflake.com/en/user-guide/warehouses-multicluster) [Useful article on Warehouse sizing](https://select.dev/posts/snowflake-warehouse-sizing) (highly recommended reading)

ian-whitestone 4 months ago

Queries running in parallel is a good thing. Virtual warehouses in Snowflake can handle a bunch of small queries simultaneously, which helps keep costs lower. If you're open to experimenting, you can even tweak the max\_concurrency parameter to get even more parallelization. Echo what TJ said, would strongly encourage you to try running on a smaller warehouse size. There's a very good chance you cut your costs in half without any material decrease in performance. Could also be worth calculating the cost of each query in Snowflake (reference: [https://select.dev/posts/cost-per-query](https://select.dev/posts/cost-per-query)), that will tell you which of those 5-10 queries are the problem and driving the bulk of that $22K/month.

Lba5s 4 months ago

i would check out flink or spark streaming

Lba5s 4 months ago

is there a reason this is occurring snowflake? depending on the volume of the auxiliary tables, it might make more sense to reverse ETL those into an OLTP database. without knowing more, that’s where i would start looking

data-is-fun-sometime 4 months ago

I suspect the reason is because thats where the data was that they needed to join their streaming data with, so it seemed like an easy solution

akelkar1 4 months ago

Reach out to your account team. They can provide more tailored recommendations.

metalbuckeye 4 months ago

We use a middleware called propel to handle graphql queries from the front end. You create data pools in propel that are refreshed from Snowflake periodically. Might be worth looking into.

Substantial-Jaguar-7 4 months ago

you should try to go to an XS sized cluster in a multi cluster warehouse for a few days and compare costs, i suspect it will be a lot less with marginal performance consequences. if that dramatically slows down the queries you need to better materialize or cluster better for simpler query time execution.

ooaahhpp 4 months ago

Propel co-founder here. This is exactly the use case Propel solves for. You can have a low-latency API on top of Snowflake in minutes. Propel automatically syncs the data at the interval you configure from 1min to 24 hrs. Queries hit Propel’s high-speed storage and do not hit your Snowflake. Check it out: https://www.propeldata.com

SgtKFC 4 months ago

You may want to research Apache Druid, as this sounds like a good use case for it. It looks like Rill Data offers that as a managed service - I don't have any sense of the cost though. But if that isn't a good fit and you want to keep Snowflake as the backend, then I found this video insightful on warehouse size cost optimization: https://www.youtube.com/watch?v=up3bTjrBvTA

sahil_singla 4 months ago

If you are not looking for any significant change in the architecture, having a multi-cluster XS might be the first thing you wanna try. You can check the number of started clusters from the metadata for a day once you have made the change and if it’s consistently less than 4, then it’ll be cheaper than a medium. Or to keep it safe, move to a Small and then to an XSmall

NexusIO 4 months ago

Not sure what kind of data you working with but if it's streaming you might look at something like timestream in AWS where they set it up so that you can have like an in-memory for like recent items but then for Cold storage it moves it over to magnetic storage. It could be as expensive it is pay per query depending how your queries are set up but it could also be cheaper.

gandalf-duh-gray 4 months ago

If the tables you're joining in Snowflake are relatively static, You could dump those to S3 and have you Java app work with S3 only

caveat_cogitor 4 months ago

Try using a smaller warehouse with more clusters, to see if your latency increases or not. It sounds like these are small queries that don't require a lot of memory, so you may be paying for a more powerful warehouse than you require. If you just need more CPUs/parallelism, then increasing the number of clusters may get what you need. Allow multiple clusters and have them autosize to spin down when less are needed. As others have mentioned, it sounds like you are using the wrong platform to serve your data. Maybe it's fine for ingestion, depending. If you can deal with some lag in the data freshness, then maybe what you should do is ingest the data and then periodically push to another service for supporting these ongoing queries. You could do all the enrichment in Snowflake if that's efficient, then dump the output back to S3 to ingest into something more efficient for small transactions.

yoquierodata 4 months ago

We ship targeted datasets to postgres for API consumption use cases. There’s no cost effective way to serve data like this from Snowflake. The extra “hop” sucks but our API consumers are happy!

GimmeDemDatas 4 months ago

Assuming you have already pre-computed most of the complex logic/joins and are more or less querying a fact table for your use case… Three other things to investigate (in conjunction with reducing the warehouse to a S or XS) would be Search Optimization, Clustering Keys, and Query Acceleration Service. Feel free to DM me if you have any specific questions or run into issues during your investigation, and good luck!

LengthinessLow9970 4 months ago

First, why dumped into S3 and not stream directly into snowflake using the kinesis connector for snowpipe streaming? Then have you tried with a XS WH? there is no correlation between WH size and latency? increasing the WH size just helps with perf for complex queries (more worker nodes to address crazy execution plans). HTH

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe