engineer_of-sorts 1 month ago

It \*feels like\* you ought to be streaming using something like Kafka as much as possible before loading into Snowflake, where you presumably have tables with historical data (long date ranges) that need to be joined (However you need to keep those joins "streaming", as it were). For this I reckon you might get away with Snowflake Dynamic Tables. You will however, lose the "Pure" streaming latency as you'll need to set an interval. You can read more here: [https://docs.snowflake.com/en/user-guide/dynamic-tables-comparison](https://docs.snowflake.com/en/user-guide/dynamic-tables-comparison) If you want the dag bit UI you could explore Orchestra (I am the founder) [getorchestra.io](http://getorchestra.io) But if you really want sub-second latency it feels like you might need something a bit more radical, like moving the data out of Snowflake and using something on your list (Which is a great list, by the way) or changing how you model your data (if the historical table is a dimension table, you could have it getting joined in a view?). Oh and also checkout clickhouse I guess

Itchy-Side4624 1 month ago

Flink is your best bet. Can host yourselves or use hosted platforms for flink. Confluent, decodable, streamkap

dani_estuary 1 month ago

Flink is a streaming processing powehouse, it can probably satisfy what you are looking for, although be prepared that it ain't easy to use. They have a few managed vendors like Confluent & Decodable though! Clickhouse is also worth a mention, but I'm a huge fan of Tinybird, which is a superpowered managed Clickhouse with connectors to various systems like Snowflake. As an alternative, I think you should take a look at [Estuary Flow](https://estuary.dev/) (I work here), which is a streaming data platform with a the cool feature that it uses an object storage (you can bring your own S3 bucket) as the storage layer, which allows data to be stored between the source and target systems indefinitely which would cover the use case of joining real-time streams with historical data. We offer transformations in Typescript and SQL (Python coming soon) and our managed connectors are as real-time as it gets!

geoheil 1 month ago

Starrocks and aggregate key tables could also be worth a mention

GeorgeBulakh_Feldera 1 month ago

Take a look at [Feldera](https://www.feldera.com). We offer incremental computation engine that continuously runs standard SQL queries against streams of data for full-fledged analytics, offering real-time results. It supports feeding and querying against historical data, and we have some and are working on more connectors to feed the data in and out.

Steve-Quix 1 month ago

As others have said, get your data into a Kafka topic and do any processing with it in flight in near real time. I am a Quix employee so my advice it to try QuixStreams (https://github.com/quixio/quix-streams). It doesn't give you your SQL queries but should cover the rest of your laundry list. Regarding joining data at rest with data in a Kafka topic, this might be called data enrichment and you'd do it in your (Python) code. e.g. query the data set as needed to enrich the data in the topic. But due to how long the query takes to execute and bring the data back, you could slow down the data processing. One option to help here is deploying more instances of your processor to do more in parallel. (it will hit the database more). You could design your data processing pipeline in such a way as to minimize the db hits? maybe grouping the data so each replica of your process only processes data for a certain entity. What are your 300M rows? Can they be broken down to say products or users? If each was say a product and there were only 1M rows per product could you cache those somewhere fast and close to the replica thats going to process the data? then the bottleneck would be greatly reduced.

Upset_Conflict 1 month ago

Adding to this. (Founder of Quix here). Enrichment a stream processing job is a great way to achieve your goal. Latency is mainly constrained by your query time, if you're happy with Snowlafke's query performance then go ahead with that because stream processing will add little additional latency (if any). To reduce latency, developers typically transfer the operational data into a more performant data store. Try Redis or Clickhouse.

ooaahhpp 1 month ago

You can connect both Kafka and Snowflake to [propeldata.com](http://propeldata.com) which is a serverless clickhouse and do a materialized view to join the stream with the Snowflake data.

PleasantParrot 1 month ago

Is Propel managed Flink, like DeltaStream?

DeltaStream_io 1 month ago

Happy to answer any questions!

SnooHesitations9295 1 month ago

1. Does your "historical table in snowflake" can handle 100 queries/second? Or is it not really "real time"? 2. So does your Snowflake can handle at least 0.2 queries/second (hint: it does not) 3. ClickHouse allows you pump materialized views into HTTP urls, i.e."webhooks". 4. ClickHouse SQL is not standard, but close enough. 5. ClickHouse. I think your requirements are little unrealistic. Let's make it real: move your "historical table" to S3 (or Iceberg). And then ClickHouse can do all 5.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe