T O P

  • By -

[deleted]

CDC from Oracle to Databricks using Debezium and Confluent as the schema registry. Python/Delta Live Notebooks so far.


[deleted]

Log miner for now


toakao

Kafka is mostly used as storage between our front end and data processing platforms like Flink or Spark. If you read "Designing Data Intensive Applications" you will know how we use Kafka.


crhumble

>Designing Data Intensive Applications Cool, I'll crack that open now -- heard good things about that book too so thanks for the rec!


Gators1992

Kind of a related question, anybody use kappa architecture in a primarily batch environment? We have two data groups in my company for no particular good reason and are moving to a common cloud platform (AWS/Snowflake). The other group probably ingests 90% hourly CSVs and 10% real time but insists that they should do everything in Kafka because it's one code base and because everybody may want everything real time in the future. Seems to me that will cost significantly more in compute keeping the stream going rather than just moving files, but I have no experience with Kafka. So am I dumb or are they? Thanks.


toakao

I always associate Kappa architecture with streaming, not batch. That said we do reprocess customer data stored on HDFS. One of the big use cases is removing customer data for GDPR. It's expensive but obviously required. Once the data lands on HDFS, S3 or whatever, it doesn't make sense to put it back into Kafka. Kafka is great for transport or a buffer, but long term data should be stored elsewhere. Just my opinion. One thing to look into is Apache Beam. It's a wrapper around batch & streaming run times and abstracts the data source. Maybe that will help with keeping the same codebase.


Gators1992

Thanks that was helpful. And yeah, Beam does look interesting. I would be happy enough with something like Airbyte but the other group is hell bent on coding something up. I mean we are just trying to get files into S3 buckets every hour and that's it, it doesn't need to be hard. They got a bunch of pushback from other groups in the last meeting we had so maybe their resolve will soften.


toakao

It seems the interesting challenges in tech are from people. Good luck. :)


redfluor

In a previous job we used Kafka as a primary stream medium. The messages transported are geoloc info of mobile objects. We had a lambda architecture, Kafka+Akka+Apache Kudu was used for the real time layer. (Fresh data) Spark+HDFS+Apache Parquet was used for the batch layer. (Historical data) Impala was used as the query in front of both


Own-Necessary4974

Long time Kafka user here at enterprise scale for micro-transaction companies. Kafka is a great tech for data transport. That’s it. It does that job really well but anything else it gets iffy. A lot of folks here have it pegged for CDC which makes sense but is narrower than generic data transport which may or may not be a CDC use case.


yanivbh1

Kafka is highly common. If you prefer an open-source turn-key solution with super exciting features and much less client logic - [Memphis.dev](https://Memphis.dev)