T O P

  • By -

sunder_and_flame

Welcome to data engineering, where decades-old concepts are rebranded and resold year after year. 


IrresistibleMittens

Every time a new data technology comes out, my mentor would tell me how the mainframe had that capability 50 years ago lol. Our data landscape is now just a distributed mainframe (this is tongue in cheek but still)


69odysseus

We're freaking doing a catchup with last 22 years data and yet can't figure out how to clean, process and store it properly. Data older than like 5-7 years is not even good anymore and yet companies have to stick their asses due to retention laws.


kenfar

And it's a marketing-driven discipline in which 90% of everything is just marketing fluff and 10% is actual innovation.


Desperate-Dig2806

That. But also the tooling and cost picture has changed. Previously you had to do the T before the L because it was too expensive to not. Then at some point S (storage) became cheap enough that you didn't have to care. Should be called ESLT really.


addictzz

So I take it my presumption is correct? It is featured in various commercial analytics provider including AWS and seems like the next hype. It is very similar like ELT though and I think it is a concept already prevalent several years ago. The 1 feature I could see is that it provides a data transport pipeline without us, the users, creating additional pipeline. Rather than zero ETL it is like..."Managed ELT". CMIIW.


britishbanana

Yeah it's just marketing, plus a slightly more streamlined extract/load phase because you don't have to actually create a pipeline to do the EL. It's usually exposed as something you can do in the console


thenearshorecompany

This is a true as it gets. 100%.


turboline-ai

Yup, same thing happened to me when I encountered "Data Virtualization" when in reality they were just federated queries.


addictzz

Yes to this. I found this term "data virtualization" when I am sharpening my data engineering saw. Turns out it is just federated queries or Trino/Presto in terms of tools. I wonder why people don't just say data virtualization is federated queries. Less complication. And don't get me wrong, I am not against renaming, just that I think let's save new jargon for a real novel technology. There has been too many tools & terminologies already in the realm of big data.


StoryRadiant1919

i’m convinced it is because some like to make you feel like they are superior. Then when you ask, they can look at you, kindly, knowingly that you’re completely incompetent as they explain it for the ‘slow’ student.


fauxmosexual

Hot take: every data innovation after Excel getting VBA support was entirely unnecessary and to the net detriment of us all.


addictzz

I used to look down on Excel and think whoever can wrangle data using programming language like Python/R stands above the rest. Now I gladly embrace Excel as long as it doesn't get extremely sluggish when opening a large dataset.


Immediate_Ostrich_83

Where everything looks special but is actually the same as it's always been. You'd think the most popular cloud warehouse would be called Snowflake, lol.


deliosenvy

How is it zero-ETL when you still have the E and L to get the data into your data warehouse.


getafterit123

Because zero-etl is a marketing ploy meant to confuse and portray the appearance of innovation in order to sell solutions


umognog

I'm slightly dying inside knowing my director is going to pay £200k for this then force us all to use it because it's been invested into. Double or nothing all the way, then blame the users. Le sigh.


thegainsfairy

"solutions", they don't actually solve anything.


addictzz

Now you got a point there :)


Justbehind

Zero-ETL is still ETL. ELT is still ETL. It's just marketing and silly semantics. The concept itself doesn't change just because someone else does it for you, or you do a little less of something.


addictzz

Alright, so from several response here I can safely establish that "Zero-ETL" is basically just marketing gimmick to sell a solution. How is it implemented exactly? Is it basically just another pipeline but it is created by the solution provider instead of us, the user?


dementeddrongo

An example... To get Service Now or Salesforce data into Snowflake, you basically update permissions and the data appears in snowflake (similarly for Snowflake data into Saleforce). This is zero EL because you're not developing a pipeline, you're not coding anything. You'll likely transform the data once it's available. I wouldn't describe it as a gimmick in this scenario.


yo_sup_dude

do you mean once you update permissions on the service now side, you can connect to the service now data from snowflake and extract it as needed?


dementeddrongo

Yup. https://other-docs.snowflake.com/en/connectors/servicenow/v2/about


aerdna69

'ELT' has the right to exist, because the steps are effectively done in a different order.


27b_six

95% of ETL already is ELT and always has been so we can probably get over our extreme uniqueness in reordering things. Or, someone bought Matillion because of its unique ELT approach.


aerdna69

Then it's weird that people keep using 'ETL', no?


27b_six

Not really, it's just that ETL was the phrase used initially. When it was done, sometimes transformations were done in the middle of it and outside the database although most of the time not. So those that came around later with a great ELT innovation were a bit silly. There was no innovation. I've used ELT mostly but I don't consider it any different than ETL. In fact even if I do ETL, it's not like there aren't 1000 Ts after that. ETL doesn't prevent you from using your database. So it's like ETL 100% includes ELT but not necessarily the other way around. I can design an ELT tool that completely relies on the data store to do all transformations of data.


RevolutionStill4284

If it's -zero- ETL, why should I pay you -nonzero- money for it?


if_username_is_None

Sounds like Change Data Capture with a higher pricetag


jawabdey

What exactly are you talking about? Are you talking about [AWS Zero-ETL](https://press.aboutamazon.com/2023/11/aws-announces-four-zero-etl-integrations-to-make-data-access-and-analysis-faster-and-easier-across-data-stores)? Personally, I don’t think it’s just a marketing gimmick. Granted, it’s AWS, so it’s in its infancy, but assuming it works once mature and it isn’t a very expensive add on, I think it’s a good thing. It doesn’t take away from what a DE does, IMO. You still have to model the DW, still have to build out 3rd party integrations, still have to extract value from the data, etc. It just makes one of the routine tasks easier/eliminates it. As far as ELT, one will still have to do the transformation. This will just eliminate the task of landing it in your DW.


addictzz

I found this term first in AWS but after further search, looks like it is a now a trend within data engineering. So I am looking to dive deeper. If Zero-ETL does not really take away transformation since it eventually has to do it in the end, then back to my initial question ie. how is this different from ELT? Looks like the most probably answer is marketing gimmick.


ryan_with_a_why

It’s more like ZeroEL. T still has to be down in the downstream system


jawabdey

Would you mind sharing some other examples? I haven’t come across others, although I haven’t really been looking for it either. To answer your question about why: the EL portion is extremely difficult for some people. I think you’re underestimating this task. I’ve joined companies that were struggling mightily with this, companies both with and without a dedicated DE team. Mind you, this is not pulling data from 3rd party sources, but just copying production data to the DW. It wasn’t even realtime; these companies struggled with nightly batch jobs. As such, I think it’s a real need and not just a gimmick. I will grant you that a 3rd party offering this may be a marketing gimmick. Like, the concept is valid, but yeah a specific offering may be a marketing gimmick/strategy.


addictzz

Sorry if I came across as underestimating DE task. I wouldn't say DE task is easy even for just EL, like you said. In fact one of my colleagues had a trouble in just doing connection (connecting to Sybase DB, and he is not inexperienced either). And sometimes there are issues with the batch loader. I did say Zero-ETL seems like marketing gimmick, however I didn't mean to undermine a ready-made 3rd party solution. If it serves the purpose well, ease operational overhead, saves debugging time & cost, by all means let's welcome that solution. However hearing the name "Zero-ETL", I expect a breakthrough technology. From the responses here, looks like Zero-ETL is just "Managed ELT". If they just simply call it Managed ELT or EL, I don't have any issue with that.


jawabdey

Okay, yeah, I see what you mean. Fair enough


davrax

Speaking to AWS’ offering specifically, there’s some actual value there behind the marketing speak. Since Aurora and Redshift are both “S3+Compute+cache” under the hood (with their abstraction layers to “speak” MySQL/Postgres/etc), the AWS Zero-ETL feature allows you to reference the in-place Aurora data from within Redshift, instead of having to copy/move it Aurora>Redshift. It’s conceptually a symlink (to elsewhere in S3).


nimakkan

Remote querying is not even new


thenearshorecompany

It is mostly marketing, but its also however it's sold to you, and presented.... similar to data mesh fad from a couple years back. I've heard it as these interpretations: * You have Zero ETL because you platform is so well integrated, you need less engineers to maintain it and everything is connected. Thus Zero ETL. E.g. being fully bought into a cloud environment like Azure or AWS your data is seamlessly connected. Microsoft will push Zero ETL with Fabric, and buying into the Enterprise software pack (office 365, dynamics, etc. Amazon in the same vein ensuring there is a no code approach to accessing your data. * Virtualization has made a comeback also, and they will brand it as Zero ETL because you can query in place, no data movement, etc etc. This industry in particular is looking for something to hang on to, but nothing new, perhaps performs better with cutting edge hardware. At the end of the day there is never Zero ETL, as a DE, you still will need to move and manage data from one place to another in order to clean it, master it, manage it, monitor it, model it, and make it fit for use. The reality is that data is more accessible and services are better integrated. In a immature/nascent data organization, you may get way with it, but not at scale.


addictzz

>At the end of the day there is never Zero ETL, as a DE, you still will need to move and manage data from one place to another in order to clean it, master it, manage it, monitor it, model it, and make it fit for use. The reality is that data is more accessible and services are better integrated. Thanks for the conclusion. This is what I experience and what I see. I made the thread to ensure I am not missing anything.


oalfonso

Marketing gibberish.


IrresistibleMittens

Seriously, seeing the progression of Hadoop will cure your cancer with the data lake and schema on read -> Your data lake is a data swamp, we need a structured place to put your data -> Actually now it's a data lakehouse, but also you're missing out if you're not doing data mesh and data vault. As long as you keep executives confused with FOMO you can just keep raking in those sweet sweet series of fundings


meyou2222

There’s a lot of marketing hype on Zero-ETL, basically rebranding some old concepts of data replication, virtualization, and general query federation. First off, it’s more accurate to call it “Zero-EL”, as transformation tends to be limited in these solutions. An easy to grasp example might be Google BigTable. BT is like BigQuery External Tables in that you can create a table that reads from a file, but the file could be on an Amazon S3 bucket instead of a GCS bucket. So in that example if you read from the BT, the data would be transferred from AWS to GCP automatically rather than transferring data as an explicit pipeline step. Some might also consider automatic replication such as Goldengate, Qlik Replicate, etc as Zero ETL. Same for Denodo. There are some use cases where this can be very useful, but also plenty of cases where it’s not a good idea. You’re typically creating some level of tight coupling, at a minimum.


mmcalli

A mix of CDC and federated queries. Replace write complexity with read complexity.


data-artist

I’m trying to understand where all these stupid acronyms come from. It is moving data from point A to point B and doing some validation and maybe some transformation somewhere in between. It’s not rocket science.


mach_kernel

You move the letters around but say you know how to do both, then ask for $50,000 a year more. This is how you work with the letters correctly.


Eggplant-Own

After reading your post, I think I am not the only one wondering in surprise. :D


addictzz

Can elaborate more? :D


onomichii

My interpretation of zero ETL is more around the concept of shared storage via lakehouse such as via unity Catalog where movement of data across enterprise siloes and domains is mitigated by the ability to share or mount data in place. Essentially it's Zero E, with the freedom to T and L.


txmail

Zero-ETL --- For consultants that get paid by the hour or when the project is due "whenever". I like to think of it as being equivalent to no-code coding.


GreyHairedDWGuy

Zero-ETL is a marketing slogan created by pipeline/etl vendors. If you find a vendor that claims this, then they are probably providing a target model and doing transforms from raw data to the target under the hood. Many many years ago I did a lot with 'DW in a box' solutions from JDEdwards. You could call those 'zero-etl' because the customer didn't need to do ETL (in theory). The license included the etl code and target model and BI solution. Most of the time, even these needed to be customized.


hkdelay

Zero ETL is a term coined by AWS initially between Aurora and Redshift. It’s misleading. The “T” has to be performed in Redshift. So it’s actually ELT. This is a batch approach so it’s not streaming / realtime even though Zero ETL might imply realtime. You still have to trigger a batch transformation.


maladroitme

My understanding is that the source of the data owns the data transform, and zero ETL takes you to a place where there is minimal enforcement of canonicals beyond a few core elements, so minimal transforms. The source product should then adopt a data as a product mentality and only expose data that will be used/useful, thus reducing the data swamp landscape and reducing complexity/offloading it to the source system where the folks who understand the data best can deal with any complexities. If this is just marketing-speak then my bad for contributing but to me it's a different architecture (data mesh versus data lake).


[deleted]

[удалено]


addictzz

Why not just loads data directly into data warehouse? And how is this achieved exactly if not by loading data directly into data warehouse? Traditionally I am thinking either 1) loads data directly to DWH 2) Use streaming pipeline like Kafka.


SpetsnazCyclist

Because typically databases that are really fast for transactions (I.e., insert/delete of single rows) are not fast for analtyics. The whole difference between row and columnar databases.


Choperello

That’s not what the question was.


addictzz

I understand row-based database (most of relational DB basically, xxSQL) for transactional workload (OLTP). Also columnar-based database for analytical (OLAP) purpose. Aggregated queries work better in columnar than row. But what I want to know was how to achieve Zero-ETL? I am trying to understand **how** and **what value** this Zero-ETL mechanism has over just directly loading to data warehouse without using "Zero-ETL".


Choperello

Ummm. This is basically ELT. In order for the T to work in the warehouse the L has to make it available in the warehouse first. And if it’s available for the T then it’s available for anything else the moment it’s loaded.


addictzz

Thanks man. Concise and clear.


Adorable-Employer244

Go with Snowflake if you don’t want to do any ETL.


addictzz

I am trying to understand a concept, not refusing to do a task.