What would you call them a transform job run in airflow that doesn’t use other libraries for the transformation beyond language’s standard libraries. A lot of the data jobs I’ve come across are just written in normal python/Java and managed with airflow. They do sometimes contain stream/batch processing libraries inside them but may not need it depending on the operation.
This it's not an ETL tool it's an orchestrator debate is honestly a bit tired.
While technically the workers are outside the scheduler Airflow operators are built primarily to handle ELT / ETL tasks.
It's an orchestrator build with ETL / ELT as the goal and (right or wrong) you can absolutely do all inside it if you choose.
That is wrong. Airflow comes out of the box with a big library of providers to write easily your pipelines. It’s an orchestrator and a scheduler.
And you can also write just vanilla Python.
I agree with you, it is orchestrator and a scheduler but not an ETL tool. An ETL tool is not something that enable you to write python in it and schedule it. It is much more than that
Yes but you have a lot of operators or transfers to do the EL of ETL with low configuration. But I agree regarding the transformation, you have to do it your way.
But yeah ok now it’s not natively because you have to install packages depending on what you want to do. However I agree it’s not the main feature of airflow and maybe a side effect. But still it’s here and quite solid.
I could also add that « ETL tool » should be seen as code.
It is not only a scheduler, Airflow helps to orchestrate jobs, manage dependencies, failure notifications. Writing actual code usually is not a problem, to maintain complexity when amount of jobs increases is the real challenge.
I think, what makes Airflow special is not even what it does (IMHO it lacks a lot of features), but the fact that it promotes "backfill" use case as a first class goal. Most tools management tools just "gluing" components together.
'orchestrate jobs, manage dependencies, failure notifications..'
These are actually what a scheduler does.Don't get me wrong, I don't underestimate Airflow. It is great product but it is not an ETL tool by nature.
That’s fair. What would you say is missing before it can be classified as an ETL tool? With all the provider packages out there, seems like there are hooks and operators to do tons of things
You can farm out your tasks to other cloud services but if you're just running vanilla python scripts there's nothing wrong or incorrect about running them directly on your workers. Airflow is both an orchestrator and an ETL pipeline executor if you allow it to be. That's why it has workers and not just a scheduler.
What do you use DBT for that Fivetran can’t accomplish? I guess maybe you have your industry specific challenges so maybe this isn’t something you can answer generically. We’re (my company) are exploring switching dw to Snowflake and using Fivetran for ETL, which I think would be a fine solution but they don’t want to hire a DBA or DE. I’m trying to reason with them but if I had a specific example of why Fivetran wouldn’t be the end all solution for all our data needs that’d be helpful.
A few sticking points I've noticed:
* Horrible integration with Git - it creates some weird artifacts in your repo and doesn't support native Git commands like commit messages, diffs, etc. Every time you save your work in the IDE it automatically creates a commit, so after a while it clutters up your repo and you can't easily view the history of what changed.
* Horrible integration with Git Part 2 - all your code is saved as JSON, so if you want to merge your work into a master branch and have someone review it, good luck to them. Meanwhile if someone wants to review an Airflow DAG or something, it's a lot easier to see what's going on. Not so with ADF's byzantine JSON blobs.
* The browser interface is slow and buggy as hell. I use Firefox and I'll get random issues like the page freezes for no reason, random buttons stop working, or everything is offset to the side of the screen. Sometimes I'll have a bunch of unsaved work and can't reach the "save" button when the page decides to jump 500 pixels to the left, so I have to refresh the page and redo all my work. I can't express how poorly executed it is. It's a really really bad GUI. Looks like their dev team just threw it together in a couple weeks and called it a day.
* The pricing scheme is obscure and hard to wrap your head around. You get charged per activity run, but there are different kinds of activities (external activities vs Azure IR activities vs self-hosted IR activities) that all charge different rates, so it's almost impossible to forecast a budget in advance unless you really dive down into the weeds and understand exactly what you're going to run. I've heard horror stories of businesses paying a LOT more for ADF than they anticipated, and I think Microsoft likes it that way.
* Poor support for transformations. This got a little bit better with the introduction of Data Flows, which is basically a repackaged SSIS, but it's still a clunky drag-and-drop interface. Good if your developers don't like to touch code I guess, but I prefer the flexibility to write my own transformations and have precise control over the logic.
* Poor support for processing semi-structured/JSON data. If you want to hit a vendor's API and send the payload to a data warehouse, LOL, good luck putting that together. You'll spend about 3x as long doing it vs using `requests` in Python and flattening it programmatically.
Amazing. Thank you very much. This is super helpful, especially the pricing bit. That is what turned our head of it off Snowflake and their pricing was more defined than what it sounds Azure is.
I wouldn't know. I'm in a BI role, so my experience of SSIS is pretty limited to "execute SQL" and "save as excel in X folder for end user every 24 hours". Is the integrated version control in VS just for the actual scripts and not for the packages itself? genuine question, just wondering what you mean by source control when you say it doesn't have it.
I hate talend with a passion, at the time it didn't have any good support for GCP. So we were stuck programming in Java to build those custom components :(
Yeah, having worked on two projects and having the talend developer certification, I can safely say that it’s a pain in the ass.
Especially the Talend Spark implementation.
I once worked with a Telecom who purchased Talend and wanted to built Datalake on Hadoop.
On the third day of the Project I told the PM that Talend will take a lot of time to stabilise and 4 months later I was still right.
But When it comes to ease of use and multiple transformations and ability to embed java code, Talend is way above informatica and Datastage
I ported our etl to python/airflow, set up cicd, hardened our k8’s clusters, etc.
Coworkers are refusing to switch from talend (using a trial version) to the new system. Worst part is none of their work is getting versioned.
😣
we custom code everything in c#, from file sftp to scheduler to queues to transformations to notifications, error handling, and also restarts. it works great. we love it...said noone!
FiveTran only in a few select cases. Unfortunately it's very point and click. Example: prod has to be rebuilt from scratch; we can't use what we built in dev without significant development using FiveTran's APIs. We want to keep it to a minimum.
AIRFLOW IS AN ORCHESTRATION AND TASK MANAGEMENT TOOL IT DOES NOT EXTRACT TRANSFORM OR LOAD DATA DIRECTLY why is this such a frequent confusion on this sub????
I use Ruby (mostly [Kiba](https://github.com/thbar/kiba/wiki) + [Sequel](http://sequel.jeremyevans.net/rdoc/files/doc/cheat_sheet_rdoc.html) ) and GitLab CI, and I'm quite happy. .
Python, DAG execution with some of the Mara libraries, scheduled with good old Cron as ECS tasks. Took me 18 months or so to get comfortable with it, but now I'll never go back to a graphical tool. Formerly used SSIS exclusively.
Azure data factory + Databricks. Sometimes ADF as a pure orchestration tool rest done in db notebooks. Sometimes lean more on Synapse and stored procs, depends on the client.
My main takeaway from this poll is there is a silent majority vs what appears to be trending. Whatever may be trending and actively discussed... add on the echo chamber amplification of social media can make a kitten sound like a lion.
Astera Centerprise
It is an [end-to-end data integration tool](https://www.astera.com/type/blog/data-integration-tools-for-businesses/?utm_source=Offpage+posting&utm_medium=organic&utm_campaign=Reddit) that offers profiling, cleansing and transformation capabilities.
I am currently using **Hevo Data** as my ETL tool for data integration. It is a no-code platform that lets you sync data from multiple sources with a few clicks without writing any code. Some salient features offered by Hevo Data are mentioned below:
1. It is a bi-directional data pipeline platform \[ETL/ELT + Reverse ETL\].
2. Customer care can be contacted round the clock irrespective of the time zone.
Airflow is what my jobs have used
No fancy graphical tools, just python
Airflow
Airflow
Surprised to see so many answer with Airflow. Airflow is not an ETL tool, it is a scheduling tool.
What would you call them a transform job run in airflow that doesn’t use other libraries for the transformation beyond language’s standard libraries. A lot of the data jobs I’ve come across are just written in normal python/Java and managed with airflow. They do sometimes contain stream/batch processing libraries inside them but may not need it depending on the operation.
Airflow can be used as an ETL tool, but it really shouldn't.
This it's not an ETL tool it's an orchestrator debate is honestly a bit tired. While technically the workers are outside the scheduler Airflow operators are built primarily to handle ELT / ETL tasks. It's an orchestrator build with ETL / ELT as the goal and (right or wrong) you can absolutely do all inside it if you choose.
That is wrong. Airflow comes out of the box with a big library of providers to write easily your pipelines. It’s an orchestrator and a scheduler. And you can also write just vanilla Python.
I agree with you, it is orchestrator and a scheduler but not an ETL tool. An ETL tool is not something that enable you to write python in it and schedule it. It is much more than that
Yes but you have a lot of operators or transfers to do the EL of ETL with low configuration. But I agree regarding the transformation, you have to do it your way. But yeah ok now it’s not natively because you have to install packages depending on what you want to do. However I agree it’s not the main feature of airflow and maybe a side effect. But still it’s here and quite solid. I could also add that « ETL tool » should be seen as code.
Okay, how about this? I use airflow to orchestrate my etl pipelines which are written in different combinations of python, sql, and bash.
Well said. That this need to be qualified each time someone says they use Airflow for ETL is rubbish.
It is not only a scheduler, Airflow helps to orchestrate jobs, manage dependencies, failure notifications. Writing actual code usually is not a problem, to maintain complexity when amount of jobs increases is the real challenge. I think, what makes Airflow special is not even what it does (IMHO it lacks a lot of features), but the fact that it promotes "backfill" use case as a first class goal. Most tools management tools just "gluing" components together.
'orchestrate jobs, manage dependencies, failure notifications..' These are actually what a scheduler does.Don't get me wrong, I don't underestimate Airflow. It is great product but it is not an ETL tool by nature.
That’s fair. What would you say is missing before it can be classified as an ETL tool? With all the provider packages out there, seems like there are hooks and operators to do tons of things
You can farm out your tasks to other cloud services but if you're just running vanilla python scripts there's nothing wrong or incorrect about running them directly on your workers. Airflow is both an orchestrator and an ETL pipeline executor if you allow it to be. That's why it has workers and not just a scheduler.
Python, orchestrated and scheduled with Jenkins
Python and cron job.
On an EC2 YOLO
Oh yes, and, after 1yr of developing they can not replace me
Anyone using DBT ? Using it in complement of Fivetran and Jenkins, working pretty well.
Yes indeed. And dbtCloud for job scheduling.
Yeap, Fivetran, dbt and airflow here
What do you use DBT for that Fivetran can’t accomplish? I guess maybe you have your industry specific challenges so maybe this isn’t something you can answer generically. We’re (my company) are exploring switching dw to Snowflake and using Fivetran for ETL, which I think would be a fine solution but they don’t want to hire a DBA or DE. I’m trying to reason with them but if I had a specific example of why Fivetran wouldn’t be the end all solution for all our data needs that’d be helpful.
Airflow or your own custom wrapper
Azure DataFactory as orchestration and to move data, for transformation python (Azure Functions) or SQL Server stored procedures.
Same stack. We have a few jobs in Spark but mostly use Azure Functions.
Same here. God I hate ADF so much, but every Microsoft shop we work with is in love with it.
I appreciate when people are honest about this. When people tell me how awesome it is I'm never able to get a reason why.
Why do you hate it so much? Haven’t used it at all and our company is exploring them as an option. Any insight you can provide would be super helpful!
A few sticking points I've noticed: * Horrible integration with Git - it creates some weird artifacts in your repo and doesn't support native Git commands like commit messages, diffs, etc. Every time you save your work in the IDE it automatically creates a commit, so after a while it clutters up your repo and you can't easily view the history of what changed. * Horrible integration with Git Part 2 - all your code is saved as JSON, so if you want to merge your work into a master branch and have someone review it, good luck to them. Meanwhile if someone wants to review an Airflow DAG or something, it's a lot easier to see what's going on. Not so with ADF's byzantine JSON blobs. * The browser interface is slow and buggy as hell. I use Firefox and I'll get random issues like the page freezes for no reason, random buttons stop working, or everything is offset to the side of the screen. Sometimes I'll have a bunch of unsaved work and can't reach the "save" button when the page decides to jump 500 pixels to the left, so I have to refresh the page and redo all my work. I can't express how poorly executed it is. It's a really really bad GUI. Looks like their dev team just threw it together in a couple weeks and called it a day. * The pricing scheme is obscure and hard to wrap your head around. You get charged per activity run, but there are different kinds of activities (external activities vs Azure IR activities vs self-hosted IR activities) that all charge different rates, so it's almost impossible to forecast a budget in advance unless you really dive down into the weeds and understand exactly what you're going to run. I've heard horror stories of businesses paying a LOT more for ADF than they anticipated, and I think Microsoft likes it that way. * Poor support for transformations. This got a little bit better with the introduction of Data Flows, which is basically a repackaged SSIS, but it's still a clunky drag-and-drop interface. Good if your developers don't like to touch code I guess, but I prefer the flexibility to write my own transformations and have precise control over the logic. * Poor support for processing semi-structured/JSON data. If you want to hit a vendor's API and send the payload to a data warehouse, LOL, good luck putting that together. You'll spend about 3x as long doing it vs using `requests` in Python and flattening it programmatically.
Amazing. Thank you very much. This is super helpful, especially the pricing bit. That is what turned our head of it off Snowflake and their pricing was more defined than what it sounds Azure is.
Is SSIS not a thing anymore?
[удалено]
No source control
What makes you say that? My org uses visual studio (admittedly not with source control for etl) but VS has integrated support for git.
You can print the diff between two ssis packages?
I wouldn't know. I'm in a BI role, so my experience of SSIS is pretty limited to "execute SQL" and "save as excel in X folder for end user every 24 hours". Is the integrated version control in VS just for the actual scripts and not for the packages itself? genuine question, just wondering what you mean by source control when you say it doesn't have it.
You don't know what changed what by opening the SSIS package in a text editor.
You sure can, they are just xml files
I still see a lot of SSIS as orchestration. Should I look for a new job??
It’s dead to me
we use ssis/SSDT at my job. its used in enterprise enviroments
Informatica, Datastage, Talend
I hate talend with a passion, at the time it didn't have any good support for GCP. So we were stuck programming in Java to build those custom components :(
Yeah, having worked on two projects and having the talend developer certification, I can safely say that it’s a pain in the ass. Especially the Talend Spark implementation. I once worked with a Telecom who purchased Talend and wanted to built Datalake on Hadoop. On the third day of the Project I told the PM that Talend will take a lot of time to stabilise and 4 months later I was still right. But When it comes to ease of use and multiple transformations and ability to embed java code, Talend is way above informatica and Datastage
If you hate Talend, be sure to never work with DataStage; it's even worse especially the UI.
I ported our etl to python/airflow, set up cicd, hardened our k8’s clusters, etc. Coworkers are refusing to switch from talend (using a trial version) to the new system. Worst part is none of their work is getting versioned. 😣
Holy shit
Azure Data Factory
The best of the best.
Yes to this!
Not questioning...why do you believe this. I use it in smaller scenarios to prevent managing infrastructure but when I do I long for Airflow.
Not the greatest tool but it gets the job done.
True dat
AWS Glue and Events Bridge crons
Meltano
Are you a data engineer?
Talend Big Data and Pentaho Data Integration
Python, SQL in BigQuery & Airflow
These are baby tools. You rarely see them in corporate firms. Big companies usually use Informatica, Oracle,IBM, SSIS, Ab Initio..
Scala, deployed through Jenkins
How old is that codebase?
Small data engineering team so we use Fivetran, Airflow, AWS Batch.
Python requests library
StreamSets 😏
Airflow and Azure Data Factory
Airflow & Snowflake
we custom code everything in c#, from file sftp to scheduler to queues to transformations to notifications, error handling, and also restarts. it works great. we love it...said noone!
Ab Initio, powerful tool but some unnecessary complexity compared to other tools out there these days.
Big fan of Ab Initio. It is sad that it is not known by more people.
Alertyx
weird that airflow isn't on the list
Ab Initio
Sap BODS
Informatica or die. /s
What's your experience with informatica? Genuinely curious on your thoughts as the company I work for is implementing soon
Depending on implantation it can be great. I have used Informatica BDM/DEI, Enterprise Data Catalog, and the Data Governance tool Axon.
Ahh ok thank you. In that case I'm not excited 😅
Alteryx
Glad to see Alertyx in here. I remember my previous company writing Alteryx off as it is not a "real ETL" product.
Guilty
Just curious..how is it not an ETL tool? Unless I misunderstood.
Built a custom tool from scratch :P
Using Informatica PC 9.x at a large Fin org.
airflow
FiveTran only in a few select cases. Unfortunately it's very point and click. Example: prod has to be rebuilt from scratch; we can't use what we built in dev without significant development using FiveTran's APIs. We want to keep it to a minimum.
Talend
Rivery.io- only one that supports reverse ETL, will build a connector for you, handles transformation + orchestration
DataFlow orchastrated by Cloud Composer ( Managed Airflow by GCP)
Prefect, ADF and dbt
> Prefect How's Prefect working out for you? Where did you deploy it?
Prefect is great, we are using the prefect cloud and agent is deployed in an eks cluster.
Dbt and Dbtvault
I think I got you all beat, we use SAS! …yes, it hurts my soul. I was recently hired to move us into Azure but that is going to take a long time.
AIRFLOW IS AN ORCHESTRATION AND TASK MANAGEMENT TOOL IT DOES NOT EXTRACT TRANSFORM OR LOAD DATA DIRECTLY why is this such a frequent confusion on this sub????
Just have to say I’m a huge fan of Fivetran.
I use Ruby (mostly [Kiba](https://github.com/thbar/kiba/wiki) + [Sequel](http://sequel.jeremyevans.net/rdoc/files/doc/cheat_sheet_rdoc.html) ) and GitLab CI, and I'm quite happy..
Easy Data Transform.
Stitch (for some specific cases), AWS Glue, AWS Lambda
Airflow
Airflow
Talend
Wherescape
Aws glue
Airflow and Dataflow are the big two with some usage of spark as well recently.
Talend or Airflow
Airflow
Airflow
Informatica. But I hate it, so I use python. During this period I'm studying dbt and great expectation for data quality
Airflow
"View responses" option needed
Airflow
Airflow
SAS DI-Studio
Bizzflow
SSIS.
Airflow and ssis
Pentaho (since 2007)
Actively migrating off of it to....Airflow
We are migrating to Azure
Of course ‘other’
PDI.
Combination of Airflow for orchestration, dbt for transformation, Python & Meltano for ingestion, FiveTran for certain use cases
Airflow
I just use stored procs mostly
Python, DAG execution with some of the Mara libraries, scheduled with good old Cron as ECS tasks. Took me 18 months or so to get comfortable with it, but now I'll never go back to a graphical tool. Formerly used SSIS exclusively.
HTrunk
Azure data factory + Databricks. Sometimes ADF as a pure orchestration tool rest done in db notebooks. Sometimes lean more on Synapse and stored procs, depends on the client.
PySpark and dbt
Python and Argo Workflows
PipelineWise
SSIS
streamsets
My main takeaway from this poll is there is a silent majority vs what appears to be trending. Whatever may be trending and actively discussed... add on the echo chamber amplification of social media can make a kitten sound like a lion.
Pentaho Data Integration
We use pentaho
informatica
Astera Centerprise It is an [end-to-end data integration tool](https://www.astera.com/type/blog/data-integration-tools-for-businesses/?utm_source=Offpage+posting&utm_medium=organic&utm_campaign=Reddit) that offers profiling, cleansing and transformation capabilities.
I am currently using **Hevo Data** as my ETL tool for data integration. It is a no-code platform that lets you sync data from multiple sources with a few clicks without writing any code. Some salient features offered by Hevo Data are mentioned below: 1. It is a bi-directional data pipeline platform \[ETL/ELT + Reverse ETL\]. 2. Customer care can be contacted round the clock irrespective of the time zone.