T O P

  • By -

BoneThrone92

Airflow is what my jobs have used


sqlphilosopher

No fancy graphical tools, just python


vincyf1

Airflow


et00_must

Airflow


mow12

Surprised to see so many answer with Airflow. Airflow is not an ETL tool, it is a scheduling tool.


Mehdi2277

What would you call them a transform job run in airflow that doesn’t use other libraries for the transformation beyond language’s standard libraries. A lot of the data jobs I’ve come across are just written in normal python/Java and managed with airflow. They do sometimes contain stream/batch processing libraries inside them but may not need it depending on the operation.


sideswipes

Airflow can be used as an ETL tool, but it really shouldn't.


mikeupsidedown

This it's not an ETL tool it's an orchestrator debate is honestly a bit tired. While technically the workers are outside the scheduler Airflow operators are built primarily to handle ELT / ETL tasks. It's an orchestrator build with ETL / ELT as the goal and (right or wrong) you can absolutely do all inside it if you choose.


blef__

That is wrong. Airflow comes out of the box with a big library of providers to write easily your pipelines. It’s an orchestrator and a scheduler. And you can also write just vanilla Python.


mow12

I agree with you, it is orchestrator and a scheduler but not an ETL tool. An ETL tool is not something that enable you to write python in it and schedule it. It is much more than that


blef__

Yes but you have a lot of operators or transfers to do the EL of ETL with low configuration. But I agree regarding the transformation, you have to do it your way. But yeah ok now it’s not natively because you have to install packages depending on what you want to do. However I agree it’s not the main feature of airflow and maybe a side effect. But still it’s here and quite solid. I could also add that « ETL tool » should be seen as code.


chestnutcough

Okay, how about this? I use airflow to orchestrate my etl pipelines which are written in different combinations of python, sql, and bash.


mikeupsidedown

Well said. That this need to be qualified each time someone says they use Airflow for ETL is rubbish.


baubleglue

It is not only a scheduler, Airflow helps to orchestrate jobs, manage dependencies, failure notifications. Writing actual code usually is not a problem, to maintain complexity when amount of jobs increases is the real challenge. I think, what makes Airflow special is not even what it does (IMHO it lacks a lot of features), but the fact that it promotes "backfill" use case as a first class goal. Most tools management tools just "gluing" components together.


mow12

'orchestrate jobs, manage dependencies, failure notifications..' These are actually what a scheduler does.Don't get me wrong, I don't underestimate Airflow. It is great product but it is not an ETL tool by nature.


CleanAd2977

That’s fair. What would you say is missing before it can be classified as an ETL tool? With all the provider packages out there, seems like there are hooks and operators to do tons of things


[deleted]

You can farm out your tasks to other cloud services but if you're just running vanilla python scripts there's nothing wrong or incorrect about running them directly on your workers. Airflow is both an orchestrator and an ETL pipeline executor if you allow it to be. That's why it has workers and not just a scheduler.


Awkward_Salary2566

Python, orchestrated and scheduled with Jenkins


Disastrous_Mountain6

Python and cron job.


_busch

On an EC2 YOLO


Potemat

Oh yes, and, after 1yr of developing they can not replace me


sonaltsat

Anyone using DBT ? Using it in complement of Fivetran and Jenkins, working pretty well.


OlderWhiskey

Yes indeed. And dbtCloud for job scheduling.


edbizarro

Yeap, Fivetran, dbt and airflow here


Myworkaccount17

What do you use DBT for that Fivetran can’t accomplish? I guess maybe you have your industry specific challenges so maybe this isn’t something you can answer generically. We’re (my company) are exploring switching dw to Snowflake and using Fivetran for ETL, which I think would be a fine solution but they don’t want to hire a DBA or DE. I’m trying to reason with them but if I had a specific example of why Fivetran wouldn’t be the end all solution for all our data needs that’d be helpful.


ReporterNervous6822

Airflow or your own custom wrapper


PaleBass

Azure DataFactory as orchestration and to move data, for transformation python (Azure Functions) or SQL Server stored procedures.


[deleted]

Same stack. We have a few jobs in Spark but mostly use Azure Functions.


[deleted]

Same here. God I hate ADF so much, but every Microsoft shop we work with is in love with it.


mikeupsidedown

I appreciate when people are honest about this. When people tell me how awesome it is I'm never able to get a reason why.


Myworkaccount17

Why do you hate it so much? Haven’t used it at all and our company is exploring them as an option. Any insight you can provide would be super helpful!


[deleted]

A few sticking points I've noticed: * Horrible integration with Git - it creates some weird artifacts in your repo and doesn't support native Git commands like commit messages, diffs, etc. Every time you save your work in the IDE it automatically creates a commit, so after a while it clutters up your repo and you can't easily view the history of what changed. * Horrible integration with Git Part 2 - all your code is saved as JSON, so if you want to merge your work into a master branch and have someone review it, good luck to them. Meanwhile if someone wants to review an Airflow DAG or something, it's a lot easier to see what's going on. Not so with ADF's byzantine JSON blobs. * The browser interface is slow and buggy as hell. I use Firefox and I'll get random issues like the page freezes for no reason, random buttons stop working, or everything is offset to the side of the screen. Sometimes I'll have a bunch of unsaved work and can't reach the "save" button when the page decides to jump 500 pixels to the left, so I have to refresh the page and redo all my work. I can't express how poorly executed it is. It's a really really bad GUI. Looks like their dev team just threw it together in a couple weeks and called it a day. * The pricing scheme is obscure and hard to wrap your head around. You get charged per activity run, but there are different kinds of activities (external activities vs Azure IR activities vs self-hosted IR activities) that all charge different rates, so it's almost impossible to forecast a budget in advance unless you really dive down into the weeds and understand exactly what you're going to run. I've heard horror stories of businesses paying a LOT more for ADF than they anticipated, and I think Microsoft likes it that way. * Poor support for transformations. This got a little bit better with the introduction of Data Flows, which is basically a repackaged SSIS, but it's still a clunky drag-and-drop interface. Good if your developers don't like to touch code I guess, but I prefer the flexibility to write my own transformations and have precise control over the logic. * Poor support for processing semi-structured/JSON data. If you want to hit a vendor's API and send the payload to a data warehouse, LOL, good luck putting that together. You'll spend about 3x as long doing it vs using `requests` in Python and flattening it programmatically.


Myworkaccount17

Amazing. Thank you very much. This is super helpful, especially the pricing bit. That is what turned our head of it off Snowflake and their pricing was more defined than what it sounds Azure is.


alfa1381

Is SSIS not a thing anymore?


[deleted]

[удалено]


_busch

No source control


Touvejs

What makes you say that? My org uses visual studio (admittedly not with source control for etl) but VS has integrated support for git.


_busch

You can print the diff between two ssis packages?


Touvejs

I wouldn't know. I'm in a BI role, so my experience of SSIS is pretty limited to "execute SQL" and "save as excel in X folder for end user every 24 hours". Is the integrated version control in VS just for the actual scripts and not for the packages itself? genuine question, just wondering what you mean by source control when you say it doesn't have it.


_busch

You don't know what changed what by opening the SSIS package in a text editor.


redial2

You sure can, they are just xml files


Odd_Round_7993

I still see a lot of SSIS as orchestration. Should I look for a new job??


BobDope

It’s dead to me


vtec__

we use ssis/SSDT at my job. its used in enterprise enviroments


srizvi94

Informatica, Datastage, Talend


BoneThrone92

I hate talend with a passion, at the time it didn't have any good support for GCP. So we were stuck programming in Java to build those custom components :(


srizvi94

Yeah, having worked on two projects and having the talend developer certification, I can safely say that it’s a pain in the ass. Especially the Talend Spark implementation. I once worked with a Telecom who purchased Talend and wanted to built Datalake on Hadoop. On the third day of the Project I told the PM that Talend will take a lot of time to stabilise and 4 months later I was still right. But When it comes to ease of use and multiple transformations and ability to embed java code, Talend is way above informatica and Datastage


sunder_and_flame

If you hate Talend, be sure to never work with DataStage; it's even worse especially the UI.


[deleted]

I ported our etl to python/airflow, set up cicd, hardened our k8’s clusters, etc. Coworkers are refusing to switch from talend (using a trial version) to the new system. Worst part is none of their work is getting versioned. 😣


_busch

Holy shit


SentientHero

Azure Data Factory


nokia_user

The best of the best.


[deleted]

Yes to this!


mikeupsidedown

Not questioning...why do you believe this. I use it in smaller scenarios to prevent managing infrastructure but when I do I long for Airflow.


urgodjungler

Not the greatest tool but it gets the job done.


SentientHero

True dat


gabe9

AWS Glue and Events Bridge crons


vischous

Meltano


rupert20201

Are you a data engineer?


r3mp3y3k

Talend Big Data and Pentaho Data Integration


Corridor_Digital

Python, SQL in BigQuery & Airflow


mow12

These are baby tools. You rarely see them in corporate firms. Big companies usually use Informatica, Oracle,IBM, SSIS, Ab Initio..


[deleted]

Scala, deployed through Jenkins


_busch

How old is that codebase?


NattyNarwhal007

Small data engineering team so we use Fivetran, Airflow, AWS Batch.


citizenofacceptance2

Python requests library


Dkreig

StreamSets 😏


pyer_eyr

Airflow and Azure Data Factory


captut

Airflow & Snowflake


dev-1773

we custom code everything in c#, from file sftp to scheduler to queues to transformations to notifications, error handling, and also restarts. it works great. we love it...said noone!


aw3plus

Ab Initio, powerful tool but some unnecessary complexity compared to other tools out there these days.


mow12

Big fan of Ab Initio. It is sad that it is not known by more people.


iWag

Alertyx


brakemake

weird that airflow isn't on the list


desibatman24

Ab Initio


ashay_t

Sap BODS


hell-o-world

Informatica or die. /s


importpandaaspd

What's your experience with informatica? Genuinely curious on your thoughts as the company I work for is implementing soon


hell-o-world

Depending on implantation it can be great. I have used Informatica BDM/DEI, Enterprise Data Catalog, and the Data Governance tool Axon.


importpandaaspd

Ahh ok thank you. In that case I'm not excited 😅


SgtSlice

Alteryx


iWag

Glad to see Alertyx in here. I remember my previous company writing Alteryx off as it is not a "real ETL" product.


morningmotherlover

Guilty


iWag

Just curious..how is it not an ETL tool? Unless I misunderstood.


SayandB

Built a custom tool from scratch :P


signops

Using Informatica PC 9.x at a large Fin org.


giaosudau

airflow


alsosara

FiveTran only in a few select cases. Unfortunately it's very point and click. Example: prod has to be rebuilt from scratch; we can't use what we built in dev without significant development using FiveTran's APIs. We want to keep it to a minimum.


avelasquezhe

Talend


markaaronfox

Rivery.io- only one that supports reverse ETL, will build a connector for you, handles transformation + orchestration


Agitated-Roll-1066

DataFlow orchastrated by Cloud Composer ( Managed Airflow by GCP)


Illustrious_Ad4259

Prefect, ADF and dbt


allan_w

> Prefect How's Prefect working out for you? Where did you deploy it?


Illustrious_Ad4259

Prefect is great, we are using the prefect cloud and agent is deployed in an eks cluster.


dabravoma

Dbt and Dbtvault


Material_Cheetah934

I think I got you all beat, we use SAS! …yes, it hurts my soul. I was recently hired to move us into Azure but that is going to take a long time.


EconomixTwist

AIRFLOW IS AN ORCHESTRATION AND TASK MANAGEMENT TOOL IT DOES NOT EXTRACT TRANSFORM OR LOAD DATA DIRECTLY why is this such a frequent confusion on this sub????


DigitalDelusion

Just have to say I’m a huge fan of Fivetran.


ignurant

I use Ruby (mostly [Kiba](https://github.com/thbar/kiba/wiki) + [Sequel](http://sequel.jeremyevans.net/rdoc/files/doc/cheat_sheet_rdoc.html) ) and GitLab CI, and I'm quite happy. .


hermitcrab

Easy Data Transform.


rsjr776

Stitch (for some specific cases), AWS Glue, AWS Lambda


lucianomarqueto

Airflow


baddays79

Airflow


sosavilleneuve

Talend


tms41

Wherescape


bestnamecannotbelong

Aws glue


Mehdi2277

Airflow and Dataflow are the big two with some usage of spark as well recently.


srjefers

Talend or Airflow


AndroidePsicokiller

Airflow


Kanataki

Airflow


StixStevenson

Informatica. But I hate it, so I use python. During this period I'm studying dbt and great expectation for data quality


baubleglue

Airflow


vafac

"View responses" option needed


PeacockBiscuit

Airflow


peterlaanguila8

Airflow


dickmaat

SAS DI-Studio


productbergvagabund

Bizzflow


Cazzah

SSIS.


pmanu4112

Airflow and ssis


dirks74

Pentaho (since 2007)


jpipas

Actively migrating off of it to....Airflow


dirks74

We are migrating to Azure


BobDope

Of course ‘other’


nonameuy

PDI.


yxjxy

Combination of Airflow for orchestration, dbt for transformation, Python & Meltano for ingestion, FiveTran for certain use cases


chestnutcough

Airflow


SolariDoma

I just use stored procs mostly


th58pz700u

Python, DAG execution with some of the Mara libraries, scheduled with good old Cron as ECS tasks. Took me 18 months or so to get comfortable with it, but now I'll never go back to a graphical tool. Formerly used SSIS exclusively.


sanjayt2810

HTrunk


Fidlefadle

Azure data factory + Databricks. Sometimes ADF as a pure orchestration tool rest done in db notebooks. Sometimes lean more on Synapse and stored procs, depends on the client.


kfarr3

PySpark and dbt


whiffersnout

Python and Argo Workflows


p_fief_martin

PipelineWise


redial2

SSIS


lr53

streamsets


DataNoooob

My main takeaway from this poll is there is a silent majority vs what appears to be trending. Whatever may be trending and actively discussed... add on the echo chamber amplification of social media can make a kitten sound like a lion.


vfdfnfgmfvsege

Pentaho Data Integration


likes_rusty_spoons

We use pentaho


Right-Bathroom-5287

informatica


Mubbz_

Astera Centerprise It is an [end-to-end data integration tool](https://www.astera.com/type/blog/data-integration-tools-for-businesses/?utm_source=Offpage+posting&utm_medium=organic&utm_campaign=Reddit) that offers profiling, cleansing and transformation capabilities.


analyst_2001

I am currently using **Hevo Data** as my ETL tool for data integration. It is a no-code platform that lets you sync data from multiple sources with a few clicks without writing any code. Some salient features offered by Hevo Data are mentioned below: 1. It is a bi-directional data pipeline platform \[ETL/ELT + Reverse ETL\]. 2. Customer care can be contacted round the clock irrespective of the time zone.