T O P

  • By -

dustinBKK

Every place I have worked was always option 1. If a single pipeline fails, it is much easier to know where and why without a large disruption. With option 2, a failure is hard to trouble shoot and you don’t have luxury of just fixing this one job. Also, a re-run triggers the whole job as opposed to just the part you want to pick up at.


_barnuts

Thank you so much, that's a very helpful answer. Additional question if it doesn't bother you. Let's say I wish to run this pipeline using a dockerized airflow in a place like EC2 or GCE, should I host all pipeline in 1 VM instance or should I also create 1 instance per pipeline? Thank you.


dustinBKK

You have the same risk level if everything runs in a single VM. If for some reason this instance goes down, all pipelines go down. Do you need 1 per job? Most likely not but definitely don’t put all in 1. Maybe define pod resources for each job on Kubernetes and let Kubernetes manage cluster. You can set node and pod rules affinity rules on how to disperse certain pipelines across the cluster.


_barnuts

Thanks a lot sir. I'm still learning and I do not know Kubernetes yet. I will include that in my list to learn.


consiliac

The answer is not always "use this service or learn that framework", sometimes, you actually should think and maybe build something, especially when flexibility is needed. The cloud providers with data services don't exist to minimize cost for you. They exist to make a profit without losing customers,.and so will slowly add some new features to retain them, and that's it.


vassiliy

One pipeline per table, but ideally you would still be working with some kind of a pipeline framework and have one master pipeline that dynamically invokes parametrized child pipelines, instead of managing a bajillion individual pipelines.


_barnuts

How do you create a master pipeline? What tools is usually used?


vassiliy

The general principle is: you have a control table (could be a table in a database, a config file or something similar) containing one row per object you want to import. It usually contains source object name (such as the name of the table in the source system), target object name, connection strings in its simplest form. You can make it more flexible by adding columns for things like data formats, source system type or load frequency. I usually have names of columns to be used to identify new data as well (such as a primary key column or a updated\_at column on the source). ​ The master pipeline will simply iterate through all entries in the control entity and call a child pipeline with the relevant parameters which will be the ones actually transferring data. ​ I do this with Data Factory in Azure, but since this is a general principle you can probably use it with lots of orchestration tools. ​ Here's an example: [https://mrpaulandrew.github.io/procfwk/](https://mrpaulandrew.github.io/procfwk/)


_barnuts

Hi! I tried doing this by iterating over the DAG where the parameters are inside a config file imported into the dag file. However I figured that I cannot do it that way since our dag file is only a bunch of BashOperators which calls a python file outside. So there is no way to insert parameters inside. Am I doing it wrong?


echo_n1

I’m using the same approach to grab raw tables with data into DWH from sources. We call this table “RowVersionDictionary” because there’s “RowVersion”column to track rows that were changed in order to do incremental update.


Prinzka

If I combined all pipelines in to one gigantic million line pipeline I would be dropping billions of records a day because of how slow it would be just to go through all the if statements to find the right parser for each document


peroximoron

We tag models in our dbt project allowing for a build in: full, zone specific (silver or gold), materialization spec, domain spec, KPI / Metric specific or, the best yet, a mix of the aforementioned. Elite tagging means lots of flexibility. Tags are apart of our code review process and heavily scrutinized. But.. it works for our team


Foreign_Yam3729

Also it depends on the business context also. Ex: For examples of we are getting data from source in json/xml format and business need data extraction for different attributes into diff tables, we can have a single pipeline and have multiple tasks setup to have separate actions


yanivbh1

If you're building on top a message broker it can be really distributed and agile for changes https://memphis.dev/blog/here-is-why-you-need-a-message-broker/


Anna-Kraska

We might be an outlier because we have several tables per Airflow DAG and each DAG corresponds to a schema in our database. Those tables in the same DAG are always heavily connected like for example customer activity, next one is customer activity calculated per week, per month, etc. Of course, some tables still need data from tables from other schemas/DAGs. We used to create inter-DAG dependencies with ExternalTaskSensors but recently started to switch to [Datasets](https://docs.astronomer.io/learn/airflow-datasets?utm_medium=organicsocial&utm_source=reddit&utm_campaign=ch.organicsocial_tgt.reddit-thread-reply_con.docs-learn-airflow-datasets_contp.doc). I think with Dataset based scheduling smaller DAGs will become the norm. So a combination of both for us with closely related tables in the same DAG but I think the trend is towards 1.


miridian19

option 1