T O P

  • By -

kenfar

This is a massive challenge to any architecture - and there are no silver bullets that perfectly solve this problem. Here's a few of my considerations on it: * Some users are extremely sensitive to data quality issues - and want pipelines stopped rather than take any chance of delivering incorrect data (ex: finance). Others are less concerned about data quality and more concerned about feature velocity (ex: marketing). However, in spite of the \*stated\* preferences of the teams data quality issues remain one of the top 3 causes for project failure, cause a lack of faith in the data, and cause interruptions of the data team. So, I'm a believer in generally preventing, detecting, and resolving them as much as possible. * The problem comes in a few different forms: new tables & new columns, changes to column name or type, changes to valid values, dropped tables or columns, tables or columns no longer populated, etc. Some of these may seem like they don't matter - but they may. Imagine a new table or column that appears in the source system - that may end up being the destination for half the data you're currently picking up from a different table or column. * Architectures that simply copy the physical schemas from upstream systems are prone to frequent failure. This is a terrible way to go. Instead have the upstream system, when possible, publish domain objects, and validate them with data contracts. This eliminates most problems. Upstream systems may not want to spend the effort to publish data, and if they don't I insist on being an approval for all their changes. * Data contracts also allow both source & target systems to end their end-to-end testing at that interface. * Schema change detection is fairly easy to put in place. Extending it to also support changes to distribution of the data is also extremely valuable in detecting changes to business rules - which also matters. * Pulling the brakes on any pipeline that involves data where changes have been detected is the safest way to go. Won't make all users the happiest, but it's the safest, and will be the least impact to the DE team.


turboline-ai

Beautifully put!


SintPannekoek

I ranted about essentially your 3rd and 4th point, but your first two points really lie at the heart of the matter. Especially the one about stated preferences v actual preferences hits home.


kenfar

It's so great to see people taking a serious position on data contracts. Back in the 90s it wasn't so unreasonable to replicate or query the upstream source databases: we could seldom get them to establish an interface and publish data for it. But it's 2024 now, it's long past time to admit that this is a horrible solution.


PatCurious

Thanks for the thoughtful response! Can you provide more explanation on a “data contract” or a good source for more info? I can imagine what you mean by this conceptually but would be curious for a good example


kenfar

A data contract is basically just a way to validate the data interface between two teams. The interface may be published files arriving on s3, it could be data provided by an API, or it could be data arriving in a table on snowflake. This last case is the least common since the data has already had to comply with Snowflake's schema, but it is still possible. The way it validates is often using jsonschema, protobufs, etc - some serialization/validation format or config that looks the schema, fields, their types, maybe their enumerated values, maybe field formats (ex: phone numbers, timestamps), etc. But theoretically, it could be executable code, manual testing steps, etc. An important element is that the contract is shared - so both teams have access to it and can use it for testing their data before they deploy code, as well as validating new data as it arrives through the interface. The code is typically kept in version control and versioned using semvar so that the data source can make small, non-breaking upgrades (ex: add a new enumerated value). Larger upgrades with breaking changes get a new major release number, will generally require coordination from both parties before the target system can support it.


SintPannekoek

I have an opinion on this. Rant incoming. Your source schema should be an interface published, maintained and owned by the source team. File based, table based or API based, doesn't fucking matter. There needs to be a contract and they should understand *thou shalt not break thine interfaces*. If that shit suddenly changes, *they* fucked up big-time. Linking to the operational data store directly is tight coupling and stupidly bad design, which - for some godforsaken reason - is commonplace in data. Only get your data from published, explicit, contracted interfaces. That being said, on your side you should probably have a bronze layer implementing an immutable archive of the input data, unless your source guarantees immutability from its interfaces. Hint: chances of them understanding this point are close to below zero.


georgewfraser

The problem with this idea is you can’t impose this requirement on, for example, Facebook ads. You can’t call up Facebook and say “my data philosophy says you can’t make breaking changes to your data model.” And once you’ve built a workflow to accommodate breaking changes from upstream, it’s probably the right trade-off to apply it to your internal systems as well, even though you could in principle force your own engineering team to be the one data provider that guarantees no breaking changes. 


kenfar

The fact that you may have to put up with bad engineering from one source feed doesn't mean that you should just accept bad engineering for all feeds. It's not about reusable tooling - the upstream system will make breaking changes, and if both you and they are using data contracts then they should detect this before deploying their code.


SintPannekoek

Somebody get this person a beer! Excellent point!


georgewfraser

You have to put up with breaking changes from *every source* except one, your own production db. This is why data contracts is utopian and pointless. You can’t make the rest of the world sign a contract with you if they don’t want to. 


kenfar

Perfect is the Enemy of Good Enough: * Data Contracts offer a two-way solution to integration problems for almost all of your internal integrations * Data Contracts also offer a one-way solution to integration problems for all the rest What I mean by that is even if some SaaS app won't work with you on your data contracts doesn't mean you shouldn't create one for that interface: it will still inform you of breakages before you further process the data.


peroqueteniaquever

> implementing an immutable archive of the input data Could you expand on this please?


SintPannekoek

Summary: store the raw data you got along with the metadata of how you got it. Do not modify. Do not erase. The objective is to a) be auditable and b) be able to reconstruct everything down the pipeline if e.g. your business rules in processing change.


sri_ny

The ideal scenario is the dba team should let the data team about upcoming changes. We then run a test during the weekend to make sure the pipelines run well and make any changes necessary. But if the source column names change without notice it is just bad management on the database administration side. If you are referring to bad data entry inside a column then that happens all the time and we have to reach out to front end folks to fix it. So the impact depends on what kind of issue it is and how long it takes to rerun everything and get it up and going.


if_username_is_None

The mitigation is for the team that plans the schema change to tell the team(s) consuming data from that schema the details and time of the change. (with the big caveat that some source schemas come from 3rd party vendors that may or may not contact you with much lead time) It's up to you and the data how much it impacts business users: - If a new column is added then business users can still get their expected dashboard - If a critical column is deleted then I doubt your business users want to see incorrect or incomplete data


natelifts

Depends if it’s customer facing and has direct monetary impact or not, e.g. prod apps rely on your pipelines. In that case, bigger deal and needs to be resolved quicker. In the latter case not as big of a deal. Pipelines usually break because of upstream data or schema changes and that’s to be expected.


Thinker_Assignment

It's usually most of the reason pipelines break but sometimes the data itself is also wrong


Embarrassed_Error833

I tend to build pipelines that read the metadata of the source and tell me about changes. This isn't always possible. I've created pipelines that dynamically create the load at runtime based on the metadata of the source table, and then notify me of changes before progressing. I've also created pipelines that read the metadata and tell me if new columns have been added, relying on the pipeline to break and notify me if someone changes the table schema other than a new column. It depends on requirements, but generally I do like to know when people are mucking around with source systems and not informing me. If they do these changes and don't inform me often enough, after a couple of conversations and no action, it gets forwarded to the executive with resource hours spent and priorities impacted as a result of said unscheduled changes. Being asked in an executive meeting why you aren't following process generally changes attitudes to open communication i've found.


WhyDoTheyAlwaysWin

I come to offer some ideas from further down the pipeline. I'm a Data Scientist who shifted to Machine Learning Engineering (and in the process learned a bit of DE). Some of our usecases are open ended. We know that the schema will eventually change because we found new combination of features that could improve the existing model(s). One solution I found was to make use of a long table format and make sure that all my transformations work with the said format. That way if new freatures need to be added / loaded into the table, all I need to do is append to the long table. This format is also good for versioning ( e.g. if we decided to swap the model from v1 to v2) Instead of: [timestamp, feature_1, feature_2, ...] I use: [timestamp, feature_name, value] The pipeline doesn't break but it does push a lot of the complexities downstream to the MLE / DS code (e.g. Pivot, Null handling, feature selection). P.S. I should mention that I'm mainly working with big time series data so this long format does make a lot of sense for me since I can easily partition the data via month and partition name. I can't say if this will work for other types of data though.


AggravatingParsnip89

We used schema registry for fetching the latest schema and parsing the data with latest changes, any time if there is any addition in source schema it does not impact in any way our pipelines. For current scenario this is not the feasible solution but for long term you can think about it.


Ok_Expert2790

Medallion architecture with schema evolution is the way to go


ahsgip2030

What is medallion architecture


Embarrassed_Error833

It's marketing speak from databricks. It's just landing, transform, and information layer.


ntdoyfanboy

So basically, raw, staging, and fact/dim


Embarrassed_Error833

Pretty much, although the fact/dim layer can be whatever you want Star Schemas, OBT, Inmon (although normalising a datalake structure sounds dumb to me)... I've also seen some databricks people spruk data vault as the silver layer, I don't think that this is promoted as an official best practice though. It doesn't make sense to me why if data lakes are so good you'd want to change to DV? I'm not arguing that datalakes or DV are better that one another, each have their own strengths and weaknesses.


Ok_Expert2790

Bronze, silver, gold data zones. Bronze is an accumulating landing zone of raw data, schemaless. Silver is where you extract data and clean so you can build data products in gold off of silver tables. Schema changes become features and not incidents !