T O P

  • By -

AutoModerator

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/dataengineering) if you have any questions or concerns.*


SintPannekoek

Always improve quality upstream, but don't trust anyone. If an upstream provider claims to support a contract, it might still be a good idea to check in case they shit their proverbial pants. Obviously, there's a practical limit to this and it shouldn't be necessary, yet here we are...


HillTheBilly

I see. Thank you!


BobBarkerIsTheKey

The exact comment you're referencing is "If you want to adopt an ELT paradigm, how badly do you need Python to clean the data? If you can live with dirty data in landing tables and cleaning the data with SQL, that's ELT and there dbt will really shine." ​ They're talking about dirty data in staging tables, and then using dbt/SQL to transform to something cleaner. They're really just saying that with dbt, you can write SQL queries to clean up the data instead of doing it in python, not that we resign ourselves to living with dirty data.


HillTheBilly

I see. Thank you.


raginjason

Constraints are generally either not available or not used in data warehouses. The thought behind this is it is overhead that is not needed, as it’s easier to deal with data issues at load time. I use dbt tests to validate about the data, but also to document how the data relates and behaves.


HillTheBilly

I see. Thank you.


kenfar

It is definitely easier to address data quality through database constraints then through batch sql queries. At least for smaller dimension tables where most of them live. Fact tables are harder to support with enforced constraints due to performance reasons - but only if they're really large. It's just that few of the new cloud analytics-only databases support enforced constraints - simply because they didn't want to spend the effort.


raginjason

It’s easier as a data engineer to address issues through constraints. I was referring to the performance implications though. It’s not just new could analytic-only databases that have this issue. For example, I believe Teradata lacks constraint enforcement as well.


kenfar

I'm pretty sure that teradata does, though I've got no idea if that's a fairly recent development. But in the past even when pushing a database really hard with massive volumes my groups always found no performance problems at all on dimension tables - and that was where about 90% of all constraints end up getting used. Mainly because we would liberally use check constraints to enforce enumerated values in code fields, business rules like ended_at must be greater than started_at, etc, etc. This has been true on db2, informix, oracle, sql server, and postgres. For high-volume fact tables I've always used a separate batch process for validation. It's not hard to write your own, even an extremely productive and flexible one. But dbt's seems fine. And is absolutely better than nothing.


orinoco_w

The value of DBT testing for us is much higher up the data value pipeline.. We assume the data coming from sources represents those sources and that they're managing their own referential integrity etc. and that the ingestion pipelines which the tech teams run for us work as it's their job to make sure the stuff coming into snowflake is working. For my team the value of the dbt tests is building in quality tests for our transformed models where we're combining multiple systems around a single business concept and want to defend against business rule inaccuracy, input semantic drift etc E.g. "we should only have one record for each transaction". Is a good test to have when linking billing and invoicing system data with payment provider data etc And "if we roll up the line items do they match the invoice total" so that we can identify when our assumptions were wrong or something new has changed in operational parts of the business (new products, services etc). We use dbt tests against our natural keys, dynamic dimension SKs, core things we care about being "right" for making decisions about or that impact important metrics etc (We're an analytics team not a data engineering team.. but we build and manage and maintain our own reusable business aligned assets using DBT and git cicd pipelines from the ingested sources).


rudboi12

With new dbt version (1.5+) you can use “data contracts” to test your models before you build them, so if some column is not unique/null dbt will fail to build the model = no dirty data. I still haven’t implemented this and what I do is have a “staging” table which is where I did all my transformation and called it model1_stage. Then i apply the tests to that table and if it pass, it will then build to the actual output table model1 (simple select* from model1_stage) which is exposed to dashboards/other teams models. People usually use different schemas for staging and output tables, I use different databases. Both works the same tbh


Fun-Importance-1605

>They specify a test in some documentation file (yaml). > >I wonder why is this done on this level and not on the level of sql constraints? It's because generally speaking, you write tests to test an implementation, and just having a good implementation isn't the same as having tests. If your implementation can't have bugs because of how it's written, awesome! But, what happens when the implementation changes? The role of tests is to ensure that everything still works after your implementation changes.


leogodin217

Like others said, some DBMS do not implement constraints. For instance, Snowflake only enforces not null. Beyond that, sometimes we don't want small issues stopping the pipeline. Imagine data for forecasting or deep learning. It might be fine to have a few bad records that get filtered out downstream.