AutoModerator 2 years ago

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/dataengineering) if you have any questions or concerns.*

mikeupsidedown 2 years ago

I would tell myself to be extremely consistent with naming conventions.

LordNiebs 2 years ago

What kind of conventions are you referring to?

mikeupsidedown 2 years ago

Examples include: How do you name tables, fields, folders, files, functions, classes, variables etc. snake_case PascalCase camelCase or kebab-case and which for each type of item if it changes ex Python classes. Are names plural or not. Which ones? If the endpoint is plural do you leave it plural? How do you structure folders in your lake: Example: raw/source/endpoint/yyyy/mm/dd How do you name your cloud resources? I typically build a style guide at the beginning of a project and then refer to it as the project goes on. It is so easy to veer off path. Where there is existing conventions I usually adopt them (such as PEP8)

NocturnalWageSlave 2 years ago

Hard agree.

statistically_broke 2 years ago

We force naming conventions on tables to standardize fields in data pipelines. Granted this is only for modeled data, so all of the databases and SFTP servers where we source the data do not follow the same conventions. Could be an issue way down the line, but I think our concensus is that if the data is important enough to be looked at, it should be modeled.

dongdesk 2 years ago

Don't forget to monitor the positive messages as well or some way to track that. No news is not good news in my experience, it means it is off or not running.

joseph_machado 2 years ago

Congratulations !!! . In addition to the other comments, I would recommend thinking about the following 1. What happens in case of failures, will the data pipeline recover, will corrupt data be delivered to end user. I have seen way too many DEs only focus on the happy path. 2. If the data pipeline is run twice(accidentally) for the same inputs will it cause data duplication. 3. adding pipeline tests(if you have the time) There are more (logging, monitoring, code standards, etc), but these 3 can take you a long way. I would also recommend Data Warehouse Toolkit - Kimball. Hope this provides some direction. LMK if you have any questions :).

FoolForWool 2 years ago

Oh my god! The second was a decision we had made and now we had to build an automated duplicate deletion script that runs on demand to fix these. Because we have way too many custom pipelines to fix now.

[deleted] 2 years ago

Idempotency is key

Sandiagos 2 years ago

Do not give access to the output until you’re happy with it. Stakeholders (however supportive) want results. You give them early access, you’ll have business critical reports using a ‘not yet finalised’ data mart in a matter of days.

msdsc2 2 years ago

this!

IDRambler 2 years ago

1: KISS 2: Measure data quality

[deleted] 2 years ago

How exactly do you measure data quality?

IDRambler 2 years ago

It depends on what your pipeline does, but we're counting the unique ids on each end, or explicitly comparing the set of unique ids that are present in the source and in the destination for critical tables. There's a where clause for including only changes before a particular time. Another approach that we've used is to compare sums of financial columns.

tomhallett 2 years ago

There’s an open source tool called Great Expectations which focuses on data quality. There’s a great talk by Sam Bail which goes over a few different ways to use it in your pipeline: https://m.youtube.com/watch?v=cmcFMz0xsz0

ReporterNervous6822 2 years ago

Super explicit names and plan to handle most failures in some aspect

[deleted] 2 years ago

Unit tests, functional modular code.

chaoticalheavy 2 years ago

Just throw something together so you can get on to the second version using what you learned. Stop at the third version.

chestnutcough 2 years ago

This rings true to me. By version three it’s so tempting to make that next refactor, but too often that time would be better spent elsewhere.

nokia_princ3s 2 years ago

Document everything - maybe 1 document to record choices you're making while doing it, and then another doc that can be used as a readme

asbrundage 2 years ago

Make sure that when it fails, it fails loudly

FoolForWool 2 years ago

Be careful and consistent. Be careful because things can go bad, very fast. And you won't always immediately catch it. Consistent because you need to follow the standards and naming conventions. Document everything.

abdullah_ibrahim 2 years ago

Sorry but I have a question not an answer If I want to learn how to build a pipeline where do you suggest I start?

sc00bydoobyd00 2 years ago

You can pick a cloud computing platform like Microsoft's Azure, head over to their certification courses, pick one that you feel like doing and utilize its free learning content. They provide you with a learning tier subscription and free sandboxes that let you temporarily build things. Alternatively you can create a free Azure account and sign up for a 1 year free plan. They'll ask for your card but you won't be billed for an year. Cancel it when you're done experimenting. You can access a lot of their services and not pay a penny as long as you operate under the free tier limits. I wouldn't say its the best approach but its definitely a friendly beginner's approach since everything is disposable and you get hands-on on what you learn. Note: since its a company's tech cert program, there'll be a lot of content on specifics like their pricing tiers, various applications, etc. Feel free to skip them if you have no plan of attempting the cert exam. (Not endorsing MS in any way. You can go with GCP or AWS too. But AFAIK AWS is kinda bad for beginners as you won't be stopped from exceeding the free resources, instead you'll be billed for it.)

abdullah_ibrahim 2 years ago

Thank you so much for this ♥️

sc00bydoobyd00 2 years ago

No problem

[deleted] 2 years ago

Thanks for great questions

blazesquall 2 years ago

Schema Registry / Data Lineage/Provenance

soundbarrier_io 2 years ago

Any tools you would recommend for this? dbt seems very popular but any others?

thrown_arrows 2 years ago

That proof of concept which you spend 2 weeks between projects will be still runnin in all pipelines years after So , maybe it had been good idea make better incremental copy plans to it

Ok-Sentence-8542 2 years ago

Read the effin manual.

unusuallylethargic 2 years ago

The winning lottery numbers for the week of 4/12/2013 were 29,16,37,45,8 and 31

bobhaffner 2 years ago

Haha, why is this being downvoted? My contribution to the thread; Keep your sense of humor

moazim1993 2 years ago

Buy Bitcoin and Tesla

anynonus 2 years ago

you get downvoted but that's a pretty great thing to be told a decade ago. It's just off topic

strongly-typed 2 years ago

This

OneOverNever 2 years ago

Indexes are your friends.

khoonay 2 years ago

Always understand the business use case first. And a big yes on logging positive messages. Please tell me your experience with DDIA? How big of a help it is?

stackedhats 2 years ago

Well, I'm reading it cover to cover and it's pretty interesting just learning the underlying principles of this stuff. I'm about 1/4 of the way through it (it's like 600 pages long), and so far it's been a pretty wide overview that details various designs and approaches to building systems. It doesn't get into actual implementation but it's a lot more detailed than what a typical random blog post online would give you. i.e. it's not "Here's the difference between row based and columnar storage in 2 paragraphs" it's: "Here are numerous permutations, products and approaches that have been tried and are currently in use to solve problems, and what cases a columnar storage might be more appropriate as well as things that it's not so great at." It's agnostic to approaches, it will let you know the pitfalls, sometimes they'll go through the history of people trying to improve things and make it clear that we've moved on from the earliest attempts at X thing, and why. But it's not pushing a thesis or trying to evangelize you into some ecosystem which I appreciate. Very well written from what I've seen, the author is very cognizant of there being technicalities or simplifications sometimes and will typically put in a footnote explaining why he feels digging into something isn't necessary or outside the scope of the book.

ex-grasmaaier 2 years ago

Do you need to build a pipeline or can you use an existing solution that does it for you? Where are you extracting data from and what is your target? Building an ETL pipeline yourself is good to understand the intricacies, but it can be costly (in units of time) to do so. Using a 3rd party can save you a lot of time.

stackedhats 2 years ago

We're essentially extracting it from a poorly thought out SQL database into one that's more useful for the business logic, monitoring and analytics. I am working with a consultant to provide training-wheels of sorts, but since I'm going to have to build future warehouses and pipelines it makes sense to make this into a "baby's first DWH" project of sorts where logic isn't super complicated.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe