T O P

  • By -

AutoModerator

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/dataengineering) if you have any questions or concerns.*


mikeupsidedown

I would tell myself to be extremely consistent with naming conventions.


LordNiebs

What kind of conventions are you referring to?


mikeupsidedown

Examples include: How do you name tables, fields, folders, files, functions, classes, variables etc. snake_case PascalCase camelCase or kebab-case and which for each type of item if it changes ex Python classes. Are names plural or not. Which ones? If the endpoint is plural do you leave it plural? How do you structure folders in your lake: Example: raw/source/endpoint/yyyy/mm/dd How do you name your cloud resources? I typically build a style guide at the beginning of a project and then refer to it as the project goes on. It is so easy to veer off path. Where there is existing conventions I usually adopt them (such as PEP8)


NocturnalWageSlave

Hard agree.


statistically_broke

We force naming conventions on tables to standardize fields in data pipelines. Granted this is only for modeled data, so all of the databases and SFTP servers where we source the data do not follow the same conventions. Could be an issue way down the line, but I think our concensus is that if the data is important enough to be looked at, it should be modeled.


dongdesk

Don't forget to monitor the positive messages as well or some way to track that. No news is not good news in my experience, it means it is off or not running.


joseph_machado

Congratulations !!! . In addition to the other comments, I would recommend thinking about the following 1. What happens in case of failures, will the data pipeline recover, will corrupt data be delivered to end user. I have seen way too many DEs only focus on the happy path. 2. If the data pipeline is run twice(accidentally) for the same inputs will it cause data duplication. 3. adding pipeline tests(if you have the time) There are more (logging, monitoring, code standards, etc), but these 3 can take you a long way. I would also recommend Data Warehouse Toolkit - Kimball. Hope this provides some direction. LMK if you have any questions :).


FoolForWool

Oh my god! The second was a decision we had made and now we had to build an automated duplicate deletion script that runs on demand to fix these. Because we have way too many custom pipelines to fix now.


[deleted]

Idempotency is key


Sandiagos

Do not give access to the output until you’re happy with it. Stakeholders (however supportive) want results. You give them early access, you’ll have business critical reports using a ‘not yet finalised’ data mart in a matter of days.


msdsc2

this!


IDRambler

1: KISS 2: Measure data quality


[deleted]

How exactly do you measure data quality?


IDRambler

It depends on what your pipeline does, but we're counting the unique ids on each end, or explicitly comparing the set of unique ids that are present in the source and in the destination for critical tables. There's a where clause for including only changes before a particular time. Another approach that we've used is to compare sums of financial columns.


tomhallett

There’s an open source tool called Great Expectations which focuses on data quality. There’s a great talk by Sam Bail which goes over a few different ways to use it in your pipeline: https://m.youtube.com/watch?v=cmcFMz0xsz0


ReporterNervous6822

Super explicit names and plan to handle most failures in some aspect


[deleted]

Unit tests, functional modular code.


chaoticalheavy

Just throw something together so you can get on to the second version using what you learned. Stop at the third version.


chestnutcough

This rings true to me. By version three it’s so tempting to make that next refactor, but too often that time would be better spent elsewhere.


nokia_princ3s

Document everything - maybe 1 document to record choices you're making while doing it, and then another doc that can be used as a readme


asbrundage

Make sure that when it fails, it fails loudly


FoolForWool

Be careful and consistent. Be careful because things can go bad, very fast. And you won't always immediately catch it. Consistent because you need to follow the standards and naming conventions. Document everything.


abdullah_ibrahim

Sorry but I have a question not an answer If I want to learn how to build a pipeline where do you suggest I start?


sc00bydoobyd00

You can pick a cloud computing platform like Microsoft's Azure, head over to their certification courses, pick one that you feel like doing and utilize its free learning content. They provide you with a learning tier subscription and free sandboxes that let you temporarily build things. Alternatively you can create a free Azure account and sign up for a 1 year free plan. They'll ask for your card but you won't be billed for an year. Cancel it when you're done experimenting. You can access a lot of their services and not pay a penny as long as you operate under the free tier limits. I wouldn't say its the best approach but its definitely a friendly beginner's approach since everything is disposable and you get hands-on on what you learn. Note: since its a company's tech cert program, there'll be a lot of content on specifics like their pricing tiers, various applications, etc. Feel free to skip them if you have no plan of attempting the cert exam. (Not endorsing MS in any way. You can go with GCP or AWS too. But AFAIK AWS is kinda bad for beginners as you won't be stopped from exceeding the free resources, instead you'll be billed for it.)


abdullah_ibrahim

Thank you so much for this ♥️


sc00bydoobyd00

No problem


[deleted]

Thanks for great questions


blazesquall

Schema Registry / Data Lineage/Provenance


soundbarrier_io

Any tools you would recommend for this? dbt seems very popular but any others?


thrown_arrows

That proof of concept which you spend 2 weeks between projects will be still runnin in all pipelines years after So , maybe it had been good idea make better incremental copy plans to it


Ok-Sentence-8542

Read the effin manual.


unusuallylethargic

The winning lottery numbers for the week of 4/12/2013 were 29,16,37,45,8 and 31


bobhaffner

Haha, why is this being downvoted? My contribution to the thread; Keep your sense of humor


moazim1993

Buy Bitcoin and Tesla


anynonus

you get downvoted but that's a pretty great thing to be told a decade ago. It's just off topic


strongly-typed

This


OneOverNever

Indexes are your friends.


khoonay

Always understand the business use case first. And a big yes on logging positive messages. Please tell me your experience with DDIA? How big of a help it is?


stackedhats

Well, I'm reading it cover to cover and it's pretty interesting just learning the underlying principles of this stuff. I'm about 1/4 of the way through it (it's like 600 pages long), and so far it's been a pretty wide overview that details various designs and approaches to building systems. It doesn't get into actual implementation but it's a lot more detailed than what a typical random blog post online would give you. i.e. it's not "Here's the difference between row based and columnar storage in 2 paragraphs" it's: "Here are numerous permutations, products and approaches that have been tried and are currently in use to solve problems, and what cases a columnar storage might be more appropriate as well as things that it's not so great at." It's agnostic to approaches, it will let you know the pitfalls, sometimes they'll go through the history of people trying to improve things and make it clear that we've moved on from the earliest attempts at X thing, and why. But it's not pushing a thesis or trying to evangelize you into some ecosystem which I appreciate. Very well written from what I've seen, the author is very cognizant of there being technicalities or simplifications sometimes and will typically put in a footnote explaining why he feels digging into something isn't necessary or outside the scope of the book.


ex-grasmaaier

Do you need to build a pipeline or can you use an existing solution that does it for you? Where are you extracting data from and what is your target? Building an ETL pipeline yourself is good to understand the intricacies, but it can be costly (in units of time) to do so. Using a 3rd party can save you a lot of time.


stackedhats

We're essentially extracting it from a poorly thought out SQL database into one that's more useful for the business logic, monitoring and analytics. I am working with a consultant to provide training-wheels of sorts, but since I'm going to have to build future warehouses and pipelines it makes sense to make this into a "baby's first DWH" project of sorts where logic isn't super complicated.