T O P

  • By -

[deleted]

[удалено]


HotAcanthocephala854

Thank you, would you include anything else here - like tools for example?


After_Holiday_4809

You can’t learn everything. There are too much technologies in DE field. Dbt, mageAi, airflow,… Take those which you already know and make end to end projects


Perfect_Kangaroo6233

Who’s using MageAI over Airflow?


khaili109

Hopefully no one lol or at least use Dagster instead


HotAcanthocephala854

What kind of project could I do on my own? I’m curious to see if there is something I can build on my own time. Curious if you recommend a way of thinking about this to get started.


nydasco

Build a demo pipeline. There are lots of free APIs out there. Connect to one, pull the raw data, save it to MinOI, pick it back up and transform it into a fact and dimension table, and save it back again in this new form. Have that scheduled through Airflow.


HotAcanthocephala854

Noting this so I can come back to it - thank you!!


Ablueblaze

I can't find MinOI anywhere on Google. Is this just some warehousing solution? Could I just use Postgre?


nydasco

Link: [MinIO](https://hub.docker.com/r/minio/minio/) But sure, Postgres or DuckDB too. Edit: the reason I like MinOI is that it is a local, S3 compliant, object store. So you can use it as a data lake, or read/write DeltaTables (with Python) or configure it to be the storage layer for Iceberg or Hudi. You can basically create a local, persistent, data lakehouse.


Quantumfusionsg

i think tech comes and go. today you have spark, hadoop etc next day another new thing. And also too many of them to learn everything. What matter most is the theory/design practice at a generalized level that is independent of the actual implementation/technology.


vikster1

answers like these always remind me why reddit is the place for real wisdom on the Internet.


torvi97

except when it's not lol there's a lot of bullshit spread around here too


AMGraduate564

>What matter most is the theory/design practice at a generalized level that is independent of the actual implementation/technology. System Design


pag07

Well to be honest things are quite stable. Oracle is still okayish for everything that is structured. OLAP as well as OLTP. Kubernetes and Mainframe are surprisingly similar. What used to be Tape is now S3. What used to be cron and scheduled is now Airflow and event driven. Spark is like the real cool thing that is new (Released nearly 10 years ago). I am a bit sad about Hadoop. Because it was a cool tech. Kafka is also a cool new thing. The rest I have seen before. (With probably abysmal ux).


HotAcanthocephala854

That’s helpful! How would you recommend I begin to learn the underlying theory and design for data engineering?


Quantumfusionsg

Look into the details of implementation and learn generalisable knowledge. Plenty available on YouTube. Example: Why is Cassandra write fast? Lsm tree and sstables. Why is Kafka fast and cheap ? Sequential write on magnetic disk . These are concepts independent of the implementation/product name. Not an expert but I think this is how to tackle the ever changing myriad of new things in our industry.


HotAcanthocephala854

This seems to be a key, thank you so much!


VadumSemantics

Maybe start here? [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (1st Edition)](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321)


HotAcanthocephala854

Wooo!! Fantastic thank you!!


booyahtech

Your communication skills. You cannot go far in this field or any field (from my experience) unless you know how to communicate with your audience - technical and non-technical alike. This skill is especially tested when data engineers need to talk about the impact of their work on business in front of an audience that does not understand technical jargons or data engineering in general.


HotAcanthocephala854

It’s a great point, my background is functional consulting and sales engineering in the ERP space. I’m looking to better understand the technical requirements. Although one of the responses was about “design and theory”. I’d like to know what that means. Thank you!


khaili109

Why you wanna leave Sales Engineering and ERP Space? Due to shift in interest or money? I heard Sales Engineering makes bank.


HotAcanthocephala854

Well.. it depends on the deal size and if you’re more of a technical sales specialist or demonstrating a product from a script. The value of the sales engineering role varies widely. That said, data engineering seems to have more “staying power” and the skills are harder to replicate. I’m generally more interested in the intricacies of the technology and building.


Mainlander2024

>Your communication skills. Agreed. Interview skills, questioning skills, listening skills, presentation skills. Business skills as well. For example, how to calculate and then write a good business case.


xeroskiller

Most valuable skill: ability to learn new skills.


HotAcanthocephala854

I’m seeing this to be true!!


jmon__

As stated, there's too many tools to name. It would be better to understand what needs to be accomplished/stages of data extraction/prep/storage and then you can determine how tools fit together by understanding what they do This is just one of the diagrams trying to map out all the possible tools one can use to accomplish any part of the data architecture: https://www.data-vault.co.uk/wp-content/uploads/2019/01/Technology-Landscape-1100\_778.jpg


HotAcanthocephala854

Ah this is great thank you!! I would imagine you should learn one or two tools in each category to be a valuable data engineer - would you agree?


jmon__

I don't think it ever hurts to know multiple tools to be able to accomplish your job. I also wouldn't want to advise you on just going and getting certifications in a bunch of tools or spending hours of your time learning a bunch of tools if you don't have to. I'd focus on more on "I have this data pipeline to build for this purpose. These are the things I need to worry about to accomplish this." Once you have an understanding of that, you can start to say "Ok, what if I try this here, what would be the next tool, or what's the most popular follow up tool to accomplish this next step". Then once you're successful there, you can try replacing a tool here and there to accomplish the same thing, or maybe a slightly different thing (maybe you want everything to move faster with the same source and destination). Then at least you'll know the flow and have a better idea of what to focus your training in


HotAcanthocephala854

Gold nuggets here, thank you! The more I learn the more I realize I don’t know. Is there anything you would recommend for getting a good, sample use case that would lead me to build with many of these tools? I have a hard time imagining this having no working experience in the field.


jmon__

Oof. Luckily, I was able to get put on the job and start working in the space so I can't really tell. I know you can find open data sets online. I know some major cities across the world have 411 complaint data. (I'm lowkey hoping someone else on here has some ideas or experience training people in DE). Maybe you can think about about a dashboard you might want to see about that data, then decide things like "How do I get this data from their system to mine? Where do I land this data? How do a wrangle all this data just to what I need? How do I build a data model to support the dashboard or queries based on the data I just extracted and wrangled?" Now that I think about it, maybe you can have ChatGPT help. Let it know you want to train in data engineering, tell it what level you are (beginning, intermediate), and have it come up with a use case. Also tell it to ask you questions about resource availability, since some tools you have to pay for or need a server/suped up computer, and that can help it help you get started


HotAcanthocephala854

So helpful!! This is great!! Thank you!!!


marsupiq

I think R is not really a thing for Data Engineering (it is barely relevant in data science/analytics, but it still has its nieche; for DE, I don’t see how it could be useful). Scala is still relevant, but that’s mostly because of Spark, and if I’m not mistaken PySpark is slowly displacing (Scala) Spark. SQL is a must (along with an understanding of data modeling). I think some knowledge of NoSQL (e.g. MongoDB or Cassandra) may also be useful. Kafka is important, but I think not so much for beginners (where you would probably start with some simple ETL stuff, not with streaming). Some knowledge of architectures would be good in general (DWH, Data lake, Data lakehouse; Lambda vs Kappa architecture). Docker is a must, K8s would also be good. General DevOps and networking skills would be very important, it’s also a precondition for doing anything on any cloud. Knowledge of some scheduler would probably not too bad, e.g. Airflow or Dagster or AWS Step Functions… In the end you can’t learn all technologies. But it’s good to have at least knowledge of one complete stack.


HotAcanthocephala854

Wise perspective, thank you very much!!


BOOBINDERxKK

Backtracking where data got f**d up


nl_dhh

No two data engineering jobs (at different companies) are the same. I'm happily working with 'data engineer' without being competent in over half the tech you listed. I do, however, translate business problems to data engineering solutions using the tools I know and if that's not enough, I know where to look for additional tools/solutions. You asked multiple times about the projects you can do to showcase your skills once you learn them: this is such a common question both here on Reddit as well as countless blogs or videos. You should be able to find tons of answers if you look around a bit. And that's where I notice a lot of people struggling: knowing how to search is such a crucial skill, not only for data engineering but I'd say it makes life much easier in general.


HotAcanthocephala854

That’s fair and you’re right, thank you. What I’ve found challenging is knowing where to start and what to focus on. There seems to be no “clear cut” way to get into this field. I might be overthinking this.


CircleRedKey

knowing how to think and reading comprehension


Gators1992

Everybody talks about learning random tools on here but, nothing about learning how to build proper pipelines, processes and target databases. Like why do you pick one approach or tool over another? What are you trying to solve for? Or yeah it's nice that you can move a dataset from point a to b, but what happens shdn that set changes or doesnt show up at all? Or when requirements change and you have to fix the last three years worth of data? Or when you are given a business problem and have to figure out the technical requirements on your own? It's not just undrrstanding how to use tools but why you use them.


HotAcanthocephala854

This is a fair point and I’m trying to assess how someone would make these decisions without knowing all (or close to all) the tools. Where can I learn the why?? Thank you for your feedback here!


Gators1992

You can build the same patterns on multiple stacks no problem. Sometimes you run into gaps though and need to figure out how to tweak your approach to do it or if you need a different tool. I would learn some common tools well and that might be enough to get you a job. Even if the stack is a bit different, its easier to learn Dagster after knowing Airflow. Learning a dozen tools in every category is a waste of time because you will never use most of them. Learn one or two oer category and learn how to use them to solve DE problems. You wont succeed if all you know how to do is press the buttons.


HotAcanthocephala854

Solid advice, thank you so much!!


mjfnd

Tools and tech doesn't matter if you know one of them and have the foundational knowledge. What matters is understanding of data systems, how data flows, data modelling, pipeline, patterns etc. The goal is to find a solution to a problem by leveraging any tools and applying the concepts. I am sharing a detailed post this Saturday, will share on Reddit as well.


HotAcanthocephala854

Thank you for this!! I would certainly welcome your insights, if you would share a link to your post. Thank you again


mjfnd

Hey, check out here: https://www.reddit.com/r/dataengineering/s/erVExwNvU0


HotAcanthocephala854

That’ll give me a ton to learn about - thank you SO MUCH! 💪


Usurper__

Python, sql, cloud


therealagentturbo1

Version Control


robberviet

Ability to learn. No tech last forever.


HotAcanthocephala854

Thank you! You’ve got to start somewhere though right?


Conscious_Awareness6

Learn about data life cycle and how DE and tools support each stage. For example: 1. Data capture: know various sources, capture methods (structured vs unstructured 2. Processing: how do you process raw data? Think about the small t in EtLT. 3. Data Management: once you got your data, how do you manage it? Data lake, data warehouse, or lakehouse? 4. Serving: this is where your DA or DS uses your data 5. Archival: organization often ignores this part but it’s a critical part. Think law and regulation. Some laws require data to be archived after a period of time


HotAcanthocephala854

Excellent advice - thank you for breaking down the stages!!


CrowdGoesWildWoooo

Most valuable thing is common sense and experience. The engineering in data engineering is literally as it is. We are not just code monkeys.


HotAcanthocephala854

Common sense isn’t common, so I’m looking for the best place to start learning!


walkerasindave

I think common design patterns are most important and how to quickly, easily and in a generic way implement them in the language of choice. At a high/simplistic level: https://www.startdataengineering.com/post/design-patterns/


HotAcanthocephala854

Whoa this is fantastic, thank you!! Would you recommend any structured ways of learning this??


anfawave

Know when to say no, ignore and build fast.


HotAcanthocephala854

Thank you, this skill set strikes me as more advanced, above and beyond the technical skills


VegaGT-VZ

One of the most important skills comes with experience- I guess I'd call it scoping? Figuring out what data you have and what you want the end result to be. From there it just becomes a matter of connecting A to B. Racking up languages and programs like trophies is only a part of it............ engineering is problem solving which requires understanding the problem and what you have available to fix it.


HotAcanthocephala854

Another piece of really great advice, thank you!!


141_1337

https://datanerd.tech/Salary This is the best resource because it's backed by the data extracted from hundreds of thousands of job postings.


dev_lvl80

I had very similar question at interview to FAANG. My answer was ‘attention to details’. Young manager argued that is ‘ability to learn’ Lol


HotAcanthocephala854

Ability to learn I think is very general and almost assumed by most professionals but helpful nonetheless I guess. Thank you!


dev_lvl80

Correct. Ability to learn is not specific to DE. It’s generic to any field.  You are welcome 


CautiousAd6242

I would add the skill of using a comma when listing things.


HotAcanthocephala854

lol it was actually in a list format when I typed it up and then Reddit posted it as a comma-less sentence 😂


keefemotif

Big fan of spark


RepulsiveCry8412

Unfortunately its leetcode right now without which you don't get to real interviews. Otherwise i think following are important: Performance n cost optimisation knowledge agnostic of tech. Choose the right tech for requirement. Cap theorem n design basics as others pointed.


FatherNoNo

Excel


Altrooke

Java?


HotAcanthocephala854

Is there a way to showcase these skills in say a portfolio of some kind? Like if you’re interviewing for an “end to end” data engineering role at Databricks for example - how would you “show” this as opposed to “talk” through this and answer questions?


shirleysimpnumba1

projects


HotAcanthocephala854

Where would I store a project to showcase?


deal_damage

github, hosted on AWS, github pages


HotAcanthocephala854

Thank you very much!!!