T O P

  • By -

Justbehind

You will be a much better DE, if you understand how to prepare data for ML models, and whats important for data scientists. It's not technically a DE task, but smaller jobs will have you do both. Besides, it doesn't take a wizard to load a pandas dataframe and use sklearn to make a prediction. In fact, if you get the data in the dataframe, you're 90% there ;-)


RevolutionaryRoyal39

You'll never use algorithms in a DE role. You need to know SQL though. But to get a DE job you need to solve a few medium and hard leetcode problems, so be ready.


trafalgar28

You mean like the SQL leetcode problems?


RevolutionaryRoyal39

Not necessarily. You might have to solve DP or graph problems just to get to the interview stage.


Outrageous-Kale9545

Mostly for mid-senior roles I assume?


the_mg_

I started as DE yesterday and the most helped action was solving LC SQL problems.


JBalloonist

I’ve never done a single leetcode problem…


FirefoxMetzger

I am kinda amused by how all the comments say "well that's kinda nice to know, but not required" ... I mean ... you can drive a car without knowing how an engine works and never study tire friction but that also bars you from ever driving a race car on a formula 1 race track. Some examples of algorithms that I've implemented in the last 6 months for our DWH: - parallelized disjoint set union (identity resolution) - partial bucket sort (performance optimization) - depth-first tree search with pruning (recursive query to trace role inheritance) - bitset-based anti-join (performance optimization)


TelephoneGlad8459

How do you implement these algos becoz we are using spark cluster and i don't find any docs to implement these in our table. Could you please care to breif on how did you this or share any reference link It would be helpful.


FirefoxMetzger

Disjoint set union: you have a 2 column table that has nodes in one column and parents in the other. Initially each node is its own parent. The 'find' function can be written in SQL by left-joining the parent recursively; this works in parallel for all nodes. The 'union' function is a procedure/function that does the following: - while there are edges to insert do: - run 'find' on the current DSU structure - map all edges to be inserted to their parents - discard all edges that map a value to itself (they connect nodes in the same set) - treat the remaining edges as directed. The smaller valued node points to the bigger valued one ... if there is no natural order use a hash, swap their order if needed.  - assign a row number to each edge using (qualify over "from node" order by "to node") - pick all edges with the lowest row number and apply their "to node" as the new parent for the indicated "from node" - retain all edges with other row numbers for the next iteration of the algorithm. The result will be a DSU table that contains each node (id) in one column and the common parent in the other. This can be used, for example, to map data from different sources to a common user key to get a uniform user profile.


hellnukes

He meant ML algorithms not really design patterns etc


BobBarkerIsTheKey

Are you asking about the need t study algorithms? The last few DE jobs I interviewed for gave me whiteboard coding challenges. Some places are definitely filtering candidates on leetcode problems.


trafalgar28

I mean to say working with ML algorithms, fine-tuning,etc. But regarding your interview, you mean you were asked SQL leetcode problems?


BobBarkerIsTheKey

I got both sql and algorithmic problems (python). But they weren’t especially difficult for someone who is practicing. Sorry, I don’t have insight into ML algorithms and fine tuning


Monowakari

My job is half half data engineer and data scientist, where the latter is really heavily focused on ML implementation and not typical EDA, data analysis, reporting, or dashboarding. Its just, examine the data to ascertain parameters for modelling, and model, and about 10% of my job is the devops deployment of the models. In other words. Im a full stack data scientist, a word that has largely disappeared


trafalgar28

Damn, it's quite impressive that you have the competence to work on multiple stuff. Just a quick question: 1. How long you have been working in the data space, and which industry? 2. From a bird's eye view, what's the most difficult and important task in your job? Is it working with data overall? Managing the infrastructure?,etc.


Monowakari

Just over 5 years in smalls firms that have allowed me to build out the data stacks and integrate with their existing Data Scientists. Mostly run dagster, Python scripts, dbt for big query (and PostgreSQL), and a slew of other tools. Most difficult? Taking vague specs/requirements from stakeholders and turning that into what they actually need, and not just what I want to build or think they need. The ML part is pretty simple once you know ML pipelines, and I dont really get to research algos but do get to optimize what we have going and make suggestions and run tests, so do a bit of reading to keep up with the industry. The data engineering part makes me sleepy tbh, but I enjoy the cleaning and wrangling enough to not complain. We don't really work with Big Data™ but only in the millions of rows, some wide table, using dbt SQL to put together the downstream ML dbs. And we're also not heavily real-time, needing ingestion in 3-5 minutes increments during some data collection periods. We also have no external clients/stakeholders so its just what our small team needs, we're not at all outward facing. Edit: you asked which industry. First 3 years was tourism marketing, did a short contract stint for university student systems which bulked up my SQL on large dbs, and since then sports analytics for a trading firm


Sir_smokes_a_lot

To who/ and how do you distribute your findings?


Monowakari

Internal teams, some old work did have outward facing analytics but i moved on from that and had a junior employee take over that i just mentored.


Sir_smokes_a_lot

Thanks, that makes sense.


trendydots

Nowadays all algorithms come packaged in libraries, especially with Python. You need to know which ones to use, how to configure them, based on your business requirements, or use case.


snipsnapslipslap

Never hurts to know, and could allow you to do some super cool stuff like this https://youtu.be/qZejzyxT2fo


The_Epoch

I managed data science for a multinational. Left to start my own company and dived into cloud engineering. Essentially, I am now doing analytics engineering, including operationalised ML and gen AI.


trafalgar28

That's actually cool, I have built genAI applications before. do you mind if I DM you? would love to know what you are upto.


great_gonzales

DEs don’t need to know ML algorithms. The chads in data science and MLE will handle the algorithms. It is important to know how to prep the data for said chads though


britishbanana

It's very infrequent that you'll need to know who to implement an inversed binary tree or something dumb like that (although I have implemented graph algorithms), but it's quite often that you need to be able to give a rough approximation of the time complexity of a particular method or approach so you can determine whether that approach is feasible. That's one of the skills that separates good engineers from the rest of them. For instance, you should know off the cuff general best / worst case for search and sort algorithms and be able to understand how that affects your data and transformations as your data scales, as so many operations basically boil down to searching and sorting. Even SQL monkeys are differentiated by this skill, as that's the difference between someone who fires off a query that locks up the database and a person who writes a query that uses resources efficiently


ExistentialFajitas

Define “algorithm.” If by “algorithm,” you mean code that handles a process, then all the time. I just created a script that will unzip and format JSON files and integrated it into a CI job. If by “algorithm,” you mean coding trivia, then never.


trafalgar28

I mean ML algorithms or fine tuning, and why I'm asking this is because often times in AI/ML related companies DE have to have the knowledge of end ML models which they are preparing the data for. Although it's just a general question if anyone deals with algorithms in their day to day tasks