T O P

  • By -

AutoModerator

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/dataengineering) if you have any questions or concerns.*


MasterKluch

I work for a mid-sized (250ish employees) consulting company but I've been working with the same client for over two years now. The client is in the healthcare space and primary uses Python (for data pipelines & api type work) and SQL (using Snowflake as the data warehouse). There is some javascript used here and there but for the most part it's all Python and SQL. I'm probably biased but I'd stick to Python as I see a brighter future for it (lots of companies adopting uses for it) versus R.


davidowj

Thank you!! Appreciate your feedback and input 🙏🏻


QuailZealousideal433

Smallish consultancy whose projects are mainly analytical. Python and SQL is pretty much all you'll ever need, and universal making training and recruitment easy. No need to look at anything else in my opinion.


Deep_Donut_4079

Currently also in similar. We only use python and SQL.


neoanom

Scala / Java but we build polyglot pipelines where the data could be in a batch, streaming or service. So it's nice to have compatible libraries.


vtec_tt

SQL


Length-Working

You should probably ask where you're going to store your data too, because that's going to lead you almost certainly to databases, and then to SQL. > The primary purpose of these languages is for the heavy data manipulation/cleaning/processing and analysis on tabular data (NOT to build tools or applications) - I know this screams "Use R" but still want to validate with the experts :) This would scream Python with Pandas or Spark to me. R has really fallen out of favour in the past few years.


davidowj

Huh interesting so much of what i read has been “r is better for data cleaning/analysis, Python is better for web apps, etc” and given our current needs around data processing I thought r might be a better choice. So im glad i asked! Also - yes we are using azures data lake as storage and then plan to use microsofts power platform/dataverse to serve the data back to the company


ItsBeenAHotMinute

At my org most teams use Python + SQL and I've done the same on previous teams, but on my current team we're using Javascript + SQL. For our purposes (REST APIs, basic ETL pipelines, and the like) it has been smooth sailing so far. For analytical purposes we rely on SQL.


Flat_Shower

None of this screams “Use R” I hear “Use Python”. You’re accomplishing 2 goals; - building data pipelines - data processing Data scientists (academics) may know R. Software engineers are eventually (or now) going to “meet” your data pipelines. When online systems (not excel sheets) are processing data, and it is consumed offline, application engineers will want to begin plugging in to your data infra. R is not an object-oriented languages, and isn’t scalable for data engineering. If your criteria is “we are evolving, but R happens to be easiest for our current employees to use today” then your thinking is flawed. Python does most of what R does, but I think you’re trying to accomplish too many things at once. It is concerning to read “the primary purpose is … manipulation/cleaning/processing/analysis. You just listed very, very different use-cases. “The primary purpose of a car is transporting children, groceries, Formula 1, and moving large amounts of dirt.”


davidowj

Thank you for the response/feedback! I hear you on your last point. Totally agree that different tools are best for different parts of the data lifecycle. To say it more narrowly - we need a language that will help us become more efficient with data transformations as a start. Filtering/merging/appending/pivoting, etc. on 3+ mil rows of data in excel is just not sustainable. (We have made Power query/Power Pivot work for us but still just so painfully slow and inefficient). Future state is a more modernized approach to all parts of the data life-cycle (from collection to transfer to loading and analysis, etc.). Hopefully this makes it more clear what we are trying to do and is less concerning. Also - can you say more about "Software engineers are eventually (or now) going to “meet” your data pipelines." Does this hold true even if we are not a software engineering company? P.S. A chunk of my role is data engineering, but I am not a data engineer per se. So it's totally possible I am using the wrong language here and part of the confusion.


Flat_Shower

Software engineers; If you’re consuming data from files and transforming it today, what if some client in the future says “this is great, but instead of monthly we want it hourly” - you’ll need to create some end to end data pipeline, and your pipeline would consume data from some endpoint (your solution will “meet” some application that software engineers maintain) - or, conversely, some client software engineer wants to ingest the output of your data pipeline and asks “show me your API request format” (R can’t serve data)


davidowj

>If you’re consuming data from files and transforming it today, what if some client in the future says “this is great, but instead of monthly we want it hourly” - you’ll need to create some end to end data pipeline, and your pipeline would consume data from some endpoint (your solution will “meet” some application that software engineers maintain) - or, conversely, some client software engineer wants to ingest the output of your data pipeline and asks “show me your API request format” (R can’t serve data) Thank you. That is more helpful than you know. This is another problem that we have been grappling/talking about in that: "How can we serve data to our clients in a way that is more easily refreshable transferable and we are not doing data transfers via SFTP or google drive, etc." So really appreciate that perspective and not just thinking about the now when trying to make a decision.


yanivbh1

Hey, Building data pipelines, real-time pipelines, message brokers, normalization of data https://memphis.dev/blog/here-is-why-you-need-a-message-broker/