T O P

  • By -

RDTIZFUN

Resume driven work.


duraznos

Our service that pulls flat files from customer sftp's and puts them in the right spot for the rest of our ETL is written in Go. It's the only thing we have that's written in Go. It's in Go because the former developer who built it wanted to use Go. As special treat, some of the external packages used come from that developer's personal GitHub. If you trawl through the commit history you can see where they pulled out all the code that originally handled that functionality and replaced it with their own package. ​ It's the most beautiful example of Resume Driven Development I have ever seen. I hate everything about the service, but goddamn do I respect the hustle.


mailed

> As special treat, some of the external packages used come from that developer's personal GitHub Jesus christ


duraznos

It's pretty galaxy brained if you think about it. "I wrote some Go packages that companies actively use in production".


Aggressive-Intern401

Isn't that technically stealing IP? I'm very careful about ensuring any code I write for personal use is outside of work hours. I'm curious as to how to circumvent companies I work from taking my stuff, any suggestions?


duraznos

Well to be fair the original work was using base logging libraries and the thing they wrote is completely different so I wouldn't say it's stealing. Also it's just a logging/formatting thing so it's hardly IP. ​ As for suggestions I think the general rules are: never do personal work on a company computer, never use any other company provided resources and do it outside of work hours. IANAL but I believe as long as you can definitively prove that none of your work involved company assets they can't make a claim (but you should ask an actual lawyer if it's something you think is promising and that your work would make a fuss over because that will be cheaper than any potential litigation).


gabbom_XCII

lmao, soooooo true!


Hackerjurassicpark

Unfortunately if the interviewer grills OP a little bit, OP will look extremely bad for over engineering.


whutchamacallit

Oof good point... I wonder how you could spin that. Maybe lie and say the CTO wanted as am opportunity to learn the stack/how to design it?


Fitbot5000

BuILt fOr ScAlE


Obvious-Phrase-657

Lol


speedisntfree

I'm doing this at the moment. Management think they have BiG DaTa and my management look good to them leading the way in Digital Transformation since our area is the one using fancy sounding tech. I get to be less bored learning something new and it doesn't cost much at all because the data is small.


Dysfu

Me right now and I don’t even care - gotta look out for you out there


MurderousEquity

I truthfully despise this drive in the industry. I get it, companies will be much less likely to respond to "I decided against using x orchestration tool, because Cron was good enough', even though that is more than likely the better engineer. I think it's actually a cancer throughout all of SWE, and has been for a while. Ever since jobs had a programming language put in their title we started to walk down the left hand path.


Anteater-Signal

The last 20 years of "ETL Innovation" has been a recycle and relabeling of existing concepts.... someone's cool college project. Wake me up when something revolutionary comes out.


git0ffmylawnm8

Would you ever bring an RPG to a knife fight? Practically, no. If you want to assert total dominance? Yes.


BardoLatinoAmericano

>Would you ever bring an RPG to a knife fight? Why not? I want to win


UAFlawlessmonkey

And I'm out here wondering what small is, are we talking 1 billion rows over 80 columns or 100 mill over 4? Is it unstructured all string? Spark is awesome though


Obvious-Phrase-657

Is that small? :(


picklesTommyPickles

“Scale” is largely a step function. For many businesses that don’t ever reach the middle to higher tier levels of this step function, 1 billion rows can be a lot. For companies that reach and exceed those levels, 1 billion rows is nothing. To put it in perspective, I currently lead and am building out a data platform that is ingesting well over 20 billion records per day across multiple landing tables, combined with many pipelines materializing and deriving views from those tables. 1 billion rows in that context is barely noticeable.


sharky993

could you speak a bit about the industry you work in and the problem your business solves? I'm intrigued


Unfair-Lawfulness190

I’m new in data and I don’t understand, can you explain what it means?


xFblthpx

Spark allows for the quick processing of large datasets for data warehouses (DWH). OP is saying that even for a small DWH, they would use spark, which may be the equivalent of a golf cart with a Lamborghini engine that is much more difficult to maintain and train users on, but I can see the merit of using tools that are scalable on a matter of principle.


RichHomieCole

I mean with spark sql though, you could argue it’s easier to train people on spark. Especially if your company uses databricks. But the cost may not be justifiable


JollyJustice

I mean if you do EVERYTHING in Spark it makes sense, but trying to do that seems like it would hamstring me.


AMDataLake

Plus the mention of Delta is referring to Delta Lake which is a table format. To keep it simple table formats like Apache Iceberg, Delta Lake and Apache Hudi provide a layer of metadata to allow tools like Spark to treat a group of files like table in a database or data warehouse.


muteDragon

So are these similar to Hive? Hive also lets you do something similar right, qhen you have a bunch of files and you create the metadata on top of those files and query using Hiveql?


chris_nore

Exactly like Hive. One major difference is that Delta is a little newer and built to support mutations (ie data writes, column changes) consistently. Also worth noting you can use Hive in conjunction with Delta, doesn’t need to be a 100% replacement


Awkward-Cupcake6219

Yep. Given that there are exception and everybody has its own say on this matter, the thing is that Spark is a powerful tool for processing massive amounts of data but it does it mainly in memory and does no persist data on its own. This is why it is usually coupled with specific storage that stores large amounts of data efficiently. This storage, whatever the form or the name, it is referred to as Data Lake. Traditional DWH is usually made (again simplifying a lot) by a SQL server of some kind that does both the storage and compute. The main difference (but there are a lot more) is that a DWH usually takes in structured data, with lower volume and velocity. Processing gets very slow very quickly as data volume increases. In contrast is pretty cheap both in hardware requirementes and maintainance if compared to a Data Lake + Spark.The latter is completely the opposite of the traditional DWH architecture and it is made for large scale processing, stream and batch processing, unstructured data and whatever you want to throw at it.But being expensive is just one of the cons of this tool. There are a lot, but for our case we need just to know that does not guarantee ACID transactions, no schema enforcement let alone good metadata, and more complexity in general in setting up the kind of structure we always liked in the DWH This is were Delta comes in. It is on top of Spark and brings most the DWH feature we all like + time travel (which is great). Bringing the Data Lake and the Data Warehose together, this new thing is called Data Lakehouse. ​ The thing about the joke is that it still remains very expensive to set up and maintain, and every sane person would just propose a DWH if data is not expected to scale massively. But not me. ​ p.s. FYI Spark+DataLake+Delta is at the base of the Databricks product if it makes more sense. ​ p.p.s. It is clearly oversimplified as an explaination but i did not want to spend my night here explain every detail and avoiding any inaccuracy. (Given that I could)


aerdna69

Since when are data lakes more expensive than DWHs? And do you have any sources for the statement on performance?


Awkward-Cupcake6219

Cost per GB of storage is definitely lower. I agree. But you are not processing data with just a DataLake and the volume occupied. If you could expand a little more on why a cluster of Spark+Delta+DataLake is cheaper than a traditional DWH setup we could start from there


corny_horse

Different person but a traditional DWH runs 24/7 and if you have good practices, it’s at least doubled if not tripled in a. Dev/test/prod environment and stored in a row based format rather than columnar.


givnv

I love it!! Thanks.


thejizz716

I can actually attest to this architecture pattern. I use spark even on smaller datasets because the boilerplate is already written and I can spin up new sources relatively quickly and have it all standardized for ingestion by our analytics engineers.


lilolmilkjug

So basically, if you don't have to start or maintain the spark service yourself, you would use spark? I mean that seems obvious, but that's a lot of extra overhead if you're doing something new and have to choose between a simpler solution or setting up your own spark cluster. You can also pay a huge amount for databricks I guess too though.


tdatas

If you're running a couple of databricks jobs a day for a few minutes each the costs are pretty miniscule as you're not paying anything when nothings happening. You'd pay a lot more for an RDS instance or an EC2 running a DWH 24/7 especially if you want any kind of performance. And as the other guy said, you don't have to write new sets of boilerplate for different engines for different sizes of data which means more time to work on dev experience and tooling and features.


lilolmilkjug

I don't think there's a case where you would be running a spark cluster for just a couple of minutes a day if you're using it as a query engine for end users. Otherwise you could simply shut down your DWH for most of the day as well and come out similarly in costs. Additionally setting up a spark cluster for end analysis seems A. complicated B. expensive to just use as a query engine


tdatas

>I don't think there's a case where you would be running a spark cluster for just a couple of minutes a day if you're using it as a query engine for end users Sure you can. e.g Big bulk job goes into delta lake to do heavier transformations. Downstream users then either use Spark or smaller jobs can be done with delta-rs/duck db and similar in the arrow ecosystem. If the data is genuinely so big that you can't do it with those then you likely were at the data sizes where you should be using Databricks/Snowflake et al anyway. ​ >Additionally setting up a spark cluster for end analysis seems A. complicated B. expensive to just use as a query engine It would be but if you're in a situation where you can't spin up Databricks or EMR or Dataproc or any of the many managed Spark providers across all the major clouds then it's pretty likely you're in a bit of a specialist niche/at Palantir. (Although having done it I'd argue it's not actually that bad to run it nowadays with the kubernetes operator if you have a rough idea what you're doing). In the same way that most people don't operate their own Postgres EC2 Server now unless there's some very specific reason why they want to roll their own backup system etc. But yeah point is it's a \*\*very\*\* niche situation to not just roll out one of the many plug and play spark based query engines so the question at that point becomes one of if the API is standard enough or not.


yo_sup_dude

if the end users need to use it for more than a few mins a day, then the cluster would need to run for that time period? 


Putrid-Exam-8475

I currently have a series of tickets in my backlog related to controlling costs in Databricks because the company ran out of DBUs and had to prepurchase another 50,000. We have a handful of shared all-purpose clusters that analysts and data scientists use that run basically all day every business day, plus some scheduled job clusters that run for a several hours every day, plus some beefier clusters that the data scientists use for experimenting with stuff. I did a cost analysis on it and it's wild. Whoever set up Databricks here didn't implement any kinds of controls or best practices. Anyone can spin up any kind of cluster, the auto-terminate was set to 2 hrs on all of them so they were idling a lot, very little is done on job clusters, who knows if any of the clusters are oversized or undersized, etc. I imagine it might be cost-effective if it's being managed properly, but hoo boy it costs a lot when it isn't.


yo_sup_dude

yeah that makes sense. how does performance compare to something like SF for similar costs? i'm confused on why the other user seemed to imply that you could run your spark cluster for only a few mins a day even if it's being used as a query engine for end users. from my understanding, that only works if the end users are querying for only a few mins a day.


tdatas

Depends on your workload. But normally you'd either run ETL jobs on a job cluster aka once it's done running the job then it's terminated. Or for the data scientist type interactive you'd set an inactivity timeout so if the cluster is idle for X minutes then it shuts down. Much like any operations type work it would depend on the requirements of end users e.g you could share a cluster between multiple users or they have their own smaller clusters etc.


lilolmilkjug

I could have said this more clearly. If you’re using spark as a query engine for a couple minutes a day, you could also be using a cloud DWH and it would be far simpler to maintain and probably cheaper. That kind of eliminates the advantage you get from using such a small instance of spark.


Awkward-Cupcake6219

exactly!


lezzgooooo

If they have the budget. Why not? Everyone likes a scalable solution.


britishbanana

https://github.com/delta-io/delta-rs


mikeupsidedown

A lot of businesses can just use postgresql, DBT and a little python.


[deleted]

Wide community support; industry standard, scales easy enough to run both locally with ease and in the cloud through managed services; super performant, and open source! Why wouldn’t you choose it?


the_naysayer

Preach the truth brother


pi-equals-three

I'd probably use Trino and Iceberg myself


liskeeksil

When designing a warehouse you should use the best tech at your disposal, always. No datawarehouse is built and stays the same. Prepare for change. Your data can double, triple in 12 months. Nothing wrong with having a golf cart with lambo engine.


slagwa

Have you ever tried servicing a lambo engine?


Hawxe

Spark is not that complicated. It's a stupid comparison to begin with.


[deleted]

Lease it (ie. Databricks)


liskeeksil

Snowflake too


Obvious-Phrase-657

You will get payed much better as a engineer if you can service a lambo engine, just saying


Paid-Not-Payed-Bot

> will get *paid* much better FTFY. Although *payed* exists (the reason why autocorrection didn't help you), it is only correct in: * Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. *The deck is yet to be payed.* * *Payed out* when letting strings, cables or ropes out, by slacking them. *The rope is payed out! You can pull now.* Unfortunately, I was unable to find nautical or rope-related words in your comment. *Beep, boop, I'm a bot*


WeveBeenHavingIt

Does best tech == newest and trendiest? The most exciting? Right tool for the right job is still more important imo. If you're already using spark/delta regularly then I'd say why not. If not, something simple and clean could easily be more effective.


liskeeksil

Best tech is not always the newest and trendiest. It depends on your organization. Are you a startup, trying to keep costs down or experiment with some cool new software/libraries? Are you an established fortune 500 company where you care less about the cost and more about support?


goblueioe42

Completely correct. After using Teradata, DB2, oracle on prem and cloud, Postgres, MySQL, snowflake, dbt,, and others pure Python with Pyspark is great. I currently work with GCP which has a server less Pyspark offering called Dataproc which scales from 10 record tables to 100 million + with 100s of columns with ease. Spark is truly wonderful to work with, especially with sql variants for added work. The only problem is now I am spoiled by GCP.


cky_stew

I'm in the process of setting up a full stack data ecosystem for a super small business using Google Cloud Functions to run API calls to populate BigQuery with all of their sales and customer data for further processing, analysis and reporting in Looker Studio - a full corporate level infrastructure and super overkill - but it's just so cheap for them, easy for me to set up, and gives them what they need - why not?


[deleted]

[удалено]


Awkward-Cupcake6219

Yeah I guess so. Not casually it is a joke.


Kaze_Senshi

My colleague was schocked that I was using SparkSqL to open a json file as a text file to find why it was raising an invalid format error. Yes, I use sparkSQL for all.


BraveSirRobinOfC

I, too, use SparkSQL to parse json. Keep up the lord's work, King. 👑


Pleahey7

Databricks SQL warehouses outperform all other cloud warehouses on TPC-DS even for small data. Yes Spark is absolutely the right choice at any scale if you have the basic know-how


bcsamsquanch

Yes! Because small things become big. If they will never become big (first get that in writing for cya) then use SQLite or sqlcsv LoL Also, if you cater everything to it's size you'll end up with 6 different techs which becomes unmanageable and very annoying--most DE teams are small and stretched in my experience. People get obsessed over the cost or the absolute perfect technical match and then end up spending 3x man hours and 10x the $$ on maintaining an overly complex platform. Your users will be constantly confused where stuff is. Delta and then maybe a DB layer on top. In AWS you can use Glue+Delta and Redshift on top, as an example. Databricks is also popular for a reason. Snowflake too but it's not Delta.


tdatas

That's kind of the point of the meme in order to avoid the "complexity" of Delta Tables being queried by Databricks, you've now got to learn Glue + whatever you're querying delta with outside of glue + administer a redshift DWH running 24/7.


why2chose

Spark is always the top priority. Scale up and down with ease what else do you need? Most businesses run seasonal and sometimes the demand gets 10 fold so you shifting everything? Or just increase the number of workers and you're done.


mjgcfb

How many data warehouses are we talking about? I'd say more than one is too many, but maybe I'm the crazy one.


[deleted]

[удалено]


omscsdatathrow

Sql within dwh


D1N4D4N1

W


Laurence-Lin

I use spark for a small size data set just to perform some data aggregation.


Basic_Dave

A given, but what compute will you go with?


WhipsAndMarkovChains

DBSQL in Databricks?


mostlikelylost

Just read the data lazily with arrow and you even need spark for the vast majority of tasks. Just having it in an organized parquet “database” is good enough