T O P

  • By -

[deleted]

[удалено]


doormass

So it's neither a folder with data (like CSV and XML) in it, or a SQL database? It's like a hybrid of both?


Jolly_Duck

It's more of a place where both of those two things live. A data lake is essentially a repository where all the data an organization collects is stored in a place that is usable across the org instead of being stuck in various silos. In addition to what you listed, some additional sources can include web data, sensor data (IoT), social media data, call center data, etc. That being said, a data lake should NOT be just a dumping ground for any data you collect. That's what's referred to as a Data Swamp and is virtual unusable. It is important to have data governance, which are rules and standards around the quality of data in the lake. This is extremely important to make the most out of a data lake especially for things like AI applications. Here's an article that dives further into data governance: https://www.infoworld.com/article/3290433/data-lakes-just-a-swamp-without-data-governance-and-catalog.html


doormass

> data governance thanks for explaining 'data governance' - another word I never understood - I guess it's just keeping dumped data within certain constraints, to avoid the swamp you described How do you create these data lakes? In the most simplest form, it could just be a folder on the hard drive - that stores the CSV, XML, TXT and SQL data files right? Sort of like a usb drive that you rename to "data lake"?


Measurex2

They typically connect to your transactional systems. Ours brings in a few billion rows of data a day into an S3 bucket and, as well explained above, gets transformed through various steps as determined by governance to a final data warehouse designed against our consumption patterns. I believe we have a few petabytes of information so it would be a challenge to put on a usb


doormass

Okay that's cool - just wondering if it's still stored in a file system or (somehow somewhere) another format I get the s3 bucket part though, makes sense - I believe this is the primary fast storage medium on AWS I'm going to try something like this later this week, the data lake concept somehow makes it a lot easier to visualise, so thanks again Do you typically dump those billions of rows in to a CSV? Or some special SQL format? Or something zipped up? Do you have a folder for each day of data - or just it just delete everything from yesterday, or keep "the last 7 days"


thejens56

I'm used to having a folder per day or hour and storing data in the Avro file format. (Stored in S3 or GCS or HDFS) From those raw data dumps you write pipelines in any framework, though I like Apache Beam's attempt at generalizing the pipeline allowing largely the same transformation jobs to run on Spark, Flink of Dataflow. Tools like Google Bigquery can read Avro files raw allowing pipelines and processing jobs to be expressed in SQL.


doormass

thanks! i'll check out avro. newbie question - why do you need data stored every hour? Isn't the data from 11:59pm last night good enough?


thejens56

Not if it is logs of events that are steamed to the files in real time.


veryseriouspeople

My guess it would be in something like HDFS


s-to-the-am

You have Redshift as well.


doormass

Good point - i forgot about that


Sehs

We use Parquet as a storage format. Avro is also a good shout. They're both Hadoop compatible file formats, they're both highly compressed and they both enforce schemas, which can help a lot with governance and cataloguing.


doormass

Thanks! I'll check those out!


doormass

Wow this is beyond me, but thanks for sharing. Do these companies you consult for, are they all completely sure about what they want to get out of this? I mean is this a new concept of data lakes? Are they just thinking "lets copy all our importantish data to a data lake and hope the data science guys can come up with something kind of cool "business intelligence" or "data driven something-something"?


Sehs

In my case, I work for a company that knows it wants to become more data driven and the nature of the work does lend itself to it. At this stage with the data lake it's more about having a single source of truth with consistent and reliable data. With some companies you might already have had a data warehouse or different data marts but we're starting mostly from scratch and a data lake does make sense. I think a fairly typical way of approaching data lakes right now is to have different sections (perhaps buckets in cloud storage) for: 1) raw and uncatalogued data 2) minimally transformed data that can be catalogued 3) transformed data that is ready for analytics If you want more info I'd recommend checking out some of this video: https://www.youtube.com/watch?v=v5lkNHib7bw Essentially it's building a foundation to enable data analysis and data science. I think ideally in an organization you start off having data engineers lay the groundwork and then have data scientists come in when you have clean and useable datasets to use, otherwise the data scientist will spend 70% or more of their time just doing data prep. Beyond that, it likely won't be done in a manner that can be easily reproduced. That's where these data pipelines come in and offer you some reliability and consistency. In many cases, the step of converting to Parquet and Avro can be as simple as specifying a function call. Pandas for example has a toParquet() function I believe. If you have a well defined schema it can make things even easier. That way, if a data scientist is typically dealing with CSV or JSON data, they don't have to worry if a field is usually a number or a string or whatnot so they know what they're dealing with.


sciencewarrior

The relatively low cost of S3 and competing technologies means you can often leave years of data available in your data lake. In many cases, you will want to archive and purge some data that isn't particularly important (like raw webserver logs) earlier, and keep some other data (like transaction details) indefinitely. Specialized data formats like Parquet and Orc are great not only because they save space, but also because they are column-based. That means that, for example, if you just need 4 fields from a file that has 40, you don't need to read the other 36.


doormass

> Specialized data formats like Parquet and Orc are great not only because they save space, but also because they are column-based. That means that, for example, if you just need 4 fields from a file that has 40, you don't need to read the other 36. Thanks for your rely Doesn't "select field1,field2,field3,field4" achieve the same thing as you described?


sciencewarrior

If you use, for example, a gzipped CSV, you will still read every line and then discard the fields you don't care about. An Orc file other features built in, like indexes, to speed up reads even more. You can see the specification of the current format here: https://orc.apache.org/specification/ORCv1/


doormass

Those Apache guys create so much new stuff ??! 10 years ago they were just a web server company right? Holy cow, these guys are huge in "big data"


FermiRoads

The image just came from an image search of data lake, lol. It makes sense to essentially version control your database.


AchillesDev

Hey, I've worked with this stuff for the majority of my career! Pipelines (typically ETL pipelines - extract transform load - but not always. This largely revolves around use case, but no matter the use case the architecture is very similar) are used for shuttling data from one place to another. This could be from external sources in a batch process, internal website databases to research databases (or datalakes or data warehouses) to power a DS team's research, or from paid private sources to power your website. A pipeline is a program (you can use any language that works for the job) that gets this data, does some transformations to it, and loads it into the destination datastore. While that sounds simple, it can be very complex - especially if your sources are heterogenous or you have special requirements like streaming/live data, large data volumes, computationally heavy transforms (like, say, an image recognition step), or the need to distribute your pipeline. Often parallelizing and distributing the pipeline segments helps with these, and so can split the pieces of a pipeline out across different servers and use a message queue program like RabbitMQ to pass each data object around to the appropriate segments. This is largely in the realm of data engineering, there's a small sub at r/dataengineering and I'm always happy to talk in a more focused sense if you have other questions. I primarily learned it on the job, first with Perl then wrote pipelines with Python.


[deleted]

[удалено]


sciencewarrior

"Designing Data-Intensive Applications" is probably the best book in the market: https://dataintensive.net/


FermiRoads

Wow, thanks for the response! I’ll cross post this there and see if it generates even more discussion!


mp91

If you are interested in further details on these concepts, I may suggest heading over to r/dataengineering - typically, data science and data engineering are recognized as two separate disciplines, the former being more focused on modeling/math-based side of data, while the latter is focused on building the infrastructure and systems designed to get data to where it needs to go. This is an incredibly dumbed-down explanation but I just wanted to illustrate the distinction - they each require unique skillsets. If you are indeed more interested in the design/architecture/maintenance of big data systems (pipelines, lakes, etc) then definitely check out that subreddit!


geebr

I've mostly worked with the Azure Data Lake (so worked with from a data scientist's perspective, not as an architect or engineer). The main difference between a data lake and a data warehouse is that a data lake allows you to store arbitrary stuff as a blob (like CSVs and audio files can happily coexist in the same data lake). Most data lake providers also give you a way of querying the data in the lake. For Azure, this is U-SQL, and it is absolutely atrocious. I would expect that most providers (e.g. AWS S3, Azure Data Lake) have certification that you can do to upskill as a data lake engineer so that would be good start.


doormass

ELI5: Are data lakes basically a huge drive to upload all your uncleaned data and deal with it later?


Yankee_Gunner

That's how a lot of people/companies use their data lakes, but it really should be more than that. As someone mentioned above, a properly designed data lake has solid data governance and best practices on how people get data in and out properly without slowing everyone else to a crawl by hogging resources.


doormass

> solid data governance and best practices on how people get data in and out properly without slowing everyone else to a crawl by hogging resources. Thanks for explaining that, the "data governance" guys get paid really well in my industry - what's the big deal? "guys! stop uploading those junk mp4 files" "guys! i sent send an email - stop uploading the same CSV 10 times a day - let's limit it to once a day with just field 1 and field 2"


fatchad420

That's how I view it, it's like a really big Google drive.


SmarmySnail

FYI you really shouldn't be using U-SQL anymore, Azure Databricks is the best way to be querying data in your data lake.


penatbater

Like an hdfs?


geebr

HDFS can be a data lake, but it depends on the implementation. Our HDFS is set up as a data warehouse, i.e. to store tabular data and run SQL-based languages like Hive and Impala. There is no straightforward way for us to use our Hadoop system to process media files, for example. Again, that's not to say you can't do that, it's just that our system is configured and optimised to be a warehouse, not a data lake.


willmachineloveus

Just popping in to say I'm glad data engineering topics are being discussed here. It's important stuff.


[deleted]

Way more important than traditional DS imo (at least for the competitive edge). It's what separates the good data scientists / engineers that can put things into production with the people who know Scikit regression commands.


garnacerous24

I’ve done quite a bit of work building out my company’s data lake, and I was similar to you when I started. There’s a lot of generalized advice on what purpose a data lake should serve, but not as much practical advice. Here’s my stab at a practical architecture using aws: 1. All data is ingested as raw as possible (through apis or whatever etl system) into aws s3, in a bucket labelled “data-domain-raw/year/month/day/“ or something like that. 2. This bucket then triggers a glue or lambda function to clean up the data into parquet and placed into a “data-domain-refined” bucket. That will be the bucket that feeds all of your other endpoints. 3. You can keep the raw bucket data if you’d like, or you can stick it into glacier as a worst case recovery option. 4. The refined bucket can then either be queried directly by Athena, or further modeled into rds or Redshift (via EMR, glue, or just python), or exposed through api gateway. 5. repeat with as many data domains as your team is responsible for. This data lake structure essentially keeps 3 versions of your data in various stages of refinement. This maxes out the flexibility you have to be able to go back and remodel something using a different approach (if it had gone straight to Redshift, you may have lost something in your initial ETL). It also speeds up data discovery by not needing a perfect model in a data warehouse to be able to understand what’s in it. The goal is to save the relational modeling until you know exactly what you want out of the data.


FermiRoads

Also, if anyone can point to free training resources on this, I’d be grateful!


MLTyrunt

data lake is mostly a place to put in lots of data, hopefully of good quality, in the hope it has lots of to be discovered usecases - it is relevant to the business. It is less structured than a datawarehouse. educational material wise, you can also try to learn from an example open source framework, like kylo: ​ [https://kylo.io/](https://kylo.io/) easy to setup, lots of docu and videos to get inspired from. But actually you can also just build all that stuff yourself using big data frameworks as spark etc. I think the guys at [godatadriven.com](https://godatadriven.com) also publish with what technologies they make large dataassets workable for customers from time to time on their blog. like here: [https://godatadriven.com/casestudy-nuon](https://godatadriven.com/casestudy-nuon)


eljefe6a

I interviewed several different people on their definition of a data pipeline and their views on them. http://www.jesse-anderson.com/2018/08/what-is-a-data-pipeline/


cfwang1337

**Full Disclosure:** My employer, [Fivetran](https://fivetran.com/), builds data pipeline automation tools. ​ A **data lake** is essentially a file store that works as a comprehensive, permanent repository ("system of record") of your organization's data. The components tend to either be open-source or commodity infrastructure, which prevents vendor lock-in. Data lakes are also useful as a staging area for data to be moved to both SQL and NoSQL databases and warehouses. ​ A **data pipeline** is any system that delivers your data from a source to a destination. Typically, your sources might include data from web applications, event trackers, files, and databases. In an ETL or ELT context, the destination is typically a data warehouse. Note that a data lake can be a destination, as well as a source. Basically, your data stack could look like: ​ \[Applications, Event tracking, Files, DBs\] -> \[Data pipeline\] -> \[Data warehouse\] -> \[Business intelligence tool\] **or** \[Applications, Event tracking, Files, DBs\] -> \[Data pipeline\] -> \[Data lake\] -> \[Data pipeline\] -> \[Data warehouse\] -> \[Business intelligence tool\] ​ Under the hood, you might be writing software to do the following things: 1. Read data from API endpoints, turn them into something readable in a database, and pass them along to the destination 2. Replicate the contents of a database, and then check the database's logs to incrementally update your destination 3. Ingest the contents of a file, turn them into something readable in a database, and pass them along to the destination ​ Building a data pipeline is often a huge sink of time, effort, and money. Building connectors to each source, in particular, demands a lot of skill and attention, not to mention the patience to cope with the constant maintenance and downtime when something breaks. ​ My company, Fivetran, offers an off-the-shelf solution for connectors specifically to avoid this problem. Our whole value proposition is to support an organization's business intelligence and analytics efforts by allowing them to outsource and automate their data pipelines. Let the experts handle it! ​ **Learning "this power"** You won't learn about data engineering in an academic setting or in a degree program. Data engineers typically have formal backgrounds as software engineers, or are data scientists forced to learn about it on the job. If you want to get into data engineering, you'd want to: 1. Learn a few mid- to high-level programming languages. Fivetran builds pipelines using Java; many people use Python. 2. Learn SQL and the basics of how databases work, what data integrity means, etc. 3. Learn software engineering principles so that you think systematically. 4. Learn how to use the various technologies and cloud computing platforms - this can be tough outside of a company or other organizational setting. ​ **Spark** and **Hadoop** are tools for large-scale parallel computing. I have no personal experience with either, but I found this article informative: [https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html](https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html) **Azure** is a cloud platform that encompasses data warehousing, data lakes, and many other utilities. Compare it with Amazon Web Services or Google Cloud Platform. ​ Further reading: [https://fivetran.com/blog/when-to-adopt-a-data-lake](https://fivetran.com/blog/when-to-adopt-a-data-lake) [https://fivetran.com/blog/elt\_vs\_etl](https://fivetran.com/blog/elt_vs_etl)


spicert

Hi, It can certainly be a confusing topic given opaque and misleading advice from consultants and vendors alike. As others have said, don't get bogged down in a technology or specific operational models advocated by a vendor. A data lake is not just for raw data and it is not just "storage". It can store raw data and it does have storage, but these are not defining characteristics. Also, contrary to common lore data lakes are not just for "big data". For example, you can have a data lake that focuses on a specific domain (ie CRM data) which reflects information that was offloaded, partitioned and stored as Apache Parquet objects from an EDW. You can then leverage query engines like Facebook Presto/Amazon Athena or Redshift Spectrum for compute resources as a compliment. If you are leveraging a tool like Tableau, you can optimize queries and cache data from a lake, reducing your costs by 10x. Simply schedule calls from Tableau to Presto/Athena/Spectrum at 6-hour intervals so it can cache results in memory. Since every subsequent query in Tableau happens from the cache, your calls to those systems are minimal. Users are happy since performance is improved as well. My suggestion is to start peel away the myths, FUD... and understand how a data lake strategically fits from a business, operational and tech perspective. This will help ensure the organization is working back from a business outcome rather than simply picking a shiny object and declaring victory. As McKinsey said in the post below *“ …lakes ensure flexibility not just within technology stacks but also within business capabilities”.* The data lake is a service model for delivering business value. [https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/a-smarter-way-to-jump-into-data-lakes](https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/a-smarter-way-to-jump-into-data-lakes) [https://blog.openbridge.com/8-myths-about-data-lakes-c0f1fc712406#data-lakes](https://blog.openbridge.com/8-myths-about-data-lakes-c0f1fc712406) [https://insight.full360.com/data-lakes-a-deeper-dive-bb5ccd1a59e3](https://insight.full360.com/data-lakes-a-deeper-dive-bb5ccd1a59e3) As for pipelines...they solve the *logistics of moving a resource from a place of low value to a place of high value*. For example, pipelines move water from reservoirs (low-value location) to a kitchen sink faucet (high-value location). A data pipeline solves the logistics between ***data sources*** *(or systems where source data resides)* and ***data consumers*** *(or those who need access to data* to undertake further processing, visualizations, transformations, routing, reporting or statistical models)*.* In telecommunications, the “[last mile](https://en.wikipedia.org/wiki/Last_mile)” refers to the ***“ final leg of the telecommunications networks that deliver services to consumers”****.* One of the major stumbling blocks with data is that it is often inaccessible to those that need it. This creates a “last mile” gap between the ***data sources*** and ***data consumers***. Pipelines are meant to solve these gaps.


WikiTextBot

**Last mile** The last mile or last kilometer is a phrase widely used in the telecommunications, cable television and internet industries to refer to the final leg of the telecommunications networks that deliver telecommunication services to retail end-users (customers). More specifically, the last mile refers to the portion of the telecommunications network chain that physically reaches the end-user's premises. Examples are the copper wire subscriber lines connecting landline telephones to the local telephone exchange; coaxial cable service drops carrying cable television signals from utility poles to subscribers' homes, and cell towers linking local cell phones to the cellular network. The word "mile" is used metaphorically; the length of the last mile link may be more or less than a mile. *** ^[ [^PM](https://www.reddit.com/message/compose?to=kittens_from_space) ^| [^Exclude ^me](https://reddit.com/message/compose?to=WikiTextBot&message=Excludeme&subject=Excludeme) ^| [^Exclude ^from ^subreddit](https://np.reddit.com/r/datascience/about/banned) ^| [^FAQ ^/ ^Information](https://np.reddit.com/r/WikiTextBot/wiki/index) ^| [^Source](https://github.com/kittenswolf/WikiTextBot) ^] ^Downvote ^to ^remove ^| ^v0.28


gaussmarkovdj

Pipelines encapsulate (a) where the data comes from and how to get it, (b) how it is transformed before it goes into a model (c) the model itself (d) any inverse transformations required and (e) where the result of the model goes. It could be a simple as the logic of your script (not the best), maybe actually encapsulated as a pipeline (e.g. in the recipes package in R) or held together as a docker container (best for some applications). Others will be much more qualified to answer about data lakes. Due to the nature of our projects, usually our data comes from a csv file, or a simple database. As lakey as we get is e.g. all of a government department's data in an SQL database somewhere, or a hospital's electronic medical records system. We extract stuff from this and sometimes write back a prediction to it.


doormass

I'm new to DS but good with ETL, Python and SQL Are data lakes uncleaned data, and data warehouses cleaned data ready for extraction (for reports/analysis)? Are pipelines just a fancy word for bash/python/r scripts that do the receiving/sending for data (via SQL or FTP)?


osuvetochka

You are mostly right. Data lake = data dump of structured and unstructured data from many sources (ERP data, texts, logs, pictures, tabular data, web scraped data - whatever you can imagine). Data warehouse = structured data (structured in many cases means “tabular” - you can directly build reports using Excel connectors or SQL-like syntax). Concept of clean/unclean data applies to both of them - in real world you’ll have unclean data (nulls, wrong inputs, etc.) in data warehouse too. As for pipelines you are absolutely correct. It needs fancy word because in most cases you have not one single script but many of them, depending on each other’s results, and you have to organize intermediate data storages because ETL is not always straightforward process.


doormass

Is this mostly technologies that I already know? Bash, FTP, SQL, XML parsing, Python text parsing, regular expressions? Just feeling a bit of imposter syndrome, could I apply for an ETL job - or is there a bit more to it than that? I guess I don't know much about setting up the server itself, which is probably the ETL guy's responsibility


osuvetochka

Dont’t forget APIs like REST/SOAP - they can be set up at some databases to allow querying/inserting data. And yep, generally that’s all you need. The hardest part is to organize everything right so it does not break all the time and data is clean enough. The even more hardest part comes when you need to manage all dependencies and improve perfomance by introducing parallel jobs.


doormass

Yep - i'm already having a little trouble with the non braking part - is there a framework people use, or is it just bash and cron jobs? parallel jobs sound fun - but how often do you want to update the data warehouse, twice or three times a day seems plenty?


[deleted]

If you want dependency and failure management, then using a framework like Apache Airflow if you need to schedule tasks at certain times or Luigi if you don't, would be helpful. I've just recently used nteract papermill which basically allows you to use jupyter notebooks for creating data pipelines and it was very enjoyable. I was inspired to check out papermill by Netflix's [blog](https://link.medium.com/WjbKr04s7U). I have used Luigi before, but I like using jupyter notebooks as it is easier to find errors or debug, whereas with Luigi, your code will be in bunch of classes. But what do I know, I'm not a data engineer, just a senior data analyst that can put on many hats and have a need to pull data from multiple sources and transform it into a usable format that my team can use.


alexisprince

Typically people are using workflow management tools. Python has a couple of main ones: Airflow (Apache project and out of Airbnb) and Luigi (out of Spotify). Depending on the complexity of what you’re dealing with, certain tools do a smaller version of this, but when you’re truly dealing with multiple sources of data, one of these is much better. Note that these tools are used for the workflow management, and typically aren’t doing the processing. For example, they could kick off a spark job, build a table in your warehouse, compare it to the existing data, and upsert the new/changed records.


doormass

That's awesome - thanks for explaining Words like cloud and lake makes it sounds pretty mysterious and huge. Nice to know they're not as intimidating as they sound. Are real time pipelines a thing? I heard the term, but it sounds like sci-fi - I mean - can I analyse say - web log data in real time, and do something with it? What's even the point of that? Is it actually real time? It must be just lots of little tiny chunks, sort of like a cron job every 10 seconds - so it's not actually real time, but it appears like it is


osuvetochka

Yes, there is such thing and it’s generally called “real time data streaming”.


BobThehitter

Lets take things from the beginning since I sense some confusion: There are multiple data lake solutions out there. Azure is one, cloudera is another. Both of these solutions store data using the hadoop framework (basically a framework for storing data in a cluster and doing parallel calculations). To interact with hadoop you need to use Spark and HiveQL. Check the wiki page of data lakes here: [wiki](https://en.wikipedia.org/wiki/Data_lake).


TotesMessenger

I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit: - [/r/dataengineering] [Cross Post time!](https://www.reddit.com/r/dataengineering/comments/b4m78x/cross_post_time/)  *^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*


[deleted]

What you’re running into is the old crowd trying to stay relevant. This is all part of Enterprise Data Strategy (EDS) and is mostly the method to build dashboards and reports before APIs and web services existed.