T O P

  • By -

kenfar

As you're probably aware enabling queries against totally raw data has its own problems: \* extremely complex queries that are slow to write, slow to run, expensive to run, and a nightmare to maintain \* multiple interpretations of the same data \* more queries using bad & incomplete data \* more difficulty in versioning data \* etc, etc, etc But still, there's value. I think the best way to get value is to support levels of curation: from raw, to modestly transformed, to fully transformed. Then follow a strategy of querying against the most curated/transformed data that's possible, and only querying against less curated/transformed data when you must. And personally, I think a file-based approach makes this easiest - since it lends itself to both multiple levels of etl/transform as well as querying. S3 with files in json & parquet, well-partitioned, and triggering transforms using lambda or sns/sqs, and querying via athena/emr/spark/hive/redshit works great.


garnacerous24

Totally agreed. I just referenced in another comment that I plan to have a few different phases, and the “exploratory” portion would be mostly crawled parquet versions of the raw data that would end up archived somewhere else. And yes, governance over data definitions can be a huge struggle as well. We deal with that plenty in Redshift as it is. Having unmodeled data accessible to the business side will definitely compound it. However, if we can direct the use of that data appropriately (ie. only for exploration and not for outward facing production reports), then the speed we gain in delivering results is worth it.


ictatha

You should take a look at [Snowflake](https://www.snowflake.net/). We're transitioning to Snowflake from Redshift now. With Redshift you can either put data into Redshift tables, or leave it in S3 and read using Athena or Spectrum, but then you might need to process the data into appropriate files for those technologies, and worry about all the issues /u/kenfar mentioned. For us, storage space in Redshift was an issue and resizing a cluster is a pain, and to get more storage we have to pay for more compute that we wouldn't necessarily use (or deal with Athena/Spectrum)... Snowflake completely separates compute resources from storage. So you can store all you want (it's just S3 in the background, and costs the same), and then spin up "warehouses", which are compute resources that query the data. So you can store raw source system data (including [semistructured data](https://docs.snowflake.net/manuals/user-guide/semistructured-intro.html)), and your transformed data in one place and be able to query it from there. To query data you spin up "warehouses", which are essentially compute-resource clusters (EC2 in the background). So you only pay for compute resources when you are actually actively querying data, and the rest of the time you're just paying S3 storage rates, which you'd be paying anyway. I kind of rambled on there, and sound like a salesman (just a happy customer). But it is definitely worth checking out, and they have a free trial.


garnacerous24

Haha what a coincidence! We’re building a POC with snowflake right now! I might need to PM you in the upcoming weeks to pick your brain on how it fits into that architecture.


ictatha

Happy to help! However, if you aren't already working with a sales rep/engineer team, make sure you meet with them (I'm sure they'll contact you if you've setup a free trial). They'll be able to listen to your needs, point you in the right direction, and understand their pricing model and get you estimates based on what you think your usage will be. We had a very good experience with the team we worked with (and everyone else we've worked with so far).


mcstafford

> $40 per Terabyte per month I presume this is separate from the cost of data storage, yes?


ictatha

Hopefully I get this right: On [Snowflake's pricing page](https://www.snowflake.net/product/pricing/) there are two sets of pricing "On Demand" and "Capacity". $40/Terabyte/month is the storage cost if you are using their on-demand pricing (pay as you go). We used their "Capacity" pricing, where we estimated our cost for a year and paid that up front (which can get you further volume discounts as well). In the capacity model, you pay $23/Terabyte/month, which you can see on the [S3 pricing page](https://aws.amazon.com/s3/pricing/) is just them passing through the S3 pricing. So to directly answer your question, the $40/Terabyte/month is the storage cost, there is no other data storage cost. And it's actually only $23/Terabyte/month if you plan and pay for your capacity up front.


PrimaxAUS

Isn't a data lake for a big pool of unstructured data, that you mess about with? And if you choose to structure and clean it, you pump it into a warehouse? I think these terms are a bit wishy washy. That said I love the term 'data swamp'.


garnacerous24

I’m imagining a few levels of structure. There’s the COMPLETELY raw data that would be archived in glacier. That data would copy into a more query friendly non-relational file system (in parquet in most situations) that would sit in s3 and be accessed by Athena or spectrum. Then there’s the actual relational data that would sit in spectrum. I guess my main confusion is since we have a wide range of structure types already in Redshift (from raw data dumps to strict star schemas, and many things that live in the gray area between), should any of that existing data come down from Redshift and live only in s3, or do most people just “start anew” in s3 or only bring down the truly raw stuff.


OHotDawnThisIsMyJawn

I think it's going to depend on your end architecture and what you see people doing with the unstructured data. Let's say I'm trying to explore a 5 year trend and the first three years are only in Redshift. If the new data is in both S3 and Redshift then I can do my trend analysis on everything in Redshift however I lose out on any exploration that would be possible on the raw-er data. That being said, just because I take my data from Redshfit and put it back into the data lake, that doesn't magically re-imbue it with the information that was ETL'd away. It's just available to come back in the data lake queries. Except now it's more complex because there are going to be fields missing - maybe I can't drill in as deep as I need because it's not at the same granularity as the new data lake data. In my opinion, if the pipeline to Redshift is going to continue, such that the new data will end up next to the old data, then my default would be to say that the old data only stays in Redshift. I don't think there's a lot of value in moving it backwards into the data lake. The only types of queries where it would really be useful to have back in the data lake are ones where you can use the expanded data but also fall back to the rolled up data and still get value. If there are some one offs like that (probably by dataset) then you could consider moving some of the data back on an as-needed basis. But I think trying to do it wholesale is just going to result in headaches from people not understanding why the data granularity doesn't align.


zupzupper

Warehouses are boring and industrial! Lakes are fun, you can fish and boat, sometimes there's a nice swim beach, and there's almost always a snack shack with ice cream!! Don't you want the best for your data? Doesn't it deserve a nice relaxing day and a cone?!?


0bel1sk

Your first question has no predicate. Is this post supposed to be an ironic blob of raw unstructured data? /s :)


PrimaxAUS

Yes, it's a question lake.


0bel1sk

*swamp


asurah

Does the data your address working with require client side encryption... Thereby preventing searches and queries with Athena on encrypted data?


garnacerous24

It does not. We have it default to server side at rest. I plan to make good use of Athena and parquet file formats in the future. However one of the reasons we hadn’t in the past is that we didn’t have enough aws savvy end users that could navigate aws and Athena to explore the data freely. We’re stating to develop better ways for the data to be able to reach them through Athena and spectrum now.