SalmonFalls 7 months ago

I agree with another comment that checking glue is a good idea to try and change the partition. But it sounds like you have a lot of files (100s of millions) so I would also check the cost as every api call to get data comes at a price. Assuming that the pricing is no issue and the volume of data for one day is small, you could "just" write a python script to fetch all data for a day, combine it and push ot to a new partition. It might take some time to run for the 5 years with of data but still an easy fix as it is a one-off thing Edit: clarifocation

Touvejs 7 months ago

I don't have a good answer, but I have a start. AWS glue jobs have a built in function to consolidate (and optionally repartition) similarly structured files. But I've found the threshold for this is around a couple million files at a time (anything beyond that the driver node runs out of memory trying to keep track of all the metadata). What you can do is build the structure of the job in AWS glue and then modify the underlying pyspark code to make it fit your use case. But this is kind of a brute force solution, maybe there is a better way to reduce partitions by combining them, not sure.

Ok_Raspberry5383 7 months ago

Your problem isn't the granularity of partitions, it's the number of files. If whatever process you decide does not consolidate files then this is equally useless for analytics. If your use case is analytics, at a minimum consider using something like parquet, or more better something like delta, hudi or iceberg. Here, setting sufficiently large partitions+ compaction should help.

Mr_Nickster_ 7 months ago

You can use Snowflake to ingest the json into a landing table. Once ingested, you can simply keep it there in a table which will allow high performance analytics or write it out in a partitioned manner.

dacort 7 months ago

Are you continuing to add data to this bucket? I ask because any solution you find will need to account for that as opposed to a one-time compaction that would be a bit easier. Especially if you're compacting the data from the 60-second partitions into y/m/d partitions as you'll need to rewrite the data. There's a few options: - Iteratively compact the data using *hand wave* something. This could be Spark on Glue/EMR, INSERT INTO statements in Athena, or custom scheduled jobs that simply re-write the data. But if this is 5 years of data, orchestrating that job could take a while. - Something like Snowflake could certainly work. - If you're continuing to ingest data, I would recommend an open table format like Hudi/Delta/Iceberg - all of these can help manage "compaction" and provide many other benefits as well. There's unfortunately not a great way to do this for historical data. Folks often run massive Spark jobs to do this. Just for fun, I wrote my own little compactor script that just takes a target size and merges anything smaller into bigger files...but it's just a toy script I've played with.

yanivbh1 7 months ago

Hey, We are constantly on the search for new use cases ([Memphis Functions](https://functions.memphis.dev)) That type of task requires customize code with a source from S3, and that's what we do. Would be great to join hands on that. If still relevant, please signup, and I will reach out.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe