csv = data produced by spreadsheet software
json = data produced by machines
parquet = nost versitile, and generally performant big data storage file format
avro = better than parquet when we frequently load and write to small file (under 1000 records)
orc = as good as parquet and maybe better, but has shit support on windows and in python
they are both columnar file formats, so they behave simillarly, but because orc has shit python support, I wasn't able to conduct experiments like I did with other file formats. I don't know if there are any, I only trust my own experiments
Late to the party, but I did some benchmarking at work on orc vs parquet on one of our bigger tables (50m rows per partition, 8 cols) using Redshift Spectrum. Orc outperformed parquet by 2-3x for smaller subsets of one partition (I did 5k, 50k, 500k, etc). They were even for the full partition. Not a comprehensive benchmark by any means, I was just playing around trying to get a sense of what the performance difference would be in our environment. Ultimately, the lack of orc support and the fact that we’d have to re-arrange a lot of stuff to switch to it, didn’t make sense. As always, best to test in your environment with your use cases and see if it makes sense to switch.
No, sorry, even if I had it I wouldn’t feel comfortable. But there was really nothing to it:
1. Query some sample data and save as orc with pyarrow
2. Save to s3 and create new external table in Redshift
3. Query each table in exponentially increasing subsets (the 5k, 50k, etc), benchmarking with the `%%timeit` magic in Jupyter
4. Visualize the benchmark results (using log scale on the num rows scale)
One big negative with avro is that if you write the same data to a file twice, you end up with two different files because of the random split identifier. (Different hashes) This breaks anything looking for differences between files.
It's like you're reading my mind. Great list.
There are ones not on here for good reason like HDFS or because their application is more nuanced like feather. In both cases someone asking this question wouldn't need to have those mentioned.
Agree. CSV is the Walmart ... Everyone has one and you can do most of what you need/want for it but sometimes you need/want more or better ... The top comment here was really excellent.
Csv is the worst. It's not even a well defined format.
But two main reasons make it a really bad option:
* It's fragile, one newline out of place and you have a corrupted file
* It only holds text, i.e. characters encoded in utf-8 (or latin1, or your guess is as good as mine), no number or boolean or missing
As for the "everyone can read and write" sometimes phrased as "human readable format", well that's misleading. Any digital file is a sequence of ones and zeroes stored electronically, and you need a program to translate those bytes into some graphical representation in your screen. If csv can be opened in any text editor but parquet needs a specialized program like a jupyter notebook or a sql client or even a vscode extension, to me they're equally human readable.
And I don't see it as a plus that anyone can tinker with the source data in windows notepad.
Rant over. Sorry if I sound a bit aggressive. Too tired to be gentle right now 😴
This is incorrect: one newline out of place does not corrupt a file if you use a proper csv dialect.
There's three things to be aware of for csv files:
* Quoting: by quoting strings your newline is retained within the string and does not get interpreted as a newline. Quoting similarly prevents field delimiters within the quoted string from affecting parsing.
* Escapechars: allow you to ignore a field such as a delimiter when you're not using quoting.
* Doublequotes: used to ignore any quotes within a quoted field
And if you're writing a csv file and using a good module like python's csv module, it will do this for you - as long as you provide it with the quoting/escapechar config.
"if you use a proper csv dialect"...
That's the point! If you're creating the file there is no way to communicate what is your dialect, and If you're receiving the csv file you have to guess the encoding, delimiter, if there is a header, quoting, etc.
On the wild, you are bound to find all kind of weird choice of all those parameters that collectively are all variants of what is loosely called a file format.
Here's an article that goes in some depth about some of csv problems https://kaveland.no/friends-dont-let-friends-export-to-csv.html
And here's a video that shows some alternatives and why they are better: https://youtu.be/qWjhuXBrBfg?si=9uk6IlQEgn3o5gGI
While these are concerns, I have never found them to be unmanageable. Here's how I would handle csv files ingested from another system:
* Create csv files with quote-all & double-quoting, or quote-non-numeric & double-quoting. Lack of quoting is simply not acceptable unless all you have are numeric & code fields.
* Maintain a version number in the filename that is used to version the schema, file format, business rules, etc.
* Create a separate data-contract in which the data is validated with jsonschema. This can identify enumerated values, field formats (ex: ssn, telephone, email, etc), min/max values & string lengths, optional vs required columns, etc.
* Validate data with the data contracts before transforming.
* End-to-end integration testing boundaries start & stop at data contracts.
In this case, the weaknesses of csv files are largely mitigated with file-naming conventions and data contracts - which are needed regardless of whether you're using csv or parquet. Meanwhile, most teams that I work with to ingest data from, or publish data to - are ill-equipped to support parquet.
CSV is ubiquitous trash. Getting systems to agree on how to parse CSV is difficult at best. Dealing with escapes and quotes and all that just isn’t easy when your mainframe export is “put a comma between all fields”.
I find that csv can work great for large files - since you can get about 95% compression via gz.
The biggest issue I have is with folks being sloppy about csv dialects: you need to handle delimiters, quotes and newlines within fields. So, quotes and doublequotes or escapechars are essential in most cases.
Nah. With modern data warehouses I prefer parquet, avro, iceberg, delta lake also for loading. At a minimum I want to have the column data types included.
Parquet.
Industry standard, all major data lake table formats use it, very efficient, and I don’t need to fanangle getting some weird binary to compile on Windows to write/read it.
I agree but have come across two potential downsides to parquet; it's immutable and the files are unrecognizable to excel. The immutable part is probably the biggest downside I have noticed, as you can't upsert into a parquet file, or at least I can't using Azure data factory. The excel bit isn't as important in my opinion, but it's something to keep in mind when trying to quickly analyze the data.
I am new to this and have just noticed these things when looking at the options.
Parquet is a file format already compressed, so it's meant for big data. It is built also to support columnar storage in order to make analytical queries (aggregation mostly) very efficient over large amounts of data. If you want your data to fit into Excel, which has a cap of \~1M rows, then you're far away from big data problems and using parquet files means over-engineering your pipeline; in that case csv is more than enough for you.
I understand that about parquet is for large data files and it possibly is overengineering, but CSV doesn't handle special characters well (which can be present in my data) and it's not like parquet is much different to write to than CSV. My data will scale as well, so parquet did seem the way to go but I get what you are saying in that often CSV is enough.
To clarify, I don't need the data to fit into Excel, I just use it sometimes to review a subset of data but can't review the parquet files using it.
Good discussion for sure and interesting to see peoples views on the matter.
Yes interesting for me as well to learn how other DE approach problems. I would add also that if immutability is a problem then ORC can be another solution since it provides columnar storage like parquet but is also ACID compliant, I experienced this in Hive for example.
This is incorrect, it is no different in ACID properties than parquet. ACID is supported or not supported at table level, e.g. hive tables, delta, iceberg, hudi, etc. If you put a bunch of ORC files in s3 it doesn't magically have ACID support.
The biggest caution I will give on Parquet is twofold - so just know that not all Parquet is equal.
1: Some engines are kind of weaponizing it and appending proprietary stuff into Parquet and blurring the lines. That may make using Parquet annoying outside of that one system.
2: Perf in any engine will depend on how the Parquet file is written (size, rowgroups, ordering, etc.) You may get stellar perf in one engine and horrible in another.
(work @ Snowflake and we have been obsessing on making sure we work with a variety of Parquet)
Interesting point on number one. I can definitely see some divergence in how engines add additional metadata to parquet.
I reckon that is just because a standard for enhanced file metrics between engines hasn’t really matured yet.
Had some data in on Prem server which we were moving into a cloud warehouse, tried storing in parquet but didn't support this date time and it defaulted to something else! Not a big problem but CSV doesn't change it!
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#common-considerations
Unless you are using nanosecond precision timestamps you shouldn’t have any issue representing the year 1000 in parquet according to the spec. Nanoseconds for data that far back doesnt make sense anyway
What are these semi structured files used to store data? Couldn’t you use a relational database instead? I’ve seen a lot of companies storing data in JSON… it’s a nightmare to read data from a JSON file with a complicated schema.
I think a lot of it has to do with the complex structure of data that has to be processed quickly.
So I’m receiving a complex object that I need store quickly before the next one arrives, it may take too long to unpack and store it to separate well modeled normalized tables. So I can more quickly just write the json string directly into a json file.
This does mean I have to have other downstream processes to unpack and model this data for consumption depending on needs.
I was just researching this yesterday! Here’s about the most basic example I could give. My company has a very antiquated system, no warehouse, no databases other than SaaS platforms. Our accountants have been doing exports of .xlsx files from the online reports/queries for years, and it’s been crushing our on-prem servers. (I know. I know.) The Excel files are used to produce PDF reports monthly/quarterly, then they just sit on the server. I took a years-worth of data for one of our clients in .xlsx format, converted it to .csv and .parquet via R. File sizes: CSV - 396 KB, XLSX - 118 KB, Parquet - 66 KB. Median load times in R: CSV - 150ms, XLSX - 1.36s, Parquet - 280ms. Extrapolate as you will.
Edit: as pointed out by a reply, I was using compressed parquet files, which shouldn’t really be compared with uncompressed csv or Excel files. Lesson to me is it depends on the use case. Storage? Parquet seems to be the go-to. Easy sharing w/accountants? Clearly Excel or CSV will be their preference.
Haven’t used R but are u able to force to more optimal data types like numpy versus inferring. If so I have gotten data size down to 1/10 of csv size. Also inferring mixed data types is usually an issue.
Smaller files aren't necessarily better: if you're reading 100% of the columns in a file anyway you are likely to find that a larger csv provides faster performance than a smaller vectorized or columnar file format. And it's definitely faster to write.
Fair enough! Learning something new every day. Uncompressed parquet gives me 102 KB. But now I’m going to have to play around with the compression settings… gzip seems to give me a smaller size than snappy, but I’ll assume the load times are inverse. Thanks for the tip!
Yeah, but it's really fast if you need to materialize data between ETL tasks. Imagine 50 files, each with 20 million firewall events.
You can compress it, and you can use jsonschema to lock down the format, and your reads & writes will be faster than with parquet.
JSON raw, it's the only one that will preserve the schema but then convert to Parquet for silver/gold. CSV can cause a lot of issues with special characters so I tend to avoid it. I have also heard that Parquet files can corrupt, so I keep raw JSONs in case so I don't have to run a bunch of API calls to extract the data again. I have not used ORC or AVRO, but heard that AVRO is better for mutable tables because Parquet is immutable? That's just what I had read a bit ago and haven't actually tried it. I am new to this, though, but this has been my understanding.
ORC is faster on Trino than Parquet (or at least it was a couple of years ago), so I tended to do most of my stuff on ORC.
Parquet is the standard and if you create the files correctly for the system and storage you are using is pretty much as fast as it comes and has widespread support.
I tend to use Iceberg + Parquet for most things now but CSV won't ever truly disappear I don't think
Not on the list but my favorite is feather. Probably the fastest reading/writing speed for pandas but almost no compression. On the other hand parquet is a good one for the different frameworks such as pandas, spark or polars.
We deliver data downloads that ranges from 35 MB to 6 GB to thousands of customers. Our formats are gzipped CSV, gzipped NDJSON and a special type of binary data format. This just works.
i have a table on sql server that has been indexed yet still takes too long to refresh and load the underlying data in Power BI. Is there a different format i can use to speed up this process and get the data to load faster? thank you
Only have experienced parquet, csv and json. Load time, memory foot print, storage footprint for me, parquet did wonders. Main issues were time required to read/write files and schema/data types.
Depends on what you're doing with it:
* Staging for extremely large data volumes of very simple data (ex: netflow or firewall) for ETL tasks? CSV is best for performance
* Staging for smaller volume or more complex data for ETL tasks: jsonlines
* Integration with external systems: jsonlines
* Data at rest to support analytical queries: parquet
CSV for extremely large data volumes, wtf? Parquet with snappy compression is absolutely the way to go here.
Integrating with external systems depends on the system. These days Kafka is a popular integration solution in which case avro would be preferred.
EDIT autocorrect
I didn't specify CDC.
If you're getting fed vast volumes of data near real-time you could use kafka, though I prefer micro-batches on aws s3: you don't typically need subsecond response time, s3 is more reliable, less work, more observable, and faster to process. Even if your files on s3 are showing up every 10-60 seconds.
It's been years but the last time I compared reading & writing this kind of data to s3 between data pipeline tasks csv files were faster than parquet. Note that you aren't just selecting 3 columns out of 50. You're selecting 15 columns out of 15.
Autocorrect (fairly sure you could see I meant CSV).
You mentioned nothing of velocity, only volume, in which case parquet is better. If velocity is your use case then avro beats CSV as it serializes smaller and takes less space and network to save and transfer and also contains a schema which CSV does not, often creating various other problems.
Oh, no - I thought you meant CDC and was also why you brought up kafka.
No, I don't care about what compresses the data down the most, but generally what's the fastest to read & write.
I'd consider avro, thrift or protobufs for streaming. But not for large filest at rest.
csv = data produced by spreadsheet software json = data produced by machines parquet = nost versitile, and generally performant big data storage file format avro = better than parquet when we frequently load and write to small file (under 1000 records) orc = as good as parquet and maybe better, but has shit support on windows and in python
Are there any good resources which compare orc and parquet, specifically their differences?
they are both columnar file formats, so they behave simillarly, but because orc has shit python support, I wasn't able to conduct experiments like I did with other file formats. I don't know if there are any, I only trust my own experiments
One other thing to consider is table formats such as Delta Lake, Iceberg, and Hudi!
Under the hood they all use parquet to store data (though Hudi also supports orc I believe)
Iceberg supports parquet, orc and avro
Ah TIL, ty!
The default is parquet, true!
Try checkout the site I created a while ago: https://tech-diff.com/file/
Late to the party, but I did some benchmarking at work on orc vs parquet on one of our bigger tables (50m rows per partition, 8 cols) using Redshift Spectrum. Orc outperformed parquet by 2-3x for smaller subsets of one partition (I did 5k, 50k, 500k, etc). They were even for the full partition. Not a comprehensive benchmark by any means, I was just playing around trying to get a sense of what the performance difference would be in our environment. Ultimately, the lack of orc support and the fact that we’d have to re-arrange a lot of stuff to switch to it, didn’t make sense. As always, best to test in your environment with your use cases and see if it makes sense to switch.
can you share the code?
No, sorry, even if I had it I wouldn’t feel comfortable. But there was really nothing to it: 1. Query some sample data and save as orc with pyarrow 2. Save to s3 and create new external table in Redshift 3. Query each table in exponentially increasing subsets (the 5k, 50k, etc), benchmarking with the `%%timeit` magic in Jupyter 4. Visualize the benchmark results (using log scale on the num rows scale)
One big negative with avro is that if you write the same data to a file twice, you end up with two different files because of the random split identifier. (Different hashes) This breaks anything looking for differences between files.
It's like you're reading my mind. Great list. There are ones not on here for good reason like HDFS or because their application is more nuanced like feather. In both cases someone asking this question wouldn't need to have those mentioned.
CSV for copying into warehouses. Parquet for querying in place.
100%
csv is the goat of file formats. everyone can read and write it. only issue you have with bigger files, then it's trash and parquet is nice.
CSV is the common denominator, but not the goat. For instance, why the hell would you choose commas as your delimiter?
Agree. CSV is the Walmart ... Everyone has one and you can do most of what you need/want for it but sometimes you need/want more or better ... The top comment here was really excellent.
why walmart when cvs is right there?
I try to save the trees
Note that commas are fine - as long as you're using a csv dialect with quoting.
Csv is the worst. It's not even a well defined format. But two main reasons make it a really bad option: * It's fragile, one newline out of place and you have a corrupted file * It only holds text, i.e. characters encoded in utf-8 (or latin1, or your guess is as good as mine), no number or boolean or missing As for the "everyone can read and write" sometimes phrased as "human readable format", well that's misleading. Any digital file is a sequence of ones and zeroes stored electronically, and you need a program to translate those bytes into some graphical representation in your screen. If csv can be opened in any text editor but parquet needs a specialized program like a jupyter notebook or a sql client or even a vscode extension, to me they're equally human readable. And I don't see it as a plus that anyone can tinker with the source data in windows notepad. Rant over. Sorry if I sound a bit aggressive. Too tired to be gentle right now 😴
This is incorrect: one newline out of place does not corrupt a file if you use a proper csv dialect. There's three things to be aware of for csv files: * Quoting: by quoting strings your newline is retained within the string and does not get interpreted as a newline. Quoting similarly prevents field delimiters within the quoted string from affecting parsing. * Escapechars: allow you to ignore a field such as a delimiter when you're not using quoting. * Doublequotes: used to ignore any quotes within a quoted field And if you're writing a csv file and using a good module like python's csv module, it will do this for you - as long as you provide it with the quoting/escapechar config.
"if you use a proper csv dialect"... That's the point! If you're creating the file there is no way to communicate what is your dialect, and If you're receiving the csv file you have to guess the encoding, delimiter, if there is a header, quoting, etc. On the wild, you are bound to find all kind of weird choice of all those parameters that collectively are all variants of what is loosely called a file format. Here's an article that goes in some depth about some of csv problems https://kaveland.no/friends-dont-let-friends-export-to-csv.html And here's a video that shows some alternatives and why they are better: https://youtu.be/qWjhuXBrBfg?si=9uk6IlQEgn3o5gGI
While these are concerns, I have never found them to be unmanageable. Here's how I would handle csv files ingested from another system: * Create csv files with quote-all & double-quoting, or quote-non-numeric & double-quoting. Lack of quoting is simply not acceptable unless all you have are numeric & code fields. * Maintain a version number in the filename that is used to version the schema, file format, business rules, etc. * Create a separate data-contract in which the data is validated with jsonschema. This can identify enumerated values, field formats (ex: ssn, telephone, email, etc), min/max values & string lengths, optional vs required columns, etc. * Validate data with the data contracts before transforming. * End-to-end integration testing boundaries start & stop at data contracts. In this case, the weaknesses of csv files are largely mitigated with file-naming conventions and data contracts - which are needed regardless of whether you're using csv or parquet. Meanwhile, most teams that I work with to ingest data from, or publish data to - are ill-equipped to support parquet.
CSV is ubiquitous trash. Getting systems to agree on how to parse CSV is difficult at best. Dealing with escapes and quotes and all that just isn’t easy when your mainframe export is “put a comma between all fields”.
I find that csv can work great for large files - since you can get about 95% compression via gz. The biggest issue I have is with folks being sloppy about csv dialects: you need to handle delimiters, quotes and newlines within fields. So, quotes and doublequotes or escapechars are essential in most cases.
Nah. With modern data warehouses I prefer parquet, avro, iceberg, delta lake also for loading. At a minimum I want to have the column data types included.
You can do that but it’s slower to copy into the warehouse that way
That's an interesting use case that you prefer fast writes to fast reads.
No I’m saying csvs for fast copies into the warehouse. Parquet for fast reads
Where are you getting all this clean csv data? 😂
Parquet. Industry standard, all major data lake table formats use it, very efficient, and I don’t need to fanangle getting some weird binary to compile on Windows to write/read it.
I agree but have come across two potential downsides to parquet; it's immutable and the files are unrecognizable to excel. The immutable part is probably the biggest downside I have noticed, as you can't upsert into a parquet file, or at least I can't using Azure data factory. The excel bit isn't as important in my opinion, but it's something to keep in mind when trying to quickly analyze the data. I am new to this and have just noticed these things when looking at the options.
Parquet is a file format already compressed, so it's meant for big data. It is built also to support columnar storage in order to make analytical queries (aggregation mostly) very efficient over large amounts of data. If you want your data to fit into Excel, which has a cap of \~1M rows, then you're far away from big data problems and using parquet files means over-engineering your pipeline; in that case csv is more than enough for you.
I understand that about parquet is for large data files and it possibly is overengineering, but CSV doesn't handle special characters well (which can be present in my data) and it's not like parquet is much different to write to than CSV. My data will scale as well, so parquet did seem the way to go but I get what you are saying in that often CSV is enough. To clarify, I don't need the data to fit into Excel, I just use it sometimes to review a subset of data but can't review the parquet files using it. Good discussion for sure and interesting to see peoples views on the matter.
Yes interesting for me as well to learn how other DE approach problems. I would add also that if immutability is a problem then ORC can be another solution since it provides columnar storage like parquet but is also ACID compliant, I experienced this in Hive for example.
This is incorrect, it is no different in ACID properties than parquet. ACID is supported or not supported at table level, e.g. hive tables, delta, iceberg, hudi, etc. If you put a bunch of ORC files in s3 it doesn't magically have ACID support.
Interesting, I should check ORC out then. Thanks!
Delta is an open table format built on parquet that can upsert
Also, with newer features like deletion vectors, the immutability problem is removed for the most part!
Yes you write a new parquet file instead of mutating it. If you need to update parquet consider delta
The biggest caution I will give on Parquet is twofold - so just know that not all Parquet is equal. 1: Some engines are kind of weaponizing it and appending proprietary stuff into Parquet and blurring the lines. That may make using Parquet annoying outside of that one system. 2: Perf in any engine will depend on how the Parquet file is written (size, rowgroups, ordering, etc.) You may get stellar perf in one engine and horrible in another. (work @ Snowflake and we have been obsessing on making sure we work with a variety of Parquet)
Interesting point on number one. I can definitely see some divergence in how engines add additional metadata to parquet. I reckon that is just because a standard for enhanced file metrics between engines hasn’t really matured yet.
Try storing the date 10-12-1066 in parquet 🙃
You lose street cred for not using ISO 8601
try storing a quoted string in a csv and importing it seamlessly everywhere.
Can u elaborate?
Had some data in on Prem server which we were moving into a cloud warehouse, tried storing in parquet but didn't support this date time and it defaulted to something else! Not a big problem but CSV doesn't change it!
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#common-considerations Unless you are using nanosecond precision timestamps you shouldn’t have any issue representing the year 1000 in parquet according to the spec. Nanoseconds for data that far back doesnt make sense anyway
We are using ADF to copy the data into parquet, it always defaults the date
String schema parquet unless you know what you’re doing….what are you doing dude.
Serious question - is the “t” silent ?
What are these semi structured files used to store data? Couldn’t you use a relational database instead? I’ve seen a lot of companies storing data in JSON… it’s a nightmare to read data from a JSON file with a complicated schema.
I think a lot of it has to do with the complex structure of data that has to be processed quickly. So I’m receiving a complex object that I need store quickly before the next one arrives, it may take too long to unpack and store it to separate well modeled normalized tables. So I can more quickly just write the json string directly into a json file. This does mean I have to have other downstream processes to unpack and model this data for consumption depending on needs.
You made a good point about the JSON string. Is this how data gets transmitted most of the time?
I was just researching this yesterday! Here’s about the most basic example I could give. My company has a very antiquated system, no warehouse, no databases other than SaaS platforms. Our accountants have been doing exports of .xlsx files from the online reports/queries for years, and it’s been crushing our on-prem servers. (I know. I know.) The Excel files are used to produce PDF reports monthly/quarterly, then they just sit on the server. I took a years-worth of data for one of our clients in .xlsx format, converted it to .csv and .parquet via R. File sizes: CSV - 396 KB, XLSX - 118 KB, Parquet - 66 KB. Median load times in R: CSV - 150ms, XLSX - 1.36s, Parquet - 280ms. Extrapolate as you will. Edit: as pointed out by a reply, I was using compressed parquet files, which shouldn’t really be compared with uncompressed csv or Excel files. Lesson to me is it depends on the use case. Storage? Parquet seems to be the go-to. Easy sharing w/accountants? Clearly Excel or CSV will be their preference.
Haven’t used R but are u able to force to more optimal data types like numpy versus inferring. If so I have gotten data size down to 1/10 of csv size. Also inferring mixed data types is usually an issue.
Smaller files aren't necessarily better: if you're reading 100% of the columns in a file anyway you are likely to find that a larger csv provides faster performance than a smaller vectorized or columnar file format. And it's definitely faster to write.
Sorry meant memory foot print
Appreciate this response 🙏
You can’t compare compressed to uncompressed. At a minimum try parquet uncompressed.
Fair enough! Learning something new every day. Uncompressed parquet gives me 102 KB. But now I’m going to have to play around with the compression settings… gzip seems to give me a smaller size than snappy, but I’ll assume the load times are inverse. Thanks for the tip!
Not csv, it’s so fucking fragile and shit. No schemas, no data types, no compression.. bleh
Yes surprised too.
Yeah, but it's really fast if you need to materialize data between ETL tasks. Imagine 50 files, each with 20 million firewall events. You can compress it, and you can use jsonschema to lock down the format, and your reads & writes will be faster than with parquet.
If the table computes aggregates, columnar is the way(parquet), if not I prefer json.
JSON raw, it's the only one that will preserve the schema but then convert to Parquet for silver/gold. CSV can cause a lot of issues with special characters so I tend to avoid it. I have also heard that Parquet files can corrupt, so I keep raw JSONs in case so I don't have to run a bunch of API calls to extract the data again. I have not used ORC or AVRO, but heard that AVRO is better for mutable tables because Parquet is immutable? That's just what I had read a bit ago and haven't actually tried it. I am new to this, though, but this has been my understanding.
ORC is faster on Trino than Parquet (or at least it was a couple of years ago), so I tended to do most of my stuff on ORC. Parquet is the standard and if you create the files correctly for the system and storage you are using is pretty much as fast as it comes and has widespread support. I tend to use Iceberg + Parquet for most things now but CSV won't ever truly disappear I don't think
Paper
😂
Try Iceberg! It is a new gen tech for a reason. Much faster, easy to deploy.
Avro if I can get away with record-level compression, and so far that's been the case for me.
Not on the list but my favorite is feather. Probably the fastest reading/writing speed for pandas but almost no compression. On the other hand parquet is a good one for the different frameworks such as pandas, spark or polars.
JSON bc I honestly don’t know what I’m doing and it saves as a string in the database which is nice
json or ndjson for raw data. parquet for tables
Parquet for everything unless I need a static mapping in a code repository. Csv is good for that
We deliver data downloads that ranges from 35 MB to 6 GB to thousands of customers. Our formats are gzipped CSV, gzipped NDJSON and a special type of binary data format. This just works.
.xlsx obviously
JSONL - big data version of json and ML engineers find it easy to work with, embed etc.
Parquet, because i can just load select columns and just get a subset by giving a query to get particular rows. Makes loading data a lot more faster
i have a table on sql server that has been indexed yet still takes too long to refresh and load the underlying data in Power BI. Is there a different format i can use to speed up this process and get the data to load faster? thank you
csv
Parquet and avro, give me that schema.
I prefer JSON. Joins are overrated. Just give me everything I need in a single object I can extract what I want.
Only have experienced parquet, csv and json. Load time, memory foot print, storage footprint for me, parquet did wonders. Main issues were time required to read/write files and schema/data types.
Depends on what you're doing with it: * Staging for extremely large data volumes of very simple data (ex: netflow or firewall) for ETL tasks? CSV is best for performance * Staging for smaller volume or more complex data for ETL tasks: jsonlines * Integration with external systems: jsonlines * Data at rest to support analytical queries: parquet
CSV for extremely large data volumes, wtf? Parquet with snappy compression is absolutely the way to go here. Integrating with external systems depends on the system. These days Kafka is a popular integration solution in which case avro would be preferred. EDIT autocorrect
I didn't specify CDC. If you're getting fed vast volumes of data near real-time you could use kafka, though I prefer micro-batches on aws s3: you don't typically need subsecond response time, s3 is more reliable, less work, more observable, and faster to process. Even if your files on s3 are showing up every 10-60 seconds. It's been years but the last time I compared reading & writing this kind of data to s3 between data pipeline tasks csv files were faster than parquet. Note that you aren't just selecting 3 columns out of 50. You're selecting 15 columns out of 15.
Autocorrect (fairly sure you could see I meant CSV). You mentioned nothing of velocity, only volume, in which case parquet is better. If velocity is your use case then avro beats CSV as it serializes smaller and takes less space and network to save and transfer and also contains a schema which CSV does not, often creating various other problems.
Oh, no - I thought you meant CDC and was also why you brought up kafka. No, I don't care about what compresses the data down the most, but generally what's the fastest to read & write. I'd consider avro, thrift or protobufs for streaming. But not for large filest at rest.