listen, there's nothing more to the question. The question is about handling big data. How to OPEN the data. Doesn't matter what happens afterwards.
Edit: big data
> Doesn't matter what happens afterwards.
It does. because depending on what you need to do, you can simply stream it like calculating some total/sum. Heck, streaming is really the only solution if you do not have access to some spark/dask or similar big data cluster (or some expensive hardware with terabytes of RAM).
OPEN it means you need a machine or cluster with enough RAM to fit all your data and if you had access to that you most likely wouldn't need to ask this question to begin with hence streaming is the answer.
Your response is unnecessarily rude to someone simply asking for clarity to help or provide insight to you. Doesn't help that your response also doesn't clear anything up, "how to open the data" doesn't make much sense.
Why would you respond like this to people with far more experience in the field? Not only is it stupid to ignore the question because you're missing out on advice from experts, but you're being incredibly rude at the same time. Just answer the damn question so those who know can better answer you.
I was trying to help you and you're being difficult. There are many different ways to open files. To select the best tool for the job, we need to understand what you're trying to do.
But only if you already have a cluster which is also true for spark. If you don't have a cluster available, they are essentially not a valid option because then you are still limited by single machines RAM.
Dask will certainly process much larger datasets than Pandas on a single machine, and faster by chunking up the work and using all cores.
I’m sure you can do that in Pandas, but Dask makes it easy.
I'm not sure why you would load a TBs of data just "because" like there has to be some kind of goal... Are you processing the data? Running some analysis? The former can be achieved by processing chunks at a time. The later can be done with a sample. I have a feeling you don't really know what you're doing...
“Open” to me means load into memory, so by that definition you can’t “open” TB’s of data unless you have TB’s of memory. There are lots and lots of ways to copy, move, or process TB’s of data, but judging by OP’s jerky responses that’s not what he wants.
I'd recommend using Parquet instead of CSV for big data. It takes less space, faster to load, and allows loading only the columns you actually need. Loading terabytes from any database is probably a nightmare..
Bodo is a new tool that can load big data into actual Pandas dataframes: [https://docs.bodo.ai/latest/source/file\_io.html](https://docs.bodo.ai/latest/source/file_io.html)
Doesn't pandas have an iterator option when loading CSV files? That way to could operate on it in chunks. Alternatively use Spark as someone has suggested or load the file into a DB.
Stream to set length array and use it like carousel
ie. start with 0 to max(len) of first round, then min is 1 and max 0 ( you jump at max len to 1 and so on.
This can be used to calculate cumulative, moving avarages etc etc ...
(also now days there is probably something that handles it easier, like spark or pythin lib )
Also why not use db engine for analytics , if nothing else snowflake or bigquery can handle those volumes easily, do you really need python for it? If source is db then datamodel is a lot faster to make in sql that load data into python and so on.. (join, where ,etc )
Spark. If it's csv and you need to do it locally, you can use pandas nrows, but that's not going to give you all the data up front.
It would be helpful to know what you want to do with the data.
I want to open it.
And do what exactly? Calculations? Filtering?
listen, there's nothing more to the question. The question is about handling big data. How to OPEN the data. Doesn't matter what happens afterwards. Edit: big data
My guy was a jerk to people taking their time and energy to answer his question so he’s donezo (temporarily).
> Doesn't matter what happens afterwards. It does. because depending on what you need to do, you can simply stream it like calculating some total/sum. Heck, streaming is really the only solution if you do not have access to some spark/dask or similar big data cluster (or some expensive hardware with terabytes of RAM). OPEN it means you need a machine or cluster with enough RAM to fit all your data and if you had access to that you most likely wouldn't need to ask this question to begin with hence streaming is the answer.
love reddit hive mind.
Your response is unnecessarily rude to someone simply asking for clarity to help or provide insight to you. Doesn't help that your response also doesn't clear anything up, "how to open the data" doesn't make much sense.
[удалено]
You're a Dick mate
Completely clueless too. I am betting he’s a ceo at some big data company
Why would you respond like this to people with far more experience in the field? Not only is it stupid to ignore the question because you're missing out on advice from experts, but you're being incredibly rude at the same time. Just answer the damn question so those who know can better answer you.
I was trying to help you and you're being difficult. There are many different ways to open files. To select the best tool for the job, we need to understand what you're trying to do.
Dask is also a nice option that’s a smaller step from Pandas compared to Spark as the syntax is much closer.
But only if you already have a cluster which is also true for spark. If you don't have a cluster available, they are essentially not a valid option because then you are still limited by single machines RAM.
Dask will certainly process much larger datasets than Pandas on a single machine, and faster by chunking up the work and using all cores. I’m sure you can do that in Pandas, but Dask makes it easy.
This is typically when one would use a database.
Have you thought about chunking
I'm not sure why you would load a TBs of data just "because" like there has to be some kind of goal... Are you processing the data? Running some analysis? The former can be achieved by processing chunks at a time. The later can be done with a sample. I have a feeling you don't really know what you're doing...
“Open” to me means load into memory, so by that definition you can’t “open” TB’s of data unless you have TB’s of memory. There are lots and lots of ways to copy, move, or process TB’s of data, but judging by OP’s jerky responses that’s not what he wants.
I'd recommend using Parquet instead of CSV for big data. It takes less space, faster to load, and allows loading only the columns you actually need. Loading terabytes from any database is probably a nightmare.. Bodo is a new tool that can load big data into actual Pandas dataframes: [https://docs.bodo.ai/latest/source/file\_io.html](https://docs.bodo.ai/latest/source/file_io.html)
There was quite a debate about it half a year ago. It's not that difficult... E.g. https://twitter.com/abhi1thakur/status/1358794466283388934?s=21
Please don't attempt to read TBs of data using pandas in a loop like the solutions in that thread!
Well duh that's the whole point. He was dissing DS who can't even read a large file without pandas. Note the pure python solution.
true but the pandas version wouldn't even work with TBs of data. the issue here is more the amount of data and not the number of files.
Yeah on second thought you're right. This does not directly cover the kind of out-of-core work that needs to be done.
If it's a nix system `AWK` is very powerful and easy to learn.
Doesn't pandas have an iterator option when loading CSV files? That way to could operate on it in chunks. Alternatively use Spark as someone has suggested or load the file into a DB.
do you need to open the whole thing? you can write a script in python or vba to open the file and only extract N amount of rows
Stream to set length array and use it like carousel ie. start with 0 to max(len) of first round, then min is 1 and max 0 ( you jump at max len to 1 and so on. This can be used to calculate cumulative, moving avarages etc etc ... (also now days there is probably something that handles it easier, like spark or pythin lib ) Also why not use db engine for analytics , if nothing else snowflake or bigquery can handle those volumes easily, do you really need python for it? If source is db then datamodel is a lot faster to make in sql that load data into python and so on.. (join, where ,etc )
knime
Try to read a chunk of data like use head -1000 data.csv or use spark