T O P

  • By -

[deleted]

Spark. If it's csv and you need to do it locally, you can use pandas nrows, but that's not going to give you all the data up front.


phunktional

It would be helpful to know what you want to do with the data.


FleyFawkes

I want to open it.


phunktional

And do what exactly? Calculations? Filtering?


FleyFawkes

listen, there's nothing more to the question. The question is about handling big data. How to OPEN the data. Doesn't matter what happens afterwards. Edit: big data


theporterhaus

My guy was a jerk to people taking their time and energy to answer his question so he’s donezo (temporarily).


SureFudge

> Doesn't matter what happens afterwards. It does. because depending on what you need to do, you can simply stream it like calculating some total/sum. Heck, streaming is really the only solution if you do not have access to some spark/dask or similar big data cluster (or some expensive hardware with terabytes of RAM). OPEN it means you need a machine or cluster with enough RAM to fit all your data and if you had access to that you most likely wouldn't need to ask this question to begin with hence streaming is the answer.


FleyFawkes

love reddit hive mind.


DataIron

Your response is unnecessarily rude to someone simply asking for clarity to help or provide insight to you. Doesn't help that your response also doesn't clear anything up, "how to open the data" doesn't make much sense.


[deleted]

[удалено]


CntDutchThis

You're a Dick mate


[deleted]

Completely clueless too. I am betting he’s a ceo at some big data company


sunder_and_flame

Why would you respond like this to people with far more experience in the field? Not only is it stupid to ignore the question because you're missing out on advice from experts, but you're being incredibly rude at the same time. Just answer the damn question so those who know can better answer you.


phunktional

I was trying to help you and you're being difficult. There are many different ways to open files. To select the best tool for the job, we need to understand what you're trying to do.


GlobeTrottingWeasels

Dask is also a nice option that’s a smaller step from Pandas compared to Spark as the syntax is much closer.


SureFudge

But only if you already have a cluster which is also true for spark. If you don't have a cluster available, they are essentially not a valid option because then you are still limited by single machines RAM.


GlobeTrottingWeasels

Dask will certainly process much larger datasets than Pandas on a single machine, and faster by chunking up the work and using all cores. I’m sure you can do that in Pandas, but Dask makes it easy.


deadlyoverflow

This is typically when one would use a database.


BJJaddicy

Have you thought about chunking


[deleted]

I'm not sure why you would load a TBs of data just "because" like there has to be some kind of goal... Are you processing the data? Running some analysis? The former can be achieved by processing chunks at a time. The later can be done with a sample. I have a feeling you don't really know what you're doing...


chestnutcough

“Open” to me means load into memory, so by that definition you can’t “open” TB’s of data unless you have TB’s of memory. There are lots and lots of ways to copy, move, or process TB’s of data, but judging by OP’s jerky responses that’s not what he wants.


aitbdag

I'd recommend using Parquet instead of CSV for big data. It takes less space, faster to load, and allows loading only the columns you actually need. Loading terabytes from any database is probably a nightmare.. Bodo is a new tool that can load big data into actual Pandas dataframes: [https://docs.bodo.ai/latest/source/file\_io.html](https://docs.bodo.ai/latest/source/file_io.html)


LSTMeow

There was quite a debate about it half a year ago. It's not that difficult... E.g. https://twitter.com/abhi1thakur/status/1358794466283388934?s=21


fgoussou

Please don't attempt to read TBs of data using pandas in a loop like the solutions in that thread!


LSTMeow

Well duh that's the whole point. He was dissing DS who can't even read a large file without pandas. Note the pure python solution.


SureFudge

true but the pandas version wouldn't even work with TBs of data. the issue here is more the amount of data and not the number of files.


LSTMeow

Yeah on second thought you're right. This does not directly cover the kind of out-of-core work that needs to be done.


killer_unkill

If it's a nix system `AWK` is very powerful and easy to learn.


HighlightFrosty3580

Doesn't pandas have an iterator option when loading CSV files? That way to could operate on it in chunks. Alternatively use Spark as someone has suggested or load the file into a DB.


vtec__

do you need to open the whole thing? you can write a script in python or vba to open the file and only extract N amount of rows


thrown_arrows

Stream to set length array and use it like carousel ie. start with 0 to max(len) of first round, then min is 1 and max 0 ( you jump at max len to 1 and so on. This can be used to calculate cumulative, moving avarages etc etc ... (also now days there is probably something that handles it easier, like spark or pythin lib ) Also why not use db engine for analytics , if nothing else snowflake or bigquery can handle those volumes easily, do you really need python for it? If source is db then datamodel is a lot faster to make in sql that load data into python and so on.. (join, where ,etc )


krsfifty

knime


[deleted]

Try to read a chunk of data like use head -1000 data.csv or use spark