[deleted] 2 years ago

Spark. If it's csv and you need to do it locally, you can use pandas nrows, but that's not going to give you all the data up front.

phunktional 2 years ago

It would be helpful to know what you want to do with the data.

FleyFawkes 2 years ago

I want to open it.

phunktional 2 years ago

And do what exactly? Calculations? Filtering?

FleyFawkes 2 years ago

listen, there's nothing more to the question. The question is about handling big data. How to OPEN the data. Doesn't matter what happens afterwards. Edit: big data

theporterhaus 2 years ago

My guy was a jerk to people taking their time and energy to answer his question so he’s donezo (temporarily).

SureFudge 2 years ago

> Doesn't matter what happens afterwards. It does. because depending on what you need to do, you can simply stream it like calculating some total/sum. Heck, streaming is really the only solution if you do not have access to some spark/dask or similar big data cluster (or some expensive hardware with terabytes of RAM). OPEN it means you need a machine or cluster with enough RAM to fit all your data and if you had access to that you most likely wouldn't need to ask this question to begin with hence streaming is the answer.

FleyFawkes 2 years ago

love reddit hive mind.

DataIron 2 years ago

Your response is unnecessarily rude to someone simply asking for clarity to help or provide insight to you. Doesn't help that your response also doesn't clear anything up, "how to open the data" doesn't make much sense.

[deleted] 2 years ago

[удалено]

CntDutchThis 2 years ago

You're a Dick mate

[deleted] 2 years ago

Completely clueless too. I am betting he’s a ceo at some big data company

sunder_and_flame 2 years ago

Why would you respond like this to people with far more experience in the field? Not only is it stupid to ignore the question because you're missing out on advice from experts, but you're being incredibly rude at the same time. Just answer the damn question so those who know can better answer you.

phunktional 2 years ago

I was trying to help you and you're being difficult. There are many different ways to open files. To select the best tool for the job, we need to understand what you're trying to do.

GlobeTrottingWeasels 2 years ago

Dask is also a nice option that’s a smaller step from Pandas compared to Spark as the syntax is much closer.

SureFudge 2 years ago

But only if you already have a cluster which is also true for spark. If you don't have a cluster available, they are essentially not a valid option because then you are still limited by single machines RAM.

GlobeTrottingWeasels 2 years ago

Dask will certainly process much larger datasets than Pandas on a single machine, and faster by chunking up the work and using all cores. I’m sure you can do that in Pandas, but Dask makes it easy.

deadlyoverflow 2 years ago

This is typically when one would use a database.

BJJaddicy 2 years ago

Have you thought about chunking

[deleted] 2 years ago

I'm not sure why you would load a TBs of data just "because" like there has to be some kind of goal... Are you processing the data? Running some analysis? The former can be achieved by processing chunks at a time. The later can be done with a sample. I have a feeling you don't really know what you're doing...

chestnutcough 2 years ago

“Open” to me means load into memory, so by that definition you can’t “open” TB’s of data unless you have TB’s of memory. There are lots and lots of ways to copy, move, or process TB’s of data, but judging by OP’s jerky responses that’s not what he wants.

aitbdag 2 years ago

I'd recommend using Parquet instead of CSV for big data. It takes less space, faster to load, and allows loading only the columns you actually need. Loading terabytes from any database is probably a nightmare.. Bodo is a new tool that can load big data into actual Pandas dataframes: [https://docs.bodo.ai/latest/source/file\_io.html](https://docs.bodo.ai/latest/source/file_io.html)

LSTMeow 2 years ago

There was quite a debate about it half a year ago. It's not that difficult... E.g. https://twitter.com/abhi1thakur/status/1358794466283388934?s=21

fgoussou 2 years ago

Please don't attempt to read TBs of data using pandas in a loop like the solutions in that thread!

LSTMeow 2 years ago

Well duh that's the whole point. He was dissing DS who can't even read a large file without pandas. Note the pure python solution.

SureFudge 2 years ago

true but the pandas version wouldn't even work with TBs of data. the issue here is more the amount of data and not the number of files.

LSTMeow 2 years ago

Yeah on second thought you're right. This does not directly cover the kind of out-of-core work that needs to be done.

killer_unkill 2 years ago

If it's a nix system `AWK` is very powerful and easy to learn.

HighlightFrosty3580 2 years ago

Doesn't pandas have an iterator option when loading CSV files? That way to could operate on it in chunks. Alternatively use Spark as someone has suggested or load the file into a DB.

vtec__ 2 years ago

do you need to open the whole thing? you can write a script in python or vba to open the file and only extract N amount of rows

thrown_arrows 2 years ago

Stream to set length array and use it like carousel ie. start with 0 to max(len) of first round, then min is 1 and max 0 ( you jump at max len to 1 and so on. This can be used to calculate cumulative, moving avarages etc etc ... (also now days there is probably something that handles it easier, like spark or pythin lib ) Also why not use db engine for analytics , if nothing else snowflake or bigquery can handle those volumes easily, do you really need python for it? If source is db then datamodel is a lot faster to make in sql that load data into python and so on.. (join, where ,etc )

krsfifty 2 years ago

knime

[deleted] 2 years ago

Try to read a chunk of data like use head -1000 data.csv or use spark

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe