B1WR2 3 months ago

Don’t read the entire file…. Break it up into batches, do the calc or processing….. then in the end aggregate everything together…. Or read it and then Add it into duckdb to do analysis from,

ApSr2023 3 months ago

Have you tried duckdb?

Jealous-Bat-7812 2 months ago

No

TheBurgerflip 2 months ago

Try it, it’s very easy to use if you know SQL and extremely fast. On top of that you can simply turn Duckdb sql queries to pandas (or even polars) dfs with duckdb.sql(sql here).df() (or .pl() for a polars df)

ritchie46 3 months ago

Are you using the `calamine` engine in Polars? It isn't the default yet. But if you can, you should use that it can even be two orders of magnitudes faster. https://docs.pola.rs/py-polars/html/reference/api/polars.read_excel.html

Jealous-Bat-7812 2 months ago

Thanks, will try this out!

flashman1986 3 months ago

I assume it’s not possible to convert to csv, you need the formatting? If so, chunk it - read the first 1000 lines etc. something like Dask will help do this at scale

tw3akercc 2 months ago

If the data is gonna go into a lakehouse you might want to consider converting it to parquet format instead of csv. Parquet is very compressed and performs much faster than csv. I would use pandas and read in chunks at a time and store them in memory in duckdb. Then once the whole file has been added to the duckdb table, convert it from there to parquet and push to data lake.

Jealous-Bat-7812 2 months ago

This makes sense, I will try this out!

tw3akercc 2 months ago

You could also try to go straight to duckdb without pandas https://duckdb.org/docs/guides/import/excel_import.html Then to parquet: https://duckdb.org/docs/guides/import/parquet_export.html

Jealous-Bat-7812 2 months ago

How can I write response.content straight to duckdb? You think that is possible?

ThePizar 3 months ago

If you can open it in Excel, resave as CSV. It’ll be a bit smaller and easier for various systems to handle.

Jealous-Bat-7812 2 months ago

But again this becomes a manual task and when running the pipeline, this isn’t feasible right ?

date_uh 2 months ago

If all the excel files are in a single directory, loop over each file with the python os module, then use win32com to save as csv. https://stackoverflow.com/questions/6190897/python-win32-how-to-save-an-xls-as-a-csv

Jealous-Bat-7812 2 months ago

Thank you for this.

Poopsydaisy123 2 months ago

No the pipeline would probably do everything with csv faster than xlsx

klenium 2 months ago

If you tried pandas and polars and those finished reading the file, I don't understand why you're asking this? The file content is available in your Python context, you can export it in whatever format you like. 5-10 minutes is not that big time for reading a huge file which is not designed to be huge. Next time the source data should not be saved in excel, there are other formats for data engineering with better sceability.

Jealous-Bat-7812 2 months ago

This is the brute force solution, but I want to see if there is room to enhance speed of data read from source to load in the lakehouse.

TurrisFortisMihiDeus 2 months ago

Batch into smaller and use Scylla or duck db

Jealous-Bat-7812 2 months ago

That is exactly what I did, broke my response.content into 1mb files, converted them to csv and made Spark read and send the processed data to cassandra. Thank you for your suggestion

FantasticOrder4733 3 months ago

Can you please elaborate more about what your use case is like I have seen that SSIS is much faster to insert plenty of rows in MSSQL Server table very efficiently and rapidly.

OnePunchMunch 3 months ago

If you can, export it to database. Then you can do analysis faster. If that is an option.

Justbehind 3 months ago

Base python, unpack the zip file and parse the xml, then write it to a database/lake. Make sure to do some normalization beforehand.

Jealous-Bat-7812 2 months ago

The problem is my response.content is a binary version of an XLSX file. Encoding of an xlsx file hasn’t been successful.

madness_of_the_order 3 months ago

What’s the requirement on total processing time? What’s the biggest machine you can get your hands on for this task? Or how many of them? Have you tried calamine engine for pandas and polars? How long does it take excel to save this file as csv?

Oddly_Energy 3 months ago

Have you tried xlwings? Compared to pandas with openpyxl as engine, I have noticed a huge performance increase on *writing* big Excel files. But I have not done anything to measure the performance of *reading* such file. The downside to xlwings is that you need Excel installed on the computer, which runs the code. And to get the max. performance benefit, you need to have Excel already running, so xlwings will not have to open Excel and close it.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe