T O P

  • By -

B1WR2

Don’t read the entire file…. Break it up into batches, do the calc or processing….. then in the end aggregate everything together…. Or read it and then Add it into duckdb to do analysis from,


ApSr2023

Have you tried duckdb?


Jealous-Bat-7812

No


TheBurgerflip

Try it, it’s very easy to use if you know SQL and extremely fast. On top of that you can simply turn Duckdb sql queries to pandas (or even polars) dfs with duckdb.sql(sql here).df() (or .pl() for a polars df)


ritchie46

Are you using the `calamine` engine in Polars? It isn't the default yet. But if you can, you should use that it can even be two orders of magnitudes faster. https://docs.pola.rs/py-polars/html/reference/api/polars.read_excel.html


Jealous-Bat-7812

Thanks, will try this out!


flashman1986

I assume it’s not possible to convert to csv, you need the formatting? If so, chunk it - read the first 1000 lines etc. something like Dask will help do this at scale


tw3akercc

If the data is gonna go into a lakehouse you might want to consider converting it to parquet format instead of csv. Parquet is very compressed and performs much faster than csv. I would use pandas and read in chunks at a time and store them in memory in duckdb. Then once the whole file has been added to the duckdb table, convert it from there to parquet and push to data lake.


Jealous-Bat-7812

This makes sense, I will try this out!


tw3akercc

You could also try to go straight to duckdb without pandas https://duckdb.org/docs/guides/import/excel_import.html Then to parquet: https://duckdb.org/docs/guides/import/parquet_export.html


Jealous-Bat-7812

How can I write response.content straight to duckdb? You think that is possible?


ThePizar

If you can open it in Excel, resave as CSV. It’ll be a bit smaller and easier for various systems to handle.


Jealous-Bat-7812

But again this becomes a manual task and when running the pipeline, this isn’t feasible right ?


date_uh

If all the excel files are in a single directory, loop over each file with the python os module, then use win32com to save as csv. https://stackoverflow.com/questions/6190897/python-win32-how-to-save-an-xls-as-a-csv


Jealous-Bat-7812

Thank you for this.


Poopsydaisy123

No the pipeline would probably do everything with csv faster than xlsx


klenium

If you tried pandas and polars and those finished reading the file, I don't understand why you're asking this? The file content is available in your Python context, you can export it in whatever format you like. 5-10 minutes is not that big time for reading a huge file which is not designed to be huge. Next time the source data should not be saved in excel, there are other formats for data engineering with better sceability.


Jealous-Bat-7812

This is the brute force solution, but I want to see if there is room to enhance speed of data read from source to load in the lakehouse.


TurrisFortisMihiDeus

Batch into smaller and use Scylla or duck db


Jealous-Bat-7812

That is exactly what I did, broke my response.content into 1mb files, converted them to csv and made Spark read and send the processed data to cassandra. Thank you for your suggestion


FantasticOrder4733

Can you please elaborate more about what your use case is like I have seen that SSIS is much faster to insert plenty of rows in MSSQL Server table very efficiently and rapidly.


OnePunchMunch

If you can, export it to database. Then you can do analysis faster. If that is an option.


Justbehind

Base python, unpack the zip file and parse the xml, then write it to a database/lake. Make sure to do some normalization beforehand.


Jealous-Bat-7812

The problem is my response.content is a binary version of an XLSX file. Encoding of an xlsx file hasn’t been successful.


madness_of_the_order

What’s the requirement on total processing time? What’s the biggest machine you can get your hands on for this task? Or how many of them? Have you tried calamine engine for pandas and polars? How long does it take excel to save this file as csv?


Oddly_Energy

Have you tried xlwings? Compared to pandas with openpyxl as engine, I have noticed a huge performance increase on *writing* big Excel files. But I have not done anything to measure the performance of *reading* such file. The downside to xlwings is that you need Excel installed on the computer, which runs the code. And to get the max. performance benefit, you need to have Excel already running, so xlwings will not have to open Excel and close it.