When you have a large excel file to handle, say 100M, it is not efficient to read it directly through pandas or openpyxl.
I have tested myself, to open one 100M size excel file and convert to pickle format, pandas takes around 8 minutes, even in readonly mode in my laptop.
Although later you can access pickle file with pandas quickly, the first 8 minutes was not acceptable for some task.
In this blog, we leverage the Excel Application itself to help us to convert excel file to csv format, which is also very easy to be handled by pandas.
Sometimes you have to admit Microsoft is still the best man to handle excel:)
Using Win32com library, you could reduce the time to around 100 seconds, which is only a fifth of time using pandas.
from time import time
import win32com.client
from os import path
import os
cwd = os.getcwd()
your_excel_file_path = path.join(cwd, "your_excel_file.xlsx")
export_csv_file_path = path.join(cwd, "new_file.csv")
start = time()
excel_app = win32com.client.Dispatch("Excel.Application")
excel_app.DisplayAlerts = False
wb = excel.Workbooks.Open(Filename=your_excel_file_path , ReadOnly=1)
# make target sheet active
ws = wb.Worksheets("active_sheet_name").Activate()
wb.SaveAs(Filename=export_csv_file_path , FileFormat="6") # 6 means csv
wb.Close(False)
excel.Application.Quit()
wb = None
excel_app = None
end = time()
print(start - end) # around 100 seconds
Top comments (0)