docx parsing time is slow compared to docx2txt tool #162

hwo411 · 2023-12-20T12:28:38Z

Hello!

We've recently stress tested the library in our app and noticed that the docx parsing performance is pretty poor compared to other tools on somewhat big files.

Example docx file:
https://tolstoy.ru/upload/iblock/b22/voina-i-mir.docx

The tools we compared the library to:

On my laptop (Ryzen 5800H, 64GB RAM) it parses file in around 40 seconds.
Pandoc has a similar performance.

But docx2txt parses it under a second.

On the servers the difference is much bigger, since we're not running a powerful server yet.

Is there something that can be improved in the docx parsing to make it comparable to docx2txt? At first glance the output is similar, so it's not that they have worse quality at cost of the speed.

I also want to mention that the parsing of pdf file with the same content as this docx file takes less (around 4 seconds), while pdf is larger (30MB vs 4MB).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docx parsing time is slow compared to docx2txt tool #162

docx parsing time is slow compared to docx2txt tool #162

hwo411 commented Dec 20, 2023

docx parsing time is slow compared to docx2txt tool #162

docx parsing time is slow compared to docx2txt tool #162

Comments

hwo411 commented Dec 20, 2023