You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've recently stress tested the library in our app and noticed that the docx parsing performance is pretty poor compared to other tools on somewhat big files.
On my laptop (Ryzen 5800H, 64GB RAM) it parses file in around 40 seconds.
Pandoc has a similar performance.
But docx2txt parses it under a second.
On the servers the difference is much bigger, since we're not running a powerful server yet.
Is there something that can be improved in the docx parsing to make it comparable to docx2txt? At first glance the output is similar, so it's not that they have worse quality at cost of the speed.
I also want to mention that the parsing of pdf file with the same content as this docx file takes less (around 4 seconds), while pdf is larger (30MB vs 4MB).
The text was updated successfully, but these errors were encountered:
Hello!
We've recently stress tested the library in our app and noticed that the docx parsing performance is pretty poor compared to other tools on somewhat big files.
Example docx file:
https://tolstoy.ru/upload/iblock/b22/voina-i-mir.docx
The tools we compared the library to:
On my laptop (Ryzen 5800H, 64GB RAM) it parses file in around 40 seconds.
Pandoc has a similar performance.
But docx2txt parses it under a second.
On the servers the difference is much bigger, since we're not running a powerful server yet.
Is there something that can be improved in the docx parsing to make it comparable to docx2txt? At first glance the output is similar, so it's not that they have worse quality at cost of the speed.
I also want to mention that the parsing of pdf file with the same content as this docx file takes less (around 4 seconds), while pdf is larger (30MB vs 4MB).
The text was updated successfully, but these errors were encountered: