-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paralalize MS2Query #156
Comments
Before adding paralalization first test what optimization could further be done. |
Even if, as a computer scientist, I agree with the suggestion to profile and then optimize (following the mantra that premature optimization if the the mother of all evils), I would consider an attempt with https://github.com/dubovikmaster/parallel-pandas. Code modification is minimal (just substitute I've not looked deep enough in the code to know where the longest computations are located, so I don't know where it would be more beneficial. But if someone of you can take the time to look into this it can really help make things faster with a minimal impact on code organization. Just my $0.02 :) |
Thanks for the suggestion! Regarding the parallel pandas, I expect the long processing parts are not related to manipulations in pandas. Probably the longest steps are computing the dot product with 500.000 spectra (which is already paralalized within matchms using numba.njit) and the embedding creation. The parallelization I had in mind would be to process, multiple spectra at the same time (since they are not dependend on each other), but this is not very easy to implement currently, since we write the results iteratively to a csv file. Since this type of paralalization is a bit more involved, we will probably not be able to pick this up in the coming months, we have future plans for improving MS2Query and will probably pick this up as well, once we start on a MS2Query 2.0. |
Of course your plans are to restructure "at large", mine was just a "trick" to speed up (at almost zero cost) some computations. I agree that such tricks will not alter the overall computation time, so they are probably not worth implementing. I'll keep an eye on the rest to see if I spot some low hanging fruit of more value :) |
With the new infrastructure spectra are processed one by one. This allows for parallelization. Might be worth implementing in MS2Query.
This might not work with the csv file. However, wrongly ordered results is not really an issue since an identifier is also added.
The text was updated successfully, but these errors were encountered: