Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paralalize MS2Query #156

Open
niekdejonge opened this issue Oct 31, 2022 · 4 comments
Open

Paralalize MS2Query #156

niekdejonge opened this issue Oct 31, 2022 · 4 comments
Labels
computational performance e.g. improving speed, memory usage etc.

Comments

@niekdejonge
Copy link
Collaborator

With the new infrastructure spectra are processed one by one. This allows for parallelization. Might be worth implementing in MS2Query.

This might not work with the csv file. However, wrongly ordered results is not really an issue since an identifier is also added.

@niekdejonge
Copy link
Collaborator Author

Before adding paralalization first test what optimization could further be done.
Florian suggested using:
C profile for this:
By running:
python -m cProfile -o out.profile my_script.py
The results can be visualized afterwards by snakewis
by running snakeviz out.profile

@niekdejonge niekdejonge added the computational performance e.g. improving speed, memory usage etc. label Nov 15, 2022
@mapio
Copy link
Contributor

mapio commented Nov 25, 2023

Even if, as a computer scientist, I agree with the suggestion to profile and then optimize (following the mantra that premature optimization if the the mother of all evils), I would consider an attempt with https://github.com/dubovikmaster/parallel-pandas.

Code modification is minimal (just substitute p_apply for apply) and sometime the effect is staggering.

I've not looked deep enough in the code to know where the longest computations are located, so I don't know where it would be more beneficial.

But if someone of you can take the time to look into this it can really help make things faster with a minimal impact on code organization.

Just my $0.02 :)

@niekdejonge
Copy link
Collaborator Author

Thanks for the suggestion! Regarding the parallel pandas, I expect the long processing parts are not related to manipulations in pandas. Probably the longest steps are computing the dot product with 500.000 spectra (which is already paralalized within matchms using numba.njit) and the embedding creation. The parallelization I had in mind would be to process, multiple spectra at the same time (since they are not dependend on each other), but this is not very easy to implement currently, since we write the results iteratively to a csv file.

Since this type of paralalization is a bit more involved, we will probably not be able to pick this up in the coming months, we have future plans for improving MS2Query and will probably pick this up as well, once we start on a MS2Query 2.0.
In the meantime if you notice a place in the code where it would be easy to implement something like you mentioned, we are very happy with PR's suggesting changes!

@mapio
Copy link
Contributor

mapio commented Nov 28, 2023

Of course your plans are to restructure "at large", mine was just a "trick" to speed up (at almost zero cost) some computations. I agree that such tricks will not alter the overall computation time, so they are probably not worth implementing.

I'll keep an eye on the rest to see if I spot some low hanging fruit of more value :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
computational performance e.g. improving speed, memory usage etc.
Projects
None yet
Development

No branches or pull requests

2 participants