Paralalize MS2Query #156

niekdejonge · 2022-10-31T08:27:20Z

With the new infrastructure spectra are processed one by one. This allows for parallelization. Might be worth implementing in MS2Query.

This might not work with the csv file. However, wrongly ordered results is not really an issue since an identifier is also added.

niekdejonge · 2022-11-15T14:00:34Z

Before adding paralalization first test what optimization could further be done.
Florian suggested using:
C profile for this:
By running:
python -m cProfile -o out.profile my_script.py
The results can be visualized afterwards by snakewis
by running snakeviz out.profile

mapio · 2023-11-25T06:54:57Z

Even if, as a computer scientist, I agree with the suggestion to profile and then optimize (following the mantra that premature optimization if the the mother of all evils), I would consider an attempt with https://github.com/dubovikmaster/parallel-pandas.

Code modification is minimal (just substitute p_apply for apply) and sometime the effect is staggering.

I've not looked deep enough in the code to know where the longest computations are located, so I don't know where it would be more beneficial.

But if someone of you can take the time to look into this it can really help make things faster with a minimal impact on code organization.

Just my $0.02 :)

niekdejonge · 2023-11-28T11:19:30Z

Thanks for the suggestion! Regarding the parallel pandas, I expect the long processing parts are not related to manipulations in pandas. Probably the longest steps are computing the dot product with 500.000 spectra (which is already paralalized within matchms using numba.njit) and the embedding creation. The parallelization I had in mind would be to process, multiple spectra at the same time (since they are not dependend on each other), but this is not very easy to implement currently, since we write the results iteratively to a csv file.

Since this type of paralalization is a bit more involved, we will probably not be able to pick this up in the coming months, we have future plans for improving MS2Query and will probably pick this up as well, once we start on a MS2Query 2.0.
In the meantime if you notice a place in the code where it would be easy to implement something like you mentioned, we are very happy with PR's suggesting changes!

mapio · 2023-11-28T14:00:50Z

Of course your plans are to restructure "at large", mine was just a "trick" to speed up (at almost zero cost) some computations. I agree that such tricks will not alter the overall computation time, so they are probably not worth implementing.

I'll keep an eye on the rest to see if I spot some low hanging fruit of more value :)

niekdejonge added the computational performance e.g. improving speed, memory usage etc. label Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paralalize MS2Query #156

Paralalize MS2Query #156

niekdejonge commented Oct 31, 2022

niekdejonge commented Nov 15, 2022

mapio commented Nov 25, 2023

niekdejonge commented Nov 28, 2023

mapio commented Nov 28, 2023

Paralalize MS2Query #156

Paralalize MS2Query #156

Comments

niekdejonge commented Oct 31, 2022

niekdejonge commented Nov 15, 2022

mapio commented Nov 25, 2023

niekdejonge commented Nov 28, 2023

mapio commented Nov 28, 2023