Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclone: fix discard thread issues #92

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alexandru-bagu
Copy link

This PR fixes two issues:

  1. because the while in the discard thread was not checking the value of self.process it would cause an exception to be raised in a thread where no exception handling is present
  2. on Cygwin the daemon thread would sometimes not start when thread.start() is called and borg would hang. unsure if it happens on linux though I doubt it.
  • keep the discard thread alive only while we have a valid self.process
  • mark thread as non daemon - this would cause borg to hang if the rclone process is not indeed closed which should not happen anyway

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 10, 2025

@alexandru-bagu so you practically tested this on cygwin? Thanks for the PR!

@alexandru-bagu
Copy link
Author

Yes, I did more than ~30 tests on Cygwin.
Without this PR borg would randomly hang at thread.start. Once I removed daemon=True, borg would not hang and would work correctly all the time.

@alexandru-bagu
Copy link
Author

I actually did a few more tests and it looks like borg is still hanging on thread.start() even with daemon=False. What is odd is that when the debug argument is provided it does not hang...

@alexandru-bagu
Copy link
Author

For whatever random reason, when starting a second thread using threading.thread.start() the 2nd would hang. Oddly enough, when --debug is used it does not hang which means it's got to be something with a process pipe in cygwin.
To make it work I decided to bypass piping the rclone's stderr output but that meant I needed to know the port beforehand and this introduces a race condition.
To make sure that rclone ends up running (if at all) with the correct port, I am selecting an open port first, providing that to rclone and then doing a noop request to make sure rclone is up and that it responds accordingly.
Additionally I added an environment variable to override the rclone binary name as in Cygwin I had issues where I would need to override the PATH environment variable to help it find the binary.

I will test this on Linux to make sure no bugs are introduced. My only concern (when comparing Windows with Linux) is that in Linux I know one can get an error with a port being in used because Linux keeps ports for a while even when the socket is closed. I attempted to fix this by using s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) but I have yet to test it.

@alexandru-bagu alexandru-bagu force-pushed the fix-rclone-hang branch 2 times, most recently from 9b2d4b8 to df7a25f Compare January 14, 2025 18:10
@alexandru-bagu
Copy link
Author

I tested on a Oracle Linux 8.10 the changes and had to make one adjustment. On Windows urllib3 has a bigger timeout for starting a http request whilst on Linux it would instantly fail because rclone was not yet listening to the given port. As such I am waiting for the port to become available before trying the noop request. This version seems to work fine for both OS.
Additionally updated the README.rst to include information regarding the RCLONE_BINARY variable.

src/borgstore/backends/rclone.py Outdated Show resolved Hide resolved
src/borgstore/backends/rclone.py Outdated Show resolved Hide resolved
src/borgstore/backends/rclone.py Outdated Show resolved Hide resolved
src/borgstore/backends/rclone.py Outdated Show resolved Hide resolved
… on relying on rclone to pick one; add a way to specify the path to the rclone binary for custom installations
@alexandru-bagu
Copy link
Author

alexandru-bagu commented Jan 15, 2025

I opted not to catch an error now for self.noop("noop") as that now suggests there's an issue with rclone which could not be handled by _rpc's retry mechanism.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpicks.

src/borgstore/backends/rclone.py Outdated Show resolved Hide resolved
@ThomasWaldmann
Copy link
Member

@ncw Can you have a look, please?

@alexandru-bagu
Copy link
Author

alexandru-bagu commented Jan 19, 2025

I am actually wondering if it wouldn't be a good idea just to add an environment variable with a port to be used for rclone. If for some reason the port is already in use, then the user can decide whether to use another one or not. That way we can also get rid of the while loop.
Additionally, maybe being able to provide parameters to rclone wouldn't be a bad idea, something like a RCLONE_RC_EXTRA_ARGS. In my tests I found that some arguments might be necessary for optimizations.
@ncw Can you please chime in?

Copy link
Contributor

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks reasonable to me.

If duplicating the find an open port code in Python lets us lose a thread then that is probably a good tradeoff as Python and threads aren't my favourite sandwich filling.

Though I'll note that the detection of an open port is racy - there is opportunity for another process to take it in between python closing it and rclone opening it. I think it is unlikely though.

I probably wouldn't bother with a variable to set a port number - that will break as soon as the user runs two copies of rclone.

@@ -66,7 +67,7 @@ class Rclone(BackendBase):
"""

precreate_dirs: bool = False
HOST = "localhost"
HOST = "127.0.0.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just note here that it is possible for 127.0.0.1 not to exist on IPv6 only computers... localhost should map to ::0 in that case. But then again there are probably computers where localhost doesn't work too!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wondered about the localhost change. @alexandru-bagu was there a specific reason why you changed that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remember the exact reason right now but I remember that when I was checking if the port was bound, the check would fail even though rclone was listening to the specified port.
It might be a Cygwin issue only and an option could be to allow the user to specify the hostname for binding the socket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so guess we'll leave that "as is".

@alexandru-bagu
Copy link
Author

Though I'll note that the detection of an open port is racy - there is opportunity for another process to take it in between python closing it and rclone opening it. I think it is unlikely though.

I considered this and I did a noop check. If that fails another port is considered. This will ensure that rclone or a software that implements rclone rc protocol would respond with a valid response.

Copy link
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tell me when ready, so I can merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants