Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining often gets stuck on waiting for metadata #1903

Closed
Tracked by #1902
leblowl opened this issue Oct 3, 2023 · 29 comments
Closed
Tracked by #1902

Joining often gets stuck on waiting for metadata #1903

leblowl opened this issue Oct 3, 2023 · 29 comments
Assignees
Labels
bug Something isn't working

Comments

@leblowl
Copy link
Contributor

leblowl commented Oct 3, 2023

On develop, with two local nodes, this happens quite often. Sorry I haven't looked into it any more than that.

@leblowl leblowl changed the title Community often get's stuck on waiting for metadata Joining often get's stuck on waiting for metadata Oct 3, 2023
@holmesworcester holmesworcester moved this to Sprint in Quiet Oct 3, 2023
@holmesworcester holmesworcester changed the title Joining often get's stuck on waiting for metadata Joining often gets stuck on waiting for metadata Oct 3, 2023
@holmesworcester holmesworcester added the bug Something isn't working label Oct 3, 2023
@vinkabuki
Copy link
Contributor

@leblowl Stucks for ever?

@holmesworcester
Copy link
Contributor

I think so, or at least quite a while. The steps to reproduce would be to start a community in the latest develop branch and try to join it.

Expected: you see the general channel very quickly after Tor connects.

Actual: you get stuck on one of the last steps.

@vinkabuki
Copy link
Contributor

I cannot reproduce it, eventually I am always joining

@holmesworcester
Copy link
Contributor

@leblowl any more insights on how to reproduce this? is it a problem in the latest develop?

@leblowl
Copy link
Contributor Author

leblowl commented Oct 30, 2023

Is it not possible to reproduce? If it's taking more than 30 seconds, or sometimes several minutes, then I think that's an issue. I haven't looked into this further, but I will soon.

@vinkabuki
Copy link
Contributor

vinkabuki commented Oct 31, 2023

So this is a problem with misleading message. Anything below 20 minutes is not suspicious at all. What happens here you are just waiting for tor to connect to the first peer., It's known issue that tor needs a lot of time to "publish" addresses and make them dialable.
Some days it works better, some days it works worse.

@holmesworcester
Copy link
Contributor

Are we sure that's what the problem is?

@vinkabuki
Copy link
Contributor

vinkabuki commented Oct 31, 2023

yes, if it's not stuck forever, then it's not a bug, it's just a standard behavior of tor.

@leblowl
Copy link
Contributor Author

leblowl commented Oct 31, 2023

If it takes 20 minutes to connect, then I think that's a pretty big issue, for internal testing and general usability. I think we should look into it more and confirm what we think is happening and document that behavior. If it is a limitation of Tor, then that's great to know as it provides another data point to consider when talking about moving away from Tor.

@holmesworcester
Copy link
Contributor

holmesworcester commented Oct 31, 2023

I think we should look into it more and confirm what we think is happening and document that behavior. If it is a limitation of Tor, then that's great to know as it provides another data point to consider when talking about moving away from Tor.

I agree that we should do this, and I don't think we know yet that it's just Tor not connecting. Just because it connects eventually does not mean we know what the problem is.

Also, we should ensure that the text under the progress bar accurately reflects what is happening and what we are waiting for. If we think it's wrong, we should make an issue for that.

@vinkabuki do you think the message is misleading? can you make an issue for that with steps to reproduce?

@EmiM
Copy link
Contributor

EmiM commented Nov 2, 2023

It's always been a known behavior for tor. You can even see it with e2e tests - last week multiple client test was failing because 15 minutes timeout was not enough. Today the test passed in 8 minutes. I know that one example is not a scientific evidence but I'm just writing what we observed during 2 years working with tor.

We have some general data on tor connection that we gathered (and are still gathering) on AWS however these tests are based on connecting to http server - so the old registering mechanism with registrar: https://s3.console.aws.amazon.com/s3/buckets/tor-connection-data?region=us-east-1&tab=objects
Btw. I think we can already stop running those.

@vinkabuki
Copy link
Contributor

imo 'waiting for metadata' doesn't explain what's actually happening. we used to have a good message.

@holmesworcester
Copy link
Contributor

Any step where Tor has started and we are waiting for Tor to connect to an onion address should say 'connecting with Tor'

@siepra
Copy link
Contributor

siepra commented Nov 8, 2023

Are there any decisions about steps to take in terms of this task?
Also, does it belong with current sprint? It sounds like a general problem, not exactly related to the changes we're about to publish.

@holmesworcester

@holmesworcester
Copy link
Contributor

One very concrete thing is: every step in the joining process should be correctly described to the user, so that if a step fails or takes too long we know why.

@siepra
Copy link
Contributor

siepra commented Nov 8, 2023

I guess we need specific guidelines on the descriptions then. Otherwise we'll get "lost in translation". I mean every one of us may have different understanding of what the step is about. "Waiting for metadata" is actually accurate when you think of it.

@holmesworcester
Copy link
Contributor

holmesworcester commented Nov 8, 2023

This is how we dealt with it last time: #1277 (comment)

Most important things to show:

  1. Tor bootstrapping.
  2. Tor connection process in registration (most important because this is where we get stuck; show a new message at the beginning of each fetch. This will repeat many times in most cases. 'fetching/timeout/fetching/timeout/etc')
  3. Orbitdb block download of messages. (doesn't have to be continuous or "make sense" just show what's happening.)

Notes:

  • We don't want to show "errors" that are not really errors. (maybe change them in the logs or maybe hide them)
  • Maybe debouncing is a good idea so we don't show everything but we should everything that takes 1s or 2s.
  • Being busy and fast is good
  • It's good if we can see what we got stuck on for debugging purposes
  • Think of yourself as an artist trying to give the user some clear information about what's happening and reassure them that something is happening.
  • We can use indeterminate progress bars in most cases (bonus if we use a steady progress bar)

Since we've changed some of the steps in the process, we may need to change what is reported to the user.

Also, any time we are waiting for Tor to connect, we should say "Connecting via Tor..." until Tor has connected successfully. And any time we are waiting for something else, like block data, we should not mention Tor. This way, we and our users will be on the same page about the impact of Tor on the joining process.

In other words, if Tor slowness is really the culprit here, let's prove it by showing status messages that clearly state when we are waiting for Tor and when we are not.

@holmesworcester
Copy link
Contributor

@siepra regarding the call we just had, I don't think there's any reason in particular why lucas needs to work on this. The workflow that @Kacper-RF did when he worked on these screens initially was:

  1. Identify stages in the process that are meaningful.
  2. Propose names for those stages that will display to the user
  3. Get approval for those names
  4. Implement it and show a screencast and confirm that it's right.

I think this approach will work again, so I think anyone on the team can do this. @Kacper-RF might be the best person since he worked on it initially.

@Kacper-RF
Copy link
Contributor

Current state:

5% - Connecting process started(initial log)
20% - Connecting to community owner via Tor
20% - Registering owner certificate(only visible for owner)
30% - Launching community
40% - Spawning hidden service for community
50% - Initializing libp2p
60% - Initialized storage
70% - Initializing IPFS
75% - Loaded certificates to memory
80% - Initialized DBs
85% - Launched community
87% - Waiting for metadata
90% - Channels replicated
95% - Certificates replicated

From my observations, there are few steps on which the user spends the most time and I think that these are the steps where we should provide the most valuable information:

20% - Connecting to community owner via Tor
85% - Launched community
87% - Waiting for metadata

@holmesworcester
Copy link
Contributor

Thanks @Kacper-RF! It's super helpful to see this list. So, just to clarify we're talking about 2.x now. I have a few general suggestions about some of these so I'm going to go through them.

"20% - Connecting to community owner via Tor"

This should now say "Connecting to peers"

40% - Spawning hidden service for community

This happens synchronously? Why do we have to wait for this at all? We already have the ability to make outgoing connections to peers so it shouldn't block anything.

85% - Launched community

This should say "Launching community", right? Because it's in progress at this stage? Or if we have already launched the community and something else is in progress, what is in progress?

It's weird to be waiting on a step that is described in the past tense as having already happened. Like, if it happened, why am I waiting?

87% - Waiting for metadata

Okay, at this point it sounds like we have already made connections to peers over Tor, in 2.x, so we aren't waiting for any more connections. What is actually happening here?

Should this say "Downloading community metadata"?

And if all we're doing here is downloading community metadata, how do we explain "joining often gets stuck?" Above Emi says:

It's always been a known behavior for tor. You can even see it with e2e tests - last week multiple client test was failing because 15 minutes timeout was not enough. Today the test passed in 8 minutes. I know that one example is not a scientific evidence but I'm just writing what we observed during 2 years working with tor.

But we've already made connections via Tor to some peers (at "20% - Connecting to community owner via Tor") so what is our explanation for the issue leblow is seeing?

@Kacper-RF
Copy link
Contributor

After taking a closer look at the current joining flow, these steps are very confusing, they are tied to the old master/production joining flow.

87% - Waiting for metadata - this is truly the moment waiting to connecting with other peers.

Previous logs are visible very briefly, it is even difficult to read

So I will suggest to use only 4 steps

  1. Connecting process started
  2. Connecting to peers (most time consuming)
  3. Channels replication
  4. Certificates replication
    optionally:
    5.Waiting for load messages ( We are waiting for at least one message to be displayed on the channel, to not throw user to empty channel)

@Kacper-RF Kacper-RF self-assigned this Nov 16, 2023
@Kacper-RF Kacper-RF moved this from Sprint to In progress in Quiet Nov 16, 2023
@holmesworcester
Copy link
Contributor

holmesworcester commented Nov 17, 2023

Previously we showed information about Tor's startup process. Can we show that here using the existing language, if Tor has not started yet? (Sometimes it will have started already, sometimes not.)

So I will suggest to use only 4 steps

What percentages will you display for these steps and what will you call them? I'd propose:

  1. [Tor startup steps]
  2. Connecting to community members via Tor
  3. Loading messages

Is certificates replication a step that would block other steps? I don't think it should be, so I think these steps can be enough. Is there a way to show progress on "loading messages"?

@Kacper-RF
Copy link
Contributor

We can do something like that:

5% - Connecting process started ( to always give user information that is on going)

From 5% - 50% Tor bootstraping logs, example:

Bootstrapped 5% (conn)
Bootstrapped 10% (conn_done)
Bootstrapped 14% (handshake)
Bootstrapped 15% (handshake_done)
Bootstrapped 25% (requesting_status)
Bootstrapped 30% (loading_status)
Bootstrapped 40% (loading_keys)
Bootstrapped 45% (requesting_descriptors)
Bootstrapped 50% (loading_descriptors)
Bootstrapped 55% (loading_descriptors)
Bootstrapped 61% (loading_descriptors)
Bootstrapped 70% (loading_descriptors)
Bootstrapped 75% (enough_dirinfo)
Bootstrapped 90% (ap_handshake_done)
Bootstrapped 100% (done)

I will adjust somehow progress bar for them.

55% - Connecting to community members via Tor (long step, maybe adding 1% to the progress bar every 3 seconds, but I don't know if that's a good idea?)

80% - Loading messages (I think I can leave the same information but change the value( 80%, 85%, 90% ), after receiving events from Orbit DB, and some backend logic)

We want to have replicated channels and certificates before showing the channel list to the user, because as far as I remember, some logic depends on certificates.

I think I can start working on a draft-solution and send you some videos of what it looks like.

Let me know what you think.

@holmesworcester
Copy link
Contributor

This sounds good!

@Kacper-RF
Copy link
Contributor

After several different approaches, I finally did something like this:

Showing Tor boostraping logs was very problematic due to the asynchronous start and sometimes broke the progress bar.

I tried to limit the steps to make them clearly visible and readable to the user.

I implemented an additional animation with a progress bar when user is on most consuming step - Connecting to community members via Tor

The idea of ​​adding 1% every 3 seconds was risky because sometimes the process takes a really long time if no member is online.

joining-user.mp4

@Kacper-RF
Copy link
Contributor

#2093

@holmesworcester
Copy link
Contributor

This looks great!

Is that new animation implemented on mobile too? (Just asking because it seems non-standard)

It might be more standard to switch the whole thing to an "infinite" progress bar that goes back and forth, if that's easier on mobile for some reason. But this looks great and, most importantly, the decisions about what to say to the user look great to me.

@Kacper-RF
Copy link
Contributor

Thanks !
Yes, I implemented the animation on mobile as well, it was a bit more tricky than on desktop, but it looks the same as on desktop.

@Kacper-RF Kacper-RF moved this from In progress to Waiting for review in Quiet Nov 24, 2023
Kacper-RF added a commit that referenced this issue Nov 30, 2023
* feat: basic changes

* feat: better UX

* feature: state manager and desktop

* feat: mobile part

* fix: use enum instead hardcoded string

* fix: mobile channel list screen

* feat: debug log

* feat: trigger mobile e2e

* feat: isJoiningCompleted selector

* test: add isJoiningCompleted
@Kacper-RF Kacper-RF moved this from Waiting for review to Merged (develop) in Quiet Nov 30, 2023
@Kacper-RF Kacper-RF moved this from Merged (develop) to Ready for QA in Quiet Dec 1, 2023
@kingalg
Copy link
Collaborator

kingalg commented Jan 11, 2024

Desktop: 2.0.3-alpha.15
Mobile: [email protected], ios

Done.

@kingalg kingalg closed this as completed Jan 11, 2024
@kingalg kingalg moved this from Ready for QA to Done in Quiet Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

7 participants