Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long GC Suspend and hard faults #111201

Open
arekpalinski opened this issue Jan 8, 2025 · 7 comments
Open

Long GC Suspend and hard faults #111201

arekpalinski opened this issue Jan 8, 2025 · 7 comments

Comments

@arekpalinski
Copy link

Description

We (RavenDB) have a customer that is occasionally experiencing very long GC pauses. It's not a new issue or a regression. It's experienced from long time. It was happening on .NET 6 already. Some time ago RavenDB got updated to use .NET 8 (in order to be on LTS version). More specifically version 8.0.400 is deployed on the affected environment.

The pauses affect the application requests to RavenDB. Also the monitoring shows connectivity issues between the nodes (the setup is 3 nodes RavenDB cluster).

So far we have narrowed it down to GC Suspend events taking a lot of time with the usage of PerfView.

Reproduction Steps

This is happening very occasionally, only in a production environment. RavenDB is configured as 3 nodes (machines) cluster. The issue is happening randomly on any node. The machines have 256 GB of memory, OS is Windows Server 2019 Datacenter (OS build number: 17763.5696.amd64fre.rs5_release.180914-1434). They are hosted in Azure.

Expected behavior

GC pauses aren't taking so long

Actual behavior

We'd like to share our analysis of the recent occurrence of the issue. We have three PerfView traces that were collected with the usage of the following command:

PerfView.exe /nogui /accepteula /BufferSizeMB:4096 /CircularMB:4096 /CollectMultiple:3 /StopOnGCSuspendOverMSec:5000 /KernelEvents=ThreadTime+Memory+VirtualAlloc /NoNGenRundown /Merge:true /Zip:true collect /Process:"Raven.Server" Over5SecGCSuspend

The longest GC suspend took 18,724.797 mSec (time range: 244,459.607 - 263,347.329):

Image

The below analysis is about this GC suspend event (although other two PerfView outputs are very similar)

Events

  • Affected GC thread, doing the suspension is 2908

Image

Thread Time (with Ready Threads) Stacks

Analysis of coreclr!ThreadSuspend::SuspendEE

  • There is a lot of READIED BY TID(0) Idle (0) CPU Wait < 1ms IdleCPUs events which they mostly point to ntoskrnl!??KiProcessExpiredTimerList (READIED_BY):
    Image

  • Below we can also find two RavenDB threads - 5108 and 6524:
    Image

Looking deeper we can find that both threads are performing some queries:

Image

When looking at the stacks of 5108 thread we can see that it's reading a documents from disk (we use memory mapped files), causing a page fault:

Image

This is about the following code:

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/LowLevelTransaction.cs#L648-L655

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/Paging/AbstractPager.cs#L332

Events (ReadyThread)

ReadyThread events for AwakenedThreadId - 2908 (GC thead doing the suspension), was awakened by mentioned Raven's thread - 5108 and 6524, but also by Idle (0) process so I assume it's System process.

Image

CPU Stacks

From CPU Stacks we know that before long GC suspend, we have started FlushFileBuffers() on our Raven.voron file where documents are saved (threads 5108 and 6524 are reading from). We call it periodically explicitly here:

https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L344

It lasted also during the GC suspend. Under the covers we can see MmTrimSection() call, which as we understand will evict some pages, hence subsequent access to its trimmed pages will result in a page fault, requiring the system to re-read the data from the file (what we see in 5108 and 6524 threads).

Image

Events (Suspend EE and Hard Faults)

Going back to Events view and adding Hard Faults events:

Image

So from our analysis it looks that the GC suspend was caused by Hard Faults taking 18-19 seconds (not sure why) after recent FlushFileBuffers().

Questions

  1. Can you please validate our analysis?
  2. Can you please confirm our assumption about MmTrimSection()?
  3. Is it expected / known issue that a managed thread doing a hard page fault might be a blocker for GC suspend?
  4. Is there any kind of a OS / kernel lock that that is affecting a thread doing a read which can result in 18-19 seconds hard faults?

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 8, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jan 8, 2025
@huoyaoyuan huoyaoyuan added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 8, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@Maoni0 Maoni0 added area-VM-coreclr and removed area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Jan 8, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@Maoni0
Copy link
Member

Maoni0 commented Jan 8, 2025

looping in @kouvel for suspension issue. thanks for the detailed investigation!

@markples
Copy link
Member

markples commented Jan 8, 2025

just a few observations - I don't know details about the flush/trim part, but yes, if a normal memory operation faults, then that thread won't have been put into a "GC ready" state (as opposed to the flush itself, which the DllImport mechanism can do). (glossing over details about fully interruptible code vs not)

Presumably the file size (seeing the file offset of 815,000,000,000) is part of it.

@arekpalinski
Copy link
Author

Yes, the file in question is pretty large - 862GB. Here are details of a sample sync operation on that file (FlushViewOfFile + FlushFileBuffers). Note it's from a different time so it's not about GC suspension in question:

Image

According to our own metrics we had 49.5MB of data written that were not synced. It's literally the measurement of the following piece of code:
https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L327-L349

@arekpalinski
Copy link
Author

We have also another PerfView traces, collected about ~15 minutes earlier where we also had GC suspension taking 6.8 second. Also during that time we can see very long hard page faults (but meanwhile there are some, even with similar offsets but taking below 1ms. I know it doesn't mean anything but it's interesting):

Image

Dropping HardFault events, and showing more FlushInit which happened much earlier (since the durations of hard faults were about 20 seconds, so I assume that earlier flushes had to cause them) shows us:

Image

@arekpalinski
Copy link
Author

Something that I have problems with understanding is that we see a lot of FlushInit events, but almost no FlushBuffers. Although Any Stacks view shown for a any FlushInit event, points to out FlushFileBuffers.

Can it be that, as show in one of the screenshots from yesterday, that most of our FlushFileBuffers is actually MmTrimSection?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants