Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native lock-free dynamic heap allocator #4749

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Feoramund
Copy link
Contributor

Native Odin-based Heap Allocator

After an intense development and rigorous testing process of the past three months, I am finally ready to share the results of my latest project.

In short, this is a lock-free dynamic heap allocator written solely in Odin, utilizing direct virtual memory access to the operating system where possible. Only system calls are used to ask the OS for virtual memory (except on operating systems where this is verboten, such as Darwin, where we use their libc API to get virtual memory), and the allocator handles everything else from
there.

Rationale

Originally, I was working on porting all of core to use os2, when I found the experimental heap allocator in there. Having hooked my unreleased os2 test framework up to it, I found that it suffered from more race conditions than had already been found, as well as other synchronization issues such as apparent misunderstandings of how atomic operations work.

The most confusing code that stood out to me was the following block:

	idx := sync.atomic_load(&alloc.idx)
	prev := sync.atomic_load(&alloc.prev)
	next := sync.atomic_load(&alloc.next)

All three of those fields exist in the same u64. It would make more sense to have atomically loaded alloc then read each field individually. I spent a few days trying to make sense of heap_linux.odin, but it was a bit much for me at the time. The previous block and the warnings listed by TSan didn't give me great hope that it would be simple to fix it.

So, I did what I think most programmers do in this situation. I decided to try writing my own from nothing and hopefully come to better appreciation of the problem as a whole.

I combed through the last 30 years of the literature on allocators, with some reading of papers on parallelism.

For dynamic heap allocators, I found that contention, false sharing, and heap blowup were issues often mentioned. The Hoard paper (Berger et al., 2000) was particularly helpful in figuring out an overall design for solving those issues.

There is hopefully nothing too novel about the design I've put together here. We're all on well-trodden ground. I think the most exciting feature is the free bitmap, where most allocators appear to use free lists instead.

Goals

The design of this allocator was guided by three principles.

  • Robustness: A dynamic heap allocator cannot afford to have bugs. It must be as small as possible to reduce the number of possible places something could go wrong. Locks cannot be used, as this introduces the potential for deadlocking an entire program upon unexpected termination of a thread.
  • Clarity: A dynamic heap allocator is complicated enough already. Adding lock-free as a goal only makes it that much more complicated. To this end, as much of the internals have been commented about, explicit and meaningful variable names are used, and there's plenty of testing and assertions throughout the code to indicate what the design expects.
  • Speed: Where it did not compromise the other goals, optimizations have been applied iteratively over the time that this has been in development to make it as fast as possible in comparison to the previous libc malloc-based allocator on Linux.

Features

  • Superpage-based Allocation which allows theoretically faster pointer access through less pressure on the operating system's Translation Lookaside Buffer. Most MMUs have page sizes of 4KiB with the ability to distribute larger page sizes of 2MiB and above, generally speaking, and this is leveraged directly when possible.
  • Remote Free Bitmaps which allows wait-free freeing from other threads through atomic bit-flips.
  • Superpage Orphanage implemented using a lock-free algorithm to prevent heap blowup and allow reuse of already-allocated superpages, reducing how often the allocator makes requests to the virtual memory subsystem.
  • Lock-Free Guaranteed: While the majority of the synchronization methods implemented in the allocator are wait-free, the strongest overall guarantee that can be made is that it is lock-free. This means that no thread, even if it is terminated or crashes, can take down the allocator. The worst that can happen is the thread's memory is leaked. At least one thread in the system is guaranteed to make progress, as the academic definition of lock-free goes.
  • Thread-local Heaps eliminate allocator-induced false sharing and most allocation contention. Each thread need not communicate with a global structure for permission to allocate.
  • Runtime Slab Reconfiguration: Superpages are set up with a fixed number of slabs, but the details of how those slabs are configured are left to runtime determination. A superpage can have a dozen slabs of one size category, or it can have a variety of sizes; the runtime needs of the program determine how the allocator partitions each slab.
  • Slab Allocation structures individual allocations into fixed-size slabs of memory, each containing fixed-size bins, adjacent to each other of the same size rank. This eliminates headers (including any need for alignment information) which saves space and encourages better cache locality and higher performance for accessing objects of similar sizes. There is no need for complicated coalescence logic either. Resizing can be a no-op, if the new size is within the same category.
  • Superpage & Slab-masked Allocations which allows constant-time lookup of the superstructures which house individual allocations through simple bitwise AND operations on the addresses distributed by the allocator.
  • Heap Caching vastly increases allocation speed, as each heap has a cache of superpages with slabs available, as well as a map structure pointing to slabs of each size category available. These caches are implemented through using space that would otherwise be wasted in each superpage due to the need to align each pointer to its owning superpage and slab.
  • Slab-wide Allocation for larger-than-bin-sized allocations, the allocator may use entire slabs to house an allocation. These slab-wide allocations can span multiple slabs, allowing the allocator to make use of as much space as available in a superpage.
  • Linear Allocation Strategy, which reduces tracking which bins have been used to a single counter and helps encourage packing new objects near old objects. A slab allocates linearly from the first free slot to the last.
  • Completely Native Implementation: The whole allocator is written in pure Odin which allows for the compiler to better reason about what is happening in the code. No longer is each call to malloc and free an opaque barrier. Given that the code is available right here in the runtime itself, it allows configuration to any programmer's needs. It provides uniform behavior across all platforms, as well; a programmer need not contemplate how heap allocation may impact performance on one system versus another.
  • 1k LoC: Excluding system-specific code, debug code, and assertions, this implementation comes in at just about one thousand lines of code. Dynamic allocators can be complicated, and lock-free dynamic heap allocators can be incredibly complex with subtle bugs. This implementation needed to be as simple as possible to allow ease of future maintenance.
  • Test & Debug Framework: A broken dynamic heap allocator has the potential to introduce numerous vexing bugs. As tricky as this can be to get right, it warranted the most thorough testing possible. Code coverage is tracked through calls at various points of interest made to an internal code coverage tracker when ODIN_DEBUG_HEAP is enabled.
  • Diagnostics offer programs the ability to view specific information about heap allocation statistics through the runtime.get_local_heap_info() API.

Benchmark Results

These benchmarks should be taken with a grain of salt in general, since I had
to search pretty far and wide to find repeatable tests that showed
significant run time differences. The run times of most real programs simply
do not depend significantly on malloc performance. (Evans, 2006, p. 9)
[...]
Exhaustive benchmarking of allocators is not feasible, and the benchmark
results should not be interpreted as definitive in any sense. Allocator
performance is highly sensitive to application allocation patterns, and it is
possible to construct microbenchmarks that show any of the three allocators
tested here in either a favorable or unfavorable light, at the benchmarker’s
whim. (p. 11)

Because of the wisdom from the quote above, I won't spend much time here except to say that the included test bench has microbenchmarks written by me for the purpose of making sure that the allocator is at least not egregiously slow in certain made-up scenarios.

If you believe these benchmarks can align with realistic situations, then this allocator is 2-3 times faster than libc malloc, in general use case scenarios (so any allocation less than ~63KiB), on my AMD64 Linux-based system, compiled with -o:aggressive -disable-assert -no-bounds-check -microarch:alderlake.

Any speed gain drops off above allocations of 32KiB in size, because this is where bin allocations are no longer possible with the default configuration, and the allocator has to resort to coalescing entire slabs to fit the requests, but I decided to accept the result of this design, as it's not that much slower than malloc, and I believe that rapid allocation of >=64KiB blocks is a special case and not the usual case for most programs.

The full test suite can be run with:

odin run tests/heap_allocator/ -debug -define:ODIN_DEBUG_HEAP=true -sanitize-thread -- -vmem-tests -serial-tests -parallel-tests -allocator:feoramalloc

The benchmarks can be run with:

odin run tests/heap_allocator/ -o:aggressive -disable-assert -no-bounds-check -- -serial-benchmarks -parallel-benchmarks -allocator:feoramalloc

The allocator command line option can be switched to libc to use the old behavior.

Memory Usage

Speed aside, I can say that there are points to be aware of with this allocator, particularly in how it uses memory, which are clear and not as susceptible to application patterns like benchmarking may be.

For one, due to the nature of slab allocation, any allocation will always use the most amount of space possible within a bin rank, so if you request 9 bytes, you will in actual fact consume 16, as that is the next power of two available. This continues for every power of two up to the maximum bin size of 32KiB.

This shouldn't be too surprising at lower sizes, as with a non-slab general purpose allocator, you're almost guaranteed to have some book-keeping somewhere, which would result in an allocation of 8 bytes actually using 16 or 24 bytes, depending on the header.

This begins to break down at higher sizes, however. If you allocate 257 bytes instead of 256, you're going to be placed into a bin of 512 bytes. This may seem wasteful, but there is a consideration for this: every allocation of a particular size rank is tightly packed next to each other, which increases cache locality. It's a memory for speed tradeoff, in the end.

Alignment is also used as the size, if it's larger than the two, up to a maximum of 64 bytes by default. This was one of the design choices made to help eliminate any need for headers. Beyond a size of 64 bytes, all allocations are aligned to at least 64 bytes. Alignment beyond 64 bytes is not supported.

There is also no convoluted coalescing logic to be had for any allocation below ~63KiB. This was done for the sake of simplicity. Beyond 64KiB, the allocator has to make decisions on which slabs to merge together, which is where memory
usage and speed both take a hit.

To allocate 64KiB is to block out up to 128KiB, due to the nature of book-keeping on slab-wide allocations. That may be the weakest point of this allocator, and I'm open to feedback on possible workarounds.

The one upside of over-allocating like this is that if you resize within the same frame of memory that's already been allotted to you, it's virtually a no-op. The allocator has to do a few calculations, and it returns without touching any memory: simple and fast.

Beyond the HUGE_ALLOCATION_THRESHOLD, which is 3/4ths of a Superpage by default (1.5MiB), the allocator distributes chunks of at least a superpage in size directly through virtual memory. This is where memory waste becomes less noticeable, as we're no longer dealing with bins or slabs but whole chunks from virtual memory.

Superpages also may waste up to one slab size of memory (64KiB) for the purposes of maintaining alignment, but this space is optionally used if a heap needs more space for its cache. With the current default values, one of these 64KiB blocks is used per 20 superpages allocated to a single thread. So it's about 3% of all virtual memory allocated this way.

The values dictating the sizes of slabs and maximum bins are all configurable through the ODIN_HEAP_* defines, so if your application really does need to make binned allocations of 64KiB, or if you find speed improvements by using smaller slabs, it's easy to change.

I chose the default values of 64KiB slabs with a 32KiB max bin size after some microbenchmarking, but it's possible that different values could result in better performance for different scenarios.

To summarize: this allocator does not try to squeeze out every possible byte at every possible juncture, but it does try to be fast as much as possible.

There may be a case to be made for the reduction of fragmentation through slab allocation resulting in less actual memory usage at the end of the day versus a coalescing allocator, but that is probably an application-specific benefit and one I have not thoroughly investigated.

Credits

I hope to demonstrate that the design used in this allocator is not exceedingly novel (and thus, not untested) by pointing out the inspirations for each major feature based upon the literature reviewed. Each feature has been documented and in use in various implementations for over two decades now.

  • The design of the Slab allocator was originally detailed by Bonwick (1994) in The Slab Allocator: An Object-Caching Kernel.
  • An observation made by Wilson et al. (1995) in Dynamic Storage Allocation: A Survey and Critical Review inspired runtime configurable size classes for slabs.
  • Kamp (1998) describes how individual objects can be so aligned as to allow bitmasking their pointers to find their owning structure in Malloc(3) revisited, otherwise known as the PHKmalloc paper. This strategy would also later be used in mimalloc.
  • Using a bitmap to track freed objects also comes from PHKmalloc. Usage of this structure predates any notion of wait-free allocator design, but it works quite well for that purpose. This strategy is contrasted mainly with free lists used in many other designs.
  • Berger et al. (2000) describe a global heap for storing unused superblocks in Hoard: A Scalable Memory Allocator for Multithreaded Applications. This was the earliest documentation of a solution to the problem of heap blowup according to the authors. This idea is the direct inspiration for the superpage orphanage. However, Hoard's specific implementation was not lock-free.
  • Using per-thread heaps also comes from the Hoard paper, where it was originally outlined as the only known solution in the literature to date for the problem of allocator-induced false sharing. Mimalloc would later use thread-local storage to accomplish the very same goal.
  • The benefits of a lock-free allocator were first described by Michael (2004) in Scalable Lock-Free Dynamic Memory Allocation.
  • This allocator may most resemble the Streamflow design described by Schneider et al. (2006), with its segregated heaps, its distinction between local and remote deallocation, and its usage of superpages to improve TLB performance.
  • The idea of using per-slab free bitmaps is partly inspired by free list sharding detailed by Leijen et al. (2019) in Mimalloc: Free List Sharding in Action where per-page free lists are used instead of a large free list per size class. It would seem to be the natural development when using bitmaps over free lists.

The following points are original ideas; original in the sense that they were realized during development and not drawn from any specific paper, not that they are wholly novel and have never been seen before.

  • The strategy of forbidding the freeing of a slab until it has been fully used at least once came about after running a benchmark where a single object was allocated and freed repeatedly. This test demonstrated a significant weak point in an earlier version of the allocator, and this strategy was the solution to keep it from becoming a sore spot for performance.
  • The usage of a single integer counter (dirty_bins) to track which bins are dirty and need zeroing upon re-allocation was an iterative optimization realized after noticing that the allocator naturally uses the bin with the lowest address possible to keep cache locality, by virtue of next_free_sector always being set to the minimum value. An earlier version of the allocator used a bitmap with the same layout of local_free to track dirty bins.
  • I do not recall seeing in any of the papers reviewed any mention of using allocator space that would otherwise be wasted by alignment needs for keeping metadata, but it should hopefully be an obvious enough usage to not qualify as groundbreaking.
  • Related to the previous point, the usage of a size class-keyed hash map to speed up finding available slabs should also be an obvious optimization.
  • I could not find any mention in the literature of the usage of a bitmap to achieve wait-free synchronization through bitwise merging. This may actually be the only novel idea in this design. Most parallel-aware allocators use free lists, and some are lock-free. Of note, StarMalloc, whose paper was published just last year, uses bitmaps instead of free lists, but it retains locks for the purpose of security according to the authors.

Quotes

The following passage inspired runtime configurable slab size classes.

A crude but possibly effective form of coalescing for simple segregated
storage (used by Mike Haertel in a fast allocator [GZH93, Vo95], and in
several garbage collectors [Wil95]) is to maintain a count of live objects
for each page, and notice when a page is entirely empty. If a page is empty,
it can be made available for allocating objects in a different size class,
preserving the invariant that all objects in a page are of a single size
class. (Wilson et al., 1995, p. 37)

This passage encouraged attention to optimizing the heuristics used for the bitmaps used to track free bins.

It may appear that bitmapped allocators are slow, because search times are
linear, and to a first approximation this may be true. But notice that if a
good heuristic is available to decide which area of the bitmap to search,
searching is linear in the size of the area searched, rather than the number
of free blocks. The cost of bitmapped allocation may then be proportional to
the rate of allocation, rather than the number of free blocks, and may scale
better than other indexing schemes. If the associated constants are low
enough, bitmapped allocation may do quite well. It may also be valuable in
conjunction with other indexing schemes. (Wilson et al., 1995, p. 42)

jemalloc's author, on the rich history of memory allocator design:

On the surface, memory allocation and deallocation appears to be a simple
problem that merely requires a bit of bookkeeping in order to keep track of
in-use versus available memory. However, decades of research and scores of
allocator implementations have failed to produce a clearly superior
allocator. (Evans, 2006, p. 1)

References

  1. Jeff Bonwick. (1994). The Slab Allocator: An Object-Caching Kernel.
  2. Paul R. Wilson, Mark S. Johnstone, Michael Neely, & David Boles. (1995). Dynamic Storage Allocation: A Survey and Critical Review.
  3. Maged M. Michael, & Michael L. Scott. (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.
  4. Poul-Henning Kamp. (1998). Malloc(3) revisited.
  5. Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, Paul R. Wilson. (2000). Hoard: A Scalable Memory Allocator for Multithreaded Applications.
  6. Maged M. Michael. (2004). Scalable Lock-Free Dynamic Memory Allocation.
  7. Thomas Edward Hart. (2005). Comparative Performance of Memory Reclamation Strategies for Lock-Free and Concurrently-Readable Data Structures.
  8. Jason Evans. (2006). A Scalable Concurrent malloc(3) Implementation for FreeBSD.
  9. Scott Schneider, Christos D. Antonopoulos, Dimitrios S. Nikolopoulos. (2006). Scalable Locality-Conscious Multithreaded Memory Allocation.
  10. Martin Thompson, Dave Farley, Michael Barker, Patricia Gee, Andrew Stewart. (2011). Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads.
  11. Daan Leijen, Benjamin Zoren, Leonardo De Moura. (2019). Mimalloc: Free List Sharding in Action.
  12. Antonin Reitz, Aymeric Fromherz, Jonathan Protzenko. (2024). StarMalloc: A Formally Verified, Concurrent, Performant and Security-Oriented Memory Allocator.

Lectures

  1. QCon SF 2010: Martin Thompson & Michael Barker, LMAX "How to Do 100K TPS at Less than 1ms Latency"
  2. C++ and Beyond 2012: Herb Sutter "atomic<> Weapons, the C++11 Memory Model and Modern Hardware"
  3. CppCon 2014: Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades)"
  4. CppCon 2015: Fedor Pikus "Live Lock-Free or Deadlock (Practical Lock-free Programming)"

Design Differences

  1. PHKmalloc is mentioned a few times as an inspiration. jemalloc was developed as a more parallel-aware replacement for phkmalloc. How does this allocator differ from jemalloc?

Of note, jemalloc uses multiple arenas to reduce the issue of allocator-induced false sharing. However, those arenas are shared between active threads. The strategy of giving exclusive access to an arena on a per thread basis is more similar to Hoard than jemalloc.

  1. How does this allocator differ from Hoard?

With regard to what is called the global heap in the Hoard paper, there is the superpage orphanage in this allocator. They both fulfill similar duties as far as memory reuse. However, in Hoard, superblocks may be moved from per-processor heaps to the global heap, if they cross an emptiness threshold.

In my design, this ownership transfer mechanism is forgone in favor of an overall simplified synchronization process. Superpages do not change ownership until they are either completely empty and ready to be freed or the thread cleanly exits. For a remote thread to be able to decouple a superpage belonging to another thread would require more complicated logic behind the scenes and likely slow down regular single-threaded usage with atomic operations.

This design can result in an apparent memory leak if thread A allocates some number of bytes, and thread B frees all of the allocations but thread A never allocates anything ever again and does not exit, as either event would trigger the merging of its remote frees and subsequent freeing of its empty superpages.

This is one behavior to be aware of when writing concurrent programs that use this allocator in producer/consumer relationships. In practice however, it should be unusual that a thread accumulates a significant amount of memory that it hands off to another thread to free and never revisits its heap for the duration of the program.

The Name

Most allocators are either named after the author or have a fancy title. PHKmalloc represents the author's initials. Mimalloc presumably means Microsoft Malloc.

If I had to give this allocator design a name, I might call it "the lock-free bitmap slab allocator" after its key features. For the purpose of differentiating this specific implementation of a heap allocator from any others, I think "Feoramalloc" is suitable.

I've used feoramalloc in the test bench to differentiate it from libc.

Final Thoughts

In closing, I want to say that I hope this allocator can improve the efficiency of programs written in Odin while standing as an example of how to learn about these low-level concepts such as lock-free programming and heap allocators.

Obviously, it won't make all programs magically faster, and if you're already using a custom allocator, then you know more about your problem space better than a general-purpose allocator could possibly ever guess.

I think this is a significant step towards having an independent runtime. We can get consistent behavior across all platforms too, as well as the ability to learn very specific information about the heap through the included diagnostics.

This PR is a draft for now, while I hammer out the final details and receive feedback.

Help Requests

I mainly need help with non-Linux/POSIX virtual memory access. I can test this allocator against FreeBSD and NetBSD, but I do not have a Windows or Darwin machine to verify the system-specific code there.

Windows passed the CI tests, so I'm hopeful that it works there. The Darwin tests pass for Intel, but it stalls on the core test suite afterwards, so there's something strange going on there. Linux and FreeBSD are working.

While testing, I hit an interesting snag with NetBSD. Its mmap syscall requires 7 arguments, but we only support 6, and I haven't been able to figure out what the calling convention for it is. That is to say, is it another register, or is it pushed to the stack, or something else. Could use some help there.

I don't have a plan for what to do for systems that do not expose virtual memory access, since I don't have any experience with those systems. I'm assuming Orca and WASM do not expose a virtual memory subsystem akin to mmap
or VirtualAlloc. I only recently started tinkering with wasm after finding the official example. These are otherwise foreign fields to me, and I'm open to feedback. We could perhaps have a malloc-based fallback allocator in that case.

The only strong requirement the allocator has, regarding backing allocation, is some ability to request alignment or dispose of unnecessary pages. If we can do either of those, we're solid.

It would be great to hear about how this allocator impacts real-world application usage, too.

Also interested to hear how this could impact -no-crt. I noticed a commit recently about how Linux requires the C runtime to initialize thread local storage. I wasn't aware of that.

API

I'm also looking to hear if anyone has any better ideas about organization or API. This allocator used to live in a package of its own right during all of my testing, but I had to merge it into base in order to avoid cyclic import issues while making it the default allocator. This resulted in a lot of heap_ and HEAP_ prefixing.

The same goes for the virtual_memory_* and get_current_thread_id procs added to base:runtime. If anyone has a feel for how that could be improved, or if they're good as-is, I'd like to hear.

Memory Order Consume

I'm uncertain if the Consume memory ordering really means consume. If you check under Atomic_Memory_Order in base:intrinsics, it has a comment beside Consume that says Monotonic, which I presume corresponds to this.

Based on the documentation for LLVM's Acquire memory ordering, this is the one that actually corresponds to memory_order_consume.

I'm leaning towards thinking my usage of Consume should actually be replaced with Acquire, based on this, but I've left the memory order as-is for now until someone else can review and comment about it. It's no problem to use a stronger order, but if we can get away with a weaker one and preserve the semantics, all the better.

I base most of my understanding of memory ordering on Herb Sutter's talks, referenced above, which I highly recommend to anyone interested in this subject.

- Add the test bench for the allocator
- Move old allocator code to the test bench
- Fix `heap_resize` usage in `os2/env_linux.odin` to fit new API
  requiring `old_size`
@Feoramund Feoramund mentioned this pull request Jan 24, 2025
61 tasks
Comment on lines +15 to +16
size, alignment: int,
old_memory: rawptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation here is a bit weird

HEAP_MAX_EMPTY_ORPHANED_SUPERPAGES :: #config(ODIN_HEAP_MAX_EMPTY_ORPHANED_SUPERPAGES, 3)
HEAP_SUPERPAGE_CACHE_RATIO :: #config(ODIN_HEAP_SUPERPAGE_CACHE_RATIO, 20)
HEAP_PANIC_ON_DOUBLE_FREE :: #config(ODIN_HEAP_PANIC_ON_DOUBLE_FREE, true)
HEAP_PANIC_ON_FREE_NIL :: #config(ODIN_HEAP_PANIC_ON_FREE_NIL, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an issue in practice if someone bypassed the allocator wrapper procedures. As free will just early return if it sees nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see what you mean now. I forgot about base:runtime.mem_free. Would you suggest removing the check entirely then?

Copy link
Contributor

@graphitemaster graphitemaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is absolutely amazing work. You've done an incredible job here.

I've not done a full review since there's certainly more to check with real world benches and because I'd have to actually run the code, but from what I've gleamed reading the changes on GH there is some comments.

I also want to ask for some real world graphs when running the myriad of heap bench tools out there and some the harder stress tests for memory allocators that stress their threading characteristics.

Of note, I think it should be added that while this allocator is robust in the nominal sense, it's not at all "security hardened" in the sense that a buffer overflow can't give an attacker access to important heap data structures to cause havoc. It's actually quite unsafe from a hardening stand-point because the heap data structure is still stored and maintained inline with regular allocations (header/footer, what have you) and is not using side-band metadata.

I'd be curious how binning performs in practice too. This allocator lacks a useful optimization most allocators implement which is a "debouncing" filter on size classes. Essentially some workloads have a lot of allocations that go like "big size, small size, big size, small size, big size, small size, ..." classes (but same bins after overflow) jumping back and forth and eventually you end up with excessive internal fragmentation and scanning loops that degrade performance.

compact_heap_orphanage :: proc "contextless" () {
// First, try to empty the orphanage so that we can evaluate each superpage.
buffer: [128]^Heap_Superpage
for i := 0; i < len(buffer); i += 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a range based for loop here for i in 0..<128, or use for &b in buffer

Comment on lines +493 to +498
} else if slab.bin_size > 0 {
i += 1
} else {
i += 1
continue
}
Copy link
Contributor

@graphitemaster graphitemaster Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro-op, how does

} else {
  i += 1
  if slab.bin_size <= 0 do continue
}

Fair here compared to this?

// returned to the operating system, then given back to the process in a
// different thread.
//
// It is otherwise impossible for us to race on newly allocated memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. The memory model of most OSes when it comes to allocating virtual memory is that only the calling thread is made available (or notified) of the new memory in caches. It's possible for the OS to return a previous virtual address that was once valid in another thread because unmap also only flushes page table entries in the thread that unmapped memory. Consider this

// thread A
ptr = mmap(...)
dothing(ptr)
pass_to_b(ptr)
pass_to_c(ptr)

// thread B
ptr = recv_from_a()
dothing(ptr)
// interrupted here
ptr = mmap()
dothing(ptr)

// thread c
ptr = recv_from_a()
// ...
unmap(ptr)

// resume B (the mmap call)

Consider that the OS has scheduled the threads so A runs first to completion here, then B runs, gets interrupted, then C runs to completion, then B resumes and does an mmap call. In this specific case it's possible for the mmap in B to return the same virtual memory that was unmap in C, but the contents in caches of A and C are incorrect and no longer coherent. A thread fence at minimum is needed here to ensure A and C are synchronized with the new allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now. It would then be necessary for a full atomic_thread_fence(.Seq_Cst) to be placed after every invocation that returns virtual memory from the OS, because we could receive memory that would otherwise have an invalid cache state due to prior use, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular was discussed in Discord, and the hazard illustrated here actually comes from a misuse of madvise(MADV_FREE) (not unmap) on particular Linux kernels and weakly ordered machine architectures. There's no harm in adding a thread fence here, to be safe, but I think it's unnecessary.

heap_slab_clear_data(slab)
}
slab.bin_size = size
slab.is_full = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does require an atomic store for coherency reasons, it's valid for this write not to be flushed from cache to backing memory until after ptr is returned and used by other threads on E X O T I C A R C H S

Comment on lines +696 to +713
bookkeeping_bin_cost := max(1, int(size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint) + 2 * HEAP_MAX_ALIGNMENT) / rounded_size)
bins -= bookkeeping_bin_cost
sectors = bins / INTEGER_BITS
sectors += 0 if bins % INTEGER_BITS == 0 else 1

slab.sectors = sectors
slab.free_bins = bins
slab.max_bins = bins

base_alignment := uintptr(min(HEAP_MAX_ALIGNMENT, rounded_size))

pointer_padding := (uintptr(base_alignment) - (uintptr(slab) + size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint))) & uintptr(base_alignment - 1)

// These bitmaps are placed at the end of the struct, one after the other.
slab.local_free = cast([^]uint)(uintptr(slab) + size_of(Heap_Slab))
slab.remote_free = cast([^]uint)(uintptr(slab) + size_of(Heap_Slab) + uintptr(sectors) * size_of(uint))
// This pointer is specifically aligned.
slab.data = (uintptr(slab) + size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint) + pointer_padding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated error-prone arithmetic should be hoisted out into some locals and reused. Specifically the size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint) part.

if superpage == local_heap_tail {
assert_contextless(superpage.next == nil, "The heap allocator's tail superpage has a next link.")
assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.")
local_heap_tail = superpage.prev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely need an atomic load of superpage.prev here as well

assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.")
local_heap_tail = superpage.prev
// We never unlink all superpages, so no need to check validity here.
superpage.prev.next = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM might optimize this code to local_heap_tail.next = nil because of the non-atomic load of superpage.prev above (and lack of a barrier) here, even though another thread can replace superpage.prev and so this code is actually assigning nil to that other thread's superpage.prev and not the one loaded into local_heap_tail. This area seems a bit problematic actually, do both of these operations (reading prev and assigning next to nil) need to happen atomically?

Comment on lines +907 to +912
if superpage.prev != nil {
superpage.prev.next = superpage.next
}
if superpage.next != nil {
superpage.next.prev = superpage.prev
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here, LLVM would optimize this to

prev := superpage.prev
next := superpage.next
if prev != nil do prev.next = next
if next != nil do next.prev = prev

Would this transformation be appropriate?

Comment on lines +1203 to +1223
for {
// NOTE: `next` is accessed atomically when pushing or popping from the
// orphanage, because this field must synchronize with other threads at
// this point.
//
// This has to do mainly with swinging the head's linking pointer.
//
// Beyond this point, the thread which owns the superpage will be the
// only one to read `next`, hence why it is not read atomically
// anywhere else.
intrinsics.atomic_store_explicit(&superpage.next, cast(^Heap_Superpage)uintptr(old_head.pointer), .Release)
new_head: Tagged_Pointer = ---
new_head.pointer = i64(uintptr(superpage))
new_head.version = old_head.version + 1

old_head_, swapped := intrinsics.atomic_compare_exchange_weak_explicit(cast(^u64)&heap_orphanage, transmute(u64)old_head, transmute(u64)new_head, .Acq_Rel, .Relaxed)
if swapped {
break
}
old_head = transmute(Tagged_Pointer)old_head_
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be of note, while the heap allocator is lock-free, it is not wait-free, for this requires unbounded wait potentially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I've only said that the freeing operation (specifically freeing a single allocation which involves an atomic bit flip) is wait-free. The allocator as a whole is only guaranteed to be lock-free.

Comment on lines +1438 to +1439
intrinsics.atomic_store_explicit(&cache.superpages_with_remote_frees[i], nil, .Release)
intrinsics.atomic_store_explicit(&superpage.remote_free_set, false, .Release)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Release semantics it's valid for these to be reordered so that remote_free_set is set to false first before the pointer is nil. In this case. If the thread is interrupted inbetween the two stores, I'm curious what would happen if the allocator sees superpages_with_remove_frees[i] != nil for a superpage that was removed as a result of remote_free_set = false.

@gingerBill
Copy link
Member

Can I just say, I fucking love your work! It's always a surprise to see it and always a pleasure to read.

}

_resize_virtual_memory :: proc "contextless" (ptr: rawptr, old_size: int, new_size: int, alignment: int) -> rawptr {
// NOTE(Feoramund): mach_vm_remap does not permit resizing, as far as I understand it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is true, you can do this more efficiently with these steps:

  1. mach_vm_allocate a bigger region
  2. mach_vm_remap from the smaller region into the new bigger region (which does the copying, and also maps the previous steps allocation)
  3. mach_vm_deallocate the old region

Copy link
Collaborator

@laytan laytan Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, mach_task_self() is implemented as #define mach_task_self() mach_task_self_.

Maybe you can use that global symbol to see a minor improvement as Odin does not optimise that function call.

EDIT: it is weird that mach_task_self is also a function symbol though, as a macro that wouldn't work 🤔

@Feoramund
Copy link
Contributor Author

Of note, I think it should be added that while this allocator is robust in the nominal sense, it's not at all "security hardened" in the sense that a buffer overflow can't give an attacker access to important heap data structures to cause havoc. It's actually quite unsafe from a hardening stand-point because the heap data structure is still stored and maintained inline with regular allocations (header/footer, what have you) and is not using side-band metadata.

If preventing buffer overflows is a priority, then the StarMalloc paper is an excellent work on making a heap allocator resilient to this class of bugs, full of ideas which I had not encountered in any other paper. They use out-of-band metadata as you suggest, as well as canaries to detect overflows and guard pages. When setting out on implementing this allocator initially, I presumed that among the audience that Odin targets, speed would be a preferred higher priority.

I put my focus on making a fast lock-free parallel allocator, because I figured that this was the most complex design that encompasses the most general-purpose use and anything else would be of lesser complexity and thus easy for someone with a specific problem to implement a solution for. I.e. it's easy for someone to write a single-threaded bump allocator for a specific size class if that affords them greater performance, as it's very specific, has strong design constraints, and is simple.

I think if a StarMalloc-like allocator is preferred, then that might be easier to implement, since it uses mutexes for synchronicity.

I'd be curious how binning performs in practice too. This allocator lacks a useful optimization most allocators implement which is a "debouncing" filter on size classes. Essentially some workloads have a lot of allocations that go like "big size, small size, big size, small size, big size, small size, ..." classes (but same bins after overflow) jumping back and forth and eventually you end up with excessive internal fragmentation and scanning loops that degrade performance.

I can't recall encountering the term "debouncing" filter, and I did a quick search through a few of my PDFs, but I can say that if you allocate, say, 4KiB, then 8 bytes, and do that repeatedly back and forth, all of the 4KiB allocations will be adjacent, and all of the 8 byte allocations will be adjacent, up to a certain number.

So with 8 byte allocations, with the current default config, you get 7907 bins to play with. All of the 0-8 byte allocations will be placed into one of those bins, linearly from left to right. When the slab runs out, the allocator will try to find a new slab (which will also be subdivided into 7907 bins) to place future allocations into of that same size.

Then with the 4KiB allocations, you'll get a slab that has only 15 bins (because the slab is 64KiB and we still need to keep some book-keeping data, therefore it's subdivided down to 15 slots), and each 4KiB allocation will go into that, until it runs out of space and needs to find a new slab.

The allocator doesn't try to get a new slab for a bin size rank if it already has one, so it shouldn't be fragmenting in that way either.

All of the available slabs with open bins are kept in the heap's cache for quick access, each slab has an index saved (next_free_sector) of where it can find the first free bin, and every superpage that has free slabs are also cached, so there shouldn't be much looping going on to find an open spot.

Hopefully this explanation clears up the allocation pattern for how this works.

@laytan
Copy link
Collaborator

laytan commented Jan 24, 2025

To give some insight about wasm:

WASM's memory model is very simple, a page is 64KiB, and there is only an API to grow (request more pages), there is no freeing or anything like that. So you want 128KiB you call it with 2, for 2 pages and you get those back. AFAIK every page you request is next to the previously requested page. You can look at the wasm_allocator.odin in runtime, it's a pretty small and basic allocator I put in based on the emscripten allocator. I am not sure if you can adapt this into this allocator, probably not. But because we already have a simple native allocator for wasm it is not entirely necessary either.

For orca we do want to keep calling malloc, this is because that calls into the orca runtime and that is always bundled in, otherwise we would have an allocator on top of their allocator which doesn't make much sense.
Also in the interest of bundle size which is often a bigger factor on wasm.

@laytan
Copy link
Collaborator

laytan commented Jan 24, 2025

Also, I can debug the macos failures this weekend if nobody else got to it.

@Feoramund
Copy link
Contributor Author

You can look at the wasm_allocator.odin in runtime, it's a pretty small and basic allocator I put in based on the emscripten allocator. I am not sure if you can adapt this into this allocator, probably not. But because we already have a simple native allocator for wasm it is not entirely necessary either.

I see no reason to displace a perfectly good allocator that's been tuned for the platform.

For orca we do want to keep calling malloc, this is because that calls into the orca runtime and that is always bundled in, otherwise we would have an allocator on top of their allocator which doesn't make much sense. Also in the interest of bundle size which is often a bigger factor on wasm.

I can see to making an exception for Orca (and possibly other platforms), so that it'll be easy to have a malloc fallback.

@laytan
Copy link
Collaborator

laytan commented Jan 26, 2025

Also, I can debug the macos failures this weekend if nobody else got to it.

So, the segfault on ARM MacOS is because _allocate_virtual_memory_superpage is returning nil, the mach_vm_map call inside it is returning 4 for invalid argument. I think superpages aren't supported on ARM MacOS.

MEMORY_OBJECT_NULL :: 0
VM_PROT_READ :: 0x01
VM_PROT_WRITE :: 0x02
VM_INHERIT_SHARE :: 0
Copy link
Collaborator

@laytan laytan Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you use MAP_PRIVATE on other targets, but use VM_INHERIT_SHARE here. The equivalent to MAP_PRIVATE would be VM_INHERIT_COPY afaict.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be why the Intel CI is failing once threading is involved but I can't confirm because I don't have an Intel machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants