-
-
Notifications
You must be signed in to change notification settings - Fork 678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add native lock-free dynamic heap allocator #4749
base: master
Are you sure you want to change the base?
Conversation
- Add the test bench for the allocator - Move old allocator code to the test bench - Fix `heap_resize` usage in `os2/env_linux.odin` to fit new API requiring `old_size`
size, alignment: int, | ||
old_memory: rawptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation here is a bit weird
HEAP_MAX_EMPTY_ORPHANED_SUPERPAGES :: #config(ODIN_HEAP_MAX_EMPTY_ORPHANED_SUPERPAGES, 3) | ||
HEAP_SUPERPAGE_CACHE_RATIO :: #config(ODIN_HEAP_SUPERPAGE_CACHE_RATIO, 20) | ||
HEAP_PANIC_ON_DOUBLE_FREE :: #config(ODIN_HEAP_PANIC_ON_DOUBLE_FREE, true) | ||
HEAP_PANIC_ON_FREE_NIL :: #config(ODIN_HEAP_PANIC_ON_FREE_NIL, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be an issue in practice if someone bypassed the allocator wrapper procedures. As free
will just early return if it sees nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see what you mean now. I forgot about base:runtime.mem_free
. Would you suggest removing the check entirely then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, this is absolutely amazing work. You've done an incredible job here.
I've not done a full review since there's certainly more to check with real world benches and because I'd have to actually run the code, but from what I've gleamed reading the changes on GH there is some comments.
I also want to ask for some real world graphs when running the myriad of heap bench tools out there and some the harder stress tests for memory allocators that stress their threading characteristics.
Of note, I think it should be added that while this allocator is robust in the nominal sense, it's not at all "security hardened" in the sense that a buffer overflow can't give an attacker access to important heap data structures to cause havoc. It's actually quite unsafe from a hardening stand-point because the heap data structure is still stored and maintained inline with regular allocations (header/footer, what have you) and is not using side-band metadata.
I'd be curious how binning performs in practice too. This allocator lacks a useful optimization most allocators implement which is a "debouncing" filter on size classes. Essentially some workloads have a lot of allocations that go like "big size, small size, big size, small size, big size, small size, ..." classes (but same bins after overflow) jumping back and forth and eventually you end up with excessive internal fragmentation and scanning loops that degrade performance.
compact_heap_orphanage :: proc "contextless" () { | ||
// First, try to empty the orphanage so that we can evaluate each superpage. | ||
buffer: [128]^Heap_Superpage | ||
for i := 0; i < len(buffer); i += 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a range based for loop here for i in 0..<128
, or use for &b in buffer
} else if slab.bin_size > 0 { | ||
i += 1 | ||
} else { | ||
i += 1 | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro-op, how does
} else {
i += 1
if slab.bin_size <= 0 do continue
}
Fair here compared to this?
// returned to the operating system, then given back to the process in a | ||
// different thread. | ||
// | ||
// It is otherwise impossible for us to race on newly allocated memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. The memory model of most OSes when it comes to allocating virtual memory is that only the calling thread is made available (or notified) of the new memory in caches. It's possible for the OS to return a previous virtual address that was once valid in another thread because unmap also only flushes page table entries in the thread that unmapped memory. Consider this
// thread A
ptr = mmap(...)
dothing(ptr)
pass_to_b(ptr)
pass_to_c(ptr)
// thread B
ptr = recv_from_a()
dothing(ptr)
// interrupted here
ptr = mmap()
dothing(ptr)
// thread c
ptr = recv_from_a()
// ...
unmap(ptr)
// resume B (the mmap call)
Consider that the OS has scheduled the threads so A runs first to completion here, then B runs, gets interrupted, then C runs to completion, then B resumes and does an mmap
call. In this specific case it's possible for the mmap
in B to return the same virtual memory that was unmap
in C, but the contents in caches of A and C are incorrect and no longer coherent. A thread fence at minimum is needed here to ensure A and C are synchronized with the new allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now. It would then be necessary for a full atomic_thread_fence(.Seq_Cst)
to be placed after every invocation that returns virtual memory from the OS, because we could receive memory that would otherwise have an invalid cache state due to prior use, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular was discussed in Discord, and the hazard illustrated here actually comes from a misuse of madvise(MADV_FREE)
(not unmap) on particular Linux kernels and weakly ordered machine architectures. There's no harm in adding a thread fence here, to be safe, but I think it's unnecessary.
heap_slab_clear_data(slab) | ||
} | ||
slab.bin_size = size | ||
slab.is_full = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does require an atomic store for coherency reasons, it's valid for this write not to be flushed from cache to backing memory until after ptr
is returned and used by other threads on E X O T I C A R C H S
bookkeeping_bin_cost := max(1, int(size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint) + 2 * HEAP_MAX_ALIGNMENT) / rounded_size) | ||
bins -= bookkeeping_bin_cost | ||
sectors = bins / INTEGER_BITS | ||
sectors += 0 if bins % INTEGER_BITS == 0 else 1 | ||
|
||
slab.sectors = sectors | ||
slab.free_bins = bins | ||
slab.max_bins = bins | ||
|
||
base_alignment := uintptr(min(HEAP_MAX_ALIGNMENT, rounded_size)) | ||
|
||
pointer_padding := (uintptr(base_alignment) - (uintptr(slab) + size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint))) & uintptr(base_alignment - 1) | ||
|
||
// These bitmaps are placed at the end of the struct, one after the other. | ||
slab.local_free = cast([^]uint)(uintptr(slab) + size_of(Heap_Slab)) | ||
slab.remote_free = cast([^]uint)(uintptr(slab) + size_of(Heap_Slab) + uintptr(sectors) * size_of(uint)) | ||
// This pointer is specifically aligned. | ||
slab.data = (uintptr(slab) + size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint) + pointer_padding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated error-prone arithmetic should be hoisted out into some locals and reused. Specifically the size_of(Heap_Slab) + HEAP_SECTOR_TYPES * uintptr(sectors) * size_of(uint)
part.
if superpage == local_heap_tail { | ||
assert_contextless(superpage.next == nil, "The heap allocator's tail superpage has a next link.") | ||
assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.") | ||
local_heap_tail = superpage.prev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely need an atomic load of superpage.prev
here as well
assert_contextless(superpage.prev != nil, "The heap allocator's tail superpage has no previous link.") | ||
local_heap_tail = superpage.prev | ||
// We never unlink all superpages, so no need to check validity here. | ||
superpage.prev.next = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM might optimize this code to local_heap_tail.next = nil
because of the non-atomic load of superpage.prev
above (and lack of a barrier) here, even though another thread can replace superpage.prev
and so this code is actually assigning nil
to that other thread's superpage.prev
and not the one loaded into local_heap_tail
. This area seems a bit problematic actually, do both of these operations (reading prev and assigning next to nil) need to happen atomically?
if superpage.prev != nil { | ||
superpage.prev.next = superpage.next | ||
} | ||
if superpage.next != nil { | ||
superpage.next.prev = superpage.prev | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here, LLVM would optimize this to
prev := superpage.prev
next := superpage.next
if prev != nil do prev.next = next
if next != nil do next.prev = prev
Would this transformation be appropriate?
for { | ||
// NOTE: `next` is accessed atomically when pushing or popping from the | ||
// orphanage, because this field must synchronize with other threads at | ||
// this point. | ||
// | ||
// This has to do mainly with swinging the head's linking pointer. | ||
// | ||
// Beyond this point, the thread which owns the superpage will be the | ||
// only one to read `next`, hence why it is not read atomically | ||
// anywhere else. | ||
intrinsics.atomic_store_explicit(&superpage.next, cast(^Heap_Superpage)uintptr(old_head.pointer), .Release) | ||
new_head: Tagged_Pointer = --- | ||
new_head.pointer = i64(uintptr(superpage)) | ||
new_head.version = old_head.version + 1 | ||
|
||
old_head_, swapped := intrinsics.atomic_compare_exchange_weak_explicit(cast(^u64)&heap_orphanage, transmute(u64)old_head, transmute(u64)new_head, .Acq_Rel, .Relaxed) | ||
if swapped { | ||
break | ||
} | ||
old_head = transmute(Tagged_Pointer)old_head_ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be of note, while the heap allocator is lock-free, it is not wait-free, for this requires unbounded wait potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I've only said that the freeing operation (specifically freeing a single allocation which involves an atomic bit flip) is wait-free. The allocator as a whole is only guaranteed to be lock-free.
intrinsics.atomic_store_explicit(&cache.superpages_with_remote_frees[i], nil, .Release) | ||
intrinsics.atomic_store_explicit(&superpage.remote_free_set, false, .Release) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Release
semantics it's valid for these to be reordered so that remote_free_set
is set to false
first before the pointer is nil
. In this case. If the thread is interrupted inbetween the two stores, I'm curious what would happen if the allocator sees superpages_with_remove_frees[i] != nil
for a superpage that was removed as a result of remote_free_set = false
.
Can I just say, I fucking love your work! It's always a surprise to see it and always a pleasure to read. |
} | ||
|
||
_resize_virtual_memory :: proc "contextless" (ptr: rawptr, old_size: int, new_size: int, alignment: int) -> rawptr { | ||
// NOTE(Feoramund): mach_vm_remap does not permit resizing, as far as I understand it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is true, you can do this more efficiently with these steps:
mach_vm_allocate
a bigger regionmach_vm_remap
from the smaller region into the new bigger region (which does the copying, and also maps the previous steps allocation)mach_vm_deallocate
the old region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, mach_task_self()
is implemented as #define mach_task_self() mach_task_self_
.
Maybe you can use that global symbol to see a minor improvement as Odin does not optimise that function call.
EDIT: it is weird that mach_task_self
is also a function symbol though, as a macro that wouldn't work 🤔
If preventing buffer overflows is a priority, then the StarMalloc paper is an excellent work on making a heap allocator resilient to this class of bugs, full of ideas which I had not encountered in any other paper. They use out-of-band metadata as you suggest, as well as canaries to detect overflows and guard pages. When setting out on implementing this allocator initially, I presumed that among the audience that Odin targets, speed would be a preferred higher priority. I put my focus on making a fast lock-free parallel allocator, because I figured that this was the most complex design that encompasses the most general-purpose use and anything else would be of lesser complexity and thus easy for someone with a specific problem to implement a solution for. I.e. it's easy for someone to write a single-threaded bump allocator for a specific size class if that affords them greater performance, as it's very specific, has strong design constraints, and is simple. I think if a StarMalloc-like allocator is preferred, then that might be easier to implement, since it uses mutexes for synchronicity.
I can't recall encountering the term "debouncing" filter, and I did a quick search through a few of my PDFs, but I can say that if you allocate, say, 4KiB, then 8 bytes, and do that repeatedly back and forth, all of the 4KiB allocations will be adjacent, and all of the 8 byte allocations will be adjacent, up to a certain number. So with 8 byte allocations, with the current default config, you get 7907 bins to play with. All of the 0-8 byte allocations will be placed into one of those bins, linearly from left to right. When the slab runs out, the allocator will try to find a new slab (which will also be subdivided into 7907 bins) to place future allocations into of that same size. Then with the 4KiB allocations, you'll get a slab that has only 15 bins (because the slab is 64KiB and we still need to keep some book-keeping data, therefore it's subdivided down to 15 slots), and each 4KiB allocation will go into that, until it runs out of space and needs to find a new slab. The allocator doesn't try to get a new slab for a bin size rank if it already has one, so it shouldn't be fragmenting in that way either. All of the available slabs with open bins are kept in the heap's cache for quick access, each slab has an index saved ( Hopefully this explanation clears up the allocation pattern for how this works. |
To give some insight about wasm: WASM's memory model is very simple, a page is 64KiB, and there is only an API to grow (request more pages), there is no freeing or anything like that. So you want 128KiB you call it with 2, for 2 pages and you get those back. AFAIK every page you request is next to the previously requested page. You can look at the For orca we do want to keep calling |
Also, I can debug the macos failures this weekend if nobody else got to it. |
I see no reason to displace a perfectly good allocator that's been tuned for the platform.
I can see to making an exception for Orca (and possibly other platforms), so that it'll be easy to have a |
So, the segfault on ARM MacOS is because |
MEMORY_OBJECT_NULL :: 0 | ||
VM_PROT_READ :: 0x01 | ||
VM_PROT_WRITE :: 0x02 | ||
VM_INHERIT_SHARE :: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you use MAP_PRIVATE
on other targets, but use VM_INHERIT_SHARE
here. The equivalent to MAP_PRIVATE
would be VM_INHERIT_COPY
afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be why the Intel CI is failing once threading is involved but I can't confirm because I don't have an Intel machine.
Native Odin-based Heap Allocator
After an intense development and rigorous testing process of the past three months, I am finally ready to share the results of my latest project.
In short, this is a lock-free dynamic heap allocator written solely in Odin, utilizing direct virtual memory access to the operating system where possible. Only system calls are used to ask the OS for virtual memory (except on operating systems where this is verboten, such as Darwin, where we use their libc API to get virtual memory), and the allocator handles everything else from
there.
Rationale
Originally, I was working on porting all of
core
to useos2
, when I found the experimental heap allocator in there. Having hooked my unreleasedos2
test framework up to it, I found that it suffered from more race conditions than had already been found, as well as other synchronization issues such as apparent misunderstandings of how atomic operations work.The most confusing code that stood out to me was the following block:
All three of those fields exist in the same
u64
. It would make more sense to have atomically loadedalloc
then read each field individually. I spent a few days trying to make sense ofheap_linux.odin
, but it was a bit much for me at the time. The previous block and the warnings listed by TSan didn't give me great hope that it would be simple to fix it.So, I did what I think most programmers do in this situation. I decided to try writing my own from nothing and hopefully come to better appreciation of the problem as a whole.
I combed through the last 30 years of the literature on allocators, with some reading of papers on parallelism.
For dynamic heap allocators, I found that contention, false sharing, and heap blowup were issues often mentioned. The Hoard paper (Berger et al., 2000) was particularly helpful in figuring out an overall design for solving those issues.
There is hopefully nothing too novel about the design I've put together here. We're all on well-trodden ground. I think the most exciting feature is the free bitmap, where most allocators appear to use free lists instead.
Goals
The design of this allocator was guided by three principles.
Features
AND
operations on the addresses distributed by the allocator.malloc
andfree
an opaque barrier. Given that the code is available right here in the runtime itself, it allows configuration to any programmer's needs. It provides uniform behavior across all platforms, as well; a programmer need not contemplate how heap allocation may impact performance on one system versus another.ODIN_DEBUG_HEAP
is enabled.runtime.get_local_heap_info()
API.Benchmark Results
Because of the wisdom from the quote above, I won't spend much time here except to say that the included test bench has microbenchmarks written by me for the purpose of making sure that the allocator is at least not egregiously slow in certain made-up scenarios.
If you believe these benchmarks can align with realistic situations, then this allocator is 2-3 times faster than libc malloc, in general use case scenarios (so any allocation less than ~63KiB), on my AMD64 Linux-based system, compiled with
-o:aggressive -disable-assert -no-bounds-check -microarch:alderlake
.Any speed gain drops off above allocations of 32KiB in size, because this is where bin allocations are no longer possible with the default configuration, and the allocator has to resort to coalescing entire slabs to fit the requests, but I decided to accept the result of this design, as it's not that much slower than malloc, and I believe that rapid allocation of >=64KiB blocks is a special case and not the usual case for most programs.
The full test suite can be run with:
The benchmarks can be run with:
The
allocator
command line option can be switched tolibc
to use the old behavior.Memory Usage
Speed aside, I can say that there are points to be aware of with this allocator, particularly in how it uses memory, which are clear and not as susceptible to application patterns like benchmarking may be.
For one, due to the nature of slab allocation, any allocation will always use the most amount of space possible within a bin rank, so if you request 9 bytes, you will in actual fact consume 16, as that is the next power of two available. This continues for every power of two up to the maximum bin size of 32KiB.
This shouldn't be too surprising at lower sizes, as with a non-slab general purpose allocator, you're almost guaranteed to have some book-keeping somewhere, which would result in an allocation of 8 bytes actually using 16 or 24 bytes, depending on the header.
This begins to break down at higher sizes, however. If you allocate 257 bytes instead of 256, you're going to be placed into a bin of 512 bytes. This may seem wasteful, but there is a consideration for this: every allocation of a particular size rank is tightly packed next to each other, which increases cache locality. It's a memory for speed tradeoff, in the end.
Alignment is also used as the size, if it's larger than the two, up to a maximum of 64 bytes by default. This was one of the design choices made to help eliminate any need for headers. Beyond a size of 64 bytes, all allocations are aligned to at least 64 bytes. Alignment beyond 64 bytes is not supported.
There is also no convoluted coalescing logic to be had for any allocation below ~63KiB. This was done for the sake of simplicity. Beyond 64KiB, the allocator has to make decisions on which slabs to merge together, which is where memory
usage and speed both take a hit.
To allocate 64KiB is to block out up to 128KiB, due to the nature of book-keeping on slab-wide allocations. That may be the weakest point of this allocator, and I'm open to feedback on possible workarounds.
The one upside of over-allocating like this is that if you resize within the same frame of memory that's already been allotted to you, it's virtually a no-op. The allocator has to do a few calculations, and it returns without touching any memory: simple and fast.
Beyond the
HUGE_ALLOCATION_THRESHOLD
, which is 3/4ths of a Superpage by default (1.5MiB), the allocator distributes chunks of at least a superpage in size directly through virtual memory. This is where memory waste becomes less noticeable, as we're no longer dealing with bins or slabs but whole chunks from virtual memory.Superpages also may waste up to one slab size of memory (64KiB) for the purposes of maintaining alignment, but this space is optionally used if a heap needs more space for its cache. With the current default values, one of these 64KiB blocks is used per 20 superpages allocated to a single thread. So it's about 3% of all virtual memory allocated this way.
The values dictating the sizes of slabs and maximum bins are all configurable through the
ODIN_HEAP_*
defines, so if your application really does need to make binned allocations of 64KiB, or if you find speed improvements by using smaller slabs, it's easy to change.I chose the default values of 64KiB slabs with a 32KiB max bin size after some microbenchmarking, but it's possible that different values could result in better performance for different scenarios.
To summarize: this allocator does not try to squeeze out every possible byte at every possible juncture, but it does try to be fast as much as possible.
There may be a case to be made for the reduction of fragmentation through slab allocation resulting in less actual memory usage at the end of the day versus a coalescing allocator, but that is probably an application-specific benefit and one I have not thoroughly investigated.
Credits
I hope to demonstrate that the design used in this allocator is not exceedingly novel (and thus, not untested) by pointing out the inspirations for each major feature based upon the literature reviewed. Each feature has been documented and in use in various implementations for over two decades now.
The following points are original ideas; original in the sense that they were realized during development and not drawn from any specific paper, not that they are wholly novel and have never been seen before.
dirty_bins
) to track which bins are dirty and need zeroing upon re-allocation was an iterative optimization realized after noticing that the allocator naturally uses the bin with the lowest address possible to keep cache locality, by virtue ofnext_free_sector
always being set to the minimum value. An earlier version of the allocator used a bitmap with the same layout oflocal_free
to track dirty bins.Quotes
The following passage inspired runtime configurable slab size classes.
This passage encouraged attention to optimizing the heuristics used for the bitmaps used to track free bins.
jemalloc's author, on the rich history of memory allocator design:
References
Lectures
Design Differences
Of note, jemalloc uses multiple arenas to reduce the issue of allocator-induced false sharing. However, those arenas are shared between active threads. The strategy of giving exclusive access to an arena on a per thread basis is more similar to Hoard than jemalloc.
With regard to what is called the global heap in the Hoard paper, there is the superpage orphanage in this allocator. They both fulfill similar duties as far as memory reuse. However, in Hoard, superblocks may be moved from per-processor heaps to the global heap, if they cross an emptiness threshold.
In my design, this ownership transfer mechanism is forgone in favor of an overall simplified synchronization process. Superpages do not change ownership until they are either completely empty and ready to be freed or the thread cleanly exits. For a remote thread to be able to decouple a superpage belonging to another thread would require more complicated logic behind the scenes and likely slow down regular single-threaded usage with atomic operations.
This design can result in an apparent memory leak if thread A allocates some number of bytes, and thread B frees all of the allocations but thread A never allocates anything ever again and does not exit, as either event would trigger the merging of its remote frees and subsequent freeing of its empty superpages.
This is one behavior to be aware of when writing concurrent programs that use this allocator in producer/consumer relationships. In practice however, it should be unusual that a thread accumulates a significant amount of memory that it hands off to another thread to free and never revisits its heap for the duration of the program.
The Name
Most allocators are either named after the author or have a fancy title. PHKmalloc represents the author's initials. Mimalloc presumably means Microsoft Malloc.
If I had to give this allocator design a name, I might call it "the lock-free bitmap slab allocator" after its key features. For the purpose of differentiating this specific implementation of a heap allocator from any others, I think "Feoramalloc" is suitable.
I've used
feoramalloc
in the test bench to differentiate it fromlibc
.Final Thoughts
In closing, I want to say that I hope this allocator can improve the efficiency of programs written in Odin while standing as an example of how to learn about these low-level concepts such as lock-free programming and heap allocators.
Obviously, it won't make all programs magically faster, and if you're already using a custom allocator, then you know more about your problem space better than a general-purpose allocator could possibly ever guess.
I think this is a significant step towards having an independent runtime. We can get consistent behavior across all platforms too, as well as the ability to learn very specific information about the heap through the included diagnostics.
This PR is a draft for now, while I hammer out the final details and receive feedback.
Help Requests
I mainly need help with non-Linux/POSIX virtual memory access. I can test this allocator against FreeBSD and NetBSD, but I do not have a Windows or Darwin machine to verify the system-specific code there.
Windows passed the CI tests, so I'm hopeful that it works there. The Darwin tests pass for Intel, but it stalls on the
core
test suite afterwards, so there's something strange going on there. Linux and FreeBSD are working.While testing, I hit an interesting snag with NetBSD. Its
mmap
syscall requires 7 arguments, but we only support 6, and I haven't been able to figure out what the calling convention for it is. That is to say, is it another register, or is it pushed to the stack, or something else. Could use some help there.I don't have a plan for what to do for systems that do not expose virtual memory access, since I don't have any experience with those systems. I'm assuming Orca and WASM do not expose a virtual memory subsystem akin to
mmap
or
VirtualAlloc
. I only recently started tinkering with wasm after finding the official example. These are otherwise foreign fields to me, and I'm open to feedback. We could perhaps have a malloc-based fallback allocator in that case.The only strong requirement the allocator has, regarding backing allocation, is some ability to request alignment or dispose of unnecessary pages. If we can do either of those, we're solid.
It would be great to hear about how this allocator impacts real-world application usage, too.
Also interested to hear how this could impact
-no-crt
. I noticed a commit recently about how Linux requires the C runtime to initialize thread local storage. I wasn't aware of that.API
I'm also looking to hear if anyone has any better ideas about organization or API. This allocator used to live in a package of its own right during all of my testing, but I had to merge it into
base
in order to avoid cyclic import issues while making it the default allocator. This resulted in a lot ofheap_
andHEAP_
prefixing.The same goes for the
virtual_memory_*
andget_current_thread_id
procs added tobase:runtime
. If anyone has a feel for how that could be improved, or if they're good as-is, I'd like to hear.Memory Order Consume
I'm uncertain if the
Consume
memory ordering really means consume. If you check underAtomic_Memory_Order
inbase:intrinsics
, it has a comment besideConsume
that saysMonotonic
, which I presume corresponds to this.Based on the documentation for LLVM's Acquire memory ordering, this is the one that actually corresponds to
memory_order_consume
.I'm leaning towards thinking my usage of
Consume
should actually be replaced withAcquire
, based on this, but I've left the memory order as-is for now until someone else can review and comment about it. It's no problem to use a stronger order, but if we can get away with a weaker one and preserve the semantics, all the better.I base most of my understanding of memory ordering on Herb Sutter's talks, referenced above, which I highly recommend to anyone interested in this subject.