-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Hack to optimise based on known sizes #276
base: snmalloc1
Are you sure you want to change the base?
Conversation
If the caller knows the sizes, and knows the allocation is thread local, then we can make some significant optimisations. This is a brief hack to show where would need changing.
On x86 the code would look something like: 00000000000055a0 <free_local_small>:
55a0: 48 89 fa mov %rdi,%rdx
55a3: 48 89 fe mov %rdi,%rsi
55a6: 48 81 e6 00 00 f0 ff and $0xfffffffffff00000,%rsi
55ad: 89 d0 mov %edx,%eax
55af: c1 e8 0e shr $0xe,%eax
55b2: 83 e0 3f and $0x3f,%eax
55b5: 48 c1 e0 04 shl $0x4,%rax
55b9: 66 83 44 06 60 ff addw $0xffff,0x60(%rsi,%rax,1)
55bf: 74 0e je 55cf <free_local_small+0x2f>
55c1: 48 8b 4c 06 58 mov 0x58(%rsi,%rax,1),%rcx
55c6: 48 89 54 06 58 mov %rdx,0x58(%rsi,%rax,1)
55cb: 48 89 0a mov %rcx,(%rdx)
55ce: c3 retq
;; SLOW PATH
55cf: 48 89 d1 mov %rdx,%rcx
55d2: 48 81 e1 00 c0 ff ff and $0xffffffffffffc000,%rcx
55d9: 48 8b 3d e8 c9 20 00 mov 0x20c9e8(%rip),%rdi # 211fc8 <.got+0x10>
55e0: 64 48 8b 3f mov %fs:(%rdi),%rdi
55e4: 48 81 e1 ff 3f f0 ff and $0xfffffffffff03fff,%rcx
55eb: 0f b6 4c 01 66 movzbl 0x66(%rcx,%rax,1),%ecx
55f0: e9 4b e6 ff ff jmpq 3c40 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE27small_dealloc_offseted_slowEPNS_9SuperslabES1_m>
55f5: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
55fc: 00 00 00
55ff: 90 nop
and 0000000000005650 <malloc_small_64>:
5650: 48 8b 05 71 c9 20 00 mov 0x20c971(%rip),%rax # 211fc8 <.got+0x10>
5657: 64 48 8b 38 mov %fs:(%rax),%rdi
565b: 48 8b 47 18 mov 0x18(%rdi),%rax
565f: 48 85 c0 test %rax,%rax
5662: 74 08 je 566c <malloc_small_64+0x1c>
5664: 48 8b 08 mov (%rax),%rcx
5667: 48 89 4f 18 mov %rcx,0x18(%rdi)
566b: c3 retq
;; SLOW PATH
566c: 48 8b 87 80 0c 00 00 mov 0xc80(%rdi),%rax
5673: 48 39 87 88 0c 00 00 cmp %rax,0xc88(%rdi)
567a: 75 0f jne 568b <malloc_small_64+0x3b>
567c: be 03 00 00 00 mov $0x3,%esi
5681: ba 40 00 00 00 mov $0x40,%edx
5686: e9 25 ea ff ff jmpq 40b0 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE26small_alloc_next_free_listILNS_7ZeroMemE0ELNS_12AllowReserveE1EEES1_mm>
568b: be 03 00 00 00 mov $0x3,%esi
5690: ba 40 00 00 00 mov $0x40,%edx
5695: e9 b6 ea ff ff jmpq 4150 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE19small_alloc_mq_slowILNS_7ZeroMemE0ELNS_12AllowReserveE1EEES1_mm>
569a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
This is not intended for merging, but I have pushed so we can adapt the idea later. |
@@ -169,6 +169,19 @@ if(CONST_QUALIFIED_MALLOC_USABLE_SIZE) | |||
endif() | |||
|
|||
|
|||
# Build a redirection layer for all sizes that are a multiple of | |||
# 16bytes up to 1024. | |||
add_executable(generate src/redirect/generate.cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to differentiate between executables for the build host and executables for the target when we're cross-compiling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needing to build a C++ program on the host to generate things that we compile for the target is a bit painful for cross compiling (do we even guarantee that the sizes will be the same if, for example, we're compiling on a 32-bit system for a CHERI target?).
It's also going to be annoying to integrate into a libc build system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, all platforms and configurations, we build would use the same file. However, we can use this program to build a collection of headers that encode the parameters of interest. E.g. of the form
generated_[MIN_ALLOC_BITS]_[INTERMEDIATE_BITS].cc
Currently, everything would use exactly the same thing:
generated_4_2.cc
But for CHERI, we might want to up the minimum allocation size. Getting either parameter wrong in the header would "work", however, we might
- have more entry points than necessary, or
- allocate a larger object than necessary.
Perhaps, just check in the file in once, we have finished experimenting. While, we are experimenting, I think having this generated is good as it means we are less likely to make mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, we use the template parameter for s
, so the code will be specialised for this size, even if it ends up being a medium alloc, or large.
#define DEFINE_MALLOC_SIZE(a, s) \
extern "C" void* a() \
{ \
return snmalloc::ThreadAlloc::get_noncachable()->template alloc<s>(); \
}
SNMALLOC_EXPORT | ||
void SNMALLOC_NAME_MANGLE(free_local_small)(void* ptr) | ||
{ | ||
if (Alloc::small_local_dealloc(ptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd feel slightly better if this if(!fast) slow
dance were inside Alloc
rather than having it export the fast and slow paths separately? Unless there's some reason to want to expose this for inlining?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to at least be in ThreadAlloc, as we don't want to take the TLS lookup on the fast path. So we need that in scope. I think we probably need a better refactor if we actually want to support this design. This is more a hack to get some codegen, and see if the use case makes sense.
return (likely(slab->dealloc_fast(super, p))); | ||
} | ||
|
||
SNMALLOC_FAST_PATH void small_local_dealloc_slow(void* p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still SNMALLOC_FAST_PATH
despite _slow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
Fix Align to be not a log size.
If the caller knows the sizes, and knows the allocation is thread local,
then we can make some significant optimisations.
This is a brief hack to show where would need changing.