WIP: Hack to optimise based on known sizes #276

mjp41 · 2021-02-05T12:06:19Z

If the caller knows the sizes, and knows the allocation is thread local,
then we can make some significant optimisations.
This is a brief hack to show where would need changing.

If the caller knows the sizes, and knows the allocation is thread local, then we can make some significant optimisations. This is a brief hack to show where would need changing.

mjp41 · 2021-02-05T12:09:25Z

On x86 the code would look something like:

00000000000055a0 <free_local_small>:
    55a0:	48 89 fa             	mov    %rdi,%rdx
    55a3:	48 89 fe             	mov    %rdi,%rsi
    55a6:	48 81 e6 00 00 f0 ff 	and    $0xfffffffffff00000,%rsi
    55ad:	89 d0                	mov    %edx,%eax
    55af:	c1 e8 0e             	shr    $0xe,%eax
    55b2:	83 e0 3f             	and    $0x3f,%eax
    55b5:	48 c1 e0 04          	shl    $0x4,%rax
    55b9:	66 83 44 06 60 ff    	addw   $0xffff,0x60(%rsi,%rax,1)
    55bf:	74 0e                	je     55cf <free_local_small+0x2f>
    55c1:	48 8b 4c 06 58       	mov    0x58(%rsi,%rax,1),%rcx
    55c6:	48 89 54 06 58       	mov    %rdx,0x58(%rsi,%rax,1)
    55cb:	48 89 0a             	mov    %rcx,(%rdx)
    55ce:	c3                   	retq   
;; SLOW PATH
    55cf:	48 89 d1             	mov    %rdx,%rcx
    55d2:	48 81 e1 00 c0 ff ff 	and    $0xffffffffffffc000,%rcx
    55d9:	48 8b 3d e8 c9 20 00 	mov    0x20c9e8(%rip),%rdi        # 211fc8 <.got+0x10>
    55e0:	64 48 8b 3f          	mov    %fs:(%rdi),%rdi
    55e4:	48 81 e1 ff 3f f0 ff 	and    $0xfffffffffff03fff,%rcx
    55eb:	0f b6 4c 01 66       	movzbl 0x66(%rcx,%rax,1),%ecx
    55f0:	e9 4b e6 ff ff       	jmpq   3c40 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE27small_dealloc_offseted_slowEPNS_9SuperslabES1_m>
    55f5:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
    55fc:	00 00 00 
    55ff:	90                   	nop

and

0000000000005650 <malloc_small_64>:
    5650:	48 8b 05 71 c9 20 00 	mov    0x20c971(%rip),%rax        # 211fc8 <.got+0x10>
    5657:	64 48 8b 38          	mov    %fs:(%rax),%rdi
    565b:	48 8b 47 18          	mov    0x18(%rdi),%rax
    565f:	48 85 c0             	test   %rax,%rax
    5662:	74 08                	je     566c <malloc_small_64+0x1c>
    5664:	48 8b 08             	mov    (%rax),%rcx
    5667:	48 89 4f 18          	mov    %rcx,0x18(%rdi)
    566b:	c3                   	retq   
;; SLOW PATH
    566c:	48 8b 87 80 0c 00 00 	mov    0xc80(%rdi),%rax
    5673:	48 39 87 88 0c 00 00 	cmp    %rax,0xc88(%rdi)
    567a:	75 0f                	jne    568b <malloc_small_64+0x3b>
    567c:	be 03 00 00 00       	mov    $0x3,%esi
    5681:	ba 40 00 00 00       	mov    $0x40,%edx
    5686:	e9 25 ea ff ff       	jmpq   40b0 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE26small_alloc_next_free_listILNS_7ZeroMemE0ELNS_12AllowReserveE1EEES1_mm>
    568b:	be 03 00 00 00       	mov    $0x3,%esi
    5690:	ba 40 00 00 00       	mov    $0x40,%edx
    5695:	e9 b6 ea ff ff       	jmpq   4150 <_ZN8snmalloc9AllocatorIXadL_ZNS_20needs_initialisationEPvEEXadL_ZNS_21init_thread_allocatorENS_12function_refIFS1_S1_EEEEENS_24MemoryProviderStateMixinINS_8PALLinuxEEENS_15DefaultChunkMapINS_21GlobalPagemapTemplateINS_11FlatPagemapILm20EhEEEEEELb1EE19small_alloc_mq_slowILNS_7ZeroMemE0ELNS_12AllowReserveE1EEES1_mm>
    569a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

This is not intended for merging, but I have pushed so we can adapt the idea later.

nwf · 2021-02-05T17:45:09Z

CMakeLists.txt

@@ -169,6 +169,19 @@ if(CONST_QUALIFIED_MALLOC_USABLE_SIZE)
 endif()


+# Build a redirection layer for all sizes that are a multiple of 
+# 16bytes up to 1024.
+add_executable(generate src/redirect/generate.cc)


Is it possible to differentiate between executables for the build host and executables for the target when we're cross-compiling?

I don't know.

Looks possible:
https://gitlab.kitware.com/cmake/community/-/wikis/doc/cmake/CrossCompiling#using-executables-in-the-build-created-during-the-build

Needing to build a C++ program on the host to generate things that we compile for the target is a bit painful for cross compiling (do we even guarantee that the sizes will be the same if, for example, we're compiling on a 32-bit system for a CHERI target?).

It's also going to be annoying to integrate into a libc build system.

Currently, all platforms and configurations, we build would use the same file. However, we can use this program to build a collection of headers that encode the parameters of interest. E.g. of the form

generated_[MIN_ALLOC_BITS]_[INTERMEDIATE_BITS].cc

Currently, everything would use exactly the same thing:

generated_4_2.cc

But for CHERI, we might want to up the minimum allocation size. Getting either parameter wrong in the header would "work", however, we might

have more entry points than necessary, or

allocate a larger object than necessary.

Perhaps, just check in the file in once, we have finished experimenting. While, we are experimenting, I think having this generated is good as it means we are less likely to make mistakes.

Note, we use the template parameter for s, so the code will be specialised for this size, even if it ends up being a medium alloc, or large.

#define DEFINE_MALLOC_SIZE(a, s) \ extern "C" void* a() \ { \ return snmalloc::ThreadAlloc::get_noncachable()->template alloc<s>(); \ }

nwf · 2021-02-05T17:48:34Z

src/override/malloc.cc

+  SNMALLOC_EXPORT
+  void SNMALLOC_NAME_MANGLE(free_local_small)(void* ptr)
+  {
+    if (Alloc::small_local_dealloc(ptr))


I think I'd feel slightly better if this if(!fast) slow dance were inside Alloc rather than having it export the fast and slow paths separately? Unless there's some reason to want to expose this for inlining?

It has to at least be in ThreadAlloc, as we don't want to take the TLS lookup on the fast path. So we need that in scope. I think we probably need a better refactor if we actually want to support this design. This is more a hack to get some codegen, and see if the use case makes sense.

nwf · 2021-02-05T17:49:31Z

src/mem/alloc.h

+      return (likely(slab->dealloc_fast(super, p)));
+    }
+
+    SNMALLOC_FAST_PATH void small_local_dealloc_slow(void* p)


Still SNMALLOC_FAST_PATH despite _slow?

Fix Align to be not a log size.

WIP: Hack to optimise based on known sizes

7944d8f

If the caller knows the sizes, and knows the allocation is thread local, then we can make some significant optimisations. This is a brief hack to show where would need changing.

mjp41 requested review from nwf, davidchisnall and sylvanc February 5, 2021 12:06

Add a redirection layer for small alloc functions.

44374b5

nwf reviewed Feb 5, 2021

View reviewed changes

mjp41 added 6 commits February 5, 2021 20:17

Add missing files.2

8b294b2

Typo

57dfeb1

Add alignment

28c9d09

Change API

d48fd1e

Large API

f132034

Fix Windows Build

d2166fc

Fix Align to be not a log size.

mjp41 force-pushed the local_free branch from 461fed6 to d2166fc Compare February 11, 2021 10:53

ghost mentioned this pull request Mar 21, 2022

LocalAllocator: add alloc_hinted #487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Hack to optimise based on known sizes #276

WIP: Hack to optimise based on known sizes #276

mjp41 commented Feb 5, 2021

mjp41 commented Feb 5, 2021

nwf Feb 5, 2021

mjp41 Feb 5, 2021

mjp41 Feb 5, 2021

davidchisnall Feb 8, 2021

mjp41 Feb 8, 2021

mjp41 Feb 8, 2021

nwf Feb 5, 2021

mjp41 Feb 5, 2021

nwf Feb 5, 2021

mjp41 Feb 5, 2021

WIP: Hack to optimise based on known sizes #276

Are you sure you want to change the base?

WIP: Hack to optimise based on known sizes #276

Conversation

mjp41 commented Feb 5, 2021

mjp41 commented Feb 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment