-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: runtime detection of CRC32 and ARM64 support. #82
base: v2.1-agentzh
Are you sure you want to change the base?
Conversation
|
||
language: c | ||
|
||
arch: | ||
- amd64 | ||
#- arm64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The luajit2-test-suite is disabled for now on arm64 since it currently has a few failing test cases.
I skimmed through the patches and overall they seem to be in the direction I wanted to take, thanks! I can do a more thorough review in some time if you like. One thing I noticed for example is the |
Ping @javierguerragiraldez. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a closer look and it looks like you've excluded the actual dynamic bits from my patch because of which this is not really runtime detection at all. You still have to build the whole binary with crc32 support and that kinda defeats the purpose, at least from the distribution point of view.
In summary, you have to make sure that lj_str_hash.c only has the optimised hashing code and build that with sse4.2/crc32 and unconditionally build it into the binary. Then at load time you look at CPU capabilities and decide to either call into those functions or not.
We do this a lot in glibc for dozens of microarchitectures in x86, ppc, s390x and arm64 and it works very well in practice.
Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project. It has been almost a year and it is about time that we move on. Only thing I need from Fedora's perspective is regular release tarballs and I'll work on merging all of my patches here :) |
Hi all, I have been away from the computer for some time.
That is one thing that I forgot to mention in this PR: this patch voluntarily leaves out the bits forcibly asking the binary to be built with SSE4.2/CRC32 support:
For the time being, this PR is safer as-is by being fully backwards-compatible (i.e. OpenResty already specifies
This is a great segway now that the topic has been mentioned in my above paragraph, and my take on it is that for the time being (and probably for the foreseeable future), we should, when possible, ensure that merge-compatibility be preserved as the upstream v2.1 branch may still receive bugfixes for as long as the project is not officially declared unmaintained. |
My turn to apologize; I'm still kinda away but stole some time for this :)
Got it. Would it be clearer if the PR title is reworded to remove the 'runtime' bit?
I can help answer that question. Other than the very old xgene1 processor (the APM Mustang processors), all other cores that gcc knows about has CRC support. So while from a theoretical point of view it is unsafe, from a practical standpoint I think the assumption is safe because I don't think anybody actually would think of using xgene to run openresty on production. It is definitely safer than assuming SSE4.2 since there are far more machines in active operation that don't have SSE4.2.
Fair point, if that is the intention I think the change is fine with an updated PR title.
I'm not holding my breath anymore on an official announcement :) |
This PR needs to rebase to the latest master. |
@siddhesh Thanks for looking at this one. You definitely know about the actual arm hardware usage better than me. We can always enable it for the aarch64 world then. We still need to be careful about merging conflicts. I just spent a lot of time a few days ago to solve the merge conflicts from upstream LuaJIT repo :) |
We'd better hurry on this PR, we're about to cut a new release. Hopefully we can have it in the next OpenResty version :) |
So with this PR the same x64 openresty binary can work on modern intel chips and amd chips at the same time now? |
My understanding is that it's not the goal (hence my request to remove the As for the rebase, the only thing to do here is to drop the |
@siddhesh I see. Thanks for the clarification. I'm really looking forward to binary-level auto-scheduling for different microarchitectures. |
That's what #75 does; you can test it in action in moonjit. I think @thibaultcha has a plan to stage the changes in though (I don't fully understand how things fit into OpenResty, etc) so y'all might want to discuss that. |
@siddhesh Please have a look at the latest commit which adds the bits we previously mentioned to unconditionally build the binary with SSE4.2/CRC32. |
* Added missing targets to .PHONY * Made the unit_test.sh script executable * Fixed minor styling issues
50b9978
to
ca7271f
Compare
@thibaultcha I'm afraid that still does not help. I just skimmed through the patchset and building the entire binary with -msse4.2 is incorrect because it will not run on CPUs that don't have SSE4.2; the compiler is free to use SSE4.2 instructions wherever it sees fit throughout the code base whereas we want to use it only for the string hash function. That is why we need to build only the SSE4.2 bits with that flag. Likewise for the aarc64 crc32 function, although that's much less of a problem since crc32 is common on aarch64 servers. |
5d3d960
to
f8a7ea0
Compare
@thibaultcha I haven't had a chance to do a thorough review, but a quick skim gives me the impression that this should work correctly. I'll sync this into moonjit this weekend (since the patchset in moonjit is slightly different, only in some minor details and tests) and let you know if I find issues. |
…'-msse4.2'. Co-authored-by: iSage <[email protected]> Co-authored-by: Siddhesh Poyarekar <[email protected]>
Only, available in ARMv8, the CRC32 instructions are enabled when LuaJIT is compiled with `-march=armv8-a+crc`. Co-authored-by: Debayan Ghosh <[email protected]>
f8a7ea0
to
8d6dca3
Compare
As a general question, in order to be "dynamic", we have to replace direct, possibly inlinable, function calls to be indirectly function calls. Unfortunately, call-sites involved are pretty frequent. Do you see any performance penalty in your stress testing? |
{ | ||
return (const uint32_t*)(void*)str; | ||
} | ||
|
||
/* hash string with len in [1, 4) */ | ||
static LJ_AINLINE uint32_t lj_str_hash_1_4(const char* str, uint32_t len) | ||
static LJ_NOINLINE uint32_t lj_str_hash_1_4(const char* str, uint32_t len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comment somehow is gone. I try to understand this change again. How come you don't want to have this function inlined? As far as I can tell, it's perfect candidate for inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because previous versions of this patch have a particular treatment for LJ_AINLINE
:
luajit2/src/x64/src/lj_str_hash_x64.h
Lines 21 to 22 in 4589430
#undef LJ_AINLINE | |
#define LJ_AINLINE |
I agree that it seems like a good candidate for inlining, but my reasoning was to wait until we have some micro benchmarks to run before making this change.
This PR refines the implementation of the CRC32 runtime detection implemented by @isage in #20, and previously refined by @siddhesh in #75. It also brings in refined ARM64 support originally implemented by @debayang in #21.
The refined implementation compares itself with the original ones as such:
__attribute__((constructor))
attribute to detect/initialize CRC32 support.-DLUAJIT_DISABLE_JIT
).jit.*
APIs allowing users some introspection into this optimization.LJ_OR_*
macros (as proposed in Move openresty-specific language extensions under a special configuration? #63).lj_str_hash.c
and leaves the room open to new string hashing implementations in the future.In addition, this PR offers to greatly improve the CI tests matrix by running all 3 test suites against many different builds, on both x64 and arm64 architectures (thanks to the recent Travis-CI updates on arm64 support).
CC for reviews: @yangshuxin, @agentzh, @siddhesh, @dndx.