-
-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD in the lexer #2296
Comments
I want to give a try |
Go ahead! Please report progress so other contributors don't end up doing the same thing. i.e. which routine and which arch are you target? |
Thanks for opening this boshen. Please can I ask could people hold off on tackling identifiers for now? I have a major refactor of lexing identifiers underway which is designed to make it much easier to bring in SIMD later. Should have that done within a week. A couple of thoughts for anyone who's interested in doing this: Most of these tasks relate to searching through a block of bytes to find one of a set of "end characters". The "pshufd" instructions (or I think the ARM equivalent is "tbl") I believe will be most efficient for this. I found this article useful on this subject: http://0x80.pl/articles/simd-byte-lookup.html Secondly, please see #2288. Hopefully that'll be merged soon, and it adds APIs to the lexer for consuming the source text as bytes instead of |
I'll begin from |
@dyxushuai Please see my comment above. I suggest it'd be better to start with one of the others - not identifiers. I have a major overhaul of lexing identifiers in the works, which I expect to turn into a PR within a week. Currently it's on my fork of OXC: overlookmotel#36 |
@overlookmotel Sorry, I didn't pay close attention to the details of your PR. I'll start with the second one FYI: There are already some good implementations for quickly searching characters, like 1, 2 Footnotes |
Thanks. And thanks for the links. Very interesting.
My PR for identifiers is a total mess at present, so not worth reading really. But point is that it completely churns up the code that's in main branch at present, so it'd be difficult for us both to work on that area simultaneously. By the way, in case you didn't notice, #2288 was merged today. That provides some more efficient APIs for reading source as bytes, which may come in useful. |
PS: One thing I missed in algorithms quoted above:
|
while bytes.len() >= 32 {
// use avx2 do the 32bytes parallel processing
bytes.advance(32)
}
// fallback to sse4.2
while bytes.len() >= 16 {
// use sse4.2 do the 16bytes parallel processing
bytes.advance(16)
}
// fallback to register width parallel processing
while bytes.len() >= core::mem::size_of::<usize>() {
// do parallel processing of register size.
bytes.advance(core::mem::size_of::<usize>())
} |
That's really smart! I have also been wondering whether we could do something like what ratel does. At the start of parsing, it copies the entire source text into a buffer that's 1 byte longer than the source, and adds a null byte on the end to be a sentinel for EOF. Then the lexer never has to bounds-check "have I reached the end?" before reading. As long as the EOF sentinel has not been read yet, then it's not at the end. We could extend this approach: Instead of adding a single null byte to the end, add 32 x null bytes. Then when iterating through the source, there's always guaranteed to be enough bytes remaining to do a 32-byte read without any bounds check. So then no need for specialised handling of "less than 32, but a least 16 bytes remaining". There's always 32. Side note: I'm not sure why ratel uses My questions about this approach are:
For me at least, (4) is the biggest question mark. Anyway... perhaps that's a side-track, but just wanted to mention it. The relevant parts of ratel, if you're interested: https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs#L630-L653 |
Do you mean add padding bytes at the end of source text? And make the source text exactly a multiple of 32. |
I think it's a habit that comes from C++. |
Yes, exactly that. Except padded length doesn't need to be a multiple of 32, it just needs to be true that:
If you want to do aligned reads, could add a bit more padding, to also guarantee:
|
Previous attempt on padding the source:
I can help contact Wojciech Muła on twitter if we have questions to ask. |
Thanks Boshen. Good to know this has been tried before. I'll read through those issues. May be a dead end, but maybe it could make more sense now, if it improves potential for SIMD. |
Yeah it makes more sense now, we can give it another try. |
@Boshen @overlookmotel Do you have a benchmark setup or framework? |
The current setup is only observable with codspeeed on github linux machines. I think we need another github repository for a setup similar to https://github.com/BurntSushi/memchr/blob/master/.github/workflows/ci.yml We also need to use criterion instead of codspeed because codspeed only works on linux. I used to do the benchmark diff like this oxc/.github/workflows/benchmark.yml Lines 52 to 68 in dda9c2e
|
I figured we should move conversations from #2327 here. @overlookmotel:
This seems pretty good. I have been playing around with implementing various SIMD and parsing optimizations for the This uses some interesting code design helped by orlp in the Rust discord to avoid panic checking and let Rust optimize the assembly for the loops instead of dancing with raw pointers. Furthermore, it implements the SIMD-based shuffling and naive compare methods for 16-byte and 32-byte. Interestingly enough, the most performant algorithms on my machine have been (1x) simd_naive, (1.34x ± 0.12) memchr, and (1.73x ± 0.09) simd_shuffle. This was a benchmark that just went through the comment bodies scraped from the React development file's source. My machine also does not have avx2, so I am unable to run the wide variants and I would expect the shuffle variant to beat the naive versions once we have more bytes to search. Hopefully these same approaches can be adapted into the various parsing algorithms in oxc. |
Nice! I'm heading for bed now so will have to look at that tomorrow (ditto dyxushuai's draft PR #2338). In meantime, just to share where I'm up to with identifiers as it overlaps with this effort: I've now refactored byte searching into a generic "byte table search" macro. My intent is this same macro can be used for all other searches discussed here (except for multi-line comments which are different as you're looking for I've used a macro rather than a function, as found it's faster if all the code is in one function together (even The macro already searches in 32-byte blocks, so I was wondering if, once it's merged, that macro could be SIMD-ized, by replacing the main "check a block of 32" loop with a SIMD implementation? Ditto If either of you have any suggestions of how to make the macro more amenable to that, please do let me know. Macro: lexer/search.rs |
Thanks, this draft is still in its early stages. There are some major refactorings that need to be done, such as using |
@sno2 I couldn't resist taking a look at your experiment (bedtime reading). A few questions:
Are you saying simd_naive was the fastest? Or slowest? And how does SIMD compare to scalar? Does it move the needle significantly? Out of interest, how do scalar and scalar_lookup compare?
If you were testing the string of each comment individually, I'm not sure it'd be very representative. When parsing a large JS file, you will very rarely hit the "not enough bytes left, fallback to scalar" path. Whereas if you're feeding it each comment as an individual string, it'd hit that slow path every time for the end of the comment. (I may well have misunderstood here and that's not what you're doing) Another thing... You might find there's a speed-up from marking the branches for if rest[offset_in_chunk] == XX {
index += skip_single_line_comment_scalar(&rest[offset_in_chunk..]);
} if rest[offset_in_chunk] == XX {
#[cold]
fn handle_e2(rest: &[u8], offset_in_chunk: usize, index: usize) {
return index + skip_single_line_comment_scalar(&rest[offset_in_chunk..]);
}
return handle_e2(rest, offset_in_chunk, index);
} We have to handle irregular whitespace for the linter, but in practice don't actually expect to find any 99% of the time, and the above communicates that to the compiler. Ugly as hell, but I've had success with it elsewhere in OXC. Lastly, in the scalar lookup version, I think this: match SCALAR_LOOKUP[byte as usize] {
1 => return idx,
2 => match &string[idx + 1..] {
&[0x80, 0xA8 | 0xA9, ..] => return idx,
_ => {}
},
_ => {}
} could be: match SCALAR_LOOKUP[byte as usize] {
0 => {},
1 => return idx,
// This is UTF-8 so guaranteed 2 more bytes after 0xE2
2 => match unsafe { string.get_unchecked(idx + 1..idx + 3) } {
&[0x80, 0xA8 | 0xA9] => return idx,
_ => {}
},
// This may allow compiler to turn this match into a jump table
// (which may or may not be faster)
_ => unsafe { std::hint::unreachable_unchecked() }
} or again, use |
One more thing (sorry): let offset_in_chunk = mask.trailing_zeros() as usize;
assert!(offset_in_chunk < 16); // eliminate bound checks
index += offset_in_chunk;
if rest[offset_in_chunk] == XX { How does the assertion eliminate the bounds check? Doesn't it just turn into a branch with a panic on it? |
simd_naive was the fastest. SIMD was around 3-5x faster thanscalar. lookup was consistently faster than scalar by a bit. I will get the CI/CD benchmarking thing working with more test cases so that we can see fair results.
It goes through the source as a whole (like an iterator), so no.
I was unable to reproduce a speedup on my machine, so I took that out. More accurate benchmarking will allow me to see better how it affects the code.
Nice, I was also able to get the lookup table optimization for the https://godbolt.org/z/d6f3xd3fK
I don't exactly know how, but I think the optimizer does back-checking whenever you assert something such as integer range checks. Then, it notices that |
Interesting! I had no idea you could get it to do that.
This is optimizing the wrong part, because
That's not what I expected at all. So maybe I was wrong about shuffle tables, at least for such a small number of "needles". But good to hear the gain from SIMD is significant. Sounds like this effort will be worthwhile. Please do post again when you have more numbers. |
@Boshen I just thought: Does the irregular whitespace linter rule need to catch irregular whitespace in comments too? Or can they be ignored? What about in strings? |
I can trade for performance by removing irregular whitespace collection from the lexer. I can implement the irregular whitespace rule differently. |
The new byte-search macro is now merged into main (#2352) and used to lex identifiers, and also some whitespace. PR to apply it to strings too is pending (#2357). If anyone wants to SIMD-ize identifiers, please go ahead now. I'd also be interested to hear if you think the |
This PR #2338 has already supported |
Saw a relevant PR: |
No bandwidth to work on this, close as stale. |
As discussed in #2285, we'd like to embark on the journey of SIMD.
The current state of SIMD in Rust is functional, successful projects include https://github.com/rusticstuff/simdutf8 and https://github.com/BurntSushi/memchr.
portable-simd
is not an option due to its instability.In this task, you will implement one of the tokenization methods for the following tokens in the lexer.
oxc/crates/oxc_parser/src/lexer/identifier.rs
Line 8 in 6002560
oxc/crates/oxc_parser/src/lexer/string.rs
Line 6 in 6002560
oxc/crates/oxc_parser/src/lexer/byte_handlers.rs
Line 101 in 6002560
oxc/crates/oxc_parser/src/lexer/comment.rs
Line 9 in 6002560
oxc/crates/oxc_parser/src/lexer/comment.rs
Line 25 in 6002560
oxc/crates/oxc_parser/src/lexer/numeric.rs
Line 30 in 6002560
Specification of these tokens can be found at https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar
To get things working, you only need to implement for the architecture of your development machine.
For example
Some of these are easy, @overlookmotel drafted some algorithms:
Multi-line comments:
*/
,\r
,\n
or0xE2
(1st byte of eitherLS
orPS
).*/
or0xE2
(as it's not relevant whether there's further line breaks now).0xE2
on a#[cold]
branch as irregular line breaks (and other Unicode characters starting with0xE2
) are rare.Whitespace:
\t
, or>= 128
(Unicode)#[cold]
branch which checks if char is irregular whitespace or not.i.e. in both, assume input will be ASCII and chew through it as fast as possible. De-opt to slow path for Unicode cases which do need to be handled, but should be rare in practice.
Please leave a comment if you'd like to participate.
The text was updated successfully, but these errors were encountered: