Skip to content

Commit

Permalink
refactor(parser): catch all illegal UTF-8 bytes (#2415)
Browse files Browse the repository at this point in the history
Catch all illegal UTF-8 bytes with the `UER` byte handler.

From https://datatracker.ietf.org/doc/html/rfc3629:

> The octet values C0, C1, F5 to FF never appear.

This change *should* make no difference at all, as a valid `&str` may not contain any of these byte values anyway. But it's possible if user has e.g. created the string with `str::from_utf8_unchecked` and not obeyed the safety contraints. This will at least contain the damage if that's happened, and panic rather than lead to UB. And since we're already catching other error conditions, may as well catch them all.
  • Loading branch information
overlookmotel authored Feb 16, 2024
1 parent c6767fa commit cc2ddbe
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions crates/oxc_parser/src/lexer/byte_handlers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ static BYTE_HANDLERS: [ByteHandler; 256] = [
UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, // 9
UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, // A
UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, // B
UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, // C
UER, UER, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, // C
UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, // D
UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, // E
UNI, UNI, UNI, UNI, UNI, UNI, UNI, UNI, UER, UER, UER, UER, UER, UER, UER, UER, // F
UNI, UNI, UNI, UNI, UNI, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, UER, // F
];

/// Macro for defining a byte handler.
Expand Down Expand Up @@ -679,11 +679,11 @@ byte_handler!(UNI(lexer) {
lexer.unicode_char_handler()
});

// UTF-8 continuation bytes (128-191) (i.e. middle of a multi-byte UTF-8 sequence)
// + and byte values which are not legal in UTF-8 strings (248-255).
// `handle_byte()` should only be called with 1st byte of a valid UTF-8 char,
// UTF-8 continuation bytes (0x80 - 0xBF) (i.e. middle of a multi-byte UTF-8 sequence)
// + and byte values which are not legal in UTF-8 strings (0xC0, 0xC1, 0xF5 - 0xFF).
// `handle_byte()` should only be called with 1st byte of a valid UTF-8 character,
// so something has gone wrong if we get here.
// https://en.wikipedia.org/wiki/UTF-8
// https://datatracker.ietf.org/doc/html/rfc3629
// NB: Must not use `ascii_byte_handler!` macro, as this handler is for non-ASCII bytes.
byte_handler!(UER(_lexer) {
unreachable!();
Expand Down

0 comments on commit cc2ddbe

Please sign in to comment.