Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dmesg parsing non-UTF-tolerant #78

Merged
merged 1 commit into from
Aug 9, 2020
Merged

Conversation

AzureCrimson
Copy link
Contributor

@AzureCrimson AzureCrimson commented Apr 28, 2018

When Python's bytes.decode() method encounters encounters a byte sequence
that cannot be decoded, it will take an action dependent on its second argument:
'strict': raise UnicodeDecodeError exception (default)
'replace': insert U+FFFD
'ignore': skip to next character

While most inputs appear to be (mostly) sanitized, dmesg output is passed to
_parse_dmesg() as is, and can contain data that escapes to invalid Unicode.
When the parser attempts to decode this data it immediately raises an exception
and dies, as seen in #77.

To prevent this issue, I set the error handling method to 'replace', as 'ignore' can
hide decoding errors from developers working with really broken dmesg logs.
The parts of dmesg the parser looks at should be in a standard format anyway,
so a U+FFFD (Replacement Character) or two after the timestamp shouldn't be
too harmful.

When Python's bytes.decode() method encounters encounters a byte sequence
that cannot be decoded, it will take an action dependent on its second argument:
'strict': raise UnicodeDecodeError exception (default)
'replace': insert U+FFFD
'ignore': skip to next character

While most inputs appear to be sanitized, dmesg output is passed to
_parse_dmesg() as is, and can contain data that escapes to invalid Unicode.
When the parser attempts to decode this data it immediately raises an exception
and dies, as seen in xrmx#77.

To prevent this issue, I set the error handling method to 'replace', as 'ignore' can
hide decoding errors from developers working with *really* broken dmesg logs.
The parts of dmesg the parser looks at should be in a standard format anyway,
so a U+FFFD (Replacement Character) or two after the timestamp shouldn't be
too harmful.
@xrmx xrmx merged commit 0ee59b5 into xrmx:master Aug 9, 2020
@xrmx
Copy link
Owner

xrmx commented Aug 9, 2020

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants