Show HN: High-speed UTF-8 validation in Rust(github.com) |
Show HN: High-speed UTF-8 validation in Rust(github.com) |
Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).
Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538
[0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...
The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4
http://openjdk.java.net/jeps/8261663
(I'm completely ignorant on the subject)
I really think the general push right now is towards stdsimd being the future for portability.
It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!
- https://doc.rust-lang.org/reference/linkage.html
- https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#call...
I have a sql parsing library (shameless plug) that is 50x faster than any other python implementation, it is just a super simple wrapper around a rust crate.
The https://cxx.rs/ project is also a major crate for C++ interoperability.
https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...
tl;dr: it's not straightforward, mostly because the standard functionality lives in `core` which doesn't have access to OS support for CPU feature detection.
The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.
I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.
The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.
Looks like it just uses the size of the original slice. If the average broken chunk is less than three bytes (maybe quite common?) then it'll have to grow the buffer, at least doubling it.
>> let bytestring = b"foobar\xcc";
>> bytestring.len()
7
>> let cleaned = String::from_utf8_lossy(bytestring).into_owned();
>> cleaned.len()
9
>> cleaned.capacity()
14It's less of an issue than it used to be, the penalty for unaligned access has steadily been reduced by newer CPU architectures, but it's still there.
[1] https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...
It isn't a letter, or a digit, or whitespace, or punctuation, or a word separator, or a control character, it is neither uppercase nor lowercase, it doesn't have any canonical equivalents - it's just a codepoint that exists specifically for this purpose.
As a result it's much less likely that if gibberish sneaks into your system somehow and gets turned into U+FFFD this causes something important to break elsewhere.
And when sooner or later a human is shown this text, it's very obvious that U+FFFD isn't what they expected, whether that was E-acute, a Euro currency symbol, a cat emoji or whatever else, and the human will know something went wrong and can decide if they care about that.