Show HN: High-speed UTF-8 validation in Rust

Show HN: High-speed UTF-8 validation in Rust(github.com)

127 points by kryps 5 years ago | 45 comments

tialaramex 5 years ago |

One flavour I would expect to be valuable that isn't present here is this:

Process the input (as quickly as possible) and never fail, but replace each invalid sequence of bytes with U+FFFD (bytes 0xEF 0xBF 0xBD).

chrismorgan 5 years ago | |

One major problem with this: you can’t do it in place. Invalid byte sequences could be 1–4 bytes long, but U+FFFD is exactly three bytes long.

AjtevhihBudhirv 5 years ago | | |

There're still faster approaches than naïve and probably common "copy valid byte sequences one by one into a resizable result buffer". For instance, scan through the input bytes all at once, keeping track of position and length of valid sequences, then memcopy each valid sequence into a preallocated buffer.

Edit: Although, it looks like Rust's std already does this, except for preallocating an exactly correct size result buffer: https://doc.rust-lang.org/src/alloc/string.rs.html#538

cryptonector 5 years ago | | |

Use an ASCII character like `?` or, even better, ASCII SUB (0x1A, substitute), which is specifically intended for this sort of thing. Failing that, there are a number of other unused ASCII control characters like VT (vertical tab), FS (file separator), GS (group separator), US (unit separator), CAN (cancel), etc.

nezirus 5 years ago | |

Yeah, I'd like to see that as well, something like to_string_lossy(), but really fast, so I can call it whenever I expect occasional broken UTF-8 string (and don't care about few replaced characters).

nerdponx 5 years ago | |

I don't know, you are throwing away the length of the invalid sequence that way. Maybe useful in certain situations, but probably not in most.

loa_in_ 5 years ago | |

I bet one could modify this to save indexes of invalid chars and do what you say with just one additional copy

celeritascelery 5 years ago |

I find it interesting that with the state of rust SIMD many implementations just opted for reimplementing the code multiple times for each intrinsic (SSE, Neon, AVX, etc). One would think that one of the generic libraries (faster, simdeez, packed_simd) would start to see widespread use, but they all seem to have issues that have prevented adoption.

mooman219 5 years ago | |

I work on a SIMD optimized font library [0] and have stumbled into the same situation of hand writing SIMD intrinsics. Some things are just kinda hard to make sure they get optimized correctly, and there is enough difference between the platforms where that matters when fiddling with bits. I also kinda have fun writing SIMD code like this too.

[0]: https://github.com/mooman219/fontdue/blob/master/src/platfor...

monocasa 5 years ago | |

The various SIMD ISAs are different enough from eachother that you really want a custom implementation per ISA for perf reasons. Particularly if you're doing something off the beaten path like this rather than cranking through some vector math.

celeritascelery 5 years ago | | |

I don't know. Looking in their code, they have one for SSE and another for AVX2. The algorithm is identical expect for one is 128 and the other is 256 bits. It seems like something that would be easy to abstract over so you only had a single implementation (at least for x86 SIMD). But it is obviously a hard problem or it would be solved already.

Fronzie 5 years ago | | |

Even for vector math, there are subtle differences, such as the numerical precision of the hardware reciprocal, leading to different implementations for x/y.

The Eigen library did recently combine most SSE and AVX code paths, however: https://eigen.tuxfamily.org/index.php?title=3.4

pgwhalen 5 years ago | | |

Can you comment on the Vector API that Java is set to include? Do you think it's entirely misguided, or more of a best effort given what one can reasonably expect from Java?

http://openjdk.java.net/jeps/8261663

(I'm completely ignorant on the subject)

BelenusMordred 5 years ago | |

packed_simd is now packed_simd2, the original author is MIA and no one else has crate or repo permissions

I really think the general push right now is towards stdsimd being the future for portability.

jfk13 5 years ago |

I'd be curious how this compares with the encoding_rs crate (https://github.com/hsivonen/encoding_rs), which also offers UTF-8 validation (among many other things).

cryptonector 5 years ago | |

This is SIMD based. If the encoding_rs crate isn't SIMD based, then this thing will beat its pants off.

burntsushi 5 years ago | | |

It does use SIMD (when enabled). But also, the encoding_rs crate is doing more. It isn't just checking for validity. It's also doing transcoding and error detection.

TwoBit 5 years ago |

Given the current ubiquity of UTF-8, is there value in having native CPU instructions for it? Or do they exist somewhere already?

dahfizz 5 years ago | |

What would these cpu instructions do?

TwoBit 5 years ago | | |

Read or write code points quickly, with validation.

nerdponx 5 years ago |

Does Rust compile code that can be used/called from other languages with an FFI? That to me is always one of the persistent advantages of C, I don't have to fully understand how compile machine code works on a technical level, but there are lots of different languages that let me use C libraries in their own language runtimes.

It would be great to have things like high-performance unicode handling with consistent semantics across multiple languages!

hardwaresofton 5 years ago | |

Yes it does -- one of the great parts about Rust. There are lots of projects out there slotting rust into other languages and getting crazy speedups.

- https://doc.rust-lang.org/reference/linkage.html

- https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html#call...

jhgb 5 years ago | |

A few days ago I read https://dev.to/veer66/calling-rust-from-common-lisp-45c5 , which illustrates presumably what you're talking about.

Disp4tch 5 years ago | |

Yes, from the python side of things there are tools like py03 that make integrating rust into python code really painless.

I have a sql parsing library (shameless plug) that is 50x faster than any other python implementation, it is just a super simple wrapper around a rust crate.

https://github.com/wseaton/sqloxide

gliptic 5 years ago | |

Yes, you can expose things with the C ABI. This is essential for how it's used in Firefox.

kzrdude 5 years ago | |

Yes, Rust natively can expose C ffi functions (for example).

The https://cxx.rs/ project is also a major crate for C++ interoperability.

kouteiheika 5 years ago |

So are there any plans to eventually contribute this to the standard library, e.g. just like hashbrown was?

mrec 5 years ago | |

The author discusses this in the corresponding /r/rust thread:

https://www.reddit.com/r/rust/comments/mvc6o5/incredibly_fas...

tl;dr: it's not straightforward, mostly because the standard functionality lives in `core` which doesn't have access to OS support for CPU feature detection.

bombcar 5 years ago | | |

Does Rust allow for libraries to “override” core functionality when available?

meisel 5 years ago |

How does this compare, speed-wise, to the UTF-8 validation done in simdjson?

kryps 5 years ago | |

Check the benchmarks section (https://github.com/rusticstuff/simdutf8#Benchmarks), second table. simdutf8 is up to 28 % faster on my Comet Lake CPU. However with pure ASCII clang does something magical with simdjson and it beats my implementation by a lot. GCC-compiled simdjson is slower all around except for a few outliers with short byte sequences.

The algorithm is the one from simdjson, the main difference is that it uses an extra step in the beginning to align reads to the SIMD block size.

tyingq 5 years ago | |

If someone is wanting to make a code comparison, most of the UTF-8 validation in simdjson seems to be nicely summed up in this pull request that sped it up: https://github.com/simdjson/simdjson/pull/993/files

celeritascelery 5 years ago |

> The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson.

I didn't really understand this part. Aligned to what? to the cache line? SIMD always reads the block size. Unless I am missing something here.

unwind 5 years ago | |

I read it as "to the width of the SIMD registers" which I have seen in other quick scanners, but did not read the code here.

celeritascelery 5 years ago | | |

Aren’t all reads aligned to the width of the SIMD register? If I do an AVX512 command it will read 512 bits right?

parhamn 5 years ago |

I was wonder about what the ARM/M1 support for simd like instructions are. It seems like it will be a while for these simd packages to be as performant on ARM. Is this correct?

jsheard 5 years ago | |

ARM has it's own SIMD instruction sets (NEON and SVE) but the intrinsics have yet to be stabilized in Rust, so the Rust SIMD ecosystem is x86-centric for now.

The plan is for Rust to eventually have a portable SIMD abstraction built into the standard library to reduce the need for CPU-specific code.