I wrote a forgettable llama implementation for https://github.com/LaurentMazare/tch-rs (pytorch's torchlib rust binding). Still not ideal but at least you get the same GPU performance you would get on pytorch.
...And then I spotted Candle, a new ML framework by the same author: https://github.com/huggingface/candle
It's all in Rust, self contained, a huge undertaking, but it looks very promising. They already have a llama2 example!
For timing benchmarks, use Instant or a similar monotonic clock instead of SystemTime.
The original C code makes the same mistake, using clock_realtime instead of clock_monotonic.
This means the benchmarks will be wrong if the program runs while ntp is fixing up the clock. This can happen right after the system gets internet, or periodically when it checks for skew. Some systems might slowly blend in ntp fixes too, which means 1 second of calendar time is not 1 second of monotonic time over a long period of time.
At least it won't be affected by daylight saving. But it's not airtight
like the file formats, all the extra files like the tokenizer.bin file, the terminology in the sources comments, logits, transformers etc
Seeing a few uses of `unsafe`, a few of `expect`. Wonder if you can mmap the binary model in without unsafe??
Some operating systems do provide the proper guarantees to make mmap safe, but Rust decided it would be best to assume it's unsafe and maintain a uniform API. Which is probably a good call, it is notoriously difficult to get a readonly mmap to be entirely safe on Linux.
https://docs.rs/mmap-rs/latest/mmap_rs/struct.MmapOptions.ht...
If you map the file into readonly memory and than get references to it, the underlying memory can mutate (eg: by modifying the file itself).
StackOverflow| Benefits of header-only libraries: https://stackoverflow.com/questions/12671383/benefits-of-hea...