Chrome and Curl both report it takes about 1100ms to load the linked page's HTML, split about 50/50 between establishing a connection and fetching content. I'm not sure how the implementation works internally but that seems like a long time for a site served from memory and aiming to be "high-performance". The images bring the total time up to around 5.7s.
As a point of comparison, my site (nginx serving static content, on the 0.25 CPU GCP instance) serves the index page in 250ms. Of that, ~140ms is connection setup (DNS, TCP, TLS). The whole page loads in < 1000ms.
https://i.imgur.com/X4LDbWj.png
https://i.imgur.com/Ccwzmgz.png
One thing to remember is that when a server like nginx serves static content, it's often serving it from the page cache (memory). The author of Varnish has written at some length about the benefits of using the OS page cache, for example <https://varnish-cache.org/docs/trunk/phk/notes.html>. Some of the same principles can be applied even for servers that render dynamically (by caching expensive fragments).
You removed the CDN and the site got slower?
How do you know your site was the one that was fast or just the CDN? IE, the CDN should have added a lot of extra hops and made things slower.
To me, this implies the rust code is very poor at opening and closing connections, so the CDNs keep alive is pasting over that issue.
edit: web.dev measure gave this blog post url a performance score of 30/100 which is quite poor.
I would have liked to see the actual results from this comparison: "I compared my site to Nginx, openresty, tengine, Apache, Go's standard library, Warp in Rust, Axum in Rust, and finally a Go standard library HTTP server that had the site data compiled into ram."
Everyone thought it was amazing even though it was just a dumb http server returning pages[req.path] :-) Latency was under 10ms which was pretty amazing for a 2012 KVM VPS.
> And when I say fast, I mean that I have tried so hard to find some static file server that could beat what my site does. I tried really hard. I compared my site to Nginx, openresty, tengine, Apache, Go's standard library, Warp in Rust, Axum in Rust, and finally a Go standard library HTTP server that had the site data compiled into ram. None of them were faster, save the precompiled Go binary (which was like 200 MB and not viable for my needs). It was hilarious. I have accidentally created something so efficient that it's hard to really express how fast it is.
In the real world, use Go, Node, etc.
There has to be a point of diminishing return. And again, I'm not discarding the dev side of things but it seems a lot of extra tooling and complexity cor not much gain.
I am too much of an OCD perfectionist and don't have the guts to ship this often.
I have CDO too but I work around it by sheer trolling with infrastructure, like my hacked up to hell CDN: https://xeiaso.net/blog/xedn
A lot of our (and in particular, my) best features come from of relocating the boundaries between things, to make space for features that weren't considered in the original design. With monolithic systems we see this late in the lifecycle in the form of Conway's Law. If you stick this problem in front of the CI/CD mirror, it's painful to face. CI/CD argues that if something is difficult we should do it all the time so that it's routine (or stop doing it entirely).
However there's a conspicuous lack of tools and techniques to make that practical. The only one I really know of is service retirement (replace 2-3 services with 2 new, refactored services), and we don't have static analysis tools that can tell us deterministically when we can remove an API. We have to do it empirically, which is fundamentally on par with println debugging.
Seeing the initial comments here I think it would be better to go with the original title.
Great blog by the way :)
I'm not a guy, I'd prefer if you used they to refer to me, but she works too.
The PAM one was a really fun talk to write. I need to finish that postmortem on how that talk went wrong.
http://cppcms.com/wikipp/en/page/main
https://github.com/Tatoeba/tatowiki the wiki of tatoeba.org ( https://en.wiki.tatoeba.org/articles/show/main# ) is written in it
If it's amd64, long mode requires a page table. Otherwise, a page table is handy so you can get page faults for null pointer dereferencing. Of course, you could do that only for development, and let production run without a page table.
My hobby OS can almost fill your needs though, but the TCP stack isn't really good enough yet (I'm pretty sure I haven't fixed retransmits after I broke them, no selective ack, probably icmp path mtu discovery is broken, certainly no path mtu blackhole detection, ipv4 only, etc), and I only support one realtek nic, cause it's what I could put in my test machine. Performance probably isn't great, but it's not far enough along to make a fair test.
I remember working in 2008 on a project for some geothermal devices that were spitting some IoT data on a "hardcoded" html page directly in the C code of the program, the device was using a chinese 8051-like CPU so you had no OS-per se
I don't think the author is claiming it is faster than a static site stored in memory, they're saying it is faster than a traditional static site that loads files from the disk. At least that's how I read it.
It can be a tiny amount more efficient since an async disk IO implementation might dispatch the file read() call to a thread pool, wait for the result, and then send the data back to the client. Makes 2 extra context switches compared to sending data from memory. Now if the user is super confident that the data is hot and in page cache then a synchronous disk read will fix the problem. Or trying a read with RWF_NOWAIT and only falling back to a thread pool if necessary.
On the other hand rendering a template on each request also requires CPU, which might be either more or less expensive than doing a syscall.
All in all the efficiency differences are likely negligible unless you run a CDN which does thousands of requests per seconnd.
In terms of throughput to the end user it will make zero measurable difference unless the box ran out of CPU.
Keeping everything in user space buffers might just be faster.
On the other hand, you're sending that sucker over network, and what you save doing this is most likely best counted in microseconds/request. It's piss in the ocean compared to the delay introduced even over a local network.
I wonder if io_uring could be used to issue a single syscall that would read data from disk (actually using page cache) and send it on the network.
Of course, you could use DPDK or similar technologies to do the opposite - read the data from disk once and keep it in user-space buffers, then write it directly to NIC memory without another syscall. That should still theoretically be faster, since there would be 0 syscalls per request, where the other approach would require 1 per request.
200MB of pages and assets, sure. Code? No. If you compile it into the binary then the storage is no worse than having a small binary and all the resources separate.
Taking a statically generated site and returning the raw bytes is 100% faster. The author said so themselves.
If you did it that way, now all your content is basically mmaped into the memory which means (probably) less syscalls.
Soo it might've shaved half a microsecond maybe ?
The individual site may be constructed individually (maybe) but it can only work if the society of people-who-use-the-internet all agree to follow a series of conventions about how websites work; you can't start using \<soul\> instead of \<body\> and expect everything to work as normal, because the reason the \<body\> tag is used to define the body of a page is because we needed a way to make sure people can use a webpage without having to define an entire new language for each one.
But no, a website is not a social construct because you don't have to have a society to have a website. I can have two machines connected and host an html file on one of them and stare at it on the other one all by myself and it will still be a website on a web! No contractual agreement is necessary!
But anyway, it's amazing that you posted on my comment! I am a huge fan!
Only if you don't care about HTTP/2 and TLS. And if you don't care about those, you can as well do sendfile() from a thread.
Cool article though. Agree on the ructe part, and I dislike how whitespace is handled. I wish Jade/Pug templates could be done in rust but will check out Maud.
Is the Internet not connected internationally (US -> Europe for example) via cables underneath the ocean? Speed of light would be satellite, light? Not electric current?
Or is electricity flowing through a wire also "speed of light"?
Second of all, electrical signals in cables move at speeds slightly lower than c, but very close to it, so the speed of light is still a very good approximation of the possible upper bound.
Third of all, intercontinental cables are normally fiber optic, for several reasons. That is, they directly transmit light through the cable.
Fourth, it should be noted that electricity is actually the same thing as light, since photons are the carrier particles of the electric field (when two charged particles interact, they are actually exchanging a photon). It's of course not visible light, but satellite communication also uses radio waves normally, which are not visible light either.
Finally, either through cables or through satellite communication, the distance/c minimum theoretical one-way latency is usually a significant under-estimation of the actually possible minimal latency, since the straight-line distance is significantly shorter than the actual cable/satellite-and-back distance that the signals must travel - the difference in straight-line VS physical path distance is typically much larger than the difference between the theoretical speed of light and the actual speed of the electrical signal propagation.
Does Apache/Nginx/IIS load static files in memory ahead of time? I would assume no, unless someone went through and did some optimizations. Even so, there is always a point where memory runs out, and in that case a templating engine is essentially compression. I would assume if the author outputted his whole website as static files and stored them in memory it would be even faster, but that would require quite a bit more memory.
Linux loads them on the first usage. If you have enough memory, they'll just stay there. It doesn't that much memory, most sites are pretty small.
But the article's way doe use less memory, less system calls, and is completely optimized for that one site only. So yeah, it will surely be faster. Besides, his site appears to not be static.
Yes, but.
The problem with OS file caches has ever been that people look at a box, see that the programs aren't consuming all of the available memory, and argue that they should be able to cram more shit on the box because it's 'underutilized'.
There are very reasonable and sane system architectures that let the OS handle caching, but you need a way to defend against these sorts of situations.
The performance falloff for this failure mode is exponential, so people try it a few times, and not getting any negative feedback, they add it to their toolbox only to get lectured months later once the bad behavior has not only become standard for them but also spread to other people.
It almost begs for a different system call that can earmark the memory usage by the app in a way that's easier for people to see.
Thanks!
You can also easily preload things into memory in boot yourself, so static websites usually don't serve files from disk.
On an intuitive level, think of swap as being a place the kernel can put memory the program has written. When you malloc(4096) and write some bytes into it, the kernel can't evict that page to disk unless there's some swap space to stick it in. However, executables are different because they're already on disk -- the in-memory version is just a cache (everything is cache (computers have too many caches)). The kernel is allowed to drop the copy of the program it has in memory, because it can always read it back from the original executable.
[0] https://man7.org/linux/man-pages/man2/mlock.2.html
[1] https://ftp.gnu.org/old-gnu/Manuals/glibc-2.2.3/html_chapter...
There are a few relevant bits to this. You can MAP_POPULATE the file to prepopulate the entries and you can MAP_LOCKED to MAP_POPULATE + lock the pages in (unreliably). As mentioned in the man page for mmap MAP_LOCKED has some failure modes that you don't get with mlock.
https://www.man7.org/linux/man-pages/man2/mmap.2.html
I also found this page: https://eklitzke.org/mlock-and-mlockall
Oh, and this: https://access.redhat.com/documentation/en-us/red_hat_enterp...
My examples are in Ruby, which is super slow compared to what you’re doing. Now I’m super curious what kind of performance you’d get globally on Fly if you deployed to a bunch of different regions.
Beyond that https://community.fly.io is the best place to get help with Nix on Fly since my abilities are exceeded. There's a Rust thread at https://community.fly.io/t/running-reproducible-rust-a-fly-a... that touches on nix flakes.