Disk I/O bottlenecks in GitHub Actions(depot.dev) |
Disk I/O bottlenecks in GitHub Actions(depot.dev) |
- Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)
- Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)
- "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]
- Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)
- Configuring Docker with containerd/estargz support
- Just generally turning kernel options and unit files off that aren't needed
[0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initial...
Are you not using a caching registry mirror, instead pulling the same image from Hub for each runner...? If so that seems like it would be an easy win to add, unless you specifically do mostly hot/unique pulls.
The more efficient answer to those rate limits is almost always to pull less times for the same work rather than scaling in a way that circumvents them.
From a performance / efficiency perspective, we generally recommend using ECR Public images[0], since AWS hosts mirrors of all the "Docker official" images, and throughput to ECR Public is great from inside AWS.
I'm slightly old; is that the same thing as a ramdisk? https://en.wikipedia.org/wiki/RAM_drive
Everyone Linux kernel does that already. I currently have 20 GB of disk cached in RAM on this laptop.
If you want to truly speed up builds by optimizing disk performance, there are no shortcuts to physically attaching NVMe storage with high throughput and high IOPS to your compute directly.
That's what we do at WarpBuild[0] and we outperform Depot runners handily. This is because we do not use network attached disks which come with relatively higher latency. Our runners are also coupled with faster processors.
I love the Depot content team though, it does a lot of heavy lifting.
Trading Strategy looks super cool, by the way.
[1]: https://runs-on.com/benchmarks/github-actions-disk-performan...
Are there any reasonable alternatives for a really tiny FOSS project?
you can check us out at https://yeet.cx
we also have a anonymous guest sandbox you can play with
If you corrupt a CI node, whatever. Just rerun the step
edit: Or, even easier, just use the pre-built fail_function infrastructure (with retval = 0 instead of an error): https://docs.kernel.org/fault-injection/fault-injection.html
Actually in my experience with pulling very large images to run with docker it turns out that Docker doesn't really do any fsync-ing itself. The sync happens when it creates an overlayfs mount while creating a container because the overlayfs driver in the kernel does it.
A volatile flag to the kernel driver was added a while back, but I don't think Docker uses it yet https://www.redhat.com/en/blog/container-volatile-overlay-mo...
Other options are to use an overlay mount with volatile or ext4 with nobarrier and writeback.
Why in the world does it do that ????
Ok I googled (kagi). Same reason anyone ever does: pure voodoo.
If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
Kernel-level crashes, the only kind of crash that risks half-written files, are no more likely during dpkg than any other time. A bad update is the same bad update regardless, no better, no worse.
> If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
https://man7.org/linux/man-pages/man2/close.2.html
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel uses the buffer cache to
defer writes. Typically, filesystems do not flush buffers when a
file is closed. If you need to be sure that the data is
physically stored on the underlying disk, use fsync(2). (It will
depend on the disk hardware at this point.)
So if you want to wait until it's been saved to disk, you have to do an fsync first. If you even just want to know if it succeeded or failed, you have to do an fsync first.Of course none of this matters much on an ephemeral Github Actions VM. There's no "on next boot or whatever". So this is one environment where it makes sense to bypass all this careful durability work that I'd normally be totally behind. It seems reasonable enough to say it's reached the page cache, it should continue being visible in the current boot, and tomorrow will never come.
You can get half-written files in many other circumstances, eg on power outages, storage failures, hw caused crashes, dirty shutdowns, and filesystem corruption/bugs.
(Nitpick: trusting the kernel to close() doesn't have anythign to do with this, like a sibling comment says)
and about kernel-level crashes: yes, but you see, dpkg creates a new file on the disk, makes sure it is written correctly with fsync() and then calls rename() (or something like that) to atomically replace old file with new one.
So there is never a possibility of given file being corrupt during update.
Maybe they know something you don't ?????
Imagine this scenario; you're writing a CI pipeline:
1. You write some script to `apt-get install` blah blah
2. As soon as the script is done, your CI job finishes.
3. Your job is finished, so the VM is powered off.
4. The hypervisor hits the power button but, oops, the VM still had dirty disk cache/pending writes.
The hypervisor may immediately pull the power (chaos monkey style; developers don't have patience), in which case those writes are now corrupted. Or, it may use ACPI shutdown which then should also have an ultimate timeout before pulling power (otherwise stalled IO might prevent resources from ever being cleaned up).
If you rely on sync to occur at step 4 during the kernel to gracefully exit, how long does the kernel wait before it decides that some shutdown-timeout occurred? How long does the hypervisor wait and is it longer than the kernel would wait? Are you even sure that the VM shutdown command you're sending is the graceful one?
How would you fsync at step 3?
For step 2, perhaps you might have an exit script that calls `fsync`.
For step 1, perhaps you might call `fsync` after `apt-get install` is done.
(to be clear: my comment is sarcasm and web scale is a reference to a joke about reliability [0])
Unpacking the Docker image tarballs can be a bit expensive--especially with things like nodejs where you have tons of tiny files
Tearing down overlayfs is a huge issue, though
Any pulls doing this become zero cost for docker hub
Any sort of cache you put between docker hub and your own infra would probably be S3 backed anyway, so adding another cache in between could be mostly a waste
Trusting close() does not mean that the data is written all the way to disk. You don't care if or when it's all the way written to disk during dpkg ops any more than at any of the other infinite seconds of run time that aren't a dpkg op.
close() just means that any other thing that expects to use that data may do so. And you can absolutely bank on that. And if you think you can't bank on that, that means you don't trust the kernel to be keeping track of file handles, and if you don't trust the kernel to close(), then why do you trust it to fsync()?
A power rug pull does not worsen this. That can happen at any time, and there is nothing special about dpkg.
Until you answer how you've solved the "I want to make sure my data is written to disk before the hypervisor powers off the virtual machine at the end of the successful run" problem, then I claim that this is absolutely not voodoo.
I suggest you actually read the documentation of all of these things before you start claiming that `fsync()` is exclusively the purpose of kernel, driver, or bootloader developers.
I've had good success with machines that have NVMe storage (especially on cloud providers) but you still are paying the cost of fsync there even if it's a lot faster
The last bit (emphasis added) sounds novel to me, I don't think I've heard before of anybody doing that. It sounds like an almost-"free" way to get a ton of performance ("almost" because somebody has to figure out the sizing. Though, I bet you could automate that by having your tool export a "desired size" metric that's equal to the high watermark of tmpfs-like storage used during the CI run)
Linux page cache exists to speed up access to the durable store which is the underlying block device (NVMe, SSD, HDD, etc).
The RAM-backed block device in question here is more like tmpfs, but with an ability to use the disk if, and only if, it overflows. There's no intention or need to store its whole contents on the durable "disk" device.
Hence you can do things entirely in RAM as long as your CI/CD job can fit all the data there, but if it can't fit, the job just gets slower instead of failing.
Consider a scenario where your VM has 4GB of RAM, but your build accesses a total of 6GB worth of files. Suppose your code interacts with 16GB of data, yet at any moment, its active working set is only around 2GB. If you preload all Docker images at the start of your build, they'll initially be cached in RAM. However, as your build progresses, the kernel will begin evicting these cached images to accommodate recently accessed data, potentially even files used infrequently or just once. And that's the key bit, to force caching of files you know are accessed more than once.
By implementing your own caching layer, you gain explicit control, allowing critical data to remain persistently cached in memory. In contrast, the kernel-managed page cache treats cached pages as opportunistic, evicting the least recently used pages whenever new data must be accommodated, even if this new data isn't frequently accessed.
I believe various RDBMSs bypass the page cache and use their own strategies for managing caching if you give them access to raw block devices, right?
And if yank the cord before the package is fully unpacked? Wouldn't that just be the same problem? Solving that problem involves simply unpacking to a temporary location first, verifying all the files were extracted correctly, and then renaming them into existence. Which actually solves both problems.
Package management is stuck in a 1990s idea of "efficiency" which is entirely unwarranted. I have more than enough hard drive space to install the distribution several times over. Stop trying to be clever.
Not the same problem, it's half-written file vs half of the files in older version.
> Which actually solves both problems.
it does not and you would have to guarantee that multiple rename operations are executed in a transaction. Which you can't. Unless you have really fancy filesystem.
> Stop trying to be clever.
It's called being correct and reliable.
Not strictly. You have to guarantee that after reboot you rollback any partial package operations. This is what a filesystem journal does anyways. So it would be one fsync() per package and not one per every file in the package. The failure mode implies a reboot must occur.
> It's called being correct and reliable.
There are multiple ways to achieve this. There are different requirements among different systems which is the whole point of this post. And your version of "correct and reliable" depends on /when/ I pull the plug. So you're paying a huge price to shift the problem from one side of the line to the other in what is not clearly a useful or pragmatic way.
It makes no sense to trust that fsync() does what it promises but not that close() does what it promises. close() promises that when close() returns, the data is stored and some other process may open() and find all of it verbatim. And that's all you care about or have any business caring about unless you are the kernel yourself.
In both scenarios, yes. This is what dpkg database is for, it keeps info about state of each package: whatever is it installed, unpacked, configured and so on. It is required to handle interrupted update scenario, no matter if it was interrupted during package unpacking or in the configuration stage.
So far you are just describing --force-unsafe-io from dpkg. It is called unsafe because you can end up with zeroed or 0-length files long after the package has been marked as installed.
> This is what a filesystem journal does anyways.
This is incorrect. And also filesystem journal is irrelevant.
Filesystem journal protects you from interrupted writes on the disk layer. You set some flag, you write to some temporary space called journal, you set another flag, then you copy that data to your primary space, then you remove both flags. If something happens during that process you'll know and you'll be able to recover because you know in which step you were interrupted.
Without filesystem journal every power outage could result in not being able to mount the filesystem. Journal prevents that scenario. This has nothing to do with package managers, page cache or fsync.
Under Linux you do the whole write() + fsync() + rename() dance, for every file, because this is the only way you don't end up in the scenario where you've written the new file, renamed it, marked package as installed and fsynced the package manager database but the actual new file contents never left the page cache and now you have bunch of zeroes on the disk. You have to fsync(). This is semantic of the layer you are working with. No fsync(), no guarantee that data is on the disk. Even if you wrote it and closed the file hours ago. And fsynced package manager database.
> There are different requirements among different systems which is the whole point of this post.
Sorry, I was under assumption that this thread is about dpkg and fsync and your idea of "solving the problem". I just wanted to point out that, no, package managers are not "trying to be clever" and are not "stuck in the 1990s". You can't throw fsync() out of the equation, reorder bunch of steps and call this "a solution".
https://wiki.debian.org/Teams/Dpkg/FAQ#Q:_Why_is_dpkg_so_slo...