Disk I/O bottlenecks in GitHub Actions

Disk I/O bottlenecks in GitHub Actions(depot.dev)

106 points by jacobwg 1 year ago | 72 comments

jacobwg 1 year ago |

A list of fun things we've done for CI runners to improve CI:

- Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

- Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)

- "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]

- Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)

- Configuring Docker with containerd/estargz support

- Just generally turning kernel options and unit files off that aren't needed

[0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initial...

3np 1 year ago | |

> Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)

Are you not using a caching registry mirror, instead pulling the same image from Hub for each runner...? If so that seems like it would be an easy win to add, unless you specifically do mostly hot/unique pulls.

The more efficient answer to those rate limits is almost always to pull less times for the same work rather than scaling in a way that circumvents them.

jacobwg 1 year ago | | |

Today we (Depot) are not, though some of our customers configure this. For the moment at least, the ephemeral public IP architecture makes it generally unnecessary from a rate-limit perspective.

From a performance / efficiency perspective, we generally recommend using ECR Public images[0], since AWS hosts mirrors of all the "Docker official" images, and throughput to ECR Public is great from inside AWS.

[0] https://gallery.ecr.aws/

philsnow 1 year ago | |

> Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

I'm slightly old; is that the same thing as a ramdisk? https://en.wikipedia.org/wiki/RAM_drive

jacobwg 1 year ago | | |

Exactly, a ramdisk-backed writeback cache for the root volume for Linux. For macOS we wrote a custom nbd filter to achieve the same thing.

seanlaff 1 year ago | |

The ramdisk that overflows to a real disk is a cool concept that I didn't previously consider. Is this just clever use of bcache? If you have any docs about how this was set up I'd love to read them.

yencabulator 1 year ago | |

> - Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)

Everyone Linux kernel does that already. I currently have 20 GB of disk cached in RAM on this laptop.

jiocrag 1 year ago | |

have you tried Buildkite? https://buildkite.com

ValdikSS 1 year ago |

`apt` installation could be easily sped-up with `eatmydata`: `dpkg` calls `fsync()` on all the unpacked files, which is very slow on HDDs, and `eatmydata` hacks it out.

crmd 1 year ago |

This is exactly the kind of content marketing I want to see. The IO bottleneck data and the fio scripts are useful to all. Then at the end a link to their product which I’d never heard of, in case you’re dealing with the issue at hand.

kylegalbraith 1 year ago | |

Thank you for the kind words. We’re always trying to share our knowledge even if Depot isn’t a good fit for everyone. I hope the scripts get some mileage!

suryao 1 year ago |

TLDR: disk is often the bottleneck in builds. Use 'fio' to get performance of the disk.

If you want to truly speed up builds by optimizing disk performance, there are no shortcuts to physically attaching NVMe storage with high throughput and high IOPS to your compute directly.

That's what we do at WarpBuild[0] and we outperform Depot runners handily. This is because we do not use network attached disks which come with relatively higher latency. Our runners are also coupled with faster processors.

I love the Depot content team though, it does a lot of heavy lifting.

[0] https://www.warpbuild.com

miohtama 1 year ago |

If you can afford, upgrade your CI runners on GitHub to paid offering. Highly recommend, less drinking coffee, more instant unit test results. Pay as you go.

striking 1 year ago | |

As a Depot customer, I'd say if you can afford to pay for GitHub's runners, you should pay for Depot's instead. They boot faster, run faster, are a fraction of the price. And they are lovely people who provide amazing support.

kylegalbraith 1 year ago | |

This is what we focus on with Depot. Faster builds across the board without breaking the bank. More time to get things done and maybe go outside earlier.

Trading Strategy looks super cool, by the way.

crohr 1 year ago |

I'm maintaining a benchmark of various GitHub Actions providers regarding I/O speed [1]. Depot is not present because my account was blocked but would love to compare! The disk accelerator looks like a nice feature.

[1]: https://runs-on.com/benchmarks/github-actions-disk-performan...

larusso 1 year ago |

So I had to read to the end to realize it’s a kinda infomercial. Ok fair enough. Didn’t know what depot was though.

nodesocket 1 year ago |

I just migrated multiple ARM64 GitHub action Docker builds from my self hosted runner (Raspberry Pi in my homeland) to Blacksmith.io and I’m really impressed with the performance so far. Only downside is no Docker layer and image cache like I had on my self hosted runner, but can’t complain on the free tier.

adityamaru 1 year ago | |

Have you checked out https://docs.blacksmith.sh/docker-builds/incremental-docker-...? This should help setup a shared, persistent docker layer cache for your runners

nodesocket 1 year ago | | |

Thanks for sharing. I have a custom bash script which does the docker builds currently and swapping to useblacksmith/build-push-action would take a bit of refactoring which I don't want to spend the time on now. :-)

kayson 1 year ago |

Bummer there's no free tier. I've been bashing my head against an intermittent CI failure problem on Github runners for probably a couple years now. I think it's related to the networking stack in their runner image and the fact that I'm using docker in docker to unit test a docker firewall. While I do appreciate that someone at Github did actually look at my issue, they totally missed the point. https://github.com/actions/runner-images/issues/11786

Are there any reasonable alternatives for a really tiny FOSS project?

r3tr0 1 year ago |

we are working on a platform that let's you measure this stuff in real-time for free.

you can check us out at https://yeet.cx

we also have a anonymous guest sandbox you can play with

https://yeet.cx/play

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel uses the buffer cache to defer writes. Typically, filesystems do not flush buffers when a file is closed. If you need to be sure that the data is physically stored on the underlying disk, use fsync(2). (It will depend on the disk hardware at this point.)