Writing a file system from scratch in Rust(blog.carlosgaldino.com) |
Writing a file system from scratch in Rust(blog.carlosgaldino.com) |
Then I implemented a little FUSE driver in Python to read the disk image from the host system and it was wonderful to mount it the first time and see the files! https://github.com/vinc/moros-fuse
PDF link: http://www.nobius.org/practical-file-system-design.pdf
Warning: Terribly written. many hacks.
I'm not asking about the structure or how it's organized. I mean... is the filesystem in a file or... how?
Background: I mostly do embedded stuff so at a glance I would have expected low level primitives (like, HW interactions, registers and stuff) but I see none. So maybe, my expectation, when tacking a problem, of interacting with the HW directly, does not stand in modern environments.
Even better, but unrelated question... how the heck does a x86 OS request data from the HDD?
As for how you actually request data from the hard drive: There's older ATA interfaces, and BIOS routines from them, which I suspect is what most hobbyist OSes would use.
A more modern interface is AHCI. The OSDev wiki has an overview, where you can see how the registers work: https://wiki.osdev.org/AHCI
Entirely too short summary: Use PCI to discover the various devices attached to the CPU. One or more of them are AHCI or NVMe devices. The AHCI and NVMe standards each describe sets of memory-mapped configuration registers and DMA engines. Eventually, you get to a point where you can describe linked lists of transactions to be executed that are semantically similar to preadv, pwritev, and so on.
There's tons of info on osdev.org, such as https://wiki.osdev.org/AHCI
There was a lot of discussion in the past around TFS https://github.com/redox-os/tfs, my understanding is that effort has kinda lost steam.
Actually everything around "Redox" looks like:
I once made a 'file system' to mount cpio archives (read-only) in an embedded system. Cpio is an extremely simple format to generate and edit (in code) and mounting it directly was very effective.
https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf
There are definitely some significant benefits you can get from managing your own storage, rather than using a filesystem.
It isn't just about performance gains, which are substantial, it also greatly simplifies the design and code by eliminating edge cases, undesirable behaviors, and variability in behavior across different deployment environments.
It's a nice ambitious goal which can really drive language and library design.
How soon until someone builds an Operating System developed in Rust? Maybe make it microkernel-based this time.
Redox[1] has been around for almost as long as Rust has. I first heard about it 4-5 years ago.
They had an interesting competition a while back challenging people to figure out how to crash it.
What’s the progress and potential of Redox?
One reason is that HDDs simply don't have a byte-wise resolution, so there's little point talking to HDDs in sub-sector units. Sectors are usually 512 bytes to 4k.
A second reason is being able to simply address the drive. Using 32b indices, if you index bytewise you're limited to 4GB which was available in the early 90s. With 512 bytes blocks, you get an addressing capacity of 2TB, and with 4k blocks (the AF format), you get 16TB. In fact I remember 'round the late 90s / early aught we'd reformat our drives using higher-size blocks because the base couldn't see the entire thing.
Sufficient explanation for code. Now why is it that disks lack byte-wise resolution?
HDDs also have small "gaps" that have headers to locate the sectors (in the distant past you could do a low-level format to correct these gaps as well).
SSDs do not have these gaps but they do need the ECC.
People would even run defragmenters, that would rearrange file blocks so they were stored continously and the whole file could be read without seeeking.
So the inode says "The data for my file starts at block 72 and is 3 blocks long" (or something like that). The disk then goes there, and reads blocks 72,73,74.
Each block is 4KiB large often, so if you have a 10KiB file, you still take up ceiling(file size/block size) blocks.
That's why there is a difference between "File size" and "Size on disk" when you look at disk usage summaries.
I think it'd be viable for an enterprise-y database to do IO directly over NVMe. Imagine the efficiency and throughput gains you could get from a database that (1) has a unified view of memory allocation in the system (2) directly performs its page-level IO on the storage devices.
Overhead.
Each disk sector is not 512 bytes (or 4096 bytes in modern disks), it's a bit more. It needs at least the sector number (to detect when the head is flying over the correct sector), a CRC or error-correcting code, a gap of several hundred bits before the next sector (so writing to one sector won't overwrite the next one), and a preamble with an alternating pattern (to synchronize the analog to digital conversion). All this overhead would be a huge waste if the addressable unit were a single byte.
You could build a HDD with byte-wise resolution, which did a whole load of clever things internally to handle error correction (modern HDDs have far more complex error correction that just block wise checksums, they’re a piece of technological magic and analogue electronics).
But sure a device would cost far more, and offer no real benefits. Especially when you consider that reading a single byte in isolation is a very rare task, and it takes so long (compared to accessing a CPU cache) to find the data and transfer it that you might as well grab a whole load of bytes at the same time.
Byte-wise resolution in a HDD is like ordering rice by the grain in individual envelopes. Sure you could do it, but why the hell would you?
The hardware has to be able to read and write data from an arbitrary location and do some kind of checksum for error detection and they have to be able to efficiently transfer this data to the cpu.
When working with RAM your PC will transfer a cache line of memory from RAM to your CPU caches. Again this is a feature of hardware limitations and trading off granularity against speed.
Your CPU operates many time faster than RAM so it makes sense to request multiple bytes at once, then your RAM can access those bytes, and line them up on it’s output pins ready for your CPU to come back and read them into cache. This give you better bandwidth. (It’s a little more complicated than this because the bytes are transferred in serial, rather than in parallel, but digging into the details here is tad tricky).
On the granularity point, the more granular your memory access pattern is, the more bits you need to address that memory. In RAM that either means more physical pins and motherboard traces, or more time to communicate that address. It also means more transistors are needed to hold that address in memory while you’re looking it up. All of those things mean more money.
And before we start looking at memory map units, that let you do magical things like map a 64 bit address space into only 16GB of physical RAM. Again the greater the granularity, the more transistors your MMU needs to store the mapping, and thus more cost.
So really the reason we don’t have bit level addressing granularity is ultimately cost. There’s no reason you could build a machine that did that, it would just cost a fortune and wouldn’t provide any practical benefits. So engineers made trade off to build something more useful.
tl;dr This new tech might do to SSDs what SSDs did to hard drives. Might.
It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.
> This isn’t the reason about block addressing at all since it existed before paging was a thing.
I didn't say "blocks exist because pages exist", I said "blocks exist for similar reasons to why pages exist". How you thought that required temporal causality is beyond me. :/
> Addressing storage content in blocks is closely aligned with how direct access storage is organized in “sectors”, usually 512-byte blocks.
OK, great: now you have to answer why, as you are just punting the answer. The answer, FWIW, is similar to the answer for why memory is divided into pages.
> Disks don’t have byte access interfaces, so it doesn’t make sense to use one.
But, of course, they could have a byte-access interface; in fact, that was in some sense the entire question: why don't they? Note, BTW, that RAM also doesn't have a byte-access interface: you have to access it by page. (And hell: many modern disks are just giant flash memories!)
The super high-level API for working with memory provided by the CPU as part of instructions lets you access it by the byte, but again: that's also true of disks and blocks, where the super high-level API for working with files absolutely lets you work with it accessing specific bytes.
> It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.
This is true, and is a partial answer to the original question. Essentially, a filesystem is serving the same problem space as a page table, but instead of mapping virtual pages of processes to physical pages it is mapping block-sized regions of files to blocks on disk.
In both cases, it is thereby beneficial to save some bits in that map. However, as pages of memory are usually 4k and blocks on disk are usually 512 bytes--though you see larger for both of these (64k pages are becoming popular)--that really isn't that many bits you are saving.
The real reason is that there is a cost to random access: the more entries in your page-table / filesystem block list the more space it takes and the harder it is to find the one you want; and, while you can "compress" ranges, that also comes with its own cost on random access.
On disks made out of spinning rust, this random access cost is particularly ridiculous, as it might require rotating a heavy piece of metal and physically adjusting magnets to achieve the specific offset you are looking for. You thereby never want to load a single byte at a time: you want to pull a full block.
There is also another interesting cost to random access on writes, which applies to some kinds of memory and applies to most kinds of disks, which is that you just don't have the precision required to write a single byte at a time: think about trying to isolate one byte using a magnet!
So the way these things work is that they have to read large subsets of data at one time and then write it back. If you can align your filesystem to match the blocks of a file along these write boundaries, you thereby decrease the amount of work the disk has to do.
(And, again: this is very similar to the work that memory managers are doing with respect to attempting to align pages of large allocations to make the virtual memory areas efficient in the page table, which, again, is pretty much a filesystem for memory with processes as files.)
You thereby also often find yourself attempting to tune your block/page size to that of the underlying hardware, and if you build higher-level data structures (such as databases using B-trees or dictionaries using RB-trees) you will want to align those at the higher-level as well.
(And it is thereby then also fascinating that it took as long as it did for people to realize that if you want to implement an in-memory tree that you should always prefer a B-tree--a data structure designed around disk blocks--over an RB-tree, for all of the same reasons.)
Blocks (and pages) thereby exist to help all layers of the system most efficiently take advantage of linear access patterns to get efficient parallel loading and caching to solve a problem that is otherwise subject to high random access costs and potentially-ridiculous fragmentation issues.