Working with Files Is Hard (2019)(danluu.com) |
Working with Files Is Hard (2019)(danluu.com) |
> When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?
I think the short answer is that the APIs are bad. The POSIX fs APIs and associated semantics are so deeply entrenched in the software ecosystem (both at the OS level, and at the application level) that it's hard to move away from them.
Well I think that's the actual problem. POSIX gives you an abstract interface but it essentially does not enforce any particular semantics on those interfaces.
Sounds like Worse Is Better™: operating systems that tried to present safer abstractions were at a disadvantage compared to operating systems that shipped whatever was easiest to implement.
(I'm not an expert in the history, just observing the surface similarity and hoping someone with more knowledge can substantiate it.)
What about the Windows API? Windows is a pretty successful OS with a less leaky FS abstraction. I know it's a totally different deal than POSIX (files can't be devices etc), the FS function calls require a seemingly absurd number of arguments, but it does seem safer and clearer what's going to happen.
> They report on a single "vulnerability" in LMDB, in which LMDB depends on the atomicity of a single sector 106-byte write for its transaction commit semantics. Their claim is that not all storage devices may guarantee the atomicity of such a write. While I myself filed an ITS on this very topic a year ago, http://www.openldap.org/its/index.cgi/Incoming?id=7668 the reality is that all storage devices made in the past 20+ years actually do guarantee atomicity of single-sector writes. You would have to rewind back to 30 years at least, to find a HDD where this is not true.
So this is a case where the programmers of LMDB thought about the "incorrect" use and decided that it was a calculated risk to take because the incorrectness does not manifest on any recent hardware.
This is analogous to the case where someone complains some C code has undefined behavior, and the developer responds by saying they have manually checked the generated assembler to make sure the assembler is correct at the ISA level even though the C code is wrong at the abstract C machine level, and they commit to checking this in the future.
Furthermore both the LMDB issue and the Postgres issue are noted in the paper to be previously known. The paper author states that Postgres documents this issue. The paper mentions pg_control so I'm guessing it's referring to this known issue here: https://wiki.postgresql.org/wiki/Full_page_writes
> We rely on 512 byte blocks (historical sector size of spinning disks) to be power-loss atomic, when we overwrite the "control file" at checkpoints.
Yeah, sounds about right about quite a lot of the C programmers except for the "they commit to checking this in the future" part. I've responses like "well, don't upgrade your compiler; I'm gonna put 'Clang >= 9.0 is unsupported' in the README as a fix".
Because it was poorly designed, and there is a high resistance to change, so those design mistakes from decades ago continue to bite
Evaluating correctness without that consideration is too high of a bar.
Safety and correctness cannot be “impossible to misuse”
It is totally acceptable for applications to say "I do not support X conditions". Swap out the file half way through a read? Sorry don't support that. Remove power to the storage devise in the middle of a sync operation? Sorry don't support that.
For vital applications, for example databases, this is a known problem and risks of the API are accounted for. Other applications don't have nearly that level of risk associated with them. My music tagging app doesn't need to be resistant to the SSD being struck by lightning.
It is perfectly acceptable to design APIs for 95% of use cases and leave extremely difficult leaks to be solved by the small number of practitioners that really need to solve those leaks.
"If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and [basically save your ass]"[0]
This is why whenever I need to persist any kind of state to disk, SQLite is the first tool I reach for. Filesystem APIs are scary, but SQLite is well-behaved.
Of course, it doesn't always make sense to do that, like the dropbox use case.
In practice I believe I've seen SQLite databases corrupted due to what I suspect are two main causes:
1. The device powering off during the middle of a write, and
2. The device running out of space during the middle of a write.
https://lists.openldap.org/hyperkitty/list/openldap-devel@op...
I'm pretty sure that's not where I originally saw his comments. I remember his criticisms being a little more pointed. Although I guess "This is a bunch of academic speculation, with a total absence of real world modeling to validate the failure scenarios they presented" is pretty pointed.
Hopefully in whichever particular mode is referenced!
I wonder what is easy.
I kinda think, and I could be wrong, that SQLite rollback would not have any vulnerabilities with `synchronous=EXTRA` (and `fullfsync=F_FULLFSYNC` on macOS [2]).
The post supports its points with extensive references to prior research - research which hasn't been done in the Microsoft environment. For various reasons (NDAs, etc.) it's likely that no such research will ever be published, either. Basically it's impossible to write a post this detailed about safety issues in Microsoft file systems unless you work there. If you did, it would still take you a year or two of full-time work to do the background stuff, and when you finished, marketing and/or legal wouldn't let you actually tell anyone about it.
I can't say the Win32 File API is "pretty", but it's also an abstraction, like the .NET File Class is. And if you touch the NT API, you're naughty.
On Linux and macOS you use the same API, just the backends are different if you want async (epoll [blocking async] on Linux, kqueue on macOS).
ZFS fsync will not fail, although it could end up waiting forever when a pool faults due to hardware failures:
https://papers.freebsd.org/2024/asiabsdcon/norris_openzfs-fs...
https://github.com/openzfs/zfs/issues/9130#issuecomment-2614...
That said, there are many others who stress ZFS on a regular basis and ZFS handles the stress fine. I do not doubt that there are bugs in the code, but I feel like there are other things at play in that report. Messages saying that the txg_sync thread has hung for 120 seconds typically indicate that disk IO is running slowly due to reasons external to ZFS (and sometimes, reasons internal to ZFS, such as data deduplication).
I will try to help everyone in that issue. Thanks for bringing that to my attention. I have been less active over the past few years, so I was not aware of that mega issue.
> In conclusion, computers don't work (but I guess you already know this...
Just not all the time.
https://archive.wikiwix.com/cache/index2.php?rev_t=&url=http...
closest I come to working with files is localStorage, but that's thread safe.
its not a real problem for most modern developers.
pwrite? wtf?
not one mention of fopen.
granted some of the fine detail discussion is interesting, but it doesn't make practical sense since about 1990.
"fopen"? That is outdated stuff from a shitty ecosystem, and how do you think it's implemented?
Meanwhile you can read plenty of stories of others having the exact opposite experience.
If you keep losing data to power losses or crashes, perhaps fix the cause of that? It doesn't make sense to try to work around it.
Ponder this notion for a moment: there are problems within one's control and problems outside of one's control.
For example, we can't control the weather. If it snows three feet overnight you simply have to deal with the fact that you're not getting to work today.
Since we can't simply stop hardware from failing, we have to deal with the fact that hardware fails. Your seventeen redundant UPSes might experience a one in a trillion cascade failure. It might take the utility ten minutes longer to restore your power than you have onsite generation.
This is not a class of problem we can control or prevent. We fix these problems by building systems which withstand failures. You can't just will electrons out of the wall socket, but you can build a better disk or FS that corrupts less data when the electrons stop.
b7/b74a/b74a56
where the digits are derived from a hash of the file name but lately I've had some NTFS volumes with a 1M file directory that seem to be OK.Hardware problems also manifest in mysterious ways. On both Windows and MacOS I had computers that seemed to be OK until I did an OS update which caused enough IO that a failing HDD was pushed over the edge and the update failed; in one case I was able to roll back the update but not apply the update, in another case the machine was trashed. Careful investigation (like taking the disk out and inspecting it on another computer) revealed a hard drive error although there was no clear indication of this in the UI and the average person would blame to software update
I keep telling my users to make sure to plug their phones in before the battery dies, but for some reason they keep forgetting...
If you can't publish it, it's not research. If the source code is under NDA, then Microsoft gets the final say about whether you can publish or not, and if the result is embarrassing to Microsoft, I'm guessing it's "or not".
No wonder things are "hard". Because otherwise many in this godforsaken industry wouldn't need to be employed.
The reason is historical and reflects a flaw in the POSIX standards process, in my opinion, one that hopefully won't be repeated in the future. I finally tracked down why this insane behavior was standardized by the POSIX committee by talking to long-time BSD hacker and POSIX standards committee member Kirk McKusick (he of the BSD daemon artwork). As he recalls, AT&T brought the current behavior to the standards committee as a proposal for byte-range locking, as this was how their current code implementation worked. The committee asked other ISVs if this was how locking should be done. The ISVs who cared about byte range locking were the large database vendors such as Oracle, Sybase and Informix (at the time). All of these companies did their own byte range locking within their own applications, none of them depended on or needed the underlying operating system to provide locking services for them. So their unanimous answer was "we don't care". In the absence of any strong negative feedback on a proposal, the committee added it "as-is", and took as the desired behavior the specifics of the first implementation, the brain-dead one from AT&T.
[0] https://www.samba.org/samba/news/articles/low_point/tale_two...
I resisted using them in my SQLite VFS, until I partially relented for WAL locks.
I wish more platforms embraced OFD locks. macOS has them, but hidden. illumos fakes them with BSD locks (which is worse, actually). The BSDs don't add them. So it's just Linux, and Windows with sane locking. In some ways Windows is actually better (supports timeouts).
UID mapping causing read() to return -EACCES after open() succeeds breaks a lot of userland code.
Most devices write sectors atomically, and so you can build a system on top of that that does not lose committed data. (Of course if the device powers off during a write then you can lose the uncommitted data you were trying to write, but the point is you don't ever have corruption, you get either the data that was there before the write attempt or the data that is there after).
The correct IO elevator to use for disks given to ZFS is none/noop as ZFS has its own IO elevator. ZFS will set the Linux IO elevator to that automatically on disks where it controls the partitioning. However, when the partitioning was done externally from ZFS, the default Linux elevator is used underneath ZFS, and that is never none/noop in practice since other Linux filesystems benefit from other elevators. If proxmox is doing partitioning itself, then it is almost certainly using the wrong IO elevator with ZFS, unless it sets the elevator to noop when ZFS is using the device. That ordinarily should not cause such severe problems, but it is within the realm of possibility that the Linux IO elevator being set by proxmox has a bug.
I suspect there are multiple disparate issues causing the txg_sync thread to hang for people, rather than just one issue. Historically, things that cause the txg_sync thread to hang are external to ZFS (with the notable exception of data deduplication), so it is quite likely that the issues are external here too. I will watch the thread and see what feedback I get from people who are having the txg_sync thread hang.
Upd mq-deadline for all drives seems to be `none` for me. OS is Ubuntu 22.04
The crash-consistency problem is very different than the durability of real synchronous writes problem. There are some storage devices which will lie about synch writes, sometimes hoping that a backup battery will allow them to complete those write.
System crashes are inevitable, use things like write ahead logs depending on need etc... No storage API will get rid of all system crashes and yes even apple games the system by disabling real sync writes, so that will always be a battle.
There are known cases where power loss during a write can corrupt previously written data (data at rest). This is not some rare occurrence. This is why enterprise flash storage devices have power loss protection.
See also: https://serverfault.com/questions/923971/is-there-a-way-to-p...
I am not sure what you mean by that. One possibility is that the ones who reported mq-deadline did better were on either kyber or bfq, rather than none. The none elevator should be best for ZFS.
cat /sys/dev/block/8:176/queue/scheduler
[mq-deadline] none
However, this output does not mean what I thought it did - it means that mq-deadline is in use.
If I do
echo "none" | sudo tee /sys/dev/block/8:176/queue/scheduler
This changes to
cat /sys/dev/block/8:176/queue/scheduler
[none] mq-deadline
The kind of model I prefer is something based on atomicity. Most applications can get by with file-level atomicity--make whole file read/writes atomic with a copy-on-write model, and you can eliminate whole classes of filesystem bugs pretty quickly. (Note that something like writeFileAtomic is already a common primitive in many high-level filesystem APIs, and it's something that's already easily buildable with regular POSIX APIs). For cases like logging, you can extend the model slightly with atomic appends, where the only kind of write allowed is to atomically append a chunk of data to the file (so readers can only possibly either see no new data or the entire chunk of data at once).
I'm less knowledgeable about the way DBs interact with the filesystem, but there the solution is probably ditching the concept of the file stream entirely and just treating files as a sparse map of offsets to blocks, which can be atomically updated. (My understanding is that DBs basically do this already, except that "atomically updated" is difficult with the current APIs).
int fd = open(".config", O_RDWR | O_CREAT | O_SYNC_ON_CLOSE, 0o666);
// effects of calls to write(2)/etc. are invisible through any other file description
// until the close(2) is called on all descriptors to this file description.
close(fd);
So now you can watch for e.g. either IN_MODIFY or IN_CLOSE_WRITE (and you don't need to balance it with IN_OPEN), it doesn't matter, you'll never see partial updates... would be nice!What happens when a lot of data is written and exceeds the dirty threshold?
Database developers don’t want the complexity or poor performance of posix. It’s wild to me that we still don’t have any alternative to fsync in Linux that can act as a barrier without also flushing caches at the same time.
https://github.com/openzfs/zfs/blob/34205715e1544d343f9a6414...
Writes on ZFS cease to be atomic around approximately 32MB in size if I read the code correctly.
I have many files that are several GB. Are you sure this is a good idea? What if my application only requires best effort?
> eliminate whole classes of filesystem bugs pretty quickly.
Block level deduplication is notoriously difficult.
> where the only kind of write allowed is to atomically append a chunk of data to the file
Which sounds good until you think about the complications involved in block oriented storage medium. You're stuck with RMW whether you think you're strictly appending or not.
But even then, doing atomic writes of multi gigabyte files doesn’t sound that hard to implement efficiently. Just write to disk first and update the metadata atomically at the end. Or whenever you choose to as a programmer.
The downside is that, when overwriting, you’ll need enough free space to store both the old and new versions of your data. But I think that’s usually a good trade off.
It would allow all sorts of useful programs to be written easily - like an atomic mode for apt, where packages either get installed or not installed. But they can’t be half installed.
Maybe also add a pSLC formatting mode for a namespace so one can be explicit about that capability...
It just has to be a drive that's useable as a generic gaming SSD so people can just buy it and have casual fun with it, like they did with Nvidia GTX GPUs and CUDA.
That said, ZNS is actually something specifically about being able to extract more value out of the same hardware (as the firmware no longer causes write amplification behind your back), which in turns means that the value for such a ZNS-capable drive ought to be strictly higher than for the traditional-only version with the same hardware.
And given that enterprise SSDs seem to only really get value from an OEM's holographic sticker on them (compare almost-new-grade used prices for those with the sticker on them vs. the just plain SSD/HDD original model number, missing the premium sticker), besides the common write-back-emergency capacitors that allow a physical write-back cache in the drive to ("safely") claim write-through semantics to the host, it should IMO be in the interest of the manufacturers to push ZNS:
ZNS makes, for ZNS-appropriate applications, the exact same hardware perform better despite requiring less fancy firmware. Also, especially, there's much less need for write-back cache as the drive doesn't sort individual random writes into something less prone to write amplification: the host software is responsible for sorting data together for minimizing write amplification (usually, arranging for data that will likely be deleted together to be physically in the same erasure block).
Also, I'm not sure how exactly "bad" bins of flash behave, but I'd not be surprised if ZNS's support for zones having less usable space than LBA/address range occupied (which can btw. change upon recycling/erasing the zone!) would allow rather poor quality flash to still be effectively utilized, as even rather unpredictable degradation can be handled this way. Basically, due to Copy-on-Write storage systems (like, Btrfs or many modern database backends (specifically, LSM-Tree ones)) inherently needing some slack/empty space, it's rather easy to cope with this space decreasing as a result of write operations, regardless of if the application/user data has actually grown from the writes: you just buy and add another drive/cluster-node when you run out of space, and until then, you can use 100% of the SSDs flash capacity, instead of up-front wasting capacity just to never have to decrease the drive's usable capacity over the warranty period.
That said: https://priceblaze.com/0TS2109-WesternDigital-Solid-State-Dr... claims (by part number) to be this model: https://www.westerndigital.com/en-ae/products/internal-drive... . That's about 150 $/TB. Refurbished; doesn't say how much life has been sucked out of them.
Give me, say, a Samsung 990 Pro 2 TB for 250 EUR but with firmware for ZNS-reformatting, instead of the 200 EUR MSRP/173 EUR Amazon.de price for the normal version.
Oh, and please let me use a decent portion of that 2 GB LPDDR4 as controller memory buffer at least if I'm in a ZNS-only formatting situation. It's after all not needed for keeping large block translation tables around, as ZNS only needs to track where physically a logical zone is currently located (wear leveling), and which individual blocks are marked dead in that physical zone (easy linear mapping between the non-contiguous usable physical blocks and the contiguous usable logical blocks). Beyond that, I guess technically it needs to keep track of open/closed zones and write pointers and filled/valid lengths.
Furthermore, I don't even need them to warranty the device lifespan in ZNS, only that it isn't bricked from activating ZNS mode. It would be nice to get as many drive-writes warranty as the non-ZNS version gets, though.
- This has not already happened with a lot of the old C standard library. The only function that has ever been removed from the C standard library, to my knowledge, is gets(). In particular, strcpy() has not been removed. Current popular compilers still support gets() with the right options, so it hasn't been removed from the actual library, just the standard.
- strncpy() is not a suitable replacement for strcpy(), certainly not a safer one. It can produce strings missing the terminating null, and it can be slower by orders of magnitude. This has been true since it was introduced in the 01970s. Nearly every call to strncpy() is a bug, and in many cases an exploitable security hole. You are propagating dangerous misinformation. (This is a sign of how difficult it is to make these transitions.)
You also seem to imply that Linux cannot add system calls that are not specified in POSIX, but of course it can and does; openat() and the other 12 related functions, epoll_*(), io_uring_*(), futex_*(), kexec_load(), add_key(), and many others are Linux-specific. The reason barrier() hasn't been added is evidently that the kernel developers haven't been convinced it's worthwhile in the 15+ years since it was proposed, not that POSIX ties their hands.
The nearest equivalents in C for the kind of "staged transition" you are proposing might be things like the 16-bit near/far/huge qualifiers and the Win16 and pre-X MacOS programming models. In each of these cases, a large body of pre-existing software was essentially abandoned and replaced by newly written software.
I don’t understand the reticence of kernel developers to implement a barrier syscall. I know they could do it. And as this article points out, it would dramatically improve database performance for databases which make use of it. Why hasn’t it happened?
Another commenter says NVMe doesn’t support it natively but I bet hardware vendors would add hardware support if Linux supported it and adding barrier support to their hardware would measurably improve the performance of their devices.
The reason it hadn't yet been supported btw. is that they explicitly wanted to allow fully parallel processing of commands in a queue, at least for submissions that concurrently exist in the command queue. In practice I don't see why this would have to be enforced to such an extend, as the only reason for out-of-order processing I can think of is that the auxiliary data of a command is physically located in host memory and the DMA reads across PCIe from the NVMe controller to the host memory happen to complete out-of-order for host DRAM controller/pattern reasons. Thus it might be something you'd not want to turn on without using controller memory buffer (where you can mmap some of the DRAM on the NVMe device into host memory, write your full-detail commands directly to this across the PCIe, and keep the NVMe controller from having to first send a read request across PCIe in response to you ringing it's doorbell: instead it can directly read from it's local DRAM when you ring the doorbell).
Unless you mean memcpy(), there is in fact no safer alternative function in the C standard for strcpy(); software has not largely moved to not using strcpy() (plenty of new C code uses it); and most validators and sanitizers do not emit warnings for strcpy(). There is a more extensive explanation of this at https://software.codidact.com/posts/281518. GCC has warnings for some uses of strcpy(), but only those that can be statically guaranteed to be incorrect: https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html
Newer, safer alternatives to strcpy() include strlcpy() and strscpy() (see https://lwn.net/Articles/659818/), neither of which is in Standard C yet. Presumably OpenBSD has some sort of validator that recommends replacing strcpy() with strlcpy(), which is licensed such that you can bundle it with your program. Visual C++ will invite you to replace your strcpy() calls with the nonstandard Microsoft extension strcpy_s(), thus making your code nonportable and, as it happens, also buggy. An incompatible version of strcpy_s() has been added as an optional annex to the C11 standard. https://nullprogram.com/blog/2021/07/30/ gives extensive details, summarized as "there are no correct or practical implementations". The Linux kernel's checkpatch.pl will invite you to replace calls to strcpy() with calls to the nonstandard Linux/BSD extension strscpy(), but it's a kernel-specific linter.
So there are not literally zero validators and sanitizers that will warn on all uses of strcpy() in C, but most of them don't.
— ⁂ —
I don't know enough about the barrier()/osync() proposal to know why it hasn't been adopted, and obviously neither do you, since you can't know anything significant about Linux kernel internals if you think that C has methods or that strncpy() is a safer alternative to strcpy().
But I can speculate! I think we can exclude the following possibilities:
- That the paper, which I haven't read much of, just went unnoticed and nobody thought of the barrier() idea again. Luu points out that it's a sort of obvious idea for kernel developers; Chidambaram et al. ("Optimistic Crash Consistency") weren't even the first ones to propose it (and it wasn't even the main topic of their paper); and their paper has been cited in hundreds of other papers, largely in systems software research on SSDs: https://scholar.google.com/scholar?cites=1238063331053768604...
- That it's a good idea in theory, but implementing even a research prototype is too much work. Chidambaram et al.'s code is available at https://github.com/utsaslab/optfs, and it is of course GPLv2, so that work is already done for you. You can download a VM image from https://research.cs.wisc.edu/adsl/Software/optfs/ for testing.
- That authors of databases don't care about performance. The authors of SQLite, which is what Chidambaram et al. used in their paper, dedicate a lot of effort to continuously improving its performance: https://www.sqlite.org/cpu.html and it's also a major consideration for MariaDB and PostgreSQL.
- That there's an existing production-ready implementation that Linus is just rejecting because he's stubborn. If that were true, you'd see an active community around the OptFS patch, Red Hat applying it to their kernels (as they do with so many other non-Linus-accepted patches), etc.
- That it relies on asynchronous barrier support in the hardware interface, as the other commenter suggested. It doesn't.
So what does that leave?
Maybe the paper was wrong, which seems unlikely, or applicable only to niche cases. You should be able to build and run their benchmarks.
Maybe it was right at the time on spinning rust ("a Hitachi DeskStar 7K1000.B 1 TB drive") but wrong on SSDs, whose "seek time" is two to three orders of magnitude faster.
In particular, maybe it uses too much CPU.
Maybe it was right then and is still right but the interface has other drawbacks, for example being more bug-prone, which also seems unlikely, or undesirably constrains the architecture of other aspects of the kernel, such as the filesystem, in order to work well enough. (You could implement osync() as a filesystem-wide fsync() as a fallback, so this would just reduce the benefits, not increase the costs.)
Maybe it's obviously the right thing to do but nobody cares enough about it to step up and take responsibility for bringing the new system call up to Linus's standards and committing to maintain it over time.
If it was really a big win for database performance, you'd think one of the developers of MariaDB, PostgreSQL, or SQLite would have offered, or maybe one of the financial sponsors of the paper, which included Facebook and EMC. Luu doesn't say Twitter used the OptFS patch when he was on the Linux kernel team there; perhaps they used it secretly, but more likely they didn't find its advantages compelling enough to use.
Out of all these unlikely cases, my best guess is either "applicable only to niche cases", "wrong on SSDs", or "undesirably constrains filesystem implementation".
As a note on tone, some people may find it offputting when you speak so authoritatively about things you don't know anything about.
You could probably abuse Force Unit Access to make it work by marking all IOs as Force Unit Access, but a number of buggy devices do not implement FUA properly, which defeats the purpose of using it. That would be why Microsoft disabled the NTFS feature that uses FUA on commodity hardware:
https://learn.microsoft.com/en-us/windows/win32/fileio/deplo...
What you seem to want is FreeBSD’s UFS2 Softupdates that uses force unit access to avoid the need for flushes for metadata updates. It has the downside that it is unreliable on hardware that does not implement FUA properly. Also, UFS2 softupdates does not actually implement do anything to protect data when fsync(2) is called if this mailing list email is accurate:
https://lists.freebsd.org/pipermail/freebsd-fs/2011-November...
As pjd said:
> Synchronous writes (or BIO_FLUSH) are needed to handle O_SYNC/fsync(2) properly, which UFS currently doesn't care about.
That said, avoiding flushes for a fsync(2) would require doing FUA on all IOs. Presumably, this is not done because it would make all requests take longer all the time, raising queue depths and causing things to have to wait for queue limits more often, killing performance. Raising the OS queue depth to compensate would not work since SATA has a maximum queue depth of 32, although it might work for NVMe where the maximum queue depth is 65536, if keeping track of an increased number of inflight IOs does not cause additional issues at the storage devices (such as IOs that never complete as long as the device is kept busy because the device will keep reordering them to the end of the queue).
Using FUA only on metadata as is done in UFS2 soft updates improves performance by eliminating the need for journalling in all cases but the case of space usage, which still needs journalling (or fsck after power loss if you choose to forgo it).
Databases implemented atomic transactions in the 70s. Let’s stop pretending like this is an unsolvable CS problem. Its not.
If you want atomic updates with APT, you could look into doing prestaged updates on ZFS. It should be possible to retrofit it into APT. Have it update a clone of the filesystem and create a new boot environment after it is done. The boot environment either is created or is not created. Then reboot into the updated OS and you can promote the clone and delete the old boot environment afterward. OpenSolaris had this capability over a decade ago.
And they have deadlocks as a result, which there is no good easy solution to (generally we work around by having only one program access a given database at a time, and even that is not 100% reliable).
Provided the underlying VFS has implemented them. They may not. Hence the point in the article that some developers only choose to support 'ext4' and nothing else.
> you’ll need enough free space to store both the old and new versions of your data.
The sacrifice is increased write wear on solid state devices.
> It would allow all sorts of useful programs to be written easily
Sure. As long as you don't need multiple processes to access the same file simultaneously. I think the article misses this point, too, in that, every FS on a multi user system is effectively a "distributed system." It's not distributed for _redundancy_ but it doesn't eliminate the attendant challenges.
https://help.dropbox.com/installs/system-requirements
They say ecryptfs is only supported when it is backed by ext4, which is a bit strange. I wonder if that is documented just to be able to close support cases when ecryptfs is used on top of a filesystem that is missing extended attribute support and their actual code does not actually check what is below ecryptfs. Usually the application above would not know what is below ecryptfs, so they would need to go out of their way to check this in order to enforce that. I do not use Dropbox, so someone else would need to test to see if they actually enforce that if curious enough.
As for wear on SSDs, I don’t think it would increase wear. You’re writing the same number of sectors on the drive. A 2gb write would still write 2gb (+ negligible metadata overhead). Why would the drive wear out faster in this scheme?
And I think it would work way better with multiple processes than the existing system. Right now the semantics when multiple processes edit the same file at once are somewhat undefined. With this approach, files would have database like semantics where any reader would either see the state before a write or the state after. It’s much cleaner - since it would become impossible for skewed reads or writes to corrupt a shared file.
Would you argue against the existence of database transactions? Of course not. Nobody does. They’re a great idea, and they’re way easier to reason about and use correctly compared to the POSIX filesystem api. I’m saying we should have the same integrity guarantees on the filesystem. I think if we had those guarantees already, you’d agree too.
This is an overly pedantic, ungenerous interpretation of what I wrote.
First, fine - you can argue that C has functions, not methods. But eh.
Second, for all practical purposes, C on Linux does have a standard library. It’s just - as you mentioned - not quite the same on every platform. We wouldn’t be talking about strcpy if C had no standard library equivalent.
Third, thankyou for the suggestion that there are even better examples than strcpy -> strncpy that I could have used to make my point more strongly. I should have chosen sprintf, gets or scanf.
I’ve been out of the game of writing C professionally for 15 years or so. I know a whole lot more about C than most. But memories fade with time. Thanks for the corrections. Likewise, no need to get snarky with them.
It works great in practice, even with a lot of concurrent clients. (iCloud is all built on foundationdb).
Hold & lock is what causes deadlocks. I agree with you - that would be a bad way to implement filesystem transactions. But we have a lot of other options.
That said, IO barriers in storage are typically synonymous with flushes. For example, the ext4 nobarrier mount option disables flushes.