Linux file write patterns: So you want to write to a file fast

Linux file write patterns: So you want to write to a file fast(blog.plenz.com)

120 points by noqqe 12 years ago | 50 comments

tytso 12 years ago |

It's 2014; why was the author using ext3 instead of ext4? Ext4 does have fallocate support.

Also, if you use fallocate(2) instead of posix_fallocate(3), you don't have to worry about glibc trying to emulate fallocate() for those file systems which don't use it.

Finally, it's a little surprising the author didn't try using O_DIRECT writes.

dekhn 12 years ago | |

Most people who use O_DIRECT writes stop quickly, thinking it's "slow". What's actually happening is you're seeing what the system is actually capable of in terms of write bandwidth, without any of the 'clever' optimizations like write caching.

StillBored 12 years ago | | |

I don't think this is accurate. We have a kernel bypass for disk operations. We use our own memory buffers, and bypass the filesystem, block, and SCSI midlayers. Our stuff is basically what O_DIRECT should be.

There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.

Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.

zobzu 12 years ago | | |

but thats also why his tests are unreliable in this case

Nican 12 years ago | |

From a Boston Linux Usergroup discussion: https://www.mail-archive.com/discuss@blu.org/msg08490.html

tlb 12 years ago |

The code in the second example is wrong. If a write partially succeeds, instead of writing the remaining part it writes again from the beginning of the buffer. The resulting file will be incorrect. That doesn't normally happen on disk writes, but it does when writing to a pipe.

mtdewcmu 12 years ago |

It's a little bit reassuring that there weren't any clear winners and losers. In a perfect world, the OS and hardware would figure out what your intent is and carry it out the fastest way possible, right? Ideally, you'd write the code the most convenient way and it would run the most performant way. Maybe the future is now.

rwmj 12 years ago |

A bit surprising (considering he started off talking about coredumps) that he doesn't mention sparse files. Core dumps can be very sparse, and you might save time and definitely will save space by not writing out the all-zeroes parts.

lukesandberg 12 years ago | |

to do that, wouldn't you have to look at every byte just to detect the runs of 0s, that would mean that you have to pull the whole file through the memory hierarchy of your system (rather than just passing chunks from syscall to syscall) wouldn't that alone slow you down significantly?

rwmj 12 years ago | | |

It depends. If the data is coming from a pipe (like core_pattern) then yes you have to check for runs of zeroes. If it's coming from a filesystem, then there are various system calls that let you skip them (specifically SEEK_HOLE and SEEK_DATA flags of lseek(2)).

Also if the data is being copied into userspace anyway, then it's quite fast to check that memory is zero. There's no C "primitive" for this, but all C compilers can turn a simple loop into relatively efficient assembler[1].

If you're using an API that never copies the data into userspace and you have to read from a pipe, then yes sparse detection will be much more expensive.

In either case it should save disk space for core files which are highly sparse.

[1] https://stackoverflow.com/a/1494021

dekhn 12 years ago | | |

Right. Sparse files are normally written by applications or kernel threads that specifically know the defined byte ranges, and define new allocated parts of the file. Further, file allocations are probably block-sized, so you would need to ensure the byte regions of blocks were all zero.

This could be done quickly in the kernel. RAID (which does pass the data through multiple transformations) subsystem metrics printed at boot demonstrate that.

flogic 12 years ago |

The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b). I didn't bother clicking the links that just the blurb under it. He's dumping 128Mb in ~200 ms. I'm not sure there is much room for improvement.

dekhn 12 years ago | |

MB. Bytes. nobody quotes disk speeds in bits (or if they do I typically ignore them).

I've frequently observed sustained 500MB/sec writes and reads on my cheap ($250) 250GB SSDs. One of my favorite instances was running out of RAM while assembling a gigapan in Hugin. I added a swap file on my SSD and continued- it ran over night with nearly 500MB/sec reads and writes more or less continuously, but the job finished fine.

Phlarp 12 years ago | | |

I weep for the memory sectors that got re-written continuously for an entire night.

rsynnott 12 years ago | |

Depends on the SSD. The PCIe SSD in a 2013 Retina MBP can approach 1GB/sec, and of course high-end PCIe server stuff can do better again; you may also have a striped RAID setup.

masklinn 12 years ago | |

> The first hit for "SSD write speeds" says ~500Mb/s (I hope I got the right b).

Nope, it's MB not Mb.

dekhn 12 years ago |

Am I correct in noting that none of his benchmarks actually timed how long it took for the data to be durably committed to disk, to the limit that the OS can report that?

I would never do XFS benchmarks because in my experience if XFS is writing during powerdown, it trashses the FS (maybe this was fixed in the past 6 years, but after it happened 3 times I haven't touched the OS again).

Wilya 12 years ago | |

One of the most catastrophical failure modes of XFS was fixed in 2007 [0]. Or at least that's what is said. I never dared touch XFS again after losing a fs to it, so I can't really confirm what it looks like today.

[0] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux....

maxhou 12 years ago |

actually since none of its tests ask to sync the data onto the disk, he might just be measuring each method efficiency in creating dirty pages.

of course that depends on the amount of RAM the system has, and how the kernel VM parameters are tuned (sysctl vm.dirty_*)

just add a fdatasync() call and you will take into account the time it takes to flush all dirty pages into the disk.

asveikau 12 years ago |

Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

He does say:

> in a real program you’d have to do real error handling instead of assertions, of course

But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program. Especially when contrasting it with a "wrong way" I think it wouldn't hurt to include real error handling. And that means something that doesn't fall into an infinite loop when the disk fills up.

masklinn 12 years ago | |

> Assuming all I/O failures are EINTR is really, really odd, as if to say disks never fill up or fail and sockets never disconnect.

The point is to retry on EINTR and to abort completely in case of other IO failures.

    assert(errno == EINTR);
    continue;

is equivalent to

    if (errno == EINTR)
        continue;
    abort();

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

Even if they do, it likely will not actually do any harm, it'll just kill the program instead of gracefully handle error.

asveikau 12 years ago | | |

You are wrong. assert is a no-op when NDEBUG is defined. Some compilers will set that for you in an optimized build.

Using an assert in place of real error checking or otherwise relying on its side effects is consequently a huge wtf in C.

maxhou 12 years ago | | |

ENOSPC ?

Niten 12 years ago | |

> But somebody somewhere is reading this and thinking this is a "semantically correct pattern" (as it is introduced) and may just copy-paste it into their program.

I dare say that would be their fault for blindly copying and pasting without taking the time to understand the context. (He even gives an explicit disclaimer!) Robust error handling would just be more noise to filter through for people actually reading the article, and I don't think it's the author's responsibility to childproof things for people who aren't.

cjensen 12 years ago |

The author has failed to account for command latency. If you write some bytes, there are a bunch of hardware buffering delays in getting bytes to disk including seek and rotational latency.

Async I/O avoids this. You can tell the I/O subsystem what you want to read next even while doing a write. The I/O is posted to the disk in modern systems, and the disk will begin seeking to the read site in parallel with informing the OS that the write has completed. Posting I/O even helps for SSDs to avoid the idle time on the SSD media between write done and read start.

mtdewcmu 12 years ago | |

I think there is some amount of write back caching in the kernel so that the application doesn't have to wait for each individual chunk to go to disk before it can submit the next chunk. I believe there's a sync on either file close or process termination, or some combination.

angry_octet 12 years ago |

"For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each" The whole thing is completely invalid for measuring actual I/O hierarchy efficiencies because of (a) write sizes too small, would be in buffer cache of unknown hotness, (b) dmcrypt introduces a whole layer of indirection and timing variability and (c) on an SSD, almost anything could be happening regarding cache and syncs. Also, mount options, % disk used, small sample sizes, unknown contention effects, etc. This is a good example of how to convince yourself of something and yet be less accurate than a divining rod.

callesgg 12 years ago |

I would assume the encryption is such a big overhead that most optimizations in the upper levels will be useless.

petermonsson 12 years ago | |

The Intel AES instructions help a lot. According to Wikipedia we use 3.5 cycles/byte. That gives us 128*3.5/3 = 149 ms. If the write speed is around 500mb/a as another potter stated, then the disk encryption is probably not a bottleneck. Still, it is better to actually measure the performance without encryption to see if there is any effect.

dar8919 12 years ago |

The second code snippet looks wrong,

write(out, buf, (r - w)) should be write(out, buf + w, r - w)

zobzu 12 years ago |

actually despite authors claims the fs stuff is ram cached a lot, hence the differences in the tests. (specially for a single file write)