* Semantics of racy O_DIRECT writes @ 2025-01-08 16:33 Travis Downs 2025-01-09 4:57 ` Theodore Ts'o 0 siblings, 1 reply; 12+ messages in thread From: Travis Downs @ 2025-01-08 16:33 UTC (permalink / raw) To: linux-block Hello linux-block, We are experiencing data corruption in our storage intensive server application and are wondering about the semantics of "racy" O_DIRECT writes. Normally we target XFS, but the question is a general one. Specifically, imagine that we are writing a single 4K aligned page, with contents AB00 (each char being 1K bytes). We only care about the first 2048 bytes (the AB part). We are using libaio writes (io_submit) with O_DIRECT semantics. While the write is in flight, i.e., after we have submitted it and before we reap it in io_getevents, the userspace application writes into second half of the page, changing it to ABCD (let's say via memcpy). The first half is not changed. The question then is: is this safe in the sense that would result in ABxx being written where xx "is don't care"? Or could it do something crazier, like cause later writes to be ignored (e.g. if something in the kernel storage layer hashes the page for some purpose and this hash is out of sync with the page at the time it was captured, or something like that). Of course, the easy answer is "don't do that", but I still want to know what happens if we do. Travis ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-08 16:33 Semantics of racy O_DIRECT writes Travis Downs @ 2025-01-09 4:57 ` Theodore Ts'o 2025-01-09 14:16 ` Travis Downs 0 siblings, 1 reply; 12+ messages in thread From: Theodore Ts'o @ 2025-01-09 4:57 UTC (permalink / raw) To: Travis Downs; +Cc: linux-block On Wed, Jan 08, 2025 at 01:33:07PM -0300, Travis Downs wrote: > Hello linux-block, > > We are experiencing data corruption in our storage intensive server > application and are wondering about the semantics of "racy" O_DIRECT > writes. > > Normally we target XFS, but the question is a general one. > > Specifically, imagine that we are writing a single 4K aligned page, > with contents AB00 (each char being 1K bytes). We only care about > the first 2048 bytes (the AB part). We are using libaio writes > (io_submit) with O_DIRECT semantics. While the write is in flight, > i.e., > after we have submitted it and before we reap it in io_getevents, the > userspace application writes into second half of the page, > changing it to ABCD (let's say via memcpy). The first half is not changed. > > The question then is: is this safe in the sense that would result in > ABxx being written where xx "is don't care"? Or could it do something > crazier, like cause later writes to be ignored (e.g. if something in > the kernel storage layer hashes the page for some purpose and > this hash is out of sync with the page at the time it was captured, or > something like that). > > Of course, the easy answer is "don't do that", but I still want to > know what happens if we do. Don't do that. Really. First of all, your program might need to run on OS's other than Linux, such as Legacy Unix systems, Mac OS X, etc, and officially, there is zero guarantees about cache coherency between O_DIRECT writes and the page cache. So if you use O_DIRECT I/O and buffered I/O or mmap access to a file.... all bet are off. By definition O_DIRECT I/O bypasses the page cache, so if there is a copy of the file's data block in the page cache, for some implementations of some OS's the page cache might contain the previous stale version of the block, so buffer reads might not have the updated copy reflected by the O_DIRECT write. And if the page is mmap'ed into some process's address space, and the process dirties that page, that page will get written back to the disk, potentially overwriting O_DIRECT write. Linux will make best efforts to maintain cache coherency between O_DIRECT and the page cache. It does that by writing out the page in the page cache if it is dirty, and then evicting the the page from the page cache. In practice this will be good enough to keep programs like a database which locks the database so it can take a consistent snapshot, and then does the backup via buffered write, when the database normally uses O_DIRECT for its transactions, it will work --- since if the database wasn't locked while taking the backup, it would be completely a mess, and the O_DIRECT vs page cache coherency is the *least* of your worries. But in general, don't mix bufered/mmap and O_DIRECT I/O to the same file. Just don't. It might work, but remember that raison d'etre for O_DIRECT is performance in support of databases and storage systems where developers Know What They Are Doing(tm) and Follow The Rules(tm). Linux's cache coherency is best efforts only (and other OS's might not even bother), and database developers and sysadmins would be sad if we compromised O_DIRECT perforance just to make things 100% safe for people want to do things which are breaking the rules. If you like breaking rules, don't use O_DIRECT. You'll be happier for it, as will hapless future users of your programs. :-) Remember, good programs are maintainable and portable. What if some one attempts to take your programs and tries to make it work on MacOS? Cheers, - Ted P.S. I commend to you the ten commandments for C programmers, especially the last one. Remember, all the world's not Linux! https://www.lysator.liu.se/c/ten-commandments.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-09 4:57 ` Theodore Ts'o @ 2025-01-09 14:16 ` Travis Downs 2025-01-09 15:01 ` Travis Downs 2025-01-09 15:51 ` Theodore Ts'o 0 siblings, 2 replies; 12+ messages in thread From: Travis Downs @ 2025-01-09 14:16 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-block On Thu, Jan 9, 2025 at 1:57 AM Theodore Ts'o <tytso@mit.edu> wrote: > > Don't do that. Really. > > First of all, your program might need to run on OS's other than Linux, > such as Legacy Unix systems, Mac OS X, etc, and officially, there is > zero guarantees about cache coherency between O_DIRECT writes and the > page cache. So if you use O_DIRECT I/O and buffered I/O or mmap > access to a file.... all bet are off. Thanks Theodore for your comprehensive reply. I probably was not very clear in the way I posed my question. To clarify: - There is only one process involved here making all the writes - We do only O_DIRECT reads and writes, so I don't expect the page cache to be involved in the usual case (but we can't exclude it entirely). - So the question is large about the possible outcomes of doing a zero- copy O_DIRECT write (where the block driver will ultimately be reading directly from the pages allocated by and passed to the kernel by the userspace application) in the situation where a portion of the the passed pages are modified in a racy way by the userspace application by a subsequent O_DIRECT write. > By definition O_DIRECT I/O bypasses the page cache, so if there is a > copy of the file's data block in the page cache, for some > implementations of some OS's the page cache might contain the previous > stale version of the block, so buffer reads might not have the updated > copy reflected by the O_DIRECT write. And if the page is mmap'ed into > some process's address space, and the process dirties that page, that > page will get written back to the disk, potentially overwriting > O_DIRECT write. > > Linux will make best efforts to maintain cache coherency between > O_DIRECT and the page cache. It does that by writing out the page in > the page cache if it is dirty, and then evicting the the page from the > page cache. In practice this will be good enough to keep programs > like a database which locks the database so it can take a consistent > snapshot, and then does the backup via buffered write, when the > database normally uses O_DIRECT for its transactions, it will work --- > since if the database wasn't locked while taking the backup, it would > be completely a mess, and the O_DIRECT vs page cache coherency is the > *least* of your worries. Note that we run only on Linux and are heavily tied to the details of linux AIO and io_uring, so an "Linux only" response is fine. I am quite sure that after an O_DIRECT write completes, a subsequent read through any Linux API is going to return the newly written value, not a stale value from the page cache. > > But in general, don't mix bufered/mmap and O_DIRECT I/O to the same > file. Just don't. It might work, but remember that raison d'etre for > O_DIRECT is performance in support of databases and storage systems > where developers Know What They Are Doing(tm) and Follow The > Rules(tm). Linux's cache coherency is best efforts only (and other > OS's might not even bother), and database developers and sysadmins > would be sad if we compromised O_DIRECT perforance just to make things > 100% safe for people want to do things which are breaking the rules. This is us, we know what we are doing and are writing a database-like product. We are heavy users of AIO and in fact many of the discussions of AIO and O_DIRECT behavior here on the LKML and elsewhere are driven by users of the same framework we use (seastar), so you can consider us expert users from that point of view. > > If you like breaking rules, don't use O_DIRECT. You'll be happier for > it, as will hapless future users of your programs. :-) > > Remember, good programs are maintainable and portable. What if some > one attempts to take your programs and tries to make it work on MacOS? Fair enough. In our case, we are writing a high-performance, clustered event store (Redpanda) which is a piece of infrastructure with very little demand to run on anything other than Linux, except for "dev" scenarios, where emulation is suitable. We make heavy use of aio (later, io_uring) and tune for specific kernel features like RWF_NOWAIT, etc. Thanks again, Travis ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-09 14:16 ` Travis Downs @ 2025-01-09 15:01 ` Travis Downs 2025-01-09 17:32 ` Bart Van Assche 2025-01-09 15:51 ` Theodore Ts'o 1 sibling, 1 reply; 12+ messages in thread From: Travis Downs @ 2025-01-09 15:01 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-block On Thu, Jan 9, 2025 at 11:16 AM Travis Downs <travis.downs@gmail.com> wrote: > > On Thu, Jan 9, 2025 at 1:57 AM Theodore Ts'o <tytso@mit.edu> wrote: > > > > Don't do that. Really. > > > > First of all, your program might need to run on OS's other than Linux, > > such as Legacy Unix systems, Mac OS X, etc, and officially, there is > > zero guarantees about cache coherency between O_DIRECT writes and the > > page cache. So if you use O_DIRECT I/O and buffered I/O or mmap > > access to a file.... all bet are off. > > Thanks Theodore for your comprehensive reply. I probably was not very > clear in the way I posed my question. To clarify: > > - There is only one process involved here making all the writes > - We do only O_DIRECT reads and writes, so I don't expect the page > cache to be involved in the usual case (but we can't exclude it entirely). > - So the question is large about the possible outcomes of doing a zero- > copy O_DIRECT write (where the block driver will ultimately be reading > directly from the pages allocated by and passed to the kernel by the > userspace application) in the situation where a portion of the the passed > pages are modified in a racy way by the userspace application by a > subsequent O_DIRECT write. Sorry, the last point above is totally wrong (insert excuse about coffee, etc here). What it should say is "...where a portion of the passed pages are modified in-memory by the application while the write is in flight". That is, there is only one O_DIRECT write here. It is issued, and before it completes the userspace writes into the same buffers that were passed to the O_DIRECT write, modifying some (but not all) bytes. Do we have a guarantee that the unmodified bytes will be successfully written? Can is cause some corruption/inconsistency in the FS or block layer? TIA ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-09 15:01 ` Travis Downs @ 2025-01-09 17:32 ` Bart Van Assche 2025-01-10 9:42 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2025-01-09 17:32 UTC (permalink / raw) To: Travis Downs, Theodore Ts'o; +Cc: linux-block On 1/9/25 7:01 AM, Travis Downs wrote: > Do we have a guarantee that the unmodified bytes will be successfully > written? That seems likely to me. > Can this cause some corruption/inconsistency in the FS or block > layer? As Ted already mentioned, if there is any driver in the storage stack that requires stable writes (e.g. a RAID 5 driver) or that assumes that data in flight is not being modified (e.g. a RAID 1 driver), you will be in trouble. Additionally, since typical storage controllers use DMA to transfer data, and since DMA may happen out of order, another pattern than AB00 or ABCD could end up on the storage device, e.g. AB0D. Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-09 17:32 ` Bart Van Assche @ 2025-01-10 9:42 ` Christoph Hellwig 2025-01-31 19:58 ` Travis Downs 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2025-01-10 9:42 UTC (permalink / raw) To: Bart Van Assche; +Cc: Travis Downs, Theodore Ts'o, linux-block On Thu, Jan 09, 2025 at 09:32:54AM -0800, Bart Van Assche wrote: > in trouble. Additionally, since typical storage controllers use DMA to > transfer data, and since DMA may happen out of order, another pattern > than AB00 or ABCD could end up on the storage device, e.g. AB0D. Only when using an LBA size larger than each chunk, e.g. in the scenario mentioned by Travis if using a 4k LBA size and not a 512 byte LBA size. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-10 9:42 ` Christoph Hellwig @ 2025-01-31 19:58 ` Travis Downs 0 siblings, 0 replies; 12+ messages in thread From: Travis Downs @ 2025-01-31 19:58 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Bart Van Assche, Theodore Ts'o, linux-block On Fri, Jan 10, 2025 at 6:42 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Jan 09, 2025 at 09:32:54AM -0800, Bart Van Assche wrote: > > in trouble. Additionally, since typical storage controllers use DMA to > > transfer data, and since DMA may happen out of order, another pattern > > than AB00 or ABCD could end up on the storage device, e.g. AB0D. > > Only when using an LBA size larger than each chunk, e.g. in the scenario > mentioned by Travis if using a 4k LBA size and not a 512 byte LBA size. > Thanks Christoph, your reply is appreciated. In our case any pattern that is ABxx is fine. We don't have any expectations about the about xx part, it's fine if they have a torn write, out of order write, total garbage etc, but we'd like ABxx to be intact (see also my coming reply on a parallel branch). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-09 14:16 ` Travis Downs 2025-01-09 15:01 ` Travis Downs @ 2025-01-09 15:51 ` Theodore Ts'o 2025-01-10 8:58 ` Christoph Hellwig 1 sibling, 1 reply; 12+ messages in thread From: Theodore Ts'o @ 2025-01-09 15:51 UTC (permalink / raw) To: Travis Downs; +Cc: linux-block On Thu, Jan 09, 2025 at 11:16:41AM -0300, Travis Downs wrote: > - So the question is large about the possible outcomes of doing a zero- > copy O_DIRECT write (where the block driver will ultimately be reading > directly from the pages allocated by and passed to the kernel by the > userspace application) in the situation where a portion of the the passed > pages are modified in a racy way by the userspace application by a > subsequent O_DIRECT write. Yeah, sorry, I thought "modified via memcpy() was via a memcpy to a mmap'ed region", which would mean it's in the page cache. If what you mean one thread modifying a block while the O_DIRECT write is underway, the answer is "it depends". For non-Linux systems, it will almost certainly be racy. For Linux, if the block device is one that requires stable writes (e.g., for iSCSI writes which include a checksum, or SCSI devices with DIF/DIX enabled, or some software RAID 5 block device), where a racy write might lead to an I/O error on the write or in the case of RAID 5, in the subsequent read of the block, Linux will protect against this happening by marking the page read-only while the I/O is underway, either if it's happening via buffered writeback or O_DIRECT writes, and then marking the page read/write afterwards. Doing this has performance implications, since changing the page table and the need to do global interprocessor interupts is not free. So we only do it for those block devices that require stable writes, and even if you are interested in a Linux-only answer, it's still "it depends". Cheers, - Ted ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-09 15:51 ` Theodore Ts'o @ 2025-01-10 8:58 ` Christoph Hellwig 2025-01-31 20:06 ` Travis Downs 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2025-01-10 8:58 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Travis Downs, linux-block On Thu, Jan 09, 2025 at 10:51:19AM -0500, Theodore Ts'o wrote: > For Linux, if the block device is one that requires stable writes > (e.g., for iSCSI writes which include a checksum, or SCSI devices with > DIF/DIX enabled, or some software RAID 5 block device), where a racy > write might lead to an I/O error on the write or in the case of RAID > 5, in the subsequent read of the block, Linux will protect against > this happening by marking the page read-only while the I/O is > underway, either if it's happening via buffered writeback or O_DIRECT > writes, and then marking the page read/write afterwards. This only happens for buffered I/O, and not for direct I/O. But that only matters when your operation is inside the sector (LBA) boundary that the device interface operates on, e.g. if you using 512 byte sector size as long your stay outside of that you're still fine. BUT: that assumes device checksums. File systems can have checksums as well and have the same problem. Because of that for example running Windows VM images which tend to somehow generate this pattern on qemu using direct I/O on btrfs files has historically causes a lot of problems. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-10 8:58 ` Christoph Hellwig @ 2025-01-31 20:06 ` Travis Downs 2025-02-04 5:19 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Travis Downs @ 2025-01-31 20:06 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Theodore Ts'o, linux-block On Fri, Jan 10, 2025 at 5:58 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Jan 09, 2025 at 10:51:19AM -0500, Theodore Ts'o wrote: > > For Linux, if the block device is one that requires stable writes > > (e.g., for iSCSI writes which include a checksum, or SCSI devices with > > DIF/DIX enabled, or some software RAID 5 block device), where a racy > > write might lead to an I/O error on the write or in the case of RAID > > 5, in the subsequent read of the block, Linux will protect against > > this happening by marking the page read-only while the I/O is > > underway, either if it's happening via buffered writeback or O_DIRECT > > writes, and then marking the page read/write afterwards. > > This only happens for buffered I/O, and not for direct I/O. Thank you. To clarify, "this" means the RO protection, right? So in direct IO there is no such protection? > > But that only matters when your operation is inside the sector (LBA) > boundary that the device interface operates on, e.g. if you using 512 > byte sector size as long your stay outside of that you're still fine. Sorry it's not clear if you are talking about the buffered or direct I/O case here. Also, my problem description was not clear enough. I made it sound as if we only concurrently write to different 1k blocks than the data we care about, but there is actually no such alignment: we might write to adjacent bytes of alignment 1. That is, we may write bytes [0, 777) of some 4K block, then send it down for direct IO via io_submit, and before that returns we may write the next region [777, 1234) or whatever. So we are definitely interested in the case where there are writes within the same 512-byte sector. > > BUT: that assumes device checksums. File systems can have checksums > as well and have the same problem. Because of that for example running > Windows VM images which tend to somehow generate this pattern on qemu > using direct I/O on btrfs files has historically causes a lot of > problems. So is it fair to say that for direct IO these types of racy writes are not safe? Specifically, we are looking at behavior in a 3rd party, proprietary block device (implemented as a kernel module) and are wondering if these types of racy writes break the implied or explicit semantics of safe direct IO writes. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-01-31 20:06 ` Travis Downs @ 2025-02-04 5:19 ` Christoph Hellwig 2025-02-04 14:32 ` Travis Downs 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2025-02-04 5:19 UTC (permalink / raw) To: Travis Downs; +Cc: Christoph Hellwig, Theodore Ts'o, linux-block On Fri, Jan 31, 2025 at 05:06:50PM -0300, Travis Downs wrote: > On Fri, Jan 10, 2025 at 5:58 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Thu, Jan 09, 2025 at 10:51:19AM -0500, Theodore Ts'o wrote: > > > For Linux, if the block device is one that requires stable writes > > > (e.g., for iSCSI writes which include a checksum, or SCSI devices with > > > DIF/DIX enabled, or some software RAID 5 block device), where a racy > > > write might lead to an I/O error on the write or in the case of RAID > > > 5, in the subsequent read of the block, Linux will protect against > > > this happening by marking the page read-only while the I/O is > > > underway, either if it's happening via buffered writeback or O_DIRECT > > > writes, and then marking the page read/write afterwards. > > > > This only happens for buffered I/O, and not for direct I/O. > > Thank you. To clarify, "this" means the RO protection, right? So in direct IO > there is no such protection? Yes. > > But that only matters when your operation is inside the sector (LBA) > > boundary that the device interface operates on, e.g. if you using 512 > > byte sector size as long your stay outside of that you're still fine. > > Sorry it's not clear if you are talking about the buffered or direct > I/O case here. This is all about direct I/O. > > BUT: that assumes device checksums. File systems can have checksums > > as well and have the same problem. Because of that for example running > > Windows VM images which tend to somehow generate this pattern on qemu > > using direct I/O on btrfs files has historically causes a lot of > > problems. > > So is it fair to say that for direct IO these types of racy writes are not safe? In general: yes. > > Specifically, we are looking at behavior in a 3rd party, proprietary > block device > (implemented as a kernel module) and are wondering if these types of racy > writes break the implied or explicit semantics of safe direct IO writes. I have no interest in helping anyone into looking proprietary drivers. But every single one I've looked at was somewhere between somewhat to totally broken in many way. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Semantics of racy O_DIRECT writes 2025-02-04 5:19 ` Christoph Hellwig @ 2025-02-04 14:32 ` Travis Downs 0 siblings, 0 replies; 12+ messages in thread From: Travis Downs @ 2025-02-04 14:32 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Theodore Ts'o, linux-block On Tue, Feb 4, 2025 at 2:19 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Jan 31, 2025 at 05:06:50PM -0300, Travis Downs wrote: > Yes. > > This is all about direct I/O. > > > So is it fair to say that for direct IO these types of racy writes are not safe? > > In general: yes. Thanks Christoph, this is all very helpful. I appreciate your replies. > > > > Specifically, we are looking at behavior in a 3rd party, proprietary > > block device > > (implemented as a kernel module) and are wondering if these types of racy > > writes break the implied or explicit semantics of safe direct IO writes. > > I have no interest in helping anyone into looking proprietary drivers. Totally fair, and it's why for this list I would frame it as a question of what is the contract between userspace, the VFS layer, the FS, the block layer and the block driver. Having that well defined is useful for in-tree devices, out-of- tree open source devices and, unavoidably, closed-source devices. So I would summarize the current situation as: "It is allowable for any FS and block device to assume that direct IO userspace buffers are stable, and if they are not, there are few guarantees about the results of unstable writes. The generic parts of the storage layer, and common drivers like NVMe block driver, may today provide reasonable semantics in this case (e.g., the unmodified part of the buffer is written with the expected values) but that is not a guarantee and may change in the future". ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-02-04 14:32 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-08 16:33 Semantics of racy O_DIRECT writes Travis Downs 2025-01-09 4:57 ` Theodore Ts'o 2025-01-09 14:16 ` Travis Downs 2025-01-09 15:01 ` Travis Downs 2025-01-09 17:32 ` Bart Van Assche 2025-01-10 9:42 ` Christoph Hellwig 2025-01-31 19:58 ` Travis Downs 2025-01-09 15:51 ` Theodore Ts'o 2025-01-10 8:58 ` Christoph Hellwig 2025-01-31 20:06 ` Travis Downs 2025-02-04 5:19 ` Christoph Hellwig 2025-02-04 14:32 ` Travis Downs
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.