* O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO [not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com> @ 2025-10-20 10:00 ` Christoph Hellwig 2025-10-20 10:24 ` Qu Wenruo ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Christoph Hellwig @ 2025-10-20 10:00 UTC (permalink / raw) To: Qu Wenruo Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote: > There is a bug report about that direct IO (and even concurrent buffered > IO) can lead to different contents of md-raid. What concurrent buffered I/O? > It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7 > ("btrfs: always fallback to buffered write if the inode requires > checksum"), however we still leave a hole for nodatasum cases. > > For nodatasum cases we still reuse the bio from direct IO, making it to > cause the same problem for RAID1*/5/6 profiles, and results > unreliable data contents read from disk, depending on the load balance. > > Just do not trust any bio from direct IO, and never reuse those bios even > for nodatasum cases. Instead alloc our own bio with newly allocated > pages. > > For direct read, submit that new bio, and at end io time copy the > contents to the dio bio. > For direct write, copy the contents from the dio bio, then submit the > new one. This basically reinvents IOCB_DONTCACHE I/O with duplicate code? > Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows > modifying the page cache when it's still under writeback) can lead to > raid mirror contents mismatch, the 23% performance drop should still be > acceptable, and bcachefs is already doing this bouncing behavior. XFS (and EXT4 as well, but I've not tested it) wait for I/O to finish before allowing modifications when mapping_stable_writes returns true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that is fine. Direct I/O is broken, and at least for XFS I have patches to force DONTCACHE instead of DIRECT I/O by default in that case, but allowing for an opt-out for known applications (e.g. file or storage servers). I'll need to rebase them, but I plan to send them out soon together with other T10 PI enabling patches. Sorry, juggling a few too many things at the moment. > But still, such performance drop can be very obvious, and performance > oriented users (who are very happy running various benchmark tools) are > going to notice or even complain. I've unfortunately seen much bigger performance drops with direct I/O and PI on fast SSDs, but we still should be safe by default. > Another question is, should we push this behavior to iomap layer so that other > fses can also benefit from it? The right place is above iomap to pick the buffered I/O path instead. The real question is if we can finally get a version of pin_user_pages that prevents user modifications entirely. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig @ 2025-10-20 10:24 ` Qu Wenruo 2025-10-20 11:45 ` Christoph Hellwig 2025-10-20 11:16 ` Jan Kara 2025-10-21 3:17 ` Qu Wenruo 2 siblings, 1 reply; 27+ messages in thread From: Qu Wenruo @ 2025-10-20 10:24 UTC (permalink / raw) To: Christoph Hellwig, Qu Wenruo Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack 在 2025/10/20 20:30, Christoph Hellwig 写道: > On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote: >> There is a bug report about that direct IO (and even concurrent buffered >> IO) can lead to different contents of md-raid. > > What concurrent buffered I/O? filemap_get_folio(), for address spaces with STABEL_WRITES, there will be a folio_wait_stable() call to wait for writeback. But since almost no device (except md-raid56) set that flag, if a folio is still under writeback, XFS/EXT4 can still modify that folio (since it's not locked, just under writeback) for new incoming buffered writes. > >> It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7 >> ("btrfs: always fallback to buffered write if the inode requires >> checksum"), however we still leave a hole for nodatasum cases. >> >> For nodatasum cases we still reuse the bio from direct IO, making it to >> cause the same problem for RAID1*/5/6 profiles, and results >> unreliable data contents read from disk, depending on the load balance. >> >> Just do not trust any bio from direct IO, and never reuse those bios even >> for nodatasum cases. Instead alloc our own bio with newly allocated >> pages. >> >> For direct read, submit that new bio, and at end io time copy the >> contents to the dio bio. >> For direct write, copy the contents from the dio bio, then submit the >> new one. > > This basically reinvents IOCB_DONTCACHE I/O with duplicate code? This reminds me the problem that btrfs can not handle DONTCACHE due to its async extents... I definitely need to address it one day. > >> Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows >> modifying the page cache when it's still under writeback) can lead to >> raid mirror contents mismatch, the 23% performance drop should still be >> acceptable, and bcachefs is already doing this bouncing behavior. > > XFS (and EXT4 as well, but I've not tested it) wait for I/O to > finish before allowing modifications when mapping_stable_writes returns > true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that > is fine. But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for write to finish. Wouldn't that cause two mirrors to differ from each other due to timing difference? > Direct I/O is broken, and at least for XFS I have patches > to force DONTCACHE instead of DIRECT I/O by default in that case, but > allowing for an opt-out for known applications (e.g. file or storage > servers). > > I'll need to rebase them, but I plan to send them out soon together > with other T10 PI enabling patches. Sorry, juggling a few too many > things at the moment. > >> But still, such performance drop can be very obvious, and performance >> oriented users (who are very happy running various benchmark tools) are >> going to notice or even complain. > > I've unfortunately seen much bigger performance drops with direct I/O and > PI on fast SSDs, but we still should be safe by default. > >> Another question is, should we push this behavior to iomap layer so that other >> fses can also benefit from it? > > The right place is above iomap to pick the buffered I/O path instead. But falling back to buffered IO performance is so miserable that wiped out almost one or more decades of storage performance improvement. Thanks, Qu > > The real question is if we can finally get a version of pin_user_pages > that prevents user modifications entirely. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 10:24 ` Qu Wenruo @ 2025-10-20 11:45 ` Christoph Hellwig 0 siblings, 0 replies; 27+ messages in thread From: Christoph Hellwig @ 2025-10-20 11:45 UTC (permalink / raw) To: Qu Wenruo Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon, Oct 20, 2025 at 08:54:49PM +1030, Qu Wenruo wrote: > > > 在 2025/10/20 20:30, Christoph Hellwig 写道: > > On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote: > > > There is a bug report about that direct IO (and even concurrent buffered > > > IO) can lead to different contents of md-raid. > > > > What concurrent buffered I/O? > > filemap_get_folio(), for address spaces with STABEL_WRITES, there will be a > folio_wait_stable() call to wait for writeback. > > But since almost no device (except md-raid56) set that flag, if a folio is > still under writeback, XFS/EXT4 can still modify that folio (since it's not > locked, just under writeback) for new incoming buffered writes. All devices with protection information set it. Now for raid1 without the bitmap there is no expectation that anything is in sync. Raid 1 with a bitmap should probably set the flag as well. > > > > > > It's exactly the situation we fixed for direct IO in commit 968f19c5b1b7 > > > ("btrfs: always fallback to buffered write if the inode requires > > > checksum"), however we still leave a hole for nodatasum cases. > > > > > > For nodatasum cases we still reuse the bio from direct IO, making it to > > > cause the same problem for RAID1*/5/6 profiles, and results > > > unreliable data contents read from disk, depending on the load balance. > > > > > > Just do not trust any bio from direct IO, and never reuse those bios even > > > for nodatasum cases. Instead alloc our own bio with newly allocated > > > pages. > > > > > > For direct read, submit that new bio, and at end io time copy the > > > contents to the dio bio. > > > For direct write, copy the contents from the dio bio, then submit the > > > new one. > > > > This basically reinvents IOCB_DONTCACHE I/O with duplicate code? > > This reminds me the problem that btrfs can not handle DONTCACHE due to its > async extents... > > I definitely need to address it one day. > > > > > > Considering the zero-copy direct IO (and the fact XFS/EXT4 even allows > > > modifying the page cache when it's still under writeback) can lead to > > > raid mirror contents mismatch, the 23% performance drop should still be > > > acceptable, and bcachefs is already doing this bouncing behavior. > > > > XFS (and EXT4 as well, but I've not tested it) wait for I/O to > > finish before allowing modifications when mapping_stable_writes returns > > true, i.e., when the block device sets BLK_FEAT_STABLE_WRITES, so that > > is fine. > > But md-raid1 doesn't set STABLE_WRITES, thus XFS/EXT4 won't wait for write > to finish. > > Wouldn't that cause two mirrors to differ from each other due to timing > difference? > > > Direct I/O is broken, and at least for XFS I have patches > > to force DONTCACHE instead of DIRECT I/O by default in that case, but > > allowing for an opt-out for known applications (e.g. file or storage > > servers). > > > > I'll need to rebase them, but I plan to send them out soon together > > with other T10 PI enabling patches. Sorry, juggling a few too many > > things at the moment. > > > > > But still, such performance drop can be very obvious, and performance > > > oriented users (who are very happy running various benchmark tools) are > > > going to notice or even complain. > > > > I've unfortunately seen much bigger performance drops with direct I/O and > > PI on fast SSDs, but we still should be safe by default. > > > > > Another question is, should we push this behavior to iomap layer so that other > > > fses can also benefit from it? > > > > The right place is above iomap to pick the buffered I/O path instead. > > But falling back to buffered IO performance is so miserable that wiped out > almost one or more decades of storage performance improvement. > > Thanks, > Qu > > > > > The real question is if we can finally get a version of pin_user_pages > > that prevents user modifications entirely. ---end quoted text--- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig 2025-10-20 10:24 ` Qu Wenruo @ 2025-10-20 11:16 ` Jan Kara 2025-10-20 11:44 ` Christoph Hellwig 2025-10-21 3:17 ` Qu Wenruo 2 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2025-10-20 11:16 UTC (permalink / raw) To: Christoph Hellwig Cc: Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon 20-10-25 03:00:43, Christoph Hellwig wrote: > On Mon, Oct 20, 2025 at 07:49:50PM +1030, Qu Wenruo wrote: > > But still, such performance drop can be very obvious, and performance > > oriented users (who are very happy running various benchmark tools) are > > going to notice or even complain. > > I've unfortunately seen much bigger performance drops with direct I/O and > PI on fast SSDs, but we still should be safe by default. > > > Another question is, should we push this behavior to iomap layer so that other > > fses can also benefit from it? > > The right place is above iomap to pick the buffered I/O path instead. > > The real question is if we can finally get a version of pin_user_pages > that prevents user modifications entirely. Hmm, this is an interesting twist in the problems with pinned pages - so far I was thinking about problems where pinned page cache page gets modified (e.g. through DIO or RDMA) and this causes checksum failures if it races with writeback. If I understand you right, now you are concerned about a situation where some page is used as a buffer for direct IO write / RDMA and it gets modified while the DMA is running which causes checksum mismatch? Writeprotecting the buffer before the DIO starts isn't that hard to do (although it has a non-trivial cost) but we don't have a mechanism to make sure the page cannot be writeably mapped while it is pinned (and avoiding that without introducing deadlocks would be *fun*). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 11:16 ` Jan Kara @ 2025-10-20 11:44 ` Christoph Hellwig 2025-10-20 13:59 ` Jan Kara 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2025-10-20 11:44 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon, Oct 20, 2025 at 01:16:39PM +0200, Jan Kara wrote: > Hmm, this is an interesting twist in the problems with pinned pages - so > far I was thinking about problems where pinned page cache page gets > modified (e.g. through DIO or RDMA) and this causes checksum failures if > it races with writeback. If I understand you right, now you are concerned > about a situation where some page is used as a buffer for direct IO write > / RDMA and it gets modified while the DMA is running which causes checksum > mismatch? Really all of the above. Even worse this can also happen for reads, e.g. when the parity or checksum is calculated in the user buffer. > Writeprotecting the buffer before the DIO starts isn't that hard > to do (although it has a non-trivial cost) but we don't have a mechanism to > make sure the page cannot be writeably mapped while it is pinned (and > avoiding that without introducing deadlocks would be *fun*). Well, this goes back to the old idea of maybe bounce buffering in that case? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 11:44 ` Christoph Hellwig @ 2025-10-20 13:59 ` Jan Kara 2025-10-20 14:59 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2025-10-20 13:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon 20-10-25 04:44:27, Christoph Hellwig wrote: > On Mon, Oct 20, 2025 at 01:16:39PM +0200, Jan Kara wrote: > > Hmm, this is an interesting twist in the problems with pinned pages - so > > far I was thinking about problems where pinned page cache page gets > > modified (e.g. through DIO or RDMA) and this causes checksum failures if > > it races with writeback. If I understand you right, now you are concerned > > about a situation where some page is used as a buffer for direct IO write > > / RDMA and it gets modified while the DMA is running which causes checksum > > mismatch? > > Really all of the above. Even worse this can also happen for reads, > e.g. when the parity or checksum is calculated in the user buffer. OK. > > Writeprotecting the buffer before the DIO starts isn't that hard > > to do (although it has a non-trivial cost) but we don't have a mechanism to > > make sure the page cannot be writeably mapped while it is pinned (and > > avoiding that without introducing deadlocks would be *fun*). > > Well, this goes back to the old idea of maybe bounce buffering in that > case? The idea was to bounce buffer the page we are writing back in case we spot a long-term pin we cannot just wait for - hence bouncing should be rare. But in this more general setting it is challenging to not bounce buffer for every IO (in which case you'd be basically at performance of RWF_DONTCACHE IO or perhaps worse so why bother?). Essentially if you hand out the real page underlying the buffer for the IO, all other attemps to do IO to that page have to block - bouncing is no longer an option because even with bouncing the second IO we could still corrupt data of the first IO once we copy to the final buffer. And if we'd block waiting for the first IO to complete, userspace could construct deadlock cycles - like racing IO to pages A, B with IO to pages B, A. So far I'm not sure about a sane way out of this... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 13:59 ` Jan Kara @ 2025-10-20 14:59 ` Matthew Wilcox 2025-10-20 15:58 ` Jan Kara 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2025-10-20 14:59 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote: > The idea was to bounce buffer the page we are writing back in case we spot > a long-term pin we cannot just wait for - hence bouncing should be rare. > But in this more general setting it is challenging to not bounce buffer for > every IO (in which case you'd be basically at performance of RWF_DONTCACHE > IO or perhaps worse so why bother?). Essentially if you hand out the real > page underlying the buffer for the IO, all other attemps to do IO to that > page have to block - bouncing is no longer an option because even with > bouncing the second IO we could still corrupt data of the first IO once we > copy to the final buffer. And if we'd block waiting for the first IO to > complete, userspace could construct deadlock cycles - like racing IO to > pages A, B with IO to pages B, A. So far I'm not sure about a sane way out > of this... There isn't one. We might have DMA-mapped this page earlier, and so a device could write to it at any time. Even if we remove PTE write permissions ... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 14:59 ` Matthew Wilcox @ 2025-10-20 15:58 ` Jan Kara 2025-10-20 17:55 ` John Hubbard 2025-10-20 19:00 ` David Hildenbrand 0 siblings, 2 replies; 27+ messages in thread From: Jan Kara @ 2025-10-20 15:58 UTC (permalink / raw) To: Matthew Wilcox Cc: Jan Kara, Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon 20-10-25 15:59:07, Matthew Wilcox wrote: > On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote: > > The idea was to bounce buffer the page we are writing back in case we spot > > a long-term pin we cannot just wait for - hence bouncing should be rare. > > But in this more general setting it is challenging to not bounce buffer for > > every IO (in which case you'd be basically at performance of RWF_DONTCACHE > > IO or perhaps worse so why bother?). Essentially if you hand out the real > > page underlying the buffer for the IO, all other attemps to do IO to that > > page have to block - bouncing is no longer an option because even with > > bouncing the second IO we could still corrupt data of the first IO once we > > copy to the final buffer. And if we'd block waiting for the first IO to > > complete, userspace could construct deadlock cycles - like racing IO to > > pages A, B with IO to pages B, A. So far I'm not sure about a sane way out > > of this... > > There isn't one. We might have DMA-mapped this page earlier, and so a > device could write to it at any time. Even if we remove PTE write > permissions ... True but writes through DMA to the page are guarded by holding a page pin these days so we could in theory block getting another page pin or mapping the page writeably until the pin is released... if we can figure out a convincing story for dealing with long-term pins from RDMA and dealing with possible deadlocks created by this. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 15:58 ` Jan Kara @ 2025-10-20 17:55 ` John Hubbard 2025-10-21 8:27 ` Jan Kara 2025-10-20 19:00 ` David Hildenbrand 1 sibling, 1 reply; 27+ messages in thread From: John Hubbard @ 2025-10-20 17:55 UTC (permalink / raw) To: Jan Kara, Matthew Wilcox Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On 10/20/25 8:58 AM, Jan Kara wrote: > On Mon 20-10-25 15:59:07, Matthew Wilcox wrote: >> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote: >>> The idea was to bounce buffer the page we are writing back in case we spot >>> a long-term pin we cannot just wait for - hence bouncing should be rare. >>> But in this more general setting it is challenging to not bounce buffer for >>> every IO (in which case you'd be basically at performance of RWF_DONTCACHE >>> IO or perhaps worse so why bother?). Essentially if you hand out the real >>> page underlying the buffer for the IO, all other attemps to do IO to that >>> page have to block - bouncing is no longer an option because even with >>> bouncing the second IO we could still corrupt data of the first IO once we >>> copy to the final buffer. And if we'd block waiting for the first IO to >>> complete, userspace could construct deadlock cycles - like racing IO to >>> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out >>> of this... >> >> There isn't one. We might have DMA-mapped this page earlier, and so a >> device could write to it at any time. Even if we remove PTE write >> permissions ... > > True but writes through DMA to the page are guarded by holding a page pin > these days so we could in theory block getting another page pin or mapping Do you mean, "setting up to do DMA is guarded by holding a FOLL_LONGTERM page pin"? Or something else (that's new to me)? thanks, John Hubbard > the page writeably until the pin is released... if we can figure out a > convincing story for dealing with long-term pins from RDMA and dealing with > possible deadlocks created by this. > > Honza ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 17:55 ` John Hubbard @ 2025-10-21 8:27 ` Jan Kara 2025-10-21 16:56 ` John Hubbard 0 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2025-10-21 8:27 UTC (permalink / raw) To: John Hubbard Cc: Jan Kara, Matthew Wilcox, Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon 20-10-25 10:55:06, John Hubbard wrote: > On 10/20/25 8:58 AM, Jan Kara wrote: > > On Mon 20-10-25 15:59:07, Matthew Wilcox wrote: > > > On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote: > > > > The idea was to bounce buffer the page we are writing back in case we spot > > > > a long-term pin we cannot just wait for - hence bouncing should be rare. > > > > But in this more general setting it is challenging to not bounce buffer for > > > > every IO (in which case you'd be basically at performance of RWF_DONTCACHE > > > > IO or perhaps worse so why bother?). Essentially if you hand out the real > > > > page underlying the buffer for the IO, all other attemps to do IO to that > > > > page have to block - bouncing is no longer an option because even with > > > > bouncing the second IO we could still corrupt data of the first IO once we > > > > copy to the final buffer. And if we'd block waiting for the first IO to > > > > complete, userspace could construct deadlock cycles - like racing IO to > > > > pages A, B with IO to pages B, A. So far I'm not sure about a sane way out > > > > of this... > > > > > > There isn't one. We might have DMA-mapped this page earlier, and so a > > > device could write to it at any time. Even if we remove PTE write > > > permissions ... > > > > True but writes through DMA to the page are guarded by holding a page pin > > these days so we could in theory block getting another page pin or mapping > > Do you mean, "setting up to do DMA is guarded by holding a FOLL_LONGTERM > page pin"? Or something else (that's new to me)? I meant to say that users that end up setting up DMA to a page also hold a page pin (either longterm for RDMA and similar users or shortterm for direct IO). Do you disagree? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 8:27 ` Jan Kara @ 2025-10-21 16:56 ` John Hubbard 0 siblings, 0 replies; 27+ messages in thread From: John Hubbard @ 2025-10-21 16:56 UTC (permalink / raw) To: Jan Kara Cc: Matthew Wilcox, Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On 10/21/25 1:27 AM, Jan Kara wrote: > On Mon 20-10-25 10:55:06, John Hubbard wrote: >> On 10/20/25 8:58 AM, Jan Kara wrote: >>> On Mon 20-10-25 15:59:07, Matthew Wilcox wrote: >>>> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote: >>>>> The idea was to bounce buffer the page we are writing back in case we spot >>>>> a long-term pin we cannot just wait for - hence bouncing should be rare. >>>>> But in this more general setting it is challenging to not bounce buffer for >>>>> every IO (in which case you'd be basically at performance of RWF_DONTCACHE >>>>> IO or perhaps worse so why bother?). Essentially if you hand out the real >>>>> page underlying the buffer for the IO, all other attemps to do IO to that >>>>> page have to block - bouncing is no longer an option because even with >>>>> bouncing the second IO we could still corrupt data of the first IO once we >>>>> copy to the final buffer. And if we'd block waiting for the first IO to >>>>> complete, userspace could construct deadlock cycles - like racing IO to >>>>> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out >>>>> of this... >>>> >>>> There isn't one. We might have DMA-mapped this page earlier, and so a >>>> device could write to it at any time. Even if we remove PTE write >>>> permissions ... >>> >>> True but writes through DMA to the page are guarded by holding a page pin >>> these days so we could in theory block getting another page pin or mapping >> >> Do you mean, "setting up to do DMA is guarded by holding a FOLL_LONGTERM >> page pin"? Or something else (that's new to me)? > > I meant to say that users that end up setting up DMA to a page also hold a > page pin (either longterm for RDMA and similar users or shortterm for > direct IO). Do you disagree? > > Honza Completely agree. I see what you have in mind now. I was hung up on the "page pins won't prevent DMA from happening, once it is set up" point, but your idea is to detect that that is already set up, and make decisions from there...that part I get now. thanks, -- John Hubbard ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 15:58 ` Jan Kara 2025-10-20 17:55 ` John Hubbard @ 2025-10-20 19:00 ` David Hildenbrand 2025-10-21 7:49 ` Christoph Hellwig 1 sibling, 1 reply; 27+ messages in thread From: David Hildenbrand @ 2025-10-20 19:00 UTC (permalink / raw) To: Jan Kara, Matthew Wilcox Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On 20.10.25 17:58, Jan Kara wrote: > On Mon 20-10-25 15:59:07, Matthew Wilcox wrote: >> On Mon, Oct 20, 2025 at 03:59:33PM +0200, Jan Kara wrote: >>> The idea was to bounce buffer the page we are writing back in case we spot >>> a long-term pin we cannot just wait for - hence bouncing should be rare. >>> But in this more general setting it is challenging to not bounce buffer for >>> every IO (in which case you'd be basically at performance of RWF_DONTCACHE >>> IO or perhaps worse so why bother?). Essentially if you hand out the real >>> page underlying the buffer for the IO, all other attemps to do IO to that >>> page have to block - bouncing is no longer an option because even with >>> bouncing the second IO we could still corrupt data of the first IO once we >>> copy to the final buffer. And if we'd block waiting for the first IO to >>> complete, userspace could construct deadlock cycles - like racing IO to >>> pages A, B with IO to pages B, A. So far I'm not sure about a sane way out >>> of this... >> >> There isn't one. We might have DMA-mapped this page earlier, and so a >> device could write to it at any time. Even if we remove PTE write >> permissions ... > > True but writes through DMA to the page are guarded by holding a page pin > these days so we could in theory block getting another page pin or mapping > the page writeably until the pin is released... if we can figure out a > convincing story for dealing with long-term pins from RDMA and dealing with > possible deadlocks created by this. Just FYI, because it might be interesting in this context. For anonymous memory we have this working by only writing the folio out if it is completely unmapped and there are no unexpected folio references/pins (see pageout()), and only allowing to write to such a folio ("reuse") if SWP_STABLE_WRITES is not set (see do_swap_page()). So once we start writeback the folio has no writable page table mappings (unmapped) and no GUP pins. Consequently, when trying to write to it we can just fallback to creating a page copy without causing trouble with GUP pins. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 19:00 ` David Hildenbrand @ 2025-10-21 7:49 ` Christoph Hellwig 2025-10-21 7:57 ` David Hildenbrand 2025-10-21 9:22 ` Jan Kara 0 siblings, 2 replies; 27+ messages in thread From: Christoph Hellwig @ 2025-10-21 7:49 UTC (permalink / raw) To: David Hildenbrand Cc: Jan Kara, Matthew Wilcox, Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: > Just FYI, because it might be interesting in this context. > > For anonymous memory we have this working by only writing the folio out if > it is completely unmapped and there are no unexpected folio references/pins > (see pageout()), and only allowing to write to such a folio ("reuse") if > SWP_STABLE_WRITES is not set (see do_swap_page()). > > So once we start writeback the folio has no writable page table mappings > (unmapped) and no GUP pins. Consequently, when trying to write to it we can > just fallback to creating a page copy without causing trouble with GUP pins. Yeah. But anonymous is the easy case, the pain is direct I/O to file mappings. Mapping the right answer is to just fail pinning them and fall back to (dontcache) buffered I/O. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 7:49 ` Christoph Hellwig @ 2025-10-21 7:57 ` David Hildenbrand 2025-10-21 9:33 ` Jan Kara 2025-10-21 9:22 ` Jan Kara 1 sibling, 1 reply; 27+ messages in thread From: David Hildenbrand @ 2025-10-21 7:57 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On 21.10.25 09:49, Christoph Hellwig wrote: > On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: >> Just FYI, because it might be interesting in this context. >> >> For anonymous memory we have this working by only writing the folio out if >> it is completely unmapped and there are no unexpected folio references/pins >> (see pageout()), and only allowing to write to such a folio ("reuse") if >> SWP_STABLE_WRITES is not set (see do_swap_page()). >> >> So once we start writeback the folio has no writable page table mappings >> (unmapped) and no GUP pins. Consequently, when trying to write to it we can >> just fallback to creating a page copy without causing trouble with GUP pins. > > Yeah. But anonymous is the easy case, the pain is direct I/O to file > mappings. Mapping the right answer is to just fail pinning them and fall > back to (dontcache) buffered I/O. Right, I think the rules could likely be a) Don't start writeback to such devices if there may be GUP pins (o writeble PTEs) b) Don't allow FOLL_WRITE GUP pins if there is writeback to such a device Regarding b), I would have thought that GUP would find the PTE to not be writable and consequently trigger a page fault first to make it writable? And I'd have thought that we cannot make such a PTE writable while there is writeback to such a device going on (otherwise the CPU could just cause trouble). -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 7:57 ` David Hildenbrand @ 2025-10-21 9:33 ` Jan Kara 2025-10-21 9:43 ` David Hildenbrand 0 siblings, 1 reply; 27+ messages in thread From: Jan Kara @ 2025-10-21 9:33 UTC (permalink / raw) To: David Hildenbrand Cc: Christoph Hellwig, Jan Kara, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Tue 21-10-25 09:57:08, David Hildenbrand wrote: > On 21.10.25 09:49, Christoph Hellwig wrote: > > On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: > > > Just FYI, because it might be interesting in this context. > > > > > > For anonymous memory we have this working by only writing the folio out if > > > it is completely unmapped and there are no unexpected folio references/pins > > > (see pageout()), and only allowing to write to such a folio ("reuse") if > > > SWP_STABLE_WRITES is not set (see do_swap_page()). > > > > > > So once we start writeback the folio has no writable page table mappings > > > (unmapped) and no GUP pins. Consequently, when trying to write to it we can > > > just fallback to creating a page copy without causing trouble with GUP pins. > > > > Yeah. But anonymous is the easy case, the pain is direct I/O to file > > mappings. Mapping the right answer is to just fail pinning them and fall > > back to (dontcache) buffered I/O. > > Right, I think the rules could likely be > > a) Don't start writeback to such devices if there may be GUP pins (o > writeble PTEs) > > b) Don't allow FOLL_WRITE GUP pins if there is writeback to such a device > > Regarding b), I would have thought that GUP would find the PTE to not be > writable and consequently trigger a page fault first to make it writable? > And I'd have thought that we cannot make such a PTE writable while there is > writeback to such a device going on (otherwise the CPU could just cause > trouble). See some of the cases in my reply to Christoph. It is also stuff like: c) Don't allow FOLL_WRITE GUP pins or writeable mapping if there are *any* pins to the page. And we'd have to write-protect the page in the page tables at the moment we obtain the FOLL_WRITE GUP pin to make sure the pin owner is the only thread able to modify that page contents while the DIO is running. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 9:33 ` Jan Kara @ 2025-10-21 9:43 ` David Hildenbrand 0 siblings, 0 replies; 27+ messages in thread From: David Hildenbrand @ 2025-10-21 9:43 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On 21.10.25 11:33, Jan Kara wrote: > On Tue 21-10-25 09:57:08, David Hildenbrand wrote: >> On 21.10.25 09:49, Christoph Hellwig wrote: >>> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: >>>> Just FYI, because it might be interesting in this context. >>>> >>>> For anonymous memory we have this working by only writing the folio out if >>>> it is completely unmapped and there are no unexpected folio references/pins >>>> (see pageout()), and only allowing to write to such a folio ("reuse") if >>>> SWP_STABLE_WRITES is not set (see do_swap_page()). >>>> >>>> So once we start writeback the folio has no writable page table mappings >>>> (unmapped) and no GUP pins. Consequently, when trying to write to it we can >>>> just fallback to creating a page copy without causing trouble with GUP pins. >>> >>> Yeah. But anonymous is the easy case, the pain is direct I/O to file >>> mappings. Mapping the right answer is to just fail pinning them and fall >>> back to (dontcache) buffered I/O. >> >> Right, I think the rules could likely be >> >> a) Don't start writeback to such devices if there may be GUP pins (o >> writeble PTEs) >> >> b) Don't allow FOLL_WRITE GUP pins if there is writeback to such a device >> >> Regarding b), I would have thought that GUP would find the PTE to not be >> writable and consequently trigger a page fault first to make it writable? >> And I'd have thought that we cannot make such a PTE writable while there is >> writeback to such a device going on (otherwise the CPU could just cause >> trouble). > > See some of the cases in my reply to Christoph. It is also stuff like: > > c) Don't allow FOLL_WRITE GUP pins or writeable mapping if there are *any* > pins to the page. > > And we'd have to write-protect the page in the page tables at the moment we > obtain the FOLL_WRITE GUP pin to make sure the pin owner is the only thread > able to modify that page contents while the DIO is running. Oh that's nasty, but yeah I understood the problem now, thanks. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 7:49 ` Christoph Hellwig 2025-10-21 7:57 ` David Hildenbrand @ 2025-10-21 9:22 ` Jan Kara 2025-10-21 9:37 ` David Hildenbrand 1 sibling, 1 reply; 27+ messages in thread From: Jan Kara @ 2025-10-21 9:22 UTC (permalink / raw) To: Christoph Hellwig Cc: David Hildenbrand, Jan Kara, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Tue 21-10-25 00:49:49, Christoph Hellwig wrote: > On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: > > Just FYI, because it might be interesting in this context. > > > > For anonymous memory we have this working by only writing the folio out if > > it is completely unmapped and there are no unexpected folio references/pins > > (see pageout()), and only allowing to write to such a folio ("reuse") if > > SWP_STABLE_WRITES is not set (see do_swap_page()). > > > > So once we start writeback the folio has no writable page table mappings > > (unmapped) and no GUP pins. Consequently, when trying to write to it we can > > just fallback to creating a page copy without causing trouble with GUP pins. > > Yeah. But anonymous is the easy case, the pain is direct I/O to file > mappings. Mapping the right answer is to just fail pinning them and fall > back to (dontcache) buffered I/O. I agree file mappings are more painful but we can also have interesting cases with anon pages: P - anon page Thread 1 Thread 2 setup DIO read to P setup DIO write from P And now you can get checksum failures for the write unless the write is bounced (falling back to dontcache). Similarly with reads: Thread 1 Thread 2 setup DIO read to P setup DIO read to P you can get read checksum mismatch unless both reads are bounced (bouncing one of the reads is not enough because the memcpy from the bounce page to the final buffer may break checksum computation of the IO going directly). So to avoid checksum failures even if user screws up and buffers overlap we need to bounce every IO even to/from anon memory. Or we need to block one of the IOs until the other one completes - a scheme that could work is we'd try to acquire kind of exclusive pin to all the pages (page lock?). If we succeed, we run the IO directly. If we don't succeed, we wait for the exclusive pins to be released, acquire standard pin (to block exclusive pinning) and *then* submit uncached IO. But it is all rather complex and I'm not sure it's worth it... For file mappings things get even more complex because you can do: P - file mapping page Thread 1 Thread 2 setup DIO write from P setup buffered write from Q to P and you get checksum failures for the DIO write. So if we don't bounce the DIO, we'd also have to teach buffered IO to avoid corrupting buffers of DIO in flight. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 9:22 ` Jan Kara @ 2025-10-21 9:37 ` David Hildenbrand 2025-10-21 9:52 ` Jan Kara 0 siblings, 1 reply; 27+ messages in thread From: David Hildenbrand @ 2025-10-21 9:37 UTC (permalink / raw) To: Jan Kara, Christoph Hellwig Cc: Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On 21.10.25 11:22, Jan Kara wrote: > On Tue 21-10-25 00:49:49, Christoph Hellwig wrote: >> On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: >>> Just FYI, because it might be interesting in this context. >>> >>> For anonymous memory we have this working by only writing the folio out if >>> it is completely unmapped and there are no unexpected folio references/pins >>> (see pageout()), and only allowing to write to such a folio ("reuse") if >>> SWP_STABLE_WRITES is not set (see do_swap_page()). >>> >>> So once we start writeback the folio has no writable page table mappings >>> (unmapped) and no GUP pins. Consequently, when trying to write to it we can >>> just fallback to creating a page copy without causing trouble with GUP pins. >> >> Yeah. But anonymous is the easy case, the pain is direct I/O to file >> mappings. Mapping the right answer is to just fail pinning them and fall >> back to (dontcache) buffered I/O. > > I agree file mappings are more painful but we can also have interesting > cases with anon pages: > > P - anon page > > Thread 1 Thread 2 > setup DIO read to P setup DIO write from P Ah, I was talking about the interaction between GUP and having BLK_FEAT_STABLE_WRITES set on the swap backend. I guess what you mean here is: GUP from/to anon pages to/from a device that has BLK_FEAT_STABLE_WRITES? So while we are writing to the device using the anon page as a source, the anon page will get modified. I did not expect that to trigger checksum failures, but I can see the problem now. -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 9:37 ` David Hildenbrand @ 2025-10-21 9:52 ` Jan Kara 0 siblings, 0 replies; 27+ messages in thread From: Jan Kara @ 2025-10-21 9:52 UTC (permalink / raw) To: David Hildenbrand Cc: Jan Kara, Christoph Hellwig, Matthew Wilcox, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Tue 21-10-25 11:37:57, David Hildenbrand wrote: > On 21.10.25 11:22, Jan Kara wrote: > > On Tue 21-10-25 00:49:49, Christoph Hellwig wrote: > > > On Mon, Oct 20, 2025 at 09:00:50PM +0200, David Hildenbrand wrote: > > > > Just FYI, because it might be interesting in this context. > > > > > > > > For anonymous memory we have this working by only writing the folio out if > > > > it is completely unmapped and there are no unexpected folio references/pins > > > > (see pageout()), and only allowing to write to such a folio ("reuse") if > > > > SWP_STABLE_WRITES is not set (see do_swap_page()). > > > > > > > > So once we start writeback the folio has no writable page table mappings > > > > (unmapped) and no GUP pins. Consequently, when trying to write to it we can > > > > just fallback to creating a page copy without causing trouble with GUP pins. > > > > > > Yeah. But anonymous is the easy case, the pain is direct I/O to file > > > mappings. Mapping the right answer is to just fail pinning them and fall > > > back to (dontcache) buffered I/O. > > > > I agree file mappings are more painful but we can also have interesting > > cases with anon pages: > > > > P - anon page > > > > Thread 1 Thread 2 > > setup DIO read to P setup DIO write from P > > Ah, I was talking about the interaction between GUP and having > BLK_FEAT_STABLE_WRITES set on the swap backend. > > I guess what you mean here is: GUP from/to anon pages to/from a device that > has BLK_FEAT_STABLE_WRITES? Correct. > So while we are writing to the device using the anon page as a source, the > anon page will get modified. > > I did not expect that to trigger checksum failures, but I can see the > problem now. Sadly it can because the checksum computation may end up using different data than the DMA will send to the device a bit later. After all this is why BLK_FEAT_STABLE_WRITES was invented... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig 2025-10-20 10:24 ` Qu Wenruo 2025-10-20 11:16 ` Jan Kara @ 2025-10-21 3:17 ` Qu Wenruo 2025-10-21 7:48 ` Christoph Hellwig 2 siblings, 1 reply; 27+ messages in thread From: Qu Wenruo @ 2025-10-21 3:17 UTC (permalink / raw) To: Christoph Hellwig, Qu Wenruo Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack 在 2025/10/20 20:30, Christoph Hellwig 写道:> >> But still, such performance drop can be very obvious, and performance >> oriented users (who are very happy running various benchmark tools) are >> going to notice or even complain. > > I've unfortunately seen much bigger performance drops with direct I/O and > PI on fast SSDs, but we still should be safe by default. Off-topic a little, mind to share the performance drop with PI enabled on XFS? With this patch I'm able to enable direct IO for inodes with checksums. I thought it would easily improve the performance, but the truth is, it's not that different from buffered IO fall back. It's still the old 200MiB/s (vs ~2GiB/s nodatasum), no matter falling back to buffered IO or not (and extra traces indeed shows it's really going direct IO path). So I start wondering if it's the checksum itself causing the miserable performance numbers. Thanks, Qu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 3:17 ` Qu Wenruo @ 2025-10-21 7:48 ` Christoph Hellwig 2025-10-21 8:15 ` Qu Wenruo 0 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2025-10-21 7:48 UTC (permalink / raw) To: Qu Wenruo Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote: > Off-topic a little, mind to share the performance drop with PI enabled on > XFS? If the bandwith of the SSDs get close or exceeds the DRAM bandwith buffered I/O can be 50% or less of the direct I/O performance. > With this patch I'm able to enable direct IO for inodes with checksums. > I thought it would easily improve the performance, but the truth is, it's > not that different from buffered IO fall back. That's because you still copy data. > So I start wondering if it's the checksum itself causing the miserable > performance numbers. Only indirectly by touching all the cachelines. But once you copy you touch them again. Especially if not done in small chunks. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 7:48 ` Christoph Hellwig @ 2025-10-21 8:15 ` Qu Wenruo 2025-10-21 11:30 ` Johannes Thumshirn 0 siblings, 1 reply; 27+ messages in thread From: Qu Wenruo @ 2025-10-21 8:15 UTC (permalink / raw) To: Christoph Hellwig, Qu Wenruo Cc: linux-btrfs, djwong, linux-xfs, linux-fsdevel, linux-block, linux-mm, martin.petersen, jack 在 2025/10/21 18:18, Christoph Hellwig 写道: > On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote: >> Off-topic a little, mind to share the performance drop with PI enabled on >> XFS? > > If the bandwith of the SSDs get close or exceeds the DRAM bandwith > buffered I/O can be 50% or less of the direct I/O performance. In my case, the DRAM is way faster than the SSD (tens of GiB/s vs less than 5GiB/s). > >> With this patch I'm able to enable direct IO for inodes with checksums. >> I thought it would easily improve the performance, but the truth is, it's >> not that different from buffered IO fall back. > > That's because you still copy data. Enabling the extra copy for direct IO only drops around 15~20% performance, but that's on no csum case. So far the calculation matches your estimation, but... > >> So I start wondering if it's the checksum itself causing the miserable >> performance numbers. > > Only indirectly by touching all the cachelines. But once you copy you > touch them again. Especially if not done in small chunks. As long as I enable checksum verification, even with the bouncing page direct IO, the result is not any better than buffered IO fallback, all around 10% (not by 10%, at 10%) of the direct IO speed (no matter bouncing or not). Maybe I need to check if the proper hardware accelerated CRC32 is utilized... Thanks, Qu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 8:15 ` Qu Wenruo @ 2025-10-21 11:30 ` Johannes Thumshirn 2025-10-22 2:27 ` Qu Wenruo 0 siblings, 1 reply; 27+ messages in thread From: Johannes Thumshirn @ 2025-10-21 11:30 UTC (permalink / raw) To: WenRuo Qu, hch@infradead.org, Qu Wenruo Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, martin.petersen@oracle.com, jack@suse.com On 10/21/25 10:15 AM, Qu Wenruo wrote: > > 在 2025/10/21 18:18, Christoph Hellwig 写道: >> On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote: >>> Off-topic a little, mind to share the performance drop with PI enabled on >>> XFS? >> If the bandwith of the SSDs get close or exceeds the DRAM bandwith >> buffered I/O can be 50% or less of the direct I/O performance. > In my case, the DRAM is way faster than the SSD (tens of GiB/s vs less > than 5GiB/s). > >>> With this patch I'm able to enable direct IO for inodes with checksums. >>> I thought it would easily improve the performance, but the truth is, it's >>> not that different from buffered IO fall back. >> That's because you still copy data. > Enabling the extra copy for direct IO only drops around 15~20% > performance, but that's on no csum case. > > So far the calculation matches your estimation, but... > >>> So I start wondering if it's the checksum itself causing the miserable >>> performance numbers. >> Only indirectly by touching all the cachelines. But once you copy you >> touch them again. Especially if not done in small chunks. > As long as I enable checksum verification, even with the bouncing page > direct IO, the result is not any better than buffered IO fallback, all > around 10% (not by 10%, at 10%) of the direct IO speed (no matter > bouncing or not). > > Maybe I need to check if the proper hardware accelerated CRC32 is > utilized... You could also hack in a NULL-csum for testing. Something that writes a fixed value every time. This would then rule out all the cost of the csum generation and only test the affected IO paths. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-21 11:30 ` Johannes Thumshirn @ 2025-10-22 2:27 ` Qu Wenruo 2025-10-22 5:04 ` hch 0 siblings, 1 reply; 27+ messages in thread From: Qu Wenruo @ 2025-10-22 2:27 UTC (permalink / raw) To: Johannes Thumshirn, WenRuo Qu, hch@infradead.org Cc: linux-btrfs@vger.kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, martin.petersen@oracle.com, jack@suse.com 在 2025/10/21 22:00, Johannes Thumshirn 写道: > On 10/21/25 10:15 AM, Qu Wenruo wrote: >> >> 在 2025/10/21 18:18, Christoph Hellwig 写道: >>> On Tue, Oct 21, 2025 at 01:47:03PM +1030, Qu Wenruo wrote: >>>> Off-topic a little, mind to share the performance drop with PI enabled on >>>> XFS? >>> If the bandwith of the SSDs get close or exceeds the DRAM bandwith >>> buffered I/O can be 50% or less of the direct I/O performance. >> In my case, the DRAM is way faster than the SSD (tens of GiB/s vs less >> than 5GiB/s). >> >>>> With this patch I'm able to enable direct IO for inodes with checksums. >>>> I thought it would easily improve the performance, but the truth is, it's >>>> not that different from buffered IO fall back. >>> That's because you still copy data. >> Enabling the extra copy for direct IO only drops around 15~20% >> performance, but that's on no csum case. >> >> So far the calculation matches your estimation, but... >> >>>> So I start wondering if it's the checksum itself causing the miserable >>>> performance numbers. >>> Only indirectly by touching all the cachelines. But once you copy you >>> touch them again. Especially if not done in small chunks. >> As long as I enable checksum verification, even with the bouncing page >> direct IO, the result is not any better than buffered IO fallback, all >> around 10% (not by 10%, at 10%) of the direct IO speed (no matter >> bouncing or not). >> >> Maybe I need to check if the proper hardware accelerated CRC32 is >> utilized... > > > You could also hack in a NULL-csum for testing. Something that writes a > fixed value every time. This would then rule out all the cost of the > csum generation and only test the affected IO paths. > It turns out to be checksum, and particularly my VM setup. My VM is using kvm64 CPU type, which blocks quite a lot of CPU features, thus the CRC32 performance is pretty poor. I just tried a short hack to always make direct IO to fallback to buffered IO, the nodatasum performance is the same as the bouncing page solution, so the slow down is not page cache itself but really the checksum. With CPU features all passed to the VM, the falling-back-to-buffered direct IO performance is only slightly worse (10~20%) than nodatasum cases. Really sorry for the noise. Thanks, Qu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-22 2:27 ` Qu Wenruo @ 2025-10-22 5:04 ` hch 2025-10-22 6:17 ` Qu Wenruo 0 siblings, 1 reply; 27+ messages in thread From: hch @ 2025-10-22 5:04 UTC (permalink / raw) To: Qu Wenruo Cc: Johannes Thumshirn, WenRuo Qu, hch@infradead.org, linux-btrfs@vger.kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, martin.petersen@oracle.com, jack@suse.com On Wed, Oct 22, 2025 at 12:57:51PM +1030, Qu Wenruo wrote: > My VM is using kvm64 CPU type, which blocks quite a lot of CPU features, > thus the CRC32 performance is pretty poor. Yes, unaccelerated CRC32 is a bad idea. > > I just tried a short hack to always make direct IO to fallback to buffered > IO, the nodatasum performance is the same as the bouncing page solution, so > the slow down is not page cache itself but really the checksum. > > With CPU features all passed to the VM, the falling-back-to-buffered direct > IO performance is only slightly worse (10~20%) than nodatasum cases. I'm a bit lost, what are the exact cases you are comparing here? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-22 5:04 ` hch @ 2025-10-22 6:17 ` Qu Wenruo 2025-10-22 6:24 ` hch 0 siblings, 1 reply; 27+ messages in thread From: Qu Wenruo @ 2025-10-22 6:17 UTC (permalink / raw) To: hch@infradead.org Cc: Johannes Thumshirn, WenRuo Qu, linux-btrfs@vger.kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, martin.petersen@oracle.com, jack@suse.com 在 2025/10/22 15:34, hch@infradead.org 写道: > On Wed, Oct 22, 2025 at 12:57:51PM +1030, Qu Wenruo wrote: >> My VM is using kvm64 CPU type, which blocks quite a lot of CPU features, >> thus the CRC32 performance is pretty poor. > > Yes, unaccelerated CRC32 is a bad idea. > >> >> I just tried a short hack to always make direct IO to fallback to buffered >> IO, the nodatasum performance is the same as the bouncing page solution, so >> the slow down is not page cache itself but really the checksum. >> >> With CPU features all passed to the VM, the falling-back-to-buffered direct >> IO performance is only slightly worse (10~20%) than nodatasum cases. > > I'm a bit lost, what are the exact cases you are comparing here? I'm just checking the impact of btrfs checksum for data writes. I may skipped some points, but overall the idea is: - direct writes, no data csum True zero-copy. - direct writes, no data csum but with bounce pages Affected by memcpy() - direct writes, no data csum but force fallback to buffered Affected by memcpy() - direct writes, data sum Fallback to buffered IO, affected by both memcpy() and csum. So far it looks like the bounce pages solution is wasting a lot of code for almost nothing, it's not any better than falling back. And the checksum part looks can be improved for writes. Since we always fallback to buffered IO for checksums, the content should not change and we can do the submission and checksum calculation in parallel. Already got a prototype, results around 10% improvement inside my VM. Will fix the bugs related to compression and send an RFC for it. Thanks, Qu ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO 2025-10-22 6:17 ` Qu Wenruo @ 2025-10-22 6:24 ` hch 0 siblings, 0 replies; 27+ messages in thread From: hch @ 2025-10-22 6:24 UTC (permalink / raw) To: Qu Wenruo Cc: hch@infradead.org, Johannes Thumshirn, WenRuo Qu, linux-btrfs@vger.kernel.org, djwong@kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, martin.petersen@oracle.com, jack@suse.com On Wed, Oct 22, 2025 at 04:47:56PM +1030, Qu Wenruo wrote: > So far it looks like the bounce pages solution is wasting a lot of code for > almost nothing, it's not any better than falling back. That was my expectation, but I wasn't entirely sure from your wording if that's what you measured. > Since we always fallback to buffered IO for checksums, the content should > not change and we can do the submission and checksum calculation in > parallel. Yes. > Already got a prototype, results around 10% improvement inside my VM. > Will fix the bugs related to compression and send an RFC for it. Nice. I've also started looking at my PI backlog for the XFS version. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-10-22 6:24 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1ee861df6fbd8bf45ab42154f429a31819294352.1760951886.git.wqu@suse.com>
2025-10-20 10:00 ` O_DIRECT vs BLK_FEAT_STABLE_WRITES, was Re: [PATCH] btrfs: never trust the bio from direct IO Christoph Hellwig
2025-10-20 10:24 ` Qu Wenruo
2025-10-20 11:45 ` Christoph Hellwig
2025-10-20 11:16 ` Jan Kara
2025-10-20 11:44 ` Christoph Hellwig
2025-10-20 13:59 ` Jan Kara
2025-10-20 14:59 ` Matthew Wilcox
2025-10-20 15:58 ` Jan Kara
2025-10-20 17:55 ` John Hubbard
2025-10-21 8:27 ` Jan Kara
2025-10-21 16:56 ` John Hubbard
2025-10-20 19:00 ` David Hildenbrand
2025-10-21 7:49 ` Christoph Hellwig
2025-10-21 7:57 ` David Hildenbrand
2025-10-21 9:33 ` Jan Kara
2025-10-21 9:43 ` David Hildenbrand
2025-10-21 9:22 ` Jan Kara
2025-10-21 9:37 ` David Hildenbrand
2025-10-21 9:52 ` Jan Kara
2025-10-21 3:17 ` Qu Wenruo
2025-10-21 7:48 ` Christoph Hellwig
2025-10-21 8:15 ` Qu Wenruo
2025-10-21 11:30 ` Johannes Thumshirn
2025-10-22 2:27 ` Qu Wenruo
2025-10-22 5:04 ` hch
2025-10-22 6:17 ` Qu Wenruo
2025-10-22 6:24 ` hch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).