* loop: it looks like REQ_OP_FLUSH could return before IO completion. @ 2022-03-19 17:14 Eric Wheeler 2022-03-21 0:21 ` Ming Lei 0 siblings, 1 reply; 12+ messages in thread From: Eric Wheeler @ 2022-03-19 17:14 UTC (permalink / raw) To: linux-block Hello all, In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: it does not appear that lo_req_flush() does anything to make sure ki_complete has been called for pending work, it just calls vfs_fsync(). Is this a consistency problem? For example, if the loop driver has queued async kiocb's to the filesystem via .read_iter/.write_iter, is it the filesystem's responsibility to complete that work before returning from vfs_sync() or is it possible that the vfs_sync() completes before ->ki_complete() is called? Relatedly, does vfs_fsync() do anything for direct IO? Ultimately f_op->fsync() is called so maybe the FS is told to commit its structures like sparse allocations that may not be on disk yet. -- Eric Wheeler ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-03-19 17:14 loop: it looks like REQ_OP_FLUSH could return before IO completion Eric Wheeler @ 2022-03-21 0:21 ` Ming Lei 2022-03-23 6:14 ` Eric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Ming Lei @ 2022-03-21 0:21 UTC (permalink / raw) To: Eric Wheeler; +Cc: linux-block On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > Hello all, > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > it does not appear that lo_req_flush() does anything to make sure > ki_complete has been called for pending work, it just calls vfs_fsync(). > > Is this a consistency problem? No. What FLUSH command provides is just flushing cache in device side to storage medium, so it is nothing to do with pending request. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-03-21 0:21 ` Ming Lei @ 2022-03-23 6:14 ` Eric Wheeler 2022-04-13 22:49 ` Eric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Eric Wheeler @ 2022-03-23 6:14 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block On Mon, 21 Mar 2022, Ming Lei wrote: > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > Hello all, > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > it does not appear that lo_req_flush() does anything to make sure > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > Is this a consistency problem? > > No. What FLUSH command provides is just flushing cache in device side to > storage medium, so it is nothing to do with pending request. If a flush follows a series of writes, would it be best if the flush happened _after_ those writes complete? Then then the storage medium will be sure to flush what was intended to be written. It seems that this series of events could lead to inconsistent data: loop -> filesystem write a write b flush write a flush write b crash, b is lost If write+flush ordering is _not_ important, then can you help me understand why? -Eric > > > Thanks, > Ming > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-03-23 6:14 ` Eric Wheeler @ 2022-04-13 22:49 ` Eric Wheeler 2022-04-14 0:15 ` Ming Lei 2022-04-14 0:28 ` Ming Lei 0 siblings, 2 replies; 12+ messages in thread From: Eric Wheeler @ 2022-04-13 22:49 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block On Tue, 22 Mar 2022, Eric Wheeler wrote: > On Mon, 21 Mar 2022, Ming Lei wrote: > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > > Hello all, > > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > > it does not appear that lo_req_flush() does anything to make sure > > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > > > Is this a consistency problem? > > > > No. What FLUSH command provides is just flushing cache in device side to > > storage medium, so it is nothing to do with pending request. > > If a flush follows a series of writes, would it be best if the flush > happened _after_ those writes complete? Then then the storage medium will > be sure to flush what was intended to be written. > > It seems that this series of events could lead to inconsistent data: > loop -> filesystem > write a > write b > flush > write a > flush > write b > crash, b is lost > > If write+flush ordering is _not_ important, then can you help me > understand why? > Hi Ming, just checking in: did you see the message above? Do you really mean to say that reordering writes around a flush is safe in the presence of a crash? -- Eric Wheeler > -Eric > > > > > > > > > Thanks, > > Ming > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-13 22:49 ` Eric Wheeler @ 2022-04-14 0:15 ` Ming Lei 2022-04-14 0:28 ` Ming Lei 1 sibling, 0 replies; 12+ messages in thread From: Ming Lei @ 2022-04-14 0:15 UTC (permalink / raw) To: Eric Wheeler; +Cc: linux-block On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote: > On Tue, 22 Mar 2022, Eric Wheeler wrote: > > On Mon, 21 Mar 2022, Ming Lei wrote: > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > > > Hello all, > > > > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > > > it does not appear that lo_req_flush() does anything to make sure > > > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > > > > > Is this a consistency problem? > > > > > > No. What FLUSH command provides is just flushing cache in device side to > > > storage medium, so it is nothing to do with pending request. > > > > If a flush follows a series of writes, would it be best if the flush > > happened _after_ those writes complete? Then then the storage medium will > > be sure to flush what was intended to be written. > > > > It seems that this series of events could lead to inconsistent data: > > loop -> filesystem > > write a > > write b > > flush > > write a > > flush > > write b > > crash, b is lost > > > > If write+flush ordering is _not_ important, then can you help me > > understand why? > > > > Hi Ming, just checking in: did you see the message above? > > Do you really mean to say that reordering writes around a flush is safe > in the presence of a crash? As I explained before, FLUSH doesn't provide order guarantee, and that is supposed to be done by upper layer, such FS. Again, what FLUSH does is just to flush cache in device side to medium, that is it. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-13 22:49 ` Eric Wheeler 2022-04-14 0:15 ` Ming Lei @ 2022-04-14 0:28 ` Ming Lei 2022-04-15 2:10 ` Eric Wheeler 1 sibling, 1 reply; 12+ messages in thread From: Ming Lei @ 2022-04-14 0:28 UTC (permalink / raw) To: Eric Wheeler; +Cc: linux-block On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote: > On Tue, 22 Mar 2022, Eric Wheeler wrote: > > On Mon, 21 Mar 2022, Ming Lei wrote: > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > > > Hello all, > > > > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > > > it does not appear that lo_req_flush() does anything to make sure > > > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > > > > > Is this a consistency problem? > > > > > > No. What FLUSH command provides is just flushing cache in device side to > > > storage medium, so it is nothing to do with pending request. > > > > If a flush follows a series of writes, would it be best if the flush > > happened _after_ those writes complete? Then then the storage medium will > > be sure to flush what was intended to be written. > > > > It seems that this series of events could lead to inconsistent data: > > loop -> filesystem > > write a > > write b > > flush > > write a > > flush > > write b > > crash, b is lost > > > > If write+flush ordering is _not_ important, then can you help me > > understand why? > > > > Hi Ming, just checking in: did you see the message above? > > Do you really mean to say that reordering writes around a flush is safe > in the presence of a crash? Sorry, replied too quick. BTW, what is the actual crash? Any dmesg log? From the above description, b is just not flushed to storage when running flush, and sooner or later it will land, so what is the real issue? Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-14 0:28 ` Ming Lei @ 2022-04-15 2:10 ` Eric Wheeler 2022-04-15 14:29 ` Ming Lei 0 siblings, 1 reply; 12+ messages in thread From: Eric Wheeler @ 2022-04-15 2:10 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block On Thu, 14 Apr 2022, Ming Lei wrote: > On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote: > > On Tue, 22 Mar 2022, Eric Wheeler wrote: > > > On Mon, 21 Mar 2022, Ming Lei wrote: > > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > > > > Hello all, > > > > > > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > > > > it does not appear that lo_req_flush() does anything to make sure > > > > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > > > > > > > Is this a consistency problem? > > > > > > > > No. What FLUSH command provides is just flushing cache in device side to > > > > storage medium, so it is nothing to do with pending request. > > > > > > If a flush follows a series of writes, would it be best if the flush > > > happened _after_ those writes complete? Then then the storage medium will > > > be sure to flush what was intended to be written. > > > > > > It seems that this series of events could lead to inconsistent data: > > > loop -> filesystem > > > write a > > > write b > > > flush > > > write a > > > flush > > > write b > > > crash, b is lost > > > > > > If write+flush ordering is _not_ important, then can you help me > > > understand why? > > > > > > > Hi Ming, just checking in: did you see the message above? > > > > Do you really mean to say that reordering writes around a flush is safe > > in the presence of a crash? > > Sorry, replied too quick. Thats ok, thanks for considering the issue! > BTW, what is the actual crash? Any dmesg log? From the above description, b is > just not flushed to storage when running flush, and sooner or later it will > land, so what is the real issue? In this case "crash" is actually a filesystem snapshot using most any snapshot technology such as: dm-thin, btrfs, bcachefs, zfs. From the filesystem inside the loop file's point of view, a snapshot is the same as a hardware crash. Some background: We've already seen journal commit re-ordering caused by dm-crypt in dm-thin snapshots: dm-thin -> dm-crypt -> [kvm with a disk using ext4] This is the original email about dm-crypt ordering: https://listman.redhat.com/archives/dm-devel/2016-September/msg00035.html We "fixed" the dm-crypt issue by disabling parallel dm-crypt threads with dm-crypt flags same_cpu_crypt+submit_from_crypt_cpus and haven't seen the issue since. (Its noticably slower, but I'll take consistency over performance any day! Not sure if that old dm-crypt ordering has been fixed.) So back to considering loop devs: Having seen the dm-crypt issue I would like to verify that loop isn't susceptable to the same issue in the presence of lower-level snapshots---but it looks like it could be since flushes don't enforce ordering. Here is why: In ext4/super.c:ext4_sync_fs(), the ext4 code calls blkdev_issue_flush(sb->s_bdev) when barriers are enabled (which is default). blkdev_issue_flush() sets REQ_PREFLUSH and according to blk_types.h this is a "request for cache flush"; you could think of in-flight IO's on the way through loop.ko and into the hypervisor filesystem where the loop's backing file lives as being in a "cache" of sorts---especially for non-DIO loopdevs hitting the pagecache. Thus, ext4 critically expects that all IOs preceding a flush will hit persistent storage before all future IOs. Failing that, journal replay cannot return the filesystem to a consistent state without a `fsck`. (Note that ext4 is just an example, really any journaled filesystem is at risk.) Lets say a virtual machine uses a loopback file for a disk and the VM issues the following to delete some file called "/b": unlink("/b"): write: journal commit: unlink /b flush: blkdev_issue_flush() write: update fs datastructures (remove /b from a dentry) <hypervisor snapshot> If the flush happens out of order then an operation like unlink("/b") could look like this where the guest VM's filesystem is on the left and the hypervisor's loop filesystem operations are on the right: VM ext4 filesystem | Hypervisor loop dev ordering --------------------------------+-------------------------------- write: journal commit: unlink /b flush: blkdev_issue_flush() write: update fs dentry's queued to loop: [journal commit: unlink /b] queued to loop: [update fs dentry's] flush: vfs_fsync() - out of order queued to ext4: [journal commit: unlink /b] queued to ext4: [update fs dentry's] write lands: [update fs dentry's] <snapshot!> write lands: [journal commit: unlink /b] Notice that the last two "write lands" are out of order because the vfs_fsync() does not separate them as expected by the VM's ext4 implementation. Presently, loop.c will never re-order actual WRITE's: they will be submitted to loopdev's `file*` handle in-submit-order because the workqueue will keep them ordered. This is good[*]. But, REQ_FLUSH is not a write: The vfs_fsync() in lo_req_flush() is _not_ ordered by the writequeue, and there exists no mechanism in loop.c to enforce completion of IOs submitted to the loopdev's `file*` handle prior to completing the vfs_fsync(), nor are subsequent IOs thereby required to complete after the flush. Thus, the hypervisor's snapshot-capable filesystem can re-order the last two writes because the flush happened early. In the re-ordered case on the hypervisor side: If a snapshot happens after the dentry removal but _before_ the journal commit, then a journal replay of the resulting snapshot will be inconsistent. Flush re-ordering creates an inconsistency in two possible cases: a. In the snapshot after dentry removal but before journal commit. b. Crash after dentry removal but before journal comit. Really a snapshot looks like a crash to the filesystem, so (a) and (b) are equivalent but (a) is easier to reason about. In either case, mounting the out-of-order filesystem (snapshot if (a), origin if (b)) will present kernel errors in the VM when the dentry is read: kernel: EXT4-fs error (device dm-2): ext4_lookup:1441: inode #1196093: comm rsync: deleted inode referenced: 1188710 [ https://listman.redhat.com/archives/dm-devel/2016-September/028121.html ] Fixing flush ordering provides for two possible improvements: ([*] from above about write ordering) 1. Consistency, as above. 2. Performance. Right now loopdev IOs are serialized by a single write queue per loopdev. Parallel work queues could be used to submit IOs in parallel to the filesystem serving the loopdev's `file*` handle since VMs may submit IOs from different CPU cores. Special care would need to be taken for the following cases: a. Ordering of dependent IOs (ie, reads or writes of preceding writes). b. Flushes need to wait for all workqueues to quiesce. W.r.t choosing the number of WQ's: Certainly not 1:1 CPU to workqueue mapping because of the old WQ issue running out of pid's with lots of CPUs, but here are some possibilities: a. 1:1 per socket b. User configurable as a module parameter c. Dedicated pool of workqueues for all loop devs that dispatch queued IOs to the correct `file*` handle. RCU might be useful. What next? Ok, so assuming consistency is an issue, #1 is the priority and #2 is nice-to-have. This might be the right time for to consider these since there is so much discussion about loop.c on the list right now. According to my understanding of the research above this appears to be an issue and there are other kernel developers who know this code better than I. I want to know if this is correct: Should others be CC'ed on this topic? If so, who? -- Eric Wheeler > > > Thanks, > Ming > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-15 2:10 ` Eric Wheeler @ 2022-04-15 14:29 ` Ming Lei 2022-04-16 5:18 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Ming Lei @ 2022-04-15 14:29 UTC (permalink / raw) To: Eric Wheeler; +Cc: linux-block, linux-ext4 On Thu, Apr 14, 2022 at 07:10:04PM -0700, Eric Wheeler wrote: > On Thu, 14 Apr 2022, Ming Lei wrote: > > On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote: > > > On Tue, 22 Mar 2022, Eric Wheeler wrote: > > > > On Mon, 21 Mar 2022, Ming Lei wrote: > > > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote: > > > > > > Hello all, > > > > > > > > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: > > > > > > it does not appear that lo_req_flush() does anything to make sure > > > > > > ki_complete has been called for pending work, it just calls vfs_fsync(). > > > > > > > > > > > > Is this a consistency problem? > > > > > > > > > > No. What FLUSH command provides is just flushing cache in device side to > > > > > storage medium, so it is nothing to do with pending request. > > > > > > > > If a flush follows a series of writes, would it be best if the flush > > > > happened _after_ those writes complete? Then then the storage medium will > > > > be sure to flush what was intended to be written. > > > > > > > > It seems that this series of events could lead to inconsistent data: > > > > loop -> filesystem > > > > write a > > > > write b > > > > flush > > > > write a > > > > flush > > > > write b > > > > crash, b is lost > > > > > > > > If write+flush ordering is _not_ important, then can you help me > > > > understand why? > > > > > > > > > > Hi Ming, just checking in: did you see the message above? > > > > > > Do you really mean to say that reordering writes around a flush is safe > > > in the presence of a crash? > > > > Sorry, replied too quick. > > Thats ok, thanks for considering the issue! > > > BTW, what is the actual crash? Any dmesg log? From the above description, b is > > just not flushed to storage when running flush, and sooner or later it will > > land, so what is the real issue? > > In this case "crash" is actually a filesystem snapshot using most any > snapshot technology such as: dm-thin, btrfs, bcachefs, zfs. From the > filesystem inside the loop file's point of view, a snapshot is the same as > a hardware crash. > > Some background: > > We've already seen journal commit re-ordering caused by dm-crypt in > dm-thin snapshots: > dm-thin -> dm-crypt -> [kvm with a disk using ext4] > > This is the original email about dm-crypt ordering: > https://listman.redhat.com/archives/dm-devel/2016-September/msg00035.html > > We "fixed" the dm-crypt issue by disabling parallel dm-crypt threads > with dm-crypt flags same_cpu_crypt+submit_from_crypt_cpus and haven't > seen the issue since. (Its noticably slower, but I'll take consistency > over performance any day! Not sure if that old dm-crypt ordering has > been fixed.) > > So back to considering loop devs: > > Having seen the dm-crypt issue I would like to verify that loop isn't > susceptable to the same issue in the presence of lower-level > snapshots---but it looks like it could be since flushes don't enforce > ordering. Here is why: > > In ext4/super.c:ext4_sync_fs(), the ext4 code calls > blkdev_issue_flush(sb->s_bdev) when barriers are enabled (which is > default). blkdev_issue_flush() sets REQ_PREFLUSH and according to > blk_types.h this is a "request for cache flush"; you could think of > in-flight IO's on the way through loop.ko and into the hypervisor > filesystem where the loop's backing file lives as being in a "cache" of > sorts---especially for non-DIO loopdevs hitting the pagecache. > > Thus, ext4 critically expects that all IOs preceding a flush will hit > persistent storage before all future IOs. Failing that, journal replay > cannot return the filesystem to a consistent state without a `fsck`. If ext4 expects the following order, it is ext4's responsibility to maintain the order, and block layer may re-order all these IOs at will, so do not expect IOs are issued to device in submission order 1) IOs preceding flush in 2) 2) flush 3) IOs after the flush Even the device drive may re-order these IOs, such as AHCI/NCQ. > > (Note that ext4 is just an example, really any journaled filesystem is at > risk.) > > Lets say a virtual machine uses a loopback file for a disk and the VM > issues the following to delete some file called "/b": > > unlink("/b"): > write: journal commit: unlink /b > flush: blkdev_issue_flush() > write: update fs datastructures (remove /b from a dentry) > <hypervisor snapshot> > > If the flush happens out of order then an operation like unlink("/b") > could look like this where the guest VM's filesystem is on the left and > the hypervisor's loop filesystem operations are on the right: > > VM ext4 filesystem | Hypervisor loop dev ordering > --------------------------------+-------------------------------- > write: journal commit: unlink /b > flush: blkdev_issue_flush() > write: update fs dentry's > queued to loop: [journal commit: unlink /b] > queued to loop: [update fs dentry's] > flush: vfs_fsync() - out of order > queued to ext4: [journal commit: unlink /b] > queued to ext4: [update fs dentry's] > write lands: [update fs dentry's] > <snapshot!> > write lands: [journal commit: unlink /b] If VM ext4 requires the above order, ext4 FS code has to start flush until write(unlink /b) is done or do write(unlink /b) and flush in one single command, and start write(update fs dentry's) after the flush is done. Once ext4 submits IOs in this way, you will see the issue shouldn't be related with hypervisor. One issue I saw is in case of snapshot in block layer & loop/buffered io. When loop write is done, the data is just in FS page cache, which may be invisible to block snapshot(dm-snap), even page writeback may be re-order. But if flush is used correctly, this way still works fine. > > Notice that the last two "write lands" are out of order because the > vfs_fsync() does not separate them as expected by the VM's ext4 > implementation. > > Presently, loop.c will never re-order actual WRITE's: they will be > submitted to loopdev's `file*` handle in-submit-order because the > workqueue will keep them ordered. This is good[*]. Again do not expect block layer to maintain IO order. > > But, REQ_FLUSH is not a write: > > The vfs_fsync() in lo_req_flush() is _not_ ordered by the writequeue, and > there exists no mechanism in loop.c to enforce completion of IOs submitted > to the loopdev's `file*` handle prior to completing the vfs_fsync(), nor > are subsequent IOs thereby required to complete after the flush. Yes, but flush is just to flush device cache to medium, and block layer doesn't maintain io order, that is responsibility of block layer's user(FS) to maintain io order. > > Thus, the hypervisor's snapshot-capable filesystem can re-order the last > two writes because the flush happened early. > > In the re-ordered case on the hypervisor side: > > If a snapshot happens after the dentry removal but _before_ the journal > commit, then a journal replay of the resulting snapshot will be > inconsistent. > > Flush re-ordering creates an inconsistency in two possible cases: > > a. In the snapshot after dentry removal but before journal commit. > b. Crash after dentry removal but before journal comit. > > Really a snapshot looks like a crash to the filesystem, so (a) and (b) are > equivalent but (a) is easier to reason about. In either case, mounting the > out-of-order filesystem (snapshot if (a), origin if (b)) will present > kernel errors in the VM when the dentry is read: > > kernel: EXT4-fs error (device dm-2): ext4_lookup:1441: inode > #1196093: comm rsync: deleted inode referenced: 1188710 > [ https://listman.redhat.com/archives/dm-devel/2016-September/028121.html ] > > > Fixing flush ordering provides for two possible improvements: No, what you should fix is to enhance the IO order in FS or journal code, instead of block layer. > ([*] from above about write ordering) > > 1. Consistency, as above. > > 2. Performance. Right now loopdev IOs are serialized by a single write > queue per loopdev. Parallel work queues could be used to submit IOs in > parallel to the filesystem serving the loopdev's `file*` handle since > VMs may submit IOs from different CPU cores. Special care would need > to be taken for the following cases: > > a. Ordering of dependent IOs (ie, reads or writes of preceding > writes). > b. Flushes need to wait for all workqueues to quiesce. > > W.r.t choosing the number of WQ's: Certainly not 1:1 CPU to workqueue > mapping because of the old WQ issue running out of pid's with lots of > CPUs, but here are some possibilities: > > a. 1:1 per socket > b. User configurable as a module parameter > c. Dedicated pool of workqueues for all loop devs that dispatch > queued IOs to the correct `file*` handle. RCU might be useful. > > > What next? > > Ok, so assuming consistency is an issue, #1 is the priority and #2 is > nice-to-have. This might be the right time for to consider these since > there is so much discussion about loop.c on the list right now. > > According to my understanding of the research above this appears to be an > issue and there are other kernel developers who know this code better than I. > > I want to know if this is correct: > > Should others be CC'ed on this topic? If so, who? ext4 or fsdevel guys. Thanks, Ming ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-15 14:29 ` Ming Lei @ 2022-04-16 5:18 ` Christoph Hellwig 2022-04-16 20:05 ` Eric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2022-04-16 5:18 UTC (permalink / raw) To: Ming Lei; +Cc: Eric Wheeler, linux-block, linux-ext4 On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote: > If ext4 expects the following order, it is ext4's responsibility to > maintain the order, and block layer may re-order all these IOs at will, > so do not expect IOs are issued to device in submission order Yes, and it has been so since REQ_FLUSH (which later became REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago: commit 28e7d1845216538303bb95d679d8fd4de50e2f1a Author: Tejun Heo <tj@kernel.org> Date: Fri Sep 3 11:56:16 2010 +0200 block: drop barrier ordering by queue draining Filesystems will take all the responsibilities for ordering requests around commit writes and will only indicate how the commit writes themselves should be handled by block layers. This patch drops barrier ordering by queue draining from block layer. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-16 5:18 ` Christoph Hellwig @ 2022-04-16 20:05 ` Eric Wheeler 2022-04-17 0:59 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Eric Wheeler @ 2022-04-16 20:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Ming Lei, linux-block, linux-ext4 On Fri, 15 Apr 2022, Christoph Hellwig wrote: > On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote: > > If ext4 expects the following order, it is ext4's responsibility to > > maintain the order, and block layer may re-order all these IOs at will, > > so do not expect IOs are issued to device in submission order > > Yes, and it has been so since REQ_FLUSH (which later became > REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago: > > commit 28e7d1845216538303bb95d679d8fd4de50e2f1a > Author: Tejun Heo <tj@kernel.org> > Date: Fri Sep 3 11:56:16 2010 +0200 > > block: drop barrier ordering by queue draining > > Filesystems will take all the responsibilities for ordering requests > around commit writes and will only indicate how the commit writes > themselves should be handled by block layers. This patch drops > barrier ordering by queue draining from block layer. Thanks Christoph. I think this answers my original question, too. You may have already answered this implicitly above. If you would be so kind as to confirm my or correct my understanding with a few more questions: 1. Is the only way for a filesystem to know if one IO completed before a second IO to track the first IO's completion and submit the second IO when the first IO's completes (eg a journal commit followed by the subsequent metadata update)? If not, then what block-layer mechanism should be used? 2. Are there any IO ordering flags or mechanisms in the block layer at this point---or---is it simply that all IOs entering the block layer can always be re-ordered before reaching the media? Thanks! -- Eric Wheeler ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-16 20:05 ` Eric Wheeler @ 2022-04-17 0:59 ` Jens Axboe 2022-04-17 16:32 ` Eric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2022-04-17 0:59 UTC (permalink / raw) To: Eric Wheeler, Christoph Hellwig; +Cc: Ming Lei, linux-block, linux-ext4 On 4/16/22 2:05 PM, Eric Wheeler wrote: > On Fri, 15 Apr 2022, Christoph Hellwig wrote: >> On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote: >>> If ext4 expects the following order, it is ext4's responsibility to >>> maintain the order, and block layer may re-order all these IOs at will, >>> so do not expect IOs are issued to device in submission order >> >> Yes, and it has been so since REQ_FLUSH (which later became >> REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago: >> >> commit 28e7d1845216538303bb95d679d8fd4de50e2f1a >> Author: Tejun Heo <tj@kernel.org> >> Date: Fri Sep 3 11:56:16 2010 +0200 >> >> block: drop barrier ordering by queue draining >> >> Filesystems will take all the responsibilities for ordering requests >> around commit writes and will only indicate how the commit writes >> themselves should be handled by block layers. This patch drops >> barrier ordering by queue draining from block layer. > > Thanks Christoph. I think this answers my original question, too. > > You may have already answered this implicitly above. If you would be so > kind as to confirm my or correct my understanding with a few more > questions: > > 1. Is the only way for a filesystem to know if one IO completed before a > second IO to track the first IO's completion and submit the second IO > when the first IO's completes (eg a journal commit followed by the > subsequent metadata update)? If not, then what block-layer mechanism > should be used? You either need to have a callback or wait on the IO, there's no other way. > 2. Are there any IO ordering flags or mechanisms in the block layer at > this point---or---is it simply that all IOs entering the block layer > can always be re-ordered before reaching the media? No, no ordering flags are provided for this kind of use case. Any IO can be reordered, hence the only reliable solution is to ensure the previous have completed. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion. 2022-04-17 0:59 ` Jens Axboe @ 2022-04-17 16:32 ` Eric Wheeler 0 siblings, 0 replies; 12+ messages in thread From: Eric Wheeler @ 2022-04-17 16:32 UTC (permalink / raw) To: Jens Axboe; +Cc: Christoph Hellwig, Ming Lei, linux-block, linux-ext4 On Sat, 16 Apr 2022, Jens Axboe wrote: > On 4/16/22 2:05 PM, Eric Wheeler wrote: > > On Fri, 15 Apr 2022, Christoph Hellwig wrote: > >> On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote: > >>> If ext4 expects the following order, it is ext4's responsibility to > >>> maintain the order, and block layer may re-order all these IOs at will, > >>> so do not expect IOs are issued to device in submission order > >> > >> Yes, and it has been so since REQ_FLUSH (which later became > >> REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago: > >> > >> commit 28e7d1845216538303bb95d679d8fd4de50e2f1a > >> Author: Tejun Heo <tj@kernel.org> > >> Date: Fri Sep 3 11:56:16 2010 +0200 > >> > >> block: drop barrier ordering by queue draining > >> > >> Filesystems will take all the responsibilities for ordering requests > >> around commit writes and will only indicate how the commit writes > >> themselves should be handled by block layers. This patch drops > >> barrier ordering by queue draining from block layer. > > > > Thanks Christoph. I think this answers my original question, too. > > > > You may have already answered this implicitly above. If you would be so > > kind as to confirm my or correct my understanding with a few more > > questions: > > > > 1. Is the only way for a filesystem to know if one IO completed before a > > second IO to track the first IO's completion and submit the second IO > > when the first IO's completes (eg a journal commit followed by the > > subsequent metadata update)? If not, then what block-layer mechanism > > should be used? > > You either need to have a callback or wait on the IO, there's no other > way. > > > 2. Are there any IO ordering flags or mechanisms in the block layer at > > this point---or---is it simply that all IOs entering the block layer > > can always be re-ordered before reaching the media? > > No, no ordering flags are provided for this kind of use case. Any IO can > be reordered, hence the only reliable solution is to ensure the previous > have completed. Perfect, thanks Jens! > > -- > Jens Axboe > > -- Eric Wheeler ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-04-17 16:32 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-03-19 17:14 loop: it looks like REQ_OP_FLUSH could return before IO completion Eric Wheeler 2022-03-21 0:21 ` Ming Lei 2022-03-23 6:14 ` Eric Wheeler 2022-04-13 22:49 ` Eric Wheeler 2022-04-14 0:15 ` Ming Lei 2022-04-14 0:28 ` Ming Lei 2022-04-15 2:10 ` Eric Wheeler 2022-04-15 14:29 ` Ming Lei 2022-04-16 5:18 ` Christoph Hellwig 2022-04-16 20:05 ` Eric Wheeler 2022-04-17 0:59 ` Jens Axboe 2022-04-17 16:32 ` Eric Wheeler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox