* [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-03 18:03 Hannes Reinecke
2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke
0 siblings, 2 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-03 18:03 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke
Hi all,
I'm have a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.
Essentially it's doing something like:
echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable
repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.
With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.
As usual, comments and reviews are welcome.
Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared
Hannes Reinecke (2):
nvme-multipath: fixup typo when clearing DISK_LIVE
nvme-multipath: fix I/O stall when remapping namespaces
drivers/nvme/host/multipath.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE 2024-09-03 18:03 [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke @ 2024-09-03 18:03 ` Hannes Reinecke 2024-09-04 4:09 ` Christoph Hellwig 2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke 1 sibling, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2024-09-03 18:03 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not 'struct nvme_ns'. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/multipath.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 91d9eb3c22ef..c9d23b1b8efc 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -646,7 +646,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns) rc = device_add_disk(&head->subsys->dev, head->disk, nvme_ns_attr_groups); if (rc) { - clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags); + clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags); return; } nvme_add_ns_head_cdev(head); -- 2.35.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE 2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke @ 2024-09-04 4:09 ` Christoph Hellwig 0 siblings, 0 replies; 15+ messages in thread From: Christoph Hellwig @ 2024-09-04 4:09 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, linux-nvme On Tue, Sep 03, 2024 at 08:03:44PM +0200, Hannes Reinecke wrote: > NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not > 'struct nvme_ns'. > > Signed-off-by: Hannes Reinecke <hare@kernel.org> > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> Although a sentence on how this was found and if this causes real issues wouldn't hurt. ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-03 18:03 [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke 2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke @ 2024-09-03 18:03 ` Hannes Reinecke 2024-09-03 19:38 ` Sagi Grimberg 2024-09-04 14:52 ` Nilay Shroff 1 sibling, 2 replies; 15+ messages in thread From: Hannes Reinecke @ 2024-09-03 18:03 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke During repetitive namespace remapping operations (ie removing a namespace and provision a different namespace with the same NSID) on the target the namespace might have changed between the time the initial scan was performed, and partition scan was invoked by device_add_disk() in nvme_mpath_set_live(). We then end up with a stuck scanning process: [<0>] folio_wait_bit_common+0x12a/0x310 [<0>] filemap_read_folio+0x97/0xd0 [<0>] do_read_cache_folio+0x108/0x390 [<0>] read_part_sector+0x31/0xa0 [<0>] read_lba+0xc5/0x160 [<0>] efi_partition+0xd9/0x8f0 [<0>] bdev_disk_changed+0x23d/0x6d0 [<0>] blkdev_get_whole+0x78/0xc0 [<0>] bdev_open+0x2c6/0x3b0 [<0>] bdev_file_open_by_dev+0xcb/0x120 [<0>] disk_scan_partitions+0x5d/0x100 [<0>] device_add_disk+0x402/0x420 [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] This happens when we have several paths, some of which are inaccessible, and the active paths are removed first. Then nvme_find_path() will requeue I/O in the ns_head (as paths are present), but the requeue list is never triggered as all remaining paths are inactive. This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once the last path has been removed to properly terminate pending I/O. Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/multipath.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index c9d23b1b8efc..1b1deb0450ab 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) { + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) + return NULL; + switch (READ_ONCE(head->subsys->iopolicy)) { case NVME_IOPOLICY_QD: return nvme_queue_depth_path(head); @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head) { struct nvme_ns *ns; + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) + return NULL; + list_for_each_entry_rcu(ns, &head->list, siblings) { if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) continue; @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) { if (!head->disk) return; - kblockd_schedule_work(&head->requeue_work); - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { nvme_cdev_del(&head->cdev, &head->cdev_device); del_gendisk(head->disk); } + /* + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared + * to allow multipath to fail all I/O. + */ + kblockd_schedule_work(&head->requeue_work); } void nvme_mpath_remove_disk(struct nvme_ns_head *head) -- 2.35.3 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke @ 2024-09-03 19:38 ` Sagi Grimberg 2024-09-04 8:20 ` Hannes Reinecke 2024-09-04 14:52 ` Nilay Shroff 1 sibling, 1 reply; 15+ messages in thread From: Sagi Grimberg @ 2024-09-03 19:38 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 03/09/2024 21:03, Hannes Reinecke wrote: > During repetitive namespace remapping operations (ie removing a namespace and > provision a different namespace with the same NSID) on the target the > namespace might have changed between the time the initial scan > was performed, and partition scan was invoked by device_add_disk() > in nvme_mpath_set_live(). We then end up with a stuck scanning process: > > [<0>] folio_wait_bit_common+0x12a/0x310 > [<0>] filemap_read_folio+0x97/0xd0 > [<0>] do_read_cache_folio+0x108/0x390 > [<0>] read_part_sector+0x31/0xa0 > [<0>] read_lba+0xc5/0x160 > [<0>] efi_partition+0xd9/0x8f0 > [<0>] bdev_disk_changed+0x23d/0x6d0 > [<0>] blkdev_get_whole+0x78/0xc0 > [<0>] bdev_open+0x2c6/0x3b0 > [<0>] bdev_file_open_by_dev+0xcb/0x120 > [<0>] disk_scan_partitions+0x5d/0x100 > [<0>] device_add_disk+0x402/0x420 > [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] > [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] > [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] > [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] > [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] > > This happens when we have several paths, some of which are inaccessible, > and the active paths are removed first. Then nvme_find_path() will requeue > I/O in the ns_head (as paths are present), but the requeue list is never > triggered as all remaining paths are inactive. > This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, > and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once > the last path has been removed to properly terminate pending I/O. > > Signed-off-by: Hannes Reinecke <hare@kernel.org> > --- > drivers/nvme/host/multipath.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index c9d23b1b8efc..1b1deb0450ab 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) > > inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) > { > + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) > + return NULL; > + > switch (READ_ONCE(head->subsys->iopolicy)) { > case NVME_IOPOLICY_QD: > return nvme_queue_depth_path(head); > @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head) > { > struct nvme_ns *ns; > > + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) > + return NULL; > + > list_for_each_entry_rcu(ns, &head->list, siblings) { > if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) > continue; > @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) > { > if (!head->disk) > return; > - kblockd_schedule_work(&head->requeue_work); > - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > nvme_cdev_del(&head->cdev, &head->cdev_device); > del_gendisk(head->disk); > } > + /* > + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared > + * to allow multipath to fail all I/O. > + */ > + kblockd_schedule_work(&head->requeue_work); Not sure how this helps given that you don't wait for srcu to synchronize before you kick the requeue. > } > > void nvme_mpath_remove_disk(struct nvme_ns_head *head) Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting you mentioned that ns_remove is stuck on srcu_synchronize? Can you explain why nvme_find_path is able to find a path given that it is already cleared NVME_NS_READY ? oris it nvme_available_path that is missing a check? Maybe can do with checking NVME_NS_READY instead? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-03 19:38 ` Sagi Grimberg @ 2024-09-04 8:20 ` Hannes Reinecke 2024-09-04 8:59 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2024-09-04 8:20 UTC (permalink / raw) To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 9/3/24 21:38, Sagi Grimberg wrote: > > > > On 03/09/2024 21:03, Hannes Reinecke wrote: >> During repetitive namespace remapping operations (ie removing a >> namespace and >> provision a different namespace with the same NSID) on the target the >> namespace might have changed between the time the initial scan >> was performed, and partition scan was invoked by device_add_disk() >> in nvme_mpath_set_live(). We then end up with a stuck scanning process: >> >> [<0>] folio_wait_bit_common+0x12a/0x310 >> [<0>] filemap_read_folio+0x97/0xd0 >> [<0>] do_read_cache_folio+0x108/0x390 >> [<0>] read_part_sector+0x31/0xa0 >> [<0>] read_lba+0xc5/0x160 >> [<0>] efi_partition+0xd9/0x8f0 >> [<0>] bdev_disk_changed+0x23d/0x6d0 >> [<0>] blkdev_get_whole+0x78/0xc0 >> [<0>] bdev_open+0x2c6/0x3b0 >> [<0>] bdev_file_open_by_dev+0xcb/0x120 >> [<0>] disk_scan_partitions+0x5d/0x100 >> [<0>] device_add_disk+0x402/0x420 >> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >> >> This happens when we have several paths, some of which are inaccessible, >> and the active paths are removed first. Then nvme_find_path() will >> requeue >> I/O in the ns_head (as paths are present), but the requeue list is never >> triggered as all remaining paths are inactive. >> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >> the last path has been removed to properly terminate pending I/O. >> >> Signed-off-by: Hannes Reinecke <hare@kernel.org> >> --- >> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/multipath.c >> b/drivers/nvme/host/multipath.c >> index c9d23b1b8efc..1b1deb0450ab 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct >> nvme_ns_head *head) >> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >> { >> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >> + return NULL; >> + >> switch (READ_ONCE(head->subsys->iopolicy)) { >> case NVME_IOPOLICY_QD: >> return nvme_queue_depth_path(head); >> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct >> nvme_ns_head *head) >> { >> struct nvme_ns *ns; >> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >> + return NULL; >> + >> list_for_each_entry_rcu(ns, &head->list, siblings) { >> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >> continue; >> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct >> nvme_ns_head *head) >> { >> if (!head->disk) >> return; >> - kblockd_schedule_work(&head->requeue_work); >> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> nvme_cdev_del(&head->cdev, &head->cdev_device); >> del_gendisk(head->disk); >> } >> + /* >> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >> + * to allow multipath to fail all I/O. >> + */ >> + kblockd_schedule_work(&head->requeue_work); > > Not sure how this helps given that you don't wait for srcu to synchronize > before you kick the requeue. > It certainly is helping in my testcase. But having a synchronize_srcu here is probably not a bad idea. >> } >> void nvme_mpath_remove_disk(struct nvme_ns_head *head) > > Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting you > mentioned that ns_remove is stuck on srcu_synchronize? Can you explain > why nvme_find_path is able to find a path given that it is already > cleared NVME_NS_READY ? oris it nvme_available_path that is missing a > check? Maybe can do with checking NVME_NS_READY instead? Turned out that the reasoning in the previous revision wasn't quite correct; since then I have seen several test-run where the above stack trace was the _only_ one in the system, so the stall in removing namespaces is more a side-effect. The ns_head was still visible in sysfs while in that state, with exactly one path left: # ls /sys/block nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 nvme0n2 nvme0n5 (whereas there had been 6 controllers with 6 namespaces). So we fail to trigger a requeue to restart I/O on the stuck scanning process; the actual path state really don't matter as never get this far. This can happen when the partition scan triggered by device_add_disk() (from one controller) interleaves with nvme_ns_remove() from another controller. Both processes are running lockless wrt to ns_head at that time, so if the partition scan issues I/O after the schedule_work in nvme_mpath_shutdown_disk(): void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) { if (!head->disk) return; kblockd_schedule_work(&head->requeue_work); if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { nvme_cdev_del(&head->cdev, &head->cdev_device); del_gendisk(head->disk); } } _and_ that last path happens to be an 'inaccessible' one, I/O will be requeued in the ns_head but never restarted, leaving to a hung process. Note, that I/O might also be triggered by userspace (eg udev); the device node is still present at that time. And that's also what I see in my test runs; occasionally I get additional stuck udev processes: [<0>] __folio_lock+0x114/0x1f0 [<0>] truncate_inode_pages_range+0x3c0/0x3e0 [<0>] blkdev_flush_mapping+0x45/0xe0 [<0>] blkdev_put_whole+0x2e/0x40 [<0>] bdev_release+0x129/0x1b0 [<0>] blkdev_release+0xd/0x20 [<0>] __fput+0xf7/0x2d0 also waiting on I/O. You might be right checking for NS_READY might be sufficient, I'll be checking. But we definitely need to requeue I/O after we called del_gendisk(). Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-04 8:20 ` Hannes Reinecke @ 2024-09-04 8:59 ` Hannes Reinecke 2024-09-05 6:43 ` Nilay Shroff 2024-09-05 7:06 ` Sagi Grimberg 0 siblings, 2 replies; 15+ messages in thread From: Hannes Reinecke @ 2024-09-04 8:59 UTC (permalink / raw) To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 9/4/24 10:20, Hannes Reinecke wrote: > On 9/3/24 21:38, Sagi Grimberg wrote: >> >> >> >> On 03/09/2024 21:03, Hannes Reinecke wrote: >>> During repetitive namespace remapping operations (ie removing a >>> namespace and >>> provision a different namespace with the same NSID) on the target the >>> namespace might have changed between the time the initial scan >>> was performed, and partition scan was invoked by device_add_disk() >>> in nvme_mpath_set_live(). We then end up with a stuck scanning process: >>> >>> [<0>] folio_wait_bit_common+0x12a/0x310 >>> [<0>] filemap_read_folio+0x97/0xd0 >>> [<0>] do_read_cache_folio+0x108/0x390 >>> [<0>] read_part_sector+0x31/0xa0 >>> [<0>] read_lba+0xc5/0x160 >>> [<0>] efi_partition+0xd9/0x8f0 >>> [<0>] bdev_disk_changed+0x23d/0x6d0 >>> [<0>] blkdev_get_whole+0x78/0xc0 >>> [<0>] bdev_open+0x2c6/0x3b0 >>> [<0>] bdev_file_open_by_dev+0xcb/0x120 >>> [<0>] disk_scan_partitions+0x5d/0x100 >>> [<0>] device_add_disk+0x402/0x420 >>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >>> >>> This happens when we have several paths, some of which are inaccessible, >>> and the active paths are removed first. Then nvme_find_path() will >>> requeue >>> I/O in the ns_head (as paths are present), but the requeue list is never >>> triggered as all remaining paths are inactive. >>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >>> the last path has been removed to properly terminate pending I/O. >>> >>> Signed-off-by: Hannes Reinecke <hare@kernel.org> >>> --- >>> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/nvme/host/multipath.c >>> b/drivers/nvme/host/multipath.c >>> index c9d23b1b8efc..1b1deb0450ab 100644 >>> --- a/drivers/nvme/host/multipath.c >>> +++ b/drivers/nvme/host/multipath.c >>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct >>> nvme_ns_head *head) >>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >>> { >>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>> + return NULL; >>> + >>> switch (READ_ONCE(head->subsys->iopolicy)) { >>> case NVME_IOPOLICY_QD: >>> return nvme_queue_depth_path(head); >>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct >>> nvme_ns_head *head) >>> { >>> struct nvme_ns *ns; >>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>> + return NULL; >>> + >>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >>> continue; >>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct >>> nvme_ns_head *head) >>> { >>> if (!head->disk) >>> return; >>> - kblockd_schedule_work(&head->requeue_work); >>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>> del_gendisk(head->disk); >>> } >>> + /* >>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>> + * to allow multipath to fail all I/O. >>> + */ >>> + kblockd_schedule_work(&head->requeue_work); >> >> Not sure how this helps given that you don't wait for srcu to synchronize >> before you kick the requeue. >> > It certainly is helping in my testcase. But having a synchronize_srcu > here is probably not a bad idea. > >>> } >>> void nvme_mpath_remove_disk(struct nvme_ns_head *head) >> >> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting >> you mentioned that ns_remove is stuck on srcu_synchronize? Can you >> explain why nvme_find_path is able to find a path given that it is >> already cleared NVME_NS_READY ? oris it nvme_available_path that is >> missing a check? Maybe can do with checking NVME_NS_READY instead? > > Turned out that the reasoning in the previous revision wasn't quite > correct; since then I have seen several test-run where the above stack > trace was the _only_ one in the system, so the stall in removing > namespaces is more a side-effect. The ns_head was still visible > in sysfs while in that state, with exactly one path left: > > # ls /sys/block > nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 nvme0n2 > nvme0n5 > > (whereas there had been 6 controllers with 6 namespaces). > So we fail to trigger a requeue to restart I/O on the stuck scanning > process; the actual path state really don't matter as never get this far. > This can happen when the partition scan triggered by device_add_disk() > (from one controller) interleaves with nvme_ns_remove() from another > controller. Both processes are running lockless wrt to ns_head at that > time, so if the partition scan issues I/O after the schedule_work > in nvme_mpath_shutdown_disk(): > > void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) > { > if (!head->disk) > return; > kblockd_schedule_work(&head->requeue_work); > if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > nvme_cdev_del(&head->cdev, &head->cdev_device); > del_gendisk(head->disk); > } > } > > _and_ that last path happens to be an 'inaccessible' one, I/O will be > requeued in the ns_head but never restarted, leaving to a hung process. > Note, that I/O might also be triggered by userspace (eg udev); the > device node is still present at that time. And that's also what I see > in my test runs; occasionally I get additional stuck udev processes: > [<0>] __folio_lock+0x114/0x1f0 > [<0>] truncate_inode_pages_range+0x3c0/0x3e0 > [<0>] blkdev_flush_mapping+0x45/0xe0 > [<0>] blkdev_put_whole+0x2e/0x40 > [<0>] bdev_release+0x129/0x1b0 > [<0>] blkdev_release+0xd/0x20 > [<0>] __fput+0xf7/0x2d0 > also waiting on I/O. > > You might be right checking for NS_READY might be sufficient, I'll be > checking. But we definitely need to requeue I/O after we called > del_gendisk(). > Turns out that we don't check NVME_NS_READY in all places; we would need this patch: diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index c9d23b1b8efc..d8a6f51896fd 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -424,6 +424,8 @@ static bool nvme_available_path(struct nvme_ns_head *head) list_for_each_entry_rcu(ns, &head->list, siblings) { if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) continue; + if (test_bit(NVME_NS_READY, &ns->flags)) + continue; switch (nvme_ctrl_state(ns->ctrl)) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: in addition to moving of kblockd_schedule. So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-04 8:59 ` Hannes Reinecke @ 2024-09-05 6:43 ` Nilay Shroff 2024-09-05 7:06 ` Sagi Grimberg 1 sibling, 0 replies; 15+ messages in thread From: Nilay Shroff @ 2024-09-05 6:43 UTC (permalink / raw) To: Hannes Reinecke, Sagi Grimberg, Hannes Reinecke, Christoph Hellwig Cc: Keith Busch, linux-nvme On 9/4/24 14:29, Hannes Reinecke wrote: > On 9/4/24 10:20, Hannes Reinecke wrote: >> On 9/3/24 21:38, Sagi Grimberg wrote: >>> >>> >>> >>> On 03/09/2024 21:03, Hannes Reinecke wrote: >>>> During repetitive namespace remapping operations (ie removing a namespace and >>>> provision a different namespace with the same NSID) on the target the >>>> namespace might have changed between the time the initial scan >>>> was performed, and partition scan was invoked by device_add_disk() >>>> in nvme_mpath_set_live(). We then end up with a stuck scanning process: >>>> >>>> [<0>] folio_wait_bit_common+0x12a/0x310 >>>> [<0>] filemap_read_folio+0x97/0xd0 >>>> [<0>] do_read_cache_folio+0x108/0x390 >>>> [<0>] read_part_sector+0x31/0xa0 >>>> [<0>] read_lba+0xc5/0x160 >>>> [<0>] efi_partition+0xd9/0x8f0 >>>> [<0>] bdev_disk_changed+0x23d/0x6d0 >>>> [<0>] blkdev_get_whole+0x78/0xc0 >>>> [<0>] bdev_open+0x2c6/0x3b0 >>>> [<0>] bdev_file_open_by_dev+0xcb/0x120 >>>> [<0>] disk_scan_partitions+0x5d/0x100 >>>> [<0>] device_add_disk+0x402/0x420 >>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >>>> >>>> This happens when we have several paths, some of which are inaccessible, >>>> and the active paths are removed first. Then nvme_find_path() will requeue >>>> I/O in the ns_head (as paths are present), but the requeue list is never >>>> triggered as all remaining paths are inactive. >>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >>>> the last path has been removed to properly terminate pending I/O. >>>> >>>> Signed-off-by: Hannes Reinecke <hare@kernel.org> >>>> --- >>>> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >>>> index c9d23b1b8efc..1b1deb0450ab 100644 >>>> --- a/drivers/nvme/host/multipath.c >>>> +++ b/drivers/nvme/host/multipath.c >>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) >>>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >>>> { >>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>> + return NULL; >>>> + >>>> switch (READ_ONCE(head->subsys->iopolicy)) { >>>> case NVME_IOPOLICY_QD: >>>> return nvme_queue_depth_path(head); >>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head) >>>> { >>>> struct nvme_ns *ns; >>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>> + return NULL; >>>> + >>>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >>>> continue; >>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >>>> { >>>> if (!head->disk) >>>> return; >>>> - kblockd_schedule_work(&head->requeue_work); >>>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>>> del_gendisk(head->disk); >>>> } >>>> + /* >>>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>>> + * to allow multipath to fail all I/O. >>>> + */ >>>> + kblockd_schedule_work(&head->requeue_work); >>> >>> Not sure how this helps given that you don't wait for srcu to synchronize >>> before you kick the requeue. >>> >> It certainly is helping in my testcase. But having a synchronize_srcu here is probably not a bad idea. >> >>>> } >>>> void nvme_mpath_remove_disk(struct nvme_ns_head *head) >>> >>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting you mentioned that ns_remove is stuck on srcu_synchronize? Can you explain why nvme_find_path is able to find a path given that it is already cleared NVME_NS_READY ? oris it nvme_available_path that is missing a check? Maybe can do with checking NVME_NS_READY instead? >> >> Turned out that the reasoning in the previous revision wasn't quite correct; since then I have seen several test-run where the above stack >> trace was the _only_ one in the system, so the stall in removing namespaces is more a side-effect. The ns_head was still visible >> in sysfs while in that state, with exactly one path left: >> >> # ls /sys/block >> nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 nvme0n2 nvme0n5 >> >> (whereas there had been 6 controllers with 6 namespaces). >> So we fail to trigger a requeue to restart I/O on the stuck scanning >> process; the actual path state really don't matter as never get this far. >> This can happen when the partition scan triggered by device_add_disk() (from one controller) interleaves with nvme_ns_remove() from another controller. Both processes are running lockless wrt to ns_head at that >> time, so if the partition scan issues I/O after the schedule_work >> in nvme_mpath_shutdown_disk(): >> >> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >> { >> if (!head->disk) >> return; >> kblockd_schedule_work(&head->requeue_work); >> if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> nvme_cdev_del(&head->cdev, &head->cdev_device); >> del_gendisk(head->disk); >> } >> } >> >> _and_ that last path happens to be an 'inaccessible' one, I/O will be requeued in the ns_head but never restarted, leaving to a hung process. >> Note, that I/O might also be triggered by userspace (eg udev); the device node is still present at that time. And that's also what I see >> in my test runs; occasionally I get additional stuck udev processes: >> [<0>] __folio_lock+0x114/0x1f0 >> [<0>] truncate_inode_pages_range+0x3c0/0x3e0 >> [<0>] blkdev_flush_mapping+0x45/0xe0 >> [<0>] blkdev_put_whole+0x2e/0x40 >> [<0>] bdev_release+0x129/0x1b0 >> [<0>] blkdev_release+0xd/0x20 >> [<0>] __fput+0xf7/0x2d0 >> also waiting on I/O. >> >> You might be right checking for NS_READY might be sufficient, I'll be >> checking. But we definitely need to requeue I/O after we called del_gendisk(). >> > Turns out that we don't check NVME_NS_READY in all places; we would need this patch: > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index c9d23b1b8efc..d8a6f51896fd 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -424,6 +424,8 @@ static bool nvme_available_path(struct nvme_ns_head *head) > list_for_each_entry_rcu(ns, &head->list, siblings) { > if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) > continue; > + if (test_bit(NVME_NS_READY, &ns->flags)) > + continue; > switch (nvme_ctrl_state(ns->ctrl)) { > case NVME_CTRL_LIVE: > case NVME_CTRL_RESETTING: > > in addition to moving of kblockd_schedule. > I think you may want to test here, if (!test_bit(NVME_NS_READY, &ns->flags)) continue; > So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY? > > Cheers, > > Hannes Thanks, --Nilay ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-04 8:59 ` Hannes Reinecke 2024-09-05 6:43 ` Nilay Shroff @ 2024-09-05 7:06 ` Sagi Grimberg 2024-09-05 8:39 ` Hannes Reinecke 1 sibling, 1 reply; 15+ messages in thread From: Sagi Grimberg @ 2024-09-05 7:06 UTC (permalink / raw) To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig Cc: Keith Busch, linux-nvme On 04/09/2024 11:59, Hannes Reinecke wrote: > On 9/4/24 10:20, Hannes Reinecke wrote: >> On 9/3/24 21:38, Sagi Grimberg wrote: >>> >>> >>> >>> On 03/09/2024 21:03, Hannes Reinecke wrote: >>>> During repetitive namespace remapping operations (ie removing a >>>> namespace and >>>> provision a different namespace with the same NSID) on the target the >>>> namespace might have changed between the time the initial scan >>>> was performed, and partition scan was invoked by device_add_disk() >>>> in nvme_mpath_set_live(). We then end up with a stuck scanning >>>> process: >>>> >>>> [<0>] folio_wait_bit_common+0x12a/0x310 >>>> [<0>] filemap_read_folio+0x97/0xd0 >>>> [<0>] do_read_cache_folio+0x108/0x390 >>>> [<0>] read_part_sector+0x31/0xa0 >>>> [<0>] read_lba+0xc5/0x160 >>>> [<0>] efi_partition+0xd9/0x8f0 >>>> [<0>] bdev_disk_changed+0x23d/0x6d0 >>>> [<0>] blkdev_get_whole+0x78/0xc0 >>>> [<0>] bdev_open+0x2c6/0x3b0 >>>> [<0>] bdev_file_open_by_dev+0xcb/0x120 >>>> [<0>] disk_scan_partitions+0x5d/0x100 >>>> [<0>] device_add_disk+0x402/0x420 >>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >>>> >>>> This happens when we have several paths, some of which are >>>> inaccessible, >>>> and the active paths are removed first. Then nvme_find_path() will >>>> requeue >>>> I/O in the ns_head (as paths are present), but the requeue list is >>>> never >>>> triggered as all remaining paths are inactive. >>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >>>> the last path has been removed to properly terminate pending I/O. >>>> >>>> Signed-off-by: Hannes Reinecke <hare@kernel.org> >>>> --- >>>> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/nvme/host/multipath.c >>>> b/drivers/nvme/host/multipath.c >>>> index c9d23b1b8efc..1b1deb0450ab 100644 >>>> --- a/drivers/nvme/host/multipath.c >>>> +++ b/drivers/nvme/host/multipath.c >>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct >>>> nvme_ns_head *head) >>>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >>>> { >>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>> + return NULL; >>>> + >>>> switch (READ_ONCE(head->subsys->iopolicy)) { >>>> case NVME_IOPOLICY_QD: >>>> return nvme_queue_depth_path(head); >>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct >>>> nvme_ns_head *head) >>>> { >>>> struct nvme_ns *ns; >>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>> + return NULL; >>>> + >>>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >>>> continue; >>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct >>>> nvme_ns_head *head) >>>> { >>>> if (!head->disk) >>>> return; >>>> - kblockd_schedule_work(&head->requeue_work); >>>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>>> del_gendisk(head->disk); >>>> } >>>> + /* >>>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>>> + * to allow multipath to fail all I/O. >>>> + */ >>>> + kblockd_schedule_work(&head->requeue_work); >>> >>> Not sure how this helps given that you don't wait for srcu to >>> synchronize >>> before you kick the requeue. >>> >> It certainly is helping in my testcase. But having a synchronize_srcu >> here is probably not a bad idea. >> >>>> } >>>> void nvme_mpath_remove_disk(struct nvme_ns_head *head) >>> >>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting >>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you >>> explain why nvme_find_path is able to find a path given that it is >>> already cleared NVME_NS_READY ? oris it nvme_available_path that is >>> missing a check? Maybe can do with checking NVME_NS_READY instead? >> >> Turned out that the reasoning in the previous revision wasn't quite >> correct; since then I have seen several test-run where the above stack >> trace was the _only_ one in the system, so the stall in removing >> namespaces is more a side-effect. The ns_head was still visible >> in sysfs while in that state, with exactly one path left: >> >> # ls /sys/block >> nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 nvme0n2 >> nvme0n5 >> >> (whereas there had been 6 controllers with 6 namespaces). >> So we fail to trigger a requeue to restart I/O on the stuck scanning >> process; the actual path state really don't matter as never get this >> far. >> This can happen when the partition scan triggered by >> device_add_disk() (from one controller) interleaves with >> nvme_ns_remove() from another controller. Both processes are running >> lockless wrt to ns_head at that >> time, so if the partition scan issues I/O after the schedule_work >> in nvme_mpath_shutdown_disk(): >> >> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >> { >> if (!head->disk) >> return; >> kblockd_schedule_work(&head->requeue_work); >> if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> nvme_cdev_del(&head->cdev, &head->cdev_device); >> del_gendisk(head->disk); >> } >> } >> >> _and_ that last path happens to be an 'inaccessible' one, I/O will be >> requeued in the ns_head but never restarted, leaving to a hung process. >> Note, that I/O might also be triggered by userspace (eg udev); the >> device node is still present at that time. And that's also what I see >> in my test runs; occasionally I get additional stuck udev processes: >> [<0>] __folio_lock+0x114/0x1f0 >> [<0>] truncate_inode_pages_range+0x3c0/0x3e0 >> [<0>] blkdev_flush_mapping+0x45/0xe0 >> [<0>] blkdev_put_whole+0x2e/0x40 >> [<0>] bdev_release+0x129/0x1b0 >> [<0>] blkdev_release+0xd/0x20 >> [<0>] __fput+0xf7/0x2d0 >> also waiting on I/O. >> >> You might be right checking for NS_READY might be sufficient, I'll be >> checking. But we definitely need to requeue I/O after we called >> del_gendisk(). >> > Turns out that we don't check NVME_NS_READY in all places; we would > need this patch: > > diff --git a/drivers/nvme/host/multipath.c > b/drivers/nvme/host/multipath.c > index c9d23b1b8efc..d8a6f51896fd 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -424,6 +424,8 @@ static bool nvme_available_path(struct > nvme_ns_head *head) > list_for_each_entry_rcu(ns, &head->list, siblings) { > if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, > &ns->ctrl->flags)) > continue; > + if (test_bit(NVME_NS_READY, &ns->flags)) > + continue; > switch (nvme_ctrl_state(ns->ctrl)) { > case NVME_CTRL_LIVE: > case NVME_CTRL_RESETTING: > > in addition to moving of kblockd_schedule. > > So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY? Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but also in nvme_mpath_revalidate_paths() if the capacity changed, so theoretically, if the last remaining path changed its capacity, it will make nvme_available_path() fail, and in turn, mpath IO will fail and not added to the requeue_list, which would be wrong. So I think that nvme_available_path should check NVME_NSHEAD_DISK_LIVE, but nvme_find_path is fine in its current form. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-05 7:06 ` Sagi Grimberg @ 2024-09-05 8:39 ` Hannes Reinecke 2024-09-05 9:06 ` Sagi Grimberg 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2024-09-05 8:39 UTC (permalink / raw) To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 9/5/24 09:06, Sagi Grimberg wrote: > > > > On 04/09/2024 11:59, Hannes Reinecke wrote: >> On 9/4/24 10:20, Hannes Reinecke wrote: >>> On 9/3/24 21:38, Sagi Grimberg wrote: >>>> >>>> >>>> >>>> On 03/09/2024 21:03, Hannes Reinecke wrote: >>>>> During repetitive namespace remapping operations (ie removing a >>>>> namespace and >>>>> provision a different namespace with the same NSID) on the target the >>>>> namespace might have changed between the time the initial scan >>>>> was performed, and partition scan was invoked by device_add_disk() >>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning >>>>> process: >>>>> >>>>> [<0>] folio_wait_bit_common+0x12a/0x310 >>>>> [<0>] filemap_read_folio+0x97/0xd0 >>>>> [<0>] do_read_cache_folio+0x108/0x390 >>>>> [<0>] read_part_sector+0x31/0xa0 >>>>> [<0>] read_lba+0xc5/0x160 >>>>> [<0>] efi_partition+0xd9/0x8f0 >>>>> [<0>] bdev_disk_changed+0x23d/0x6d0 >>>>> [<0>] blkdev_get_whole+0x78/0xc0 >>>>> [<0>] bdev_open+0x2c6/0x3b0 >>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120 >>>>> [<0>] disk_scan_partitions+0x5d/0x100 >>>>> [<0>] device_add_disk+0x402/0x420 >>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >>>>> >>>>> This happens when we have several paths, some of which are >>>>> inaccessible, >>>>> and the active paths are removed first. Then nvme_find_path() will >>>>> requeue >>>>> I/O in the ns_head (as paths are present), but the requeue list is >>>>> never >>>>> triggered as all remaining paths are inactive. >>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >>>>> the last path has been removed to properly terminate pending I/O. >>>>> >>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org> >>>>> --- >>>>> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/nvme/host/multipath.c >>>>> b/drivers/nvme/host/multipath.c >>>>> index c9d23b1b8efc..1b1deb0450ab 100644 >>>>> --- a/drivers/nvme/host/multipath.c >>>>> +++ b/drivers/nvme/host/multipath.c >>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct >>>>> nvme_ns_head *head) >>>>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >>>>> { >>>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>>> + return NULL; >>>>> + >>>>> switch (READ_ONCE(head->subsys->iopolicy)) { >>>>> case NVME_IOPOLICY_QD: >>>>> return nvme_queue_depth_path(head); >>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct >>>>> nvme_ns_head *head) >>>>> { >>>>> struct nvme_ns *ns; >>>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>>> + return NULL; >>>>> + >>>>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>>>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >>>>> continue; >>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct >>>>> nvme_ns_head *head) >>>>> { >>>>> if (!head->disk) >>>>> return; >>>>> - kblockd_schedule_work(&head->requeue_work); >>>>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>>>> del_gendisk(head->disk); >>>>> } >>>>> + /* >>>>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>>>> + * to allow multipath to fail all I/O. >>>>> + */ >>>>> + kblockd_schedule_work(&head->requeue_work); >>>> >>>> Not sure how this helps given that you don't wait for srcu to >>>> synchronize >>>> before you kick the requeue. >>>> >>> It certainly is helping in my testcase. But having a synchronize_srcu >>> here is probably not a bad idea. >>> >>>>> } >>>>> void nvme_mpath_remove_disk(struct nvme_ns_head *head) >>>> >>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting >>>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you >>>> explain why nvme_find_path is able to find a path given that it is >>>> already cleared NVME_NS_READY ? oris it nvme_available_path that is >>>> missing a check? Maybe can do with checking NVME_NS_READY instead? >>> >>> Turned out that the reasoning in the previous revision wasn't quite >>> correct; since then I have seen several test-run where the above stack >>> trace was the _only_ one in the system, so the stall in removing >>> namespaces is more a side-effect. The ns_head was still visible >>> in sysfs while in that state, with exactly one path left: >>> >>> # ls /sys/block >>> nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 nvme0n2 >>> nvme0n5 >>> >>> (whereas there had been 6 controllers with 6 namespaces). >>> So we fail to trigger a requeue to restart I/O on the stuck scanning >>> process; the actual path state really don't matter as never get this >>> far. >>> This can happen when the partition scan triggered by >>> device_add_disk() (from one controller) interleaves with >>> nvme_ns_remove() from another controller. Both processes are running >>> lockless wrt to ns_head at that >>> time, so if the partition scan issues I/O after the schedule_work >>> in nvme_mpath_shutdown_disk(): >>> >>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >>> { >>> if (!head->disk) >>> return; >>> kblockd_schedule_work(&head->requeue_work); >>> if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>> del_gendisk(head->disk); >>> } >>> } >>> >>> _and_ that last path happens to be an 'inaccessible' one, I/O will be >>> requeued in the ns_head but never restarted, leaving to a hung process. >>> Note, that I/O might also be triggered by userspace (eg udev); the >>> device node is still present at that time. And that's also what I see >>> in my test runs; occasionally I get additional stuck udev processes: >>> [<0>] __folio_lock+0x114/0x1f0 >>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0 >>> [<0>] blkdev_flush_mapping+0x45/0xe0 >>> [<0>] blkdev_put_whole+0x2e/0x40 >>> [<0>] bdev_release+0x129/0x1b0 >>> [<0>] blkdev_release+0xd/0x20 >>> [<0>] __fput+0xf7/0x2d0 >>> also waiting on I/O. >>> >>> You might be right checking for NS_READY might be sufficient, I'll be >>> checking. But we definitely need to requeue I/O after we called >>> del_gendisk(). >>> >> Turns out that we don't check NVME_NS_READY in all places; we would >> need this patch: >> >> diff --git a/drivers/nvme/host/multipath.c >> b/drivers/nvme/host/multipath.c >> index c9d23b1b8efc..d8a6f51896fd 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct >> nvme_ns_head *head) >> list_for_each_entry_rcu(ns, &head->list, siblings) { >> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, >> &ns->ctrl->flags)) >> continue; >> + if (test_bit(NVME_NS_READY, &ns->flags)) >> + continue; >> switch (nvme_ctrl_state(ns->ctrl)) { >> case NVME_CTRL_LIVE: >> case NVME_CTRL_RESETTING: >> >> in addition to moving of kblockd_schedule. >> >> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY? > > Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but also > in nvme_mpath_revalidate_paths() if the capacity changed, so theoretically, > if the last remaining path changed its capacity, it will make > nvme_available_path() > fail, and in turn, mpath IO will fail and not added to the requeue_list, > which would be > wrong. So I think that nvme_available_path should check > NVME_NSHEAD_DISK_LIVE, > but nvme_find_path is fine in its current form. Weelll ... not quite. I never get this far, as the only place where nvme_remove_ns() can be called is from the scanning process, yet this is stuck waiting for I/O. My testcase then simulated a namespace replacement by setting a new UUID, and then enabling the namespace again. The target will return PATH ERROR during that time, causing I/O to be retried (while disabled) or failed over to the same path (as it's the only one left). Of course we could argue that the testcase is invalid, and the target should return INVALID_NS if the UUID changed. But one could also argue that retrying on the same path is pointless, seeing that we'll always get the same error. Hmm? Guess it'll make a nice topic for ALPSS ... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-05 8:39 ` Hannes Reinecke @ 2024-09-05 9:06 ` Sagi Grimberg 2024-09-05 9:23 ` Hannes Reinecke 0 siblings, 1 reply; 15+ messages in thread From: Sagi Grimberg @ 2024-09-05 9:06 UTC (permalink / raw) To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig Cc: Keith Busch, linux-nvme On 05/09/2024 11:39, Hannes Reinecke wrote: > On 9/5/24 09:06, Sagi Grimberg wrote: >> >> >> >> On 04/09/2024 11:59, Hannes Reinecke wrote: >>> On 9/4/24 10:20, Hannes Reinecke wrote: >>>> On 9/3/24 21:38, Sagi Grimberg wrote: >>>>> >>>>> >>>>> >>>>> On 03/09/2024 21:03, Hannes Reinecke wrote: >>>>>> During repetitive namespace remapping operations (ie removing a >>>>>> namespace and >>>>>> provision a different namespace with the same NSID) on the target >>>>>> the >>>>>> namespace might have changed between the time the initial scan >>>>>> was performed, and partition scan was invoked by device_add_disk() >>>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning >>>>>> process: >>>>>> >>>>>> [<0>] folio_wait_bit_common+0x12a/0x310 >>>>>> [<0>] filemap_read_folio+0x97/0xd0 >>>>>> [<0>] do_read_cache_folio+0x108/0x390 >>>>>> [<0>] read_part_sector+0x31/0xa0 >>>>>> [<0>] read_lba+0xc5/0x160 >>>>>> [<0>] efi_partition+0xd9/0x8f0 >>>>>> [<0>] bdev_disk_changed+0x23d/0x6d0 >>>>>> [<0>] blkdev_get_whole+0x78/0xc0 >>>>>> [<0>] bdev_open+0x2c6/0x3b0 >>>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120 >>>>>> [<0>] disk_scan_partitions+0x5d/0x100 >>>>>> [<0>] device_add_disk+0x402/0x420 >>>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >>>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >>>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >>>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >>>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >>>>>> >>>>>> This happens when we have several paths, some of which are >>>>>> inaccessible, >>>>>> and the active paths are removed first. Then nvme_find_path() >>>>>> will requeue >>>>>> I/O in the ns_head (as paths are present), but the requeue list >>>>>> is never >>>>>> triggered as all remaining paths are inactive. >>>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >>>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >>>>>> the last path has been removed to properly terminate pending I/O. >>>>>> >>>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org> >>>>>> --- >>>>>> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >>>>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/nvme/host/multipath.c >>>>>> b/drivers/nvme/host/multipath.c >>>>>> index c9d23b1b8efc..1b1deb0450ab 100644 >>>>>> --- a/drivers/nvme/host/multipath.c >>>>>> +++ b/drivers/nvme/host/multipath.c >>>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct >>>>>> nvme_ns_head *head) >>>>>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >>>>>> { >>>>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>>>> + return NULL; >>>>>> + >>>>>> switch (READ_ONCE(head->subsys->iopolicy)) { >>>>>> case NVME_IOPOLICY_QD: >>>>>> return nvme_queue_depth_path(head); >>>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct >>>>>> nvme_ns_head *head) >>>>>> { >>>>>> struct nvme_ns *ns; >>>>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>>>>> + return NULL; >>>>>> + >>>>>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>>>>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, >>>>>> &ns->ctrl->flags)) >>>>>> continue; >>>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct >>>>>> nvme_ns_head *head) >>>>>> { >>>>>> if (!head->disk) >>>>>> return; >>>>>> - kblockd_schedule_work(&head->requeue_work); >>>>>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>>>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>>>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>>>>> del_gendisk(head->disk); >>>>>> } >>>>>> + /* >>>>>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>>>>> + * to allow multipath to fail all I/O. >>>>>> + */ >>>>>> + kblockd_schedule_work(&head->requeue_work); >>>>> >>>>> Not sure how this helps given that you don't wait for srcu to >>>>> synchronize >>>>> before you kick the requeue. >>>>> >>>> It certainly is helping in my testcase. But having a >>>> synchronize_srcu here is probably not a bad idea. >>>> >>>>>> } >>>>>> void nvme_mpath_remove_disk(struct nvme_ns_head *head) >>>>> >>>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last >>>>> posting you mentioned that ns_remove is stuck on srcu_synchronize? >>>>> Can you explain why nvme_find_path is able to find a path given >>>>> that it is already cleared NVME_NS_READY ? oris it >>>>> nvme_available_path that is missing a check? Maybe can do with >>>>> checking NVME_NS_READY instead? >>>> >>>> Turned out that the reasoning in the previous revision wasn't quite >>>> correct; since then I have seen several test-run where the above stack >>>> trace was the _only_ one in the system, so the stall in removing >>>> namespaces is more a side-effect. The ns_head was still visible >>>> in sysfs while in that state, with exactly one path left: >>>> >>>> # ls /sys/block >>>> nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 >>>> nvme0n2 nvme0n5 >>>> >>>> (whereas there had been 6 controllers with 6 namespaces). >>>> So we fail to trigger a requeue to restart I/O on the stuck scanning >>>> process; the actual path state really don't matter as never get >>>> this far. >>>> This can happen when the partition scan triggered by >>>> device_add_disk() (from one controller) interleaves with >>>> nvme_ns_remove() from another controller. Both processes are >>>> running lockless wrt to ns_head at that >>>> time, so if the partition scan issues I/O after the schedule_work >>>> in nvme_mpath_shutdown_disk(): >>>> >>>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >>>> { >>>> if (!head->disk) >>>> return; >>>> kblockd_schedule_work(&head->requeue_work); >>>> if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>>> del_gendisk(head->disk); >>>> } >>>> } >>>> >>>> _and_ that last path happens to be an 'inaccessible' one, I/O will >>>> be requeued in the ns_head but never restarted, leaving to a hung >>>> process. >>>> Note, that I/O might also be triggered by userspace (eg udev); the >>>> device node is still present at that time. And that's also what I see >>>> in my test runs; occasionally I get additional stuck udev processes: >>>> [<0>] __folio_lock+0x114/0x1f0 >>>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0 >>>> [<0>] blkdev_flush_mapping+0x45/0xe0 >>>> [<0>] blkdev_put_whole+0x2e/0x40 >>>> [<0>] bdev_release+0x129/0x1b0 >>>> [<0>] blkdev_release+0xd/0x20 >>>> [<0>] __fput+0xf7/0x2d0 >>>> also waiting on I/O. >>>> >>>> You might be right checking for NS_READY might be sufficient, I'll be >>>> checking. But we definitely need to requeue I/O after we called >>>> del_gendisk(). >>>> >>> Turns out that we don't check NVME_NS_READY in all places; we would >>> need this patch: >>> >>> diff --git a/drivers/nvme/host/multipath.c >>> b/drivers/nvme/host/multipath.c >>> index c9d23b1b8efc..d8a6f51896fd 100644 >>> --- a/drivers/nvme/host/multipath.c >>> +++ b/drivers/nvme/host/multipath.c >>> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct >>> nvme_ns_head *head) >>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, >>> &ns->ctrl->flags)) >>> continue; >>> + if (test_bit(NVME_NS_READY, &ns->flags)) >>> + continue; >>> switch (nvme_ctrl_state(ns->ctrl)) { >>> case NVME_CTRL_LIVE: >>> case NVME_CTRL_RESETTING: >>> >>> in addition to moving of kblockd_schedule. >>> >>> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY? >> >> Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but >> also >> in nvme_mpath_revalidate_paths() if the capacity changed, so >> theoretically, >> if the last remaining path changed its capacity, it will make >> nvme_available_path() >> fail, and in turn, mpath IO will fail and not added to the >> requeue_list, which would be >> wrong. So I think that nvme_available_path should check >> NVME_NSHEAD_DISK_LIVE, >> but nvme_find_path is fine in its current form. > > Weelll ... not quite. > I never get this far, as the only place where nvme_remove_ns() can be > called is from the scanning process, yet this is stuck waiting for I/O. > > My testcase then simulated a namespace replacement by setting a new > UUID, and then enabling the namespace again. The target will return > PATH ERROR during that time, causing I/O to be retried (while disabled) > or failed over to the same path (as it's the only one left). > > Of course we could argue that the testcase is invalid, and the target > should return INVALID_NS if the UUID changed. > But one could also argue that retrying on the same path is pointless, > seeing that we'll always get the same error. > Hmm? Hannes, I was referring to the general case where we have more than one path. What I said is that I think that if you drop the NSHEAD_DISK_LIVE check in nvme_find_path (it already checks for NS_READY) but have nvme_available_path check for NSHEAD_DISK_LIVE and both your test case should work and the general case where there is more than a single path. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-05 9:06 ` Sagi Grimberg @ 2024-09-05 9:23 ` Hannes Reinecke 0 siblings, 0 replies; 15+ messages in thread From: Hannes Reinecke @ 2024-09-05 9:23 UTC (permalink / raw) To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 9/5/24 11:06, Sagi Grimberg wrote: > > > > On 05/09/2024 11:39, Hannes Reinecke wrote: >> On 9/5/24 09:06, Sagi Grimberg wrote: >>> >>> >>> >>> On 04/09/2024 11:59, Hannes Reinecke wrote: >>>> On 9/4/24 10:20, Hannes Reinecke wrote: [ .. ] >>>> in addition to moving of kblockd_schedule. >>>> >>>> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY? >>> >>> Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but >>> also >>> in nvme_mpath_revalidate_paths() if the capacity changed, so >>> theoretically, >>> if the last remaining path changed its capacity, it will make >>> nvme_available_path() >>> fail, and in turn, mpath IO will fail and not added to the >>> requeue_list, which would be >>> wrong. So I think that nvme_available_path should check >>> NVME_NSHEAD_DISK_LIVE, >>> but nvme_find_path is fine in its current form. >> >> Weelll ... not quite. >> I never get this far, as the only place where nvme_remove_ns() can be >> called is from the scanning process, yet this is stuck waiting for I/O. >> >> My testcase then simulated a namespace replacement by setting a new >> UUID, and then enabling the namespace again. The target will return >> PATH ERROR during that time, causing I/O to be retried (while disabled) >> or failed over to the same path (as it's the only one left). >> >> Of course we could argue that the testcase is invalid, and the target >> should return INVALID_NS if the UUID changed. >> But one could also argue that retrying on the same path is pointless, >> seeing that we'll always get the same error. >> Hmm? > > Hannes, I was referring to the general case where we have more than one > path. > > What I said is that I think that if you drop the NSHEAD_DISK_LIVE check > in nvme_find_path > (it already checks for NS_READY) but have nvme_available_path check for > NSHEAD_DISK_LIVE > and both your test case should work and the general case where there is > more than a single > path. Ah, _that's_ what you mean. Sure, sounds feasible. I'll give it a spin. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald, Werner Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke 2024-09-03 19:38 ` Sagi Grimberg @ 2024-09-04 14:52 ` Nilay Shroff 2024-09-04 15:55 ` Hannes Reinecke 1 sibling, 1 reply; 15+ messages in thread From: Nilay Shroff @ 2024-09-04 14:52 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme On 9/3/24 23:33, Hannes Reinecke wrote: > During repetitive namespace remapping operations (ie removing a namespace and > provision a different namespace with the same NSID) on the target the > namespace might have changed between the time the initial scan > was performed, and partition scan was invoked by device_add_disk() > in nvme_mpath_set_live(). We then end up with a stuck scanning process: > > [<0>] folio_wait_bit_common+0x12a/0x310 > [<0>] filemap_read_folio+0x97/0xd0 > [<0>] do_read_cache_folio+0x108/0x390 > [<0>] read_part_sector+0x31/0xa0 > [<0>] read_lba+0xc5/0x160 > [<0>] efi_partition+0xd9/0x8f0 > [<0>] bdev_disk_changed+0x23d/0x6d0 > [<0>] blkdev_get_whole+0x78/0xc0 > [<0>] bdev_open+0x2c6/0x3b0 > [<0>] bdev_file_open_by_dev+0xcb/0x120 > [<0>] disk_scan_partitions+0x5d/0x100 > [<0>] device_add_disk+0x402/0x420 > [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] > [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] > [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] > [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] > [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] > > This happens when we have several paths, some of which are inaccessible, > and the active paths are removed first. Then nvme_find_path() will requeue > I/O in the ns_head (as paths are present), but the requeue list is never > triggered as all remaining paths are inactive. > This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, > and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once > the last path has been removed to properly terminate pending I/O. > > Signed-off-by: Hannes Reinecke <hare@kernel.org> > --- > drivers/nvme/host/multipath.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index c9d23b1b8efc..1b1deb0450ab 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) > > inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) > { > + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) > + return NULL; > + > switch (READ_ONCE(head->subsys->iopolicy)) { > case NVME_IOPOLICY_QD: > return nvme_queue_depth_path(head); > @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head) > { > struct nvme_ns *ns; > > + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) > + return NULL; > + > list_for_each_entry_rcu(ns, &head->list, siblings) { > if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) > continue; > @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) > { > if (!head->disk) > return; > - kblockd_schedule_work(&head->requeue_work); > - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > nvme_cdev_del(&head->cdev, &head->cdev_device); > del_gendisk(head->disk); > } > + /* > + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared > + * to allow multipath to fail all I/O. > + */ > + kblockd_schedule_work(&head->requeue_work); > } Not sure, in your case, whether nvme_mpath_remove_disk() is called or not? The nvme_mpath_remove_disk() shall trigger the ns_head reqiest list. > > void nvme_mpath_remove_disk(struct nvme_ns_head *head) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-04 14:52 ` Nilay Shroff @ 2024-09-04 15:55 ` Hannes Reinecke 2024-09-05 6:41 ` Nilay Shroff 0 siblings, 1 reply; 15+ messages in thread From: Hannes Reinecke @ 2024-09-04 15:55 UTC (permalink / raw) To: Nilay Shroff, Hannes Reinecke, Christoph Hellwig Cc: Keith Busch, Sagi Grimberg, linux-nvme On 9/4/24 16:52, Nilay Shroff wrote: > > > On 9/3/24 23:33, Hannes Reinecke wrote: >> During repetitive namespace remapping operations (ie removing a namespace and >> provision a different namespace with the same NSID) on the target the >> namespace might have changed between the time the initial scan >> was performed, and partition scan was invoked by device_add_disk() >> in nvme_mpath_set_live(). We then end up with a stuck scanning process: >> >> [<0>] folio_wait_bit_common+0x12a/0x310 >> [<0>] filemap_read_folio+0x97/0xd0 >> [<0>] do_read_cache_folio+0x108/0x390 >> [<0>] read_part_sector+0x31/0xa0 >> [<0>] read_lba+0xc5/0x160 >> [<0>] efi_partition+0xd9/0x8f0 >> [<0>] bdev_disk_changed+0x23d/0x6d0 >> [<0>] blkdev_get_whole+0x78/0xc0 >> [<0>] bdev_open+0x2c6/0x3b0 >> [<0>] bdev_file_open_by_dev+0xcb/0x120 >> [<0>] disk_scan_partitions+0x5d/0x100 >> [<0>] device_add_disk+0x402/0x420 >> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >> >> This happens when we have several paths, some of which are inaccessible, >> and the active paths are removed first. Then nvme_find_path() will requeue >> I/O in the ns_head (as paths are present), but the requeue list is never >> triggered as all remaining paths are inactive. >> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >> the last path has been removed to properly terminate pending I/O. >> >> Signed-off-by: Hannes Reinecke <hare@kernel.org> >> --- >> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >> index c9d23b1b8efc..1b1deb0450ab 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) >> >> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >> { >> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >> + return NULL; >> + >> switch (READ_ONCE(head->subsys->iopolicy)) { >> case NVME_IOPOLICY_QD: >> return nvme_queue_depth_path(head); >> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head) >> { >> struct nvme_ns *ns; >> >> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >> + return NULL; >> + >> list_for_each_entry_rcu(ns, &head->list, siblings) { >> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >> continue; >> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >> { >> if (!head->disk) >> return; >> - kblockd_schedule_work(&head->requeue_work); >> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> nvme_cdev_del(&head->cdev, &head->cdev_device); >> del_gendisk(head->disk); >> } >> + /* >> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >> + * to allow multipath to fail all I/O. >> + */ >> + kblockd_schedule_work(&head->requeue_work); >> } > Not sure, in your case, whether nvme_mpath_remove_disk() is called or not? > The nvme_mpath_remove_disk() shall trigger the ns_head reqiest list. > It would, but we're never get this far as the reference to ns_head it held by the above stack trace (and nvme_mpath_remove_disk()) is called when the last reference to ns_head drops. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces 2024-09-04 15:55 ` Hannes Reinecke @ 2024-09-05 6:41 ` Nilay Shroff 0 siblings, 0 replies; 15+ messages in thread From: Nilay Shroff @ 2024-09-05 6:41 UTC (permalink / raw) To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig Cc: Keith Busch, Sagi Grimberg, linux-nvme On 9/4/24 21:25, Hannes Reinecke wrote: > On 9/4/24 16:52, Nilay Shroff wrote: >> >> >> On 9/3/24 23:33, Hannes Reinecke wrote: >>> During repetitive namespace remapping operations (ie removing a namespace and >>> provision a different namespace with the same NSID) on the target the >>> namespace might have changed between the time the initial scan >>> was performed, and partition scan was invoked by device_add_disk() >>> in nvme_mpath_set_live(). We then end up with a stuck scanning process: >>> >>> [<0>] folio_wait_bit_common+0x12a/0x310 >>> [<0>] filemap_read_folio+0x97/0xd0 >>> [<0>] do_read_cache_folio+0x108/0x390 >>> [<0>] read_part_sector+0x31/0xa0 >>> [<0>] read_lba+0xc5/0x160 >>> [<0>] efi_partition+0xd9/0x8f0 >>> [<0>] bdev_disk_changed+0x23d/0x6d0 >>> [<0>] blkdev_get_whole+0x78/0xc0 >>> [<0>] bdev_open+0x2c6/0x3b0 >>> [<0>] bdev_file_open_by_dev+0xcb/0x120 >>> [<0>] disk_scan_partitions+0x5d/0x100 >>> [<0>] device_add_disk+0x402/0x420 >>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] >>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] >>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] >>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] >>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] >>> >>> This happens when we have several paths, some of which are inaccessible, >>> and the active paths are removed first. Then nvme_find_path() will requeue >>> I/O in the ns_head (as paths are present), but the requeue list is never >>> triggered as all remaining paths are inactive. >>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path, >>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once >>> the last path has been removed to properly terminate pending I/O. >>> >>> Signed-off-by: Hannes Reinecke <hare@kernel.org> >>> --- >>> drivers/nvme/host/multipath.c | 14 ++++++++++++-- >>> 1 file changed, 12 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >>> index c9d23b1b8efc..1b1deb0450ab 100644 >>> --- a/drivers/nvme/host/multipath.c >>> +++ b/drivers/nvme/host/multipath.c >>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head) >>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head) >>> { >>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>> + return NULL; >>> + >>> switch (READ_ONCE(head->subsys->iopolicy)) { >>> case NVME_IOPOLICY_QD: >>> return nvme_queue_depth_path(head); >>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head) >>> { >>> struct nvme_ns *ns; >>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) >>> + return NULL; >>> + >>> list_for_each_entry_rcu(ns, &head->list, siblings) { >>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) >>> continue; >>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >>> { >>> if (!head->disk) >>> return; >>> - kblockd_schedule_work(&head->requeue_work); >>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>> del_gendisk(head->disk); >>> } >>> + /* >>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>> + * to allow multipath to fail all I/O. >>> + */ >>> + kblockd_schedule_work(&head->requeue_work); >>> } >> Not sure, in your case, whether nvme_mpath_remove_disk() is called or not? >> The nvme_mpath_remove_disk() shall trigger the ns_head reqiest list. >> > It would, but we're never get this far as the reference to ns_head it held by the above stack trace (and nvme_mpath_remove_disk()) is called > when the last reference to ns_head drops. OK, I see. In that case, I believe checking NVME_NS_READY in nvme_avalibale_path() should be a good idea. > > Cheers, > > Hannes Thanks, --Nilay ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-09-05 9:24 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-03 18:03 [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke 2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke 2024-09-04 4:09 ` Christoph Hellwig 2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke 2024-09-03 19:38 ` Sagi Grimberg 2024-09-04 8:20 ` Hannes Reinecke 2024-09-04 8:59 ` Hannes Reinecke 2024-09-05 6:43 ` Nilay Shroff 2024-09-05 7:06 ` Sagi Grimberg 2024-09-05 8:39 ` Hannes Reinecke 2024-09-05 9:06 ` Sagi Grimberg 2024-09-05 9:23 ` Hannes Reinecke 2024-09-04 14:52 ` Nilay Shroff 2024-09-04 15:55 ` Hannes Reinecke 2024-09-05 6:41 ` Nilay Shroff
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.