[PATCHv2 0/2] nvme: NSHEAD_DISK

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-03 18:03 Hannes Reinecke
  2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
  2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke
  0 siblings, 2 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-03 18:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

Hi all,

I'm have a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.

Essentially it's doing something like:

echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable

repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.
With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.

As usual, comments and reviews are welcome.

Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared

Hannes Reinecke (2):
  nvme-multipath: fixup typo when clearing DISK_LIVE
  nvme-multipath: fix I/O stall when remapping namespaces

 drivers/nvme/host/multipath.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE
  2024-09-03 18:03 [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
@ 2024-09-03 18:03 ` Hannes Reinecke
  2024-09-04  4:09   ` Christoph Hellwig
  2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke
  1 sibling, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-03 18:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
'struct nvme_ns'.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/multipath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 91d9eb3c22ef..c9d23b1b8efc 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -646,7 +646,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 		rc = device_add_disk(&head->subsys->dev, head->disk,
 				     nvme_ns_attr_groups);
 		if (rc) {
-			clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags);
+			clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
 			return;
 		}
 		nvme_add_ns_head_cdev(head);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE
  2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-04  4:09   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2024-09-04  4:09 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, linux-nvme

On Tue, Sep 03, 2024 at 08:03:44PM +0200, Hannes Reinecke wrote:
> NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
> 'struct nvme_ns'.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Although a sentence on how this was found and if this causes
real issues wouldn't hurt.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-03 18:03 [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
  2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-03 18:03 ` Hannes Reinecke
  2024-09-03 19:38   ` Sagi Grimberg
  2024-09-04 14:52   ` Nilay Shroff
  1 sibling, 2 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-03 18:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

During repetitive namespace remapping operations (ie removing a namespace and
provision a different namespace with the same NSID) on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:

[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]

This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.
This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/multipath.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..1b1deb0450ab 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
 
 inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 {
+	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+		return NULL;
+
 	switch (READ_ONCE(head->subsys->iopolicy)) {
 	case NVME_IOPOLICY_QD:
 		return nvme_queue_depth_path(head);
@@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
 {
 	struct nvme_ns *ns;
 
+	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+		return NULL;
+
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
 			continue;
@@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
 {
 	if (!head->disk)
 		return;
-	kblockd_schedule_work(&head->requeue_work);
-	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
 		nvme_cdev_del(&head->cdev, &head->cdev_device);
 		del_gendisk(head->disk);
 	}
+	/*
+	 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+	 * to allow multipath to fail all I/O.
+	 */
+	kblockd_schedule_work(&head->requeue_work);
 }
 
 void nvme_mpath_remove_disk(struct nvme_ns_head *head)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke
@ 2024-09-03 19:38   ` Sagi Grimberg
  2024-09-04  8:20     ` Hannes Reinecke
  2024-09-04 14:52   ` Nilay Shroff
  1 sibling, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2024-09-03 19:38 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme




On 03/09/2024 21:03, Hannes Reinecke wrote:
> During repetitive namespace remapping operations (ie removing a namespace and
> provision a different namespace with the same NSID) on the target the
> namespace might have changed between the time the initial scan
> was performed, and partition scan was invoked by device_add_disk()
> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>
> [<0>] folio_wait_bit_common+0x12a/0x310
> [<0>] filemap_read_folio+0x97/0xd0
> [<0>] do_read_cache_folio+0x108/0x390
> [<0>] read_part_sector+0x31/0xa0
> [<0>] read_lba+0xc5/0x160
> [<0>] efi_partition+0xd9/0x8f0
> [<0>] bdev_disk_changed+0x23d/0x6d0
> [<0>] blkdev_get_whole+0x78/0xc0
> [<0>] bdev_open+0x2c6/0x3b0
> [<0>] bdev_file_open_by_dev+0xcb/0x120
> [<0>] disk_scan_partitions+0x5d/0x100
> [<0>] device_add_disk+0x402/0x420
> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>
> This happens when we have several paths, some of which are inaccessible,
> and the active paths are removed first. Then nvme_find_path() will requeue
> I/O in the ns_head (as paths are present), but the requeue list is never
> triggered as all remaining paths are inactive.
> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
> the last path has been removed to properly terminate pending I/O.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>   1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index c9d23b1b8efc..1b1deb0450ab 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>   
>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>   {
> +	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> +		return NULL;
> +
>   	switch (READ_ONCE(head->subsys->iopolicy)) {
>   	case NVME_IOPOLICY_QD:
>   		return nvme_queue_depth_path(head);
> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>   {
>   	struct nvme_ns *ns;
>   
> +	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> +		return NULL;
> +
>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>   		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>   			continue;
> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>   {
>   	if (!head->disk)
>   		return;
> -	kblockd_schedule_work(&head->requeue_work);
> -	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> +	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>   		nvme_cdev_del(&head->cdev, &head->cdev_device);
>   		del_gendisk(head->disk);
>   	}
> +	/*
> +	 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> +	 * to allow multipath to fail all I/O.
> +	 */
> +	kblockd_schedule_work(&head->requeue_work);

Not sure how this helps given that you don't wait for srcu to synchronize
before you kick the requeue.

>   }
>   
>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)

Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting you 
mentioned that ns_remove is stuck on srcu_synchronize? Can you explain 
why nvme_find_path is able to find a path given that it is already 
cleared NVME_NS_READY ? oris it nvme_available_path that is missing a 
check? Maybe can do with checking NVME_NS_READY instead?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-03 19:38   ` Sagi Grimberg
@ 2024-09-04  8:20     ` Hannes Reinecke
  2024-09-04  8:59       ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-04  8:20 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/3/24 21:38, Sagi Grimberg wrote:
> 
> 
> 
> On 03/09/2024 21:03, Hannes Reinecke wrote:
>> During repetitive namespace remapping operations (ie removing a 
>> namespace and
>> provision a different namespace with the same NSID) on the target the
>> namespace might have changed between the time the initial scan
>> was performed, and partition scan was invoked by device_add_disk()
>> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>>
>> [<0>] folio_wait_bit_common+0x12a/0x310
>> [<0>] filemap_read_folio+0x97/0xd0
>> [<0>] do_read_cache_folio+0x108/0x390
>> [<0>] read_part_sector+0x31/0xa0
>> [<0>] read_lba+0xc5/0x160
>> [<0>] efi_partition+0xd9/0x8f0
>> [<0>] bdev_disk_changed+0x23d/0x6d0
>> [<0>] blkdev_get_whole+0x78/0xc0
>> [<0>] bdev_open+0x2c6/0x3b0
>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>> [<0>] disk_scan_partitions+0x5d/0x100
>> [<0>] device_add_disk+0x402/0x420
>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>
>> This happens when we have several paths, some of which are inaccessible,
>> and the active paths are removed first. Then nvme_find_path() will 
>> requeue
>> I/O in the ns_head (as paths are present), but the requeue list is never
>> triggered as all remaining paths are inactive.
>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>> the last path has been removed to properly terminate pending I/O.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index c9d23b1b8efc..1b1deb0450ab 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>> nvme_ns_head *head)
>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>   {
>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>> +        return NULL;
>> +
>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>       case NVME_IOPOLICY_QD:
>>           return nvme_queue_depth_path(head);
>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>> nvme_ns_head *head)
>>   {
>>       struct nvme_ns *ns;
>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>> +        return NULL;
>> +
>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>               continue;
>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>> nvme_ns_head *head)
>>   {
>>       if (!head->disk)
>>           return;
>> -    kblockd_schedule_work(&head->requeue_work);
>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>           del_gendisk(head->disk);
>>       }
>> +    /*
>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>> +     * to allow multipath to fail all I/O.
>> +     */
>> +    kblockd_schedule_work(&head->requeue_work);
> 
> Not sure how this helps given that you don't wait for srcu to synchronize
> before you kick the requeue.
> 
It certainly is helping in my testcase. But having a synchronize_srcu 
here is probably not a bad idea.

>>   }
>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
> 
> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting you 
> mentioned that ns_remove is stuck on srcu_synchronize? Can you explain 
> why nvme_find_path is able to find a path given that it is already 
> cleared NVME_NS_READY ? oris it nvme_available_path that is missing a 
> check? Maybe can do with checking NVME_NS_READY instead?

Turned out that the reasoning in the previous revision wasn't quite 
correct; since then I have seen several test-run where the above stack
trace was the _only_ one in the system, so the stall in removing 
namespaces is more a side-effect. The ns_head was still visible
in sysfs while in that state, with exactly one path left:

# ls /sys/block
nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5  nvme0n2 
nvme0n5

(whereas there had been 6 controllers with 6 namespaces).
So we fail to trigger a requeue to restart I/O on the stuck scanning
process; the actual path state really don't matter as never get this far.
This can happen when the partition scan triggered by device_add_disk() 
(from one controller) interleaves with nvme_ns_remove() from another 
controller. Both processes are running lockless wrt to ns_head at that
time, so if the partition scan issues I/O after the schedule_work
in nvme_mpath_shutdown_disk():

void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
{
	if (!head->disk)
		return;
	kblockd_schedule_work(&head->requeue_work);
	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
		nvme_cdev_del(&head->cdev, &head->cdev_device);
		del_gendisk(head->disk);
	}
}

_and_ that last path happens to be an 'inaccessible' one, I/O will be 
requeued in the ns_head but never restarted, leaving to a hung process.
Note, that I/O might also be triggered by userspace (eg udev); the 
device node is still present at that time. And that's also what I see
in my test runs; occasionally I get additional stuck udev processes:
[<0>] __folio_lock+0x114/0x1f0
[<0>] truncate_inode_pages_range+0x3c0/0x3e0
[<0>] blkdev_flush_mapping+0x45/0xe0
[<0>] blkdev_put_whole+0x2e/0x40
[<0>] bdev_release+0x129/0x1b0
[<0>] blkdev_release+0xd/0x20
[<0>] __fput+0xf7/0x2d0
also waiting on I/O.

You might be right checking for NS_READY might be sufficient, I'll be
checking. But we definitely need to requeue I/O after we called 
del_gendisk().

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-04  8:20     ` Hannes Reinecke
@ 2024-09-04  8:59       ` Hannes Reinecke
  2024-09-05  6:43         ` Nilay Shroff
  2024-09-05  7:06         ` Sagi Grimberg
  0 siblings, 2 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-04  8:59 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/4/24 10:20, Hannes Reinecke wrote:
> On 9/3/24 21:38, Sagi Grimberg wrote:
>>
>>
>>
>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>> During repetitive namespace remapping operations (ie removing a 
>>> namespace and
>>> provision a different namespace with the same NSID) on the target the
>>> namespace might have changed between the time the initial scan
>>> was performed, and partition scan was invoked by device_add_disk()
>>> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>>>
>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>> [<0>] filemap_read_folio+0x97/0xd0
>>> [<0>] do_read_cache_folio+0x108/0x390
>>> [<0>] read_part_sector+0x31/0xa0
>>> [<0>] read_lba+0xc5/0x160
>>> [<0>] efi_partition+0xd9/0x8f0
>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>> [<0>] blkdev_get_whole+0x78/0xc0
>>> [<0>] bdev_open+0x2c6/0x3b0
>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>> [<0>] disk_scan_partitions+0x5d/0x100
>>> [<0>] device_add_disk+0x402/0x420
>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> This happens when we have several paths, some of which are inaccessible,
>>> and the active paths are removed first. Then nvme_find_path() will 
>>> requeue
>>> I/O in the ns_head (as paths are present), but the requeue list is never
>>> triggered as all remaining paths are inactive.
>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>> the last path has been removed to properly terminate pending I/O.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c 
>>> b/drivers/nvme/host/multipath.c
>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>>> nvme_ns_head *head)
>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>   {
>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>       case NVME_IOPOLICY_QD:
>>>           return nvme_queue_depth_path(head);
>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>>> nvme_ns_head *head)
>>>   {
>>>       struct nvme_ns *ns;
>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>               continue;
>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>>> nvme_ns_head *head)
>>>   {
>>>       if (!head->disk)
>>>           return;
>>> -    kblockd_schedule_work(&head->requeue_work);
>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>           del_gendisk(head->disk);
>>>       }
>>> +    /*
>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>> +     * to allow multipath to fail all I/O.
>>> +     */
>>> +    kblockd_schedule_work(&head->requeue_work);
>>
>> Not sure how this helps given that you don't wait for srcu to synchronize
>> before you kick the requeue.
>>
> It certainly is helping in my testcase. But having a synchronize_srcu 
> here is probably not a bad idea.
> 
>>>   }
>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>
>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting 
>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you 
>> explain why nvme_find_path is able to find a path given that it is 
>> already cleared NVME_NS_READY ? oris it nvme_available_path that is 
>> missing a check? Maybe can do with checking NVME_NS_READY instead?
> 
> Turned out that the reasoning in the previous revision wasn't quite 
> correct; since then I have seen several test-run where the above stack
> trace was the _only_ one in the system, so the stall in removing 
> namespaces is more a side-effect. The ns_head was still visible
> in sysfs while in that state, with exactly one path left:
> 
> # ls /sys/block
> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5  nvme0n2 
> nvme0n5
> 
> (whereas there had been 6 controllers with 6 namespaces).
> So we fail to trigger a requeue to restart I/O on the stuck scanning
> process; the actual path state really don't matter as never get this far.
> This can happen when the partition scan triggered by device_add_disk() 
> (from one controller) interleaves with nvme_ns_remove() from another 
> controller. Both processes are running lockless wrt to ns_head at that
> time, so if the partition scan issues I/O after the schedule_work
> in nvme_mpath_shutdown_disk():
> 
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
> {
>      if (!head->disk)
>          return;
>      kblockd_schedule_work(&head->requeue_work);
>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>          del_gendisk(head->disk);
>      }
> }
> 
> _and_ that last path happens to be an 'inaccessible' one, I/O will be 
> requeued in the ns_head but never restarted, leaving to a hung process.
> Note, that I/O might also be triggered by userspace (eg udev); the 
> device node is still present at that time. And that's also what I see
> in my test runs; occasionally I get additional stuck udev processes:
> [<0>] __folio_lock+0x114/0x1f0
> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
> [<0>] blkdev_flush_mapping+0x45/0xe0
> [<0>] blkdev_put_whole+0x2e/0x40
> [<0>] bdev_release+0x129/0x1b0
> [<0>] blkdev_release+0xd/0x20
> [<0>] __fput+0xf7/0x2d0
> also waiting on I/O.
> 
> You might be right checking for NS_READY might be sufficient, I'll be
> checking. But we definitely need to requeue I/O after we called 
> del_gendisk().
> 
Turns out that we don't check NVME_NS_READY in all places; we would need 
this patch:

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..d8a6f51896fd 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -424,6 +424,8 @@ static bool nvme_available_path(struct nvme_ns_head 
*head)
         list_for_each_entry_rcu(ns, &head->list, siblings) {
                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
                         continue;
+               if (test_bit(NVME_NS_READY, &ns->flags))
+                       continue;
                 switch (nvme_ctrl_state(ns->ctrl)) {
                 case NVME_CTRL_LIVE:
                 case NVME_CTRL_RESETTING:

in addition to moving of kblockd_schedule.

So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-04  8:59       ` Hannes Reinecke
@ 2024-09-05  6:43         ` Nilay Shroff
  2024-09-05  7:06         ` Sagi Grimberg
  1 sibling, 0 replies; 15+ messages in thread
From: Nilay Shroff @ 2024-09-05  6:43 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, Hannes Reinecke,
	Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 9/4/24 14:29, Hannes Reinecke wrote:
> On 9/4/24 10:20, Hannes Reinecke wrote:
>> On 9/3/24 21:38, Sagi Grimberg wrote:
>>>
>>>
>>>
>>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>>> During repetitive namespace remapping operations (ie removing a namespace and
>>>> provision a different namespace with the same NSID) on the target the
>>>> namespace might have changed between the time the initial scan
>>>> was performed, and partition scan was invoked by device_add_disk()
>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>>>>
>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>> [<0>] read_part_sector+0x31/0xa0
>>>> [<0>] read_lba+0xc5/0x160
>>>> [<0>] efi_partition+0xd9/0x8f0
>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>> [<0>] device_add_disk+0x402/0x420
>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>
>>>> This happens when we have several paths, some of which are inaccessible,
>>>> and the active paths are removed first. Then nvme_find_path() will requeue
>>>> I/O in the ns_head (as paths are present), but the requeue list is never
>>>> triggered as all remaining paths are inactive.
>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>>> the last path has been removed to properly terminate pending I/O.
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>> ---
>>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>>   {
>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> +        return NULL;
>>>> +
>>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>>       case NVME_IOPOLICY_QD:
>>>>           return nvme_queue_depth_path(head);
>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>>>>   {
>>>>       struct nvme_ns *ns;
>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> +        return NULL;
>>>> +
>>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>>               continue;
>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>>>   {
>>>>       if (!head->disk)
>>>>           return;
>>>> -    kblockd_schedule_work(&head->requeue_work);
>>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>           del_gendisk(head->disk);
>>>>       }
>>>> +    /*
>>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>>> +     * to allow multipath to fail all I/O.
>>>> +     */
>>>> +    kblockd_schedule_work(&head->requeue_work);
>>>
>>> Not sure how this helps given that you don't wait for srcu to synchronize
>>> before you kick the requeue.
>>>
>> It certainly is helping in my testcase. But having a synchronize_srcu here is probably not a bad idea.
>>
>>>>   }
>>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>>
>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting you mentioned that ns_remove is stuck on srcu_synchronize? Can you explain why nvme_find_path is able to find a path given that it is already cleared NVME_NS_READY ? oris it nvme_available_path that is missing a check? Maybe can do with checking NVME_NS_READY instead?
>>
>> Turned out that the reasoning in the previous revision wasn't quite correct; since then I have seen several test-run where the above stack
>> trace was the _only_ one in the system, so the stall in removing namespaces is more a side-effect. The ns_head was still visible
>> in sysfs while in that state, with exactly one path left:
>>
>> # ls /sys/block
>> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5  nvme0n2 nvme0n5
>>
>> (whereas there had been 6 controllers with 6 namespaces).
>> So we fail to trigger a requeue to restart I/O on the stuck scanning
>> process; the actual path state really don't matter as never get this far.
>> This can happen when the partition scan triggered by device_add_disk() (from one controller) interleaves with nvme_ns_remove() from another controller. Both processes are running lockless wrt to ns_head at that
>> time, so if the partition scan issues I/O after the schedule_work
>> in nvme_mpath_shutdown_disk():
>>
>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>> {
>>      if (!head->disk)
>>          return;
>>      kblockd_schedule_work(&head->requeue_work);
>>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>>          del_gendisk(head->disk);
>>      }
>> }
>>
>> _and_ that last path happens to be an 'inaccessible' one, I/O will be requeued in the ns_head but never restarted, leaving to a hung process.
>> Note, that I/O might also be triggered by userspace (eg udev); the device node is still present at that time. And that's also what I see
>> in my test runs; occasionally I get additional stuck udev processes:
>> [<0>] __folio_lock+0x114/0x1f0
>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
>> [<0>] blkdev_flush_mapping+0x45/0xe0
>> [<0>] blkdev_put_whole+0x2e/0x40
>> [<0>] bdev_release+0x129/0x1b0
>> [<0>] blkdev_release+0xd/0x20
>> [<0>] __fput+0xf7/0x2d0
>> also waiting on I/O.
>>
>> You might be right checking for NS_READY might be sufficient, I'll be
>> checking. But we definitely need to requeue I/O after we called del_gendisk().
>>
> Turns out that we don't check NVME_NS_READY in all places; we would need this patch:
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index c9d23b1b8efc..d8a6f51896fd 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>                         continue;
> +               if (test_bit(NVME_NS_READY, &ns->flags))
> +                       continue;
>                 switch (nvme_ctrl_state(ns->ctrl)) {
>                 case NVME_CTRL_LIVE:
>                 case NVME_CTRL_RESETTING:
> 
> in addition to moving of kblockd_schedule.
> 
I think you may want to test here, 

if (!test_bit(NVME_NS_READY, &ns->flags))
        continue;

> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?
> 
> Cheers,
> 
> Hannes

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-04  8:59       ` Hannes Reinecke
  2024-09-05  6:43         ` Nilay Shroff
@ 2024-09-05  7:06         ` Sagi Grimberg
  2024-09-05  8:39           ` Hannes Reinecke
  1 sibling, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2024-09-05  7:06 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme




On 04/09/2024 11:59, Hannes Reinecke wrote:
> On 9/4/24 10:20, Hannes Reinecke wrote:
>> On 9/3/24 21:38, Sagi Grimberg wrote:
>>>
>>>
>>>
>>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>>> During repetitive namespace remapping operations (ie removing a 
>>>> namespace and
>>>> provision a different namespace with the same NSID) on the target the
>>>> namespace might have changed between the time the initial scan
>>>> was performed, and partition scan was invoked by device_add_disk()
>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning 
>>>> process:
>>>>
>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>> [<0>] read_part_sector+0x31/0xa0
>>>> [<0>] read_lba+0xc5/0x160
>>>> [<0>] efi_partition+0xd9/0x8f0
>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>> [<0>] device_add_disk+0x402/0x420
>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>
>>>> This happens when we have several paths, some of which are 
>>>> inaccessible,
>>>> and the active paths are removed first. Then nvme_find_path() will 
>>>> requeue
>>>> I/O in the ns_head (as paths are present), but the requeue list is 
>>>> never
>>>> triggered as all remaining paths are inactive.
>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>>> the last path has been removed to properly terminate pending I/O.
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>> ---
>>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c 
>>>> b/drivers/nvme/host/multipath.c
>>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>>>> nvme_ns_head *head)
>>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>>   {
>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> +        return NULL;
>>>> +
>>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>>       case NVME_IOPOLICY_QD:
>>>>           return nvme_queue_depth_path(head);
>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>>>> nvme_ns_head *head)
>>>>   {
>>>>       struct nvme_ns *ns;
>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> +        return NULL;
>>>> +
>>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>>               continue;
>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>>>> nvme_ns_head *head)
>>>>   {
>>>>       if (!head->disk)
>>>>           return;
>>>> -    kblockd_schedule_work(&head->requeue_work);
>>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>           del_gendisk(head->disk);
>>>>       }
>>>> +    /*
>>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>>> +     * to allow multipath to fail all I/O.
>>>> +     */
>>>> +    kblockd_schedule_work(&head->requeue_work);
>>>
>>> Not sure how this helps given that you don't wait for srcu to 
>>> synchronize
>>> before you kick the requeue.
>>>
>> It certainly is helping in my testcase. But having a synchronize_srcu 
>> here is probably not a bad idea.
>>
>>>>   }
>>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>>
>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting 
>>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you 
>>> explain why nvme_find_path is able to find a path given that it is 
>>> already cleared NVME_NS_READY ? oris it nvme_available_path that is 
>>> missing a check? Maybe can do with checking NVME_NS_READY instead?
>>
>> Turned out that the reasoning in the previous revision wasn't quite 
>> correct; since then I have seen several test-run where the above stack
>> trace was the _only_ one in the system, so the stall in removing 
>> namespaces is more a side-effect. The ns_head was still visible
>> in sysfs while in that state, with exactly one path left:
>>
>> # ls /sys/block
>> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5 nvme0n2 
>> nvme0n5
>>
>> (whereas there had been 6 controllers with 6 namespaces).
>> So we fail to trigger a requeue to restart I/O on the stuck scanning
>> process; the actual path state really don't matter as never get this 
>> far.
>> This can happen when the partition scan triggered by 
>> device_add_disk() (from one controller) interleaves with 
>> nvme_ns_remove() from another controller. Both processes are running 
>> lockless wrt to ns_head at that
>> time, so if the partition scan issues I/O after the schedule_work
>> in nvme_mpath_shutdown_disk():
>>
>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>> {
>>      if (!head->disk)
>>          return;
>>      kblockd_schedule_work(&head->requeue_work);
>>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>>          del_gendisk(head->disk);
>>      }
>> }
>>
>> _and_ that last path happens to be an 'inaccessible' one, I/O will be 
>> requeued in the ns_head but never restarted, leaving to a hung process.
>> Note, that I/O might also be triggered by userspace (eg udev); the 
>> device node is still present at that time. And that's also what I see
>> in my test runs; occasionally I get additional stuck udev processes:
>> [<0>] __folio_lock+0x114/0x1f0
>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
>> [<0>] blkdev_flush_mapping+0x45/0xe0
>> [<0>] blkdev_put_whole+0x2e/0x40
>> [<0>] bdev_release+0x129/0x1b0
>> [<0>] blkdev_release+0xd/0x20
>> [<0>] __fput+0xf7/0x2d0
>> also waiting on I/O.
>>
>> You might be right checking for NS_READY might be sufficient, I'll be
>> checking. But we definitely need to requeue I/O after we called 
>> del_gendisk().
>>
> Turns out that we don't check NVME_NS_READY in all places; we would 
> need this patch:
>
> diff --git a/drivers/nvme/host/multipath.c 
> b/drivers/nvme/host/multipath.c
> index c9d23b1b8efc..d8a6f51896fd 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct 
> nvme_ns_head *head)
>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, 
> &ns->ctrl->flags))
>                         continue;
> +               if (test_bit(NVME_NS_READY, &ns->flags))
> +                       continue;
>                 switch (nvme_ctrl_state(ns->ctrl)) {
>                 case NVME_CTRL_LIVE:
>                 case NVME_CTRL_RESETTING:
>
> in addition to moving of kblockd_schedule.
>
> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?

Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but also
in nvme_mpath_revalidate_paths() if the capacity changed, so theoretically,
if the last remaining path changed its capacity, it will make 
nvme_available_path()
fail, and in turn, mpath IO will fail and not added to the requeue_list, 
which would be
wrong. So I think that nvme_available_path should check 
NVME_NSHEAD_DISK_LIVE,
but nvme_find_path is fine in its current form.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-05  7:06         ` Sagi Grimberg
@ 2024-09-05  8:39           ` Hannes Reinecke
  2024-09-05  9:06             ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-05  8:39 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/5/24 09:06, Sagi Grimberg wrote:
> 
> 
> 
> On 04/09/2024 11:59, Hannes Reinecke wrote:
>> On 9/4/24 10:20, Hannes Reinecke wrote:
>>> On 9/3/24 21:38, Sagi Grimberg wrote:
>>>>
>>>>
>>>>
>>>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>>>> During repetitive namespace remapping operations (ie removing a 
>>>>> namespace and
>>>>> provision a different namespace with the same NSID) on the target the
>>>>> namespace might have changed between the time the initial scan
>>>>> was performed, and partition scan was invoked by device_add_disk()
>>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning 
>>>>> process:
>>>>>
>>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>>> [<0>] read_part_sector+0x31/0xa0
>>>>> [<0>] read_lba+0xc5/0x160
>>>>> [<0>] efi_partition+0xd9/0x8f0
>>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>>> [<0>] device_add_disk+0x402/0x420
>>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>>
>>>>> This happens when we have several paths, some of which are 
>>>>> inaccessible,
>>>>> and the active paths are removed first. Then nvme_find_path() will 
>>>>> requeue
>>>>> I/O in the ns_head (as paths are present), but the requeue list is 
>>>>> never
>>>>> triggered as all remaining paths are inactive.
>>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>>>> the last path has been removed to properly terminate pending I/O.
>>>>>
>>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>>> ---
>>>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/multipath.c 
>>>>> b/drivers/nvme/host/multipath.c
>>>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>>>> --- a/drivers/nvme/host/multipath.c
>>>>> +++ b/drivers/nvme/host/multipath.c
>>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>>>>> nvme_ns_head *head)
>>>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>>>   {
>>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>>> +        return NULL;
>>>>> +
>>>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>>>       case NVME_IOPOLICY_QD:
>>>>>           return nvme_queue_depth_path(head);
>>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>>>>> nvme_ns_head *head)
>>>>>   {
>>>>>       struct nvme_ns *ns;
>>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>>> +        return NULL;
>>>>> +
>>>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>>>               continue;
>>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>>>>> nvme_ns_head *head)
>>>>>   {
>>>>>       if (!head->disk)
>>>>>           return;
>>>>> -    kblockd_schedule_work(&head->requeue_work);
>>>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>>           del_gendisk(head->disk);
>>>>>       }
>>>>> +    /*
>>>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>>>> +     * to allow multipath to fail all I/O.
>>>>> +     */
>>>>> +    kblockd_schedule_work(&head->requeue_work);
>>>>
>>>> Not sure how this helps given that you don't wait for srcu to 
>>>> synchronize
>>>> before you kick the requeue.
>>>>
>>> It certainly is helping in my testcase. But having a synchronize_srcu 
>>> here is probably not a bad idea.
>>>
>>>>>   }
>>>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>>>
>>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting 
>>>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you 
>>>> explain why nvme_find_path is able to find a path given that it is 
>>>> already cleared NVME_NS_READY ? oris it nvme_available_path that is 
>>>> missing a check? Maybe can do with checking NVME_NS_READY instead?
>>>
>>> Turned out that the reasoning in the previous revision wasn't quite 
>>> correct; since then I have seen several test-run where the above stack
>>> trace was the _only_ one in the system, so the stall in removing 
>>> namespaces is more a side-effect. The ns_head was still visible
>>> in sysfs while in that state, with exactly one path left:
>>>
>>> # ls /sys/block
>>> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5 nvme0n2 
>>> nvme0n5
>>>
>>> (whereas there had been 6 controllers with 6 namespaces).
>>> So we fail to trigger a requeue to restart I/O on the stuck scanning
>>> process; the actual path state really don't matter as never get this 
>>> far.
>>> This can happen when the partition scan triggered by 
>>> device_add_disk() (from one controller) interleaves with 
>>> nvme_ns_remove() from another controller. Both processes are running 
>>> lockless wrt to ns_head at that
>>> time, so if the partition scan issues I/O after the schedule_work
>>> in nvme_mpath_shutdown_disk():
>>>
>>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>> {
>>>      if (!head->disk)
>>>          return;
>>>      kblockd_schedule_work(&head->requeue_work);
>>>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>          del_gendisk(head->disk);
>>>      }
>>> }
>>>
>>> _and_ that last path happens to be an 'inaccessible' one, I/O will be 
>>> requeued in the ns_head but never restarted, leaving to a hung process.
>>> Note, that I/O might also be triggered by userspace (eg udev); the 
>>> device node is still present at that time. And that's also what I see
>>> in my test runs; occasionally I get additional stuck udev processes:
>>> [<0>] __folio_lock+0x114/0x1f0
>>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
>>> [<0>] blkdev_flush_mapping+0x45/0xe0
>>> [<0>] blkdev_put_whole+0x2e/0x40
>>> [<0>] bdev_release+0x129/0x1b0
>>> [<0>] blkdev_release+0xd/0x20
>>> [<0>] __fput+0xf7/0x2d0
>>> also waiting on I/O.
>>>
>>> You might be right checking for NS_READY might be sufficient, I'll be
>>> checking. But we definitely need to requeue I/O after we called 
>>> del_gendisk().
>>>
>> Turns out that we don't check NVME_NS_READY in all places; we would 
>> need this patch:
>>
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index c9d23b1b8efc..d8a6f51896fd 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct 
>> nvme_ns_head *head)
>>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>>                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, 
>> &ns->ctrl->flags))
>>                         continue;
>> +               if (test_bit(NVME_NS_READY, &ns->flags))
>> +                       continue;
>>                 switch (nvme_ctrl_state(ns->ctrl)) {
>>                 case NVME_CTRL_LIVE:
>>                 case NVME_CTRL_RESETTING:
>>
>> in addition to moving of kblockd_schedule.
>>
>> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?
> 
> Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but also
> in nvme_mpath_revalidate_paths() if the capacity changed, so theoretically,
> if the last remaining path changed its capacity, it will make 
> nvme_available_path()
> fail, and in turn, mpath IO will fail and not added to the requeue_list, 
> which would be
> wrong. So I think that nvme_available_path should check 
> NVME_NSHEAD_DISK_LIVE,
> but nvme_find_path is fine in its current form.

Weelll ... not quite.
I never get this far, as the only place where nvme_remove_ns() can be
called is from the scanning process, yet this is stuck waiting for I/O.

My testcase then simulated a namespace replacement by setting a new 
UUID, and then enabling the namespace again. The target will return
PATH ERROR during that time, causing I/O to be retried (while disabled)
or failed over to the same path (as it's the only one left).

Of course we could argue that the testcase is invalid, and the target
should return INVALID_NS if the UUID changed.
But one could also argue that retrying on the same path is pointless,
seeing that we'll always get the same error.
Hmm?

Guess it'll make a nice topic for ALPSS ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-05  8:39           ` Hannes Reinecke
@ 2024-09-05  9:06             ` Sagi Grimberg
  2024-09-05  9:23               ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2024-09-05  9:06 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme




On 05/09/2024 11:39, Hannes Reinecke wrote:
> On 9/5/24 09:06, Sagi Grimberg wrote:
>>
>>
>>
>> On 04/09/2024 11:59, Hannes Reinecke wrote:
>>> On 9/4/24 10:20, Hannes Reinecke wrote:
>>>> On 9/3/24 21:38, Sagi Grimberg wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>>>>> During repetitive namespace remapping operations (ie removing a 
>>>>>> namespace and
>>>>>> provision a different namespace with the same NSID) on the target 
>>>>>> the
>>>>>> namespace might have changed between the time the initial scan
>>>>>> was performed, and partition scan was invoked by device_add_disk()
>>>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning 
>>>>>> process:
>>>>>>
>>>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>>>> [<0>] read_part_sector+0x31/0xa0
>>>>>> [<0>] read_lba+0xc5/0x160
>>>>>> [<0>] efi_partition+0xd9/0x8f0
>>>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>>>> [<0>] device_add_disk+0x402/0x420
>>>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>>>
>>>>>> This happens when we have several paths, some of which are 
>>>>>> inaccessible,
>>>>>> and the active paths are removed first. Then nvme_find_path() 
>>>>>> will requeue
>>>>>> I/O in the ns_head (as paths are present), but the requeue list 
>>>>>> is never
>>>>>> triggered as all remaining paths are inactive.
>>>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>>>>> the last path has been removed to properly terminate pending I/O.
>>>>>>
>>>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>>>> ---
>>>>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/multipath.c 
>>>>>> b/drivers/nvme/host/multipath.c
>>>>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>>>>> --- a/drivers/nvme/host/multipath.c
>>>>>> +++ b/drivers/nvme/host/multipath.c
>>>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>>>>>> nvme_ns_head *head)
>>>>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>>>>   {
>>>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>>>> +        return NULL;
>>>>>> +
>>>>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>>>>       case NVME_IOPOLICY_QD:
>>>>>>           return nvme_queue_depth_path(head);
>>>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>>>>>> nvme_ns_head *head)
>>>>>>   {
>>>>>>       struct nvme_ns *ns;
>>>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>>>> +        return NULL;
>>>>>> +
>>>>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, 
>>>>>> &ns->ctrl->flags))
>>>>>>               continue;
>>>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>>>>>> nvme_ns_head *head)
>>>>>>   {
>>>>>>       if (!head->disk)
>>>>>>           return;
>>>>>> -    kblockd_schedule_work(&head->requeue_work);
>>>>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>>>           del_gendisk(head->disk);
>>>>>>       }
>>>>>> +    /*
>>>>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>>>>> +     * to allow multipath to fail all I/O.
>>>>>> +     */
>>>>>> +    kblockd_schedule_work(&head->requeue_work);
>>>>>
>>>>> Not sure how this helps given that you don't wait for srcu to 
>>>>> synchronize
>>>>> before you kick the requeue.
>>>>>
>>>> It certainly is helping in my testcase. But having a 
>>>> synchronize_srcu here is probably not a bad idea.
>>>>
>>>>>>   }
>>>>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>>>>
>>>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last 
>>>>> posting you mentioned that ns_remove is stuck on srcu_synchronize? 
>>>>> Can you explain why nvme_find_path is able to find a path given 
>>>>> that it is already cleared NVME_NS_READY ? oris it 
>>>>> nvme_available_path that is missing a check? Maybe can do with 
>>>>> checking NVME_NS_READY instead?
>>>>
>>>> Turned out that the reasoning in the previous revision wasn't quite 
>>>> correct; since then I have seen several test-run where the above stack
>>>> trace was the _only_ one in the system, so the stall in removing 
>>>> namespaces is more a side-effect. The ns_head was still visible
>>>> in sysfs while in that state, with exactly one path left:
>>>>
>>>> # ls /sys/block
>>>> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5 
>>>> nvme0n2 nvme0n5
>>>>
>>>> (whereas there had been 6 controllers with 6 namespaces).
>>>> So we fail to trigger a requeue to restart I/O on the stuck scanning
>>>> process; the actual path state really don't matter as never get 
>>>> this far.
>>>> This can happen when the partition scan triggered by 
>>>> device_add_disk() (from one controller) interleaves with 
>>>> nvme_ns_remove() from another controller. Both processes are 
>>>> running lockless wrt to ns_head at that
>>>> time, so if the partition scan issues I/O after the schedule_work
>>>> in nvme_mpath_shutdown_disk():
>>>>
>>>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>>> {
>>>>      if (!head->disk)
>>>>          return;
>>>>      kblockd_schedule_work(&head->requeue_work);
>>>>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>          del_gendisk(head->disk);
>>>>      }
>>>> }
>>>>
>>>> _and_ that last path happens to be an 'inaccessible' one, I/O will 
>>>> be requeued in the ns_head but never restarted, leaving to a hung 
>>>> process.
>>>> Note, that I/O might also be triggered by userspace (eg udev); the 
>>>> device node is still present at that time. And that's also what I see
>>>> in my test runs; occasionally I get additional stuck udev processes:
>>>> [<0>] __folio_lock+0x114/0x1f0
>>>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
>>>> [<0>] blkdev_flush_mapping+0x45/0xe0
>>>> [<0>] blkdev_put_whole+0x2e/0x40
>>>> [<0>] bdev_release+0x129/0x1b0
>>>> [<0>] blkdev_release+0xd/0x20
>>>> [<0>] __fput+0xf7/0x2d0
>>>> also waiting on I/O.
>>>>
>>>> You might be right checking for NS_READY might be sufficient, I'll be
>>>> checking. But we definitely need to requeue I/O after we called 
>>>> del_gendisk().
>>>>
>>> Turns out that we don't check NVME_NS_READY in all places; we would 
>>> need this patch:
>>>
>>> diff --git a/drivers/nvme/host/multipath.c 
>>> b/drivers/nvme/host/multipath.c
>>> index c9d23b1b8efc..d8a6f51896fd 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct 
>>> nvme_ns_head *head)
>>>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, 
>>> &ns->ctrl->flags))
>>>                         continue;
>>> +               if (test_bit(NVME_NS_READY, &ns->flags))
>>> +                       continue;
>>>                 switch (nvme_ctrl_state(ns->ctrl)) {
>>>                 case NVME_CTRL_LIVE:
>>>                 case NVME_CTRL_RESETTING:
>>>
>>> in addition to moving of kblockd_schedule.
>>>
>>> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?
>>
>> Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but 
>> also
>> in nvme_mpath_revalidate_paths() if the capacity changed, so 
>> theoretically,
>> if the last remaining path changed its capacity, it will make 
>> nvme_available_path()
>> fail, and in turn, mpath IO will fail and not added to the 
>> requeue_list, which would be
>> wrong. So I think that nvme_available_path should check 
>> NVME_NSHEAD_DISK_LIVE,
>> but nvme_find_path is fine in its current form.
>
> Weelll ... not quite.
> I never get this far, as the only place where nvme_remove_ns() can be
> called is from the scanning process, yet this is stuck waiting for I/O.
>
> My testcase then simulated a namespace replacement by setting a new 
> UUID, and then enabling the namespace again. The target will return
> PATH ERROR during that time, causing I/O to be retried (while disabled)
> or failed over to the same path (as it's the only one left).
>
> Of course we could argue that the testcase is invalid, and the target
> should return INVALID_NS if the UUID changed.
> But one could also argue that retrying on the same path is pointless,
> seeing that we'll always get the same error.
> Hmm?

Hannes, I was referring to the general case where we have more than one 
path.

What I said is that I think that if you drop the NSHEAD_DISK_LIVE check 
in nvme_find_path
(it already checks for NS_READY) but have nvme_available_path check for 
NSHEAD_DISK_LIVE
and both your test case should work and the general case where there is 
more than a single
path.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-05  9:06             ` Sagi Grimberg
@ 2024-09-05  9:23               ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-05  9:23 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/5/24 11:06, Sagi Grimberg wrote:
> 
> 
> 
> On 05/09/2024 11:39, Hannes Reinecke wrote:
>> On 9/5/24 09:06, Sagi Grimberg wrote:
>>>
>>>
>>>
>>> On 04/09/2024 11:59, Hannes Reinecke wrote:
>>>> On 9/4/24 10:20, Hannes Reinecke wrote:
[ .. ]
>>>> in addition to moving of kblockd_schedule.
>>>>
>>>> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?
>>>
>>> Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but 
>>> also
>>> in nvme_mpath_revalidate_paths() if the capacity changed, so 
>>> theoretically,
>>> if the last remaining path changed its capacity, it will make 
>>> nvme_available_path()
>>> fail, and in turn, mpath IO will fail and not added to the 
>>> requeue_list, which would be
>>> wrong. So I think that nvme_available_path should check 
>>> NVME_NSHEAD_DISK_LIVE,
>>> but nvme_find_path is fine in its current form.
>>
>> Weelll ... not quite.
>> I never get this far, as the only place where nvme_remove_ns() can be
>> called is from the scanning process, yet this is stuck waiting for I/O.
>>
>> My testcase then simulated a namespace replacement by setting a new 
>> UUID, and then enabling the namespace again. The target will return
>> PATH ERROR during that time, causing I/O to be retried (while disabled)
>> or failed over to the same path (as it's the only one left).
>>
>> Of course we could argue that the testcase is invalid, and the target
>> should return INVALID_NS if the UUID changed.
>> But one could also argue that retrying on the same path is pointless,
>> seeing that we'll always get the same error.
>> Hmm?
> 
> Hannes, I was referring to the general case where we have more than one 
> path.
> 
> What I said is that I think that if you drop the NSHEAD_DISK_LIVE check 
> in nvme_find_path
> (it already checks for NS_READY) but have nvme_available_path check for 
> NSHEAD_DISK_LIVE
> and both your test case should work and the general case where there is 
> more than a single
> path.

Ah, _that's_ what you mean. Sure, sounds feasible. I'll give it a spin.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke
  2024-09-03 19:38   ` Sagi Grimberg
@ 2024-09-04 14:52   ` Nilay Shroff
  2024-09-04 15:55     ` Hannes Reinecke
  1 sibling, 1 reply; 15+ messages in thread
From: Nilay Shroff @ 2024-09-04 14:52 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme



On 9/3/24 23:33, Hannes Reinecke wrote:
> During repetitive namespace remapping operations (ie removing a namespace and
> provision a different namespace with the same NSID) on the target the
> namespace might have changed between the time the initial scan
> was performed, and partition scan was invoked by device_add_disk()
> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
> 
> [<0>] folio_wait_bit_common+0x12a/0x310
> [<0>] filemap_read_folio+0x97/0xd0
> [<0>] do_read_cache_folio+0x108/0x390
> [<0>] read_part_sector+0x31/0xa0
> [<0>] read_lba+0xc5/0x160
> [<0>] efi_partition+0xd9/0x8f0
> [<0>] bdev_disk_changed+0x23d/0x6d0
> [<0>] blkdev_get_whole+0x78/0xc0
> [<0>] bdev_open+0x2c6/0x3b0
> [<0>] bdev_file_open_by_dev+0xcb/0x120
> [<0>] disk_scan_partitions+0x5d/0x100
> [<0>] device_add_disk+0x402/0x420
> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
> 
> This happens when we have several paths, some of which are inaccessible,
> and the active paths are removed first. Then nvme_find_path() will requeue
> I/O in the ns_head (as paths are present), but the requeue list is never
> triggered as all remaining paths are inactive.
> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
> the last path has been removed to properly terminate pending I/O.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/nvme/host/multipath.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index c9d23b1b8efc..1b1deb0450ab 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>  
>  inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>  {
> +	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> +		return NULL;
> +
>  	switch (READ_ONCE(head->subsys->iopolicy)) {
>  	case NVME_IOPOLICY_QD:
>  		return nvme_queue_depth_path(head);
> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>  {
>  	struct nvme_ns *ns;
>  
> +	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> +		return NULL;
> +
>  	list_for_each_entry_rcu(ns, &head->list, siblings) {
>  		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>  			continue;
> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>  {
>  	if (!head->disk)
>  		return;
> -	kblockd_schedule_work(&head->requeue_work);
> -	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> +	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>  		nvme_cdev_del(&head->cdev, &head->cdev_device);
>  		del_gendisk(head->disk);
>  	}
> +	/*
> +	 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> +	 * to allow multipath to fail all I/O.
> +	 */
> +	kblockd_schedule_work(&head->requeue_work);
>  }
Not sure, in your case, whether nvme_mpath_remove_disk() is called or not? 
The nvme_mpath_remove_disk() shall trigger the ns_head reqiest list. 

>  
>  void nvme_mpath_remove_disk(struct nvme_ns_head *head)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-04 14:52   ` Nilay Shroff
@ 2024-09-04 15:55     ` Hannes Reinecke
  2024-09-05  6:41       ` Nilay Shroff
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-09-04 15:55 UTC (permalink / raw)
  To: Nilay Shroff, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme

On 9/4/24 16:52, Nilay Shroff wrote:
> 
> 
> On 9/3/24 23:33, Hannes Reinecke wrote:
>> During repetitive namespace remapping operations (ie removing a namespace and
>> provision a different namespace with the same NSID) on the target the
>> namespace might have changed between the time the initial scan
>> was performed, and partition scan was invoked by device_add_disk()
>> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>>
>> [<0>] folio_wait_bit_common+0x12a/0x310
>> [<0>] filemap_read_folio+0x97/0xd0
>> [<0>] do_read_cache_folio+0x108/0x390
>> [<0>] read_part_sector+0x31/0xa0
>> [<0>] read_lba+0xc5/0x160
>> [<0>] efi_partition+0xd9/0x8f0
>> [<0>] bdev_disk_changed+0x23d/0x6d0
>> [<0>] blkdev_get_whole+0x78/0xc0
>> [<0>] bdev_open+0x2c6/0x3b0
>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>> [<0>] disk_scan_partitions+0x5d/0x100
>> [<0>] device_add_disk+0x402/0x420
>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>
>> This happens when we have several paths, some of which are inaccessible,
>> and the active paths are removed first. Then nvme_find_path() will requeue
>> I/O in the ns_head (as paths are present), but the requeue list is never
>> triggered as all remaining paths are inactive.
>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>> the last path has been removed to properly terminate pending I/O.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index c9d23b1b8efc..1b1deb0450ab 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>   
>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>   {
>> +	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>> +		return NULL;
>> +
>>   	switch (READ_ONCE(head->subsys->iopolicy)) {
>>   	case NVME_IOPOLICY_QD:
>>   		return nvme_queue_depth_path(head);
>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>>   {
>>   	struct nvme_ns *ns;
>>   
>> +	if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>> +		return NULL;
>> +
>>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>>   		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>   			continue;
>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>   {
>>   	if (!head->disk)
>>   		return;
>> -	kblockd_schedule_work(&head->requeue_work);
>> -	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>> +	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>   		nvme_cdev_del(&head->cdev, &head->cdev_device);
>>   		del_gendisk(head->disk);
>>   	}
>> +	/*
>> +	 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>> +	 * to allow multipath to fail all I/O.
>> +	 */
>> +	kblockd_schedule_work(&head->requeue_work);
>>   }
> Not sure, in your case, whether nvme_mpath_remove_disk() is called or not?
> The nvme_mpath_remove_disk() shall trigger the ns_head reqiest list.
> 
It would, but we're never get this far as the reference to ns_head it 
held by the above stack trace (and nvme_mpath_remove_disk()) is called
when the last reference to ns_head drops.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
  2024-09-04 15:55     ` Hannes Reinecke
@ 2024-09-05  6:41       ` Nilay Shroff
  0 siblings, 0 replies; 15+ messages in thread
From: Nilay Shroff @ 2024-09-05  6:41 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme



On 9/4/24 21:25, Hannes Reinecke wrote:
> On 9/4/24 16:52, Nilay Shroff wrote:
>>
>>
>> On 9/3/24 23:33, Hannes Reinecke wrote:
>>> During repetitive namespace remapping operations (ie removing a namespace and
>>> provision a different namespace with the same NSID) on the target the
>>> namespace might have changed between the time the initial scan
>>> was performed, and partition scan was invoked by device_add_disk()
>>> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>>>
>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>> [<0>] filemap_read_folio+0x97/0xd0
>>> [<0>] do_read_cache_folio+0x108/0x390
>>> [<0>] read_part_sector+0x31/0xa0
>>> [<0>] read_lba+0xc5/0x160
>>> [<0>] efi_partition+0xd9/0x8f0
>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>> [<0>] blkdev_get_whole+0x78/0xc0
>>> [<0>] bdev_open+0x2c6/0x3b0
>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>> [<0>] disk_scan_partitions+0x5d/0x100
>>> [<0>] device_add_disk+0x402/0x420
>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> This happens when we have several paths, some of which are inaccessible,
>>> and the active paths are removed first. Then nvme_find_path() will requeue
>>> I/O in the ns_head (as paths are present), but the requeue list is never
>>> triggered as all remaining paths are inactive.
>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>> the last path has been removed to properly terminate pending I/O.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>>     inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>   {
>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>       case NVME_IOPOLICY_QD:
>>>           return nvme_queue_depth_path(head);
>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>>>   {
>>>       struct nvme_ns *ns;
>>>   +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>               continue;
>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>>   {
>>>       if (!head->disk)
>>>           return;
>>> -    kblockd_schedule_work(&head->requeue_work);
>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>           del_gendisk(head->disk);
>>>       }
>>> +    /*
>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>> +     * to allow multipath to fail all I/O.
>>> +     */
>>> +    kblockd_schedule_work(&head->requeue_work);
>>>   }
>> Not sure, in your case, whether nvme_mpath_remove_disk() is called or not?
>> The nvme_mpath_remove_disk() shall trigger the ns_head reqiest list.
>>
> It would, but we're never get this far as the reference to ns_head it held by the above stack trace (and nvme_mpath_remove_disk()) is called
> when the last reference to ns_head drops.
OK, I see. In that case, I believe checking NVME_NS_READY in nvme_avalibale_path() should 
be a  good idea. 
> 
> Cheers,
> 
> Hannes

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-09-05  9:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 18:03 [PATCHv2 0/2] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-03 18:03 ` [PATCH 1/2] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-04  4:09   ` Christoph Hellwig
2024-09-03 18:03 ` [PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces Hannes Reinecke
2024-09-03 19:38   ` Sagi Grimberg
2024-09-04  8:20     ` Hannes Reinecke
2024-09-04  8:59       ` Hannes Reinecke
2024-09-05  6:43         ` Nilay Shroff
2024-09-05  7:06         ` Sagi Grimberg
2024-09-05  8:39           ` Hannes Reinecke
2024-09-05  9:06             ` Sagi Grimberg
2024-09-05  9:23               ` Hannes Reinecke
2024-09-04 14:52   ` Nilay Shroff
2024-09-04 15:55     ` Hannes Reinecke
2024-09-05  6:41       ` Nilay Shroff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.