* [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-09 7:19 Hannes Reinecke
2024-09-09 7:19 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-09 7:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Hi all,
I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.
Essentially it's doing something like:
echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable
repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.
Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried, and we are unable to even
disconnect the controller. To fix this we should disabled retries
for failed commands when the controller state is 'DELETING' as there
really is no point in trying to retry the command when the controller
is shut down.
With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.
As usual, comments and reviews are welcome.
Changes to v4:
- Disabled command retries when the controller is removed instead of
(ab-)using the failfast flag
Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes
Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers
Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared
Hannes Reinecke (3):
nvme-multipath: fixup typo when clearing DISK_LIVE
nvme-multipath: avoid hang on inaccessible namespaces
nvme: 'nvme disconnect' hangs after remapping namespaces
drivers/nvme/host/core.c | 7 ++++++-
drivers/nvme/host/multipath.c | 14 +++++++++++---
2 files changed, 17 insertions(+), 4 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE
2024-09-09 7:19 [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
@ 2024-09-09 7:19 ` Hannes Reinecke
2024-09-09 7:19 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
2024-09-09 7:19 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
2 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-09 7:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
'struct nvme_ns'.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/multipath.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 91d9eb3c22ef..c9d23b1b8efc 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -646,7 +646,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
rc = device_add_disk(&head->subsys->dev, head->disk,
nvme_ns_attr_groups);
if (rc) {
- clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags);
+ clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
return;
}
nvme_add_ns_head_cdev(head);
--
2.35.3
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces
2024-09-09 7:19 [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-09 7:19 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-09 7:19 ` Hannes Reinecke
2024-09-09 7:19 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
2 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-09 7:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:
[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.
This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(),
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/multipath.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..f72c5a6a2d8e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -421,6 +421,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
+ if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+ return NULL;
+
list_for_each_entry_rcu(ns, &head->list, siblings) {
if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
continue;
@@ -967,11 +970,16 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
{
if (!head->disk)
return;
- kblockd_schedule_work(&head->requeue_work);
- if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+ if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
nvme_cdev_del(&head->cdev, &head->cdev_device);
del_gendisk(head->disk);
}
+ /*
+ * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+ * to allow multipath to fail all I/O.
+ */
+ synchronize_srcu(&head->srcu);
+ kblockd_schedule_work(&head->requeue_work);
}
void nvme_mpath_remove_disk(struct nvme_ns_head *head)
--
2.35.3
^ permalink raw reply related [flat|nested] 12+ messages in thread* [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-09 7:19 [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-09 7:19 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-09 7:19 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
@ 2024-09-09 7:19 ` Hannes Reinecke
2024-09-10 7:57 ` Sagi Grimberg
2 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-09 7:19 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
During repetitive namespace map and unmap operations on the target
(disabling the namespace, changing the UUID, enabling it again)
the initial scan will hang as the target will be returning
PATH_ERROR and the I/O is constantly retried:
[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
Calling 'nvme disconnect' on controllers with these namespaces
will hang as the disconnect operation tries to flush scan_work:
[<0>] __flush_work+0x389/0x4b0
[<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
[<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
[<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
[<0>] kernfs_fop_write_iter+0x13d/0x1b0
[<0>] vfs_write+0x404/0x510
before the namespaces are removed, and the controller state
DELETING_NOIO (which would abort any pending I/O) is set only
afterwards.
This patch calls 'nvme_kick_requeue_lists()' when entering
DELETING state for a controller to ensure all pending I/O
is flushed, and also disables failover for any commands which
are completed with an error afterwards, breaking the infinite
retry loop.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 651073280f6f..142babce1963 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -381,6 +381,8 @@ enum nvme_disposition {
static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
{
+ struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
+
if (likely(nvme_req(req)->status == 0))
return COMPLETE;
@@ -393,6 +395,8 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
return AUTHENTICATE;
if (req->cmd_flags & REQ_NVME_MPATH) {
+ if (nvme_ctrl_state(ctrl) == NVME_CTRL_DELETING)
+ return COMPLETE;
if (nvme_is_path_error(nvme_req(req)->status) ||
blk_queue_dying(req->q))
return FAILOVER;
@@ -629,7 +633,8 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
} else if (new_state == NVME_CTRL_CONNECTING &&
old_state == NVME_CTRL_RESETTING) {
nvme_start_failfast_work(ctrl);
- }
+ } else if (new_state == NVME_CTRL_DELETING)
+ nvme_kick_requeue_lists(ctrl);
return changed;
}
EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
--
2.35.3
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-09 7:19 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
@ 2024-09-10 7:57 ` Sagi Grimberg
2024-09-10 8:23 ` Hannes Reinecke
0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2024-09-10 7:57 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 09/09/2024 10:19, Hannes Reinecke wrote:
> During repetitive namespace map and unmap operations on the target
> (disabling the namespace, changing the UUID, enabling it again)
> the initial scan will hang as the target will be returning
> PATH_ERROR and the I/O is constantly retried:
>
> [<0>] folio_wait_bit_common+0x12a/0x310
> [<0>] filemap_read_folio+0x97/0xd0
> [<0>] do_read_cache_folio+0x108/0x390
> [<0>] read_part_sector+0x31/0xa0
> [<0>] read_lba+0xc5/0x160
> [<0>] efi_partition+0xd9/0x8f0
> [<0>] bdev_disk_changed+0x23d/0x6d0
> [<0>] blkdev_get_whole+0x78/0xc0
> [<0>] bdev_open+0x2c6/0x3b0
> [<0>] bdev_file_open_by_dev+0xcb/0x120
> [<0>] disk_scan_partitions+0x5d/0x100
> [<0>] device_add_disk+0x402/0x420
> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>
> Calling 'nvme disconnect' on controllers with these namespaces
> will hang as the disconnect operation tries to flush scan_work:
>
> [<0>] __flush_work+0x389/0x4b0
> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
> [<0>] vfs_write+0x404/0x510
>
> before the namespaces are removed, and the controller state
> DELETING_NOIO (which would abort any pending I/O) is set only
> afterwards.
>
> This patch calls 'nvme_kick_requeue_lists()' when entering
> DELETING state for a controller to ensure all pending I/O
> is flushed, and also disables failover for any commands which
> are completed with an error afterwards, breaking the infinite
> retry loop.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/core.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 651073280f6f..142babce1963 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -381,6 +381,8 @@ enum nvme_disposition {
>
> static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
> {
> + struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
> +
> if (likely(nvme_req(req)->status == 0))
> return COMPLETE;
>
> @@ -393,6 +395,8 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
> return AUTHENTICATE;
>
> if (req->cmd_flags & REQ_NVME_MPATH) {
> + if (nvme_ctrl_state(ctrl) == NVME_CTRL_DELETING)
> + return COMPLETE;
This looks wrong. What if I disconnect from one path and there are other
eligible paths?
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-10 7:57 ` Sagi Grimberg
@ 2024-09-10 8:23 ` Hannes Reinecke
0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-10 8:23 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/10/24 09:57, Sagi Grimberg wrote:
>
>
>
> On 09/09/2024 10:19, Hannes Reinecke wrote:
>> During repetitive namespace map and unmap operations on the target
>> (disabling the namespace, changing the UUID, enabling it again)
>> the initial scan will hang as the target will be returning
>> PATH_ERROR and the I/O is constantly retried:
>>
>> [<0>] folio_wait_bit_common+0x12a/0x310
>> [<0>] filemap_read_folio+0x97/0xd0
>> [<0>] do_read_cache_folio+0x108/0x390
>> [<0>] read_part_sector+0x31/0xa0
>> [<0>] read_lba+0xc5/0x160
>> [<0>] efi_partition+0xd9/0x8f0
>> [<0>] bdev_disk_changed+0x23d/0x6d0
>> [<0>] blkdev_get_whole+0x78/0xc0
>> [<0>] bdev_open+0x2c6/0x3b0
>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>> [<0>] disk_scan_partitions+0x5d/0x100
>> [<0>] device_add_disk+0x402/0x420
>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>
>> Calling 'nvme disconnect' on controllers with these namespaces
>> will hang as the disconnect operation tries to flush scan_work:
>>
>> [<0>] __flush_work+0x389/0x4b0
>> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
>> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
>> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
>> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
>> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
>> [<0>] vfs_write+0x404/0x510
>>
>> before the namespaces are removed, and the controller state
>> DELETING_NOIO (which would abort any pending I/O) is set only
>> afterwards.
>>
>> This patch calls 'nvme_kick_requeue_lists()' when entering
>> DELETING state for a controller to ensure all pending I/O
>> is flushed, and also disables failover for any commands which
>> are completed with an error afterwards, breaking the infinite
>> retry loop.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>> drivers/nvme/host/core.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 651073280f6f..142babce1963 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -381,6 +381,8 @@ enum nvme_disposition {
>> static inline enum nvme_disposition nvme_decide_disposition(struct
>> request *req)
>> {
>> + struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
>> +
>> if (likely(nvme_req(req)->status == 0))
>> return COMPLETE;
>> @@ -393,6 +395,8 @@ static inline enum nvme_disposition
>> nvme_decide_disposition(struct request *req)
>> return AUTHENTICATE;
>> if (req->cmd_flags & REQ_NVME_MPATH) {
>> + if (nvme_ctrl_state(ctrl) == NVME_CTRL_DELETING)
>> + return COMPLETE;
>
> This looks wrong. What if I disconnect from one path and there are other
> eligible paths?
Yes, correct. We should check if there are paths left, and failover
if we still have paths, but complete the requests if this is the last
path and we had a path error:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 651073280f6f..73b21c01b165 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -394,8 +394,17 @@ static inline enum nvme_disposition
nvme_decide_disposition(struct request *req)
if (req->cmd_flags & REQ_NVME_MPATH) {
if (nvme_is_path_error(nvme_req(req)->status) ||
- blk_queue_dying(req->q))
+ blk_queue_dying(req->q)) {
+ struct nvme_ns *ns = req->q->queuedata;
+ /*
+ * Always complete if this is the last path
+ * and the controller is deleted.
+ */
+ if (list_is_singular(&ns->head->list) &&
+ nvme_ctrl_state(ns->ctrl) == NVME_CTRL_DELETING)
+ return COMPLETE;
return FAILOVER;
+ }
} else {
if (blk_queue_dying(req->q))
return COMPLETE;
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCHv4 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-06 10:16 Hannes Reinecke
2024-09-06 10:16 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
0 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-06 10:16 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Hi all,
I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.
Essentially it's doing something like:
echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable
repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.
Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried, and we are unable to even
disconnect the controller. To fix this we should set the
'failfast' flag for the controller when disconnecting to ensure
that all I/O is aborted.
With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.
As usual, comments and reviews are welcome.
Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes
Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers
Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared
Hannes Reinecke (3):
nvme-multipath: fixup typo when clearing DISK_LIVE
nvme-multipath: avoid hang on inaccessible namespaces
nvme: 'nvme disconnect' hangs after remapping namespaces
drivers/nvme/host/core.c | 7 +++++++
drivers/nvme/host/multipath.c | 14 +++++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-06 10:16 [PATCHv4 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
@ 2024-09-06 10:16 ` Hannes Reinecke
2024-09-06 10:35 ` Damien Le Moal
2024-09-08 7:21 ` Sagi Grimberg
0 siblings, 2 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-06 10:16 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
During repetitive namespace map and unmap operations on the target
(disabling the namespace, changing the UUID, enabling it again)
the initial scan will hang as the target will be returning
PATH_ERROR and the I/O is constantly retried:
[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
Calling 'nvme disconnect' on controllers with these namespaces
will hang as the disconnect operation tries to flush scan_work:
[<0>] __flush_work+0x389/0x4b0
[<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
[<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
[<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
[<0>] kernfs_fop_write_iter+0x13d/0x1b0
[<0>] vfs_write+0x404/0x510
before the namespaces are removed.
This patch sets the 'failfast_expired' bit for the controller
to cause all pending I/O to be failed, and the disconnect process
to complete.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 651073280f6f..b968b672dcf8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4222,6 +4222,13 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
*/
nvme_mpath_clear_ctrl_paths(ctrl);
+ /*
+ * Mark the controller as 'failfast' to ensure all pending I/O
+ * is killed.
+ */
+ set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+ nvme_kick_requeue_lists(ctrl);
+
/*
* Unquiesce io queues so any pending IO won't hang, especially
* those submitted from scan work
--
2.35.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-06 10:16 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
@ 2024-09-06 10:35 ` Damien Le Moal
2024-09-06 10:42 ` Hannes Reinecke
2024-09-08 7:21 ` Sagi Grimberg
1 sibling, 1 reply; 12+ messages in thread
From: Damien Le Moal @ 2024-09-06 10:35 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme
On 9/6/24 19:16, Hannes Reinecke wrote:
> During repetitive namespace map and unmap operations on the target
> (disabling the namespace, changing the UUID, enabling it again)
> the initial scan will hang as the target will be returning
> PATH_ERROR and the I/O is constantly retried:
>
> [<0>] folio_wait_bit_common+0x12a/0x310
> [<0>] filemap_read_folio+0x97/0xd0
> [<0>] do_read_cache_folio+0x108/0x390
> [<0>] read_part_sector+0x31/0xa0
> [<0>] read_lba+0xc5/0x160
> [<0>] efi_partition+0xd9/0x8f0
> [<0>] bdev_disk_changed+0x23d/0x6d0
> [<0>] blkdev_get_whole+0x78/0xc0
> [<0>] bdev_open+0x2c6/0x3b0
> [<0>] bdev_file_open_by_dev+0xcb/0x120
> [<0>] disk_scan_partitions+0x5d/0x100
> [<0>] device_add_disk+0x402/0x420
> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>
> Calling 'nvme disconnect' on controllers with these namespaces
> will hang as the disconnect operation tries to flush scan_work:
>
> [<0>] __flush_work+0x389/0x4b0
> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
> [<0>] vfs_write+0x404/0x510
>
> before the namespaces are removed.
>
> This patch sets the 'failfast_expired' bit for the controller
> to cause all pending I/O to be failed, and the disconnect process
> to complete.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/core.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 651073280f6f..b968b672dcf8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4222,6 +4222,13 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> */
> nvme_mpath_clear_ctrl_paths(ctrl);
>
> + /*
> + * Mark the controller as 'failfast' to ensure all pending I/O
> + * is killed.
s/is/are ?
> + */
> + set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
> + nvme_kick_requeue_lists(ctrl);
> +
> /*
> * Unquiesce io queues so any pending IO won't hang, especially
> * those submitted from scan work
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-06 10:35 ` Damien Le Moal
@ 2024-09-06 10:42 ` Hannes Reinecke
0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-06 10:42 UTC (permalink / raw)
To: Damien Le Moal, Hannes Reinecke, Christoph Hellwig
Cc: Sagi Grimberg, Keith Busch, linux-nvme
On 9/6/24 12:35, Damien Le Moal wrote:
> On 9/6/24 19:16, Hannes Reinecke wrote:
>> During repetitive namespace map and unmap operations on the target
>> (disabling the namespace, changing the UUID, enabling it again)
>> the initial scan will hang as the target will be returning
>> PATH_ERROR and the I/O is constantly retried:
>>
>> [<0>] folio_wait_bit_common+0x12a/0x310
>> [<0>] filemap_read_folio+0x97/0xd0
>> [<0>] do_read_cache_folio+0x108/0x390
>> [<0>] read_part_sector+0x31/0xa0
>> [<0>] read_lba+0xc5/0x160
>> [<0>] efi_partition+0xd9/0x8f0
>> [<0>] bdev_disk_changed+0x23d/0x6d0
>> [<0>] blkdev_get_whole+0x78/0xc0
>> [<0>] bdev_open+0x2c6/0x3b0
>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>> [<0>] disk_scan_partitions+0x5d/0x100
>> [<0>] device_add_disk+0x402/0x420
>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>
>> Calling 'nvme disconnect' on controllers with these namespaces
>> will hang as the disconnect operation tries to flush scan_work:
>>
>> [<0>] __flush_work+0x389/0x4b0
>> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
>> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
>> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
>> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
>> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
>> [<0>] vfs_write+0x404/0x510
>>
>> before the namespaces are removed.
>>
>> This patch sets the 'failfast_expired' bit for the controller
>> to cause all pending I/O to be failed, and the disconnect process
>> to complete.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>> drivers/nvme/host/core.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 651073280f6f..b968b672dcf8 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4222,6 +4222,13 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>> */
>> nvme_mpath_clear_ctrl_paths(ctrl);
>>
>> + /*
>> + * Mark the controller as 'failfast' to ensure all pending I/O
>> + * is killed.
>
> s/is/are ?
>
Depending whether you consider I/O as a stream or individual commands.
'.. ensure all pending commands are killed.' ?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-06 10:16 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
2024-09-06 10:35 ` Damien Le Moal
@ 2024-09-08 7:21 ` Sagi Grimberg
2024-09-09 6:22 ` Hannes Reinecke
1 sibling, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2024-09-08 7:21 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 06/09/2024 13:16, Hannes Reinecke wrote:
> During repetitive namespace map and unmap operations on the target
> (disabling the namespace, changing the UUID, enabling it again)
> the initial scan will hang as the target will be returning
> PATH_ERROR and the I/O is constantly retried:
>
> [<0>] folio_wait_bit_common+0x12a/0x310
> [<0>] filemap_read_folio+0x97/0xd0
> [<0>] do_read_cache_folio+0x108/0x390
> [<0>] read_part_sector+0x31/0xa0
> [<0>] read_lba+0xc5/0x160
> [<0>] efi_partition+0xd9/0x8f0
> [<0>] bdev_disk_changed+0x23d/0x6d0
> [<0>] blkdev_get_whole+0x78/0xc0
> [<0>] bdev_open+0x2c6/0x3b0
> [<0>] bdev_file_open_by_dev+0xcb/0x120
> [<0>] disk_scan_partitions+0x5d/0x100
> [<0>] device_add_disk+0x402/0x420
> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>
> Calling 'nvme disconnect' on controllers with these namespaces
> will hang as the disconnect operation tries to flush scan_work:
>
> [<0>] __flush_work+0x389/0x4b0
> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
> [<0>] vfs_write+0x404/0x510
>
> before the namespaces are removed.
>
> This patch sets the 'failfast_expired' bit for the controller
> to cause all pending I/O to be failed, and the disconnect process
> to complete.
I don't know if I agree with this approach. Seems too indirect.
Can you please explain (with tracing) what is preventing the scan_work
to complete? because the controller state should be DELETING, and perhaps
we are missing somewhere a check for this?
nvme_remove_namespaces() at the point you are injecting here is designed
to allow any writeback to complete (if possible).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-08 7:21 ` Sagi Grimberg
@ 2024-09-09 6:22 ` Hannes Reinecke
2024-09-09 6:59 ` Hannes Reinecke
0 siblings, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-09 6:22 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/8/24 09:21, Sagi Grimberg wrote:
>
>
>
> On 06/09/2024 13:16, Hannes Reinecke wrote:
>> During repetitive namespace map and unmap operations on the target
>> (disabling the namespace, changing the UUID, enabling it again)
>> the initial scan will hang as the target will be returning
>> PATH_ERROR and the I/O is constantly retried:
>>
>> [<0>] folio_wait_bit_common+0x12a/0x310
>> [<0>] filemap_read_folio+0x97/0xd0
>> [<0>] do_read_cache_folio+0x108/0x390
>> [<0>] read_part_sector+0x31/0xa0
>> [<0>] read_lba+0xc5/0x160
>> [<0>] efi_partition+0xd9/0x8f0
>> [<0>] bdev_disk_changed+0x23d/0x6d0
>> [<0>] blkdev_get_whole+0x78/0xc0
>> [<0>] bdev_open+0x2c6/0x3b0
>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>> [<0>] disk_scan_partitions+0x5d/0x100
>> [<0>] device_add_disk+0x402/0x420
>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>
>> Calling 'nvme disconnect' on controllers with these namespaces
>> will hang as the disconnect operation tries to flush scan_work:
>>
>> [<0>] __flush_work+0x389/0x4b0
>> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
>> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
>> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
>> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
>> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
>> [<0>] vfs_write+0x404/0x510
>>
>> before the namespaces are removed.
>>
>> This patch sets the 'failfast_expired' bit for the controller
>> to cause all pending I/O to be failed, and the disconnect process
>> to complete.
>
> I don't know if I agree with this approach. Seems too indirect.
> Can you please explain (with tracing) what is preventing the scan_work
> to complete? because the controller state should be DELETING, and perhaps
> we are missing somewhere a check for this?
>
> nvme_remove_namespaces() at the point you are injecting here is designed
> to allow any writeback to complete (if possible).
Looks like we're forgetting a 'nvme_kick_requeue_lists()' when setting
the state to DELETING. So requeue is never triggered, causing stuck I/O.
I'll be checking.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces
2024-09-09 6:22 ` Hannes Reinecke
@ 2024-09-09 6:59 ` Hannes Reinecke
0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-09-09 6:59 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/9/24 08:22, Hannes Reinecke wrote:
> On 9/8/24 09:21, Sagi Grimberg wrote:
>>
>>
>>
>> On 06/09/2024 13:16, Hannes Reinecke wrote:
>>> During repetitive namespace map and unmap operations on the target
>>> (disabling the namespace, changing the UUID, enabling it again)
>>> the initial scan will hang as the target will be returning
>>> PATH_ERROR and the I/O is constantly retried:
>>>
>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>> [<0>] filemap_read_folio+0x97/0xd0
>>> [<0>] do_read_cache_folio+0x108/0x390
>>> [<0>] read_part_sector+0x31/0xa0
>>> [<0>] read_lba+0xc5/0x160
>>> [<0>] efi_partition+0xd9/0x8f0
>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>> [<0>] blkdev_get_whole+0x78/0xc0
>>> [<0>] bdev_open+0x2c6/0x3b0
>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>> [<0>] disk_scan_partitions+0x5d/0x100
>>> [<0>] device_add_disk+0x402/0x420
>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> Calling 'nvme disconnect' on controllers with these namespaces
>>> will hang as the disconnect operation tries to flush scan_work:
>>>
>>> [<0>] __flush_work+0x389/0x4b0
>>> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
>>> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
>>> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
>>> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
>>> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
>>> [<0>] vfs_write+0x404/0x510
>>>
>>> before the namespaces are removed.
>>>
>>> This patch sets the 'failfast_expired' bit for the controller
>>> to cause all pending I/O to be failed, and the disconnect process
>>> to complete.
>>
>> I don't know if I agree with this approach. Seems too indirect.
>> Can you please explain (with tracing) what is preventing the scan_work
>> to complete? because the controller state should be DELETING, and perhaps
>> we are missing somewhere a check for this?
>>
>> nvme_remove_namespaces() at the point you are injecting here is designed
>> to allow any writeback to complete (if possible).
>
> Looks like we're forgetting a 'nvme_kick_requeue_lists()' when setting
> the state to DELETING. So requeue is never triggered, causing stuck I/O.
> I'll be checking.
>
Turns out not to be sufficient.
Problem here is that nvmet will always return PATH_ERROR when the
namespace is disabled. So that I/O will _continue_ to be retried,
as nvme_find_path() calls nvme_path_is_disabled(), which
_deliberately_ forwards I/O even if the controller is in DELETING:
/*
* We don't treat NVME_CTRL_DELETING as a disabled path as I/O should
* still be able to complete assuming that the controller is connected.
* Otherwise it will fail immediately and return to the requeue list.
*/
with no way out, and a stuck process. We can surely try to failover to
another path, but if this is the last path we're stuck forever.
Seems like we need a 'fail_if_no_path' mechanism here...
Hmm.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-09-10 8:24 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09 7:19 [PATCHv5 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-09 7:19 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-09 7:19 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
2024-09-09 7:19 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
2024-09-10 7:57 ` Sagi Grimberg
2024-09-10 8:23 ` Hannes Reinecke
-- strict thread matches above, loose matches on Subject: below --
2024-09-06 10:16 [PATCHv4 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-06 10:16 ` [PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces Hannes Reinecke
2024-09-06 10:35 ` Damien Le Moal
2024-09-06 10:42 ` Hannes Reinecke
2024-09-08 7:21 ` Sagi Grimberg
2024-09-09 6:22 ` Hannes Reinecke
2024-09-09 6:59 ` Hannes Reinecke
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.