* [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE
2024-09-06 7:18 [PATCHv3 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
@ 2024-09-06 7:18 ` Hannes Reinecke
2024-09-06 7:29 ` Sagi Grimberg
2024-09-06 7:18 ` [PATCH 2/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths Hannes Reinecke
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-06 7:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
'struct nvme_ns'.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/multipath.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 91d9eb3c22ef..c9d23b1b8efc 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -646,7 +646,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
rc = device_add_disk(&head->subsys->dev, head->disk,
nvme_ns_attr_groups);
if (rc) {
- clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags);
+ clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
return;
}
nvme_add_ns_head_cdev(head);
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 2/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths
2024-09-06 7:18 [PATCHv3 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-06 7:18 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-06 7:18 ` Hannes Reinecke
2024-09-06 7:30 ` Sagi Grimberg
2024-09-06 7:18 ` [PATCH 3/4] nvme-multipath: always requeue I/O when updating the ANA state Hannes Reinecke
2024-09-06 7:18 ` [PATCH 4/4] nvme: set 'failfast_expired' in nvme_remove_namespaces() Hannes Reinecke
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-06 7:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:
[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.
This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(),
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/multipath.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..f72c5a6a2d8e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -421,6 +421,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
+ if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+ return NULL;
+
list_for_each_entry_rcu(ns, &head->list, siblings) {
if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
continue;
@@ -967,11 +970,16 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
{
if (!head->disk)
return;
- kblockd_schedule_work(&head->requeue_work);
- if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+ if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
nvme_cdev_del(&head->cdev, &head->cdev_device);
del_gendisk(head->disk);
}
+ /*
+ * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+ * to allow multipath to fail all I/O.
+ */
+ synchronize_srcu(&head->srcu);
+ kblockd_schedule_work(&head->requeue_work);
}
void nvme_mpath_remove_disk(struct nvme_ns_head *head)
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 3/4] nvme-multipath: always requeue I/O when updating the ANA state
2024-09-06 7:18 [PATCHv3 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-06 7:18 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-06 7:18 ` [PATCH 2/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths Hannes Reinecke
@ 2024-09-06 7:18 ` Hannes Reinecke
2024-09-06 7:33 ` Sagi Grimberg
2024-09-06 7:18 ` [PATCH 4/4] nvme: set 'failfast_expired' in nvme_remove_namespaces() Hannes Reinecke
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-06 7:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
When we've changed the ANA state we always need to kiek the requeue
list to ensure any pending I/O is retried with the latest changes.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/multipath.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f72c5a6a2d8e..143c9d64dfa7 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -734,9 +734,12 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
* will reprocess the ANA log page in nvme_mpath_update() once the
* controller is ready.
*/
- if (nvme_state_is_live(ns->ana_state) &&
- nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
- nvme_mpath_set_live(ns);
+ if (nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE) {
+ if (nvme_state_is_live(ns->ana_state))
+ nvme_mpath_set_live(ns);
+ else
+ kblockd_schedule_work(&ns->head->requeue_work);
+ }
}
static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH 4/4] nvme: set 'failfast_expired' in nvme_remove_namespaces()
2024-09-06 7:18 [PATCHv3 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
` (2 preceding siblings ...)
2024-09-06 7:18 ` [PATCH 3/4] nvme-multipath: always requeue I/O when updating the ANA state Hannes Reinecke
@ 2024-09-06 7:18 ` Hannes Reinecke
2024-09-06 7:38 ` Sagi Grimberg
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-06 7:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
nvme_remove_namespaces() is only called when the controller is
being removed. If there is a scan process still pending and
the I/O from that process cannot make progress (eg if all paths
are in ANA state 'inaccessible') we cannot disconnect the
controller as the 'nvme disconnect' process will hang in
flush_work(&ctrl->scan_work).
This patch sets the 'failfast_expired' bit for the controller
to cause all pending I/O to be failed, and the disconnect process
to complete.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 651073280f6f..b968b672dcf8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4222,6 +4222,13 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
*/
nvme_mpath_clear_ctrl_paths(ctrl);
+ /*
+ * Mark the controller as 'failfast' to ensure all pending I/O
+ * is killed.
+ */
+ set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+ nvme_kick_requeue_lists(ctrl);
+
/*
* Unquiesce io queues so any pending IO won't hang, especially
* those submitted from scan work
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH 4/4] nvme: set 'failfast_expired' in nvme_remove_namespaces()
2024-09-06 7:18 ` [PATCH 4/4] nvme: set 'failfast_expired' in nvme_remove_namespaces() Hannes Reinecke
@ 2024-09-06 7:38 ` Sagi Grimberg
2024-09-06 8:32 ` Hannes Reinecke
0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2024-09-06 7:38 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
Please describe what the patch fixes in the title, not what it does.
On 06/09/2024 10:18, Hannes Reinecke wrote:
> nvme_remove_namespaces() is only called when the controller is
> being removed. If there is a scan process still pending and
> the I/O from that process cannot make progress (eg if all paths
> are in ANA state 'inaccessible') we cannot disconnect the
> controller as the 'nvme disconnect' process will hang in
> flush_work(&ctrl->scan_work).
Please include the hang process stack.
>
> This patch sets the 'failfast_expired' bit for the controller
> to cause all pending I/O to be failed, and the disconnect process
> to complete.
How did you reproduce it? trigger namespace scanning and disconnect-all
in a loop? Can we get a blktest for it?
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/core.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 651073280f6f..b968b672dcf8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4222,6 +4222,13 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> */
> nvme_mpath_clear_ctrl_paths(ctrl);
>
> + /*
> + * Mark the controller as 'failfast' to ensure all pending I/O
> + * is killed.
> + */
> + set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
What about nvme_stop_failfast_work() ?
> + nvme_kick_requeue_lists(ctrl);
> +
> /*
> * Unquiesce io queues so any pending IO won't hang, especially
> * those submitted from scan work
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] nvme: set 'failfast_expired' in nvme_remove_namespaces()
2024-09-06 7:38 ` Sagi Grimberg
@ 2024-09-06 8:32 ` Hannes Reinecke
0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-06 8:32 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/6/24 09:38, Sagi Grimberg wrote:
> Please describe what the patch fixes in the title, not what it does.
>
>
Yeah, will do.
> On 06/09/2024 10:18, Hannes Reinecke wrote:
>> nvme_remove_namespaces() is only called when the controller is
>> being removed. If there is a scan process still pending and
>> the I/O from that process cannot make progress (eg if all paths
>> are in ANA state 'inaccessible') we cannot disconnect the
>> controller as the 'nvme disconnect' process will hang in
>> flush_work(&ctrl->scan_work).
>
> Please include the hang process stack.
>
Ok.
>>
>> This patch sets the 'failfast_expired' bit for the controller
>> to cause all pending I/O to be failed, and the disconnect process
>> to complete.
>
> How did you reproduce it? trigger namespace scanning and disconnect-all
> in a loop? Can we get a blktest for it?
>
The script is quite easy:
for i in $(seq 0 8); do
for j in $(seq 1 6 | shuf); do
ns="${CONFIGFS}/subsystems/${NQN}/namespaces/${j}"
grpid=$(( ((($j / 3) + $i) % 3) + 1 ))
echo "$i: enable $j (grpid $grpid)"
echo $grpid > $ns/ana_grpid
uuidgen > $ns/device_uuid
echo 1 > $ns/enable
done
ls /sys/block | grep -v ram
for j in $(seq 1 6 | shuf); do
echo "$i: disable $j"
ns="${CONFIGFS}/subsystems/${NQN}/namespaces/${j}"
echo 0 > $ns/enable
done
ls /sys/block | grep -v ram
done
but in order to recreate that you'd need several paths, each with its
own ANA group id, and at least one in 'inaccessible'.
On my testing I've been using 6 paths, four 'inaccessible', one
'optimized', and one 'non-optimized'.
And having one namespace for each ANA group ID.
I'm planning to do a blktest for it, but that requires to first
implement ANA support there.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>> drivers/nvme/host/core.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 651073280f6f..b968b672dcf8 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4222,6 +4222,13 @@ void nvme_remove_namespaces(struct nvme_ctrl
>> *ctrl)
>> */
>> nvme_mpath_clear_ctrl_paths(ctrl);
>> + /*
>> + * Mark the controller as 'failfast' to ensure all pending I/O
>> + * is killed.
>> + */
>> + set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
>
> What about nvme_stop_failfast_work() ?
>
That is already done; nvme_remove_namespaces() is called after
nvme_stop_ctrl() which already called nvme_stop_failfast_work().
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 10+ messages in thread