* [PATCH 0/4] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-02 11:15 Hannes Reinecke
2024-09-02 11:15 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-02 11:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Hi all,
I'm have a testcase which does a repeated map/unmap of namespaces and
changing the UUID for each namespace after unmapping.
This leads to an unrecoverable system where the scanning processes
are stuck in 'device_add_disk()' waiting for I/O which will never come
as the I/O is doing ping-pong between the namespace (which cannot
do I/O as the path is inaccessible) and the ns_head (which is not
live and hence can't do I/O, either).
With this patchset (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.
As usual, comments and reviews are welcome.
Hannes Reinecke (4):
nvme-multipath: fixup typo when clearing DISK_LIVE
nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown
nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths
nvme: remove existing namespace on ID mismatch
drivers/nvme/host/core.c | 4 +++-
drivers/nvme/host/multipath.c | 10 ++++++++--
2 files changed, 11 insertions(+), 3 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE
2024-09-02 11:15 [PATCH 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
@ 2024-09-02 11:15 ` Hannes Reinecke
2024-09-02 16:48 ` Sagi Grimberg
2024-09-02 11:15 ` [PATCH 2/4] nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown Hannes Reinecke
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-02 11:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
'struct nvme_ns'.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/multipath.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 91d9eb3c22ef..c9d23b1b8efc 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -646,7 +646,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
rc = device_add_disk(&head->subsys->dev, head->disk,
nvme_ns_attr_groups);
if (rc) {
- clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags);
+ clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
return;
}
nvme_add_ns_head_cdev(head);
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/4] nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown
2024-09-02 11:15 [PATCH 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-02 11:15 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-02 11:15 ` Hannes Reinecke
2024-09-02 16:51 ` Sagi Grimberg
2024-09-02 11:15 ` [PATCH 3/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths Hannes Reinecke
2024-09-02 11:15 ` [PATCH 4/4] nvme: remove existing namespace on ID mismatch Hannes Reinecke
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-02 11:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
When calling 'nvme_mpath_shutdown_disk()' we should be clearing
the 'NVME_NSHEAD_DISK_LIVE' bit to be consistent with
nvme_mpath_set_disk_live().
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/multipath.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..01dbbe866d61 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -968,7 +968,7 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
if (!head->disk)
return;
kblockd_schedule_work(&head->requeue_work);
- if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+ if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
nvme_cdev_del(&head->cdev, &head->cdev_device);
del_gendisk(head->disk);
}
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths
2024-09-02 11:15 [PATCH 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-02 11:15 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-02 11:15 ` [PATCH 2/4] nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown Hannes Reinecke
@ 2024-09-02 11:15 ` Hannes Reinecke
2024-09-02 16:53 ` Sagi Grimberg
2024-09-02 11:15 ` [PATCH 4/4] nvme: remove existing namespace on ID mismatch Hannes Reinecke
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-02 11:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:
[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
and another scanning process (triggered by AEN) trying to remove the namespaces:
[<0>] __synchronize_srcu+0x17c/0x1b0
[<0>] nvme_ns_remove+0x12e/0x210 [nvme_core]
[<0>] nvme_ns_remove_by_nsid+0x21/0x70 [nvme_core]
[<0>] nvme_scan_work+0x1b7/0x490 [nvme_core]
[<0>] process_scheduled_works+0x37d/0x6d0
none of which is able to make progress as the first process cannot submit I/O
(as all the namespace inforation is stale), and the second process is stalled
waiting on the srcu to be released by the first process.
This patch checks for the NVME_NSHEAD_DISK_LIVE bit when selecting a path,
causing I/O to be aborted and allowing the first process to make progress.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/multipath.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 01dbbe866d61..f5f2e1667c64 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
{
+ if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+ return NULL;
+
switch (READ_ONCE(head->subsys->iopolicy)) {
case NVME_IOPOLICY_QD:
return nvme_queue_depth_path(head);
@@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
+ if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+ return NULL;
+
list_for_each_entry_rcu(ns, &head->list, siblings) {
if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
continue;
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/4] nvme: remove existing namespace on ID mismatch
2024-09-02 11:15 [PATCH 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
` (2 preceding siblings ...)
2024-09-02 11:15 ` [PATCH 3/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths Hannes Reinecke
@ 2024-09-02 11:15 ` Hannes Reinecke
2024-09-02 17:01 ` Sagi Grimberg
3 siblings, 1 reply; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-02 11:15 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
When the namespace IDs mismatch during rescan we clearly ran out of
sync with the target, but we really can't know whether our local copy
of the NSID is correct, and that one found by scanning is not.
But clearly something is wrong with the namespaces, so we should remove
the existing one, too, to avoid accidental data corruption.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0dc8bcc664f2..626718bbf3a5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -136,6 +136,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
unsigned nsid);
static void nvme_update_keep_alive(struct nvme_ctrl *ctrl,
struct nvme_command *cmd);
+static void nvme_ns_remove_by_nsid(struct nvme_ctrl *ctrl, u32 nsid);
void nvme_queue_scan(struct nvme_ctrl *ctrl)
{
@@ -3746,8 +3747,9 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
}
if (!nvme_ns_ids_equal(&head->ids, &info->ids)) {
dev_err(ctrl->device,
- "IDs don't match for shared namespace %d\n",
+ "IDs don't match for shared namespace %d, removing\n",
info->nsid);
+ nvme_ns_remove_by_nsid(ctrl, info->nsid);
goto out_put_ns_head;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE
2024-09-02 11:15 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-02 16:48 ` Sagi Grimberg
0 siblings, 0 replies; 10+ messages in thread
From: Sagi Grimberg @ 2024-09-02 16:48 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
This was already sent wasn't it?
On 02/09/2024 14:15, Hannes Reinecke wrote:
> NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
> 'struct nvme_ns'.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
And again,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/4] nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown
2024-09-02 11:15 ` [PATCH 2/4] nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown Hannes Reinecke
@ 2024-09-02 16:51 ` Sagi Grimberg
0 siblings, 0 replies; 10+ messages in thread
From: Sagi Grimberg @ 2024-09-02 16:51 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 02/09/2024 14:15, Hannes Reinecke wrote:
> When calling 'nvme_mpath_shutdown_disk()' we should be clearing
> the 'NVME_NSHEAD_DISK_LIVE' bit to be consistent with
> nvme_mpath_set_disk_live().
I have to say, this looks redundant on its own. The nshead is
going away so it shouldn't matter.
Your cover letter suggest that there is a hang, which I assume
this is essential to, please describe how and say what is this fixing.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths
2024-09-02 11:15 ` [PATCH 3/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths Hannes Reinecke
@ 2024-09-02 16:53 ` Sagi Grimberg
0 siblings, 0 replies; 10+ messages in thread
From: Sagi Grimberg @ 2024-09-02 16:53 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
The patch title should describe what it is fixing rather than what it is
doing.
On 02/09/2024 14:15, Hannes Reinecke wrote:
> During repetitive namespace remapping operations on the target the
> namespace might have changed between the time the initial scan
> was performed, and partition scan was invoked by device_add_disk()
> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
I'm not sure I understand what "remapping operations" are?
> [<0>] folio_wait_bit_common+0x12a/0x310
> [<0>] filemap_read_folio+0x97/0xd0
> [<0>] do_read_cache_folio+0x108/0x390
> [<0>] read_part_sector+0x31/0xa0
> [<0>] read_lba+0xc5/0x160
> [<0>] efi_partition+0xd9/0x8f0
> [<0>] bdev_disk_changed+0x23d/0x6d0
> [<0>] blkdev_get_whole+0x78/0xc0
> [<0>] bdev_open+0x2c6/0x3b0
> [<0>] bdev_file_open_by_dev+0xcb/0x120
> [<0>] disk_scan_partitions+0x5d/0x100
> [<0>] device_add_disk+0x402/0x420
> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>
> and another scanning process (triggered by AEN) trying to remove the namespaces:
>
> [<0>] __synchronize_srcu+0x17c/0x1b0
> [<0>] nvme_ns_remove+0x12e/0x210 [nvme_core]
> [<0>] nvme_ns_remove_by_nsid+0x21/0x70 [nvme_core]
> [<0>] nvme_scan_work+0x1b7/0x490 [nvme_core]
> [<0>] process_scheduled_works+0x37d/0x6d0
>
> none of which is able to make progress as the first process cannot submit I/O
> (as all the namespace inforation is stale), and the second process is stalled
> waiting on the srcu to be released by the first process.
>
> This patch checks for the NVME_NSHEAD_DISK_LIVE bit when selecting a path,
> causing I/O to be aborted and allowing the first process to make progress.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/multipath.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 01dbbe866d61..f5f2e1667c64 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>
> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> {
> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> + return NULL;
> +
> switch (READ_ONCE(head->subsys->iopolicy)) {
> case NVME_IOPOLICY_QD:
> return nvme_queue_depth_path(head);
> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
> {
> struct nvme_ns *ns;
>
> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> + return NULL;
> +
> list_for_each_entry_rcu(ns, &head->list, siblings) {
> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
> continue;
This looks like together with the last patch address the hang, so they
need to be squashed together.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] nvme: remove existing namespace on ID mismatch
2024-09-02 11:15 ` [PATCH 4/4] nvme: remove existing namespace on ID mismatch Hannes Reinecke
@ 2024-09-02 17:01 ` Sagi Grimberg
2024-09-03 12:30 ` Hannes Reinecke
0 siblings, 1 reply; 10+ messages in thread
From: Sagi Grimberg @ 2024-09-02 17:01 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
Again, patch title should describe what it is fixing.
On 02/09/2024 14:15, Hannes Reinecke wrote:
> When the namespace IDs mismatch during rescan we clearly ran out of
> sync with the target,
Not clear at all. When can this happen and how?
> but we really can't know whether our local copy
> of the NSID is correct, and that one found by scanning is not.
> But clearly something is wrong with the namespaces, so we should remove
> the existing one, too, to avoid accidental data corruption.
Umm. So you see a bogus ns and conflicting with another ns which is
seemingly fine and you remove it?
I feel like you are describing something "generically useful/correct"
while you address
a very specific problem but you don't describe it to us...
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] nvme: remove existing namespace on ID mismatch
2024-09-02 17:01 ` Sagi Grimberg
@ 2024-09-03 12:30 ` Hannes Reinecke
0 siblings, 0 replies; 10+ messages in thread
From: Hannes Reinecke @ 2024-09-03 12:30 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/2/24 19:01, Sagi Grimberg wrote:
> Again, patch title should describe what it is fixing.
>
>
> On 02/09/2024 14:15, Hannes Reinecke wrote:
>> When the namespace IDs mismatch during rescan we clearly ran out of
>> sync with the target,
>
> Not clear at all. When can this happen and how?
>
>> but we really can't know whether our local copy
>> of the NSID is correct, and that one found by scanning is not.
>> But clearly something is wrong with the namespaces, so we should remove
>> the existing one, too, to avoid accidental data corruption.
>
> Umm. So you see a bogus ns and conflicting with another ns which is
> seemingly fine and you remove it?
>
> I feel like you are describing something "generically useful/correct"
> while you address
> a very specific problem but you don't describe it to us...
>
Turns out that this patch was actually pointless, as the code in
question can only be reached from the 'nvme_alloc_ns()' flow, which
means that no namespace for this nsid on that controller is present
(and hence nothing to remove).
But I seem to have found the root cause now (I/O stall as we don't
kick the requeue list in all cases), will be sending a new patchset.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-09-03 12:30 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-02 11:15 [PATCH 0/4] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-02 11:15 ` [PATCH 1/4] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-02 16:48 ` Sagi Grimberg
2024-09-02 11:15 ` [PATCH 2/4] nvme-multipath: clear 'NVME_NSHEAD_DISK_LIVE' bit on shutdown Hannes Reinecke
2024-09-02 16:51 ` Sagi Grimberg
2024-09-02 11:15 ` [PATCH 3/4] nvme-multipath: check for NVME_NSHEAD_DISK_LIVE when selecting paths Hannes Reinecke
2024-09-02 16:53 ` Sagi Grimberg
2024-09-02 11:15 ` [PATCH 4/4] nvme: remove existing namespace on ID mismatch Hannes Reinecke
2024-09-02 17:01 ` Sagi Grimberg
2024-09-03 12:30 ` Hannes Reinecke
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.