* [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes
@ 2024-09-11 9:51 Hannes Reinecke
2024-09-11 9:51 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-11 9:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Hi all,
I'm having a testcase which repeatedly deletes namespaces on the target
and creates new namespaces, and aggressively re-using NSIDs for the
new namespaces.
To throw in more fun these namespaces are created on different nodes
in the cluster, where only the paths local to the cluster node are
active, and all other paths are inaccessible.
Essentially it's doing something like:
echo 0 > ${ns}/enable
rm ${ns}
<random delay>
mkdir ${ns}
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable
repeatedly with several namespaces and several ANA groups.
This leads to an unrecoverable system where the scanning processes
are stuck in the partition scanning code triggered via
'device_add_disk()' waiting for I/O which will never
come.
There are two parts to fixing this:
We need to ensure the NSHEAD_DISK_LIVE is properly set when the
ns_head is live, and unset when the last path is gone.
And we need to trigger the requeue list after NSHEAD_DISK_LIVE
has been cleared to flush all outstanding I/O.
Turns out there's another corner case; when running the same test
but not removing the namespaces while changing the UUID we end up
with I/Os constantly being retried. If this happend during partition
scan kicked off from device_add_disk() the system is stuck as the
scan_mutex will never be released. To fix this I've introduced a
NVME_NSHEAD_DISABLE_QUEUEING flag to inhibig queueing during scan,
such that device_add_disk() can make progress.
With these patches (and the queue freeze patchset from hch) the problem
is resolved and the testcase runs without issues.
I see to get the testcase added to blktests.
As usual, comments and reviews are welcome.
Changes to v5:
- Introduce NVME_NSHEAD_DISABLE_QUEUEING flag instead of disabling
command retries
Changes to v4:
- Disabled command retries when the controller is removed instead of
(ab-)using the failfast flag
Changes to v3:
- Update patch description as suggested by Sagi
- Drop patch to requeue I/O after ANA state changes
Changes to v2:
- Include reviews from Sagi
- Drop the check for NSHEAD_DISK_LIVE in nvme_available_path()
- Add a patch to requeue I/O if the ANA state changed
- Set the 'failfast' flag when removing controllers
Changes to the original submission:
- Drop patch to remove existing namespaces on ID mismatch
- Combine patches updating NSHEAD_DISK_LIVE handling
- requeue I/O after NSHEAD_DISK_LIVE has been cleared
Hannes Reinecke (3):
nvme-multipath: fixup typo when clearing DISK_LIVE
nvme-multipath: avoid hang on inaccessible namespaces
nvme-multipath: skip inaccessible paths during partition scan
drivers/nvme/host/multipath.c | 22 +++++++++++++++++++---
drivers/nvme/host/nvme.h | 1 +
2 files changed, 20 insertions(+), 3 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE
2024-09-11 9:51 [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
@ 2024-09-11 9:51 ` Hannes Reinecke
2024-09-11 9:51 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
2024-09-11 9:51 ` [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan Hannes Reinecke
2 siblings, 0 replies; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-11 9:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
NVME_NSHEAD_DISK_LIVE is a flag for 'struct nvme_ns_head', not
'struct nvme_ns'.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/multipath.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 91d9eb3c22ef..c9d23b1b8efc 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -646,7 +646,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
rc = device_add_disk(&head->subsys->dev, head->disk,
nvme_ns_attr_groups);
if (rc) {
- clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags);
+ clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
return;
}
nvme_add_ns_head_cdev(head);
--
2.35.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces
2024-09-11 9:51 [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-11 9:51 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
@ 2024-09-11 9:51 ` Hannes Reinecke
2024-09-11 9:51 ` [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan Hannes Reinecke
2 siblings, 0 replies; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-11 9:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
During repetitive namespace remapping operations on the target the
namespace might have changed between the time the initial scan
was performed, and partition scan was invoked by device_add_disk()
in nvme_mpath_set_live(). We then end up with a stuck scanning process:
[<0>] folio_wait_bit_common+0x12a/0x310
[<0>] filemap_read_folio+0x97/0xd0
[<0>] do_read_cache_folio+0x108/0x390
[<0>] read_part_sector+0x31/0xa0
[<0>] read_lba+0xc5/0x160
[<0>] efi_partition+0xd9/0x8f0
[<0>] bdev_disk_changed+0x23d/0x6d0
[<0>] blkdev_get_whole+0x78/0xc0
[<0>] bdev_open+0x2c6/0x3b0
[<0>] bdev_file_open_by_dev+0xcb/0x120
[<0>] disk_scan_partitions+0x5d/0x100
[<0>] device_add_disk+0x402/0x420
[<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
[<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
[<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
[<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
This happens when we have several paths, some of which are inaccessible,
and the active paths are removed first. Then nvme_find_path() will requeue
I/O in the ns_head (as paths are present), but the requeue list is never
triggered as all remaining paths are inactive.
This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(),
and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
the last path has been removed to properly terminate pending I/O.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/multipath.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..f72c5a6a2d8e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -421,6 +421,9 @@ static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
+ if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+ return NULL;
+
list_for_each_entry_rcu(ns, &head->list, siblings) {
if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
continue;
@@ -967,11 +970,16 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
{
if (!head->disk)
return;
- kblockd_schedule_work(&head->requeue_work);
- if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+ if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
nvme_cdev_del(&head->cdev, &head->cdev_device);
del_gendisk(head->disk);
}
+ /*
+ * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+ * to allow multipath to fail all I/O.
+ */
+ synchronize_srcu(&head->srcu);
+ kblockd_schedule_work(&head->requeue_work);
}
void nvme_mpath_remove_disk(struct nvme_ns_head *head)
--
2.35.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan
2024-09-11 9:51 [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-11 9:51 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-11 9:51 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
@ 2024-09-11 9:51 ` Hannes Reinecke
2024-09-11 11:24 ` Sagi Grimberg
2 siblings, 1 reply; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-11 9:51 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
When a path is switched to 'inaccessible' during partition scan
triggered via device_add_disk() and we only have one path the
system will be stuck as nvme_available_path() will always return
'true'. So I/O will never be completed and the system is stuck
in device_add_disk().
This patch introduces a flag NVME_NSHEAD_DISABLE_QUEUEING to
cause nvme_available_path() to always return NULL, and with
that I/O to be failed if all paths are unavailable.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/multipath.c | 10 +++++++++-
drivers/nvme/host/nvme.h | 1 +
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f72c5a6a2d8e..bcd70755c663 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -421,7 +421,8 @@ static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
- if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
+ if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags) ||
+ test_bit(NVME_NSHEAD_DISABLE_QUEUEING, &head->flags))
return NULL;
list_for_each_entry_rcu(ns, &head->list, siblings) {
@@ -646,8 +647,15 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
* head.
*/
if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+ /*
+ * Disable queueing to ensure I/O is not retried on unusable
+ * paths, which would cause the system to be stuck during
+ * partition scan.
+ */
+ set_bit(NVME_NSHEAD_DISABLE_QUEUEING, &head->flags);
rc = device_add_disk(&head->subsys->dev, head->disk,
nvme_ns_attr_groups);
+ clear_bit(NVME_NSHEAD_DISABLE_QUEUEING, &head->flags);
if (rc) {
clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
return;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 50515ad0f9d6..f45ca7c45fd2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -493,6 +493,7 @@ struct nvme_ns_head {
struct mutex lock;
unsigned long flags;
#define NVME_NSHEAD_DISK_LIVE 0
+#define NVME_NSHEAD_DISABLE_QUEUEING 1
struct nvme_ns __rcu *current_path[];
#endif
};
--
2.35.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan
2024-09-11 9:51 ` [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan Hannes Reinecke
@ 2024-09-11 11:24 ` Sagi Grimberg
2024-09-11 12:04 ` Hannes Reinecke
0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2024-09-11 11:24 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
Hannes, if this patch fixes a bug, please phrase the title to reflect that.
On 11/09/2024 12:51, Hannes Reinecke wrote:
> When a path is switched to 'inaccessible' during partition scan
> triggered via device_add_disk() and we only have one path the
> system will be stuck as nvme_available_path() will always return
> 'true'. So I/O will never be completed and the system is stuck
> in device_add_disk().
> This patch introduces a flag NVME_NSHEAD_DISABLE_QUEUEING to
> cause nvme_available_path() to always return NULL, and with
> that I/O to be failed if all paths are unavailable.
So what will happen if a new device comes along, and the first
path it connects to is 'inaccessible'? What will scan partitions?
I don't think that this approach is the right one.
Effectively, there is a semantic decision here, is 'inaccessible' a
temporary state or not, if it is we should queue IO knowing that
it may change in the future, and if not, we should fail it.
ANA inaccessible is semantically temporary. Its just that your test
treats it as permanent. For this case you have fast_io_fail_tmo, which is
designed to give up also in cases where there is hope that any path will
become online in the future.
I do agree that if the user disconnects the last path, should it be
inaccessible
or not, it should succeed, and cause the queued IO to fail immediately.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan
2024-09-11 11:24 ` Sagi Grimberg
@ 2024-09-11 12:04 ` Hannes Reinecke
0 siblings, 0 replies; 6+ messages in thread
From: Hannes Reinecke @ 2024-09-11 12:04 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/11/24 13:24, Sagi Grimberg wrote:
> Hannes, if this patch fixes a bug, please phrase the title to reflect that.
>
>
> On 11/09/2024 12:51, Hannes Reinecke wrote:
>> When a path is switched to 'inaccessible' during partition scan
>> triggered via device_add_disk() and we only have one path the
>> system will be stuck as nvme_available_path() will always return
>> 'true'. So I/O will never be completed and the system is stuck
>> in device_add_disk().
>> This patch introduces a flag NVME_NSHEAD_DISABLE_QUEUEING to
>> cause nvme_available_path() to always return NULL, and with
>> that I/O to be failed if all paths are unavailable.
>
> So what will happen if a new device comes along, and the first
> path it connects to is 'inaccessible'? What will scan partitions?
>
No, it won't. But it won't do that today, neither, as we never call
nvme_mpath_set_disk_live():
nvme_mpath_add_disk()
-> nvme_update_ns_ana_state():
if (nvme_state_is_live(ns->ana_state) &&
nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
nvme_mpath_set_live(ns);
> I don't think that this approach is the right one.
> Effectively, there is a semantic decision here, is 'inaccessible' a
> temporary state or not, if it is we should queue IO knowing that
> it may change in the future, and if not, we should fail it.
>
Yes.
I would argue that we should never requeue I/O if it's triggered
from partition scan, ie from within device_add_disk().
Reasoning is as follows:
We never call nvme_mpath_set_live() during initial scan if all paths are
inaccessible.
So when a path is 'optimized' / 'non-optimized' and the path state
changes to 'inaccessible' while device_add_disk() is running, we are
perfectly fine to disable the device (and kill all I/O), as this is
what would have happened if the path had been inaccessible originally.
> ANA inaccessible is semantically temporary. Its just that your test
> treats it as permanent. For this case you have fast_io_fail_tmo, which is
> designed to give up also in cases where there is hope that any path will
> become online in the future.
>
It's not just ANA inaccessible. We're facing a similar problem if the
target returns PATH_ERROR during scanning; then we're constantly failing
over I/O to the next path, returning PATH_ERROR, failing over to the
next path, ...
> I do agree that if the user disconnects the last path, should it be
> inaccessible
> or not, it should succeed, and cause the queued IO to fail immediately.
I'll check if that would work for my testcases.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-09-11 12:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-11 9:51 [PATCHv6 0/3] nvme: NSHEAD_DISK_LIVE fixes Hannes Reinecke
2024-09-11 9:51 ` [PATCH 1/3] nvme-multipath: fixup typo when clearing DISK_LIVE Hannes Reinecke
2024-09-11 9:51 ` [PATCH 2/3] nvme-multipath: avoid hang on inaccessible namespaces Hannes Reinecke
2024-09-11 9:51 ` [PATCH 3/3] nvme-multipath: skip inaccessible paths during partition scan Hannes Reinecke
2024-09-11 11:24 ` Sagi Grimberg
2024-09-11 12:04 ` Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).