* [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk()
@ 2024-10-07 10:01 Hannes Reinecke
2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Hi all,
I'm having a testcase which repeatedly disables namespaces on the target
assigning new UUID (to simulate namespace remapping) and enable that
namespace again.
To throw in more fun these namespaces have their ANA group ID changes
to simulate namespace moving around in a cluster, where only the paths
local to the cluster node are active, and all other paths are inaccessible.
Essentially it's doing something like:
echo 0 > ${ns}/enable
<random delay>
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable
ie a similar testcase than the previous patchset, only this time I'm
just doing an 'enable/disable' bit without removing the namespace from
the target.
This is causing lockups in device_add_disk(), as the partition scan is
constantly retrying I/O and never completes.
Funnily enough the very same issue should have been fixed with
ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work
and/or ana_work"), but that fix seem to be imperfect.
As usual, comments and reviews are welcome.
Hannes Reinecke (3):
nvme-multipath: simplify loop in nvme_update_ana_state()
nvme-multipath: cannot disconnect controller on stuck partition scan
nvme-multipath: skip failed paths during partition scan
drivers/nvme/host/multipath.c | 51 ++++++++++++++++++++++++++---------
drivers/nvme/host/nvme.h | 1 +
2 files changed, 40 insertions(+), 12 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() 2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke @ 2024-10-07 10:01 ` Hannes Reinecke 2024-10-07 15:46 ` Keith Busch 2024-10-20 23:33 ` Sagi Grimberg 2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke 2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke 2 siblings, 2 replies; 25+ messages in thread From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke Currently nvme_update_ana_state() iterates over all namespaces with the srcu lock held, and calls the 'cb' function if a match on the list of namespaces from the ANA log is found. This assumes that the iteration itself is relatively quick, and the namespace state itself doesn't change during iteration. For more complex functions (eg if the 'cb' function triggers mpath_set_live() which then kicks off a partition scan) the namespace state might change while iterating, and the rcu-protected area becomes really long. So change the loop to iterated over the entries in the ANA log, and call nvme_find_get_ns() on each entry. With that the 'cb' function is executed outside of the RCU protected area, and normal reference counting rules apply. Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/multipath.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f72c5a6a2d8e..61f8ae199288 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -745,7 +745,6 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, u32 nr_nsids = le32_to_cpu(desc->nnsids), n = 0; unsigned *nr_change_groups = data; struct nvme_ns *ns; - int srcu_idx; dev_dbg(ctrl->device, "ANA group %d: %s.\n", le32_to_cpu(desc->grpid), @@ -757,21 +756,16 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, if (!nr_nsids) return 0; - srcu_idx = srcu_read_lock(&ctrl->srcu); - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { + for(n = 0; n < nr_nsids; n++) { unsigned nsid; -again: + nsid = le32_to_cpu(desc->nsids[n]); - if (ns->head->ns_id < nsid) - continue; - if (ns->head->ns_id == nsid) + ns = nvme_find_get_ns(ctrl, nsid); + if (ns) { nvme_update_ns_ana_state(desc, ns); - if (++n == nr_nsids) - break; - if (ns->head->ns_id > nsid) - goto again; + nvme_put_ns(ns); + } } - srcu_read_unlock(&ctrl->srcu, srcu_idx); return 0; } -- 2.35.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() 2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke @ 2024-10-07 15:46 ` Keith Busch 2024-10-08 6:39 ` Christoph Hellwig 2024-10-20 23:33 ` Sagi Grimberg 1 sibling, 1 reply; 25+ messages in thread From: Keith Busch @ 2024-10-07 15:46 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme On Mon, Oct 07, 2024 at 12:01:32PM +0200, Hannes Reinecke wrote: > Currently nvme_update_ana_state() iterates over all namespaces > with the srcu lock held, and calls the 'cb' function if a > match on the list of namespaces from the ANA log is found. > This assumes that the iteration itself is relatively quick, > and the namespace state itself doesn't change during iteration. > For more complex functions (eg if the 'cb' function triggers > mpath_set_live() which then kicks off a partition scan) the > namespace state might change while iterating, and the rcu-protected > area becomes really long. > So change the loop to iterated over the entries in the ANA log, > and call nvme_find_get_ns() on each entry. > With that the 'cb' function is executed outside of the RCU > protected area, and normal reference counting rules apply. This patch looks good to me. Your commit log is word wrapping a bit short, though. We standardize on 72 characters. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() 2024-10-07 15:46 ` Keith Busch @ 2024-10-08 6:39 ` Christoph Hellwig 0 siblings, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2024-10-08 6:39 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Mon, Oct 07, 2024 at 09:46:16AM -0600, Keith Busch wrote: > This patch looks good to me. > > Your commit log is word wrapping a bit short, though. We standardize on > 72 characters. The for loop is also missing a space before the opening brace while we're nitpicking. The reason the code was written this way is that when there are a lot of namespace in an update we now need to a full list scan for each one, which can add a lot of overhead. So I guess we'll need to see why it is important to change it to counter that tradeoff. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() 2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke 2024-10-07 15:46 ` Keith Busch @ 2024-10-20 23:33 ` Sagi Grimberg 1 sibling, 0 replies; 25+ messages in thread From: Sagi Grimberg @ 2024-10-20 23:33 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme On 07/10/2024 13:01, Hannes Reinecke wrote: > Currently nvme_update_ana_state() iterates over all namespaces > with the srcu lock held, and calls the 'cb' function if a > match on the list of namespaces from the ANA log is found. > This assumes that the iteration itself is relatively quick, > and the namespace state itself doesn't change during iteration. > For more complex functions (eg if the 'cb' function triggers > mpath_set_live() which then kicks off a partition scan) the > namespace state might change while iterating, and the rcu-protected > area becomes really long. > So change the loop to iterated over the entries in the ANA log, > and call nvme_find_get_ns() on each entry. > With that the 'cb' function is executed outside of the RCU > protected area, and normal reference counting rules apply. > > Signed-off-by: Hannes Reinecke <hare@kernel.org> > --- > drivers/nvme/host/multipath.c | 18 ++++++------------ > 1 file changed, 6 insertions(+), 12 deletions(-) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index f72c5a6a2d8e..61f8ae199288 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -745,7 +745,6 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, > u32 nr_nsids = le32_to_cpu(desc->nnsids), n = 0; > unsigned *nr_change_groups = data; > struct nvme_ns *ns; > - int srcu_idx; > > dev_dbg(ctrl->device, "ANA group %d: %s.\n", > le32_to_cpu(desc->grpid), > @@ -757,21 +756,16 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, > if (!nr_nsids) > return 0; > > - srcu_idx = srcu_read_lock(&ctrl->srcu); > - list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { > + for(n = 0; n < nr_nsids; n++) { > unsigned nsid; > -again: > + > nsid = le32_to_cpu(desc->nsids[n]); > - if (ns->head->ns_id < nsid) > - continue; > - if (ns->head->ns_id == nsid) > + ns = nvme_find_get_ns(ctrl, nsid); > + if (ns) { > nvme_update_ns_ana_state(desc, ns); > - if (++n == nr_nsids) > - break; > - if (ns->head->ns_id > nsid) > - goto again; > + nvme_put_ns(ns); > + } > } > - srcu_read_unlock(&ctrl->srcu, srcu_idx); You also made the loop now n^2... which *may* be an issue for an environment with a lot of namespaces. It is simpler though... Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke 2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke @ 2024-10-07 10:01 ` Hannes Reinecke 2024-10-07 18:19 ` Keith Busch 2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke 2 siblings, 1 reply; 25+ messages in thread From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke When a namespace state is changed during partition scan triggered via nvme_scan_ns->nvme_mpath_set_live()->device_add_disk() I/O might be returned with a path error, causing it to be retried on other paths. But if this happens to be the last path the process will be stuck. Trying to disconnect this controller will call nvme_unquiesce_io_queues() flush_work(&ctrl_scan_work) where the first should abort/retry all I/O pending during scan such that the following 'flush_work' can succeeed. However, we explicitly do _not_ ignore paths from deleted controllers in nvme_mpath_is_disabled(), so that I/O on these devices will be _retried_, not aborted, and the scanning process continues to be stuck. So the process to disconnect the controller will be stuck in flush_work(), and that controller and all namespaces become unusable until the system is rebooted. Fixes: ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work and/or ana_work") Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/multipath.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 61f8ae199288..f03ef983a75f 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) { enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl); + /* + * Skip deleted controllers for I/O from partition scan + */ + if (state == NVME_CTRL_DELETING && + mutex_is_locked(&ns->ctrl->scan_lock)) + return true; + /* * We don't treat NVME_CTRL_DELETING as a disabled path as I/O should * still be able to complete assuming that the controller is connected. -- 2.35.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke @ 2024-10-07 18:19 ` Keith Busch 2024-10-08 6:43 ` Christoph Hellwig 2024-10-08 7:17 ` Hannes Reinecke 0 siblings, 2 replies; 25+ messages in thread From: Keith Busch @ 2024-10-07 18:19 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme On Mon, Oct 07, 2024 at 12:01:33PM +0200, Hannes Reinecke wrote: > @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) > { > enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl); > > + /* > + * Skip deleted controllers for I/O from partition scan > + */ > + if (state == NVME_CTRL_DELETING && > + mutex_is_locked(&ns->ctrl->scan_lock)) > + return true; This feels off to me, using these seemingly unrelated dependencies to make these kinds of decisions. We talked a couple weeks ago about suppressing the parition scanning during nvme scan_work. I know you said there was some reason it wouldn't work, but could you check the below? It seems okay to me. --- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 48e7a8906d012..82cb1eb3a773b 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -586,6 +586,12 @@ static void nvme_requeue_work(struct work_struct *work) container_of(work, struct nvme_ns_head, requeue_work); struct bio *bio, *next; + if (test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) { + mutex_lock(&head->disk->open_mutex); + bdev_disk_changed(head->disk, false); + mutex_unlock(&head->disk->open_mutex); + } + spin_lock_irq(&head->requeue_lock); next = bio_list_get(&head->requeue_list); spin_unlock_irq(&head->requeue_lock); @@ -629,6 +635,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) return PTR_ERR(head->disk); head->disk->fops = &nvme_ns_head_ops; head->disk->private_data = head; + set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); sprintf(head->disk->disk_name, "nvme%dn%d", ctrl->subsys->instance, head->instance); return 0; -- ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-07 18:19 ` Keith Busch @ 2024-10-08 6:43 ` Christoph Hellwig 2024-10-08 7:17 ` Hannes Reinecke 1 sibling, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2024-10-08 6:43 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Mon, Oct 07, 2024 at 12:19:23PM -0600, Keith Busch wrote: > On Mon, Oct 07, 2024 at 12:01:33PM +0200, Hannes Reinecke wrote: > > @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) > > { > > enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl); > > > > + /* > > + * Skip deleted controllers for I/O from partition scan > > + */ > > + if (state == NVME_CTRL_DELETING && > > + mutex_is_locked(&ns->ctrl->scan_lock)) > > + return true; > > This feels off to me, using these seemingly unrelated dependencies to > make these kinds of decisions. > > We talked a couple weeks ago about suppressing the parition scanning > during nvme scan_work. I know you said there was some reason it wouldn't > work, but could you check the below? It seems okay to me. > > --- > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 48e7a8906d012..82cb1eb3a773b 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -586,6 +586,12 @@ static void nvme_requeue_work(struct work_struct *work) > container_of(work, struct nvme_ns_head, requeue_work); > struct bio *bio, *next; > > + if (test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) { > + mutex_lock(&head->disk->open_mutex); > + bdev_disk_changed(head->disk, false); > + mutex_unlock(&head->disk->open_mutex); > + } > + Rescan_work feels a little counter intuitive for the partition scanning. I guess this should work because requeue_work is scheduled from the end of nvme_mpath_set_live and gives us a context outside of scan_lock. It'll need really good comments to explain this. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-07 18:19 ` Keith Busch 2024-10-08 6:43 ` Christoph Hellwig @ 2024-10-08 7:17 ` Hannes Reinecke 2024-10-08 20:41 ` Keith Busch 1 sibling, 1 reply; 25+ messages in thread From: Hannes Reinecke @ 2024-10-08 7:17 UTC (permalink / raw) To: Keith Busch, Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme On 10/7/24 20:19, Keith Busch wrote: > On Mon, Oct 07, 2024 at 12:01:33PM +0200, Hannes Reinecke wrote: >> @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) >> { >> enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl); >> >> + /* >> + * Skip deleted controllers for I/O from partition scan >> + */ >> + if (state == NVME_CTRL_DELETING && >> + mutex_is_locked(&ns->ctrl->scan_lock)) >> + return true; > > This feels off to me, using these seemingly unrelated dependencies to > make these kinds of decisions. > > We talked a couple weeks ago about suppressing the parition scanning > during nvme scan_work. I know you said there was some reason it wouldn't > work, but could you check the below? It seems okay to me. > > --- > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 48e7a8906d012..82cb1eb3a773b 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -586,6 +586,12 @@ static void nvme_requeue_work(struct work_struct *work) > container_of(work, struct nvme_ns_head, requeue_work); > struct bio *bio, *next; > > + if (test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) { > + mutex_lock(&head->disk->open_mutex); > + bdev_disk_changed(head->disk, false); > + mutex_unlock(&head->disk->open_mutex); > + } > + > spin_lock_irq(&head->requeue_lock); > next = bio_list_get(&head->requeue_list); > spin_unlock_irq(&head->requeue_lock); > @@ -629,6 +635,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) > return PTR_ERR(head->disk); > head->disk->fops = &nvme_ns_head_ops; > head->disk->private_data = head; > + set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); > sprintf(head->disk->disk_name, "nvme%dn%d", > ctrl->subsys->instance, head->instance); > return 0; > -- Hmm. Sadly, not really. Still stuck in partition scanning. Guess we'll need to check if we have paths available before triggering partition scan; let me check ... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-08 7:17 ` Hannes Reinecke @ 2024-10-08 20:41 ` Keith Busch 2024-10-09 6:23 ` Hannes Reinecke 0 siblings, 1 reply; 25+ messages in thread From: Keith Busch @ 2024-10-08 20:41 UTC (permalink / raw) To: Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Tue, Oct 08, 2024 at 09:17:47AM +0200, Hannes Reinecke wrote: > Hmm. Sadly, not really. Still stuck in partition scanning. > > Guess we'll need to check if we have paths available before triggering > partition scan; let me check ... I think you must have two paths, and the 2nd path cleared the suppress bit before the 1st one finished its called to device_add_disk(). How about this one? --- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 48e7a8906d012..755495f387168 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -586,6 +586,13 @@ static void nvme_requeue_work(struct work_struct *work) container_of(work, struct nvme_ns_head, requeue_work); struct bio *bio, *next; + if (disk_live(head->disk) && + test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) { + mutex_lock(&head->disk->open_mutex); + bdev_disk_changed(head->disk, false); + mutex_unlock(&head->disk->open_mutex); + } + spin_lock_irq(&head->requeue_lock); next = bio_list_get(&head->requeue_list); spin_unlock_irq(&head->requeue_lock); @@ -629,6 +636,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) return PTR_ERR(head->disk); head->disk->fops = &nvme_ns_head_ops; head->disk->private_data = head; + set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); sprintf(head->disk->disk_name, "nvme%dn%d", ctrl->subsys->instance, head->instance); return 0; -- ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-08 20:41 ` Keith Busch @ 2024-10-09 6:23 ` Hannes Reinecke 2024-10-09 16:33 ` Keith Busch ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Hannes Reinecke @ 2024-10-09 6:23 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On 10/8/24 22:41, Keith Busch wrote: > On Tue, Oct 08, 2024 at 09:17:47AM +0200, Hannes Reinecke wrote: >> Hmm. Sadly, not really. Still stuck in partition scanning. >> >> Guess we'll need to check if we have paths available before triggering >> partition scan; let me check ... > > I think you must have two paths, and the 2nd path cleared the suppress > bit before the 1st one finished its called to device_add_disk(). How > about this one? > Nope. Still stuck, this time in bdev_disc_changed(). With my testcase _all_ paths return NS_NOT_READY during partition scan, so I/O is constantly bounced between paths, and partition scan never returns. Doesn't matter where you call it, it's stuck. The only chance we have is to modify I/O handling during scanning (cf my new patchset). Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-09 6:23 ` Hannes Reinecke @ 2024-10-09 16:33 ` Keith Busch 2024-10-09 17:10 ` Keith Busch 2024-10-09 17:32 ` Keith Busch 2 siblings, 0 replies; 25+ messages in thread From: Keith Busch @ 2024-10-09 16:33 UTC (permalink / raw) To: Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: > On 10/8/24 22:41, Keith Busch wrote: > > On Tue, Oct 08, 2024 at 09:17:47AM +0200, Hannes Reinecke wrote: > > > Hmm. Sadly, not really. Still stuck in partition scanning. > > > > > > Guess we'll need to check if we have paths available before triggering > > > partition scan; let me check ... > > > > I think you must have two paths, and the 2nd path cleared the suppress > > bit before the 1st one finished its called to device_add_disk(). How > > about this one? > > > Nope. Still stuck, this time in bdev_disc_changed(). > With my testcase _all_ paths return NS_NOT_READY during partition scan, so > I/O is constantly bounced between paths, and partition scan never returns. > Doesn't matter where you call it, it's stuck. Oh right... We need the requeue_work to end the bio's when the disk is dead, but the work can't proceed because it's waiting on its own bio's. Darn, I guess this scheme would need yet another work_queue to do it. > The only chance we have is to modify I/O handling during scanning > (cf my new patchset). I hoped we could avoid special error handling paths, but I'll take another look at that approach. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-09 6:23 ` Hannes Reinecke 2024-10-09 16:33 ` Keith Busch @ 2024-10-09 17:10 ` Keith Busch 2024-10-09 17:32 ` Keith Busch 2 siblings, 0 replies; 25+ messages in thread From: Keith Busch @ 2024-10-09 17:10 UTC (permalink / raw) To: Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: > With my testcase _all_ paths return NS_NOT_READY during partition scan, so > I/O is constantly bounced between paths, and partition scan never returns. > Doesn't matter where you call it, it's stuck. I'm trying to recreate this so just want to clarify the return status. NS_NOT_READY isn't a path error, so doesn't use failover. It eventually exhausts its retries, so I don't see a problem with that status. If I synthesize a pathing error though, then yeah, I can recreate a problem. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-09 6:23 ` Hannes Reinecke 2024-10-09 16:33 ` Keith Busch 2024-10-09 17:10 ` Keith Busch @ 2024-10-09 17:32 ` Keith Busch 2024-10-10 6:16 ` Hannes Reinecke ` (2 more replies) 2 siblings, 3 replies; 25+ messages in thread From: Keith Busch @ 2024-10-09 17:32 UTC (permalink / raw) To: Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: > With my testcase _all_ paths return NS_NOT_READY during partition scan, so > I/O is constantly bounced between paths, and partition scan never returns. > Doesn't matter where you call it, it's stuck. Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I can recreate a stuck scan. Let me get one more shot at fixing this by suppressing the scan to another context. This one is successful in my tests: --- diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 48e7a8906d012..e1fd53ff2c0ca 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head) return ret; } +static void nvme_scan_work(struct work_struct *work) +{ + struct nvme_ns_head *head = + container_of(work, struct nvme_ns_head, scan_work); + + if (WARN_ON_ONCE(!test_and_clear_bit(GD_SUPPRESS_PART_SCAN, + &head->disk->state))) + return; + + mutex_lock(&head->disk->open_mutex); + bdev_disk_changed(head->disk, false); + mutex_unlock(&head->disk->open_mutex); +} + static void nvme_requeue_work(struct work_struct *work) { struct nvme_ns_head *head = @@ -606,6 +620,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) bio_list_init(&head->requeue_list); spin_lock_init(&head->requeue_lock); INIT_WORK(&head->requeue_work, nvme_requeue_work); + INIT_WORK(&head->scan_work, nvme_scan_work); /* * Add a multipath node if the subsystems supports multiple controllers. @@ -629,6 +644,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) return PTR_ERR(head->disk); head->disk->fops = &nvme_ns_head_ops; head->disk->private_data = head; + set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); sprintf(head->disk->disk_name, "nvme%dn%d", ctrl->subsys->instance, head->instance); return 0; @@ -655,6 +671,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns) return; } nvme_add_ns_head_cdev(head); + kblockd_schedule_work(&head->scan_work); } mutex_lock(&head->lock); @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) return; if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { nvme_cdev_del(&head->cdev, &head->cdev_device); + /* + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared + * to allow multipath to fail all I/O. + */ + synchronize_srcu(&head->srcu); + kblockd_schedule_work(&head->requeue_work); del_gendisk(head->disk); } - /* - * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared - * to allow multipath to fail all I/O. - */ - synchronize_srcu(&head->srcu); - kblockd_schedule_work(&head->requeue_work); } void nvme_mpath_remove_disk(struct nvme_ns_head *head) @@ -991,6 +1008,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) /* make sure all pending bios are cleaned up */ kblockd_schedule_work(&head->requeue_work); flush_work(&head->requeue_work); + flush_work(&head->scan_work); put_disk(head->disk); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 313a4f978a2cf..ff3d092dbdb6d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -494,6 +494,7 @@ struct nvme_ns_head { struct bio_list requeue_list; spinlock_t requeue_lock; struct work_struct requeue_work; + struct work_struct scan_work; struct mutex lock; unsigned long flags; #define NVME_NSHEAD_DISK_LIVE 0 -- ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-09 17:32 ` Keith Busch @ 2024-10-10 6:16 ` Hannes Reinecke 2024-10-10 7:18 ` Hannes Reinecke 2024-10-10 8:17 ` Christoph Hellwig 2024-10-10 8:57 ` Hannes Reinecke 2 siblings, 1 reply; 25+ messages in thread From: Hannes Reinecke @ 2024-10-10 6:16 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On 10/9/24 19:32, Keith Busch wrote: > On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: >> With my testcase _all_ paths return NS_NOT_READY during partition scan, so >> I/O is constantly bounced between paths, and partition scan never returns. >> Doesn't matter where you call it, it's stuck. > > Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I > can recreate a stuck scan. Let me get one more shot at fixing this by > suppressing the scan to another context. This one is successful in my > tests: > > --- > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 48e7a8906d012..e1fd53ff2c0ca 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head) > return ret; [ .. ] Piling workqueues on workqueues ... but anyway, I'll give it a spin. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-10 6:16 ` Hannes Reinecke @ 2024-10-10 7:18 ` Hannes Reinecke 0 siblings, 0 replies; 25+ messages in thread From: Hannes Reinecke @ 2024-10-10 7:18 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On 10/10/24 08:16, Hannes Reinecke wrote: > On 10/9/24 19:32, Keith Busch wrote: >> On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: >>> With my testcase _all_ paths return NS_NOT_READY during partition >>> scan, so >>> I/O is constantly bounced between paths, and partition scan never >>> returns. >>> Doesn't matter where you call it, it's stuck. >> >> Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I >> can recreate a stuck scan. Let me get one more shot at fixing this by >> suppressing the scan to another context. This one is successful in my >> tests: >> >> --- >> diff --git a/drivers/nvme/host/multipath.c >> b/drivers/nvme/host/multipath.c >> index 48e7a8906d012..e1fd53ff2c0ca 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct >> nvme_ns_head *head) >> return ret; > [ .. ] > > Piling workqueues on workqueues ... but anyway, I'll give it a spin. > And it even works. Care to send a proper patch? You can add my 'Tested-by:' to it, too. Thanks for doing this. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-09 17:32 ` Keith Busch 2024-10-10 6:16 ` Hannes Reinecke @ 2024-10-10 8:17 ` Christoph Hellwig 2024-10-10 8:57 ` Hannes Reinecke 2 siblings, 0 replies; 25+ messages in thread From: Christoph Hellwig @ 2024-10-10 8:17 UTC (permalink / raw) To: Keith Busch Cc: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Wed, Oct 09, 2024 at 11:32:32AM -0600, Keith Busch wrote: > On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: > > With my testcase _all_ paths return NS_NOT_READY during partition scan, so > > I/O is constantly bounced between paths, and partition scan never returns. > > Doesn't matter where you call it, it's stuck. > > Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I > can recreate a stuck scan. Let me get one more shot at fixing this by > suppressing the scan to another context. This one is successful in my > tests: > > --- > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 48e7a8906d012..e1fd53ff2c0ca 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head) > return ret; > } > > +static void nvme_scan_work(struct work_struct *work) scan_work feels a little too generic. Maybe part_scan_work? Otherwise not pretty, but by far the best idea so far, so I guess should go with it. Can you also add a big fat comment explaining the rationale for the actual submission? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-09 17:32 ` Keith Busch 2024-10-10 6:16 ` Hannes Reinecke 2024-10-10 8:17 ` Christoph Hellwig @ 2024-10-10 8:57 ` Hannes Reinecke 2024-10-15 14:33 ` Keith Busch 2 siblings, 1 reply; 25+ messages in thread From: Hannes Reinecke @ 2024-10-10 8:57 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On 10/9/24 19:32, Keith Busch wrote: > On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote: >> With my testcase _all_ paths return NS_NOT_READY during partition scan, so >> I/O is constantly bounced between paths, and partition scan never returns. >> Doesn't matter where you call it, it's stuck. > > Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I > can recreate a stuck scan. Let me get one more shot at fixing this by > suppressing the scan to another context. This one is successful in my > tests: > > --- > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 48e7a8906d012..e1fd53ff2c0ca 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head) > return ret; > } > > +static void nvme_scan_work(struct work_struct *work) > +{ > + struct nvme_ns_head *head = > + container_of(work, struct nvme_ns_head, scan_work); > + > + if (WARN_ON_ONCE(!test_and_clear_bit(GD_SUPPRESS_PART_SCAN, > + &head->disk->state))) > + return; > + > + mutex_lock(&head->disk->open_mutex); > + bdev_disk_changed(head->disk, false); > + mutex_unlock(&head->disk->open_mutex); > +} > + > static void nvme_requeue_work(struct work_struct *work) > { > struct nvme_ns_head *head = > @@ -606,6 +620,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) > bio_list_init(&head->requeue_list); > spin_lock_init(&head->requeue_lock); > INIT_WORK(&head->requeue_work, nvme_requeue_work); > + INIT_WORK(&head->scan_work, nvme_scan_work); > > /* > * Add a multipath node if the subsystems supports multiple controllers. > @@ -629,6 +644,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) > return PTR_ERR(head->disk); > head->disk->fops = &nvme_ns_head_ops; > head->disk->private_data = head; > + set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); > sprintf(head->disk->disk_name, "nvme%dn%d", > ctrl->subsys->instance, head->instance); > return 0; > @@ -655,6 +671,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns) > return; > } > nvme_add_ns_head_cdev(head); > + kblockd_schedule_work(&head->scan_work); > } > > mutex_lock(&head->lock); > @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) > return; > if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > nvme_cdev_del(&head->cdev, &head->cdev_device); > + /* > + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared > + * to allow multipath to fail all I/O. > + */ > + synchronize_srcu(&head->srcu); > + kblockd_schedule_work(&head->requeue_work); > del_gendisk(head->disk); > } I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when testing for the condition and a 'clear_bit()' after del_gendisk(). Otherwise we're having a race condition with nvme_mpath_set_live: if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { rc = device_add_disk(&head->subsys->dev, head->disk, nvme_ns_attr_groups); which could sneak in from another controller just after we cleared the NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the same name is already registered. Or nvme_cdev_del() to display nice kernel warnings as the cdev was not registered. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-10 8:57 ` Hannes Reinecke @ 2024-10-15 14:33 ` Keith Busch 2024-10-15 14:56 ` Hannes Reinecke 0 siblings, 1 reply; 25+ messages in thread From: Keith Busch @ 2024-10-15 14:33 UTC (permalink / raw) To: Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Thu, Oct 10, 2024 at 10:57:23AM +0200, Hannes Reinecke wrote: > On 10/9/24 19:32, Keith Busch wrote: > > @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) > > return; > > if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > > nvme_cdev_del(&head->cdev, &head->cdev_device); > > + /* > > + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared > > + * to allow multipath to fail all I/O. > > + */ > > + synchronize_srcu(&head->srcu); > > + kblockd_schedule_work(&head->requeue_work); > > del_gendisk(head->disk); > > } > > I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when > testing for the condition and a 'clear_bit()' after del_gendisk(). > > Otherwise we're having a race condition with nvme_mpath_set_live: > > if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { > rc = device_add_disk(&head->subsys->dev, head->disk, > nvme_ns_attr_groups); > > which could sneak in from another controller just after we cleared the > NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the > same name is already registered. > Or nvme_cdev_del() to display nice kernel warnings as the cdev was > not registered. Is that actually happening for you? I don't think it's supposed to happen because mpath_shutdwon_disk only happens if the head is detached from the subsystem, so no other thread should be able to possibly attempt to access that head's disk for a different controller path. That new controller should have to allocate an entirely new head. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-15 14:33 ` Keith Busch @ 2024-10-15 14:56 ` Hannes Reinecke 2024-10-15 15:10 ` Keith Busch 0 siblings, 1 reply; 25+ messages in thread From: Hannes Reinecke @ 2024-10-15 14:56 UTC (permalink / raw) To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On 10/15/24 16:33, Keith Busch wrote: > On Thu, Oct 10, 2024 at 10:57:23AM +0200, Hannes Reinecke wrote: >> On 10/9/24 19:32, Keith Busch wrote: >>> @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) >>> return; >>> if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >>> nvme_cdev_del(&head->cdev, &head->cdev_device); >>> + /* >>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared >>> + * to allow multipath to fail all I/O. >>> + */ >>> + synchronize_srcu(&head->srcu); >>> + kblockd_schedule_work(&head->requeue_work); >>> del_gendisk(head->disk); >>> } >> >> I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when >> testing for the condition and a 'clear_bit()' after del_gendisk(). >> >> Otherwise we're having a race condition with nvme_mpath_set_live: >> >> if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { >> rc = device_add_disk(&head->subsys->dev, head->disk, >> nvme_ns_attr_groups); >> >> which could sneak in from another controller just after we cleared the >> NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the >> same name is already registered. >> Or nvme_cdev_del() to display nice kernel warnings as the cdev was >> not registered. > > Is that actually happening for you? I don't think it's supposed to > happen because mpath_shutdwon_disk only happens if the head is detached > from the subsystem, so no other thread should be able to possibly > attempt to access that head's disk for a different controller path. That > new controller should have to allocate an entirely new head. Na, forget it. Not only doesn't it happen, but we need the 'test_and_set_bit()' at the start otherwise we can't abort I/O. So all's fine with your patch, and it even survives my testcase. (The first to do so even after several runs, I might add :-) Care to send an official patch? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-15 14:56 ` Hannes Reinecke @ 2024-10-15 15:10 ` Keith Busch 2024-10-20 23:37 ` Sagi Grimberg 0 siblings, 1 reply; 25+ messages in thread From: Keith Busch @ 2024-10-15 15:10 UTC (permalink / raw) To: Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme On Tue, Oct 15, 2024 at 04:56:08PM +0200, Hannes Reinecke wrote: > On 10/15/24 16:33, Keith Busch wrote: > > Is that actually happening for you? I don't think it's supposed to > > happen because mpath_shutdwon_disk only happens if the head is detached > > from the subsystem, so no other thread should be able to possibly > > attempt to access that head's disk for a different controller path. That > > new controller should have to allocate an entirely new head. > > Na, forget it. Not only doesn't it happen, but we need the > 'test_and_set_bit()' at the start otherwise we can't abort I/O. > So all's fine with your patch, and it even survives my testcase. > > (The first to do so even after several runs, I might add :-) > > Care to send an official patch? Should be in your mailbox. Here's the link: https://lore.kernel.org/linux-nvme/20241015143136.810779-1-kbusch@meta.com/T/#u ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan 2024-10-15 15:10 ` Keith Busch @ 2024-10-20 23:37 ` Sagi Grimberg 0 siblings, 0 replies; 25+ messages in thread From: Sagi Grimberg @ 2024-10-20 23:37 UTC (permalink / raw) To: Keith Busch, Hannes Reinecke Cc: Hannes Reinecke, Christoph Hellwig, linux-nvme On 15/10/2024 18:10, Keith Busch wrote: > On Tue, Oct 15, 2024 at 04:56:08PM +0200, Hannes Reinecke wrote: >> On 10/15/24 16:33, Keith Busch wrote: >>> Is that actually happening for you? I don't think it's supposed to >>> happen because mpath_shutdwon_disk only happens if the head is detached >>> from the subsystem, so no other thread should be able to possibly >>> attempt to access that head's disk for a different controller path. That >>> new controller should have to allocate an entirely new head. >> Na, forget it. Not only doesn't it happen, but we need the >> 'test_and_set_bit()' at the start otherwise we can't abort I/O. >> So all's fine with your patch, and it even survives my testcase. >> >> (The first to do so even after several runs, I might add :-) >> >> Care to send an official patch? > Should be in your mailbox. Here's the link: > > https://lore.kernel.org/linux-nvme/20241015143136.810779-1-kbusch@meta.com/T/#u I'm glad we didn't take the original version of this patch :) ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 3/3] nvme-multipath: skip failed paths during partition scan 2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke 2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke 2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke @ 2024-10-07 10:01 ` Hannes Reinecke 2024-10-08 6:40 ` Christoph Hellwig 2 siblings, 1 reply; 25+ messages in thread From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw) To: Christoph Hellwig Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke, Hannes Reinecke From: Hannes Reinecke <hare@suse.de> When an I/O error is encountered during scanning (ie when the scan_lock is held) we should avoid using this path until scanning is finished to avoid deadlocks with device_add_disk(). So set a new flag NVME_NS_SCAN_FAILED if a failover happened during scanning, and skip this path in nvme_available_paths(). Then we can check if that bit is set after device_add_disk() returned, and remove the disk again if no available paths are found. That allows the device to be recreated via the 'rescan' sysfs attribute once no I/O errors occur anymore. Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/multipath.c | 26 ++++++++++++++++++++++++++ drivers/nvme/host/nvme.h | 1 + 2 files changed, 27 insertions(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f03ef983a75f..4113d38606a4 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -102,6 +102,13 @@ void nvme_failover_req(struct request *req) queue_work(nvme_wq, &ns->ctrl->ana_work); } + /* + * Do not use this path during scanning + * to avoid deadlocks in device_add_disk() + */ + if (mutex_is_locked(&ns->ctrl->scan_lock)) + set_bit(NVME_NS_SCAN_FAILED, &ns->flags); + spin_lock_irqsave(&ns->head->requeue_lock, flags); for (bio = req->bio; bio; bio = bio->bi_next) { bio_set_dev(bio, ns->head->disk->part0); @@ -434,6 +441,10 @@ static bool nvme_available_path(struct nvme_ns_head *head) list_for_each_entry_rcu(ns, &head->list, siblings) { if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags)) continue; + if (test_bit(NVME_NS_SCAN_FAILED, &ns->flags) && + mutex_is_locked(&ns->ctrl->scan_lock)) + continue; + switch (nvme_ctrl_state(ns->ctrl)) { case NVME_CTRL_LIVE: case NVME_CTRL_RESETTING: @@ -659,6 +670,20 @@ static void nvme_mpath_set_live(struct nvme_ns *ns) clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags); return; } + /* + * If there is no available path and NVME_NS_SCAN_FAILED is + * set an error occurred during partition scan triggered + * by device_add_disk(), and the disk is most certainly + * not live. + */ + if (!nvme_available_path(head) && + test_and_clear_bit(NVME_NS_SCAN_FAILED, &ns->flags)) { + dev_dbg(ns->ctrl->device, "delete gendisk for nsid %d\n", + head->ns_id); + clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags); + del_gendisk(head->disk); + return; + } nvme_add_ns_head_cdev(head); } @@ -732,6 +757,7 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc, ns->ana_grpid = le32_to_cpu(desc->grpid); ns->ana_state = desc->state; clear_bit(NVME_NS_ANA_PENDING, &ns->flags); + clear_bit(NVME_NS_SCAN_FAILED, &ns->flags); /* * nvme_mpath_set_live() will trigger I/O to the multipath path device * and in turn to this path device. However we cannot accept this I/O diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 50515ad0f9d6..a4f99873ecb7 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -527,6 +527,7 @@ struct nvme_ns { #define NVME_NS_ANA_PENDING 2 #define NVME_NS_FORCE_RO 3 #define NVME_NS_READY 4 +#define NVME_NS_SCAN_FAILED 5 struct cdev cdev; struct device cdev_device; -- 2.35.3 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: skip failed paths during partition scan 2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke @ 2024-10-08 6:40 ` Christoph Hellwig 2024-10-20 23:38 ` Sagi Grimberg 0 siblings, 1 reply; 25+ messages in thread From: Christoph Hellwig @ 2024-10-08 6:40 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke On Mon, Oct 07, 2024 at 12:01:34PM +0200, Hannes Reinecke wrote: > From: Hannes Reinecke <hare@suse.de> > > When an I/O error is encountered during scanning (ie when the > scan_lock is held) we should avoid using this path until scanning > is finished to avoid deadlocks with device_add_disk(). > So set a new flag NVME_NS_SCAN_FAILED if a failover happened during > scanning, and skip this path in nvme_available_paths(). > Then we can check if that bit is set after device_add_disk() returned, > and remove the disk again if no available paths are found. > That allows the device to be recreated via the 'rescan' sysfs attribute > once no I/O errors occur anymore. > > Signed-off-by: Hannes Reinecke <hare@kernel.org> > --- > drivers/nvme/host/multipath.c | 26 ++++++++++++++++++++++++++ > drivers/nvme/host/nvme.h | 1 + > 2 files changed, 27 insertions(+) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index f03ef983a75f..4113d38606a4 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -102,6 +102,13 @@ void nvme_failover_req(struct request *req) > queue_work(nvme_wq, &ns->ctrl->ana_work); > } > > + /* > + * Do not use this path during scanning > + * to avoid deadlocks in device_add_disk() > + */ > + if (mutex_is_locked(&ns->ctrl->scan_lock)) > + set_bit(NVME_NS_SCAN_FAILED, &ns->flags); Err, no. mutex_is_locked is never a valid way to detect the calling context - obviously someone else could also be holding it. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] nvme-multipath: skip failed paths during partition scan 2024-10-08 6:40 ` Christoph Hellwig @ 2024-10-20 23:38 ` Sagi Grimberg 0 siblings, 0 replies; 25+ messages in thread From: Sagi Grimberg @ 2024-10-20 23:38 UTC (permalink / raw) To: Christoph Hellwig, Hannes Reinecke Cc: Keith Busch, linux-nvme, Hannes Reinecke >> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >> index f03ef983a75f..4113d38606a4 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -102,6 +102,13 @@ void nvme_failover_req(struct request *req) >> queue_work(nvme_wq, &ns->ctrl->ana_work); >> } >> >> + /* >> + * Do not use this path during scanning >> + * to avoid deadlocks in device_add_disk() >> + */ >> + if (mutex_is_locked(&ns->ctrl->scan_lock)) >> + set_bit(NVME_NS_SCAN_FAILED, &ns->flags); > Err, no. mutex_is_locked is never a valid way to detect the calling > context - obviously someone else could also be holding it. > Strongly agree here. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-10-20 23:38 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke 2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke 2024-10-07 15:46 ` Keith Busch 2024-10-08 6:39 ` Christoph Hellwig 2024-10-20 23:33 ` Sagi Grimberg 2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke 2024-10-07 18:19 ` Keith Busch 2024-10-08 6:43 ` Christoph Hellwig 2024-10-08 7:17 ` Hannes Reinecke 2024-10-08 20:41 ` Keith Busch 2024-10-09 6:23 ` Hannes Reinecke 2024-10-09 16:33 ` Keith Busch 2024-10-09 17:10 ` Keith Busch 2024-10-09 17:32 ` Keith Busch 2024-10-10 6:16 ` Hannes Reinecke 2024-10-10 7:18 ` Hannes Reinecke 2024-10-10 8:17 ` Christoph Hellwig 2024-10-10 8:57 ` Hannes Reinecke 2024-10-15 14:33 ` Keith Busch 2024-10-15 14:56 ` Hannes Reinecke 2024-10-15 15:10 ` Keith Busch 2024-10-20 23:37 ` Sagi Grimberg 2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke 2024-10-08 6:40 ` Christoph Hellwig 2024-10-20 23:38 ` Sagi Grimberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).