* [RFC] nvme-mpath: delete disk after last connection
@ 2020-09-25 21:38 Keith Busch
2020-09-26 11:16 ` Hannes Reinecke
0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2020-09-25 21:38 UTC (permalink / raw)
To: linux-nvme, hch, sagi; +Cc: Keith Busch, Hannes Reinecke
I have this tagged "RFC" because I'm not sure if there's a reason why the code
is done the way that it is today.
The multipath code currently deletes the disk only after all references
to it are dropped rather than when the last path to that disk is lost.
This has been reported to cause problems with some usage, like MD RAID.
Delete the disk when the last path is gone. This is the same behavior we
currently have with non-multipathed nvme devices.
The following is just a simple example that demonstrates what is currently
observed using a simple nvme loop back (loop setup file not shown):
# nvmetcli restore loop.json
[ 31.156452] nvmet: adding nsid 1 to subsystem testnqn1
[ 31.159140] nvmet: adding nsid 1 to subsystem testnqn2
# nvme connect -t loop -n testnqn1 -q hostnqn
[ 36.866302] nvmet: creating controller 1 for subsystem testnqn1 for NQN hostnqn.
[ 36.872926] nvme nvme3: new ctrl: "testnqn1"
# nvme connect -t loop -n testnqn1 -q hostnqn
[ 38.227186] nvmet: creating controller 2 for subsystem testnqn1 for NQN hostnqn.
[ 38.234450] nvme nvme4: new ctrl: "testnqn1"
# nvme connect -t loop -n testnqn2 -q hostnqn
[ 43.902761] nvmet: creating controller 3 for subsystem testnqn2 for NQN hostnqn.
[ 43.907401] nvme nvme5: new ctrl: "testnqn2"
# nvme connect -t loop -n testnqn2 -q hostnqn
[ 44.627689] nvmet: creating controller 4 for subsystem testnqn2 for NQN hostnqn.
[ 44.641773] nvme nvme6: new ctrl: "testnqn2"
# mdadm --create /dev/md0 --level=mirror --raid-devices=2 /dev/nvme3n1 /dev/nvme5n1
[ 53.497038] md/raid1:md0: active with 2 out of 2 mirrors
[ 53.501717] md0: detected capacity change from 0 to 66060288
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 nvme5n1[1] nvme3n1[0]
64512 blocks super 1.2 [2/2] [UU]
Now delete all paths to one of the namespaces:
# echo 1 > /sys/class/nvme/nvme3/delete_controller
# echo 1 > /sys/class/nvme/nvme4/delete_controller
We have no path, but mdstat says:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active (auto-read-only) raid1 nvme5n1[1]
64512 blocks super 1.2 [2/1] [_U]
And this is reported to cause a problem.
With the proposed patch, the following messages appear:
[ 227.516807] md/raid1:md0: Disk failure on nvme3n1, disabling device.
[ 227.516807] md/raid1:md0: Operation continuing on 1 devices.
And mdstat shows only the viable members:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active (auto-read-only) raid1 nvme5n1[1]
64512 blocks super 1.2 [2/1] [_U]
Reported-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/core.c | 3 ++-
drivers/nvme/host/multipath.c | 1 -
drivers/nvme/host/nvme.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4857168f71f2..a2faa3625e39 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -475,7 +475,8 @@ static void nvme_free_ns_head(struct kref *ref)
struct nvme_ns_head *head =
container_of(ref, struct nvme_ns_head, ref);
- nvme_mpath_remove_disk(head);
+ if (head->disk)
+ put_disk(head->disk);
ida_simple_remove(&head->subsys->ns_ida, head->instance);
cleanup_srcu_struct(&head->srcu);
nvme_put_subsystem(head->subsys);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 74896be40c17..55045291b4de 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -697,7 +697,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
*/
head->disk->queue = NULL;
}
- put_disk(head->disk);
}
int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a42b75869213..745cda1a63fd 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -670,7 +670,7 @@ static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
struct nvme_ns_head *head = ns->head;
if (head->disk && list_empty(&head->list))
- kblockd_schedule_work(&head->requeue_work);
+ nvme_mpath_remove_disk(head);
}
static inline void nvme_trace_bio_complete(struct request *req,
--
2.24.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-25 21:38 [RFC] nvme-mpath: delete disk after last connection Keith Busch
@ 2020-09-26 11:16 ` Hannes Reinecke
2020-09-27 6:07 ` Christoph Hellwig
0 siblings, 1 reply; 8+ messages in thread
From: Hannes Reinecke @ 2020-09-26 11:16 UTC (permalink / raw)
To: Keith Busch, linux-nvme, hch, sagi
On 9/25/20 11:38 PM, Keith Busch wrote:
> I have this tagged "RFC" because I'm not sure if there's a reason why the code
> is done the way that it is today.
>
> The multipath code currently deletes the disk only after all references
> to it are dropped rather than when the last path to that disk is lost.
> This has been reported to cause problems with some usage, like MD RAID.
>
> Delete the disk when the last path is gone. This is the same behavior we
> currently have with non-multipathed nvme devices.
>
> The following is just a simple example that demonstrates what is currently
> observed using a simple nvme loop back (loop setup file not shown):
>
> # nvmetcli restore loop.json
> [ 31.156452] nvmet: adding nsid 1 to subsystem testnqn1
> [ 31.159140] nvmet: adding nsid 1 to subsystem testnqn2
>
> # nvme connect -t loop -n testnqn1 -q hostnqn
> [ 36.866302] nvmet: creating controller 1 for subsystem testnqn1 for NQN hostnqn.
> [ 36.872926] nvme nvme3: new ctrl: "testnqn1"
>
> # nvme connect -t loop -n testnqn1 -q hostnqn
> [ 38.227186] nvmet: creating controller 2 for subsystem testnqn1 for NQN hostnqn.
> [ 38.234450] nvme nvme4: new ctrl: "testnqn1"
>
> # nvme connect -t loop -n testnqn2 -q hostnqn
> [ 43.902761] nvmet: creating controller 3 for subsystem testnqn2 for NQN hostnqn.
> [ 43.907401] nvme nvme5: new ctrl: "testnqn2"
>
> # nvme connect -t loop -n testnqn2 -q hostnqn
> [ 44.627689] nvmet: creating controller 4 for subsystem testnqn2 for NQN hostnqn.
> [ 44.641773] nvme nvme6: new ctrl: "testnqn2"
>
> # mdadm --create /dev/md0 --level=mirror --raid-devices=2 /dev/nvme3n1 /dev/nvme5n1
> [ 53.497038] md/raid1:md0: active with 2 out of 2 mirrors
> [ 53.501717] md0: detected capacity change from 0 to 66060288
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 nvme5n1[1] nvme3n1[0]
> 64512 blocks super 1.2 [2/2] [UU]
>
> Now delete all paths to one of the namespaces:
>
> # echo 1 > /sys/class/nvme/nvme3/delete_controller
> # echo 1 > /sys/class/nvme/nvme4/delete_controller
>
> We have no path, but mdstat says:
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active (auto-read-only) raid1 nvme5n1[1]
> 64512 blocks super 1.2 [2/1] [_U]
>
> And this is reported to cause a problem.
>
> With the proposed patch, the following messages appear:
>
> [ 227.516807] md/raid1:md0: Disk failure on nvme3n1, disabling device.
> [ 227.516807] md/raid1:md0: Operation continuing on 1 devices.
>
> And mdstat shows only the viable members:
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active (auto-read-only) raid1 nvme5n1[1]
> 64512 blocks super 1.2 [2/1] [_U]
>
> Reported-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
> drivers/nvme/host/core.c | 3 ++-
> drivers/nvme/host/multipath.c | 1 -
> drivers/nvme/host/nvme.h | 2 +-
> 3 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 4857168f71f2..a2faa3625e39 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -475,7 +475,8 @@ static void nvme_free_ns_head(struct kref *ref)
> struct nvme_ns_head *head =
> container_of(ref, struct nvme_ns_head, ref);
>
> - nvme_mpath_remove_disk(head);
> + if (head->disk)
> + put_disk(head->disk);
> ida_simple_remove(&head->subsys->ns_ida, head->instance);
> cleanup_srcu_struct(&head->srcu);
> nvme_put_subsystem(head->subsys);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 74896be40c17..55045291b4de 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -697,7 +697,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
> */
> head->disk->queue = NULL;
> }
> - put_disk(head->disk);
> }
>
> int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index a42b75869213..745cda1a63fd 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -670,7 +670,7 @@ static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
> struct nvme_ns_head *head = ns->head;
>
> if (head->disk && list_empty(&head->list))
> - kblockd_schedule_work(&head->requeue_work);
> + nvme_mpath_remove_disk(head);
> }
>
> static inline void nvme_trace_bio_complete(struct request *req,
>
I'm okay with that in general, but then again we might run into
situations where an 'all paths down' scenario is actually expected
(think of a temporary network outage on nvme-tcp).
So I guess we need to introduce an additional setting
(queue_if_no_path?) to be specified during the initial connection.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-26 11:16 ` Hannes Reinecke
@ 2020-09-27 6:07 ` Christoph Hellwig
2020-09-28 14:11 ` Keith Busch
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2020-09-27 6:07 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: Keith Busch, hch, linux-nvme, sagi
On Sat, Sep 26, 2020 at 01:16:51PM +0200, Hannes Reinecke wrote:
> I'm okay with that in general, but then again we might run into situations
> where an 'all paths down' scenario is actually expected (think of a
> temporary network outage on nvme-tcp).
> So I guess we need to introduce an additional setting (queue_if_no_path?)
> to be specified during the initial connection.
The original design was pretty much intentional as and all paths down
even usually is temporary. So any change in behavior should be based
on an optional instead of a change in default. And I think the right
way to do implement this would be a timer when to take the gendisk
down, with the default remaining infinitity.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-27 6:07 ` Christoph Hellwig
@ 2020-09-28 14:11 ` Keith Busch
2020-09-29 8:28 ` Sagi Grimberg
0 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2020-09-28 14:11 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Hannes Reinecke, linux-nvme, sagi
On Sun, Sep 27, 2020 at 08:07:09AM +0200, Christoph Hellwig wrote:
> On Sat, Sep 26, 2020 at 01:16:51PM +0200, Hannes Reinecke wrote:
> > I'm okay with that in general, but then again we might run into situations
> > where an 'all paths down' scenario is actually expected (think of a
> > temporary network outage on nvme-tcp).
> > So I guess we need to introduce an additional setting (queue_if_no_path?)
> > to be specified during the initial connection.
>
> The original design was pretty much intentional as and all paths down
> even usually is temporary. So any change in behavior should be based
> on an optional instead of a change in default. And I think the right
> way to do implement this would be a timer when to take the gendisk
> down, with the default remaining infinitity.
Ok, we can let the timer be an option for fabrics so it can be specific
to the subsystem you're connecting to. The timeout for pcie targets
probably has to be a module parameter, though.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-28 14:11 ` Keith Busch
@ 2020-09-29 8:28 ` Sagi Grimberg
2020-09-29 18:08 ` Christoph Hellwig
0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2020-09-29 8:28 UTC (permalink / raw)
To: Keith Busch, Christoph Hellwig; +Cc: Hannes Reinecke, linux-nvme
>>> I'm okay with that in general, but then again we might run into situations
>>> where an 'all paths down' scenario is actually expected (think of a
>>> temporary network outage on nvme-tcp).
>>> So I guess we need to introduce an additional setting (queue_if_no_path?)
>>> to be specified during the initial connection.
>>
>> The original design was pretty much intentional as and all paths down
>> even usually is temporary. So any change in behavior should be based
>> on an optional instead of a change in default. And I think the right
>> way to do implement this would be a timer when to take the gendisk
>> down, with the default remaining infinitity.
>
> Ok, we can let the timer be an option for fabrics so it can be specific
> to the subsystem you're connecting to. The timeout for pcie targets
> probably has to be a module parameter, though.
Something here is not clear to me, we are not really talking about "all
paths down" but rather "all paths lost", which should take the gendisk
down AFAIK.
Keith, shouldn't we modify the gendisk reference manipulation in
nvme_mpath_remove_disk instead of moving it to the call sites?
I'm still unsure why we need a timer for this. If a path is removed
(e.g. disconnected, we shouldn't keep the gendisk up).
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-29 8:28 ` Sagi Grimberg
@ 2020-09-29 18:08 ` Christoph Hellwig
2020-09-29 18:16 ` Sagi Grimberg
0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2020-09-29 18:08 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, Hannes Reinecke
On Tue, Sep 29, 2020 at 01:28:51AM -0700, Sagi Grimberg wrote:
> Something here is not clear to me, we are not really talking about "all
> paths down" but rather "all paths lost", which should take the gendisk
> down AFAIK.
>
> Keith, shouldn't we modify the gendisk reference manipulation in
> nvme_mpath_remove_disk instead of moving it to the call sites?
>
> I'm still unsure why we need a timer for this. If a path is removed
> (e.g. disconnected, we shouldn't keep the gendisk up).
True. And I guess the right place to deal with timers is on how long
we are going to keep retrying (or at least the underlying disks alive).
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-29 18:08 ` Christoph Hellwig
@ 2020-09-29 18:16 ` Sagi Grimberg
2020-09-30 5:36 ` Hannes Reinecke
0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2020-09-29 18:16 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Keith Busch, Hannes Reinecke, linux-nvme
>> Something here is not clear to me, we are not really talking about "all
>> paths down" but rather "all paths lost", which should take the gendisk
>> down AFAIK.
>>
>> Keith, shouldn't we modify the gendisk reference manipulation in
>> nvme_mpath_remove_disk instead of moving it to the call sites?
>>
>> I'm still unsure why we need a timer for this. If a path is removed
>> (e.g. disconnected, we shouldn't keep the gendisk up).
>
> True. And I guess the right place to deal with timers is on how long
> we are going to keep retrying (or at least the underlying disks alive).
Which is orthogonal to this specific patch right?
As for timers, I wouldn't want to introduce a modparam for that. Perhaps
the correct way to do this is with what Victor suggested where mpath
fastfail is really the longest controller fastfail...
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] nvme-mpath: delete disk after last connection
2020-09-29 18:16 ` Sagi Grimberg
@ 2020-09-30 5:36 ` Hannes Reinecke
0 siblings, 0 replies; 8+ messages in thread
From: Hannes Reinecke @ 2020-09-30 5:36 UTC (permalink / raw)
To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 9/29/20 8:16 PM, Sagi Grimberg wrote:
>
>>> Something here is not clear to me, we are not really talking about "all
>>> paths down" but rather "all paths lost", which should take the gendisk
>>> down AFAIK.
>>>
>>> Keith, shouldn't we modify the gendisk reference manipulation in
>>> nvme_mpath_remove_disk instead of moving it to the call sites?
>>>
>>> I'm still unsure why we need a timer for this. If a path is removed
>>> (e.g. disconnected, we shouldn't keep the gendisk up).
>>
>> True. And I guess the right place to deal with timers is on how long
>> we are going to keep retrying (or at least the underlying disks alive).
>
> Which is orthogonal to this specific patch right?
>
> As for timers, I wouldn't want to introduce a modparam for that. Perhaps
> the correct way to do this is with what Victor suggested where mpath
> fastfail is really the longest controller fastfail...
Yes. If we're talking timers we should be looking a fastfail, as this is
really what we want here.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2020-09-30 5:36 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-25 21:38 [RFC] nvme-mpath: delete disk after last connection Keith Busch
2020-09-26 11:16 ` Hannes Reinecke
2020-09-27 6:07 ` Christoph Hellwig
2020-09-28 14:11 ` Keith Busch
2020-09-29 8:28 ` Sagi Grimberg
2020-09-29 18:08 ` Christoph Hellwig
2020-09-29 18:16 ` Sagi Grimberg
2020-09-30 5:36 ` Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox