* [PATCH 0/2] nvme: fix regression with MD RAID
@ 2021-02-23 11:59 Hannes Reinecke
2021-02-23 11:59 ` [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Hannes Reinecke @ 2021-02-23 11:59 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-nvme, Sagi Grimberg, Keith Busch, Hannes Reinecke
Hi all,
ever since the implementation NVMe-oF does not work together with MD RAID.
MD RAID expects the device to return an I/O error on failure, and to remove
the block device if the underlying hardware is removed.
This is contrary to the implementation of NVMe-oF, which will keep on retrying
I/O while the controller is being reset, and will only remove the block device
once the last _user_ is gone.
These patches fixup this situation by adding a new sysfs attribute
'fail_if_no_path'. When this attribute is set we will return I/O errors
as soon as no paths are available anymore, and will remove the block device
once the last controller holding a path to the namespace is removed (ie after
all reconnect attempts for that controllers are exhausted).
This is a rework of the earlier path by Keith Busch ('nvme-mpath: delete disk
after last connection'). Kudos to him for suggesting this approach.
Hannes Reinecke (2):
nvme: add 'fail_if_no_path' sysfs attribute
nvme: delete disk when last path is gone
drivers/nvme/host/core.c | 6 +++++
drivers/nvme/host/multipath.c | 46 ++++++++++++++++++++++++++++++++---
drivers/nvme/host/nvme.h | 19 +++++++++++++--
3 files changed, 66 insertions(+), 5 deletions(-)
--
2.29.2
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute 2021-02-23 11:59 [PATCH 0/2] nvme: fix regression with MD RAID Hannes Reinecke @ 2021-02-23 11:59 ` Hannes Reinecke 2021-02-23 12:41 ` Minwoo Im 2021-02-24 22:47 ` Sagi Grimberg 2021-02-23 11:59 ` [PATCH 2/2] nvme: delete disk when last path is gone Hannes Reinecke 2021-02-24 16:25 ` [PATCH 0/2] nvme: fix regression with MD RAID Christoph Hellwig 2 siblings, 2 replies; 12+ messages in thread From: Hannes Reinecke @ 2021-02-23 11:59 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nvme, Sagi Grimberg, Keith Busch, Hannes Reinecke In some setups like RAID or cluster we need to return an I/O error once all paths are unavailable to allow the upper layers to start their own error recovery (like redirecting I/O to other mirrors). This patch adds a sysfs attribute 'fail_if_no_path' to allow the admin to enable that behaviour instead of the current 'queue until a path becomes available' policy. Signed-off-by: Hannes Reinecke <hare@suse.de> --- drivers/nvme/host/core.c | 5 ++++ drivers/nvme/host/multipath.c | 43 +++++++++++++++++++++++++++++++++-- drivers/nvme/host/nvme.h | 2 ++ 3 files changed, 48 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 4de6a3a13575..2fb3ecc0c53b 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3464,6 +3464,7 @@ static struct attribute *nvme_ns_id_attrs[] = { #ifdef CONFIG_NVME_MULTIPATH &dev_attr_ana_grpid.attr, &dev_attr_ana_state.attr, + &dev_attr_fail_if_no_path.attr, #endif NULL, }; @@ -3494,6 +3495,10 @@ static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj, if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl)) return 0; } + if (a == &dev_attr_fail_if_no_path.attr) { + if (dev_to_disk(dev)->fops == &nvme_bdev_ops) + return 0; + } #endif return a->mode; } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 0696319adaf6..d5773ea105b1 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -283,10 +283,18 @@ static bool nvme_available_path(struct nvme_ns_head *head) continue; switch (ns->ctrl->state) { case NVME_CTRL_LIVE: + if (!test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, + &head->flags)) + return true; + if (ns->ana_state != NVME_ANA_INACCESSIBLE && + ns->ana_state != NVME_ANA_PERSISTENT_LOSS) + return true; case NVME_CTRL_RESETTING: - case NVME_CTRL_CONNECTING: /* fallthru */ - return true; + case NVME_CTRL_CONNECTING: + if (!test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, + &head->flags)) + return true; default: break; } @@ -641,6 +649,37 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr, } DEVICE_ATTR_RO(ana_state); +static ssize_t fail_if_no_path_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + struct nvme_ns_head *head = disk->private_data; + + return sprintf(buf, "%d\n", + test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags) ? + 1 : 0); +} + +static ssize_t fail_if_no_path_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct gendisk *disk = dev_to_disk(dev); + struct nvme_ns_head *head = disk->private_data; + int fail_if_no_path, err; + + err = kstrtoint(buf, 10, &fail_if_no_path); + if (err) + return -EINVAL; + + if (fail_if_no_path <= 0) + clear_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags); + else + set_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags); + return count; +} +DEVICE_ATTR(fail_if_no_path, S_IRUGO | S_IWUSR, + fail_if_no_path_show, fail_if_no_path_store); + static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl, struct nvme_ana_group_desc *desc, void *data) { diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 07b34175c6ce..3d2513f8194d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -418,6 +418,7 @@ struct nvme_ns_head { struct mutex lock; unsigned long flags; #define NVME_NSHEAD_DISK_LIVE 0 +#define NVME_NSHEAD_FAIL_IF_NO_PATH 1 struct nvme_ns __rcu *current_path[]; #endif }; @@ -694,6 +695,7 @@ static inline void nvme_trace_bio_complete(struct request *req) extern struct device_attribute dev_attr_ana_grpid; extern struct device_attribute dev_attr_ana_state; +extern struct device_attribute dev_attr_fail_if_no_path; extern struct device_attribute subsys_attr_iopolicy; #else -- 2.29.2 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute 2021-02-23 11:59 ` [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute Hannes Reinecke @ 2021-02-23 12:41 ` Minwoo Im 2021-02-24 22:47 ` Sagi Grimberg 1 sibling, 0 replies; 12+ messages in thread From: Minwoo Im @ 2021-02-23 12:41 UTC (permalink / raw) To: Hannes Reinecke; +Cc: Keith Busch, Christoph Hellwig, linux-nvme, Sagi Grimberg On 21-02-23 12:59:21, Hannes Reinecke wrote: > In some setups like RAID or cluster we need to return an I/O error > once all paths are unavailable to allow the upper layers to start > their own error recovery (like redirecting I/O to other mirrors). > This patch adds a sysfs attribute 'fail_if_no_path' to allow the > admin to enable that behaviour instead of the current 'queue until > a path becomes available' policy. > > Signed-off-by: Hannes Reinecke <hare@suse.de> > --- > drivers/nvme/host/core.c | 5 ++++ > drivers/nvme/host/multipath.c | 43 +++++++++++++++++++++++++++++++++-- > drivers/nvme/host/nvme.h | 2 ++ > 3 files changed, 48 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 4de6a3a13575..2fb3ecc0c53b 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3464,6 +3464,7 @@ static struct attribute *nvme_ns_id_attrs[] = { > #ifdef CONFIG_NVME_MULTIPATH > &dev_attr_ana_grpid.attr, > &dev_attr_ana_state.attr, > + &dev_attr_fail_if_no_path.attr, > #endif > NULL, > }; > @@ -3494,6 +3495,10 @@ static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj, > if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl)) > return 0; > } > + if (a == &dev_attr_fail_if_no_path.attr) { > + if (dev_to_disk(dev)->fops == &nvme_bdev_ops) > + return 0; > + } > #endif > return a->mode; > } > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 0696319adaf6..d5773ea105b1 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -283,10 +283,18 @@ static bool nvme_available_path(struct nvme_ns_head *head) > continue; > switch (ns->ctrl->state) { > case NVME_CTRL_LIVE: > + if (!test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, > + &head->flags)) > + return true; > + if (ns->ana_state != NVME_ANA_INACCESSIBLE && > + ns->ana_state != NVME_ANA_PERSISTENT_LOSS) > + return true; It looks like it needs to prevent fallthru here with return false. It causes following warning: drivers/nvme/host/multipath.c: In function ‘nvme_available_path’: drivers/nvme/host/multipath.c:289:7: warning: this statement may fall through [-Wimplicit-fallthrough=] 289 | if (ns->ana_state != NVME_ANA_INACCESSIBLE && | ^ drivers/nvme/host/multipath.c:292:3: note: here 292 | case NVME_CTRL_RESETTING: | ^~~~ diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f995b8234622..d692eab3d483 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -289,6 +289,7 @@ static bool nvme_available_path(struct nvme_ns_head *head) if (ns->ana_state != NVME_ANA_INACCESSIBLE && ns->ana_state != NVME_ANA_PERSISTENT_LOSS) return true; + return false; case NVME_CTRL_RESETTING: _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute 2021-02-23 11:59 ` [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute Hannes Reinecke 2021-02-23 12:41 ` Minwoo Im @ 2021-02-24 22:47 ` Sagi Grimberg 2021-02-25 8:10 ` Hannes Reinecke 1 sibling, 1 reply; 12+ messages in thread From: Sagi Grimberg @ 2021-02-24 22:47 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: linux-nvme, Keith Busch On 2/23/21 3:59 AM, Hannes Reinecke wrote: > In some setups like RAID or cluster we need to return an I/O error > once all paths are unavailable to allow the upper layers to start > their own error recovery (like redirecting I/O to other mirrors). > This patch adds a sysfs attribute 'fail_if_no_path' to allow the > admin to enable that behaviour instead of the current 'queue until > a path becomes available' policy. Doesn't the same happen today if all the controllers are set with fail_io_fast_tmo=0? nvme_available_path will return false if all the paths have NVME_CTRL_FAILFAST_EXPIRED set. I think that fail_io_fast_tmo should be settable via sysfs and I think I requested that during the various submission iterations... I'm not sure this should be controlled on the individual namespace level... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute 2021-02-24 22:47 ` Sagi Grimberg @ 2021-02-25 8:10 ` Hannes Reinecke 0 siblings, 0 replies; 12+ messages in thread From: Hannes Reinecke @ 2021-02-25 8:10 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: linux-nvme, Keith Busch On 2/24/21 11:47 PM, Sagi Grimberg wrote: > > > On 2/23/21 3:59 AM, Hannes Reinecke wrote: >> In some setups like RAID or cluster we need to return an I/O error >> once all paths are unavailable to allow the upper layers to start >> their own error recovery (like redirecting I/O to other mirrors). >> This patch adds a sysfs attribute 'fail_if_no_path' to allow the >> admin to enable that behaviour instead of the current 'queue until >> a path becomes available' policy. > > Doesn't the same happen today if all the controllers are set with > fail_io_fast_tmo=0? nvme_available_path will return false if all > the paths have NVME_CTRL_FAILFAST_EXPIRED set. > > I think that fail_io_fast_tmo should be settable via sysfs and I > think I requested that during the various submission iterations... > > I'm not sure this should be controlled on the individual namespace level... Indeed, you are right; using 'fast_io_fail_tmo' works as well here. So I'll be redoing the patch series and will be replacing this patch with another one adding a per-controller 'fast_io_fail_tmo' sysfs attribute. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] nvme: delete disk when last path is gone 2021-02-23 11:59 [PATCH 0/2] nvme: fix regression with MD RAID Hannes Reinecke 2021-02-23 11:59 ` [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute Hannes Reinecke @ 2021-02-23 11:59 ` Hannes Reinecke 2021-02-23 12:56 ` Minwoo Im 2021-02-24 22:40 ` Sagi Grimberg 2021-02-24 16:25 ` [PATCH 0/2] nvme: fix regression with MD RAID Christoph Hellwig 2 siblings, 2 replies; 12+ messages in thread From: Hannes Reinecke @ 2021-02-23 11:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, linux-nvme, Sagi Grimberg, Keith Busch, Hannes Reinecke The multipath code currently deletes the disk only after all references to it are dropped rather than when the last path to that disk is lost. This has been reported to cause problems with some use cases like MD RAID. This patch implements an alternative behaviour of deleting the disk when the last path is gone, ie the same behaviour as non-multipathed nvme devices. The new behaviour will be selected with the 'fail_if_no_path' attribute, as returning it's arguably the same functionality. Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de> --- drivers/nvme/host/core.c | 1 + drivers/nvme/host/multipath.c | 3 ++- drivers/nvme/host/nvme.h | 17 +++++++++++++++-- 3 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 2fb3ecc0c53b..d717a6283d6e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -542,6 +542,7 @@ static void nvme_free_ns_head(struct kref *ref) container_of(ref, struct nvme_ns_head, ref); nvme_mpath_remove_disk(head); + nvme_mpath_put_disk(head); ida_simple_remove(&head->subsys->ns_ida, head->instance); cleanup_srcu_struct(&head->srcu); nvme_put_subsystem(head->subsys); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index d5773ea105b1..f995b8234622 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -724,6 +724,8 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id) void nvme_mpath_remove_disk(struct nvme_ns_head *head) { + if (test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags)) + return; if (!head->disk) return; if (head->disk->flags & GENHD_FL_UP) @@ -741,7 +743,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) */ head->disk->queue = NULL; } - put_disk(head->disk); } int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 3d2513f8194d..e6efa085f08a 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -681,8 +681,12 @@ static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) { struct nvme_ns_head *head = ns->head; - if (head->disk && list_empty(&head->list)) - kblockd_schedule_work(&head->requeue_work); + if (head->disk && list_empty(&head->list)) { + if (test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags)) + nvme_mpath_remove_disk(head); + else + kblockd_schedule_work(&head->requeue_work); + } } static inline void nvme_trace_bio_complete(struct request *req) @@ -693,6 +697,12 @@ static inline void nvme_trace_bio_complete(struct request *req) trace_block_bio_complete(ns->head->disk->queue, req->bio); } +static inline void nvme_mpath_put_disk(struct nvme_ns_head *head) +{ + if (head->disk) + put_disk(head->disk); +} + extern struct device_attribute dev_attr_ana_grpid; extern struct device_attribute dev_attr_ana_state; extern struct device_attribute dev_attr_fail_if_no_path; @@ -731,6 +741,9 @@ static inline void nvme_mpath_add_disk(struct nvme_ns *ns, static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head) { } +static inline void nvme_mpath_put_disk(struct nvme_ns_head *head) +{ +} static inline bool nvme_mpath_clear_current_path(struct nvme_ns *ns) { return false; -- 2.29.2 _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] nvme: delete disk when last path is gone 2021-02-23 11:59 ` [PATCH 2/2] nvme: delete disk when last path is gone Hannes Reinecke @ 2021-02-23 12:56 ` Minwoo Im 2021-02-23 14:07 ` Hannes Reinecke 2021-02-24 22:40 ` Sagi Grimberg 1 sibling, 1 reply; 12+ messages in thread From: Minwoo Im @ 2021-02-23 12:56 UTC (permalink / raw) To: Hannes Reinecke Cc: Keith Busch, Keith Busch, Christoph Hellwig, linux-nvme, Sagi Grimberg On 21-02-23 12:59:22, Hannes Reinecke wrote: > The multipath code currently deletes the disk only after all references > to it are dropped rather than when the last path to that disk is lost. > This has been reported to cause problems with some use cases like MD RAID. > > This patch implements an alternative behaviour of deleting the disk when > the last path is gone, ie the same behaviour as non-multipathed nvme > devices. The new behaviour will be selected with the 'fail_if_no_path' > attribute, as returning it's arguably the same functionality. > > Suggested-by: Keith Busch <kbusch@kernel.org> > Signed-off-by: Hannes Reinecke <hare@suse.de> > --- > drivers/nvme/host/core.c | 1 + > drivers/nvme/host/multipath.c | 3 ++- > drivers/nvme/host/nvme.h | 17 +++++++++++++++-- > 3 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 2fb3ecc0c53b..d717a6283d6e 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -542,6 +542,7 @@ static void nvme_free_ns_head(struct kref *ref) > container_of(ref, struct nvme_ns_head, ref); > > nvme_mpath_remove_disk(head); > + nvme_mpath_put_disk(head); > ida_simple_remove(&head->subsys->ns_ida, head->instance); > cleanup_srcu_struct(&head->srcu); > nvme_put_subsystem(head->subsys); > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index d5773ea105b1..f995b8234622 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -724,6 +724,8 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id) > > void nvme_mpath_remove_disk(struct nvme_ns_head *head) > { > + if (test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags)) > + return; > if (!head->disk) > return; > if (head->disk->flags & GENHD_FL_UP) > @@ -741,7 +743,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) > */ > head->disk->queue = NULL; > } > - put_disk(head->disk); > } > > int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index 3d2513f8194d..e6efa085f08a 100644 > --- a/drivers/nvme/host/nvme.h > +++ b/drivers/nvme/host/nvme.h > @@ -681,8 +681,12 @@ static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) > { > struct nvme_ns_head *head = ns->head; > > - if (head->disk && list_empty(&head->list)) > - kblockd_schedule_work(&head->requeue_work); > + if (head->disk && list_empty(&head->list)) { > + if (test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags)) > + nvme_mpath_remove_disk(head); Does it need to call nvme_mpath_remove_disk here ? It looks like it returns with nothing right away if NVME_NSHEAD_FAIL_IF_NO_PATH is set. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] nvme: delete disk when last path is gone 2021-02-23 12:56 ` Minwoo Im @ 2021-02-23 14:07 ` Hannes Reinecke 0 siblings, 0 replies; 12+ messages in thread From: Hannes Reinecke @ 2021-02-23 14:07 UTC (permalink / raw) To: Minwoo Im Cc: Keith Busch, Keith Busch, Christoph Hellwig, linux-nvme, Sagi Grimberg On 2/23/21 1:56 PM, Minwoo Im wrote: > On 21-02-23 12:59:22, Hannes Reinecke wrote: >> The multipath code currently deletes the disk only after all references >> to it are dropped rather than when the last path to that disk is lost. >> This has been reported to cause problems with some use cases like MD RAID. >> >> This patch implements an alternative behaviour of deleting the disk when >> the last path is gone, ie the same behaviour as non-multipathed nvme >> devices. The new behaviour will be selected with the 'fail_if_no_path' >> attribute, as returning it's arguably the same functionality. >> >> Suggested-by: Keith Busch <kbusch@kernel.org> >> Signed-off-by: Hannes Reinecke <hare@suse.de> >> --- >> drivers/nvme/host/core.c | 1 + >> drivers/nvme/host/multipath.c | 3 ++- >> drivers/nvme/host/nvme.h | 17 +++++++++++++++-- >> 3 files changed, 18 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 2fb3ecc0c53b..d717a6283d6e 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -542,6 +542,7 @@ static void nvme_free_ns_head(struct kref *ref) >> container_of(ref, struct nvme_ns_head, ref); >> >> nvme_mpath_remove_disk(head); >> + nvme_mpath_put_disk(head); >> ida_simple_remove(&head->subsys->ns_ida, head->instance); >> cleanup_srcu_struct(&head->srcu); >> nvme_put_subsystem(head->subsys); >> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >> index d5773ea105b1..f995b8234622 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -724,6 +724,8 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id) >> >> void nvme_mpath_remove_disk(struct nvme_ns_head *head) >> { >> + if (test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags)) >> + return; >> if (!head->disk) >> return; >> if (head->disk->flags & GENHD_FL_UP) >> @@ -741,7 +743,6 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) >> */ >> head->disk->queue = NULL; >> } >> - put_disk(head->disk); >> } >> >> int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) >> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h >> index 3d2513f8194d..e6efa085f08a 100644 >> --- a/drivers/nvme/host/nvme.h >> +++ b/drivers/nvme/host/nvme.h >> @@ -681,8 +681,12 @@ static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) >> { >> struct nvme_ns_head *head = ns->head; >> >> - if (head->disk && list_empty(&head->list)) >> - kblockd_schedule_work(&head->requeue_work); >> + if (head->disk && list_empty(&head->list)) { >> + if (test_bit(NVME_NSHEAD_FAIL_IF_NO_PATH, &head->flags)) >> + nvme_mpath_remove_disk(head); > > Does it need to call nvme_mpath_remove_disk here ? It looks like it > returns with nothing right away if NVME_NSHEAD_FAIL_IF_NO_PATH is set. > Argl. Yes, you are correct. I'll be reworking that one. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] nvme: delete disk when last path is gone 2021-02-23 11:59 ` [PATCH 2/2] nvme: delete disk when last path is gone Hannes Reinecke 2021-02-23 12:56 ` Minwoo Im @ 2021-02-24 22:40 ` Sagi Grimberg 2021-02-25 8:37 ` Hannes Reinecke 1 sibling, 1 reply; 12+ messages in thread From: Sagi Grimberg @ 2021-02-24 22:40 UTC (permalink / raw) To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Keith Busch > The multipath code currently deletes the disk only after all references > to it are dropped rather than when the last path to that disk is lost. > This has been reported to cause problems with some use cases like MD RAID. What is the exact problem? Can you describe what the problem you see now and what you expect to see (unrelated to patch #1)? > This patch implements an alternative behaviour of deleting the disk when > the last path is gone, ie the same behaviour as non-multipathed nvme > devices. But we also don't remove the non-multipath'd nvme device until the last reference drops (e.g. if you have a mounted filesystem on top). This would be the equivalent to running raid on top of dm-mpath on top of scsi devices right? And if all the mpath device nodes go away the mpath device is deleted even if it has an open reference to it? > The new behaviour will be selected with the 'fail_if_no_path' > attribute, as returning it's arguably the same functionality. But its not the same functionality. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] nvme: delete disk when last path is gone 2021-02-24 22:40 ` Sagi Grimberg @ 2021-02-25 8:37 ` Hannes Reinecke 0 siblings, 0 replies; 12+ messages in thread From: Hannes Reinecke @ 2021-02-25 8:37 UTC (permalink / raw) To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Keith Busch On 2/24/21 11:40 PM, Sagi Grimberg wrote: > >> The multipath code currently deletes the disk only after all references >> to it are dropped rather than when the last path to that disk is lost. >> This has been reported to cause problems with some use cases like MD >> RAID. > > What is the exact problem? > > Can you describe what the problem you see now and what you expect > to see (unrelated to patch #1)? > The problem is a difference in behaviour between multipathed and non-multipathed namespaces (ie whether 'CMIC' is set or not). If the CMIC bit is _not_ set, the disk device will be removed once the controller is gone; if the CMIC bit is set the disk device will be retained, and only removed once the last _reference_ is dropped. This is causing customer issues, as some vendors produce nearly identical PCI NVMe devices, which differ in the CMIC bit. So depending on which device the customer uses, he might be getting on or the other behaviour. And this is causing issues when said customer deploys MD RAID on thems; with one set of devices PCI hotplug works, with the other set of devices it doesn't. >> This patch implements an alternative behaviour of deleting the disk when >> the last path is gone, ie the same behaviour as non-multipathed nvme >> devices. > > But we also don't remove the non-multipath'd nvme device until the > last reference drops (e.g. if you have a mounted filesystem on top). > Au contraire. When doing PCI hotplug the controller is removed (in the non-multipathed case), and calling 'put_disk()' during nvme_free_ns(). When doing PCI hotplug in the non-multipathed case, the controller is removed, too, but put_disk() is only called on the namespace itself; the 'nshead' disk is still kept around, and put_disk() on the 'nshead' disk is only called after the last reference is dropped. > This would be the equivalent to running raid on top of dm-mpath on > top of scsi devices right? And if all the mpath device nodes go away > the mpath device is deleted even if it has an open reference to it? > See above. The prime motivator behind this patch is to get equivalent behaviour between multipathed and non-multipathed devices. It just so happens that MD RAID exercises this particular issue. >> The new behaviour will be selected with the 'fail_if_no_path' >> attribute, as returning it's arguably the same functionality. > > But its not the same functionality. Agreed. But as the first patch will be dropped (see my other mail) I'll be redoing the patchset anyway. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] nvme: fix regression with MD RAID 2021-02-23 11:59 [PATCH 0/2] nvme: fix regression with MD RAID Hannes Reinecke 2021-02-23 11:59 ` [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute Hannes Reinecke 2021-02-23 11:59 ` [PATCH 2/2] nvme: delete disk when last path is gone Hannes Reinecke @ 2021-02-24 16:25 ` Christoph Hellwig 2021-02-24 17:10 ` Hannes Reinecke 2 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2021-02-24 16:25 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-nvme, Christoph Hellwig, Keith Busch, Sagi Grimberg I don't see any regression here, even if the new features sound useful. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] nvme: fix regression with MD RAID 2021-02-24 16:25 ` [PATCH 0/2] nvme: fix regression with MD RAID Christoph Hellwig @ 2021-02-24 17:10 ` Hannes Reinecke 0 siblings, 0 replies; 12+ messages in thread From: Hannes Reinecke @ 2021-02-24 17:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nvme, Sagi Grimberg, Keith Busch On 2/24/21 5:25 PM, Christoph Hellwig wrote: > I don't see any regression here, even if the new features sound useful. > Have you ever tried MD RAID on nvme-of? Without this patch MD RAID will _stop_ I/O until the controller reconnects. If it does. If it doesn't, the controller gets removed (so after some 300 seconds), and MD RAID will get an I/O error, finally. But then you reconnect the failed path, and you end up with a _different_ nvme namespace device. Requiring you to do manual handholding get the MD RAID into shape again. With this patch it 'just works' without any interaction. One might argue if that constitutes a regression (as it's been the behaviour since day 1), but it certainly is impaired functionality as compared to other drivers/subsystems like SCSI. And we can't have that, can we? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-02-25 8:37 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-02-23 11:59 [PATCH 0/2] nvme: fix regression with MD RAID Hannes Reinecke 2021-02-23 11:59 ` [PATCH 1/2] nvme: add 'fail_if_no_path' sysfs attribute Hannes Reinecke 2021-02-23 12:41 ` Minwoo Im 2021-02-24 22:47 ` Sagi Grimberg 2021-02-25 8:10 ` Hannes Reinecke 2021-02-23 11:59 ` [PATCH 2/2] nvme: delete disk when last path is gone Hannes Reinecke 2021-02-23 12:56 ` Minwoo Im 2021-02-23 14:07 ` Hannes Reinecke 2021-02-24 22:40 ` Sagi Grimberg 2021-02-25 8:37 ` Hannes Reinecke 2021-02-24 16:25 ` [PATCH 0/2] nvme: fix regression with MD RAID Christoph Hellwig 2021-02-24 17:10 ` Hannes Reinecke
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox