* [PATCH] nvme-pci: silence a lockdep complaint @ 2024-05-22 9:15 Sagi Grimberg 2024-05-22 12:18 ` Shinichiro Kawasaki 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2024-05-22 9:15 UTC (permalink / raw) To: linux-nvme; +Cc: Christoph Hellwig, Keith Busch, Shinichiro Kawasaki lockdep complains about the timeout handler running concurrently with the reset work which is syncing the IO request queues (which in turn flushes the timeout work). We know it cannot be the case because the ctrl state machine prevents the timeout handler from disabling the ctrl when the reset work is running (changing ctrl state to RESETTING will fail, and the state is not terminal). If this assumption happens to break in the future, we won't have lockdep to assist, but for the time being we are simply seeing false-positive complaints from it... Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> --- drivers/nvme/host/pci.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 710043086dff..4a85b83b78f9 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2718,7 +2718,18 @@ static void nvme_reset_work(struct work_struct *work) */ if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) nvme_dev_disable(dev, false); + /* + * lockdep complains about the timeout handler running concurrently + * with this call. We know it cannot be the case because the ctrl state + * machine prevents the timeout handler from disabling the ctrl when + * the reset work is running (changing ctrl state to RESETTING will + * fail, and the state is not terminal). If this assumption happens to + * break in the future, we won't have lockdep to assist, but for the + * time being we are simply seeing false-positive complaints from it... + */ + lockdep_off(); nvme_sync_queues(&dev->ctrl); + lockdep_on(); mutex_lock(&dev->shutdown_lock); result = nvme_pci_enable(dev); -- 2.40.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 9:15 [PATCH] nvme-pci: silence a lockdep complaint Sagi Grimberg @ 2024-05-22 12:18 ` Shinichiro Kawasaki 2024-05-22 16:12 ` Keith Busch 0 siblings, 1 reply; 14+ messages in thread From: Shinichiro Kawasaki @ 2024-05-22 12:18 UTC (permalink / raw) To: Sagi Grimberg Cc: linux-nvme@lists.infradead.org, Christoph Hellwig, Keith Busch On May 22, 2024 / 12:15, Sagi Grimberg wrote: > lockdep complains about the timeout handler running concurrently with > the reset work which is syncing the IO request queues (which in turn > flushes the timeout work). > > We know it cannot be the case because the ctrl state machine prevents > the timeout handler from disabling the ctrl when the reset work is > running (changing ctrl state to RESETTING will fail, and the state is not > terminal). If this assumption happens to break in the future, we won't > have lockdep to assist, but for the time being we are simply seeing > false-positive complaints from it... > > Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > Suggested-by: Keith Busch <kbusch@kernel.org> > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Sagi, thank you very much. I confirmed this patch avoids the lockdep WARN observed with the blktests test case nvme/050 [1]. When I repeated the test case around 10 times on v6.9 kernel without this patch, the lockdep WARN was recreated in stable manner. After applying this patch, the WARN was not observed even when the test case is repeated 100 times. Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> [1] https://lore.kernel.org/linux-block/m6a437jvfwzq2jfytvvk62zpgu7e4bjvegr7x73pihhkp5me5c@sh6vs3s7w754/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 12:18 ` Shinichiro Kawasaki @ 2024-05-22 16:12 ` Keith Busch 2024-05-22 16:28 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Keith Busch @ 2024-05-22 16:12 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, Christoph Hellwig On Wed, May 22, 2024 at 12:18:57PM +0000, Shinichiro Kawasaki wrote: > On May 22, 2024 / 12:15, Sagi Grimberg wrote: > > lockdep complains about the timeout handler running concurrently with > > the reset work which is syncing the IO request queues (which in turn > > flushes the timeout work). > > > > We know it cannot be the case because the ctrl state machine prevents > > the timeout handler from disabling the ctrl when the reset work is > > running (changing ctrl state to RESETTING will fail, and the state is not > > terminal). If this assumption happens to break in the future, we won't > > have lockdep to assist, but for the time being we are simply seeing > > false-positive complaints from it... > > > > Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > Suggested-by: Keith Busch <kbusch@kernel.org> > > Signed-off-by: Sagi Grimberg <sagi@grimberg.me> > > Sagi, thank you very much. I confirmed this patch avoids the lockdep WARN > observed with the blktests test case nvme/050 [1]. When I repeated the test case > around 10 times on v6.9 kernel without this patch, the lockdep WARN was > recreated in stable manner. After applying this patch, the WARN was not > observed even when the test case is repeated 100 times. > > Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> This is not desireable, but I give up trying to appease lockdep for this one. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 16:12 ` Keith Busch @ 2024-05-22 16:28 ` Christoph Hellwig 2024-05-22 18:00 ` Sagi Grimberg 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2024-05-22 16:28 UTC (permalink / raw) To: Keith Busch Cc: Shinichiro Kawasaki, Sagi Grimberg, linux-nvme@lists.infradead.org, Christoph Hellwig On Wed, May 22, 2024 at 10:12:19AM -0600, Keith Busch wrote: > > Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> > > This is not desireable, but I give up trying to appease lockdep for this > one. lockdep_off is a cure that's worth than the desease. If we can't explain the "can't happen" using lockdep classes I don't see how we could actually be sure that it can't happen. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 16:28 ` Christoph Hellwig @ 2024-05-22 18:00 ` Sagi Grimberg 2024-05-22 21:36 ` Keith Busch 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2024-05-22 18:00 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch Cc: Shinichiro Kawasaki, linux-nvme@lists.infradead.org On 22/05/2024 19:28, Christoph Hellwig wrote: > On Wed, May 22, 2024 at 10:12:19AM -0600, Keith Busch wrote: >>> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> >> This is not desireable, but I give up trying to appease lockdep for this >> one. > lockdep_off is a cure that's worth than the desease. This is not ideal for sure. > If we can't > explain the "can't happen" using lockdep classes I don't see how > we could actually be sure that it can't happen. I don't understand lockdep well enough to do that. This warning exists for 6 months now. The conclusion that Keith came to (and I agreed) is that it can't happen, because nvme_timeout handler will not disable the device while the ctrl reset_work is running, the ctrl state machine will prevent that from happening. There was an attempt to convert namespaces_rwsem a srcu, but my comment to that was that this is an intrusive change for a complaint that we think is a false-positive. Do you see a proper way to solve this? Would love to see a better solution here. The original report is at: https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 18:00 ` Sagi Grimberg @ 2024-05-22 21:36 ` Keith Busch 2024-05-23 6:54 ` Christoph Hellwig 2024-05-23 8:06 ` Christoph Hellwig 0 siblings, 2 replies; 14+ messages in thread From: Keith Busch @ 2024-05-22 21:36 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On Wed, May 22, 2024 at 09:00:34PM +0300, Sagi Grimberg wrote: > I don't understand lockdep well enough to do that. This warning > exists for 6 months now. The conclusion that Keith came to (and I agreed) > is that it can't happen, because nvme_timeout handler will not disable the > device while the ctrl reset_work is running, the ctrl state machine will > prevent that from happening. > > There was an attempt to convert namespaces_rwsem a srcu, but my comment to > that was that this is an intrusive change for a complaint that we think is a > false-positive. I'll play devil's advocate and suggest there might be some whacky timeout_work, scan_work, reset_work, and pciehp irq that could result in a read, write, wait, read deadlock. I don't think it can happen, but it's harder to prove that. The below patch is what I'm messing with. I have a CMIC PCI NVMe and nvme native multipathing enabled, and that can recreate the lockdep splat 100% of the time. It goes away with this patch. Non-multipath devices are harder to hit the lockdep splat for a different reason. From 00dd7a573d3dd14be748b46e040bb5eb0f5687b1 Mon Sep 17 00:00:00 2001 From: Keith Busch <kbusch@kernel.org> Date: Tue, 21 May 2024 06:41:45 -0700 Subject: [PATCH] nvme: switch rwsem to srcu Signed-off-by: Keith Busch <kbusch@kernel.org> --- drivers/nvme/host/core.c | 94 +++++++++++++++++++++-------------- drivers/nvme/host/ioctl.c | 12 ++--- drivers/nvme/host/multipath.c | 21 ++++---- drivers/nvme/host/nvme.h | 3 +- 4 files changed, 76 insertions(+), 54 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 7706df2373494..480af74f1606e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3684,9 +3684,10 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info) struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid) { struct nvme_ns *ns, *ret = NULL; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) { + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { if (ns->head->ns_id == nsid) { if (!nvme_get_ns(ns)) continue; @@ -3696,7 +3697,7 @@ struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid) if (ns->head->ns_id > nsid) break; } - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); return ret; } EXPORT_SYMBOL_NS_GPL(nvme_find_get_ns, NVME_TARGET_PASSTHRU); @@ -3710,7 +3711,7 @@ static void nvme_ns_add_to_ctrl_list(struct nvme_ns *ns) list_for_each_entry_reverse(tmp, &ns->ctrl->namespaces, list) { if (tmp->head->ns_id < ns->head->ns_id) { - list_add(&ns->list, &tmp->list); + list_add_rcu(&ns->list, &tmp->list); return; } } @@ -3776,17 +3777,17 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) if (nvme_update_ns_info(ns, info)) goto out_unlink_ns; - down_write(&ctrl->namespaces_rwsem); + mutex_lock(&ctrl->namespaces_lock); /* * Ensure that no namespaces are added to the ctrl list after the queues * are frozen, thereby avoiding a deadlock between scan and reset. */ if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) { - up_write(&ctrl->namespaces_rwsem); + mutex_unlock(&ctrl->namespaces_lock); goto out_unlink_ns; } nvme_ns_add_to_ctrl_list(ns); - up_write(&ctrl->namespaces_rwsem); + mutex_unlock(&ctrl->namespaces_lock); nvme_get_ctrl(ctrl); if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups)) @@ -3809,9 +3810,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) out_cleanup_ns_from_list: nvme_put_ctrl(ctrl); - down_write(&ctrl->namespaces_rwsem); - list_del_init(&ns->list); - up_write(&ctrl->namespaces_rwsem); + mutex_lock(&ctrl->namespaces_lock); + list_del_rcu(&ns->list); + mutex_unlock(&ctrl->namespaces_lock); out_unlink_ns: mutex_lock(&ctrl->subsys->lock); list_del_rcu(&ns->siblings); @@ -3861,10 +3862,11 @@ static void nvme_ns_remove(struct nvme_ns *ns) nvme_cdev_del(&ns->cdev, &ns->cdev_device); del_gendisk(ns->disk); - down_write(&ns->ctrl->namespaces_rwsem); - list_del_init(&ns->list); - up_write(&ns->ctrl->namespaces_rwsem); + mutex_lock(&ns->ctrl->namespaces_lock); + list_del_rcu(&ns->list); + mutex_unlock(&ns->ctrl->namespaces_lock); + synchronize_srcu(&ns->ctrl->srcu); if (last_path) nvme_mpath_shutdown_disk(ns->head); nvme_put_ns(ns); @@ -3953,16 +3955,17 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl, struct nvme_ns *ns, *next; LIST_HEAD(rm_list); - down_write(&ctrl->namespaces_rwsem); + mutex_lock(&ctrl->namespaces_lock); list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) { if (ns->head->ns_id > nsid) - list_move_tail(&ns->list, &rm_list); + list_splice_init_rcu(&ns->list, &rm_list, synchronize_rcu); } - up_write(&ctrl->namespaces_rwsem); + mutex_unlock(&ctrl->namespaces_lock); list_for_each_entry_safe(ns, next, &rm_list, list) nvme_ns_remove(ns); + synchronize_srcu(&ctrl->srcu); } static int nvme_scan_ns_list(struct nvme_ctrl *ctrl) @@ -4132,12 +4135,14 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) /* this is a no-op when called from the controller reset handler */ nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO); - down_write(&ctrl->namespaces_rwsem); - list_splice_init(&ctrl->namespaces, &ns_list); - up_write(&ctrl->namespaces_rwsem); + mutex_lock(&ctrl->namespaces_lock); + list_splice_init_rcu(&ctrl->namespaces, &ns_list, synchronize_rcu); + mutex_unlock(&ctrl->namespaces_lock); list_for_each_entry_safe(ns, next, &ns_list, list) nvme_ns_remove(ns); + + synchronize_srcu(&ctrl->srcu); } EXPORT_SYMBOL_GPL(nvme_remove_namespaces); @@ -4582,6 +4587,7 @@ static void nvme_free_ctrl(struct device *dev) key_put(ctrl->tls_key); nvme_free_cels(ctrl); nvme_mpath_uninit(ctrl); + cleanup_srcu_struct(&ctrl->srcu); nvme_auth_stop(ctrl); nvme_auth_free(ctrl); __free_page(ctrl->discard_page); @@ -4614,10 +4620,15 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, ctrl->passthru_err_log_enabled = false; clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags); spin_lock_init(&ctrl->lock); + mutex_init(&ctrl->namespaces_lock); + + ret = init_srcu_struct(&ctrl->srcu); + if (ret) + return ret; + mutex_init(&ctrl->scan_lock); INIT_LIST_HEAD(&ctrl->namespaces); xa_init(&ctrl->cels); - init_rwsem(&ctrl->namespaces_rwsem); ctrl->dev = dev; ctrl->ops = ops; ctrl->quirks = quirks; @@ -4697,6 +4708,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev, out: if (ctrl->discard_page) __free_page(ctrl->discard_page); + cleanup_srcu_struct(&ctrl->srcu); return ret; } EXPORT_SYMBOL_GPL(nvme_init_ctrl); @@ -4705,22 +4717,24 @@ EXPORT_SYMBOL_GPL(nvme_init_ctrl); void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) blk_mark_disk_dead(ns->disk); - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); } EXPORT_SYMBOL_GPL(nvme_mark_namespaces_dead); void nvme_unfreeze(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) blk_mq_unfreeze_queue(ns->queue); - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); clear_bit(NVME_CTRL_FROZEN, &ctrl->flags); } EXPORT_SYMBOL_GPL(nvme_unfreeze); @@ -4728,14 +4742,15 @@ EXPORT_SYMBOL_GPL(nvme_unfreeze); int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) { + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout); if (timeout <= 0) break; } - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); return timeout; } EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout); @@ -4743,23 +4758,25 @@ EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout); void nvme_wait_freeze(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) blk_mq_freeze_queue_wait(ns->queue); - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); } EXPORT_SYMBOL_GPL(nvme_wait_freeze); void nvme_start_freeze(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; set_bit(NVME_CTRL_FROZEN, &ctrl->flags); - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) blk_freeze_queue_start(ns->queue); - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); } EXPORT_SYMBOL_GPL(nvme_start_freeze); @@ -4802,11 +4819,12 @@ EXPORT_SYMBOL_GPL(nvme_unquiesce_admin_queue); void nvme_sync_io_queues(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) blk_sync_queue(ns->queue); - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); } EXPORT_SYMBOL_GPL(nvme_sync_io_queues); diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 499a8bb7cac7d..0f05058692b55 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -789,16 +789,16 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp, bool open_for_write) { struct nvme_ns *ns; - int ret; + int ret, srcu_idx; - down_read(&ctrl->namespaces_rwsem); + srcu_idx = srcu_read_lock(&ctrl->srcu); if (list_empty(&ctrl->namespaces)) { ret = -ENOTTY; goto out_unlock; } - ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list); - if (ns != list_last_entry(&ctrl->namespaces, struct nvme_ns, list)) { + ns = list_first_or_null_rcu(&ctrl->namespaces, struct nvme_ns, list); + if (!ns) { dev_warn(ctrl->device, "NVME_IOCTL_IO_CMD not supported when multiple namespaces present!\n"); ret = -EINVAL; @@ -808,14 +808,14 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp, dev_warn(ctrl->device, "using deprecated NVME_IOCTL_IO_CMD ioctl on the char device!\n"); kref_get(&ns->kref); - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); ret = nvme_user_cmd(ctrl, ns, argp, 0, open_for_write); nvme_put_ns(ns); return ret; out_unlock: - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); return ret; } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 1bee176fd850e..d8b6b4648eaff 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -151,16 +151,17 @@ void nvme_mpath_end_request(struct request *rq) void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) { + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { if (!ns->head->disk) continue; kblockd_schedule_work(&ns->head->requeue_work); if (nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE) disk_uevent(ns->head->disk, KOBJ_CHANGE); } - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); } static const char *nvme_ana_state_names[] = { @@ -194,13 +195,14 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns) void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; + int srcu_idx; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) { + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { nvme_mpath_clear_current_path(ns); kblockd_schedule_work(&ns->head->requeue_work); } - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); } void nvme_mpath_revalidate_paths(struct nvme_ns *ns) @@ -681,6 +683,7 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, u32 nr_nsids = le32_to_cpu(desc->nnsids), n = 0; unsigned *nr_change_groups = data; struct nvme_ns *ns; + int srcu_idx; dev_dbg(ctrl->device, "ANA group %d: %s.\n", le32_to_cpu(desc->grpid), @@ -692,8 +695,8 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, if (!nr_nsids) return 0; - down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) { + srcu_idx = srcu_read_lock(&ctrl->srcu); + list_for_each_entry_rcu(ns, &ctrl->namespaces, list) { unsigned nsid; again: nsid = le32_to_cpu(desc->nsids[n]); @@ -706,7 +709,7 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl, if (ns->head->ns_id > nsid) goto again; } - up_read(&ctrl->namespaces_rwsem); + srcu_read_unlock(&ctrl->srcu, srcu_idx); return 0; } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index fc31bd340a63a..a005941a8b67e 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -282,7 +282,8 @@ struct nvme_ctrl { struct blk_mq_tag_set *tagset; struct blk_mq_tag_set *admin_tagset; struct list_head namespaces; - struct rw_semaphore namespaces_rwsem; + struct mutex namespaces_lock; + struct srcu_struct srcu; struct device ctrl_device; struct device *device; /* char device */ #ifdef CONFIG_NVME_HWMON -- 2.43.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 21:36 ` Keith Busch @ 2024-05-23 6:54 ` Christoph Hellwig 2024-05-23 10:04 ` Sagi Grimberg 2024-05-23 8:06 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2024-05-23 6:54 UTC (permalink / raw) To: Keith Busch Cc: Sagi Grimberg, Christoph Hellwig, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote: > The below patch is what I'm messing with. I have a CMIC PCI NVMe and > nvme native multipathing enabled, and that can recreate the lockdep > splat 100% of the time. It goes away with this patch. Non-multipath > devices are harder to hit the lockdep splat for a different reason. I much prefer the srcu locking as it has real benefits as well over a global-ish rwsem. I'll try to get to a real review in a bit. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-23 6:54 ` Christoph Hellwig @ 2024-05-23 10:04 ` Sagi Grimberg 2024-05-23 12:39 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2024-05-23 10:04 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch Cc: Shinichiro Kawasaki, linux-nvme@lists.infradead.org On 23/05/2024 9:54, Christoph Hellwig wrote: > On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote: >> The below patch is what I'm messing with. I have a CMIC PCI NVMe and >> nvme native multipathing enabled, and that can recreate the lockdep >> splat 100% of the time. It goes away with this patch. Non-multipath >> devices are harder to hit the lockdep splat for a different reason. > I much prefer the srcu locking as it has real benefits as well over > a global-ish rwsem. I'll try to get to a real review in a bit. > Can you explain what are these benefits? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-23 10:04 ` Sagi Grimberg @ 2024-05-23 12:39 ` Christoph Hellwig 2024-05-23 13:02 ` Sagi Grimberg 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2024-05-23 12:39 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On Thu, May 23, 2024 at 01:04:13PM +0300, Sagi Grimberg wrote: > > > On 23/05/2024 9:54, Christoph Hellwig wrote: >> On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote: >>> The below patch is what I'm messing with. I have a CMIC PCI NVMe and >>> nvme native multipathing enabled, and that can recreate the lockdep >>> splat 100% of the time. It goes away with this patch. Non-multipath >>> devices are harder to hit the lockdep splat for a different reason. >> I much prefer the srcu locking as it has real benefits as well over >> a global-ish rwsem. I'll try to get to a real review in a bit. >> > > Can you explain what are these benefits? No having a global lock taken for namespace lookups. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-23 12:39 ` Christoph Hellwig @ 2024-05-23 13:02 ` Sagi Grimberg 2024-05-23 13:19 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2024-05-23 13:02 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On 23/05/2024 15:39, Christoph Hellwig wrote: > On Thu, May 23, 2024 at 01:04:13PM +0300, Sagi Grimberg wrote: >> >> On 23/05/2024 9:54, Christoph Hellwig wrote: >>> On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote: >>>> The below patch is what I'm messing with. I have a CMIC PCI NVMe and >>>> nvme native multipathing enabled, and that can recreate the lockdep >>>> splat 100% of the time. It goes away with this patch. Non-multipath >>>> devices are harder to hit the lockdep splat for a different reason. >>> I much prefer the srcu locking as it has real benefits as well over >>> a global-ish rwsem. I'll try to get to a real review in a bit. >>> >> Can you explain what are these benefits? > No having a global lock taken for namespace lookups. its a rw lock, not sure if its not a marginal value, especially given the nature of sequences where we lookup ctrl namespaces. I just want to have good testing before we touch these areas that are all done in ctrl reset and error recovery which historically manifest most of the issues... and all this because lockdep complains on false-positives... I'm not against addressing this, but we should be a bit more cautious... ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-23 13:02 ` Sagi Grimberg @ 2024-05-23 13:19 ` Christoph Hellwig 2024-05-23 13:45 ` Sagi Grimberg 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2024-05-23 13:19 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On Thu, May 23, 2024 at 04:02:43PM +0300, Sagi Grimberg wrote: > > I just want to have good testing before we touch these areas that > are all done in ctrl reset and error recovery which historically manifest > most of the issues... > > and all this because lockdep complains on false-positives... I still haven't seen any proof that it is a false-positive. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-23 13:19 ` Christoph Hellwig @ 2024-05-23 13:45 ` Sagi Grimberg 2024-05-23 15:02 ` Keith Busch 0 siblings, 1 reply; 14+ messages in thread From: Sagi Grimberg @ 2024-05-23 13:45 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On 23/05/2024 16:19, Christoph Hellwig wrote: > On Thu, May 23, 2024 at 04:02:43PM +0300, Sagi Grimberg wrote: >> I just want to have good testing before we touch these areas that >> are all done in ctrl reset and error recovery which historically manifest >> most of the issues... >> >> and all this because lockdep complains on false-positives... > I still haven't seen any proof that it is a false-positive. > The explanation is that the nvme-pci timeout handler will only disable the device when either it is able to transition the ctrl state to RESETTING or if the ctrl state is terminal, and in both cases the reset_work is not running concurrently... It was also proposed moving the call to nvme_disable_ctrl to a different context (returning BLK_RESET_TIMER from the timeout handler) such that blk_sync_queue() does not depend on it. But Keith said it won't solve the issue. Keith spent more time thinking about this, so I'll defer to him. I just want it to stop bothering Shinichiro and others running blktests regularly... ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-23 13:45 ` Sagi Grimberg @ 2024-05-23 15:02 ` Keith Busch 0 siblings, 0 replies; 14+ messages in thread From: Keith Busch @ 2024-05-23 15:02 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On Thu, May 23, 2024 at 04:45:37PM +0300, Sagi Grimberg wrote: > > > On 23/05/2024 16:19, Christoph Hellwig wrote: > > On Thu, May 23, 2024 at 04:02:43PM +0300, Sagi Grimberg wrote: > > > I just want to have good testing before we touch these areas that > > > are all done in ctrl reset and error recovery which historically manifest > > > most of the issues... > > > > > > and all this because lockdep complains on false-positives... > > I still haven't seen any proof that it is a false-positive. > > > > The explanation is that the nvme-pci timeout handler will only disable the > device when either > it is able to transition the ctrl state to RESETTING or if the ctrl state is > terminal, and in both > cases the reset_work is not running concurrently... > > It was also proposed moving the call to nvme_disable_ctrl to a different > context (returning BLK_RESET_TIMER from > the timeout handler) such that blk_sync_queue() does not depend on it. But > Keith said it won't solve the issue. > > Keith spent more time thinking about this, so I'll defer to him. I just want > it to stop bothering Shinichiro and > others running blktests regularly... Lockdep is complaining about the namespaces_rwsem taken from both timeout_work and reset_work. The only time namespaces_rwsem is taken from timeout work is if timeout_work successfully transitions controller state to RESETTING. But the controller is already in the RESETTING state because reset_work wouldn't be running here if we weren't. So timeout_work never locks the rwsem if reset_work is already attempting to "cancel" that work. Nevermind the fact that both are using the read lock, so it's not even a problem if they both attempt to lock it. The only potential problem could be if someone requests the write lock inbetween. At nearly the same time, pciehp has to transition the state to DEAD in order for the special "terminal" condition to apply for the timeout_work to attempt some kind of recovery, but the removal doesn't take the write lock until after the reset_work is flushed, so how would the write lock sneak in? A potential interaction with scan_work while the timeout, reset, and hotplug run concurrently might be the only part I'm not 100% sure about, but I haven't found any concerning path so far. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] nvme-pci: silence a lockdep complaint 2024-05-22 21:36 ` Keith Busch 2024-05-23 6:54 ` Christoph Hellwig @ 2024-05-23 8:06 ` Christoph Hellwig 1 sibling, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2024-05-23 8:06 UTC (permalink / raw) To: Keith Busch Cc: Sagi Grimberg, Christoph Hellwig, Shinichiro Kawasaki, linux-nvme@lists.infradead.org On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote: > + mutex_lock(&ctrl->namespaces_lock); > + list_del_rcu(&ns->list); > + mutex_unlock(&ctrl->namespaces_lock); This needs a synchronize_srcu(). Maybe factor out a helper to share the code with nvme_ns_remove? > - down_write(&ctrl->namespaces_rwsem); > + mutex_lock(&ctrl->namespaces_lock); > list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) { > if (ns->head->ns_id > nsid) > - list_move_tail(&ns->list, &rm_list); > + list_splice_init_rcu(&ns->list, &rm_list, synchronize_rcu); Overly long line here. > } > - up_write(&ctrl->namespaces_rwsem); > + mutex_unlock(&ctrl->namespaces_lock); > > list_for_each_entry_safe(ns, next, &rm_list, list) > nvme_ns_remove(ns); > > + synchronize_srcu(&ctrl->srcu); The synchronize_srcu should go right after the mutex_unlock as it synchronizes the ctrl->namespaces access. > + mutex_lock(&ctrl->namespaces_lock); > + list_splice_init_rcu(&ctrl->namespaces, &ns_list, synchronize_rcu); > + mutex_unlock(&ctrl->namespaces_lock); > > list_for_each_entry_safe(ns, next, &ns_list, list) > nvme_ns_remove(ns); > + > + synchronize_srcu(&ctrl->srcu); Same here. > kref_get(&ns->kref); > - up_read(&ctrl->namespaces_rwsem); > + srcu_read_unlock(&ctrl->srcu, srcu_idx); The kref_get here has to become a kref_get_unless_zero (aka nvme_get_ns). ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-05-23 15:02 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-22 9:15 [PATCH] nvme-pci: silence a lockdep complaint Sagi Grimberg 2024-05-22 12:18 ` Shinichiro Kawasaki 2024-05-22 16:12 ` Keith Busch 2024-05-22 16:28 ` Christoph Hellwig 2024-05-22 18:00 ` Sagi Grimberg 2024-05-22 21:36 ` Keith Busch 2024-05-23 6:54 ` Christoph Hellwig 2024-05-23 10:04 ` Sagi Grimberg 2024-05-23 12:39 ` Christoph Hellwig 2024-05-23 13:02 ` Sagi Grimberg 2024-05-23 13:19 ` Christoph Hellwig 2024-05-23 13:45 ` Sagi Grimberg 2024-05-23 15:02 ` Keith Busch 2024-05-23 8:06 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox