[PATCH] nvme-pci: silence a lockdep complaint

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] nvme-pci: silence a lockdep complaint
@ 2024-05-22  9:15 Sagi Grimberg
  2024-05-22 12:18 ` Shinichiro Kawasaki
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-05-22  9:15 UTC (permalink / raw)
  To: linux-nvme; +Cc: Christoph Hellwig, Keith Busch, Shinichiro Kawasaki

lockdep complains about the timeout handler running concurrently with
the reset work which is syncing the IO request queues (which in turn
flushes the timeout work).

We know it cannot be the case because the ctrl state machine prevents
the timeout handler from disabling the ctrl when the reset work is
running (changing ctrl state to RESETTING will fail, and the state is not
terminal). If this assumption happens to break in the future, we won't
have lockdep to assist, but for the time being we are simply seeing
false-positive complaints from it...

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/pci.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 710043086dff..4a85b83b78f9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2718,7 +2718,18 @@ static void nvme_reset_work(struct work_struct *work)
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
 		nvme_dev_disable(dev, false);
+	/*
+	 * lockdep complains about the timeout handler running concurrently
+	 * with this call. We know it cannot be the case because the ctrl state
+	 * machine prevents the timeout handler from disabling the ctrl when
+	 * the reset work is running (changing ctrl state to RESETTING will
+	 * fail, and the state is not terminal). If this assumption happens to
+	 * break in the future, we won't have lockdep to assist, but for the
+	 * time being we are simply seeing false-positive complaints from it...
+	 */
+	lockdep_off();
 	nvme_sync_queues(&dev->ctrl);
+	lockdep_on();
 
 	mutex_lock(&dev->shutdown_lock);
 	result = nvme_pci_enable(dev);
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22  9:15 [PATCH] nvme-pci: silence a lockdep complaint Sagi Grimberg
@ 2024-05-22 12:18 ` Shinichiro Kawasaki
  2024-05-22 16:12   ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Shinichiro Kawasaki @ 2024-05-22 12:18 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme@lists.infradead.org, Christoph Hellwig, Keith Busch

On May 22, 2024 / 12:15, Sagi Grimberg wrote:
> lockdep complains about the timeout handler running concurrently with
> the reset work which is syncing the IO request queues (which in turn
> flushes the timeout work).
> 
> We know it cannot be the case because the ctrl state machine prevents
> the timeout handler from disabling the ctrl when the reset work is
> running (changing ctrl state to RESETTING will fail, and the state is not
> terminal). If this assumption happens to break in the future, we won't
> have lockdep to assist, but for the time being we are simply seeing
> false-positive complaints from it...
> 
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Suggested-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>

Sagi, thank you very much. I confirmed this patch avoids the lockdep WARN
observed with the blktests test case nvme/050 [1]. When I repeated the test case
around 10 times on v6.9 kernel without this patch, the lockdep WARN was
recreated in stable manner. After applying this patch, the WARN was not
observed even when the test case is repeated 100 times.

Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

[1] https://lore.kernel.org/linux-block/m6a437jvfwzq2jfytvvk62zpgu7e4bjvegr7x73pihhkp5me5c@sh6vs3s7w754/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22 12:18 ` Shinichiro Kawasaki
@ 2024-05-22 16:12   ` Keith Busch
  2024-05-22 16:28     ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2024-05-22 16:12 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Sagi Grimberg, linux-nvme@lists.infradead.org, Christoph Hellwig

On Wed, May 22, 2024 at 12:18:57PM +0000, Shinichiro Kawasaki wrote:
> On May 22, 2024 / 12:15, Sagi Grimberg wrote:
> > lockdep complains about the timeout handler running concurrently with
> > the reset work which is syncing the IO request queues (which in turn
> > flushes the timeout work).
> > 
> > We know it cannot be the case because the ctrl state machine prevents
> > the timeout handler from disabling the ctrl when the reset work is
> > running (changing ctrl state to RESETTING will fail, and the state is not
> > terminal). If this assumption happens to break in the future, we won't
> > have lockdep to assist, but for the time being we are simply seeing
> > false-positive complaints from it...
> > 
> > Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Suggested-by: Keith Busch <kbusch@kernel.org>
> > Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> 
> Sagi, thank you very much. I confirmed this patch avoids the lockdep WARN
> observed with the blktests test case nvme/050 [1]. When I repeated the test case
> around 10 times on v6.9 kernel without this patch, the lockdep WARN was
> recreated in stable manner. After applying this patch, the WARN was not
> observed even when the test case is repeated 100 times.
> 
> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

This is not desireable, but I give up trying to appease lockdep for this
one.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22 16:12   ` Keith Busch
@ 2024-05-22 16:28     ` Christoph Hellwig
  2024-05-22 18:00       ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-05-22 16:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Shinichiro Kawasaki, Sagi Grimberg,
	linux-nvme@lists.infradead.org, Christoph Hellwig

On Wed, May 22, 2024 at 10:12:19AM -0600, Keith Busch wrote:
> > Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> 
> This is not desireable, but I give up trying to appease lockdep for this
> one.

lockdep_off is a cure that's worth than the desease.  If we can't
explain the "can't happen" using lockdep classes I don't see how
we could actually be sure that it can't happen.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22 16:28     ` Christoph Hellwig
@ 2024-05-22 18:00       ` Sagi Grimberg
  2024-05-22 21:36         ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-05-22 18:00 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Shinichiro Kawasaki, linux-nvme@lists.infradead.org

On 22/05/2024 19:28, Christoph Hellwig wrote:
> On Wed, May 22, 2024 at 10:12:19AM -0600, Keith Busch wrote:
>>> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> This is not desireable, but I give up trying to appease lockdep for this
>> one.
> lockdep_off is a cure that's worth than the desease.

This is not ideal for sure.

>   If we can't
> explain the "can't happen" using lockdep classes I don't see how
> we could actually be sure that it can't happen.

I don't understand lockdep well enough to do that. This warning
exists for 6 months now. The conclusion that Keith came to (and I agreed)
is that it can't happen, because nvme_timeout handler will not disable the
device while the ctrl reset_work is running, the ctrl state machine will 
prevent
that from happening.

There was an attempt to convert namespaces_rwsem a srcu, but my comment to
that was that this is an intrusive change for a complaint that we think 
is a false-positive.

Do you see a proper way to solve this? Would love to see a better 
solution here.

The original report is at:
https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/ 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22 18:00       ` Sagi Grimberg
@ 2024-05-22 21:36         ` Keith Busch
  2024-05-23  6:54           ` Christoph Hellwig
  2024-05-23  8:06           ` Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Keith Busch @ 2024-05-22 21:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Shinichiro Kawasaki,
	linux-nvme@lists.infradead.org

On Wed, May 22, 2024 at 09:00:34PM +0300, Sagi Grimberg wrote:
> I don't understand lockdep well enough to do that. This warning
> exists for 6 months now. The conclusion that Keith came to (and I agreed)
> is that it can't happen, because nvme_timeout handler will not disable the
> device while the ctrl reset_work is running, the ctrl state machine will
> prevent that from happening.
> 
> There was an attempt to convert namespaces_rwsem a srcu, but my comment to
> that was that this is an intrusive change for a complaint that we think is a
> false-positive.

I'll play devil's advocate and suggest there might be some whacky
timeout_work, scan_work, reset_work, and pciehp irq that could result in
a read, write, wait, read deadlock. I don't think it can happen, but
it's harder to prove that.

The below patch is what I'm messing with. I have a CMIC PCI NVMe and
nvme native multipathing enabled, and that can recreate the lockdep
splat 100% of the time. It goes away with this patch. Non-multipath
devices are harder to hit the lockdep splat for a different reason.

From 00dd7a573d3dd14be748b46e040bb5eb0f5687b1 Mon Sep 17 00:00:00 2001
From: Keith Busch <kbusch@kernel.org>
Date: Tue, 21 May 2024 06:41:45 -0700
Subject: [PATCH] nvme: switch rwsem to srcu

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c      | 94 +++++++++++++++++++++--------------
 drivers/nvme/host/ioctl.c     | 12 ++---
 drivers/nvme/host/multipath.c | 21 ++++----
 drivers/nvme/host/nvme.h      |  3 +-
 4 files changed, 76 insertions(+), 54 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7706df2373494..480af74f1606e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3684,9 +3684,10 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
 struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 {
 	struct nvme_ns *ns, *ret = NULL;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
 		if (ns->head->ns_id == nsid) {
 			if (!nvme_get_ns(ns))
 				continue;
@@ -3696,7 +3697,7 @@ struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		if (ns->head->ns_id > nsid)
 			break;
 	}
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 	return ret;
 }
 EXPORT_SYMBOL_NS_GPL(nvme_find_get_ns, NVME_TARGET_PASSTHRU);
@@ -3710,7 +3711,7 @@ static void nvme_ns_add_to_ctrl_list(struct nvme_ns *ns)
 
 	list_for_each_entry_reverse(tmp, &ns->ctrl->namespaces, list) {
 		if (tmp->head->ns_id < ns->head->ns_id) {
-			list_add(&ns->list, &tmp->list);
+			list_add_rcu(&ns->list, &tmp->list);
 			return;
 		}
 	}
@@ -3776,17 +3777,17 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	if (nvme_update_ns_info(ns, info))
 		goto out_unlink_ns;
 
-	down_write(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->namespaces_lock);
 	/*
 	 * Ensure that no namespaces are added to the ctrl list after the queues
 	 * are frozen, thereby avoiding a deadlock between scan and reset.
 	 */
 	if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
-		up_write(&ctrl->namespaces_rwsem);
+		mutex_unlock(&ctrl->namespaces_lock);
 		goto out_unlink_ns;
 	}
 	nvme_ns_add_to_ctrl_list(ns);
-	up_write(&ctrl->namespaces_rwsem);
+	mutex_unlock(&ctrl->namespaces_lock);
 	nvme_get_ctrl(ctrl);
 
 	if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups))
@@ -3809,9 +3810,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 
  out_cleanup_ns_from_list:
 	nvme_put_ctrl(ctrl);
-	down_write(&ctrl->namespaces_rwsem);
-	list_del_init(&ns->list);
-	up_write(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->namespaces_lock);
+	list_del_rcu(&ns->list);
+	mutex_unlock(&ctrl->namespaces_lock);
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
@@ -3861,10 +3862,11 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
 	del_gendisk(ns->disk);
 
-	down_write(&ns->ctrl->namespaces_rwsem);
-	list_del_init(&ns->list);
-	up_write(&ns->ctrl->namespaces_rwsem);
+	mutex_lock(&ns->ctrl->namespaces_lock);
+	list_del_rcu(&ns->list);
+	mutex_unlock(&ns->ctrl->namespaces_lock);
 
+	synchronize_srcu(&ns->ctrl->srcu);
 	if (last_path)
 		nvme_mpath_shutdown_disk(ns->head);
 	nvme_put_ns(ns);
@@ -3953,16 +3955,17 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 	LIST_HEAD(rm_list);
 
-	down_write(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->namespaces_lock);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
 		if (ns->head->ns_id > nsid)
-			list_move_tail(&ns->list, &rm_list);
+			list_splice_init_rcu(&ns->list, &rm_list, synchronize_rcu);
 	}
-	up_write(&ctrl->namespaces_rwsem);
+	mutex_unlock(&ctrl->namespaces_lock);
 
 	list_for_each_entry_safe(ns, next, &rm_list, list)
 		nvme_ns_remove(ns);
 
+	synchronize_srcu(&ctrl->srcu);
 }
 
 static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)
@@ -4132,12 +4135,14 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	/* this is a no-op when called from the controller reset handler */
 	nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING_NOIO);
 
-	down_write(&ctrl->namespaces_rwsem);
-	list_splice_init(&ctrl->namespaces, &ns_list);
-	up_write(&ctrl->namespaces_rwsem);
+	mutex_lock(&ctrl->namespaces_lock);
+	list_splice_init_rcu(&ctrl->namespaces, &ns_list, synchronize_rcu);
+	mutex_unlock(&ctrl->namespaces_lock);
 
 	list_for_each_entry_safe(ns, next, &ns_list, list)
 		nvme_ns_remove(ns);
+
+	synchronize_srcu(&ctrl->srcu);
 }
 EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
 
@@ -4582,6 +4587,7 @@ static void nvme_free_ctrl(struct device *dev)
 	key_put(ctrl->tls_key);
 	nvme_free_cels(ctrl);
 	nvme_mpath_uninit(ctrl);
+	cleanup_srcu_struct(&ctrl->srcu);
 	nvme_auth_stop(ctrl);
 	nvme_auth_free(ctrl);
 	__free_page(ctrl->discard_page);
@@ -4614,10 +4620,15 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->passthru_err_log_enabled = false;
 	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
+	mutex_init(&ctrl->namespaces_lock);
+
+	ret = init_srcu_struct(&ctrl->srcu);
+	if (ret)
+		return ret;
+
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
 	xa_init(&ctrl->cels);
-	init_rwsem(&ctrl->namespaces_rwsem);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
@@ -4697,6 +4708,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 out:
 	if (ctrl->discard_page)
 		__free_page(ctrl->discard_page);
+	cleanup_srcu_struct(&ctrl->srcu);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nvme_init_ctrl);
@@ -4705,22 +4717,24 @@ EXPORT_SYMBOL_GPL(nvme_init_ctrl);
 void nvme_mark_namespaces_dead(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
 		blk_mark_disk_dead(ns->disk);
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 EXPORT_SYMBOL_GPL(nvme_mark_namespaces_dead);
 
 void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
 		blk_mq_unfreeze_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 	clear_bit(NVME_CTRL_FROZEN, &ctrl->flags);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
@@ -4728,14 +4742,15 @@ EXPORT_SYMBOL_GPL(nvme_unfreeze);
 int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
 		timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout);
 		if (timeout <= 0)
 			break;
 	}
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 	return timeout;
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
@@ -4743,23 +4758,25 @@ EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
 		blk_mq_freeze_queue_wait(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
 
 void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
 	set_bit(NVME_CTRL_FROZEN, &ctrl->flags);
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
 		blk_freeze_queue_start(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
@@ -4802,11 +4819,12 @@ EXPORT_SYMBOL_GPL(nvme_unquiesce_admin_queue);
 void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list)
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list)
 		blk_sync_queue(ns->queue);
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 499a8bb7cac7d..0f05058692b55 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -789,16 +789,16 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp,
 		bool open_for_write)
 {
 	struct nvme_ns *ns;
-	int ret;
+	int ret, srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
 	if (list_empty(&ctrl->namespaces)) {
 		ret = -ENOTTY;
 		goto out_unlock;
 	}
 
-	ns = list_first_entry(&ctrl->namespaces, struct nvme_ns, list);
-	if (ns != list_last_entry(&ctrl->namespaces, struct nvme_ns, list)) {
+	ns = list_first_or_null_rcu(&ctrl->namespaces, struct nvme_ns, list);
+	if (!ns) {
 		dev_warn(ctrl->device,
 			"NVME_IOCTL_IO_CMD not supported when multiple namespaces present!\n");
 		ret = -EINVAL;
@@ -808,14 +808,14 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp,
 	dev_warn(ctrl->device,
 		"using deprecated NVME_IOCTL_IO_CMD ioctl on the char device!\n");
 	kref_get(&ns->kref);
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 
 	ret = nvme_user_cmd(ctrl, ns, argp, 0, open_for_write);
 	nvme_put_ns(ns);
 	return ret;
 
 out_unlock:
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 	return ret;
 }
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1bee176fd850e..d8b6b4648eaff 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -151,16 +151,17 @@ void nvme_mpath_end_request(struct request *rq)
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
 		if (!ns->head->disk)
 			continue;
 		kblockd_schedule_work(&ns->head->requeue_work);
 		if (nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
 			disk_uevent(ns->head->disk, KOBJ_CHANGE);
 	}
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 
 static const char *nvme_ana_state_names[] = {
@@ -194,13 +195,14 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
+	int srcu_idx;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
 		nvme_mpath_clear_current_path(ns);
 		kblockd_schedule_work(&ns->head->requeue_work);
 	}
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 
 void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
@@ -681,6 +683,7 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
 	u32 nr_nsids = le32_to_cpu(desc->nnsids), n = 0;
 	unsigned *nr_change_groups = data;
 	struct nvme_ns *ns;
+	int srcu_idx;
 
 	dev_dbg(ctrl->device, "ANA group %d: %s.\n",
 			le32_to_cpu(desc->grpid),
@@ -692,8 +695,8 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
 	if (!nr_nsids)
 		return 0;
 
-	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
 		unsigned nsid;
 again:
 		nsid = le32_to_cpu(desc->nsids[n]);
@@ -706,7 +709,7 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
 		if (ns->head->ns_id > nsid)
 			goto again;
 	}
-	up_read(&ctrl->namespaces_rwsem);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 	return 0;
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index fc31bd340a63a..a005941a8b67e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -282,7 +282,8 @@ struct nvme_ctrl {
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
-	struct rw_semaphore namespaces_rwsem;
+	struct mutex namespaces_lock;
+	struct srcu_struct srcu;
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 #ifdef CONFIG_NVME_HWMON
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22 21:36         ` Keith Busch
@ 2024-05-23  6:54           ` Christoph Hellwig
  2024-05-23 10:04             ` Sagi Grimberg
  2024-05-23  8:06           ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-05-23  6:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, Shinichiro Kawasaki,
	linux-nvme@lists.infradead.org

On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote:
> The below patch is what I'm messing with. I have a CMIC PCI NVMe and
> nvme native multipathing enabled, and that can recreate the lockdep
> splat 100% of the time. It goes away with this patch. Non-multipath
> devices are harder to hit the lockdep splat for a different reason.

I much prefer the srcu locking as it has real benefits as well over
a global-ish rwsem.  I'll try to get to a real review in a bit.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-22 21:36         ` Keith Busch
  2024-05-23  6:54           ` Christoph Hellwig
@ 2024-05-23  8:06           ` Christoph Hellwig
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2024-05-23  8:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, Shinichiro Kawasaki,
	linux-nvme@lists.infradead.org

On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote:
> +	mutex_lock(&ctrl->namespaces_lock);
> +	list_del_rcu(&ns->list);
> +	mutex_unlock(&ctrl->namespaces_lock);

This needs a synchronize_srcu().  Maybe factor out a helper to share the
code with nvme_ns_remove?

> -	down_write(&ctrl->namespaces_rwsem);
> +	mutex_lock(&ctrl->namespaces_lock);
>  	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
>  		if (ns->head->ns_id > nsid)
> -			list_move_tail(&ns->list, &rm_list);
> +			list_splice_init_rcu(&ns->list, &rm_list, synchronize_rcu);

Overly long line here.

>  	}
> -	up_write(&ctrl->namespaces_rwsem);
> +	mutex_unlock(&ctrl->namespaces_lock);
>  
>  	list_for_each_entry_safe(ns, next, &rm_list, list)
>  		nvme_ns_remove(ns);
>  
> +	synchronize_srcu(&ctrl->srcu);

The synchronize_srcu should go right after the mutex_unlock as it
synchronizes the ctrl->namespaces access.

> +	mutex_lock(&ctrl->namespaces_lock);
> +	list_splice_init_rcu(&ctrl->namespaces, &ns_list, synchronize_rcu);
> +	mutex_unlock(&ctrl->namespaces_lock);
>  
>  	list_for_each_entry_safe(ns, next, &ns_list, list)
>  		nvme_ns_remove(ns);
> +
> +	synchronize_srcu(&ctrl->srcu);

Same here.

>  	kref_get(&ns->kref);
> -	up_read(&ctrl->namespaces_rwsem);
> +	srcu_read_unlock(&ctrl->srcu, srcu_idx);

The kref_get here has to become a kref_get_unless_zero (aka nvme_get_ns).



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-23  6:54           ` Christoph Hellwig
@ 2024-05-23 10:04             ` Sagi Grimberg
  2024-05-23 12:39               ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-05-23 10:04 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Shinichiro Kawasaki, linux-nvme@lists.infradead.org



On 23/05/2024 9:54, Christoph Hellwig wrote:
> On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote:
>> The below patch is what I'm messing with. I have a CMIC PCI NVMe and
>> nvme native multipathing enabled, and that can recreate the lockdep
>> splat 100% of the time. It goes away with this patch. Non-multipath
>> devices are harder to hit the lockdep splat for a different reason.
> I much prefer the srcu locking as it has real benefits as well over
> a global-ish rwsem.  I'll try to get to a real review in a bit.
>

Can you explain what are these benefits?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-23 10:04             ` Sagi Grimberg
@ 2024-05-23 12:39               ` Christoph Hellwig
  2024-05-23 13:02                 ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-05-23 12:39 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Shinichiro Kawasaki,
	linux-nvme@lists.infradead.org

On Thu, May 23, 2024 at 01:04:13PM +0300, Sagi Grimberg wrote:
>
>
> On 23/05/2024 9:54, Christoph Hellwig wrote:
>> On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote:
>>> The below patch is what I'm messing with. I have a CMIC PCI NVMe and
>>> nvme native multipathing enabled, and that can recreate the lockdep
>>> splat 100% of the time. It goes away with this patch. Non-multipath
>>> devices are harder to hit the lockdep splat for a different reason.
>> I much prefer the srcu locking as it has real benefits as well over
>> a global-ish rwsem.  I'll try to get to a real review in a bit.
>>
>
> Can you explain what are these benefits?

No having a global lock taken for namespace lookups. 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-23 12:39               ` Christoph Hellwig
@ 2024-05-23 13:02                 ` Sagi Grimberg
  2024-05-23 13:19                   ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-05-23 13:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Shinichiro Kawasaki, linux-nvme@lists.infradead.org



On 23/05/2024 15:39, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 01:04:13PM +0300, Sagi Grimberg wrote:
>>
>> On 23/05/2024 9:54, Christoph Hellwig wrote:
>>> On Wed, May 22, 2024 at 03:36:19PM -0600, Keith Busch wrote:
>>>> The below patch is what I'm messing with. I have a CMIC PCI NVMe and
>>>> nvme native multipathing enabled, and that can recreate the lockdep
>>>> splat 100% of the time. It goes away with this patch. Non-multipath
>>>> devices are harder to hit the lockdep splat for a different reason.
>>> I much prefer the srcu locking as it has real benefits as well over
>>> a global-ish rwsem.  I'll try to get to a real review in a bit.
>>>
>> Can you explain what are these benefits?
> No having a global lock taken for namespace lookups.

its a rw lock, not sure if its not a marginal value, especially given
the nature of sequences where we lookup ctrl namespaces.

I just want to have good testing before we touch these areas that
are all done in ctrl reset and error recovery which historically manifest
most of the issues...

and all this because lockdep complains on false-positives...

I'm not against addressing this, but we should be a bit more cautious...


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-23 13:02                 ` Sagi Grimberg
@ 2024-05-23 13:19                   ` Christoph Hellwig
  2024-05-23 13:45                     ` Sagi Grimberg
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2024-05-23 13:19 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Shinichiro Kawasaki,
	linux-nvme@lists.infradead.org

On Thu, May 23, 2024 at 04:02:43PM +0300, Sagi Grimberg wrote:
>
> I just want to have good testing before we touch these areas that
> are all done in ctrl reset and error recovery which historically manifest
> most of the issues...
>
> and all this because lockdep complains on false-positives...

I still haven't seen any proof that it is a false-positive.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-23 13:19                   ` Christoph Hellwig
@ 2024-05-23 13:45                     ` Sagi Grimberg
  2024-05-23 15:02                       ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-05-23 13:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Shinichiro Kawasaki, linux-nvme@lists.infradead.org

On 23/05/2024 16:19, Christoph Hellwig wrote:
> On Thu, May 23, 2024 at 04:02:43PM +0300, Sagi Grimberg wrote:
>> I just want to have good testing before we touch these areas that
>> are all done in ctrl reset and error recovery which historically manifest
>> most of the issues...
>>
>> and all this because lockdep complains on false-positives...
> I still haven't seen any proof that it is a false-positive.
>

The explanation is that the nvme-pci timeout handler will only disable 
the device when either
it is able to transition the ctrl state to RESETTING or if the ctrl 
state is terminal, and in both
cases the reset_work is not running concurrently...

It was also proposed moving the call to nvme_disable_ctrl to a different 
context (returning BLK_RESET_TIMER from
the timeout handler) such that blk_sync_queue() does not depend on it. 
But Keith said it won't solve the issue.

Keith spent more time thinking about this, so I'll defer to him. I just 
want it to stop bothering Shinichiro and
others running blktests regularly...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] nvme-pci: silence a lockdep complaint
  2024-05-23 13:45                     ` Sagi Grimberg
@ 2024-05-23 15:02                       ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2024-05-23 15:02 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Shinichiro Kawasaki,
	linux-nvme@lists.infradead.org

On Thu, May 23, 2024 at 04:45:37PM +0300, Sagi Grimberg wrote:
> 
> 
> On 23/05/2024 16:19, Christoph Hellwig wrote:
> > On Thu, May 23, 2024 at 04:02:43PM +0300, Sagi Grimberg wrote:
> > > I just want to have good testing before we touch these areas that
> > > are all done in ctrl reset and error recovery which historically manifest
> > > most of the issues...
> > > 
> > > and all this because lockdep complains on false-positives...
> > I still haven't seen any proof that it is a false-positive.
> > 
> 
> The explanation is that the nvme-pci timeout handler will only disable the
> device when either
> it is able to transition the ctrl state to RESETTING or if the ctrl state is
> terminal, and in both
> cases the reset_work is not running concurrently...
> 
> It was also proposed moving the call to nvme_disable_ctrl to a different
> context (returning BLK_RESET_TIMER from
> the timeout handler) such that blk_sync_queue() does not depend on it. But
> Keith said it won't solve the issue.
> 
> Keith spent more time thinking about this, so I'll defer to him. I just want
> it to stop bothering Shinichiro and
> others running blktests regularly...

Lockdep is complaining about the namespaces_rwsem taken from both
timeout_work and reset_work.

The only time namespaces_rwsem is taken from timeout work is if
timeout_work successfully transitions controller state to RESETTING. But
the controller is already in the RESETTING state because reset_work
wouldn't be running here if we weren't. So timeout_work never locks the
rwsem if reset_work is already attempting to "cancel" that work.

Nevermind the fact that both are using the read lock, so it's not even a
problem if they both attempt to lock it. The only potential problem
could be if someone requests the write lock inbetween. At nearly the
same time, pciehp has to transition the state to DEAD in order for the
special "terminal" condition to apply for the timeout_work to attempt
some kind of recovery, but the removal doesn't take the write lock until
after the reset_work is flushed, so how would the write lock sneak in?

A potential interaction with scan_work while the timeout, reset, and
hotplug run concurrently might be the only part I'm not 100% sure about,
but I haven't found any concerning path so far.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-05-23 15:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-22  9:15 [PATCH] nvme-pci: silence a lockdep complaint Sagi Grimberg
2024-05-22 12:18 ` Shinichiro Kawasaki
2024-05-22 16:12   ` Keith Busch
2024-05-22 16:28     ` Christoph Hellwig
2024-05-22 18:00       ` Sagi Grimberg
2024-05-22 21:36         ` Keith Busch
2024-05-23  6:54           ` Christoph Hellwig
2024-05-23 10:04             ` Sagi Grimberg
2024-05-23 12:39               ` Christoph Hellwig
2024-05-23 13:02                 ` Sagi Grimberg
2024-05-23 13:19                   ` Christoph Hellwig
2024-05-23 13:45                     ` Sagi Grimberg
2024-05-23 15:02                       ` Keith Busch
2024-05-23  8:06           ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox