linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] block: fix lock order and remove redundant locking
@ 2025-02-05 14:44 Nilay Shroff
  2025-02-05 14:44 ` [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock Nilay Shroff
  2025-02-05 14:44 ` [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes Nilay Shroff
  0 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-02-05 14:44 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce

Hi,

This patchset contains two patches in the series. The fist patch fixes
the incorrect lock ordering between queue ->sysfs_lock and freeze-lock.
And the second patch avoids using q->sysfs_lock while accessing sysfs
attribute files.

After we modeled the freeze & enter queue as lock for supporting lockdep
under commit f1be1788a32e ("block: model freeze & enter queue as lock
for supporting lockdep"), we received numerous lockdep splats. And one
of those splats[1] reported the potential deadlock due to incorrect lock
ordering issue between q->sysfs-lock and q->q_usage_counter. So the
first patch in the series addresses this lockdep splat.

The second patch in the series removes unnecessary holding of q->sysfs_lock
while we access the sysfs block layer attribute files. Ideally, we don't
need to hold any lock while accessing sysfs attribute files as attribute 
file is already protected with internal sysfs/kernfs locking. For instance,
accessing a sysfs attribute file for reading, follows the following
code path:
    vfs_read
      kernfs_fop_read_iter
        seq_read_iter
          kernfs_seq_start  ==> acquires kernfs internal mutex
            kernfs_seq_show
              sysfs_kf_seq_show
                queue_attr_show

Similarly, accessing a sysfs attribute file for writing, follows the
the following code path:
    vfs_write
      kernfs_fop_write_iter  ==> acquires kernfs internal mutex
        sysfs_kf_write
          queue_attr_store

As shown in the above code path, kernfs internal mutex already protects
sysfs store/show operations, and so we could safely get rid off holding
q->sysfs_lock while accessing sysfs attribute files.

Please note that above changes were unit tested against blktests and
quick xfstests with lockdep enabled.

Nilay Shroff (2):
  block: fix lock ordering between the queue ->sysfs_lock and
    freeze-lock
  block: avoid acquiring q->sysfs_lock while accessing sysfs attributes

 block/blk-mq-sysfs.c |  6 +-----
 block/blk-mq.c       | 49 +++++++++++++++++++++++++++++---------------
 block/blk-sysfs.c    | 10 +++------
 block/elevator.c     |  9 ++++++++
 4 files changed, 46 insertions(+), 28 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-05 14:44 [PATCH 0/2] block: fix lock order and remove redundant locking Nilay Shroff
@ 2025-02-05 14:44 ` Nilay Shroff
  2025-02-05 15:59   ` Christoph Hellwig
  2025-02-05 14:44 ` [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes Nilay Shroff
  1 sibling, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-02-05 14:44 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce

Lockdep reports [1] have identified inconsistent lock ordering between
q->sysfs_lock and freeze-lock at several call sites in the block layer.
This patch resolves the issue by enforcing a consistent lock acquisition
order: q->sysfs_lock is always acquired before freeze-lock. This change
eliminates the observed lockdep splats caused by the inconsistent
ordering.

Additionally, while rearranging the locking order, we ensure that no new
lock ordering issues are introduced between the global CPU hotplug (cpuhp)
lock and q->sysfs_lock, as previously reported [2]. To address this,
blk_mq_add_hw_queues_cpuhp() and blk_mq_remove_hw_queues_cpuhp() are now
called outside the critical section protected by q->sysfs_lock.

Since blk_mq_add_hw_queues_cpuhp() and blk_mq_remove_hw_queues_cpuhp()
are invoked during hardware context allocation via blk_mq_realloc_hw_
ctxs(), which runs holding q->sysfs_lock, we've relocated the add/remove
cpuhp function calls to __blk_mq_update_nr_hw_queues() and blk_mq_init_
allocated_queue() after the q->sysfs_lock is released. This ensures proper
lock ordering without introducing regressions.

[1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/
[2] https://lore.kernel.org/all/20241206082202.949142-1-ming.lei@redhat.com/

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq.c   | 49 ++++++++++++++++++++++++++++++++----------------
 block/elevator.c |  9 +++++++++
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 40490ac88045..87200539b3cc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4467,7 +4467,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	unsigned long i, j;
 
 	/* protect against switching io scheduler  */
-	mutex_lock(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_lock);
+
 	for (i = 0; i < set->nr_hw_queues; i++) {
 		int old_node;
 		int node = blk_mq_get_hctx_node(set, i);
@@ -4500,13 +4501,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 	xa_for_each_start(&q->hctx_table, j, hctx, j)
 		blk_mq_exit_hctx(q, set, hctx, j);
-	mutex_unlock(&q->sysfs_lock);
-
-	/* unregister cpuhp callbacks for exited hctxs */
-	blk_mq_remove_hw_queues_cpuhp(q);
-
-	/* register cpuhp for new initialized hctxs */
-	blk_mq_add_hw_queues_cpuhp(q);
 }
 
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
@@ -4532,10 +4526,19 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
+	mutex_lock(&q->sysfs_lock);
 	blk_mq_realloc_hw_ctxs(set, q);
+	mutex_unlock(&q->sysfs_lock);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
+	/*
+	 * Register cpuhp for new initialized hctxs and ensure that the cpuhp
+	 * registration happens outside of q->sysfs_lock to avoid any lock
+	 * ordering issue between q->sysfs_lock and global cpuhp lock.
+	 */
+	blk_mq_add_hw_queues_cpuhp(q);
+
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -4934,12 +4937,12 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
 		return false;
 
 	/* q->elevator needs protection from ->sysfs_lock */
-	mutex_lock(&q->sysfs_lock);
+	lockdep_assert_held(&q->sysfs_lock);
 
 	/* the check has to be done with holding sysfs_lock */
 	if (!q->elevator) {
 		kfree(qe);
-		goto unlock;
+		goto out;
 	}
 
 	INIT_LIST_HEAD(&qe->node);
@@ -4949,8 +4952,7 @@ static bool blk_mq_elv_switch_none(struct list_head *head,
 	__elevator_get(qe->type);
 	list_add(&qe->node, head);
 	elevator_disable(q);
-unlock:
-	mutex_unlock(&q->sysfs_lock);
+out:
 
 	return true;
 }
@@ -4973,6 +4975,8 @@ static void blk_mq_elv_switch_back(struct list_head *head,
 	struct blk_mq_qe_pair *qe;
 	struct elevator_type *t;
 
+	lockdep_assert_held(&q->sysfs_lock);
+
 	qe = blk_lookup_qe_pair(head, q);
 	if (!qe)
 		return;
@@ -4980,11 +4984,9 @@ static void blk_mq_elv_switch_back(struct list_head *head,
 	list_del(&qe->node);
 	kfree(qe);
 
-	mutex_lock(&q->sysfs_lock);
 	elevator_switch(q, t);
 	/* drop the reference acquired in blk_mq_elv_switch_none */
 	elevator_put(t);
-	mutex_unlock(&q->sysfs_lock);
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
@@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		mutex_lock(&q->sysfs_lock);
 		blk_mq_freeze_queue_nomemsave(q);
+	}
 
 	/*
 	 * Switch IO scheduler to 'none', cleaning up the data associated
@@ -5055,8 +5059,21 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_elv_switch_back(&head, q);
 
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		mutex_unlock(&q->sysfs_lock);
+
+		/*
+		 * Unregister cpuhp callbacks for exited hctxs and register
+		 * cpuhp for new initialized hctxs. Ensure that unregister/
+		 * register cpuhp is called outside of q->sysfs_lock to avoid
+		 * lock ordering issue between q->sysfs_lock and  global cpuhp
+		 * lock.
+		 */
+		blk_mq_remove_hw_queues_cpuhp(q);
+		blk_mq_add_hw_queues_cpuhp(q);
+
 		blk_mq_unfreeze_queue_nomemrestore(q);
+	}
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/elevator.c b/block/elevator.c
index cd2ce4921601..596eb5c0219f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -725,7 +725,16 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	int ret;
 
 	strscpy(elevator_name, buf, sizeof(elevator_name));
+
+	/*
+	 * The elevator change/switch code expects that the q->sysfs_lock
+	 * is held while we update the iosched to protect against the
+	 * simultaneous hctx update.
+	 */
+	mutex_lock(&disk->queue->sysfs_lock);
 	ret = elevator_change(disk->queue, strstrip(elevator_name));
+	mutex_unlock(&disk->queue->sysfs_lock);
+
 	if (!ret)
 		return count;
 	return ret;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-05 14:44 [PATCH 0/2] block: fix lock order and remove redundant locking Nilay Shroff
  2025-02-05 14:44 ` [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock Nilay Shroff
@ 2025-02-05 14:44 ` Nilay Shroff
  2025-02-05 15:53   ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-02-05 14:44 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, dlemoal, axboe, gjoyce

The sysfs attributes are already protected with sysfs/kernfs internal
locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
attribute files. So this change helps avoid holding q->sysfs_lock while
accessing sysfs attribute files.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sysfs.c |  6 +-----
 block/blk-sysfs.c    | 10 +++-------
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3feeeccf8a99..da53397d99fa 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -52,7 +52,6 @@ static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj,
 	struct blk_mq_hw_ctx_sysfs_entry *entry;
 	struct blk_mq_hw_ctx *hctx;
 	struct request_queue *q;
-	ssize_t res;
 
 	entry = container_of(attr, struct blk_mq_hw_ctx_sysfs_entry, attr);
 	hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj);
@@ -61,10 +60,7 @@ static ssize_t blk_mq_hw_sysfs_show(struct kobject *kobj,
 	if (!entry->show)
 		return -EIO;
 
-	mutex_lock(&q->sysfs_lock);
-	res = entry->show(hctx, page);
-	mutex_unlock(&q->sysfs_lock);
-	return res;
+	return entry->show(hctx, page);
 }
 
 static ssize_t blk_mq_hw_sysfs_nr_tags_show(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6f548a4376aa..2b8e7b311c61 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -664,14 +664,11 @@ queue_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 {
 	struct queue_sysfs_entry *entry = to_queue(attr);
 	struct gendisk *disk = container_of(kobj, struct gendisk, queue_kobj);
-	ssize_t res;
 
 	if (!entry->show)
 		return -EIO;
-	mutex_lock(&disk->queue->sysfs_lock);
-	res = entry->show(disk, page);
-	mutex_unlock(&disk->queue->sysfs_lock);
-	return res;
+
+	return entry->show(disk, page);
 }
 
 static ssize_t
@@ -710,11 +707,10 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
 		return length;
 	}
 
-	mutex_lock(&q->sysfs_lock);
 	memflags = blk_mq_freeze_queue(q);
 	res = entry->store(disk, page, length);
 	blk_mq_unfreeze_queue(q, memflags);
-	mutex_unlock(&q->sysfs_lock);
+
 	return res;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-05 14:44 ` [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes Nilay Shroff
@ 2025-02-05 15:53   ` Christoph Hellwig
  2025-02-06 13:54     ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-02-05 15:53 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, ming.lei, dlemoal, axboe, gjoyce

On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
> The sysfs attributes are already protected with sysfs/kernfs internal
> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
> attribute files. So this change helps avoid holding q->sysfs_lock while
> accessing sysfs attribute files.

the sysfs/kernfs locking only protects against other accesses using
sysfs.  But that's not really the most interesting part here.  We
also want to make sure nothing changes underneath in a way that
could cause crashes (and maybe even torn information).

We'll really need to audit what is accessed in each method and figure
out what protects it.  Chances are that sysfs_lock provides that
protection in some case right now, and chances are also very high
that a lot of this is pretty broken.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-05 14:44 ` [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock Nilay Shroff
@ 2025-02-05 15:59   ` Christoph Hellwig
  2025-02-06 13:22     ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-02-05 15:59 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, ming.lei, dlemoal, axboe, gjoyce

On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>  
>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  		return;
>  
>  	memflags = memalloc_noio_save();
> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +		mutex_lock(&q->sysfs_lock);

This now means we hold up to number of request queues sysfs_lock
at the same time.  I doubt lockdep will be happy about this.
Did you test this patch with a multi-namespace nvme device or
a multi-LU per host SCSI setup?

I suspect the answer here is to (ab)use the tag_list_lock for
scheduler updates - while the scope is too broad for just
changing it on a single queue it is a rate operation and it
solves the mess in __blk_mq_update_nr_hw_queues.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-05 15:59   ` Christoph Hellwig
@ 2025-02-06 13:22     ` Nilay Shroff
  2025-02-06 14:15       ` Christoph Hellwig
  2025-02-07 11:59       ` Ming Lei
  0 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-02-06 13:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce



On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>  
>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  		return;
>>  
>>  	memflags = memalloc_noio_save();
>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		mutex_lock(&q->sysfs_lock);
> 
> This now means we hold up to number of request queues sysfs_lock
> at the same time.  I doubt lockdep will be happy about this.
> Did you test this patch with a multi-namespace nvme device or
> a multi-LU per host SCSI setup?
> 
Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
complain. Agreed we need to hold up q->sysfs_lock for multiple 
request queues at the same time and that may not be elegant, but 
looking at the mess in __blk_mq_update_nr_hw_queues we may not
have other choice which could help correct the lock order.

> I suspect the answer here is to (ab)use the tag_list_lock for
> scheduler updates - while the scope is too broad for just
> changing it on a single queue it is a rate operation and it
> solves the mess in __blk_mq_update_nr_hw_queues.
> 
Yes this is probably a good idea, that instead of using q->sysfs_lock 
we may depend on q->tag_set->tag_list_lock here for sched/elevator updates
as a fact that __blk_mq_update_nr_hw_queues already runs with tag_list_lock
held. But then it also requires using the same tag_list_lock instead of 
current sysfs_lock while we update the scheduler from sysfs. But that's
a trivial change.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-05 15:53   ` Christoph Hellwig
@ 2025-02-06 13:54     ` Nilay Shroff
  2025-02-06 14:07       ` Christoph Hellwig
  2025-02-08 10:41       ` Ming Lei
  0 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-02-06 13:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce



On 2/5/25 9:23 PM, Christoph Hellwig wrote:
> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>> The sysfs attributes are already protected with sysfs/kernfs internal
>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>> attribute files. So this change helps avoid holding q->sysfs_lock while
>> accessing sysfs attribute files.
> 
> the sysfs/kernfs locking only protects against other accesses using
> sysfs.  But that's not really the most interesting part here.  We
> also want to make sure nothing changes underneath in a way that
> could cause crashes (and maybe even torn information).
> 
> We'll really need to audit what is accessed in each method and figure
> out what protects it.  Chances are that sysfs_lock provides that
> protection in some case right now, and chances are also very high
> that a lot of this is pretty broken.
> 
Yes that's possible and so I audited all sysfs attributes which are 
currently protected using q->sysfs_lock and I found some interesting
facts. Please find below:

1. io_poll:
   Write to this attribute is ignored. So, we don't need q->sysfs_lock.

2. io_poll_delay:
   Write to this attribute is NOP, so we don't need q->sysfs_lock.

3. io_timeout:
   Write to this attribute updates q->rq_timeout and read of this attribute
   returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
   set only once when we init the queue (under blk_mq_init_allocated_queue())
   even before disk is added. So that means that we may not need to protect
   it with q->sysfs_lock.

4. nomerges:
   Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
   and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
   anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
   updated/accessed with bitops which are atomic. So, I believe, protecting
   it with q->sysfs_lock is not necessary.

5. nr_requests:
   Write to this attribute updates the tag sets and this could potentially
   race with __blk_mq_update_nr_hw_queues(). So I think we should really 
   protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.

6. read_ahead_kb:
   Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
   ra_pages is also updated under queue_limits_commit_update() which runs 
   holding q->limits_lock; so I think this attribute file should be protected
   with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
   Maybe we should move it under the same sets of attribute files which today
   runs with q->limits_lock held.

7. rq_affinity:
   Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
   and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
   using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
   protecting it with q->stsys_lock is not necessary.

8. scheduler:
   Write to this attribute actually updates q->elevator and the elevator change/switch 
   code expects that the q->sysfs_lock is held while we update the iosched to protect 
   against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
   q->sysfs_lock protection.

   However if we're thinking of protecting sched change/update using q->tag_sets->
   tag_list_lock (as discussed in another thread), then we may use q->tag_set->
   tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
   file.

9. wbt_lat_usec:
   Write to this attribute file updates the various wbt limits and state. This may race 
   with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
   blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
   and so yes, we need to protect this attribute with q->sysfs_lock.

   However, as mentioned above, if we're thinking of protecting elevator change/update
   using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
   q->sysfs_lock while reading/writing to this attribute file.

So yes, you've rightly guessed few of the above attributes are not well protected and few
still may require sysfs_lock protection. 

From the above list, I see that the "read_ahead_kb" should be protected with q->limits_lock.

The "nr_requests" should be protected with q->tag_set->tag_list_lock instead of q->sysfs_lock.

The "scheduler" and "wbt_lat_usec" require protection using either q->sysfs_lock or 
q->tag_set->tag_list_lock. 

The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and 
those could be well protected with the sysfs/kernfs internal lock.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-06 13:54     ` Nilay Shroff
@ 2025-02-06 14:07       ` Christoph Hellwig
  2025-02-07 11:03         ` Nilay Shroff
  2025-02-08 10:41       ` Ming Lei
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-02-06 14:07 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Christoph Hellwig, linux-block, ming.lei, dlemoal, axboe, gjoyce

On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
> Yes that's possible and so I audited all sysfs attributes which are 
> currently protected using q->sysfs_lock and I found some interesting
> facts. Please find below:
> 
> 1. io_poll:
>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
> 
> 2. io_poll_delay:
>    Write to this attribute is NOP, so we don't need q->sysfs_lock.

Yes, those are easy.

> 3. io_timeout:
>    Write to this attribute updates q->rq_timeout and read of this attribute
>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>    even before disk is added. So that means that we may not need to protect
>    it with q->sysfs_lock.

Are we sure blk_queue_rq_timeout is never called from anything but
probe?  Either way given that this is a limit that isn't directly
corelated with anything else simply using WRITE_ONCE/READ_ONCE might
be enough.

> 
> 4. nomerges:
>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>    updated/accessed with bitops which are atomic. So, I believe, protecting
>    it with q->sysfs_lock is not necessary.

Yes.

> 5. nr_requests:
>    Write to this attribute updates the tag sets and this could potentially
>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.

Yeah.

> 6. read_ahead_kb:
>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>    ra_pages is also updated under queue_limits_commit_update() which runs 
>    holding q->limits_lock; so I think this attribute file should be protected
>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>    Maybe we should move it under the same sets of attribute files which today
>    runs with q->limits_lock held.

Yes, limits_lock sounds sensible here.

> 7. rq_affinity:
>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>    protecting it with q->stsys_lock is not necessary.

Agreed.  Although updating both flags isn't atomic that should be
harmless in this case (but could use a comment about that).

> 8. scheduler:
>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>    q->sysfs_lock protection.
> 
>    However if we're thinking of protecting sched change/update using q->tag_sets->
>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>    file.

Yes.

> 9. wbt_lat_usec:
>    Write to this attribute file updates the various wbt limits and state. This may race 
>    with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
>    blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
>    and so yes, we need to protect this attribute with q->sysfs_lock.
> 
>    However, as mentioned above, if we're thinking of protecting elevator change/update
>    using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
>    q->sysfs_lock while reading/writing to this attribute file.

Yes.

> The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and 
> those could be well protected with the sysfs/kernfs internal lock.

So I think the first step here is to remove the locking from
queue_attr_store/queue_attr_show and move it into the attributes that
still need it in some form, followed by replacing it with the more
suitable locks as defined above.  And maybe with a little bit more
work we can remove sysfs_lock entirely eventually.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-06 13:22     ` Nilay Shroff
@ 2025-02-06 14:15       ` Christoph Hellwig
  2025-02-07 11:59       ` Ming Lei
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-02-06 14:15 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Christoph Hellwig, linux-block, ming.lei, dlemoal, axboe, gjoyce

On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
> complain. Agreed we need to hold up q->sysfs_lock for multiple 
> request queues at the same time and that may not be elegant, but 
> looking at the mess in __blk_mq_update_nr_hw_queues we may not
> have other choice which could help correct the lock order.

Odd, as it's usually very unhappy about nesting locks of the
same kind unless specifically annotated.

> Yes this is probably a good idea, that instead of using q->sysfs_lock 
> we may depend on q->tag_set->tag_list_lock here for sched/elevator updates
> as a fact that __blk_mq_update_nr_hw_queues already runs with tag_list_lock
> held.

Yes.

> But then it also requires using the same tag_list_lock instead of 
> current sysfs_lock while we update the scheduler from sysfs. But that's
> a trivial change.

Yes.  I think it's a good idea, but maybe wait a bit to see if Jens
or Ming also have opinions on this before starting the work.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-06 14:07       ` Christoph Hellwig
@ 2025-02-07 11:03         ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-02-07 11:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, ming.lei, dlemoal, axboe, gjoyce



On 2/6/25 7:37 PM, Christoph Hellwig wrote:
> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>> Yes that's possible and so I audited all sysfs attributes which are 
>> currently protected using q->sysfs_lock and I found some interesting
>> facts. Please find below:
>>
>> 1. io_poll:
>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>
>> 2. io_poll_delay:
>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
> 
> Yes, those are easy.
> 
>> 3. io_timeout:
>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>    even before disk is added. So that means that we may not need to protect
>>    it with q->sysfs_lock.
> 
> Are we sure blk_queue_rq_timeout is never called from anything but
> probe?  Either way given that this is a limit that isn't directly
> corelated with anything else simply using WRITE_ONCE/READ_ONCE might
> be enough.
I just again grep source code and confirmed that blk_queue_rq_timeout is 
mostly called from the driver probe method (barring nbd driver which may
set q->rq_timeout using ioctl). And yes, agreed, q->rq_timeout can be
accessed using WRITE_ONCE/READ_ONCE.

>>
>> 4. nomerges:
>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>    it with q->sysfs_lock is not necessary.
> 
> Yes.
> 
>> 5. nr_requests:
>>    Write to this attribute updates the tag sets and this could potentially
>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
> 
> Yeah.
> 
>> 6. read_ahead_kb:
>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>    holding q->limits_lock; so I think this attribute file should be protected
>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>    Maybe we should move it under the same sets of attribute files which today
>>    runs with q->limits_lock held.
> 
> Yes, limits_lock sounds sensible here.
> 
>> 7. rq_affinity:
>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>    protecting it with q->stsys_lock is not necessary.
> 
> Agreed.  Although updating both flags isn't atomic that should be
> harmless in this case (but could use a comment about that).
Sure will add comment about this.
> 
>> 8. scheduler:
>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>    q->sysfs_lock protection.
>>
>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>    file.
> 
> Yes.
> 
>> 9. wbt_lat_usec:
>>    Write to this attribute file updates the various wbt limits and state. This may race 
>>    with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
>>    blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
>>    and so yes, we need to protect this attribute with q->sysfs_lock.
>>
>>    However, as mentioned above, if we're thinking of protecting elevator change/update
>>    using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
>>    q->sysfs_lock while reading/writing to this attribute file.
> 
> Yes.
> 
>> The rest of attributes seems don't require protection from q->sysfs_lock or any other lock and 
>> those could be well protected with the sysfs/kernfs internal lock.
> 
> So I think the first step here is to remove the locking from
> queue_attr_store/queue_attr_show and move it into the attributes that
> still need it in some form, followed by replacing it with the more
> suitable locks as defined above.  And maybe with a little bit more
> work we can remove sysfs_lock entirely eventually.
Yes this sounds like a good plan! 

Thank you Christoph for your review! And as you suggested, in another thread, 
I'd wait some more time for Jens and Ming, if in case they have any further 
comments about this change.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-06 13:22     ` Nilay Shroff
  2025-02-06 14:15       ` Christoph Hellwig
@ 2025-02-07 11:59       ` Ming Lei
  2025-02-07 18:02         ` Nilay Shroff
  1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-07 11:59 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce

On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> > On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
> >>  
> >>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>  		return;
> >>  
> >>  	memflags = memalloc_noio_save();
> >> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> >> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >> +		mutex_lock(&q->sysfs_lock);
> > 
> > This now means we hold up to number of request queues sysfs_lock
> > at the same time.  I doubt lockdep will be happy about this.
> > Did you test this patch with a multi-namespace nvme device or
> > a multi-LU per host SCSI setup?
> > 
> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
> complain. Agreed we need to hold up q->sysfs_lock for multiple 
> request queues at the same time and that may not be elegant, but 
> looking at the mess in __blk_mq_update_nr_hw_queues we may not
> have other choice which could help correct the lock order.

All q->sysfs_lock instance actually shares same lock class, so this way
should have triggered double lock warning, please see mutex_init().

The ->sysfs_lock involved in this patch looks only for sync elevator
switch with reallocating hctxs, so I am wondering why not add new
dedicated lock for this purpose only?

Then we needn't to worry about its dependency with q->q_usage_counter(io)?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-07 11:59       ` Ming Lei
@ 2025-02-07 18:02         ` Nilay Shroff
  2025-02-08  8:30           ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-02-07 18:02 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce



On 2/7/25 5:29 PM, Ming Lei wrote:
> On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
>>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>>>  
>>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>  		return;
>>>>  
>>>>  	memflags = memalloc_noio_save();
>>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>> +		mutex_lock(&q->sysfs_lock);
>>>
>>> This now means we hold up to number of request queues sysfs_lock
>>> at the same time.  I doubt lockdep will be happy about this.
>>> Did you test this patch with a multi-namespace nvme device or
>>> a multi-LU per host SCSI setup?
>>>
>> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
>> complain. Agreed we need to hold up q->sysfs_lock for multiple 
>> request queues at the same time and that may not be elegant, but 
>> looking at the mess in __blk_mq_update_nr_hw_queues we may not
>> have other choice which could help correct the lock order.
> 
> All q->sysfs_lock instance actually shares same lock class, so this way
> should have triggered double lock warning, please see mutex_init().
> 
Well, my understanding about lockdep is that even though all q->sysfs_lock
instances share the same lock class key, lockdep differentiates locks 
based on their memory address. Since each instance of &q->sysfs_lock has 
got different memory address, lockdep treat each of them as distinct locks 
and IMO, that avoids triggering double lock warning.

> The ->sysfs_lock involved in this patch looks only for sync elevator
> switch with reallocating hctxs, so I am wondering why not add new
> dedicated lock for this purpose only?
> 
> Then we needn't to worry about its dependency with q->q_usage_counter(io)?
> 
Yes that should be possible but then as Christoph suggested, __blk_mq_update_
nr_hw_queues already runs holding tag_list_lock and so why shouldn't we use
the same tag_list_lock even for sched/elevator updates? That way we may avoid
adding another new lock.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-07 18:02         ` Nilay Shroff
@ 2025-02-08  8:30           ` Ming Lei
  2025-02-08 13:18             ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-08  8:30 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce

On Fri, Feb 07, 2025 at 11:32:37PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/7/25 5:29 PM, Ming Lei wrote:
> > On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> >>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
> >>>>  
> >>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>>>  		return;
> >>>>  
> >>>>  	memflags = memalloc_noio_save();
> >>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> >>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >>>> +		mutex_lock(&q->sysfs_lock);
> >>>
> >>> This now means we hold up to number of request queues sysfs_lock
> >>> at the same time.  I doubt lockdep will be happy about this.
> >>> Did you test this patch with a multi-namespace nvme device or
> >>> a multi-LU per host SCSI setup?
> >>>
> >> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
> >> complain. Agreed we need to hold up q->sysfs_lock for multiple 
> >> request queues at the same time and that may not be elegant, but 
> >> looking at the mess in __blk_mq_update_nr_hw_queues we may not
> >> have other choice which could help correct the lock order.
> > 
> > All q->sysfs_lock instance actually shares same lock class, so this way
> > should have triggered double lock warning, please see mutex_init().
> > 
> Well, my understanding about lockdep is that even though all q->sysfs_lock
> instances share the same lock class key, lockdep differentiates locks 
> based on their memory address. Since each instance of &q->sysfs_lock has 
> got different memory address, lockdep treat each of them as distinct locks 
> and IMO, that avoids triggering double lock warning.

That isn't correct, think about how lockdep can deal with millions of
lock instances.

Please take a look at the beginning of Documentation/locking/lockdep-design.rst

```
The validator tracks the 'usage state' of lock-classes, and it tracks
the dependencies between different lock-classes.
```

Please verify it by the following code:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e76651e786d..a4ffc6198e7b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5150,10 +5150,37 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
 		cancel_delayed_work_sync(&hctx->run_work);
 }

+struct lock_test {
+	struct mutex	lock;
+};
+
+void init_lock_test(struct lock_test *lt)
+{
+	mutex_init(&lt->lock);
+	printk("init lock: %p\n", lt);
+}
+
+static void test_lockdep(void)
+{
+	struct lock_test A, B;
+
+	init_lock_test(&A);
+	init_lock_test(&B);
+
+	printk("start lock test\n");
+	mutex_lock(&A.lock);
+	mutex_lock(&B.lock);
+	mutex_unlock(&B.lock);
+	mutex_unlock(&A.lock);
+	printk("end lock test\n");
+}
+
 static int __init blk_mq_init(void)
 {
 	int i;

+	test_lockdep();
+
 	for_each_possible_cpu(i)
 		init_llist_head(&per_cpu(blk_cpu_done, i));
 	for_each_possible_cpu(i)



Thanks,
Ming


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-06 13:54     ` Nilay Shroff
  2025-02-06 14:07       ` Christoph Hellwig
@ 2025-02-08 10:41       ` Ming Lei
  2025-02-08 12:56         ` Nilay Shroff
  1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-08 10:41 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce

On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
> > On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
> >> The sysfs attributes are already protected with sysfs/kernfs internal
> >> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
> >> attribute files. So this change helps avoid holding q->sysfs_lock while
> >> accessing sysfs attribute files.
> > 
> > the sysfs/kernfs locking only protects against other accesses using
> > sysfs.  But that's not really the most interesting part here.  We
> > also want to make sure nothing changes underneath in a way that
> > could cause crashes (and maybe even torn information).
> > 
> > We'll really need to audit what is accessed in each method and figure
> > out what protects it.  Chances are that sysfs_lock provides that
> > protection in some case right now, and chances are also very high
> > that a lot of this is pretty broken.
> > 
> Yes that's possible and so I audited all sysfs attributes which are 
> currently protected using q->sysfs_lock and I found some interesting
> facts. Please find below:
> 
> 1. io_poll:
>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
> 
> 2. io_poll_delay:
>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
> 
> 3. io_timeout:
>    Write to this attribute updates q->rq_timeout and read of this attribute
>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>    even before disk is added. So that means that we may not need to protect
>    it with q->sysfs_lock.
> 
> 4. nomerges:
>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>    updated/accessed with bitops which are atomic. So, I believe, protecting
>    it with q->sysfs_lock is not necessary.
> 
> 5. nr_requests:
>    Write to this attribute updates the tag sets and this could potentially
>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
> 
> 6. read_ahead_kb:
>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>    ra_pages is also updated under queue_limits_commit_update() which runs 
>    holding q->limits_lock; so I think this attribute file should be protected
>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>    Maybe we should move it under the same sets of attribute files which today
>    runs with q->limits_lock held.
> 
> 7. rq_affinity:
>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>    protecting it with q->stsys_lock is not necessary.
> 
> 8. scheduler:
>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>    q->sysfs_lock protection.
> 
>    However if we're thinking of protecting sched change/update using q->tag_sets->
>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>    file.

This is one misuse of tag_list_lock, which is supposed to cover host
wide change, and shouldn't be used for request queue level protection,
which is exactly provided by q->sysfs_lock.

Not mention it will cause ABBA deadlock over freeze lock, please see
blk_mq_update_nr_hw_queues(). And it can't be used for protecting
'nr_requests' too.

> 
> 9. wbt_lat_usec:
>    Write to this attribute file updates the various wbt limits and state. This may race 
>    with blk_mq_exit_sched() or blk_register_queue(). The wbt updates through the 
>    blk_mq_exit_sched() and blk_register_queue() is currently protected with q->sysfs_lock
>    and so yes, we need to protect this attribute with q->sysfs_lock.
> 
>    However, as mentioned above, if we're thinking of protecting elevator change/update
>    using q->sets->tag_list_lock then we may use q->tag_set->tag_list_lock intstead of
>    q->sysfs_lock while reading/writing to this attribute file.

Same ABBA deadlock with above.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-08 10:41       ` Ming Lei
@ 2025-02-08 12:56         ` Nilay Shroff
  2025-02-09 11:41           ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-02-08 12:56 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce



On 2/8/25 4:11 PM, Ming Lei wrote:
> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
>>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>>>> The sysfs attributes are already protected with sysfs/kernfs internal
>>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>>>> attribute files. So this change helps avoid holding q->sysfs_lock while
>>>> accessing sysfs attribute files.
>>>
>>> the sysfs/kernfs locking only protects against other accesses using
>>> sysfs.  But that's not really the most interesting part here.  We
>>> also want to make sure nothing changes underneath in a way that
>>> could cause crashes (and maybe even torn information).
>>>
>>> We'll really need to audit what is accessed in each method and figure
>>> out what protects it.  Chances are that sysfs_lock provides that
>>> protection in some case right now, and chances are also very high
>>> that a lot of this is pretty broken.
>>>
>> Yes that's possible and so I audited all sysfs attributes which are 
>> currently protected using q->sysfs_lock and I found some interesting
>> facts. Please find below:
>>
>> 1. io_poll:
>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>
>> 2. io_poll_delay:
>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
>>
>> 3. io_timeout:
>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>    even before disk is added. So that means that we may not need to protect
>>    it with q->sysfs_lock.
>>
>> 4. nomerges:
>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>    it with q->sysfs_lock is not necessary.
>>
>> 5. nr_requests:
>>    Write to this attribute updates the tag sets and this could potentially
>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
>>
>> 6. read_ahead_kb:
>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>    holding q->limits_lock; so I think this attribute file should be protected
>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>    Maybe we should move it under the same sets of attribute files which today
>>    runs with q->limits_lock held.
>>
>> 7. rq_affinity:
>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>    protecting it with q->stsys_lock is not necessary.
>>
>> 8. scheduler:
>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>    q->sysfs_lock protection.
>>
>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>    file.
> 
> This is one misuse of tag_list_lock, which is supposed to cover host
> wide change, and shouldn't be used for request queue level protection,
> which is exactly provided by q->sysfs_lock.
> 
Yes I think Christoph was also pointed about the same but then assuming 
schedule/elevator update would be a rare operation it may not cause
a lot of contention. Having said that, I'm also fine creating another 
lock just to protect elevator changes and removing ->sysfs_lock from 
elevator code.

> Not mention it will cause ABBA deadlock over freeze lock, please see
> blk_mq_update_nr_hw_queues(). And it can't be used for protecting
> 'nr_requests' too.
I don't know how this might cause ABBA deadlock. The proposal here's to 
use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
attribute from sysfs as well as while we update the elevator through 
__blk_mq_update_nr_hw_queues().

In each code path (either from sysfs attribute update or from nr_hw_queues 
update), we first acquire ->tag_list_lock and then freeze-lock.

Do you see any code path where the above order might not be followed?  	

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock
  2025-02-08  8:30           ` Ming Lei
@ 2025-02-08 13:18             ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-02-08 13:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce



On 2/8/25 2:00 PM, Ming Lei wrote:
> On Fri, Feb 07, 2025 at 11:32:37PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/7/25 5:29 PM, Ming Lei wrote:
>>> On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
>>>>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>>>>>  
>>>>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>>>  		return;
>>>>>>  
>>>>>>  	memflags = memalloc_noio_save();
>>>>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>>>> +		mutex_lock(&q->sysfs_lock);
>>>>>
>>>>> This now means we hold up to number of request queues sysfs_lock
>>>>> at the same time.  I doubt lockdep will be happy about this.
>>>>> Did you test this patch with a multi-namespace nvme device or
>>>>> a multi-LU per host SCSI setup?
>>>>>
>>>> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
>>>> complain. Agreed we need to hold up q->sysfs_lock for multiple 
>>>> request queues at the same time and that may not be elegant, but 
>>>> looking at the mess in __blk_mq_update_nr_hw_queues we may not
>>>> have other choice which could help correct the lock order.
>>>
>>> All q->sysfs_lock instance actually shares same lock class, so this way
>>> should have triggered double lock warning, please see mutex_init().
>>>
>> Well, my understanding about lockdep is that even though all q->sysfs_lock
>> instances share the same lock class key, lockdep differentiates locks 
>> based on their memory address. Since each instance of &q->sysfs_lock has 
>> got different memory address, lockdep treat each of them as distinct locks 
>> and IMO, that avoids triggering double lock warning.
> 
> That isn't correct, think about how lockdep can deal with millions of
> lock instances.
> 
> Please take a look at the beginning of Documentation/locking/lockdep-design.rst
> 
> ```
> The validator tracks the 'usage state' of lock-classes, and it tracks
> the dependencies between different lock-classes.
> ```
> 
> Please verify it by the following code:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4e76651e786d..a4ffc6198e7b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5150,10 +5150,37 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
>  		cancel_delayed_work_sync(&hctx->run_work);
>  }
> 
> +struct lock_test {
> +	struct mutex	lock;
> +};
> +
> +void init_lock_test(struct lock_test *lt)
> +{
> +	mutex_init(&lt->lock);
> +	printk("init lock: %p\n", lt);
> +}
> +
> +static void test_lockdep(void)
> +{
> +	struct lock_test A, B;
> +
> +	init_lock_test(&A);
> +	init_lock_test(&B);
> +
> +	printk("start lock test\n");
> +	mutex_lock(&A.lock);
> +	mutex_lock(&B.lock);
> +	mutex_unlock(&B.lock);
> +	mutex_unlock(&A.lock);
> +	printk("end lock test\n");
> +}
> +
>  static int __init blk_mq_init(void)
>  {
>  	int i;
> 
> +	test_lockdep();
> +
>  	for_each_possible_cpu(i)
>  		init_llist_head(&per_cpu(blk_cpu_done, i));
>  	for_each_possible_cpu(i)
> 
> 
> 
Thank you Ming for providing the patch for testing lockdep!
You and Christoph were correct. The lockdep should complain about possible 
recursive locking for q->sysfs_lock and after a bit of debugging I think I found
the cause about why on my system lockdep was unable to complain about recursive locking. 
The reason is on my test system, I enabled KASAN and KASAN reported a potential 
use-after-free bug that tainted the kernel and disabled the further lock debugging. 
Hence any subsequent locking issues were not detected by lockdep. 

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-08 12:56         ` Nilay Shroff
@ 2025-02-09 11:41           ` Ming Lei
  2025-02-09 13:41             ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-02-09 11:41 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce

On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote:
> 
> 
> On 2/8/25 4:11 PM, Ming Lei wrote:
> > On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
> >>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
> >>>> The sysfs attributes are already protected with sysfs/kernfs internal
> >>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
> >>>> attribute files. So this change helps avoid holding q->sysfs_lock while
> >>>> accessing sysfs attribute files.
> >>>
> >>> the sysfs/kernfs locking only protects against other accesses using
> >>> sysfs.  But that's not really the most interesting part here.  We
> >>> also want to make sure nothing changes underneath in a way that
> >>> could cause crashes (and maybe even torn information).
> >>>
> >>> We'll really need to audit what is accessed in each method and figure
> >>> out what protects it.  Chances are that sysfs_lock provides that
> >>> protection in some case right now, and chances are also very high
> >>> that a lot of this is pretty broken.
> >>>
> >> Yes that's possible and so I audited all sysfs attributes which are 
> >> currently protected using q->sysfs_lock and I found some interesting
> >> facts. Please find below:
> >>
> >> 1. io_poll:
> >>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
> >>
> >> 2. io_poll_delay:
> >>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
> >>
> >> 3. io_timeout:
> >>    Write to this attribute updates q->rq_timeout and read of this attribute
> >>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
> >>    set only once when we init the queue (under blk_mq_init_allocated_queue())
> >>    even before disk is added. So that means that we may not need to protect
> >>    it with q->sysfs_lock.
> >>
> >> 4. nomerges:
> >>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
> >>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
> >>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
> >>    updated/accessed with bitops which are atomic. So, I believe, protecting
> >>    it with q->sysfs_lock is not necessary.
> >>
> >> 5. nr_requests:
> >>    Write to this attribute updates the tag sets and this could potentially
> >>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
> >>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
> >>
> >> 6. read_ahead_kb:
> >>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
> >>    ra_pages is also updated under queue_limits_commit_update() which runs 
> >>    holding q->limits_lock; so I think this attribute file should be protected
> >>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
> >>    Maybe we should move it under the same sets of attribute files which today
> >>    runs with q->limits_lock held.
> >>
> >> 7. rq_affinity:
> >>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
> >>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
> >>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
> >>    protecting it with q->stsys_lock is not necessary.
> >>
> >> 8. scheduler:
> >>    Write to this attribute actually updates q->elevator and the elevator change/switch 
> >>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
> >>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
> >>    q->sysfs_lock protection.
> >>
> >>    However if we're thinking of protecting sched change/update using q->tag_sets->
> >>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
> >>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
> >>    file.
> > 
> > This is one misuse of tag_list_lock, which is supposed to cover host
> > wide change, and shouldn't be used for request queue level protection,
> > which is exactly provided by q->sysfs_lock.
> > 
> Yes I think Christoph was also pointed about the same but then assuming 
> schedule/elevator update would be a rare operation it may not cause
> a lot of contention. Having said that, I'm also fine creating another 
> lock just to protect elevator changes and removing ->sysfs_lock from 
> elevator code.

Then please use new lock.

> 
> > Not mention it will cause ABBA deadlock over freeze lock, please see
> > blk_mq_update_nr_hw_queues(). And it can't be used for protecting
> > 'nr_requests' too.
> I don't know how this might cause ABBA deadlock. The proposal here's to 
> use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
> attribute from sysfs as well as while we update the elevator through 
> __blk_mq_update_nr_hw_queues().
> 
> In each code path (either from sysfs attribute update or from nr_hw_queues 
> update), we first acquire ->tag_list_lock and then freeze-lock.
> 
> Do you see any code path where the above order might not be followed?  	

You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal")
has added one such warning:  blk_mq_sysfs_unregister() is called after
queue freeze lock is grabbed from del_gendisk()

Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset()
called after tagset is frozen.

More serious, driver may grab ->tag_list_lock in error recovery code for
providing forward progress, you have to be careful wrt. using ->tag_list_lock,
for example:

	mutex_lock(->tag_list_lock)
	blk_mq_freeze_queue()		// If IO timeout happens, the driver timeout
								// handler stuck on mutex_lock(->tag_list_lock)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes
  2025-02-09 11:41           ` Ming Lei
@ 2025-02-09 13:41             ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-02-09 13:41 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, linux-block, dlemoal, axboe, gjoyce



On 2/9/25 5:11 PM, Ming Lei wrote:
> On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/8/25 4:11 PM, Ming Lei wrote:
>>> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
>>>>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>>>>>> The sysfs attributes are already protected with sysfs/kernfs internal
>>>>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>>>>>> attribute files. So this change helps avoid holding q->sysfs_lock while
>>>>>> accessing sysfs attribute files.
>>>>>
>>>>> the sysfs/kernfs locking only protects against other accesses using
>>>>> sysfs.  But that's not really the most interesting part here.  We
>>>>> also want to make sure nothing changes underneath in a way that
>>>>> could cause crashes (and maybe even torn information).
>>>>>
>>>>> We'll really need to audit what is accessed in each method and figure
>>>>> out what protects it.  Chances are that sysfs_lock provides that
>>>>> protection in some case right now, and chances are also very high
>>>>> that a lot of this is pretty broken.
>>>>>
>>>> Yes that's possible and so I audited all sysfs attributes which are 
>>>> currently protected using q->sysfs_lock and I found some interesting
>>>> facts. Please find below:
>>>>
>>>> 1. io_poll:
>>>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>>>
>>>> 2. io_poll_delay:
>>>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
>>>>
>>>> 3. io_timeout:
>>>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>>>    even before disk is added. So that means that we may not need to protect
>>>>    it with q->sysfs_lock.
>>>>
>>>> 4. nomerges:
>>>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>>>    it with q->sysfs_lock is not necessary.
>>>>
>>>> 5. nr_requests:
>>>>    Write to this attribute updates the tag sets and this could potentially
>>>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
>>>>
>>>> 6. read_ahead_kb:
>>>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>>>    holding q->limits_lock; so I think this attribute file should be protected
>>>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>>>    Maybe we should move it under the same sets of attribute files which today
>>>>    runs with q->limits_lock held.
>>>>
>>>> 7. rq_affinity:
>>>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>>>    protecting it with q->stsys_lock is not necessary.
>>>>
>>>> 8. scheduler:
>>>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>>>    q->sysfs_lock protection.
>>>>
>>>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>>>    file.
>>>
>>> This is one misuse of tag_list_lock, which is supposed to cover host
>>> wide change, and shouldn't be used for request queue level protection,
>>> which is exactly provided by q->sysfs_lock.
>>>
>> Yes I think Christoph was also pointed about the same but then assuming 
>> schedule/elevator update would be a rare operation it may not cause
>> a lot of contention. Having said that, I'm also fine creating another 
>> lock just to protect elevator changes and removing ->sysfs_lock from 
>> elevator code.
> 
> Then please use new lock.
Okay, I will replace q->sysfs_lock with another dedicated lock for synchronizing
elevator switch and nr_hw_queue update and that would help eliminate dependency 
between the q->q_usage_counter(io) (or freeze-lock) and the q->sysfs_lock. 
> 
>>
>>> Not mention it will cause ABBA deadlock over freeze lock, please see
>>> blk_mq_update_nr_hw_queues(). And it can't be used for protecting
>>> 'nr_requests' too.
>> I don't know how this might cause ABBA deadlock. The proposal here's to 
>> use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
>> attribute from sysfs as well as while we update the elevator through 
>> __blk_mq_update_nr_hw_queues().
>>
>> In each code path (either from sysfs attribute update or from nr_hw_queues 
>> update), we first acquire ->tag_list_lock and then freeze-lock.
>>
>> Do you see any code path where the above order might not be followed?  	
> 
> You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal")
> has added one such warning:  blk_mq_sysfs_unregister() is called after
> queue freeze lock is grabbed from del_gendisk()
> 
> Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset()
> called after tagset is frozen.
> 
> More serious, driver may grab ->tag_list_lock in error recovery code for
> providing forward progress, you have to be careful wrt. using ->tag_list_lock,
> for example:
> 
> 	mutex_lock(->tag_list_lock)
> 	blk_mq_freeze_queue()		// If IO timeout happens, the driver timeout
> 								// handler stuck on mutex_lock(->tag_list_lock)

Ok got it! But lets wait for a bit if Christoph or others have any further comment before
I start making this change.

Thanks,
--Nilay 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-02-09 13:41 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05 14:44 [PATCH 0/2] block: fix lock order and remove redundant locking Nilay Shroff
2025-02-05 14:44 ` [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock Nilay Shroff
2025-02-05 15:59   ` Christoph Hellwig
2025-02-06 13:22     ` Nilay Shroff
2025-02-06 14:15       ` Christoph Hellwig
2025-02-07 11:59       ` Ming Lei
2025-02-07 18:02         ` Nilay Shroff
2025-02-08  8:30           ` Ming Lei
2025-02-08 13:18             ` Nilay Shroff
2025-02-05 14:44 ` [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes Nilay Shroff
2025-02-05 15:53   ` Christoph Hellwig
2025-02-06 13:54     ` Nilay Shroff
2025-02-06 14:07       ` Christoph Hellwig
2025-02-07 11:03         ` Nilay Shroff
2025-02-08 10:41       ` Ming Lei
2025-02-08 12:56         ` Nilay Shroff
2025-02-09 11:41           ` Ming Lei
2025-02-09 13:41             ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).