public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
@ 2026-03-01 12:59 Nilay Shroff
  2026-03-02  5:11 ` Yu Kuai
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Nilay Shroff @ 2026-03-01 12:59 UTC (permalink / raw)
  To: linux-block
  Cc: axboe, hch, ming.lei, yi.zhang, yukuai, shinichiro.kawasaki,
	gjoyce

While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
after ->freeze_lock is acquired or queue is frozen. This potentially
creates a circular dependency involving ->fs_reclaim if reclaim is
triggered simultaneously in a code path which first acquires ->pcpu_
alloc_mutex. As the queue is already frozen while nr_hw_queue update
allocates tagsets, the reclaim can't forward progress and thus it could
cause a potential deadlock as reported in lockdep splat[1].

Fix this by pre-allocating tagset tags before we freeze queue during
nr_hw_queue update. Later the allocated tagset tags could be safely
installed and used after queue is frozen.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
Changes from v1:
  - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
    encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
    to distinguish between success and failure. (Yu Kuai)
---
 block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9af8c3dec3f6..a27bade45de3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4793,38 +4793,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 	}
 }
 
-static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
-				       int new_nr_hw_queues)
+static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
+				struct blk_mq_tag_set *set,
+				int new_nr_hw_queues)
 {
 	struct blk_mq_tags **new_tags;
 	int i;
 
 	if (set->nr_hw_queues >= new_nr_hw_queues)
-		goto done;
+		return NULL;
 
 	new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
 				GFP_KERNEL, set->numa_node);
 	if (!new_tags)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	if (set->tags)
 		memcpy(new_tags, set->tags, set->nr_hw_queues *
 		       sizeof(*set->tags));
-	kfree(set->tags);
-	set->tags = new_tags;
 
 	for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
-		if (!__blk_mq_alloc_map_and_rqs(set, i)) {
-			while (--i >= set->nr_hw_queues)
-				__blk_mq_free_map_and_rqs(set, i);
-			return -ENOMEM;
+		if (blk_mq_is_shared_tags(set->flags))
+			new_tags[i] = set->shared_tags;
+		else {
+			new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+					set->queue_depth);
+			if (!new_tags[i])
+				goto out_unwind;
 		}
 		cond_resched();
 	}
 
-done:
-	set->nr_hw_queues = new_nr_hw_queues;
-	return 0;
+	return new_tags;
+out_unwind:
+	while (--i >= set->nr_hw_queues) {
+		if (!blk_mq_is_shared_tags(set->flags))
+			blk_mq_free_map_and_rqs(set, new_tags[i], i);
+	}
+	kfree(new_tags);
+	return ERR_PTR(-ENOMEM);
 }
 
 /*
@@ -5113,6 +5120,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	unsigned int memflags;
 	int i;
 	struct xarray elv_tbl;
+	struct blk_mq_tags **new_tags;
 	bool queues_frozen = false;
 
 	lockdep_assert_held(&set->tag_list_lock);
@@ -5147,11 +5155,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		if (blk_mq_elv_switch_none(q, &elv_tbl))
 			goto switch_back;
 
+	new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
+	if (IS_ERR(new_tags))
+		goto switch_back;
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue_nomemsave(q);
 	queues_frozen = true;
-	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
-		goto switch_back;
+	if (new_tags) {
+		kfree(set->tags);
+		set->tags = new_tags;
+	}
+	set->nr_hw_queues = nr_hw_queues;
 
 fallback:
 	blk_mq_update_queue_map(set);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
  2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
@ 2026-03-02  5:11 ` Yu Kuai
  2026-03-02 16:24   ` Jens Axboe
  2026-03-02  7:42 ` Yi Zhang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Yu Kuai @ 2026-03-02  5:11 UTC (permalink / raw)
  To: Nilay Shroff, linux-block
  Cc: axboe, hch, ming.lei, yi.zhang, shinichiro.kawasaki, gjoyce,
	yukuai

Hi,

在 2026/3/1 20:59, Nilay Shroff 写道:
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> Fix this by pre-allocating tagset tags before we freeze queue during
> nr_hw_queue update. Later the allocated tagset tags could be safely
> installed and used after queue is frozen.
>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
>    - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
>      encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
>      to distinguish between success and failure. (Yu Kuai)
> ---
>   block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
>   1 file changed, 30 insertions(+), 15 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9af8c3dec3f6..a27bade45de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4793,38 +4793,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
>   	}
>   }
>   
> -static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
> -				       int new_nr_hw_queues)
> +static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
> +				struct blk_mq_tag_set *set,
> +				int new_nr_hw_queues)
>   {
>   	struct blk_mq_tags **new_tags;
>   	int i;
>   
>   	if (set->nr_hw_queues >= new_nr_hw_queues)
> -		goto done;
> +		return NULL;
>   
>   	new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
>   				GFP_KERNEL, set->numa_node);
>   	if (!new_tags)
> -		return -ENOMEM;
> +		return ERR_PTR(-ENOMEM);
>   
>   	if (set->tags)
>   		memcpy(new_tags, set->tags, set->nr_hw_queues *
>   		       sizeof(*set->tags));
> -	kfree(set->tags);
> -	set->tags = new_tags;
>   
>   	for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
> -		if (!__blk_mq_alloc_map_and_rqs(set, i)) {
> -			while (--i >= set->nr_hw_queues)
> -				__blk_mq_free_map_and_rqs(set, i);
> -			return -ENOMEM;
> +		if (blk_mq_is_shared_tags(set->flags))
> +			new_tags[i] = set->shared_tags;

I think usually we'll keep the braces for if branch.

Otherwise feel free to add:

Reviewed-by: Yu Kuai <yukuai@fnnas.com>

> +		else {
> +			new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> +					set->queue_depth);
> +			if (!new_tags[i])
> +				goto out_unwind;
>   		}
>   		cond_resched();
>   	}
>   
> -done:
> -	set->nr_hw_queues = new_nr_hw_queues;
> -	return 0;
> +	return new_tags;
> +out_unwind:
> +	while (--i >= set->nr_hw_queues) {
> +		if (!blk_mq_is_shared_tags(set->flags))
> +			blk_mq_free_map_and_rqs(set, new_tags[i], i);
> +	}
> +	kfree(new_tags);
> +	return ERR_PTR(-ENOMEM);
>   }
>   
>   /*
> @@ -5113,6 +5120,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	unsigned int memflags;
>   	int i;
>   	struct xarray elv_tbl;
> +	struct blk_mq_tags **new_tags;
>   	bool queues_frozen = false;
>   
>   	lockdep_assert_held(&set->tag_list_lock);
> @@ -5147,11 +5155,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		if (blk_mq_elv_switch_none(q, &elv_tbl))
>   			goto switch_back;
>   
> +	new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
> +	if (IS_ERR(new_tags))
> +		goto switch_back;
> +
>   	list_for_each_entry(q, &set->tag_list, tag_set_list)
>   		blk_mq_freeze_queue_nomemsave(q);
>   	queues_frozen = true;
> -	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> -		goto switch_back;
> +	if (new_tags) {
> +		kfree(set->tags);
> +		set->tags = new_tags;
> +	}
> +	set->nr_hw_queues = nr_hw_queues;
>   
>   fallback:
>   	blk_mq_update_queue_map(set);

-- 
Thansk,
Kuai

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
  2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
  2026-03-02  5:11 ` Yu Kuai
@ 2026-03-02  7:42 ` Yi Zhang
  2026-03-02 14:05 ` Ming Lei
  2026-03-02 16:28 ` Jens Axboe
  3 siblings, 0 replies; 6+ messages in thread
From: Yi Zhang @ 2026-03-02  7:42 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block, axboe, hch, ming.lei, yukuai, shinichiro.kawasaki,
	gjoyce

On Sun, Mar 1, 2026 at 9:00 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> Fix this by pre-allocating tagset tags before we freeze queue during
> nr_hw_queue update. Later the allocated tagset tags could be safely
> installed and used after queue is frozen.
>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
>   - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
>     encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
>     to distinguish between success and failure. (Yu Kuai)
> ---
>  block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
>  1 file changed, 30 insertions(+), 15 deletions(-)

Confirmed the issue was fixed with this patch:

Tested-by: Yi Zhang <yi.zhang@redhat.com>


>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9af8c3dec3f6..a27bade45de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4793,38 +4793,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
>         }
>  }
>
> -static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
> -                                      int new_nr_hw_queues)
> +static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
> +                               struct blk_mq_tag_set *set,
> +                               int new_nr_hw_queues)
>  {
>         struct blk_mq_tags **new_tags;
>         int i;
>
>         if (set->nr_hw_queues >= new_nr_hw_queues)
> -               goto done;
> +               return NULL;
>
>         new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
>                                 GFP_KERNEL, set->numa_node);
>         if (!new_tags)
> -               return -ENOMEM;
> +               return ERR_PTR(-ENOMEM);
>
>         if (set->tags)
>                 memcpy(new_tags, set->tags, set->nr_hw_queues *
>                        sizeof(*set->tags));
> -       kfree(set->tags);
> -       set->tags = new_tags;
>
>         for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
> -               if (!__blk_mq_alloc_map_and_rqs(set, i)) {
> -                       while (--i >= set->nr_hw_queues)
> -                               __blk_mq_free_map_and_rqs(set, i);
> -                       return -ENOMEM;
> +               if (blk_mq_is_shared_tags(set->flags))
> +                       new_tags[i] = set->shared_tags;
> +               else {
> +                       new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> +                                       set->queue_depth);
> +                       if (!new_tags[i])
> +                               goto out_unwind;
>                 }
>                 cond_resched();
>         }
>
> -done:
> -       set->nr_hw_queues = new_nr_hw_queues;
> -       return 0;
> +       return new_tags;
> +out_unwind:
> +       while (--i >= set->nr_hw_queues) {
> +               if (!blk_mq_is_shared_tags(set->flags))
> +                       blk_mq_free_map_and_rqs(set, new_tags[i], i);
> +       }
> +       kfree(new_tags);
> +       return ERR_PTR(-ENOMEM);
>  }
>
>  /*
> @@ -5113,6 +5120,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>         unsigned int memflags;
>         int i;
>         struct xarray elv_tbl;
> +       struct blk_mq_tags **new_tags;
>         bool queues_frozen = false;
>
>         lockdep_assert_held(&set->tag_list_lock);
> @@ -5147,11 +5155,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>                 if (blk_mq_elv_switch_none(q, &elv_tbl))
>                         goto switch_back;
>
> +       new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
> +       if (IS_ERR(new_tags))
> +               goto switch_back;
> +
>         list_for_each_entry(q, &set->tag_list, tag_set_list)
>                 blk_mq_freeze_queue_nomemsave(q);
>         queues_frozen = true;
> -       if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> -               goto switch_back;
> +       if (new_tags) {
> +               kfree(set->tags);
> +               set->tags = new_tags;
> +       }
> +       set->nr_hw_queues = nr_hw_queues;
>
>  fallback:
>         blk_mq_update_queue_map(set);
> --
> 2.53.0
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
  2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
  2026-03-02  5:11 ` Yu Kuai
  2026-03-02  7:42 ` Yi Zhang
@ 2026-03-02 14:05 ` Ming Lei
  2026-03-02 16:28 ` Jens Axboe
  3 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2026-03-02 14:05 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block, axboe, hch, yi.zhang, yukuai, shinichiro.kawasaki,
	gjoyce

On Sun, Mar 1, 2026 at 9:00 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> Fix this by pre-allocating tagset tags before we freeze queue during
> nr_hw_queue update. Later the allocated tagset tags could be safely
> installed and used after queue is frozen.
>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
>   - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
>     encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
>     to distinguish between success and failure. (Yu Kuai)

Reviewed-by: Ming Lei <ming.lei@redhat.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
  2026-03-02  5:11 ` Yu Kuai
@ 2026-03-02 16:24   ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-03-02 16:24 UTC (permalink / raw)
  To: yukuai, Nilay Shroff, linux-block
  Cc: hch, ming.lei, yi.zhang, shinichiro.kawasaki, gjoyce

On 3/1/26 10:11 PM, Yu Kuai wrote:
>>   	for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
>> -		if (!__blk_mq_alloc_map_and_rqs(set, i)) {
>> -			while (--i >= set->nr_hw_queues)
>> -				__blk_mq_free_map_and_rqs(set, i);
>> -			return -ENOMEM;
>> +		if (blk_mq_is_shared_tags(set->flags))
>> +			new_tags[i] = set->shared_tags;
> 
> I think usually we'll keep the braces for if branch.
> 
> Otherwise feel free to add:
> 
> Reviewed-by: Yu Kuai <yukuai@fnnas.com>

Indeed, I fixed up the style issue.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
  2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
                   ` (2 preceding siblings ...)
  2026-03-02 14:05 ` Ming Lei
@ 2026-03-02 16:28 ` Jens Axboe
  3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-03-02 16:28 UTC (permalink / raw)
  To: linux-block, Nilay Shroff
  Cc: hch, ming.lei, yi.zhang, yukuai, shinichiro.kawasaki, gjoyce


On Sun, 01 Mar 2026 18:29:43 +0530, Nilay Shroff wrote:
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
> 
> [...]

Applied, thanks!

[1/1] block: break pcpu_alloc_mutex dependency on freeze_lock
      commit: 539d1b47e935e8384977dd7e5cec370c08b7a644

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-02 16:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
2026-03-02  5:11 ` Yu Kuai
2026-03-02 16:24   ` Jens Axboe
2026-03-02  7:42 ` Yi Zhang
2026-03-02 14:05 ` Ming Lei
2026-03-02 16:28 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox