* [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
@ 2026-03-01 12:59 Nilay Shroff
2026-03-02 5:11 ` Yu Kuai
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Nilay Shroff @ 2026-03-01 12:59 UTC (permalink / raw)
To: linux-block
Cc: axboe, hch, ming.lei, yi.zhang, yukuai, shinichiro.kawasaki,
gjoyce
While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
after ->freeze_lock is acquired or queue is frozen. This potentially
creates a circular dependency involving ->fs_reclaim if reclaim is
triggered simultaneously in a code path which first acquires ->pcpu_
alloc_mutex. As the queue is already frozen while nr_hw_queue update
allocates tagsets, the reclaim can't forward progress and thus it could
cause a potential deadlock as reported in lockdep splat[1].
Fix this by pre-allocating tagset tags before we freeze queue during
nr_hw_queue update. Later the allocated tagset tags could be safely
installed and used after queue is frozen.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
Changes from v1:
- Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
to distinguish between success and failure. (Yu Kuai)
---
block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
1 file changed, 30 insertions(+), 15 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9af8c3dec3f6..a27bade45de3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4793,38 +4793,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
}
}
-static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
- int new_nr_hw_queues)
+static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
+ struct blk_mq_tag_set *set,
+ int new_nr_hw_queues)
{
struct blk_mq_tags **new_tags;
int i;
if (set->nr_hw_queues >= new_nr_hw_queues)
- goto done;
+ return NULL;
new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
GFP_KERNEL, set->numa_node);
if (!new_tags)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);
if (set->tags)
memcpy(new_tags, set->tags, set->nr_hw_queues *
sizeof(*set->tags));
- kfree(set->tags);
- set->tags = new_tags;
for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
- if (!__blk_mq_alloc_map_and_rqs(set, i)) {
- while (--i >= set->nr_hw_queues)
- __blk_mq_free_map_and_rqs(set, i);
- return -ENOMEM;
+ if (blk_mq_is_shared_tags(set->flags))
+ new_tags[i] = set->shared_tags;
+ else {
+ new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+ set->queue_depth);
+ if (!new_tags[i])
+ goto out_unwind;
}
cond_resched();
}
-done:
- set->nr_hw_queues = new_nr_hw_queues;
- return 0;
+ return new_tags;
+out_unwind:
+ while (--i >= set->nr_hw_queues) {
+ if (!blk_mq_is_shared_tags(set->flags))
+ blk_mq_free_map_and_rqs(set, new_tags[i], i);
+ }
+ kfree(new_tags);
+ return ERR_PTR(-ENOMEM);
}
/*
@@ -5113,6 +5120,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
unsigned int memflags;
int i;
struct xarray elv_tbl;
+ struct blk_mq_tags **new_tags;
bool queues_frozen = false;
lockdep_assert_held(&set->tag_list_lock);
@@ -5147,11 +5155,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (blk_mq_elv_switch_none(q, &elv_tbl))
goto switch_back;
+ new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
+ if (IS_ERR(new_tags))
+ goto switch_back;
+
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_freeze_queue_nomemsave(q);
queues_frozen = true;
- if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
- goto switch_back;
+ if (new_tags) {
+ kfree(set->tags);
+ set->tags = new_tags;
+ }
+ set->nr_hw_queues = nr_hw_queues;
fallback:
blk_mq_update_queue_map(set);
--
2.53.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
@ 2026-03-02 5:11 ` Yu Kuai
2026-03-02 16:24 ` Jens Axboe
2026-03-02 7:42 ` Yi Zhang
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Yu Kuai @ 2026-03-02 5:11 UTC (permalink / raw)
To: Nilay Shroff, linux-block
Cc: axboe, hch, ming.lei, yi.zhang, shinichiro.kawasaki, gjoyce,
yukuai
Hi,
在 2026/3/1 20:59, Nilay Shroff 写道:
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> Fix this by pre-allocating tagset tags before we freeze queue during
> nr_hw_queue update. Later the allocated tagset tags could be safely
> installed and used after queue is frozen.
>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
> - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
> encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
> to distinguish between success and failure. (Yu Kuai)
> ---
> block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
> 1 file changed, 30 insertions(+), 15 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9af8c3dec3f6..a27bade45de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4793,38 +4793,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
> }
> }
>
> -static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
> - int new_nr_hw_queues)
> +static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
> + struct blk_mq_tag_set *set,
> + int new_nr_hw_queues)
> {
> struct blk_mq_tags **new_tags;
> int i;
>
> if (set->nr_hw_queues >= new_nr_hw_queues)
> - goto done;
> + return NULL;
>
> new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
> GFP_KERNEL, set->numa_node);
> if (!new_tags)
> - return -ENOMEM;
> + return ERR_PTR(-ENOMEM);
>
> if (set->tags)
> memcpy(new_tags, set->tags, set->nr_hw_queues *
> sizeof(*set->tags));
> - kfree(set->tags);
> - set->tags = new_tags;
>
> for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
> - if (!__blk_mq_alloc_map_and_rqs(set, i)) {
> - while (--i >= set->nr_hw_queues)
> - __blk_mq_free_map_and_rqs(set, i);
> - return -ENOMEM;
> + if (blk_mq_is_shared_tags(set->flags))
> + new_tags[i] = set->shared_tags;
I think usually we'll keep the braces for if branch.
Otherwise feel free to add:
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
> + else {
> + new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> + set->queue_depth);
> + if (!new_tags[i])
> + goto out_unwind;
> }
> cond_resched();
> }
>
> -done:
> - set->nr_hw_queues = new_nr_hw_queues;
> - return 0;
> + return new_tags;
> +out_unwind:
> + while (--i >= set->nr_hw_queues) {
> + if (!blk_mq_is_shared_tags(set->flags))
> + blk_mq_free_map_and_rqs(set, new_tags[i], i);
> + }
> + kfree(new_tags);
> + return ERR_PTR(-ENOMEM);
> }
>
> /*
> @@ -5113,6 +5120,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> unsigned int memflags;
> int i;
> struct xarray elv_tbl;
> + struct blk_mq_tags **new_tags;
> bool queues_frozen = false;
>
> lockdep_assert_held(&set->tag_list_lock);
> @@ -5147,11 +5155,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> if (blk_mq_elv_switch_none(q, &elv_tbl))
> goto switch_back;
>
> + new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
> + if (IS_ERR(new_tags))
> + goto switch_back;
> +
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> blk_mq_freeze_queue_nomemsave(q);
> queues_frozen = true;
> - if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> - goto switch_back;
> + if (new_tags) {
> + kfree(set->tags);
> + set->tags = new_tags;
> + }
> + set->nr_hw_queues = nr_hw_queues;
>
> fallback:
> blk_mq_update_queue_map(set);
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
2026-03-02 5:11 ` Yu Kuai
@ 2026-03-02 7:42 ` Yi Zhang
2026-03-02 14:05 ` Ming Lei
2026-03-02 16:28 ` Jens Axboe
3 siblings, 0 replies; 6+ messages in thread
From: Yi Zhang @ 2026-03-02 7:42 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-block, axboe, hch, ming.lei, yukuai, shinichiro.kawasaki,
gjoyce
On Sun, Mar 1, 2026 at 9:00 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> Fix this by pre-allocating tagset tags before we freeze queue during
> nr_hw_queue update. Later the allocated tagset tags could be safely
> installed and used after queue is frozen.
>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
> - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
> encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
> to distinguish between success and failure. (Yu Kuai)
> ---
> block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
> 1 file changed, 30 insertions(+), 15 deletions(-)
Confirmed the issue was fixed with this patch:
Tested-by: Yi Zhang <yi.zhang@redhat.com>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9af8c3dec3f6..a27bade45de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4793,38 +4793,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
> }
> }
>
> -static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
> - int new_nr_hw_queues)
> +static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
> + struct blk_mq_tag_set *set,
> + int new_nr_hw_queues)
> {
> struct blk_mq_tags **new_tags;
> int i;
>
> if (set->nr_hw_queues >= new_nr_hw_queues)
> - goto done;
> + return NULL;
>
> new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
> GFP_KERNEL, set->numa_node);
> if (!new_tags)
> - return -ENOMEM;
> + return ERR_PTR(-ENOMEM);
>
> if (set->tags)
> memcpy(new_tags, set->tags, set->nr_hw_queues *
> sizeof(*set->tags));
> - kfree(set->tags);
> - set->tags = new_tags;
>
> for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
> - if (!__blk_mq_alloc_map_and_rqs(set, i)) {
> - while (--i >= set->nr_hw_queues)
> - __blk_mq_free_map_and_rqs(set, i);
> - return -ENOMEM;
> + if (blk_mq_is_shared_tags(set->flags))
> + new_tags[i] = set->shared_tags;
> + else {
> + new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> + set->queue_depth);
> + if (!new_tags[i])
> + goto out_unwind;
> }
> cond_resched();
> }
>
> -done:
> - set->nr_hw_queues = new_nr_hw_queues;
> - return 0;
> + return new_tags;
> +out_unwind:
> + while (--i >= set->nr_hw_queues) {
> + if (!blk_mq_is_shared_tags(set->flags))
> + blk_mq_free_map_and_rqs(set, new_tags[i], i);
> + }
> + kfree(new_tags);
> + return ERR_PTR(-ENOMEM);
> }
>
> /*
> @@ -5113,6 +5120,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> unsigned int memflags;
> int i;
> struct xarray elv_tbl;
> + struct blk_mq_tags **new_tags;
> bool queues_frozen = false;
>
> lockdep_assert_held(&set->tag_list_lock);
> @@ -5147,11 +5155,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> if (blk_mq_elv_switch_none(q, &elv_tbl))
> goto switch_back;
>
> + new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
> + if (IS_ERR(new_tags))
> + goto switch_back;
> +
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> blk_mq_freeze_queue_nomemsave(q);
> queues_frozen = true;
> - if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> - goto switch_back;
> + if (new_tags) {
> + kfree(set->tags);
> + set->tags = new_tags;
> + }
> + set->nr_hw_queues = nr_hw_queues;
>
> fallback:
> blk_mq_update_queue_map(set);
> --
> 2.53.0
>
--
Best Regards,
Yi Zhang
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
2026-03-02 5:11 ` Yu Kuai
2026-03-02 7:42 ` Yi Zhang
@ 2026-03-02 14:05 ` Ming Lei
2026-03-02 16:28 ` Jens Axboe
3 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2026-03-02 14:05 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-block, axboe, hch, yi.zhang, yukuai, shinichiro.kawasaki,
gjoyce
On Sun, Mar 1, 2026 at 9:00 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> Fix this by pre-allocating tagset tags before we freeze queue during
> nr_hw_queue update. Later the allocated tagset tags could be safely
> installed and used after queue is frozen.
>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
> - Make blk_mq_prealloc_tag_set_tags() return the tags pointer directly,
> encoding error conditions using ERR_PTR(). Update caller to use IS_ERR()
> to distinguish between success and failure. (Yu Kuai)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
2026-03-02 5:11 ` Yu Kuai
@ 2026-03-02 16:24 ` Jens Axboe
0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-03-02 16:24 UTC (permalink / raw)
To: yukuai, Nilay Shroff, linux-block
Cc: hch, ming.lei, yi.zhang, shinichiro.kawasaki, gjoyce
On 3/1/26 10:11 PM, Yu Kuai wrote:
>> for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
>> - if (!__blk_mq_alloc_map_and_rqs(set, i)) {
>> - while (--i >= set->nr_hw_queues)
>> - __blk_mq_free_map_and_rqs(set, i);
>> - return -ENOMEM;
>> + if (blk_mq_is_shared_tags(set->flags))
>> + new_tags[i] = set->shared_tags;
>
> I think usually we'll keep the braces for if branch.
>
> Otherwise feel free to add:
>
> Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Indeed, I fixed up the style issue.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock
2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
` (2 preceding siblings ...)
2026-03-02 14:05 ` Ming Lei
@ 2026-03-02 16:28 ` Jens Axboe
3 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-03-02 16:28 UTC (permalink / raw)
To: linux-block, Nilay Shroff
Cc: hch, ming.lei, yi.zhang, yukuai, shinichiro.kawasaki, gjoyce
On Sun, 01 Mar 2026 18:29:43 +0530, Nilay Shroff wrote:
> While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
> after ->freeze_lock is acquired or queue is frozen. This potentially
> creates a circular dependency involving ->fs_reclaim if reclaim is
> triggered simultaneously in a code path which first acquires ->pcpu_
> alloc_mutex. As the queue is already frozen while nr_hw_queue update
> allocates tagsets, the reclaim can't forward progress and thus it could
> cause a potential deadlock as reported in lockdep splat[1].
>
> [...]
Applied, thanks!
[1/1] block: break pcpu_alloc_mutex dependency on freeze_lock
commit: 539d1b47e935e8384977dd7e5cec370c08b7a644
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-02 16:28 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-01 12:59 [PATCHv2] block: break pcpu_alloc_mutex dependency on freeze_lock Nilay Shroff
2026-03-02 5:11 ` Yu Kuai
2026-03-02 16:24 ` Jens Axboe
2026-03-02 7:42 ` Yi Zhang
2026-03-02 14:05 ` Ming Lei
2026-03-02 16:28 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox