public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Nilay Shroff <nilay@linux.ibm.com>,
	Yi Zhang <yi.zhang@redhat.com>, Ming Lei <ming.lei@redhat.com>,
	Yu Kuai <yukuai@fnnas.com>, Jens Axboe <axboe@kernel.dk>,
	Sasha Levin <sashal@kernel.org>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 6.19-6.18] block: break pcpu_alloc_mutex dependency on freeze_lock
Date: Tue, 10 Mar 2026 05:01:09 -0400	[thread overview]
Message-ID: <20260310090145.2709021-9-sashal@kernel.org> (raw)
In-Reply-To: <20260310090145.2709021-1-sashal@kernel.org>

From: Nilay Shroff <nilay@linux.ibm.com>

[ Upstream commit 539d1b47e935e8384977dd7e5cec370c08b7a644 ]

While nr_hw_update allocates tagset tags it acquires ->pcpu_alloc_mutex
after ->freeze_lock is acquired or queue is frozen. This potentially
creates a circular dependency involving ->fs_reclaim if reclaim is
triggered simultaneously in a code path which first acquires ->pcpu_
alloc_mutex. As the queue is already frozen while nr_hw_queue update
allocates tagsets, the reclaim can't forward progress and thus it could
cause a potential deadlock as reported in lockdep splat[1].

Fix this by pre-allocating tagset tags before we freeze queue during
nr_hw_queue update. Later the allocated tagset tags could be safely
installed and used after queue is frozen.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs8F=OV9s3La2kEQ34YndgfZP-B5PHS4Z8_b9euKG6J4mw@mail.gmail.com/ [1]
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
[axboe: fix brace style issue]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a complete picture. Let me write the analysis.

## Analysis

### What the commit fixes

This commit fixes a **potential deadlock** in the block layer's
`__blk_mq_update_nr_hw_queues()` path. The deadlock involves a circular
lock dependency between:

1. **`freeze_lock`** (held when the queue is frozen)
2. **`pcpu_alloc_mutex`** (acquired internally by `alloc_percpu_gfp()`
   during tag allocation via `sbitmap_init_node()`)
3. **`fs_reclaim`** - if memory reclaim is triggered while
   `pcpu_alloc_mutex` is held, and the queue is frozen, forward progress
   is blocked

The fix pre-allocates tagset tags **before** freezing the queue, then
installs them after freeze. This breaks the circular dependency.

### Bug severity and evidence

- **Lockdep splat**: Reported with a concrete lockdep trace by Yi Zhang
  (Red Hat)
- **Tested-by**: The reporter confirmed the fix works
- **Multiple reviews**: Ming Lei, Yu Kuai, and Jens Axboe (block layer
  maintainer) all reviewed/signed off
- **Block layer core path**: `blk_mq_update_nr_hw_queues()` is called by
  many storage drivers (NVMe, SCSI, etc.) during hardware queue
  reconfiguration
- **Deadlock type**: Not a mere theoretical concern - lockdep actually
  fired, indicating the lock ordering violation is real

### Dependency analysis - CRITICAL ISSUE

The commit's diff is entangled with a **long chain of prerequisite
commits** that heavily restructured `__blk_mq_update_nr_hw_queues()`:

| Commit | Description | First in |
|--------|-------------|----------|
| `596dce110b7d` | simplify elevator reattachment | v6.16-rc1 |
| `2d8951aee844` | unfreeze queue if realloc fails | v6.16-rc1 |
| `5989bfe6ac6b` | restore two stage elevator switch | v6.17-rc1 |
| `04225d13aef1` | fix deadlock (sched_tags outside freeze) | v6.17-rc1
|
| `2d82f3bd8910` | fix lockdep warning | v6.17-rc3 |

The function in **v6.12** (latest stable tree) looks completely
different:
- v6.12: freezes queues first, then calls
  `blk_mq_realloc_tag_set_tags()` inside the frozen state
- v7.0 (this commit): has batch sched allocation, restructured elevator
  switching, new flow

The **underlying bug** (allocating tags with `GFP_KERNEL` while queues
are frozen → `pcpu_alloc_mutex` inside freeze) **exists in all stable
kernels**. In v6.12, `blk_mq_realloc_tag_set_tags()` is called at line
4818, well inside the freeze started at line 4803.

However, this exact patch **will not apply** to any current stable tree
without significant rework due to the restructured function flow (the
`blk_mq_alloc_sched_ctx_batch`, `blk_mq_alloc_sched_res_batch`, and the
new elevator switching pattern all don't exist in stable).

### Scope and risk

- **Files changed**: 1 (block/blk-mq.c)
- **Lines changed**: +30/-15 (net diff), but significant structural
  refactoring of two functions
- **Risk**: Moderate - the core logic is sound and well-reviewed, but a
  backport would require crafting a simpler version for the older code

### Verification

- `git show 539d1b47e935` confirmed the upstream commit exists and
  matches
- `git describe --contains 539d1b47e935` → v7.0-rc3 (very recent)
- `git show v6.12:block/blk-mq.c` verified the v6.12 function structure
  differs significantly (freeze before realloc at line 4803-4818)
- `git describe --contains` for prerequisites confirmed they are all
  v6.16+ or v6.17+ (not in any stable tree)
- `git show v6.12:block/blk-mq.c | sed -n '4474,4510p'` confirmed the
  old `blk_mq_realloc_tag_set_tags` function matches the "before"
  version in the diff
- The lore.kernel.org link in the commit references a real lockdep splat
  report
- Traced the allocation chain: `blk_mq_alloc_map_and_rqs` →
  `sbitmap_queue_init_node` → `alloc_percpu_gfp` → `pcpu_alloc_mutex`
  (verified via agent analysis of sbitmap.c)
- Could NOT verify whether a simplified backport has been prepared
  separately (unverified)

### Decision rationale

This is a real deadlock fix in core block layer infrastructure, reported
with concrete lockdep evidence, well-reviewed, and tested. The bug
affects all stable kernels. Despite the dependency chain making the
exact diff incompatible with current stable trees, the fix warrants
backporting - either via this commit with its prerequisite chain, or via
a simplified targeted backport. The deadlock involves the per-cpu
allocator mutex and queue freeze lock, which can cause system hangs
during NVMe/SCSI hardware reconfiguration events that are common in
production.

**YES**

 block/blk-mq.c | 45 ++++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 968699277c3d5..3b58dd5876114 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4778,38 +4778,45 @@ static void blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 	}
 }
 
-static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
-				       int new_nr_hw_queues)
+static struct blk_mq_tags **blk_mq_prealloc_tag_set_tags(
+				struct blk_mq_tag_set *set,
+				int new_nr_hw_queues)
 {
 	struct blk_mq_tags **new_tags;
 	int i;
 
 	if (set->nr_hw_queues >= new_nr_hw_queues)
-		goto done;
+		return NULL;
 
 	new_tags = kcalloc_node(new_nr_hw_queues, sizeof(struct blk_mq_tags *),
 				GFP_KERNEL, set->numa_node);
 	if (!new_tags)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	if (set->tags)
 		memcpy(new_tags, set->tags, set->nr_hw_queues *
 		       sizeof(*set->tags));
-	kfree(set->tags);
-	set->tags = new_tags;
 
 	for (i = set->nr_hw_queues; i < new_nr_hw_queues; i++) {
-		if (!__blk_mq_alloc_map_and_rqs(set, i)) {
-			while (--i >= set->nr_hw_queues)
-				__blk_mq_free_map_and_rqs(set, i);
-			return -ENOMEM;
+		if (blk_mq_is_shared_tags(set->flags)) {
+			new_tags[i] = set->shared_tags;
+		} else {
+			new_tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+					set->queue_depth);
+			if (!new_tags[i])
+				goto out_unwind;
 		}
 		cond_resched();
 	}
 
-done:
-	set->nr_hw_queues = new_nr_hw_queues;
-	return 0;
+	return new_tags;
+out_unwind:
+	while (--i >= set->nr_hw_queues) {
+		if (!blk_mq_is_shared_tags(set->flags))
+			blk_mq_free_map_and_rqs(set, new_tags[i], i);
+	}
+	kfree(new_tags);
+	return ERR_PTR(-ENOMEM);
 }
 
 /*
@@ -5093,6 +5100,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	unsigned int memflags;
 	int i;
 	struct xarray elv_tbl;
+	struct blk_mq_tags **new_tags;
 	bool queues_frozen = false;
 
 	lockdep_assert_held(&set->tag_list_lock);
@@ -5127,11 +5135,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		if (blk_mq_elv_switch_none(q, &elv_tbl))
 			goto switch_back;
 
+	new_tags = blk_mq_prealloc_tag_set_tags(set, nr_hw_queues);
+	if (IS_ERR(new_tags))
+		goto switch_back;
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue_nomemsave(q);
 	queues_frozen = true;
-	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
-		goto switch_back;
+	if (new_tags) {
+		kfree(set->tags);
+		set->tags = new_tags;
+	}
+	set->nr_hw_queues = nr_hw_queues;
 
 fallback:
 	blk_mq_update_queue_map(set);
-- 
2.51.0


  parent reply	other threads:[~2026-03-10  9:02 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-10  9:01 [PATCH AUTOSEL 6.19-6.18] ALSA: hda/hdmi: Add Tegra238 HDA codec device ID Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] ASoC: amd: acp: Add ACP6.3 match entries for Cirrus Logic parts Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] nvme-pci: ensure we're polling a polled queue Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] ASoC: cs35l56: Only patch ASP registers if the DAI is part of a DAIlink Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] ALSA: hda/senary: Ensure EAPD is enabled during init Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] ASoC: fsl_easrc: Fix event generation in fsl_easrc_iec958_set_reg() Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] kbuild: install-extmod-build: Package resolve_btfids if necessary Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] scsi: devinfo: Add BLIST_SKIP_IO_HINTS for Iomega ZIP Sasha Levin
2026-03-10  9:01 ` Sasha Levin [this message]
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] platform/x86: oxpec: Add support for OneXPlayer X1z Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19] spi: spi-dw-dma: fix print error log when wait finish transaction Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.6] HID: asus: add xg mobile 2023 external hardware support Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] ASoC: rt1321: fix DMIC ch2/3 mask issue Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] drm/ttm/tests: Fix build failure on PREEMPT_RT Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] bpf: Fix u32/s32 bounds when ranges cross min/max boundary Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] HID: mcp2221: cancel last I2C command on read error Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] platform/x86: intel-hid: Add Dell 14 Plus 2-in-1 to dmi_vgbs_allow_list Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] HID: asus: avoid memory leak in asus_report_fixup() Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] scsi: mpi3mr: Clear reset history on ready and recheck state after timeout Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] platform/x86: oxpec: Add support for Aokzoe A2 Pro Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19] platform/x86: hp-wmi: Add Victus 16-d0xxx support Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] platform/x86: touchscreen_dmi: Add quirk for y-inverted Goodix touchscreen on SUPI S10 Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.1] HID: apple: avoid memory leak in apple_report_fixup() Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] platform/x86: intel-hid: Enable 5-button array on ThinkPad X1 Fold 16 Gen 1 Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] ASoC: Intel: sof_sdw: Add quirk for Alienware Area 51 (2025) 0CCD SKU Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] platform/x86: hp-wmi: Add Omen 16-xd0xxx fan and thermal support Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.12] HID: apple: Add EPOMAKER TH87 to the non-apple keyboards list Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] platform/x86: hp-wmi: Add Omen 16-wf0xxx fan and thermal support Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] nvme-pci: cap queue creation to used queues Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] dma-buf: Include ioctl.h in UAPI header Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] platform/x86: oxpec: Add support for OneXPlayer X1 Air Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19] platform/x86: hp-wmi: add Omen 14-fb1xxx (board 8E41) support Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] net: usb: r8152: add TRENDnet TUC-ET2G Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] platform/x86: oxpec: Add support for OneXPlayer APEX Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] ASoC: fsl_easrc: Fix event generation in fsl_easrc_iec958_put_bits() Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] HID: magicmouse: fix battery reporting for Apple Magic Trackpad 2 Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.18] HID: intel-ish-hid: ipc: Add Nova Lake-H/S PCI device IDs Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.1] nvme-fabrics: use kfree_sensitive() for DHCHAP secrets Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] module: Fix kernel panic when a symbol st_shndx is out of bounds Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] HID: magicmouse: avoid memory leak in magicmouse_report_fixup() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260310090145.2709021-9-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=nilay@linux.ibm.com \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=yi.zhang@redhat.com \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox