[PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit
       [not found] <20260211123112.1330287-1-sashal@kernel.org>
@ 2026-02-11 12:30 ` Sasha Levin
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] blk-mq-sched: unify elevators checking for async requests Sasha Levin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-02-11 12:30 UTC (permalink / raw)
  To: patches, stable
  Cc: Luke Wang, Ulf Hansson, Jens Axboe, Sasha Levin, linux-block

From: Luke Wang <ziniu.wang_1@nxp.com>

[ Upstream commit ee81212f74a57c5d2b56cf504f40d528dac6faaf ]

Secure erase should use max_secure_erase_sectors instead of being limited
by max_discard_sectors. Separate the handling of REQ_OP_SECURE_ERASE from
REQ_OP_DISCARD to allow each operation to use its own size limit.

Signed-off-by: Luke Wang <ziniu.wang_1@nxp.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

This is an important caller. `blk_insert_cloned_request` is used by
device mapper (dm) to submit cloned requests. If
`blk_queue_get_max_sectors` returns the wrong value (e.g.,
`max_discard_sectors` instead of `max_secure_erase_sectors`), and the
cloned request's sector count exceeds the wrong limit but fits within
the correct one, the request will be **rejected with BLK_STS_IOERR** on
line 3291. This is a real I/O error path!

Here is my complete analysis:

---

## Detailed Analysis

### 1. Commit Message Analysis

The commit clearly describes a **bug fix**: secure erase operations were
incorrectly limited by `max_discard_sectors` instead of
`max_secure_erase_sectors`. This is a correctness issue, not a feature.
The commit was reviewed by Ulf Hansson (MMC maintainer) and merged by
Jens Axboe (block layer maintainer).

### 2. Code Change Analysis

The patch modifies two files with two distinct changes:

**Change A: `block/blk.h` - `blk_queue_get_max_sectors()`**

Before the fix:
```211:213:block/blk.h
        if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
                return min(q->limits.max_discard_sectors,
                           UINT_MAX >> SECTOR_SHIFT);
```

Both `REQ_OP_DISCARD` and `REQ_OP_SECURE_ERASE` used
`max_discard_sectors`. The fix separates them so secure erase uses
`max_secure_erase_sectors`.

**Change B: `block/blk-merge.c` - `bio_split_discard()`**

The original `bio_split_discard()` always split using
`lim->max_discard_sectors`, even for `REQ_OP_SECURE_ERASE` (which
reaches this function via `__bio_split_to_limits` at line 407-409 of
`blk.h`). The fix refactors the function into a wrapper that selects the
correct limit and a helper `__bio_split_discard()` that does the actual
splitting.

### 3. Bug Mechanism and Impact

**The bug**: The kernel `queue_limits` structure has separate fields for
`max_discard_sectors` and `max_secure_erase_sectors`, but the block
layer core code in two critical paths ignored the secure erase field and
always used the discard field.

**Concrete impact scenarios**:

1. **virtio_blk**: This driver reads separate limits from the virtio
   config for discard (`max_discard_sectors`) and secure erase
   (`max_secure_erase_sectors`). The driver even documents the
   workaround: *"The discard and secure erase limits are combined since
   the Linux block layer uses the same limit for both commands."*
   (virtio_blk.c lines 1336-1341). This means the driver had to
   artificially reduce its limits to compensate for the block layer bug.

2. **xen-blkfront**: Sets `max_secure_erase_sectors = UINT_MAX` while
   `max_discard_sectors = get_capacity(gd)` - different values.

3. **dm (device mapper)**: The `blk_insert_cloned_request()` function
   calls `blk_queue_get_max_sectors()` and rejects requests that exceed
   the limit with `BLK_STS_IOERR`. If `max_secure_erase_sectors` >
   `max_discard_sectors` on the underlying device, valid secure erase
   requests could be rejected with I/O errors. Conversely, if
   `max_secure_erase_sectors` < `max_discard_sectors`, oversized
   requests could be sent to hardware.

4. **The most dangerous case**: When `max_secure_erase_sectors <
   max_discard_sectors`, the bio splitting code won't split the secure
   erase bio when it should, sending a request larger than the device
   can handle. This can cause **I/O errors, device failures, or data
   integrity issues** with secure erase operations.

### 4. Scope and Risk Assessment

- **Lines changed**: ~25 lines of actual logic change across 2 files
- **Files touched**: `block/blk-merge.c` and `block/blk.h` (core block
  layer, but well-contained)
- **Risk of regression**: **Low**. For any driver that sets
  `max_secure_erase_sectors == max_discard_sectors` (like MMC), the
  behavior is completely unchanged. For drivers with different values,
  the behavior is corrected.
- **The change is obviously correct**: Secure erase operations should
  use secure erase limits, not discard limits.

### 5. Affected Subsystems and Users

- MMC/eMMC (mobile, embedded) - same values, no visible change
- virtio_blk (VMs, cloud) - corrects a workaround-needing limitation
- xen-blkfront (Xen VMs) - corrects a real mismatch
- dm (device-mapper stacking) - fixes potential I/O errors

### 6. Backport Complexity

- For kernels 6.12+: The patch should apply nearly cleanly;
  `bio_submit_split()` and the current code structure exist.
- For kernels 6.1-6.11: The `blk.h` change applies cleanly (same
  pattern). The `blk-merge.c` change needs adaptation because the
  splitting mechanism is different (returns split bio instead of using
  `bio_submit_split`), but the conceptual fix is the same.
- For kernel 5.15: `max_secure_erase_sectors` field exists but
  `blkdev_issue_secure_erase` was introduced in 6.1; older kernels have
  different secure erase code paths.

### 7. Stability Indicators

- Reviewed by Ulf Hansson (MMC subsystem co-maintainer)
- Merged by Jens Axboe (block layer maintainer)
- Simple, obviously correct fix

### 8. Conclusion

This commit fixes a real bug where the block layer core was using the
wrong limit for secure erase operations, potentially causing:
- I/O errors when valid requests are rejected
- Oversized requests being sent to hardware
- Unnecessary performance limitations

The fix is small, contained, obviously correct, and was reviewed by
multiple maintainers. It affects core block layer code that is shared
across all block devices supporting secure erase. The risk of regression
is minimal since the change only affects the code path when
`max_secure_erase_sectors != max_discard_sectors`.

**YES**

 block/blk-merge.c | 21 +++++++++++++++++----
 block/blk.h       |  6 +++++-
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d3115d7469df0..bf8faadb0bd46 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -158,8 +158,9 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
 	return bio;
 }

-struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,
-		unsigned *nsegs)
+static struct bio *__bio_split_discard(struct bio *bio,
+		const struct queue_limits *lim, unsigned *nsegs,
+		unsigned int max_sectors)
 {
 	unsigned int max_discard_sectors, granularity;
 	sector_t tmp;
@@ -169,8 +170,7 @@ struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,

 	granularity = max(lim->discard_granularity >> 9, 1U);

-	max_discard_sectors =
-		min(lim->max_discard_sectors, bio_allowed_max_sectors(lim));
+	max_discard_sectors = min(max_sectors, bio_allowed_max_sectors(lim));
 	max_discard_sectors -= max_discard_sectors % granularity;
 	if (unlikely(!max_discard_sectors))
 		return bio;
@@ -194,6 +194,19 @@ struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,
 	return bio_submit_split(bio, split_sectors);
 }

+struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,
+		unsigned *nsegs)
+{
+	unsigned int max_sectors;
+
+	if (bio_op(bio) == REQ_OP_SECURE_ERASE)
+		max_sectors = lim->max_secure_erase_sectors;
+	else
+		max_sectors = lim->max_discard_sectors;
+
+	return __bio_split_discard(bio, lim, nsegs, max_sectors);
+}
+
 static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim,
 						bool is_atomic)
 {
diff --git a/block/blk.h b/block/blk.h
index e4c433f62dfc7..4cd5a91346d8a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -208,10 +208,14 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
 	struct request_queue *q = rq->q;
 	enum req_op op = req_op(rq);

-	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
+	if (unlikely(op == REQ_OP_DISCARD))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);

+	if (unlikely(op == REQ_OP_SECURE_ERASE))
+		return min(q->limits.max_secure_erase_sectors,
+			   UINT_MAX >> SECTOR_SHIFT);
+
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
 		return q->limits.max_write_zeroes_sectors;

-- 
2.51.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.19-6.18] blk-mq-sched: unify elevators checking for async requests
       [not found] <20260211123112.1330287-1-sashal@kernel.org>
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit Sasha Levin
@ 2026-02-11 12:30 ` Sasha Levin
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() Sasha Levin
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] rnbd-srv: Zero the rsp buffer before using it Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-02-11 12:30 UTC (permalink / raw)
  To: patches, stable
  Cc: Yu Kuai, Nilay Shroff, Hannes Reinecke, Jens Axboe, Sasha Levin,
	linux-block

From: Yu Kuai <yukuai@fnnas.com>

[ Upstream commit 1db61b0afdd7e8aa9289c423fdff002603b520b5 ]

bfq and mq-deadline consider sync writes as async requests and only
reserve tags for sync reads by async_depth, however, kyber doesn't
consider sync writes as async requests for now.

Consider the case there are lots of dirty pages, and user use fsync to
flush dirty pages. In this case sched_tags can be exhausted by sync writes
and sync reads can stuck waiting for tag. Hence let kyber follow what
mq-deadline and bfq did, and unify async requests checking for all
elevators.

Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a thorough analysis. Let me
compile the findings.

---

## Complete Analysis

### 1. COMMIT MESSAGE ANALYSIS

The subject "blk-mq-sched: unify elevators checking for async requests"
sounds like a cleanup/unification commit, but the message body reveals a
**real I/O starvation bug in the kyber scheduler**:

> *"Consider the case there are lots of dirty pages, and user use fsync
to flush dirty pages. In this case sched_tags can be exhausted by sync
writes and sync reads can stuck waiting for tag."*

This describes a concrete, user-visible problem: sync read starvation
when kyber is the I/O scheduler and fsync flushes dirty pages.

### 2. CODE CHANGE ANALYSIS — The Bug Mechanism

The key to understanding this bug lies in the difference between
`op_is_sync()` and the new `blk_mq_is_sync_read()`:

```470:474:include/linux/blk_types.h
static inline bool op_is_sync(blk_opf_t op)
{
        return (op & REQ_OP_MASK) == REQ_OP_READ ||
                (op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH));
}
```

`op_is_sync()` returns **true** for both sync reads AND sync writes
(writes with `REQ_SYNC`/`REQ_FUA`/`REQ_PREFLUSH`).

When `fsync()` triggers writeback, writes get `REQ_SYNC` because
`wbc->sync_mode == WB_SYNC_ALL`:

```93:103:include/linux/writeback.h
static inline blk_opf_t wbc_to_write_flags(struct writeback_control
*wbc)
{
        blk_opf_t flags = 0;

        if (wbc->sync_mode == WB_SYNC_ALL)
                flags |= REQ_SYNC;
        // ...
}
```

**Kyber's bug** — in `kyber_limit_depth()`:

```553:564:block/kyber-iosched.c
static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data
*data)
{
        /*
  - We use the scheduler tags as per-hardware queue queueing tokens.
  - Async requests can be limited at this stage.
         */
        if (!op_is_sync(opf)) {
                struct kyber_queue_data *kqd =
data->q->elevator->elevator_data;

                data->shallow_depth = kqd->async_depth;
        }
}
```

The condition `!op_is_sync(opf)` means only truly async operations get
throttled. Sync writes (from fsync) pass `op_is_sync()` as true, so they
get **full depth** — no throttling. This means sync writes can consume
ALL sched_tags.

**mq-deadline and bfq already handle this correctly:**

```493:506:block/mq-deadline.c
static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data
*data)
{
        struct deadline_data *dd = data->q->elevator->elevator_data;

        /* Do not throttle synchronous reads. */
        if (op_is_sync(opf) && !op_is_write(opf))
                return;

        // ... throttle everything else including sync writes
        data->shallow_depth = dd->async_depth;
}
```

Both mq-deadline and bfq use `op_is_sync(opf) && !op_is_write(opf)` to
give full depth only to sync **reads**. Sync writes are properly
throttled by `async_depth`.

**The fix:** Changes kyber's condition from `!op_is_sync(opf)` to
`!blk_mq_is_sync_read(opf)`, where the new helper is defined as:

```c
static inline bool blk_mq_is_sync_read(blk_opf_t opf)
{
    return op_is_sync(opf) && !op_is_write(opf);
}
```

### 3. BEHAVIORAL IMPACT BY SCHEDULER

| Scheduler | Before | After | Behavioral Change? |
|---|---|---|---|
| **kyber** | Only async ops throttled; sync writes get full depth |
Everything except sync reads throttled | **YES — this is the bug fix** |
| **bfq** | `op_is_sync(opf) && !op_is_write(opf)` |
`blk_mq_is_sync_read(opf)` | **NO — semantically identical** |
| **mq-deadline** | `op_is_sync(opf) && !op_is_write(opf)` |
`blk_mq_is_sync_read(opf)` | **NO — semantically identical** |

The bfq and mq-deadline changes are purely cosmetic refactoring. The
actual bug fix is exclusively in kyber.

### 4. BUG SEVERITY

The starvation scenario is concrete and reproducible:
1. System has lots of dirty pages
2. User calls `fsync()` to flush them
3. Lots of sync writes are submitted
4. Kyber gives them full depth (no throttling)
5. All sched_tags consumed by sync writes
6. Sync reads from applications **starve** — they cannot get any tags
7. Read I/O hangs until writes complete

This is a **system responsiveness issue** — applications waiting for
reads (file access, page faults) can hang when another process is doing
heavy fsyncing.

### 5. SCOPE AND RISK

- **Size:** Very small — adds a 4-line helper, changes one condition in
  each of 3 files
- **Risk for bfq/mq-deadline:** Zero — semantically identical changes
- **Risk for kyber:** Low — the change aligns kyber with the well-
  established and battle-tested behavior of mq-deadline and bfq. Kyber's
  `async_depth` is set to 75% of `q->nr_requests` (`KYBER_ASYNC_PERCENT
  = 75`), so sync writes still get generous tag allocation, just not
  unlimited
- **Possible regression:** Slight reduction in sync write throughput on
  kyber (now throttled to 75% of tags instead of 100%), but this is the
  correct behavior to prevent read starvation

### 6. REVIEW AND TESTING

The commit has two `Reviewed-by` tags from experienced kernel
developers:
- **Nilay Shroff** (IBM) — block layer contributor
- **Hannes Reinecke** (SUSE) — longstanding storage/block subsystem
  maintainer

Signed off by **Jens Axboe** — block subsystem maintainer.

### 7. DEPENDENCY ANALYSIS

The commit depends on `42e6c6ce03fd3e` ("lib/sbitmap: convert
shallow_depth from one word to the whole sbitmap") for context lines in
the bfq part (which references `bfqd->async_depths`, renamed from
`bfqd->word_depths` in the prerequisite). However:

- The **core fix** (kyber condition change) is completely independent
- The bfq and mq-deadline changes are cosmetic and could be dropped for
  stable
- For stable trees, the kyber fix + helper function could be backported
  alone, or the bfq/mq-deadline parts could be adapted to match older
  context

### 8. BUG LONGEVITY

The bug has existed since kyber's introduction in commit `00e043936e9a1`
(April 2017, kernel v4.12). The original code already used
`!op_is_sync(op)` which had the same problem. The mq-deadline scheduler
only added tag reservation in v5.15 (commit `07757588e507`, June 2021)
and correctly used `op_is_sync(opf) && !op_is_write(opf)` from the
start. The bfq scheduler similarly had the correct check. Kyber was the
odd one out.

### 9. CONCLUSION

This commit fixes a real I/O starvation bug in the kyber scheduler where
sync writes (from fsync/sync) can exhaust all scheduler tags and starve
sync reads. The fix is:

- **Small and surgical** — one condition change in kyber, plus a trivial
  helper function
- **Obviously correct** — it aligns kyber with the proven behavior of
  mq-deadline and bfq
- **Low risk** — the bfq/mq-deadline parts are semantically identical;
  the kyber change is well-bounded
- **Well-reviewed** — by experienced block subsystem developers and the
  subsystem maintainer
- **Fixes a real user-visible bug** — read I/O starvation during fsync-
  heavy workloads

The only concern is the dependency on a prerequisite for the bfq context
lines, but the core kyber fix is standalone and the bfq/mq-deadline
parts are optional cosmetic refactoring that could be adapted or dropped
for stable trees.

**YES**

 block/bfq-iosched.c   | 2 +-
 block/blk-mq-sched.h  | 5 +++++
 block/kyber-iosched.c | 2 +-
 block/mq-deadline.c   | 2 +-
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e54b1d3d8bc2..9e9d081e86bb2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -697,7 +697,7 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 	unsigned int limit, act_idx;
 
 	/* Sync reads have full depth available */
-	if (op_is_sync(opf) && !op_is_write(opf))
+	if (blk_mq_is_sync_read(opf))
 		limit = data->q->nr_requests;
 	else
 		limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 02c40a72e9598..5678e15bd33c4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -137,4 +137,9 @@ static inline void blk_mq_set_min_shallow_depth(struct request_queue *q,
 						depth);
 }
 
+static inline bool blk_mq_is_sync_read(blk_opf_t opf)
+{
+	return op_is_sync(opf) && !op_is_write(opf);
+}
+
 #endif
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index c1b36ffd19ceb..2b3f5b8959af0 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -556,7 +556,7 @@ static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 	 * We use the scheduler tags as per-hardware queue queueing tokens.
 	 * Async requests can be limited at this stage.
 	 */
-	if (!op_is_sync(opf)) {
+	if (!blk_mq_is_sync_read(opf)) {
 		struct kyber_queue_data *kqd = data->q->elevator->elevator_data;
 
 		data->shallow_depth = kqd->async_depth;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3e3719093aec7..29d00221fbea6 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -495,7 +495,7 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 	struct deadline_data *dd = data->q->elevator->elevator_data;
 
 	/* Do not throttle synchronous reads. */
-	if (op_is_sync(opf) && !op_is_write(opf))
+	if (blk_mq_is_sync_read(opf))
 		return;
 
 	/*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.19-5.10] blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs()
       [not found] <20260211123112.1330287-1-sashal@kernel.org>
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit Sasha Levin
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] blk-mq-sched: unify elevators checking for async requests Sasha Levin
@ 2026-02-11 12:30 ` Sasha Levin
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] rnbd-srv: Zero the rsp buffer before using it Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-02-11 12:30 UTC (permalink / raw)
  To: patches, stable
  Cc: Yu Kuai, Nilay Shroff, Ming Lei, Hannes Reinecke, Jens Axboe,
	Sasha Levin, linux-block

From: Yu Kuai <yukuai@fnnas.com>

[ Upstream commit 9d20fd6ce1ba9733cd5ac96fcab32faa9fc404dd ]

In blk_mq_update_nr_hw_queues(), debugfs_mutex is not held while
creating debugfs entries for hctxs. Hence add debugfs_mutex there,
it's safe because queue is not frozen.

Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a comprehensive picture. Let me compile the full analysis.

---

## Complete Analysis

### 1. COMMIT MESSAGE ANALYSIS

The commit title is clear: "add missing debugfs_mutex in
blk_mq_debugfs_register_hctxs()". The body explicitly describes the
problem: `debugfs_mutex` is not held while creating debugfs entries for
hctxs during `blk_mq_update_nr_hw_queues()`. The author (Yu Kuai) notes
it's safe to add the mutex there because "queue is not frozen" at that
point (meaning there's no deadlock risk with freeze-related locks).

Three reviewers signed off: Nilay Shroff (IBM), Ming Lei (Red Hat), and
Hannes Reinecke (SUSE) -- all experienced block layer developers. The
merge was by Jens Axboe (block layer maintainer).

### 2. CODE CHANGE ANALYSIS

The diff is minimal -- **3 lines added**:

```c
mutex_lock(&q->debugfs_mutex);
queue_for_each_hw_ctx(q, hctx, i)
    blk_mq_debugfs_register_hctx(q, hctx);
mutex_unlock(&q->debugfs_mutex);
```

#### The Bug Mechanism

The `debugfs_mutex` was introduced in commit `5cf9c91ba927` ("block:
serialize all debugfs operations using q->debugfs_mutex", v5.19) by
Christoph Hellwig. That commit explicitly stated the goal: *"Use the
existing debugfs_mutex to serialize all debugfs operations that rely on
q->debugfs_dir or the directories hanging off it."*

That commit added `lockdep_assert_held(&q->debugfs_mutex)` in many
functions:
- `blk_mq_debugfs_register_sched()` (line 706)
- `blk_mq_debugfs_unregister_sched()` (line 725)
- `blk_mq_debugfs_unregister_rqos()` (line 746)
- `blk_mq_debugfs_register_rqos()` (line 759)
- `blk_mq_debugfs_register_sched_hctx()` (line 777)
- `blk_mq_debugfs_unregister_sched_hctx()` (line 798)

But it missed `blk_mq_debugfs_register_hctxs()`, which is called from
`__blk_mq_update_nr_hw_queues()` at line 5165 of `block/blk-mq.c`
**without** holding `debugfs_mutex`.

All other callers properly hold the mutex:
- `blk_register_queue()` in `blk-sysfs.c` holds `debugfs_mutex` when
  calling `blk_mq_debugfs_register()` (which internally calls
  `blk_mq_debugfs_register_hctx()`)
- `blk_mq_sched_reg_debugfs()` in `blk-mq-sched.c` holds `debugfs_mutex`
  when calling sched debugfs registration
- `rq_qos_add()` in `blk-rq-qos.c` holds `debugfs_mutex` when
  registering rqos debugfs

#### The Race

The race is between two concurrent paths:

**Thread A** -- `__blk_mq_update_nr_hw_queues()` →
`blk_mq_debugfs_register_hctxs()` → `blk_mq_debugfs_register_hctx()`:

```656:673:block/blk-mq-debugfs.c
void blk_mq_debugfs_register_hctx(struct request_queue *q,
                                  struct blk_mq_hw_ctx *hctx)
{
        struct blk_mq_ctx *ctx;
        char name[20];
        int i;

        if (!q->debugfs_dir)
                return;

        snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
        hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
        // ... creates more debugfs files ...
}
```

**Thread B** -- `blk_unregister_queue()` → `blk_debugfs_remove()`:

```884:895:block/blk-sysfs.c
static void blk_debugfs_remove(struct gendisk *disk)
{
        struct request_queue *q = disk->queue;

        mutex_lock(&q->debugfs_mutex);
        blk_trace_shutdown(q);
        debugfs_remove_recursive(q->debugfs_dir);
        q->debugfs_dir = NULL;
        // ...
        mutex_unlock(&q->debugfs_mutex);
}
```

Without the mutex in Thread A:
1. Thread A checks `q->debugfs_dir` (line 663) -- not NULL, proceeds
2. Thread B acquires `debugfs_mutex`, removes `q->debugfs_dir`, sets it
   to NULL
3. Thread A uses the now-stale/freed `q->debugfs_dir` to create child
   entries (line 667)

This can result in orphaned debugfs entries, inconsistent debugfs state,
and potentially use of a freed dentry.

### 3. CLASSIFICATION

This is a **synchronization bug fix** -- adding a missing lock
acquisition in a path that was accidentally omitted when the locking
scheme was introduced. It completes the locking protocol established in
commit `5cf9c91ba927`.

### 4. SCOPE AND RISK

- **Size**: 3 lines added (+2 mutex_lock/unlock, consistent with
  existing pattern)
- **Files touched**: 1 file (`block/blk-mq-debugfs.c`)
- **Subsystem**: Block layer (blk-mq) -- core infrastructure used by all
  block device drivers
- **Risk**: Extremely low. The mutex is already held by all other paths.
  The commit message explicitly addresses deadlock safety ("queue is not
  frozen")
- **Pattern match**: 100% consistent with how every other debugfs
  operation in the block layer acquires this lock

### 5. USER IMPACT

The `__blk_mq_update_nr_hw_queues()` path is triggered during:
- CPU hotplug events (adding/removing CPUs changes hw queue count)
- Driver-initiated hw queue resizing (e.g., `null_blk` via configfs
  `submit_queues`)
- SCSI/NVMe driver queue changes

This means any system doing CPU hotplug or block device reconfiguration
could hit this race. The impact is concurrent with disk
registration/unregistration, which happens during device discovery and
removal -- both common operations.

### 6. STABILITY INDICATORS

- **3 Reviewed-by tags** from major Linux storage developers
- **Merged by Jens Axboe** (block layer maintainer)
- Part of a series of debugfs locking improvements to the block layer
  (see commit `0e94ed336814` protecting `hctx_busy_show`)

### 7. DEPENDENCY CHECK

- **Dependencies**: Requires `debugfs_mutex` in `struct request_queue`,
  which has been present since v5.19 (commit `5cf9c91ba927`)
- **Self-contained**: The fix is entirely self-contained -- no other
  patches required
- **Applies to**: All stable trees from 5.19 onwards (6.1.y, 6.6.y,
  6.12.y, etc.)

### Summary

This commit fixes a missing lock acquisition in
`blk_mq_debugfs_register_hctxs()` that was overlooked when the
`debugfs_mutex` serialization scheme was introduced in v5.19. Every
other debugfs operation in the block layer properly holds this lock. The
fix is:
- **3 lines** of trivially correct code
- **Zero risk** of regression (follows established pattern exactly)
- **Well-reviewed** by 3 experienced block layer developers
- **Fixes a real race** between hw queue updates and debugfs teardown
- **Self-contained** with no dependencies beyond what's in all affected
  stable trees

The fix is small, surgical, obviously correct, and meets all stable
kernel criteria.

**YES**

 block/blk-mq-debugfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 4896525b1c054..553d93b88e194 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -686,8 +686,10 @@ void blk_mq_debugfs_register_hctxs(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
+	mutex_lock(&q->debugfs_mutex);
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_debugfs_register_hctx(q, hctx);
+	mutex_unlock(&q->debugfs_mutex);
 }
 
 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH AUTOSEL 6.19-5.10] rnbd-srv: Zero the rsp buffer before using it
       [not found] <20260211123112.1330287-1-sashal@kernel.org>
                   ` (2 preceding siblings ...)
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() Sasha Levin
@ 2026-02-11 12:30 ` Sasha Levin
  3 siblings, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-02-11 12:30 UTC (permalink / raw)
  To: patches, stable
  Cc: Md Haris Iqbal, Jack Wang, Grzegorz Prajsner, Jens Axboe,
	Sasha Levin, linux-block

From: Md Haris Iqbal <haris.iqbal@ionos.com>

[ Upstream commit 69d26698e4fd44935510553809007151b2fe4db5 ]

Before using the data buffer to send back the response message, zero it
completely. This prevents any stray bytes to be picked up by the client
side when there the message is exchanged between different protocol
versions.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

The entire data path is now clear. Here is my complete analysis:

---

## Complete Analysis

### 1. COMMIT MESSAGE ANALYSIS

The commit message states: "Before using the data buffer to send back
the response message, zero it completely. This prevents any stray bytes
to be picked up by the client side when there the message is exchanged
between different protocol versions."

While the commit message frames this as a protocol compatibility
concern, the underlying issue is actually an **information leak
vulnerability** — stale kernel memory data is sent over the network to a
remote client.

### 2. CODE CHANGE ANALYSIS

The fix adds two `memset(rsp, 0, sizeof(*rsp))` calls:

**First:** In `rnbd_srv_fill_msg_open_rsp()` before filling the
`rnbd_msg_open_rsp` structure (56 bytes total). Without the memset, 13
bytes are **never explicitly set**:
- `hdr.__padding` (2 bytes) — struct padding field
- `obsolete_rotational` (1 byte) — deprecated field, never written
- `reserved[10]` (10 bytes) — explicitly reserved for future use

**Second:** In `process_msg_sess_info()` before filling the
`rnbd_msg_sess_info_rsp` structure (36 bytes total). Without the memset,
33 bytes are **never explicitly set**:
- `hdr.__padding` (2 bytes) — struct padding field
- `reserved[31]` (31 bytes) — reserved bytes

### 3. THE BUG MECHANISM — CONFIRMED INFORMATION LEAK

Tracing the complete data path reveals this is a real information leak
over the network:

1. **Buffer allocation**: The RDMA chunk pages are allocated via
   `alloc_pages(GFP_KERNEL, ...)` in `get_or_create_srv()` (`rtrs-
   srv.c:1435`). `alloc_pages` does **not** zero memory (unlike
   `__GFP_ZERO` or `get_zeroed_page()`).

2. **Buffer reuse**: The chunk pages (`srv->chunks[buf_id]`) are
   allocated once at server initialization and **reused** across all
   RDMA operations. Each chunk may contain leftover data from previous
   block I/O operations (data read from block devices being served to
   other clients).

3. **Response buffer**: The `data` pointer in `rnbd_srv_rdma_ev()` is
   `page_address(srv->chunks[buf_id])`, pointing directly into these
   non-zeroed, reused RDMA pages.

4. **Client request direction**: Both `send_msg_open()` and
   `send_msg_sess_info()` on the client side use the `READ` direction
   for RTRS. This means the server processes these via `process_read()`,
   setting `id->dir = READ`.

5. **Response sent via RDMA WRITE**: In `rtrs_srv_resp_rdma()`, because
   `id->dir == READ` and `sg_cnt != 0`, the `rdma_write_sg()` function
   is called. This performs an `IB_WR_RDMA_WRITE` operation, sending the
   contents of the server's chunk buffer directly to the client's memory
   via RDMA. The DMA mapping is `DMA_BIDIRECTIONAL`, and
   `ib_dma_sync_single_for_device()` syncs the full response before
   transmission.

6. **What leaks**: The 13 uninitialized bytes in `rnbd_msg_open_rsp` and
   33 uninitialized bytes in `rnbd_msg_sess_info_rsp` contain whatever
   was previously stored in the reused RDMA chunk page. This could
   include **block device data from previous I/O operations** —
   potentially data belonging to other clients or other block devices.

### 4. SECURITY IMPACT

This is a **cross-client kernel memory information leak over the
network**:
- The server leaks up to 46 bytes of stale data per control message
  exchange
- The stale data can include block device content from other clients'
  I/O operations
- The data is transmitted over RDMA (network), reaching a potentially
  different machine
- This has been present since RNBD was introduced in Linux 5.8 (commit
  `2de6c8de192b9`)

### 5. FIX QUALITY

- **Size**: 2 lines added (`memset` calls), no lines removed
- **Correctness**: Obviously correct — zeroing a buffer before partially
  filling it is a standard, well-understood pattern
- **Risk**: Zero regression risk. The memset zeroes the entire
  structure, then the code overwrites the relevant fields. This is
  strictly safer than the original code.
- **Self-contained**: No dependencies on other commits
- **Tested in mainline**: Authored and reviewed by RNBD maintainers
  (IONOS team), merged by Jens Axboe (block layer maintainer)

### 6. APPLICABILITY

RNBD has been present since kernel 5.8. This fix applies to all stable
trees that include RNBD (5.10.y, 5.15.y, 6.1.y, 6.6.y, 6.12.y, etc.).
The patch should apply cleanly since the affected functions haven't
changed structurally.

### 7. CONCLUSION

This is a small, surgical fix for a confirmed information leak
vulnerability where stale kernel memory (potentially containing other
clients' block device data) is sent over the network via RDMA to remote
clients. It meets all stable kernel criteria:
- Fixes a real security/data-integrity bug (information leak over
  network)
- Obviously correct (memset before partial initialization)
- Small and contained (2 lines, 1 file)
- No new features or APIs
- Zero regression risk
- Has been in mainline; authored by subsystem developers

**YES**

 drivers/block/rnbd/rnbd-srv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
index 2df8941a6b146..6afac85c110f2 100644
--- a/drivers/block/rnbd/rnbd-srv.c
+++ b/drivers/block/rnbd/rnbd-srv.c
@@ -538,6 +538,8 @@ static void rnbd_srv_fill_msg_open_rsp(struct rnbd_msg_open_rsp *rsp,
 {
 	struct block_device *bdev = file_bdev(sess_dev->bdev_file);

+	memset(rsp, 0, sizeof(*rsp));
+
 	rsp->hdr.type = cpu_to_le16(RNBD_MSG_OPEN_RSP);
 	rsp->device_id = cpu_to_le32(sess_dev->device_id);
 	rsp->nsectors = cpu_to_le64(bdev_nr_sectors(bdev));
@@ -644,6 +646,7 @@ static void process_msg_sess_info(struct rnbd_srv_session *srv_sess,

 	trace_process_msg_sess_info(srv_sess, sess_info_msg);

+	memset(rsp, 0, sizeof(*rsp));
 	rsp->hdr.type = cpu_to_le16(RNBD_MSG_SESS_INFO_RSP);
 	rsp->ver = srv_sess->ver;
 }
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-11 12:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260211123112.1330287-1-sashal@kernel.org>
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] blk-mq-sched: unify elevators checking for async requests Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] rnbd-srv: Zero the rsp buffer before using it Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox