From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Yu Kuai <yukuai@fnnas.com>, Nilay Shroff <nilay@linux.ibm.com>,
Hannes Reinecke <hare@suse.de>, Jens Axboe <axboe@kernel.dk>,
Sasha Levin <sashal@kernel.org>,
linux-block@vger.kernel.org
Subject: [PATCH AUTOSEL 6.19-6.18] blk-mq-sched: unify elevators checking for async requests
Date: Wed, 11 Feb 2026 07:30:27 -0500 [thread overview]
Message-ID: <20260211123112.1330287-17-sashal@kernel.org> (raw)
In-Reply-To: <20260211123112.1330287-1-sashal@kernel.org>
From: Yu Kuai <yukuai@fnnas.com>
[ Upstream commit 1db61b0afdd7e8aa9289c423fdff002603b520b5 ]
bfq and mq-deadline consider sync writes as async requests and only
reserve tags for sync reads by async_depth, however, kyber doesn't
consider sync writes as async requests for now.
Consider the case there are lots of dirty pages, and user use fsync to
flush dirty pages. In this case sched_tags can be exhausted by sync writes
and sync reads can stuck waiting for tag. Hence let kyber follow what
mq-deadline and bfq did, and unify async requests checking for all
elevators.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have all the information needed for a thorough analysis. Let me
compile the findings.
---
## Complete Analysis
### 1. COMMIT MESSAGE ANALYSIS
The subject "blk-mq-sched: unify elevators checking for async requests"
sounds like a cleanup/unification commit, but the message body reveals a
**real I/O starvation bug in the kyber scheduler**:
> *"Consider the case there are lots of dirty pages, and user use fsync
to flush dirty pages. In this case sched_tags can be exhausted by sync
writes and sync reads can stuck waiting for tag."*
This describes a concrete, user-visible problem: sync read starvation
when kyber is the I/O scheduler and fsync flushes dirty pages.
### 2. CODE CHANGE ANALYSIS — The Bug Mechanism
The key to understanding this bug lies in the difference between
`op_is_sync()` and the new `blk_mq_is_sync_read()`:
```470:474:include/linux/blk_types.h
static inline bool op_is_sync(blk_opf_t op)
{
return (op & REQ_OP_MASK) == REQ_OP_READ ||
(op & (REQ_SYNC | REQ_FUA | REQ_PREFLUSH));
}
```
`op_is_sync()` returns **true** for both sync reads AND sync writes
(writes with `REQ_SYNC`/`REQ_FUA`/`REQ_PREFLUSH`).
When `fsync()` triggers writeback, writes get `REQ_SYNC` because
`wbc->sync_mode == WB_SYNC_ALL`:
```93:103:include/linux/writeback.h
static inline blk_opf_t wbc_to_write_flags(struct writeback_control
*wbc)
{
blk_opf_t flags = 0;
if (wbc->sync_mode == WB_SYNC_ALL)
flags |= REQ_SYNC;
// ...
}
```
**Kyber's bug** — in `kyber_limit_depth()`:
```553:564:block/kyber-iosched.c
static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data
*data)
{
/*
- We use the scheduler tags as per-hardware queue queueing tokens.
- Async requests can be limited at this stage.
*/
if (!op_is_sync(opf)) {
struct kyber_queue_data *kqd =
data->q->elevator->elevator_data;
data->shallow_depth = kqd->async_depth;
}
}
```
The condition `!op_is_sync(opf)` means only truly async operations get
throttled. Sync writes (from fsync) pass `op_is_sync()` as true, so they
get **full depth** — no throttling. This means sync writes can consume
ALL sched_tags.
**mq-deadline and bfq already handle this correctly:**
```493:506:block/mq-deadline.c
static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data
*data)
{
struct deadline_data *dd = data->q->elevator->elevator_data;
/* Do not throttle synchronous reads. */
if (op_is_sync(opf) && !op_is_write(opf))
return;
// ... throttle everything else including sync writes
data->shallow_depth = dd->async_depth;
}
```
Both mq-deadline and bfq use `op_is_sync(opf) && !op_is_write(opf)` to
give full depth only to sync **reads**. Sync writes are properly
throttled by `async_depth`.
**The fix:** Changes kyber's condition from `!op_is_sync(opf)` to
`!blk_mq_is_sync_read(opf)`, where the new helper is defined as:
```c
static inline bool blk_mq_is_sync_read(blk_opf_t opf)
{
return op_is_sync(opf) && !op_is_write(opf);
}
```
### 3. BEHAVIORAL IMPACT BY SCHEDULER
| Scheduler | Before | After | Behavioral Change? |
|---|---|---|---|
| **kyber** | Only async ops throttled; sync writes get full depth |
Everything except sync reads throttled | **YES — this is the bug fix** |
| **bfq** | `op_is_sync(opf) && !op_is_write(opf)` |
`blk_mq_is_sync_read(opf)` | **NO — semantically identical** |
| **mq-deadline** | `op_is_sync(opf) && !op_is_write(opf)` |
`blk_mq_is_sync_read(opf)` | **NO — semantically identical** |
The bfq and mq-deadline changes are purely cosmetic refactoring. The
actual bug fix is exclusively in kyber.
### 4. BUG SEVERITY
The starvation scenario is concrete and reproducible:
1. System has lots of dirty pages
2. User calls `fsync()` to flush them
3. Lots of sync writes are submitted
4. Kyber gives them full depth (no throttling)
5. All sched_tags consumed by sync writes
6. Sync reads from applications **starve** — they cannot get any tags
7. Read I/O hangs until writes complete
This is a **system responsiveness issue** — applications waiting for
reads (file access, page faults) can hang when another process is doing
heavy fsyncing.
### 5. SCOPE AND RISK
- **Size:** Very small — adds a 4-line helper, changes one condition in
each of 3 files
- **Risk for bfq/mq-deadline:** Zero — semantically identical changes
- **Risk for kyber:** Low — the change aligns kyber with the well-
established and battle-tested behavior of mq-deadline and bfq. Kyber's
`async_depth` is set to 75% of `q->nr_requests` (`KYBER_ASYNC_PERCENT
= 75`), so sync writes still get generous tag allocation, just not
unlimited
- **Possible regression:** Slight reduction in sync write throughput on
kyber (now throttled to 75% of tags instead of 100%), but this is the
correct behavior to prevent read starvation
### 6. REVIEW AND TESTING
The commit has two `Reviewed-by` tags from experienced kernel
developers:
- **Nilay Shroff** (IBM) — block layer contributor
- **Hannes Reinecke** (SUSE) — longstanding storage/block subsystem
maintainer
Signed off by **Jens Axboe** — block subsystem maintainer.
### 7. DEPENDENCY ANALYSIS
The commit depends on `42e6c6ce03fd3e` ("lib/sbitmap: convert
shallow_depth from one word to the whole sbitmap") for context lines in
the bfq part (which references `bfqd->async_depths`, renamed from
`bfqd->word_depths` in the prerequisite). However:
- The **core fix** (kyber condition change) is completely independent
- The bfq and mq-deadline changes are cosmetic and could be dropped for
stable
- For stable trees, the kyber fix + helper function could be backported
alone, or the bfq/mq-deadline parts could be adapted to match older
context
### 8. BUG LONGEVITY
The bug has existed since kyber's introduction in commit `00e043936e9a1`
(April 2017, kernel v4.12). The original code already used
`!op_is_sync(op)` which had the same problem. The mq-deadline scheduler
only added tag reservation in v5.15 (commit `07757588e507`, June 2021)
and correctly used `op_is_sync(opf) && !op_is_write(opf)` from the
start. The bfq scheduler similarly had the correct check. Kyber was the
odd one out.
### 9. CONCLUSION
This commit fixes a real I/O starvation bug in the kyber scheduler where
sync writes (from fsync/sync) can exhaust all scheduler tags and starve
sync reads. The fix is:
- **Small and surgical** — one condition change in kyber, plus a trivial
helper function
- **Obviously correct** — it aligns kyber with the proven behavior of
mq-deadline and bfq
- **Low risk** — the bfq/mq-deadline parts are semantically identical;
the kyber change is well-bounded
- **Well-reviewed** — by experienced block subsystem developers and the
subsystem maintainer
- **Fixes a real user-visible bug** — read I/O starvation during fsync-
heavy workloads
The only concern is the dependency on a prerequisite for the bfq context
lines, but the core kyber fix is standalone and the bfq/mq-deadline
parts are optional cosmetic refactoring that could be adapted or dropped
for stable trees.
**YES**
block/bfq-iosched.c | 2 +-
block/blk-mq-sched.h | 5 +++++
block/kyber-iosched.c | 2 +-
block/mq-deadline.c | 2 +-
4 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 6e54b1d3d8bc2..9e9d081e86bb2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -697,7 +697,7 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
unsigned int limit, act_idx;
/* Sync reads have full depth available */
- if (op_is_sync(opf) && !op_is_write(opf))
+ if (blk_mq_is_sync_read(opf))
limit = data->q->nr_requests;
else
limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 02c40a72e9598..5678e15bd33c4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -137,4 +137,9 @@ static inline void blk_mq_set_min_shallow_depth(struct request_queue *q,
depth);
}
+static inline bool blk_mq_is_sync_read(blk_opf_t opf)
+{
+ return op_is_sync(opf) && !op_is_write(opf);
+}
+
#endif
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index c1b36ffd19ceb..2b3f5b8959af0 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -556,7 +556,7 @@ static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
* We use the scheduler tags as per-hardware queue queueing tokens.
* Async requests can be limited at this stage.
*/
- if (!op_is_sync(opf)) {
+ if (!blk_mq_is_sync_read(opf)) {
struct kyber_queue_data *kqd = data->q->elevator->elevator_data;
data->shallow_depth = kqd->async_depth;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3e3719093aec7..29d00221fbea6 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -495,7 +495,7 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
struct deadline_data *dd = data->q->elevator->elevator_data;
/* Do not throttle synchronous reads. */
- if (op_is_sync(opf) && !op_is_write(opf))
+ if (blk_mq_is_sync_read(opf))
return;
/*
--
2.51.0
next prev parent reply other threads:[~2026-02-11 12:31 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-11 12:30 [PATCH AUTOSEL 6.19-5.10] s390/perf: Disable register readout on sampling events Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] arm64: Add support for TSV110 Spectre-BHB mitigation Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] xenbus: Use .freeze/.thaw to handle xenbus devices Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] s390/purgatory: Add -Wno-default-const-init-unsafe to KBUILD_CFLAGS Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] s390/boot: " Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.1] perf/arm-cmn: Support CMN-600AE Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] ntfs: ->d_compare() must not block Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] ACPI: x86: s2idle: Invoke Microsoft _DSM Function 9 (Turn On Display) Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] block: decouple secure erase size limit from discard size limit Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] sparc: don't reference obsolete termio struct for TC* constants Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] EFI/CPER: don't go past the ARM processor CPER record buffer Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19] ACPI: scan: Use async schedule function in acpi_scan_clear_dep_fn() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] cpufreq: dt-platdev: Block the driver from probing on more QC platforms Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] EFI/CPER: don't dump the entire memory region Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] ACPI: battery: fix incorrect charging status when current is zero Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] rust: cpufreq: always inline functions using build_assert with arguments Sasha Levin
2026-02-11 12:30 ` Sasha Levin [this message]
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] x86/xen/pvh: Enable PAE mode for 32-bit guest only when CONFIG_X86_PAE is set Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] APEI/GHES: ARM processor Error: don't go past allocated memory Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] md raid: fix hang when stopping arrays with metadata through dm-raid Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] tools/power cpupower: Reset errno before strtoull() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] sparc: Synchronize user stack on fork and clone Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] blk-mq-debugfs: add missing debugfs_mutex in blk_mq_debugfs_register_hctxs() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] rnbd-srv: Zero the rsp buffer before using it Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] alpha: fix user-space corruption during memory compaction Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] ACPICA: Abort AML bytecode execution when executing AML_FATAL_OP Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19] arm64: mte: Set TCMA1 whenever MTE is present in the kernel Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] tools/cpupower: Fix inverted APERF capability check Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.15] ACPI: processor: Fix NULL-pointer dereference in acpi_processor_errata_piix4() Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] ACPI: resource: Add JWIPC JVC9100 to irq1_level_low_skip_override[] Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] perf/cxlpmu: Replace IRQF_ONESHOT with IRQF_NO_THREAD Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] md-cluster: fix NULL pointer dereference in process_metadata_update Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-5.10] APEI/GHES: ensure that won't go past CPER allocated record Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.12] powercap: intel_rapl: Add PL4 support for Ice Lake Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] io_uring/timeout: annotate data race in io_flush_timeouts() Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260211123112.1330287-17-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=axboe@kernel.dk \
--cc=hare@suse.de \
--cc=linux-block@vger.kernel.org \
--cc=nilay@linux.ibm.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
--cc=yukuai@fnnas.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox