From: Tal Zussman <tz2294@columbia.edu>
To: Jens Axboe <axboe@kernel.dk>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Christian Brauner <brauner@kernel.org>,
"Darrick J. Wong" <djwong@kernel.org>,
Carlos Maiolino <cem@kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <dgc@kernel.org>,
Bart Van Assche <bvanassche@acm.org>,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, Gao Xiang <xiang@kernel.org>,
Tal Zussman <tz2294@columbia.edu>
Subject: [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices
Date: Thu, 14 May 2026 17:51:13 -0400 [thread overview]
Message-ID: <20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu> (raw)
Add support for using RWF_DONTCACHE with block devices.
Dropbehind pruning needs to be done in non-IRQ context, but block
devices complete writeback in IRQ context. To fix this, we defer
dropbehind invalidation to task context. Add infrastructure that lets
bi_end_io callbacks run from a worker, in two forms:
1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
upfront that the callback needs task context, as in the dropbehind
writeback paths.
2. bio_complete_in_task(), a helper that callbacks can invoke from
bi_end_io() when the decision to defer is dynamic, as in iomap
fserror reporting.
These queue the bio to a per-CPU batch and schedule a delayed work item
to do bio completion.
Patch 1 adds the block layer task-context completion infrastructure,
with both the flag and the procedural helper. This builds on top of
suggestions by Matthew and Christoph: the procedural helper and
bio_in_atomic() come from Christoph's "bio completion in task
enhancements / experiments" series [1].
[Christoph, I put you down as Suggested-by for this patch. Let me know
if you'd like it to be Co-authored-by with your sign-off.]
Patch 2 wires BIO_COMPLETE_IN_TASK into iomap writeback for dropbehind
folios, removes IOMAP_IOEND_DONTCACHE, and removes the DONTCACHE
workqueue deferral from XFS.
Patch 3 sets up DONTCACHE support for buffer-head-based I/O by setting
BIO_COMPLETE_IN_TASK in submit_bh_wbc() for the CONFIG_BUFFER_HEAD
path.
Patch 4 enables RWF_DONTCACHE for block devices based on the previous
support. This support is useful for databases that operate on raw block
devices, among other userspace applications.
I tested this (with CONFIG_BUFFER_HEAD=y) for reads and writes on a
single block device on a VM, so results may be noisy.
Reads were tested on the root partition with a 45GB range (~2x RAM).
Writes were tested on a disabled swap parition (~1GB) in a memcg of size
244MB to force reclaim pressure.
Results:
===== READS (/dev/nvme0n1p2) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 1098.6 1609.0
2 1270.3 1506.6
3 1093.3 1576.5
4 1141.8 2393.9
5 1365.3 2793.8
6 1324.6 2065.9
7 879.6 1920.7
8 1434.1 1662.4
9 1184.9 1857.9
10 1166.4 1702.8
11 1161.4 1653.4
12 1086.9 1555.4
13 1198.5 1718.9
14 1111.9 1752.2
---- ------------ --------------
avg 1173.7 1828.8 (+56%)
==== WRITES (/dev/nvme0n1p3) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 692.4 9297.7
2 4810.8 9342.8
3 5221.7 2955.2
4 396.7 8488.3
5 7249.2 9249.3
6 6695.4 1376.2
7 122.9 9125.8
8 5486.5 9414.7
9 6921.5 8743.5
10 27.9 8997.8
---- ------------ --------------
avg 3762.5 7699.1 (+105%)
[1]: https://lore.kernel.org/all/20260409160243.1008358-1-hch@lst.de/
---
Changes in v6:
- Remove RFC tag
- Rebase on v7.1-rc3.
- 1/4: Revert to using a bio_list, per Jens.
- 1/4: Restructure and simplify work function loop.
- 1/4: Expose both the flag and procedural version, in order to allow
static and dynamic deferral decisions, per conversation with Matthew
and Christoph at LSFMM.
- 1/4: Use bio_in_atomic() predicate, per Christoph.
- 1/4: Use the CPU hot-unplug protocol from mm/vmstat.c, to take into
account use of delayed_work.
- 1/4: Mark the workqueue WQ_PERCPU.
- 1/4: Add comments.
- 3/4 and 4/4: Split into two patches, per Christoph.
- 3/4: Drop the cont_write_begin() change. Block devices don't go
through cont_write_begin(), so it was out of scope and was left over
from v1.
- Link to v5: https://lore.kernel.org/r/20260408-blk-dontcache-v5-0-0f080c20a96f@columbia.edu
Changes in v5:
- 1/3: Replace local_lock + bio_list with struct llist, per Dave.
- 1/3: Use delayed_work with 1-jiffie delay, per Dave.
- 1/3: Add dedicated workqueue to avoid deadlocks, per Christoph.
- 1/3: Restructure work function as do/while loop and only schedule work
originally when the list was previously empty, per Jens.
- 2/3: Delete IOMAP_IOEND_DONTCACHE and its NOMERGE entry, per Matthew
and Christoph.
- Link to v4: https://lore.kernel.org/r/20260325-blk-dontcache-v4-0-c4b56db43f64@columbia.edu
Changes in v4:
- 1/3: Move dropbehind deferral from folio-level to bio-level using
BIO_COMPLETE_IN_TASK, per Matthew and Jan.
- 1/3: Work function yields on need_resched() to avoid hogging the CPU,
per Jan.
- 2/3: New patch. Set BIO_COMPLETE_IN_TASK on iomap writeback bios for
DONTCACHE folios, removing the need for XFS-specific workqueue
deferral.
- 3/3: Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() for buffer_head
path.
- 3/3: Update commit message to mention CONFIG_BUFFER_HEAD=n path.
- Link to v3: https://lore.kernel.org/r/20260227-blk-dontcache-v3-0-cd309ccd5868@columbia.edu
Changes in v3:
- 1/2: Convert dropbehind deferral to per-CPU folio_batches protected by
local_lock using per-CPU work items, to reduce contention, per Jens.
- 1/2: Call folio_end_dropbehind_irq() directly from
folio_end_writeback(), per Jens.
- 1/2: Add CPU hotplug dead callback to drain the departing CPU's folio
batch.
- 2/2: Introduce block_write_begin_iocb(), per Christoph.
- 2/2: Dropped R-b due to changes.
- Link to v2: https://lore.kernel.org/r/20260225-blk-dontcache-v2-0-70e7ac4f7108@columbia.edu
Changes in v2:
- Add R-b from Jan Kara for 2/2.
- Add patch to defer dropbehind completion from IRQ context via a work
item (1/2).
- Add initial performance numbers to cover letter.
- Link to v1: https://lore.kernel.org/r/20260218-blk-dontcache-v1-1-fad6675ef71f@columbia.edu
---
Tal Zussman (4):
block: add task-context bio completion infrastructure
iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
buffer: add dropbehind writeback support
block: enable RWF_DONTCACHE for block devices
block/bio.c | 147 +++++++++++++++++++++++++++++++++++++++++++-
block/fops.c | 5 +-
fs/buffer.c | 19 +++++-
fs/iomap/ioend.c | 5 +-
fs/xfs/xfs_aops.c | 4 --
include/linux/bio.h | 32 ++++++++++
include/linux/blk_types.h | 1 +
include/linux/buffer_head.h | 3 +
include/linux/iomap.h | 5 +-
9 files changed, 206 insertions(+), 15 deletions(-)
---
base-commit: 695fee9be55747935d0a7b58f3d1fb83397a8b4f
change-id: 20260218-blk-dontcache-338133dd045e
Best regards,
--
Tal Zussman <tz2294@columbia.edu>
next reply other threads:[~2026-05-14 21:51 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-14 21:51 Tal Zussman [this message]
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
2026-05-15 2:38 ` Hillf Danton
2026-05-14 21:51 ` [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
2026-05-14 21:51 ` [PATCH v6 3/4] buffer: add dropbehind writeback support Tal Zussman
2026-05-14 21:51 ` [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu \
--to=tz2294@columbia.edu \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=bvanassche@acm.org \
--cc=cem@kernel.org \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=xiang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox