* [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices
@ 2026-05-14 21:51 Tal Zussman
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Tal Zussman @ 2026-05-14 21:51 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang, Tal Zussman
Add support for using RWF_DONTCACHE with block devices.
Dropbehind pruning needs to be done in non-IRQ context, but block
devices complete writeback in IRQ context. To fix this, we defer
dropbehind invalidation to task context. Add infrastructure that lets
bi_end_io callbacks run from a worker, in two forms:
1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
upfront that the callback needs task context, as in the dropbehind
writeback paths.
2. bio_complete_in_task(), a helper that callbacks can invoke from
bi_end_io() when the decision to defer is dynamic, as in iomap
fserror reporting.
These queue the bio to a per-CPU batch and schedule a delayed work item
to do bio completion.
Patch 1 adds the block layer task-context completion infrastructure,
with both the flag and the procedural helper. This builds on top of
suggestions by Matthew and Christoph: the procedural helper and
bio_in_atomic() come from Christoph's "bio completion in task
enhancements / experiments" series [1].
[Christoph, I put you down as Suggested-by for this patch. Let me know
if you'd like it to be Co-authored-by with your sign-off.]
Patch 2 wires BIO_COMPLETE_IN_TASK into iomap writeback for dropbehind
folios, removes IOMAP_IOEND_DONTCACHE, and removes the DONTCACHE
workqueue deferral from XFS.
Patch 3 sets up DONTCACHE support for buffer-head-based I/O by setting
BIO_COMPLETE_IN_TASK in submit_bh_wbc() for the CONFIG_BUFFER_HEAD
path.
Patch 4 enables RWF_DONTCACHE for block devices based on the previous
support. This support is useful for databases that operate on raw block
devices, among other userspace applications.
I tested this (with CONFIG_BUFFER_HEAD=y) for reads and writes on a
single block device on a VM, so results may be noisy.
Reads were tested on the root partition with a 45GB range (~2x RAM).
Writes were tested on a disabled swap parition (~1GB) in a memcg of size
244MB to force reclaim pressure.
Results:
===== READS (/dev/nvme0n1p2) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 1098.6 1609.0
2 1270.3 1506.6
3 1093.3 1576.5
4 1141.8 2393.9
5 1365.3 2793.8
6 1324.6 2065.9
7 879.6 1920.7
8 1434.1 1662.4
9 1184.9 1857.9
10 1166.4 1702.8
11 1161.4 1653.4
12 1086.9 1555.4
13 1198.5 1718.9
14 1111.9 1752.2
---- ------------ --------------
avg 1173.7 1828.8 (+56%)
==== WRITES (/dev/nvme0n1p3) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 692.4 9297.7
2 4810.8 9342.8
3 5221.7 2955.2
4 396.7 8488.3
5 7249.2 9249.3
6 6695.4 1376.2
7 122.9 9125.8
8 5486.5 9414.7
9 6921.5 8743.5
10 27.9 8997.8
---- ------------ --------------
avg 3762.5 7699.1 (+105%)
[1]: https://lore.kernel.org/all/20260409160243.1008358-1-hch@lst.de/
---
Changes in v6:
- Remove RFC tag
- Rebase on v7.1-rc3.
- 1/4: Revert to using a bio_list, per Jens.
- 1/4: Restructure and simplify work function loop.
- 1/4: Expose both the flag and procedural version, in order to allow
static and dynamic deferral decisions, per conversation with Matthew
and Christoph at LSFMM.
- 1/4: Use bio_in_atomic() predicate, per Christoph.
- 1/4: Use the CPU hot-unplug protocol from mm/vmstat.c, to take into
account use of delayed_work.
- 1/4: Mark the workqueue WQ_PERCPU.
- 1/4: Add comments.
- 3/4 and 4/4: Split into two patches, per Christoph.
- 3/4: Drop the cont_write_begin() change. Block devices don't go
through cont_write_begin(), so it was out of scope and was left over
from v1.
- Link to v5: https://lore.kernel.org/r/20260408-blk-dontcache-v5-0-0f080c20a96f@columbia.edu
Changes in v5:
- 1/3: Replace local_lock + bio_list with struct llist, per Dave.
- 1/3: Use delayed_work with 1-jiffie delay, per Dave.
- 1/3: Add dedicated workqueue to avoid deadlocks, per Christoph.
- 1/3: Restructure work function as do/while loop and only schedule work
originally when the list was previously empty, per Jens.
- 2/3: Delete IOMAP_IOEND_DONTCACHE and its NOMERGE entry, per Matthew
and Christoph.
- Link to v4: https://lore.kernel.org/r/20260325-blk-dontcache-v4-0-c4b56db43f64@columbia.edu
Changes in v4:
- 1/3: Move dropbehind deferral from folio-level to bio-level using
BIO_COMPLETE_IN_TASK, per Matthew and Jan.
- 1/3: Work function yields on need_resched() to avoid hogging the CPU,
per Jan.
- 2/3: New patch. Set BIO_COMPLETE_IN_TASK on iomap writeback bios for
DONTCACHE folios, removing the need for XFS-specific workqueue
deferral.
- 3/3: Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() for buffer_head
path.
- 3/3: Update commit message to mention CONFIG_BUFFER_HEAD=n path.
- Link to v3: https://lore.kernel.org/r/20260227-blk-dontcache-v3-0-cd309ccd5868@columbia.edu
Changes in v3:
- 1/2: Convert dropbehind deferral to per-CPU folio_batches protected by
local_lock using per-CPU work items, to reduce contention, per Jens.
- 1/2: Call folio_end_dropbehind_irq() directly from
folio_end_writeback(), per Jens.
- 1/2: Add CPU hotplug dead callback to drain the departing CPU's folio
batch.
- 2/2: Introduce block_write_begin_iocb(), per Christoph.
- 2/2: Dropped R-b due to changes.
- Link to v2: https://lore.kernel.org/r/20260225-blk-dontcache-v2-0-70e7ac4f7108@columbia.edu
Changes in v2:
- Add R-b from Jan Kara for 2/2.
- Add patch to defer dropbehind completion from IRQ context via a work
item (1/2).
- Add initial performance numbers to cover letter.
- Link to v1: https://lore.kernel.org/r/20260218-blk-dontcache-v1-1-fad6675ef71f@columbia.edu
---
Tal Zussman (4):
block: add task-context bio completion infrastructure
iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
buffer: add dropbehind writeback support
block: enable RWF_DONTCACHE for block devices
block/bio.c | 147 +++++++++++++++++++++++++++++++++++++++++++-
block/fops.c | 5 +-
fs/buffer.c | 19 +++++-
fs/iomap/ioend.c | 5 +-
fs/xfs/xfs_aops.c | 4 --
include/linux/bio.h | 32 ++++++++++
include/linux/blk_types.h | 1 +
include/linux/buffer_head.h | 3 +
include/linux/iomap.h | 5 +-
9 files changed, 206 insertions(+), 15 deletions(-)
---
base-commit: 695fee9be55747935d0a7b58f3d1fb83397a8b4f
change-id: 20260218-blk-dontcache-338133dd045e
Best regards,
--
Tal Zussman <tz2294@columbia.edu>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 1/4] block: add task-context bio completion infrastructure
2026-05-14 21:51 [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
@ 2026-05-14 21:51 ` Tal Zussman
2026-05-15 2:38 ` Hillf Danton
` (2 more replies)
2026-05-14 21:51 ` [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
` (2 subsequent siblings)
3 siblings, 3 replies; 18+ messages in thread
From: Tal Zussman @ 2026-05-14 21:51 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang, Tal Zussman
Some bio completion handlers need to run from preemptible task context,
but bio_endio() may be called from IRQ context (e.g., buffer_head
writeback). Callers need a way to ensure their callback eventually runs
from a sleepable context. Add infrastructure for that, in two forms:
1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
in advance that its callback needs task context (e.g., dropbehind
writeback). bio_endio() sees the flag and offloads completion to a
worker automatically.
2. bio_complete_in_task(), a helper that completion callbacks can
invoke from within bi_end_io() when the deferral decision is
dynamic (e.g., fserror reporting).
Both share a per-CPU batch list drained by a delayed work item on a
WQ_PERCPU workqueue. Producers push the bio onto the local CPU's batch
and schedule the work item, which then dispatches each bio's bi_end_io()
from task context. The delayed work item uses a 1-jiffie delay to allow
batches of completions to accumulate before processing.
Both methods are gated on bio_in_atomic(), which returns true in any
context where a sleeping bi_end_io() is unsafe, including
non-preemptible task context. This logic is copied from commit
c99fab6e80b7 ("erofs: fix atomic context detection when
!CONFIG_DEBUG_LOCK_ALLOC").
Two CPU hotplug callbacks are used to drain remaining bios from the
departing CPU's batch, while maintaining the per-CPU behavior. The
CPUHP_AP_ONLINE_DYN callback disables the per-CPU delayed work while the
CPU is still online, preventing it from running on an unbound worker
later. CPUHP_BP_PREPARE_DYN then drains any bios added between disabling
the work item and CPU offline.
Link: https://lore.kernel.org/all/20260409160243.1008358-1-hch@lst.de/
Suggested-by: Matthew Wilcox <willy@infradead.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
block/bio.c | 147 +++++++++++++++++++++++++++++++++++++++++++++-
include/linux/bio.h | 32 ++++++++++
include/linux/blk_types.h | 1 +
3 files changed, 179 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index b8972dba68a0..6864ee737400 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -19,6 +19,7 @@
#include <linux/blk-crypto.h>
#include <linux/xarray.h>
#include <linux/kmemleak.h>
+#include <linux/local_lock.h>
#include <trace/events/block.h>
#include "blk.h"
@@ -1717,6 +1718,79 @@ void bio_check_pages_dirty(struct bio *bio)
}
EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
+/*
+ * Infrastructure for deferring bio completions to task-context via a per-CPU
+ * workqueue. Triggered either by the BIO_COMPLETE_IN_TASK bio flag (static
+ * decision at submit time) or by calling bio_complete_in_task() from
+ * bi_end_io() (dynamic decision at completion time).
+ */
+
+struct bio_complete_batch {
+ local_lock_t lock;
+ struct bio_list list;
+ struct delayed_work work;
+ int cpu;
+};
+
+static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
+ .lock = INIT_LOCAL_LOCK(lock),
+};
+static struct workqueue_struct *bio_complete_wq;
+
+static void bio_complete_work_fn(struct work_struct *w)
+{
+ struct delayed_work *dw = to_delayed_work(w);
+ struct bio_complete_batch *batch =
+ container_of(dw, struct bio_complete_batch, work);
+
+ while (1) {
+ struct bio_list list;
+ struct bio *bio;
+
+ local_lock_irq(&bio_complete_batch.lock);
+ list = batch->list;
+ bio_list_init(&batch->list);
+ local_unlock_irq(&bio_complete_batch.lock);
+
+ if (bio_list_empty(&list))
+ break;
+
+ while ((bio = bio_list_pop(&list)))
+ bio->bi_end_io(bio);
+
+ if (need_resched()) {
+ bool is_empty;
+
+ local_lock_irq(&bio_complete_batch.lock);
+ is_empty = bio_list_empty(&batch->list);
+ local_unlock_irq(&bio_complete_batch.lock);
+ if (!is_empty)
+ mod_delayed_work_on(batch->cpu,
+ bio_complete_wq,
+ &batch->work, 0);
+ break;
+ }
+ }
+}
+
+void __bio_complete_in_task(struct bio *bio)
+{
+ struct bio_complete_batch *batch;
+ unsigned long flags;
+ bool was_empty;
+
+ local_lock_irqsave(&bio_complete_batch.lock, flags);
+ batch = this_cpu_ptr(&bio_complete_batch);
+ was_empty = bio_list_empty(&batch->list);
+ bio_list_add(&batch->list, bio);
+ local_unlock_irqrestore(&bio_complete_batch.lock, flags);
+
+ if (was_empty)
+ mod_delayed_work_on(batch->cpu, bio_complete_wq,
+ &batch->work, 1);
+}
+EXPORT_SYMBOL_GPL(__bio_complete_in_task);
+
static inline bool bio_remaining_done(struct bio *bio)
{
/*
@@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
}
#endif
- if (bio->bi_end_io)
+ if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
+ __bio_complete_in_task(bio);
+ else if (bio->bi_end_io)
bio->bi_end_io(bio);
}
EXPORT_SYMBOL(bio_endio);
@@ -1977,6 +2053,51 @@ int bioset_init(struct bio_set *bs,
}
EXPORT_SYMBOL(bioset_init);
+static int bio_complete_batch_cpu_online(unsigned int cpu)
+{
+ enable_delayed_work(&per_cpu(bio_complete_batch, cpu).work);
+ return 0;
+}
+
+/*
+ * Disable this CPU's delayed work so that it cannot run on an unbound worker
+ * after the CPU is offlined.
+ */
+static int bio_complete_batch_cpu_down_prep(unsigned int cpu)
+{
+ disable_delayed_work_sync(&per_cpu(bio_complete_batch, cpu).work);
+ return 0;
+}
+
+/*
+ * Drain a dead CPU's deferred bio completions. The CPU is dead and the worker
+ * is canceled so no locking is needed.
+ */
+static int bio_complete_batch_cpu_dead(unsigned int cpu)
+{
+ struct bio_complete_batch *batch =
+ per_cpu_ptr(&bio_complete_batch, cpu);
+ struct bio *bio;
+
+ while ((bio = bio_list_pop(&batch->list)))
+ bio->bi_end_io(bio);
+
+ return 0;
+}
+
+static void __init bio_complete_batch_init(int cpu)
+{
+ struct bio_complete_batch *batch =
+ per_cpu_ptr(&bio_complete_batch, cpu);
+
+ bio_list_init(&batch->list);
+ INIT_DELAYED_WORK(&batch->work, bio_complete_work_fn);
+ batch->cpu = cpu;
+
+ if (!cpu_online(cpu))
+ disable_delayed_work_sync(&batch->work);
+}
+
static int __init init_bio(void)
{
int i;
@@ -1991,6 +2112,30 @@ static int __init init_bio(void)
SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
}
+ for_each_possible_cpu(i)
+ bio_complete_batch_init(i);
+
+ bio_complete_wq = alloc_workqueue("bio_complete",
+ WQ_MEM_RECLAIM | WQ_PERCPU, 0);
+ if (!bio_complete_wq)
+ panic("bio: can't allocate bio_complete workqueue\n");
+
+ /*
+ * bio task-context completion draining on hot-unplugged CPUs:
+ *
+ * 1. Stop the per-CPU delayed work while the CPU is still online, so
+ * that it cannot run on an unbound worker later.
+ * 2. Drain leftover bios added between worker disabling and CPU
+ * offlining.
+ */
+ cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "block/bio:complete:online",
+ bio_complete_batch_cpu_online,
+ bio_complete_batch_cpu_down_prep);
+ cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
+ "block/bio:complete:dead",
+ NULL, bio_complete_batch_cpu_dead);
+
cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
bio_cpu_dead);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 97d747320b35..c0214d6c28d6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -369,6 +369,38 @@ static inline struct bio *bio_alloc(struct block_device *bdev,
void submit_bio(struct bio *bio);
+/**
+ * bio_in_atomic - check if the current context is unsafe for bio completion
+ *
+ * Return: %true in atomic contexts (e.g. hard/soft IRQ, preempt-disabled);
+ * %false when a bio can be safely completed in the current context.
+ */
+static inline bool bio_in_atomic(void)
+{
+ if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
+ return true;
+ if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
+ return true;
+ return !preemptible();
+}
+
+void __bio_complete_in_task(struct bio *bio);
+
+/**
+ * bio_complete_in_task - ensure a bio is completed in preemptible task context
+ * @bio: bio to complete
+ *
+ * If called from non-task context, offload the bio completion to a worker
+ * thread and return %true. Else return %false and do nothing.
+ */
+static inline bool bio_complete_in_task(struct bio *bio)
+{
+ if (!bio_in_atomic())
+ return false;
+ __bio_complete_in_task(bio);
+ return true;
+}
+
extern void bio_endio(struct bio *);
static inline void bio_io_error(struct bio *bio)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8808ee76e73c..d49d97a050d0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -322,6 +322,7 @@ enum {
BIO_REMAPPED,
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+ BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
BIO_FLAG_LAST
};
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
2026-05-14 21:51 [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
@ 2026-05-14 21:51 ` Tal Zussman
2026-05-18 6:48 ` Christoph Hellwig
2026-05-14 21:51 ` [PATCH v6 3/4] buffer: add dropbehind writeback support Tal Zussman
2026-05-14 21:51 ` [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
3 siblings, 1 reply; 18+ messages in thread
From: Tal Zussman @ 2026-05-14 21:51 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang, Tal Zussman
Set BIO_COMPLETE_IN_TASK on iomap writeback bios when a dropbehind folio
is added. This ensures that bi_end_io runs in task context, where
folio_end_dropbehind() can safely invalidate folios.
With the bio layer now handling task-context deferral generically,
IOMAP_IOEND_DONTCACHE is no longer needed, as XFS no longer needs to
route DONTCACHE ioends through its completion workqueue. Remove the flag
and its NOMERGE entry.
Without the NOMERGE, regular I/Os that get merged with a dropbehind
folio will also have their completion deferred to task context.
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
fs/iomap/ioend.c | 5 +++--
fs/xfs/xfs_aops.c | 4 ----
include/linux/iomap.h | 5 +----
3 files changed, 4 insertions(+), 10 deletions(-)
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index acf3cf98b23a..892dbfc77ae9 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -237,8 +237,6 @@ ssize_t iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, struct folio *folio,
if (wpc->iomap.flags & IOMAP_F_SHARED)
ioend_flags |= IOMAP_IOEND_SHARED;
- if (folio_test_dropbehind(folio))
- ioend_flags |= IOMAP_IOEND_DONTCACHE;
if (pos == wpc->iomap.offset && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
ioend_flags |= IOMAP_IOEND_BOUNDARY;
@@ -255,6 +253,9 @@ ssize_t iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, struct folio *folio,
if (!bio_add_folio(&ioend->io_bio, folio, map_len, poff))
goto new_ioend;
+ if (folio_test_dropbehind(folio))
+ bio_set_flag(&ioend->io_bio, BIO_COMPLETE_IN_TASK);
+
/*
* Clamp io_offset and io_size to the incore EOF so that ondisk
* file size updates in the ioend completion are byte-accurate.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f279055fcea0..0dcf78beae8a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -511,10 +511,6 @@ xfs_ioend_needs_wq_completion(
if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))
return true;
- /* Page cache invalidation cannot be done in irq context. */
- if (ioend->io_flags & IOMAP_IOEND_DONTCACHE)
- return true;
-
return false;
}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2c5685adf3a9..fef04e01116f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -399,16 +399,13 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
#define IOMAP_IOEND_BOUNDARY (1U << 2)
/* is direct I/O */
#define IOMAP_IOEND_DIRECT (1U << 3)
-/* is DONTCACHE I/O */
-#define IOMAP_IOEND_DONTCACHE (1U << 4)
/*
* Flags that if set on either ioend prevent the merge of two ioends.
* (IOMAP_IOEND_BOUNDARY also prevents merges, but only one-way)
*/
#define IOMAP_IOEND_NOMERGE_FLAGS \
- (IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_DIRECT | \
- IOMAP_IOEND_DONTCACHE)
+ (IOMAP_IOEND_SHARED | IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_DIRECT)
/*
* Structure for writeback I/O completions.
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 3/4] buffer: add dropbehind writeback support
2026-05-14 21:51 [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
2026-05-14 21:51 ` [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
@ 2026-05-14 21:51 ` Tal Zussman
2026-05-18 6:49 ` Christoph Hellwig
2026-05-22 23:14 ` Tal Zussman
2026-05-14 21:51 ` [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
3 siblings, 2 replies; 18+ messages in thread
From: Tal Zussman @ 2026-05-14 21:51 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang, Tal Zussman
Add block_write_begin_iocb() which threads the kiocb through to
__filemap_get_folio() so that buffer_head-based I/O can use DONTCACHE
behavior. When the iocb has IOCB_DONTCACHE set, FGP_DONTCACHE is
passed to mark the folio for dropbehind. The existing
block_write_begin() is preserved as a wrapper that passes a NULL iocb.
Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() when the folio has
dropbehind set, so that buffer_head writeback completions get deferred
to task context.
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
fs/buffer.c | 19 +++++++++++++++++--
include/linux/buffer_head.h | 3 +++
2 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index b0b3792b1496..d0abaf44d782 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2138,14 +2138,19 @@ EXPORT_SYMBOL(block_commit_write);
*
* The filesystem needs to handle block truncation upon failure.
*/
-int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
+int block_write_begin_iocb(const struct kiocb *iocb,
+ struct address_space *mapping, loff_t pos, unsigned len,
struct folio **foliop, get_block_t *get_block)
{
pgoff_t index = pos >> PAGE_SHIFT;
+ fgf_t fgp_flags = FGP_WRITEBEGIN;
struct folio *folio;
int status;
- folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+ if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
+ fgp_flags |= FGP_DONTCACHE;
+
+ folio = __filemap_get_folio(mapping, index, fgp_flags,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
return PTR_ERR(folio);
@@ -2160,6 +2165,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
*foliop = folio;
return status;
}
+
+int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
+ struct folio **foliop, get_block_t *get_block)
+{
+ return block_write_begin_iocb(NULL, mapping, pos, len, foliop,
+ get_block);
+}
EXPORT_SYMBOL(block_write_begin);
int block_write_end(loff_t pos, unsigned len, unsigned copied,
@@ -2715,6 +2727,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
+ if (folio_test_dropbehind(bh->b_folio))
+ bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
+
if (IS_ENABLED(CONFIG_FS_ENCRYPTION))
buffer_set_crypto_ctx(bio, bh, GFP_NOIO);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index e4939e33b4b5..4ce50882d621 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -260,6 +260,9 @@ int block_read_full_folio(struct folio *, get_block_t *);
bool block_is_partially_uptodate(struct folio *, size_t from, size_t count);
int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
struct folio **foliop, get_block_t *get_block);
+int block_write_begin_iocb(const struct kiocb *iocb,
+ struct address_space *mapping, loff_t pos, unsigned len,
+ struct folio **foliop, get_block_t *get_block);
int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
get_block_t *get_block);
int block_write_end(loff_t pos, unsigned len, unsigned copied, struct folio *);
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
2026-05-14 21:51 [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
` (2 preceding siblings ...)
2026-05-14 21:51 ` [PATCH v6 3/4] buffer: add dropbehind writeback support Tal Zussman
@ 2026-05-14 21:51 ` Tal Zussman
2026-05-18 6:49 ` Christoph Hellwig
2026-05-22 23:17 ` Tal Zussman
3 siblings, 2 replies; 18+ messages in thread
From: Tal Zussman @ 2026-05-14 21:51 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang, Tal Zussman
Block device buffered reads and writes already pass through
filemap_read() and iomap_file_buffered_write() respectively, both of
which handle IOCB_DONTCACHE. Enable RWF_DONTCACHE for block device files
by setting FOP_DONTCACHE in def_blk_fops.
For CONFIG_BUFFER_HEAD=y paths, use block_write_begin_iocb() in
blkdev_write_begin() to thread the kiocb through so that buffer_head
writeback gets dropbehind support.
CONFIG_BUFFER_HEAD=n paths are handled by the previously added iomap
BIO_COMPLETE_IN_TASK support.
This support is useful for databases that operate on raw block devices,
among other userspace applications.
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
block/fops.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index bb6642b45937..31b073181d87 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -504,7 +504,8 @@ static int blkdev_write_begin(const struct kiocb *iocb,
unsigned len, struct folio **foliop,
void **fsdata)
{
- return block_write_begin(mapping, pos, len, foliop, blkdev_get_block);
+ return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
+ blkdev_get_block);
}
static int blkdev_write_end(const struct kiocb *iocb,
@@ -966,7 +967,7 @@ const struct file_operations def_blk_fops = {
.splice_write = iter_file_splice_write,
.fallocate = blkdev_fallocate,
.uring_cmd = blkdev_uring_cmd,
- .fop_flags = FOP_BUFFER_RASYNC,
+ .fop_flags = FOP_BUFFER_RASYNC | FOP_DONTCACHE,
};
static __init int blkdev_init(void)
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
@ 2026-05-15 2:38 ` Hillf Danton
2026-05-18 6:48 ` Christoph Hellwig
2026-05-22 23:09 ` Tal Zussman
2 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2026-05-15 2:38 UTC (permalink / raw)
To: Tal Zussman
Cc: Matthew Wilcox (Oracle), Christoph Hellwig, linux-block,
linux-kernel
On Thu, 14 May 2026 17:51:14 -0400 Tal Zussman wrote:
> +
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct delayed_work *dw = to_delayed_work(w);
> + struct bio_complete_batch *batch =
> + container_of(dw, struct bio_complete_batch, work);
> +
> + while (1) {
> + struct bio_list list;
> + struct bio *bio;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (bio_list_empty(&list))
> + break;
> +
> + while ((bio = bio_list_pop(&list)))
> + bio->bi_end_io(bio);
> +
> + if (need_resched()) {
> + bool is_empty;
> +
Checking resched is not needed as workqueue worker can be preempted
while processing bios.Given batch and delayed work, I suspect completing
more than batch, the bios accumulated within a jiff, makes sense.
> + local_lock_irq(&bio_complete_batch.lock);
> + is_empty = bio_list_empty(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> + if (!is_empty)
> + mod_delayed_work_on(batch->cpu,
> + bio_complete_wq,
> + &batch->work, 0);
> + break;
> + }
> + }
> +}
> +
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
2026-05-15 2:38 ` Hillf Danton
@ 2026-05-18 6:48 ` Christoph Hellwig
2026-05-22 22:47 ` Tal Zussman
2026-05-22 23:09 ` Tal Zussman
2 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-18 6:48 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On Thu, May 14, 2026 at 05:51:14PM -0400, Tal Zussman wrote:
> Some bio completion handlers need to run from preemptible task context,
> but bio_endio() may be called from IRQ context (e.g., buffer_head
> writeback). Callers need a way to ensure their callback eventually runs
> from a sleepable context. Add infrastructure for that, in two forms:
>
> 1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
> in advance that its callback needs task context (e.g., dropbehind
> writeback). bio_endio() sees the flag and offloads completion to a
> worker automatically.
>
> 2. bio_complete_in_task(), a helper that completion callbacks can
> invoke from within bi_end_io() when the deferral decision is
> dynamic (e.g., fserror reporting).
Note that method 2 is unused as of this series. I do plan to add users
ASAP, and at one or two could even land through the block layer in this
merge window.
> Both share a per-CPU batch list drained by a delayed work item on a
> WQ_PERCPU workqueue. Producers push the bio onto the local CPU's batch
> and schedule the work item, which then dispatches each bio's bi_end_io()
> from task context. The delayed work item uses a 1-jiffie delay to allow
> batches of completions to accumulate before processing.
But this 1-jiffie delay also means we unconditionally increase
completion latency, which feels like a bad idea. Do you have any
measurements that show where it does benefit? Note that queing work
already often has very measurable latency on it's own. This also
directly contradics the erofs experience that even went to a RT
thread to reduce the latency.
> Both methods are gated on bio_in_atomic(), which returns true in any
> context where a sleeping bi_end_io() is unsafe, including
> non-preemptible task context. This logic is copied from commit
> c99fab6e80b7 ("erofs: fix atomic context detection when
> !CONFIG_DEBUG_LOCK_ALLOC").
Let's not copy it, but have a prep patch that moves the erofs logic
into the block layer under the new bio_in_atomic name.
> + while ((bio = bio_list_pop(&list)))
> + bio->bi_end_io(bio);
> +
> + if (need_resched()) {
> + bool is_empty;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + is_empty = bio_list_empty(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> + if (!is_empty)
> + mod_delayed_work_on(batch->cpu,
> + bio_complete_wq,
> + &batch->work, 0);
> + break;
> + }
> + }
Ån all mainstream architetures we now default to lazy preempt, which
should remove the need for need_resched() calls. Do you have evidence
that we actually need this handling on recent kernels?
Otherwise this looks good to me.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
2026-05-14 21:51 ` [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
@ 2026-05-18 6:48 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-18 6:48 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] buffer: add dropbehind writeback support
2026-05-14 21:51 ` [PATCH v6 3/4] buffer: add dropbehind writeback support Tal Zussman
@ 2026-05-18 6:49 ` Christoph Hellwig
2026-05-22 23:14 ` Tal Zussman
1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-18 6:49 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
2026-05-14 21:51 ` [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
@ 2026-05-18 6:49 ` Christoph Hellwig
2026-05-22 23:17 ` Tal Zussman
1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-18 6:49 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
2026-05-18 6:48 ` Christoph Hellwig
@ 2026-05-22 22:47 ` Tal Zussman
0 siblings, 0 replies; 18+ messages in thread
From: Tal Zussman @ 2026-05-22 22:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On 5/18/26 2:48 AM, Christoph Hellwig wrote:
> On Thu, May 14, 2026 at 05:51:14PM -0400, Tal Zussman wrote:
>> Some bio completion handlers need to run from preemptible task context,
>> but bio_endio() may be called from IRQ context (e.g., buffer_head
>> writeback). Callers need a way to ensure their callback eventually runs
>> from a sleepable context. Add infrastructure for that, in two forms:
>>
>> 1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
>> in advance that its callback needs task context (e.g., dropbehind
>> writeback). bio_endio() sees the flag and offloads completion to a
>> worker automatically.
>>
>> 2. bio_complete_in_task(), a helper that completion callbacks can
>> invoke from within bi_end_io() when the deferral decision is
>> dynamic (e.g., fserror reporting).
>
> Note that method 2 is unused as of this series. I do plan to add users
> ASAP, and at one or two could even land through the block layer in this
> merge window.
>
>> Both share a per-CPU batch list drained by a delayed work item on a
>> WQ_PERCPU workqueue. Producers push the bio onto the local CPU's batch
>> and schedule the work item, which then dispatches each bio's bi_end_io()
>> from task context. The delayed work item uses a 1-jiffie delay to allow
>> batches of completions to accumulate before processing.
>
> But this 1-jiffie delay also means we unconditionally increase
> completion latency, which feels like a bad idea. Do you have any
> measurements that show where it does benefit? Note that queing work
> already often has very measurable latency on it's own. This also
> directly contradics the erofs experience that even went to a RT
> thread to reduce the latency.
I added this per Dave's feedback on v4, where he noted that XFS inodegc
uses a delayed work item to avoid context switch storms. There's only a
delay for the first bio in a batch to complete, as we only delay when the
list is empty. I'll run some experiments and measure context switches,
completion latency, etc. to see if this is necessary.
>> Both methods are gated on bio_in_atomic(), which returns true in any
>> context where a sleeping bi_end_io() is unsafe, including
>> non-preemptible task context. This logic is copied from commit
>> c99fab6e80b7 ("erofs: fix atomic context detection when
>> !CONFIG_DEBUG_LOCK_ALLOC").
>
> Let's not copy it, but have a prep patch that moves the erofs logic
> into the block layer under the new bio_in_atomic name.
Will do.
>> + while ((bio = bio_list_pop(&list)))
>> + bio->bi_end_io(bio);
>> +
>> + if (need_resched()) {
>> + bool is_empty;
>> +
>> + local_lock_irq(&bio_complete_batch.lock);
>> + is_empty = bio_list_empty(&batch->list);
>> + local_unlock_irq(&bio_complete_batch.lock);
>> + if (!is_empty)
>> + mod_delayed_work_on(batch->cpu,
>> + bio_complete_wq,
>> + &batch->work, 0);
>> + break;
>> + }
>> + }
>
> Ån all mainstream architetures we now default to lazy preempt, which
> should remove the need for need_resched() calls. Do you have evidence
> that we actually need this handling on recent kernels?
No evidence - I added this per feedback on v3, but agreed that it can be
simplified.
> Otherwise this looks good to me.
>
Thanks - AI review found a couple more small things, which I'll respond to
in a separate message.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
2026-05-15 2:38 ` Hillf Danton
2026-05-18 6:48 ` Christoph Hellwig
@ 2026-05-22 23:09 ` Tal Zussman
2026-05-25 5:24 ` Christoph Hellwig
2 siblings, 1 reply; 18+ messages in thread
From: Tal Zussman @ 2026-05-22 23:09 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On 5/14/26 5:51 PM, Tal Zussman wrote:
Responding to Sashiko review inline:
Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct delayed_work *dw = to_delayed_work(w);
> + struct bio_complete_batch *batch =
> + container_of(dw, struct bio_complete_batch, work);
> +
> + while (1) {
> + struct bio_list list;
> + struct bio *bio;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (bio_list_empty(&list))
> + break;
> +
> + while ((bio = bio_list_pop(&list)))
> + bio->bi_end_io(bio);
Q: "Does this code handle bios that do not have a completion callback?
If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
bi_end_io pointer (like some flush or discard bios), executing
bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
The synchronous completion path in bio_endio() checks if bio->bi_end_io is
non-NULL before calling it. Should a similar check be added here?"
A: Every bio that reaches this point should have bi_end_io set. There's no
point in deferring a bio without a completion handler.
Q: "Could this loop also monopolize the CPU and cause latency spikes?
Because the work uses a 1-jiffie delay, tens of thousands of bios might
accumulate under heavy I/O load. Processing the entire batch sequentially
without yielding or calling cond_resched() inside the inner loop could
lead to scheduler latency or watchdog warnings on non-preemptible kernels."
A: With PREEMPT_LAZY as the default, the worker is preemptible, as per
Hillf and Christoph, so this is unnecessary.
> + if (need_resched()) {
> + bool is_empty;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + is_empty = bio_list_empty(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> + if (!is_empty)
> + mod_delayed_work_on(batch->cpu,
> + bio_complete_wq,
> + &batch->work, 0);
> + break;
> + }
> + }
> +}
> +
> +void __bio_complete_in_task(struct bio *bio)
> +{
> + struct bio_complete_batch *batch;
> + unsigned long flags;
> + bool was_empty;
> +
> + local_lock_irqsave(&bio_complete_batch.lock, flags);
Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
context?
On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
called when bio_in_atomic() is true (which includes hardware interrupts or
execution under a raw_spinlock_t), attempting to acquire a sleepable lock
here would trigger an "Invalid wait context" lockdep warning.
Would a lockless list (llist) be more appropriate here to avoid sleeping
in atomic contexts?"
A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
to switch to raw_spinlock_t, as it seems like that would add unnecessary
overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
(as is done for the per-CPU bio allocation cache) should work.
> + batch = this_cpu_ptr(&bio_complete_batch);
> + was_empty = bio_list_empty(&batch->list);
> + bio_list_add(&batch->list, bio);
> + local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> + if (was_empty)
> + mod_delayed_work_on(batch->cpu, bio_complete_wq,
> + &batch->work, 1);
> +}
> +EXPORT_SYMBOL_GPL(__bio_complete_in_task);
> +
> static inline bool bio_remaining_done(struct bio *bio)
> {
> /*
> @@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
> }
> #endif
>
> - if (bio->bi_end_io)
> + if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
> + __bio_complete_in_task(bio);
> + else if (bio->bi_end_io)
> bio->bi_end_io(bio);
> }
> EXPORT_SYMBOL(bio_endio);
> @@ -1977,6 +2053,51 @@ int bioset_init(struct bio_set *bs,
> }
> EXPORT_SYMBOL(bioset_init);
>
> +static int bio_complete_batch_cpu_online(unsigned int cpu)
> +{
> + enable_delayed_work(&per_cpu(bio_complete_batch, cpu).work);
> + return 0;
> +}
> +
Q: "Could a CPU hotplug rollback strand deferred bios indefinitely?
During the early teardown phase (CPUHP_AP_ONLINE_DYN),
bio_complete_batch_cpu_down_prep() disables the delayed work. However, the
CPU remains active and can continue completing bios. When these bios
complete, __bio_complete_in_task() adds them to the list with
was_empty = true, but mod_delayed_work_on() fails to queue the work
because it is disabled.
If the CPU offline operation is aborted, bio_complete_batch_cpu_online()
re-enables the delayed work but does not explicitly queue it. Because the
list is no longer empty, any future bio completions will see
was_empty == false and skip calling mod_delayed_work_on(), permanently
stranding the bios."
A: This is legit. The solution is to requeue the work if the list isn't
empty on CPU online.
> +/*
> + * Disable this CPU's delayed work so that it cannot run on an unbound worker
> + * after the CPU is offlined.
> + */
> +static int bio_complete_batch_cpu_down_prep(unsigned int cpu)
> +{
> + disable_delayed_work_sync(&per_cpu(bio_complete_batch, cpu).work);
> + return 0;
> +}
> +
> +/*
> + * Drain a dead CPU's deferred bio completions. The CPU is dead and the worker
> + * is canceled so no locking is needed.
> + */
> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, cpu);
> + struct bio *bio;
> +
> + while ((bio = bio_list_pop(&batch->list)))
> + bio->bi_end_io(bio);
> +
Q: "Is it possible to hit the same NULL pointer dereference issue here if the
bio lacks a bi_end_io callback?"
A: Same as above.
> + return 0;
> +}
> +
> +static void __init bio_complete_batch_init(int cpu)
> +{
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, cpu);
> +
> + bio_list_init(&batch->list);
> + INIT_DELAYED_WORK(&batch->work, bio_complete_work_fn);
> + batch->cpu = cpu;
> +
> + if (!cpu_online(cpu))
> + disable_delayed_work_sync(&batch->work);
> +}
> +
> static int __init init_bio(void)
> {
> int i;
> @@ -1991,6 +2112,30 @@ static int __init init_bio(void)
> SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> }
>
> + for_each_possible_cpu(i)
> + bio_complete_batch_init(i);
> +
> + bio_complete_wq = alloc_workqueue("bio_complete",
> + WQ_MEM_RECLAIM | WQ_PERCPU, 0);
Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
protection in bio_complete_work_fn()?
When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
can execute per-CPU work items during worker pool congestion. This rescuer
thread executes unbound, meaning it could run on CPU B while processing
CPU A's work item.
Since local_lock operates strictly on the currently executing CPU, the
rescuer thread on CPU B would acquire CPU B's lock, while popping elements
from CPU A's list (derived via container_of()).
If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
it will acquire CPU A's lock and modify the same list without mutual
exclusion, potentially causing list corruption."
A: The rescuer should run on the same CPU, not unbound, so this is not an
issue.
> + if (!bio_complete_wq)
> + panic("bio: can't allocate bio_complete workqueue\n");
> +
> + /*
> + * bio task-context completion draining on hot-unplugged CPUs:
> + *
> + * 1. Stop the per-CPU delayed work while the CPU is still online, so
> + * that it cannot run on an unbound worker later.
> + * 2. Drain leftover bios added between worker disabling and CPU
> + * offlining.
> + */
> + cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> + "block/bio:complete:online",
> + bio_complete_batch_cpu_online,
> + bio_complete_batch_cpu_down_prep);
> + cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
> + "block/bio:complete:dead",
> + NULL, bio_complete_batch_cpu_dead);
> +
> cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
> bio_cpu_dead);
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 97d747320b35..c0214d6c28d6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -369,6 +369,38 @@ static inline struct bio *bio_alloc(struct block_device *bdev,
>
> void submit_bio(struct bio *bio);
>
> +/**
> + * bio_in_atomic - check if the current context is unsafe for bio completion
> + *
> + * Return: %true in atomic contexts (e.g. hard/soft IRQ, preempt-disabled);
> + * %false when a bio can be safely completed in the current context.
> + */
> +static inline bool bio_in_atomic(void)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> + return true;
> + if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> + return true;
Q: "Will this cause an infinite loop of bio offloading on kernels with
CONFIG_PREEMPT_COUNT disabled?
Because bio_in_atomic() unconditionally returns true without preempt count
support, a dynamic call to bio_complete_in_task() from within a bi_end_io()
callback will always offload the bio to the workqueue.
When the workqueue executes bio->bi_end_io(bio), the callback will evaluate
bio_complete_in_task() again, which will return true again, creating a
permanent offloading loop."
A: Legit issue. This can be solved by changing bio_complete_in_task() to:
static inline bool bio_complete_in_task(struct bio *bio)
{
if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
return false;
if (!bio_in_atomic())
return false;
bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
__bio_complete_in_task(bio);
return true;
}
We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
been deferred to the workqueue as is safe to run.
> + return !preemptible();
> +}
> +
> +void __bio_complete_in_task(struct bio *bio);
> +
> +/**
> + * bio_complete_in_task - ensure a bio is completed in preemptible task context
> + * @bio: bio to complete
> + *
> + * If called from non-task context, offload the bio completion to a worker
> + * thread and return %true. Else return %false and do nothing.
> + */
> +static inline bool bio_complete_in_task(struct bio *bio)
> +{
> + if (!bio_in_atomic())
> + return false;
> + __bio_complete_in_task(bio);
> + return true;
> +}
> +
> extern void bio_endio(struct bio *);
>
> static inline void bio_io_error(struct bio *bio)
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8808ee76e73c..d49d97a050d0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -322,6 +322,7 @@ enum {
> BIO_REMAPPED,
> BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> + BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
> BIO_FLAG_LAST
> };
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] buffer: add dropbehind writeback support
2026-05-14 21:51 ` [PATCH v6 3/4] buffer: add dropbehind writeback support Tal Zussman
2026-05-18 6:49 ` Christoph Hellwig
@ 2026-05-22 23:14 ` Tal Zussman
2026-05-25 5:25 ` Christoph Hellwig
1 sibling, 1 reply; 18+ messages in thread
From: Tal Zussman @ 2026-05-22 23:14 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On 5/14/26 5:51 PM, Tal Zussman wrote:
> Add block_write_begin_iocb() which threads the kiocb through to
> __filemap_get_folio() so that buffer_head-based I/O can use DONTCACHE
> behavior. When the iocb has IOCB_DONTCACHE set, FGP_DONTCACHE is
> passed to mark the folio for dropbehind. The existing
> block_write_begin() is preserved as a wrapper that passes a NULL iocb.
>
> Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() when the folio has
> dropbehind set, so that buffer_head writeback completions get deferred
> to task context.
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Responding to Sashiko review inline:
Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu
> ---
> fs/buffer.c | 19 +++++++++++++++++--
> include/linux/buffer_head.h | 3 +++
> 2 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b0b3792b1496..d0abaf44d782 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2138,14 +2138,19 @@ EXPORT_SYMBOL(block_commit_write);
> *
> * The filesystem needs to handle block truncation upon failure.
> */
> -int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> +int block_write_begin_iocb(const struct kiocb *iocb,
> + struct address_space *mapping, loff_t pos, unsigned len,
> struct folio **foliop, get_block_t *get_block)
> {
> pgoff_t index = pos >> PAGE_SHIFT;
> + fgf_t fgp_flags = FGP_WRITEBEGIN;
> struct folio *folio;
> int status;
>
> - folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> + if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
> + fgp_flags |= FGP_DONTCACHE;
> +
> + folio = __filemap_get_folio(mapping, index, fgp_flags,
> mapping_gfp_mask(mapping));
> if (IS_ERR(folio))
> return PTR_ERR(folio);
> @@ -2160,6 +2165,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> *foliop = folio;
> return status;
> }
Q: "Is an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() needed for
block_write_begin_iocb()?
Since the legacy wrapper block_write_begin() remains exported, it appears
filesystems compiled as loadable modules (like ext4 or fat) will encounter
linker errors if they attempt to transition to the new API."
A: Can be added once filesystems want to use it.
> +
> +int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> + struct folio **foliop, get_block_t *get_block)
> +{
> + return block_write_begin_iocb(NULL, mapping, pos, len, foliop,
> + get_block);
> +}
> EXPORT_SYMBOL(block_write_begin);
>
> int block_write_end(loff_t pos, unsigned len, unsigned copied,
> @@ -2715,6 +2727,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
>
> bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
>
> + if (folio_test_dropbehind(bh->b_folio))
> + bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> +
Q: "Does this add unnecessary workqueue overhead for read operations?
While task-context deferral is required for asynchronous writeback completions
to allow folio_end_writeback() to sleep and invalidate the folio, read
dropbehind invalidation is handled synchronously in the reader's task context
via filemap_end_dropbehind_read().
Could this be gated by an op == REQ_OP_WRITE or op_is_write(opf) check so
that read bio completions aren't forced into a workqueue?"
A: Seems worth adding a op_is_write(opf) check to avoid deferring read
dropbehind completions.
> if (IS_ENABLED(CONFIG_FS_ENCRYPTION))
> buffer_set_crypto_ctx(bio, bh, GFP_NOIO);
>
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index e4939e33b4b5..4ce50882d621 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -260,6 +260,9 @@ int block_read_full_folio(struct folio *, get_block_t *);
> bool block_is_partially_uptodate(struct folio *, size_t from, size_t count);
> int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> struct folio **foliop, get_block_t *get_block);
> +int block_write_begin_iocb(const struct kiocb *iocb,
> + struct address_space *mapping, loff_t pos, unsigned len,
> + struct folio **foliop, get_block_t *get_block);
> int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
> get_block_t *get_block);
> int block_write_end(loff_t pos, unsigned len, unsigned copied, struct folio *);
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
2026-05-14 21:51 ` [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-05-18 6:49 ` Christoph Hellwig
@ 2026-05-22 23:17 ` Tal Zussman
2026-05-25 5:30 ` Christoph Hellwig
1 sibling, 1 reply; 18+ messages in thread
From: Tal Zussman @ 2026-05-22 23:17 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig
Cc: Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On 5/14/26 5:51 PM, Tal Zussman wrote:
> Block device buffered reads and writes already pass through
> filemap_read() and iomap_file_buffered_write() respectively, both of
> which handle IOCB_DONTCACHE. Enable RWF_DONTCACHE for block device files
> by setting FOP_DONTCACHE in def_blk_fops.
>
> For CONFIG_BUFFER_HEAD=y paths, use block_write_begin_iocb() in
> blkdev_write_begin() to thread the kiocb through so that buffer_head
> writeback gets dropbehind support.
>
> CONFIG_BUFFER_HEAD=n paths are handled by the previously added iomap
> BIO_COMPLETE_IN_TASK support.
>
> This support is useful for databases that operate on raw block devices,
> among other userspace applications.
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Responding to Sashiko review inline:
Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu
Q: "Could this code path be unreachable during block device writes?
Block device buffered writes use blkdev_write_iter(), which unconditionally
delegates to blkdev_buffered_write() and subsequently
iomap_file_buffered_write(). The iomap infrastructure bypasses the legacy
address_space_operations .write_begin method.
During a write, iomap_write_begin() handles buffer head allocation internally
by calling __block_write_begin_int() directly. This naturally inherits the
FGP_DONTCACHE flag passed down from the kiocb via iomap_get_folio().
If the VFS write paths were actually calling .write_begin for block devices, a
CONFIG_BUFFER_HEAD=n kernel would crash with a NULL pointer dereference since
def_blk_aops does not define .write_begin or .write_end in that configuration."
A: So this actually seems legit... doesn't look like anything actually calls
blkdev_write_begin() or blkdev_write_end(), unless I'm missing something.
block_write_begin_iocb() usage seems necessary for bh-based filesystems, but
block devices seem to use iomap for writes unconditionally.
> ---
> block/fops.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/block/fops.c b/block/fops.c
> index bb6642b45937..31b073181d87 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -504,7 +504,8 @@ static int blkdev_write_begin(const struct kiocb *iocb,
> unsigned len, struct folio **foliop,
> void **fsdata)
> {
> - return block_write_begin(mapping, pos, len, foliop, blkdev_get_block);
> + return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
> + blkdev_get_block);
> }
>
> static int blkdev_write_end(const struct kiocb *iocb,
> @@ -966,7 +967,7 @@ const struct file_operations def_blk_fops = {
> .splice_write = iter_file_splice_write,
> .fallocate = blkdev_fallocate,
> .uring_cmd = blkdev_uring_cmd,
> - .fop_flags = FOP_BUFFER_RASYNC,
> + .fop_flags = FOP_BUFFER_RASYNC | FOP_DONTCACHE,
> };
>
> static __init int blkdev_init(void)
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
2026-05-22 23:09 ` Tal Zussman
@ 2026-05-25 5:24 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-25 5:24 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang,
Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
linux-rt-devel
[adding the PREEMPT-RT maintainers and list for one and a half questions
for them a bit below]
On Fri, May 22, 2026 at 07:09:59PM -0400, Tal Zussman wrote:
> > + while ((bio = bio_list_pop(&list)))
> > + bio->bi_end_io(bio);
>
> Q: "Does this code handle bios that do not have a completion callback?
> If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
> bi_end_io pointer (like some flush or discard bios), executing
> bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
> The synchronous completion path in bio_endio() checks if bio->bi_end_io is
> non-NULL before calling it. Should a similar check be added here?"
>
> A: Every bio that reaches this point should have bi_end_io set. There's no
> point in deferring a bio without a completion handler.
Yes. The only bios without bi_end_io that can end up in bio_endio
are those attached to passthrough requests, and deferring them of
course doesn't make sense.
> Q: "Could this loop also monopolize the CPU and cause latency spikes?
> Because the work uses a 1-jiffie delay, tens of thousands of bios might
> accumulate under heavy I/O load. Processing the entire batch sequentially
> without yielding or calling cond_resched() inside the inner loop could
> lead to scheduler latency or watchdog warnings on non-preemptible kernels."
>
> A: With PREEMPT_LAZY as the default, the worker is preemptible, as per
> Hillf and Christoph, so this is unnecessary.
Yes. Sashiko tends to spew this a lot, which is odd as there were clear
recommendations to only do this absolutely if needed even before
PREEMPT_LAZY.
> > +
> > + local_lock_irqsave(&bio_complete_batch.lock, flags);
>
> Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
> context?
> On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
> which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
> called when bio_in_atomic() is true (which includes hardware interrupts or
> execution under a raw_spinlock_t), attempting to acquire a sleepable lock
> here would trigger an "Invalid wait context" lockdep warning.
> Would a lockless list (llist) be more appropriate here to avoid sleeping
> in atomic contexts?"
>
> A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
> to switch to raw_spinlock_t, as it seems like that would add unnecessary
> overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
> (as is done for the per-CPU bio allocation cache) should work.
Adding the PREEMPT_RT maintainers for this as it is above my pay grade.
> Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
> protection in bio_complete_work_fn()?
> When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
> can execute per-CPU work items during worker pool congestion. This rescuer
> thread executes unbound, meaning it could run on CPU B while processing
> CPU A's work item.
> Since local_lock operates strictly on the currently executing CPU, the
> rescuer thread on CPU B would acquire CPU B's lock, while popping elements
> from CPU A's list (derived via container_of()).
> If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
> it will acquire CPU A's lock and modify the same list without mutual
> exclusion, potentially causing list corruption."
>
> A: The rescuer should run on the same CPU, not unbound, so this is not an
> issue.
This is another area where the PREEMPT_RT/scheduler folks might be able
to help.
> static inline bool bio_complete_in_task(struct bio *bio)
> {
> if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> return false;
> if (!bio_in_atomic())
> return false;
> bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> __bio_complete_in_task(bio);
> return true;
> }
>
> We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
> been deferred to the workqueue as is safe to run.
Would be nice to avoid this, but yes.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 3/4] buffer: add dropbehind writeback support
2026-05-22 23:14 ` Tal Zussman
@ 2026-05-25 5:25 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-25 5:25 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On Fri, May 22, 2026 at 07:14:46PM -0400, Tal Zussman wrote:
> > if (IS_ERR(folio))
> > return PTR_ERR(folio);
> > @@ -2160,6 +2165,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
> > *foliop = folio;
> > return status;
> > }
>
> Q: "Is an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() needed for
> block_write_begin_iocb()?
> Since the legacy wrapper block_write_begin() remains exported, it appears
> filesystems compiled as loadable modules (like ext4 or fat) will encounter
> linker errors if they attempt to transition to the new API."
>
> A: Can be added once filesystems want to use it.
Yeah, Sashiko is really stupid and counterproductive here.
> > @@ -2715,6 +2727,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
> >
> > bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
> >
> > + if (folio_test_dropbehind(bh->b_folio))
> > + bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> > +
>
> Q: "Does this add unnecessary workqueue overhead for read operations?
> While task-context deferral is required for asynchronous writeback completions
> to allow folio_end_writeback() to sleep and invalidate the folio, read
> dropbehind invalidation is handled synchronously in the reader's task context
> via filemap_end_dropbehind_read().
> Could this be gated by an op == REQ_OP_WRITE or op_is_write(opf) check so
> that read bio completions aren't forced into a workqueue?"
>
> A: Seems worth adding a op_is_write(opf) check to avoid deferring read
> dropbehind completions.
Yes.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
2026-05-22 23:17 ` Tal Zussman
@ 2026-05-25 5:30 ` Christoph Hellwig
2026-05-25 18:06 ` Tal Zussman
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2026-05-25 5:30 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, Dave Chinner, Bart Van Assche, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On Fri, May 22, 2026 at 07:17:15PM -0400, Tal Zussman wrote:
> A: So this actually seems legit... doesn't look like anything actually calls
> blkdev_write_begin() or blkdev_write_end(), unless I'm missing something.
> block_write_begin_iocb() usage seems necessary for bh-based filesystems, but
> block devices seem to use iomap for writes unconditionally.
Yes. Maybe send a separate patch to remove these now unused methods?
Or I could do that since I forgot to remove them when I should have.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices
2026-05-25 5:30 ` Christoph Hellwig
@ 2026-05-25 18:06 ` Tal Zussman
0 siblings, 0 replies; 18+ messages in thread
From: Tal Zussman @ 2026-05-25 18:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Dave Chinner, Bart Van Assche, linux-block, linux-kernel,
linux-xfs, linux-fsdevel, linux-mm, Gao Xiang
On 5/25/26 1:30 AM, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 07:17:15PM -0400, Tal Zussman wrote:
>> A: So this actually seems legit... doesn't look like anything actually calls
>> blkdev_write_begin() or blkdev_write_end(), unless I'm missing something.
>> block_write_begin_iocb() usage seems necessary for bh-based filesystems, but
>> block devices seem to use iomap for writes unconditionally.
>
> Yes. Maybe send a separate patch to remove these now unused methods?
> Or I could do that since I forgot to remove them when I should have.
>
I'll send a patch. I'll also drop the block_write_begin_iocb() change from this
series, as it becomes unused.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-05-25 18:06 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 21:51 [PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-05-14 21:51 ` [PATCH v6 1/4] block: add task-context bio completion infrastructure Tal Zussman
2026-05-15 2:38 ` Hillf Danton
2026-05-18 6:48 ` Christoph Hellwig
2026-05-22 22:47 ` Tal Zussman
2026-05-22 23:09 ` Tal Zussman
2026-05-25 5:24 ` Christoph Hellwig
2026-05-14 21:51 ` [PATCH v6 2/4] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
2026-05-18 6:48 ` Christoph Hellwig
2026-05-14 21:51 ` [PATCH v6 3/4] buffer: add dropbehind writeback support Tal Zussman
2026-05-18 6:49 ` Christoph Hellwig
2026-05-22 23:14 ` Tal Zussman
2026-05-25 5:25 ` Christoph Hellwig
2026-05-14 21:51 ` [PATCH v6 4/4] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-05-18 6:49 ` Christoph Hellwig
2026-05-22 23:17 ` Tal Zussman
2026-05-25 5:30 ` Christoph Hellwig
2026-05-25 18:06 ` Tal Zussman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox