* [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices
@ 2026-03-25 18:42 Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Tal Zussman @ 2026-03-25 18:42 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm, Tal Zussman
Add support for using RWF_DONTCACHE with block devices.
Dropbehind pruning needs to be done in non-IRQ context, but block
devices complete writeback in IRQ context.
To fix this, we can defer dropbehind invalidation to task context. We
introduce a new BIO_COMPLETE_IN_TASK flag that allows the bio submitter
to request task-context completion of bi_end_io. When bio_endio() sees
this flag in non-task context, it queues the bio to a per-CPU list and
schedules a work item to do bio completion.
Patch 1 adds the BIO_COMPLETE_IN_TASK infrastructure in the block
layer.
Patch 2 wires BIO_COMPLETE_IN_TASK into iomap writeback for DONTCACHE
folios and removes the DONTCACHE workqueue deferral from XFS.
Patch 3 enables RWF_DONTCACHE for block devices, setting
BIO_COMPLETE_IN_TASK in submit_bh_wbc() for the CONFIG_BUFFER_HEAD
path.
This support is useful for databases that operate on raw block devices,
among other userspace applications.
I tested this (with CONFIG_BUFFER_HEAD=y) for reads and writes on a
single block device on a VM, so results may be noisy.
Reads were tested on the root partition with a 45GB range (~2x RAM).
Writes were tested on a disabled swap parition (~1GB) in a memcg of size
244MB to force reclaim pressure.
Results:
===== READS (/dev/nvme0n1p2) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 1098.6 1609.0
2 1270.3 1506.6
3 1093.3 1576.5
4 1141.8 2393.9
5 1365.3 2793.8
6 1324.6 2065.9
7 879.6 1920.7
8 1434.1 1662.4
9 1184.9 1857.9
10 1166.4 1702.8
11 1161.4 1653.4
12 1086.9 1555.4
13 1198.5 1718.9
14 1111.9 1752.2
---- ------------ --------------
avg 1173.7 1828.8 (+56%)
==== WRITES (/dev/nvme0n1p3) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 692.4 9297.7
2 4810.8 9342.8
3 5221.7 2955.2
4 396.7 8488.3
5 7249.2 9249.3
6 6695.4 1376.2
7 122.9 9125.8
8 5486.5 9414.7
9 6921.5 8743.5
10 27.9 8997.8
---- ------------ --------------
avg 3762.5 7699.1 (+105%)
---
Changes in v4:
- 1/3: Move dropbehind deferral from folio-level to bio-level using
BIO_COMPLETE_IN_TASK, per Matthew and Jan.
- 1/3: Work function yields on need_resched() to avoid hogging the CPU,
per Jan.
- 2/3: New patch. Set BIO_COMPLETE_IN_TASK on iomap writeback bios for
DONTCACHE folios, removing the need for XFS-specific workqueue
deferral.
- 3/3: Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() for buffer_head
path.
- 3/3: Update commit message to mention CONFIG_BUFFER_HEAD=n path.
- Link to v3: https://lore.kernel.org/r/20260227-blk-dontcache-v3-0-cd309ccd5868@columbia.edu
Changes in v3:
- 1/2: Convert dropbehind deferral to per-CPU folio_batches protected by
local_lock using per-CPU work items, to reduce contention, per Jens.
- 1/2: Call folio_end_dropbehind_irq() directly from
folio_end_writeback(), per Jens.
- 1/2: Add CPU hotplug dead callback to drain the departing CPU's folio
batch.
- 2/2: Introduce block_write_begin_iocb(), per Christoph.
- 2/2: Dropped R-b due to changes.
- Link to v2: https://lore.kernel.org/r/20260225-blk-dontcache-v2-0-70e7ac4f7108@columbia.edu
Changes in v2:
- Add R-b from Jan Kara for 2/2.
- Add patch to defer dropbehind completion from IRQ context via a work
item (1/2).
- Add initial performance numbers to cover letter.
- Link to v1: https://lore.kernel.org/r/20260218-blk-dontcache-v1-1-fad6675ef71f@columbia.edu
---
Tal Zussman (3):
block: add BIO_COMPLETE_IN_TASK for task-context completion
iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
block: enable RWF_DONTCACHE for block devices
block/bio.c | 84 ++++++++++++++++++++++++++++++++++++++++++++-
block/fops.c | 5 +--
fs/buffer.c | 22 ++++++++++--
fs/iomap/ioend.c | 2 ++
fs/xfs/xfs_aops.c | 4 ---
include/linux/blk_types.h | 1 +
include/linux/buffer_head.h | 3 ++
7 files changed, 111 insertions(+), 10 deletions(-)
---
base-commit: 2961f841b025fb234860bac26dfb7fa7cb0fb122
change-id: 20260218-blk-dontcache-338133dd045e
Best regards,
--
Tal Zussman <tz2294@columbia.edu>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
@ 2026-03-25 18:43 ` Tal Zussman
2026-03-25 19:54 ` Matthew Wilcox
` (3 more replies)
2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
2 siblings, 4 replies; 13+ messages in thread
From: Tal Zussman @ 2026-03-25 18:43 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm, Tal Zussman
Some bio completion handlers need to run in task context but bio_endio()
can be called from IRQ context (e.g. buffer_head writeback). Add a
BIO_COMPLETE_IN_TASK flag that bio submitters can set to request
task-context completion of their bi_end_io callback.
When bio_endio() sees this flag and is running in non-task context, it
queues the bio to a per-cpu list and schedules a work item to call
bi_end_io() from task context. A CPU hotplug dead callback drains any
remaining bios from the departing CPU's batch.
This will be used to enable RWF_DONTCACHE for block devices, and could
be used for other subsystems like fscrypt that need task-context bio
completion.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
block/bio.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/blk_types.h | 1 +
2 files changed, 84 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index 8203bb7455a9..69ee0d93041f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -18,6 +18,7 @@
#include <linux/highmem.h>
#include <linux/blk-crypto.h>
#include <linux/xarray.h>
+#include <linux/local_lock.h>
#include <trace/events/block.h>
#include "blk.h"
@@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
}
EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
+struct bio_complete_batch {
+ local_lock_t lock;
+ struct bio_list list;
+ struct work_struct work;
+};
+
+static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
+ .lock = INIT_LOCAL_LOCK(lock),
+};
+
+static void bio_complete_work_fn(struct work_struct *w)
+{
+ struct bio_complete_batch *batch;
+ struct bio_list list;
+
+again:
+ local_lock_irq(&bio_complete_batch.lock);
+ batch = this_cpu_ptr(&bio_complete_batch);
+ list = batch->list;
+ bio_list_init(&batch->list);
+ local_unlock_irq(&bio_complete_batch.lock);
+
+ while (!bio_list_empty(&list)) {
+ struct bio *bio = bio_list_pop(&list);
+ bio->bi_end_io(bio);
+ }
+
+ local_lock_irq(&bio_complete_batch.lock);
+ batch = this_cpu_ptr(&bio_complete_batch);
+ if (!bio_list_empty(&batch->list)) {
+ local_unlock_irq(&bio_complete_batch.lock);
+
+ if (!need_resched())
+ goto again;
+
+ schedule_work_on(smp_processor_id(), &batch->work);
+ return;
+ }
+ local_unlock_irq(&bio_complete_batch.lock);
+}
+
+static void bio_queue_completion(struct bio *bio)
+{
+ struct bio_complete_batch *batch;
+ unsigned long flags;
+
+ local_lock_irqsave(&bio_complete_batch.lock, flags);
+ batch = this_cpu_ptr(&bio_complete_batch);
+ bio_list_add(&batch->list, bio);
+ local_unlock_irqrestore(&bio_complete_batch.lock, flags);
+
+ schedule_work_on(smp_processor_id(), &batch->work);
+}
+
static inline bool bio_remaining_done(struct bio *bio)
{
/*
@@ -1788,7 +1843,9 @@ void bio_endio(struct bio *bio)
}
#endif
- if (bio->bi_end_io)
+ if (!in_task() && bio_flagged(bio, BIO_COMPLETE_IN_TASK))
+ bio_queue_completion(bio);
+ else if (bio->bi_end_io)
bio->bi_end_io(bio);
}
EXPORT_SYMBOL(bio_endio);
@@ -1974,6 +2031,21 @@ int bioset_init(struct bio_set *bs,
}
EXPORT_SYMBOL(bioset_init);
+/*
+ * Drain a dead CPU's deferred bio completions. The CPU is dead so no locking
+ * is needed -- no new bios will be queued to it.
+ */
+static int bio_complete_batch_cpu_dead(unsigned int cpu)
+{
+ struct bio_complete_batch *batch = per_cpu_ptr(&bio_complete_batch, cpu);
+ struct bio *bio;
+
+ while ((bio = bio_list_pop(&batch->list)))
+ bio->bi_end_io(bio);
+
+ return 0;
+}
+
static int __init init_bio(void)
{
int i;
@@ -1988,6 +2060,16 @@ static int __init init_bio(void)
SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
}
+ for_each_possible_cpu(i) {
+ struct bio_complete_batch *batch =
+ per_cpu_ptr(&bio_complete_batch, i);
+
+ bio_list_init(&batch->list);
+ INIT_WORK(&batch->work, bio_complete_work_fn);
+ }
+
+ cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
+ NULL, bio_complete_batch_cpu_dead);
cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
bio_cpu_dead);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8808ee76e73c..d49d97a050d0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -322,6 +322,7 @@ enum {
BIO_REMAPPED,
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+ BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
BIO_FLAG_LAST
};
--
2.39.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
@ 2026-03-25 18:43 ` Tal Zussman
2026-03-25 20:21 ` Matthew Wilcox
2026-03-25 20:34 ` Dave Chinner
2026-03-25 18:43 ` [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
2 siblings, 2 replies; 13+ messages in thread
From: Tal Zussman @ 2026-03-25 18:43 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm, Tal Zussman
Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
context, where folio_end_dropbehind() can safely invalidate folios.
With the bio layer now handling task-context deferral generically, XFS
no longer needs to route DONTCACHE ioends through its completion
workqueue for page cache invalidation. Remove the DONTCACHE check from
xfs_ioend_needs_wq_completion().
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
fs/iomap/ioend.c | 2 ++
fs/xfs/xfs_aops.c | 4 ----
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index e4d57cb969f1..6b8375d11cc0 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
GFP_NOFS, &iomap_ioend_bioset);
bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
bio->bi_write_hint = wpc->inode->i_write_hint;
+ if (ioend_flags & IOMAP_IOEND_DONTCACHE)
+ bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
wbc_init_bio(wpc->wbc, bio);
wpc->nr_folios = 0;
return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 76678814f46f..0d469b91377d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -510,10 +510,6 @@ xfs_ioend_needs_wq_completion(
if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))
return true;
- /* Page cache invalidation cannot be done in irq context. */
- if (ioend->io_flags & IOMAP_IOEND_DONTCACHE)
- return true;
-
return false;
}
--
2.39.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices
2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
@ 2026-03-25 18:43 ` Tal Zussman
2 siblings, 0 replies; 13+ messages in thread
From: Tal Zussman @ 2026-03-25 18:43 UTC (permalink / raw)
To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm, Tal Zussman
Block device buffered reads and writes already pass through
filemap_read() and iomap_file_buffered_write() respectively, both of
which handle IOCB_DONTCACHE. Enable RWF_DONTCACHE for block device files
by setting FOP_DONTCACHE in def_blk_fops.
For CONFIG_BUFFER_HEAD=y paths, add block_write_begin_iocb() which
threads the kiocb through so that buffer_head-based I/O can use
DONTCACHE behavior. The existing block_write_begin() is preserved as a
wrapper that passes a NULL iocb. Set BIO_COMPLETE_IN_TASK in
submit_bh_wbc() when the folio has dropbehind so that buffer_head
writeback completions get deferred to task context.
CONFIG_BUFFER_HEAD=n paths are handled by the previously added iomap
BIO_COMPLETE_IN_TASK support.
This support is useful for databases that operate on raw block devices,
among other userspace applications.
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
block/fops.c | 5 +++--
fs/buffer.c | 22 +++++++++++++++++++---
include/linux/buffer_head.h | 3 +++
3 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/block/fops.c b/block/fops.c
index 4d32785b31d9..d8165f6ba71c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -505,7 +505,8 @@ static int blkdev_write_begin(const struct kiocb *iocb,
unsigned len, struct folio **foliop,
void **fsdata)
{
- return block_write_begin(mapping, pos, len, foliop, blkdev_get_block);
+ return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
+ blkdev_get_block);
}
static int blkdev_write_end(const struct kiocb *iocb,
@@ -967,7 +968,7 @@ const struct file_operations def_blk_fops = {
.splice_write = iter_file_splice_write,
.fallocate = blkdev_fallocate,
.uring_cmd = blkdev_uring_cmd,
- .fop_flags = FOP_BUFFER_RASYNC,
+ .fop_flags = FOP_BUFFER_RASYNC | FOP_DONTCACHE,
};
static __init int blkdev_init(void)
diff --git a/fs/buffer.c b/fs/buffer.c
index ed724a902657..c60c0ad6cc35 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2239,14 +2239,19 @@ EXPORT_SYMBOL(block_commit_write);
*
* The filesystem needs to handle block truncation upon failure.
*/
-int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
+int block_write_begin_iocb(const struct kiocb *iocb,
+ struct address_space *mapping, loff_t pos, unsigned len,
struct folio **foliop, get_block_t *get_block)
{
pgoff_t index = pos >> PAGE_SHIFT;
+ fgf_t fgp_flags = FGP_WRITEBEGIN;
struct folio *folio;
int status;
- folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+ if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
+ fgp_flags |= FGP_DONTCACHE;
+
+ folio = __filemap_get_folio(mapping, index, fgp_flags,
mapping_gfp_mask(mapping));
if (IS_ERR(folio))
return PTR_ERR(folio);
@@ -2261,6 +2266,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
*foliop = folio;
return status;
}
+
+int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
+ struct folio **foliop, get_block_t *get_block)
+{
+ return block_write_begin_iocb(NULL, mapping, pos, len, foliop,
+ get_block);
+}
EXPORT_SYMBOL(block_write_begin);
int block_write_end(loff_t pos, unsigned len, unsigned copied,
@@ -2589,7 +2601,8 @@ int cont_write_begin(const struct kiocb *iocb, struct address_space *mapping,
(*bytes)++;
}
- return block_write_begin(mapping, pos, len, foliop, get_block);
+ return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
+ get_block);
}
EXPORT_SYMBOL(cont_write_begin);
@@ -2801,6 +2814,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
+ if (folio_test_dropbehind(bh->b_folio))
+ bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
+
fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index b16b88bfbc3e..ddf88ce290f2 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -260,6 +260,9 @@ int block_read_full_folio(struct folio *, get_block_t *);
bool block_is_partially_uptodate(struct folio *, size_t from, size_t count);
int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
struct folio **foliop, get_block_t *get_block);
+int block_write_begin_iocb(const struct kiocb *iocb,
+ struct address_space *mapping, loff_t pos, unsigned len,
+ struct folio **foliop, get_block_t *get_block);
int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
get_block_t *get_block);
int block_write_end(loff_t pos, unsigned len, unsigned copied, struct folio *);
--
2.39.5
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
@ 2026-03-25 19:54 ` Matthew Wilcox
2026-03-25 20:14 ` Jens Axboe
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2026-03-25 19:54 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Christian Brauner, Darrick J. Wong, Carlos Maiolino,
Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm
On Wed, Mar 25, 2026 at 02:43:00PM -0400, Tal Zussman wrote:
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct bio_complete_batch *batch;
> + struct bio_list list;
> +
> +again:
> + local_lock_irq(&bio_complete_batch.lock);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + while (!bio_list_empty(&list)) {
> + struct bio *bio = bio_list_pop(&list);
> + bio->bi_end_io(bio);
> + }
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + if (!bio_list_empty(&batch->list)) {
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (!need_resched())
> + goto again;
> +
> + schedule_work_on(smp_processor_id(), &batch->work);
> + return;
> + }
I don't know how often we see this actually trigger, but wouldn't this
be slightly more efficient?
+ local_lock_irq(&bio_complete_batch.lock);
+ batch = this_cpu_ptr(&bio_complete_batch);
+ list = batch->list;
+again:
+ bio_list_init(&batch->list);
+ local_unlock_irq(&bio_complete_batch.lock);
+
+ while (!bio_list_empty(&list)) {
+ struct bio *bio = bio_list_pop(&list);
+ bio->bi_end_io(bio);
+ }
+
+ local_lock_irq(&bio_complete_batch.lock);
+ batch = this_cpu_ptr(&bio_complete_batch);
+ list = batch->list;
+ if (!bio_list_empty(&list)) {
+ if (!need_resched())
+ goto again;
+
+ local_unlock_irq(&bio_complete_batch.lock);
+ schedule_work_on(smp_processor_id(), &batch->work);
+ return;
+ }
Overall I like this. I think this is a better approach than the earlier
patches, and I'm looking forward to the simplifications that it's going
to enable.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
2026-03-25 19:54 ` Matthew Wilcox
@ 2026-03-25 20:14 ` Jens Axboe
2026-03-25 20:26 ` Dave Chinner
2026-03-25 21:03 ` Bart Van Assche
3 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2026-03-25 20:14 UTC (permalink / raw)
To: Tal Zussman, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm
On 3/25/26 12:43 PM, Tal Zussman wrote:
> diff --git a/block/bio.c b/block/bio.c
> index 8203bb7455a9..69ee0d93041f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -18,6 +18,7 @@
> #include <linux/highmem.h>
> #include <linux/blk-crypto.h>
> #include <linux/xarray.h>
> +#include <linux/local_lock.h>
>
> #include <trace/events/block.h>
> #include "blk.h"
> @@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
> }
> EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
>
> +struct bio_complete_batch {
> + local_lock_t lock;
> + struct bio_list list;
> + struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
> + .lock = INIT_LOCAL_LOCK(lock),
> +};
> +
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct bio_complete_batch *batch;
> + struct bio_list list;
> +
> +again:
> + local_lock_irq(&bio_complete_batch.lock);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + while (!bio_list_empty(&list)) {
> + struct bio *bio = bio_list_pop(&list);
> + bio->bi_end_io(bio);
> + }
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + if (!bio_list_empty(&batch->list)) {
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (!need_resched())
> + goto again;
> +
> + schedule_work_on(smp_processor_id(), &batch->work);
> + return;
> + }
> + local_unlock_irq(&bio_complete_batch.lock);
> +}
bool looped = false;
do {
if (looped && need_resched()) {
schedule_work_on(smp_processor_id(), &batch->work);
break;
}
local_lock_irq(&bio_complete_batch.lock);
batch = this_cpu_ptr(&bio_complete_batch);
list = batch->list;
bio_list_init(&batch->list);
local_unlock_irq(&bio_complete_batch.lock);
if (bio_list_empty(&list))
break;
do {
struct bio *bio = bio_list_pop(&list);
bio->bi_end_io(bio);
} while (!bio_list_empty(&list));
looped = true;
} while (1);
would be a lot easier to read, and avoid needing the list manipulation
included twice.
> +static void bio_queue_completion(struct bio *bio)
> +{
> + struct bio_complete_batch *batch;
> + unsigned long flags;
> +
> + local_lock_irqsave(&bio_complete_batch.lock, flags);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + bio_list_add(&batch->list, bio);
> + local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> + schedule_work_on(smp_processor_id(), &batch->work);
> +}
Maybe do something ala:
static void bio_queue_completion(struct bio *bio)
{
struct bio_complete_batch *batch;
unsigned long flags;
bool was_empty;
local_lock_irqsave(&bio_complete_batch.lock, flags);
batch = this_cpu_ptr(&bio_complete_batch);
was_empty = bio_list_empty(&batch->list);
bio_list_add(&batch->list, bio);
local_unlock_irqrestore(&bio_complete_batch.lock, flags);
if (was_empty)
schedule_work_on(smp_processor_id(), &batch->work);
}
Outside of these mostly nits, I like this approach. It avoids my main
worry with this, which was contention on the list locks. And on the
io_uring side, we'll never hit the !in_task() path anyway, as the
completions are run from the task always. The bio flag makes sense for
this.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
@ 2026-03-25 20:21 ` Matthew Wilcox
2026-03-25 20:34 ` Dave Chinner
1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2026-03-25 20:21 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Christian Brauner, Darrick J. Wong, Carlos Maiolino,
Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm
On Wed, Mar 25, 2026 at 02:43:01PM -0400, Tal Zussman wrote:
> Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
> IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
> context, where folio_end_dropbehind() can safely invalidate folios.
>
> With the bio layer now handling task-context deferral generically, XFS
> no longer needs to route DONTCACHE ioends through its completion
> workqueue for page cache invalidation. Remove the DONTCACHE check from
> xfs_ioend_needs_wq_completion().
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
> fs/iomap/ioend.c | 2 ++
> fs/xfs/xfs_aops.c | 4 ----
> 2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> index e4d57cb969f1..6b8375d11cc0 100644
> --- a/fs/iomap/ioend.c
> +++ b/fs/iomap/ioend.c
> @@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
> GFP_NOFS, &iomap_ioend_bioset);
> bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
> bio->bi_write_hint = wpc->inode->i_write_hint;
> + if (ioend_flags & IOMAP_IOEND_DONTCACHE)
> + bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> wbc_init_bio(wpc->wbc, bio);
> wpc->nr_folios = 0;
> return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
Can't we delete IOMAP_IOEND_DONTCACHE, and just do:
if (folio_test_dropbehind(folio))
bio_set_flag(&ioend->io_bio, BIO_COMPLETE_IN_TASK);
It'd need to move down a few lines in iomap_add_to_ioend() to after
bio_add_folio() succeeds.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
2026-03-25 19:54 ` Matthew Wilcox
2026-03-25 20:14 ` Jens Axboe
@ 2026-03-25 20:26 ` Dave Chinner
2026-03-25 20:39 ` Matthew Wilcox
2026-03-25 21:03 ` Bart Van Assche
3 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2026-03-25 20:26 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm
On Wed, Mar 25, 2026 at 02:43:00PM -0400, Tal Zussman wrote:
> Some bio completion handlers need to run in task context but bio_endio()
> can be called from IRQ context (e.g. buffer_head writeback). Add a
> BIO_COMPLETE_IN_TASK flag that bio submitters can set to request
> task-context completion of their bi_end_io callback.
>
> When bio_endio() sees this flag and is running in non-task context, it
> queues the bio to a per-cpu list and schedules a work item to call
> bi_end_io() from task context. A CPU hotplug dead callback drains any
> remaining bios from the departing CPU's batch.
>
> This will be used to enable RWF_DONTCACHE for block devices, and could
> be used for other subsystems like fscrypt that need task-context bio
> completion.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
> block/bio.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/blk_types.h | 1 +
> 2 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 8203bb7455a9..69ee0d93041f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -18,6 +18,7 @@
> #include <linux/highmem.h>
> #include <linux/blk-crypto.h>
> #include <linux/xarray.h>
> +#include <linux/local_lock.h>
>
> #include <trace/events/block.h>
> #include "blk.h"
> @@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
> }
> EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
>
> +struct bio_complete_batch {
> + local_lock_t lock;
> + struct bio_list list;
> + struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
> + .lock = INIT_LOCAL_LOCK(lock),
> +};
> +
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct bio_complete_batch *batch;
> + struct bio_list list;
> +
> +again:
> + local_lock_irq(&bio_complete_batch.lock);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
This is just a FIFO processing queue, and it is so wanting to be a
struct llist for lockless queuing and dequeueing.
We do this lockless per-cpu queue + per-cpu workqueue in XFS for
background inode GC processing. See struct xfs_inodegc and all the
xfs_inodegc_*() functions - it may be useful to have a generic
lockless per-cpu queue processing so we don't keep open coding this
repeating pattern everywhere.
> +
> + while (!bio_list_empty(&list)) {
> + struct bio *bio = bio_list_pop(&list);
> + bio->bi_end_io(bio);
> + }
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + if (!bio_list_empty(&batch->list)) {
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (!need_resched())
> + goto again;
> +
> + schedule_work_on(smp_processor_id(), &batch->work);
We've learnt that immediately scheduling per-cpu batch
processing work can cause context switch storms as the queue/dequeue
steps one work item at a time.
Hence we use a delayed work with a scheduling delay of a singel
jiffie to allow batches of queue work from a single context to
complete before (potentially) being pre-empted by the per-cpu
kworker task that will process the queue...
> + return;
> + }
> + local_unlock_irq(&bio_complete_batch.lock);
> +}
> +
> +static void bio_queue_completion(struct bio *bio)
> +{
> + struct bio_complete_batch *batch;
> + unsigned long flags;
> +
> + local_lock_irqsave(&bio_complete_batch.lock, flags);
> + batch = this_cpu_ptr(&bio_complete_batch);
> + bio_list_add(&batch->list, bio);
> + local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> + schedule_work_on(smp_processor_id(), &batch->work);
> +}
Yeah, we definitely want to queue all the pending bio completions
the interrupt is delivering before we run the batch processing...
> +
> static inline bool bio_remaining_done(struct bio *bio)
> {
> /*
> @@ -1788,7 +1843,9 @@ void bio_endio(struct bio *bio)
> }
> #endif
>
> - if (bio->bi_end_io)
> + if (!in_task() && bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> + bio_queue_completion(bio);
> + else if (bio->bi_end_io)
> bio->bi_end_io(bio);
> }
> EXPORT_SYMBOL(bio_endio);
> @@ -1974,6 +2031,21 @@ int bioset_init(struct bio_set *bs,
> }
> EXPORT_SYMBOL(bioset_init);
>
> +/*
> + * Drain a dead CPU's deferred bio completions. The CPU is dead so no locking
> + * is needed -- no new bios will be queued to it.
> + */
> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> + struct bio_complete_batch *batch = per_cpu_ptr(&bio_complete_batch, cpu);
> + struct bio *bio;
> +
> + while ((bio = bio_list_pop(&batch->list)))
> + bio->bi_end_io(bio);
> +
> + return 0;
> +}
If you use a llist for the queue, this code is no different to the
normal processing work.
> +
> static int __init init_bio(void)
> {
> int i;
> @@ -1988,6 +2060,16 @@ static int __init init_bio(void)
> SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> }
>
> + for_each_possible_cpu(i) {
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, i);
> +
> + bio_list_init(&batch->list);
> + INIT_WORK(&batch->work, bio_complete_work_fn);
> + }
> +
> + cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
> + NULL, bio_complete_batch_cpu_dead);
XFS inodegc tracks the CPUs with work queued via a cpumask and
iterates the CPU mask for "all CPU" iteration scans. This avoids the
need for CPU hotplug integration...
> cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
> bio_cpu_dead);
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8808ee76e73c..d49d97a050d0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -322,6 +322,7 @@ enum {
> BIO_REMAPPED,
> BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> + BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
Can anyone set this on a bio they submit? i.e. This needs a better
description. Who can use it, constraints, guarantees, etc.
I ask, because the higher filesystem layers often know at submission
time that we need task based IO completion. If we can tell the bio
we are submitting that it needs task completion and have the block
layer guarantee that the ->end_io completion only ever runs in task
context, then we can get rid of mulitple instances of IO completion
deferal to task context in filesystem code (e.g. iomap - for both
buffered and direct IO, xfs buffer cache write completions, etc).
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
2026-03-25 20:21 ` Matthew Wilcox
@ 2026-03-25 20:34 ` Dave Chinner
1 sibling, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2026-03-25 20:34 UTC (permalink / raw)
To: Tal Zussman
Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm
On Wed, Mar 25, 2026 at 02:43:01PM -0400, Tal Zussman wrote:
> Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
> IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
> context, where folio_end_dropbehind() can safely invalidate folios.
>
> With the bio layer now handling task-context deferral generically, XFS
> no longer needs to route DONTCACHE ioends through its completion
> workqueue for page cache invalidation. Remove the DONTCACHE check from
> xfs_ioend_needs_wq_completion().
>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
> fs/iomap/ioend.c | 2 ++
> fs/xfs/xfs_aops.c | 4 ----
> 2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> index e4d57cb969f1..6b8375d11cc0 100644
> --- a/fs/iomap/ioend.c
> +++ b/fs/iomap/ioend.c
> @@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
> GFP_NOFS, &iomap_ioend_bioset);
> bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
> bio->bi_write_hint = wpc->inode->i_write_hint;
> + if (ioend_flags & IOMAP_IOEND_DONTCACHE)
> + bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> wbc_init_bio(wpc->wbc, bio);
> wpc->nr_folios = 0;
> return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 76678814f46f..0d469b91377d 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -510,10 +510,6 @@ xfs_ioend_needs_wq_completion(
> if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))
> return true;
>
> - /* Page cache invalidation cannot be done in irq context. */
> - if (ioend->io_flags & IOMAP_IOEND_DONTCACHE)
> - return true;
> -
> return false;
> }
Ok, so higher layers can set it.
At this point, I'd suggest that we should not be making random
one-off changes to the iomap and filesystem layers like this just
for one operation that needs deferred IO completion work. This needs
to considered from the overall perspective of how we defer
completion work - there are lots of different paths through
filesystems and/or iomap that require/use task deferal for IO
completion. We want them all to use the same mechanism - splitting
deferal between multiple layers depending on IO type is not a
particularly nice thing to be doing...
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 20:26 ` Dave Chinner
@ 2026-03-25 20:39 ` Matthew Wilcox
2026-03-26 2:44 ` Dave Chinner
0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2026-03-25 20:39 UTC (permalink / raw)
To: Dave Chinner
Cc: Tal Zussman, Jens Axboe, Christian Brauner, Darrick J. Wong,
Carlos Maiolino, Alexander Viro, Jan Kara, Christoph Hellwig,
linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm
On Thu, Mar 26, 2026 at 07:26:26AM +1100, Dave Chinner wrote:
> > @@ -1988,6 +2060,16 @@ static int __init init_bio(void)
> > SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> > }
> >
> > + for_each_possible_cpu(i) {
> > + struct bio_complete_batch *batch =
> > + per_cpu_ptr(&bio_complete_batch, i);
> > +
> > + bio_list_init(&batch->list);
> > + INIT_WORK(&batch->work, bio_complete_work_fn);
> > + }
> > +
> > + cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
> > + NULL, bio_complete_batch_cpu_dead);
>
> XFS inodegc tracks the CPUs with work queued via a cpumask and
> iterates the CPU mask for "all CPU" iteration scans. This avoids the
> need for CPU hotplug integration...
Can you elaborate a bit on how this would work in this context?
I understand why inode garbage collection might do an "all CPU"
iteration, but I don't understand the circumstances under which
we'd iterate over all CPUs to complete deferred BIOs.
> > +++ b/include/linux/blk_types.h
> > @@ -322,6 +322,7 @@ enum {
> > BIO_REMAPPED,
> > BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> > BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> > + BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
>
> Can anyone set this on a bio they submit? i.e. This needs a better
> description. Who can use it, constraints, guarantees, etc.
>
> I ask, because the higher filesystem layers often know at submission
> time that we need task based IO completion. If we can tell the bio
> we are submitting that it needs task completion and have the block
> layer guarantee that the ->end_io completion only ever runs in task
> context, then we can get rid of mulitple instances of IO completion
> deferal to task context in filesystem code (e.g. iomap - for both
> buffered and direct IO, xfs buffer cache write completions, etc).
Right, that's the idea, this would be entirely general. I want to do
it for all pagecache writeback so we can change i_pages.xa_lock from
being irq-safe to only taken in task context.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
` (2 preceding siblings ...)
2026-03-25 20:26 ` Dave Chinner
@ 2026-03-25 21:03 ` Bart Van Assche
2026-03-26 3:18 ` Dave Chinner
3 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2026-03-25 21:03 UTC (permalink / raw)
To: Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
Christian Brauner, Darrick J. Wong, Carlos Maiolino,
Alexander Viro, Jan Kara
Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
linux-fsdevel, linux-mm
On 3/25/26 11:43 AM, Tal Zussman wrote:
> + schedule_work_on(smp_processor_id(), &batch->work);
Since schedule_work_on() queues work on system_percpu_wq the above call
has the same effect as schedule_work(&batch->work), isn't it? From the
workqueue implementation:
system_percpu_wq = alloc_workqueue("events", WQ_PERCPU, 0);
[ ... ]
if (req_cpu == WORK_CPU_UNBOUND) {
if (wq->flags & WQ_UNBOUND)
cpu = wq_select_unbound_cpu(raw_smp_processor_id());
else
cpu = raw_smp_processor_id();
Thanks,
Bart.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 20:39 ` Matthew Wilcox
@ 2026-03-26 2:44 ` Dave Chinner
0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2026-03-26 2:44 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Tal Zussman, Jens Axboe, Christian Brauner, Darrick J. Wong,
Carlos Maiolino, Alexander Viro, Jan Kara, Christoph Hellwig,
linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm
On Wed, Mar 25, 2026 at 08:39:21PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 26, 2026 at 07:26:26AM +1100, Dave Chinner wrote:
> > > @@ -1988,6 +2060,16 @@ static int __init init_bio(void)
> > > SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> > > }
> > >
> > > + for_each_possible_cpu(i) {
> > > + struct bio_complete_batch *batch =
> > > + per_cpu_ptr(&bio_complete_batch, i);
> > > +
> > > + bio_list_init(&batch->list);
> > > + INIT_WORK(&batch->work, bio_complete_work_fn);
> > > + }
> > > +
> > > + cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
> > > + NULL, bio_complete_batch_cpu_dead);
> >
> > XFS inodegc tracks the CPUs with work queued via a cpumask and
> > iterates the CPU mask for "all CPU" iteration scans. This avoids the
> > need for CPU hotplug integration...
>
> Can you elaborate a bit on how this would work in this context?
It may not even be relevant. I was just mentioning it because if
someone looks at the xfs_inodegc code (as I suggested) they might
wonder why there aren't hotplug hooks for a per-cpu queuing
algorithm and/or why it tracked CPUs with queued items via a CPU
mask...
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
2026-03-25 21:03 ` Bart Van Assche
@ 2026-03-26 3:18 ` Dave Chinner
0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2026-03-26 3:18 UTC (permalink / raw)
To: Bart Van Assche
Cc: Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
Christian Brauner, Darrick J. Wong, Carlos Maiolino,
Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
linux-kernel, linux-xfs, linux-fsdevel, linux-mm
On Wed, Mar 25, 2026 at 02:03:40PM -0700, Bart Van Assche wrote:
> On 3/25/26 11:43 AM, Tal Zussman wrote:
> > + schedule_work_on(smp_processor_id(), &batch->work);
>
> Since schedule_work_on() queues work on system_percpu_wq the above call
> has the same effect as schedule_work(&batch->work), isn't it?
No. Two words: Task preemption.
And in saying this, I realise the originally proposed code is dodgy.
It might work look ok because the common cases is that interrupt
context processing can't be preempted. However, I don't think that
is true for PREEMPT_RT kernels (IIRC interrupt processing runs as a
task that can be preempted). Also, bio completion can naturally run
from task context because the submitter can hold the last reference
to the bio.
Hence the queueing function can be preempted and scheduled to a
different CPU like so:
lock_lock_irq()
queue on CPU 0
local_lock_irq()
<preempt>
<run on CPU 1>
schedule_work_on(smp_processor_id())
That results in bio completion being queued on CPU 0, but the
processing work is scheduled for CPU 1. Oops.
> From the
> workqueue implementation:
>
> system_percpu_wq = alloc_workqueue("events", WQ_PERCPU, 0);
>
> [ ... ]
> if (req_cpu == WORK_CPU_UNBOUND) {
> if (wq->flags & WQ_UNBOUND)
> cpu = wq_select_unbound_cpu(raw_smp_processor_id());
> else
> cpu = raw_smp_processor_id();
Same preemption problem as above.
-Dave.
--
Dave Chinner
dgc@kernel.org
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-03-26 3:18 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
2026-03-25 19:54 ` Matthew Wilcox
2026-03-25 20:14 ` Jens Axboe
2026-03-25 20:26 ` Dave Chinner
2026-03-25 20:39 ` Matthew Wilcox
2026-03-26 2:44 ` Dave Chinner
2026-03-25 21:03 ` Bart Van Assche
2026-03-26 3:18 ` Dave Chinner
2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
2026-03-25 20:21 ` Matthew Wilcox
2026-03-25 20:34 ` Dave Chinner
2026-03-25 18:43 ` [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox