public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices
@ 2026-03-25 18:42 Tal Zussman
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Tal Zussman @ 2026-03-25 18:42 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm, Tal Zussman

Add support for using RWF_DONTCACHE with block devices.

Dropbehind pruning needs to be done in non-IRQ context, but block
devices complete writeback in IRQ context.

To fix this, we can defer dropbehind invalidation to task context. We
introduce a new BIO_COMPLETE_IN_TASK flag that allows the bio submitter
to request task-context completion of bi_end_io. When bio_endio() sees
this flag in non-task context, it queues the bio to a per-CPU list and
schedules a work item to do bio completion.

Patch 1 adds the BIO_COMPLETE_IN_TASK infrastructure in the block
layer.

Patch 2 wires BIO_COMPLETE_IN_TASK into iomap writeback for DONTCACHE
folios and removes the DONTCACHE workqueue deferral from XFS.

Patch 3 enables RWF_DONTCACHE for block devices, setting
BIO_COMPLETE_IN_TASK in submit_bh_wbc() for the CONFIG_BUFFER_HEAD
path.

This support is useful for databases that operate on raw block devices,
among other userspace applications.

I tested this (with CONFIG_BUFFER_HEAD=y) for reads and writes on a
single block device on a VM, so results may be noisy.

Reads were tested on the root partition with a 45GB range (~2x RAM).
Writes were tested on a disabled swap parition (~1GB) in a memcg of size
244MB to force reclaim pressure.

Results:

===== READS (/dev/nvme0n1p2) =====
 sec   normal MB/s  dontcache MB/s
----  ------------  --------------
   1        1098.6          1609.0
   2        1270.3          1506.6
   3        1093.3          1576.5
   4        1141.8          2393.9
   5        1365.3          2793.8
   6        1324.6          2065.9
   7         879.6          1920.7
   8        1434.1          1662.4
   9        1184.9          1857.9
  10        1166.4          1702.8
  11        1161.4          1653.4
  12        1086.9          1555.4
  13        1198.5          1718.9
  14        1111.9          1752.2
----  ------------  --------------
 avg        1173.7          1828.8  (+56%)

==== WRITES (/dev/nvme0n1p3) =====
 sec   normal MB/s  dontcache MB/s
----  ------------  --------------
   1         692.4          9297.7
   2        4810.8          9342.8
   3        5221.7          2955.2
   4         396.7          8488.3
   5        7249.2          9249.3
   6        6695.4          1376.2
   7         122.9          9125.8
   8        5486.5          9414.7
   9        6921.5          8743.5
  10          27.9          8997.8
----  ------------  --------------
 avg        3762.5          7699.1  (+105%)

---
Changes in v4:
- 1/3: Move dropbehind deferral from folio-level to bio-level using
  BIO_COMPLETE_IN_TASK, per Matthew and Jan.
- 1/3: Work function yields on need_resched() to avoid hogging the CPU,
  per Jan.
- 2/3: New patch. Set BIO_COMPLETE_IN_TASK on iomap writeback bios for
  DONTCACHE folios, removing the need for XFS-specific workqueue
  deferral.
- 3/3: Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() for buffer_head
  path.
- 3/3: Update commit message to mention CONFIG_BUFFER_HEAD=n path.
- Link to v3: https://lore.kernel.org/r/20260227-blk-dontcache-v3-0-cd309ccd5868@columbia.edu

Changes in v3:
- 1/2: Convert dropbehind deferral to per-CPU folio_batches protected by
  local_lock using per-CPU work items, to reduce contention, per Jens.
- 1/2: Call folio_end_dropbehind_irq() directly from
  folio_end_writeback(), per Jens.
- 1/2: Add CPU hotplug dead callback to drain the departing CPU's folio
  batch.
- 2/2: Introduce block_write_begin_iocb(), per Christoph.
- 2/2: Dropped R-b due to changes.
- Link to v2: https://lore.kernel.org/r/20260225-blk-dontcache-v2-0-70e7ac4f7108@columbia.edu

Changes in v2:
- Add R-b from Jan Kara for 2/2.
- Add patch to defer dropbehind completion from IRQ context via a work
  item (1/2).
- Add initial performance numbers to cover letter.
- Link to v1: https://lore.kernel.org/r/20260218-blk-dontcache-v1-1-fad6675ef71f@columbia.edu

---
Tal Zussman (3):
      block: add BIO_COMPLETE_IN_TASK for task-context completion
      iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
      block: enable RWF_DONTCACHE for block devices

 block/bio.c                 | 84 ++++++++++++++++++++++++++++++++++++++++++++-
 block/fops.c                |  5 +--
 fs/buffer.c                 | 22 ++++++++++--
 fs/iomap/ioend.c            |  2 ++
 fs/xfs/xfs_aops.c           |  4 ---
 include/linux/blk_types.h   |  1 +
 include/linux/buffer_head.h |  3 ++
 7 files changed, 111 insertions(+), 10 deletions(-)
---
base-commit: 2961f841b025fb234860bac26dfb7fa7cb0fb122
change-id: 20260218-blk-dontcache-338133dd045e

Best regards,
-- 
Tal Zussman <tz2294@columbia.edu>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
@ 2026-03-25 18:43 ` Tal Zussman
  2026-03-25 19:54   ` Matthew Wilcox
                     ` (4 more replies)
  2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
  2026-03-25 18:43 ` [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
  2 siblings, 5 replies; 19+ messages in thread
From: Tal Zussman @ 2026-03-25 18:43 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm, Tal Zussman

Some bio completion handlers need to run in task context but bio_endio()
can be called from IRQ context (e.g. buffer_head writeback). Add a
BIO_COMPLETE_IN_TASK flag that bio submitters can set to request
task-context completion of their bi_end_io callback.

When bio_endio() sees this flag and is running in non-task context, it
queues the bio to a per-cpu list and schedules a work item to call
bi_end_io() from task context. A CPU hotplug dead callback drains any
remaining bios from the departing CPU's batch.

This will be used to enable RWF_DONTCACHE for block devices, and could
be used for other subsystems like fscrypt that need task-context bio
completion.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
 block/bio.c               | 84 ++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/blk_types.h |  1 +
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 8203bb7455a9..69ee0d93041f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -18,6 +18,7 @@
 #include <linux/highmem.h>
 #include <linux/blk-crypto.h>
 #include <linux/xarray.h>
+#include <linux/local_lock.h>
 
 #include <trace/events/block.h>
 #include "blk.h"
@@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
 
+struct bio_complete_batch {
+	local_lock_t lock;
+	struct bio_list list;
+	struct work_struct work;
+};
+
+static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
+	.lock = INIT_LOCAL_LOCK(lock),
+};
+
+static void bio_complete_work_fn(struct work_struct *w)
+{
+	struct bio_complete_batch *batch;
+	struct bio_list list;
+
+again:
+	local_lock_irq(&bio_complete_batch.lock);
+	batch = this_cpu_ptr(&bio_complete_batch);
+	list = batch->list;
+	bio_list_init(&batch->list);
+	local_unlock_irq(&bio_complete_batch.lock);
+
+	while (!bio_list_empty(&list)) {
+		struct bio *bio = bio_list_pop(&list);
+		bio->bi_end_io(bio);
+	}
+
+	local_lock_irq(&bio_complete_batch.lock);
+	batch = this_cpu_ptr(&bio_complete_batch);
+	if (!bio_list_empty(&batch->list)) {
+		local_unlock_irq(&bio_complete_batch.lock);
+
+		if (!need_resched())
+			goto again;
+
+		schedule_work_on(smp_processor_id(), &batch->work);
+		return;
+	}
+	local_unlock_irq(&bio_complete_batch.lock);
+}
+
+static void bio_queue_completion(struct bio *bio)
+{
+	struct bio_complete_batch *batch;
+	unsigned long flags;
+
+	local_lock_irqsave(&bio_complete_batch.lock, flags);
+	batch = this_cpu_ptr(&bio_complete_batch);
+	bio_list_add(&batch->list, bio);
+	local_unlock_irqrestore(&bio_complete_batch.lock, flags);
+
+	schedule_work_on(smp_processor_id(), &batch->work);
+}
+
 static inline bool bio_remaining_done(struct bio *bio)
 {
 	/*
@@ -1788,7 +1843,9 @@ void bio_endio(struct bio *bio)
 	}
 #endif
 
-	if (bio->bi_end_io)
+	if (!in_task() && bio_flagged(bio, BIO_COMPLETE_IN_TASK))
+		bio_queue_completion(bio);
+	else if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
 EXPORT_SYMBOL(bio_endio);
@@ -1974,6 +2031,21 @@ int bioset_init(struct bio_set *bs,
 }
 EXPORT_SYMBOL(bioset_init);
 
+/*
+ * Drain a dead CPU's deferred bio completions. The CPU is dead so no locking
+ * is needed -- no new bios will be queued to it.
+ */
+static int bio_complete_batch_cpu_dead(unsigned int cpu)
+{
+	struct bio_complete_batch *batch = per_cpu_ptr(&bio_complete_batch, cpu);
+	struct bio *bio;
+
+	while ((bio = bio_list_pop(&batch->list)))
+		bio->bi_end_io(bio);
+
+	return 0;
+}
+
 static int __init init_bio(void)
 {
 	int i;
@@ -1988,6 +2060,16 @@ static int __init init_bio(void)
 				SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
 	}
 
+	for_each_possible_cpu(i) {
+		struct bio_complete_batch *batch =
+			per_cpu_ptr(&bio_complete_batch, i);
+
+		bio_list_init(&batch->list);
+		INIT_WORK(&batch->work, bio_complete_work_fn);
+	}
+
+	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
+				NULL, bio_complete_batch_cpu_dead);
 	cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
 					bio_cpu_dead);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8808ee76e73c..d49d97a050d0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -322,6 +322,7 @@ enum {
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
 	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+	BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
 	BIO_FLAG_LAST
 };
 

-- 
2.39.5



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
@ 2026-03-25 18:43 ` Tal Zussman
  2026-03-25 20:21   ` Matthew Wilcox
  2026-03-25 20:34   ` Dave Chinner
  2026-03-25 18:43 ` [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
  2 siblings, 2 replies; 19+ messages in thread
From: Tal Zussman @ 2026-03-25 18:43 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm, Tal Zussman

Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
context, where folio_end_dropbehind() can safely invalidate folios.

With the bio layer now handling task-context deferral generically, XFS
no longer needs to route DONTCACHE ioends through its completion
workqueue for page cache invalidation. Remove the DONTCACHE check from
xfs_ioend_needs_wq_completion().

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
 fs/iomap/ioend.c  | 2 ++
 fs/xfs/xfs_aops.c | 4 ----
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index e4d57cb969f1..6b8375d11cc0 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
 	bio->bi_write_hint = wpc->inode->i_write_hint;
+	if (ioend_flags & IOMAP_IOEND_DONTCACHE)
+		bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
 	wbc_init_bio(wpc->wbc, bio);
 	wpc->nr_folios = 0;
 	return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 76678814f46f..0d469b91377d 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -510,10 +510,6 @@ xfs_ioend_needs_wq_completion(
 	if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))
 		return true;
 
-	/* Page cache invalidation cannot be done in irq context. */
-	if (ioend->io_flags & IOMAP_IOEND_DONTCACHE)
-		return true;
-
 	return false;
 }
 

-- 
2.39.5



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices
  2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
  2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
@ 2026-03-25 18:43 ` Tal Zussman
  2 siblings, 0 replies; 19+ messages in thread
From: Tal Zussman @ 2026-03-25 18:43 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm, Tal Zussman

Block device buffered reads and writes already pass through
filemap_read() and iomap_file_buffered_write() respectively, both of
which handle IOCB_DONTCACHE. Enable RWF_DONTCACHE for block device files
by setting FOP_DONTCACHE in def_blk_fops.

For CONFIG_BUFFER_HEAD=y paths, add block_write_begin_iocb() which
threads the kiocb through so that buffer_head-based I/O can use
DONTCACHE behavior. The existing block_write_begin() is preserved as a
wrapper that passes a NULL iocb. Set BIO_COMPLETE_IN_TASK in
submit_bh_wbc() when the folio has dropbehind so that buffer_head
writeback completions get deferred to task context.

CONFIG_BUFFER_HEAD=n paths are handled by the previously added iomap
BIO_COMPLETE_IN_TASK support.

This support is useful for databases that operate on raw block devices,
among other userspace applications.

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
---
 block/fops.c                |  5 +++--
 fs/buffer.c                 | 22 +++++++++++++++++++---
 include/linux/buffer_head.h |  3 +++
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 4d32785b31d9..d8165f6ba71c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -505,7 +505,8 @@ static int blkdev_write_begin(const struct kiocb *iocb,
 			      unsigned len, struct folio **foliop,
 			      void **fsdata)
 {
-	return block_write_begin(mapping, pos, len, foliop, blkdev_get_block);
+	return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
+				     blkdev_get_block);
 }
 
 static int blkdev_write_end(const struct kiocb *iocb,
@@ -967,7 +968,7 @@ const struct file_operations def_blk_fops = {
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
 	.uring_cmd	= blkdev_uring_cmd,
-	.fop_flags	= FOP_BUFFER_RASYNC,
+	.fop_flags	= FOP_BUFFER_RASYNC | FOP_DONTCACHE,
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/buffer.c b/fs/buffer.c
index ed724a902657..c60c0ad6cc35 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2239,14 +2239,19 @@ EXPORT_SYMBOL(block_commit_write);
  *
  * The filesystem needs to handle block truncation upon failure.
  */
-int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
+int block_write_begin_iocb(const struct kiocb *iocb,
+		struct address_space *mapping, loff_t pos, unsigned len,
 		struct folio **foliop, get_block_t *get_block)
 {
 	pgoff_t index = pos >> PAGE_SHIFT;
+	fgf_t fgp_flags = FGP_WRITEBEGIN;
 	struct folio *folio;
 	int status;
 
-	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+	if (iocb && iocb->ki_flags & IOCB_DONTCACHE)
+		fgp_flags |= FGP_DONTCACHE;
+
+	folio = __filemap_get_folio(mapping, index, fgp_flags,
 			mapping_gfp_mask(mapping));
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
@@ -2261,6 +2266,13 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 	*foliop = folio;
 	return status;
 }
+
+int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
+		struct folio **foliop, get_block_t *get_block)
+{
+	return block_write_begin_iocb(NULL, mapping, pos, len, foliop,
+				      get_block);
+}
 EXPORT_SYMBOL(block_write_begin);
 
 int block_write_end(loff_t pos, unsigned len, unsigned copied,
@@ -2589,7 +2601,8 @@ int cont_write_begin(const struct kiocb *iocb, struct address_space *mapping,
 		(*bytes)++;
 	}
 
-	return block_write_begin(mapping, pos, len, foliop, get_block);
+	return block_write_begin_iocb(iocb, mapping, pos, len, foliop,
+				     get_block);
 }
 EXPORT_SYMBOL(cont_write_begin);
 
@@ -2801,6 +2814,9 @@ static void submit_bh_wbc(blk_opf_t opf, struct buffer_head *bh,
 
 	bio = bio_alloc(bh->b_bdev, 1, opf, GFP_NOIO);
 
+	if (folio_test_dropbehind(bh->b_folio))
+		bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
+
 	fscrypt_set_bio_crypt_ctx_bh(bio, bh, GFP_NOIO);
 
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index b16b88bfbc3e..ddf88ce290f2 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -260,6 +260,9 @@ int block_read_full_folio(struct folio *, get_block_t *);
 bool block_is_partially_uptodate(struct folio *, size_t from, size_t count);
 int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len,
 		struct folio **foliop, get_block_t *get_block);
+int block_write_begin_iocb(const struct kiocb *iocb,
+		struct address_space *mapping, loff_t pos, unsigned len,
+		struct folio **foliop, get_block_t *get_block);
 int __block_write_begin(struct folio *folio, loff_t pos, unsigned len,
 		get_block_t *get_block);
 int block_write_end(loff_t pos, unsigned len, unsigned copied, struct folio *);

-- 
2.39.5



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
@ 2026-03-25 19:54   ` Matthew Wilcox
  2026-03-25 20:14   ` Jens Axboe
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2026-03-25 19:54 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 02:43:00PM -0400, Tal Zussman wrote:
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> +	struct bio_complete_batch *batch;
> +	struct bio_list list;
> +
> +again:
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	list = batch->list;
> +	bio_list_init(&batch->list);
> +	local_unlock_irq(&bio_complete_batch.lock);
> +
> +	while (!bio_list_empty(&list)) {
> +		struct bio *bio = bio_list_pop(&list);
> +		bio->bi_end_io(bio);
> +	}
> +
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	if (!bio_list_empty(&batch->list)) {
> +		local_unlock_irq(&bio_complete_batch.lock);
> +
> +		if (!need_resched())
> +			goto again;
> +
> +		schedule_work_on(smp_processor_id(), &batch->work);
> +		return;
> +	}

I don't know how often we see this actually trigger, but wouldn't this
be slightly more efficient?

+	local_lock_irq(&bio_complete_batch.lock);
+	batch = this_cpu_ptr(&bio_complete_batch);
+	list = batch->list;
+again:
+	bio_list_init(&batch->list);
+	local_unlock_irq(&bio_complete_batch.lock);
+
+	while (!bio_list_empty(&list)) {
+		struct bio *bio = bio_list_pop(&list);
+		bio->bi_end_io(bio);
+	}
+
+	local_lock_irq(&bio_complete_batch.lock);
+	batch = this_cpu_ptr(&bio_complete_batch);
+	list = batch->list;
+	if (!bio_list_empty(&list)) {
+		if (!need_resched())
+			goto again;
+
+		local_unlock_irq(&bio_complete_batch.lock);
+		schedule_work_on(smp_processor_id(), &batch->work);
+		return;
+	}


Overall I like this.  I think this is a better approach than the earlier
patches, and I'm looking forward to the simplifications that it's going
to enable.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
  2026-03-25 19:54   ` Matthew Wilcox
@ 2026-03-25 20:14   ` Jens Axboe
  2026-03-25 20:26   ` Dave Chinner
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2026-03-25 20:14 UTC (permalink / raw)
  To: Tal Zussman, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm

On 3/25/26 12:43 PM, Tal Zussman wrote:
> diff --git a/block/bio.c b/block/bio.c
> index 8203bb7455a9..69ee0d93041f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -18,6 +18,7 @@
>  #include <linux/highmem.h>
>  #include <linux/blk-crypto.h>
>  #include <linux/xarray.h>
> +#include <linux/local_lock.h>
>  
>  #include <trace/events/block.h>
>  #include "blk.h"
> @@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
>  }
>  EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
>  
> +struct bio_complete_batch {
> +	local_lock_t lock;
> +	struct bio_list list;
> +	struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
> +	.lock = INIT_LOCAL_LOCK(lock),
> +};
> +
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> +	struct bio_complete_batch *batch;
> +	struct bio_list list;
> +
> +again:
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	list = batch->list;
> +	bio_list_init(&batch->list);
> +	local_unlock_irq(&bio_complete_batch.lock);
> +
> +	while (!bio_list_empty(&list)) {
> +		struct bio *bio = bio_list_pop(&list);
> +		bio->bi_end_io(bio);
> +	}
> +
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	if (!bio_list_empty(&batch->list)) {
> +		local_unlock_irq(&bio_complete_batch.lock);
> +
> +		if (!need_resched())
> +			goto again;
> +
> +		schedule_work_on(smp_processor_id(), &batch->work);
> +		return;
> +	}
> +	local_unlock_irq(&bio_complete_batch.lock);
> +}

bool looped = false;

do {
	if (looped && need_resched()) {
    		schedule_work_on(smp_processor_id(), &batch->work);
		break;
	}

	local_lock_irq(&bio_complete_batch.lock);
	batch = this_cpu_ptr(&bio_complete_batch);
	list = batch->list;
	bio_list_init(&batch->list);
	local_unlock_irq(&bio_complete_batch.lock);

	if (bio_list_empty(&list))
		break;

	do {
		struct bio *bio = bio_list_pop(&list);
		bio->bi_end_io(bio);
	} while (!bio_list_empty(&list));
	looped = true;
} while (1);

would be a lot easier to read, and avoid needing the list manipulation
included twice.

> +static void bio_queue_completion(struct bio *bio)
> +{
> +	struct bio_complete_batch *batch;
> +	unsigned long flags;
> +
> +	local_lock_irqsave(&bio_complete_batch.lock, flags);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	bio_list_add(&batch->list, bio);
> +	local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> +	schedule_work_on(smp_processor_id(), &batch->work);
> +}

Maybe do something ala:

static void bio_queue_completion(struct bio *bio)
{
	struct bio_complete_batch *batch;
	unsigned long flags;
	bool was_empty;

	local_lock_irqsave(&bio_complete_batch.lock, flags);
	batch = this_cpu_ptr(&bio_complete_batch);
	was_empty = bio_list_empty(&batch->list);
	bio_list_add(&batch->list, bio);
	local_unlock_irqrestore(&bio_complete_batch.lock, flags);

	if (was_empty)
		schedule_work_on(smp_processor_id(), &batch->work);
}

Outside of these mostly nits, I like this approach. It avoids my main
worry with this, which was contention on the list locks. And on the
io_uring side, we'll never hit the !in_task() path anyway, as the
completions are run from the task always. The bio flag makes sense for
this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
@ 2026-03-25 20:21   ` Matthew Wilcox
  2026-03-27  6:03     ` Christoph Hellwig
  2026-03-25 20:34   ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2026-03-25 20:21 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 02:43:01PM -0400, Tal Zussman wrote:
> Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
> IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
> context, where folio_end_dropbehind() can safely invalidate folios.
> 
> With the bio layer now handling task-context deferral generically, XFS
> no longer needs to route DONTCACHE ioends through its completion
> workqueue for page cache invalidation. Remove the DONTCACHE check from
> xfs_ioend_needs_wq_completion().
> 
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
>  fs/iomap/ioend.c  | 2 ++
>  fs/xfs/xfs_aops.c | 4 ----
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> index e4d57cb969f1..6b8375d11cc0 100644
> --- a/fs/iomap/ioend.c
> +++ b/fs/iomap/ioend.c
> @@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  			       GFP_NOFS, &iomap_ioend_bioset);
>  	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
>  	bio->bi_write_hint = wpc->inode->i_write_hint;
> +	if (ioend_flags & IOMAP_IOEND_DONTCACHE)
> +		bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
>  	wbc_init_bio(wpc->wbc, bio);
>  	wpc->nr_folios = 0;
>  	return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);

Can't we delete IOMAP_IOEND_DONTCACHE, and just do:

	if (folio_test_dropbehind(folio))
		bio_set_flag(&ioend->io_bio, BIO_COMPLETE_IN_TASK);

It'd need to move down a few lines in iomap_add_to_ioend() to after
bio_add_folio() succeeds.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
  2026-03-25 19:54   ` Matthew Wilcox
  2026-03-25 20:14   ` Jens Axboe
@ 2026-03-25 20:26   ` Dave Chinner
  2026-03-25 20:39     ` Matthew Wilcox
  2026-03-25 21:03   ` Bart Van Assche
  2026-03-27  6:01   ` Christoph Hellwig
  4 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2026-03-25 20:26 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 02:43:00PM -0400, Tal Zussman wrote:
> Some bio completion handlers need to run in task context but bio_endio()
> can be called from IRQ context (e.g. buffer_head writeback). Add a
> BIO_COMPLETE_IN_TASK flag that bio submitters can set to request
> task-context completion of their bi_end_io callback.
> 
> When bio_endio() sees this flag and is running in non-task context, it
> queues the bio to a per-cpu list and schedules a work item to call
> bi_end_io() from task context. A CPU hotplug dead callback drains any
> remaining bios from the departing CPU's batch.
> 
> This will be used to enable RWF_DONTCACHE for block devices, and could
> be used for other subsystems like fscrypt that need task-context bio
> completion.
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
>  block/bio.c               | 84 ++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/blk_types.h |  1 +
>  2 files changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 8203bb7455a9..69ee0d93041f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -18,6 +18,7 @@
>  #include <linux/highmem.h>
>  #include <linux/blk-crypto.h>
>  #include <linux/xarray.h>
> +#include <linux/local_lock.h>
>  
>  #include <trace/events/block.h>
>  #include "blk.h"
> @@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
>  }
>  EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
>  
> +struct bio_complete_batch {
> +	local_lock_t lock;
> +	struct bio_list list;
> +	struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
> +	.lock = INIT_LOCAL_LOCK(lock),
> +};
> +
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> +	struct bio_complete_batch *batch;
> +	struct bio_list list;
> +
> +again:
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	list = batch->list;
> +	bio_list_init(&batch->list);
> +	local_unlock_irq(&bio_complete_batch.lock);

This is just a FIFO processing queue, and it is so wanting to be a
struct llist for lockless queuing and dequeueing.

We do this lockless per-cpu queue + per-cpu workqueue in XFS for
background inode GC processing. See struct xfs_inodegc and all the
xfs_inodegc_*() functions - it may be useful to have a generic
lockless per-cpu queue processing so we don't keep open coding this
repeating pattern everywhere.

> +
> +	while (!bio_list_empty(&list)) {
> +		struct bio *bio = bio_list_pop(&list);
> +		bio->bi_end_io(bio);
> +	}
> +
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	if (!bio_list_empty(&batch->list)) {
> +		local_unlock_irq(&bio_complete_batch.lock);
> +
> +		if (!need_resched())
> +			goto again;
> +
> +		schedule_work_on(smp_processor_id(), &batch->work);

We've learnt that immediately scheduling per-cpu batch
processing work can cause context switch storms as the queue/dequeue
steps one work item at a time.

Hence we use a delayed work with a scheduling delay of a singel
jiffie to allow batches of queue work from a single context to
complete before (potentially) being pre-empted by the per-cpu
kworker task that will process the queue...

> +		return;
> +	}
> +	local_unlock_irq(&bio_complete_batch.lock);
> +}
> +
> +static void bio_queue_completion(struct bio *bio)
> +{
> +	struct bio_complete_batch *batch;
> +	unsigned long flags;
> +
> +	local_lock_irqsave(&bio_complete_batch.lock, flags);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	bio_list_add(&batch->list, bio);
> +	local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> +	schedule_work_on(smp_processor_id(), &batch->work);
> +}

Yeah, we definitely want to queue all the pending bio completions
the interrupt is delivering before we run the batch processing...

> +
>  static inline bool bio_remaining_done(struct bio *bio)
>  {
>  	/*
> @@ -1788,7 +1843,9 @@ void bio_endio(struct bio *bio)
>  	}
>  #endif
>  
> -	if (bio->bi_end_io)
> +	if (!in_task() && bio_flagged(bio, BIO_COMPLETE_IN_TASK))
> +		bio_queue_completion(bio);
> +	else if (bio->bi_end_io)
>  		bio->bi_end_io(bio);
>  }
>  EXPORT_SYMBOL(bio_endio);
> @@ -1974,6 +2031,21 @@ int bioset_init(struct bio_set *bs,
>  }
>  EXPORT_SYMBOL(bioset_init);
>  
> +/*
> + * Drain a dead CPU's deferred bio completions. The CPU is dead so no locking
> + * is needed -- no new bios will be queued to it.
> + */
> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> +	struct bio_complete_batch *batch = per_cpu_ptr(&bio_complete_batch, cpu);
> +	struct bio *bio;
> +
> +	while ((bio = bio_list_pop(&batch->list)))
> +		bio->bi_end_io(bio);
> +
> +	return 0;
> +}

If you use a llist for the queue, this code is no different to the
normal processing work.

> +
>  static int __init init_bio(void)
>  {
>  	int i;
> @@ -1988,6 +2060,16 @@ static int __init init_bio(void)
>  				SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>  	}
>  
> +	for_each_possible_cpu(i) {
> +		struct bio_complete_batch *batch =
> +			per_cpu_ptr(&bio_complete_batch, i);
> +
> +		bio_list_init(&batch->list);
> +		INIT_WORK(&batch->work, bio_complete_work_fn);
> +	}
> +
> +	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
> +				NULL, bio_complete_batch_cpu_dead);

XFS inodegc tracks the CPUs with work queued via a cpumask and
iterates the CPU mask for "all CPU" iteration scans. This avoids the
need for CPU hotplug integration...

>  	cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
>  					bio_cpu_dead);
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8808ee76e73c..d49d97a050d0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -322,6 +322,7 @@ enum {
>  	BIO_REMAPPED,
>  	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
>  	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> +	BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */

Can anyone set this on a bio they submit? i.e. This needs a better
description. Who can use it, constraints, guarantees, etc.

I ask, because the higher filesystem layers often know at submission
time that we need task based IO completion. If we can tell the bio
we are submitting that it needs task completion and have the block
layer guarantee that the ->end_io completion only ever runs in task
context, then we can get rid of mulitple instances of IO completion
deferal to task context in filesystem code (e.g. iomap - for both
buffered and direct IO, xfs buffer cache write completions, etc).

-Dave.
-- 
Dave Chinner
dgc@kernel.org


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
  2026-03-25 20:21   ` Matthew Wilcox
@ 2026-03-25 20:34   ` Dave Chinner
  2026-03-27  6:08     ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2026-03-25 20:34 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 02:43:01PM -0400, Tal Zussman wrote:
> Set BIO_COMPLETE_IN_TASK on iomap writeback bios when
> IOMAP_IOEND_DONTCACHE is set. This ensures that bi_end_io runs in task
> context, where folio_end_dropbehind() can safely invalidate folios.
> 
> With the bio layer now handling task-context deferral generically, XFS
> no longer needs to route DONTCACHE ioends through its completion
> workqueue for page cache invalidation. Remove the DONTCACHE check from
> xfs_ioend_needs_wq_completion().
> 
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
>  fs/iomap/ioend.c  | 2 ++
>  fs/xfs/xfs_aops.c | 4 ----
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
> index e4d57cb969f1..6b8375d11cc0 100644
> --- a/fs/iomap/ioend.c
> +++ b/fs/iomap/ioend.c
> @@ -113,6 +113,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
>  			       GFP_NOFS, &iomap_ioend_bioset);
>  	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
>  	bio->bi_write_hint = wpc->inode->i_write_hint;
> +	if (ioend_flags & IOMAP_IOEND_DONTCACHE)
> +		bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
>  	wbc_init_bio(wpc->wbc, bio);
>  	wpc->nr_folios = 0;
>  	return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 76678814f46f..0d469b91377d 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -510,10 +510,6 @@ xfs_ioend_needs_wq_completion(
>  	if (ioend->io_flags & (IOMAP_IOEND_UNWRITTEN | IOMAP_IOEND_SHARED))
>  		return true;
>  
> -	/* Page cache invalidation cannot be done in irq context. */
> -	if (ioend->io_flags & IOMAP_IOEND_DONTCACHE)
> -		return true;
> -
>  	return false;
>  }

Ok, so higher layers can set it.

At this point, I'd suggest that we should not be making random
one-off changes to the iomap and filesystem layers like this just
for one operation that needs deferred IO completion work. This needs
to considered from the overall perspective of how we defer
completion work -  there are lots of different paths through
filesystems and/or iomap that require/use task deferal for IO
completion. We want them all to use the same mechanism - splitting
deferal between multiple layers depending on IO type is not a
particularly nice thing to be doing...

-Dave.
-- 
Dave Chinner
dgc@kernel.org


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 20:26   ` Dave Chinner
@ 2026-03-25 20:39     ` Matthew Wilcox
  2026-03-26  2:44       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2026-03-25 20:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tal Zussman, Jens Axboe, Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, Alexander Viro, Jan Kara, Christoph Hellwig,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Thu, Mar 26, 2026 at 07:26:26AM +1100, Dave Chinner wrote:
> > @@ -1988,6 +2060,16 @@ static int __init init_bio(void)
> >  				SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >  	}
> >  
> > +	for_each_possible_cpu(i) {
> > +		struct bio_complete_batch *batch =
> > +			per_cpu_ptr(&bio_complete_batch, i);
> > +
> > +		bio_list_init(&batch->list);
> > +		INIT_WORK(&batch->work, bio_complete_work_fn);
> > +	}
> > +
> > +	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
> > +				NULL, bio_complete_batch_cpu_dead);
> 
> XFS inodegc tracks the CPUs with work queued via a cpumask and
> iterates the CPU mask for "all CPU" iteration scans. This avoids the
> need for CPU hotplug integration...

Can you elaborate a bit on how this would work in this context?
I understand why inode garbage collection might do an "all CPU"
iteration, but I don't understand the circumstances under which
we'd iterate over all CPUs to complete deferred BIOs.

> > +++ b/include/linux/blk_types.h
> > @@ -322,6 +322,7 @@ enum {
> >  	BIO_REMAPPED,
> >  	BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> >  	BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> > +	BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
> 
> Can anyone set this on a bio they submit? i.e. This needs a better
> description. Who can use it, constraints, guarantees, etc.
> 
> I ask, because the higher filesystem layers often know at submission
> time that we need task based IO completion. If we can tell the bio
> we are submitting that it needs task completion and have the block
> layer guarantee that the ->end_io completion only ever runs in task
> context, then we can get rid of mulitple instances of IO completion
> deferal to task context in filesystem code (e.g. iomap - for both
> buffered and direct IO, xfs buffer cache write completions, etc).

Right, that's the idea, this would be entirely general.  I want to do
it for all pagecache writeback so we can change i_pages.xa_lock from
being irq-safe to only taken in task context.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
                     ` (2 preceding siblings ...)
  2026-03-25 20:26   ` Dave Chinner
@ 2026-03-25 21:03   ` Bart Van Assche
  2026-03-26  3:18     ` Dave Chinner
  2026-03-27  6:01   ` Christoph Hellwig
  4 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2026-03-25 21:03 UTC (permalink / raw)
  To: Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara
  Cc: Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm

On 3/25/26 11:43 AM, Tal Zussman wrote:
> +	schedule_work_on(smp_processor_id(), &batch->work);

Since schedule_work_on() queues work on system_percpu_wq the above call
has the same effect as schedule_work(&batch->work), isn't it? From the
workqueue implementation:

	system_percpu_wq = alloc_workqueue("events", WQ_PERCPU, 0);

[ ... ]

	if (req_cpu == WORK_CPU_UNBOUND) {
		if (wq->flags & WQ_UNBOUND)
			cpu = wq_select_unbound_cpu(raw_smp_processor_id());
		else
			cpu = raw_smp_processor_id();

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 20:39     ` Matthew Wilcox
@ 2026-03-26  2:44       ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2026-03-26  2:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tal Zussman, Jens Axboe, Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, Alexander Viro, Jan Kara, Christoph Hellwig,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 08:39:21PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 26, 2026 at 07:26:26AM +1100, Dave Chinner wrote:
> > > @@ -1988,6 +2060,16 @@ static int __init init_bio(void)
> > >  				SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> > >  	}
> > >  
> > > +	for_each_possible_cpu(i) {
> > > +		struct bio_complete_batch *batch =
> > > +			per_cpu_ptr(&bio_complete_batch, i);
> > > +
> > > +		bio_list_init(&batch->list);
> > > +		INIT_WORK(&batch->work, bio_complete_work_fn);
> > > +	}
> > > +
> > > +	cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "block/bio:complete:dead",
> > > +				NULL, bio_complete_batch_cpu_dead);
> > 
> > XFS inodegc tracks the CPUs with work queued via a cpumask and
> > iterates the CPU mask for "all CPU" iteration scans. This avoids the
> > need for CPU hotplug integration...
> 
> Can you elaborate a bit on how this would work in this context?

It may not even be relevant. I was just mentioning it because if
someone looks at the xfs_inodegc code (as I suggested) they might
wonder why there aren't hotplug hooks for a per-cpu queuing
algorithm and/or why it tracked CPUs with queued items via a CPU
mask...

-Dave.
-- 
Dave Chinner
dgc@kernel.org


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 21:03   ` Bart Van Assche
@ 2026-03-26  3:18     ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2026-03-26  3:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 02:03:40PM -0700, Bart Van Assche wrote:
> On 3/25/26 11:43 AM, Tal Zussman wrote:
> > +	schedule_work_on(smp_processor_id(), &batch->work);
> 
> Since schedule_work_on() queues work on system_percpu_wq the above call
> has the same effect as schedule_work(&batch->work), isn't it?

No. Two words: Task preemption.

And in saying this, I realise the originally proposed code is dodgy.
It might work look ok because the common cases is that interrupt
context processing can't be preempted. However, I don't think that
is true for PREEMPT_RT kernels (IIRC interrupt processing runs as a
task that can be preempted). Also, bio completion can naturally run
from task context because the submitter can hold the last reference
to the bio.

Hence the queueing function can be preempted and scheduled to a
different CPU like so:

lock_lock_irq()
queue on CPU 0
local_lock_irq()
<preempt>
<run on CPU 1>
schedule_work_on(smp_processor_id())

That results in bio completion being queued on CPU 0, but the
processing work is scheduled for CPU 1. Oops.

> From the
> workqueue implementation:
> 
> 	system_percpu_wq = alloc_workqueue("events", WQ_PERCPU, 0);
> 
> [ ... ]
> 	if (req_cpu == WORK_CPU_UNBOUND) {
> 		if (wq->flags & WQ_UNBOUND)
> 			cpu = wq_select_unbound_cpu(raw_smp_processor_id());
> 		else
> 			cpu = raw_smp_processor_id();

Same preemption problem as above.


-Dave.
-- 
Dave Chinner
dgc@kernel.org


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion
  2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
                     ` (3 preceding siblings ...)
  2026-03-25 21:03   ` Bart Van Assche
@ 2026-03-27  6:01   ` Christoph Hellwig
  4 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2026-03-27  6:01 UTC (permalink / raw)
  To: Tal Zussman
  Cc: Jens Axboe, Matthew Wilcox (Oracle), Christian Brauner,
	Darrick J. Wong, Carlos Maiolino, Alexander Viro, Jan Kara,
	Christoph Hellwig, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 02:43:00PM -0400, Tal Zussman wrote:
> Some bio completion handlers need to run in task context but bio_endio()
> can be called from IRQ context (e.g. buffer_head writeback). Add a
> BIO_COMPLETE_IN_TASK flag that bio submitters can set to request
> task-context completion of their bi_end_io callback.
> 
> When bio_endio() sees this flag and is running in non-task context, it
> queues the bio to a per-cpu list and schedules a work item to call
> bi_end_io() from task context. A CPU hotplug dead callback drains any
> remaining bios from the departing CPU's batch.
> 
> This will be used to enable RWF_DONTCACHE for block devices, and could
> be used for other subsystems like fscrypt that need task-context bio
> completion.
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Tal Zussman <tz2294@columbia.edu>
> ---
>  block/bio.c               | 84 ++++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/blk_types.h |  1 +
>  2 files changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 8203bb7455a9..69ee0d93041f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -18,6 +18,7 @@
>  #include <linux/highmem.h>
>  #include <linux/blk-crypto.h>
>  #include <linux/xarray.h>
> +#include <linux/local_lock.h>
>  
>  #include <trace/events/block.h>
>  #include "blk.h"
> @@ -1714,6 +1715,60 @@ void bio_check_pages_dirty(struct bio *bio)
>  }
>  EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
>  
> +struct bio_complete_batch {
> +	local_lock_t lock;
> +	struct bio_list list;
> +	struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct bio_complete_batch, bio_complete_batch) = {
> +	.lock = INIT_LOCAL_LOCK(lock),
> +};
> +
> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> +	struct bio_complete_batch *batch;
> +	struct bio_list list;
> +
> +again:
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	list = batch->list;
> +	bio_list_init(&batch->list);
> +	local_unlock_irq(&bio_complete_batch.lock);
> +
> +	while (!bio_list_empty(&list)) {
> +		struct bio *bio = bio_list_pop(&list);
> +		bio->bi_end_io(bio);
> +	}

bio_list_pop already does a NULL check, so this could be:

	while ((bio = bio_list_pop(&batch->list)))
		bio->bi_end_io(bio);

In fact that same pattern is repeated later, so maybe just add a helper
for it?  But I think Dave's idea of just using a llist (and adding a
new llist member to the bio for this) seems sensible.  Just don't forget
the llist_reverse_order call to avoid reordering.

> +
> +	local_lock_irq(&bio_complete_batch.lock);
> +	batch = this_cpu_ptr(&bio_complete_batch);
> +	if (!bio_list_empty(&batch->list)) {
> +		local_unlock_irq(&bio_complete_batch.lock);
> +
> +		if (!need_resched())
> +			goto again;
> +
> +		schedule_work_on(smp_processor_id(), &batch->work);
> +		return;
> +	}
> +	local_unlock_irq(&bio_complete_batch.lock);

I don't really understand this requeue logic.  Can you explain it?

> +	schedule_work_on(smp_processor_id(), &batch->work);

We'll probably want a dedicated workqueue here to avoid deadlocks
vs other system wq uses.

> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> +	struct bio_complete_batch *batch = per_cpu_ptr(&bio_complete_batch, cpu);

Overly long line.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-25 20:21   ` Matthew Wilcox
@ 2026-03-27  6:03     ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2026-03-27  6:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tal Zussman, Jens Axboe, Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, Alexander Viro, Jan Kara, Christoph Hellwig,
	linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Wed, Mar 25, 2026 at 08:21:28PM +0000, Matthew Wilcox wrote:
> > +	if (ioend_flags & IOMAP_IOEND_DONTCACHE)
> > +		bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
> >  	wbc_init_bio(wpc->wbc, bio);
> >  	wpc->nr_folios = 0;
> >  	return iomap_init_ioend(wpc->inode, bio, pos, ioend_flags);
> 
> Can't we delete IOMAP_IOEND_DONTCACHE, and just do:
> 
> 	if (folio_test_dropbehind(folio))
> 		bio_set_flag(&ioend->io_bio, BIO_COMPLETE_IN_TASK);
> 
> It'd need to move down a few lines in iomap_add_to_ioend() to after
> bio_add_folio() succeeds.

Yes, that sounds sensible.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-25 20:34   ` Dave Chinner
@ 2026-03-27  6:08     ` Christoph Hellwig
  2026-03-27  6:24       ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2026-03-27  6:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, Christoph Hellwig, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Thu, Mar 26, 2026 at 07:34:45AM +1100, Dave Chinner wrote:
> At this point, I'd suggest that we should not be making random
> one-off changes to the iomap and filesystem layers like this just
> for one operation that needs deferred IO completion work. This needs
> to considered from the overall perspective of how we defer
> completion work -  there are lots of different paths through
> filesystems and/or iomap that require/use task deferal for IO
> completion. We want them all to use the same mechanism - splitting
> deferal between multiple layers depending on IO type is not a
> particularly nice thing to be doing...

Yes and no.  The XFS/iomap write completions needs special handling
for merging operation, using different workqueues, and also the
serialization provided by the per-inode list.

Everything that just needs a dumb user context should be the same,
though.  And this mechanism should work just fine for the T10 PI
checksums.  It does not currently work for the defer to user on error
used by the fserror reporting, but should be adaptable to that by
allowing to also defer an I/O completion from an already running
end_io handler, although that might get ugly.

It should work really well for other places that defer bio completions
like the erofs decompression handler that recently came up, and it will
be very useful to implement actually working REQ_NOWAIT support for
file system writes.  So yes, I think we need to look more at the whole
picture, and I think this is a good building block considering the
whole picture.  I don't think we can coverge on just a single mechanism,
but having few and generic ones is good.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-27  6:08     ` Christoph Hellwig
@ 2026-03-27  6:24       ` Gao Xiang
  2026-03-27  6:27         ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Gao Xiang @ 2026-03-27  6:24 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner
  Cc: Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm

Hi Christiph,

On 2026/3/27 14:08, Christoph Hellwig wrote:
> On Thu, Mar 26, 2026 at 07:34:45AM +1100, Dave Chinner wrote:
>> At this point, I'd suggest that we should not be making random
>> one-off changes to the iomap and filesystem layers like this just
>> for one operation that needs deferred IO completion work. This needs
>> to considered from the overall perspective of how we defer
>> completion work -  there are lots of different paths through
>> filesystems and/or iomap that require/use task deferal for IO
>> completion. We want them all to use the same mechanism - splitting
>> deferal between multiple layers depending on IO type is not a
>> particularly nice thing to be doing...
> 
> Yes and no.  The XFS/iomap write completions needs special handling
> for merging operation, using different workqueues, and also the
> serialization provided by the per-inode list.
> 
> Everything that just needs a dumb user context should be the same,
> though.  And this mechanism should work just fine for the T10 PI
> checksums.  It does not currently work for the defer to user on error
> used by the fserror reporting, but should be adaptable to that by
> allowing to also defer an I/O completion from an already running
> end_io handler, although that might get ugly.
> 
> It should work really well for other places that defer bio completions
> like the erofs decompression handler that recently came up, and it will

I noticed this work, but typically the current EROFS
decompression has two latency-sensitive cases:

  - dm-verity calls EROFS completion, yes, in that case, this
    work can work well since dm-verity already takes some
    merkle tree latencies, and we just don't want to add more
    scheduling latencies with another workqueue;

  - use EROFS directly, in that case, we still need process
    contexts to decompress, but due to Android latency
    requirements, they really need per-cpu RT threads instead,
    otherwise it will cause serious regression too; but I'm not
    sure that case can be replaced by this work since workqueues
    don't support RT threads and I guess generic block layer
    won't be bothered with that too.

Thanks,
Gao Xiang

> be very useful to implement actually working REQ_NOWAIT support for
> file system writes.  So yes, I think we need to look more at the whole
> picture, and I think this is a good building block considering the
> whole picture.  I don't think we can coverge on just a single mechanism,
> but having few and generic ones is good.
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-27  6:24       ` Gao Xiang
@ 2026-03-27  6:27         ` Christoph Hellwig
  2026-03-27  6:45           ` Gao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2026-03-27  6:27 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Christoph Hellwig, Dave Chinner, Tal Zussman, Jens Axboe,
	Matthew Wilcox (Oracle), Christian Brauner, Darrick J. Wong,
	Carlos Maiolino, Alexander Viro, Jan Kara, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm

On Fri, Mar 27, 2026 at 02:24:02PM +0800, Gao Xiang wrote:
>  - use EROFS directly, in that case, we still need process
>    contexts to decompress, but due to Android latency
>    requirements, they really need per-cpu RT threads instead,
>    otherwise it will cause serious regression too; but I'm not
>    sure that case can be replaced by this work since workqueues
>    don't support RT threads and I guess generic block layer
>    won't be bothered with that too.

All of the I/O completions should be latency sensitive.  So I think it
would be great if you could help out here with the requirements and
implementation.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
  2026-03-27  6:27         ` Christoph Hellwig
@ 2026-03-27  6:45           ` Gao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Gao Xiang @ 2026-03-27  6:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Tal Zussman, Jens Axboe, Matthew Wilcox (Oracle),
	Christian Brauner, Darrick J. Wong, Carlos Maiolino,
	Alexander Viro, Jan Kara, linux-block, linux-kernel, linux-xfs,
	linux-fsdevel, linux-mm



On 2026/3/27 14:27, Christoph Hellwig wrote:
> On Fri, Mar 27, 2026 at 02:24:02PM +0800, Gao Xiang wrote:
>>   - use EROFS directly, in that case, we still need process
>>     contexts to decompress, but due to Android latency
>>     requirements, they really need per-cpu RT threads instead,
>>     otherwise it will cause serious regression too; but I'm not
>>     sure that case can be replaced by this work since workqueues
>>     don't support RT threads and I guess generic block layer
>>     won't be bothered with that too.
> 
> All of the I/O completions should be latency sensitive.  So I think it
> would be great if you could help out here with the requirements and
> implementation.

Yes, especially for sync read completion. Our requirement can
be outlined as:

   - a mark to make the whole bio completion in task, so that
     we ensure that the bio completion is in the task context
     so that we don't need to worry about that;

   - another per-CPU RT thread flag (or similiar) relates to
     a bio or some other things, so that bio completion can be
     handled by per-cpu RT threads instead of workqueues
     instead.

If they meet, I think that would be very helpful to clean
up our internal codebase at least.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-03-27  6:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 18:42 [PATCH RFC v4 0/3] block: enable RWF_DONTCACHE for block devices Tal Zussman
2026-03-25 18:43 ` [PATCH RFC v4 1/3] block: add BIO_COMPLETE_IN_TASK for task-context completion Tal Zussman
2026-03-25 19:54   ` Matthew Wilcox
2026-03-25 20:14   ` Jens Axboe
2026-03-25 20:26   ` Dave Chinner
2026-03-25 20:39     ` Matthew Wilcox
2026-03-26  2:44       ` Dave Chinner
2026-03-25 21:03   ` Bart Van Assche
2026-03-26  3:18     ` Dave Chinner
2026-03-27  6:01   ` Christoph Hellwig
2026-03-25 18:43 ` [PATCH RFC v4 2/3] iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback Tal Zussman
2026-03-25 20:21   ` Matthew Wilcox
2026-03-27  6:03     ` Christoph Hellwig
2026-03-25 20:34   ` Dave Chinner
2026-03-27  6:08     ` Christoph Hellwig
2026-03-27  6:24       ` Gao Xiang
2026-03-27  6:27         ` Christoph Hellwig
2026-03-27  6:45           ` Gao Xiang
2026-03-25 18:43 ` [PATCH RFC v4 3/3] block: enable RWF_DONTCACHE for block devices Tal Zussman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox