Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Gao Xiang @ 2026-06-16 16:35 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260616123443.GA21024@lst.de>

On 2026/6/16 20:34, Christoph Hellwig wrote:

> IMHO sharing devices between superblocks is a bad idea, but that ship
> has sailed, but please keep it contained inside of erofs.

I'm not sure why it's a bad idea, for example,
the immutable layer model is already applied to layered virtual
block formats (such as qcow2) and layered fs like overlayfs.

and I think device mappers may have some similar immutable
approaches as shared layers but works in a slight different
way.

The principle is that each instance uses shared blobs in a
read-only way, and that is almost a simple and safest way
to share data among filesystem instances.

Yet I don't want to argue with that since it's pretty common
for years and I've seen no practical risk using this model.

Thanks,
Gao Xiang

^ permalink raw reply

* Re: [PATCH v3 0/3] f2fs: support encrypted inline data
From: Eric Biggers @ 2026-06-16 23:02 UTC (permalink / raw)
  To: LiaoYuanhong-vivo
  Cc: chao, corbet, jaegeuk, linux-doc, linux-ext4, linux-f2fs-devel,
	linux-fscrypt, linux-kernel, skhan, tytso
In-Reply-To: <20260616094612.45505-1-liaoyuanhong@vivo.com>

On Tue, Jun 16, 2026 at 05:46:12PM +0800, LiaoYuanhong-vivo wrote:
> Could you share more about the direction you have in mind for simplifying
> f2fs/ext4 contents encryption around blk-crypto?

Currently ext4 and f2fs each have two implementations of file contents
encryption and decryption:

- One where the en/decryption is done in the filesystem layer

- One where the filesystem attaches a bio_crypt_ctx to the bios and the
  en/decryption is done either in the block layer by blk-crypto-fallback
  or by inline encryption hardware

I'd like to drop the first one, for simplicity and to reduce the burden
on ongoing developments like large folio support.

> For f2fs inline_data, there is still a real space-saving benefit on phones,
> since many encrypted files are smaller than 4K. Is there any acceptable
> future direction to support this kind of inode-resident data with
> blk-crypto or hardware-wrapped keys?

It is incompatible with inline encryption hardware.  A CPU-based
solution like Intel Key Locker or RISC-V High Assurance Cryptography
could provide similar security properties.  But there's nothing for
arm64 yet.  And I should mention that no one has wanted to use Key
Locker anyway because it's really slow.

- Eric

^ permalink raw reply

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Zhang Yi @ 2026-06-17  2:45 UTC (permalink / raw)
  To: Baokun Li, linux-ext4
  Cc: tytso, adilger.kernel, jack, yi.zhang, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <060f63e0-d64f-40df-99a7-af53862049ee@linux.alibaba.com>

Hi, Baokun!

On 6/16/2026 9:10 PM, Baokun Li wrote:
> Hi all,
> 
> Thank you for your review!
> 
> After extensive testing, I found that after merging this patch, generic/746
> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The test
> triggers a "Page cache invalidation failure on direct I/O" warning, and
> subsequent fsync returns -EIO.
> 
> The underlying race existed before this patch, but this patch appears to
> have widened the reproduction window considerably, so I thought it worth
> trying to address.  Here is my analysis:
> 
> On no-extent inodes, DIO writes that hit holes cannot use unwritten
> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
> returns 0 for a hole, and:
> 
>          if (!m_flags && !ret)
>                  ret = -ENOTBLK;
> 
> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
> I/O.  The fallback path dirties pages in the page cache, then flushes
> and invalidates them.  However, concurrent async DIO completions to
> other blocks on the same inode can run kiocb_invalidate_post_direct_write()
> without holding the inode lock.
> 
> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
> to the written extent, while thread B does DIO spanning both extents:
> 
>    kworker A (4k DIO, allocated block)    kworker B (8k DIO, hole->fallback)
>    -----------------------------------    -----------------------------------
>    inode_lock_shared()                    inode_lock_shared()
>    iomap_dio_rw():                        iomap_dio_rw():
>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
>      submit_bio (async)                     dio->size = 0
>    inode_unlock_shared()                  inode_unlock_shared()
> 
>    [bio pending in block layer]           /* fallback: inode lock released */
>                                           ext4_buffered_write_iter()
>                                             inode_lock(exclusive)
>                                             generic_perform_write()
>                                               -> dirty pages [0, 8k]
>                                             inode_unlock(exclusive)
> 
>                                           /* pages still dirty here */
>    [bio completes]                        filemap_write_and_wait_range()
>    iomap_dio_complete()                     -> flush dirty pages
>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
>        invalidate_inode_pages2_range()
>        -> finds dirty page!               /* window closed */
>        -> dio_warn_stale_pagecache()
>        -> errseq_set(-EIO)
> 

It looks like this issue occurs when invalidate_inode_pages2_range()
checks beyond the DIO write range, which may only happen when folio size
is larger than block size. Is that correct?

> The critical window is the gap between ext4_buffered_write_iter() dirtying
> pages and filemap_write_and_wait_range() flushing them.  In this window the
> inode lock is not held, so another thread's async DIO completion is free to
> invalidate the still-dirty pages in the page cache.
> 
> This race has always existed on ext3 because indirect-block inodes lack
> unwritten-extent support.  However, the window was extremely narrow in
> practice, because the old ext4_overwrite_io() checked every block and
> would conservatively take an exclusive lock.  This patch replaced it
> with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
> making unaligned DIO more likely to take a shared lock and
> proportionally increasing the chance of hitting the race.
> 
> I tried a couple of alternatives before settling on the patch below:
> 
> 1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
>     This closes the window for new DIO submissions, but does not protect
>     against bio completions from previously submitted async DIO, which
>     run independently of the inode lock.
> 
> 2. Wrap the fallback dirty+flush+invalidate sequence in
>     filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
>     do not use this lock, so it would not serialise against DIO
>     completions.
> 

Could we add a call to inode_dio_wait() before falling back to buffered
I/O? That is, in thread B, when falling back to buffered I/O, could we
acquire the exclusive inode lock and then call inode_dio_wait() to wait
for in-flight DIO to complete? This should close the race window. Since
scenarios where DIO writes to holes on ext3 are relatively rare, the
performance impact should be minimal (I suppose).

> One straightforward approach that seems correct is to skip direct I/O
> for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
>   {
>          if (fsverity_active(inode))
>                  return 0;
> +       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +               return 0;
>          if (ext4_should_journal_data(inode))
>                  return 0;
>          if (ext4_has_inline_data(inode))
> 
> With this, ext4_should_use_dio() returns false for no-extent inodes, and
> all I/O goes through ext4_buffered_write_iter() directly, bypassing the
> DIO path entirely.  On ext3, DIO to a hole already falls back to buffered
> I/O, so there is essentially no performance benefit to using DIO in the
> first place.
> 
> Note that with this change, the fallback branch in ext4_dio_write_iter():
> 
>          if (ret >= 0 && iov_iter_count(from)) {
>                  /* buffered fallback */
>          }
> 
> would also become dead code for extent-based inodes (since unwritten
> extents guarantee iomap_dio_rw() never returns zero with unconsumed
> data), and could be removed in a follow-up cleanup.
> 
> Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
> I'm missing?
> 

Hmm, this would also cause DIO to fall back to buffered I/O in common
extending write cases, which I think would be unacceptable.

Cheers,
Yi.

> Looking forward to your feedback.
> 
> 
> Thanks,
> Baokun
> 
> 
> 


^ permalink raw reply

* Re: [PATCH RFC v2 15/18] f2fs: open via dedicated fs bdev helpers
From: Chao Yu @ 2026-06-17  3:17 UTC (permalink / raw)
  To: Christian Brauner, Jan Kara
  Cc: chao, Christoph Hellwig, Jens Axboe, Alexander Viro, linux-block,
	linux-kernel, linux-fsdevel, Carlos Maiolino, linux-xfs,
	Chris Mason, David Sterba, linux-btrfs, Theodore Ts'o,
	linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-15-7df6b864028e@kernel.org>

On 6/16/26 22:08, Christian Brauner wrote:
> Route the extra device opens of a multi-device f2fs through
> fs_bdev_file_open_by_path() so each device is registered against the
> superblock, and convert the matching release in destroy_device_list()
> to fs_bdev_file_release(). The first device aliases the main bdev file
> opened by setup_bdev_super() and is already registered through it.
> 
> f2fs opened its extra devices without holder ops, so a freeze, sync, or
> removal of one of them was never propagated to the superblock.
> Registering them wires those events up: every device now freezes,
> thaws, syncs, and shuts down the filesystem like the main device does.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Acked-by: Chao Yu <chao@kernel.org>

Thanks,

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christoph Hellwig @ 2026-06-17  6:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Christoph Hellwig, Jan Kara, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-fragil-duktus-nachverfolgen-60f54584c206@brauner>

On Tue, Jun 16, 2026 at 04:59:53PM +0200, Christian Brauner wrote:
> > Err, no.  block devices need to have a specific owner.  If erofs wants
> > to share a device between superblock it needs to come up with an entity
> > that owns the block devices which is not a superblock.
> 
> It already did.
> 
> > IMHO sharing devices between superblocks is a bad idea, but that ship
> > has sailed, but please keep it contained inside of erofs.
> 
> We need a simple device number to superblock mapping anyway and that can
> simply be centralized in the vfs. And it can work with anon device
> numbers and block device numbers uniformly.

No, we don't need a secondary device number to sb mapping.  On the other
hand we do need the deviceloss, freeze etc upcalls to work for owners
that are not file systems like mdraid or dm, even if they have been
slow to pick this.  The whole idea of the holder ops is to abstract
away from who holds it instead of adding back the broken hard coding
of the superblock.  Otherwise you're just badly reinventing get_super.

If erofs already has an owner entity it just needs custom holder ops for
that.

^ permalink raw reply

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Baokun Li @ 2026-06-17  7:52 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, yi.zhang, ojaswin,
	ritesh.list, peng_wang
In-Reply-To: <d1adcf7c-c276-458d-9cac-68a4410f7626@gmail.com>

On 2026/6/17 10:45, Zhang Yi wrote:
> Hi, Baokun!
>
> On 6/16/2026 9:10 PM, Baokun Li wrote:
>> Hi all,
>>
>> Thank you for your review!
>>
>> After extensive testing, I found that after merging this patch,
>> generic/746
>> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The
>> test
>> triggers a "Page cache invalidation failure on direct I/O" warning, and
>> subsequent fsync returns -EIO.
>>
>> The underlying race existed before this patch, but this patch appears to
>> have widened the reproduction window considerably, so I thought it worth
>> trying to address.  Here is my analysis:
>>
>> On no-extent inodes, DIO writes that hit holes cannot use unwritten
>> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
>> returns 0 for a hole, and:
>>
>>          if (!m_flags && !ret)
>>                  ret = -ENOTBLK;
>>
>> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
>> I/O.  The fallback path dirties pages in the page cache, then flushes
>> and invalidates them.  However, concurrent async DIO completions to
>> other blocks on the same inode can run
>> kiocb_invalidate_post_direct_write()
>> without holding the inode lock.
>>
>> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
>> to the written extent, while thread B does DIO spanning both extents:
>>
>>    kworker A (4k DIO, allocated block)    kworker B (8k DIO,
>> hole->fallback)
>>    -----------------------------------   
>> -----------------------------------
>>    inode_lock_shared()                    inode_lock_shared()
>>    iomap_dio_rw():                        iomap_dio_rw():
>>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
>>      submit_bio (async)                     dio->size = 0
>>    inode_unlock_shared()                  inode_unlock_shared()
>>
>>    [bio pending in block layer]           /* fallback: inode lock
>> released */
>>                                           ext4_buffered_write_iter()
>>                                             inode_lock(exclusive)
>>                                             generic_perform_write()
>>                                               -> dirty pages [0, 8k]
>>                                             inode_unlock(exclusive)
>>
>>                                           /* pages still dirty here */
>>    [bio completes]                        filemap_write_and_wait_range()
>>    iomap_dio_complete()                     -> flush dirty pages
>>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
>>        invalidate_inode_pages2_range()
>>        -> finds dirty page!               /* window closed */
>>        -> dio_warn_stale_pagecache()
>>        -> errseq_set(-EIO)
>>
>
> It looks like this issue occurs when invalidate_inode_pages2_range()
> checks beyond the DIO write range, which may only happen when folio size
> is larger than block size. Is that correct?
Thanks for looking at this!

Not quite — the scenario involves an 8k file with layout

 [hole at 0-4k] [written extent at 4k-8k]

and two DIO threads. Thread A does a 4k DIO write at offset 4k; since
the target block is a written extent, no fallback occurs. Thread B
does an 8k DIO write at offset 0; since blocks 0-4k are a hole on an
indirect-block inode and ext3 does not support unwritten extents,
iomap returns -ENOTBLK and the entire 8k write falls back to buffered
I/O.

Normally the kernel would prevent concurrent BIO and DIO to
overlapping ranges on the same file. But because Thread A holds only
a shared inode lock (pure overwrite on a written extent), and
Thread B's DIO has already returned -ENOTBLK before its buffered
fallback begins, both paths can proceed concurrently:


Thread A: 4k DIO at offset 4k     Thread B: 8k DIO at offset 0
─────────────────────────────     ─────────────────────────────
kiocb_invalidate_pages            iomap_begin → -ENOTBLK
  → page index 1 already clean      (indirect inode hole, m_flags=0)
submit_bio (async)                dio->size = 0
inode_unlock_shared()             inode_unlock_shared()

                                  ext4_buffered_write_iter()
[bio pending]                       → dirty page 0 [0, 4k]
                                    → dirty page 1 [4k, 8k]
                                  inode_unlock()
                                  // pages dirty, no lock

[bio completes]
iomap_dio_complete():
  kiocb_invalidate_post_direct_write()
    start = 4096 >> 12 = 1
    end = (8191) >> 12 = 1
    invalidate_inode_pages2_range(1, 1)
      → page 1 [4k,8k] is DIRTY
      → -EBUSY → errseq_set(-EIO)

Page index 1 corresponds to file offset [4k, 8k], which is exactly
Thread A's DIO range. The invalidation is not going beyond the DIO
range — the dirty page was placed there by Thread B's buffered
fallback, which wrote to [0, 8k] and dirtied the same page.

No large folio is needed; 4k pages and 4k blocks are sufficient.

From the user's perspective, when performing concurrent DIO on a
holed ext3 file, the file contents can become corrupted with some
probability. If the file is used as a loop device's backing file,
this manifests as filesystem corruption inside the loop device.
>
>> The critical window is the gap between ext4_buffered_write_iter()
>> dirtying
>> pages and filemap_write_and_wait_range() flushing them.  In this
>> window the
>> inode lock is not held, so another thread's async DIO completion is
>> free to
>> invalidate the still-dirty pages in the page cache.
>>
>> This race has always existed on ext3 because indirect-block inodes lack
>> unwritten-extent support.  However, the window was extremely narrow in
>> practice, because the old ext4_overwrite_io() checked every block and
>> would conservatively take an exclusive lock.  This patch replaced it
>> with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
>> making unaligned DIO more likely to take a shared lock and
>> proportionally increasing the chance of hitting the race.
>>
>> I tried a couple of alternatives before settling on the patch below:
>>
>> 1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
>>     This closes the window for new DIO submissions, but does not protect
>>     against bio completions from previously submitted async DIO, which
>>     run independently of the inode lock.
>>
>> 2. Wrap the fallback dirty+flush+invalidate sequence in
>>     filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
>>     do not use this lock, so it would not serialise against DIO
>>     completions.
>>
>
> Could we add a call to inode_dio_wait() before falling back to buffered
> I/O? That is, in thread B, when falling back to buffered I/O, could we
> acquire the exclusive inode lock and then call inode_dio_wait() to wait
> for in-flight DIO to complete? This should close the race window. Since
> scenarios where DIO writes to holes on ext3 are relatively rare, the
> performance impact should be minimal (I suppose).
>
That's a great idea, thank you!

I had been trying to fix this on the DIO side and didn't consider
waiting from the buffered fallback path.

I've tested the approach locally and it closes the race; I'll add a
patch using it in the next version.
>> One straightforward approach that seems correct is to skip direct I/O
>> for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
>>   {
>>          if (fsverity_active(inode))
>>                  return 0;
>> +       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> +               return 0;
>>          if (ext4_should_journal_data(inode))
>>                  return 0;
>>          if (ext4_has_inline_data(inode))
>>
>> With this, ext4_should_use_dio() returns false for no-extent inodes, and
>> all I/O goes through ext4_buffered_write_iter() directly, bypassing the
>> DIO path entirely.  On ext3, DIO to a hole already falls back to
>> buffered
>> I/O, so there is essentially no performance benefit to using DIO in the
>> first place.
>>
>> Note that with this change, the fallback branch in
>> ext4_dio_write_iter():
>>
>>          if (ret >= 0 && iov_iter_count(from)) {
>>                  /* buffered fallback */
>>          }
>>
>> would also become dead code for extent-based inodes (since unwritten
>> extents guarantee iomap_dio_rw() never returns zero with unconsumed
>> data), and could be removed in a follow-up cleanup.
>>
>> Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
>> I'm missing?
>>
>
> Hmm, this would also cause DIO to fall back to buffered I/O in common
> extending write cases, which I think would be unacceptable.
>

Fair point, the regression on extending writes is hard to justify.  That
said, until we had a better fix, I'd argue a behavioural change was
still preferable to potential data corruption. With the inode_dio_wait()
approach above, this trade-off goes away. 


Thanks,
Baokun



^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Zhang Yi @ 2026-06-17  8:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai, Brian Foster
In-Reply-To: <c2q54d6u724xctkzwm6x7sbmg5cpvcackfz3toc47qts6iaj77@ci2czn4fqjik>

On 6/16/2026 8:28 PM, Jan Kara wrote:
> On Mon 11-05-26 15:23:34, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
>> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
>> infrastructure for ext4.
>>
>> ext4_iomap_block_zero_range() calls iomap_zero_range() with
>> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
>> out either a mapped partial block or a dirty, unwritten partial block.
>>
>> Important constraints:
>>
>> Zeroing out under an active journal handle can cause deadlock, because
>> the order of acquiring the folio lock and starting a handle is
>> inconsistent with the iomap writeback path.
>>
>> Therefore, ext4_iomap_block_zero_range():
>> - Must NOT be called under an active handle.
>> - Cannot rely on data=ordered mode to ensure zeroed data persistence
>>   before updating i_disksize (for the cases of post-EOF append write,
>>   post-EOF fallocate, and truncate up). In subsequent patches, we will
>>   address this by synchronizing commit I/O but doesn't waiting for
>>   completion, and updating i_disksize to i_size only after the zeroed
>>   data has been written back.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 92 insertions(+)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index c6fe42d012fc..e0dae2501292 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>>  	return 0;
>>  }
>>  
>> +static int ext4_iomap_zero_begin(struct inode *inode,
>> +		loff_t offset, loff_t length, unsigned int flags,
>> +		struct iomap *iomap, struct iomap *srcmap)
>> +{
>> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> 
> This looks like a layering violation to me. I don't think you can safely
> assume the iomap you're passed is a part of iomap_iter...
> 
>> +	struct ext4_map_blocks map;
>> +	u8 blkbits = inode->i_blkbits;
>> +	unsigned int iomap_flags = 0;
>> +	int ret;
>> +
>> +	ret = ext4_emergency_state(inode->i_sb);
>> +	if (unlikely(ret))
>> +		return ret;
>> +
>> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
>> +		return -EINVAL;
>> +
>> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	/*
>> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
>> +	 * this bypasses the flush iomap uses to trigger extent conversion
>> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
>> +	 */
>> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
>> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
>> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
>> +
>> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
>> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
>> +			map.m_len = (start >> blkbits) - map.m_lblk;
>> +	}
> 
> ... and you need access to iter only for this which seems to be really a
> hack that's trying to outsmart the iomap code. I have to admit I don't
> fully understand what you are trying to achieve here. Are you trying to
> avoid flushing of the range that will be zeroed out?

This logic is copied from the XFS and iomap infrastructure. Its primary
purpose is to optimize the zeroing operations on dirty written extents.
It was introduced by Brian in [1].

The history as I understand it: originally, the iomap infrastructure
could not zero dirty unwritten extents during zero range processing,
which led to stale data exposure. XFS had to flush dirty ranges itself
before zeroing — a workaround that was not generic.

In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
extents"), Brian added an unconditional flush in the iomap
infrastructure, ensuring that by the time zeroing runs the extent has
already been converted to written so the zero can proceed correctly.
However, this flush was too heavy and introduced noticeable performance
overhead.

This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
conditional on unwritten mappings"), which restricts flushing to only
dirty pagecache over unwritten or hole mappings.

Brian later proposed a different approach: rather than relying on flush
to convert the extent type, find dirty folios ahead of the zero range
and zero the dirty unwritten extents directly. In [1] he added this
lookup logic. The filesystem now supplies a folio batch (a collection of
dirty folios) via the iomap begin callback, and zero range iterates over
these dirty folios to perform zeroing. Clean regions not covered by the
batch are simply skipped. This entirely eliminates the need to flush.

[1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/

If I understand correctly, the current approach is a compromise, and
Brian is still working on this. Perhaps ext4 and XFS could work together
on improvements in the future?

> 
>> +	ret = iomap_zero_range(inode, from, length, did_zero,
>> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
>> +			       NULL);
>> +	if (ret)
>> +		return ret;
>> +
>> +	/*
>> +	 * TODO: The iomap does not distinguish between different types of
>> +	 * zeroing and always sets zero_written if a zeroing operation is
>> +	 * performed, which may result in unnecessary order operations.
>> +	 */
> 
> Is this still true after your fix to did_zero handling?

Yeah. Currently, iomap_zero_range() can only report whether a zeroing
operation has occurred through did_zero parameter, but it cannot
distinguish whether the zeroed range is a written extent that already
exists on disk. That is, even if the zeroing is performed on a delalloc
extent, did_zero will still return true.

Thanks,
Yi.

> 
>> +	if (did_zero && zero_written)
>> +		*zero_written = *did_zero;
>> +
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Zeros out a mapping of length 'length' starting from file offset
>>   * 'from'.  The range to be zero'd must be contained with in one block.
> 
> 								Honza

^ permalink raw reply

* Re: [PATCH v7 3/4] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Zhou, Yun @ 2026-06-17  8:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, linux-kernel
In-Reply-To: <20260616151558.1728881-4-yun.zhou@windriver.com>

Hi Honza,
> Add ext4_put_ea_inode() which safely releases EA inode references:
> when SB_ACTIVE, it calls iput() directly (write_inode_now cannot be
> triggered); during mount (!SB_ACTIVE), it queues the inode on a per-sb
> lock-free llist and schedules a worker to call iput() in a clean
> context without holding any ext4 locks.
>
> Convert the iput in ext4_xattr_block_set()'s "Drop the previous xattr
> block" path to use ext4_xattr_inode_array_free_deferred(), which
> releases EA inodes via ext4_put_ea_inode().  This path previously called
> ext4_xattr_inode_array_free() (synchronous iput) while holding xattr_sem
> and a jbd2 handle.
>
> The worker is flushed in ext4_put_super() before journal destruction to
> ensure all pending EA inode cleanup completes while the journal is still
> available.
>
>   
> +static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
> +				struct ext4_xattr_inode_array *array)
> +{
> +	int idx;
> +
> +	if (array == NULL)
> +		return;
> +
> +	for (idx = 0; idx < array->count; ++idx)
> +		ext4_put_ea_inode(sb, array->inodes[idx]);
> +	kfree(array);
> +}
> +
> +struct ext4_ea_iput_entry {
> +	struct llist_node node;
> +	struct inode *inode;
> +};
> +
> +/*
> + * Worker function for deferred EA inode iput.  Processes all inodes queued
> + * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
> + */
> +void ext4_ea_inode_work(struct work_struct *work)
> +{
> +	struct ext4_sb_info *sbi = container_of(work, struct ext4_sb_info,
> +						s_ea_inode_work);
> +	struct llist_node *node = llist_del_all(&sbi->s_ea_inode_to_free);
> +	struct llist_node *next;
> +
> +	while (node) {
> +		struct ext4_ea_iput_entry *entry = container_of(node,
> +				struct ext4_ea_iput_entry, node);
> +		next = node->next;
> +		iput(entry->inode);
> +		kfree(entry);
> +		node = next;
> +	}
> +}
> +
> +/*
> + * Release a VFS reference on an EA inode after ext4_xattr_inode_dec_ref()
> + * may have set i_nlink=0.  Must be used instead of iput() in any context
> + * where xattr_sem or a jbd2 handle is held, because eviction of a nlink=0
> + * inode can acquire those same locks.
> + *
> + * When SB_ACTIVE, eviction does not call write_inode_now() so direct
> + * iput() is safe.  During mount (!SB_ACTIVE), defer to a workqueue.
> + *
> + * For EA inode references dropped without a preceding dec_ref (e.g.,
> + * lookup-only paths where nlink remains >= 1), plain iput() is safe
> + * and preferred.
> + */
> +void ext4_put_ea_inode(struct super_block *sb, struct inode *inode)
> +{
> +	struct ext4_ea_iput_entry *entry;
> +
> +	if (!inode)
> +		return;
> +	if (sb->s_flags & SB_ACTIVE) {
> +		iput(inode);
> +		return;
> +	}
> +	entry = kmalloc(sizeof(*entry), GFP_NOFS | __GFP_NOFAIL);
> +	entry->inode = inode;
> +	llist_add(&entry->node, &EXT4_SB(sb)->s_ea_inode_to_free);
> +	schedule_work(&EXT4_SB(sb)->s_ea_inode_work);
> +}
> +
>
Could you help me check if this is the way you expected?

Thanks,
Yun

^ permalink raw reply

* Re: [PATCH RFC 2/8] fs: add a global device to super block hash table
From: Christian Brauner @ 2026-06-17  9:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Jens Axboe, Alexander Viro, linux-block, linux-kernel,
	linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
	David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
	Gao Xiang, linux-erofs
In-Reply-To: <20260617062523.GA20041@lst.de>

> No, we don't need a secondary device number to sb mapping.  On the other
> hand we do need the deviceloss, freeze etc upcalls to work for owners
> that are not file systems like mdraid or dm, even if they have been
> slow to pick this.  The whole idea of the holder ops is to abstract
> away from who holds it instead of adding back the broken hard coding
> of the superblock.  Otherwise you're just badly reinventing get_super.

No, the expanded version works for all device numbers. There's also
no-hardcoding. And non-fs users may do whatever they want with their
holder ops ofc. erofs always had the non 1:1 relationship between
devices and filesystems and for that case it seems sane. I'm happy to
let the series sit for a bit to gather input and do the security
mediation patches first. The series are complementary.

^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Jan Kara @ 2026-06-17 10:50 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, ojaswin, ritesh.list, djwong, hch,
	yi.zhang, yizhang089, yangerkun, yukuai, Brian Foster
In-Reply-To: <16cccb83-cdad-4113-8182-e8ea9e3049a2@huaweicloud.com>

On Wed 17-06-26 16:14:40, Zhang Yi wrote:
> On 6/16/2026 8:28 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:34, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> >> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> >> infrastructure for ext4.
> >>
> >> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> >> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> >> out either a mapped partial block or a dirty, unwritten partial block.
> >>
> >> Important constraints:
> >>
> >> Zeroing out under an active journal handle can cause deadlock, because
> >> the order of acquiring the folio lock and starting a handle is
> >> inconsistent with the iomap writeback path.
> >>
> >> Therefore, ext4_iomap_block_zero_range():
> >> - Must NOT be called under an active handle.
> >> - Cannot rely on data=ordered mode to ensure zeroed data persistence
> >>   before updating i_disksize (for the cases of post-EOF append write,
> >>   post-EOF fallocate, and truncate up). In subsequent patches, we will
> >>   address this by synchronizing commit I/O but doesn't waiting for
> >>   completion, and updating i_disksize to i_size only after the zeroed
> >>   data has been written back.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 92 insertions(+)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index c6fe42d012fc..e0dae2501292 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>  
> >> +static int ext4_iomap_zero_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> > 
> > This looks like a layering violation to me. I don't think you can safely
> > assume the iomap you're passed is a part of iomap_iter...
> > 
> >> +	struct ext4_map_blocks map;
> >> +	u8 blkbits = inode->i_blkbits;
> >> +	unsigned int iomap_flags = 0;
> >> +	int ret;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> >> +		return -EINVAL;
> >> +
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> >> +	 * this bypasses the flush iomap uses to trigger extent conversion
> >> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> >> +	 */
> >> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> >> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> >> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> >> +
> >> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> >> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> >> +			map.m_len = (start >> blkbits) - map.m_lblk;
> >> +	}
> > 
> > ... and you need access to iter only for this which seems to be really a
> > hack that's trying to outsmart the iomap code. I have to admit I don't
> > fully understand what you are trying to achieve here. Are you trying to
> > avoid flushing of the range that will be zeroed out?
> 
> This logic is copied from the XFS and iomap infrastructure. Its primary
> purpose is to optimize the zeroing operations on dirty written extents.
> It was introduced by Brian in [1].

Ah, I see. I still find it hacky but apparently it is an established hack
in iomap :). Fair.

> The history as I understand it: originally, the iomap infrastructure
> could not zero dirty unwritten extents during zero range processing,
> which led to stale data exposure. XFS had to flush dirty ranges itself
> before zeroing — a workaround that was not generic.
> 
> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
> extents"), Brian added an unconditional flush in the iomap
> infrastructure, ensuring that by the time zeroing runs the extent has
> already been converted to written so the zero can proceed correctly.
> However, this flush was too heavy and introduced noticeable performance
> overhead.
> 
> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
> conditional on unwritten mappings"), which restricts flushing to only
> dirty pagecache over unwritten or hole mappings.
> 
> Brian later proposed a different approach: rather than relying on flush
> to convert the extent type, find dirty folios ahead of the zero range
> and zero the dirty unwritten extents directly. In [1] he added this
> lookup logic. The filesystem now supplies a folio batch (a collection of
> dirty folios) via the iomap begin callback, and zero range iterates over
> these dirty folios to perform zeroing. Clean regions not covered by the
> batch are simply skipped. This entirely eliminates the need to flush.
> 
> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/

Thanks for the summary! So I was confused because somehow I thought this is
about fallocate(FALLOC_FL_ZERO_RANGE) and so I was wondering why we just
cannot evict the page cache and be done with that. Only after reading
everything again I've realized this is about zeroing partial blocks on hole
punch etc. And we may need to really handle multiple folios because XFS
also uses this mechanism to implement FALLOC_FL_ZERO_RANGE for zoned
storage. Ugh. OK, anyway for now this looks like your patch is following
how things are expected to be done so feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> >> +	/*
> >> +	 * TODO: The iomap does not distinguish between different types of
> >> +	 * zeroing and always sets zero_written if a zeroing operation is
> >> +	 * performed, which may result in unnecessary order operations.
> >> +	 */
> > 
> > Is this still true after your fix to did_zero handling?
> 
> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
> operation has occurred through did_zero parameter, but it cannot
> distinguish whether the zeroed range is a written extent that already
> exists on disk. That is, even if the zeroing is performed on a delalloc
> extent, did_zero will still return true.

So maybe write in the comment explicitely, that this may result in
unnecessary flushing of folios if zeroing happened in
delayed-not-yet-allocated blocks?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Zhang Yi @ 2026-06-17 10:54 UTC (permalink / raw)
  To: Baokun Li, Zhang Yi
  Cc: linux-ext4, tytso, adilger.kernel, jack, ojaswin, ritesh.list,
	peng_wang
In-Reply-To: <4f4800dc-0a2c-48e2-9535-f32b21628bdb@linux.alibaba.com>

On 6/17/2026 3:52 PM, Baokun Li wrote:
> On 2026/6/17 10:45, Zhang Yi wrote:
>> Hi, Baokun!
>>
>> On 6/16/2026 9:10 PM, Baokun Li wrote:
>>> Hi all,
>>>
>>> Thank you for your review!
>>>
>>> After extensive testing, I found that after merging this patch,
>>> generic/746
>>> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The
>>> test
>>> triggers a "Page cache invalidation failure on direct I/O" warning, and
>>> subsequent fsync returns -EIO.
>>>
>>> The underlying race existed before this patch, but this patch appears to
>>> have widened the reproduction window considerably, so I thought it worth
>>> trying to address.  Here is my analysis:
>>>
>>> On no-extent inodes, DIO writes that hit holes cannot use unwritten
>>> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
>>> returns 0 for a hole, and:
>>>
>>>          if (!m_flags && !ret)
>>>                  ret = -ENOTBLK;
>>>
>>> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
>>> I/O.  The fallback path dirties pages in the page cache, then flushes
>>> and invalidates them.  However, concurrent async DIO completions to
>>> other blocks on the same inode can run
>>> kiocb_invalidate_post_direct_write()
>>> without holding the inode lock.
>>>
>>> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
>>> to the written extent, while thread B does DIO spanning both extents:
>>>
>>>    kworker A (4k DIO, allocated block)    kworker B (8k DIO,
>>> hole->fallback)
>>>    -----------------------------------   
>>> -----------------------------------
>>>    inode_lock_shared()                    inode_lock_shared()
>>>    iomap_dio_rw():                        iomap_dio_rw():
>>>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
>>>      submit_bio (async)                     dio->size = 0
>>>    inode_unlock_shared()                  inode_unlock_shared()
>>>
>>>    [bio pending in block layer]           /* fallback: inode lock
>>> released */
>>>                                           ext4_buffered_write_iter()
>>>                                             inode_lock(exclusive)
>>>                                             generic_perform_write()
>>>                                               -> dirty pages [0, 8k]
>>>                                             inode_unlock(exclusive)
>>>
>>>                                           /* pages still dirty here */
>>>    [bio completes]                        filemap_write_and_wait_range()
>>>    iomap_dio_complete()                     -> flush dirty pages
>>>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
>>>        invalidate_inode_pages2_range()
>>>        -> finds dirty page!               /* window closed */
>>>        -> dio_warn_stale_pagecache()
>>>        -> errseq_set(-EIO)
>>>
>>
>> It looks like this issue occurs when invalidate_inode_pages2_range()
>> checks beyond the DIO write range, which may only happen when folio size
>> is larger than block size. Is that correct?
> Thanks for looking at this!
> 
> Not quite — the scenario involves an 8k file with layout
> 
>  [hole at 0-4k] [written extent at 4k-8k]
> 
> and two DIO threads. Thread A does a 4k DIO write at offset 4k; since
> the target block is a written extent, no fallback occurs. Thread B
> does an 8k DIO write at offset 0; since blocks 0-4k are a hole on an
> indirect-block inode and ext3 does not support unwritten extents,
> iomap returns -ENOTBLK and the entire 8k write falls back to buffered
> I/O.
> 
> Normally the kernel would prevent concurrent BIO and DIO to
> overlapping ranges on the same file. But because Thread A holds only
> a shared inode lock (pure overwrite on a written extent), and
> Thread B's DIO has already returned -ENOTBLK before its buffered
> fallback begins, both paths can proceed concurrently:
> 
> 
> Thread A: 4k DIO at offset 4k     Thread B: 8k DIO at offset 0
> ─────────────────────────────     ─────────────────────────────
> kiocb_invalidate_pages            iomap_begin → -ENOTBLK
>   → page index 1 already clean      (indirect inode hole, m_flags=0)
> submit_bio (async)                dio->size = 0
> inode_unlock_shared()             inode_unlock_shared()
> 
>                                   ext4_buffered_write_iter()
> [bio pending]                       → dirty page 0 [0, 4k]
>                                     → dirty page 1 [4k, 8k]
>                                   inode_unlock()
>                                   // pages dirty, no lock
> 
> [bio completes]
> iomap_dio_complete():
>   kiocb_invalidate_post_direct_write()
>     start = 4096 >> 12 = 1
>     end = (8191) >> 12 = 1
>     invalidate_inode_pages2_range(1, 1)
>       → page 1 [4k,8k] is DIRTY
>       → -EBUSY → errseq_set(-EIO)
> 
> Page index 1 corresponds to file offset [4k, 8k], which is exactly
> Thread A's DIO range. The invalidation is not going beyond the DIO
> range — the dirty page was placed there by Thread B's buffered
> fallback, which wrote to [0, 8k] and dirtied the same page.
> 
> No large folio is needed; 4k pages and 4k blocks are sufficient.
> 
> From the user's perspective, when performing concurrent DIO on a
> holed ext3 file, the file contents can become corrupted with some
> probability. If the file is used as a loop device's backing file,
> this manifests as filesystem corruption inside the loop device.

Ha, fair enough. Thanks for the details.

Cheers,
Yi.


^ permalink raw reply

* Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
From: Brian Foster @ 2026-06-17 10:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
	adilger.kernel, libaokun, ojaswin, ritesh.list, djwong, hch,
	yi.zhang, yizhang089, yangerkun, yukuai
In-Reply-To: <16cccb83-cdad-4113-8182-e8ea9e3049a2@huaweicloud.com>

On Wed, Jun 17, 2026 at 04:14:40PM +0800, Zhang Yi wrote:
> On 6/16/2026 8:28 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:34, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> >> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> >> infrastructure for ext4.
> >>
> >> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> >> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> >> out either a mapped partial block or a dirty, unwritten partial block.
> >>
> >> Important constraints:
> >>
> >> Zeroing out under an active journal handle can cause deadlock, because
> >> the order of acquiring the folio lock and starting a handle is
> >> inconsistent with the iomap writeback path.
> >>
> >> Therefore, ext4_iomap_block_zero_range():
> >> - Must NOT be called under an active handle.
> >> - Cannot rely on data=ordered mode to ensure zeroed data persistence
> >>   before updating i_disksize (for the cases of post-EOF append write,
> >>   post-EOF fallocate, and truncate up). In subsequent patches, we will
> >>   address this by synchronizing commit I/O but doesn't waiting for
> >>   completion, and updating i_disksize to i_size only after the zeroed
> >>   data has been written back.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 92 insertions(+)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index c6fe42d012fc..e0dae2501292 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>  
> >> +static int ext4_iomap_zero_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> > 
> > This looks like a layering violation to me. I don't think you can safely
> > assume the iomap you're passed is a part of iomap_iter...
> > 
> >> +	struct ext4_map_blocks map;
> >> +	u8 blkbits = inode->i_blkbits;
> >> +	unsigned int iomap_flags = 0;
> >> +	int ret;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> >> +		return -EINVAL;
> >> +
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> >> +	 * this bypasses the flush iomap uses to trigger extent conversion
> >> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> >> +	 */
> >> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> >> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> >> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> >> +
> >> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> >> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> >> +			map.m_len = (start >> blkbits) - map.m_lblk;
> >> +	}
> > 
> > ... and you need access to iter only for this which seems to be really a
> > hack that's trying to outsmart the iomap code. I have to admit I don't
> > fully understand what you are trying to achieve here. Are you trying to
> > avoid flushing of the range that will be zeroed out?
> 
> This logic is copied from the XFS and iomap infrastructure. Its primary
> purpose is to optimize the zeroing operations on dirty written extents.
> It was introduced by Brian in [1].
> 
> The history as I understand it: originally, the iomap infrastructure
> could not zero dirty unwritten extents during zero range processing,
> which led to stale data exposure. XFS had to flush dirty ranges itself
> before zeroing — a workaround that was not generic.
> 
> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
> extents"), Brian added an unconditional flush in the iomap
> infrastructure, ensuring that by the time zeroing runs the extent has
> already been converted to written so the zero can proceed correctly.
> However, this flush was too heavy and introduced noticeable performance
> overhead.
> 
> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
> conditional on unwritten mappings"), which restricts flushing to only
> dirty pagecache over unwritten or hole mappings.
> 
> Brian later proposed a different approach: rather than relying on flush
> to convert the extent type, find dirty folios ahead of the zero range
> and zero the dirty unwritten extents directly. In [1] he added this
> lookup logic. The filesystem now supplies a folio batch (a collection of
> dirty folios) via the iomap begin callback, and zero range iterates over
> these dirty folios to perform zeroing. Clean regions not covered by the
> batch are simply skipped. This entirely eliminates the need to flush.
> 
> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/
> 
> If I understand correctly, the current approach is a compromise, and
> Brian is still working on this. Perhaps ext4 and XFS could work together
> on improvements in the future?
> 

I think that about covers it!

I do agree wrt to the iomap_iter thing in that it doesn't seem like the
most elegant thing. I considered that a bit of a roadblock when first
hacking on the batch stuff, but IIRC somebody pointed out that there was
precedent already so I didn't think too hard about it after that. Indeed
if you poke around, other filesystems use a similar pattern to access
iter->private for whatever private context is carried around.

FWIW, one of the longer term thoughts for the dirty folio stuff was to
eventually lift it out of the callback and just have iomap do it for the
fs. That would eliminate this particular pattern and probably clean
things up a bit, but there were also some other caveats with that that
aren't top of mind atm (IIRC, things like dealing with map trimming,
etc., but I haven't had a chance to think about it in a while).

Also note that this isn't necessarily a hard requirement. It's an
optional optimization. iomap will flush and retry in the dirty
pagecache+unwritten extent case if the fs hasn't otherwised provided
folios to make sure it zeroes properly, it's just that performance of
that may or may not be acceptable for your use case.

Brian

> > 
> >> +	ret = iomap_zero_range(inode, from, length, did_zero,
> >> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> >> +			       NULL);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * TODO: The iomap does not distinguish between different types of
> >> +	 * zeroing and always sets zero_written if a zeroing operation is
> >> +	 * performed, which may result in unnecessary order operations.
> >> +	 */
> > 
> > Is this still true after your fix to did_zero handling?
> 
> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
> operation has occurred through did_zero parameter, but it cannot
> distinguish whether the zeroed range is a written extent that already
> exists on disk. That is, even if the zeroing is performed on a delalloc
> extent, did_zero will still return true.
> 
> Thanks,
> Yi.
> 
> > 
> >> +	if (did_zero && zero_written)
> >> +		*zero_written = *did_zero;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  /*
> >>   * Zeros out a mapping of length 'length' starting from file offset
> >>   * 'from'.  The range to be zero'd must be contained with in one block.
> > 
> > 								Honza
> 


^ permalink raw reply

* Re: [PATCH 2/2] ext4: base unaligned DIO lock decision on partial block zeroing
From: Jan Kara @ 2026-06-17 11:08 UTC (permalink / raw)
  To: Baokun Li
  Cc: Zhang Yi, linux-ext4, tytso, adilger.kernel, jack, yi.zhang,
	ojaswin, ritesh.list, peng_wang
In-Reply-To: <4f4800dc-0a2c-48e2-9535-f32b21628bdb@linux.alibaba.com>

On Wed 17-06-26 15:52:24, Baokun Li wrote:
> On 2026/6/17 10:45, Zhang Yi wrote:
> > On 6/16/2026 9:10 PM, Baokun Li wrote:
> >> Thank you for your review!
> >>
> >> After extensive testing, I found that after merging this patch,
> >> generic/746
> >> started failing intermittently on ext3 (mkfs.ext4 -O ^extents).  The
> >> test
> >> triggers a "Page cache invalidation failure on direct I/O" warning, and
> >> subsequent fsync returns -EIO.
> >>
> >> The underlying race existed before this patch, but this patch appears to
> >> have widened the reproduction window considerably, so I thought it worth
> >> trying to address.  Here is my analysis:
> >>
> >> On no-extent inodes, DIO writes that hit holes cannot use unwritten
> >> extents.  ext4_iomap_alloc() leaves m_flags=0, so ext4_map_blocks()
> >> returns 0 for a hole, and:
> >>
> >>          if (!m_flags && !ret)
> >>                  ret = -ENOTBLK;
> >>
> >> The iomap layer returns -ENOTBLK to ext4, which falls back to buffered
> >> I/O.  The fallback path dirties pages in the page cache, then flushes
> >> and invalidates them.  However, concurrent async DIO completions to
> >> other blocks on the same inode can run
> >> kiocb_invalidate_post_direct_write()
> >> without holding the inode lock.
> >>
> >> Consider a file with two 4k extents: [hole][written].  Thread A does DIO
> >> to the written extent, while thread B does DIO spanning both extents:
> >>
> >>    kworker A (4k DIO, allocated block)    kworker B (8k DIO,
> >> hole->fallback)
> >>    -----------------------------------   
> >> -----------------------------------
> >>    inode_lock_shared()                    inode_lock_shared()
> >>    iomap_dio_rw():                        iomap_dio_rw():
> >>      kiocb_invalidate_pages -> clean        iomap_begin -> -ENOTBLK
> >>      submit_bio (async)                     dio->size = 0
> >>    inode_unlock_shared()                  inode_unlock_shared()
> >>
> >>    [bio pending in block layer]           /* fallback: inode lock
> >> released */
> >>                                           ext4_buffered_write_iter()
> >>                                             inode_lock(exclusive)
> >>                                             generic_perform_write()
> >>                                               -> dirty pages [0, 8k]
> >>                                             inode_unlock(exclusive)
> >>
> >>                                           /* pages still dirty here */
> >>    [bio completes]                        filemap_write_and_wait_range()
> >>    iomap_dio_complete()                     -> flush dirty pages
> >>      kiocb_invalidate_post_direct_write() invalidate_mapping_pages()
> >>        invalidate_inode_pages2_range()
> >>        -> finds dirty page!               /* window closed */
> >>        -> dio_warn_stale_pagecache()
> >>        -> errseq_set(-EIO)
> >>
> >
> > It looks like this issue occurs when invalidate_inode_pages2_range()
> > checks beyond the DIO write range, which may only happen when folio size
> > is larger than block size. Is that correct?
> Thanks for looking at this!
> 
> Not quite — the scenario involves an 8k file with layout
> 
>  [hole at 0-4k] [written extent at 4k-8k]
> 
> and two DIO threads. Thread A does a 4k DIO write at offset 4k; since
> the target block is a written extent, no fallback occurs. Thread B
> does an 8k DIO write at offset 0; since blocks 0-4k are a hole on an
> indirect-block inode and ext3 does not support unwritten extents,
> iomap returns -ENOTBLK and the entire 8k write falls back to buffered
> I/O.

Right, but for this to happen userspace had to submit two overlapping
direct IO writes. This always had undefined behavior so some inconsistent
content in the file is more or less acceptable. But as Zhang pointed out,
the same failure can also appear when block_size < folio_size and there we
should really strive to provide consistent data.

> >> The critical window is the gap between ext4_buffered_write_iter()
> >> dirtying
> >> pages and filemap_write_and_wait_range() flushing them.  In this
> >> window the
> >> inode lock is not held, so another thread's async DIO completion is
> >> free to
> >> invalidate the still-dirty pages in the page cache.
> >>
> >> This race has always existed on ext3 because indirect-block inodes lack
> >> unwritten-extent support.  However, the window was extremely narrow in
> >> practice, because the old ext4_overwrite_io() checked every block and
> >> would conservatively take an exclusive lock.  This patch replaced it
> >> with ext4_dio_needs_zeroing(), which only checks head and tail blocks,
> >> making unaligned DIO more likely to take a shared lock and
> >> proportionally increasing the chance of hitting the race.
> >>
> >> I tried a couple of alternatives before settling on the patch below:
> >>
> >> 1. Force exclusive lock + IOMAP_DIO_FORCE_WAIT for all no-extent DIO.
> >>     This closes the window for new DIO submissions, but does not protect
> >>     against bio completions from previously submitted async DIO, which
> >>     run independently of the inode lock.
> >>
> >> 2. Wrap the fallback dirty+flush+invalidate sequence in
> >>     filemap_invalidate_lock().  However, the ext4 DIO and iomap layers
> >>     do not use this lock, so it would not serialise against DIO
> >>     completions.
> >>
> >
> > Could we add a call to inode_dio_wait() before falling back to buffered
> > I/O? That is, in thread B, when falling back to buffered I/O, could we
> > acquire the exclusive inode lock and then call inode_dio_wait() to wait
> > for in-flight DIO to complete? This should close the race window. Since
> > scenarios where DIO writes to holes on ext3 are relatively rare, the
> > performance impact should be minimal (I suppose).
> >
> That's a great idea, thank you!
> 
> I had been trying to fix this on the DIO side and didn't consider
> waiting from the buffered fallback path.
> 
> I've tested the approach locally and it closes the race; I'll add a
> patch using it in the next version.

Yes, this looks like the best solution so far. The fallback doesn't have to
be fast. It was always - you are doing something stupid and we try to fixup
for you - kind of thing and bad performance is acceptable in that case.

> >> One straightforward approach that seems correct is to skip direct I/O
> >> for no-extent inodes entirely, by returning 0 from ext4_dio_alignment():
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -6131,6 +6131,8 @@ u32 ext4_dio_alignment(struct inode *inode)
> >>   {
> >>          if (fsverity_active(inode))
> >>                  return 0;
> >> +       if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >> +               return 0;
> >>          if (ext4_should_journal_data(inode))
> >>                  return 0;
> >>          if (ext4_has_inline_data(inode))
> >>
> >> With this, ext4_should_use_dio() returns false for no-extent inodes, and
> >> all I/O goes through ext4_buffered_write_iter() directly, bypassing the
> >> DIO path entirely.  On ext3, DIO to a hole already falls back to
> >> buffered
> >> I/O, so there is essentially no performance benefit to using DIO in the
> >> first place.
> >>
> >> Note that with this change, the fallback branch in
> >> ext4_dio_write_iter():
> >>
> >>          if (ret >= 0 && iov_iter_count(from)) {
> >>                  /* buffered fallback */
> >>          }
> >>
> >> would also become dead code for extent-based inodes (since unwritten
> >> extents guarantee iomap_dio_rw() never returns zero with unconsumed
> >> data), and could be removed in a follow-up cleanup.
> >>
> >> Thoughts?  Is there a reason to preserve DIO on no-extent inodes that
> >> I'm missing?
> >>
> >
> > Hmm, this would also cause DIO to fall back to buffered I/O in common
> > extending write cases, which I think would be unacceptable.
> 
> Fair point, the regression on extending writes is hard to justify.  That
> said, until we had a better fix, I'd argue a behavioural change was
> still preferable to potential data corruption. With the inode_dio_wait()
> approach above, this trade-off goes away. 

But heavily regressing performance for overwrites or extending DIO writes
even on indirect block based files is not really acceptable. There are
still users who for whatever reasons stay with old filesystems having
indirect block based files and they'd likely notice the regression.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v7 0/11] fstests: add test coverage for cloned filesystem ids
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong

v7:
. 803, 806: Trimmed down the UUID checks to only what is required, mountinfo
   and libblkid.
. 802: Dropped the unnecessary echo statements previously used for logical
   flow in the golden output. On second thought, it looks fine without them.
. Swapped _fixed_by_kernel_commit for _fixed_by_fs_commit.
. _clone_mount_option(): Now echoes directly from the case block itself.
. _require_unique_f_fsid(): add the link to the ref. discussions.

v6:
  https://lore.kernel.org/fstests/cover.1779939330.git.asj@kernel.org

v5:
  https://lore.kernel.org/fstests/cover.1779367627.git.asj@kernel.org

v4:
  https://lore.kernel.org/fstests/cover.1777357320.git.asj@kernel.org

v3:
  https://lore.kernel.org/fstests/cover.1777281778.git.asj@kernel.org

v2:
  https://lore.kernel.org/fstests/cover.1774090817.git.asj@kernel.org

v1:
  https://lore.kernel.org/fstests/cover.1772095513.git.asj@kernel.org

This series adds fstests infrastructure and test cases to verify correct
filesystem identity when a filesystem is cloned (block-level copy).
Test covers inotify, fanotify, f_fsid, libblkid, IMA, exportfs file handles
and libblkid tools verify with metadata_uuid.
  New helpers:
   _loop_image_create_clone() and _loop_image_destroy() to help create fs and clone
   _clone_mount_option() helper to apply per-filesystem clone mount options
   _change_metadata_uuid() changes the UUID before the clone

  New tests:
  - fanotify events are isolated between cloned filesystems
  - f_fsid is unique across cloned filesystem instances
  - libblkid correctly resolves duplicate UUIDs to distinct devices
    with and without metadata_uuid
  - IMA distinct identity for each cloned filesystem
  - exportfs file handles resolve correctly on cloned filesystems

Kernel Patches:
  Requires Btrfs kernel patches for all tests to pass.
   [1] https://lore.kernel.org/linux-btrfs/cover.1777281686.git.asj@kernel.org

Anand Jain (11):
  fstests: add _loop_image_create_clone() helper
  fstests: add _clone_mount_option() helper
  fstests: add FSNOTIFYWAIT_PROG
  fstests: add _require_unique_f_fsid() helper
  fstests: verify fanotify isolation on cloned filesystems
  fstests: verify f_fsid for cloned filesystems
  fstests: verify libblkid resolution of duplicate UUIDs
  fstests: verify IMA isolation on cloned filesystems
  fstests: verify exportfs file handles on cloned filesystems
  fstests: add _change_metadata_uuid helper
  fstests: test UUID consistency for clones with metadata_uuid

 common/config         |   1 +
 common/rc             | 120 +++++++++++++++++++++++++++++++++++++
 tests/generic/801     | 135 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/801.out |   7 +++
 tests/generic/802     |  64 ++++++++++++++++++++
 tests/generic/802.out |   4 ++
 tests/generic/803     |  72 ++++++++++++++++++++++
 tests/generic/803.out |   6 ++
 tests/generic/804     | 108 +++++++++++++++++++++++++++++++++
 tests/generic/804.out |  10 ++++
 tests/generic/805     |  80 +++++++++++++++++++++++++
 tests/generic/805.out |   2 +
 tests/generic/806     |  74 +++++++++++++++++++++++
 tests/generic/806.out |   6 ++
 14 files changed, 689 insertions(+)
 create mode 100644 tests/generic/801
 create mode 100644 tests/generic/801.out
 create mode 100644 tests/generic/802
 create mode 100644 tests/generic/802.out
 create mode 100644 tests/generic/803
 create mode 100644 tests/generic/803.out
 create mode 100644 tests/generic/804
 create mode 100644 tests/generic/804.out
 create mode 100644 tests/generic/805
 create mode 100644 tests/generic/805.out
 create mode 100644 tests/generic/806
 create mode 100644 tests/generic/806.out

-- 
2.43.0


^ permalink raw reply

* [PATCH v7 01/11] fstests: add _loop_image_create_clone() helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Introduce _loop_image_create_clone() and _loop_image_destroy() to mkfs an
image file and clone it to another image file, and attach a loop device to
them. And its destroy part.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 common/rc | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/common/rc b/common/rc
index 79189e7e6e94..d7e3e0bdfb1e 100644
--- a/common/rc
+++ b/common/rc
@@ -1520,6 +1520,69 @@ _scratch_resvblks()
 	esac
 }
 
+# Create a small loop image, run an optional tuning function ($2) on it,
+# clone it, and attach both to loop devices, returned in ($1).
+# Args:
+#   $1: Nameref to return the array of allocated loop devices [base, clone].
+#   $2: Optional callback function to tune the base filesystem before cloning.
+_loop_image_create_clone()
+{
+	local -n _ret=$1
+	local pre_clone_tune_func="$2"
+	local img_file=$TEST_DIR/${seq}.img
+	local img_file_clone=$TEST_DIR/${seq}_clone.img
+	local size=$(_small_fs_size_mb 128) # Smallest possible
+	local loop_devs
+
+	# Since we copy the block device image, we keep its size small.
+	_require_fs_space $TEST_DIR $((size * 1024))
+
+	_create_file_sized $((size * 1024 * 1024)) $img_file ||
+				_fail "Failed: Create $img_file $size"
+
+	loop_devs=$(_create_loop_device $img_file)
+	_ret=($loop_devs)
+
+	case $FSTYP in
+	xfs)
+		_mkfs_dev "-s size=4096" ${loop_devs[0]}
+		;;
+	btrfs)
+		_mkfs_dev ${loop_devs[0]}
+		;;
+	*)
+		_mkfs_dev ${loop_devs[0]}
+		;;
+	esac
+
+	# Only execute if the function argument is not empty
+	if [ -n "$pre_clone_tune_func" ]; then
+		$pre_clone_tune_func ${loop_devs[0]}
+	fi
+
+	sync ${loop_devs[0]}
+	cp $img_file $img_file_clone
+
+	loop_devs="$loop_devs $(_create_loop_device $img_file_clone)"
+
+	_ret=($loop_devs)
+}
+
+# Teardown loop devices and delete their underlying backing image files.
+# Accepts a list of loop device paths (e.g., /dev/loop0 /dev/loop1).
+_loop_image_destroy()
+{
+	for d in "$@"; do
+		# Retrieve the path of the backing file
+		local f=$(losetup --noheadings --output BACK-FILE $d)
+
+		# Detach the loop device from the backing file
+		_destroy_loop_device "$d"
+
+		# Clean up the backing disk image file
+		[ -n "$f" ] && rm -f "$f"
+	done
+}
 
 # Repair scratch filesystem.  Returns 0 if the FS is good to go (either no
 # errors found or errors were fixed) and nonzero otherwise; also spits out
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 02/11] fstests: add _clone_mount_option() helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Adds _clone_mount_option() helper function to handle filesystem-specific
requirements for mounting cloned devices. Abstract the need for -o nouuid
on XFS.

Signed-off-by: Anand Jain <asj@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/common/rc b/common/rc
index d7e3e0bdfb1e..968ba33686f3 100644
--- a/common/rc
+++ b/common/rc
@@ -414,6 +414,19 @@ _scratch_mount_options()
 					$SCRATCH_DEV $SCRATCH_MNT
 }
 
+# Return filesystem-specific mount options required for mounting clone/snapshot
+# devices.
+_clone_mount_option()
+{
+	case "$FSTYP" in
+	xfs)
+		# Allow mounting a duplicate filesystem on the same host
+		echo "-o nouuid"
+		;;
+	*)
+	esac
+}
+
 _supports_filetype()
 {
 	local dir=$1
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 03/11] fstests: add FSNOTIFYWAIT_PROG
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Define `FSNOTIFYWAIT_PROG` for an upcoming test case that uses `fsnotifywait`.

Signed-off-by: Anand Jain <asj@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/config | 1 +
 1 file changed, 1 insertion(+)

diff --git a/common/config b/common/config
index d5299d5b926f..5661fa0ec310 100644
--- a/common/config
+++ b/common/config
@@ -242,6 +242,7 @@ export BTRFS_MAP_LOGICAL_PROG=$(type -P btrfs-map-logical)
 export PARTED_PROG="$(type -P parted)"
 export XFS_PROPERTY_PROG="$(type -P xfs_property)"
 export FSCRYPTCTL_PROG="$(type -P fscryptctl)"
+export FSNOTIFYWAIT_PROG="$(type -P fsnotifywait)"
 
 # udev wait functions.
 #
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 04/11] fstests: add _require_unique_f_fsid() helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Add a helper to check if the target filesystem supports unique f_fsid
tracking across cloned or snapshot instances.

Certain filesystems like XFS, Btrfs, and F2FS ensure unique f_fsid
identifiers per filesystem instance. However, Ext4 derives its f_fsid
directly from its superblock UUID, which leads to identical f_fsid
values on cloned images until the UUID is manually modified by userspace.

Introduce _require_unique_f_fsid() to allow test cases requiring strict
f_fsid uniqueness to skip gracefully on unsupported filesystems.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 common/rc | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/common/rc b/common/rc
index 968ba33686f3..d95eec94f7b7 100644
--- a/common/rc
+++ b/common/rc
@@ -6310,6 +6310,27 @@ _require_fanotify_ioerrors()
 	_notrun "$FSTYP does not support fanotify ioerrors"
 }
 
+# Ext4 derives f_fsid from the superblock UUID, meaning clones share the
+# same f_fsid until their UUIDs diverge. Conversely, XFS, Btrfs,
+# and F2FS ensure f_fsid remains unique per filesystem instance (often by
+# deriving it from the UUID and underlying block device.)
+#
+# Across all filesystems, a UUID collision causes libblkid tools to return
+# non-deterministic device mappings. It is ultimately the responsibility
+# of the userspace utility or use-case to enforce uniqueness when a clone
+# diverges. For details, see mailing list thread discussions:
+#   Link: https://lore.kernel.org/linux-ext4/20260409131238.GC18443@macsyma-wired.lan/
+_require_unique_f_fsid()
+{
+	# Skip the test if the filesystem does not enforce unique f_fsids
+	# natively. Checking this dynamically requires recreating a clone
+	# layout, so we use a static lookup based on FSTYP.
+	if [ "$FSTYP" == "ext4" ]; then
+		_notrun "Target filesystem ($FSTYP) does not guarantee unique f_fsid on clones."
+	fi
+}
+
+
 # Computes a percentage of the available space in a filesystem and
 # returns that quantity in MB. The percentage must not contain a percent
 # sign ("%").
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 05/11] fstests: verify fanotify isolation on cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Verify that fanotify events are correctly routed to the appropriate
watcher when cloned filesystems are mounted.
Helps verify kernel's event notification distinguishes between devices
sharing the same FSID/UUID.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/801     | 135 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/801.out |   7 +++
 2 files changed, 142 insertions(+)
 create mode 100644 tests/generic/801
 create mode 100644 tests/generic/801.out

diff --git a/tests/generic/801 b/tests/generic/801
new file mode 100644
index 000000000000..3bfb87d41922
--- /dev/null
+++ b/tests/generic/801
@@ -0,0 +1,135 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 801
+# Verify fanotify FID functionality on cloned filesystems by setting up
+# watchers and making sure notifications are in the correct logs files.
+
+. ./common/preamble
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+_require_command "$FSNOTIFYWAIT_PROG" fsnotifywait
+_require_unique_f_fsid
+
+_cleanup()
+{
+	cd /
+	[[ -n $pid1 ]] && { kill -TERM "$pid1" 2> /dev/null; wait $pid1; }
+	[[ -n $pid2 ]] && { kill -TERM "$pid2" 2> /dev/null; wait $pid2; }
+
+	if [ "$semanage_added" = "yes" ]; then
+		semanage permissive -d unconfined_t >/dev/null 2>&1 || true
+	fi
+
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+	rm -r -f $tmp.*
+}
+
+# Run fsnotifywait in unbuffered mode to watch filesystem-wide create events
+monitor_fanotify()
+{
+	local mmnt=$1
+	exec stdbuf -oL $FSNOTIFYWAIT_PROG -m -F -S -e create "$mmnt" 2>&1
+}
+
+# Transform f_fsid into the hi.lo format used in fanotify FID logs
+fsid_to_fid_parts()
+{
+	local fsid=$1
+	# Pad to 16 hex chars (64-bit), then split into two 32-bit halves
+	local padded=$(printf '%016x' "0x${fsid}")
+	local hi=$(printf '%x' "0x${padded:0:8}")   # strips leading zeros
+	local lo=$(printf '%x' "0x${padded:8:8}")   # strips leading zeros
+	echo "${hi}.${lo}"
+}
+
+# Create base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both base and clone filesystems using required clone mount options
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+# Fetch filesystem IDs to verify the kernel can differentiate between them
+fsid1=$(stat -f -c "%i" $mnt1)
+fsid2=$(stat -f -c "%i" $mnt2)
+
+log1=$tmp.fanotify1
+log2=$tmp.fanotify2
+
+pid1=""
+pid2=""
+echo "Setup FID fanotify watchers on both mnt1 and mnt2"
+
+# Permit unconfined_t domains when SELinux is enforcing to prevent fanotify
+# blockages
+semanage_added="no"
+if [ "$(getenforce 2>/dev/null)" = "Enforcing" ]; then
+    if ! semanage permissive -l | grep -q "unconfined_t"; then
+        semanage permissive -a unconfined_t >/dev/null 2>&1 && semanage_added="yes"
+    fi
+fi
+
+# Start asynchronous fanotify monitors
+( monitor_fanotify "$mnt1" > "$log1" ) &
+pid1=$!
+( monitor_fanotify "$mnt2" > "$log2" ) &
+pid2=$!
+sleep 2
+
+echo "Trigger file creation on mnt1"
+touch $mnt1/file_on_mnt1
+sync
+sleep 1
+
+echo "Trigger file creation on mnt2"
+touch $mnt2/file_on_mnt2
+sync
+sleep 1
+
+echo "Verify fsid in the fanotify"
+kill $pid1 $pid2
+wait $pid1 $pid2 2>/dev/null
+pid1=""
+pid2=""
+
+e_fsid1=$(fsid_to_fid_parts "$fsid1")
+e_fsid2=$(fsid_to_fid_parts "$fsid2")
+
+# Dump debug details to the full log
+echo $fsid1 $e_fsid1 $fsid2 $e_fsid2 >> $seqres.full
+cat $log1 >> $seqres.full
+cat $log2 >> $seqres.full
+
+# Ensure monitor 1 only captured events belonging to mnt 1 and fsid 1
+if grep -qF "$e_fsid1" "$log1" && ! grep -qF "$e_fsid2" "$log1"; then
+	echo "SUCCESS: mnt1 events found"
+else
+	[ ! -s "$log1" ] && echo "  - mnt1 received no events."
+	grep -qF "$e_fsid2" "$log1" && echo "  - mnt1 received event from mnt2."
+fi
+
+# Ensure monitor 2 only captured events belonging to mnt 2 and fsid 2
+if grep -qF "$e_fsid2" "$log2" && ! grep -qF "$e_fsid1" "$log2"; then
+	echo "SUCCESS: mnt2 events found"
+else
+	[ ! -s "$log2" ] && echo "  - mnt2 received no events."
+	grep -qF "$e_fsid1" "$log2" && echo "  - mnt2 received event from mnt1."
+fi
+
+status=0
+exit
diff --git a/tests/generic/801.out b/tests/generic/801.out
new file mode 100644
index 000000000000..d7b318d9f27c
--- /dev/null
+++ b/tests/generic/801.out
@@ -0,0 +1,7 @@
+QA output created by 801
+Setup FID fanotify watchers on both mnt1 and mnt2
+Trigger file creation on mnt1
+Trigger file creation on mnt2
+Verify fsid in the fanotify
+SUCCESS: mnt1 events found
+SUCCESS: mnt2 events found
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 06/11] fstests: verify f_fsid for cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Verify that the cloned filesystem provides an f_fsid that is persistent
across mount cycles, yet unique from the original filesystem's f_fsid.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/802     | 64 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/802.out |  4 +++
 2 files changed, 68 insertions(+)
 create mode 100644 tests/generic/802
 create mode 100644 tests/generic/802.out

diff --git a/tests/generic/802 b/tests/generic/802
new file mode 100644
index 000000000000..910807c11584
--- /dev/null
+++ b/tests/generic/802
@@ -0,0 +1,64 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 802
+# Check that the cloned filesystem provides an f_fsid that is persistent
+# across mount cycles if the block device maj:min remains unchanged.
+
+. ./common/preamble
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: derive f_fsid from on-disk fsuuid and dev_t"
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both filesystems simultaneously using mandatory clone mount options
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+# Capture baseline filesystem IDs for comparison
+fsid_scratch=$(stat -f -c "%i" $mnt1)
+fsid_clone=$(stat -f -c "%i" $mnt2)
+
+# Verify that the fsids remain stable after a mount cycle, even when the
+# mount order is reversed.
+echo "**** fsid after mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+# Compare post mount-cycle values against the baseline
+stat -f -c "%i" $mnt1 | sed -e "s/$fsid_scratch/FSID_SCRATCH/g"
+stat -f -c "%i" $mnt2 | sed -e "s/$fsid_clone/FSID_CLONE/g"
+
+status=0
+exit
diff --git a/tests/generic/802.out b/tests/generic/802.out
new file mode 100644
index 000000000000..0202a9a2c108
--- /dev/null
+++ b/tests/generic/802.out
@@ -0,0 +1,4 @@
+QA output created by 802
+**** fsid after mount cycle ****
+FSID_SCRATCH
+FSID_CLONE
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 07/11] fstests: verify libblkid resolution of duplicate UUIDs
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Verify how findmnt, df (libblkid) resolve device paths when multiple
block devices share the same FSUUID.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/803     | 72 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/803.out |  6 ++++
 2 files changed, 78 insertions(+)
 create mode 100644 tests/generic/803
 create mode 100644 tests/generic/803.out

diff --git a/tests/generic/803 b/tests/generic/803
new file mode 100644
index 000000000000..77901592366c
--- /dev/null
+++ b/tests/generic/803
@@ -0,0 +1,72 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 803
+# Check if the mountinfo based findmnt would resolve to the common uuid
+# as per the blkid (libblkid based).
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Normalize pool devices and mount points names
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Get the uuid from the source device
+fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+# Mount both identical UUID filesystems simultaneously
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+# Btrfs assigned a random uuid for the clone fs before the fix.
+# Cycle mounts and reverse the initialization (source and clone fs) order.
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+status=0
+exit
diff --git a/tests/generic/803.out b/tests/generic/803.out
new file mode 100644
index 000000000000..3a130c662430
--- /dev/null
+++ b/tests/generic/803.out
@@ -0,0 +1,6 @@
+QA output created by 803
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
+**** mount cycle ****
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 08/11] fstests: verify IMA isolation on cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Add testcase to verify IMA measurement isolation when multiple devices
share the same FSUUID.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/804     | 108 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/804.out |  10 ++++
 2 files changed, 118 insertions(+)
 create mode 100644 tests/generic/804
 create mode 100644 tests/generic/804.out

diff --git a/tests/generic/804 b/tests/generic/804
new file mode 100644
index 000000000000..ced32e6d79dd
--- /dev/null
+++ b/tests/generic/804
@@ -0,0 +1,108 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 804
+# Verify IMA isolation on cloned filesystems:
+# . Mount two devices sharing the same FSUUID (cloned).
+# . Apply an IMA policy to measure files based on that FSUUID.
+# . Create unique files on each mount point to trigger measurements.
+# . Confirm the IMA log correctly attributes events to the respective mounts.
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: use on-disk uuid for s_uuid in temp_fsid mounts"
+_fixed_by_fs_commit btrfs xxxxxxxxxxxx \
+	"btrfs: derive f_fsid from on-disk fsuuid and dev_t"
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	_unmount $mnt1 2>/dev/null
+	_unmount $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Normalize device names and mount points
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|$mnt1|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|$mnt2|MNT2|g" | _filter_spaces
+}
+
+# Core helper to set IMA policy and check measurement logs
+do_ima()
+{
+	local ima_policy="/sys/kernel/security/ima/policy"
+	local ima_log="/sys/kernel/security/ima/ascii_runtime_measurements"
+	local fsuuid
+	local mnt=$1
+	local enable=$2
+
+	# Since the in-memory IMA audit log is only cleared upon reboot,
+	# use unique random filenames to avoid log collisions.
+	local foofile=$(mktemp --dry-run foobar_XXXXX)
+
+	echo $mnt $enable | filter_pool
+
+	[ -w "$ima_policy" ] || _notrun "IMA policy not writable"
+
+	fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+	# Load IMA policy to measure file access specifically for this
+	# filesystem UUID.
+	if [[ $enable -eq 1 ]]; then
+		echo "measure func=FILE_CHECK fsuuid=$fsuuid" > "$ima_policy" || \
+			_notrun "Policy rejected"
+	fi
+
+	# Create a file to trigger measurement and verify its entry in
+	# the IMA log.
+	echo "test_data" > $mnt/$foofile
+
+	# IMA log extract
+	grep $foofile "$ima_log" | awk '{ print $5 }' | filter_pool | \
+						sed "s/$foofile/FOOBAR_FILE/"
+
+	echo "dbg: $mnt $fsuuid $foofile" >> $seqres.full
+	cat $ima_log | tail -1 >> $seqres.full
+	echo >> $seqres.full
+}
+
+# Initialize loop base and cloned instances
+devs=()
+_loop_image_create_clone devs
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Concurrently mount both clones
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+#  IMA response on baseline and clone configuration
+do_ima $mnt1 1
+do_ima $mnt2 0
+
+# Cycle mount on the second device.
+echo mount cycle
+_unmount $mnt2
+_mount $mount_opts ${devs[1]} $mnt2 || _fail "Failed to mount dev2"
+
+do_ima $mnt1 0
+do_ima $mnt2 0
+
+status=0
+exit
diff --git a/tests/generic/804.out b/tests/generic/804.out
new file mode 100644
index 000000000000..9804181d6c17
--- /dev/null
+++ b/tests/generic/804.out
@@ -0,0 +1,10 @@
+QA output created by 804
+MNT1 1
+MNT1/FOOBAR_FILE
+MNT2 0
+MNT2/FOOBAR_FILE
+mount cycle
+MNT1 0
+MNT1/FOOBAR_FILE
+MNT2 0
+MNT2/FOOBAR_FILE
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 09/11] fstests: verify exportfs file handles on cloned filesystems
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Ensure that exportfs can correctly decode file handles on a cloned
filesystem across a mount cycle, by file handles generated on a
cloned device remain valid after mount cycle.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/805     | 80 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/805.out |  2 ++
 2 files changed, 82 insertions(+)
 create mode 100644 tests/generic/805
 create mode 100644 tests/generic/805.out

diff --git a/tests/generic/805 b/tests/generic/805
new file mode 100644
index 000000000000..5827eee039df
--- /dev/null
+++ b/tests/generic/805
@@ -0,0 +1,80 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test No. 805
+# Verify that file handles encoded on a cloned filesystem remain valid and
+# resolvable via open_by_handle across a mount cycle and mount order swap.
+
+. ./common/preamble
+
+_begin_fstest auto quick exportfs clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_exportfs
+_require_loop
+_require_test_program "open_by_handle"
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	_unmount $mnt1 2>/dev/null
+	_unmount $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+# Create test dir and test files, encode file handles and store to tmp file
+create_test_files()
+{
+	rm -rf $testdir
+	mkdir -p $testdir
+	$here/src/open_by_handle -cwp -o $tmp.handles_file $testdir $NUMFILES
+}
+
+# Attempt to read and decode the saved file handles on the targeted mount point.
+test_file_handles()
+{
+	local opt=$1
+	local when=$2
+
+	echo test_file_handles after $when
+	$here/src/open_by_handle $opt -i $tmp.handles_file $mnt2 $NUMFILES
+}
+
+# Setup base loop device and its clone
+devs=()
+_loop_image_create_clone devs
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Mount both identical UUID filesystems simultaneously
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+NUMFILES=1
+testdir=$mnt2/testdir
+
+# Decode file handles of files/dir after cycle mount
+create_test_files
+
+# Cycle mounts and reverse initialization sequence to check if
+# file handle lookups are okay
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+# Verify file handles can still be resolved post-mount-cycle
+test_file_handles -rp "cycle mount"
+
+status=0
+exit
diff --git a/tests/generic/805.out b/tests/generic/805.out
new file mode 100644
index 000000000000..29b11ec77ffb
--- /dev/null
+++ b/tests/generic/805.out
@@ -0,0 +1,2 @@
+QA output created by 805
+test_file_handles after cycle mount
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 10/11] fstests: add _change_metadata_uuid helper
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

_change_metadata_uuid changes the UUID of the golden filesystem before it
is cloned.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 common/rc | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/common/rc b/common/rc
index d95eec94f7b7..5cd4e025293b 100644
--- a/common/rc
+++ b/common/rc
@@ -1533,6 +1533,29 @@ _scratch_resvblks()
 	esac
 }
 
+# Change the metadata UUID of the given device to a newly generated one.
+# Args:
+#   $1: Block device path to modify.
+_change_metadata_uuid()
+{
+	local temp_mnt=$TEST_DIR/${seq}_mnt
+	local dev=$1
+
+	case $FSTYP in
+	xfs)
+		_require_command "$XFS_ADMIN_PROG" "xfs_admin"
+		$XFS_ADMIN_PROG -U generate $dev >> $seqres.full
+		;;
+	btrfs)
+		_require_command "$BTRFS_TUNE_PROG" "btrfstune"
+		$BTRFS_TUNE_PROG -m $dev
+		;;
+	*)
+		_notrun "Require filesystem with metadata_uuid feature"
+		;;
+	esac
+}
+
 # Create a small loop image, run an optional tuning function ($2) on it,
 # clone it, and attach both to loop devices, returned in ($1).
 # Args:
-- 
2.43.0


^ permalink raw reply related

* [PATCH v7 11/11] fstests: test UUID consistency for clones with metadata_uuid
From: Anand Jain @ 2026-06-17 11:20 UTC (permalink / raw)
  To: fstests
  Cc: linux-btrfs, linux-ext4, linux-xfs, linux-f2fs-devel, zlang, hch,
	djwong
In-Reply-To: <cover.1781694879.git.asj@kernel.org>

Btrfs and xfs uses the metadata_uuid superblock feature to change the
on-disk UUID without rewriting every block header. This patch adds a
sanity check to ensure UUID consistency when a filesystem with
metadata_uuid enabled is cloned.

Signed-off-by: Anand Jain <asj@kernel.org>
---
 tests/generic/806     | 74 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/806.out |  6 ++++
 2 files changed, 80 insertions(+)
 create mode 100644 tests/generic/806
 create mode 100644 tests/generic/806.out

diff --git a/tests/generic/806 b/tests/generic/806
new file mode 100644
index 000000000000..6d3166491006
--- /dev/null
+++ b/tests/generic/806
@@ -0,0 +1,74 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Anand Jain <asj@kernel.org>.  All Rights Reserved.
+#
+# FS QA Test 806
+#
+# Verify that the cloned filesystem UUID remains consistent, even when the
+# `metadata_uuid` feature is enabled.
+#
+
+. ./common/preamble
+. ./common/filter
+
+_begin_fstest auto quick mount clone
+
+_require_test
+_require_block_device $TEST_DEV
+_require_loop
+
+_cleanup()
+{
+	cd /
+	rm -r -f $tmp.*
+	umount $mnt1 $mnt2 2>/dev/null
+	_loop_image_destroy "${devs[@]}" 2> /dev/null
+}
+
+filter_pool()
+{
+	sed -e "s|${devs[0]}|DEV1|g" -e "s|${mnt1}|MNT1|g" \
+	    -e "s|${devs[1]}|DEV2|g" -e "s|${mnt2}|MNT2|g" | _filter_spaces
+}
+
+# Create base loop device and its clone, applying the metadata_uuid tuning
+# callback to the base filesystem before the copy occurs.
+devs=()
+_loop_image_create_clone devs _change_metadata_uuid
+mkdir -p $TEST_DIR/$seq
+mnt1=$TEST_DIR/$seq/mnt1
+mnt2=$TEST_DIR/$seq/mnt2
+mkdir -p $mnt1
+mkdir -p $mnt2
+
+# Get the uuid from the source device
+fsuuid=$(blkid -s UUID -o value ${devs[0]})
+
+# Mount both clone and baseline
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+# Cycle mounts and reverse the initialization order to ensure UUID tracking
+# doesn't mismatch or flip when metadata_uuid optimization is active.
+echo "**** mount cycle ****"
+_unmount $mnt1
+_unmount $mnt2
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[1]} $mnt2 || \
+						_fail "Failed to mount dev2"
+_mount $(_common_dev_mount_options) $(_clone_mount_option) ${devs[0]} $mnt1 || \
+						_fail "Failed to mount dev1"
+
+findmnt -o SOURCE,TARGET,UUID "${devs[0]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+findmnt -o SOURCE,TARGET,UUID "${devs[1]}" | tail -n +2 | \
+				sed -e "s/${fsuuid}/FSUUID/g" | filter_pool
+
+status=0
+exit
diff --git a/tests/generic/806.out b/tests/generic/806.out
new file mode 100644
index 000000000000..918f422ecddf
--- /dev/null
+++ b/tests/generic/806.out
@@ -0,0 +1,6 @@
+QA output created by 806
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
+**** mount cycle ****
+DEV1 MNT1 FSUUID
+DEV2 MNT2 FSUUID
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox