Linux EXT4 FS development

Linux EXT4 FS development
 help / color / mirror / Atom feed

* Re: [patch 33/38] powerpc: Select ARCH_HAS_RANDOM_ENTROPY
From: Mukesh Kumar Chaurasiya @ 2026-04-21 11:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Michael Ellerman, linuxppc-dev, Arnd Bergmann, x86,
	Lu Baolu, iommu, Michael Grzeschik, netdev, linux-wireless,
	Herbert Xu, linux-crypto, Vlastimil Babka, linux-mm,
	David Woodhouse, Bernie Thompson, linux-fbdev, Theodore Tso,
	linux-ext4, Andrew Morton, Uladzislau Rezki, Marco Elver,
	Dmitry Vyukov, kasan-dev, Andrey Ryabinin, Thomas Sailer,
	linux-hams, Jason A. Donenfeld, Richard Henderson, linux-alpha,
	Russell King, linux-arm-kernel, Catalin Marinas, Huacai Chen,
	loongarch, Geert Uytterhoeven, linux-m68k, Dinh Nguyen,
	Jonas Bonn, linux-openrisc, Helge Deller, linux-parisc,
	Paul Walmsley, linux-riscv, Heiko Carstens, linux-s390,
	David S. Miller, sparclinux
In-Reply-To: <20260410120319.789114053@kernel.org>

On Fri, Apr 10, 2026 at 02:21:09PM +0200, Thomas Gleixner wrote:
> The only remaining usage of get_cycles() is to provide random_get_entropy().
> 
> Switch powerpc over to the new scheme of selecting ARCH_HAS_RANDOM_ENTROPY
> and providing random_get_entropy() in asm/random.h.
> 
> Remove asm/timex.h as it has no functionality anymore.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/Kconfig              |    1 +
>  arch/powerpc/include/asm/random.h |   13 +++++++++++++
>  arch/powerpc/include/asm/timex.h  |   21 ---------------------
>  3 files changed, 14 insertions(+), 21 deletions(-)
> 
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -150,6 +150,7 @@ config PPC
>  	select ARCH_HAS_PREEMPT_LAZY
>  	select ARCH_HAS_PTDUMP
>  	select ARCH_HAS_PTE_SPECIAL
> +	select ARCH_HAS_RANDOM_ENTROPY
>  	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
>  	select ARCH_HAS_SET_MEMORY
>  	select ARCH_HAS_STRICT_KERNEL_RWX	if (PPC_BOOK3S || PPC_8xx) && !HIBERNATION
> --- /dev/null
> +++ b/arch/powerpc/include/asm/random.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_RANDOM_H
> +#define _ASM_POWERPC_RANDOM_H
> +
> +#include <asm/cputable.h>
> +#include <asm/vdso/timebase.h>
> +
> +static inline unsigned long random_get_entropy(void)
> +{
> +	return mftb();
> +}
> +
> +#endif	/* _ASM_POWERPC_RANDOM_H */
> --- a/arch/powerpc/include/asm/timex.h
> +++ b/arch/powerpc/include/asm/timex.h
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _ASM_POWERPC_TIMEX_H
> -#define _ASM_POWERPC_TIMEX_H
> -
> -#ifdef __KERNEL__
> -
> -/*
> - * PowerPC architecture timex specifications
> - */
> -
> -#include <asm/cputable.h>
> -#include <asm/vdso/timebase.h>
> -
> -ostatic inline cycles_t get_cycles(void)
> -{
R> -	return mftb();
> -}
> -#define get_cycles get_cycles
> -
> -#endif	/* __KERNEL__ */
> -#endif	/* _ASM_POWERPC_TIMEX_H */
> 
Build tested for this series with allmodconfig and allyesconfig on ppc64le
machine for ppc64le.
tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git getcycles-v1

Boot tested for this series on powernv9 qemu, powernv10 qemu and pSeries
power11 hardware.

Tested-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>


^ permalink raw reply

* Re: [PATCH] ext4: prevent out-of-bounds read in ext4_read_inline_data()
From: Jan Kara @ 2026-04-21 10:04 UTC (permalink / raw)
  To: Junjie Cao
  Cc: tytso, adilger.kernel, jack, libaokun, ojaswin, ritesh.list,
	yi.zhang, linux-ext4, linux-kernel, stable,
	syzbot+26c4a8cab92d0cda3e3b
In-Reply-To: <20260421093138.906266-1-junjie.cao@intel.com>

On Tue 21-04-26 17:31:38, Junjie Cao wrote:
> ext4_read_inline_data() reads e_value_offs from the inode buffer_head on
> each call, but the decision to enter the xattr value path depends on
> i_inline_size cached in EXT4_I(inode) at iget time. If the buffer
> contents change after the initial validation, e_value_offs can point
> beyond the inode body while i_inline_size still directs the code into
> the xattr value path, causing an out-of-bounds read in the memcpy.
> 
> Add a bounds check before the memcpy, consistent with
> ext4_xattr_ibody_get(). Also guard folio_mark_uptodate() in
> ext4_read_inline_folio() since ext4_read_inline_data() can now return
> -EFSCORRUPTED.
> 
> Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support")
> Cc: stable@vger.kernel.org
> Reported-by: syzbot+26c4a8cab92d0cda3e3b@syzkaller.appspotmail.com
> Tested-by: syzbot+26c4a8cab92d0cda3e3b@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=26c4a8cab92d0cda3e3b
> Signed-off-by: Junjie Cao <junjie.cao@intel.com>

If the buffer contents changes after the initial validation, there is some
problem somewhere and this isn't going to fix it (likely the fs is
corrupted and that isn't properly detected). Please fix the real problem,
not just paper over it.

								Honza

> ---
>  fs/ext4/inline.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 408677fa8196..18c678df0a6e 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -211,6 +211,14 @@ static int ext4_read_inline_data(struct inode *inode, void *buffer,
>  	len = min_t(unsigned int, len,
>  		    (unsigned int)le32_to_cpu(entry->e_value_size));
>  
> +	if (unlikely((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) +
> +		     len > (void *)ITAIL(inode, raw_inode))) {
> +		EXT4_ERROR_INODE(inode,
> +			"inline data value out of bounds (offs %u len %u)",
> +			le16_to_cpu(entry->e_value_offs), len);
> +		return -EFSCORRUPTED;
> +	}
> +
>  	memcpy(buffer,
>  	       (void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs), len);
>  	cp_len += len;
> @@ -535,7 +543,8 @@ static int ext4_read_inline_folio(struct inode *inode, struct folio *folio)
>  	ret = ext4_read_inline_data(inode, kaddr, len, &iloc);
>  	kaddr = folio_zero_tail(folio, len, kaddr + len);
>  	kunmap_local(kaddr);
> -	folio_mark_uptodate(folio);
> +	if (ret >= 0)
> +		folio_mark_uptodate(folio);
>  	brelse(iloc.bh);
>  
>  out:
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH] ext4: prevent out-of-bounds read in ext4_read_inline_data()
From: Junjie Cao @ 2026-04-21  9:31 UTC (permalink / raw)
  To: tytso
  Cc: adilger.kernel, jack, libaokun, ojaswin, ritesh.list, yi.zhang,
	linux-ext4, linux-kernel, stable, syzbot+26c4a8cab92d0cda3e3b,
	junjie.cao

ext4_read_inline_data() reads e_value_offs from the inode buffer_head on
each call, but the decision to enter the xattr value path depends on
i_inline_size cached in EXT4_I(inode) at iget time. If the buffer
contents change after the initial validation, e_value_offs can point
beyond the inode body while i_inline_size still directs the code into
the xattr value path, causing an out-of-bounds read in the memcpy.

Add a bounds check before the memcpy, consistent with
ext4_xattr_ibody_get(). Also guard folio_mark_uptodate() in
ext4_read_inline_folio() since ext4_read_inline_data() can now return
-EFSCORRUPTED.

Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support")
Cc: stable@vger.kernel.org
Reported-by: syzbot+26c4a8cab92d0cda3e3b@syzkaller.appspotmail.com
Tested-by: syzbot+26c4a8cab92d0cda3e3b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=26c4a8cab92d0cda3e3b
Signed-off-by: Junjie Cao <junjie.cao@intel.com>
---
 fs/ext4/inline.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 408677fa8196..18c678df0a6e 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -211,6 +211,14 @@ static int ext4_read_inline_data(struct inode *inode, void *buffer,
 	len = min_t(unsigned int, len,
 		    (unsigned int)le32_to_cpu(entry->e_value_size));
 
+	if (unlikely((void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs) +
+		     len > (void *)ITAIL(inode, raw_inode))) {
+		EXT4_ERROR_INODE(inode,
+			"inline data value out of bounds (offs %u len %u)",
+			le16_to_cpu(entry->e_value_offs), len);
+		return -EFSCORRUPTED;
+	}
+
 	memcpy(buffer,
 	       (void *)IFIRST(header) + le16_to_cpu(entry->e_value_offs), len);
 	cp_len += len;
@@ -535,7 +543,8 @@ static int ext4_read_inline_folio(struct inode *inode, struct folio *folio)
 	ret = ext4_read_inline_data(inode, kaddr, len, &iloc);
 	kaddr = folio_zero_tail(folio, len, kaddr + len);
 	kunmap_local(kaddr);
-	folio_mark_uptodate(folio);
+	if (ret >= 0)
+		folio_mark_uptodate(folio);
 	brelse(iloc.bh);
 
 out:
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v2 v2 2/2] ext4: allow clearing mballoc stats through mb_stats
From: liubaolin @ 2026-04-21  7:07 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Ritesh Harjani (IBM), Andreas Dilger, tytso, wangguanyu, yi.zhang,
	linux-ext4, linux-kernel, Baolin Liu
In-Reply-To: <aecVCWGL3bUFFiBd@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>



在 2026/4/21 14:12, Ojaswin Mujoo 写道:
> On Tue, Apr 21, 2026 at 01:22:31PM +0800, liubaolin wrote:
>> Dear all,
>>     I noticed the discussion about where to document the ext4 proc parameter
>> mb_stats.
>>     I ran:
>>       git grep -n "/proc/fs/ext4" Documentation/
>>     and found that ext4 proc parameters are currently documented in both
>>     Documentation/admin-guide/ext4.rst and
>> Documentation/filesystems/proc.rst.
>>
>>     To be consistent with the existing documentation, I am thinking about
>> documenting mb_stats in both places.
>>     If there are no objections, I will send a v3 shortly.
>>     Compared with v2,the v3 patch will add the mb_stats documentation to both
>> ext4.rst and proc.rst.
> 
> Yes this makes sense to me. Feel free to retain the Reviewed-by since this is
> just a documentation change.

> OK, I will add all the reviewers who helped review my patch with Reviewed-by tags in the v3 version.


> 
> Thanks,
> Ojaswin
> 
>>
>>     Regards,
>>     Baolin
>>
>>
>>
>> 在 2026/4/21 11:40, Ritesh Harjani (IBM) 写道:
>>> Andreas Dilger <adilger@dilger.ca> writes:
>>>
>>>> On Apr 20, 2026, at 03:12, Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
>>>>>
>>>>> On Sun, Apr 19, 2026 at 02:34:36PM +0800, Baolin Liu wrote:
>>>>>> From: Baolin Liu <liubaolin@kylinos.cn>
>>>>>>
>>>>>> Make /proc/fs/ext4/<dev>/mb_stats writable and clear the runtime
>>>>>> mballoc statistics when 0 is written.
>>>>>>
>>>>>> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
>>>>>> ---
>>>>> Hi Baolin, thanks for the changes.
>>>>>
>>>>> Seems like userspace doesn't have any way to know that writing 0 will
>>>>> clear the that. Well, I guess if you are looking at this file you are
>>>>> anyways debugging kernel code so that should be fine
>>>>
>>>> That could be documented in Documentation/filesystems/ext4/allocators.rst,
>>>> or better would be to add a new file that covers mballoc in more detail.
>>>>
>>>
>>> I started looking for ext4's control knobs for sys-admins in kernel
>>> Documentation where we should ideally document this, and I see those
>>> are declared here..
>>>
>>> Documentation/admin-guide/ext4.rst
>>> Documentation/ABI/testing/sysfs-fs-ext4.rst
>>>
>>> Looking at this and the relevant code, I see all /proc/ entries in ext4
>>> are all readable and sysfs entries for ext4 are mostly the control knobs
>>> which are declared in above admin guide.
>>>
>>> But now this patch adds a control knob to /proc/fs/ext4/<dev>/mb_stats,
>>> to clear the stats :).
>>>
>>> I guess we could have simply documented a new control knob value (e.g.
>>> "2") for clearing the stats via /sys/fs/ext4/<dev>/mb_stats itself or
>>> maybe even having mb_stats_clear file in sysfs wasn't bad either... But
>>> either ways, clearing the stats via the same procfs mb_stats file is not
>>> totally bad and I don't have a strong preference.
>>>
>>>
>>> For documenting this, we can add mb_stats entry under /proc section in
>>> Documentation/admin-guide/ext4.rst and document this change. Something
>>> like -
>>>
>>>     mb_stats
>>>           reports runtime statistics from multiblock allocator (mballoc),
>>>           including allocation request counts, groups scanned,
>>>           per-criteria scan hits (cr_p2_aligned, cr_goal_fast,
>>>           cr_best_avail, cr_goal_slow, cr_any_free), groups / extents
>>>           scanned, goal hits, buddy bitmap generations, and preallocation
>>>           usage etc.
>>>           Writing 0 to this procfs file resets all counters to zero.
>>>
>>>
>>> -ritesh
>>


^ permalink raw reply

* Re: [patch 32/38] powerpc/spufs: Use mftb() directly
From: Mukesh Kumar Chaurasiya @ 2026-04-21  6:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Michael Ellerman, linuxppc-dev, Arnd Bergmann, x86,
	Lu Baolu, iommu, Michael Grzeschik, netdev, linux-wireless,
	Herbert Xu, linux-crypto, Vlastimil Babka, linux-mm,
	David Woodhouse, Bernie Thompson, linux-fbdev, Theodore Tso,
	linux-ext4, Andrew Morton, Uladzislau Rezki, Marco Elver,
	Dmitry Vyukov, kasan-dev, Andrey Ryabinin, Thomas Sailer,
	linux-hams, Jason A. Donenfeld, Richard Henderson, linux-alpha,
	Russell King, linux-arm-kernel, Catalin Marinas, Huacai Chen,
	loongarch, Geert Uytterhoeven, linux-m68k, Dinh Nguyen,
	Jonas Bonn, linux-openrisc, Helge Deller, linux-parisc,
	Paul Walmsley, linux-riscv, Heiko Carstens, linux-s390,
	David S. Miller, sparclinux
In-Reply-To: <20260410120319.723429844@kernel.org>

On Fri, Apr 10, 2026 at 02:21:04PM +0200, Thomas Gleixner wrote:
> There is no reason to indirect via get_cycles(), which is about to be
> removed.
> 
> Use mftb() directly.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/platforms/cell/spufs/switch.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> --- a/arch/powerpc/platforms/cell/spufs/switch.c
> +++ b/arch/powerpc/platforms/cell/spufs/switch.c
> @@ -34,6 +34,7 @@
>  #include <asm/spu_priv1.h>
>  #include <asm/spu_csa.h>
>  #include <asm/mmu_context.h>
> +#include <asm/time.h>
>  
>  #include "spufs.h"
>  
> @@ -279,7 +280,7 @@ static inline void save_timebase(struct
>  	 *    Read PPE Timebase High and Timebase low registers
>  	 *    and save in CSA.  TBD.
>  	 */
> -	csa->suspend_time = get_cycles();
> +	csa->suspend_time = mftb();
>  }
>  
>  static inline void remove_other_spu_access(struct spu_state *csa,
> @@ -1261,7 +1262,7 @@ static inline void setup_decr(struct spu
>  	 *     in LSCSA.
>  	 */
>  	if (csa->priv2.mfc_control_RW & MFC_CNTL_DECREMENTER_RUNNING) {
> -		cycles_t resume_time = get_cycles();
> +		cycles_t resume_time = mftb();
>  		cycles_t delta_time = resume_time - csa->suspend_time;
>  
>  		csa->lscsa->decr_status.slot[0] = SPU_DECR_STATUS_RUNNING;
> 
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>

^ permalink raw reply

* Re: [PATCH v2 v2 2/2] ext4: allow clearing mballoc stats through mb_stats
From: Ojaswin Mujoo @ 2026-04-21  6:12 UTC (permalink / raw)
  To: liubaolin
  Cc: Ritesh Harjani (IBM), Andreas Dilger, tytso, wangguanyu, yi.zhang,
	linux-ext4, linux-kernel, Baolin Liu
In-Reply-To: <9a2d25d3-219f-481c-afb8-083c8b9417c1@163.com>

On Tue, Apr 21, 2026 at 01:22:31PM +0800, liubaolin wrote:
> Dear all,
>    I noticed the discussion about where to document the ext4 proc parameter
> mb_stats.
>    I ran:
>      git grep -n "/proc/fs/ext4" Documentation/
>    and found that ext4 proc parameters are currently documented in both
>    Documentation/admin-guide/ext4.rst and
> Documentation/filesystems/proc.rst.
> 
>    To be consistent with the existing documentation, I am thinking about
> documenting mb_stats in both places.
>    If there are no objections, I will send a v3 shortly.
>    Compared with v2,the v3 patch will add the mb_stats documentation to both
> ext4.rst and proc.rst.

Yes this makes sense to me. Feel free to retain the Reviewed-by since this is
just a documentation change.

Thanks,
Ojaswin

> 
>    Regards,
>    Baolin
> 
> 
> 
> 在 2026/4/21 11:40, Ritesh Harjani (IBM) 写道:
> > Andreas Dilger <adilger@dilger.ca> writes:
> > 
> > > On Apr 20, 2026, at 03:12, Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
> > > > 
> > > > On Sun, Apr 19, 2026 at 02:34:36PM +0800, Baolin Liu wrote:
> > > > > From: Baolin Liu <liubaolin@kylinos.cn>
> > > > > 
> > > > > Make /proc/fs/ext4/<dev>/mb_stats writable and clear the runtime
> > > > > mballoc statistics when 0 is written.
> > > > > 
> > > > > Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
> > > > > ---
> > > > Hi Baolin, thanks for the changes.
> > > > 
> > > > Seems like userspace doesn't have any way to know that writing 0 will
> > > > clear the that. Well, I guess if you are looking at this file you are
> > > > anyways debugging kernel code so that should be fine
> > > 
> > > That could be documented in Documentation/filesystems/ext4/allocators.rst,
> > > or better would be to add a new file that covers mballoc in more detail.
> > > 
> > 
> > I started looking for ext4's control knobs for sys-admins in kernel
> > Documentation where we should ideally document this, and I see those
> > are declared here..
> > 
> > Documentation/admin-guide/ext4.rst
> > Documentation/ABI/testing/sysfs-fs-ext4.rst
> > 
> > Looking at this and the relevant code, I see all /proc/ entries in ext4
> > are all readable and sysfs entries for ext4 are mostly the control knobs
> > which are declared in above admin guide.
> > 
> > But now this patch adds a control knob to /proc/fs/ext4/<dev>/mb_stats,
> > to clear the stats :).
> > 
> > I guess we could have simply documented a new control knob value (e.g.
> > "2") for clearing the stats via /sys/fs/ext4/<dev>/mb_stats itself or
> > maybe even having mb_stats_clear file in sysfs wasn't bad either... But
> > either ways, clearing the stats via the same procfs mb_stats file is not
> > totally bad and I don't have a strong preference.
> > 
> > 
> > For documenting this, we can add mb_stats entry under /proc section in
> > Documentation/admin-guide/ext4.rst and document this change. Something
> > like -
> > 
> >    mb_stats
> >          reports runtime statistics from multiblock allocator (mballoc),
> >          including allocation request counts, groups scanned,
> >          per-criteria scan hits (cr_p2_aligned, cr_goal_fast,
> >          cr_best_avail, cr_goal_slow, cr_any_free), groups / extents
> >          scanned, goal hits, buddy bitmap generations, and preallocation
> >          usage etc.
> >          Writing 0 to this procfs file resets all counters to zero.
> > 
> > 
> > -ritesh
> 

^ permalink raw reply

* Re: [PATCH v2 v2 2/2] ext4: allow clearing mballoc stats through mb_stats
From: liubaolin @ 2026-04-21  5:22 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), Andreas Dilger, Ojaswin Mujoo
  Cc: tytso, wangguanyu, yi.zhang, linux-ext4, linux-kernel, Baolin Liu
In-Reply-To: <7bq1t1zg.ritesh.list@gmail.com>

Dear all,
    I noticed the discussion about where to document the ext4 proc 
parameter mb_stats.
    I ran:
      git grep -n "/proc/fs/ext4" Documentation/
    and found that ext4 proc parameters are currently documented in both
    Documentation/admin-guide/ext4.rst and 
Documentation/filesystems/proc.rst.

    To be consistent with the existing documentation, I am thinking 
about documenting mb_stats in both places.
    If there are no objections, I will send a v3 shortly.
    Compared with v2,the v3 patch will add the mb_stats documentation to 
both ext4.rst and proc.rst.

    Regards,
    Baolin



在 2026/4/21 11:40, Ritesh Harjani (IBM) 写道:
> Andreas Dilger <adilger@dilger.ca> writes:
> 
>> On Apr 20, 2026, at 03:12, Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
>>>
>>> On Sun, Apr 19, 2026 at 02:34:36PM +0800, Baolin Liu wrote:
>>>> From: Baolin Liu <liubaolin@kylinos.cn>
>>>>
>>>> Make /proc/fs/ext4/<dev>/mb_stats writable and clear the runtime
>>>> mballoc statistics when 0 is written.
>>>>
>>>> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
>>>> ---
>>> Hi Baolin, thanks for the changes.
>>>
>>> Seems like userspace doesn't have any way to know that writing 0 will
>>> clear the that. Well, I guess if you are looking at this file you are
>>> anyways debugging kernel code so that should be fine
>>
>> That could be documented in Documentation/filesystems/ext4/allocators.rst,
>> or better would be to add a new file that covers mballoc in more detail.
>>
> 
> I started looking for ext4's control knobs for sys-admins in kernel
> Documentation where we should ideally document this, and I see those
> are declared here..
> 
> Documentation/admin-guide/ext4.rst
> Documentation/ABI/testing/sysfs-fs-ext4.rst
> 
> Looking at this and the relevant code, I see all /proc/ entries in ext4
> are all readable and sysfs entries for ext4 are mostly the control knobs
> which are declared in above admin guide.
> 
> But now this patch adds a control knob to /proc/fs/ext4/<dev>/mb_stats,
> to clear the stats :).
> 
> I guess we could have simply documented a new control knob value (e.g.
> "2") for clearing the stats via /sys/fs/ext4/<dev>/mb_stats itself or
> maybe even having mb_stats_clear file in sysfs wasn't bad either... But
> either ways, clearing the stats via the same procfs mb_stats file is not
> totally bad and I don't have a strong preference.
> 
> 
> For documenting this, we can add mb_stats entry under /proc section in
> Documentation/admin-guide/ext4.rst and document this change. Something
> like -
> 
>    mb_stats
>          reports runtime statistics from multiblock allocator (mballoc),
>          including allocation request counts, groups scanned,
>          per-criteria scan hits (cr_p2_aligned, cr_goal_fast,
>          cr_best_avail, cr_goal_slow, cr_any_free), groups / extents
>          scanned, goal hits, buddy bitmap generations, and preallocation
>          usage etc.
>          Writing 0 to this procfs file resets all counters to zero.
> 
> 
> -ritesh


^ permalink raw reply

* Re: [PATCH v2 v2 2/2] ext4: allow clearing mballoc stats through mb_stats
From: Ritesh Harjani @ 2026-04-21  3:40 UTC (permalink / raw)
  To: Andreas Dilger, Ojaswin Mujoo
  Cc: Baolin Liu, tytso, wangguanyu, yi.zhang, linux-ext4, linux-kernel,
	Baolin Liu
In-Reply-To: <4A904858-8611-42BC-B1BD-9679F284F8EE@dilger.ca>

Andreas Dilger <adilger@dilger.ca> writes:

> On Apr 20, 2026, at 03:12, Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
>> 
>> On Sun, Apr 19, 2026 at 02:34:36PM +0800, Baolin Liu wrote:
>>> From: Baolin Liu <liubaolin@kylinos.cn>
>>> 
>>> Make /proc/fs/ext4/<dev>/mb_stats writable and clear the runtime
>>> mballoc statistics when 0 is written.
>>> 
>>> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
>>> ---
>> Hi Baolin, thanks for the changes.
>> 
>> Seems like userspace doesn't have any way to know that writing 0 will
>> clear the that. Well, I guess if you are looking at this file you are
>> anyways debugging kernel code so that should be fine
>
> That could be documented in Documentation/filesystems/ext4/allocators.rst,
> or better would be to add a new file that covers mballoc in more detail.
>

I started looking for ext4's control knobs for sys-admins in kernel
Documentation where we should ideally document this, and I see those
are declared here..

Documentation/admin-guide/ext4.rst
Documentation/ABI/testing/sysfs-fs-ext4.rst

Looking at this and the relevant code, I see all /proc/ entries in ext4
are all readable and sysfs entries for ext4 are mostly the control knobs
which are declared in above admin guide.

But now this patch adds a control knob to /proc/fs/ext4/<dev>/mb_stats,
to clear the stats :).

I guess we could have simply documented a new control knob value (e.g.
"2") for clearing the stats via /sys/fs/ext4/<dev>/mb_stats itself or
maybe even having mb_stats_clear file in sysfs wasn't bad either... But
either ways, clearing the stats via the same procfs mb_stats file is not
totally bad and I don't have a strong preference.

For documenting this, we can add mb_stats entry under /proc section in
Documentation/admin-guide/ext4.rst and document this change. Something
like - 

  mb_stats
        reports runtime statistics from multiblock allocator (mballoc),
        including allocation request counts, groups scanned,
        per-criteria scan hits (cr_p2_aligned, cr_goal_fast,
        cr_best_avail, cr_goal_slow, cr_any_free), groups / extents
        scanned, goal hits, buddy bitmap generations, and preallocation
        usage etc.
        Writing 0 to this procfs file resets all counters to zero.

-ritesh

^ permalink raw reply

* Re: [RFC PATCH] iomap: add fast read path for small direct I/O
From: Fengnan @ 2026-04-21  3:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ojaswin Mujoo, Fengnan Chang, brauner, djwong, linux-xfs,
	linux-fsdevel, linux-ext4, lidiangang
In-Reply-To: <aea96YRt2aHJsM96@dread>

[-- Attachment #1: Type: text/plain, Size: 6240 bytes --]

在 2026/4/21 07:59, Dave Chinner 写道:
> On Thu, Apr 16, 2026 at 11:22:08AM +0800, changfengnan wrote:
>> This is test 4k randread with QD 512 in io_uring poll mode.
>> If you use fio, almost like this, but ./t/io_uring  can get higher IOPS.
>> fio \
>>    --name=io_uring_test \
>>    --ioengine=io_uring \
>>    --filename=/mnt/testfile \
>>    --direct=1 \
>>    --rw=randread \
>>    --bs=4096 \
>>    --iodepth=512 \
>>    --iodepth_batch_submit=32 \
>>    --iodepth_batch_complete_min=32 \
>>    --hipri=1 \
>>    --fixedbufs=1 \
>>    --registerfiles=1 \
>>    --nonvectored=1 \
>>    --sqthread_poll=1
> Ok, given the way fio works, the iodepth batching will result in in
> the code submitting repeated batches of 32 read Io submissions in a
> single 'syscall'.
>
> If you change the size of this batch, how does it change the
> performance of both vanilla and patched IO paths? i.e. does this
> optimisation provide a benefit over a range of IO submission
> patterns, or is it only evident when the CPU is running a IO-uring
> microbenchmark and userspace is doing no real work on the IO buffers
> being submitted?
Hi Dave:

If batch size is 16, IOPS 1.84M -> 2.11M.
If batch size is 8, IOPS 1.72M -> 1.98M.
If batch size is 1, IOPS 1.09M->1.17M.
This is a general optimization that isn't limited to specific tests; you 
can see the
benefits even when using fio+libaio.
Use this command test, IOPS 480K->500K:
taskset -c 30 fio  --name=test   --ioengine=libaio 
--filename=/mnt/mytest --direct=1
--rw=randread   --bs=4096 --iodepth=512.

>
> Also, 'fixedbufs=1' leads me to beleive that this is using the same
> set of buffer memory for all IOs, and hence we've probably got a
> cache-hot data set here. Hence: is userspace reading the buffers at
> IO completion (i.e. emulating the application actually consuming the
> data that is being read from the disk), or are they remaining
> untouched by userspace and immediately reused for the next IO
> submission batch?
If use t/io_uring do test, the buffer is untouched by userspace. But I 
think it doesn't matter.
I use fio with refill_buffers argument  do some test,  the result is same.

taskset -c 30 fio  --name=test   --ioengine=libaio 
--filename=/mnt/mytest --direct=1
--rw=randread   --bs=4096 --iodepth=512 --refill_buffers.  IOPS 478K->498K.
taskset -c 30 fio  --name=test   --ioengine=io_uring 
--filename=/mnt/mytest --direct=1
--rw=randread   --bs=4096 --iodepth=512 --refill_buffers.  IOPS 542K->568K.

Perhaps my test cases are a bit unusual, which has raised quite a few 
questions.
In the upcoming patch, I’ll include more fio test results.
>
>>>> Profiling the ext4 workload reveals that a significant portion of CPU
>>>> time is spent on memory allocation and the iomap state machine
>>>> iteration:
>>>>    5.33%  [kernel]  [k] __iomap_dio_rw
>>>>    3.26%  [kernel]  [k] iomap_iter
>>>>    2.37%  [kernel]  [k] iomap_dio_bio_iter
>>>>    2.35%  [kernel]  [k] kfree
>>>>    1.33%  [kernel]  [k] iomap_dio_complete
>>>   
>>> Hmm read is usually under a shared lock for inode as well as extent
>>> lookup so we should ideally not be blocking too much there. Can you
>>> share a bit more detailed perf report. I'd be interested to see where
>>> in iomap_iter() are you seeing the regression?
>> Are there enough images of the flame diagram? I’ve attached them.
>> ext4_poll_7.svg is without this patch, iomap_fast.svg is with this patch.
> I've had a look at them, and the biggest change in CPU usage is that
> bio_alloc_bioset() disappears from the graph. In the vanilla kernel,
> that accounts for 6.05% of the cpu samples.
>
> Let's put this in a table:
>
> function		vanilla		patched		saved
> ----------		-------		-------		-----
> ext4_file_read_iter	54.75		46.85		-7.90
> iomap_dio_rw		49.21		40.69		-8.52
> ----
> bio_alloc_bioset	 6.05		1.77		-4.28
> iomap_dio_bio_iter	25.44
> iomap_iter		15.02
> iomap_dio_fast_read_async		39.82
>
> (subtotals)		46.51		41.59		-4.99
> ----
> bio_alloc_bioset	 6.05		1.77		-4.28
> bio_init		 4.52		0.00		-4.52
>
> More than 50% of the difference in CPU usage between the two code
> paths is entirely from bio_init() overhead.
>
> That makes no sense to me. The fast path still requires bios to be
> allocated and have bio_init() called on them, and we are doing many
> more of those calls every second. Why is this overhead not showing
> up in the fast path profile -at all-?
I haven't figured that out either.   I ran another flame graph on the 
old kernel
version, and bio_alloc_bioset only accounted for 1.83%. I'm not sure if 
there was
something wrong with the flame graph I generated back then.
I re-captured the ext4 heatmap using the newly modified patch, and it 
looks more
reasonable now.

>
>>>> I attempted several incremental optimizations in the __iomap_dio_rw()
>>>> path to close the gap:
>>>> 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
>>>>     separate kmalloc. However, because `struct iomap_dio` is relatively
>>>>     large and the main path is complex, this yielded almost no
>>>>     performance improvement.
> Yet this is exactly what you do in the fast path. Why did it not
> provide any improvement for the existing code when it is an implied
> beneficial optimisation for the new fast path?
I think there might be two reasons: first, the __iomap_dio_rw path is 
too complex, with
too many checks; second, the dio structure has to maintain reference 
counts for every
I/O operation, and the operations on atomic variables are a bit heavy.

> I'm clearly missing something here. I'm trying to work out why the
> profiles show what they do, but there's differences between them
> that do make obvious sense to me.
>
> It would also be useful to have XFS profiles, because it has a
> larger CPU cache footprint than ext4. If what the profiles are
> showing is a result of CPU cache residency artifacts, then we'll see
> different profile (and, potentially, performance) artifacts with
> XFS...
The XFS flame graph is also attached now.
IOPS: 1.92M->2.3M.

>
> -Dave.

[-- Attachment #2: xfs_patch.svg --]
[-- Type: image/svg+xml, Size: 79092 bytes --]

[-- Attachment #3: xfs_base.svg --]
[-- Type: image/svg+xml, Size: 82837 bytes --]

[-- Attachment #4: ext4_base.svg --]
[-- Type: image/svg+xml, Size: 76895 bytes --]

[-- Attachment #5: ext4_patched.svg --]
[-- Type: image/svg+xml, Size: 72805 bytes --]

^ permalink raw reply

* Re: [RFC PATCH] iomap: add fast read path for small direct I/O
From: Dave Chinner @ 2026-04-20 23:59 UTC (permalink / raw)
  To: changfengnan
  Cc: Ojaswin Mujoo, Fengnan Chang, brauner, djwong, linux-xfs,
	linux-fsdevel, linux-ext4, lidiangang
In-Reply-To: <d9210bcdf73fbe1ac8b6ec132865609a3ed68688.bd12b07f.c444.4fe0.8460.b6fed4af7332@bytedance.com>

On Thu, Apr 16, 2026 at 11:22:08AM +0800, changfengnan wrote:
> This is test 4k randread with QD 512 in io_uring poll mode. 
> If you use fio, almost like this, but ./t/io_uring  can get higher IOPS.
> fio \
>   --name=io_uring_test \
>   --ioengine=io_uring \
>   --filename=/mnt/testfile \
>   --direct=1 \
>   --rw=randread \
>   --bs=4096 \
>   --iodepth=512 \
>   --iodepth_batch_submit=32 \
>   --iodepth_batch_complete_min=32 \
>   --hipri=1 \
>   --fixedbufs=1 \
>   --registerfiles=1 \
>   --nonvectored=1 \
>   --sqthread_poll=1

Ok, given the way fio works, the iodepth batching will result in in
the code submitting repeated batches of 32 read Io submissions in a
single 'syscall'.

If you change the size of this batch, how does it change the
performance of both vanilla and patched IO paths? i.e. does this
optimisation provide a benefit over a range of IO submission
patterns, or is it only evident when the CPU is running a IO-uring
microbenchmark and userspace is doing no real work on the IO buffers
being submitted?

Also, 'fixedbufs=1' leads me to beleive that this is using the same
set of buffer memory for all IOs, and hence we've probably got a
cache-hot data set here. Hence: is userspace reading the buffers at
IO completion (i.e. emulating the application actually consuming the
data that is being read from the disk), or are they remaining
untouched by userspace and immediately reused for the next IO
submission batch?

> > > Profiling the ext4 workload reveals that a significant portion of CPU
> > > time is spent on memory allocation and the iomap state machine
> > > iteration:
> > >   5.33%  [kernel]  [k] __iomap_dio_rw
> > >   3.26%  [kernel]  [k] iomap_iter
> > >   2.37%  [kernel]  [k] iomap_dio_bio_iter
> > >   2.35%  [kernel]  [k] kfree
> > >   1.33%  [kernel]  [k] iomap_dio_complete
> > 
> > Hmm read is usually under a shared lock for inode as well as extent
> > lookup so we should ideally not be blocking too much there. Can you
> > share a bit more detailed perf report. I'd be interested to see where
> > in iomap_iter() are you seeing the regression?
> Are there enough images of the flame diagram? I’ve attached them.
> ext4_poll_7.svg is without this patch, iomap_fast.svg is with this patch.

I've had a look at them, and the biggest change in CPU usage is that
bio_alloc_bioset() disappears from the graph. In the vanilla kernel,
that accounts for 6.05% of the cpu samples.

Let's put this in a table:

function		vanilla		patched		saved
----------		-------		-------		-----
ext4_file_read_iter	54.75		46.85		-7.90
iomap_dio_rw		49.21		40.69		-8.52
----
bio_alloc_bioset	 6.05		1.77		-4.28
iomap_dio_bio_iter	25.44
iomap_iter		15.02
iomap_dio_fast_read_async		39.82

(subtotals)		46.51		41.59		-4.99
----
bio_alloc_bioset	 6.05		1.77		-4.28
bio_init		 4.52		0.00		-4.52

More than 50% of the difference in CPU usage between the two code
paths is entirely from bio_init() overhead.

That makes no sense to me. The fast path still requires bios to be
allocated and have bio_init() called on them, and we are doing many
more of those calls every second. Why is this overhead not showing
up in the fast path profile -at all-?

> > > I attempted several incremental optimizations in the __iomap_dio_rw()
> > > path to close the gap:
> > > 1. Allocating the `bio` and `struct iomap_dio` together to avoid a
> > >    separate kmalloc. However, because `struct iomap_dio` is relatively
> > >    large and the main path is complex, this yielded almost no
> > >    performance improvement.

Yet this is exactly what you do in the fast path. Why did it not
provide any improvement for the existing code when it is an implied
beneficial optimisation for the new fast path?

I'm clearly missing something here. I'm trying to work out why the
profiles show what they do, but there's differences between them
that do make obvious sense to me.

It would also be useful to have XFS profiles, because it has a
larger CPU cache footprint than ext4. If what the profiles are
showing is a result of CPU cache residency artifacts, then we'll see
different profile (and, potentially, performance) artifacts with
XFS...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply

* Re: [PATCH v2 v2 2/2] ext4: allow clearing mballoc stats through mb_stats
From: Andreas Dilger @ 2026-04-20 18:28 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Baolin Liu, tytso, wangguanyu, yi.zhang, ritesh.list, linux-ext4,
	linux-kernel, Baolin Liu
In-Reply-To: <aeXt_qg1SprB9Gu_@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On Apr 20, 2026, at 03:12, Ojaswin Mujoo <ojaswin@linux.ibm.com> wrote:
> 
> On Sun, Apr 19, 2026 at 02:34:36PM +0800, Baolin Liu wrote:
>> From: Baolin Liu <liubaolin@kylinos.cn>
>> 
>> Make /proc/fs/ext4/<dev>/mb_stats writable and clear the runtime
>> mballoc statistics when 0 is written.
>> 
>> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
>> ---
> Hi Baolin, thanks for the changes.
> 
> Seems like userspace doesn't have any way to know that writing 0 will
> clear the that. Well, I guess if you are looking at this file you are
> anyways debugging kernel code so that should be fine

That could be documented in Documentation/filesystems/ext4/allocators.rst,
or better would be to add a new file that covers mballoc in more detail.

Cheers, Andreas






^ permalink raw reply

* [PATCH AUTOSEL 7.0-6.1] ext4: unmap invalidated folios from page tables in mpage_release_unused_pages()
From: Sasha Levin @ 2026-04-20 13:22 UTC (permalink / raw)
  To: patches, stable
  Cc: Deepanshu Kartikey, syzbot+b0a0670332b6b3230a0a, Matthew Wilcox,
	Theodore Ts'o, Sasha Levin, linux-ext4, linux-kernel
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Deepanshu Kartikey <kartikey406@gmail.com>

[ Upstream commit 9b25f381de6b8942645f43735cb0a4fb0ab3a6d1 ]

When delayed block allocation fails (e.g., due to filesystem corruption
detected in ext4_map_blocks()), the writeback error handler calls
mpage_release_unused_pages(invalidate=true) which invalidates affected
folios by clearing their uptodate flag via folio_clear_uptodate().

However, these folios may still be mapped in process page tables. If a
subsequent operation (such as ftruncate calling ext4_block_truncate_page)
triggers a write fault, the existing page table entry allows access to
the now-invalidated folio. This leads to ext4_page_mkwrite() being called
with a non-uptodate folio, which then gets marked dirty, triggering:

    WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
    __folio_mark_dirty+0x578/0x880

    Call Trace:
     fault_dirty_shared_page+0x16e/0x2d0
     do_wp_page+0x38b/0xd20
     handle_pte_fault+0x1da/0x450

The sequence leading to this warning is:

1. Process writes to mmap'd file, folio becomes uptodate and dirty
2. Writeback begins, but delayed allocation fails due to corruption
3. mpage_release_unused_pages(invalidate=true) is called:
   - block_invalidate_folio() clears dirty flag
   - folio_clear_uptodate() clears uptodate flag
   - But folio remains mapped in page tables
4. Later, ftruncate triggers ext4_block_truncate_page()
5. This causes a write fault on the still-mapped folio
6. ext4_page_mkwrite() is called with folio that is !uptodate
7. block_page_mkwrite() marks buffers dirty
8. fault_dirty_shared_page() tries to mark folio dirty
9. block_dirty_folio() calls __folio_mark_dirty(warn=1)
10. WARNING triggers: WARN_ON_ONCE(warn && !uptodate && !dirty)

Fix this by unmapping folios from page tables before invalidating them
using unmap_mapping_pages(). This ensures that subsequent accesses
trigger new page faults rather than reusing invalidated folios through
stale page table entries.

Note that this results in data loss for any writes to the mmap'd region
that couldn't be written back, but this is expected behavior when
writeback fails due to filesystem corruption. The existing error message
already states "This should not happen!! Data will be lost".

Reported-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
Tested-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Link: https://patch.msgid.link/20251205055914.1393799-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have sufficient information to complete the forensic analysis.

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1 — Subject line parse**
- Record: Subsystem `ext4:`; action verb `unmap` (describes the fix
  mechanism, but semantically a fix); one-line summary: invalidated
  folios in the writeback error path must also be unmapped from user
  page tables to avoid a `__folio_mark_dirty` WARN.

**Step 1.2 — Commit message tags**
- Record:
  - `Reported-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com`
  - `Tested-by: syzbot+...@syzkaller.appspotmail.com` (auto-test bot
    confirmed fix)
  - `Closes:
    https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a`
  - `Suggested-by: Matthew Wilcox <willy@infradead.org>` (MM maintainer)
  - `Signed-off-by: Deepanshu Kartikey` + `Signed-off-by: Theodore Ts'o`
    (ext4 maintainer)
  - `Link:` to lore message id
  - No Fixes:, no Cc: stable (expected – that's why this is up for
    review)

**Step 1.3 — Commit body analysis**
- Record: Very detailed 10-step reproduction flow. The author identifies
  the exact sequence: mmap write → dirty/uptodate folio → delayed-alloc
  failure (e.g., corruption) →
  `mpage_release_unused_pages(invalidate=true)` → folio invalidated but
  still mapped → later write fault (e.g., from
  `ext4_block_truncate_page()`) hits `ext4_page_mkwrite()` with
  `!uptodate` folio → `WARN_ON_ONCE(warn && !uptodate && !dirty)` fires
  in `__folio_mark_dirty()`. Author explicitly states this is not
  theoretical — syzbot has a C reproducer. Also notes data-loss is
  intentional/expected on writeback failure ("This should not happen!!
  Data will be lost" message is pre-existing).

**Step 1.4 — Hidden bug fix?**
- Record: Not hidden — the subject names the mechanism, and the body
  explicitly documents a WARN and a concrete syscall sequence. This is
  clearly a fix.

## PHASE 2: DIFF ANALYSIS

**Step 2.1 — Inventory**
- Record: 1 file changed (`fs/ext4/inode.c`), +15/-1 lines, all in
  `mpage_release_unused_pages()`. Single-file surgical fix, scope = very
  small.

**Step 2.2 — Code flow change**
- Record: Before: when `invalidate=true` and `folio_mapped(folio)` was
  true, we only `folio_clear_dirty_for_io(folio)` to clear the PTE-dirty
  bits (from 2016 commit `4e800c0359d9a`), then
  `block_invalidate_folio()` + `folio_clear_uptodate()`, and left the
  mapping in place. After: we additionally call
  `unmap_mapping_pages(folio->mapping, folio->index,
  folio_nr_pages(folio), false)` to tear the folio out of every
  process's page tables, so no stale PTE can resurface the now-
  invalidated folio.

**Step 2.3 — Bug mechanism classification**
- Record: Memory-safety / correctness in error path. Stale PTE pointing
  at an invalidated folio → `fault_dirty_shared_page()` reaches
  `__folio_mark_dirty()` with `!uptodate && !dirty`, firing a KERNEL
  WARN. It is a bug (WARN = kernel bug signal to syzbot) and also opens
  the door to suspicious follow-on state (dirty bits on a folio the
  filesystem has already written off).

**Step 2.4 — Fix quality**
- Record: Obvious and correct. `unmap_mapping_pages()` is the standard
  MM helper for exactly this purpose (used by truncate_pagecache,
  `filemap_fault` race handling, etc.). It runs only under
  `invalidate=true` — i.e., only on the writeback-failure path — so the
  runtime cost in the non-error case is zero. Very low regression risk:
  the worst case is forcing future access to re-fault, which is benign.

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1 — Blame**
- Record: The surrounding construct (`if (folio_mapped())
  folio_clear_dirty_for_io(...)`, then `block_invalidate_folio` +
  `folio_clear_uptodate`) was added by commit `4e800c0359d9a` ("ext4:
  bugfix for mmaped pages in mpage_release_unused_pages()"), released in
  v4.9-rc1 (2016). So the incomplete handling has existed since v4.9 —
  every current stable tree is affected.

**Step 3.2 — Fixes: tag**
- Record: No `Fixes:` tag is in the commit (expected — this is a
  candidate under review). The bug is logically introduced by
  `4e800c0359d9a` (v4.9), which is present in every active stable tree.

**Step 3.3 — File history**
- Record: Recent touches to `mpage_release_unused_pages()` include
  `d8be7607de039` (ext4: Move mpage_page_done() calls after error
  handling), `fb5a5be05fb45` (convert to filemap_get_folios),
  `a297b2fcee461` (unlock unused_pages timely). None address this
  specific stale-PTE issue. This patch is self-contained; not part of a
  series.

**Step 3.4 — Author**
- Record: `Deepanshu Kartikey` is a regular syzbot-driven contributor
  (many small fixes across ext4, gfs2, netfs, mac80211). Not the
  maintainer, but the commit was reviewed and applied by ext4 maintainer
  Theodore Ts'o.

**Step 3.5 — Dependencies**
- Record: Only depends on `unmap_mapping_pages()`, which exists since
  v4.16 (mm commit `977fbdcd5986c`) — verified present in every stable
  tree checked (5.10, 5.15, 6.1, 6.6, 6.12). No patch-series dependency.

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

**Step 4.1 — Original submission**
- Record: `b4 dig -c 9b25f381...` resolved to the v3 thread at `https://
  lore.kernel.org/all/20251205055914.1393799-1-kartikey406@gmail.com/`.
  `b4 dig -a` shows this is v3 (earlier attempts v1/v2 tried to fix it
  in `ext4_page_mkwrite()` — see syzbot Discussions table linking
  `20251122015742.362444-1-...` and `20251121131305.332698-1-...`). The
  v3 approach was suggested by Matthew Wilcox and preferred by Ted Ts'o.
  Ted applied v3 directly with "Applied, thanks!" (mbox saved by b4
  shows `commit: 9b25f381de6b...`).

**Step 4.2 — Reviewers**
- Record: To/Cc from `b4 dig -w` includes `tytso@mit.edu` (ext4
  maintainer — applied), `adilger.kernel@dilger.ca` (ext4 co-
  maintainer), `willy@infradead.org` (MM maintainer — suggested the
  fix), `djwong@kernel.org`, `yi.zhang@huaweicloud.com`, `linux-
  ext4@vger.kernel.org`, `linux-kernel@vger.kernel.org`. Appropriate
  audience reviewed the change.

**Step 4.3 — Bug report**
- Record: Fetched
  https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a. Syzbot
  has a C reproducer. First crash 254 days before fetch, last 5d ago.
  Label `Fix commit: 9b25f381de6b` confirms this commit closed the
  upstream bug. The sample crash shows `__folio_mark_dirty` WARN with
  call trace `block_dirty_folio → fault_dirty_shared_page → do_wp_page →
  handle_mm_fault → do_user_addr_fault` — exact match to the commit
  message. Linux-6.6 has a sibling report labeled `origin:lts-only` and
  linux-6.1 one labeled `missing-backport`, indicating stable trees
  still need a fix.

**Step 4.4 — Related patches**
- Record: This is a single-patch series (v3); v1/v2 were alternative
  approaches to the same bug, superseded. No dependent patches.

**Step 4.5 — Stable ML**
- Record: No explicit Cc: stable in the applied patch. Syzbot label
  `missing-backport` on 6.1 is effectively a public request for stable
  coverage of this bug.

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1 — Functions in diff**
- Record: Only `mpage_release_unused_pages()` is modified.

**Step 5.2 — Callers**
- Record: Two call sites in `ext4_do_writepages()`:
  `mpage_release_unused_pages(mpd, false)` (normal completion, no
  invalidate) and `mpage_release_unused_pages(mpd, give_up_on_write)`
  (error path). The fix only triggers on the second (writeback-failure)
  path.

**Step 5.3 — Callees**
- Record: After fix adds `unmap_mapping_pages(folio->mapping,
  folio->index, folio_nr_pages(folio), false)` — standard MM helper that
  tears down PTEs for the given pgoff range (non-even-cows). Existing
  callees: `folio_clear_dirty_for_io`, `block_invalidate_folio`,
  `folio_clear_uptodate`, `folio_unlock`.

**Step 5.4 — Call chain / reachability**
- Record: `ext4_do_writepages` is called from the ordinary writeback
  path (syscalls such as `fsync`, `sync`, `msync`, memory-pressure-
  driven writeback). The `give_up_on_write=true` branch is taken when
  `ext4_map_blocks()` returns an error — e.g., on corruption detected by
  the extent tree. So an unprivileged user with a mmap of a corrupt ext4
  image can trigger it, which is exactly what syzbot does.

**Step 5.5 — Similar patterns**
- Record: Related earlier fix in the same function — commit
  `4e800c0359d9a` from 2016 — covered the PTE-dirty bit but not the PTE
  itself. The new patch completes that earlier partial fix. The same
  philosophy (unmap before invalidating) is used by
  `truncate_inode_pages_range()` and `invalidate_inode_pages2_range()`
  in mm/truncate.c, so this brings ext4 in line with the mm convention.

## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS

**Step 6.1 — Code exists in stable**
- Record: Verified the vulnerable pattern exists:
  - `stable/linux-6.19.y`: `folio_mapped(folio) →
    folio_clear_dirty_for_io` without unmap ✓
  - `stable/linux-6.18.y`: same ✓
  - `stable/linux-6.17.y`: same ✓
  - `stable/linux-6.12.y`: same ✓
  - `stable/linux-6.6.y`: same ✓
  - `stable/linux-6.1.y`: same ✓
  - `stable/linux-5.15.y`, `5.10.y`: same logic but pre-folio
    (`page_mapped(page) → clear_page_dirty_for_io`) — needs port to page
    API.

**Step 6.2 — Backport complications**
- Record: For 6.1..6.19 the hunk is effectively identical and should
  apply cleanly or with trivial offsets. For 5.15/5.10, the patch must
  be re-expressed using `unmap_mapping_pages(page->mapping, page->index,
  compound_nr(page), false)` or `1` for non-compound.
  `unmap_mapping_pages()` itself is available since v4.16, so available
  in all these trees.

**Step 6.3 — Already fixed?**
- Record: `git log --grep="unmap invalidated folios"` in
  `stable/linux-6.1/6.6/6.12/6.17/6.18/6.19` returned nothing. Not yet
  backported.

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1 — Subsystem**
- Record: `fs/ext4/` — one of the most widely deployed filesystems.
  Criticality: IMPORTANT (affects a large population of users,
  especially enterprise and Android).

**Step 7.2 — Activity**
- Record: ext4/inode.c is very actively maintained; the specific
  `mpage_release_unused_pages()` function has had targeted fixes before
  (2016, 2024). Writeback error path is exercised any time delayed
  allocation fails.

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1 — Affected users**
- Record: Any user of ext4 who has a mmapped file where delayed block
  allocation fails (FS corruption, ENOSPC under certain delalloc
  conditions, etc.). Unprivileged users can trigger it with a
  crafted/corrupt image (syzbot proved this).

**Step 8.2 — Trigger conditions**
- Record: Mmap a file on ext4, dirty it, then force writeback to fail
  (syzbot does this with a corrupt FS image). A concrete C reproducer
  exists and still crashes unpatched 6.6.y as of ~5 days ago.

**Step 8.3 — Failure mode / severity**
- Record: Kernel WARN (`WARN_ON_ONCE(warn && !uptodate && !dirty)`),
  plus the page stays accessible via stale PTEs after invalidation. On
  systems with `panic_on_warn`, this is a kernel panic (DoS). Even
  without panic_on_warn, the invariant violation signals a genuine
  state-machine bug and can mislead subsequent writeback/truncate logic.
  Severity: MEDIUM-HIGH (WARN / potential DoS / invariant violation; a
  security-relevant WARN class that syzbot tracks specifically).

**Step 8.4 — Risk-benefit**
- Record: Benefit — closes a syzbot-tracked bug with public C
  reproducer, stops WARN/panic on corrupt FS workloads, on a core
  filesystem. Risk — fix is 15 lines, only executes in the writeback-
  error path, uses a well-understood MM API, reviewed by MM + ext4
  maintainers, and has syzbot `Tested-by`. Ratio strongly favors
  backporting.

## PHASE 9: SYNTHESIS

**Step 9.1 — Evidence**
- For: syzbot-reported + reproducible, WARN on a core filesystem, tiny
  surgical change, reviewed by the MM maintainer who suggested it and
  applied by the ext4 maintainer, Tested-by syzbot, the buggy code
  exists in every active stable tree, sibling syzbot reports on 6.1.y
  and 6.6.y explicitly labeled `missing-backport` / `origin:lts-only`,
  dependency `unmap_mapping_pages()` is present since v4.16.
- Against: No `Cc: stable` tag (not a decisive signal — this is
  precisely why it's under review). For 5.10/5.15 a minor port from
  folio to page API is needed.
- Unresolved: None that would flip the decision.

**Step 9.2 — Stable rules checklist**
1. Obviously correct and tested — YES (syzbot Tested-by; straightforward
   mm API use).
2. Fixes a real bug — YES (WARN with C reproducer).
3. Important issue — YES (kernel WARN/potential panic, invariant
   violation in core filesystem).
4. Small and contained — YES (+15/-1 in one function, error path only).
5. No new features — YES.
6. Can apply to stable — YES for 6.1+ cleanly; minor hand-port for
   5.15/5.10.

**Step 9.3 — Exception category**
- Not a device-ID/quirk/DT case; this is a straightforward bug fix that
  qualifies on its own merits.

**Step 9.4 — Decision**
- Backport.

## Verification
- [Phase 1] Parsed tags from the commit body: `Reported-by:
  syzbot+b0a0670332b6b3230a0a@...`, `Tested-by: syzbot+...`, `Suggested-
  by: Matthew Wilcox`, `Signed-off-by: Theodore Ts'o`, `Closes:
  syzkaller URL`. No Fixes: or Cc: stable (expected for candidates).
- [Phase 2] Ran `git show 9b25f381de6b...` — confirmed diff is +15/-1 in
  `fs/ext4/inode.c`, only inside `mpage_release_unused_pages()`'s `if
  (invalidate)` block, adds `unmap_mapping_pages(folio->mapping,
  folio->index, folio_nr_pages(folio), false)`.
- [Phase 3] `git log --oneline --grep="mpage_release_unused_pages"` —
  found 8 historical touches including the 2016 partial fix
  `4e800c0359d9a` ("ext4: bugfix for mmaped pages..."). `git describe
  --contains 4e800c0359d9a` → v4.9-rc1 — confirms the vulnerable
  construct has been in stable trees since v4.9.
- [Phase 3] Confirmed no Fixes: tag in commit; logical predecessor is
  `4e800c0359d9a`.
- [Phase 3] `git log --author="Deepanshu Kartikey"` — author is a
  syzbot-focused contributor with many accepted small fixes across
  subsystems.
- [Phase 4] `b4 dig -c 9b25f381de6b...` returned the v3 submission URL `
  https://lore.kernel.org/all/20251205055914.1393799-1-
  kartikey406@gmail.com/`.
- [Phase 4] `b4 dig -c ... -a` showed this is v3; earlier v1/v2 took a
  different (rejected) approach in `ext4_page_mkwrite()`.
- [Phase 4] `b4 dig -c ... -w` confirmed willy, tytso, adilger, djwong,
  yi.zhang, linux-ext4 were CC'd and reviewed.
- [Phase 4] `b4 dig -c ... -m` and read the mbox — Ted Ts'o applied v3
  with "Applied, thanks!", commit `9b25f381de6b`.
- [Phase 4] Fetched syzkaller URL — confirmed public C reproducer, `Fix
  commit: 9b25f381de6b`, still first-crashed 254 days ago and last seen
  5 days ago on unpatched trees. Sibling bugs `a92b613efd5e` (linux-6.1,
  label `missing-backport`) and `d429f1fb4bc9` (linux-6.6, label
  `origin:lts-only`) indicate stable trees still need the fix.
- [Phase 5] Manually traced: only two call sites in
  `ext4_do_writepages()`, the patched branch only hits the
  `give_up_on_write` error path. Confirmed `unmap_mapping_pages` is used
  elsewhere in mm/ with same pattern (truncate/invalidate).
- [Phase 6] Read `fs/ext4/inode.c` in stable/linux-6.1.y, 6.6.y, 6.12.y,
  6.17.y, 6.18.y, 6.19.y — all contain the unchanged vulnerable `if
  (folio_mapped(folio)) folio_clear_dirty_for_io(folio);
  block_invalidate_folio(...); folio_clear_uptodate(...);` pattern. Read
  5.10.y/5.15.y — contain the pre-folio `page_mapped(page) /
  clear_page_dirty_for_io / block_invalidatepage / ClearPageUptodate`
  pattern (needs small port).
- [Phase 6] `git log stable/linux-6.{1,6,12,17,18,19}.y --grep="unmap
  invalidated folios"` — none contain this fix yet.
- [Phase 6] Confirmed `unmap_mapping_pages` is defined in `mm/memory.c`
  of every stable tree checked; `git describe --contains 977fbdcd5986c`
  = v4.16-rc1.
- [Phase 7] `fs/ext4/` is a core, widely-used filesystem;
  `mpage_release_unused_pages()` touched by multiple targeted fixes over
  the years.
- [Phase 8] Failure mode verified from the sample syzbot crash:
  `block_dirty_folio → fault_dirty_shared_page → do_wp_page →
  handle_mm_fault → exc_page_fault`, exactly matching the commit
  message's description.
- UNVERIFIED: I did not attempt an actual patch apply against
  5.10.y/5.15.y — expect a minor manual port (folio→page API), not a
  semantic change.

The fix is small, surgical, obviously correct, syzbot-reported with a
reproducer, reviewed by ext4+MM maintainers, touches only the writeback-
error path, and the vulnerable code pattern is present in every active
stable tree. It satisfies all stable-kernel rules with very low
regression risk.

**YES**

 fs/ext4/inode.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1123d995494b5..025ea8f0c41bd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1757,8 +1757,22 @@ static void mpage_release_unused_pages(struct mpage_da_data *mpd,
 			BUG_ON(!folio_test_locked(folio));
 			BUG_ON(folio_test_writeback(folio));
 			if (invalidate) {
-				if (folio_mapped(folio))
+				if (folio_mapped(folio)) {
 					folio_clear_dirty_for_io(folio);
+					/*
+					 * Unmap folio from page
+					 * tables to prevent
+					 * subsequent accesses through
+					 * stale PTEs. This ensures
+					 * future accesses trigger new
+					 * page faults rather than
+					 * reusing the invalidated
+					 * folio.
+					 */
+					unmap_mapping_pages(folio->mapping,
+						folio->index,
+						folio_nr_pages(folio), false);
+				}
 				block_invalidate_folio(folio, 0,
 						folio_size(folio));
 				folio_clear_uptodate(folio);
-- 
2.53.0


^ permalink raw reply related

* [PATCH AUTOSEL 7.0-5.10] ext2: replace BUG_ON with WARN_ON_ONCE in ext2_get_blocks
From: Sasha Levin @ 2026-04-20 13:18 UTC (permalink / raw)
  To: patches, stable
  Cc: Milos Nikic, Jan Kara, Sasha Levin, jack, linux-ext4,
	linux-kernel
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Milos Nikic <nikic.milos@gmail.com>

[ Upstream commit 0cf9c58bf654d0f27abe18005281dbf9890de401 ]

If ext2_get_blocks() is called with maxblocks == 0, it currently triggers
a BUG_ON(), causing a kernel panic.

While this condition implies a logic error in the caller, a filesystem
should not crash the system due to invalid arguments.

Replace the BUG_ON() with a WARN_ON_ONCE() to provide a stack trace for
debugging, and return -EINVAL to handle the error gracefully.

Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Link: https://patch.msgid.link/20260207010617.216675-1-nikic.milos@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for my analysis. Let me compile
the complete assessment.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem:** ext2
- **Action verb:** "replace" (BUG_ON with WARN_ON_ONCE)
- **Summary:** Converts a crash-inducing assertion into a graceful error
  return in ext2 block allocation

Record: [ext2] [replace] [Convert BUG_ON to WARN_ON_ONCE + return
-EINVAL for maxblocks==0 assertion]

### Step 1.2: Tags
- **Signed-off-by:** Milos Nikic (author) and Jan Kara (ext2 maintainer,
  merged)
- **Link:** `https://patch.msgid.link/20260207010617.216675-1-
  nikic.milos@gmail.com`
- No Fixes: tag (expected for AUTOSEL candidate)
- No Reported-by: tag
- No Cc: stable tag

Record: Signed-off by ext2 maintainer Jan Kara. No Fixes, no Reported-
by, no syzbot. No explicit stable nomination.

### Step 1.3: Commit Body
The commit explains that when `ext2_get_blocks()` is called with
`maxblocks == 0`, it triggers a `BUG_ON()` causing a kernel panic. The
author argues a filesystem should not crash the system due to invalid
arguments.

Record: [Bug: BUG_ON causes kernel panic on invalid argument] [Symptom:
kernel panic/crash] [Root cause: overly aggressive assertion for a
condition that should be handled gracefully]

### Step 1.4: Hidden Bug Fix Detection
This is a defensive hardening change. BUG_ON() is itself a bug when a
graceful recovery is possible. The kernel community has been
systematically converting such assertions.

Record: This is a fix for "BUG_ON is a bug" - the assertion behavior is
itself the problem.

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **Files:** `fs/ext2/inode.c` only
- **Lines changed:** -1 / +2 (net +1 line)
- **Function modified:** `ext2_get_blocks()`
- **Scope:** Single-file, single-function, surgical fix

### Step 2.2: Code Flow Change
**Before:** `BUG_ON(maxblocks == 0)` — triggers kernel panic, system
crashes
**After:** `if (WARN_ON_ONCE(maxblocks == 0)) return -EINVAL;` — prints
stack trace once, returns error code gracefully

The change affects the entry validation of `ext2_get_blocks()`, before
any actual work is done.

### Step 2.3: Bug Mechanism
Category: **Logic/correctness fix** (defensive assertion improvement).
The BUG_ON() unconditionally panics the system for a condition that can
be handled by returning an error.

### Step 2.4: Fix Quality
- Obviously correct: YES. This is a standard, well-understood pattern.
- Minimal: YES. 2 lines.
- Regression risk: Extremely low. The only behavior change is: if
  `maxblocks == 0`, instead of crashing, return -EINVAL. Both callers
  (`ext2_get_block` and `ext2_iomap_begin`) check return values
  properly.

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: Blame
The `BUG_ON(maxblocks == 0)` was introduced by commit `7ba3ec5749ddb6`
(Jan Kara, 2013-11-05, "ext2: Fix fs corruption in ext2_get_xip_mem()").
This commit first appeared in v3.13-rc1, meaning the buggy BUG_ON has
been present in **every stable tree** since v3.13 (~11 years).

### Step 3.2: Original Commit Context
The original commit `7ba3ec5749ddb6` fixed a real bug in
`ext2_get_xip_mem()` where 0 blocks were being requested. The BUG_ON was
added as a defensive assertion to catch similar bugs. The actual XIP bug
was also fixed in the same commit. The BUG_ON was always a "shouldn't
happen" assertion.

### Step 3.3: File History
`fs/ext2/inode.c` has had moderate churn (~44 changes since v5.15, ~65
since v5.4), but the specific BUG_ON line has been untouched since 2013.
No related fixes in this area.

### Step 3.4: Author
Milos Nikic is not the subsystem maintainer, but the patch was accepted
and signed-off by Jan Kara, who is the ext2 maintainer and who
originally added the BUG_ON.

### Step 3.5: Dependencies
None. This is a standalone 2-line change with no dependencies.

## PHASE 4: MAILING LIST RESEARCH

From the mbox thread:
1. **v1 submitted:** Feb 6, 2026
2. **Author ping:** Feb 26, 2026 — "Just a friendly ping on this patch"
3. **Jan Kara reply:** Feb 27, 2026 — "Thanks merged now."

No NAKs, no objections, no explicit stable nomination. Minimal
discussion — the maintainer accepted it without requesting changes. No
one mentioned stable.

Record: Single-version patch, accepted without changes by ext2
maintainer.

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1: Functions Modified
Only `ext2_get_blocks()` is modified (static function in
`fs/ext2/inode.c`).

### Step 5.2: Callers
`ext2_get_blocks()` is called from:
1. `ext2_get_block()` (line 791) — where `max_blocks = bh_result->b_size
   >> inode->i_blkbits`
2. `ext2_iomap_begin()` (line 835) — where `max_blocks = (length + (1 <<
   blkbits) - 1) >> blkbits`

`ext2_get_block()` is called from numerous VFS paths:
`mpage_read_folio`, `mpage_readahead`, `block_write_begin`,
`generic_block_bmap`, `mpage_writepages`, `block_truncate_page`, and
`__block_write_begin`.

### Step 5.3-5.4: Reachability
The code is reachable from common filesystem operations (read, write,
truncate, bmap). In `ext2_get_block()`, `max_blocks` could theoretically
be 0 if `bh_result->b_size` is less than `(1 << i_blkbits)`. In
`ext2_iomap_begin()`, `max_blocks` would be 0 only if `length == 0`.
Both should be prevented by callers, but are not explicitly validated in
the callee.

### Step 5.5: Similar Patterns
There are other BUG_ON instances in ext2 (`balloc.c`, `dir.c`, `acl.c`).
The kernel has been systematically converting such assertions across
filesystems (e.g., `ext4: convert some BUG_ON's in mballoc`, `nilfs2:
convert BUG_ON in nilfs_finish_roll_forward`, `quota: Remove BUG_ON from
dqget()`).

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Code Existence in Stable
The BUG_ON line exists in ALL active stable trees (introduced in v3.13).
The line `BUG_ON(maxblocks == 0)` at the same location is unchanged
since 2013.

### Step 6.2: Backport Complications
The patch should apply cleanly to all stable trees — the surrounding
code is identical across all branches (verified via blame: the context
lines are from 2005/2007).

### Step 6.3: No related fixes already in stable for this issue.

## PHASE 7: SUBSYSTEM CONTEXT

### Step 7.1: Subsystem Criticality
- **Subsystem:** ext2 filesystem
- **Criticality:** IMPORTANT — ext2 is still used in embedded systems,
  older systems, and as a simple/reliable FS choice
- A panic in filesystem code can cause data loss and is especially
  disruptive

### Step 7.2: Activity
ext2 is a mature, low-activity subsystem. The code being fixed has been
stable for 11 years.

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Users
All ext2 users across all kernel versions since v3.13. Ext2 is still
used in embedded, IoT, and some server configurations.

### Step 8.2: Trigger Conditions
Currently, no known caller passes maxblocks == 0. However:
- A corrupted filesystem image could potentially lead to invalid
  parameters
- A future kernel bug (like the XIP bug that motivated the BUG_ON) could
  trigger it
- The condition is a "shouldn't happen" scenario, but if it does, the
  system panics

### Step 8.3: Failure Mode Severity
- **Without fix:** Kernel panic (BUG_ON) → CRITICAL (system crash,
  potential data loss)
- **With fix:** WARN_ON_ONCE + -EINVAL → LOW (warning message, graceful
  error handling)

### Step 8.4: Risk-Benefit Ratio
- **Benefit:** Prevents kernel panic if condition ever triggers.
  Converts crash to graceful error.
- **Risk:** Essentially zero. 2 lines, obviously correct, well-
  understood pattern. The -EINVAL return is properly handled by both
  callers.
- **Ratio:** Extremely favorable.

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence Compilation

**FOR backporting:**
- Prevents kernel panic (BUG_ON → WARN_ON_ONCE)
- 2-line change, minimal regression risk
- Obviously correct pattern
- Accepted by ext2 maintainer Jan Kara (who wrote the original BUG_ON)
- Code exists in ALL stable trees since v3.13
- Will apply cleanly to all stable trees
- Filesystem code should never crash the system on input validation
- No dependencies, completely standalone

**AGAINST backporting:**
- No known trigger in current code — the condition is theoretical
- No syzbot report, no user reports of the BUG_ON firing
- Could be considered defensive hardening rather than fixing a real bug
- No explicit stable nomination by anyone

### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** — trivial, well-understood
   pattern, merged by maintainer
2. Fixes a real bug? **YES** — BUG_ON causing unnecessary kernel panic
   IS a bug in filesystem code
3. Important issue? **YES if triggered** — kernel panic is CRITICAL
   severity
4. Small and contained? **YES** — 2 lines in 1 file
5. No new features or APIs? **YES** — purely defensive improvement
6. Can apply to stable? **YES** — code is unchanged since v3.13

### Step 9.3: Exception Categories
Not strictly an exception category, but "BUG_ON to WARN_ON_ONCE
conversion in filesystem code" is a well-accepted pattern that has been
backported before.

### Step 9.4: Decision
The change is a tiny (2-line), obviously correct fix that prevents a
kernel panic in a filesystem. While the trigger condition is currently
theoretical, the risk of the backport is essentially zero, and the
benefit if triggered (preventing a crash vs. graceful error) is
significant. The ext2 maintainer accepted it. The BUG_ON has existed in
all stable trees for 11+ years.

## Verification

- [Phase 1] Parsed tags: Signed-off-by Milos Nikic and Jan Kara. Link to
  lore. No Fixes/Reported-by/Cc:stable.
- [Phase 2] Diff analysis: 1 line removed (`BUG_ON`), 2 lines added (`if
  (WARN_ON_ONCE(...)) return -EINVAL`), in `ext2_get_blocks()`.
- [Phase 3] git blame: `BUG_ON(maxblocks == 0)` introduced by commit
  `7ba3ec5749ddb6` (Jan Kara, 2013-11-05), first in v3.13-rc1. Present
  in ALL stable trees.
- [Phase 3] git show `7ba3ec5749ddb6`: Original commit was "ext2: Fix fs
  corruption in ext2_get_xip_mem()" — BUG_ON added as defensive
  assertion alongside actual XIP fix.
- [Phase 3] git log -20 on file: No related fixes or prerequisites for
  this change.
- [Phase 3] Author check: Milos Nikic also authored `bac3190a8e79b`
  (jbd2 hardening). Jan Kara is ext2 maintainer.
- [Phase 4] b4 mbox retrieved thread: 3 messages — original patch,
  author ping, Jan Kara "Thanks merged now." No review comments, no
  NAKs, no stable nomination.
- [Phase 5] Callers: `ext2_get_block()` (VFS callback, many callers) and
  `ext2_iomap_begin()` (iomap path). Both handle negative return values
  properly.
- [Phase 5] Trigger analysis: `max_blocks` computed from `b_size >>
  i_blkbits` or ceiling division of `length`. Currently prevented by
  callers but not validated in callee.
- [Phase 6] BUG_ON line unchanged since 2013, applies cleanly to all
  stable trees.
- [Phase 6] ext2/inode.c: 44 changes since v5.15 but none near the
  BUG_ON line.
- [Phase 8] Risk: essentially zero (2 lines, pattern is standard).
  Benefit: prevents panic if triggered.
- UNVERIFIED: Could not access lore.kernel.org web UI due to Anubis bot
  protection, but successfully retrieved full thread via b4 mbox.

**YES**

 fs/ext2/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index dbfe9098a1245..18bf1a91dbc24 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -638,7 +638,8 @@ static int ext2_get_blocks(struct inode *inode,
 	int count = 0;
 	ext2_fsblk_t first_block = 0;

-	BUG_ON(maxblocks == 0);
+	if (WARN_ON_ONCE(maxblocks == 0))
+		return -EINVAL;

 	depth = ext2_block_to_path(inode,iblock,offsets,&blocks_to_boundary);

-- 
2.53.0

^ permalink raw reply related

* [PATCH AUTOSEL 7.0-6.6] ext2: avoid drop_nlink() during unlink of zero-nlink inode in ext2_unlink()
From: Sasha Levin @ 2026-04-20 13:17 UTC (permalink / raw)
  To: patches, stable
  Cc: Ziyi Guo, Jan Kara, Sasha Levin, jack, linux-ext4, linux-kernel
In-Reply-To: <20260420132314.1023554-1-sashal@kernel.org>

From: Ziyi Guo <n7l8m4@u.northwestern.edu>

[ Upstream commit 19134a133184fcc49c41cf42797cb2e7fef76065 ]

ext2_unlink() calls inode_dec_link_count() unconditionally, which
invokes drop_nlink(). If the inode was loaded from a corrupted disk
image with i_links_count == 0, drop_nlink()
triggers WARN_ON(inode->i_nlink == 0)

Follow the ext4 pattern from __ext4_unlink(): check i_nlink before
decrementing. If already zero, skip the decrement.

Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu>
Link: https://patch.msgid.link/20260211022052.973114-1-n7l8m4@u.northwestern.edu
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a complete analysis. Let me
compile the full assessment.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem:** ext2 (filesystem)
- **Action verb:** "avoid" - indicates preventing an incorrect behavior
- **Summary:** Prevent `drop_nlink()` WARN_ON when unlinking an inode
  that already has zero link count

### Step 1.2: Tags
- **Link:** `https://patch.msgid.link/20260211022052.973114-1-
  n7l8m4@u.northwestern.edu` - original patch submission
- **Signed-off-by:** Ziyi Guo (author), Jan Kara (ext2/ext4 maintainer)
- No Fixes: tag (expected for this review pipeline)
- No Reported-by: tag, but the commit describes a specific WARN_ON
  trigger from corrupted disk images
- No Cc: stable (expected)

### Step 1.3: Commit Body
- **Bug:** `ext2_unlink()` unconditionally calls
  `inode_dec_link_count()`, which calls `drop_nlink()`. If the inode was
  loaded from a corrupted disk with `i_links_count == 0`, `drop_nlink()`
  triggers `WARN_ON(inode->i_nlink == 0)`.
- **Failure mode:** Kernel WARN_ON triggered, and then `i_nlink`
  underflows to `(unsigned int)-1`.
- **Fix approach:** Follow the ext4 pattern from `__ext4_unlink()`:
  check `i_nlink` before decrementing.

### Step 1.4: Hidden Bug Fix?
This is an explicit bug fix, not disguised. It directly addresses a
WARN_ON trigger and an nlink underflow from corrupted disk images.

---

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **File:** `fs/ext2/namei.c` — 1 line removed, 3 lines added (net +2
  lines)
- **Function modified:** `ext2_unlink()`
- **Scope:** Single-file, single-function, surgical fix

### Step 2.2: Code Flow Change
**Before:** `inode_dec_link_count(inode)` is called unconditionally
after a successful directory entry deletion.

**After:** `inode_dec_link_count(inode)` is only called if
`inode->i_nlink > 0`.

### Step 2.3: Bug Mechanism
Category: **Logic/correctness fix + defensive coding against
corruption**

The call chain is:
1. `ext2_unlink()` → `inode_dec_link_count()` (fs.h inline)
2. `inode_dec_link_count()` → `drop_nlink()` (fs/inode.c)
3. `drop_nlink()` has `WARN_ON(inode->i_nlink == 0)` followed by
   `inode->__i_nlink--`

Verified from `fs/inode.c` lines 416-422:
```c
void drop_nlink(struct inode *inode)
{
    WARN_ON(inode->i_nlink == 0);
    inode->__i_nlink--;
    ...
}
```

With a corrupted disk where `i_links_count == 0`, this triggers the WARN
and underflows the nlink counter.

### Step 2.4: Fix Quality
- **Obviously correct:** Yes — if nlink is already 0, don't decrement
  further.
- **Minimal/surgical:** Yes — 2 lines of logic added.
- **Regression risk:** Extremely low — only affects corrupted inodes
  with zero nlink. Normal inodes always have nlink >= 1 during unlink.
- **Precedent:** The ext4 filesystem has had the identical check since
  2019 (commit c7df4a1ecb857, by Theodore Ts'o, Cc: stable@kernel.org).

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: Blame
The unconditional `inode_dec_link_count(inode)` at the unlink path
traces to `a513b035eadf80` (2006, Alexey Dobriyan - introduced the
`inode_dec_link_count` wrapper) but the underlying unlink logic is from
`1da177e4c3f41` (2005, Linus Torvalds, Linux 2.6.12-rc2). **This buggy
code has been present since the very first kernel in git.**

### Step 3.2: Fixes Tag
No Fixes: tag present. This is expected for the review pipeline. The bug
has existed since the origin of the codebase.

### Step 3.3: File History
Recent changes to `fs/ext2/namei.c` are all refactoring (folio
conversion, ctime accessors, idmap). None are related to nlink handling.
The fix is standalone with no prerequisites.

### Step 3.4: Author
Ziyi Guo is not a regular ext2 contributor. However, the commit was
signed by **Jan Kara** (`jack@suse.cz`), who is the ext2/ext4
maintainer. This gives the fix high credibility.

### Step 3.5: Dependencies
The fix has **zero dependencies**. It only adds an `if` guard around an
existing function call. No new functions, no new data structures.

---

## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH

### Step 4.1-4.2: Patch Discussion
Lore was not accessible (Anubis protection). b4 dig could not match
because the HEAD SHA was incorrectly used. However, the Link: tag
confirms the patch was submitted and applied through Jan Kara's tree.

### Step 4.3: Bug Context
- The ext4 equivalent fix (c7df4a1ecb857) references bugzilla.kernel.org
  bug 205433 — a real user-reported bug from corrupted disk images.
- The minix equivalent fixes (`009a2ba40303c`, `d3e0e8661ceb4`) were
  **syzbot-reported**, confirming this class of bug is found by fuzzers
  against ext2-like filesystems.

### Step 4.4: Related Patches
Multiple filesystems have received the exact same fix:
- **ext4:** c7df4a1ecb857 (2019, Cc: stable, by Theodore Ts'o)
- **minix rename:** 009a2ba40303c (syzbot-reported, Reviewed-by: Jan
  Kara)
- **minix rmdir:** d3e0e8661ceb4 (syzbot-reported, Reviewed-by: Jan
  Kara)
- **fat:** 8cafcb881364a (parent link count underflow in rmdir)
- **f2fs:** 42cb74a92adaf (prevent kernel warning from corrupted image)

This is a well-understood class of bug. Ext2 was the last remaining
major filesystem without the guard.

### Step 4.5: Stable History
The ext4 equivalent was explicitly tagged `Cc: stable@kernel.org` by
Theodore Ts'o, establishing a precedent that this class of fix belongs
in stable.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1-5.2: Functions
- **Modified function:** `ext2_unlink()` — called from:
  - VFS unlink path (`.unlink = ext2_unlink` in
    `ext2_dir_inode_operations`)
  - `ext2_rmdir()` (line 308)
  - `ext2_rename()` is not a direct caller
- VFS unlink is triggered by the `unlink()` syscall — this is a
  **common, userspace-reachable path**.

### Step 5.3-5.4: Call Chain
`unlink()` syscall → `do_unlinkat()` → `vfs_unlink()` → `ext2_unlink()`
→ `inode_dec_link_count()` → `drop_nlink()` → **WARN_ON**

The buggy path is directly reachable from userspace with any corrupted
ext2 filesystem image.

### Step 5.5: Similar Patterns
`ext2_rmdir()` also has an unprotected `inode_dec_link_count(inode)` at
line 311 (after calling `ext2_unlink`). This is a separate path that
could benefit from a similar guard, but the current fix addresses the
most direct and common case.

---

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Code Exists in Stable
Verified: the exact same code exists in the 6.19.12 stable tree —
`inode_dec_link_count(inode)` at the same location in `ext2_unlink()`.
The buggy code has been present since Linux 2.6.12 and is in **every
active stable tree**.

### Step 6.2: Backport Complications
The code in the 7.0 tree and the 6.19.12 stable tree is **identical**
around `ext2_unlink()`. The patch will apply cleanly. Older stable trees
(pre-6.6) that use page-based rather than folio-based code will have the
same surrounding logic — the fix only touches the `inode_dec_link_count`
line, which hasn't changed.

### Step 6.3: Related Fixes in Stable
No equivalent ext2 fix is already in stable. The ext4 fix
(c7df4a1ecb857) went to stable in 2019.

---

## PHASE 7: SUBSYSTEM CONTEXT

### Step 7.1: Subsystem Criticality
- **Subsystem:** ext2 filesystem (fs/ext2/)
- **Criticality:** IMPORTANT — ext2 is widely used in embedded systems,
  USB drives, small partitions, and as a simple filesystem for testing.
  Any machine that mounts an ext2 filesystem is affected.

### Step 7.2: Activity
ext2 is a mature, stable subsystem with infrequent changes. Bug has been
present for 20+ years, making the fix more important for stable (more
deployed systems affected).

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Population
All users who mount ext2 filesystems — this includes:
- Embedded systems, USB drives, legacy partitions
- Any system handling potentially corrupted or malicious ext2 images

### Step 8.2: Trigger Conditions
- **Trigger:** Mount a corrupted ext2 filesystem with an inode that has
  `i_links_count == 0`, then unlink that file.
- **Likelihood:** Uncommon in normal usage, but straightforward with
  corrupted/malicious disk images.
- **Unprivileged user:** Yes — can be triggered by any user with write
  access to the mounted filesystem (or via auto-mounted USB devices).
- **Security relevance:** Mounting crafted filesystem images is a known
  attack vector.

### Step 8.3: Failure Mode Severity
- **WARN_ON trigger:** Produces a kernel warning with full stack trace
  (log spam, potential for denial-of-service if warnings cause system
  slowdown or panic-on-warn)
- **nlink underflow:** `i_nlink` wraps to `(unsigned int)-1`, which
  corrupts the inode state in memory
- **Severity:** MEDIUM-HIGH (WARN_ON + data corruption in inode state)
- On systems with `panic_on_warn`, this becomes a **kernel panic**
  (CRITICAL).

### Step 8.4: Risk-Benefit Ratio
- **Benefit:** Prevents WARN_ON, nlink underflow, and potential panic on
  corrupted ext2 images. Established pattern across 5+ filesystems.
- **Risk:** Near-zero. The fix is a 2-line `if` guard that only
  activates on corrupted inodes. Normal operations are completely
  unaffected.
- **Ratio:** Very favorable for backporting.

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence Summary

**FOR backporting:**
- Fixes a real bug: WARN_ON + nlink underflow on corrupted disk images
- Extremely small and surgical: 2 lines of code
- Obviously correct: simple `if (inode->i_nlink)` guard
- Follows established pattern from ext4 (which was CC'd stable by Ted
  Ts'o)
- Same class of fix applied to minix (syzbot-reported), fat, f2fs
- Signed by ext2 maintainer Jan Kara
- Buggy code exists in ALL stable trees (since 2005)
- Patch applies cleanly to stable trees
- No dependencies on other commits
- Zero regression risk for normal operations
- On `panic_on_warn` systems, this prevents a kernel panic

**AGAINST backporting:**
- No explicit Fixes: tag (expected, not a negative signal)
- Only the unlink path is fixed; ext2_rmdir has a similar unprotected
  path (but this fix is still standalone and valuable)
- Trigger requires corrupted disk image (not common in normal usage)

### Step 9.2: Stable Rules Checklist
1. **Obviously correct and tested?** YES — trivial guard, same pattern
   as ext4 (in stable since 2019)
2. **Fixes a real bug?** YES — WARN_ON and nlink underflow from
   corrupted images
3. **Important issue?** YES — kernel warning, potential data corruption,
   panic-on-warn
4. **Small and contained?** YES — 2 lines, single function, single file
5. **No new features/APIs?** YES — purely defensive check
6. **Applies to stable?** YES — verified identical code in 6.19.12

### Step 9.3: Exception Categories
Not needed — this meets standard stable criteria as a bug fix.

### Step 9.4: Decision
Clear YES. This is a textbook stable backport: a tiny, obviously correct
fix that prevents a kernel warning and nlink corruption on mounted
corrupted ext2 filesystems, following an established pattern across
multiple filesystems, signed by the subsystem maintainer.

---

## Verification

- [Phase 1] Parsed tags: Link to patch.msgid.link, Signed-off-by Jan
  Kara (ext2 maintainer)
- [Phase 2] Diff analysis: 2 lines added — `if (inode->i_nlink)` guard
  around `inode_dec_link_count(inode)` in `ext2_unlink()`
- [Phase 2] Confirmed `drop_nlink()` in `fs/inode.c:416-422` has
  `WARN_ON(inode->i_nlink == 0)` followed by `inode->__i_nlink--`
- [Phase 2] Confirmed `inode_dec_link_count()` in
  `include/linux/fs.h:2251-2255` calls `drop_nlink()` then
  `mark_inode_dirty()`
- [Phase 3] git blame: buggy `inode_dec_link_count` introduced in
  a513b035eadf80 (2006), underlying logic from 1da177e4c3f41 (2005,
  original kernel)
- [Phase 3] No prerequisites found; fix is standalone
- [Phase 3] git log: no related ext2 unlink fixes in recent history
- [Phase 4] b4 dig: could not match due to commit not being in this
  tree; lore.kernel.org blocked by Anubis
- [Phase 4] ext4 equivalent fix c7df4a1ecb857 (Theodore Ts'o, 2019)
  verified — has `Cc: stable@kernel.org` and `Reviewed-by: Andreas
  Dilger`
- [Phase 4] minix equivalents 009a2ba40303c and d3e0e8661ceb4 verified —
  syzbot-reported, Reviewed-by Jan Kara
- [Phase 5] Callers: ext2_unlink is VFS `.unlink` handler, reachable
  from `unlink()` syscall — common path
- [Phase 5] Also called from ext2_rmdir (which has its own unprotected
  inode_dec_link_count)
- [Phase 6] Verified: 6.19.12 stable tree has identical unfixed code at
  same location in ext2_unlink()
- [Phase 6] Patch applies cleanly — surrounding context is identical
- [Phase 7] ext2 is a mature, widely-deployed filesystem — IMPORTANT
  criticality
- [Phase 8] Failure mode: WARN_ON + nlink underflow; CRITICAL on
  panic_on_warn systems
- [Phase 8] Risk: near-zero (2-line if guard, only activates on
  corrupted inodes)

**YES**

 fs/ext2/namei.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index bde617a66cecd..728c487308baf 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -293,7 +293,10 @@ static int ext2_unlink(struct inode *dir, struct dentry *dentry)
 		goto out;

 	inode_set_ctime_to_ts(inode, inode_get_ctime(dir));
-	inode_dec_link_count(inode);
+
+	if (inode->i_nlink)
+		inode_dec_link_count(inode);
+
 	err = 0;
 out:
 	return err;
-- 
2.53.0

^ permalink raw reply related

* [PATCH v8 22/22] xfs: enable ro-compat fs-verity flag
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

Finalize fs-verity integration in XFS by making kernel fs-verity
aware with ro-compat flag.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add spaces]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4dff29659e40..0ce46c234b9c 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -378,8 +378,9 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
-		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_VERITY)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 21/22] xfs: introduce health state for corrupted fsverity metadata
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

Report corrupted fsverity descriptor through health system.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h     |  1 +
 fs/xfs/libxfs/xfs_health.h |  4 +++-
 fs/xfs/xfs_fsverity.c      | 13 ++++++++++---
 fs/xfs/xfs_health.c        |  1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index ebf17a0b0722..cece31ecee81 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -422,6 +422,7 @@ struct xfs_bulkstat {
 #define XFS_BS_SICK_SYMLINK	(1 << 6)  /* symbolic link remote target */
 #define XFS_BS_SICK_PARENT	(1 << 7)  /* parent pointers */
 #define XFS_BS_SICK_DIRTREE	(1 << 8)  /* directory tree structure */
+#define XFS_BS_SICK_FSVERITY	(1 << 9)  /* fsverity metadata */
 
 /*
  * Project quota id helpers (previously projid was 16bit only
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 1d45cf5789e8..932b447190da 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -104,6 +104,7 @@ struct xfs_rtgroup;
 /* Don't propagate sick status to ag health summary during inactivation */
 #define XFS_SICK_INO_FORGET	(1 << 12)
 #define XFS_SICK_INO_DIRTREE	(1 << 13)  /* directory tree structure */
+#define XFS_SICK_INO_FSVERITY	(1 << 14)  /* fsverity metadata */
 
 /* Primary evidence of health problems in a given group. */
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
@@ -140,7 +141,8 @@ struct xfs_rtgroup;
 				 XFS_SICK_INO_XATTR | \
 				 XFS_SICK_INO_SYMLINK | \
 				 XFS_SICK_INO_PARENT | \
-				 XFS_SICK_INO_DIRTREE)
+				 XFS_SICK_INO_DIRTREE | \
+				 XFS_SICK_INO_FSVERITY)
 
 #define XFS_SICK_INO_ZAPPED	(XFS_SICK_INO_BMBTD_ZAPPED | \
 				 XFS_SICK_INO_BMBTA_ZAPPED | \
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index ef5cf97ad700..8ac810f0ffa1 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -84,16 +84,23 @@ xfs_fsverity_get_descriptor(
 		return error;
 
 	desc_size = be32_to_cpu(d_desc_size);
-	if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE))
+	if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE)) {
+		xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
 		return -ERANGE;
-	if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos))
+	}
+
+	if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos)) {
+		xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
 		return -ERANGE;
+	}
 
 	if (!buf_size)
 		return desc_size;
 
-	if (XFS_IS_CORRUPT(mp, desc_size > buf_size))
+	if (XFS_IS_CORRUPT(mp, desc_size > buf_size)) {
+		xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
 		return -ERANGE;
+	}
 
 	desc_pos = round_down(desc_size_pos - desc_size, blocksize);
 	error = fsverity_pagecache_read(inode, buf, desc_size, desc_pos);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 239b843e83d4..be66760fb120 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -625,6 +625,7 @@ static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_DIR_ZAPPED,	XFS_BS_SICK_DIR },
 	{ XFS_SICK_INO_SYMLINK_ZAPPED,	XFS_BS_SICK_SYMLINK },
 	{ XFS_SICK_INO_DIRTREE,	XFS_BS_SICK_DIRTREE },
+	{ XFS_SICK_INO_FSVERITY,	XFS_BS_SICK_FSVERITY },
 };
 
 /* Fill out bulkstat health info. */
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 20/22] xfs: check and repair the verity inode flag state
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Darrick J. Wong, hch, linux-ext4, linux-f2fs-devel, linux-btrfs,
	linux-unionfs, Andrey Albershteyn
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

From: "Darrick J. Wong" <djwong@kernel.org>

If an inode has the incore verity iflag set, make sure that we can
actually activate fsverity on that inode.  If activation fails due to
a fsverity metadata validation error, clear the flag.  The usage model
for fsverity requires that any program that cares about verity state is
required to call statx/getflags to check that the flag is set after
opening the file, so clearing the flag will not compromise that model.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/scrub/attr.c         |  7 +++++
 fs/xfs/scrub/common.c       | 53 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h       |  2 ++
 fs/xfs/scrub/inode.c        |  7 +++++
 fs/xfs/scrub/inode_repair.c | 36 +++++++++++++++++++++++++
 5 files changed, 105 insertions(+)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 390ac2e11ee0..daf7962c2374 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -649,6 +649,13 @@ xchk_xattr(
 	if (!xfs_inode_hasattr(sc->ip))
 		return -ENOENT;
 
+	/*
+	 * If this is a verity file that won't activate, we cannot check the
+	 * merkle tree geometry.
+	 */
+	if (xchk_inode_verity_broken(sc->ip))
+		xchk_set_incomplete(sc);
+
 	/* Allocate memory for xattr checking. */
 	error = xchk_setup_xattr_buf(sc, 0);
 	if (error == -ENOMEM)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 20e63069088b..6cc6bea9c554 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -45,6 +45,8 @@
 #include "scrub/health.h"
 #include "scrub/tempfile.h"
 
+#include <linux/fsverity.h>
+
 /* Common code for the metadata scrubbers. */
 
 /*
@@ -1743,3 +1745,54 @@ xchk_inode_count_blocks(
 	return xfs_bmap_count_blocks(sc->tp, sc->ip, whichfork, nextents,
 			count);
 }
+
+/*
+ * If this inode has S_VERITY set on it, read the verity info. If the reading
+ * fails with anything other than ENOMEM, the file is corrupt, which we can
+ * detect later with fsverity_active.
+ *
+ * Callers must hold the IOLOCK and must not hold the ILOCK of sc->ip because
+ * activation reads inode data.
+ */
+int
+xchk_inode_setup_verity(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	if (!fsverity_active(VFS_I(sc->ip)))
+		return 0;
+
+	error = fsverity_ensure_verity_info(VFS_I(sc->ip));
+	switch (error) {
+	case 0:
+		/* fsverity is active */
+		break;
+	case -ENODATA:
+	case -EMSGSIZE:
+	case -EINVAL:
+	case -EFSCORRUPTED:
+	case -EFBIG:
+		/*
+		 * The nonzero errno codes above are the error codes that can
+		 * be returned from fsverity on metadata validation errors.
+		 */
+		return 0;
+	default:
+		/* runtime errors */
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Is this a verity file that failed to activate?  Callers must have tried to
+ * activate fsverity via xchk_inode_setup_verity.
+ */
+bool
+xchk_inode_verity_broken(
+	struct xfs_inode	*ip)
+{
+	return fsverity_active(VFS_I(ip)) && !fsverity_get_info(VFS_I(ip));
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index f2ecc68538f0..aa16d310bd6d 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -264,6 +264,8 @@ int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_agino_t agino,
 		bool *inuse);
 int xchk_inode_count_blocks(struct xfs_scrub *sc, int whichfork,
 		xfs_extnum_t *nextents, xfs_filblks_t *count);
+int xchk_inode_setup_verity(struct xfs_scrub *sc);
+bool xchk_inode_verity_broken(struct xfs_inode *ip);
 
 bool xchk_inode_is_dirtree_root(const struct xfs_inode *ip);
 bool xchk_inode_is_sb_rooted(const struct xfs_inode *ip);
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 948d04dcba2a..8ce6917e22b4 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -36,6 +36,10 @@ xchk_prepare_iscrub(
 
 	xchk_ilock(sc, XFS_IOLOCK_EXCL);
 
+	error = xchk_inode_setup_verity(sc);
+	if (error)
+		return error;
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
@@ -833,6 +837,9 @@ xchk_inode(
 	if (S_ISREG(VFS_I(sc->ip)->i_mode))
 		xchk_inode_check_reflink_iflag(sc, sc->ip->i_ino);
 
+	if (xchk_inode_verity_broken(sc->ip))
+		xchk_ino_set_corrupt(sc, sc->sm->sm_ino);
+
 	xchk_inode_check_unlinked(sc);
 
 	xchk_inode_xref(sc, sc->ip->i_ino, &di);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 9738b9ce3f2d..3761e3922466 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -573,6 +573,8 @@ xrep_dinode_flags(
 		dip->di_nrext64_pad = 0;
 	else if (dip->di_version >= 3)
 		dip->di_v3_pad = 0;
+	if (!xfs_has_verity(mp) || !S_ISREG(mode))
+		flags2 &= ~XFS_DIFLAG2_VERITY;
 
 	if (flags2 & XFS_DIFLAG2_METADATA) {
 		xfs_failaddr_t	fa;
@@ -1613,6 +1615,10 @@ xrep_dinode_core(
 	if (iget_error)
 		return iget_error;
 
+	error = xchk_inode_setup_verity(sc);
+	if (error)
+		return error;
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
@@ -2032,6 +2038,27 @@ xrep_inode_unlinked(
 	return 0;
 }
 
+/*
+ * If this file is a fsverity file, xchk_prepare_iscrub or xrep_dinode_core
+ * should have activated it.  If it's still not active, then there's something
+ * wrong with the verity descriptor and we should turn it off.
+ */
+STATIC int
+xrep_inode_verity(
+	struct xfs_scrub	*sc)
+{
+	struct inode		*inode = VFS_I(sc->ip);
+
+	if (xchk_inode_verity_broken(sc->ip)) {
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_VERITY;
+		inode->i_flags &= ~S_VERITY;
+
+		xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	}
+
+	return 0;
+}
+
 /* Repair an inode's fields. */
 int
 xrep_inode(
@@ -2081,6 +2108,15 @@ xrep_inode(
 			return error;
 	}
 
+	/*
+	 * Disable fsverity if it cannot be activated.  Activation failure
+	 * prohibits the file from being opened, so there cannot be another
+	 * program with an open fd to what it thinks is a verity file.
+	 */
+	error = xrep_inode_verity(sc);
+	if (error)
+		return error;
+
 	/* Reconnect incore unlinked list */
 	error = xrep_inode_unlinked(sc);
 	if (error)
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 19/22] xfs: advertise fs-verity being available on filesystem
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Darrick J. Wong, hch, linux-ext4, linux-f2fs-devel, linux-btrfs,
	linux-unionfs, Andrey Albershteyn, Andrey Albershteyn
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

From: "Darrick J. Wong" <djwong@kernel.org>

Advertise that this filesystem supports fsverity.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h | 1 +
 fs/xfs/libxfs/xfs_sb.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index d165de607d17..ebf17a0b0722 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -250,6 +250,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_PARENT	(1 << 25) /* linux parent pointers */
 #define XFS_FSOP_GEOM_FLAGS_METADIR	(1 << 26) /* metadata directories */
 #define XFS_FSOP_GEOM_FLAGS_ZONED	(1 << 27) /* zoned rt device */
+#define XFS_FSOP_GEOM_FLAGS_VERITY	(1 << 28) /* fs-verity */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a15510ebd2f1..222bbe5559df 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1590,6 +1590,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_METADIR;
 	if (xfs_has_zoned(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_ZONED;
+	if (xfs_has_verity(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_VERITY;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 18/22] xfs: add fs-verity ioctls
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
tree pages) and obtain file's digest.

[djwong: remove unnecessary casting]

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_ioctl.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index facffdc8dca8..e633d56cad00 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -46,6 +46,7 @@
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
+#include <linux/fsverity.h>
 
 /* Return 0 on success or positive error */
 int
@@ -1426,6 +1427,19 @@ xfs_file_ioctl(
 	case XFS_IOC_VERIFY_MEDIA:
 		return xfs_ioc_verify_media(filp, arg);
 
+	case FS_IOC_ENABLE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_enable(filp, arg);
+	case FS_IOC_MEASURE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_measure(filp, arg);
+	case FS_IOC_READ_VERITY_METADATA:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_read_metadata(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 17/22] xfs: remove unwritten extents after preallocations in fsverity metadata
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

XFS preallocates spaces during writes. In normal I/O this space, if
unused, is removed by truncate. For files with fsverity XFS does not use
truncate as fsverity metadata is stored past EOF.

After we're done with writing fsverity metadata iterate over extents in
that region and remove any unwritten ones. These would be left overs in
the holes in the merkle tree and past fsverity descriptor.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_fsverity.c | 67 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index 68d9736d19d9..ef5cf97ad700 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -21,6 +21,8 @@
 #include "xfs_iomap.h"
 #include "xfs_error.h"
 #include "xfs_health.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
 #include <linux/fsverity.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
@@ -173,6 +175,63 @@ xfs_fsverity_delete_metadata(
 	return error;
 }
 
+static int
+xfs_fsverity_cancel_unwritten(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		start,
+	xfs_fileoff_t		end)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSB(mp, start);
+	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end);
+	struct xfs_bmbt_irec	imap;
+	int			nimaps;
+	int			error = 0;
+	int			done;
+
+
+	while (offset_fsb < end_fsb) {
+		nimaps = 1;
+
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
+				0, &tp);
+		if (error)
+			return error;
+
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+				&imap, &nimaps, 0);
+		if (error)
+			goto out_cancel;
+
+		if (nimaps == 0)
+			goto out_cancel;
+
+		if (imap.br_state == XFS_EXT_UNWRITTEN) {
+			xfs_trans_ijoin(tp, ip, 0);
+
+			error = xfs_bunmapi(tp, ip, imap.br_startoff,
+					imap.br_blockcount, 0, 1, &done);
+			if (error)
+				goto out_cancel;
+
+			error = xfs_trans_commit(tp);
+		} else {
+			xfs_trans_cancel(tp);
+		}
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+		offset_fsb = imap.br_startoff + imap.br_blockcount;
+	}
+
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
 
 /*
  * Prepare to enable fsverity by clearing old metadata.
@@ -248,6 +307,14 @@ xfs_fsverity_end_enable(
 	if (error)
 		goto out;
 
+	/*
+	 * Remove unwritten extents left by COW preallocations and write
+	 * preallocation in the merkle tree holes and past descriptor
+	 */
+	error = xfs_fsverity_cancel_unwritten(ip, range_start, LLONG_MAX);
+	if (error)
+		goto out;
+
 	/*
 	 * Proactively drop any delayed allocations in COW fork, the fsverity
 	 * files are read-only
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 16/22] xfs: add fs-verity support
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

Add integration with fs-verity. XFS stores fs-verity descriptor and
Merkle tree in the inode data fork at first block aligned to 64k past
EOF.

The Merkle tree reading/writing is done through iomap interface. The
data itself is read to the inode's page cache. When XFS reads from this
region iomap doesn't call into fsverity to verify it against Merkle
tree. For data, verification is done at ioend completion in a workqueue.

When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
flag is set meaning that the Merkle tree is being build. The
initialization ends with storing of verity descriptor and setting
inode on-disk flag (XFS_DIFLAG2_VERITY). Lastly, the
XFS_IVERITY_CONSTRUCTION is dropped and I_VERITY is set on inode.

The descriptor is stored in a new block aligned to 64k after the last
Merkle tree block. The size of the descriptor is stored at the end of
the last descriptor block (descriptor can be multiple blocks).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   8 +
 fs/xfs/xfs_fsverity.c  | 353 ++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_fsverity.h  |   2 +
 fs/xfs/xfs_message.c   |   4 +
 fs/xfs/xfs_message.h   |   1 +
 fs/xfs/xfs_mount.h     |   2 +
 fs/xfs/xfs_super.c     |   7 +
 7 files changed, 376 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..18348f4fd2aa 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -31,6 +31,7 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
+#include <linux/fsverity.h>
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -553,6 +554,13 @@ xfs_can_free_eofblocks(
 	if (last_fsb <= end_fsb)
 		return false;
 
+	/*
+	 * Nothing to clean on fsverity inodes as they don't use prealloc and
+	 * there no delalloc as only written data is fsverity metadata
+	 */
+	if (IS_VERITY(VFS_I(ip)))
+		return false;
+
 	/*
 	 * Check if there is an post-EOF extent to free.  If there are any
 	 * delalloc blocks attached to the inode (data fork delalloc
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index b983e20bb5e1..68d9736d19d9 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -4,14 +4,26 @@
  */
 #include "xfs_platform.h"
 #include "xfs_format.h"
-#include "xfs_inode.h"
 #include "xfs_shared.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_fsverity.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_log_format.h"
+#include "xfs_bmap_util.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_trace.h"
+#include "xfs_quota.h"
 #include "xfs_fsverity.h"
+#include "xfs_iomap.h"
+#include "xfs_error.h"
+#include "xfs_health.h"
 #include <linux/fsverity.h>
 #include <linux/iomap.h>
+#include <linux/pagemap.h>
 
 loff_t
 xfs_fsverity_metadata_offset(
@@ -28,3 +40,342 @@ xfs_fsverity_is_file_data(
 	return fsverity_active(VFS_IC(ip)) &&
 			offset < xfs_fsverity_metadata_offset(ip);
 }
+
+/*
+ * Retrieve the verity descriptor.
+ */
+static int
+xfs_fsverity_get_descriptor(
+	struct inode		*inode,
+	void			*buf,
+	size_t			buf_size)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	__be32			d_desc_size;
+	u32			desc_size;
+	u64			desc_size_pos;
+	int			error;
+	u64			desc_pos;
+	struct xfs_bmbt_irec	rec;
+	int			is_empty;
+	uint32_t		blocksize = i_blocksize(VFS_I(ip));
+	xfs_fileoff_t		last_block_offset;
+
+	ASSERT(inode->i_flags & S_VERITY);
+	error = xfs_bmap_last_extent(NULL, ip, XFS_DATA_FORK, &rec, &is_empty);
+	if (error)
+		return error;
+
+	if (is_empty)
+		return -ENODATA;
+
+	last_block_offset =
+		XFS_FSB_TO_B(mp, rec.br_startoff + rec.br_blockcount);
+	if (last_block_offset < xfs_fsverity_metadata_offset(ip))
+		return -ENODATA;
+
+	desc_size_pos = last_block_offset - sizeof(__be32);
+	error = fsverity_pagecache_read(inode, (char *)&d_desc_size,
+			sizeof(d_desc_size), desc_size_pos);
+	if (error)
+		return error;
+
+	desc_size = be32_to_cpu(d_desc_size);
+	if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE))
+		return -ERANGE;
+	if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos))
+		return -ERANGE;
+
+	if (!buf_size)
+		return desc_size;
+
+	if (XFS_IS_CORRUPT(mp, desc_size > buf_size))
+		return -ERANGE;
+
+	desc_pos = round_down(desc_size_pos - desc_size, blocksize);
+	error = fsverity_pagecache_read(inode, buf, desc_size, desc_pos);
+	if (error)
+		return error;
+
+	return desc_size;
+}
+
+static int
+xfs_fsverity_write_descriptor(
+	struct file		*file,
+	const void		*desc,
+	u32			desc_size,
+	u64			merkle_tree_size)
+{
+	int			error;
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	unsigned int		blksize = ip->i_mount->m_attr_geo->blksize;
+	u64			tree_last_block =
+			xfs_fsverity_metadata_offset(ip) + merkle_tree_size;
+	u64			desc_pos =
+			round_up(tree_last_block, XFS_FSVERITY_START_ALIGN);
+	u64			desc_end = desc_pos + desc_size;
+	__be32			desc_size_disk = cpu_to_be32(desc_size);
+	u64			desc_size_pos =
+			round_up(desc_end + sizeof(desc_size_disk), blksize) -
+			sizeof(desc_size_disk);
+
+	error = iomap_fsverity_write(file, desc_size_pos, sizeof(__be32),
+			(const void *)&desc_size_disk,
+			&xfs_buffered_write_iomap_ops,
+			&xfs_iomap_write_ops);
+	if (error)
+		return error;
+
+	return iomap_fsverity_write(file, desc_pos, desc_size, desc,
+			&xfs_buffered_write_iomap_ops,
+			&xfs_iomap_write_ops);
+}
+
+/*
+ * Try to remove all the fsverity metadata after a failed enablement.
+ */
+static int
+xfs_fsverity_delete_metadata(
+	struct xfs_inode	*ip)
+{
+	struct xfs_trans	*tp;
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/*
+	 * We removing post EOF data, no need to update i_size as fsverity
+	 * didn't move i_size in the first place
+	 */
+	error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK, XFS_ISIZE(ip));
+	if (error)
+		goto err_cancel;
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto err_cancel;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+
+err_cancel:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+
+/*
+ * Prepare to enable fsverity by clearing old metadata.
+ */
+static int
+xfs_fsverity_begin_enable(
+	struct file		*filp)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	if (IS_DAX(inode))
+		return -EINVAL;
+
+	if (inode->i_size > XFS_FSVERITY_LARGEST_FILE)
+		return -EFBIG;
+
+	/*
+	 * Flush pagecache before building Merkle tree. Inode is locked and no
+	 * further writes will happen to the file except fsverity metadata
+	 */
+	error = filemap_write_and_wait(inode->i_mapping);
+	if (error)
+		return error;
+
+	if (xfs_iflags_test_and_set(ip, XFS_VERITY_CONSTRUCTION))
+		return -EBUSY;
+
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
+
+	return xfs_fsverity_delete_metadata(ip);
+}
+
+/*
+ * Complete (or fail) the process of enabling fsverity.
+ */
+static int
+xfs_fsverity_end_enable(
+	struct file		*file,
+	const void		*desc,
+	size_t			desc_size,
+	u64			merkle_tree_size)
+{
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	int			error = 0;
+	loff_t			range_start = xfs_fsverity_metadata_offset(ip);
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	/* fs-verity failed, just cleanup */
+	if (desc == NULL)
+		goto out;
+
+	error = xfs_fsverity_write_descriptor(file, desc, desc_size,
+			merkle_tree_size);
+	if (error)
+		goto out;
+
+	/*
+	 * Wait for Merkle tree get written to disk before setting on-disk inode
+	 * flag and clearing XFS_VERITY_CONSTRUCTION
+	 */
+	error = filemap_write_and_wait_range(inode->i_mapping, range_start,
+			LLONG_MAX);
+	if (error)
+		goto out;
+
+	/*
+	 * Proactively drop any delayed allocations in COW fork, the fsverity
+	 * files are read-only
+	 */
+	if (xfs_is_cow_inode(ip))
+		xfs_bmap_punch_delalloc_range(ip, XFS_COW_FORK, 0, LLONG_MAX,
+				NULL);
+
+	/*
+	 * Set fsverity inode flag
+	 */
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_ichange,
+			0, 0, false, &tp);
+	if (error)
+		goto out;
+
+	/*
+	 * Ensure that we've persisted the verity information before we enable
+	 * it on the inode and tell the caller we have sealed the inode.
+	 */
+	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (!error)
+		inode->i_flags |= S_VERITY;
+
+out:
+	if (error) {
+		int	error2;
+
+		error2 = xfs_fsverity_delete_metadata(ip);
+		if (error2)
+			xfs_alert(ip->i_mount,
+"ino 0x%llx failed to clean up new fsverity metadata, err %d",
+					ip->i_ino, error2);
+	}
+
+	xfs_iflags_clear(ip, XFS_VERITY_CONSTRUCTION);
+	return error;
+}
+
+/*
+ * Retrieve a merkle tree block.
+ */
+static struct page *
+xfs_fsverity_read_merkle(
+	struct inode		*inode,
+	pgoff_t			index)
+{
+	index += xfs_fsverity_metadata_offset(XFS_I(inode)) >> PAGE_SHIFT;
+
+	return generic_read_merkle_tree_page(inode, index);
+}
+
+/*
+ * Retrieve a merkle tree block.
+ */
+static void
+xfs_fsverity_readahead_merkle_tree(
+	struct inode		*inode,
+	pgoff_t			index,
+	unsigned long		nr_pages)
+{
+	index += xfs_fsverity_metadata_offset(XFS_I(inode)) >> PAGE_SHIFT;
+
+	generic_readahead_merkle_tree(inode, index, nr_pages);
+}
+
+/*
+ * Write a merkle tree block.
+ */
+static int
+xfs_fsverity_write_merkle(
+	struct file		*file,
+	const void		*buf,
+	u64			pos,
+	unsigned int		size,
+	const u8		*zero_digest,
+	unsigned int		digest_size)
+{
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	loff_t			position = pos +
+		xfs_fsverity_metadata_offset(ip);
+
+	if (position + size > inode->i_sb->s_maxbytes)
+		return -EFBIG;
+
+	/*
+	 * If this is a block full of hashes of zeroed blocks, don't bother
+	 * storing the block. We can synthesize them later.
+	 *
+	 * However, do this only in case Merkle tree block == fs block size.
+	 * Iomap synthesizes these blocks based on holes in the merkle tree. We
+	 * won't be able to tell if something need to be synthesizes for the
+	 * range in the fs block. For example, for 4k filesystem block
+	 *
+	 *	[ 1k | zero hashes | zero hashes | 1k ]
+	 *
+	 * Iomap won't know about these empty blocks.
+	 */
+	if (size == ip->i_mount->m_sb.sb_blocksize &&
+			/*
+			 * First digest is zero_digest
+			 */
+			memcmp(buf, zero_digest, digest_size) == 0 &&
+			/*
+			 * Every digest is same as previous, thus all are
+			 * zero_digest
+			 */
+			memcmp(buf + digest_size, buf, size - digest_size) == 0)
+		return 0;
+
+	return iomap_fsverity_write(file, position, size, buf,
+			&xfs_buffered_write_iomap_ops,
+			&xfs_iomap_write_ops);
+}
+
+const struct fsverity_operations xfs_fsverity_ops = {
+	.begin_enable_verity		= xfs_fsverity_begin_enable,
+	.end_enable_verity		= xfs_fsverity_end_enable,
+	.get_verity_descriptor		= xfs_fsverity_get_descriptor,
+	.read_merkle_tree_page		= xfs_fsverity_read_merkle,
+	.readahead_merkle_tree		= xfs_fsverity_readahead_merkle_tree,
+	.write_merkle_tree_block	= xfs_fsverity_write_merkle,
+};
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
index ec77ba571106..6a981e20a75b 100644
--- a/fs/xfs/xfs_fsverity.h
+++ b/fs/xfs/xfs_fsverity.h
@@ -6,8 +6,10 @@
 #define __XFS_FSVERITY_H__
 
 #include "xfs_platform.h"
+#include <linux/fsverity.h>
 
 #ifdef CONFIG_FS_VERITY
+extern const struct fsverity_operations xfs_fsverity_ops;
 loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip);
 bool xfs_fsverity_is_file_data(const struct xfs_inode *ip, loff_t offset);
 #else
diff --git a/fs/xfs/xfs_message.c b/fs/xfs/xfs_message.c
index fd297082aeb8..9818d8f8f239 100644
--- a/fs/xfs/xfs_message.c
+++ b/fs/xfs/xfs_message.c
@@ -153,6 +153,10 @@ xfs_warn_experimental(
 			.opstate	= XFS_OPSTATE_WARNED_ZONED,
 			.name		= "zoned RT device",
 		},
+		[XFS_EXPERIMENTAL_FSVERITY] = {
+			.opstate	= XFS_OPSTATE_WARNED_FSVERITY,
+			.name		= "fsverity",
+		},
 	};
 	ASSERT(feat >= 0 && feat < XFS_EXPERIMENTAL_MAX);
 	BUILD_BUG_ON(ARRAY_SIZE(features) != XFS_EXPERIMENTAL_MAX);
diff --git a/fs/xfs/xfs_message.h b/fs/xfs/xfs_message.h
index 49b0ef40d299..083403944f11 100644
--- a/fs/xfs/xfs_message.h
+++ b/fs/xfs/xfs_message.h
@@ -94,6 +94,7 @@ enum xfs_experimental_feat {
 	XFS_EXPERIMENTAL_SHRINK,
 	XFS_EXPERIMENTAL_LARP,
 	XFS_EXPERIMENTAL_ZONED,
+	XFS_EXPERIMENTAL_FSVERITY,
 
 	XFS_EXPERIMENTAL_MAX,
 };
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 07f6aa3c3f26..84d7cfb5e2c7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -583,6 +583,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
 #define XFS_OPSTATE_WARNED_ZONED	19
 /* (Zoned) GC is in progress */
 #define XFS_OPSTATE_ZONEGC_RUNNING	20
+/* Kernel has logged a warning about fsverity support */
+#define XFS_OPSTATE_WARNED_FSVERITY	21
 
 #define __XFS_IS_OPSTATE(name, NAME) \
 static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f8de44443e81..d9d442009610 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -30,6 +30,7 @@
 #include "xfs_filestream.h"
 #include "xfs_quota.h"
 #include "xfs_sysfs.h"
+#include "xfs_fsverity.h"
 #include "xfs_ondisk.h"
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
@@ -1686,6 +1687,9 @@ xfs_fs_fill_super(
 	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
 #endif
 	sb->s_op = &xfs_super_operations;
+#ifdef CONFIG_FS_VERITY
+	sb->s_vop = &xfs_fsverity_ops;
+#endif
 
 	/*
 	 * Delay mount work if the debug hook is set. This is debug
@@ -1939,6 +1943,9 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_filestream_unmount;
 
+	if (xfs_has_verity(mp))
+		xfs_warn_experimental(mp, XFS_EXPERIMENTAL_FSVERITY);
+
 	root = igrab(VFS_I(mp->m_rootip));
 	if (!root) {
 		error = -ENOENT;
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 15/22] xfs: use read ioend for fsverity data verification
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

Use read ioends for fsverity verification. Do not issues fsverity
metadata I/O through the same workqueue due to risk of a deadlock by a
filled workqueue.

Pass fsverity_info from iomap context down to the ioend as hashtable
lookups are expensive.

Add a simple helper to check that this is not fsverity metadata but file
data that needs verification.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_aops.c     | 46 ++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_fsverity.c |  9 +++++++++
 fs/xfs/xfs_fsverity.h |  6 ++++++
 3 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9503252a0fa4..ecb07f250956 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -24,6 +24,7 @@
 #include "xfs_rtgroup.h"
 #include "xfs_fsverity.h"
 #include <linux/bio-integrity.h>
+#include <linux/fsverity.h>
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -171,6 +172,23 @@ xfs_end_ioend_write(
 	memalloc_nofs_restore(nofs_flag);
 }
 
+/*
+ * IO read completion.
+ */
+static void
+xfs_end_ioend_read(
+	struct iomap_ioend	*ioend)
+{
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+
+	if (!ioend->io_bio.bi_status &&
+			xfs_fsverity_is_file_data(ip, ioend->io_offset))
+		fsverity_verify_bio(ioend->io_vi,
+				    &ioend->io_bio);
+	iomap_finish_ioends(ioend,
+		blk_status_to_errno(ioend->io_bio.bi_status));
+}
+
 /*
  * Finish all pending IO completions that require transactional modifications.
  *
@@ -205,8 +223,7 @@ xfs_end_io(
 		list_del_init(&ioend->io_list);
 		iomap_ioend_try_merge(ioend, &tmp);
 		if (bio_op(&ioend->io_bio) == REQ_OP_READ)
-			iomap_finish_ioends(ioend,
-				blk_status_to_errno(ioend->io_bio.bi_status));
+			xfs_end_ioend_read(ioend);
 		else
 			xfs_end_ioend_write(ioend);
 		cond_resched();
@@ -232,9 +249,14 @@ xfs_end_bio(
 	}
 
 	spin_lock_irqsave(&ip->i_ioend_lock, flags);
-	if (list_empty(&ip->i_ioend_list))
-		WARN_ON_ONCE(!queue_work(mp->m_unwritten_workqueue,
+	if (list_empty(&ip->i_ioend_list)) {
+		if (IS_ENABLED(CONFIG_FS_VERITY) && ioend->io_vi &&
+		    ioend->io_offset < xfs_fsverity_metadata_offset(ip))
+			fsverity_enqueue_verify_work(&ip->i_ioend_work);
+		else
+			WARN_ON_ONCE(!queue_work(mp->m_unwritten_workqueue,
 					 &ip->i_ioend_work));
+	}
 	list_add_tail(&ioend->io_list, &ip->i_ioend_list);
 	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
 }
@@ -764,9 +786,13 @@ xfs_bio_submit_read(
 	struct iomap_read_folio_ctx	*ctx)
 {
 	struct bio			*bio = ctx->read_ctx;
+	struct iomap_ioend		*ioend;
 
 	/* defer read completions to the ioend workqueue */
-	iomap_init_ioend(iter->inode, bio, ctx->read_ctx_file_offset, 0);
+	ioend = iomap_init_ioend(iter->inode, bio, ctx->read_ctx_file_offset,
+			0);
+	ioend->io_vi = ctx->vi;
+
 	bio->bi_end_io = xfs_end_bio;
 	submit_bio(bio);
 }
@@ -779,11 +805,13 @@ static const struct iomap_read_ops xfs_iomap_read_ops = {
 
 static inline const struct iomap_read_ops *
 xfs_get_iomap_read_ops(
-	const struct address_space	*mapping)
+	const struct address_space	*mapping,
+	loff_t				position)
 {
 	struct xfs_inode		*ip = XFS_I(mapping->host);
 
-	if (bdev_has_integrity_csum(xfs_inode_buftarg(ip)->bt_bdev))
+	if (bdev_has_integrity_csum(xfs_inode_buftarg(ip)->bt_bdev) ||
+			xfs_fsverity_is_file_data(ip, position))
 		return &xfs_iomap_read_ops;
 	return &iomap_bio_read_ops;
 }
@@ -795,7 +823,7 @@ xfs_vm_read_folio(
 {
 	struct iomap_read_folio_ctx	ctx = { .cur_folio = folio };
 
-	ctx.ops = xfs_get_iomap_read_ops(folio->mapping);
+	ctx.ops = xfs_get_iomap_read_ops(folio->mapping, folio_pos(folio));
 	iomap_read_folio(&xfs_read_iomap_ops, &ctx, NULL);
 	return 0;
 }
@@ -806,7 +834,7 @@ xfs_vm_readahead(
 {
 	struct iomap_read_folio_ctx	ctx = { .rac = rac };
 
-	ctx.ops = xfs_get_iomap_read_ops(rac->mapping),
+	ctx.ops = xfs_get_iomap_read_ops(rac->mapping, readahead_pos(rac));
 	iomap_readahead(&xfs_read_iomap_ops, &ctx, NULL);
 }
 
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index 6e6a8636a577..b983e20bb5e1 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -19,3 +19,12 @@ xfs_fsverity_metadata_offset(
 {
 	return round_up(i_size_read(VFS_IC(ip)), XFS_FSVERITY_START_ALIGN);
 }
+
+bool
+xfs_fsverity_is_file_data(
+	const struct xfs_inode	*ip,
+	loff_t			offset)
+{
+	return fsverity_active(VFS_IC(ip)) &&
+			offset < xfs_fsverity_metadata_offset(ip);
+}
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
index 5771db2cd797..ec77ba571106 100644
--- a/fs/xfs/xfs_fsverity.h
+++ b/fs/xfs/xfs_fsverity.h
@@ -9,12 +9,18 @@
 
 #ifdef CONFIG_FS_VERITY
 loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip);
+bool xfs_fsverity_is_file_data(const struct xfs_inode *ip, loff_t offset);
 #else
 static inline loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip)
 {
 	WARN_ON_ONCE(1);
 	return ULLONG_MAX;
 }
+static inline bool xfs_fsverity_is_file_data(const struct xfs_inode *ip,
+					    loff_t offset)
+{
+	return false;
+}
 #endif	/* CONFIG_FS_VERITY */
 
 #endif	/* __XFS_FSVERITY_H__ */
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 14/22] xfs: handle fsverity I/O in write/read path
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

For write/writeback set IOMAP_F_FSVERITY flag telling iomap to not
update inode size and to not skip folios beyond EOF.

Initiate fsverity writeback with IOMAP_F_FSVERITY set to tell iomap
should not skip folio that is dirty beyond EOF.

In read path let iomap know that we are reading fsverity metadata. So,
treat holes in the tree as request to synthesize tree blocks and hole
after descriptor as end of the fsverity region.

Introduce a new inode flag meaning that merkle tree is being build on
the inode.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/Makefile          |  1 +
 fs/xfs/libxfs/xfs_bmap.c |  7 +++++++
 fs/xfs/xfs_aops.c        | 16 +++++++++++++++-
 fs/xfs/xfs_fsverity.c    | 21 +++++++++++++++++++++
 fs/xfs/xfs_fsverity.h    | 20 ++++++++++++++++++++
 fs/xfs/xfs_inode.h       |  6 ++++++
 fs/xfs/xfs_iomap.c       | 15 +++++++++++++--
 7 files changed, 83 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_fsverity.c
 create mode 100644 fs/xfs/xfs_fsverity.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 9f7133e02576..38b7f51e5d84 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -149,6 +149,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_FS_VERITY)		+= xfs_fsverity.o
 
 # notify failure
 ifeq ($(CONFIG_MEMORY_FAILURE),y)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7a4c8f1aa76c..931d02678d19 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -41,6 +41,8 @@
 #include "xfs_inode_util.h"
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
@@ -4451,6 +4453,11 @@ xfs_bmapi_convert_one_delalloc(
 	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, bma.length));
 	XFS_STATS_INC(mp, xs_xstrat_quick);
 
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION) &&
+	    XFS_FSB_TO_B(mp, bma.got.br_startoff) >=
+		    xfs_fsverity_metadata_offset(ip))
+		flags |= IOMAP_F_FSVERITY;
+
 	ASSERT(!isnullstartblock(bma.got.br_startblock));
 	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
 				xfs_iomap_inode_sequence(ip, flags));
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f279055fcea0..9503252a0fa4 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -22,6 +22,7 @@
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
+#include "xfs_fsverity.h"
 #include <linux/bio-integrity.h>
 
 struct xfs_writepage_ctx {
@@ -339,12 +340,16 @@ xfs_map_blocks(
 	int			retries = 0;
 	int			error = 0;
 	unsigned int		*seq;
+	unsigned int		iomap_flags = 0;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
 	XFS_ERRORTAG_DELAY(mp, XFS_ERRTAG_WB_DELAY_MS);
 
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION))
+		iomap_flags |= IOMAP_F_FSVERITY;
+
 	/*
 	 * COW fork blocks can overlap data fork blocks even if the blocks
 	 * aren't shared.  COW I/O always takes precedent, so we must always
@@ -432,7 +437,8 @@ xfs_map_blocks(
 	    isnullstartblock(imap.br_startblock))
 		goto allocate_blocks;
 
-	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0, XFS_WPC(wpc)->data_seq);
+	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, iomap_flags,
+			  XFS_WPC(wpc)->data_seq);
 	trace_xfs_map_blocks_found(ip, offset, count, whichfork, &imap);
 	return 0;
 allocate_blocks:
@@ -705,6 +711,14 @@ xfs_vm_writepages(
 			},
 		};
 
+		/*
+		 * Writeback does not work for folios past EOF, let it know that
+		 * I/O happens for fsverity metadata and this restriction need
+		 * to be skipped
+		 */
+		if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION))
+			wpc.ctx.iomap.flags |= IOMAP_F_FSVERITY;
+
 		return iomap_writepages(&wpc.ctx);
 	}
 }
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
new file mode 100644
index 000000000000..6e6a8636a577
--- /dev/null
+++ b/fs/xfs/xfs_fsverity.c
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Red Hat, Inc.
+ */
+#include "xfs_platform.h"
+#include "xfs_format.h"
+#include "xfs_inode.h"
+#include "xfs_shared.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_fsverity.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
+#include <linux/iomap.h>
+
+loff_t
+xfs_fsverity_metadata_offset(
+	const struct xfs_inode	*ip)
+{
+	return round_up(i_size_read(VFS_IC(ip)), XFS_FSVERITY_START_ALIGN);
+}
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
new file mode 100644
index 000000000000..5771db2cd797
--- /dev/null
+++ b/fs/xfs/xfs_fsverity.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Red Hat, Inc.
+ */
+#ifndef __XFS_FSVERITY_H__
+#define __XFS_FSVERITY_H__
+
+#include "xfs_platform.h"
+
+#ifdef CONFIG_FS_VERITY
+loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip);
+#else
+static inline loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip)
+{
+	WARN_ON_ONCE(1);
+	return ULLONG_MAX;
+}
+#endif	/* CONFIG_FS_VERITY */
+
+#endif	/* __XFS_FSVERITY_H__ */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bd6d33557194..6df48d68a919 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -415,6 +415,12 @@ static inline bool xfs_inode_can_sw_atomic_write(const struct xfs_inode *ip)
  */
 #define XFS_IREMAPPING		(1U << 15)
 
+/*
+ * fs-verity's Merkle tree is under construction. The file is read-only, the
+ * only writes happening are for the fsverity metadata.
+ */
+#define XFS_VERITY_CONSTRUCTION	(1U << 16)
+
 /* All inode state flags related to inode reclaim. */
 #define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
 				 XFS_IRECLAIM | \
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 9c2f12d5fec9..71ccd4ff5f48 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -32,6 +32,8 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -1789,6 +1791,9 @@ xfs_buffered_write_iomap_begin(
 		return xfs_direct_write_iomap_begin(inode, offset, count,
 				flags, iomap, srcmap);
 
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION))
+		iomap_flags |= IOMAP_F_FSVERITY;
+
 	error = xfs_qm_dqattach(ip);
 	if (error)
 		return error;
@@ -2113,12 +2118,17 @@ xfs_read_iomap_begin(
 	bool			shared = false;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
 	u64			seq;
+	unsigned int		iomap_flags = 0;
 
 	ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
+	if (fsverity_active(inode) &&
+	    (offset >= xfs_fsverity_metadata_offset(ip)))
+		iomap_flags |= IOMAP_F_FSVERITY;
+
 	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
 	if (error)
 		return error;
@@ -2132,8 +2142,9 @@ xfs_read_iomap_begin(
 	if (error)
 		return error;
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 shared ? IOMAP_F_SHARED : 0, seq);
+	iomap_flags |= shared ? IOMAP_F_SHARED : 0;
+
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
 }
 
 const struct iomap_ops xfs_read_iomap_ops = {
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 13/22] xfs: disable direct read path for fs-verity files
From: Andrey Albershteyn @ 2026-04-20 11:47 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

The direct path is not supported on verity files. Attempts to use direct
I/O path on such files should fall back to buffered I/O path.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_file.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a980ac5196a8..6fa9835f9531 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -282,7 +282,8 @@ xfs_file_dax_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret = 0;
 
 	trace_xfs_file_dax_read(iocb, to);
@@ -333,6 +334,14 @@ xfs_file_read_iter(
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
+	/*
+	 * In case fs-verity is enabled, we also fallback to the buffered read
+	 * from the direct read path. Therefore, IOCB_DIRECT is set and need to
+	 * be cleared (see generic_file_read_iter())
+	 */
+	if (fsverity_active(inode))
+		iocb->ki_flags &= ~IOCB_DIRECT;
+
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
 	else if (iocb->ki_flags & IOCB_DIRECT)
-- 
2.51.2


^ permalink raw reply related

* [PATCH v8 12/22] xfs: don't allow to enable DAX on fs-verity sealed inode
From: Andrey Albershteyn @ 2026-04-20 11:46 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260420114714.1621982-1-aalbersh@kernel.org>

fs-verity doesn't support DAX. Forbid filesystem to enable DAX on
inodes which already have fs-verity enabled. The opposite is checked
when fs-verity is enabled, it won't be enabled if DAX is.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ca369eb96561..17efc83a86ed 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1387,6 +1387,8 @@ xfs_inode_should_enable_dax(
 		return false;
 	if (!xfs_inode_supports_dax(ip))
 		return false;
+	if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
+		return false;
 	if (xfs_has_dax_always(ip->i_mount))
 		return true;
 	if (ip->i_diflags2 & XFS_DIFLAG2_DAX)
-- 
2.51.2


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox