Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: [PATCH v2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
From: Guan-Chun Wu @ 2026-04-10  6:28 UTC (permalink / raw)
  To: Theodore Tso
  Cc: adilger.kernel, linux-ext4, linux-kernel, visitorckw,
	david.laight.linux
In-Reply-To: <20260409141050.GA59468@macsyma-wired.lan>

On Thu, Apr 09, 2026 at 10:10:50AM -0400, Theodore Tso wrote:
> On Sat, Nov 22, 2025 at 12:39:29PM +0800, Guan-Chun Wu wrote:
> > The original byte-by-byte implementation with modulo checks is less
> > efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
> > to process input in explicit 4-byte chunks instead of using a
> > modulus-based loop to emit words byte by byte.
> > 
> > Additionally, the use of function pointers for selecting the appropriate
> > str2hashbuf implementation has been removed. Instead, the functions are
> > directly invoked based on the hash type, eliminating the overhead of
> > dynamic function calls.
> > 
> > Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
> > runs, using kernel module for testing):
> > 
> >     len | orig_s | new_s | orig_u | new_u
> >     ----+--------+-------+--------+-------
> >       1 |   70   |   71  |   63   |   63
> >       8 |   68   |   64  |   64   |   62
> >      32 |   75   |   70  |   75   |   63
> >      64 |   96   |   71  |  100   |   68
> >     255 |  192   |  108  |  187   |   84
> > 
> > This change improves performance, especially for larger input sizes.
> > 
> > Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
> 
> Apologies for the delay in looking at this.  It fell through the
> cracks on my end.
> 
> Because of how I'm a bit late with reviewing patches before the merge
> window, I'm going to be very conservative in which patches I'm going
> to land.  So this is going to be deferred until the next cycle, but I
> wanted to let you know that I haven't forgotten about it.
> 
> If this was a comprehensive set of Kunit tests for fs/ext4/hash.c, I
> might have taken it.  And that's something that I would look at adding
> for the next cycle, but if you'd be interested in creating the kunit
> tests for hash.c, that would be great.
> 
> 						- Ted

Thanks for the update.

I'd be happy to add Kunit tests for fs/ext4/hash.c. I'll work on them and
send a v3 patchset with the tests and the optimization in the next cycle.

Best regards,
Guan-Chun

^ permalink raw reply

* [PATCH] ext4: make mballoc max prealloc size configurable
From: guzebing @ 2026-04-10  3:56 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, guzebing
  Cc: linux-kernel, linux-ext4

From: Guzebing <guzebing@bytedance.com>

Add per-superblock sysfs knob mb_max_prealloc_kb (min 8MiB, roundup
pow2) and use it in request normalization.

When multiple tasks write to different files on the same filesystem
concurrently, each file ends up with 8 MiB extents. If the preallocation
size is increased, the resulting extent size grows accordingly. Due
to the readahead mechanism on NVMe SSDs, files with larger extents
achieve higher sequential read throughput.

On an ext4 filesystem on an NVMe Gen4 data drive, dd read throughput
for a file with 8 MiB extents is 455 MB/s, while for a file with
32 MiB extents it reaches 702 MB/s.

Steps to reproduce:
1.Configure the maximum preallocation size to 8 MiB or 32 MiB:
echo 8192 > /sys/fs/ext4/nvme13n1/mb_max_prealloc_kb
echo 32768 > /sys/fs/ext4/nvme13n1/mb_max_prealloc_kb

2.Run the following commands simultaneously so that the extents of
the two files are physically interleaved, resulting in 8 MiB or 32 MiB
extents:
dd if=/dev/zero of=/mnt/store1/501.txt bs=128K count=80K oflag=direct
dd if=/dev/zero of=/mnt/store1/502.txt bs=128K count=80K oflag=direct

3.Read back the file and measure the read throughput:
dd if=/mnt/store1/501.txt of=/dev/null bs=128K count=80K iflag=direct

Signed-off-by: Guzebing <guzebing@bytedance.com>
---
 Documentation/ABI/testing/sysfs-fs-ext4 |  8 +++++++
 fs/ext4/ext4.h                          |  1 +
 fs/ext4/mballoc.c                       |  2 +-
 fs/ext4/super.c                         |  1 +
 fs/ext4/sysfs.c                         | 28 ++++++++++++++++++++++++-
 5 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-fs-ext4 b/Documentation/ABI/testing/sysfs-fs-ext4
index 2edd0a6672d3a..316ae1d1ec18b 100644
--- a/Documentation/ABI/testing/sysfs-fs-ext4
+++ b/Documentation/ABI/testing/sysfs-fs-ext4
@@ -48,6 +48,14 @@ Description:
 		will have its blocks allocated out of its own unique
 		preallocation pool.
 
+What:		/sys/fs/ext4/<disk>/mb_max_prealloc_kb
+Date:		April 2026
+Contact:	"Linux Ext4 Development List" <linux-ext4@vger.kernel.org>
+Description:
+		Maximum size (in kilobytes) used by the multiblock allocator's
+		normalized request preallocation heuristic. Values are rounded
+		up to a power of two and clamped to a minimum of 8192 (8MiB).
+
 What:		/sys/fs/ext4/<disk>/inode_readahead_blks
 Date:		March 2008
 Contact:	"Theodore Ts'o" <tytso@mit.edu>
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7617e2d454ea5..bce99740740f5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1634,6 +1634,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_best_avail_max_trim_order;
 	unsigned int s_sb_update_sec;
 	unsigned int s_sb_update_kb;
+	unsigned int s_mb_max_prealloc_kb;
 
 	/* where last allocation was done - for stream allocation */
 	ext4_group_t *s_mb_last_groups;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index bb58eafb87bcd..f5f63c56fcdac 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4589,7 +4589,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 					(8<<20)>>bsbits, max, 8 * 1024)) {
 		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
 							(23 - bsbits)) << 23;
-		size = 8 * 1024 * 1024;
+		size = (loff_t)sbi->s_mb_max_prealloc_kb << 10;
 	} else {
 		start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
 		size	  = (loff_t) EXT4_C2B(sbi,
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a34efb44e73d7..f815e31657cc9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5447,6 +5447,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 		sbi->s_stripe = 0;
 	}
 	sbi->s_extent_max_zeroout_kb = 32;
+	sbi->s_mb_max_prealloc_kb = 8 * 1024;
 
 	/*
 	 * set up enough so that it can read an inode
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 923b375e017fa..6339492eb2fa7 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -10,6 +10,8 @@
 
 #include <linux/time.h>
 #include <linux/fs.h>
+#include <linux/log2.h>
+#include <linux/limits.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/proc_fs.h>
@@ -41,6 +43,7 @@ typedef enum {
 	attr_pointer_atomic,
 	attr_journal_task,
 	attr_err_report_sec,
+	attr_mb_max_prealloc_kb,
 } attr_id_t;
 
 typedef enum {
@@ -115,6 +118,25 @@ static ssize_t reserved_clusters_store(struct ext4_sb_info *sbi,
 	return count;
 }
 
+static ssize_t mb_max_prealloc_kb_store(struct ext4_sb_info *sbi,
+					const char *buf, size_t count)
+{
+	unsigned int v;
+	int ret;
+	unsigned long rounded;
+
+	ret = kstrtouint(skip_spaces(buf), 0, &v);
+	if (ret)
+		return ret;
+	if (v < 8192)
+		v = 8192;
+	rounded = roundup_pow_of_two((unsigned long)v);
+	if (rounded > UINT_MAX)
+		return -EINVAL;
+	sbi->s_mb_max_prealloc_kb = (unsigned int)rounded;
+	return count;
+}
+
 static ssize_t trigger_test_error(struct ext4_sb_info *sbi,
 				  const char *buf, size_t count)
 {
@@ -288,6 +310,7 @@ EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
 EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
 EXT4_RW_ATTR_SBI_UI(sb_update_sec, s_sb_update_sec);
 EXT4_RW_ATTR_SBI_UI(sb_update_kb, s_sb_update_kb);
+EXT4_ATTR_OFFSET(mb_max_prealloc_kb, 0644, mb_max_prealloc_kb, ext4_sb_info, s_mb_max_prealloc_kb);
 
 static unsigned int old_bump_val = 128;
 EXT4_ATTR_PTR(max_writeback_mb_bump, 0444, pointer_ui, &old_bump_val);
@@ -341,6 +364,7 @@ static struct attribute *ext4_attrs[] = {
 	ATTR_LIST(last_trim_minblks),
 	ATTR_LIST(sb_update_sec),
 	ATTR_LIST(sb_update_kb),
+	ATTR_LIST(mb_max_prealloc_kb),
 	ATTR_LIST(err_report_sec),
 	NULL,
 };
@@ -431,6 +455,7 @@ static ssize_t ext4_generic_attr_show(struct ext4_attr *a,
 	case attr_mb_order:
 	case attr_pointer_pi:
 	case attr_pointer_ui:
+	case attr_mb_max_prealloc_kb:
 		if (a->attr_ptr == ptr_ext4_super_block_offset)
 			return sysfs_emit(buf, "%u\n", le32_to_cpup(ptr));
 		return sysfs_emit(buf, "%u\n", *((unsigned int *) ptr));
@@ -557,6 +582,8 @@ static ssize_t ext4_attr_store(struct kobject *kobj,
 		return reserved_clusters_store(sbi, buf, len);
 	case attr_inode_readahead:
 		return inode_readahead_blks_store(sbi, buf, len);
+	case attr_mb_max_prealloc_kb:
+		return mb_max_prealloc_kb_store(sbi, buf, len);
 	case attr_trigger_test_error:
 		return trigger_test_error(sbi, buf, len);
 	case attr_err_report_sec:
@@ -695,4 +722,3 @@ void ext4_exit_sysfs(void)
 	remove_proc_entry(proc_dirname, NULL);
 	ext4_proc_root = NULL;
 }
-
-- 
2.20.1


^ permalink raw reply related

* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
From: Theodore Tso @ 2026-04-10  1:18 UTC (permalink / raw)
  To: Li Chen
  Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel
In-Reply-To: <20260120112538.132774-1-me@linux.beauty>

On Tue, Jan 20, 2026 at 07:25:29PM +0800, Li Chen wrote:
> Hi,
> 
> (This RFC v4 series is based on linux-next tag next-20260106, plus the
> prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe" posted at:
> https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)

Can you take a look at the Sashiko reviews here:

    https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty

There seems to be at least one legitimate concern, which is the
potential cur_lblk overflow.  There are a couple of others which I
think is real; could you please look at their review comments?

Thanks,

					- Ted

^ permalink raw reply

* [GIT PULL 2/2] fuse4fs: fork a low level fuse server
From: Darrick J. Wong @ 2026-04-09 20:15 UTC (permalink / raw)
  To: tytso; +Cc: amir73il, djwong, linux-ext4
In-Reply-To: <20260409201302.GD6192@frogsfrogsfrogs>

Hi Ted,

Please pull this branch with changes for ext4.

As usual, I did a test-merge with the main upstream branch as of a few
minutes ago, and didn't see any conflicts.  Please let me know if you
encounter any problems.

The following changes since commit c137760397ef6671832548bcd256778f57f49c2d:

fuse2fs: drop fuse 2.x support code (2026-04-09 13:00:40 -0700)

are available in the Git repository at:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git tags/fuse4fs-fork_2026-04-09

for you to fetch changes up to 951a10258cada0508e185c58ebec74c47b0eb774:

fuse4fs: create incore reverse orphan list (2026-04-09 13:00:42 -0700)

----------------------------------------------------------------
fuse4fs: fork a low level fuse server [02/11]

Whilst developing the fuse2fs+iomap prototype, I discovered a
fundamental design limitation of the upper-level libfuse API: hardlinks.
The upper level fuse library really wants to communicate with the fuse
server with file paths, instead of using inode numbers.  This works
great for filesystems that don't have inodes, create files dynamically
at runtime, or lack stable inode numbers.

Unfortunately, the libfuse path abstraction assigns a unique nodeid to
every child file in the entire filesystem, without regard to hard links.
In other words, a hardlinked regular file may have one ondisk inode
number but multiple kernel inodes.  For classic fuse2fs this isn't a
problem because all file access goes through the fuse server and the big
library lock protects us from corruption.

For fuse2fs + iomap this is a disaster because we rely on the kernel to
coordinate access to inodes.  For hardlinked files, we *require* that
there only be one in-kernel inode for each ondisk inode.

The path based mechanism is also very inefficient for fuse2fs.  Every
time a file is accessed, the upper level libfuse passes a new nodeid to
the kernel, and on every file access the kernel passes that same nodeid
back to libfuse.  libfuse then walks its internal directory entry cache
to construct a path string for that nodeid and hands it to fuse2fs.
fuse2fs then walks the ondisk directory structure to find the ext2 inode
number.  Every time.

Create a new fuse4fs server from fuse2fs that uses the lowlevel fuse
API.  This affords us direct control over nodeids and eliminates the
path wrangling.  Hardlinks can be supported when iomap is turned on,
and metadata-heavy workloads run twice as fast.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

----------------------------------------------------------------
Darrick J. Wong (23):
fuse2fs: separate libfuse3 and fuse2fs detection in configure
fuse2fs: start porting fuse2fs to lowlevel libfuse API
debian: create new package for fuse4fs
fuse4fs: namespace some helpers
fuse4fs: convert to low level API
libsupport: port the kernel list.h to libsupport
libsupport: add a cache
cache: disable debugging
cache: use modern list iterator macros
cache: embed struct cache in the owner
cache: pass cache pointer to callbacks
cache: pass a private data pointer through cache_walk
cache: add a helper to grab a new refcount for a cache_node
cache: return results of a cache flush
cache: add a "get only if incore" flag to cache_node_get
cache: support gradual expansion
cache: support updating maxcount and flags
cache: support channging flags
cache: implement automatic shrinking
fuse4fs: add cache to track open files
fuse4fs: use the orphaned inode list
fuse4fs: implement FUSE_TMPFILE
fuse4fs: create incore reverse orphan list

lib/ext2fs/jfs_compat.h  |    2 +-
lib/ext2fs/kernel-list.h |  111 -
lib/support/cache.h      |  184 ++
lib/support/list.h       |  901 +++++++
lib/support/xbitops.h    |  128 +
Makefile.in              |    3 +-
configure                |  441 ++--
configure.ac             |  156 +-
debian/control           |   12 +-
debian/fuse4fs.install   |    2 +
debian/fuse4fs.links     |    3 +
debian/rules             |   11 +
debugfs/Makefile.in      |   12 +-
e2fsck/Makefile.in       |   56 +-
fuse4fs/Makefile.in      |  193 ++
fuse4fs/fuse4fs.1.in     |  118 +
fuse4fs/fuse4fs.c        | 6451 ++++++++++++++++++++++++++++++++++++++++++++++
lib/config.h.in          |    3 +
lib/e2p/Makefile.in      |    4 +-
lib/ext2fs/Makefile.in   |   14 +-
lib/support/Makefile.in  |    8 +-
lib/support/cache.c      |  882 +++++++
misc/Makefile.in         |   18 +-
misc/tune2fs.c           |    4 -
24 files changed, 9248 insertions(+), 469 deletions(-)
delete mode 100644 lib/ext2fs/kernel-list.h
create mode 100644 lib/support/cache.h
create mode 100644 lib/support/list.h
create mode 100644 lib/support/xbitops.h
create mode 100644 debian/fuse4fs.install
create mode 100644 debian/fuse4fs.links
create mode 100644 fuse4fs/Makefile.in
create mode 100644 fuse4fs/fuse4fs.1.in
create mode 100644 fuse4fs/fuse4fs.c
create mode 100644 lib/support/cache.c


^ permalink raw reply

* [GIT PULL 1/2] fuse2fs: upgrade to libfuse 3.17
From: Darrick J. Wong @ 2026-04-09 20:14 UTC (permalink / raw)
  To: tytso; +Cc: amir73il, djwong, linux-ext4
In-Reply-To: <20260409201302.GD6192@frogsfrogsfrogs>

Hi Ted,

Please pull this branch with changes for ext4.

As usual, I did a test-merge with the main upstream branch as of a few
minutes ago, and didn't see any conflicts.  Please let me know if you
encounter any problems.

The following changes since commit 43643a57fb2d3368fbacd181a8cd713102d52a1a:

tests/f_opt_extent: use tune2fs from the build tree in the test script (2026-04-03 11:20:06 -0400)

are available in the Git repository at:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git tags/fuse2fs-library-upgrade_2026-04-09

for you to fetch changes up to c137760397ef6671832548bcd256778f57f49c2d:

fuse2fs: drop fuse 2.x support code (2026-04-09 13:00:40 -0700)

----------------------------------------------------------------
fuse2fs: upgrade to libfuse 3.17 [01/11]

In preparation to start hacking on fuse2fs and iomap, upgrade fuse2fs
library support to 3.17, which is the latest upstream release as of this
writing.  Drop support for libfuse2, which is now very obsolete.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

----------------------------------------------------------------
Darrick J. Wong (5):
libsupport: change get_thread_id return type to unsigned long long
fuse2fs: bump library version
fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse
fuse2fs: disable nfs exports
fuse2fs: drop fuse 2.x support code

lib/support/thread.h |   2 +-
configure            | 358 +++++----------------------------------------------
configure.ac         |  85 +++++-------
lib/support/thread.c |   4 +-
misc/fuse2fs.c       | 252 ++++++++----------------------------
5 files changed, 118 insertions(+), 583 deletions(-)


^ permalink raw reply

* [PULLBOMB v3 1.48] fuse4fs: drop libfuse2, add new server
From: Darrick J. Wong @ 2026-04-09 20:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, amir73il

Hi Ted,

As promised during this morning's ext4 call, this large patchbomb
contains the remaining improvements that I'd like to make to fuse2fs
before integrating iomap.

The first pull request drops libfuse2 support from e2fsprogs.

The second pull request creates a new fuse ext* server (fuse4fs) which
uses the lowlevel FUSE API instead of the high level one.  The major
advantage of using the lowlevel API is that all file operations are
performed in terms of inodes instead of paths.  As a result, fuse4fs has
MUCH less overhead than fuse2fs because we avoid the overhead of having
libfuse translate inode numbers to paths only to have fuse2fs translate
paths back into inode numbers.

Obviously, this stuff should go into e2fsprogs 1.48, not a 1.47.x stable
release.  This patchbomb hasn't changed much since its last posting on
12 Mar 2026.  The range-diff between then and now is appended below;
there aren't really any changes other than the effects of rebasing
against today's next branch and adapting to using autoheader(1).

One curious thing: I saw in [1] that you applied my get_thread_id
change, but it isn't in -next yet.  These PRs might not merge properly.
Obviously it's no big deal for me to rebase and send new PRs.

--D

 -:  -------------- >  1:  91f5b6651553fe fuse2fs: bump library version
 1:  7e7bb349a384f7 =  2:  cf80ae2e89ba2f fuse2fs: wrap the fuse_set_feature_flag helper for older libfuse
 2:  631d3f8d14b613 =  3:  c7d947ac718bb4 fuse2fs: disable nfs exports
 3:  96ca2a29cc7dcb !  4:  c137760397ef66 fuse2fs: drop fuse 2.x support code
    @@ configure.ac: then
      	AC_DEFINE_UNQUOTED(FUSE_USE_VERSION, $FUSE_USE_VERSION,
      		[Define to the version of FUSE to use])
      fi
     
      ## misc/fuse2fs.c ##
     @@
    - #endif /* __SET_FOB_FOR_FUSE */
      #include <inttypes.h>
      #include "ext2fs/ext2fs.h"
      #include "ext2fs/ext2_fs.h"
      #include "ext2fs/ext2fsP.h"
      #include "support/bthread.h"
    + #include "support/thread.h"
     -#if FUSE_VERSION >= FUSE_MAKE_VERSION(3, 0)
     -# define FUSE_PLATFORM_OPTS	""
     -#else
     -# ifdef __linux__
     -#  define FUSE_PLATFORM_OPTS	",use_ino,big_writes"
     -# else
 4:  e048f198d7b598 =  5:  239f4b7ac05b69 fuse2fs: separate libfuse3 and fuse2fs detection in configure
 5:  3b02fa292dda7c !  6:  1536edf8a93539 fuse2fs: start porting fuse2fs to lowlevel libfuse API
    @@ fuse4fs/fuse4fs.c (new)
     +#endif /* __SET_FOB_FOR_FUSE */
     +#include <inttypes.h>
     +#include "ext2fs/ext2fs.h"
     +#include "ext2fs/ext2_fs.h"
     +#include "ext2fs/ext2fsP.h"
     +#include "support/bthread.h"
    ++#include "support/thread.h"
     +
     +#include "../version.h"
     +#include "uuid/uuid.h"
     +#include "e2p/e2p.h"
     +
     +#ifdef ENABLE_NLS
    @@ fuse4fs/fuse4fs.c (new)
     +	m = b % align;
     +	return b - m;
     +}
     +
     +#define dbg_printf(fuse4fs, format, ...) \
     +	while ((fuse4fs)->debug) { \
    -+		printf("FUSE4FS (%s): tid=%d " format, (fuse4fs)->shortdev, gettid(), ##__VA_ARGS__); \
    ++		printf("FUSE4FS (%s): tid=%llu " format, (fuse4fs)->shortdev, get_thread_id(), ##__VA_ARGS__); \
     +		fflush(stdout); \
     +		break; \
     +	}
     +
     +#define log_printf(fuse4fs, format, ...) \
     +	do { \
    @@ fuse4fs/fuse4fs.c (new)
     +		ret = -EUCLEAN;
     +		break;
     +	case EIO:
     +#ifdef EILSEQ
     +	case EILSEQ:
     +#endif
    ++#if EUCLEAN != EIO
     +	case EUCLEAN:
    ++#endif
     +		/* these errnos usually denote corruption or persistence fail */
     +		is_err = 1;
     +		ret = -err;
     +		break;
     +	default:
     +		if (err < 256) {
    @@ fuse4fs/fuse4fs.c (new)
     +
     +	return ret;
     +}
     
      ## lib/config.h.in ##
     @@
    - /* Define to 1 if you have the BSD-style 'qsort_r' function. */
    - #undef HAVE_BSD_QSORT_R
    + /* Define to 1 if fuse supports cache_readdir */
    + #undef HAVE_FUSE_CACHE_READDIR
      
    - /* Define to 1 if PR_SET_IO_FLUSHER is present */
    - #undef HAVE_PR_SET_IO_FLUSHER
    + /* Define to 1 if you have the <fuse.h> header file. */
    + #undef HAVE_FUSE_H
      
     +/* Define to 1 if fuse supports lowlevel API */
     +#undef HAVE_FUSE_LOWLEVEL
     +
    - /* Define to 1 if you have the Mac OS X function
    -    CFLocaleCopyPreferredLanguages in the CoreFoundation framework. */
    - #undef HAVE_CFLOCALECOPYPREFERREDLANGUAGES
    + /* Define to 1 if you have the 'futimes' function. */
    + #undef HAVE_FUTIMES
    + 
    + /* Define to 1 if you have the 'getcwd' function. */
    + #undef HAVE_GETCWD
      
    - /* Define to 1 if you have the Mac OS X function CFPreferencesCopyAppValue in
    -    the CoreFoundation framework. */
 6:  480ffd57c3152f =  7:  de29e11fda2d6b debian: create new package for fuse4fs
 7:  eb97e1fefbf286 =  8:  d4aa426efc0f3c fuse4fs: namespace some helpers
 8:  7a2e6084f203a4 =  9:  a53dd84ef7ce0c fuse4fs: convert to low level API
 9:  8072dafea5ba54 = 10:  37dafcc0894b89 libsupport: port the kernel list.h to libsupport
10:  df5e76db0018b3 ! 11:  30b3c80ed6bcc5 libsupport: add a cache
    @@ lib/support/xbitops.h (new)
     +#define rounddown_pow_of_two(n) __rounddown_pow_of_two(n)
     +
     +#endif
     
      ## lib/support/Makefile.in ##
     @@ lib/support/Makefile.in: OBJS=		bthread.o \
    - 		profile_helpers.o \
      		prof_err.o \
      		quotaio.o \
      		quotaio_v2.o \
      		quotaio_tree.o \
    + 		thread.o \
      		dict.o \
     -		devname.o
     +		devname.o \
     +		cache.o
      
      SRCS=		$(srcdir)/argv_parse.c \
      		$(srcdir)/bthread.c \
      		$(srcdir)/cstring.c \
      		$(srcdir)/mkquota.c \
      		$(srcdir)/parse_qtype.c \
     @@ lib/support/Makefile.in: SRCS=		$(srcdir)/argv_parse.c \
    - 		$(srcdir)/profile_helpers.c \
      		prof_err.c \
      		$(srcdir)/quotaio.c \
      		$(srcdir)/quotaio_tree.c \
      		$(srcdir)/quotaio_v2.c \
    + 		$(srcdir)/thread.c \
      		$(srcdir)/dict.c \
     -		$(srcdir)/devname.c
     +		$(srcdir)/devname.c \
     +		$(srcdir)/cache.c
      
      LIBRARY= libsupport
      LIBDIR= support
      
      @MAKEFILE_LIBRARY@
      @MAKEFILE_PROFILE@
     @@ lib/support/Makefile.in: quotaio_v2.o: $(srcdir)/quotaio_v2.c $(top_builddir)/lib/config.h \
    -  $(top_srcdir)/lib/ext2fs/bitops.h $(srcdir)/dqblk_v2.h \
    -  $(srcdir)/quotaio_tree.h
    + thread.o: $(srcdir)/thread.c $(top_builddir)/lib/config.h \
    +  $(top_builddir)/lib/dirpaths.h $(srcdir)/thread.h
      dict.o: $(srcdir)/dict.c $(top_builddir)/lib/config.h \
       $(top_builddir)/lib/dirpaths.h $(srcdir)/dict.h
      devname.o: $(srcdir)/devname.c $(top_builddir)/lib/config.h \
       $(top_builddir)/lib/dirpaths.h $(srcdir)/devname.h $(srcdir)/nls-enable.h
     +cache.o: $(srcdir)/cache.c $(top_builddir)/lib/config.h \
     + $(srcdir)/cache.h $(srcdir)/list.h $(srcdir)/xbitops.h
11:  f13414a72be32f = 12:  fb17b6521f0430 cache: disable debugging
12:  383f3dd0b56f64 = 13:  b6d5640bdf0086 cache: use modern list iterator macros
13:  a9c0181ee15128 = 14:  0be92e2d351193 cache: embed struct cache in the owner
14:  31e5763e32dfc2 = 15:  a486e89007fb55 cache: pass cache pointer to callbacks
15:  a5c773947bd56d = 16:  ae4911b183fa9d cache: pass a private data pointer through cache_walk
16:  4f017b4dc8920f = 17:  9c5af5b9d39a70 cache: add a helper to grab a new refcount for a cache_node
17:  d1dea709493fdc = 18:  c6ed3f93d43660 cache: return results of a cache flush
18:  e59f4eba2f22f8 = 19:  a13c635ef81b43 cache: add a "get only if incore" flag to cache_node_get
19:  a999b4c74344cd = 20:  cbd9dbc6943267 cache: support gradual expansion
20:  019aa0d8a8c87d = 21:  5d91f82ff155b9 cache: support updating maxcount and flags
21:  0283835cbfa2db = 22:  7134be34c002f5 cache: support channging flags
22:  0f5c83c6c88c3f = 23:  2a8ddebbfa9298 cache: implement automatic shrinking
23:  26f78774c08acc ! 24:  dd17627e00495e fuse4fs: add cache to track open files
    @@ fuse4fs/fuse4fs.c
      #ifdef __SET_FOB_FOR_FUSE
      # error Do not set magic value __SET_FOB_FOR_FUSE!!!!
      #endif
      #ifndef _FILE_OFFSET_BITS
      /*
     @@
    - #endif /* __SET_FOB_FOR_FUSE */
      #include <inttypes.h>
      #include "ext2fs/ext2fs.h"
      #include "ext2fs/ext2_fs.h"
      #include "ext2fs/ext2fsP.h"
      #include "support/bthread.h"
    + #include "support/thread.h"
     +#include "support/list.h"
     +#include "support/cache.h"
      
      #include "../version.h"
      #include "uuid/uuid.h"
      #include "e2p/e2p.h"
24:  d6a2e3be991671 = 25:  2ea53b826253b6 fuse4fs: use the orphaned inode list
25:  2a4c9c2e579556 = 26:  6bd1919297a22f fuse4fs: implement FUSE_TMPFILE
26:  38cd25631692e6 = 27:  951a10258cada0 fuse4fs: create incore reverse orphan list

^ permalink raw reply

* Re: [PATCH 2/2] ext4: align preallocation size to stripe width
From: David Laight @ 2026-04-09 18:59 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Yu Kuai, adilger.kernel, linux-ext4, linux-kernel
In-Reply-To: <20260409142911.GB59468@macsyma-wired.lan>

On Thu, 9 Apr 2026 10:29:11 -0400
"Theodore Tso" <tytso@mit.edu> wrote:

> On Mon, Dec 08, 2025 at 04:32:46PM +0800, Yu Kuai wrote:
> > When stripe width (io_opt) is configured, align the predicted
> > preallocation size to stripe boundaries. This ensures optimal I/O
> > performance on RAID and other striped storage devices by avoiding
> > partial stripe operations.
> > 
> > The current implementation uses hardcoded size predictions (16KB, 32KB,
> > 64KB, etc.) that are not stripe-aware. This causes physical block
> > offsets on disk to be misaligned to stripe boundaries, leading to
> > read-modify-write penalties on RAID arrays and reduced performance.
> > 
> > This patch makes size prediction stripe-aware by using multiples of
> > stripe size (1x, 2x, 4x, 8x, 16x, 32x) when s_stripe is set.
> > Additionally, the start offset is aligned to stripe boundaries using
> > rounddown(), which works correctly for both power-of-2 and non-power-of-2
> > stripe sizes. For devices without stripe configuration, the original
> > behavior is preserved.
> > ...  
> 
> Hi Yu,
> 
> Did you see the build failures reported by the kernel build bot on the
> i386[1] and arm[2] platforms?  The problem appears to be using
> roundup() and rounddown() on an unsigned long types.

Looks like that whole condition chain should be replaced with something
based on ilog2() (or some other bit-scan function).

	David

> 
> [1] https://lore.kernel.org/r/202512102331.yweFnVTU-lkp@intel.com
> [2] https://lore.kernel.org/r/202512120613.mM5COVWV-lkp@intel.com
> 
> We can't apply your patch until this issue is addressed.
> 
> Thanks,
> 
> 					- Ted
> 


^ permalink raw reply

* Re: [PATCH 00/61] treewide: Use IS_ERR_OR_NULL over manual NULL check - refactor
From: Al Viro @ 2026-04-09 18:16 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
	gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
	linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
	linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
	linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
	linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
	linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
	linux-sctp, linux-security-module, linux-sh, linux-sound,
	linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
	netdev, ntfs3, samba-technical, sched-ext, target-devel,
	tipc-discussion, v9fs, Julia Lawall, Nicolas Palix, Chris Mason,
	David Sterba, Ilya Dryomov, Alex Markuze, Viacheslav Dubeyko,
	Theodore Ts'o, Andreas Dilger, Steve French, Paulo Alcantara,
	Ronnie Sahlberg, Shyam Prasad N, Tom Talpey, Bharath SM,
	Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, Gao Xiang, Chao Yu, Yue Hu, Jeffle Xu,
	Sandeep Dhavale, Hongbo Li, Chunhai Guo, Miklos Szeredi,
	Konstantin Komarov, Andreas Gruenbacher, Kees Cook, Tony Luck,
	Guilherme G. Piccoli, Jan Kara, Phillip Lougher,
	Christian Brauner, Jan Kara, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Tejun Heo, David Vernet, Andrea Righi,
	Changwoo Min, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Luis Chamberlain, Petr Pavlu, Daniel Gomez,
	Sami Tolvanen, Aaron Tomlin, Sylwester Nawrocki, Liam Girdwood,
	Mark Brown, Jaroslav Kysela, Takashi Iwai, Max Filippov,
	Paolo Bonzini, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Andrew Morton, Alasdair Kergon, Mike Snitzer,
	Mikulas Patocka, Benjamin Marzinski, David S. Miller, David Ahern,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Marcel Holtmann, Johan Hedberg, Luiz Augusto von Dentz,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Stanislav Fomichev, Jamal Hadi Salim, Jiri Pirko,
	Marcelo Ricardo Leitner, Xin Long, Trond Myklebust,
	Anna Schumaker, Chuck Lever, Jeff Layton, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Jon Maloy, Johannes Berg,
	Catalin Marinas, Russell King, John Crispin, Thomas Bogendoerfer,
	Yoshinori Sato, Rich Felker, John Paul Adrian Glaubitz,
	Andrzej Hajda, Neil Armstrong, Robert Foss, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Zhenyu Wang,
	Zhi Wang, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, Alex Deucher, Christian König, Sandy Huang,
	Heiko Stübner, Andy Yan, Igor Russkikh, Andrew Lunn,
	Pavan Chebbi, Michael Chan, Potnuri Bharat Teja, Tony Nguyen,
	Przemek Kitszel, Taras Chornyi, Maxime Coquelin, Alexandre Torgue,
	Iyappan Subramanian, Keyur Chudgar, Quan Nguyen, Heiner Kallweit,
	Marc Zyngier, Thomas Gleixner, Andrew Lunn, Gregory Clement,
	Sebastian Hesselbarth, Vinod Koul, Linus Walleij, Ulf Hansson,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Martin K. Petersen,
	Eduardo Valentin, Keerthy, Rafael J. Wysocki, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, Alex Williamson, Mark Greer,
	Miquel Raynal, Richard Weinberger, Vignesh Raghavendra,
	Shuah Khan, Kieran Bingham, Mauro Carvalho Chehab, Joerg Roedel,
	Will Deacon, Robin Murphy, Lee Jones, Pavel Machek, Dave Penkler,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Justin Sanders, Jens Axboe, Georgi Djakov, Michael Turquette,
	Stephen Boyd, Philipp Zabel, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Pali Rohár, Dmitry Torokhov
In-Reply-To: <20260310-b4-is_err_or_null-v1-0-bd63b656022d@avm.de>

On Tue, Mar 10, 2026 at 12:48:26PM +0100, Philipp Hahn wrote:
> While doing some static code analysis I stumbled over a common pattern,
> where IS_ERR() is combined with a NULL check. For that there is
> IS_ERR_OR_NULL().

... and valid uses of IS_ERR_OR_NULL are rare as hen teeth.
Most of those are "I'm not sure how this function returns an
error, let's use that just in case".

Please, do not introduce more of that crap.

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Russell King (Oracle) @ 2026-04-09 16:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Robin Murphy, netdev, linux-arm-kernel, linux-kernel,
	iommu, linux-ext4, dmaengine, Marek Szyprowski, Theodore Ts'o,
	Andreas Dilger, Vinod Koul, Frank Li
In-Reply-To: <CAHk-=whO3F1u+nme4cnYMy5baYmb7CH=wE63dcNaPLWD0vKaew@mail.gmail.com>

On Thu, Apr 09, 2026 at 08:37:53AM -0700, Linus Torvalds wrote:
> On Thu, 9 Apr 2026 at 05:24, Will Deacon <will@kernel.org> wrote:
> >
> > On Wed, Apr 08, 2026 at 08:52:32PM +0100, Russell King (Oracle) wrote:
> > > What's the status on the iommu fix? Is it merged into mainline yet?
> > > If it isn't already, that means net-next remains unbootable going
> > > into the merge window without manually carrying the fix locally.
> >
> > I'll pick it up for 7.0 in the iommu tree.
> 
> ... and now it's in my tree.

Thanks, I see you merged it prior to the net tree, which should mean
the fix finds its way into net-next! Yay! Double thanks for that!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Linus Torvalds @ 2026-04-09 15:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: Russell King (Oracle), Robin Murphy, netdev, linux-arm-kernel,
	linux-kernel, iommu, linux-ext4, dmaengine, Marek Szyprowski,
	Theodore Ts'o, Andreas Dilger, Vinod Koul, Frank Li
In-Reply-To: <adeaiSAnkaggqPsA@willie-the-truck>

On Thu, 9 Apr 2026 at 05:24, Will Deacon <will@kernel.org> wrote:
>
> On Wed, Apr 08, 2026 at 08:52:32PM +0100, Russell King (Oracle) wrote:
> > What's the status on the iommu fix? Is it merged into mainline yet?
> > If it isn't already, that means net-next remains unbootable going
> > into the merge window without manually carrying the fix locally.
>
> I'll pick it up for 7.0 in the iommu tree.

... and now it's in my tree.

               Linus

^ permalink raw reply

* Re: [PATCH 2/2] ext4: align preallocation size to stripe width
From: Theodore Tso @ 2026-04-09 14:29 UTC (permalink / raw)
  To: Yu Kuai; +Cc: adilger.kernel, linux-ext4, linux-kernel
In-Reply-To: <20251208083246.320965-3-yukuai@fnnas.com>

On Mon, Dec 08, 2025 at 04:32:46PM +0800, Yu Kuai wrote:
> When stripe width (io_opt) is configured, align the predicted
> preallocation size to stripe boundaries. This ensures optimal I/O
> performance on RAID and other striped storage devices by avoiding
> partial stripe operations.
> 
> The current implementation uses hardcoded size predictions (16KB, 32KB,
> 64KB, etc.) that are not stripe-aware. This causes physical block
> offsets on disk to be misaligned to stripe boundaries, leading to
> read-modify-write penalties on RAID arrays and reduced performance.
> 
> This patch makes size prediction stripe-aware by using multiples of
> stripe size (1x, 2x, 4x, 8x, 16x, 32x) when s_stripe is set.
> Additionally, the start offset is aligned to stripe boundaries using
> rounddown(), which works correctly for both power-of-2 and non-power-of-2
> stripe sizes. For devices without stripe configuration, the original
> behavior is preserved.
> ...

Hi Yu,

Did you see the build failures reported by the kernel build bot on the
i386[1] and arm[2] platforms?  The problem appears to be using
roundup() and rounddown() on an unsigned long types.

[1] https://lore.kernel.org/r/202512102331.yweFnVTU-lkp@intel.com
[2] https://lore.kernel.org/r/202512120613.mM5COVWV-lkp@intel.com

We can't apply your patch until this issue is addressed.

Thanks,

					- Ted

^ permalink raw reply

* Re: [PATCH v2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
From: Theodore Tso @ 2026-04-09 14:10 UTC (permalink / raw)
  To: Guan-Chun Wu
  Cc: adilger.kernel, linux-ext4, linux-kernel, visitorckw,
	david.laight.linux
In-Reply-To: <20251122043929.1908643-1-409411716@gms.tku.edu.tw>

On Sat, Nov 22, 2025 at 12:39:29PM +0800, Guan-Chun Wu wrote:
> The original byte-by-byte implementation with modulo checks is less
> efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
> to process input in explicit 4-byte chunks instead of using a
> modulus-based loop to emit words byte by byte.
> 
> Additionally, the use of function pointers for selecting the appropriate
> str2hashbuf implementation has been removed. Instead, the functions are
> directly invoked based on the hash type, eliminating the overhead of
> dynamic function calls.
> 
> Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
> runs, using kernel module for testing):
> 
>     len | orig_s | new_s | orig_u | new_u
>     ----+--------+-------+--------+-------
>       1 |   70   |   71  |   63   |   63
>       8 |   68   |   64  |   64   |   62
>      32 |   75   |   70  |   75   |   63
>      64 |   96   |   71  |  100   |   68
>     255 |  192   |  108  |  187   |   84
> 
> This change improves performance, especially for larger input sizes.
> 
> Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>

Apologies for the delay in looking at this.  It fell through the
cracks on my end.

Because of how I'm a bit late with reviewing patches before the merge
window, I'm going to be very conservative in which patches I'm going
to land.  So this is going to be deferred until the next cycle, but I
wanted to let you know that I haven't forgotten about it.

If this was a comprehensive set of Kunit tests for fs/ext4/hash.c, I
might have taken it.  And that's something that I would look at adding
for the next cycle, but if you'd be interested in creating the kunit
tests for hash.c, that would be great.

						- Ted

^ permalink raw reply

* Re: [RFC PATCH v1 0/6] provenance_time (ptime): a new settable timestamp for cross-filesystem provenance
From: Christian Brauner @ 2026-04-09 13:38 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Sean Smith, linux-fsdevel, linux-ext4, linux-btrfs, dsterba,
	david, osandov, hirofumi, linkinjeon
In-Reply-To: <20260407233618.GB12536@macsyma-wired.lan>

On Tue, Apr 07, 2026 at 07:36:18PM -0400, Theodore Ts'o wrote:
> On Mon, Apr 06, 2026 at 07:05:55PM -0500, Sean Smith wrote:
> > The patches implement rename-over preservation in all 5
> > filesystem rename handlers. When rename(source, target)
> > replaces an existing file, and the source has ptime=0 (the
> > default for any newly-created temp file) while the target
> > has ptime != 0, the filesystem copies the target's ptime to
> > the source before destroying the target's inode. This runs
> > inside the rename transaction, atomic with the rename itself.
> 
> Yelch.   This is so *very* non-Unixy / non-POSIX / non-Linux.

I think you meant to type "N", "A", and "K".

^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: derive f_fsid from block device to avoid collisions
From: Theodore Tso @ 2026-04-09 13:12 UTC (permalink / raw)
  To: Anand Jain
  Cc: Christoph Hellwig, Darrick J. Wong, linux-ext4, linux-btrfs,
	linux-xfs, Anand Jain, dsterba
In-Reply-To: <22cfbf8d-af9b-462e-b240-67a1de24764f@gmail.com>

On Thu, Apr 09, 2026 at 05:45:24PM +0800, Anand Jain wrote:
> 
> Got it. Do you mean that since both filesystems are identical,
> statfs(A) and statfs(B) can legitimately return the same values?

Yes.  f_Fsid can legitimately always be zero (which I believe is the
case for FreeBSD, but I understand that there are some programs, like
systemd, which subscribe to the heresy, "All the World's Linux", which
is a variant of the "All the World's a Vax" or "All the World's SunOS"
at the beginning of my career :-).

> I'm not entirely sure what the correct expectation for f_fsid
> should be.

That's my point, there *is* no correct expectation, and I don't
believe there can or should be.  What we should be doing instead is
actively discouraging people from using f_fsid.  I suspect that's one
of the reasons why FreeBSD may have chosen to just return zero.

Which is why I don't think we should be testing this in xfstests's
generic/791, either.  (Unless we get consensus across file system
developers abnd willing to make it be a documented behavior as of a
particular kernel version, and we then adjust the test to skip it if
it's older than that kernel version, so it doesn't break LTS kernel
tests.  See below....)

> My initial idea was to make f_fsid behavior consistent across
> major filesystems so that user space benefits from predictable
> semantics.

I'm OK with that, so long as it's unconditional across all file system
types (ideally) or unconditionally across all major file systems (xfs,
btrfs, ext4, f2fs) as of a particular kernel version (which is
probably much more realistic), *and* it is documented in the Linux man
pages as this is the standard behavior starting with 7.1 (or
whatever), and that the man page further cautions that programs that
expect to be portable to other OS's (MacOS, FreeBSD, Solaris, etc.)
should not count on this behavior.

But given that you originally stumbled across this with Overlayfs,
because it was originally using s_uuid, and that didn't work well for
btrfs, why not change overlayfs to just use s_uuid plus kdev_t in its
xattr, and just fix the problem for overlayfs?  That has the benefit
that it will work for all file system types in Linux, not just for
those where we have changed what f_fsid does.

Cheers,

					- Ted

^ permalink raw reply

* [PATCH v7 22/22] xfs: enable ro-compat fs-verity flag
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

Finalize fs-verity integration in XFS by making kernel fs-verity
aware with ro-compat flag.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add spaces]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 4dff29659e40..0ce46c234b9c 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -378,8 +378,9 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
-		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_VERITY)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 21/22] xfs: introduce health state for corrupted fsverity metadata
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

Report corrupted fsverity descriptor through health system.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h     |  1 +
 fs/xfs/libxfs/xfs_health.h |  4 +++-
 fs/xfs/xfs_fsverity.c      | 13 ++++++++++---
 fs/xfs/xfs_health.c        |  1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index ebf17a0b0722..cece31ecee81 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -422,6 +422,7 @@ struct xfs_bulkstat {
 #define XFS_BS_SICK_SYMLINK	(1 << 6)  /* symbolic link remote target */
 #define XFS_BS_SICK_PARENT	(1 << 7)  /* parent pointers */
 #define XFS_BS_SICK_DIRTREE	(1 << 8)  /* directory tree structure */
+#define XFS_BS_SICK_FSVERITY	(1 << 9)  /* fsverity metadata */
 
 /*
  * Project quota id helpers (previously projid was 16bit only
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 1d45cf5789e8..932b447190da 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -104,6 +104,7 @@ struct xfs_rtgroup;
 /* Don't propagate sick status to ag health summary during inactivation */
 #define XFS_SICK_INO_FORGET	(1 << 12)
 #define XFS_SICK_INO_DIRTREE	(1 << 13)  /* directory tree structure */
+#define XFS_SICK_INO_FSVERITY	(1 << 14)  /* fsverity metadata */
 
 /* Primary evidence of health problems in a given group. */
 #define XFS_SICK_FS_PRIMARY	(XFS_SICK_FS_COUNTERS | \
@@ -140,7 +141,8 @@ struct xfs_rtgroup;
 				 XFS_SICK_INO_XATTR | \
 				 XFS_SICK_INO_SYMLINK | \
 				 XFS_SICK_INO_PARENT | \
-				 XFS_SICK_INO_DIRTREE)
+				 XFS_SICK_INO_DIRTREE | \
+				 XFS_SICK_INO_FSVERITY)
 
 #define XFS_SICK_INO_ZAPPED	(XFS_SICK_INO_BMBTD_ZAPPED | \
 				 XFS_SICK_INO_BMBTA_ZAPPED | \
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index ef5cf97ad700..8ac810f0ffa1 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -84,16 +84,23 @@ xfs_fsverity_get_descriptor(
 		return error;
 
 	desc_size = be32_to_cpu(d_desc_size);
-	if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE))
+	if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE)) {
+		xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
 		return -ERANGE;
-	if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos))
+	}
+
+	if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos)) {
+		xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
 		return -ERANGE;
+	}
 
 	if (!buf_size)
 		return desc_size;
 
-	if (XFS_IS_CORRUPT(mp, desc_size > buf_size))
+	if (XFS_IS_CORRUPT(mp, desc_size > buf_size)) {
+		xfs_inode_mark_sick(XFS_I(inode), XFS_SICK_INO_FSVERITY);
 		return -ERANGE;
+	}
 
 	desc_pos = round_down(desc_size_pos - desc_size, blocksize);
 	error = fsverity_pagecache_read(inode, buf, desc_size, desc_pos);
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 239b843e83d4..be66760fb120 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -625,6 +625,7 @@ static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_DIR_ZAPPED,	XFS_BS_SICK_DIR },
 	{ XFS_SICK_INO_SYMLINK_ZAPPED,	XFS_BS_SICK_SYMLINK },
 	{ XFS_SICK_INO_DIRTREE,	XFS_BS_SICK_DIRTREE },
+	{ XFS_SICK_INO_FSVERITY,	XFS_BS_SICK_FSVERITY },
 };
 
 /* Fill out bulkstat health info. */
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 20/22] xfs: check and repair the verity inode flag state
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Darrick J. Wong, hch, linux-ext4, linux-f2fs-devel, linux-btrfs,
	Andrey Albershteyn
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

From: "Darrick J. Wong" <djwong@kernel.org>

If an inode has the incore verity iflag set, make sure that we can
actually activate fsverity on that inode.  If activation fails due to
a fsverity metadata validation error, clear the flag.  The usage model
for fsverity requires that any program that cares about verity state is
required to call statx/getflags to check that the flag is set after
opening the file, so clearing the flag will not compromise that model.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/scrub/attr.c         |  7 +++++
 fs/xfs/scrub/common.c       | 53 +++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h       |  2 ++
 fs/xfs/scrub/inode.c        |  7 +++++
 fs/xfs/scrub/inode_repair.c | 36 +++++++++++++++++++++++++
 5 files changed, 105 insertions(+)

diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 390ac2e11ee0..daf7962c2374 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -649,6 +649,13 @@ xchk_xattr(
 	if (!xfs_inode_hasattr(sc->ip))
 		return -ENOENT;
 
+	/*
+	 * If this is a verity file that won't activate, we cannot check the
+	 * merkle tree geometry.
+	 */
+	if (xchk_inode_verity_broken(sc->ip))
+		xchk_set_incomplete(sc);
+
 	/* Allocate memory for xattr checking. */
 	error = xchk_setup_xattr_buf(sc, 0);
 	if (error == -ENOMEM)
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 20e63069088b..6cc6bea9c554 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -45,6 +45,8 @@
 #include "scrub/health.h"
 #include "scrub/tempfile.h"
 
+#include <linux/fsverity.h>
+
 /* Common code for the metadata scrubbers. */
 
 /*
@@ -1743,3 +1745,54 @@ xchk_inode_count_blocks(
 	return xfs_bmap_count_blocks(sc->tp, sc->ip, whichfork, nextents,
 			count);
 }
+
+/*
+ * If this inode has S_VERITY set on it, read the verity info. If the reading
+ * fails with anything other than ENOMEM, the file is corrupt, which we can
+ * detect later with fsverity_active.
+ *
+ * Callers must hold the IOLOCK and must not hold the ILOCK of sc->ip because
+ * activation reads inode data.
+ */
+int
+xchk_inode_setup_verity(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	if (!fsverity_active(VFS_I(sc->ip)))
+		return 0;
+
+	error = fsverity_ensure_verity_info(VFS_I(sc->ip));
+	switch (error) {
+	case 0:
+		/* fsverity is active */
+		break;
+	case -ENODATA:
+	case -EMSGSIZE:
+	case -EINVAL:
+	case -EFSCORRUPTED:
+	case -EFBIG:
+		/*
+		 * The nonzero errno codes above are the error codes that can
+		 * be returned from fsverity on metadata validation errors.
+		 */
+		return 0;
+	default:
+		/* runtime errors */
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Is this a verity file that failed to activate?  Callers must have tried to
+ * activate fsverity via xchk_inode_setup_verity.
+ */
+bool
+xchk_inode_verity_broken(
+	struct xfs_inode	*ip)
+{
+	return fsverity_active(VFS_I(ip)) && !fsverity_get_info(VFS_I(ip));
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index f2ecc68538f0..aa16d310bd6d 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -264,6 +264,8 @@ int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_agino_t agino,
 		bool *inuse);
 int xchk_inode_count_blocks(struct xfs_scrub *sc, int whichfork,
 		xfs_extnum_t *nextents, xfs_filblks_t *count);
+int xchk_inode_setup_verity(struct xfs_scrub *sc);
+bool xchk_inode_verity_broken(struct xfs_inode *ip);
 
 bool xchk_inode_is_dirtree_root(const struct xfs_inode *ip);
 bool xchk_inode_is_sb_rooted(const struct xfs_inode *ip);
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 948d04dcba2a..8ce6917e22b4 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -36,6 +36,10 @@ xchk_prepare_iscrub(
 
 	xchk_ilock(sc, XFS_IOLOCK_EXCL);
 
+	error = xchk_inode_setup_verity(sc);
+	if (error)
+		return error;
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
@@ -833,6 +837,9 @@ xchk_inode(
 	if (S_ISREG(VFS_I(sc->ip)->i_mode))
 		xchk_inode_check_reflink_iflag(sc, sc->ip->i_ino);
 
+	if (xchk_inode_verity_broken(sc->ip))
+		xchk_ino_set_corrupt(sc, sc->sm->sm_ino);
+
 	xchk_inode_check_unlinked(sc);
 
 	xchk_inode_xref(sc, sc->ip->i_ino, &di);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 9738b9ce3f2d..3761e3922466 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -573,6 +573,8 @@ xrep_dinode_flags(
 		dip->di_nrext64_pad = 0;
 	else if (dip->di_version >= 3)
 		dip->di_v3_pad = 0;
+	if (!xfs_has_verity(mp) || !S_ISREG(mode))
+		flags2 &= ~XFS_DIFLAG2_VERITY;
 
 	if (flags2 & XFS_DIFLAG2_METADATA) {
 		xfs_failaddr_t	fa;
@@ -1613,6 +1615,10 @@ xrep_dinode_core(
 	if (iget_error)
 		return iget_error;
 
+	error = xchk_inode_setup_verity(sc);
+	if (error)
+		return error;
+
 	error = xchk_trans_alloc(sc, 0);
 	if (error)
 		return error;
@@ -2032,6 +2038,27 @@ xrep_inode_unlinked(
 	return 0;
 }
 
+/*
+ * If this file is a fsverity file, xchk_prepare_iscrub or xrep_dinode_core
+ * should have activated it.  If it's still not active, then there's something
+ * wrong with the verity descriptor and we should turn it off.
+ */
+STATIC int
+xrep_inode_verity(
+	struct xfs_scrub	*sc)
+{
+	struct inode		*inode = VFS_I(sc->ip);
+
+	if (xchk_inode_verity_broken(sc->ip)) {
+		sc->ip->i_diflags2 &= ~XFS_DIFLAG2_VERITY;
+		inode->i_flags &= ~S_VERITY;
+
+		xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	}
+
+	return 0;
+}
+
 /* Repair an inode's fields. */
 int
 xrep_inode(
@@ -2081,6 +2108,15 @@ xrep_inode(
 			return error;
 	}
 
+	/*
+	 * Disable fsverity if it cannot be activated.  Activation failure
+	 * prohibits the file from being opened, so there cannot be another
+	 * program with an open fd to what it thinks is a verity file.
+	 */
+	error = xrep_inode_verity(sc);
+	if (error)
+		return error;
+
 	/* Reconnect incore unlinked list */
 	error = xrep_inode_unlinked(sc);
 	if (error)
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 19/22] xfs: advertise fs-verity being available on filesystem
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Darrick J. Wong, hch, linux-ext4, linux-f2fs-devel, linux-btrfs,
	Andrey Albershteyn, Andrey Albershteyn
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

From: "Darrick J. Wong" <djwong@kernel.org>

Advertise that this filesystem supports fsverity.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h | 1 +
 fs/xfs/libxfs/xfs_sb.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index d165de607d17..ebf17a0b0722 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -250,6 +250,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_PARENT	(1 << 25) /* linux parent pointers */
 #define XFS_FSOP_GEOM_FLAGS_METADIR	(1 << 26) /* metadata directories */
 #define XFS_FSOP_GEOM_FLAGS_ZONED	(1 << 27) /* zoned rt device */
+#define XFS_FSOP_GEOM_FLAGS_VERITY	(1 << 28) /* fs-verity */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a15510ebd2f1..222bbe5559df 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1590,6 +1590,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_METADIR;
 	if (xfs_has_zoned(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_ZONED;
+	if (xfs_has_verity(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_VERITY;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 18/22] xfs: add fs-verity ioctls
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

Add fs-verity ioctls to enable, dump metadata (descriptor and Merkle
tree pages) and obtain file's digest.

[djwong: remove unnecessary casting]

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_ioctl.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index facffdc8dca8..e633d56cad00 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -46,6 +46,7 @@
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
+#include <linux/fsverity.h>
 
 /* Return 0 on success or positive error */
 int
@@ -1426,6 +1427,19 @@ xfs_file_ioctl(
 	case XFS_IOC_VERIFY_MEDIA:
 		return xfs_ioc_verify_media(filp, arg);
 
+	case FS_IOC_ENABLE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_enable(filp, arg);
+	case FS_IOC_MEASURE_VERITY:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_measure(filp, arg);
+	case FS_IOC_READ_VERITY_METADATA:
+		if (!xfs_has_verity(mp))
+			return -EOPNOTSUPP;
+		return fsverity_ioctl_read_metadata(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 17/22] xfs: remove unwritten extents after preallocations in fsverity metadata
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

XFS preallocates spaces during writes. In normal I/O this space, if
unused, is removed by truncate. For files with fsverity XFS does not use
truncate as fsverity metadata is stored past EOF.

After we're done with writing fsverity metadata iterate over extents in
that region and remove any unwritten ones. These would be left overs in
the holes in the merkle tree and past fsverity descriptor.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_fsverity.c | 67 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index 68d9736d19d9..ef5cf97ad700 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -21,6 +21,8 @@
 #include "xfs_iomap.h"
 #include "xfs_error.h"
 #include "xfs_health.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
 #include <linux/fsverity.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
@@ -173,6 +175,63 @@ xfs_fsverity_delete_metadata(
 	return error;
 }
 
+static int
+xfs_fsverity_cancel_unwritten(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		start,
+	xfs_fileoff_t		end)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSB(mp, start);
+	xfs_fileoff_t		end_fsb = XFS_B_TO_FSB(mp, end);
+	struct xfs_bmbt_irec	imap;
+	int			nimaps;
+	int			error = 0;
+	int			done;
+
+
+	while (offset_fsb < end_fsb) {
+		nimaps = 1;
+
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, 0, 0,
+				0, &tp);
+		if (error)
+			return error;
+
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
+				&imap, &nimaps, 0);
+		if (error)
+			goto out_cancel;
+
+		if (nimaps == 0)
+			goto out_cancel;
+
+		if (imap.br_state == XFS_EXT_UNWRITTEN) {
+			xfs_trans_ijoin(tp, ip, 0);
+
+			error = xfs_bunmapi(tp, ip, imap.br_startoff,
+					imap.br_blockcount, 0, 1, &done);
+			if (error)
+				goto out_cancel;
+
+			error = xfs_trans_commit(tp);
+		} else {
+			xfs_trans_cancel(tp);
+		}
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+		offset_fsb = imap.br_startoff + imap.br_blockcount;
+	}
+
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
 
 /*
  * Prepare to enable fsverity by clearing old metadata.
@@ -248,6 +307,14 @@ xfs_fsverity_end_enable(
 	if (error)
 		goto out;
 
+	/*
+	 * Remove unwritten extents left by COW preallocations and write
+	 * preallocation in the merkle tree holes and past descriptor
+	 */
+	error = xfs_fsverity_cancel_unwritten(ip, range_start, LLONG_MAX);
+	if (error)
+		goto out;
+
 	/*
 	 * Proactively drop any delayed allocations in COW fork, the fsverity
 	 * files are read-only
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 16/22] xfs: add fs-verity support
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

Add integration with fs-verity. XFS stores fs-verity descriptor and
Merkle tree in the inode data fork at first block aligned to 64k past
EOF.

The Merkle tree reading/writing is done through iomap interface. The
data itself is read to the inode's page cache. When XFS reads from this
region iomap doesn't call into fsverity to verify it against Merkle
tree. For data, verification is done at ioend completion in a workqueue.

When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
flag is set meaning that the Merkle tree is being build. The
initialization ends with storing of verity descriptor and setting
inode on-disk flag (XFS_DIFLAG2_VERITY). Lastly, the
XFS_IVERITY_CONSTRUCTION is dropped and I_VERITY is set on inode.

The descriptor is stored in a new block aligned to 64k after the last
Merkle tree block. The size of the descriptor is stored at the end of
the last descriptor block (descriptor can be multiple blocks).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   8 +
 fs/xfs/xfs_fsverity.c  | 353 ++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_fsverity.h  |   2 +
 fs/xfs/xfs_message.c   |   4 +
 fs/xfs/xfs_message.h   |   1 +
 fs/xfs/xfs_mount.h     |   2 +
 fs/xfs/xfs_super.c     |   7 +
 7 files changed, 376 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 0ab00615f1ad..18348f4fd2aa 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -31,6 +31,7 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
+#include <linux/fsverity.h>
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -553,6 +554,13 @@ xfs_can_free_eofblocks(
 	if (last_fsb <= end_fsb)
 		return false;
 
+	/*
+	 * Nothing to clean on fsverity inodes as they don't use prealloc and
+	 * there no delalloc as only written data is fsverity metadata
+	 */
+	if (IS_VERITY(VFS_I(ip)))
+		return false;
+
 	/*
 	 * Check if there is an post-EOF extent to free.  If there are any
 	 * delalloc blocks attached to the inode (data fork delalloc
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index b983e20bb5e1..68d9736d19d9 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -4,14 +4,26 @@
  */
 #include "xfs_platform.h"
 #include "xfs_format.h"
-#include "xfs_inode.h"
 #include "xfs_shared.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_fsverity.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_inode.h"
+#include "xfs_log_format.h"
+#include "xfs_bmap_util.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_trace.h"
+#include "xfs_quota.h"
 #include "xfs_fsverity.h"
+#include "xfs_iomap.h"
+#include "xfs_error.h"
+#include "xfs_health.h"
 #include <linux/fsverity.h>
 #include <linux/iomap.h>
+#include <linux/pagemap.h>
 
 loff_t
 xfs_fsverity_metadata_offset(
@@ -28,3 +40,342 @@ xfs_fsverity_is_file_data(
 	return fsverity_active(VFS_IC(ip)) &&
 			offset < xfs_fsverity_metadata_offset(ip);
 }
+
+/*
+ * Retrieve the verity descriptor.
+ */
+static int
+xfs_fsverity_get_descriptor(
+	struct inode		*inode,
+	void			*buf,
+	size_t			buf_size)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	__be32			d_desc_size;
+	u32			desc_size;
+	u64			desc_size_pos;
+	int			error;
+	u64			desc_pos;
+	struct xfs_bmbt_irec	rec;
+	int			is_empty;
+	uint32_t		blocksize = i_blocksize(VFS_I(ip));
+	xfs_fileoff_t		last_block_offset;
+
+	ASSERT(inode->i_flags & S_VERITY);
+	error = xfs_bmap_last_extent(NULL, ip, XFS_DATA_FORK, &rec, &is_empty);
+	if (error)
+		return error;
+
+	if (is_empty)
+		return -ENODATA;
+
+	last_block_offset =
+		XFS_FSB_TO_B(mp, rec.br_startoff + rec.br_blockcount);
+	if (last_block_offset < xfs_fsverity_metadata_offset(ip))
+		return -ENODATA;
+
+	desc_size_pos = last_block_offset - sizeof(__be32);
+	error = fsverity_pagecache_read(inode, (char *)&d_desc_size,
+			sizeof(d_desc_size), desc_size_pos);
+	if (error)
+		return error;
+
+	desc_size = be32_to_cpu(d_desc_size);
+	if (XFS_IS_CORRUPT(mp, desc_size > FS_VERITY_MAX_DESCRIPTOR_SIZE))
+		return -ERANGE;
+	if (XFS_IS_CORRUPT(mp, desc_size > desc_size_pos))
+		return -ERANGE;
+
+	if (!buf_size)
+		return desc_size;
+
+	if (XFS_IS_CORRUPT(mp, desc_size > buf_size))
+		return -ERANGE;
+
+	desc_pos = round_down(desc_size_pos - desc_size, blocksize);
+	error = fsverity_pagecache_read(inode, buf, desc_size, desc_pos);
+	if (error)
+		return error;
+
+	return desc_size;
+}
+
+static int
+xfs_fsverity_write_descriptor(
+	struct file		*file,
+	const void		*desc,
+	u32			desc_size,
+	u64			merkle_tree_size)
+{
+	int			error;
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	unsigned int		blksize = ip->i_mount->m_attr_geo->blksize;
+	u64			tree_last_block =
+			xfs_fsverity_metadata_offset(ip) + merkle_tree_size;
+	u64			desc_pos =
+			round_up(tree_last_block, XFS_FSVERITY_START_ALIGN);
+	u64			desc_end = desc_pos + desc_size;
+	__be32			desc_size_disk = cpu_to_be32(desc_size);
+	u64			desc_size_pos =
+			round_up(desc_end + sizeof(desc_size_disk), blksize) -
+			sizeof(desc_size_disk);
+
+	error = iomap_fsverity_write(file, desc_size_pos, sizeof(__be32),
+			(const void *)&desc_size_disk,
+			&xfs_buffered_write_iomap_ops,
+			&xfs_iomap_write_ops);
+	if (error)
+		return error;
+
+	return iomap_fsverity_write(file, desc_pos, desc_size, desc,
+			&xfs_buffered_write_iomap_ops,
+			&xfs_iomap_write_ops);
+}
+
+/*
+ * Try to remove all the fsverity metadata after a failed enablement.
+ */
+static int
+xfs_fsverity_delete_metadata(
+	struct xfs_inode	*ip)
+{
+	struct xfs_trans	*tp;
+	struct xfs_mount	*mp = ip->i_mount;
+	int			error;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	/*
+	 * We removing post EOF data, no need to update i_size as fsverity
+	 * didn't move i_size in the first place
+	 */
+	error = xfs_itruncate_extents(&tp, ip, XFS_DATA_FORK, XFS_ISIZE(ip));
+	if (error)
+		goto err_cancel;
+
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto err_cancel;
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+
+err_cancel:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+
+/*
+ * Prepare to enable fsverity by clearing old metadata.
+ */
+static int
+xfs_fsverity_begin_enable(
+	struct file		*filp)
+{
+	struct inode		*inode = file_inode(filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	int			error;
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	if (IS_DAX(inode))
+		return -EINVAL;
+
+	if (inode->i_size > XFS_FSVERITY_LARGEST_FILE)
+		return -EFBIG;
+
+	/*
+	 * Flush pagecache before building Merkle tree. Inode is locked and no
+	 * further writes will happen to the file except fsverity metadata
+	 */
+	error = filemap_write_and_wait(inode->i_mapping);
+	if (error)
+		return error;
+
+	if (xfs_iflags_test_and_set(ip, XFS_VERITY_CONSTRUCTION))
+		return -EBUSY;
+
+	error = xfs_qm_dqattach(ip);
+	if (error)
+		return error;
+
+	return xfs_fsverity_delete_metadata(ip);
+}
+
+/*
+ * Complete (or fail) the process of enabling fsverity.
+ */
+static int
+xfs_fsverity_end_enable(
+	struct file		*file,
+	const void		*desc,
+	size_t			desc_size,
+	u64			merkle_tree_size)
+{
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	int			error = 0;
+	loff_t			range_start = xfs_fsverity_metadata_offset(ip);
+
+	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL);
+
+	/* fs-verity failed, just cleanup */
+	if (desc == NULL)
+		goto out;
+
+	error = xfs_fsverity_write_descriptor(file, desc, desc_size,
+			merkle_tree_size);
+	if (error)
+		goto out;
+
+	/*
+	 * Wait for Merkle tree get written to disk before setting on-disk inode
+	 * flag and clearing XFS_VERITY_CONSTRUCTION
+	 */
+	error = filemap_write_and_wait_range(inode->i_mapping, range_start,
+			LLONG_MAX);
+	if (error)
+		goto out;
+
+	/*
+	 * Proactively drop any delayed allocations in COW fork, the fsverity
+	 * files are read-only
+	 */
+	if (xfs_is_cow_inode(ip))
+		xfs_bmap_punch_delalloc_range(ip, XFS_COW_FORK, 0, LLONG_MAX,
+				NULL);
+
+	/*
+	 * Set fsverity inode flag
+	 */
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_ichange,
+			0, 0, false, &tp);
+	if (error)
+		goto out;
+
+	/*
+	 * Ensure that we've persisted the verity information before we enable
+	 * it on the inode and tell the caller we have sealed the inode.
+	 */
+	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (!error)
+		inode->i_flags |= S_VERITY;
+
+out:
+	if (error) {
+		int	error2;
+
+		error2 = xfs_fsverity_delete_metadata(ip);
+		if (error2)
+			xfs_alert(ip->i_mount,
+"ino 0x%llx failed to clean up new fsverity metadata, err %d",
+					ip->i_ino, error2);
+	}
+
+	xfs_iflags_clear(ip, XFS_VERITY_CONSTRUCTION);
+	return error;
+}
+
+/*
+ * Retrieve a merkle tree block.
+ */
+static struct page *
+xfs_fsverity_read_merkle(
+	struct inode		*inode,
+	pgoff_t			index)
+{
+	index += xfs_fsverity_metadata_offset(XFS_I(inode)) >> PAGE_SHIFT;
+
+	return generic_read_merkle_tree_page(inode, index);
+}
+
+/*
+ * Retrieve a merkle tree block.
+ */
+static void
+xfs_fsverity_readahead_merkle_tree(
+	struct inode		*inode,
+	pgoff_t			index,
+	unsigned long		nr_pages)
+{
+	index += xfs_fsverity_metadata_offset(XFS_I(inode)) >> PAGE_SHIFT;
+
+	generic_readahead_merkle_tree(inode, index, nr_pages);
+}
+
+/*
+ * Write a merkle tree block.
+ */
+static int
+xfs_fsverity_write_merkle(
+	struct file		*file,
+	const void		*buf,
+	u64			pos,
+	unsigned int		size,
+	const u8		*zero_digest,
+	unsigned int		digest_size)
+{
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	loff_t			position = pos +
+		xfs_fsverity_metadata_offset(ip);
+
+	if (position + size > inode->i_sb->s_maxbytes)
+		return -EFBIG;
+
+	/*
+	 * If this is a block full of hashes of zeroed blocks, don't bother
+	 * storing the block. We can synthesize them later.
+	 *
+	 * However, do this only in case Merkle tree block == fs block size.
+	 * Iomap synthesizes these blocks based on holes in the merkle tree. We
+	 * won't be able to tell if something need to be synthesizes for the
+	 * range in the fs block. For example, for 4k filesystem block
+	 *
+	 *	[ 1k | zero hashes | zero hashes | 1k ]
+	 *
+	 * Iomap won't know about these empty blocks.
+	 */
+	if (size == ip->i_mount->m_sb.sb_blocksize &&
+			/*
+			 * First digest is zero_digest
+			 */
+			memcmp(buf, zero_digest, digest_size) == 0 &&
+			/*
+			 * Every digest is same as previous, thus all are
+			 * zero_digest
+			 */
+			memcmp(buf + digest_size, buf, size - digest_size) == 0)
+		return 0;
+
+	return iomap_fsverity_write(file, position, size, buf,
+			&xfs_buffered_write_iomap_ops,
+			&xfs_iomap_write_ops);
+}
+
+const struct fsverity_operations xfs_fsverity_ops = {
+	.begin_enable_verity		= xfs_fsverity_begin_enable,
+	.end_enable_verity		= xfs_fsverity_end_enable,
+	.get_verity_descriptor		= xfs_fsverity_get_descriptor,
+	.read_merkle_tree_page		= xfs_fsverity_read_merkle,
+	.readahead_merkle_tree		= xfs_fsverity_readahead_merkle_tree,
+	.write_merkle_tree_block	= xfs_fsverity_write_merkle,
+};
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
index ec77ba571106..6a981e20a75b 100644
--- a/fs/xfs/xfs_fsverity.h
+++ b/fs/xfs/xfs_fsverity.h
@@ -6,8 +6,10 @@
 #define __XFS_FSVERITY_H__
 
 #include "xfs_platform.h"
+#include <linux/fsverity.h>
 
 #ifdef CONFIG_FS_VERITY
+extern const struct fsverity_operations xfs_fsverity_ops;
 loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip);
 bool xfs_fsverity_is_file_data(const struct xfs_inode *ip, loff_t offset);
 #else
diff --git a/fs/xfs/xfs_message.c b/fs/xfs/xfs_message.c
index fd297082aeb8..9818d8f8f239 100644
--- a/fs/xfs/xfs_message.c
+++ b/fs/xfs/xfs_message.c
@@ -153,6 +153,10 @@ xfs_warn_experimental(
 			.opstate	= XFS_OPSTATE_WARNED_ZONED,
 			.name		= "zoned RT device",
 		},
+		[XFS_EXPERIMENTAL_FSVERITY] = {
+			.opstate	= XFS_OPSTATE_WARNED_FSVERITY,
+			.name		= "fsverity",
+		},
 	};
 	ASSERT(feat >= 0 && feat < XFS_EXPERIMENTAL_MAX);
 	BUILD_BUG_ON(ARRAY_SIZE(features) != XFS_EXPERIMENTAL_MAX);
diff --git a/fs/xfs/xfs_message.h b/fs/xfs/xfs_message.h
index 49b0ef40d299..083403944f11 100644
--- a/fs/xfs/xfs_message.h
+++ b/fs/xfs/xfs_message.h
@@ -94,6 +94,7 @@ enum xfs_experimental_feat {
 	XFS_EXPERIMENTAL_SHRINK,
 	XFS_EXPERIMENTAL_LARP,
 	XFS_EXPERIMENTAL_ZONED,
+	XFS_EXPERIMENTAL_FSVERITY,
 
 	XFS_EXPERIMENTAL_MAX,
 };
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 07f6aa3c3f26..84d7cfb5e2c7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -583,6 +583,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
 #define XFS_OPSTATE_WARNED_ZONED	19
 /* (Zoned) GC is in progress */
 #define XFS_OPSTATE_ZONEGC_RUNNING	20
+/* Kernel has logged a warning about fsverity support */
+#define XFS_OPSTATE_WARNED_FSVERITY	21
 
 #define __XFS_IS_OPSTATE(name, NAME) \
 static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f8de44443e81..d9d442009610 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -30,6 +30,7 @@
 #include "xfs_filestream.h"
 #include "xfs_quota.h"
 #include "xfs_sysfs.h"
+#include "xfs_fsverity.h"
 #include "xfs_ondisk.h"
 #include "xfs_rmap_item.h"
 #include "xfs_refcount_item.h"
@@ -1686,6 +1687,9 @@ xfs_fs_fill_super(
 	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
 #endif
 	sb->s_op = &xfs_super_operations;
+#ifdef CONFIG_FS_VERITY
+	sb->s_vop = &xfs_fsverity_ops;
+#endif
 
 	/*
 	 * Delay mount work if the debug hook is set. This is debug
@@ -1939,6 +1943,9 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_filestream_unmount;
 
+	if (xfs_has_verity(mp))
+		xfs_warn_experimental(mp, XFS_EXPERIMENTAL_FSVERITY);
+
 	root = igrab(VFS_I(mp->m_rootip));
 	if (!root) {
 		error = -ENOENT;
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 15/22] xfs: use read ioend for fsverity data verification
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

Use read ioends for fsverity verification. Do not issues fsverity
metadata I/O through the same workqueue due to risk of a deadlock by a
filled workqueue.

Pass fsverity_info from iomap context down to the ioend as hashtable
lookups are expensive.

Add a simple helper to check that this is not fsverity metadata but file
data that needs verification.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_aops.c     | 46 ++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_fsverity.c |  9 +++++++++
 fs/xfs/xfs_fsverity.h |  6 ++++++
 3 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9503252a0fa4..ecb07f250956 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -24,6 +24,7 @@
 #include "xfs_rtgroup.h"
 #include "xfs_fsverity.h"
 #include <linux/bio-integrity.h>
+#include <linux/fsverity.h>
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -171,6 +172,23 @@ xfs_end_ioend_write(
 	memalloc_nofs_restore(nofs_flag);
 }
 
+/*
+ * IO read completion.
+ */
+static void
+xfs_end_ioend_read(
+	struct iomap_ioend	*ioend)
+{
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+
+	if (!ioend->io_bio.bi_status &&
+			xfs_fsverity_is_file_data(ip, ioend->io_offset))
+		fsverity_verify_bio(ioend->io_vi,
+				    &ioend->io_bio);
+	iomap_finish_ioends(ioend,
+		blk_status_to_errno(ioend->io_bio.bi_status));
+}
+
 /*
  * Finish all pending IO completions that require transactional modifications.
  *
@@ -205,8 +223,7 @@ xfs_end_io(
 		list_del_init(&ioend->io_list);
 		iomap_ioend_try_merge(ioend, &tmp);
 		if (bio_op(&ioend->io_bio) == REQ_OP_READ)
-			iomap_finish_ioends(ioend,
-				blk_status_to_errno(ioend->io_bio.bi_status));
+			xfs_end_ioend_read(ioend);
 		else
 			xfs_end_ioend_write(ioend);
 		cond_resched();
@@ -232,9 +249,14 @@ xfs_end_bio(
 	}
 
 	spin_lock_irqsave(&ip->i_ioend_lock, flags);
-	if (list_empty(&ip->i_ioend_list))
-		WARN_ON_ONCE(!queue_work(mp->m_unwritten_workqueue,
+	if (list_empty(&ip->i_ioend_list)) {
+		if (IS_ENABLED(CONFIG_FS_VERITY) && ioend->io_vi &&
+		    ioend->io_offset < xfs_fsverity_metadata_offset(ip))
+			fsverity_enqueue_verify_work(&ip->i_ioend_work);
+		else
+			WARN_ON_ONCE(!queue_work(mp->m_unwritten_workqueue,
 					 &ip->i_ioend_work));
+	}
 	list_add_tail(&ioend->io_list, &ip->i_ioend_list);
 	spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
 }
@@ -764,9 +786,13 @@ xfs_bio_submit_read(
 	struct iomap_read_folio_ctx	*ctx)
 {
 	struct bio			*bio = ctx->read_ctx;
+	struct iomap_ioend		*ioend;
 
 	/* defer read completions to the ioend workqueue */
-	iomap_init_ioend(iter->inode, bio, ctx->read_ctx_file_offset, 0);
+	ioend = iomap_init_ioend(iter->inode, bio, ctx->read_ctx_file_offset,
+			0);
+	ioend->io_vi = ctx->vi;
+
 	bio->bi_end_io = xfs_end_bio;
 	submit_bio(bio);
 }
@@ -779,11 +805,13 @@ static const struct iomap_read_ops xfs_iomap_read_ops = {
 
 static inline const struct iomap_read_ops *
 xfs_get_iomap_read_ops(
-	const struct address_space	*mapping)
+	const struct address_space	*mapping,
+	loff_t				position)
 {
 	struct xfs_inode		*ip = XFS_I(mapping->host);
 
-	if (bdev_has_integrity_csum(xfs_inode_buftarg(ip)->bt_bdev))
+	if (bdev_has_integrity_csum(xfs_inode_buftarg(ip)->bt_bdev) ||
+			xfs_fsverity_is_file_data(ip, position))
 		return &xfs_iomap_read_ops;
 	return &iomap_bio_read_ops;
 }
@@ -795,7 +823,7 @@ xfs_vm_read_folio(
 {
 	struct iomap_read_folio_ctx	ctx = { .cur_folio = folio };
 
-	ctx.ops = xfs_get_iomap_read_ops(folio->mapping);
+	ctx.ops = xfs_get_iomap_read_ops(folio->mapping, folio_pos(folio));
 	iomap_read_folio(&xfs_read_iomap_ops, &ctx, NULL);
 	return 0;
 }
@@ -806,7 +834,7 @@ xfs_vm_readahead(
 {
 	struct iomap_read_folio_ctx	ctx = { .rac = rac };
 
-	ctx.ops = xfs_get_iomap_read_ops(rac->mapping),
+	ctx.ops = xfs_get_iomap_read_ops(rac->mapping, readahead_pos(rac));
 	iomap_readahead(&xfs_read_iomap_ops, &ctx, NULL);
 }
 
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
index 6e6a8636a577..b983e20bb5e1 100644
--- a/fs/xfs/xfs_fsverity.c
+++ b/fs/xfs/xfs_fsverity.c
@@ -19,3 +19,12 @@ xfs_fsverity_metadata_offset(
 {
 	return round_up(i_size_read(VFS_IC(ip)), XFS_FSVERITY_START_ALIGN);
 }
+
+bool
+xfs_fsverity_is_file_data(
+	const struct xfs_inode	*ip,
+	loff_t			offset)
+{
+	return fsverity_active(VFS_IC(ip)) &&
+			offset < xfs_fsverity_metadata_offset(ip);
+}
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
index 5771db2cd797..ec77ba571106 100644
--- a/fs/xfs/xfs_fsverity.h
+++ b/fs/xfs/xfs_fsverity.h
@@ -9,12 +9,18 @@
 
 #ifdef CONFIG_FS_VERITY
 loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip);
+bool xfs_fsverity_is_file_data(const struct xfs_inode *ip, loff_t offset);
 #else
 static inline loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip)
 {
 	WARN_ON_ONCE(1);
 	return ULLONG_MAX;
 }
+static inline bool xfs_fsverity_is_file_data(const struct xfs_inode *ip,
+					    loff_t offset)
+{
+	return false;
+}
 #endif	/* CONFIG_FS_VERITY */
 
 #endif	/* __XFS_FSVERITY_H__ */
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 14/22] xfs: handle fsverity I/O in write/read path
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

For write/writeback set IOMAP_F_FSVERITY flag telling iomap to not
update inode size and to not skip folios beyond EOF.

Initiate fsverity writeback with IOMAP_F_FSVERITY set to tell iomap
should not skip folio that is dirty beyond EOF.

In read path let iomap know that we are reading fsverity metadata. So,
treat holes in the tree as request to synthesize tree blocks and hole
after descriptor as end of the fsverity region.

Introduce a new inode flag meaning that merkle tree is being build on
the inode.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/Makefile          |  1 +
 fs/xfs/libxfs/xfs_bmap.c |  7 +++++++
 fs/xfs/xfs_aops.c        | 16 +++++++++++++++-
 fs/xfs/xfs_fsverity.c    | 21 +++++++++++++++++++++
 fs/xfs/xfs_fsverity.h    | 20 ++++++++++++++++++++
 fs/xfs/xfs_inode.h       |  6 ++++++
 fs/xfs/xfs_iomap.c       | 15 +++++++++++++--
 7 files changed, 83 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_fsverity.c
 create mode 100644 fs/xfs/xfs_fsverity.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 9f7133e02576..38b7f51e5d84 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -149,6 +149,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
 xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
+xfs-$(CONFIG_FS_VERITY)		+= xfs_fsverity.o
 
 # notify failure
 ifeq ($(CONFIG_MEMORY_FAILURE),y)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7a4c8f1aa76c..931d02678d19 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -41,6 +41,8 @@
 #include "xfs_inode_util.h"
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
@@ -4451,6 +4453,11 @@ xfs_bmapi_convert_one_delalloc(
 	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, bma.length));
 	XFS_STATS_INC(mp, xs_xstrat_quick);
 
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION) &&
+	    XFS_FSB_TO_B(mp, bma.got.br_startoff) >=
+		    xfs_fsverity_metadata_offset(ip))
+		flags |= IOMAP_F_FSVERITY;
+
 	ASSERT(!isnullstartblock(bma.got.br_startblock));
 	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags,
 				xfs_iomap_inode_sequence(ip, flags));
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f279055fcea0..9503252a0fa4 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -22,6 +22,7 @@
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
+#include "xfs_fsverity.h"
 #include <linux/bio-integrity.h>
 
 struct xfs_writepage_ctx {
@@ -339,12 +340,16 @@ xfs_map_blocks(
 	int			retries = 0;
 	int			error = 0;
 	unsigned int		*seq;
+	unsigned int		iomap_flags = 0;
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
 	XFS_ERRORTAG_DELAY(mp, XFS_ERRTAG_WB_DELAY_MS);
 
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION))
+		iomap_flags |= IOMAP_F_FSVERITY;
+
 	/*
 	 * COW fork blocks can overlap data fork blocks even if the blocks
 	 * aren't shared.  COW I/O always takes precedent, so we must always
@@ -432,7 +437,8 @@ xfs_map_blocks(
 	    isnullstartblock(imap.br_startblock))
 		goto allocate_blocks;
 
-	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, 0, XFS_WPC(wpc)->data_seq);
+	xfs_bmbt_to_iomap(ip, &wpc->iomap, &imap, 0, iomap_flags,
+			  XFS_WPC(wpc)->data_seq);
 	trace_xfs_map_blocks_found(ip, offset, count, whichfork, &imap);
 	return 0;
 allocate_blocks:
@@ -705,6 +711,14 @@ xfs_vm_writepages(
 			},
 		};
 
+		/*
+		 * Writeback does not work for folios past EOF, let it know that
+		 * I/O happens for fsverity metadata and this restriction need
+		 * to be skipped
+		 */
+		if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION))
+			wpc.ctx.iomap.flags |= IOMAP_F_FSVERITY;
+
 		return iomap_writepages(&wpc.ctx);
 	}
 }
diff --git a/fs/xfs/xfs_fsverity.c b/fs/xfs/xfs_fsverity.c
new file mode 100644
index 000000000000..6e6a8636a577
--- /dev/null
+++ b/fs/xfs/xfs_fsverity.c
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Red Hat, Inc.
+ */
+#include "xfs_platform.h"
+#include "xfs_format.h"
+#include "xfs_inode.h"
+#include "xfs_shared.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_fsverity.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
+#include <linux/iomap.h>
+
+loff_t
+xfs_fsverity_metadata_offset(
+	const struct xfs_inode	*ip)
+{
+	return round_up(i_size_read(VFS_IC(ip)), XFS_FSVERITY_START_ALIGN);
+}
diff --git a/fs/xfs/xfs_fsverity.h b/fs/xfs/xfs_fsverity.h
new file mode 100644
index 000000000000..5771db2cd797
--- /dev/null
+++ b/fs/xfs/xfs_fsverity.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Red Hat, Inc.
+ */
+#ifndef __XFS_FSVERITY_H__
+#define __XFS_FSVERITY_H__
+
+#include "xfs_platform.h"
+
+#ifdef CONFIG_FS_VERITY
+loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip);
+#else
+static inline loff_t xfs_fsverity_metadata_offset(const struct xfs_inode *ip)
+{
+	WARN_ON_ONCE(1);
+	return ULLONG_MAX;
+}
+#endif	/* CONFIG_FS_VERITY */
+
+#endif	/* __XFS_FSVERITY_H__ */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bd6d33557194..6df48d68a919 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -415,6 +415,12 @@ static inline bool xfs_inode_can_sw_atomic_write(const struct xfs_inode *ip)
  */
 #define XFS_IREMAPPING		(1U << 15)
 
+/*
+ * fs-verity's Merkle tree is under construction. The file is read-only, the
+ * only writes happening are for the fsverity metadata.
+ */
+#define XFS_VERITY_CONSTRUCTION	(1U << 16)
+
 /* All inode state flags related to inode reclaim. */
 #define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
 				 XFS_IRECLAIM | \
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 9c2f12d5fec9..71ccd4ff5f48 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -32,6 +32,8 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_fsverity.h"
+#include <linux/fsverity.h>
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -1789,6 +1791,9 @@ xfs_buffered_write_iomap_begin(
 		return xfs_direct_write_iomap_begin(inode, offset, count,
 				flags, iomap, srcmap);
 
+	if (xfs_iflags_test(ip, XFS_VERITY_CONSTRUCTION))
+		iomap_flags |= IOMAP_F_FSVERITY;
+
 	error = xfs_qm_dqattach(ip);
 	if (error)
 		return error;
@@ -2113,12 +2118,17 @@ xfs_read_iomap_begin(
 	bool			shared = false;
 	unsigned int		lockmode = XFS_ILOCK_SHARED;
 	u64			seq;
+	unsigned int		iomap_flags = 0;
 
 	ASSERT(!(flags & (IOMAP_WRITE | IOMAP_ZERO)));
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
+	if (fsverity_active(inode) &&
+	    (offset >= xfs_fsverity_metadata_offset(ip)))
+		iomap_flags |= IOMAP_F_FSVERITY;
+
 	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
 	if (error)
 		return error;
@@ -2132,8 +2142,9 @@ xfs_read_iomap_begin(
 	if (error)
 		return error;
 	trace_xfs_iomap_found(ip, offset, length, XFS_DATA_FORK, &imap);
-	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
-				 shared ? IOMAP_F_SHARED : 0, seq);
+	iomap_flags |= shared ? IOMAP_F_SHARED : 0;
+
+	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, iomap_flags, seq);
 }
 
 const struct iomap_ops xfs_read_iomap_ops = {
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 13/22] xfs: disable direct read path for fs-verity files
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

The direct path is not supported on verity files. Attempts to use direct
I/O path on such files should fall back to buffered I/O path.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_file.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a980ac5196a8..6fa9835f9531 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -282,7 +282,8 @@ xfs_file_dax_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
+	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret = 0;
 
 	trace_xfs_file_dax_read(iocb, to);
@@ -333,6 +334,14 @@ xfs_file_read_iter(
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
+	/*
+	 * In case fs-verity is enabled, we also fallback to the buffered read
+	 * from the direct read path. Therefore, IOCB_DIRECT is set and need to
+	 * be cleared (see generic_file_read_iter())
+	 */
+	if (fsverity_active(inode))
+		iocb->ki_flags &= ~IOCB_DIRECT;
+
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
 	else if (iocb->ki_flags & IOCB_DIRECT)
-- 
2.51.2


^ permalink raw reply related

* [PATCH v7 12/22] xfs: don't allow to enable DAX on fs-verity sealed inode
From: Andrey Albershteyn @ 2026-04-09 13:13 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, djwong
In-Reply-To: <20260409131404.1545834-1-aalbersh@kernel.org>

fs-verity doesn't support DAX. Forbid filesystem to enable DAX on
inodes which already have fs-verity enabled. The opposite is checked
when fs-verity is enabled, it won't be enabled if DAX is.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ca369eb96561..17efc83a86ed 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1387,6 +1387,8 @@ xfs_inode_should_enable_dax(
 		return false;
 	if (!xfs_inode_supports_dax(ip))
 		return false;
+	if (ip->i_diflags2 & XFS_DIFLAG2_VERITY)
+		return false;
 	if (xfs_has_dax_always(ip->i_mount))
 		return true;
 	if (ip->i_diflags2 & XFS_DIFLAG2_DAX)
-- 
2.51.2


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox