[PATCHBOMB v10] xfsprogs: autonomous self healing of filesystems

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHBOMB v10] xfsprogs: autonomous self healing of filesystems
@ 2026-03-19  4:37 Darrick J. Wong
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
  0 siblings, 2 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:37 UTC (permalink / raw)
  To: Andrey Albershteyn; +Cc: cem, hch, linux-xfs, Zorro Lang

Hi all,

This patchset contains the userspace and QA changes (xfs_healer) needed
to put to use all the new kernel functionality to deliver live
information about filesystem health events (xfs_healthmon.c) to
userspace and a lot of cleanups to xfs_scrub's media verification.

In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically.  This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing.  They are auto-started by a starter
service that uses fanotify.

When the patchsets under this cover letter are merged, online fsck for
XFS will at long last be fully feature complete.  The passive scan parts
have been done since mid-2024, this final part adds proactive repair.

Here's what's left to review, thanks to Christoph for doing a bunch of
xfs_healer reviews and sharing some cleanups he wanted to see in
xfs_scrub; and to Zorro for merging the fstests.

[PATCHSET v10 1/2] xfsprogs: autonomous self healing of filesystems
  [PATCH 19/26] xfs_healer: use statmount to find moved filesystems
[PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA
  [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap
  [PATCH 02/22] mkfs: rename byte unit conversion macros
  [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h
  [PATCH 04/22] xfs_scrub: report truncated devices as media errors
  [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return
  [PATCH 07/22] xfs_scrub: move read verification scheduling to
  [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere
  [PATCH 11/22] xfs_scrub: rename nr_io_threads
  [PATCH 16/22] xfs_scrub: perform media scanning of the log region
  [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids
  [PATCH 18/22] xfs_scrub: move failmap and other outputs into
  [PATCH 19/22] xfs_scrub: clean up device-related error messages
  [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk.
  [PATCH 21/22] xfs_scrub: raise media verification IO limits
  [PATCH 22/22] xfs_scrub: allow overrides of the media verification IO

v10: cleanups of the media verification code in xfs_scrub
v9: reorg listmount/statmount, use it to find moved mounts, improve the
    commit messages and documentation
v8: clean up userspace for merging now that the kernel part is upstream
v7: more cleanups of the media verification ioctl, improve comments, and
    reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
    and filter bio status flags better
v5: add verify-media ioctl, collapse small helper funcs with only
    one caller
v4: drop multiple client support so we can make direct calls into
    healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status

--D

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCHSET v10 1/2] xfsprogs: autonomous self healing of filesystems
  2026-03-19  4:37 [PATCHBOMB v10] xfsprogs: autonomous self healing of filesystems Darrick J. Wong
@ 2026-03-19  4:38 ` Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 01/26] libfrog: add a function to grab the path from an open fd and a file handle Darrick J. Wong
                     ` (25 more replies)
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
  1 sibling, 26 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:38 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

Hi all,

This patchset builds new functionality to deliver live information about
filesystem health events to userspace.  This is done by creating an
anonymous file that can be read() for events by userspace programs.
Events are captured by hooking various parts of XFS and iomap so that
metadata health failures, file I/O errors, and major changes in
filesystem state (unmounts, shutdowns, etc.) can be observed by
programs.

When an event occurs, the hook functions queue an event object to each
event anonfd for later processing.  Programs must have CAP_SYS_ADMIN
to open the anonfd and there's a maximum event lag to prevent resource
overconsumption.  The events themselves can be read() from the anonfd
as C structs for the xfs_healer daemon.

In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically.  This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing.  They are auto-started by a starter
service that uses fanotify.

v10: move the xfs_scrub cleanups and changes to their own patchset,
     improve the commit messages to explain why we use getmntent and
     statmount
v9: move listmount/statmount to libfrog; improve documentation about why
    we dance with getmntent; enhance getmntent reconnection with
    listmount; move --svcname helpers to xfs_{healer,scrub}; improve
    commit messages; various tweaks to fstests
v8: clean up userspace for merging now that the kernel part is upstream
v7: more cleanups of the media verification ioctl, improve comments, and
    reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
v5: add verify-media ioctl, collapse small helper funcs with only
    one caller
v4: drop multiple client support so we can make direct calls into
    healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * libfrog: add a function to grab the path from an open fd and a file handle
 * libfrog: create healthmon event log library functions
 * libfrog: add support code for starting systemd services programmatically
 * libfrog: hoist a couple of service helper functions
 * libfrog: add wrappers for listmount and statmount
 * man2: document the healthmon ioctl
 * man2: document the media verification ioctl
 * xfs_io: monitor filesystem health events
 * xfs_io: add a media verify command
 * xfs_healer: create daemon to listen for health events
 * xfs_healer: enable repairing filesystems
 * xfs_healer: use getparents to look up file names
 * xfs_healer: create a per-mount background monitoring service
 * xfs_healer: create a service to start the per-mount healer service
 * xfs_healer: don't start service if kernel support unavailable
 * xfs_healer: use the autofsck fsproperty to select mode
 * xfs_healer: run full scrub after lost corruption events or targeted repair failure
 * xfs_healer: use getmntent to find moved filesystems
 * xfs_healer: use statmount to find moved filesystems even faster
 * xfs_healer: validate that repair fds point to the monitored fs
 * xfs_healer: add a manual page
 * xfs_scrub: print systemd service names
 * xfs_io: add listmount and statmount commands
 * mkfs: enable online repair if all backrefs are enabled
 * debian/control: listify the build dependencies
 * debian: enable xfs_healer on the root filesystem by default
---
 healer/xfs_healer.h                            |   92 +++
 include/linux.h                                |    8 
 io/io.h                                        |    8 
 libfrog/flagmap.h                              |   23 +
 libfrog/fsproperties.h                         |    5 
 libfrog/getparents.h                           |    4 
 libfrog/healthevent.h                          |   55 ++
 libfrog/statmount.h                            |  104 ++++
 libfrog/systemd.h                              |   55 ++
 scrub/xfs_scrub.h                              |    3 
 Makefile                                       |    5 
 configure.ac                                   |   13 
 debian/control                                 |   14 -
 debian/postinst                                |    8 
 debian/prerm                                   |   13 
 debian/rules                                   |    3 
 healer/Makefile                                |   69 ++
 healer/fsrepair.c                              |  342 ++++++++++++
 healer/system-xfs_healer.slice                 |   31 +
 healer/weakhandle.c                            |  296 +++++++++++
 healer/xfs_healer.c                            |  666 ++++++++++++++++++++++++
 healer/xfs_healer@.service.in                  |  108 ++++
 healer/xfs_healer_start.c                      |  368 +++++++++++++
 healer/xfs_healer_start.service.in             |   85 +++
 include/builddefs.in                           |   15 +
 io/Makefile                                    |    9 
 io/healthmon.c                                 |  186 +++++++
 io/init.c                                      |    3 
 io/listmount.c                                 |  361 +++++++++++++
 io/verify_media.c                              |  180 ++++++
 libfrog/Makefile                               |   19 +
 libfrog/flagmap.c                              |   79 +++
 libfrog/getparents.c                           |   93 +++
 libfrog/healthevent.c                          |  477 +++++++++++++++++
 libfrog/statmount.c                            |   76 +++
 libfrog/systemd.c                              |  177 ++++++
 m4/package_libcdev.m4                          |  129 +++++
 man/man2/ioctl_xfs_health_fd_on_monitored_fs.2 |   75 +++
 man/man2/ioctl_xfs_health_monitor.2            |  464 +++++++++++++++++
 man/man2/ioctl_xfs_verify_media.2              |  185 +++++++
 man/man8/Makefile                              |   40 +
 man/man8/xfs_healer.8                          |  109 ++++
 man/man8/xfs_healer_start.8                    |   37 +
 man/man8/xfs_io.8                              |  133 +++++
 mkfs/xfs_mkfs.c                                |    9 
 scrub/Makefile                                 |   14 -
 scrub/xfs_scrub.c                              |   82 ++-
 47 files changed, 5272 insertions(+), 58 deletions(-)
 create mode 100644 healer/xfs_healer.h
 create mode 100644 libfrog/flagmap.h
 create mode 100644 libfrog/healthevent.h
 create mode 100644 libfrog/statmount.h
 create mode 100644 libfrog/systemd.h
 create mode 100644 debian/prerm
 create mode 100644 healer/Makefile
 create mode 100644 healer/fsrepair.c
 create mode 100644 healer/system-xfs_healer.slice
 create mode 100644 healer/weakhandle.c
 create mode 100644 healer/xfs_healer.c
 create mode 100644 healer/xfs_healer@.service.in
 create mode 100644 healer/xfs_healer_start.c
 create mode 100644 healer/xfs_healer_start.service.in
 create mode 100644 io/healthmon.c
 create mode 100644 io/listmount.c
 create mode 100644 io/verify_media.c
 create mode 100644 libfrog/flagmap.c
 create mode 100644 libfrog/healthevent.c
 create mode 100644 libfrog/statmount.c
 create mode 100644 libfrog/systemd.c
 create mode 100644 man/man2/ioctl_xfs_health_fd_on_monitored_fs.2
 create mode 100644 man/man2/ioctl_xfs_health_monitor.2
 create mode 100644 man/man2/ioctl_xfs_verify_media.2
 create mode 100644 man/man8/xfs_healer.8
 create mode 100644 man/man8/xfs_healer_start.8


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA
  2026-03-19  4:37 [PATCHBOMB v10] xfsprogs: autonomous self healing of filesystems Darrick J. Wong
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
@ 2026-03-19  4:38 ` Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer Darrick J. Wong
                     ` (21 more replies)
  1 sibling, 22 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:38 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs, linux-xfs

Hi all,

Originally, this was a single patch inside the autonomous self healing
patchset.  However, various issues brought up during review of the v9
codebase made it obvious that more cleanups were necessary prior to
shifting xfs_scrub to use the new media verification ioctl.  Therefore,
the xfs_scrub changes have been broken out into a separate patchset to
contain the refactorings and new functionality.

v10: decouple the spacemap iterator and the read-verify pool code,
     simplify the read-verify apis, use a fixed limit of 8 verify IO
     threads per device, improve the log media error reporting

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-verify-ioctl
---
Commits in this patchset:
 * libfrog: allow bitmap_free to handle a null bitmap pointer
 * mkfs: rename byte unit conversion macros
 * libfrog: lift *BYTES helpers to convert.h
 * xfs_scrub: report truncated devices as media errors
 * xfs_scrub: fix i18n of the decode_special_owner return value
 * scrub: remove the unused io_disk field in struct read_verify
 * xfs_scrub: move read verification scheduling to phase6.c
 * scrub: simplify the read_verify_pool_alloc interface
 * xfs_scrub: don't pass the io_end_arg around everywhere
 * scrub: use enum xfs_device for read verification
 * xfs_scrub: rename nr_io_threads
 * scrub: simplify verifier threads calculation
 * xfs_scrub: move disk media verification error injection
 * xfs_scrub: use the verify media ioctl during phase 6 if possible
 * scrub: don't allocate disk for ioctl-based media verify
 * xfs_scrub: perform media scanning of the log region
 * xfs_scrub: index read-verify pools by xfs_device ids
 * xfs_scrub: move failmap and other outputs into read_verify_pool
 * xfs_scrub: clean up device-related error messages
 * xfs_scrub: drop SCSI_VERIFY code from disk.
 * xfs_scrub: raise media verification IO limits
 * xfs_scrub: allow overrides of the media verification IO limits
---
 libfrog/convert.h         |    7 +
 scrub/disk.h              |    2 
 scrub/read_verify.h       |   34 ++-
 scrub/spacemap.h          |    5 
 scrub/xfs_scrub.h         |    9 -
 db/bmap_inflate.c         |    2 
 doc/README-env-vars.txt   |    4 
 libfrog/bitmap.c          |    3 
 libfrog/convert.c         |    7 -
 mdrestore/xfs_mdrestore.c |    5 
 mkfs/xfs_mkfs.c           |   38 ++--
 repair/agbtree.c          |    3 
 repair/bmap_repair.c      |    3 
 repair/rmap.c             |    3 
 repair/xfs_repair.c       |    5 
 scrub/common.c            |   19 +-
 scrub/disk.c              |  204 -------------------
 scrub/phase1.c            |   91 ++++++--
 scrub/phase5.c            |    6 -
 scrub/phase6.c            |  417 +++++++++++++++++++--------------------
 scrub/phase8.c            |    3 
 scrub/read_verify.c       |  485 +++++++++++++++++++++++++++++++++------------
 scrub/xfs_scrub.c         |    4 
 23 files changed, 722 insertions(+), 637 deletions(-)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 01/26] libfrog: add a function to grab the path from an open fd and a file handle
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
@ 2026-03-19  4:39   ` Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 02/26] libfrog: create healthmon event log library functions Darrick J. Wong
                     ` (24 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:39 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

handle_walk_paths operates on a file handle, but requires that the fs
has been registered with libhandle via path_to_fshandle.  For a normal
libhandle client this is the desirable behavior because the application
*should* maintain an open fd to the filesystem mount.

However for xfs_healer this isn't going to work well because the healer
mustn't pin the mount while it's running.  It's smart enough to know how
to find and reconnect to the mountpoint, but libhandle doesn't have any
such concept.

Therefore, alter the libfrog getparents code so that xfs_healer can pass
in the mountpoint and reconnected fd without needing libhandle.  All
we're really doing here is trying to obtain a user-visible path for a
file that encountered problems for logging purposes; if it fails, we'll
fall back to logging the inode number.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libfrog/getparents.h |    4 ++
 libfrog/getparents.c |   93 ++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 82 insertions(+), 15 deletions(-)


diff --git a/libfrog/getparents.h b/libfrog/getparents.h
index 8098d594219b4c..e1df30889c7606 100644
--- a/libfrog/getparents.h
+++ b/libfrog/getparents.h
@@ -39,4 +39,8 @@ int fd_to_path(int fd, size_t ioctl_bufsize, char *path, size_t pathlen);
 int handle_to_path(const void *hanp, size_t hlen, size_t ioctl_bufsize,
 		char *path, size_t pathlen);
 
+int handle_walk_paths_fd(const char *mntpt, int mntfd, const void *hanp,
+		size_t hanlen, size_t ioctl_bufsize, walk_path_fn fn,
+		void *arg);
+
 #endif /* __LIBFROG_GETPARENTS_H_ */
diff --git a/libfrog/getparents.c b/libfrog/getparents.c
index 9118b0ff32db0d..e8f545392634e4 100644
--- a/libfrog/getparents.c
+++ b/libfrog/getparents.c
@@ -112,9 +112,13 @@ fd_walk_parents(
 	return ret;
 }
 
-/* Walk all parent pointers of this handle.  Returns 0 or positive errno. */
-int
-handle_walk_parents(
+/*
+ * Walk all parent pointers of this handle using the given fd to query the
+ * filesystem.  Returns 0 or positive errno.
+ */
+static int
+handle_walk_parents_fd(
+	int			fd,
 	const void		*hanp,
 	size_t			hlen,
 	size_t			bufsize,
@@ -123,21 +127,11 @@ handle_walk_parents(
 {
 	struct xfs_getparents_by_handle	gph = { };
 	void			*buf;
-	char			*mntpt;
-	int			fd;
 	int			ret;
 
 	if (hlen != sizeof(struct xfs_handle))
 		return EINVAL;
 
-	/*
-	 * This function doesn't modify the handle, but we don't want to have
-	 * to bump the libhandle major version just to change that.
-	 */
-	fd = handle_to_fsfd((void *)hanp, &mntpt);
-	if (fd < 0)
-		return errno;
-
 	buf = alloc_records(&gph.gph_request, bufsize);
 	if (!buf)
 		return errno;
@@ -158,6 +152,29 @@ handle_walk_parents(
 	return ret;
 }
 
+/* Walk all parent pointers of this handle.  Returns 0 or positive errno. */
+int
+handle_walk_parents(
+	const void		*hanp,
+	size_t			hlen,
+	size_t			bufsize,
+	walk_parent_fn		fn,
+	void			*arg)
+{
+	char			*mntpt;
+	int			fd;
+
+	/*
+	 * This function doesn't modify the handle, but we don't want to have
+	 * to bump the libhandle major version just to change that.
+	 */
+	fd = handle_to_fsfd((void *)hanp, &mntpt);
+	if (fd < 0)
+		return errno;
+
+	return handle_walk_parents_fd(fd, hanp, hlen, bufsize, fn, arg);
+}
+
 struct walk_ppaths_info {
 	/* Callback */
 	walk_path_fn		fn;
@@ -169,7 +186,11 @@ struct walk_ppaths_info {
 	/* Path that we're constructing. */
 	struct path_list	*path;
 
+	/* Use this much memory per call. */
 	size_t			ioctl_bufsize;
+
+	/* Use this fd for calling the getparents ioctl. */
+	int			mntfd;
 };
 
 /*
@@ -200,8 +221,14 @@ find_parent_component(
 		return errno;
 	path_list_add_parent_component(wpi->path, pc);
 
-	ret = handle_walk_parents(&rec->p_handle, sizeof(rec->p_handle),
-			wpi->ioctl_bufsize, find_parent_component, wpi);
+	if (wpi->mntfd >= 0)
+		ret = handle_walk_parents_fd(wpi->mntfd, &rec->p_handle,
+				sizeof(rec->p_handle), wpi->ioctl_bufsize,
+				find_parent_component, wpi);
+	else
+		ret = handle_walk_parents(&rec->p_handle,
+				sizeof(rec->p_handle), wpi->ioctl_bufsize,
+				find_parent_component, wpi);
 
 	path_list_del_component(wpi->path, pc);
 	path_component_free(pc);
@@ -222,6 +249,7 @@ handle_walk_paths(
 {
 	struct walk_ppaths_info	wpi = {
 		.ioctl_bufsize	= ioctl_bufsize,
+		.mntfd		= -1,
 	};
 	int			ret;
 
@@ -246,6 +274,41 @@ handle_walk_paths(
 	return ret;
 }
 
+/*
+ * Call the given function on all known paths from the vfs root to the inode
+ * described in the handle using an already open mountpoint and fd.  Returns 0
+ * for success or positive errno.
+ */
+int
+handle_walk_paths_fd(
+	const char		*mntpt,
+	int			mntfd,
+	const void		*hanp,
+	size_t			hlen,
+	size_t			ioctl_bufsize,
+	walk_path_fn		fn,
+	void			*arg)
+{
+	struct walk_ppaths_info	wpi = {
+		.ioctl_bufsize	= ioctl_bufsize,
+		.mntfd		= mntfd,
+		.mntpt		= (char *)mntpt,
+	};
+	int			ret;
+
+	wpi.path = path_list_init();
+	if (!wpi.path)
+		return errno;
+	wpi.fn = fn;
+	wpi.arg = arg;
+
+	ret = handle_walk_parents_fd(mntfd, hanp, hlen, ioctl_bufsize,
+			find_parent_component, &wpi);
+
+	path_list_free(wpi.path);
+	return ret;
+}
+
 /*
  * Call the given function on all known paths from the vfs root to the inode
  * referred to by the file description.  Returns 0 or positive errno.


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 02/26] libfrog: create healthmon event log library functions
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 01/26] libfrog: add a function to grab the path from an open fd and a file handle Darrick J. Wong
@ 2026-03-19  4:39   ` Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 03/26] libfrog: add support code for starting systemd services programmatically Darrick J. Wong
                     ` (23 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:39 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add some helper functions to log health monitoring events so that xfs_io
and xfs_healer can share logging code.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libfrog/flagmap.h     |   20 +++
 libfrog/healthevent.h |   43 ++++++
 libfrog/Makefile      |    4 +
 libfrog/flagmap.c     |   62 ++++++++
 libfrog/healthevent.c |  360 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 489 insertions(+)
 create mode 100644 libfrog/flagmap.h
 create mode 100644 libfrog/healthevent.h
 create mode 100644 libfrog/flagmap.c
 create mode 100644 libfrog/healthevent.c


diff --git a/libfrog/flagmap.h b/libfrog/flagmap.h
new file mode 100644
index 00000000000000..8031d75a7c02a8
--- /dev/null
+++ b/libfrog/flagmap.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef LIBFROG_FLAGMAP_H_
+#define LIBFROG_FLAGMAP_H_
+
+struct flag_map {
+	unsigned long long	flag;
+	const char		*string;
+};
+
+void mask_to_string(const struct flag_map *map, unsigned long long mask,
+		const char *delimiter, char *buf, size_t bufsize);
+
+const char *value_to_string(const struct flag_map *map,
+		unsigned long long value);
+
+#endif /* LIBFROG_FLAGMAP_H_ */
diff --git a/libfrog/healthevent.h b/libfrog/healthevent.h
new file mode 100644
index 00000000000000..6de41bc797100c
--- /dev/null
+++ b/libfrog/healthevent.h
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef LIBFROG_HEALTHEVENT_H_
+#define LIBFROG_HEALTHEVENT_H_
+
+struct hme_prefix {
+	/*
+	 * Format a complete file path into this buffer to prevent the logging
+	 * code from printing the mountpoint and a file handle.  Only works for
+	 * file-related events.
+	 */
+	char		path[MAXPATHLEN];
+
+	/* Set this to the mountpoint */
+	const char	*mountpoint;
+};
+
+static inline bool hme_prefix_has_path(const struct hme_prefix *pfx)
+{
+	return pfx->path[0] != 0;
+}
+
+static inline void hme_prefix_clear_path(struct hme_prefix *pfx)
+{
+	pfx->path[0] = 0;
+}
+
+static inline void
+hme_prefix_init(
+	struct hme_prefix	*pfx,
+	const char		*mountpoint)
+{
+	pfx->mountpoint = mountpoint;
+	hme_prefix_clear_path(pfx);
+}
+
+void hme_report_event(const struct hme_prefix *pfx,
+		const struct xfs_health_monitor_event *hme);
+
+#endif /* LIBFROG_HEALTHEVENT_H_ */
diff --git a/libfrog/Makefile b/libfrog/Makefile
index 927bd8d0957fab..bccd9289e5dd79 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -19,11 +19,13 @@ bulkstat.c \
 convert.c \
 crc32.c \
 file_exchange.c \
+flagmap.c \
 fsgeom.c \
 fsproperties.c \
 fsprops.c \
 getparents.c \
 histogram.c \
+healthevent.c \
 file_attr.c \
 list_sort.c \
 linux.c \
@@ -51,11 +53,13 @@ dahashselftest.h \
 div64.h \
 fakelibattr.h \
 file_exchange.h \
+flagmap.h \
 fsgeom.h \
 fsproperties.h \
 fsprops.h \
 getparents.h \
 handle_priv.h \
+healthevent.h \
 histogram.h \
 file_attr.h \
 logging.h \
diff --git a/libfrog/flagmap.c b/libfrog/flagmap.c
new file mode 100644
index 00000000000000..631c4bbc8f1dc0
--- /dev/null
+++ b/libfrog/flagmap.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+
+#include "platform_defs.h"
+#include "libfrog/flagmap.h"
+
+/*
+ * Given a mapping of bits to strings and a bitmask, format the bitmask as a
+ * list of strings and hexadecimal number representing bits not mapped to any
+ * string.  The output will be truncated if buf is not large enough.
+ */
+void
+mask_to_string(
+	const struct flag_map	*map,
+	unsigned long long	mask,
+	const char		*delimiter,
+	char			*buf,
+	size_t			bufsize)
+{
+	const char		*tag = "";
+	unsigned long long	seen = 0;
+	int			w;
+
+	for (; map->string; map++) {
+		seen |= map->flag;
+
+		if (mask & map->flag) {
+			w = snprintf(buf, bufsize, "%s%s", tag, _(map->string));
+			if (w > bufsize)
+				return;
+
+			buf += w;
+			bufsize -= w;
+
+			tag = delimiter;
+		}
+	}
+
+	if (mask & ~seen)
+		snprintf(buf, bufsize, "%s0x%llx", tag, mask & ~seen);
+}
+
+/*
+ * Given a mapping of values to strings and a value, return the matching string
+ * or confusion.
+ */
+const char *
+value_to_string(
+	const struct flag_map	*map,
+	unsigned long long	value)
+{
+	for (; map->string; map++) {
+		if (value == map->flag)
+			return _(map->string);
+	}
+
+	return _("unknown value");
+}
diff --git a/libfrog/healthevent.c b/libfrog/healthevent.c
new file mode 100644
index 00000000000000..8520cb3218fb03
--- /dev/null
+++ b/libfrog/healthevent.c
@@ -0,0 +1,360 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+
+#include "platform_defs.h"
+#include "libfrog/healthevent.h"
+#include "libfrog/flagmap.h"
+
+/*
+ * The healthmon log string format is as follows:
+ *
+ * WHICH OBJECT: STATUS
+ *
+ * /mnt: 32 events lost
+ * /mnt agno 0x5 bnobt, rmapbt: sick
+ * /mnt rgno 0x5 bitmap: sick
+ * /mnt ino 13 gen 0x3 bmbtd: sick
+ * /mnt/a bmbtd: sick
+ * /mnt ino 13 gen 0x3 pos 4096 len 4096: directio_write failed
+ * /mnt/a pos 4096 len 4096: directio_read failed
+ * /mnt datadev daddr 0x13 bbcount 0x5: media error
+ * /mnt: filesystem shut down due to shenanigans, badness
+ */
+
+static const struct flag_map device_domains[] = {
+	{ XFS_HEALTH_MONITOR_DOMAIN_DATADEV,	N_("datadev") },
+	{ XFS_HEALTH_MONITOR_DOMAIN_RTDEV,	N_("rtdev") },
+	{ XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,	N_("logdev") },
+	{0, NULL},
+};
+
+static inline const char *
+device_domain_string(
+	uint32_t		domain)
+{
+	return value_to_string(device_domains, domain);
+}
+
+static const struct flag_map fileio_types[] = {
+	{ XFS_HEALTH_MONITOR_TYPE_BUFREAD,	N_("buffered_read") },
+	{ XFS_HEALTH_MONITOR_TYPE_BUFWRITE,	N_("buffered_write") },
+	{ XFS_HEALTH_MONITOR_TYPE_DIOREAD,	N_("directio_read") },
+	{ XFS_HEALTH_MONITOR_TYPE_DIOWRITE,	N_("directio_write") },
+	{ XFS_HEALTH_MONITOR_TYPE_DATALOST,	N_("media") },
+	{0, NULL},
+};
+
+static inline const char *
+fileio_type_string(
+	uint32_t		type)
+{
+	return value_to_string(fileio_types, type);
+}
+
+static const struct flag_map health_types[] = {
+	{ XFS_HEALTH_MONITOR_TYPE_SICK,		N_("sick") },
+	{ XFS_HEALTH_MONITOR_TYPE_CORRUPT,	N_("corrupt") },
+	{ XFS_HEALTH_MONITOR_TYPE_HEALTHY,	N_("healthy") },
+	{0, NULL},
+};
+
+static inline const char *
+health_type_string(
+	uint32_t		type)
+{
+	return value_to_string(health_types, type);
+}
+
+/* Report that the kernel lost events. */
+static void
+report_lost(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	printf("%s: %llu %s\n", pfx->mountpoint,
+			(unsigned long long)hme->e.lost.count,
+			_("events lost"));
+	fflush(stdout);
+}
+
+/* Report that the monitor is running. */
+static void
+report_running(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	printf("%s: %s\n", pfx->mountpoint, _("monitoring started"));
+	fflush(stdout);
+}
+
+/* Report that the filesystem was unmounted. */
+static void
+report_unmounted(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	printf("%s: %s\n", pfx->mountpoint, _("filesystem unmounted"));
+	fflush(stdout);
+}
+
+static const struct flag_map shutdown_reasons[] = {
+	{ XFS_HEALTH_SHUTDOWN_META_IO_ERROR,	N_("metadata I/O error") },
+	{ XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR,	N_("log I/O error") },
+	{ XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT,	N_("forced unmount") },
+	{ XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE,	N_("in-memory state corruption") },
+	{ XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK,	N_("ondisk metadata corruption") },
+	{ XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED,	N_("device removed") },
+	{0, NULL},
+};
+
+/* Report an abortive shutdown of the filesystem. */
+static void
+report_shutdown(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	char					buf[512];
+
+	mask_to_string(shutdown_reasons, hme->e.shutdown.reasons, ", ", buf,
+			sizeof(buf));
+
+	printf("%s: %s %s\n", pfx->mountpoint,
+			_("filesystem shut down due to"), buf);
+	fflush(stdout);
+}
+
+static const struct flag_map inode_structs[] = {
+	{ XFS_BS_SICK_INODE,	N_("core") },
+	{ XFS_BS_SICK_BMBTD,	N_("datafork") },
+	{ XFS_BS_SICK_BMBTA,	N_("attrfork") },
+	{ XFS_BS_SICK_BMBTC,	N_("cowfork") },
+	{ XFS_BS_SICK_DIR,	N_("directory") },
+	{ XFS_BS_SICK_XATTR,	N_("xattr") },
+	{ XFS_BS_SICK_SYMLINK,	N_("symlink") },
+	{ XFS_BS_SICK_PARENT,	N_("parent") },
+	{ XFS_BS_SICK_DIRTREE,	N_("dirtree") },
+	{0, NULL},
+};
+
+/* Report inode metadata corruption */
+static void
+report_inode(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	char					buf[512];
+
+	mask_to_string(inode_structs, hme->e.inode.mask, ", ", buf,
+			sizeof(buf));
+
+	if (hme_prefix_has_path(pfx))
+		printf("%s %s: %s\n",
+				pfx->path,
+				buf,
+				health_type_string(hme->type));
+	else
+		printf("%s %s %llu %s 0x%x %s: %s\n",
+				pfx->mountpoint,
+				_("ino"),
+				(unsigned long long)hme->e.inode.ino,
+				_("gen"),
+				hme->e.inode.gen,
+				buf,
+				health_type_string(hme->type));
+	fflush(stdout);
+}
+
+static const struct flag_map ag_structs[] = {
+	{ XFS_AG_GEOM_SICK_SB,		N_("super") },
+	{ XFS_AG_GEOM_SICK_AGF,		N_("agf") },
+	{ XFS_AG_GEOM_SICK_AGFL,	N_("agfl") },
+	{ XFS_AG_GEOM_SICK_AGI,		N_("agi") },
+	{ XFS_AG_GEOM_SICK_BNOBT,	N_("bnobt") },
+	{ XFS_AG_GEOM_SICK_CNTBT,	N_("cntbt") },
+	{ XFS_AG_GEOM_SICK_INOBT,	N_("inobt") },
+	{ XFS_AG_GEOM_SICK_FINOBT,	N_("finobt") },
+	{ XFS_AG_GEOM_SICK_RMAPBT,	N_("rmapbt") },
+	{ XFS_AG_GEOM_SICK_REFCNTBT,	N_("refcountbt") },
+	{ XFS_AG_GEOM_SICK_INODES,	N_("inodes") },
+	{0, NULL},
+};
+
+/* Report AG metadata corruption */
+static void
+report_ag(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	char					buf[512];
+
+	mask_to_string(ag_structs, hme->e.group.mask, ", ", buf,
+			sizeof(buf));
+
+	printf("%s %s 0x%x %s: %s\n",
+			pfx->mountpoint,
+			_("agno"),
+			hme->e.group.gno,
+			buf,
+			health_type_string(hme->type));
+	fflush(stdout);
+}
+
+static const struct flag_map rtgroup_structs[] = {
+	{ XFS_RTGROUP_GEOM_SICK_SUPER,		N_("super") },
+	{ XFS_RTGROUP_GEOM_SICK_BITMAP,		N_("bitmap") },
+	{ XFS_RTGROUP_GEOM_SICK_SUMMARY,	N_("summary") },
+	{ XFS_RTGROUP_GEOM_SICK_RMAPBT,		N_("rmapbt") },
+	{ XFS_RTGROUP_GEOM_SICK_REFCNTBT,	N_("refcountbt") },
+	{0, NULL},
+};
+
+/* Report rtgroup metadata corruption */
+static void
+report_rtgroup(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	char					buf[512];
+
+	mask_to_string(rtgroup_structs, hme->e.group.mask, ", ", buf,
+			sizeof(buf));
+
+	printf("%s %s 0x%x %s: %s\n",
+			pfx->mountpoint,
+			_("rgno"),
+			hme->e.group.gno,
+			buf, health_type_string(hme->type));
+	fflush(stdout);
+}
+
+static const struct flag_map fs_structs[] = {
+	{ XFS_FSOP_GEOM_SICK_COUNTERS,		N_("fscounters") },
+	{ XFS_FSOP_GEOM_SICK_UQUOTA,		N_("usrquota") },
+	{ XFS_FSOP_GEOM_SICK_GQUOTA,		N_("grpquota") },
+	{ XFS_FSOP_GEOM_SICK_PQUOTA,		N_("prjquota") },
+	{ XFS_FSOP_GEOM_SICK_RT_BITMAP,		N_("bitmap") },
+	{ XFS_FSOP_GEOM_SICK_RT_SUMMARY,	N_("summary") },
+	{ XFS_FSOP_GEOM_SICK_QUOTACHECK,	N_("quotacheck") },
+	{ XFS_FSOP_GEOM_SICK_NLINKS,		N_("nlinks") },
+	{ XFS_FSOP_GEOM_SICK_METADIR,		N_("metadir") },
+	{ XFS_FSOP_GEOM_SICK_METAPATH,		N_("metapath") },
+	{0, NULL},
+};
+
+/* Report fs-wide metadata corruption */
+static void
+report_fs(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	char					buf[512];
+
+	mask_to_string(fs_structs, hme->e.fs.mask, ", ", buf, sizeof(buf));
+
+	printf("%s %s: %s\n",
+			pfx->mountpoint,
+			buf,
+			health_type_string(hme->type));
+	fflush(stdout);
+}
+
+/* Report device media corruption */
+static void
+report_device_error(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	printf("%s %s %s 0x%llx %s 0x%llx: %s\n", pfx->mountpoint,
+			device_domain_string(hme->domain),
+			_("daddr"),
+			(unsigned long long)hme->e.media.daddr,
+			_("bbcount"),
+			(unsigned long long)hme->e.media.bbcount,
+			_("media error"));
+	fflush(stdout);
+}
+
+/* Report file range errors */
+static void
+report_file_range(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	if (hme_prefix_has_path(pfx))
+		printf("%s ", pfx->path);
+	else
+		printf("%s %s %llu %s 0x%x ",
+				pfx->mountpoint,
+				_("ino"),
+				(unsigned long long)hme->e.filerange.ino,
+				_("gen"),
+				hme->e.filerange.gen);
+	if (hme->type != XFS_HEALTH_MONITOR_TYPE_DATALOST &&
+	    hme->e.filerange.error)
+		printf("%s %llu %s %llu: %s: %s\n",
+				_("pos"),
+				(unsigned long long)hme->e.filerange.pos,
+				_("len"),
+				(unsigned long long)hme->e.filerange.len,
+				fileio_type_string(hme->type),
+				strerror(hme->e.filerange.error));
+	else
+		printf("%s %llu %s %llu: %s %s\n",
+				_("pos"),
+				(unsigned long long)hme->e.filerange.pos,
+				_("len"),
+				(unsigned long long)hme->e.filerange.len,
+				fileio_type_string(hme->type),
+				_("failed"));
+	fflush(stdout);
+}
+
+/* Log a health monitoring event to stdout. */
+void
+hme_report_event(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	switch (hme->domain) {
+	case XFS_HEALTH_MONITOR_DOMAIN_MOUNT:
+		switch (hme->type) {
+		case XFS_HEALTH_MONITOR_TYPE_LOST:
+			report_lost(pfx, hme);
+			return;
+		case XFS_HEALTH_MONITOR_TYPE_RUNNING:
+			report_running(pfx, hme);
+			return;
+		case XFS_HEALTH_MONITOR_TYPE_UNMOUNT:
+			report_unmounted(pfx, hme);
+			return;
+		case XFS_HEALTH_MONITOR_TYPE_SHUTDOWN:
+			report_shutdown(pfx, hme);
+			return;
+		}
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_INODE:
+		report_inode(pfx, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_AG:
+		report_ag(pfx, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_RTGROUP:
+		report_rtgroup(pfx, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_FS:
+		report_fs(pfx, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_DATADEV:
+	case XFS_HEALTH_MONITOR_DOMAIN_RTDEV:
+	case XFS_HEALTH_MONITOR_DOMAIN_LOGDEV:
+		report_device_error(pfx, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_FILERANGE:
+		report_file_range(pfx, hme);
+		break;
+	}
+}


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 03/26] libfrog: add support code for starting systemd services programmatically
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 01/26] libfrog: add a function to grab the path from an open fd and a file handle Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 02/26] libfrog: create healthmon event log library functions Darrick J. Wong
@ 2026-03-19  4:39   ` Darrick J. Wong
  2026-03-19  4:39   ` [PATCH 04/26] libfrog: hoist a couple of service helper functions Darrick J. Wong
                     ` (22 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:39 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add some simple routines for computing the name of systemd service
instances and starting systemd services.  These will be used by the
xfs_healer_start service to start per-filesystem xfs_healer service
instances.

Note that we run systemd helper programs as subprocesses for a couple of
reasons.  First, the path-escaping functionality is not a part of any
library-accessible API, which means it can only be accessed via
systemd-escape(1).  Second, although the service startup functionality
can be reached via dbus, doing so would introduce a new library
dependency.  Systemd is also undergoing a dbus -> varlink RPC transition
so we avoid that mess by calling the cli systemctl(1) program.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libfrog/systemd.h     |   20 ++++++
 configure.ac          |    1 
 include/builddefs.in  |    1 
 libfrog/Makefile      |    6 ++
 libfrog/systemd.c     |  177 +++++++++++++++++++++++++++++++++++++++++++++++++
 m4/package_libcdev.m4 |   19 +++++
 6 files changed, 224 insertions(+)
 create mode 100644 libfrog/systemd.h
 create mode 100644 libfrog/systemd.c


diff --git a/libfrog/systemd.h b/libfrog/systemd.h
new file mode 100644
index 00000000000000..4f414bc3c1e9c3
--- /dev/null
+++ b/libfrog/systemd.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2026 Oracle.  All rights reserved.
+ * All Rights Reserved.
+ */
+#ifndef __LIBFROG_SYSTEMD_H__
+#define __LIBFROG_SYSTEMD_H__
+
+int systemd_path_instance_unit_name(const char *unit_template,
+		const char *path, char *unitname, size_t unitnamelen);
+
+enum systemd_unit_manage {
+	UM_STOP,
+	UM_START,
+	UM_RESTART,
+};
+
+int systemd_manage_unit(enum systemd_unit_manage how, const char *unitname);
+
+#endif /* __LIBFROG_SYSTEMD_H__ */
diff --git a/configure.ac b/configure.ac
index 8092b8656ef94b..a9febabc71cfc7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -182,6 +182,7 @@ AC_CONFIG_UDEV_RULE_DIR
 AC_HAVE_BLKID_TOPO
 AC_HAVE_TRIVIAL_AUTO_VAR_INIT
 AC_STRERROR_R_RETURNS_STRING
+AC_HAVE_CLOSE_RANGE
 
 if test "$enable_ubsan" = "yes" || test "$enable_ubsan" = "probe"; then
         AC_PACKAGE_CHECK_UBSAN
diff --git a/include/builddefs.in b/include/builddefs.in
index b38a099b7d525a..4a2cb757c0bdb3 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -118,6 +118,7 @@ HAVE_UDEV = @have_udev@
 UDEV_RULE_DIR = @udev_rule_dir@
 HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@
 STRERROR_R_RETURNS_STRING = @strerror_r_returns_string@
+HAVE_CLOSE_RANGE = @have_close_range@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/libfrog/Makefile b/libfrog/Makefile
index bccd9289e5dd79..89a0332ae85372 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -36,6 +36,7 @@ ptvar.c \
 radix-tree.c \
 randbytes.c \
 scrub.c \
+systemd.c \
 util.c \
 workqueue.c \
 zones.c
@@ -70,6 +71,7 @@ radix-tree.h \
 randbytes.h \
 scrub.h \
 statx.h \
+systemd.h \
 workqueue.h \
 zones.h
 
@@ -90,6 +92,10 @@ ifeq ($(HAVE_GETRANDOM_NONBLOCK),yes)
 LCFLAGS += -DHAVE_GETRANDOM_NONBLOCK
 endif
 
+ifeq ($(HAVE_CLOSE_RANGE),yes)
+CFLAGS += -DHAVE_CLOSE_RANGE
+endif
+
 default: ltdepend $(LTLIBRARY) $(GETTEXT_PY)
 
 crc32table.h: gen_crc32table.c crc32defs.h
diff --git a/libfrog/systemd.c b/libfrog/systemd.c
new file mode 100644
index 00000000000000..2d2d2e9be72e6a
--- /dev/null
+++ b/libfrog/systemd.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/wait.h>
+
+#include "libfrog/systemd.h"
+
+/* Close all fds except for the three standard ones. */
+static void
+close_fds(void)
+{
+	int	max_fd = sysconf(_SC_OPEN_MAX);
+	int	fd;
+
+	if (max_fd < 1)
+		max_fd = 1024;
+
+#ifdef HAVE_CLOSE_RANGE
+	if (close_range(STDERR_FILENO + 1, max_fd, 0) == 0)
+		return;
+#endif
+
+	for (fd = STDERR_FILENO + 1; fd < max_fd; fd++)
+		close(fd);
+}
+
+/*
+ * Compute the systemd instance unit name for a given path.
+ *
+ * The escaping logic is implemented directly in systemctl so there's no
+ * library or dbus service that we can call.
+ */
+int
+systemd_path_instance_unit_name(
+	const char		*unit_template,
+	const char		*path,
+	char			*unitname,
+	size_t			unitnamelen)
+{
+	size_t			i;
+	ssize_t			bytes;
+	pid_t			child_pid;
+	int			pipe_fds[2];
+	int			child_status;
+	int			ret;
+
+	ret = pipe(pipe_fds);
+	if (ret)
+		return -1;
+
+	child_pid = fork();
+	if (child_pid < 0)
+		return -1;
+
+	if (!child_pid) {
+		/* child process */
+		char		*argv[] = {
+			"systemd-escape",
+			"--template",
+			(char *)unit_template,
+			"--path",
+			(char *)path,
+			NULL,
+		};
+
+		ret = dup2(pipe_fds[1], STDOUT_FILENO);
+		if (ret < 0) {
+			perror(path);
+			goto fail;
+		}
+
+		close_fds();
+
+		ret = execvp("systemd-escape", argv);
+		if (ret)
+			perror(path);
+
+fail:
+		exit(EXIT_FAILURE);
+	}
+
+	/*
+	 * Close our connection to stdin so that the read won't hang if the
+	 * child exits without writing anything to stdout.
+	 */
+	close(pipe_fds[1]);
+	bytes = read(pipe_fds[0], unitname, unitnamelen - 1);
+	close(pipe_fds[0]);
+
+	waitpid(child_pid, &child_status, 0);
+	if (!WIFEXITED(child_status) || WEXITSTATUS(child_status) != 0) {
+		errno = 0;
+		return -1;
+	}
+
+	/* Terminate string at first newline or end of buffer. */
+	for (i = 0; i < bytes; i++) {
+		if (unitname[i] == '\n') {
+			unitname[i] = 0;
+			break;
+		}
+	}
+	if (i == bytes)
+		unitname[unitnamelen - 1] = 0;
+
+	return 0;
+}
+
+static const char *systemd_unit_manage_string(enum systemd_unit_manage how)
+{
+	switch (how) {
+	case UM_STOP:
+		return "stop";
+	case UM_START:
+		return "start";
+	case UM_RESTART:
+		return "restart";
+	}
+
+	/* shut up gcc */
+	return NULL;
+}
+
+/*
+ * Start/stop/restart a systemd unit and let it run in the background.
+ *
+ * systemctl start wraps a lot of logic around starting a unit, so it's less
+ * work for xfsprogs to invoke systemctl instead of calling through dbus.
+ */
+int
+systemd_manage_unit(
+	enum systemd_unit_manage	how,
+	const char			*unitname)
+{
+	pid_t				child_pid;
+	int				child_status;
+	int				ret;
+
+	child_pid = fork();
+	if (child_pid < 0)
+		return -1;
+
+	if (!child_pid) {
+		/* child starts the process */
+		char		*argv[] = {
+			"systemctl",
+			(char *)systemd_unit_manage_string(how),
+			"--no-block",
+			(char *)unitname,
+			NULL,
+		};
+
+		close_fds();
+
+		ret = execvp("systemctl", argv);
+		if (ret)
+			perror("systemctl");
+
+		exit(EXIT_FAILURE);
+	}
+
+	/* parent waits for process */
+	waitpid(child_pid, &child_status, 0);
+
+	/* systemctl (stop/start/restart) --no-block should return quickly */
+	if (WIFEXITED(child_status) && WEXITSTATUS(child_status) == 0)
+		return 0;
+
+	errno = ENOMEM;
+	return -1;
+}
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index c5538c30d2518a..b3d87229d3367a 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -347,3 +347,22 @@ puts(strerror_r(0, buf, sizeof(buf)));
     CFLAGS="$OLD_CFLAGS"
     AC_SUBST(strerror_r_returns_string)
   ])
+
+#
+# Check if close_range exists
+#
+AC_DEFUN([AC_HAVE_CLOSE_RANGE],
+  [AC_MSG_CHECKING([for close_range])
+    AC_LINK_IFELSE(
+    [AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <linux/close_range.h>
+  ]], [[
+close_range(0, 0, 0);
+  ]])
+    ], have_close_range=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_close_range)
+  ])


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 04/26] libfrog: hoist a couple of service helper functions
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (2 preceding siblings ...)
  2026-03-19  4:39   ` [PATCH 03/26] libfrog: add support code for starting systemd services programmatically Darrick J. Wong
@ 2026-03-19  4:39   ` Darrick J. Wong
  2026-03-19  4:40   ` [PATCH 05/26] libfrog: add wrappers for listmount and statmount Darrick J. Wong
                     ` (21 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:39 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Hoist a couple of service/daemon-related helper functions to libfrog so
that we can share the code between xfs_scrub and xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libfrog/systemd.h |   28 ++++++++++++++++++++++++++++
 scrub/xfs_scrub.c |   32 +++++++++-----------------------
 2 files changed, 37 insertions(+), 23 deletions(-)


diff --git a/libfrog/systemd.h b/libfrog/systemd.h
index 4f414bc3c1e9c3..c96df4afa39aa6 100644
--- a/libfrog/systemd.h
+++ b/libfrog/systemd.h
@@ -17,4 +17,32 @@ enum systemd_unit_manage {
 
 int systemd_manage_unit(enum systemd_unit_manage how, const char *unitname);
 
+static inline bool systemd_is_service(void)
+{
+	return getenv("SERVICE_MODE") != NULL;
+}
+
+/* Special processing for a service/daemon program that is exiting. */
+static inline int
+systemd_service_exit(int ret)
+{
+	/*
+	 * We have to sleep 2 seconds here because journald uses the pid to
+	 * connect our log messages to the systemd service.  This is critical
+	 * for capturing all the log messages if the service fails, because
+	 * failure analysis tools use the service name to gather log messages
+	 * for reporting.
+	 */
+	sleep(2);
+
+	/*
+	 * If we're being run as a service, the return code must fit the LSB
+	 * init script action error guidelines, which is to say that we
+	 * compress all errors to 1 ("generic or unspecified error", LSB 5.0
+	 * section 22.2) and hope the admin will scan the log for what actually
+	 * happened.
+	 */
+	return ret != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
+}
+
 #endif /* __LIBFROG_SYSTEMD_H__ */
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 3dba972a7e8d2a..79937aa8cce4c4 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -19,6 +19,7 @@
 #include "unicrash.h"
 #include "progress.h"
 #include "libfrog/histogram.h"
+#include "libfrog/systemd.h"
 
 /*
  * XFS Online Metadata Scrub (and Repair)
@@ -866,8 +867,7 @@ main(
 	if (stdout_isatty && !progress_fp)
 		progress_fp = fdopen(1, "w+");
 
-	if (getenv("SERVICE_MODE"))
-		is_service = true;
+	is_service = systemd_is_service();
 
 	/* Initialize overall phase stats. */
 	error = phase_start(&all_pi, 0, NULL);
@@ -960,29 +960,15 @@ main(
 	hist_free(&ctx.datadev_hist);
 	hist_free(&ctx.rtdev_hist);
 
-	/*
-	 * If we're being run as a service, the return code must fit the LSB
-	 * init script action error guidelines, which is to say that we
-	 * compress all errors to 1 ("generic or unspecified error", LSB 5.0
-	 * section 22.2) and hope the admin will scan the log for what
-	 * actually happened.
-	 *
-	 * We have to sleep 2 seconds here because journald uses the pid to
-	 * connect our log messages to the systemd service.  This is critical
-	 * for capturing all the log messages if the scrub fails, because the
-	 * fail service uses the service name to gather log messages for the
-	 * error report.
-	 *
-	 * Note: We don't count a lack of kernel support as a service failure
-	 * because we haven't determined that there's anything wrong with the
-	 * filesystem.
-	 */
 	if (is_service) {
-		sleep(2);
+		/*
+		 * Note: We don't count a lack of kernel support as a service
+		 * failure because we haven't determined that there's anything
+		 * wrong with the filesystem.
+		 */
 		if (!ctx.scrub_setup_succeeded)
-			return 0;
-		if (ret != SCRUB_RET_SUCCESS)
-			return 1;
+			ret = 0;
+		return systemd_service_exit(ret);
 	}
 
 	return ret;


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/26] libfrog: add wrappers for listmount and statmount
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (3 preceding siblings ...)
  2026-03-19  4:39   ` [PATCH 04/26] libfrog: hoist a couple of service helper functions Darrick J. Wong
@ 2026-03-19  4:40   ` Darrick J. Wong
  2026-03-19  4:40   ` [PATCH 06/26] man2: document the healthmon ioctl Darrick J. Wong
                     ` (20 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:40 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add some wrappers for listmount and statmount so that we don't have to
open-code the kernel ABI quirks in every utility program that uses it.
Note that glibc seems to have discussed providing a wrapper in late 2023
but took no action; and the listmount manpage says that there is no
glibc wrapper.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux.h       |    8 +++-
 libfrog/statmount.h   |  104 +++++++++++++++++++++++++++++++++++++++++++++++++
 configure.ac          |    5 ++
 include/builddefs.in  |    7 +++
 libfrog/Makefile      |    9 ++++
 libfrog/statmount.c   |   76 ++++++++++++++++++++++++++++++++++++
 m4/package_libcdev.m4 |   86 +++++++++++++++++++++++++++++++++++++++++
 7 files changed, 294 insertions(+), 1 deletion(-)
 create mode 100644 libfrog/statmount.h
 create mode 100644 libfrog/statmount.c


diff --git a/include/linux.h b/include/linux.h
index 3ea9016272e688..8972c9596c75f5 100644
--- a/include/linux.h
+++ b/include/linux.h
@@ -32,7 +32,13 @@
 #ifdef OVERRIDE_SYSTEM_FSXATTR
 # define fsxattr sys_fsxattr
 #endif
-#include <linux/fs.h> /* fsxattr defintion for new kernels */
+#ifdef OVERRIDE_SYSTEM_STATMOUNT
+# define statmount sys_statmount
+#endif
+#include <linux/fs.h> /* fsxattr/statmount defintion for new kernels */
+#ifdef OVERRIDE_SYSTEM_STATMOUNT
+# undef statmount
+#endif
 #ifdef OVERRIDE_SYSTEM_FSXATTR
 # undef fsxattr
 #endif
diff --git a/libfrog/statmount.h b/libfrog/statmount.h
new file mode 100644
index 00000000000000..7e281ce93029ff
--- /dev/null
+++ b/libfrog/statmount.h
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2026 Oracle.  All rights reserved.
+ * All Rights Reserved.
+ */
+#ifndef __LIBFROG_STATMOUNT_H__
+#define __LIBFROG_STATMOUNT_H__
+
+/* This is the path to the current process' mount namespace file */
+#define DEFAULT_MOUNTNS_FILE	"/proc/self/ns/mnt"
+
+/*
+ * Believe it or not, listmount and statmount treat a zero value for mnt_ns_fd
+ * as if that means "use the current process' mount namespace" even though
+ * Linus Torvalds roared about that with the BPF people.
+ */
+#define DEFAULT_MOUNTNS_FD	(0)
+
+#ifdef OVERRIDE_SYSTEM_STATMOUNT
+struct statmount {
+	__u32 size;		/* Total size, including strings */
+	__u32 mnt_opts;		/* [str] Options (comma separated, escaped) */
+	__u64 mask;		/* What results were written */
+	__u32 sb_dev_major;	/* Device ID */
+	__u32 sb_dev_minor;
+	__u64 sb_magic;		/* ..._SUPER_MAGIC */
+	__u32 sb_flags;		/* SB_{RDONLY,SYNCHRONOUS,DIRSYNC,LAZYTIME} */
+	__u32 fs_type;		/* [str] Filesystem type */
+	__u64 mnt_id;		/* Unique ID of mount */
+	__u64 mnt_parent_id;	/* Unique ID of parent (for root == mnt_id) */
+	__u32 mnt_id_old;	/* Reused IDs used in proc/.../mountinfo */
+	__u32 mnt_parent_id_old;
+	__u64 mnt_attr;		/* MOUNT_ATTR_... */
+	__u64 mnt_propagation;	/* MS_{SHARED,SLAVE,PRIVATE,UNBINDABLE} */
+	__u64 mnt_peer_group;	/* ID of shared peer group */
+	__u64 mnt_master;	/* Mount receives propagation from this ID */
+	__u64 propagate_from;	/* Propagation from in current namespace */
+	__u32 mnt_root;		/* [str] Root of mount relative to root of fs */
+	__u32 mnt_point;	/* [str] Mountpoint relative to current root */
+	__u64 mnt_ns_id;	/* ID of the mount namespace */
+	__u32 fs_subtype;	/* [str] Subtype of fs_type (if any) */
+	__u32 sb_source;	/* [str] Source string of the mount */
+	__u32 opt_num;		/* Number of fs options */
+	__u32 opt_array;	/* [str] Array of nul terminated fs options */
+	__u32 opt_sec_num;	/* Number of security options */
+	__u32 opt_sec_array;	/* [str] Array of nul terminated security options */
+	__u64 supported_mask;	/* Mask flags that this kernel supports */
+	__u64 __spare2[45];
+	char str[];		/* Variable size part containing strings */
+};
+#endif
+
+/* all the new flags added since the beginning of statmount */
+
+#ifndef STATMOUNT_MNT_NS_ID
+#define STATMOUNT_MNT_NS_ID		0x00000040U	/* Want/got mnt_ns_id */
+#endif
+
+#ifndef STATMOUNT_MNT_OPTS
+#define STATMOUNT_MNT_OPTS		0x00000080U	/* Want/got mnt_opts */
+#endif
+
+#ifndef STATMOUNT_FS_SUBTYPE
+#define STATMOUNT_FS_SUBTYPE		0x00000100U	/* Want/got fs_subtype */
+#endif
+
+#ifndef STATMOUNT_SB_SOURCE
+#define STATMOUNT_SB_SOURCE		0x00000200U	/* Want/got sb_source */
+#endif
+
+#ifndef STATMOUNT_OPT_ARRAY
+#define STATMOUNT_OPT_ARRAY		0x00000400U	/* Want/got opt_... */
+#endif
+
+#ifndef STATMOUNT_OPT_SEC_ARRAY
+#define STATMOUNT_OPT_SEC_ARRAY		0x00000800U	/* Want/got opt_sec... */
+#endif
+
+#ifndef STATMOUNT_SUPPORTED_MASK
+#define STATMOUNT_SUPPORTED_MASK	0x00001000U	/* Want/got supported mask flags */
+#endif
+
+/* flag bits for statmount */
+#ifndef STATMOUNT_BY_FD
+#define STATMOUNT_BY_FD		0x00000001U	/* want mountinfo for given fd */
+#endif
+
+#define LISTMOUNT_INIT_CURSOR		(0ULL)
+
+int libfrog_listmount(uint64_t mnt_id, int mnt_ns_fd, uint64_t *cursor,
+		uint64_t *mnt_ids, size_t nr_mnt_ids);
+
+int libfrog_statmount(uint64_t mnt_id, int mnt_ns_fd, uint64_t statmount_flags,
+		struct statmount *smbuf, size_t smbuf_size);
+
+int libfrog_fstatmount(int fd, uint64_t statmount_flags,
+		struct statmount *smbuf, size_t smbuf_size);
+
+static inline size_t libfrog_statmount_sizeof(size_t strings_bytes)
+{
+	return sizeof(struct statmount) + strings_bytes;
+}
+
+#endif /* __LIBFROG_STATMOUNT_H__ */
diff --git a/configure.ac b/configure.ac
index a9febabc71cfc7..cffcaf373cfa5e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -183,6 +183,11 @@ AC_HAVE_BLKID_TOPO
 AC_HAVE_TRIVIAL_AUTO_VAR_INIT
 AC_STRERROR_R_RETURNS_STRING
 AC_HAVE_CLOSE_RANGE
+AC_HAVE_LISTMOUNT
+if test "$have_listmount" = "yes"; then
+	AC_HAVE_LISTMOUNT_NS_FD
+	AC_HAVE_STATMOUNT_SUPPORTED_MASK
+fi
 
 if test "$enable_ubsan" = "yes" || test "$enable_ubsan" = "probe"; then
         AC_PACKAGE_CHECK_UBSAN
diff --git a/include/builddefs.in b/include/builddefs.in
index 4a2cb757c0bdb3..d2d25c8a0ed676 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -119,6 +119,10 @@ UDEV_RULE_DIR = @udev_rule_dir@
 HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@
 STRERROR_R_RETURNS_STRING = @strerror_r_returns_string@
 HAVE_CLOSE_RANGE = @have_close_range@
+HAVE_LISTMOUNT = @have_listmount@
+HAVE_LISTMOUNT_NS_FD = @have_listmount_ns_fd@
+HAVE_STATMOUNT_SUPPORTED_MASK = @have_statmount_supported_mask@
+NEED_INTERNAL_STATMOUNT = @need_internal_statmount@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
@@ -141,6 +145,9 @@ endif
 ifeq ($(NEED_INTERNAL_STATX),yes)
 PCFLAGS+= -DOVERRIDE_SYSTEM_STATX
 endif
+ifeq ($(NEED_INTERNAL_STATMOUNT),yes)
+PCFLAGS+= -DOVERRIDE_SYSTEM_STATMOUNT
+endif
 ifeq ($(HAVE_GETFSMAP),yes)
 PCFLAGS+= -DHAVE_GETFSMAP
 endif
diff --git a/libfrog/Makefile b/libfrog/Makefile
index 89a0332ae85372..22668212f22b93 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -96,6 +96,15 @@ ifeq ($(HAVE_CLOSE_RANGE),yes)
 CFLAGS += -DHAVE_CLOSE_RANGE
 endif
 
+ifeq ($(HAVE_LISTMOUNT),yes)
+CFILES += statmount.c
+HFILES += statmount.h
+endif
+
+ifeq ($(HAVE_LISTMOUNT_NS_FD),yes)
+CFLAGS+=-DHAVE_LISTMOUNT_NS_FD
+endif
+
 default: ltdepend $(LTLIBRARY) $(GETTEXT_PY)
 
 crc32table.h: gen_crc32table.c crc32defs.h
diff --git a/libfrog/statmount.c b/libfrog/statmount.c
new file mode 100644
index 00000000000000..edf17d6080ea42
--- /dev/null
+++ b/libfrog/statmount.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+
+#include <libfrog/statmount.h>
+
+int
+libfrog_listmount(
+	uint64_t		mnt_id,
+	int			mnt_ns_fd,
+	uint64_t		*cursor,
+	uint64_t		*mnt_ids,
+	size_t			nr_mnt_ids)
+{
+	struct mnt_id_req	req = {
+		.size		= sizeof(req),
+		.mnt_id		= mnt_id,
+#ifdef HAVE_LISTMOUNT_NS_FD
+		.mnt_ns_fd	= mnt_ns_fd,
+#else
+		.spare		= mnt_ns_fd,
+#endif
+		.param		= *cursor,
+	};
+	int ret = syscall(SYS_listmount, &req, mnt_ids, nr_mnt_ids, 0);
+
+	if (ret > 0)
+		*cursor = mnt_ids[ret - 1];
+
+	return ret;
+}
+
+int
+libfrog_statmount(
+	uint64_t		mnt_id,
+	int			mnt_ns_fd,
+	uint64_t		statmount_flags,
+	struct statmount	*smbuf,
+	size_t			smbuf_size)
+{
+	struct mnt_id_req	req = {
+		.size		= sizeof(req),
+		.mnt_id		= mnt_id,
+#ifdef HAVE_LISTMOUNT_NS_FD
+		.mnt_ns_fd	= mnt_ns_fd,
+#else
+		.spare		= mnt_ns_fd,
+#endif
+		.param		= statmount_flags,
+	};
+
+	return syscall(SYS_statmount, &req, smbuf, smbuf_size, 0);
+}
+
+int
+libfrog_fstatmount(
+	int			fd,
+	uint64_t		statmount_flags,
+	struct statmount	*smbuf,
+	size_t			smbuf_size)
+{
+	struct mnt_id_req	req = {
+		.size		= sizeof(req),
+#ifdef HAVE_LISTMOUNT_NS_FD
+		.mnt_ns_fd	= fd,
+#else
+		.spare		= fd,
+#endif
+		.param		= statmount_flags,
+	};
+
+	return syscall(SYS_statmount, &req, smbuf, smbuf_size, STATMOUNT_BY_FD);
+}
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index b3d87229d3367a..ec4a3ef444b705 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -366,3 +366,89 @@ close_range(0, 0, 0);
        AC_MSG_RESULT(no))
     AC_SUBST(have_close_range)
   ])
+
+#
+# Check if listmount and statmount exist.  Note that statmount came first (6.8)
+# and listmount came later (6.9), so we'll refuse both if either is missing.
+#
+AC_DEFUN([AC_HAVE_LISTMOUNT],
+  [AC_MSG_CHECKING([for listmount])
+    AC_LINK_IFELSE(
+    [AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <linux/mount.h>
+#include <sys/syscall.h>
+#include <alloca.h>
+  ]], [[
+	struct mnt_id_req	req = {
+		.size		= sizeof(req),
+	};
+	struct statmount	smbuf;
+
+	syscall(SYS_statmount, &req, &smbuf, 0, 0);
+	syscall(SYS_listmount, &req, NULL, 0, 0);
+  ]])
+    ], have_listmount=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_listmount)
+  ])
+
+#
+# Check if mnt_id_req::mnt_ns_fd exists.  This replaced mnt_id_req::spare in
+# 6.18, though earlier kernels allowed userspace to assign to spare.
+#
+AC_DEFUN([AC_HAVE_LISTMOUNT_NS_FD],
+  [AC_MSG_CHECKING([for struct mnt_id_req::mnt_ns_fd])
+    AC_LINK_IFELSE(
+    [AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <linux/mount.h>
+#include <sys/syscall.h>
+#include <alloca.h>
+  ]], [[
+	struct mnt_id_req	req = {
+		.mnt_ns_fd	= 555,
+	};
+
+	syscall(SYS_listmount, &req, NULL, 0, 0);
+  ]])
+    ], have_listmount_ns_fd=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_listmount_ns_fd)
+  ])
+
+#
+# Check if statmount::supported_mask (and hence sb_source) exists.  We need
+# sb_source for xfs_healer_start; and supported_mask for the xfs_io wrapper.
+# sb_source was added in 6.13 and supported_mask in 6.15.
+#
+AC_DEFUN([AC_HAVE_STATMOUNT_SUPPORTED_MASK],
+  [AC_MSG_CHECKING([for struct statmount::supported_mask])
+    AC_LINK_IFELSE(
+    [AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <unistd.h>
+#include <linux/mount.h>
+#include <sys/syscall.h>
+#include <alloca.h>
+  ]], [[
+	struct mnt_id_req	req = {
+		.mnt_ns_fd	= 555,
+	};
+	struct statmount	smbuf = {
+		.supported_mask	= 1,
+	};
+
+	syscall(SYS_statmount, &req, &smbuf, 0, 0);
+  ]])
+    ], have_statmount_supported_mask=yes
+       AC_MSG_RESULT(yes),
+       need_internal_statmount=yes
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_statmount_supported_mask)
+    AC_SUBST(need_internal_statmount)
+  ])


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 06/26] man2: document the healthmon ioctl
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (4 preceding siblings ...)
  2026-03-19  4:40   ` [PATCH 05/26] libfrog: add wrappers for listmount and statmount Darrick J. Wong
@ 2026-03-19  4:40   ` Darrick J. Wong
  2026-03-19  4:40   ` [PATCH 07/26] man2: document the media verification ioctl Darrick J. Wong
                     ` (19 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:40 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Document the XFS_IOC_HEALTH_MONITOR and
XFS_IOC_HEALTH_FD_ON_MONITORED_FS ioctls.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 man/man2/ioctl_xfs_health_fd_on_monitored_fs.2 |   75 ++++
 man/man2/ioctl_xfs_health_monitor.2            |  464 ++++++++++++++++++++++++
 2 files changed, 539 insertions(+)
 create mode 100644 man/man2/ioctl_xfs_health_fd_on_monitored_fs.2
 create mode 100644 man/man2/ioctl_xfs_health_monitor.2


diff --git a/man/man2/ioctl_xfs_health_fd_on_monitored_fs.2 b/man/man2/ioctl_xfs_health_fd_on_monitored_fs.2
new file mode 100644
index 00000000000000..bbc5ce9bbabf53
--- /dev/null
+++ b/man/man2/ioctl_xfs_health_fd_on_monitored_fs.2
@@ -0,0 +1,75 @@
+.\" Copyright (c) 2025-2026, Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" SPDX-License-Identifier: GPL-2.0+
+.\" %%%LICENSE_END
+.TH IOCTL-XFS-HEALTH-FD-ON-MONITORED-FS 2 2026-01-04 "XFS"
+.SH NAME
+ioctl_xfs_health_fd_on_monitored_fs \- check if the given fd belongs to the same fs being monitored
+.SH SYNOPSIS
+.br
+.B #include <xfs/xfs_fs.h>
+.PP
+.BI "int ioctl(int " healthmon_fd ", XFS_IOC_HEALTH_FD_ON_MONITORED_FS, struct xfs_health_file_on_monitored_fs *" arg );
+.SH DESCRIPTION
+This XFS healthmon fd ioctl asks the kernel driver if the file descriptor
+passed in via
+.I arg
+points to a file on the same filesystem that is being monitored by
+.IR healthmon_fd .
+The file descriptor is conveyed in a structure of the following form:
+.PP
+.in +4n
+.nf
+struct xfs_health_file_on_monitored_fs {
+	__s32 fd;
+	__u32 flags;
+};
+.fi
+.in
+.PP
+The field
+.I flags
+must be zero.
+.PP
+The field
+.I fd
+is a descriptor of an open file.
+.PP
+The argument
+.I healthmon_fd
+must be a file opened via the
+.B XFS_IOC_HEALTH_MONITOR
+ioctl.
+.SH RETURN VALUE
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+If the file descriptor points to a file on the same filesystem that is being
+monitored, 0 is returned.
+.PP
+.SH ERRORS
+Error codes can be one of, but are not limited to, the following:
+.TP
+.B ESTALE
+The open file is not on the same filesystem that is being monitored.
+.TP
+.B EINVAL
+One or more of the arguments specified is invalid.
+.TP
+.B EBADF
+.I arg.fd
+does not refer to an open file.
+.TP
+.B EFAULT
+The
+.I arg
+structure could not be copied into the kernel.
+.TP
+.B ENOTTY
+.I healthmon_fd
+is not a XFS health monitoring file.
+.SH CONFORMING TO
+This API is specific to XFS filesystem on the Linux kernel.
+.SH SEE ALSO
+.BR ioctl_xfs_health_monitor (2)
diff --git a/man/man2/ioctl_xfs_health_monitor.2 b/man/man2/ioctl_xfs_health_monitor.2
new file mode 100644
index 00000000000000..269c434515d960
--- /dev/null
+++ b/man/man2/ioctl_xfs_health_monitor.2
@@ -0,0 +1,464 @@
+.\" Copyright (c) 2025-2026, Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" SPDX-License-Identifier: GPL-2.0+
+.\" %%%LICENSE_END
+.TH IOCTL-XFS-HEALTH-MONITOR 2 2026-01-04 "XFS"
+.SH NAME
+ioctl_xfs_health_monitor \- read filesystem health events from the kernel
+.SH SYNOPSIS
+.br
+.B #include <xfs/xfs_fs.h>
+.PP
+.BI "int ioctl(int " dest_fd ", XFS_IOC_HEALTH_MONITOR, struct xfs_health_monitor *" arg );
+.SH DESCRIPTION
+This XFS ioctl asks the kernel driver to create a pseudo-file from which
+information about adverse filesystem health events can be read.
+This new file will be installed into the file descriptor table of the calling
+process as a read-only file, and will have the close-on-exec flag set.
+.PP
+The specific behaviors of this health monitor file are requested via a
+structure of the following form:
+.PP
+.in +4n
+.nf
+struct xfs_health_monitor {
+	__u64 flags;
+	__u8  format;
+	__u8  pad[23];
+};
+.fi
+.in
+.PP
+The field
+.I pad
+must be zero.
+.PP
+The field
+.I format
+controls the format of the event data that can be read:
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_FMT_V0
+Event data will be presented in discrete objects of type struct
+xfs_health_monitor_event.
+See below for more information.
+.RE
+
+.PD 1
+.PP
+The field
+.I flags
+control the behavior of the monitor.
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_VERBOSE
+Return all health events, including affirmations of healthy metadata.
+.RE
+.SH RETURN VALUE
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+Otherwise, the return value is a new file descriptor.
+.PP
+.SH ERRORS
+Error codes can be one of, but are not limited to, the following:
+.TP
+.B EEXIST
+Health monitoring is already active for this filesystem.
+.TP
+.B EPERM
+The caller does not have permission to open a health monitor.
+Calling programs must have administrative capability, run in the initial user
+namespace, and the
+.I fd
+passed to ioctl must be the root directory of an XFS filesystem.
+.TP
+.B EINVAL
+One or more of the arguments specified is invalid.
+.TP
+.B EFAULT
+The argument could not be copied into the kernel.
+.TP
+.B ENOMEM
+There was not sufficient memory to construct the health monitor.
+.SH EVENT FORMAT
+Calling programs retrieve XFS health events by calling
+.BR read (2)
+on the returned file descriptor.
+The read buffer must be large enough to hold at least one event object.
+Partial objects will not be returned; instead, a short read will occur.
+
+Events will be returned in the following format:
+
+.PP
+.in +4n
+.nf
+struct xfs_health_monitor_event {
+	__u32	domain;
+	__u32	type;
+	__u64	time_ns;
+
+	union {
+		struct xfs_health_monitor_lost lost;
+		struct xfs_health_monitor_fs fs;
+		struct xfs_health_monitor_group group;
+		struct xfs_health_monitor_inode inode;
+		struct xfs_health_monitor_shutdown shutdown;
+		struct xfs_health_monitor_media media;
+		struct xfs_health_monitor_filerange filerange;
+	} e;
+
+	__u64	pad[2];
+};
+.fi
+.in
+.PP
+The field
+.I time_ns
+records the timestamp at which the health event was generated, in units of
+nanoseconds since the Unix epoch.
+.PP
+The field
+.I pad
+will be zero.
+.PP
+The field
+.I domain
+indicates the scope of the filesystem affected by the event:
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_MOUNT
+The entire filesystem is affected.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_FS
+Metadata concerning the entire filesystem is affected.
+Details are available through the
+.I fs
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_AG
+Metadata concerning a specific allocation group is affected.
+Details are available through the
+.I group
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_RTGROUP
+Metadata concerning a specific realtime allocation group is affected.
+Details are available through the
+.I group
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_INODE
+File metadata is affected.
+Details are available through the
+.I inode
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_DATADEV
+The main data volume is affected.
+Details are available through the
+.I media
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_RTDEV
+The realtime volume is affected.
+Details are available through the
+.I media
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_LOGDEV
+The external log is affected.
+Details are available through the
+.I media
+field.
+.TP
+.B XFS_HEALTH_MONITOR_DOMAIN_FILERANGE
+File data is affected.
+Details are available through the
+.I filerange
+field.
+.RE
+
+.PP
+The field
+.I type
+indicates what was affected by a health event:
+.RS 0.4i
+.PP
+The following types apply to events from the
+.B MOUNT
+domain.
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_RUNNING
+This filesystem health monitor is now running.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_LOST
+Health events were lost.
+Details are available through the
+.I lost
+field.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_UNMOUNT
+The filesystem is being unmounted.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_SHUTDOWN
+The filesystem has shut down due to problems.
+Details are available through the
+.I shutdown
+field.
+.RE
+.PP
+The following three types apply to events from the
+.BR FS ,
+.BR AG ,
+.BR RTGROUP ,
+and
+.B INODE
+domains.
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_SICK
+Filesystem metadata has been scanned by online fsck and found to be corrupt.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_CORRUPT
+A metadata corruption problem was encountered during a filesystem operation
+outside of fsck.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_HEALTHY
+Filesystem metadata has either been scanned by online fsck and found to be
+in good condition, or it has been repaired to good condition.
+.RE
+.PP
+The following type applies to events from the
+.BR DATADEV ,
+.BR RTDEV ,
+and
+.B LOGDEV
+domains.
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR
+A media error has been observed on one of the storage devices that can be
+attached to an XFS filesystem.
+.RE
+.PP
+The following types apply to events from the
+.B FILERANGE
+domain.
+.RS 0.4i
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_BUFREAD
+An attempt to read (or readahead) from a file failed with an I/O error.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_BUFWRITE
+An attempt to write dirty data to storage failed with an I/O error.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_DIOREAD
+A direct read of file data from storage failed with an I/O error.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_DIOWRITE
+A direct write of file data to storage failed with an I/O error.
+.TP
+.B XFS_HEALTH_MONITOR_TYPE_DATALOST
+A latent media error was discovered on the storage backing part of this file.
+.RE
+.RE
+
+.PP
+The union
+.I e
+contains further details about the health event:
+
+.RS 0.4i
+.PP
+The kernel will use no more than 32KiB of memory per monitoring file to queue
+health events.
+If this limit is exceeded, an event will be generated to describe how many
+events were lost:
+
+.in +4n
+.nf
+struct xfs_health_monitor_lost {
+	__u64	count;
+};
+.fi
+.in
+.PP
+The
+.I count
+field records the number of events lost.
+
+.PP
+If whole-filesystem metadata experiences a health event, the exact type of
+that metadata is recorded as follows:
+
+.in +4n
+.nf
+struct xfs_health_monitor_fs {
+	__u32	mask;
+};
+.fi
+.in
+.PP
+The
+.I mask
+field will contain
+.I XFS_FSOP_GEOM_SICK_*
+flags that are documented in the
+.BR ioctl_xfs_fsgeometry (2)
+manual page.
+
+.PP
+If an allocation group (realtime or data) experiences a health event,
+the exact type and location of the metadata is recorded as follows:
+
+.in +4n
+.nf
+struct xfs_health_monitor_group {
+	__u32	mask;
+	__u32	gno;
+};
+.fi
+.in
+.PP
+The
+.I mask
+field will contain
+.I XFS_AG_SICK_*
+flags that are documented in the
+.BR ioctl_xfs_ag_geometry (2)
+manual page, or the
+.I XFS_RTGROUP_SICK_*
+flags that are documented by the
+.BR ioctl_xfs_rtgroup_geometry (2)
+manual page.
+.PP
+The
+.I gno
+field will contain the group number.
+
+.PP
+If a file experiences a health event, the exact type and handle to the file
+is recorded as follows:
+
+.in +4n
+.nf
+struct xfs_health_monitor_inode {
+	__u32	mask;
+	__u32	gen;
+	__u64	ino;
+};
+.fi
+.in
+.PP
+The
+.I mask
+field will contain
+.I XFS_BS_SICK_*
+flags that are documented by the
+.BR ioctl_xfs_bulkstat (2)
+manual page.
+.PP
+The
+.I ino
+and
+.I gen
+fields describe a handle to the affected file.
+
+.PP
+If the filesystem shuts down abnormally, the exact reasons are recorded as
+follows:
+
+.in +4n
+.nf
+struct xfs_health_monitor_shutdown {
+	__u32	reasons;
+};
+.fi
+.in
+.PP
+The
+.I reasons
+field is a combination of the following values:
+.RS 0.4i
+.TP
+.B XFS_HEALTH_SHUTDOWN_META_IO_ERROR
+Metadata I/O errors were encountered.
+.TP
+.B XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR
+Log I/O errors were encountered.
+.TP
+.B XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT
+The filesystem was forcibly shut down by an administrator.
+.TP
+.B XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE
+In-memory metadata are corrupt.
+.TP
+.B XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK
+On-disk metadata are corrupt.
+.TP
+.B XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED
+Storage devices were removed.
+.RE
+
+.PP
+If a media error is discovered on the storage device, the exact location is
+recorded as follows:
+
+.in +4n
+.nf
+struct xfs_health_monitor_media {
+	__u64	daddr;
+	__u64	bbcount;
+};
+.fi
+.in
+.PP
+The
+.I daddr
+and
+.I bbcount
+fields describe the range of the storage that were lost.
+Both are provided in units of 512-byte blocks.
+
+.PP
+If a problem is discovered with regular file data, the handle of the file
+and the exact range of the file are recorded as follows:
+
+.in +4n
+.nf
+struct xfs_health_monitor_filerange {
+	__u64	pos;
+	__u64	len;
+	__u64	ino;
+	__u32	gen;
+	__u32	error;
+};
+.fi
+.in
+.PP
+The
+.I ino
+and
+.I gen
+fields describe a handle to the affected file.
+The
+.I pos
+and
+.I len
+fields describe the range of the file data that are affected.
+Both are provided in units of bytes.
+.PP
+The
+.I error
+field describes the error that occurred.
+See the
+.BR errno (3)
+manual page for more information.
+.RE
+.SH CONFORMING TO
+This API is specific to XFS filesystem on the Linux kernel.
+.SH SEE ALSO
+.BR ioctl_xfs_health_samefs (2)


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 07/26] man2: document the media verification ioctl
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (5 preceding siblings ...)
  2026-03-19  4:40   ` [PATCH 06/26] man2: document the healthmon ioctl Darrick J. Wong
@ 2026-03-19  4:40   ` Darrick J. Wong
  2026-03-19  4:40   ` [PATCH 08/26] xfs_io: monitor filesystem health events Darrick J. Wong
                     ` (18 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:40 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Document XFS_IOC_VERIFY_MEDIA, which is a new ioctl for xfs_scrub to
perform media scans on the disks underneath the filesystem.  This will
enable media errors to be reported to xfs_healer and fsnotify.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 man/man2/ioctl_xfs_verify_media.2 |  185 +++++++++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)
 create mode 100644 man/man2/ioctl_xfs_verify_media.2


diff --git a/man/man2/ioctl_xfs_verify_media.2 b/man/man2/ioctl_xfs_verify_media.2
new file mode 100644
index 00000000000000..bd0d4579f5a364
--- /dev/null
+++ b/man/man2/ioctl_xfs_verify_media.2
@@ -0,0 +1,185 @@
+.\" Copyright (c) 2025-2026, Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" SPDX-License-Identifier: GPL-2.0+
+.\" %%%LICENSE_END
+.TH IOCTL-XFS-VERIFY-MEDIA 2 2026-01-09 "XFS"
+.SH NAME
+ioctl_xfs_verify_media \- verify the media of the devices backing XFS
+.SH SYNOPSIS
+.br
+.B #include <xfs/xfs_fs.h>
+.PP
+.BI "int ioctl(int " fd ", XFS_IOC_VERIFY_MEDIA, struct xfs_verify_media *" arg );
+.SH DESCRIPTION
+Verify the media of a storage device backing an XFS filesystem.
+If errors are found, report the error to the kernel so that it can generate
+health events for the health monitoring system and fsnotify.
+The verification request is conveyed in a structure of the following form:
+.PP
+.in +4n
+.nf
+struct xfs_verify_error {
+	__u32	me_dev;
+	__u32	me_flags;
+	__u64	me_start_daddr;
+	__u64	me_end_daddr;
+	__u32	me_ioerror;
+	__u32	me_pad;
+};
+.fi
+.in
+.PP
+The field
+.I me_pad
+must be zero.
+.PP
+The field
+.I me_ioerror
+will be set if the ioctl returns success.
+.PP
+The fields
+.I me_start_daddr
+and
+.I me_end_daddr
+are the range of the storage device to verify.
+Both values must be in units of 512-byte blocks.
+The
+.I me_start_daddr
+field is inclusive, and the
+.I me_end_daddr
+field is exclusive.
+If
+.I me_end_daddr
+is larger than the size of the device, the kernel will set it to the size of
+the device.
+
+If the system call returns success and any part of the storage device range was
+successfully verified, the
+.I me_start_daddr
+field will be updated to reflect the successful verification.
+If after this update the
+.I me_start_daddr
+is equal to
+.IR me_end_daddr ,
+then the entire range was verified successfully.
+
+If not, then a media error was encountered and the caller should generate a
+series of secondary calls to this ioctl with smaller ranges to discover the
+exact location and type of media error.
+The type of media error will be written to the
+.I me_ioerror
+field.
+
+.PP
+The field
+.I me_dev
+must be one of the following values:
+.RS 0.4i
+.TP
+.B XFS_DEV_DATA
+Verify the data device.
+.TP
+.B XFS_DEV_LOG
+Verify the external log device.
+.TP
+.B XFS_DEV_RT
+Verify the realtime device.
+.RE
+.PP
+The field
+.I me_flags
+is a bitmask of one of the following values:
+.RS 0.4i
+.TP
+.B XFS_VERIFY_MEDIA_REPORT
+Report all media errors to fsnotify.
+.RE
+
+The
+.IR me_max_io_size
+field, if nonzero, will be used as advice for the maximum size of the IO to
+send to the device.
+
+The
+.I me_rest_us
+field will cause the kernel to pause for this many microseconds between IO
+requests.
+
+.SH RETURN VALUE
+On runtime error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+If 0 is returned, then
+.I start_daddr
+or
+.I ioerror
+will be updated.
+.PP
+.SH ERRORS
+Error codes can be one of, but are not limited to, the following:
+.TP
+.B EPERM
+The calling process does not have sufficient privilege.
+.TP
+.B EINVAL
+One or more of the arguments specified is invalid.
+.TP
+.B EFAULT
+The
+.I arg
+structure could not be copied into the kernel.
+.TP
+.B ENODEV
+The device is not present.
+.TP
+.B ENOMEM
+There was not enough memory to perform the verification.
+
+.SH I/O ERRORS
+The
+.I ioerror
+field could be set to one of the following:
+.TP
+.B 0
+The verification I/O succeeded.
+.TP
+.B EOPNOTSUPP
+.TP
+.B ETIMEDOUT
+The kernel timed out the verification I/O command.
+.TP
+.B ENOLINK
+The transportation link to the storage device was down temporarily.
+.TP
+.B EREMOTEIO
+The storage target controller suffered a critical error.
+.TP
+.B ENODATA
+The storage target media suffered a critical error.
+.TP
+.B EILSEQ
+Storage protection metadata did not validate successfully.
+.TP
+.B ENOMEM
+There was not enough memory to allocate an I/O request.
+.TP
+.B ENODEV
+The storage device is offline.
+.TP
+.B ETIME
+The storage device timed out the I/O command.
+.TP
+.B EINVAL
+The I/O request was rejected by the device for being invalid.
+.TP
+.B EIO
+An I/O error occurred but no specific details are available.
+.RE
+.PP
+This list is not exhaustive and may grow in the future.
+
+.SH CONFORMING TO
+This API is specific to XFS filesystem on the Linux kernel.
+.SH SEE ALSO
+.BR ioctl_xfs_health_monitor (2)


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 08/26] xfs_io: monitor filesystem health events
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (6 preceding siblings ...)
  2026-03-19  4:40   ` [PATCH 07/26] man2: document the media verification ioctl Darrick J. Wong
@ 2026-03-19  4:40   ` Darrick J. Wong
  2026-03-19  4:41   ` [PATCH 09/26] xfs_io: add a media verify command Darrick J. Wong
                     ` (17 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:40 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a subcommand to monitor for health events generated by the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 io/io.h           |    1 
 io/Makefile       |    1 
 io/healthmon.c    |  186 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 io/init.c         |    1 
 man/man8/xfs_io.8 |   25 +++++++
 5 files changed, 214 insertions(+)
 create mode 100644 io/healthmon.c


diff --git a/io/io.h b/io/io.h
index 35fb8339eeb5aa..2f5262bce6acbb 100644
--- a/io/io.h
+++ b/io/io.h
@@ -162,3 +162,4 @@ extern void		bulkstat_init(void);
 void			exchangerange_init(void);
 void			fsprops_init(void);
 void			aginfo_init(void);
+void			healthmon_init(void);
diff --git a/io/Makefile b/io/Makefile
index 444e2d6a557d5d..8e3783353a52b5 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -25,6 +25,7 @@ CFILES = \
 	fsuuid.c \
 	fsync.c \
 	getrusage.c \
+	healthmon.c \
 	imap.c \
 	init.c \
 	inject.c \
diff --git a/io/healthmon.c b/io/healthmon.c
new file mode 100644
index 00000000000000..5bf54ff6c717e6
--- /dev/null
+++ b/io/healthmon.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/paths.h"
+#include "libfrog/healthevent.h"
+#include "command.h"
+#include "init.h"
+#include "io.h"
+
+static void
+healthmon_help(void)
+{
+	printf(_(
+"Monitor filesystem health events"
+"\n"
+"-c             Replace the open file with the monitor file.\n"
+"-d delay_ms    Sleep this many milliseconds between reads.\n"
+"-p             Only probe for the existence of the ioctl.\n"
+"-v             Request all events.\n"
+"\n"));
+}
+
+static inline int
+monitor_sleep(
+	int			delay_ms)
+{
+	struct timespec		ts;
+
+	if (!delay_ms)
+		return 0;
+
+	ts.tv_sec = delay_ms / 1000;
+	ts.tv_nsec = (delay_ms % 1000) * 1000000;
+
+	return nanosleep(&ts, NULL);
+}
+
+static int
+monitor(
+	size_t			bufsize,
+	bool			consume,
+	int			delay_ms,
+	bool			verbose,
+	bool			only_probe)
+{
+	struct xfs_health_monitor	hmo = {
+		.format		= XFS_HEALTH_MONITOR_FMT_V0,
+	};
+	struct hme_prefix	pfx;
+	void			*buf;
+	ssize_t			bytes_read;
+	int			mon_fd;
+	int			ret = 1;
+
+	hme_prefix_init(&pfx, file->name);
+
+	if (verbose)
+		hmo.flags |= XFS_HEALTH_MONITOR_ALL;
+
+	mon_fd = ioctl(file->fd, XFS_IOC_HEALTH_MONITOR, &hmo);
+	if (mon_fd < 0) {
+		perror("XFS_IOC_HEALTH_MONITOR");
+		return 1;
+	}
+
+	if (only_probe) {
+		ret = 0;
+		goto out_mon;
+	}
+
+	buf = malloc(bufsize);
+	if (!buf) {
+		perror("malloc");
+		goto out_mon;
+	}
+
+	if (consume) {
+		close(file->fd);
+		file->fd = mon_fd;
+	}
+
+	monitor_sleep(delay_ms);
+	while ((bytes_read = read(mon_fd, buf, bufsize)) > 0) {
+		struct xfs_health_monitor_event *hme = buf;
+
+		while (bytes_read >= sizeof(*hme)) {
+			hme_report_event(&pfx, hme);
+			hme++;
+			bytes_read -= sizeof(*hme);
+		}
+		if (bytes_read > 0) {
+			printf("healthmon: %zu bytes remain?\n", bytes_read);
+			fflush(stdout);
+		}
+
+		monitor_sleep(delay_ms);
+	}
+	if (bytes_read < 0) {
+		perror("healthmon");
+		goto out_buf;
+	}
+
+	ret = 0;
+
+out_buf:
+	free(buf);
+out_mon:
+	close(mon_fd);
+	return ret;
+}
+
+static int
+healthmon_f(
+	int			argc,
+	char			**argv)
+{
+	size_t			bufsize = 4096;
+	bool			consume = false;
+	bool			verbose = false;
+	bool			only_probe = false;
+	int			delay_ms = 0;
+	int			c;
+
+	while ((c = getopt(argc, argv, "b:cd:pv")) != EOF) {
+		switch (c) {
+		case 'b':
+			errno = 0;
+			c = atoi(optarg);
+			if (c < 0 || errno) {
+				printf("%s: bufsize must be positive\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			bufsize = c;
+			break;
+		case 'c':
+			consume = true;
+			break;
+		case 'd':
+			errno = 0;
+			delay_ms = atoi(optarg);
+			if (delay_ms < 0 || errno) {
+				printf("%s: delay must be positive msecs\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			break;
+		case 'p':
+			only_probe = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			exitcode = 1;
+			healthmon_help();
+			return 0;
+		}
+	}
+
+	return monitor(bufsize, consume, delay_ms, verbose, only_probe);
+}
+
+static struct cmdinfo healthmon_cmd = {
+	.name		= "healthmon",
+	.cfunc		= healthmon_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.args		= "[-c] [-d delay_ms] [-v]",
+	.help		= healthmon_help,
+};
+
+void
+healthmon_init(void)
+{
+	healthmon_cmd.oneline = _("monitor filesystem health events");
+
+	add_command(&healthmon_cmd);
+}
diff --git a/io/init.c b/io/init.c
index 49e9e7cb88214b..cb5573f45ccfbc 100644
--- a/io/init.c
+++ b/io/init.c
@@ -92,6 +92,7 @@ init_commands(void)
 	crc32cselftest_init();
 	exchangerange_init();
 	fsprops_init();
+	healthmon_init();
 }
 
 /*
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 0a673322fde3a1..f7f2956a54a7aa 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1356,6 +1356,31 @@ .SH FILESYSTEM COMMANDS
 .B thaw
 Undo the effects of a filesystem freeze operation.
 Only available in expert mode and requires privileges.
+.TP
+.BI "healthmon [ \-c " bufsize " ] [ \-c ] [ \-d " delay_ms " ] [ \-p ] [ \-v ]"
+Watch for filesystem health events and write them to the console.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI "\-b " bufsize
+Use a buffer of this size to read events from the kernel.
+.TP
+.BI \-c
+Close the open file and replace it with the monitor file.
+.TP
+.BI "\-d " delay_ms
+Sleep for this long between read attempts.
+.TP
+.B \-p
+Probe for the existence of the functionality by opening the monitoring fd and
+closing it immediately.
+.TP
+.BI \-v
+Request all health events, even if nothing changed.
+.PD
+.RE
+
 .TP
 .BI "inject [ " tag " ]"
 Inject errors into a filesystem to observe filesystem behavior at


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 09/26] xfs_io: add a media verify command
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (7 preceding siblings ...)
  2026-03-19  4:40   ` [PATCH 08/26] xfs_io: monitor filesystem health events Darrick J. Wong
@ 2026-03-19  4:41   ` Darrick J. Wong
  2026-03-19  4:41   ` [PATCH 10/26] xfs_healer: create daemon to listen for health events Darrick J. Wong
                     ` (16 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:41 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a subcommand to invoke the media verification ioctl to make sure
that we can actually check the storage underneath an xfs filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 io/io.h           |    1 
 io/Makefile       |    3 +
 io/init.c         |    1 
 io/verify_media.c |  180 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |   42 ++++++++++++
 5 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 io/verify_media.c


diff --git a/io/io.h b/io/io.h
index 2f5262bce6acbb..0f12b3cfed5e76 100644
--- a/io/io.h
+++ b/io/io.h
@@ -163,3 +163,4 @@ void			exchangerange_init(void);
 void			fsprops_init(void);
 void			aginfo_init(void);
 void			healthmon_init(void);
+void			verifymedia_init(void);
diff --git a/io/Makefile b/io/Makefile
index 8e3783353a52b5..79d5e172b8f31f 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -51,7 +51,8 @@ CFILES = \
 	sync.c \
 	sync_file_range.c \
 	truncate.c \
-	utimes.c
+	utimes.c \
+	verify_media.c
 
 LLDLIBS = $(LIBXCMD) $(LIBHANDLE) $(LIBFROG) $(LIBPTHREAD) $(LIBUUID)
 LTDEPENDENCIES = $(LIBXCMD) $(LIBHANDLE) $(LIBFROG)
diff --git a/io/init.c b/io/init.c
index cb5573f45ccfbc..f2a551ef559200 100644
--- a/io/init.c
+++ b/io/init.c
@@ -93,6 +93,7 @@ init_commands(void)
 	exchangerange_init();
 	fsprops_init();
 	healthmon_init();
+	verifymedia_init();
 }
 
 /*
diff --git a/io/verify_media.c b/io/verify_media.c
new file mode 100644
index 00000000000000..e67567f675abfd
--- /dev/null
+++ b/io/verify_media.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "command.h"
+#include "input.h"
+#include "init.h"
+#include "io.h"
+
+static void
+verifymedia_help(void)
+{
+	printf(_(
+"\n"
+" Verify the media of the devices backing the filesystem.\n"
+"\n"
+" -d -- Verify the data device (default).\n"
+" -l -- Verify the log device.\n"
+" -r -- Verify the realtime device.\n"
+" -R -- Report media errors to fsnotify.\n"
+" -s -- Sleep this many usecs between IOs.\n"
+"\n"
+" start is the byte offset of the start of the range to verify.  If the start\n"
+" is specified, the end may (optionally) be specified as well."
+"\n"
+" end is the byte offset of the end of the range to verify.\n"
+"\n"
+" If neither start nor end are specified, the media verification will\n"
+" check the entire device."
+"\n"));
+}
+
+static int
+verifymedia_f(
+	int			argc,
+	char			**argv)
+{
+	xfs_daddr_t		orig_start_daddr = 0;
+	struct xfs_verify_media me = {
+		.me_start_daddr	= orig_start_daddr,
+		.me_end_daddr	= ~0ULL,
+		.me_dev		= XFS_DEV_DATA,
+	};
+	struct timeval		t1, t2;
+	long long		l;
+	size_t			fsblocksize, fssectsize;
+	const char		*verifydev = _("datadev");
+	int			c, ret;
+
+	init_cvtnum(&fsblocksize, &fssectsize);
+
+	while ((c = getopt(argc, argv, "b:dlrRs:")) != EOF) {
+		switch (c) {
+		case 'd':
+			me.me_dev = XFS_DEV_DATA;
+			verifydev = _("datadev");
+			break;
+		case 'l':
+			me.me_dev = XFS_DEV_LOG;
+			verifydev = _("logdev");
+			break;
+		case 'r':
+			me.me_dev = XFS_DEV_RT;
+			verifydev = _("rtdev");
+			break;
+		case 'b':
+			l = cvtnum(fsblocksize, fssectsize, optarg);
+			if (l < 0 || l > UINT_MAX) {
+				printf("non-numeric maxio argument -- %s\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			me.me_max_io_size = l;
+			break;
+		case 'R':
+			me.me_flags |= XFS_VERIFY_MEDIA_REPORT;
+			break;
+		case 's':
+			l = atoi(optarg);
+			if (l < 0) {
+				printf("non-numeric rest_us argument -- %s\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			me.me_rest_us = l;
+			break;
+		default:
+			verifymedia_help();
+			exitcode = 1;
+			return 0;
+		}
+	}
+
+	/* Range start (optional) */
+	if (optind < argc) {
+		l = cvtnum(fsblocksize, fssectsize, argv[optind]);
+		if (l < 0) {
+			printf("non-numeric start argument -- %s\n",
+					argv[optind]);
+			exitcode = 1;
+			return 0;
+		}
+
+		orig_start_daddr = l / 512;
+		me.me_start_daddr = orig_start_daddr;
+		optind++;
+	}
+
+	/* Range end (optional if range start was specified) */
+	if (optind < argc) {
+		l = cvtnum(fsblocksize, fssectsize, argv[optind]);
+		if (l < 0) {
+			printf("non-numeric end argument -- %s\n",
+					argv[optind]);
+			exitcode = 1;
+			return 0;
+		}
+
+		me.me_end_daddr = ((l + 511) / 512);
+		optind++;
+	}
+
+	if (optind < argc) {
+		printf("too many arguments -- %s\n", argv[optind]);
+		exitcode = 1;
+		return 0;
+	}
+
+	gettimeofday(&t1, NULL);
+	ret = ioctl(file->fd, XFS_IOC_VERIFY_MEDIA, &me);
+	gettimeofday(&t2, NULL);
+	t2 = tsub(t2, t1);
+	if (ret < 0) {
+		fprintf(stderr,
+ "%s: ioctl(XFS_IOC_VERIFY_MEDIA) [\"%s\"]: %s\n",
+				progname, file->name, strerror(errno));
+		exitcode = 1;
+		return 0;
+	}
+
+	if (me.me_ioerror) {
+		fprintf(stderr,
+ "%s: verify error at offset %llu length %llu: %s\n",
+				verifydev,
+				BBTOB(me.me_start_daddr),
+				BBTOB(me.me_end_daddr - me.me_start_daddr),
+				strerror(me.me_ioerror));
+	} else {
+		unsigned long long	total;
+
+		if (me.me_end_daddr > orig_start_daddr)
+			total = BBTOB(me.me_end_daddr - orig_start_daddr);
+		else
+			total = 0;
+		report_io_times("verified", &t2, BBTOB(orig_start_daddr),
+				BBTOB(me.me_start_daddr - orig_start_daddr),
+				total, 1, false);
+	}
+
+	return 0;
+}
+
+static struct cmdinfo verifymedia_cmd = {
+	.name		= "verifymedia",
+	.cfunc		= verifymedia_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.args		= "[-lr] [start [end]]",
+	.help		= verifymedia_help,
+};
+
+void
+verifymedia_init(void)
+{
+	add_command(&verifymedia_cmd);
+}
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index f7f2956a54a7aa..2090cd4c0b2641 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1389,6 +1389,48 @@ .SH FILESYSTEM COMMANDS
 argument, displays the list of error tags available.
 Only available in expert mode and requires privileges.
 
+.TP
+.BI "verifymedia [ \-bdlrsR ] [ " start " [ " end " ]]"
+Check for media errors on the storage devices backing XFS.
+The
+.I start
+and
+.I end
+parameters are the range of physical storage to verify, in bytes.
+The
+.I start
+parameter is inclusive.
+The
+.I end
+parameter is exclusive.
+If neither
+.IR start " nor " end
+are specified, the entire device will be verified.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.B \-b
+Don't issue any IOs larger than this size.
+.TP
+.B \-d
+Verify the data device.
+This is the default.
+.TP
+.B \-l
+Verify the log device instead of the data device.
+.TP
+.B \-r
+Verify the realtime device instead of the data device.
+.TP
+.B \-R
+Report media errors to fsnotify.
+.TP
+.B \-s
+Sleep this many microseconds between IO requests.
+.PD
+.RE
+
 .TP
 .BI "rginfo [ \-r " rgno " ]"
 Show information about or update the state of realtime allocation groups.


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 10/26] xfs_healer: create daemon to listen for health events
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (8 preceding siblings ...)
  2026-03-19  4:41   ` [PATCH 09/26] xfs_io: add a media verify command Darrick J. Wong
@ 2026-03-19  4:41   ` Darrick J. Wong
  2026-03-19  4:41   ` [PATCH 11/26] xfs_healer: enable repairing filesystems Darrick J. Wong
                     ` (15 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:41 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a daemon program that can listen for and log health events.
Eventually this will be used to self-heal filesystems in real time.

Because events can take a while to process, the main thread reads event
objects from the healthmon fd and dispatches them to a background
workqueue as quickly as it can.  This split of responsibilities is
necessary because the kernel event queue will drop events if the queue
fills up, and each event can take some time to process (logging,
repairs, etc.) so we don't want to lose events.

To be clear, xfs_healer and xfs_scrub are complementary tools:

Scrub walks the whole filesystem, finds stuff that needs fixing or
rebuilding, and rebuilds it.  This is sort of analogous to a patrol
scrub.

Healer listens for metadata corruption messages from the kernel and
issues a targeted repair of that structure.  This is kind of like an
ondemand scrub.

My end goal is that xfs_healer (the service) is active all the time and
can respond instantly to a corruption report, whereas xfs_scrub (the
service) gets run periodically as a cron job.

xfs_healer can decide that it's overwhelmed with problems and start
xfs_scrub to deal with the mess.  Ideally you don't crash the filesystem
and then have to use xfs_repair to smash your way back to a mountable
filesystem.

By default we run xfs_healer as a background service, which means that
we only start two threads -- one to read the events, and another to
process them.  In other words, we try not to use all available hardware
resources for repairs.  The foreground mode switch starts up a large
number of threads to try to increase parallelism, which may or may not
be useful for repairs depending on how much metadata the kernel needs to
scan.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h  |   47 ++++++
 Makefile             |    5 +
 configure.ac         |    6 +
 healer/Makefile      |   35 ++++
 healer/xfs_healer.c  |  391 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/builddefs.in |    1 
 6 files changed, 485 insertions(+)
 create mode 100644 healer/xfs_healer.h
 create mode 100644 healer/Makefile
 create mode 100644 healer/xfs_healer.c


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
new file mode 100644
index 00000000000000..bcddde5db0cc47
--- /dev/null
+++ b/healer/xfs_healer.h
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef XFS_HEALER_XFS_HEALER_H_
+#define XFS_HEALER_XFS_HEALER_H_
+
+extern char *progname;
+
+/*
+ * When running in environments with restrictive security policies, healer
+ * might not be allowed to access the global mount tree.  However, processes
+ * are usually still allowed to see their own mount tree, so use this path for
+ * all mount table queries.
+ */
+#define _PATH_PROC_MOUNTS	"/proc/self/mounts"
+
+struct healer_ctx {
+	/* CLI options, must be int */
+	int			debug;
+	int			log;
+	int			everything;
+	int			foreground;
+
+	/* fd and fs geometry for mount */
+	struct xfs_fd		mnt;
+
+	/* Shared reference to the user's mountpoint for logging */
+	const char		*mntpoint;
+
+	/* Shared reference to the getmntent fsname for reconnecting */
+	const char		*fsname;
+
+	/* file stream of monitor and buffer */
+	FILE			*mon_fp;
+	char			*mon_buf;
+
+	/* coordinates logging printfs */
+	pthread_mutex_t		conlock;
+
+	/* event queue */
+	struct workqueue	event_queue;
+	bool			queue_active;
+};
+
+#endif /* XFS_HEALER_XFS_HEALER_H_ */
diff --git a/Makefile b/Makefile
index c73aa391bc5f43..1f499c30f3457e 100644
--- a/Makefile
+++ b/Makefile
@@ -69,6 +69,10 @@ ifeq ("$(ENABLE_SCRUB)","yes")
 TOOL_SUBDIRS += scrub
 endif
 
+ifeq ("$(ENABLE_HEALER)","yes")
+TOOL_SUBDIRS += healer
+endif
+
 ifneq ("$(XGETTEXT)","")
 TOOL_SUBDIRS += po
 endif
@@ -100,6 +104,7 @@ mkfs: libxcmd
 spaceman: libxcmd libhandle
 scrub: libhandle libxcmd
 rtcp: libfrog
+healer: libhandle
 
 ifeq ($(HAVE_BUILDDEFS), yes)
 include $(BUILDRULES)
diff --git a/configure.ac b/configure.ac
index cffcaf373cfa5e..90af1f84035ee6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -110,6 +110,12 @@ AC_ARG_ENABLE(libicu,
 [  --enable-libicu=[yes/no]  Enable Unicode name scanning in xfs_scrub (libicu) [default=probe]],,
 	enable_libicu=probe)
 
+# Enable xfs_healer build
+AC_ARG_ENABLE(healer,
+[  --enable-healer=[yes/no]  Enable build of xfs_healer utility [[default=yes]]],,
+	enable_healer=yes)
+AC_SUBST(enable_healer)
+
 #
 # If the user specified a libdir ending in lib64 do not append another
 # 64 to the library names.
diff --git a/healer/Makefile b/healer/Makefile
new file mode 100644
index 00000000000000..e82c820883669a
--- /dev/null
+++ b/healer/Makefile
@@ -0,0 +1,35 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2024-2026 Oracle.  All Rights Reserved.
+#
+
+TOPDIR = ..
+builddefs=$(TOPDIR)/include/builddefs
+include $(builddefs)
+
+INSTALL_HEALER = install-healer
+
+LTCOMMAND = xfs_healer
+
+CFILES = \
+xfs_healer.c
+
+HFILES = \
+xfs_healer.h
+
+LLDLIBS += $(LIBHANDLE) $(LIBFROG) $(LIBURCU) $(LIBPTHREAD)
+LTDEPENDENCIES += $(LIBHANDLE) $(LIBFROG)
+LLDFLAGS = -static
+
+default: depend $(LTCOMMAND)
+
+include $(BUILDRULES)
+
+install: $(INSTALL_HEALER)
+
+install-healer: default
+	$(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR)
+	$(INSTALL) -m 755 $(LTCOMMAND) $(PKG_LIBEXEC_DIR)
+
+install-dev:
+
+-include .dep
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
new file mode 100644
index 00000000000000..e0076fff381632
--- /dev/null
+++ b/healer/xfs_healer.c
@@ -0,0 +1,391 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include <pthread.h>
+#include <stdlib.h>
+
+#include "platform_defs.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/paths.h"
+#include "libfrog/healthevent.h"
+#include "libfrog/workqueue.h"
+#include "libfrog/systemd.h"
+#include "xfs_healer.h"
+
+/* Program name; needed for libfrog error reports. */
+char				*progname = "xfs_healer";
+
+/* Return a health monitoring fd. */
+static int
+open_health_monitor(
+	struct healer_ctx		*ctx,
+	int				mnt_fd)
+{
+	struct xfs_health_monitor	hmo = {
+		.format			= XFS_HEALTH_MONITOR_FMT_V0,
+	};
+
+	if (ctx->everything)
+		hmo.flags |= XFS_HEALTH_MONITOR_VERBOSE;
+
+	return ioctl(mnt_fd, XFS_IOC_HEALTH_MONITOR, &hmo);
+}
+
+/* Decide if this event can only be reported upon, and not acted upon. */
+static bool
+event_not_actionable(
+	const struct xfs_health_monitor_event	*hme)
+{
+	switch (hme->type) {
+	case XFS_HEALTH_MONITOR_TYPE_LOST:
+	case XFS_HEALTH_MONITOR_TYPE_RUNNING:
+	case XFS_HEALTH_MONITOR_TYPE_UNMOUNT:
+	case XFS_HEALTH_MONITOR_TYPE_SHUTDOWN:
+		return true;
+	}
+
+	return false;
+}
+
+/* Should this event be logged? */
+static bool
+event_loggable(
+	const struct healer_ctx			*ctx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	return ctx->log || event_not_actionable(hme);
+}
+
+/* Handle an event asynchronously. */
+static void
+handle_event(
+	struct workqueue		*wq,
+	uint32_t			index,
+	void				*arg)
+{
+	struct hme_prefix		pfx;
+	struct xfs_health_monitor_event	*hme = arg;
+	struct healer_ctx		*ctx = wq->wq_ctx;
+	const bool loggable = event_loggable(ctx, hme);
+
+	hme_prefix_init(&pfx, ctx->mntpoint);
+
+	/*
+	 * Non-actionable events should always be logged, because they are 100%
+	 * informational.
+	 */
+	if (loggable) {
+		pthread_mutex_lock(&ctx->conlock);
+		hme_report_event(&pfx, hme);
+		pthread_mutex_unlock(&ctx->conlock);
+	}
+
+	free(hme);
+}
+
+/*
+ * Find the filesystem source name for the mount that we're monitoring.  We
+ * don't use the fs_table_ helpers because we might be running in a restricted
+ * environment where we cannot access device files at all.
+ */
+static int
+try_capture_fsinfo(
+	struct healer_ctx	*ctx)
+{
+	struct mntent		*mnt;
+	FILE			*mtp;
+	char			rpath[PATH_MAX], rmnt_dir[PATH_MAX];
+
+	if (!realpath(ctx->mntpoint, rpath))
+		return -1;
+
+	mtp = setmntent(_PATH_PROC_MOUNTS, "r");
+	if (mtp == NULL)
+		return -1;
+
+	while ((mnt = getmntent(mtp)) != NULL) {
+		if (strcmp(mnt->mnt_type, "xfs"))
+			continue;
+		if (!realpath(mnt->mnt_dir, rmnt_dir))
+			continue;
+
+		if (!strcmp(rpath, rmnt_dir)) {
+			ctx->fsname = strdup(mnt->mnt_fsname);
+			break;
+		}
+	}
+
+	endmntent(mtp);
+
+	return ctx->fsname ? 0 : -1;
+}
+
+static unsigned int
+healer_nproc(
+	const struct healer_ctx	*ctx)
+{
+	/*
+	 * By default, use one event handler thread.  In foreground mode,
+	 * create one thread per cpu.
+	 */
+	return ctx->foreground ? platform_nproc() : 1;
+}
+
+/* Set ourselves up to monitor the given mountpoint for health events. */
+static int
+setup_monitor(
+	struct healer_ctx	*ctx)
+{
+	const long		BUF_SIZE = sysconf(_SC_PAGE_SIZE) * 2;
+	int			mon_fd;
+	int			ret;
+
+	ret = xfd_open(&ctx->mnt, ctx->mntpoint, O_RDONLY);
+	if (ret) {
+		perror(ctx->mntpoint);
+		return -1;
+	}
+
+	ret = try_capture_fsinfo(ctx);
+	if (ret) {
+		fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+				_("Not a XFS mount point."));
+		goto out_mnt_fd;
+	}
+
+	/*
+	 * Open the health monitor, then close the mountpoint to avoid pinning
+	 * it.  We can reconnect later if need be.
+	 */
+	mon_fd = open_health_monitor(ctx, ctx->mnt.fd);
+	if (mon_fd < 0) {
+		switch (errno) {
+		case ENOTTY:
+		case EOPNOTSUPP:
+			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+ _("XFS health monitoring not supported."));
+			break;
+		case EEXIST:
+			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+ _("XFS health monitoring already running."));
+			break;
+		default:
+			perror(ctx->mntpoint);
+			break;
+		}
+
+		goto out_mnt_fd;
+	}
+	close(ctx->mnt.fd);
+	ctx->mnt.fd = -1;
+
+	/*
+	 * mon_fp consumes mon_fd.  We intentionally leave mon_fp attached to
+	 * the context so that we keep the monitoring fd open until we've torn
+	 * down all the background threads.
+	 */
+	ctx->mon_fp = fdopen(mon_fd, "r");
+	if (!ctx->mon_fp) {
+		perror(ctx->mntpoint);
+		goto out_mon_fd;
+	}
+
+	/* Increase the buffer size so that we can reduce kernel calls */
+	ctx->mon_buf = malloc(BUF_SIZE);
+	if (ctx->mon_buf)
+		setvbuf(ctx->mon_fp, ctx->mon_buf, _IOFBF, BUF_SIZE);
+
+	/*
+	 * Queue up to 1MB of events before we stop trying to read events from
+	 * the kernel as quickly as we can.  Note that the kernel won't accrue
+	 * more than 32K of internal events before it starts dropping them.
+	 */
+	ret = workqueue_create_bound(&ctx->event_queue, ctx, healer_nproc(ctx),
+			1048576 / sizeof(struct xfs_health_monitor_event));
+	if (ret) {
+		errno = ret;
+		fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,
+				_("worker threadpool setup"), strerror(errno));
+		goto out_mon_fp;
+	}
+	ctx->queue_active = true;
+
+	return 0;
+
+out_mon_fp:
+	if (ctx->mon_fp)
+		fclose(ctx->mon_fp);
+	ctx->mon_fp = NULL;
+out_mon_fd:
+	if (mon_fd >= 0)
+		close(mon_fd);
+out_mnt_fd:
+	if (ctx->mnt.fd >= 0)
+		close(ctx->mnt.fd);
+	ctx->mnt.fd = -1;
+	return -1;
+}
+
+/* Monitor the given mountpoint for health events. */
+static void
+monitor(
+	struct healer_ctx	*ctx)
+{
+	bool			mounted = true;
+	size_t			nr;
+
+	do {
+		struct xfs_health_monitor_event	*hme;
+		int		ret;
+
+		hme = malloc(sizeof(*hme));
+		if (!hme) {
+			pthread_mutex_lock(&ctx->conlock);
+			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+					_("could not allocate event object"));
+			pthread_mutex_unlock(&ctx->conlock);
+			break;
+		}
+
+		nr = fread(hme, sizeof(*hme), 1, ctx->mon_fp);
+		if (nr == 0) {
+			free(hme);
+			break;
+		}
+
+		if (hme->type == XFS_HEALTH_MONITOR_TYPE_UNMOUNT)
+			mounted = false;
+
+		/* handle_event owns hme if the workqueue_add succeeds */
+		ret = workqueue_add(&ctx->event_queue, handle_event, 0, hme);
+		if (ret) {
+			pthread_mutex_lock(&ctx->conlock);
+			fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,
+					_("could not queue event object"),
+					strerror(ret));
+			pthread_mutex_unlock(&ctx->conlock);
+			free(hme);
+			break;
+		}
+	} while (nr > 0 && mounted);
+}
+
+/* Tear down all the resources that we created for monitoring */
+static void
+teardown_monitor(
+	struct healer_ctx	*ctx)
+{
+	if (ctx->queue_active) {
+		workqueue_terminate(&ctx->event_queue);
+		workqueue_destroy(&ctx->event_queue);
+	}
+	if (ctx->mon_fp) {
+		fclose(ctx->mon_fp);
+		ctx->mon_fp = NULL;
+	}
+	free(ctx->mon_buf);
+	ctx->mon_buf = NULL;
+}
+
+static void __attribute__((noreturn))
+usage(void)
+{
+	fprintf(stderr, "%s %s %s\n", _("Usage:"), progname,
+			_("[OPTIONS] mountpoint"));
+	fprintf(stderr, "\n");
+	fprintf(stderr, _("Options:\n"));
+	fprintf(stderr, _("  --debug       Enable debugging messages.\n"));
+	fprintf(stderr, _("  --everything  Capture all events.\n"));
+	fprintf(stderr, _("  --foreground  Process events as soon as possible.\n"));
+	fprintf(stderr, _("  --quiet       Do not log health events to stdout.\n"));
+	fprintf(stderr, _("  -V            Print version.\n"));
+
+	exit(EXIT_FAILURE);
+}
+
+enum long_opt_nr {
+	LOPT_DEBUG,
+	LOPT_EVERYTHING,
+	LOPT_FOREGROUND,
+	LOPT_HELP,
+	LOPT_QUIET,
+
+	LOPT_MAX,
+};
+
+int
+main(
+	int			argc,
+	char			**argv)
+{
+	struct healer_ctx	ctx = {
+		.conlock	= (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER,
+		.log		= 1,
+		.mnt.fd		= -1,
+	};
+	int			option_index;
+	int			vflag = 0;
+	int			c;
+	int			ret;
+
+	progname = basename(argv[0]);
+	setlocale(LC_ALL, "");
+	bindtextdomain(PACKAGE, LOCALEDIR);
+	textdomain(PACKAGE);
+
+	struct option long_options[] = {
+		[LOPT_DEBUG]	   = {"debug", no_argument, &ctx.debug, 1 },
+		[LOPT_EVERYTHING]  = {"everything", no_argument, &ctx.everything, 1 },
+		[LOPT_FOREGROUND]  = {"foreground", no_argument, &ctx.foreground, 1 },
+		[LOPT_HELP]	   = {"help", no_argument, NULL, 0 },
+		[LOPT_QUIET]	   = {"quiet", no_argument, &ctx.log, 0 },
+
+		[LOPT_MAX]	   = {NULL, 0, NULL, 0 },
+	};
+
+	while ((c = getopt_long(argc, argv, "V", long_options, &option_index))
+			!= EOF) {
+		switch (c) {
+		case 0:
+			switch (option_index) {
+			case LOPT_HELP:
+				usage();
+				break;
+			default:
+				break;
+			}
+			break;
+		case 'V':
+			vflag++;
+			break;
+		default:
+			usage();
+			break;
+		}
+	}
+
+	if (vflag) {
+		fprintf(stdout, "%s %s %s\n", progname, _("version"), VERSION);
+		fflush(stdout);
+		return EXIT_SUCCESS;
+	}
+
+	if (optind != argc - 1)
+		usage();
+
+	ctx.mntpoint = argv[optind];
+
+	ret = setup_monitor(&ctx);
+	if (ret)
+		goto out_events;
+
+	monitor(&ctx);
+
+out_events:
+	teardown_monitor(&ctx);
+	free((char *)ctx.fsname);
+	return systemd_service_exit(ret);
+}
diff --git a/include/builddefs.in b/include/builddefs.in
index d2d25c8a0ed676..0ab2bf1702f0f0 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -91,6 +91,7 @@ ENABLE_SHARED	= @enable_shared@
 ENABLE_GETTEXT	= @enable_gettext@
 ENABLE_EDITLINE	= @enable_editline@
 ENABLE_SCRUB	= @enable_scrub@
+ENABLE_HEALER	= @enable_healer@
 
 HAVE_ZIPPED_MANPAGES = @have_zipped_manpages@
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 11/26] xfs_healer: enable repairing filesystems
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (9 preceding siblings ...)
  2026-03-19  4:41   ` [PATCH 10/26] xfs_healer: create daemon to listen for health events Darrick J. Wong
@ 2026-03-19  4:41   ` Darrick J. Wong
  2026-03-19  4:41   ` [PATCH 12/26] xfs_healer: use getparents to look up file names Darrick J. Wong
                     ` (14 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:41 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that our health monitoring daemon can initiate repairs in
response to reports of corrupt filesystem metadata.  Repairs are
initiated from the background workers as explained in the previous
patch.

Note that just like xfs_scrub, xfs_healer's ability to repair metadata
relies heavily on back references such as reverse mappings and directory
parent pointers to add redundancy to the filesystem.  Check for these
two features and whine a bit if they are missing, just like scrub.

There's a bit of trickery with the fd that is used to initiate repairs
in the kernel.  Because an open fd will pin the filesystem in memory,
xfs_healer can only hold an open fd to the target filesystem while it's
performing repairs.  Therefore, at startup xfs_healer must sample enough
information about the target filesystem to reconnect to it later on.
Currently, the fs source (aka the data device path) and the root
directory handle are sufficient to do this.

Someday we might be able to have revocable fds, which would eliminate
the need for such efforts in userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h   |   28 ++++++
 libfrog/flagmap.h     |    3 +
 libfrog/healthevent.h |   12 ++
 healer/Makefile       |    2 
 healer/fsrepair.c     |  249 +++++++++++++++++++++++++++++++++++++++++++++++++
 healer/weakhandle.c   |  115 +++++++++++++++++++++++
 healer/xfs_healer.c   |   55 +++++++++++
 libfrog/flagmap.c     |   17 +++
 libfrog/healthevent.c |  117 +++++++++++++++++++++++
 9 files changed, 598 insertions(+)
 create mode 100644 healer/fsrepair.c
 create mode 100644 healer/weakhandle.c


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index bcddde5db0cc47..a4de1ad32a408f 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -8,6 +8,9 @@
 
 extern char *progname;
 
+struct weakhandle;
+struct hme_prefix;
+
 /*
  * When running in environments with restrictive security policies, healer
  * might not be allowed to access the global mount tree.  However, processes
@@ -22,6 +25,7 @@ struct healer_ctx {
 	int			log;
 	int			everything;
 	int			foreground;
+	int			want_repair;
 
 	/* fd and fs geometry for mount */
 	struct xfs_fd		mnt;
@@ -32,6 +36,9 @@ struct healer_ctx {
 	/* Shared reference to the getmntent fsname for reconnecting */
 	const char		*fsname;
 
+	/* weak file handle so we can reattach to filesystem */
+	struct weakhandle	*wh;
+
 	/* file stream of monitor and buffer */
 	FILE			*mon_fp;
 	char			*mon_buf;
@@ -44,4 +51,25 @@ struct healer_ctx {
 	bool			queue_active;
 };
 
+static inline bool healer_has_rmapbt(const struct healer_ctx *ctx)
+{
+	return ctx->mnt.fsgeom.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT;
+}
+
+static inline bool healer_has_parent(const struct healer_ctx *ctx)
+{
+	return ctx->mnt.fsgeom.flags & XFS_FSOP_GEOM_FLAGS_PARENT;
+}
+
+/* repair.c */
+int repair_metadata(struct healer_ctx *ctx, const struct hme_prefix *pfx,
+		const struct xfs_health_monitor_event *hme);
+bool healer_can_repair(struct healer_ctx *ctx);
+
+/* weakhandle.c */
+int weakhandle_alloc(int fd, const char *mountpoint, const char *fsname,
+		struct weakhandle **whp);
+int weakhandle_reopen(struct weakhandle *wh, int *fd);
+void weakhandle_free(struct weakhandle **whp);
+
 #endif /* XFS_HEALER_XFS_HEALER_H_ */
diff --git a/libfrog/flagmap.h b/libfrog/flagmap.h
index 8031d75a7c02a8..05110c3544dc97 100644
--- a/libfrog/flagmap.h
+++ b/libfrog/flagmap.h
@@ -14,6 +14,9 @@ struct flag_map {
 void mask_to_string(const struct flag_map *map, unsigned long long mask,
 		const char *delimiter, char *buf, size_t bufsize);
 
+const char *lowest_set_mask_string(const struct flag_map *map,
+		unsigned long long mask);
+
 const char *value_to_string(const struct flag_map *map,
 		unsigned long long value);
 
diff --git a/libfrog/healthevent.h b/libfrog/healthevent.h
index 6de41bc797100c..4f3c8ba639ec4c 100644
--- a/libfrog/healthevent.h
+++ b/libfrog/healthevent.h
@@ -40,4 +40,16 @@ hme_prefix_init(
 void hme_report_event(const struct hme_prefix *pfx,
 		const struct xfs_health_monitor_event *hme);
 
+enum repair_outcome {
+	REPAIR_SUCCESS,
+	REPAIR_FAILED,
+	REPAIR_PROBABLY_OK,
+	REPAIR_UNNECESSARY,
+};
+
+void report_health_repair(const struct hme_prefix *pfx,
+		const struct xfs_health_monitor_event *hme,
+		uint32_t event_mask,
+		enum repair_outcome outcome);
+
 #endif /* LIBFROG_HEALTHEVENT_H_ */
diff --git a/healer/Makefile b/healer/Makefile
index e82c820883669a..981192b81af626 100644
--- a/healer/Makefile
+++ b/healer/Makefile
@@ -11,6 +11,8 @@ INSTALL_HEALER = install-healer
 LTCOMMAND = xfs_healer
 
 CFILES = \
+fsrepair.c \
+weakhandle.c \
 xfs_healer.c
 
 HFILES = \
diff --git a/healer/fsrepair.c b/healer/fsrepair.c
new file mode 100644
index 00000000000000..907afca3dba8a7
--- /dev/null
+++ b/healer/fsrepair.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+
+#include "platform_defs.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/workqueue.h"
+#include "libfrog/healthevent.h"
+#include "xfs_healer.h"
+
+/* Translate scrub output flags to outcome. */
+static enum repair_outcome from_repair_oflags(uint32_t oflags)
+{
+	if (oflags & (XFS_SCRUB_OFLAG_CORRUPT | XFS_SCRUB_OFLAG_INCOMPLETE))
+		return REPAIR_FAILED;
+
+	if (oflags & XFS_SCRUB_OFLAG_XFAIL)
+		return REPAIR_PROBABLY_OK;
+
+	if (oflags & XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED)
+		return REPAIR_UNNECESSARY;
+
+	return REPAIR_SUCCESS;
+}
+
+struct u32_scrub {
+	uint32_t	event_mask;
+	uint32_t	scrub_type;
+};
+
+#define foreach_scrub_type(cur, mask, coll) \
+	for ((cur) = (coll); (cur)->scrub_type != 0; (cur)++) \
+		if ((mask) & (cur)->event_mask)
+
+/* Call the kernel to repair some inode metadata. */
+static inline enum repair_outcome
+xfs_repair_metadata(
+	int			fd,
+	uint32_t		scrub_type,
+	uint32_t		group,
+	uint64_t		ino,
+	uint32_t		gen)
+{
+	struct xfs_scrub_metadata sm = {
+		.sm_type = scrub_type,
+		.sm_flags = XFS_SCRUB_IFLAG_REPAIR,
+		.sm_ino = ino,
+		.sm_gen = gen,
+		.sm_agno = group,
+	};
+	int			ret;
+
+	ret = ioctl(fd, XFS_IOC_SCRUB_METADATA, &sm);
+	if (ret)
+		return REPAIR_FAILED;
+
+	return from_repair_oflags(sm.sm_flags);
+}
+
+/* React to a fs-domain corruption event by repairing it. */
+static void
+try_repair_wholefs(
+	struct healer_ctx			*ctx,
+	const struct hme_prefix			*pfx,
+	int					mnt_fd,
+	const struct xfs_health_monitor_event	*hme)
+{
+#define X(code, type) { XFS_FSOP_GEOM_SICK_ ## code, XFS_SCRUB_TYPE_ ## type }
+	static const struct u32_scrub		FS_STRUCTURES[] = {
+		X(COUNTERS,	FSCOUNTERS),
+		X(UQUOTA,	UQUOTA),
+		X(GQUOTA,	GQUOTA),
+		X(PQUOTA,	PQUOTA),
+		X(RT_BITMAP,	RTBITMAP),
+		X(RT_SUMMARY,	RTSUM),
+		X(QUOTACHECK,	QUOTACHECK),
+		X(NLINKS,	NLINKS),
+		{0,		0},
+	};
+#undef X
+	const struct u32_scrub	*f;
+
+	foreach_scrub_type(f, hme->e.fs.mask, FS_STRUCTURES) {
+		enum repair_outcome	outcome =
+			xfs_repair_metadata(mnt_fd, f->scrub_type, 0, 0, 0);
+
+		pthread_mutex_lock(&ctx->conlock);
+		report_health_repair(pfx, hme, f->event_mask, outcome);
+		pthread_mutex_unlock(&ctx->conlock);
+	}
+}
+
+/* React to an ag corruption event by repairing it. */
+static void
+try_repair_ag(
+	struct healer_ctx			*ctx,
+	const struct hme_prefix			*pfx,
+	int					mnt_fd,
+	const struct xfs_health_monitor_event	*hme)
+{
+#define X(code, type) { XFS_AG_GEOM_SICK_ ## code, XFS_SCRUB_TYPE_ ## type }
+	static const struct u32_scrub		AG_STRUCTURES[] = {
+		X(SB,		SB),
+		X(AGF,		AGF),
+		X(AGFL,		AGFL),
+		X(AGI,		AGI),
+		X(BNOBT,	BNOBT),
+		X(CNTBT,	CNTBT),
+		X(INOBT,	INOBT),
+		X(FINOBT,	FINOBT),
+		X(RMAPBT,	RMAPBT),
+		X(REFCNTBT,	REFCNTBT),
+		{0,		0},
+	};
+#undef X
+	const struct u32_scrub *f;
+
+	foreach_scrub_type(f, hme->e.group.mask, AG_STRUCTURES) {
+		enum repair_outcome	outcome =
+			xfs_repair_metadata(mnt_fd, f->scrub_type,
+					hme->e.group.gno, 0, 0);
+
+		pthread_mutex_lock(&ctx->conlock);
+		report_health_repair(pfx, hme, f->event_mask, outcome);
+		pthread_mutex_unlock(&ctx->conlock);
+	}
+}
+
+/* React to a rtgroup corruption event by repairing it. */
+static void
+try_repair_rtgroup(
+	struct healer_ctx			*ctx,
+	const struct hme_prefix			*pfx,
+	int					mnt_fd,
+	const struct xfs_health_monitor_event	*hme)
+{
+#define X(code, type) { XFS_RTGROUP_GEOM_SICK_ ## code, XFS_SCRUB_TYPE_ ## type }
+	static const struct u32_scrub		RTG_STRUCTURES[] = {
+		X(SUPER,	RGSUPER),
+		X(BITMAP,	RTBITMAP),
+		X(SUMMARY,	RTSUM),
+		X(RMAPBT,	RTRMAPBT),
+		X(REFCNTBT,	RTREFCBT),
+		{0,		0},
+	};
+#undef X
+	const struct u32_scrub *f;
+
+	foreach_scrub_type(f, hme->e.group.mask, RTG_STRUCTURES) {
+		enum repair_outcome	outcome =
+			xfs_repair_metadata(mnt_fd, f->scrub_type,
+					hme->e.group.gno, 0, 0);
+
+		pthread_mutex_lock(&ctx->conlock);
+		report_health_repair(pfx, hme, f->event_mask, outcome);
+		pthread_mutex_unlock(&ctx->conlock);
+	}
+}
+
+/* React to a inode-domain corruption event by repairing it. */
+static void
+try_repair_inode(
+	struct healer_ctx			*ctx,
+	const struct hme_prefix			*pfx,
+	int					mnt_fd,
+	const struct xfs_health_monitor_event	*hme)
+{
+#define X(code, type) { XFS_BS_SICK_ ## code, XFS_SCRUB_TYPE_ ## type }
+	static const struct u32_scrub		INODE_STRUCTURES[] = {
+		X(INODE,	INODE),
+		X(BMBTD,	BMBTD),
+		X(BMBTA,	BMBTA),
+		X(BMBTC,	BMBTC),
+		X(DIR,		DIR),
+		X(XATTR,	XATTR),
+		X(SYMLINK,	SYMLINK),
+		X(PARENT,	PARENT),
+		X(DIRTREE,	DIRTREE),
+		{0,		0},
+	};
+#undef X
+	const struct u32_scrub *f;
+
+	foreach_scrub_type(f, hme->e.inode.mask, INODE_STRUCTURES) {
+		enum repair_outcome	outcome =
+			xfs_repair_metadata(mnt_fd, f->scrub_type,
+					0, hme->e.inode.ino, hme->e.inode.gen);
+
+		pthread_mutex_lock(&ctx->conlock);
+		report_health_repair(pfx, hme, f->event_mask, outcome);
+		pthread_mutex_unlock(&ctx->conlock);
+	}
+}
+
+/* Repair a metadata corruption. */
+int
+repair_metadata(
+	struct healer_ctx			*ctx,
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	int					repair_fd;
+	int					ret;
+
+	ret = weakhandle_reopen(ctx->wh, &repair_fd);
+	if (ret) {
+		fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,
+				_("cannot open filesystem to repair"),
+				strerror(errno));
+		return ret;
+	}
+
+	switch (hme->domain) {
+	case XFS_HEALTH_MONITOR_DOMAIN_FS:
+		try_repair_wholefs(ctx, pfx, repair_fd, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_AG:
+		try_repair_ag(ctx, pfx, repair_fd, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_RTGROUP:
+		try_repair_rtgroup(ctx, pfx, repair_fd, hme);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_INODE:
+		try_repair_inode(ctx, pfx, repair_fd, hme);
+		break;
+	}
+
+	close(repair_fd);
+	return 0;
+}
+
+/* Ask the kernel if it supports repairs. */
+bool
+healer_can_repair(
+	struct healer_ctx	*ctx)
+{
+	struct xfs_scrub_metadata sm = {
+		.sm_type = XFS_SCRUB_TYPE_PROBE,
+		.sm_flags = XFS_SCRUB_IFLAG_REPAIR,
+	};
+	int			ret;
+
+	/* assume any errno means not supported */
+	ret = ioctl(ctx->mnt.fd, XFS_IOC_SCRUB_METADATA, &sm);
+	return ret ? false : true;
+}
diff --git a/healer/weakhandle.c b/healer/weakhandle.c
new file mode 100644
index 00000000000000..53df43b03e16cc
--- /dev/null
+++ b/healer/weakhandle.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2025-2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include <pthread.h>
+#include <stdlib.h>
+
+#include "platform_defs.h"
+#include "handle.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/workqueue.h"
+#include "xfs_healer.h"
+
+struct weakhandle {
+	/* Shared reference to the user's mountpoint for logging */
+	const char		*mntpoint;
+
+	/* Shared reference to the getmntent fsname for reconnecting */
+	const char		*fsname;
+
+	/* handle to root dir */
+	void			*hanp;
+	size_t			hlen;
+};
+
+/* Capture a handle for a given filesystem, but don't attach to the fd. */
+int
+weakhandle_alloc(
+	int			fd,
+	const char		*mountpoint,
+	const char		*fsname,
+	struct weakhandle	**whp)
+{
+	struct weakhandle	*wh;
+	int			ret;
+
+	*whp = NULL;
+
+	if (fd < 0 || !mountpoint) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	wh = calloc(1, sizeof(struct weakhandle));
+	if (!wh)
+		return -1;
+
+	wh->mntpoint = mountpoint;
+	wh->fsname = fsname;
+
+	ret = fd_to_handle(fd, &wh->hanp, &wh->hlen);
+	if (ret)
+		goto out_wh;
+
+	*whp = wh;
+	return 0;
+
+out_wh:
+	free(wh);
+	return -1;
+}
+
+/* Reopen a file handle obtained via weak reference. */
+int
+weakhandle_reopen(
+	struct weakhandle	*wh,
+	int			*fd)
+{
+	void			*hanp;
+	size_t			hlen;
+	int			mnt_fd;
+	int			ret;
+
+	*fd = -1;
+
+	mnt_fd = open(wh->mntpoint, O_RDONLY);
+	if (mnt_fd < 0)
+		return -1;
+
+	ret = fd_to_handle(mnt_fd, &hanp, &hlen);
+	if (ret)
+		goto out_mntfd;
+
+	if (hlen != wh->hlen || memcmp(hanp, wh->hanp, hlen)) {
+		errno = ESTALE;
+		goto out_handle;
+	}
+
+	free_handle(hanp, hlen);
+	*fd = mnt_fd;
+	return 0;
+
+out_handle:
+	free_handle(hanp, hlen);
+out_mntfd:
+	close(mnt_fd);
+	return -1;
+}
+
+/* Tear down a weak handle */
+void
+weakhandle_free(
+	struct weakhandle	**whp)
+{
+	struct weakhandle	*wh = *whp;
+
+	if (wh) {
+		free_handle(wh->hanp, wh->hlen);
+		free(wh);
+	}
+
+	*whp = NULL;
+}
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index e0076fff381632..488f2a5310d0fd 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -59,6 +59,18 @@ event_loggable(
 	return ctx->log || event_not_actionable(hme);
 }
 
+/* Are we going to try a repair? */
+static inline bool
+event_repairable(
+	const struct healer_ctx			*ctx,
+	const struct xfs_health_monitor_event	*hme)
+{
+	if (event_not_actionable(hme))
+		return false;
+
+	return ctx->want_repair && hme->type == XFS_HEALTH_MONITOR_TYPE_SICK;
+}
+
 /* Handle an event asynchronously. */
 static void
 handle_event(
@@ -70,6 +82,7 @@ handle_event(
 	struct xfs_health_monitor_event	*hme = arg;
 	struct healer_ctx		*ctx = wq->wq_ctx;
 	const bool loggable = event_loggable(ctx, hme);
+	const bool will_repair = event_repairable(ctx, hme);
 
 	hme_prefix_init(&pfx, ctx->mntpoint);
 
@@ -83,6 +96,10 @@ handle_event(
 		pthread_mutex_unlock(&ctx->conlock);
 	}
 
+	/* Initiate a repair if appropriate. */
+	if (will_repair)
+		repair_metadata(ctx, &pfx, hme);
+
 	free(hme);
 }
 
@@ -156,6 +173,40 @@ setup_monitor(
 		goto out_mnt_fd;
 	}
 
+	if (ctx->want_repair) {
+		/* Check that the kernel supports repairs at all. */
+		if (!healer_can_repair(ctx)) {
+			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+ _("XFS online repair is not supported, exiting"));
+			goto out_mnt_fd;
+		}
+
+		/* Check for backref metadata that makes repair effective. */
+		if (!healer_has_rmapbt(ctx))
+			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+ _("XFS online repair is less effective without rmap btrees."));
+
+		if (!healer_has_parent(ctx))
+			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+ _("XFS online repair is less effective without parent pointers."));
+
+	}
+
+	/*
+	 * Open weak-referenced file handle to mountpoint so that we can
+	 * reconnect to the mountpoint to start repairs.
+	 */
+	if (ctx->want_repair) {
+		ret = weakhandle_alloc(ctx->mnt.fd, ctx->mntpoint,
+				ctx->fsname, &ctx->wh);
+		if (ret) {
+			fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,
+					_("creating weak fshandle"),
+					strerror(errno));
+			goto out_mnt_fd;
+		}
+	}
+
 	/*
 	 * Open the health monitor, then close the mountpoint to avoid pinning
 	 * it.  We can reconnect later if need be.
@@ -287,6 +338,7 @@ teardown_monitor(
 		ctx->mon_fp = NULL;
 	}
 	free(ctx->mon_buf);
+	weakhandle_free(&ctx->wh);
 	ctx->mon_buf = NULL;
 }
 
@@ -301,6 +353,7 @@ usage(void)
 	fprintf(stderr, _("  --everything  Capture all events.\n"));
 	fprintf(stderr, _("  --foreground  Process events as soon as possible.\n"));
 	fprintf(stderr, _("  --quiet       Do not log health events to stdout.\n"));
+	fprintf(stderr, _("  --repair      Always repair corrupt metadata.\n"));
 	fprintf(stderr, _("  -V            Print version.\n"));
 
 	exit(EXIT_FAILURE);
@@ -312,6 +365,7 @@ enum long_opt_nr {
 	LOPT_FOREGROUND,
 	LOPT_HELP,
 	LOPT_QUIET,
+	LOPT_REPAIR,
 
 	LOPT_MAX,
 };
@@ -342,6 +396,7 @@ main(
 		[LOPT_FOREGROUND]  = {"foreground", no_argument, &ctx.foreground, 1 },
 		[LOPT_HELP]	   = {"help", no_argument, NULL, 0 },
 		[LOPT_QUIET]	   = {"quiet", no_argument, &ctx.log, 0 },
+		[LOPT_REPAIR]	   = {"repair", no_argument, &ctx.want_repair, 1 },
 
 		[LOPT_MAX]	   = {NULL, 0, NULL, 0 },
 	};
diff --git a/libfrog/flagmap.c b/libfrog/flagmap.c
index 631c4bbc8f1dc0..ce413297780a2a 100644
--- a/libfrog/flagmap.c
+++ b/libfrog/flagmap.c
@@ -44,6 +44,23 @@ mask_to_string(
 		snprintf(buf, bufsize, "%s0x%llx", tag, mask & ~seen);
 }
 
+/*
+ * Given a mapping of bits to strings and a bitmask, return the string
+ * corresponding to the lowest set bit in the mask.
+ */
+const char *
+lowest_set_mask_string(
+	const struct flag_map	*map,
+	unsigned long long	mask)
+{
+	for (; map->string; map++) {
+		if (mask & map->flag)
+			return _(map->string);
+	}
+
+	return _("unknown flag");
+}
+
 /*
  * Given a mapping of values to strings and a value, return the matching string
  * or confusion.
diff --git a/libfrog/healthevent.c b/libfrog/healthevent.c
index 8520cb3218fb03..193738332dbd71 100644
--- a/libfrog/healthevent.c
+++ b/libfrog/healthevent.c
@@ -358,3 +358,120 @@ hme_report_event(
 		break;
 	}
 }
+
+static const char *
+repair_outcome_string(
+	enum repair_outcome	o)
+{
+	switch (o) {
+	case REPAIR_FAILED:
+		return _("Repair unsuccessful; offline repair required.");
+	case REPAIR_PROBABLY_OK:
+		return _("Seems correct but cross-referencing failed; offline repair recommended.");
+	case REPAIR_UNNECESSARY:
+		return _("No modification needed.");
+	case REPAIR_SUCCESS:
+		return _("Repairs successful.");
+	}
+
+	return NULL;
+}
+
+/* Report inode metadata repair */
+static void
+report_inode_repair(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme,
+	uint32_t				domain_mask,
+	enum repair_outcome			outcome)
+{
+	if (hme_prefix_has_path(pfx))
+		printf("%s %s: %s\n",
+				pfx->path,
+				lowest_set_mask_string(inode_structs,
+						       domain_mask),
+				repair_outcome_string(outcome));
+	else
+		printf("%s %s %llu %s 0x%x %s: %s\n",
+				pfx->mountpoint,
+				_("ino"),
+				(unsigned long long)hme->e.inode.ino,
+				_("gen"),
+				hme->e.inode.gen,
+				lowest_set_mask_string(inode_structs,
+						       domain_mask),
+				repair_outcome_string(outcome));
+	fflush(stdout);
+}
+
+/* Report AG metadata repair */
+static void
+report_ag_repair(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme,
+	uint32_t				domain_mask,
+	enum repair_outcome			outcome)
+{
+	printf("%s %s 0x%x %s: %s\n", pfx->mountpoint,
+			_("agno"),
+			hme->e.group.gno,
+			lowest_set_mask_string(ag_structs, domain_mask),
+			repair_outcome_string(outcome));
+	fflush(stdout);
+}
+
+/* Report rtgroup metadata repair */
+static void
+report_rtgroup_repair(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme,
+	uint32_t				domain_mask,
+	enum repair_outcome			outcome)
+{
+	printf("%s %s 0x%x %s: %s\n", pfx->mountpoint,
+			_("rgno"),
+			hme->e.group.gno,
+			lowest_set_mask_string(rtgroup_structs, domain_mask),
+			repair_outcome_string(outcome));
+	fflush(stdout);
+}
+
+/* Report fs-wide metadata repair */
+static void
+report_fs_repair(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme,
+	uint32_t				domain_mask,
+	enum repair_outcome			outcome)
+{
+	printf("%s %s: %s\n", pfx->mountpoint,
+			lowest_set_mask_string(fs_structs, domain_mask),
+			repair_outcome_string(outcome));
+	fflush(stdout);
+}
+
+/* Log a repair event to stdout. */
+void
+report_health_repair(
+	const struct hme_prefix			*pfx,
+	const struct xfs_health_monitor_event	*hme,
+	uint32_t				domain_mask,
+	enum repair_outcome			outcome)
+{
+	switch (hme->domain) {
+	case XFS_HEALTH_MONITOR_DOMAIN_INODE:
+		report_inode_repair(pfx, hme, domain_mask, outcome);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_AG:
+		report_ag_repair(pfx, hme, domain_mask, outcome);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_RTGROUP:
+		report_rtgroup_repair(pfx, hme, domain_mask, outcome);
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_FS:
+		report_fs_repair(pfx, hme, domain_mask, outcome);
+		break;
+	default:
+		break;
+	}
+}


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 12/26] xfs_healer: use getparents to look up file names
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (10 preceding siblings ...)
  2026-03-19  4:41   ` [PATCH 11/26] xfs_healer: enable repairing filesystems Darrick J. Wong
@ 2026-03-19  4:41   ` Darrick J. Wong
  2026-03-19  4:42   ` [PATCH 13/26] xfs_healer: create a per-mount background monitoring service Darrick J. Wong
                     ` (13 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:41 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the kernel tells about something that happened to a file, use the
GETPARENTS ioctl to try to look up the path to that file for more
ergonomic reporting.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h |    6 ++++
 healer/fsrepair.c   |   16 ++++++++-
 healer/weakhandle.c |   86 +++++++++++++++++++++++++++++++++++++++++++++++++++
 healer/xfs_healer.c |   45 ++++++++++++++++++++++++++-
 4 files changed, 149 insertions(+), 4 deletions(-)


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index a4de1ad32a408f..6d12921245934c 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -61,6 +61,10 @@ static inline bool healer_has_parent(const struct healer_ctx *ctx)
 	return ctx->mnt.fsgeom.flags & XFS_FSOP_GEOM_FLAGS_PARENT;
 }
 
+void lookup_path(struct healer_ctx *ctx,
+		const struct xfs_health_monitor_event *hme,
+		struct hme_prefix *pfx);
+
 /* repair.c */
 int repair_metadata(struct healer_ctx *ctx, const struct hme_prefix *pfx,
 		const struct xfs_health_monitor_event *hme);
@@ -71,5 +75,7 @@ int weakhandle_alloc(int fd, const char *mountpoint, const char *fsname,
 		struct weakhandle **whp);
 int weakhandle_reopen(struct weakhandle *wh, int *fd);
 void weakhandle_free(struct weakhandle **whp);
+int weakhandle_getpath_for(struct weakhandle *wh, uint64_t ino, uint32_t gen,
+		char *path, size_t pathlen);
 
 #endif /* XFS_HEALER_XFS_HEALER_H_ */
diff --git a/healer/fsrepair.c b/healer/fsrepair.c
index 907afca3dba8a7..4534104f8a6ac1 100644
--- a/healer/fsrepair.c
+++ b/healer/fsrepair.c
@@ -164,7 +164,7 @@ try_repair_rtgroup(
 static void
 try_repair_inode(
 	struct healer_ctx			*ctx,
-	const struct hme_prefix			*pfx,
+	const struct hme_prefix			*orig_pfx,
 	int					mnt_fd,
 	const struct xfs_health_monitor_event	*hme)
 {
@@ -182,13 +182,25 @@ try_repair_inode(
 		{0,		0},
 	};
 #undef X
-	const struct u32_scrub *f;
+	struct hme_prefix	new_pfx;
+	const struct hme_prefix	*pfx = orig_pfx;
+	const struct u32_scrub	*f;
 
 	foreach_scrub_type(f, hme->e.inode.mask, INODE_STRUCTURES) {
 		enum repair_outcome	outcome =
 			xfs_repair_metadata(mnt_fd, f->scrub_type,
 					0, hme->e.inode.ino, hme->e.inode.gen);
 
+		/*
+		 * Try again to find the file path, maybe we fixed the dir
+		 * tree.
+		 */
+		if (!hme_prefix_has_path(pfx)) {
+			lookup_path(ctx, hme, &new_pfx);
+			if (hme_prefix_has_path(&new_pfx))
+				pfx = &new_pfx;
+		}
+
 		pthread_mutex_lock(&ctx->conlock);
 		report_health_repair(pfx, hme, f->event_mask, outcome);
 		pthread_mutex_unlock(&ctx->conlock);
diff --git a/healer/weakhandle.c b/healer/weakhandle.c
index 53df43b03e16cc..8950e0eb1e5a43 100644
--- a/healer/weakhandle.c
+++ b/healer/weakhandle.c
@@ -11,6 +11,8 @@
 #include "handle.h"
 #include "libfrog/fsgeom.h"
 #include "libfrog/workqueue.h"
+#include "libfrog/getparents.h"
+#include "libfrog/paths.h"
 #include "xfs_healer.h"
 
 struct weakhandle {
@@ -113,3 +115,87 @@ weakhandle_free(
 
 	*whp = NULL;
 }
+
+struct bufvec {
+	char	*buf;
+	size_t	len;
+};
+
+static int
+render_path(
+	const char		*mntpt,
+	const struct path_list	*path,
+	void			*arg)
+{
+	struct bufvec		*args = arg;
+	int			mntpt_len = strlen(mntpt);
+	ssize_t			ret;
+
+	/* Trim trailing slashes from the mountpoint */
+	while (mntpt_len > 0 && mntpt[mntpt_len - 1] == '/')
+		mntpt_len--;
+
+	ret = snprintf(args->buf, args->len, "%.*s", mntpt_len, mntpt);
+	if (ret < 0 || ret >= args->len)
+		return 0;
+
+	ret = path_list_to_string(path, args->buf + ret, args->len - ret);
+	if (ret < 0)
+		return 0;
+
+	/* magic code that means we found one */
+	return ECANCELED;
+}
+
+/* Render any path to this weakhandle into the specified buffer. */
+int
+weakhandle_getpath_for(
+	struct weakhandle	*wh,
+	uint64_t		ino,
+	uint32_t		gen,
+	char			*path,
+	size_t			pathlen)
+{
+	struct xfs_handle	fakehandle;
+	struct bufvec		bv = {
+		.buf		= path,
+		.len		= pathlen,
+	};
+	int			mnt_fd;
+	int			ret;
+
+	if (wh->hlen != sizeof(fakehandle)) {
+		errno = EINVAL;
+		return -1;
+	}
+	memcpy(&fakehandle, wh->hanp, sizeof(fakehandle));
+	fakehandle.ha_fid.fid_ino = ino;
+	fakehandle.ha_fid.fid_gen = gen;
+
+	ret = weakhandle_reopen(wh, &mnt_fd);
+	if (ret)
+		return ret;
+
+	/*
+	 * In the common case, files only have one parent; and what's the
+	 * chance that we'll need to walk past the second parent to find *one*
+	 * path that goes to the rootdir?  With a max filename length of 255
+	 * bytes, we pick 600 for the buffer size.
+	 */
+	ret = handle_walk_paths_fd(wh->mntpoint, mnt_fd, &fakehandle,
+			sizeof(fakehandle), 600, render_path, &bv);
+	switch (ret) {
+	case ECANCELED:
+		/* found a path */
+		ret = 0;
+		break;
+	default:
+		/* didn't find one */
+		errno = ENOENT;
+		ret = -1;
+		break;
+	}
+
+	close(mnt_fd);
+	return ret;
+}
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index 488f2a5310d0fd..63baf641cb6ec6 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -34,6 +34,39 @@ open_health_monitor(
 	return ioctl(mnt_fd, XFS_IOC_HEALTH_MONITOR, &hmo);
 }
 
+/* Report either the file handle or its path, if we can. */
+void
+lookup_path(
+	struct healer_ctx			*ctx,
+	const struct xfs_health_monitor_event	*hme,
+	struct hme_prefix			*pfx)
+{
+	uint64_t				ino = 0;
+	uint32_t				gen = 0;
+	int					ret;
+
+	if (!healer_has_parent(ctx))
+		return;
+
+	switch (hme->domain) {
+	case XFS_HEALTH_MONITOR_DOMAIN_INODE:
+		ino = hme->e.inode.ino;
+		gen = hme->e.inode.gen;
+		break;
+	case XFS_HEALTH_MONITOR_DOMAIN_FILERANGE:
+		ino = hme->e.filerange.ino;
+		gen = hme->e.filerange.gen;
+		break;
+	default:
+		return;
+	}
+
+	ret = weakhandle_getpath_for(ctx->wh, ino, gen, pfx->path,
+			sizeof(pfx->path));
+	if (ret)
+		hme_prefix_clear_path(pfx);
+}
+
 /* Decide if this event can only be reported upon, and not acted upon. */
 static bool
 event_not_actionable(
@@ -86,6 +119,13 @@ handle_event(
 
 	hme_prefix_init(&pfx, ctx->mntpoint);
 
+	/*
+	 * Try to look up the file name for the file we're about to log or
+	 * about to repair (which always logs).
+	 */
+	if (loggable || will_repair)
+		lookup_path(ctx, hme, &pfx);
+
 	/*
 	 * Non-actionable events should always be logged, because they are 100%
 	 * informational.
@@ -194,9 +234,10 @@ setup_monitor(
 
 	/*
 	 * Open weak-referenced file handle to mountpoint so that we can
-	 * reconnect to the mountpoint to start repairs.
+	 * reconnect to the mountpoint to start repairs or to look up file
+	 * paths for logging.
 	 */
-	if (ctx->want_repair) {
+	if (ctx->want_repair || healer_has_parent(ctx)) {
 		ret = weakhandle_alloc(ctx->mnt.fd, ctx->mntpoint,
 				ctx->fsname, &ctx->wh);
 		if (ret) {


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 13/26] xfs_healer: create a per-mount background monitoring service
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (11 preceding siblings ...)
  2026-03-19  4:41   ` [PATCH 12/26] xfs_healer: use getparents to look up file names Darrick J. Wong
@ 2026-03-19  4:42   ` Darrick J. Wong
  2026-03-19  4:42   ` [PATCH 14/26] xfs_healer: create a service to start the per-mount healer service Darrick J. Wong
                     ` (12 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:42 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a systemd service definition for our self-healing filesystem
daemon so that we can run it for every mounted filesystem.  Add a
hidden switch so that we can print the service unit name for fstests.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h            |    1 
 healer/Makefile                |   22 ++++++++
 healer/system-xfs_healer.slice |   31 ++++++++++++
 healer/xfs_healer.c            |   16 ++++++
 healer/xfs_healer@.service.in  |  107 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 healer/system-xfs_healer.slice
 create mode 100644 healer/xfs_healer@.service.in


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index 6d12921245934c..679bdc95ae48f8 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -26,6 +26,7 @@ struct healer_ctx {
 	int			everything;
 	int			foreground;
 	int			want_repair;
+	int			print_svcname;
 
 	/* fd and fs geometry for mount */
 	struct xfs_fd		mnt;
diff --git a/healer/Makefile b/healer/Makefile
index 981192b81af626..ee44aaee461250 100644
--- a/healer/Makefile
+++ b/healer/Makefile
@@ -22,7 +22,23 @@ LLDLIBS += $(LIBHANDLE) $(LIBFROG) $(LIBURCU) $(LIBPTHREAD)
 LTDEPENDENCIES += $(LIBHANDLE) $(LIBFROG)
 LLDFLAGS = -static
 
-default: depend $(LTCOMMAND)
+XFS_HEALER_SVCNAME=xfs_healer@.service
+CFLAGS += -DXFS_HEALER_SVCNAME=\"$(XFS_HEALER_SVCNAME)\"
+
+ifeq ($(HAVE_SYSTEMD),yes)
+INSTALL_HEALER += install-systemd
+SYSTEMD_SERVICES=\
+	system-xfs_healer.slice \
+	$(XFS_HEALER_SVCNAME)
+OPTIONAL_TARGETS += $(SYSTEMD_SERVICES)
+endif
+
+default: depend $(LTCOMMAND) $(SYSTEMD_SERVICES)
+
+%.service: %.service.in $(builddefs)
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@pkg_libexec_dir@|$(PKG_LIBEXEC_DIR)|g" \
+		   < $< > $@
 
 include $(BUILDRULES)
 
@@ -32,6 +48,10 @@ install-healer: default
 	$(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR)
 	$(INSTALL) -m 755 $(LTCOMMAND) $(PKG_LIBEXEC_DIR)
 
+install-systemd: default
+	$(INSTALL) -m 755 -d $(SYSTEMD_SYSTEM_UNIT_DIR)
+	$(INSTALL) -m 644 $(SYSTEMD_SERVICES) $(SYSTEMD_SYSTEM_UNIT_DIR)
+
 install-dev:
 
 -include .dep
diff --git a/healer/system-xfs_healer.slice b/healer/system-xfs_healer.slice
new file mode 100644
index 00000000000000..b8f5bca03963ff
--- /dev/null
+++ b/healer/system-xfs_healer.slice
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=xfs_healer background service slice
+Before=slices.target
+
+[Slice]
+
+# If the CPU usage cgroup controller is available, don't use more than 2 cores
+# for all background processes.  One thread to read events, another to run
+# repairs.
+CPUQuota=200%
+CPUAccounting=true
+
+[Install]
+# As of systemd 249, the systemd cgroupv2 configuration code will drop resource
+# controllers from the root and system.slice cgroups at startup if it doesn't
+# find any direct dependencies that require a given controller.  Newly
+# activated units with resource control directives are created under the system
+# slice but do not cause a reconfiguration of the slice's resource controllers.
+# Hence we cannot put CPUQuota= into the xfs_healer service units directly.
+#
+# For the CPUQuota directive to have any effect, we must therefore create an
+# explicit definition file for the slice that systemd creates to contain the
+# xfs_healer instance units (e.g. xfs_healer@.service) and we must configure
+# this slice as a dependency of the system slice to establish the direct
+# dependency relation.
+WantedBy=system.slice
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index 63baf641cb6ec6..1a26ffe830e5fe 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -407,6 +407,7 @@ enum long_opt_nr {
 	LOPT_HELP,
 	LOPT_QUIET,
 	LOPT_REPAIR,
+	LOPT_SVCNAME,
 
 	LOPT_MAX,
 };
@@ -438,6 +439,7 @@ main(
 		[LOPT_HELP]	   = {"help", no_argument, NULL, 0 },
 		[LOPT_QUIET]	   = {"quiet", no_argument, &ctx.log, 0 },
 		[LOPT_REPAIR]	   = {"repair", no_argument, &ctx.want_repair, 1 },
+		[LOPT_SVCNAME]	   = {"svcname", no_argument, &ctx.print_svcname, 1 },
 
 		[LOPT_MAX]	   = {NULL, 0, NULL, 0 },
 	};
@@ -474,6 +476,20 @@ main(
 
 	ctx.mntpoint = argv[optind];
 
+	if (ctx.print_svcname) {
+		char	unitname[PATH_MAX];
+
+		ret = systemd_path_instance_unit_name(XFS_HEALER_SVCNAME,
+				ctx.mntpoint, unitname, sizeof(unitname));
+		if (ret) {
+			perror(ctx.mntpoint);
+			return EXIT_FAILURE;
+		}
+
+		printf("%s\n", unitname);
+		return EXIT_SUCCESS;
+	}
+
 	ret = setup_monitor(&ctx);
 	if (ret)
 		goto out_events;
diff --git a/healer/xfs_healer@.service.in b/healer/xfs_healer@.service.in
new file mode 100644
index 00000000000000..385257872b0cbb
--- /dev/null
+++ b/healer/xfs_healer@.service.in
@@ -0,0 +1,107 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=Self Healing of XFS Metadata for %f
+
+# Explicitly require the capabilities that this program needs
+ConditionCapability=CAP_SYS_ADMIN
+ConditionCapability=CAP_DAC_OVERRIDE
+
+# Must be a mountpoint
+ConditionPathIsMountPoint=%f
+RequiresMountsFor=%f
+
+[Service]
+Type=exec
+Environment=SERVICE_MODE=1
+ExecStart=@pkg_libexec_dir@/xfs_healer %f
+SyslogIdentifier=%N
+
+# Create the service underneath the healer background service slice so that we
+# can control resource usage.
+Slice=system-xfs_healer.slice
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# xfs_healer avoids pinning mounted filesystems by recording the file handle
+# for the provided mountpoint (%f) before opening the health monitor, after
+# which it closes the fd for the mountpoint.  If repairs are needed, it will
+# reopen the mountpoint, resample the file handle, and proceed only if the
+# handles match.  If the filesystem is unmounted, the daemon exits.  If the
+# mountpoint moves, repairs will not be attempted against the wrong filesystem.
+#
+# Due to this resampling behavior, xfs_healer must see the same filesystem
+# mount tree inside the service container as outside, with the same ro/rw
+# state.  BindPaths doesn't work on the paths that are made readonly by
+# ProtectSystem and ProtectHome, so it is not possible to set either option.
+# DynamicUser sets ProtectSystem, so that also cannot be used.  We cannot use
+# BindPaths to bind the desired mountpoint somewhere under /tmp like xfs_scrub
+# does because that pins the mount.
+#
+# Regrettably, this leaves xfs_healer less hardened than xfs_scrub.
+# Surprisingly, this doesn't affect xfs_healer's score dramatically.
+DynamicUser=false
+ProtectSystem=false
+ProtectHome=no
+PrivateTmp=true
+PrivateDevices=true
+
+# Don't let healer complain about paths in /etc/projects that have been hidden
+# by our sandboxing.  healer doesn't care about project ids anyway.
+InaccessiblePaths=-/etc/projects
+
+# No network access
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=none
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_healer needs these privileges to open the rootdir and monitor
+CapabilityBoundingSet=CAP_SYS_ADMIN CAP_DAC_OVERRIDE
+AmbientCapabilities=CAP_SYS_ADMIN CAP_DAC_OVERRIDE
+NoNewPrivileges=true
+
+# xfs_healer doesn't create files
+UMask=7777
+
+# No access to hardware /dev files except for block devices
+ProtectClock=true
+DevicePolicy=closed
+
+[Install]
+WantedBy=multi-user.target
+# If someone tries to enable the template itself, translate that into enabling
+# this service on the root directory at systemd startup time.  In the
+# initramfs, the udev rules in xfs_healer.rules run before systemd starts.
+DefaultInstance=-


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 14/26] xfs_healer: create a service to start the per-mount healer service
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (12 preceding siblings ...)
  2026-03-19  4:42   ` [PATCH 13/26] xfs_healer: create a per-mount background monitoring service Darrick J. Wong
@ 2026-03-19  4:42   ` Darrick J. Wong
  2026-03-19  4:42   ` [PATCH 15/26] xfs_healer: don't start service if kernel support unavailable Darrick J. Wong
                     ` (11 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:42 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a daemon to wait for xfs mount events via fsnotify and start up
the per-mount healer service.  It's important that we're running in the
same mount namespace as the mount, so we're a fanotify client to avoid
having to filter the mount namespaces ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 libfrog/systemd.h                  |   23 +-
 configure.ac                       |    1 
 healer/Makefile                    |   16 +-
 healer/xfs_healer_start.c          |  368 ++++++++++++++++++++++++++++++++++++
 healer/xfs_healer_start.service.in |   85 ++++++++
 include/builddefs.in               |    5 
 m4/package_libcdev.m4              |   24 ++
 7 files changed, 511 insertions(+), 11 deletions(-)
 create mode 100644 healer/xfs_healer_start.c
 create mode 100644 healer/xfs_healer_start.service.in


diff --git a/libfrog/systemd.h b/libfrog/systemd.h
index c96df4afa39aa6..8a0970282d1080 100644
--- a/libfrog/systemd.h
+++ b/libfrog/systemd.h
@@ -22,6 +22,20 @@ static inline bool systemd_is_service(void)
 	return getenv("SERVICE_MODE") != NULL;
 }
 
+/* Special processing for a service/daemon program that is exiting. */
+static inline int
+systemd_service_exit_now(int ret)
+{
+	/*
+	 * If we're being run as a service, the return code must fit the LSB
+	 * init script action error guidelines, which is to say that we
+	 * compress all errors to 1 ("generic or unspecified error", LSB 5.0
+	 * section 22.2) and hope the admin will scan the log for what actually
+	 * happened.
+	 */
+	return ret != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
+}
+
 /* Special processing for a service/daemon program that is exiting. */
 static inline int
 systemd_service_exit(int ret)
@@ -35,14 +49,7 @@ systemd_service_exit(int ret)
 	 */
 	sleep(2);
 
-	/*
-	 * If we're being run as a service, the return code must fit the LSB
-	 * init script action error guidelines, which is to say that we
-	 * compress all errors to 1 ("generic or unspecified error", LSB 5.0
-	 * section 22.2) and hope the admin will scan the log for what actually
-	 * happened.
-	 */
-	return ret != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
+	return systemd_service_exit_now(ret);
 }
 
 #endif /* __LIBFROG_SYSTEMD_H__ */
diff --git a/configure.ac b/configure.ac
index 90af1f84035ee6..e098cf0530415b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -194,6 +194,7 @@ if test "$have_listmount" = "yes"; then
 	AC_HAVE_LISTMOUNT_NS_FD
 	AC_HAVE_STATMOUNT_SUPPORTED_MASK
 fi
+AC_HAVE_FANOTIFY_MOUNTINFO
 
 if test "$enable_ubsan" = "yes" || test "$enable_ubsan" = "probe"; then
         AC_PACKAGE_CHECK_UBSAN
diff --git a/healer/Makefile b/healer/Makefile
index ee44aaee461250..1eeb727682008b 100644
--- a/healer/Makefile
+++ b/healer/Makefile
@@ -9,6 +9,7 @@ include $(builddefs)
 INSTALL_HEALER = install-healer
 
 LTCOMMAND = xfs_healer
+BUILD_TARGETS = $(LTCOMMAND)
 
 CFILES = \
 fsrepair.c \
@@ -31,9 +32,18 @@ SYSTEMD_SERVICES=\
 	system-xfs_healer.slice \
 	$(XFS_HEALER_SVCNAME)
 OPTIONAL_TARGETS += $(SYSTEMD_SERVICES)
-endif
+endif # HAVE_SYSTEMD
 
-default: depend $(LTCOMMAND) $(SYSTEMD_SERVICES)
+ifeq ($(HAVE_HEALER_START_DEPS),yes)
+BUILD_TARGETS += xfs_healer_start
+SYSTEMD_SERVICES += xfs_healer_start.service
+endif # xfs_healer_start deps
+
+default: depend $(BUILD_TARGETS) $(SYSTEMD_SERVICES)
+
+xfs_healer_start: $(SUBDIRS) xfs_healer_start.o $(LTDEPENDENCIES)
+	@echo "    [LD]     $@"
+	$(Q)$(LTLINK) -o $@ $(LDFLAGS) xfs_healer_start.o $(LDLIBS)
 
 %.service: %.service.in $(builddefs)
 	@echo "    [SED]    $@"
@@ -46,7 +56,7 @@ install: $(INSTALL_HEALER)
 
 install-healer: default
 	$(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR)
-	$(INSTALL) -m 755 $(LTCOMMAND) $(PKG_LIBEXEC_DIR)
+	$(INSTALL) -m 755 $(BUILD_TARGETS) $(PKG_LIBEXEC_DIR)
 
 install-systemd: default
 	$(INSTALL) -m 755 -d $(SYSTEMD_SYSTEM_UNIT_DIR)
diff --git a/healer/xfs_healer_start.c b/healer/xfs_healer_start.c
new file mode 100644
index 00000000000000..c016e915da79a4
--- /dev/null
+++ b/healer/xfs_healer_start.c
@@ -0,0 +1,368 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+
+#include <errno.h>
+#include <err.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <sys/fanotify.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <linux/mount.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <limits.h>
+
+#include "platform_defs.h"
+#include "libfrog/systemd.h"
+#include "libfrog/statmount.h"
+
+static int debug = 0;
+static const char *progname = "xfs_healer_start";
+
+/* Start the xfs_healer service for a given mountpoint. */
+static void
+start_healer(
+	const char	*mntpoint)
+{
+	char		unitname[PATH_MAX];
+	int		ret;
+
+	ret = systemd_path_instance_unit_name(XFS_HEALER_SVCNAME, mntpoint,
+			unitname, PATH_MAX);
+	if (ret) {
+		fprintf(stderr, "%s: %s\n", mntpoint,
+				_("Could not determine xfs_healer unit name."));
+		return;
+	}
+
+	/*
+	 * Restart so that we aren't foiled by an existing unit that's slowly
+	 * working its way off a cycled mount.
+	 */
+	ret = systemd_manage_unit(UM_RESTART, unitname);
+	if (ret) {
+		fprintf(stderr, "%s: %s: %s\n", mntpoint,
+				_("Could not start xfs_healer service unit"),
+				unitname);
+		return;
+	}
+
+	printf("%s: %s\n", mntpoint, _("xfs_healer service started."));
+	fflush(stdout);
+}
+
+#define REQUIRED_STATMOUNT_FIELDS (STATMOUNT_FS_TYPE | \
+				   STATMOUNT_MNT_POINT | \
+				   STATMOUNT_MNT_ROOT)
+
+/* Process a newly discovered mountpoint. */
+static void
+examine_mount(
+	int			mnt_ns_fd,
+	uint64_t		mnt_id)
+{
+	size_t			smbuf_size = libfrog_statmount_sizeof(4096);
+	struct statmount	*smbuf = alloca(smbuf_size);
+	int			ret;
+
+	ret = libfrog_statmount(mnt_id, mnt_ns_fd, REQUIRED_STATMOUNT_FIELDS,
+			smbuf, smbuf_size);
+	if (ret) {
+		perror("statmount");
+		return;
+	}
+
+	if (debug) {
+		printf("mount: id 0x%llx fstype %s mountpoint %s mntroot %s\n",
+				(unsigned long long)mnt_id,
+				(smbuf->mask & STATMOUNT_FS_TYPE) ?
+					smbuf->str + smbuf->fs_type : "null",
+				(smbuf->mask & STATMOUNT_MNT_POINT) ?
+					smbuf->str + smbuf->mnt_point : "null",
+				(smbuf->mask & STATMOUNT_MNT_ROOT) ?
+					smbuf->str + smbuf->mnt_root : "null");
+		fflush(stdout);
+	}
+
+	/* Look for mount points for the root dir of an XFS filesystem. */
+	if ((smbuf->mask & REQUIRED_STATMOUNT_FIELDS) !=
+			   REQUIRED_STATMOUNT_FIELDS)
+		return;
+
+	if (!strcmp(smbuf->str + smbuf->fs_type, "xfs") &&
+	    !strcmp(smbuf->str + smbuf->mnt_root, "/"))
+		start_healer(smbuf->str + smbuf->mnt_point);
+}
+
+/* Translate fanotify mount events into something we can process. */
+static void
+handle_mount_event(
+	const struct fanotify_event_metadata	*event,
+	int					mnt_ns_fd)
+{
+	const struct fanotify_event_info_header	*info;
+	const struct fanotify_event_info_mnt	*mnt;
+	int					off;
+
+	if (event->fd != FAN_NOFD) {
+		if (debug)
+			fprintf(stderr, "Expected FAN_NOFD, got fd=%d\n",
+					event->fd);
+		return;
+	}
+
+	switch (event->mask) {
+	case FAN_MNT_ATTACH:
+		if (debug) {
+			printf("FAN_MNT_ATTACH (len=%d)\n", event->event_len);
+			fflush(stdout);
+		}
+		break;
+	default:
+		/* should never get here */
+		return;
+	}
+
+	for (off = sizeof(*event) ; off < event->event_len;
+	     off += info->len) {
+		info = (struct fanotify_event_info_header *)
+			((char *) event + off);
+
+		switch (info->info_type) {
+		case FAN_EVENT_INFO_TYPE_MNT:
+			mnt = (struct fanotify_event_info_mnt *) info;
+
+			if (debug) {
+				printf( "Mount record: len=%d mnt_id=0x%llx\n",
+						mnt->hdr.len, mnt->mnt_id);
+				fflush(stdout);
+			}
+
+			examine_mount(mnt_ns_fd, mnt->mnt_id);
+			break;
+
+		default:
+			if (debug)
+				fprintf(stderr,
+ "Unexpected fanotify event info_type=%d len=%d\n",
+						info->info_type, info->len);
+			break;
+		}
+	}
+}
+
+/* Extract mount attachment notifications from fanotify. */
+static void
+handle_notifications(
+	char				*buffer,
+	ssize_t				len,
+	int				mnt_ns_fd)
+{
+	struct fanotify_event_metadata	*event =
+		(struct fanotify_event_metadata *) buffer;
+
+	for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) {
+
+		switch (event->mask) {
+		case FAN_MNT_ATTACH:
+			handle_mount_event(event, mnt_ns_fd);
+			break;
+		default:
+			if (debug)
+				fprintf(stderr,
+ "Unexpected fanotify mark: 0x%llx\n",
+					(unsigned long long)event->mask);
+			break;
+		}
+	}
+}
+
+#define NR_MNT_IDS		(32)
+
+/* Start healer services for existing XFS mounts. */
+static int
+start_existing_mounts(
+	int			mnt_ns_fd)
+{
+	uint64_t		mnt_ids[NR_MNT_IDS];
+	uint64_t		cursor = LISTMOUNT_INIT_CURSOR;
+	int			i;
+	int			ret;
+
+	while ((ret = libfrog_listmount(LSMT_ROOT, mnt_ns_fd, &cursor,
+					mnt_ids, NR_MNT_IDS)) > 0) {
+		for (i = 0; i < ret; i++)
+			examine_mount(mnt_ns_fd, mnt_ids[i]);
+	}
+
+	if (ret < 0) {
+		if (errno == ENOSYS)
+			fprintf(stderr, "%s\n",
+ _("This program requires the listmount system call."));
+		else
+			perror("listmount");
+		return -1;
+	}
+
+	return 0;
+}
+
+static void __attribute__((noreturn))
+usage(void)
+{
+	fprintf(stderr, "%s %s %s\n", _("Usage:"), progname, _("[OPTIONS]"));
+	fprintf(stderr, "\n");
+	fprintf(stderr, _("Options:\n"));
+	fprintf(stderr, _("  --debug      Enable debugging messages.\n"));
+	fprintf(stderr, _("  --mountns    Path to the mount namespace file.\n"));
+	fprintf(stderr, _("  --supported  Make sure we can actually run.\n"));
+	fprintf(stderr, _("  -V           Print version.\n"));
+
+	exit(EXIT_FAILURE);
+}
+
+enum long_opt_nr {
+	LOPT_DEBUG,
+	LOPT_HELP,
+	LOPT_MOUNTNS,
+	LOPT_SUPPORTED,
+
+	LOPT_MAX,
+};
+
+int
+main(
+	int		argc,
+	char		*argv[])
+{
+	char		buffer[BUFSIZ];
+	const char	*mntns = NULL;
+	int		mnt_ns_fd;
+	int		fan_fd;
+	int		c;
+	int		option_index;
+	int		support_check = 0;
+	int		ret = 0;
+
+	struct option long_options[] = {
+		[LOPT_SUPPORTED] = {"supported", no_argument, &support_check, 1 },
+		[LOPT_DEBUG]	 = {"debug", no_argument, &debug, 1 },
+		[LOPT_HELP]	 = {"help", no_argument, NULL, 0 },
+		[LOPT_MOUNTNS]	 = {"mountns", required_argument, NULL, 0 },
+		[LOPT_MAX]	 = {NULL, 0, NULL, 0 },
+	};
+
+	while ((c = getopt_long(argc, argv, "V", long_options, &option_index))
+			!= EOF) {
+		switch (c) {
+		case 0:
+			switch (option_index) {
+			case LOPT_MOUNTNS:
+				mntns = optarg;
+				break;
+			case LOPT_HELP:
+				usage();
+				break;
+			default:
+				break;
+			}
+			break;
+		case 'V':
+			fprintf(stdout, "%s %s %s\n", progname, _("version"),
+					VERSION);
+			fflush(stdout);
+			return EXIT_SUCCESS;
+		default:
+			usage();
+			break;
+		}
+	}
+
+	/*
+	 * Try to open the mount namespace file for the current process.
+	 * fanotify requires this mount namespace file to send mount attachment
+	 * events, so this is required for correct functionality.
+	 */
+	mnt_ns_fd = open(mntns ? mntns : DEFAULT_MOUNTNS_FILE, O_RDONLY);
+	if (mnt_ns_fd < 0) {
+		if (errno == ENOENT && !mntns) {
+			perror(DEFAULT_MOUNTNS_FILE);
+			fprintf(stderr, "%s\n",
+ _("This program requires mount namespace support."));
+		} else {
+			perror(mntns ? mntns : DEFAULT_MOUNTNS_FILE);
+		}
+		ret = 1;
+		goto out;
+	}
+	if (mnt_ns_fd == DEFAULT_MOUNTNS_FD && mntns != NULL) {
+		/*
+		 * We specified a path to a mount namespace file but got fd 0,
+		 * which (for listmount and statmount) means to use the current
+		 * process' mount namespace.  That's probably not what the user
+		 * wanted.
+		 */
+		fprintf(stderr,
+ _("%s: got bad file descriptor for mount namespace\n"),
+				mntns);
+		ret = 1;
+		goto out;
+	}
+
+	fan_fd = fanotify_init(FAN_REPORT_MNT, O_RDONLY);
+	if (fan_fd < 0) {
+		perror("fanotify_init");
+		if (errno == EINVAL)
+			fprintf(stderr, "%s\n",
+ _("This program requires fanotify mount event support."));
+		ret = 1;
+		goto out;
+	}
+
+	ret = fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MNTNS,
+			FAN_MNT_ATTACH, mnt_ns_fd, NULL);
+	if (ret) {
+		perror("fanotify_mark");
+		goto out;
+	}
+
+	if (support_check) {
+		/*
+		 * We're being run as an ExecCondition process and we've
+		 * decided to start the main service.  There is no need to wait
+		 * for journald because the ExecStart version of ourselves will
+		 * take care of the waiting for us.
+		 */
+		return systemd_service_exit_now(0);
+	}
+
+	if (debug) {
+		printf("fanotify active\n");
+		fflush(stdout);
+	}
+
+	ret = start_existing_mounts(mnt_ns_fd);
+	if (ret)
+		goto out;
+
+	while (1) {
+		ssize_t bytes_read = read(fan_fd, buffer, BUFSIZ);
+
+		if (bytes_read < 0) {
+			perror("fanotify");
+			ret = 1;
+			break;
+		}
+
+		handle_notifications(buffer, bytes_read, mnt_ns_fd);
+	}
+
+out:
+	return systemd_service_exit(ret);
+}
diff --git a/healer/xfs_healer_start.service.in b/healer/xfs_healer_start.service.in
new file mode 100644
index 00000000000000..6fd34eafa48c33
--- /dev/null
+++ b/healer/xfs_healer_start.service.in
@@ -0,0 +1,85 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (c) 2026 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=Start Self Healing of XFS Metadata
+
+[Service]
+Type=exec
+Environment=SERVICE_MODE=1
+ExecCondition=@pkg_libexec_dir@/xfs_healer_start --supported
+ExecStart=@pkg_libexec_dir@/xfs_healer_start
+
+# This service starts more services, so we want it to try to restart any time
+# the program exits or crashes.
+Restart=on-failure
+
+# Create the service underneath the healer background service slice so that we
+# can control resource usage.
+Slice=system-xfs_healer.slice
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# Must run with full privileges in a shared mount namespace so that we can
+# see new mounts and tell systemd to start the per-mount healer service.
+DynamicUser=false
+ProtectSystem=false
+ProtectHome=no
+PrivateTmp=true
+PrivateDevices=true
+
+# Don't let healer complain about paths in /etc/projects that have been hidden
+# by our sandboxing.  healer doesn't care about project ids anyway.
+InaccessiblePaths=-/etc/projects
+
+# No network access except to the systemd control socket
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=AF_UNIX
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and fanotify
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+SystemCallFilter=fanotify_init fanotify_mark
+
+# xfs_healer_start needs these privileges to open the rootdir and monitor
+CapabilityBoundingSet=CAP_SYS_ADMIN CAP_DAC_OVERRIDE
+AmbientCapabilities=CAP_SYS_ADMIN CAP_DAC_OVERRIDE
+NoNewPrivileges=true
+
+# xfs_healer_start doesn't create files
+UMask=7777
+
+# No access to hardware /dev files except for block devices
+ProtectClock=true
+DevicePolicy=closed
+
+[Install]
+WantedBy=multi-user.target
diff --git a/include/builddefs.in b/include/builddefs.in
index 0ab2bf1702f0f0..bdba9cd9037900 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -124,6 +124,7 @@ HAVE_LISTMOUNT = @have_listmount@
 HAVE_LISTMOUNT_NS_FD = @have_listmount_ns_fd@
 HAVE_STATMOUNT_SUPPORTED_MASK = @have_statmount_supported_mask@
 NEED_INTERNAL_STATMOUNT = @need_internal_statmount@
+HAVE_FANOTIFY_MOUNTINFO = @have_fanotify_mountinfo@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
@@ -159,6 +160,10 @@ ifeq ($(HAVE_LIBURCU_ATOMIC64),yes)
 PCFLAGS += -DHAVE_LIBURCU_ATOMIC64
 endif
 
+ifeq ($(ENABLE_HEALER)$(HAVE_SYSTEMD)$(HAVE_LISTMOUNT)$(HAVE_FANOTIFY_MOUNTINFO),yesyesyesyes)
+HAVE_HEALER_START_DEPS = yes
+endif
+
 SANITIZER_CFLAGS += @addrsan_cflags@ @threadsan_cflags@ @ubsan_cflags@ @autovar_init_cflags@
 SANITIZER_LDFLAGS += @addrsan_ldflags@ @threadsan_ldflags@ @ubsan_ldflags@
 
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index ec4a3ef444b705..9586bc01fe0f25 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -452,3 +452,27 @@ AC_DEFUN([AC_HAVE_STATMOUNT_SUPPORTED_MASK],
     AC_SUBST(have_statmount_supported_mask)
     AC_SUBST(need_internal_statmount)
   ])
+
+#
+# Check if fanotify will give us mount notifications (6.15).
+#
+AC_DEFUN([AC_HAVE_FANOTIFY_MOUNTINFO],
+  [AC_MSG_CHECKING([for fanotify mount events])
+    AC_LINK_IFELSE(
+    [AC_LANG_PROGRAM([[
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <fcntl.h>
+#include <sys/fanotify.h>
+  ]], [[
+	struct fanotify_event_info_mnt info;
+
+	int fan_fd = fanotify_init(FAN_REPORT_MNT, 0);
+	fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MNTNS, FAN_MNT_ATTACH,
+			-1, NULL);
+  ]])
+    ], have_fanotify_mountinfo=yes
+       AC_MSG_RESULT(yes),
+       AC_MSG_RESULT(no))
+    AC_SUBST(have_fanotify_mountinfo)
+  ])


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 15/26] xfs_healer: don't start service if kernel support unavailable
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (13 preceding siblings ...)
  2026-03-19  4:42   ` [PATCH 14/26] xfs_healer: create a service to start the per-mount healer service Darrick J. Wong
@ 2026-03-19  4:42   ` Darrick J. Wong
  2026-03-19  4:42   ` [PATCH 16/26] xfs_healer: use the autofsck fsproperty to select mode Darrick J. Wong
                     ` (10 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:42 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use ExecCondition= in the system service to check if kernel support for
the health monitor is available.  If not, we don't want to run the
service, have it fail, and generate a bunch of silly log messages.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h           |    1 +
 healer/xfs_healer.c           |   47 ++++++++++++++++++++++++++++++++---------
 healer/xfs_healer@.service.in |    1 +
 3 files changed, 39 insertions(+), 10 deletions(-)


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index 679bdc95ae48f8..7caa6c66a59c6f 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -27,6 +27,7 @@ struct healer_ctx {
 	int			foreground;
 	int			want_repair;
 	int			print_svcname;
+	int			support_check;
 
 	/* fd and fs geometry for mount */
 	struct xfs_fd		mnt;
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index 1a26ffe830e5fe..8c48d2d9ee8c2d 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -191,8 +191,14 @@ healer_nproc(
 	return ctx->foreground ? platform_nproc() : 1;
 }
 
+enum mon_state {
+	MON_START,
+	MON_EXIT,
+	MON_ERROR,
+};
+
 /* Set ourselves up to monitor the given mountpoint for health events. */
-static int
+static enum mon_state
 setup_monitor(
 	struct healer_ctx	*ctx)
 {
@@ -203,7 +209,7 @@ setup_monitor(
 	ret = xfd_open(&ctx->mnt, ctx->mntpoint, O_RDONLY);
 	if (ret) {
 		perror(ctx->mntpoint);
-		return -1;
+		return MON_ERROR;
 	}
 
 	ret = try_capture_fsinfo(ctx);
@@ -274,6 +280,16 @@ setup_monitor(
 	close(ctx->mnt.fd);
 	ctx->mnt.fd = -1;
 
+	/*
+	 * At this point, we know that the kernel is capable of repairing the
+	 * filesystem and telling us that it needs repairs.  If the user only
+	 * wanted us to check for the capability, we're done.
+	 */
+	if (ctx->support_check) {
+		close(mon_fd);
+		return MON_EXIT;
+	}
+
 	/*
 	 * mon_fp consumes mon_fd.  We intentionally leave mon_fp attached to
 	 * the context so that we keep the monitoring fd open until we've torn
@@ -305,7 +321,7 @@ setup_monitor(
 	}
 	ctx->queue_active = true;
 
-	return 0;
+	return MON_START;
 
 out_mon_fp:
 	if (ctx->mon_fp)
@@ -318,7 +334,7 @@ setup_monitor(
 	if (ctx->mnt.fd >= 0)
 		close(ctx->mnt.fd);
 	ctx->mnt.fd = -1;
-	return -1;
+	return MON_ERROR;
 }
 
 /* Monitor the given mountpoint for health events. */
@@ -395,6 +411,7 @@ usage(void)
 	fprintf(stderr, _("  --foreground  Process events as soon as possible.\n"));
 	fprintf(stderr, _("  --quiet       Do not log health events to stdout.\n"));
 	fprintf(stderr, _("  --repair      Always repair corrupt metadata.\n"));
+	fprintf(stderr, _("  --supported   Check that health monitoring is supported.\n"));
 	fprintf(stderr, _("  -V            Print version.\n"));
 
 	exit(EXIT_FAILURE);
@@ -407,6 +424,7 @@ enum long_opt_nr {
 	LOPT_HELP,
 	LOPT_QUIET,
 	LOPT_REPAIR,
+	LOPT_SUPPORTED,
 	LOPT_SVCNAME,
 
 	LOPT_MAX,
@@ -439,6 +457,7 @@ main(
 		[LOPT_HELP]	   = {"help", no_argument, NULL, 0 },
 		[LOPT_QUIET]	   = {"quiet", no_argument, &ctx.log, 0 },
 		[LOPT_REPAIR]	   = {"repair", no_argument, &ctx.want_repair, 1 },
+		[LOPT_SUPPORTED]   = {"supported", no_argument, &ctx.support_check, 1 },
 		[LOPT_SVCNAME]	   = {"svcname", no_argument, &ctx.print_svcname, 1 },
 
 		[LOPT_MAX]	   = {NULL, 0, NULL, 0 },
@@ -490,14 +509,22 @@ main(
 		return EXIT_SUCCESS;
 	}
 
-	ret = setup_monitor(&ctx);
-	if (ret)
-		goto out_events;
+	switch (setup_monitor(&ctx)) {
+	case MON_ERROR:
+		ret = -1;
+		break;
+	case MON_EXIT:
+		ret = 0;
+		break;
+	case MON_START:
+		ret = 0;
+		monitor(&ctx);
+		break;
+	}
 
-	monitor(&ctx);
-
-out_events:
 	teardown_monitor(&ctx);
 	free((char *)ctx.fsname);
+	if (ctx.support_check)
+		return systemd_service_exit_now(ret);
 	return systemd_service_exit(ret);
 }
diff --git a/healer/xfs_healer@.service.in b/healer/xfs_healer@.service.in
index 385257872b0cbb..53f89cf9c4333d 100644
--- a/healer/xfs_healer@.service.in
+++ b/healer/xfs_healer@.service.in
@@ -17,6 +17,7 @@ RequiresMountsFor=%f
 [Service]
 Type=exec
 Environment=SERVICE_MODE=1
+ExecCondition=@pkg_libexec_dir@/xfs_healer --supported %f
 ExecStart=@pkg_libexec_dir@/xfs_healer %f
 SyslogIdentifier=%N
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 16/26] xfs_healer: use the autofsck fsproperty to select mode
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (14 preceding siblings ...)
  2026-03-19  4:42   ` [PATCH 15/26] xfs_healer: don't start service if kernel support unavailable Darrick J. Wong
@ 2026-03-19  4:42   ` Darrick J. Wong
  2026-03-19  4:43   ` [PATCH 17/26] xfs_healer: run full scrub after lost corruption events or targeted repair failure Darrick J. Wong
                     ` (9 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:42 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make the xfs_healer background service query the autofsck filesystem
property to figure out which operating mode it should use.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h    |    1 
 libfrog/fsproperties.h |    5 ++
 healer/xfs_healer.c    |  102 +++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 105 insertions(+), 3 deletions(-)


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index 7caa6c66a59c6f..a2a46053928e33 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -28,6 +28,7 @@ struct healer_ctx {
 	int			want_repair;
 	int			print_svcname;
 	int			support_check;
+	int			autofsck;
 
 	/* fd and fs geometry for mount */
 	struct xfs_fd		mnt;
diff --git a/libfrog/fsproperties.h b/libfrog/fsproperties.h
index 11d6530bc9a6d6..1cf90d058765b2 100644
--- a/libfrog/fsproperties.h
+++ b/libfrog/fsproperties.h
@@ -52,6 +52,11 @@ bool fsprop_validate(const char *name, const char *value);
 
 #define FSPROP_AUTOFSCK_NAME		"autofsck"
 
+/* filesystem property name for fgetxattr */
+#define VFS_FSPROP_AUTOFSCK_NAME	(FSPROP_NAMESPACE \
+					 FSPROP_NAME_PREFIX \
+					 FSPROP_AUTOFSCK_NAME)
+
 enum fsprop_autofsck {
 	FSPROP_AUTOFSCK_UNSET = 0,	/* do not set property */
 	FSPROP_AUTOFSCK_NONE,		/* no background scrubs */
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index 8c48d2d9ee8c2d..f4bee495979324 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -6,6 +6,7 @@
 #include "xfs.h"
 #include <pthread.h>
 #include <stdlib.h>
+#include <sys/xattr.h>
 
 #include "platform_defs.h"
 #include "libfrog/fsgeom.h"
@@ -13,6 +14,7 @@
 #include "libfrog/healthevent.h"
 #include "libfrog/workqueue.h"
 #include "libfrog/systemd.h"
+#include "libfrog/fsproperties.h"
 #include "xfs_healer.h"
 
 /* Program name; needed for libfrog error reports. */
@@ -191,6 +193,63 @@ healer_nproc(
 	return ctx->foreground ? platform_nproc() : 1;
 }
 
+enum want_repair {
+	WR_REPAIR,
+	WR_LOG_ONLY,
+	WR_EXIT,
+};
+
+/* Determine want_repair from the autofsck filesystem property. */
+static enum want_repair
+want_repair_from_autofsck(
+	struct healer_ctx	*ctx)
+{
+	char			valuebuf[FSPROP_MAX_VALUELEN + 1] = { 0 };
+	enum fsprop_autofsck	shval;
+	ssize_t			ret;
+
+	/*
+	 * Any OS error (including ENODATA) or string parsing error is treated
+	 * the same as an unrecognized value.
+	 */
+	ret = fgetxattr(ctx->mnt.fd, VFS_FSPROP_AUTOFSCK_NAME, valuebuf,
+			FSPROP_MAX_VALUELEN);
+	if (ret < 0)
+		goto no_advice;
+
+	shval = fsprop_autofsck_read(valuebuf);
+	switch (shval) {
+	case FSPROP_AUTOFSCK_NONE:
+		/* don't run at all */
+		ret = WR_EXIT;
+		break;
+	case FSPROP_AUTOFSCK_CHECK:
+	case FSPROP_AUTOFSCK_OPTIMIZE:
+		/* log events, do not repair */
+		ret = WR_LOG_ONLY;
+		break;
+	case FSPROP_AUTOFSCK_REPAIR:
+		/* repair stuff */
+		ret = WR_REPAIR;
+		break;
+	case FSPROP_AUTOFSCK_UNSET:
+		goto no_advice;
+	}
+
+	return ret;
+
+no_advice:
+	/*
+	 * For an unrecognized value, log but do not fix runtime corruption if
+	 * backref metadata are enabled.  If no backref metadata are available,
+	 * the fs is too old so don't run at all.
+	 */
+	if (healer_has_rmapbt(ctx) || healer_has_parent(ctx))
+		return WR_LOG_ONLY;
+
+	return WR_EXIT;
+}
+
 enum mon_state {
 	MON_START,
 	MON_EXIT,
@@ -219,14 +278,45 @@ setup_monitor(
 		goto out_mnt_fd;
 	}
 
-	if (ctx->want_repair) {
-		/* Check that the kernel supports repairs at all. */
-		if (!healer_can_repair(ctx)) {
+	if (ctx->autofsck) {
+		switch (want_repair_from_autofsck(ctx)) {
+		case WR_EXIT:
+			printf("%s: %s\n", ctx->mntpoint,
+ _("Disabling daemon per autofsck directive."));
+			fflush(stdout);
+			close(ctx->mnt.fd);
+			return MON_EXIT;
+		case WR_REPAIR:
+			ctx->want_repair = 1;
+			printf("%s: %s\n", ctx->mntpoint,
+ _("Automatically repairing per autofsck directive."));
+			fflush(stdout);
+			break;
+		case WR_LOG_ONLY:
+			ctx->want_repair = 0;
+			ctx->log = 1;
+			printf("%s: %s\n", ctx->mntpoint,
+ _("Only logging errors per autofsck directive."));
+			fflush(stdout);
+			break;
+		}
+	}
+
+	/* Check that the kernel supports repairs at all. */
+	if (ctx->want_repair && !healer_can_repair(ctx)) {
+		if (!ctx->autofsck) {
 			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
  _("XFS online repair is not supported, exiting"));
 			goto out_mnt_fd;
 		}
 
+		printf("%s: %s\n", ctx->mntpoint,
+ _("XFS online repair is not supported, will report only"));
+		fflush(stdout);
+		ctx->want_repair = 0;
+	}
+
+	if (ctx->want_repair) {
 		/* Check for backref metadata that makes repair effective. */
 		if (!healer_has_rmapbt(ctx))
 			fprintf(stderr, "%s: %s\n", ctx->mntpoint,
@@ -409,6 +499,7 @@ usage(void)
 	fprintf(stderr, _("  --debug       Enable debugging messages.\n"));
 	fprintf(stderr, _("  --everything  Capture all events.\n"));
 	fprintf(stderr, _("  --foreground  Process events as soon as possible.\n"));
+	fprintf(stderr, _("  --no-autofsck Do not use the \"autofsck\" fs property to decide to repair.\n"));
 	fprintf(stderr, _("  --quiet       Do not log health events to stdout.\n"));
 	fprintf(stderr, _("  --repair      Always repair corrupt metadata.\n"));
 	fprintf(stderr, _("  --supported   Check that health monitoring is supported.\n"));
@@ -422,6 +513,7 @@ enum long_opt_nr {
 	LOPT_EVERYTHING,
 	LOPT_FOREGROUND,
 	LOPT_HELP,
+	LOPT_NO_AUTOFSCK,
 	LOPT_QUIET,
 	LOPT_REPAIR,
 	LOPT_SUPPORTED,
@@ -439,6 +531,7 @@ main(
 		.conlock	= (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER,
 		.log		= 1,
 		.mnt.fd		= -1,
+		.autofsck	= 1,
 	};
 	int			option_index;
 	int			vflag = 0;
@@ -455,6 +548,7 @@ main(
 		[LOPT_EVERYTHING]  = {"everything", no_argument, &ctx.everything, 1 },
 		[LOPT_FOREGROUND]  = {"foreground", no_argument, &ctx.foreground, 1 },
 		[LOPT_HELP]	   = {"help", no_argument, NULL, 0 },
+		[LOPT_NO_AUTOFSCK] = {"no-autofsck", no_argument, &ctx.autofsck, 0 },
 		[LOPT_QUIET]	   = {"quiet", no_argument, &ctx.log, 0 },
 		[LOPT_REPAIR]	   = {"repair", no_argument, &ctx.want_repair, 1 },
 		[LOPT_SUPPORTED]   = {"supported", no_argument, &ctx.support_check, 1 },
@@ -492,6 +586,8 @@ main(
 
 	if (optind != argc - 1)
 		usage();
+	if (ctx.want_repair)
+		ctx.autofsck = 0;
 
 	ctx.mntpoint = argv[optind];
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 17/26] xfs_healer: run full scrub after lost corruption events or targeted repair failure
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (15 preceding siblings ...)
  2026-03-19  4:42   ` [PATCH 16/26] xfs_healer: use the autofsck fsproperty to select mode Darrick J. Wong
@ 2026-03-19  4:43   ` Darrick J. Wong
  2026-03-19  4:43   ` [PATCH 18/26] xfs_healer: use getmntent to find moved filesystems Darrick J. Wong
                     ` (8 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we fail to perform a spot repair of metadata or the kernel tells us
that it lost corruption events due to queue limits, initiate a full run
of the online fsck service to try to fix the error.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h  |    3 ++
 healer/Makefile      |    2 +
 healer/fsrepair.c    |   81 +++++++++++++++++++++++++++++++++++++++++++++-----
 healer/weakhandle.c  |   13 ++++++++
 healer/xfs_healer.c  |    7 ++++
 include/builddefs.in |    1 +
 scrub/Makefile       |    7 ++--
 7 files changed, 102 insertions(+), 12 deletions(-)


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index a2a46053928e33..e1370323bbd66a 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -72,6 +72,7 @@ void lookup_path(struct healer_ctx *ctx,
 int repair_metadata(struct healer_ctx *ctx, const struct hme_prefix *pfx,
 		const struct xfs_health_monitor_event *hme);
 bool healer_can_repair(struct healer_ctx *ctx);
+void run_full_repair(struct healer_ctx *ctx);
 
 /* weakhandle.c */
 int weakhandle_alloc(int fd, const char *mountpoint, const char *fsname,
@@ -80,5 +81,7 @@ int weakhandle_reopen(struct weakhandle *wh, int *fd);
 void weakhandle_free(struct weakhandle **whp);
 int weakhandle_getpath_for(struct weakhandle *wh, uint64_t ino, uint32_t gen,
 		char *path, size_t pathlen);
+int weakhandle_instance_unit_name(struct weakhandle *wh, const char *template,
+		char *unitname, size_t unitnamelen);
 
 #endif /* XFS_HEALER_XFS_HEALER_H_ */
diff --git a/healer/Makefile b/healer/Makefile
index 1eeb727682008b..b8ffce33e90d18 100644
--- a/healer/Makefile
+++ b/healer/Makefile
@@ -19,6 +19,8 @@ xfs_healer.c
 HFILES = \
 xfs_healer.h
 
+CFLAGS+=-DXFS_SCRUB_SVCNAME=\"$(XFS_SCRUB_SVCNAME)\"
+
 LLDLIBS += $(LIBHANDLE) $(LIBFROG) $(LIBURCU) $(LIBPTHREAD)
 LTDEPENDENCIES += $(LIBHANDLE) $(LIBFROG)
 LLDFLAGS = -static
diff --git a/healer/fsrepair.c b/healer/fsrepair.c
index 4534104f8a6ac1..9f8c128e395ebc 100644
--- a/healer/fsrepair.c
+++ b/healer/fsrepair.c
@@ -9,8 +9,14 @@
 #include "libfrog/fsgeom.h"
 #include "libfrog/workqueue.h"
 #include "libfrog/healthevent.h"
+#include "libfrog/systemd.h"
 #include "xfs_healer.h"
 
+enum what_next {
+	NEED_FULL_REPAIR,
+	REPAIR_DONE,
+};
+
 /* Translate scrub output flags to outcome. */
 static enum repair_outcome from_repair_oflags(uint32_t oflags)
 {
@@ -61,7 +67,7 @@ xfs_repair_metadata(
 }
 
 /* React to a fs-domain corruption event by repairing it. */
-static void
+static enum what_next
 try_repair_wholefs(
 	struct healer_ctx			*ctx,
 	const struct hme_prefix			*pfx,
@@ -90,11 +96,16 @@ try_repair_wholefs(
 		pthread_mutex_lock(&ctx->conlock);
 		report_health_repair(pfx, hme, f->event_mask, outcome);
 		pthread_mutex_unlock(&ctx->conlock);
+
+		if (outcome == REPAIR_FAILED)
+			return NEED_FULL_REPAIR;
 	}
+
+	return REPAIR_DONE;
 }
 
 /* React to an ag corruption event by repairing it. */
-static void
+static enum what_next
 try_repair_ag(
 	struct healer_ctx			*ctx,
 	const struct hme_prefix			*pfx,
@@ -126,11 +137,16 @@ try_repair_ag(
 		pthread_mutex_lock(&ctx->conlock);
 		report_health_repair(pfx, hme, f->event_mask, outcome);
 		pthread_mutex_unlock(&ctx->conlock);
+
+		if (outcome == REPAIR_FAILED)
+			return NEED_FULL_REPAIR;
 	}
+
+	return REPAIR_DONE;
 }
 
 /* React to a rtgroup corruption event by repairing it. */
-static void
+static enum what_next
 try_repair_rtgroup(
 	struct healer_ctx			*ctx,
 	const struct hme_prefix			*pfx,
@@ -157,11 +173,16 @@ try_repair_rtgroup(
 		pthread_mutex_lock(&ctx->conlock);
 		report_health_repair(pfx, hme, f->event_mask, outcome);
 		pthread_mutex_unlock(&ctx->conlock);
+
+		if (outcome == REPAIR_FAILED)
+			return NEED_FULL_REPAIR;
 	}
+
+	return REPAIR_DONE;
 }
 
 /* React to a inode-domain corruption event by repairing it. */
-static void
+static enum what_next
 try_repair_inode(
 	struct healer_ctx			*ctx,
 	const struct hme_prefix			*orig_pfx,
@@ -204,7 +225,12 @@ try_repair_inode(
 		pthread_mutex_lock(&ctx->conlock);
 		report_health_repair(pfx, hme, f->event_mask, outcome);
 		pthread_mutex_unlock(&ctx->conlock);
+
+		if (outcome == REPAIR_FAILED)
+			return NEED_FULL_REPAIR;
 	}
+
+	return REPAIR_DONE;
 }
 
 /* Repair a metadata corruption. */
@@ -214,6 +240,7 @@ repair_metadata(
 	const struct hme_prefix			*pfx,
 	const struct xfs_health_monitor_event	*hme)
 {
+	enum what_next				what_next;
 	int					repair_fd;
 	int					ret;
 
@@ -227,19 +254,25 @@ repair_metadata(
 
 	switch (hme->domain) {
 	case XFS_HEALTH_MONITOR_DOMAIN_FS:
-		try_repair_wholefs(ctx, pfx, repair_fd, hme);
+		what_next = try_repair_wholefs(ctx, pfx, repair_fd, hme);
 		break;
 	case XFS_HEALTH_MONITOR_DOMAIN_AG:
-		try_repair_ag(ctx, pfx, repair_fd, hme);
+		what_next = try_repair_ag(ctx, pfx, repair_fd, hme);
 		break;
 	case XFS_HEALTH_MONITOR_DOMAIN_RTGROUP:
-		try_repair_rtgroup(ctx, pfx, repair_fd, hme);
+		what_next = try_repair_rtgroup(ctx, pfx, repair_fd, hme);
 		break;
 	case XFS_HEALTH_MONITOR_DOMAIN_INODE:
-		try_repair_inode(ctx, pfx, repair_fd, hme);
+		what_next = try_repair_inode(ctx, pfx, repair_fd, hme);
 		break;
+	default:
+		what_next = REPAIR_DONE;
 	}
 
+	/* Transform into a full repair if we failed to fix this item. */
+	if (what_next == NEED_FULL_REPAIR)
+		run_full_repair(ctx);
+
 	close(repair_fd);
 	return 0;
 }
@@ -259,3 +292,35 @@ healer_can_repair(
 	ret = ioctl(ctx->mnt.fd, XFS_IOC_SCRUB_METADATA, &sm);
 	return ret ? false : true;
 }
+
+/* Run a full repair of the filesystem using the background fsck service. */
+void
+run_full_repair(
+	struct healer_ctx	*ctx)
+{
+	char			unitname[PATH_MAX];
+	int			ret;
+
+	ret = weakhandle_instance_unit_name(ctx->wh, XFS_SCRUB_SVCNAME,
+			unitname, PATH_MAX);
+	if (ret) {
+		fprintf(stderr, "%s: %s\n", ctx->mntpoint,
+				_("Could not determine xfs_scrub unit name."));
+		return;
+	}
+
+	/*
+	 * Scrub could already be repairing something, so try to start the unit
+	 * and be content if it's already running.
+	 */
+	ret = systemd_manage_unit(UM_START, unitname);
+	if (ret) {
+		fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,
+				_("Could not start xfs_scrub service unit"),
+				unitname);
+		return;
+	}
+
+	printf("%s: %s\n", ctx->mntpoint, _("Full repairs in progress."));
+	fflush(stdout);
+}
diff --git a/healer/weakhandle.c b/healer/weakhandle.c
index 8950e0eb1e5a43..849aa2882700d4 100644
--- a/healer/weakhandle.c
+++ b/healer/weakhandle.c
@@ -13,6 +13,7 @@
 #include "libfrog/workqueue.h"
 #include "libfrog/getparents.h"
 #include "libfrog/paths.h"
+#include "libfrog/systemd.h"
 #include "xfs_healer.h"
 
 struct weakhandle {
@@ -199,3 +200,15 @@ weakhandle_getpath_for(
 	close(mnt_fd);
 	return ret;
 }
+
+/* Compute the systemd instance unit name for this mountpoint. */
+int
+weakhandle_instance_unit_name(
+	struct weakhandle	*wh,
+	const char		*template,
+	char			*unitname,
+	size_t			unitnamelen)
+{
+	return systemd_path_instance_unit_name(template, wh->mntpoint,
+			unitname, unitnamelen);
+}
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index f4bee495979324..09b88c754a550c 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -138,6 +138,13 @@ handle_event(
 		pthread_mutex_unlock(&ctx->conlock);
 	}
 
+	/*
+	 * If we didn't ask for all the metadata reports (including the healthy
+	 * ones) and the kernel tells us it lost something, run the full scan.
+	 */
+	if (hme->type == XFS_HEALTH_MONITOR_TYPE_LOST && !ctx->everything)
+		run_full_repair(ctx);
+
 	/* Initiate a repair if appropriate. */
 	if (will_repair)
 		repair_metadata(ctx, &pfx, hme);
diff --git a/include/builddefs.in b/include/builddefs.in
index bdba9cd9037900..3b52d1afd7031c 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -62,6 +62,7 @@ MKFS_CFG_DIR	= @datadir@/@pkg_name@/mkfs
 PKG_STATE_DIR	= @localstatedir@/lib/@pkg_name@
 
 XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_STAMP=$(PKG_STATE_DIR)/xfs_scrub_all_media.stamp
+XFS_SCRUB_SVCNAME=xfs_scrub@.service
 
 CC		= @cc@
 BUILD_CC	= @BUILD_CC@
diff --git a/scrub/Makefile b/scrub/Makefile
index ff79a265762332..aee49bfce100e2 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -8,7 +8,6 @@ include $(builddefs)
 
 SCRUB_PREREQS=$(HAVE_GETFSMAP)
 
-scrub_svcname=xfs_scrub@.service
 scrub_media_svcname=xfs_scrub_media@.service
 
 ifeq ($(SCRUB_PREREQS),yes)
@@ -21,7 +20,7 @@ XFS_SCRUB_SERVICE_ARGS = -b -o autofsck
 ifeq ($(HAVE_SYSTEMD),yes)
 INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
-	$(scrub_svcname) \
+	$(XFS_SCRUB_SVCNAME) \
 	xfs_scrub_fail@.service \
 	$(scrub_media_svcname) \
 	xfs_scrub_media_fail@.service \
@@ -123,7 +122,7 @@ xfs_scrub_all.timer: xfs_scrub_all.timer.in $(builddefs)
 $(XFS_SCRUB_ALL_PROG): $(XFS_SCRUB_ALL_PROG).in $(builddefs) $(TOPDIR)/libfrog/gettext.py
 	@echo "    [SED]    $@"
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
-		   -e "s|@scrub_svcname@|$(scrub_svcname)|g" \
+		   -e "s|@scrub_svcname@|$(XFS_SCRUB_SVCNAME)|g" \
 		   -e "s|@scrub_media_svcname@|$(scrub_media_svcname)|g" \
 		   -e "s|@pkg_version@|$(PKG_VERSION)|g" \
 		   -e "s|@stampfile@|$(XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_STAMP)|g" \
@@ -137,7 +136,7 @@ $(XFS_SCRUB_ALL_PROG): $(XFS_SCRUB_ALL_PROG).in $(builddefs) $(TOPDIR)/libfrog/g
 xfs_scrub_fail: xfs_scrub_fail.in $(builddefs)
 	@echo "    [SED]    $@"
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
-		   -e "s|@scrub_svcname@|$(scrub_svcname)|g" \
+		   -e "s|@scrub_svcname@|$(XFS_SCRUB_SVCNAME)|g" \
 		   -e "s|@pkg_version@|$(PKG_VERSION)|g"  < $< > $@
 	$(Q)chmod a+x $@
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 18/26] xfs_healer: use getmntent to find moved filesystems
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (16 preceding siblings ...)
  2026-03-19  4:43   ` [PATCH 17/26] xfs_healer: run full scrub after lost corruption events or targeted repair failure Darrick J. Wong
@ 2026-03-19  4:43   ` Darrick J. Wong
  2026-03-19  4:43   ` [PATCH 19/26] xfs_healer: use statmount to find moved filesystems even faster Darrick J. Wong
                     ` (7 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

It's possible that a mounted filesystem can move mountpoints between the
time of the initial mount (at which point xfs_healer starts) and when
it actually wants to start a repair.  When this happens,
weakhandle::mountpoint becomes obsolete and opening it will either fail
with ENOENT or the handle revalidation will return ESTALE.

However, we do still have a means to find the mounted filesystem -- the
fsname parameter (aka the path to the data device at mount time).  This
is record in /proc/mounts, which means that we can iterate getmntent to
see if we can find the mount elsewhere.

As documented a few patches ago, this would be easier if we had
revocable fds that didn't pin mounts, but that's a very huge ask.

This getmntent code enables xfs_healer to find a filesystem that has
been bind mounted in a new place and the original mountpoint detached:

# mount /dev/sda /mnt
# xfs_healer /mnt &
# mount /mnt /opt --bind
# umount /mnt

The key here is that each bind mount gets a separate struct mount
object.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/weakhandle.c |   55 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)


diff --git a/healer/weakhandle.c b/healer/weakhandle.c
index 849aa2882700d4..5df5207514e38e 100644
--- a/healer/weakhandle.c
+++ b/healer/weakhandle.c
@@ -65,10 +65,14 @@ weakhandle_alloc(
 	return -1;
 }
 
-/* Reopen a file handle obtained via weak reference. */
-int
-weakhandle_reopen(
+/*
+ * Reopen a file handle obtained via weak reference, using the given path to a
+ * mount point.
+ */
+static int
+weakhandle_reopen_from(
 	struct weakhandle	*wh,
+	const char		*path,
 	int			*fd)
 {
 	void			*hanp;
@@ -78,7 +82,7 @@ weakhandle_reopen(
 
 	*fd = -1;
 
-	mnt_fd = open(wh->mntpoint, O_RDONLY);
+	mnt_fd = open(path, O_RDONLY);
 	if (mnt_fd < 0)
 		return -1;
 
@@ -102,6 +106,49 @@ weakhandle_reopen(
 	return -1;
 }
 
+/* Reopen a file handle obtained via weak reference. */
+int
+weakhandle_reopen(
+	struct weakhandle	*wh,
+	int			*fd)
+{
+	FILE			*mtab;
+	struct mntent		*mnt;
+	int			ret;
+
+	/* First try reopening using the original mountpoint */
+	ret = weakhandle_reopen_from(wh, wh->mntpoint, fd);
+	if (!ret)
+		return 0;
+
+	/*
+	 * That didn't work, so now walk /proc/mounts to find a mount with the
+	 * same fsname (aka xfs data device path) as when we started.
+	 */
+	mtab = setmntent(_PATH_PROC_MOUNTS, "r");
+	if (!mtab)
+		return -1;
+
+	while ((mnt = getmntent(mtab)) != NULL) {
+		if (strcmp(mnt->mnt_type, "xfs"))
+			continue;
+		if (strcmp(mnt->mnt_fsname, wh->fsname))
+			continue;
+
+		ret = weakhandle_reopen_from(wh, mnt->mnt_dir, fd);
+		if (!ret)
+			break;
+	}
+
+	if (*fd < 0) {
+		errno = ESTALE;
+		ret = -1;
+	}
+
+	endmntent(mtab);
+	return ret;
+}
+
 /* Tear down a weak handle */
 void
 weakhandle_free(


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 19/26] xfs_healer: use statmount to find moved filesystems even faster
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (17 preceding siblings ...)
  2026-03-19  4:43   ` [PATCH 18/26] xfs_healer: use getmntent to find moved filesystems Darrick J. Wong
@ 2026-03-19  4:43   ` Darrick J. Wong
  2026-03-20  7:11     ` Christoph Hellwig
  2026-03-19  4:43   ` [PATCH 20/26] xfs_healer: validate that repair fds point to the monitored fs Darrick J. Wong
                     ` (6 subsequent siblings)
  25 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

As noted in the previous patch, it's possible that a mounted filesystem
can move mountpoints between the time of the initial mount (at which
point xfs_healer starts) and when it actually wants to start a repair.
The previous patch fixed that problem by using getmntent to walk
/proc/self/mounts to see if it finds a mount with the same "source"
name, aka data device.

However, this is really slow if there are a lot of filesystems because
we end up wading through a lot of irrelevant information.  However,
statmount() can help us here because as of Linux 7.0 we can open the
passed-in path at startup, call statmount() on it to retrieve the
mnt_id, and then call it again later with that same mnt_id to find the
mountpoint.  Luckily xfs_healthmon didn't get merged until 7.0 so it's
more or less guaranteed to be there if XFS_IOC_HEALTH_MONITOR succeeds.

Obviously if this doesn't work, we can fall back to the slow walk.

This statmount code enables xfs_healer to find a filesystem that has
had its mountpoint moved to a different place in the directory tree
without the use of bind mounts and without needing to walk the entire
mount list:

# mount -t tmpfs urk /mnt
# mount --make-rprivate /mnt
# mkdir -p /mnt/a /mnt/b
# mount /dev/sda /mnt/a
# mount --move /mnt/a /mnt/b

The key here is that the struct mount object is moved, and no new ones
are created.  Therefore, the original mnt_id is still usable.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 healer/xfs_healer.h |    7 +++++--
 healer/weakhandle.c |   24 ++++++++++++++++++++++++
 healer/xfs_healer.c |   37 +++++++++++++++++++++++++++++++++++--
 3 files changed, 64 insertions(+), 4 deletions(-)


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index e1370323bbd66a..96e146f266629a 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -39,6 +39,9 @@ struct healer_ctx {
 	/* Shared reference to the getmntent fsname for reconnecting */
 	const char		*fsname;
 
+	/* Mount id for faster reconnecting */
+	uint64_t		mnt_id;
+
 	/* weak file handle so we can reattach to filesystem */
 	struct weakhandle	*wh;
 
@@ -75,8 +78,8 @@ bool healer_can_repair(struct healer_ctx *ctx);
 void run_full_repair(struct healer_ctx *ctx);
 
 /* weakhandle.c */
-int weakhandle_alloc(int fd, const char *mountpoint, const char *fsname,
-		struct weakhandle **whp);
+int weakhandle_alloc(int fd, const char *mountpoint, uint64_t mnt_id,
+		const char *fsname, struct weakhandle **whp);
 int weakhandle_reopen(struct weakhandle *wh, int *fd);
 void weakhandle_free(struct weakhandle **whp);
 int weakhandle_getpath_for(struct weakhandle *wh, uint64_t ino, uint32_t gen,
diff --git a/healer/weakhandle.c b/healer/weakhandle.c
index 5df5207514e38e..358c553f883f3d 100644
--- a/healer/weakhandle.c
+++ b/healer/weakhandle.c
@@ -14,6 +14,7 @@
 #include "libfrog/getparents.h"
 #include "libfrog/paths.h"
 #include "libfrog/systemd.h"
+#include "libfrog/statmount.h"
 #include "xfs_healer.h"
 
 struct weakhandle {
@@ -23,6 +24,9 @@ struct weakhandle {
 	/* Shared reference to the getmntent fsname for reconnecting */
 	const char		*fsname;
 
+	/* Mount id for faster reconnecting */
+	uint64_t		mnt_id;
+
 	/* handle to root dir */
 	void			*hanp;
 	size_t			hlen;
@@ -33,6 +37,7 @@ int
 weakhandle_alloc(
 	int			fd,
 	const char		*mountpoint,
+	uint64_t		mnt_id,
 	const char		*fsname,
 	struct weakhandle	**whp)
 {
@@ -51,6 +56,7 @@ weakhandle_alloc(
 		return -1;
 
 	wh->mntpoint = mountpoint;
+	wh->mnt_id = mnt_id;
 	wh->fsname = fsname;
 
 	ret = fd_to_handle(fd, &wh->hanp, &wh->hlen);
@@ -112,6 +118,9 @@ weakhandle_reopen(
 	struct weakhandle	*wh,
 	int			*fd)
 {
+	const size_t		smbuf_size =
+		libfrog_statmount_sizeof(PATH_MAX);
+	struct statmount	*smbuf = alloca(smbuf_size);
 	FILE			*mtab;
 	struct mntent		*mnt;
 	int			ret;
@@ -121,6 +130,21 @@ weakhandle_reopen(
 	if (!ret)
 		return 0;
 
+	/*
+	 * The original mountpoint didn't work, which means the mount might
+	 * have been moved.  Look up the mountpoint for the mount id that we
+	 * captured earlier, which is a quick lookup if there are many mounts.
+	 * Note that @ret is nonzero here.
+	 */
+	ret = libfrog_statmount(wh->mnt_id, DEFAULT_MOUNTNS_FD,
+			STATMOUNT_MNT_POINT, smbuf, smbuf_size);
+	if (ret || !(smbuf->mask & STATMOUNT_MNT_POINT))
+		goto fallback;
+	ret = weakhandle_reopen_from(wh, smbuf->str + smbuf->mnt_point, fd);
+	if (!ret)
+		return 0;
+
+fallback:
 	/*
 	 * That didn't work, so now walk /proc/mounts to find a mount with the
 	 * same fsname (aka xfs data device path) as when we started.
diff --git a/healer/xfs_healer.c b/healer/xfs_healer.c
index 09b88c754a550c..b91d7f16774b75 100644
--- a/healer/xfs_healer.c
+++ b/healer/xfs_healer.c
@@ -15,6 +15,7 @@
 #include "libfrog/workqueue.h"
 #include "libfrog/systemd.h"
 #include "libfrog/fsproperties.h"
+#include "libfrog/statmount.h"
 #include "xfs_healer.h"
 
 /* Program name; needed for libfrog error reports. */
@@ -163,11 +164,43 @@ try_capture_fsinfo(
 {
 	struct mntent		*mnt;
 	FILE			*mtp;
-	char			rpath[PATH_MAX], rmnt_dir[PATH_MAX];
+	const size_t		smbuf_size =
+		libfrog_statmount_sizeof(PATH_MAX + 128);
+	struct statmount	*smbuf = alloca(smbuf_size);
+	char			*rmnt_dir = smbuf->str;
+	char			rpath[PATH_MAX];
+	int			ret;
 
 	if (!realpath(ctx->mntpoint, rpath))
 		return -1;
 
+	/*
+	 * In Linux 7.0 we can do statmount on an open file, which means that
+	 * we can capture the mnt_id, mount point, and fsname, which can help
+	 * us find a mount --move'd elsewhere in the directory tree.
+	 */
+	ret = libfrog_fstatmount(ctx->mnt.fd, STATMOUNT_MNT_POINT, smbuf,
+			smbuf_size);
+	if (ret || !(smbuf->mask & STATMOUNT_MNT_POINT))
+		goto fallback;
+	if (strcmp(rpath, smbuf->str + smbuf->mnt_point))
+		goto fallback;
+
+	ret = libfrog_fstatmount(ctx->mnt.fd,
+			STATMOUNT_SB_SOURCE | STATMOUNT_MNT_BASIC,
+			smbuf, smbuf_size);
+	if (ret || !(smbuf->mask & STATMOUNT_SB_SOURCE))
+		goto fallback;
+
+	ctx->fsname = strdup(smbuf->str + smbuf->sb_source);
+	ctx->mnt_id = smbuf->mnt_id;
+	return 0;
+
+fallback:
+	/*
+	 * If statmount isn't available for whatever reason, fall back to
+	 * walking the mount table via getmntent.
+	 */
 	mtp = setmntent(_PATH_PROC_MOUNTS, "r");
 	if (mtp == NULL)
 		return -1;
@@ -341,7 +374,7 @@ setup_monitor(
 	 * paths for logging.
 	 */
 	if (ctx->want_repair || healer_has_parent(ctx)) {
-		ret = weakhandle_alloc(ctx->mnt.fd, ctx->mntpoint,
+		ret = weakhandle_alloc(ctx->mnt.fd, ctx->mntpoint, ctx->mnt_id,
 				ctx->fsname, &ctx->wh);
 		if (ret) {
 			fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 20/26] xfs_healer: validate that repair fds point to the monitored fs
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (18 preceding siblings ...)
  2026-03-19  4:43   ` [PATCH 19/26] xfs_healer: use statmount to find moved filesystems even faster Darrick J. Wong
@ 2026-03-19  4:43   ` Darrick J. Wong
  2026-03-19  4:44   ` [PATCH 21/26] xfs_healer: add a manual page Darrick J. Wong
                     ` (5 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When xfs_healer reopens a mountpoint to perform a repair, it should
validate that the opened fd points to a file on the same filesystem as
the one being monitored.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 healer/xfs_healer.h |    4 +++-
 healer/fsrepair.c   |   18 +++++++++++++++++-
 healer/weakhandle.c |   23 +++++++++++++++++------
 3 files changed, 37 insertions(+), 8 deletions(-)


diff --git a/healer/xfs_healer.h b/healer/xfs_healer.h
index 96e146f266629a..c6692375dda6bf 100644
--- a/healer/xfs_healer.h
+++ b/healer/xfs_healer.h
@@ -80,7 +80,9 @@ void run_full_repair(struct healer_ctx *ctx);
 /* weakhandle.c */
 int weakhandle_alloc(int fd, const char *mountpoint, uint64_t mnt_id,
 		const char *fsname, struct weakhandle **whp);
-int weakhandle_reopen(struct weakhandle *wh, int *fd);
+typedef bool (*weakhandle_fd_t)(int mnt_fd, void *data);
+int weakhandle_reopen(struct weakhandle *wh, int *fd,
+		weakhandle_fd_t is_acceptable, void *data);
 void weakhandle_free(struct weakhandle **whp);
 int weakhandle_getpath_for(struct weakhandle *wh, uint64_t ino, uint32_t gen,
 		char *path, size_t pathlen);
diff --git a/healer/fsrepair.c b/healer/fsrepair.c
index 9f8c128e395ebc..002e5e78fcf22e 100644
--- a/healer/fsrepair.c
+++ b/healer/fsrepair.c
@@ -233,6 +233,22 @@ try_repair_inode(
 	return REPAIR_DONE;
 }
 
+/* Make sure the reopened file is on the same fs as the monitor. */
+static bool
+is_same_fs(
+	int				mnt_fd,
+	void				*data)
+{
+	struct xfs_health_file_on_monitored_fs hms = {
+		.fd = mnt_fd,
+	};
+	FILE				*mon_fp = data;
+	int				ret;
+
+	ret = ioctl(fileno(mon_fp), XFS_IOC_HEALTH_FD_ON_MONITORED_FS, &hms);
+	return ret == 0;
+}
+
 /* Repair a metadata corruption. */
 int
 repair_metadata(
@@ -244,7 +260,7 @@ repair_metadata(
 	int					repair_fd;
 	int					ret;
 
-	ret = weakhandle_reopen(ctx->wh, &repair_fd);
+	ret = weakhandle_reopen(ctx->wh, &repair_fd, is_same_fs, ctx->mon_fp);
 	if (ret) {
 		fprintf(stderr, "%s: %s: %s\n", ctx->mntpoint,
 				_("cannot open filesystem to repair"),
diff --git a/healer/weakhandle.c b/healer/weakhandle.c
index 358c553f883f3d..7b8cef0a63f971 100644
--- a/healer/weakhandle.c
+++ b/healer/weakhandle.c
@@ -79,7 +79,9 @@ static int
 weakhandle_reopen_from(
 	struct weakhandle	*wh,
 	const char		*path,
-	int			*fd)
+	int			*fd,
+	weakhandle_fd_t		is_acceptable,
+	void			*data)
 {
 	void			*hanp;
 	size_t			hlen;
@@ -101,6 +103,11 @@ weakhandle_reopen_from(
 		goto out_handle;
 	}
 
+	if (is_acceptable && !is_acceptable(mnt_fd, data)) {
+		errno = ESTALE;
+		goto out_handle;
+	}
+
 	free_handle(hanp, hlen);
 	*fd = mnt_fd;
 	return 0;
@@ -116,7 +123,9 @@ weakhandle_reopen_from(
 int
 weakhandle_reopen(
 	struct weakhandle	*wh,
-	int			*fd)
+	int			*fd,
+	weakhandle_fd_t		is_acceptable,
+	void			*data)
 {
 	const size_t		smbuf_size =
 		libfrog_statmount_sizeof(PATH_MAX);
@@ -126,7 +135,7 @@ weakhandle_reopen(
 	int			ret;
 
 	/* First try reopening using the original mountpoint */
-	ret = weakhandle_reopen_from(wh, wh->mntpoint, fd);
+	ret = weakhandle_reopen_from(wh, wh->mntpoint, fd, is_acceptable, data);
 	if (!ret)
 		return 0;
 
@@ -140,7 +149,8 @@ weakhandle_reopen(
 			STATMOUNT_MNT_POINT, smbuf, smbuf_size);
 	if (ret || !(smbuf->mask & STATMOUNT_MNT_POINT))
 		goto fallback;
-	ret = weakhandle_reopen_from(wh, smbuf->str + smbuf->mnt_point, fd);
+	ret = weakhandle_reopen_from(wh, smbuf->str + smbuf->mnt_point, fd,
+			is_acceptable, data);
 	if (!ret)
 		return 0;
 
@@ -159,7 +169,8 @@ weakhandle_reopen(
 		if (strcmp(mnt->mnt_fsname, wh->fsname))
 			continue;
 
-		ret = weakhandle_reopen_from(wh, mnt->mnt_dir, fd);
+		ret = weakhandle_reopen_from(wh, mnt->mnt_dir, fd,
+				is_acceptable, data);
 		if (!ret)
 			break;
 	}
@@ -244,7 +255,7 @@ weakhandle_getpath_for(
 	fakehandle.ha_fid.fid_ino = ino;
 	fakehandle.ha_fid.fid_gen = gen;
 
-	ret = weakhandle_reopen(wh, &mnt_fd);
+	ret = weakhandle_reopen(wh, &mnt_fd, NULL, NULL);
 	if (ret)
 		return ret;
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 21/26] xfs_healer: add a manual page
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (19 preceding siblings ...)
  2026-03-19  4:43   ` [PATCH 20/26] xfs_healer: validate that repair fds point to the monitored fs Darrick J. Wong
@ 2026-03-19  4:44   ` Darrick J. Wong
  2026-03-19  4:44   ` [PATCH 22/26] xfs_scrub: print systemd service names Darrick J. Wong
                     ` (4 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new section 8 manpage for this service daemon so others can read
about what this program is supposed to do.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 man/man8/Makefile           |   40 +++++++++++++---
 man/man8/xfs_healer.8       |  109 +++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_healer_start.8 |   37 +++++++++++++++
 3 files changed, 180 insertions(+), 6 deletions(-)
 create mode 100644 man/man8/xfs_healer.8
 create mode 100644 man/man8/xfs_healer_start.8


diff --git a/man/man8/Makefile b/man/man8/Makefile
index 5be76ab727a1fe..05710f85ae89ad 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -7,13 +7,41 @@ include $(TOPDIR)/include/builddefs
 
 MAN_SECTION	= 8
 
-ifneq ("$(ENABLE_SCRUB)","yes")
-  MAN_PAGES = $(filter-out xfs_scrub%,$(shell echo *.$(MAN_SECTION)))
-else
-  MAN_PAGES = $(shell echo *.$(MAN_SECTION))
-  MAN_PAGES += xfs_scrub_all.8
+MAN_PAGES = \
+	fsck.xfs.8 \
+	mkfs.xfs.8 \
+	xfs_admin.8 \
+	xfs_bmap.8 \
+	xfs_copy.8 \
+	xfs_db.8 \
+	xfs_estimate.8 \
+	xfs_freeze.8 \
+	xfs_fsr.8 \
+	xfs_growfs.8 \
+	xfs_info.8 \
+	xfs_io.8 \
+	xfs_logprint.8 \
+	xfs_mdrestore.8 \
+	xfs_metadump.8 \
+	xfs_mkfile.8 \
+	xfs_ncheck.8 \
+	xfs_property.8 \
+	xfs_protofile.8 \
+	xfs_quota.8 \
+	xfs_repair.8 \
+	xfs_rtcp.8 \
+	xfs_spaceman.8
+
+ifeq ($(ENABLE_HEALER),yes)
+  MAN_PAGES += xfs_healer.8
 endif
-MAN_PAGES	+= mkfs.xfs.8
+ifeq ($(HAVE_HEALER_START_DEPS),yes)
+  MAN_PAGES += xfs_healer_start.8
+endif
+ifeq ($(ENABLE_SCRUB),yes)
+  MAN_PAGES += xfs_scrub.8 xfs_scrub_all.8
+endif
+
 MAN_DEST	= $(PKG_MAN_DIR)/man$(MAN_SECTION)
 LSRCFILES	= $(MAN_PAGES)
 DIRT		= mkfs.xfs.8 xfs_scrub_all.8
diff --git a/man/man8/xfs_healer.8 b/man/man8/xfs_healer.8
new file mode 100644
index 00000000000000..eea799f7811a4d
--- /dev/null
+++ b/man/man8/xfs_healer.8
@@ -0,0 +1,109 @@
+.TH xfs_healer 8
+.SH NAME
+xfs_healer \- automatically heal damage to XFS filesystem metadata
+.SH SYNOPSIS
+.B xfs_healer
+[
+.B OPTIONS
+]
+.I mount-point
+.br
+.B xfs_healer \-V
+.SH DESCRIPTION
+.B xfs_healer
+is a daemon that tries to automatically repair damaged XFS filesystem metadata.
+.PP
+.B WARNING!
+This program is
+.BR EXPERIMENTAL ","
+which means that its behavior and interface
+could change at any time!
+.PP
+.B xfs_healer
+asks the kernel to report all observations of corrupt metadata, media errors,
+filesystem shutdowns, and file I/O errors.
+The program can respond to runtime metadata corruption errors by initiating
+targeted repairs of the suspect metadata or a full online fsck of the
+filesystem.
+
+Normally this program runs as a systemd service.
+The service is activated via the
+.I xfs_healer_start
+service if systemd is supported.
+
+The kernel may not support repairing or optimizing the filesystem.
+If this is the case, the filesystem must be unmounted and
+.BR xfs_repair (8)
+run on the filesystem to fix the problems.
+.SH OPTIONS
+.TP
+.BI \-\-everything
+Ask the kernel to send us good metadata health events, not only events related
+to metadata corruption, media errors, shutdowns, and I/O errors.
+.TP
+.B \-\-foreground
+Start enough event handling threads to allow consumption of all online CPUs.
+If not specified, start exactly one event handling thread.
+.TP
+.B \-\-no-autofsck
+Do not use the
+.I autofsck
+filesystem property to decide whether or not to repair corrupt metadata.
+If the
+.B \-\-repair
+option is given, then all corruptions will be repaired.
+If the
+.B \-\-repair
+option is not given, then the program will never try to repair the filesystem.
+.TP
+.B \-\-quiet
+Do not print every event to standard output.
+.TP
+.B \-\-repair
+Always try to repair each piece of corrupt metadata when the kernel tells us
+about it.
+If an individual repair fails or the kernel tells us that health events were
+lost, the
+.I xfs_scrub
+service for this mount point will be launched.
+The default is not to try to repair anything.
+If this option is specified but the kernel does not support repairs, the
+program will exit.
+.TP
+.B \-\-supported
+Check if the filesystem supports sending health events.
+Exits with 0 if it does, and non-zero if not.
+.TP
+.BI \-V
+Prints the version number and exit.
+
+.SH AUTOFSCK
+By default, this program will read the
+.I autofsck
+filesystem property to decide if it should try to repair corruptions.
+If the property is set to the value
+.B repair
+then corruptions will be repaired.
+If the property is not set but the filesystem supports all back-reference
+metadata (reverse mappings and parent pointers), then corruptions will be
+repaired.
+
+See the
+.BR xfs_scrub (8)
+manual page for more details on this filesystem property.
+
+.SH CAVEATS
+.B xfs_healer
+is an immature utility!
+Do not run this program unless you have backups of your data!
+This program takes advantage of in-kernel scrubbing to verify a given
+data structure with locks held and can keep the filesystem busy for a
+long time.
+The kernel must be new enough to support the SCRUB_METADATA ioctl.
+.PP
+If errors are found and cannot be repaired, the filesystem must be
+unmounted and repaired.
+.SH SEE ALSO
+.BR xfs_repair (8)
+and
+.BR xfs_scrub (8).
diff --git a/man/man8/xfs_healer_start.8 b/man/man8/xfs_healer_start.8
new file mode 100644
index 00000000000000..9e424432a513fe
--- /dev/null
+++ b/man/man8/xfs_healer_start.8
@@ -0,0 +1,37 @@
+.TH xfs_healer_start 8
+.SH NAME
+xfs_healer_start \- starts xfs_healer instances
+.SH SYNOPSIS
+.B xfs_healer_start
+[
+.B OPTIONS
+]
+.br
+.B xfs_healer \-V
+.SH DESCRIPTION
+.B xfs_healer_start
+starts the xfs_healer service whenever the kernel mounts an XFS filesystem in
+the current mount namespace.
+.PP
+.B WARNING!
+This program is
+.BR EXPERIMENTAL ","
+which means that its behavior and interface
+could change at any time!
+
+Normally this program runs as a systemd service.
+
+.SH OPTIONS
+.TP
+.B \-\-supported
+Check if the kernel supports listening for mount events.
+Exits with 0 if it does, and non-zero if not.
+.TP
+.BI "\-\-mountns " path
+Monitor the given mount namespace.
+Defaults to the mount namespace associated with the process itself.
+.TP
+.BI \-V
+Prints the version number and exit.
+.SH SEE ALSO
+.BR xfs_healer (8).


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 22/26] xfs_scrub: print systemd service names
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (20 preceding siblings ...)
  2026-03-19  4:44   ` [PATCH 21/26] xfs_healer: add a manual page Darrick J. Wong
@ 2026-03-19  4:44   ` Darrick J. Wong
  2026-03-19  4:44   ` [PATCH 23/26] xfs_io: add listmount and statmount commands Darrick J. Wong
                     ` (3 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a hidden switch to xfs_scrub to emit systemd service names for XFS
services targetting filesystems paths instead of opencoding the
computation in things like fstests.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 scrub/xfs_scrub.h |    3 +++
 scrub/Makefile    |    7 +++++--
 scrub/xfs_scrub.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 54 insertions(+), 6 deletions(-)


diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 6ee359f4cebd47..041c0fadaa93c0 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -108,6 +108,9 @@ struct scrub_ctx {
 	 * this much space per volume.
 	 */
 	double			fstrim_block_pct;
+
+	/* CLI options, must be int */
+	int			print_svcname;
 };
 
 /*
diff --git a/scrub/Makefile b/scrub/Makefile
index aee49bfce100e2..4aa0a7d836c342 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -8,9 +8,12 @@ include $(builddefs)
 
 SCRUB_PREREQS=$(HAVE_GETFSMAP)
 
-scrub_media_svcname=xfs_scrub_media@.service
+XFS_SCRUB_MEDIA_SVCNAME=xfs_scrub_media@.service
 
 ifeq ($(SCRUB_PREREQS),yes)
+CFLAGS+=-DXFS_SCRUB_SVCNAME=\"$(XFS_SCRUB_SVCNAME)\"
+CFLAGS+=-DXFS_SCRUB_MEDIA_SVCNAME=\"$(XFS_SCRUB_MEDIA_SVCNAME)\"
+
 LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 XFS_SCRUB_ALL_PROG = xfs_scrub_all.py
@@ -22,7 +25,7 @@ INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
 	$(XFS_SCRUB_SVCNAME) \
 	xfs_scrub_fail@.service \
-	$(scrub_media_svcname) \
+	$(XFS_SCRUB_MEDIA_SVCNAME) \
 	xfs_scrub_media_fail@.service \
 	xfs_scrub_all.service \
 	xfs_scrub_all_fail.service \
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 79937aa8cce4c4..b74dc1635141aa 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -710,6 +710,13 @@ parse_o_opts(
 	}
 }
 
+enum long_opt_nr {
+	LOPT_HELP,
+	LOPT_SVCNAME,
+
+	LOPT_MAX,
+};
+
 int
 main(
 	int			argc,
@@ -717,11 +724,15 @@ main(
 {
 	struct scrub_ctx	ctx = {
 		.fstrim_block_pct = FSTRIM_BLOCK_PCT_DEFAULT,
+		.lock		= (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER,
+		.mode		= SCRUB_MODE_REPAIR,
+		.error_action	= ERRORS_CONTINUE,
 	};
 	struct phase_rusage	all_pi;
 	char			*mtab = NULL;
 	FILE			*progress_fp = NULL;
 	struct fs_path		*fsp;
+	int			option_index;
 	int			vflag = 0;
 	int			c;
 	int			fd;
@@ -742,11 +753,25 @@ main(
 		goto out_unicrash;
 	}
 
-	pthread_mutex_init(&ctx.lock, NULL);
-	ctx.mode = SCRUB_MODE_REPAIR;
-	ctx.error_action = ERRORS_CONTINUE;
-	while ((c = getopt(argc, argv, "a:bC:de:kM:m:no:pTvxV")) != EOF) {
+	struct option long_options[] = {
+		[LOPT_HELP]	   = {"help", no_argument, NULL, 0 },
+		[LOPT_SVCNAME]	   = {"svcname", no_argument, &ctx.print_svcname, 1 },
+
+		[LOPT_MAX]	   = {NULL, 0, NULL, 0 },
+	};
+
+	while ((c = getopt_long(argc, argv, "a:bC:de:kM:m:no:pTvxV",
+				long_options, &option_index)) != EOF) {
 		switch (c) {
+		case 0:
+			switch (option_index) {
+			case LOPT_HELP:
+				usage();
+				break;
+			default:
+				break;
+			}
+			break;
 		case 'a':
 			ctx.max_errors = cvt_u64(optarg, 10);
 			if (errno) {
@@ -860,6 +885,23 @@ main(
 	if (!ctx.actual_mntpoint)
 		ctx.actual_mntpoint = ctx.mntpoint;
 
+	if (ctx.print_svcname) {
+		char		unitname[PATH_MAX];
+		const char	*template =
+			scrub_data ? XFS_SCRUB_MEDIA_SVCNAME :
+				     XFS_SCRUB_SVCNAME;
+
+		ret = systemd_path_instance_unit_name(template,
+				ctx.mntpoint, unitname, sizeof(unitname));
+		if (ret) {
+			perror(ctx.mntpoint);
+			return EXIT_FAILURE;
+		}
+
+		printf("%s\n", unitname);
+		return EXIT_SUCCESS;
+	}
+
 	stdout_isatty = isatty(STDOUT_FILENO);
 	stderr_isatty = isatty(STDERR_FILENO);
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 23/26] xfs_io: add listmount and statmount commands
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (21 preceding siblings ...)
  2026-03-19  4:44   ` [PATCH 22/26] xfs_scrub: print systemd service names Darrick J. Wong
@ 2026-03-19  4:44   ` Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 24/26] mkfs: enable online repair if all backrefs are enabled Darrick J. Wong
                     ` (2 subsequent siblings)
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add two new commands: one to list all mounts via statmount, now that we
use this in xfs_healer_start, and another to statmount each open file.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 io/io.h           |    6 +
 io/Makefile       |    5 +
 io/init.c         |    1 
 io/listmount.c    |  361 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |   66 ++++++++++
 5 files changed, 439 insertions(+)
 create mode 100644 io/listmount.c


diff --git a/io/io.h b/io/io.h
index 0f12b3cfed5e76..5f1f278d14a033 100644
--- a/io/io.h
+++ b/io/io.h
@@ -164,3 +164,9 @@ void			fsprops_init(void);
 void			aginfo_init(void);
 void			healthmon_init(void);
 void			verifymedia_init(void);
+
+#ifdef HAVE_LISTMOUNT
+void			listmount_init(void);
+#else
+# define		listmount_init()	do { } while (0)
+#endif
diff --git a/io/Makefile b/io/Makefile
index 79d5e172b8f31f..e25742b635396e 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -90,6 +90,11 @@ ifeq ($(HAVE_GETFSMAP),yes)
 CFILES += fsmap.c
 endif
 
+ifeq ($(HAVE_LISTMOUNT),yes)
+CFILES += listmount.c
+LCFLAGS += -DHAVE_LISTMOUNT
+endif
+
 default: depend $(LTCOMMAND)
 
 include $(BUILDRULES)
diff --git a/io/init.c b/io/init.c
index f2a551ef559200..ba60cb2199639b 100644
--- a/io/init.c
+++ b/io/init.c
@@ -94,6 +94,7 @@ init_commands(void)
 	fsprops_init();
 	healthmon_init();
 	verifymedia_init();
+	listmount_init();
 }
 
 /*
diff --git a/io/listmount.c b/io/listmount.c
new file mode 100644
index 00000000000000..af4ebaf7861250
--- /dev/null
+++ b/io/listmount.c
@@ -0,0 +1,361 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2026 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+
+#include "libfrog/flagmap.h"
+#include "libfrog/statmount.h"
+#include "command.h"
+#include "input.h"
+#include "init.h"
+#include "io.h"
+
+static const struct flag_map statmount_funcs[] = {
+	{ STATMOUNT_SB_BASIC,		N_("sb_basic") },
+	{ STATMOUNT_MNT_BASIC,		N_("mnt_basic") },
+	{ STATMOUNT_PROPAGATE_FROM,	N_("propagate_from") },
+	{ STATMOUNT_MNT_ROOT,		N_("mnt_root") },
+	{ STATMOUNT_MNT_POINT,		N_("mnt_point") },
+	{ STATMOUNT_FS_TYPE,		N_("fs_type") },
+	{ STATMOUNT_MNT_NS_ID,		N_("mnt_ns_id") },
+	{ STATMOUNT_MNT_OPTS,		N_("mnt_opts") },
+	{ STATMOUNT_FS_SUBTYPE,		N_("fs_subtype") },
+	{ STATMOUNT_SB_SOURCE,		N_("sb_source") },
+	{ STATMOUNT_OPT_ARRAY,		N_("opt_array") },
+	{ STATMOUNT_OPT_SEC_ARRAY,	N_("opt_sec_array") },
+	{ STATMOUNT_SUPPORTED_MASK,	N_("supported_mask") },
+	{0, NULL},
+};
+
+static const struct flag_map mount_attrs[] = {
+	{ MOUNT_ATTR_RDONLY,		N_("rdonly") },
+	{ MOUNT_ATTR_NOSUID,		N_("nosuid") },
+	{ MOUNT_ATTR_NODEV,		N_("nodev") },
+	{ MOUNT_ATTR_NOEXEC,		N_("noexec") },
+	{ MOUNT_ATTR__ATIME,		N_("atime") },
+	{ MOUNT_ATTR_RELATIME,		N_("relatime") },
+	{ MOUNT_ATTR_NOATIME,		N_("noatime") },
+	{ MOUNT_ATTR_STRICTATIME,	N_("strictatime") },
+	{ MOUNT_ATTR_NODIRATIME,	N_("nodiratime") },
+	{ MOUNT_ATTR_IDMAP,		N_("idmap") },
+	{ MOUNT_ATTR_NOSYMFOLLOW,	N_("nosymfollow") },
+	{0, NULL},
+};
+
+static const struct flag_map mount_prop_flags[] = {
+	{ MS_SHARED,			N_("shared") },
+	{ MS_SLAVE,			N_("nopeer") },
+	{ MS_PRIVATE,			N_("private") },
+	{ MS_UNBINDABLE,		N_("unbindable") },
+	{0, NULL},
+};
+
+static void
+dump_mountinfo(
+	const struct statmount	*smbuf,
+	bool			rawflag)
+{
+	char			buf[4096];
+
+	if (rawflag) {
+		printf("\tmask: 0x%llx\n", (unsigned long long)smbuf->mask);
+	} else {
+		mask_to_string(statmount_funcs, smbuf->mask, ",", buf,
+				sizeof(buf));
+		printf("\tmask: {%s}\n", buf);
+	}
+
+	if (smbuf->mask & STATMOUNT_SB_BASIC) {
+		printf("\tsb_dev_major: %u\n", smbuf->sb_dev_major);
+		printf("\tsb_dev_minor: %u\n", smbuf->sb_dev_minor);
+		printf("\tsb_magic: 0x%llx\n",
+				(unsigned long long)smbuf->sb_magic);
+		printf("\tsb_flags: 0x%x\n", smbuf->sb_flags);
+	}
+
+	if (smbuf->mask & STATMOUNT_MNT_BASIC) {
+		printf("\tmnt_id: 0x%llx\n",
+				(unsigned long long)smbuf->mnt_id);
+		printf("\tmnt_parent_id: 0x%llx\n",
+				(unsigned long long)smbuf->mnt_parent_id);
+		printf("\tmnt_id_old: %u\n", smbuf->mnt_id_old);
+		printf("\tmnt_parent_id_old: %u\n", smbuf->mnt_parent_id_old);
+		if (rawflag) {
+			printf("\tmnt_attr: 0x%llx\n",
+					(unsigned long long)smbuf->mnt_attr);
+			printf("\tmnt_propagation: 0x%llx\n",
+					(unsigned long long)smbuf->mnt_propagation);
+		} else {
+			mask_to_string(mount_attrs, smbuf->mnt_attr, ",", buf,
+					sizeof(buf));
+			printf("\tmnt_attr: {%s}\n", buf);
+			mask_to_string(mount_prop_flags, smbuf->mnt_propagation,
+					",", buf, sizeof(buf));
+			printf("\tmnt_propagation: {%s}\n", buf);
+		}
+		printf("\tmnt_peer_group: 0x%llx\n",
+				(unsigned long long)smbuf->mnt_peer_group);
+		printf("\tmnt_master: 0x%llx\n",
+				(unsigned long long)smbuf->mnt_master);
+	}
+
+	if (smbuf->mask & STATMOUNT_PROPAGATE_FROM)
+		printf("\tpropagate_from: 0x%llx\n",
+				(unsigned long long)smbuf->propagate_from);
+
+	if (smbuf->mask & STATMOUNT_MNT_ROOT)
+		printf("\tmnt_root: %s\n", smbuf->str + smbuf->mnt_root);
+	if (smbuf->mask & STATMOUNT_MNT_POINT)
+		printf("\tmnt_point: %s\n", smbuf->str + smbuf->mnt_point);
+	if (smbuf->mask & STATMOUNT_FS_TYPE)
+		printf("\tfs_type: %s\n", smbuf->str + smbuf->fs_type);
+	if (smbuf->mask & STATMOUNT_FS_SUBTYPE)
+		printf("\tfs_subtype: %s\n", smbuf->str + smbuf->fs_subtype);
+
+	if (smbuf->mask & STATMOUNT_MNT_NS_ID)
+		printf("\tmnt_ns_id: 0x%llx\n",
+				(unsigned long long)smbuf->mnt_ns_id);
+
+	if (smbuf->mask & STATMOUNT_MNT_OPTS)
+		printf("\tmnt_opts: %s\n", smbuf->str + smbuf->mnt_opts);
+	if (smbuf->mask & STATMOUNT_SB_SOURCE)
+		printf("\tsb_source: %s\n", smbuf->str + smbuf->sb_source);
+
+	if (smbuf->mask & STATMOUNT_SUPPORTED_MASK) {
+		if (rawflag) {
+			printf("\tsupported_mask: 0x%llx\n",
+					(unsigned long long)smbuf->supported_mask);
+		} else {
+			mask_to_string(statmount_funcs, smbuf->supported_mask,
+					",", buf, sizeof(buf));
+			printf("\tsupported_mask: {%s}\n", buf);
+		}
+	}
+}
+
+static inline bool
+match_mount(
+	const struct statmount	*smbuf,
+	const char		*fstype)
+{
+	char			real_fstype[256];
+
+	if (!fstype)
+		return true;
+
+	if (!(smbuf->mask & STATMOUNT_FS_TYPE))
+		return false;
+
+	if (smbuf->mask & STATMOUNT_FS_SUBTYPE)
+		snprintf(real_fstype, sizeof(fstype), "%s.%s",
+				smbuf->str + smbuf->fs_type,
+				smbuf->str + smbuf->fs_subtype);
+	else
+		snprintf(real_fstype, sizeof(fstype), "%s",
+				smbuf->str + smbuf->fs_type);
+
+	return strcmp(fstype, real_fstype) == 0;
+}
+
+static void
+listmount_help(void)
+{
+	printf(_(
+"\n"
+" List all mounted filesystems.\n"
+"\n"
+" -f   -- statmount mask flags to set.  Defaults to all possible flags.\n"
+" -i   -- only list mounts below this mount id.  Defaults to the rootdir.\n"
+" -n   -- path to a procfs mount namespace file.\n"
+" -r   -- do not decode flags fields into strings.\n"
+" -t   -- only display mount info for this fs type.\n"
+));
+}
+
+#define NR_MNT_IDS		7
+
+static int
+listmount_f(
+	int			argc,
+	char			**argv)
+{
+	uint64_t		mnt_ids[NR_MNT_IDS];
+	uint64_t		cursor = LISTMOUNT_INIT_CURSOR;
+	uint64_t		statmount_flags = -1ULL;
+	uint64_t		mnt_id = LSMT_ROOT;
+	struct statmount	*smbuf;
+	const char		*fstype = NULL;
+	unsigned long long	rows = 0;
+	const size_t		smbuf_size = libfrog_statmount_sizeof(4096);
+	int			mnt_ns_fd = DEFAULT_MOUNTNS_FD;
+	int			rawflag = 0;
+	int			c;
+	int			ret;
+
+	while ((c = getopt(argc, argv, "f:i:n:rt:")) > 0) {
+		switch (c) {
+		case 'f':
+			errno = 0;
+			statmount_flags = strtoull(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				return 1;
+			}
+			break;
+		case 'i':
+			errno = 0;
+			mnt_id = strtoull(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				return 1;
+			}
+			break;
+		case 'n':
+			mnt_ns_fd = open(optarg, O_RDONLY);
+			if (mnt_ns_fd < 0) {
+				perror(optarg);
+				return 1;
+			}
+			break;
+		case 'r':
+			rawflag++;
+			break;
+		case 't':
+			fstype = optarg;
+			break;
+		default:
+			listmount_help();
+			return 1;
+		}
+	}
+
+	smbuf = malloc(smbuf_size);
+	if (!smbuf) {
+		perror("malloc");
+		return 1;
+	}
+
+	if (fstype)
+		statmount_flags |= STATMOUNT_FS_TYPE | STATMOUNT_FS_SUBTYPE;
+
+	while ((ret = libfrog_listmount(mnt_id, mnt_ns_fd, &cursor,
+					mnt_ids, NR_MNT_IDS)) > 0) {
+		for (c = 0; c < ret; c++) {
+			ret = libfrog_statmount(mnt_ids[c], mnt_ns_fd,
+					statmount_flags, smbuf, smbuf_size);
+			if (ret) {
+				perror("statmount");
+				goto out_smbuf;
+			}
+
+			if (!match_mount(smbuf, fstype))
+				continue;
+
+			printf("mnt_id[%llu]: 0x%llx\n",
+					(unsigned long long)rows++,
+					(unsigned long long)mnt_ids[c]);
+
+			dump_mountinfo(smbuf, rawflag);
+		}
+	}
+
+	if (ret < 0)
+		perror("listmount");
+
+out_smbuf:
+	free(smbuf);
+	return 0;
+}
+
+static const struct cmdinfo listmount_cmd = {
+	.name		= "listmount",
+	.cfunc		= listmount_f,
+	.argmin		= -1,
+	.argmax		= -1,
+	.flags		= CMD_NOFILE_OK | CMD_FOREIGN_OK | CMD_NOMAP_OK,
+	.oneline	= N_("list mounted filesystems"),
+	.help		= listmount_help,
+};
+
+static void
+statmount_help(void)
+{
+	printf(_(
+"\n"
+" Print statmount information for the open file.\n"
+"\n"
+" -f   -- statmount mask flags to set.  Defaults to all possible flags.\n"
+" -r   -- do not decode flags fields into strings.\n"
+));
+}
+
+static int
+statmount_f(
+	int			argc,
+	char			**argv)
+{
+	uint64_t		statmount_flags = -1ULL;
+	struct statmount	*smbuf;
+	const size_t		smbuf_size = libfrog_statmount_sizeof(4096);
+	int			rawflag = 0;
+	int			c;
+	int			ret;
+
+	while ((c = getopt(argc, argv, "f:r")) > 0) {
+		switch (c) {
+		case 'f':
+			errno = 0;
+			statmount_flags = strtoull(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				return 1;
+			}
+			break;
+		case 'r':
+			rawflag++;
+			break;
+		default:
+			listmount_help();
+			return 1;
+		}
+	}
+
+	smbuf = malloc(smbuf_size);
+	if (!smbuf) {
+		perror("malloc");
+		return 1;
+	}
+
+	ret = libfrog_fstatmount(file->fd, statmount_flags, smbuf, smbuf_size);
+	if (ret) {
+		perror("statmount");
+		goto out_smbuf;
+	}
+
+	printf("path: %s\n", file->name);
+
+	dump_mountinfo(smbuf, rawflag);
+
+out_smbuf:
+	free(smbuf);
+	return 0;
+}
+
+static const struct cmdinfo statmount_cmd = {
+	.name		= "statmount",
+	.cfunc		= statmount_f,
+	.argmin		= -1,
+	.argmax		= -1,
+	.flags		= CMD_FOREIGN_OK | CMD_NOMAP_OK,
+	.oneline	= N_("statmount the open file"),
+	.help		= statmount_help,
+};
+
+void
+listmount_init(void)
+{
+	add_command(&listmount_cmd);
+	add_command(&statmount_cmd);
+}
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 2090cd4c0b2641..61defcc377163a 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1766,6 +1766,72 @@ .SH FILESYSTEM COMMANDS
 .TP
 .BI "removefsprops " name " [ " names "... ]"
 Remove the given filesystem properties.
+.TP
+.BI "listmount [ \-f " mask " ] [ \-i " mnt_id " ] [ \-n " path " ] [ \-r ] [ \-t" fstype " ]"
+Print information about the mounted filesystems in a particular mount
+namespace.
+The information returned by this call corresponds to the information returned
+by the
+.BR statmount (2)
+system call.
+
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI "\-f " mask
+Pass this numeric argument as the mask argument to
+.BR statmount (8).
+Defaults to all bits set, to retrieve all possible information.
+
+.TP
+.BI "\-i " mnt_id
+Only return information for mounts below this mount in the mount tree.
+Defaults to the root directory.
+
+.TP
+.BI "\-n " path
+Return information for the mount namespace given by this procfs path.
+For a given process, the path will most likely look like
+.BI /proc/ $pid /ns/mnt
+though any path can be provided.
+Defaults to the mount namespace of the
+.B xfs_io
+process itself.
+
+.TP
+.B \-r
+Print raw bitmasks instead of converting them to strings.
+
+.TP
+.BI "\-t " fstype
+Only return information for filesystems of this type.
+If not specified, no filtering is performed.
+.RE
+
+.TP
+.BI "statmount [ \-f " mask " ] [ \-r ]"
+Print information about the mounted filesystem for each open file.
+The information returned by this call corresponds to the information returned
+by the
+.BR statmount (2)
+system call.
+
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI "\-f " mask
+Pass this numeric argument as the mask argument to
+.BR statmount (8).
+Defaults to all bits set, to retrieve all possible information.
+
+.TP
+.B \-r
+Print raw bitmasks instead of converting them to strings.
+
+.RE
+.PD
 
 .SH OTHER COMMANDS
 .TP


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 24/26] mkfs: enable online repair if all backrefs are enabled
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (22 preceding siblings ...)
  2026-03-19  4:44   ` [PATCH 23/26] xfs_io: add listmount and statmount commands Darrick J. Wong
@ 2026-03-19  4:45   ` Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 25/26] debian/control: listify the build dependencies Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 26/26] debian: enable xfs_healer on the root filesystem by default Darrick J. Wong
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If all backreferences are enabled in the filesystem, then enable online
repair by default if the user didn't supply any other autofsck setting.
Users might as well get full self-repair capability if they're paying
for the extra metadata.

Note that it's up to each distro to enable the systemd services
according to their own service activation policies.  Debian policy is to
enable all systemd services at package installation but they don't
enable online fsck in their Kconfig so the services won't activate.
RHEL and SUSE policy requires sysadmins to enable them explicitly unless
the OS vendor also ships a systemd preset file enabling the services.
Distros without systemd won't get any of the systemd services,
obviously.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 mkfs/xfs_mkfs.c |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 527a662f3ac858..f859626afdda36 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -6296,6 +6296,15 @@ main(
 	if (mp->m_sb.sb_agcount > 1)
 		rewrite_secondary_superblocks(mp);

+	/*
+	 * If the filesystem has full backreferences and the user didn't
+	 * express an autofsck preference, enable online repair because they
+	 * might as well get some useful functionality from the extra metadata.
+	 */
+	if (cli.autofsck == FSPROP_AUTOFSCK_UNSET &&
+	    cli.sb_feat.rmapbt && cli.sb_feat.parent_pointers)
+		cli.autofsck = FSPROP_AUTOFSCK_REPAIR;
+
 	if (cli.autofsck != FSPROP_AUTOFSCK_UNSET)
 		set_autofsck(mp, &cli);

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 25/26] debian/control: listify the build dependencies
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (23 preceding siblings ...)
  2026-03-19  4:45   ` [PATCH 24/26] mkfs: enable online repair if all backrefs are enabled Darrick J. Wong
@ 2026-03-19  4:45   ` Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 26/26] debian: enable xfs_healer on the root filesystem by default Darrick J. Wong
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This will make it less gross to add more build deps later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 debian/control |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)


diff --git a/debian/control b/debian/control
index 6473c10be7f7d6..d50960fba205bb 100644
--- a/debian/control
+++ b/debian/control
@@ -3,7 +3,19 @@ Section: admin
 Priority: optional
 Maintainer: XFS Development Team <linux-xfs@vger.kernel.org>
 Uploaders: Nathan Scott <nathans@debian.org>, Anibal Monsalve Salazar <anibal@debian.org>
-Build-Depends: libinih-dev (>= 53), uuid-dev, debhelper (>= 12), gettext, libtool, libedit-dev, libblkid-dev (>= 2.17), linux-libc-dev, libdevmapper-dev, libicu-dev, pkg-config, liburcu-dev, systemd-dev | systemd (<< 253-2~)
+Build-Depends: debhelper (>= 12),
+ gettext,
+ libblkid-dev (>= 2.17),
+ libdevmapper-dev,
+ libedit-dev,
+ libicu-dev,
+ libinih-dev (>= 53),
+ libtool,
+ liburcu-dev,
+ linux-libc-dev,
+ pkg-config,
+ systemd-dev | systemd (<< 253-2~),
+ uuid-dev
 Standards-Version: 4.0.0
 Homepage: https://xfs.wiki.kernel.org/
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 26/26] debian: enable xfs_healer on the root filesystem by default
  2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
                     ` (24 preceding siblings ...)
  2026-03-19  4:45   ` [PATCH 25/26] debian/control: listify the build dependencies Darrick J. Wong
@ 2026-03-19  4:45   ` Darrick J. Wong
  25 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we're finished building autonomous repair, enable the healer
service on the root filesystem by default.  The root filesystem is
mounted by the initrd prior to starting systemd, which is why the
xfs_healer_start service cannot autostart the service for the root
filesystem.

dh_installsystemd won't activate a template service (aka one with an
at-sign in the name) even if it provides a DefaultInstance directive to
make that possible.  Hence we enable this explicitly via the postinst
script.

Note that Debian enables services by default upon package installation,
so this is consistent with their policies.  Their kernel doesn't enable
online fsck, so healer won't do much more than monitor for corruptions
and log them.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 debian/postinst |    8 ++++++++
 debian/prerm    |   13 +++++++++++++
 debian/rules    |    3 ++-
 3 files changed, 23 insertions(+), 1 deletion(-)
 create mode 100644 debian/prerm


diff --git a/debian/postinst b/debian/postinst
index d11c8d94a3cbe4..966dbb7626cab3 100644
--- a/debian/postinst
+++ b/debian/postinst
@@ -21,5 +21,13 @@ case "${1}" in
 esac
 
 #DEBHELPER#
+#
+# dh_installsystemd doesn't handle template services even if we supply a
+# default instance, so we'll install it here.
+if [ -z "${DPKG_ROOT:-}" ] && [ -d /run/systemd/system ] ; then
+	if [ "$1" = "configure" ] || [ "$1" = "abort-upgrade" ] || [ "$1" = "abort-deconfigure" ] || [ "$1" = "abort-remove" ] ; then
+		/bin/systemctl enable xfs_healer@.service || true
+	fi
+fi
 
 exit 0
diff --git a/debian/prerm b/debian/prerm
new file mode 100644
index 00000000000000..c526dcdd1d7103
--- /dev/null
+++ b/debian/prerm
@@ -0,0 +1,13 @@
+#!/bin/sh
+
+set -e
+
+# dh_installsystemd doesn't handle template services even if we supply a
+# default instance, so we'll install it here.
+if [ -z "${DPKG_ROOT:-}" ] && [ "$1" = remove ] && [ -d /run/systemd/system ] ; then
+	/bin/systemctl disable xfs_healer@.service || true
+fi
+
+#DEBHELPER#
+
+exit 0
diff --git a/debian/rules b/debian/rules
index 7c9f90e6c483ff..aaf99a95ce3df5 100755
--- a/debian/rules
+++ b/debian/rules
@@ -97,4 +97,5 @@ override_dh_installdocs:
 	dh_installdocs -XCHANGES
 
 override_dh_installsystemd:
-	dh_installsystemd -p xfsprogs --no-restart-after-upgrade --no-stop-on-upgrade system-xfs_scrub.slice xfs_scrub_all.timer
+	dh_installsystemd -p xfsprogs --no-restart-after-upgrade --no-stop-on-upgrade system-xfs_scrub.slice xfs_scrub_all.timer system-xfs_healer.slice
+	dh_installsystemd -p xfsprogs --restart-after-upgrade xfs_healer_start.service


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
@ 2026-03-19  4:45   ` Darrick J. Wong
  2026-03-20  7:12     ` Christoph Hellwig
  2026-03-19  4:46   ` [PATCH 02/22] mkfs: rename byte unit conversion macros Darrick J. Wong
                     ` (20 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow bitmap_free() callers to pass a pointer to a NULL pointer.
This will help subsequent refactorings in xfs_scrub have cleaner
bitmap_free callsites.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/bitmap.c |    3 +++
 repair/rmap.c    |    3 +--
 scrub/phase5.c   |    6 ++----
 3 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/libfrog/bitmap.c b/libfrog/bitmap.c
index 0308886d446ff2..6a3d852c25a46c 100644
--- a/libfrog/bitmap.c
+++ b/libfrog/bitmap.c
@@ -109,6 +109,9 @@ bitmap_free(
 	struct bitmap_node	*ext;
 
 	bmap = *bmapp;
+	if (!bmap)
+		return;
+
 	avl_for_each_safe(bmap->bt_tree, node, n) {
 		ext = container_of(node, struct bitmap_node, btn_node);
 		free(ext);
diff --git a/repair/rmap.c b/repair/rmap.c
index e89bd32d63a953..55c2b0928c52b0 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -752,8 +752,7 @@ rmap_commit_agbtree_mappings(
 err:
 	if (agflbp)
 		libxfs_buf_relse(agflbp);
-	if (own_ag_bitmap)
-		bitmap_free(&own_ag_bitmap);
+	bitmap_free(&own_ag_bitmap);
 	return error;
 }
 
diff --git a/scrub/phase5.c b/scrub/phase5.c
index 577dda8064c3a8..52bbbca4b9f06a 100644
--- a/scrub/phase5.c
+++ b/scrub/phase5.c
@@ -897,10 +897,8 @@ _("Filesystem has errors, skipping connectivity checks."));
 	scrub_report_preen_triggers(ctx);
 out_lock:
 	pthread_mutex_destroy(&ncs.lock);
-	if (ncs.new_deferred)
-		bitmap_free(&ncs.new_deferred);
-	if (ncs.cur_deferred)
-		bitmap_free(&ncs.cur_deferred);
+	bitmap_free(&ncs.new_deferred);
+	bitmap_free(&ncs.cur_deferred);
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 02/22] mkfs: rename byte unit conversion macros
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer Darrick J. Wong
@ 2026-03-19  4:46   ` Darrick J. Wong
  2026-03-20  7:12     ` Christoph Hellwig
  2026-03-19  4:46   ` [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h Darrick J. Wong
                     ` (19 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Rename these macros so that we can promote the generic ones in the next
patch.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |   33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index f859626afdda36..e0f0bb28e12159 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -18,9 +18,10 @@
 #include "proto.h"
 #include <ini.h>
 
-#define TERABYTES(count, blog)	((uint64_t)(count) << (40 - (blog)))
-#define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
-#define MEGABYTES(count, blog)	((uint64_t)(count) << (20 - (blog)))
+/* Convert a quantity of mega/giga/terabytes into units of blocks */
+#define TERABLOCKS(count, blog)	((uint64_t)(count) << (40 - (blog)))
+#define GIGABLOCKS(count, blog)	((uint64_t)(count) << (30 - (blog)))
+#define MEGABLOCKS(count, blog)	((uint64_t)(count) << (20 - (blog)))
 
 /*
  * Realistically, the log should never be smaller than 64MB.  Studies by the
@@ -28,7 +29,7 @@
  * latency of the xlog grant head waitqueue when running a heavy metadata
  * update workload when the log size is at least 64MB.
  */
-#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog)	(MEGABYTES(64, (blog)))
+#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog)	(MEGABLOCKS(64, (blog)))
 
 /*
  * Use this macro before we have superblock and mount structure to
@@ -3443,7 +3444,7 @@ validate_supported(
 	 *
 	 * 64MB * (8 / 7) * 4 = 293MB
 	 */
-	if (mp->m_sb.sb_dblocks < MEGABYTES(300, mp->m_sb.sb_blocklog)) {
+	if (mp->m_sb.sb_dblocks < MEGABLOCKS(300, mp->m_sb.sb_blocklog)) {
 		fprintf(stderr,
  _("Filesystem must be larger than 300MB.\n"));
 		usage();
@@ -3559,7 +3560,7 @@ _("%s: Volume reports invalid stripe unit (%d) and stripe width (%d), ignoring.\
 				BBTOB(ft->data.sunit), BBTOB(ft->data.swidth));
 			ft->data.sunit = 0;
 			ft->data.swidth = 0;
-		} else if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
+		} else if (cfg->dblocks < GIGABLOCKS(1, cfg->blocklog)) {
 			/*
 			 * Don't use automatic stripe detection if the device
 			 * size is less than 1GB because the performance gains
@@ -4056,7 +4057,7 @@ calc_concurrency_ag_geometry(
 	 */
 	try_threads = nr_threads;
 	try_agsize = cfg->dblocks / try_threads;
-	if (try_agsize < GIGABYTES(4, cfg->blocklog)) {
+	if (try_agsize < GIGABLOCKS(4, cfg->blocklog)) {
 		do {
 			try_threads--;
 			if (try_threads <= def_agcount) {
@@ -4065,7 +4066,7 @@ calc_concurrency_ag_geometry(
 			}
 
 			try_agsize = cfg->dblocks / try_threads;
-		} while (try_agsize < GIGABYTES(4, cfg->blocklog));
+		} while (try_agsize < GIGABLOCKS(4, cfg->blocklog));
 		goto out;
 	}
 
@@ -4413,7 +4414,7 @@ calc_concurrency_rtgroup_geometry(
 	 */
 	try_threads = nr_threads;
 	try_rgsize = cfg->rtblocks / try_threads;
-	if (try_rgsize < GIGABYTES(4, cfg->blocklog)) {
+	if (try_rgsize < GIGABLOCKS(4, cfg->blocklog)) {
 		do {
 			try_threads--;
 			if (try_threads <= def_rgcount) {
@@ -4422,7 +4423,7 @@ calc_concurrency_rtgroup_geometry(
 			}
 
 			try_rgsize = cfg->rtblocks / try_threads;
-		} while (try_rgsize < GIGABYTES(4, cfg->blocklog));
+		} while (try_rgsize < GIGABLOCKS(4, cfg->blocklog));
 		goto out;
 	}
 
@@ -4516,7 +4517,7 @@ _("rgsize (%s) not a multiple of fs blk size (%d)\n"),
 		 * If nobody specified a realtime device or the rtgroup size,
 		 * try 1TB, rounded down to the nearest rt extent.
 		 */
-		cfg->rgsize = TERABYTES(1, cfg->blocklog);
+		cfg->rgsize = TERABLOCKS(1, cfg->blocklog);
 		cfg->rgsize -= cfg->rgsize % cfg->rtextblocks;
 		cfg->rgcount = 0;
 	} else if (cfg->rtblocks < cfg->rtextblocks * 2) {
@@ -4651,7 +4652,7 @@ _("rgsize (%s) not a multiple of fs blk size (%d)\n"),
 					(cfg->rtblocks % cfg->rgcount != 0);
 		} else {
 			/* 256MB zones just like typical SMR HDDs */
-			cfg->rgsize = MEGABYTES(256, cfg->blocklog);
+			cfg->rgsize = MEGABLOCKS(256, cfg->blocklog);
 			cfg->rgcount = cfg->rtblocks / cfg->rgsize +
 					(cfg->rtblocks % cfg->rgsize != 0);
 		}
@@ -4692,9 +4693,9 @@ calculate_imaxpct(
 	 *  - under  1 TB, use XFS_DFL_IMAXIMUM_PCT (25%).
 	 */
 
-	if (cfg->dblocks < TERABYTES(1, cfg->blocklog))
+	if (cfg->dblocks < TERABLOCKS(1, cfg->blocklog))
 		cfg->imaxpct = XFS_DFL_IMAXIMUM_PCT;
-	else if (cfg->dblocks < TERABYTES(50, cfg->blocklog))
+	else if (cfg->dblocks < TERABLOCKS(50, cfg->blocklog))
 		cfg->imaxpct = 5;
 	else
 		cfg->imaxpct = 1;
@@ -5016,7 +5017,7 @@ calc_concurrency_logblocks(
 	 * If this filesystem is smaller than a gigabyte, there's little to be
 	 * gained from making the log larger.
 	 */
-	if (cfg->dblocks < GIGABYTES(1, cfg->blocklog))
+	if (cfg->dblocks < GIGABLOCKS(1, cfg->blocklog))
 		goto out;
 
 	/*
@@ -5279,7 +5280,7 @@ _("max log size %d smaller than min log size %d, filesystem is too small\n"),
 				XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
 
 		/* And for a tiny filesystem, use the absolute minimum size */
-		if (cfg->dblocks < MEGABYTES(300, cfg->blocklog))
+		if (cfg->dblocks < MEGABLOCKS(300, cfg->blocklog))
 			cfg->logblocks = min_logblocks;
 
 		/* Ensure the chosen size fits within log size requirements */


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
  2026-03-19  4:45   ` [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer Darrick J. Wong
  2026-03-19  4:46   ` [PATCH 02/22] mkfs: rename byte unit conversion macros Darrick J. Wong
@ 2026-03-19  4:46   ` Darrick J. Wong
  2026-03-20  7:12     ` Christoph Hellwig
  2026-03-19  4:46   ` [PATCH 04/22] xfs_scrub: report truncated devices as media errors Darrick J. Wong
                     ` (18 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move these byte unit conversion macros to convert.h and amend the macros
to cast to unsigned long long to avoid shifting issues.  Now we can use
these same macros throughout the codebase instead of opencoding shifts
and possibly suffering from integer shifting problems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/convert.h         |    7 +++++++
 db/bmap_inflate.c         |    2 +-
 libfrog/convert.c         |    7 -------
 mdrestore/xfs_mdrestore.c |    5 +++--
 mkfs/xfs_mkfs.c           |    5 +++--
 repair/agbtree.c          |    3 ++-
 repair/bmap_repair.c      |    3 ++-
 repair/xfs_repair.c       |    5 +++--
 scrub/common.c            |   17 +++++++++--------
 scrub/phase8.c            |    3 ++-
 10 files changed, 32 insertions(+), 25 deletions(-)


diff --git a/libfrog/convert.h b/libfrog/convert.h
index 3e5fbe055986a4..223d3e73c9422b 100644
--- a/libfrog/convert.h
+++ b/libfrog/convert.h
@@ -22,4 +22,11 @@ extern uid_t	uid_from_string(char *user);
 extern gid_t	gid_from_string(char *group);
 extern prid_t	prid_from_string(char *project);
 
+#define EXABYTES(x)	((unsigned long long)(x) << 60)
+#define PETABYTES(x)	((unsigned long long)(x) << 50)
+#define TERABYTES(x)	((unsigned long long)(x) << 40)
+#define GIGABYTES(x)	((unsigned long long)(x) << 30)
+#define MEGABYTES(x)	((unsigned long long)(x) << 20)
+#define KILOBYTES(x)	((unsigned long long)(x) << 10)
+
 #endif	/* __LIBFROG_CONVERT_H__ */
diff --git a/db/bmap_inflate.c b/db/bmap_inflate.c
index 1de6d3439ab3d3..cc7c197e788d7f 100644
--- a/db/bmap_inflate.c
+++ b/db/bmap_inflate.c
@@ -437,7 +437,7 @@ bmapinflate_f(
 	struct xfs_trans	*tp;
 	char			*p;
 	unsigned long long	nextents = 0;
-	unsigned long long	dirty_bytes = 60U << 20; /* 60MiB */
+	unsigned long long	dirty_bytes = MEGABYTES(60);
 	unsigned long long	dirty_blocks;
 	unsigned int		resblks;
 	bool			estimate = false;
diff --git a/libfrog/convert.c b/libfrog/convert.c
index 0ceeb389682ae1..65b9e47459e234 100644
--- a/libfrog/convert.c
+++ b/libfrog/convert.c
@@ -173,13 +173,6 @@ cvt_u16(
 	return i;
 }
 
-#define EXABYTES(x)	((long long)(x) << 60)
-#define PETABYTES(x)	((long long)(x) << 50)
-#define TERABYTES(x)	((long long)(x) << 40)
-#define GIGABYTES(x)	((long long)(x) << 30)
-#define MEGABYTES(x)	((long long)(x) << 20)
-#define KILOBYTES(x)	((long long)(x) << 10)
-
 long long
 cvtnum(
 	size_t		blksize,
diff --git a/mdrestore/xfs_mdrestore.c b/mdrestore/xfs_mdrestore.c
index 8858026c87dc97..ce7f596a6184fe 100644
--- a/mdrestore/xfs_mdrestore.c
+++ b/mdrestore/xfs_mdrestore.c
@@ -8,6 +8,7 @@
 #include "xfs_metadump.h"
 #include <libfrog/platform.h>
 #include "libfrog/div64.h"
+#include "libfrog/convert.h"
 
 union mdrestore_headers {
 	__be32				magic;
@@ -92,10 +93,10 @@ final_print_progress(
 	if (!mdrestore.show_progress)
 		goto done;
 
-	if (bytes_read <= (*cursor << 20))
+	if (bytes_read <= MEGABYTES(*cursor))
 		goto done;
 
-	print_progress("%lld MB read", howmany_64(bytes_read, 1U << 20));
+	print_progress("%lld MB read", howmany_64(bytes_read, MEGABYTES(1)));
 
 done:
 	if (mdrestore.progress_since_warning)
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index e0f0bb28e12159..dd8a48c3633ef0 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -15,6 +15,7 @@
 #include "libfrog/dahashselftest.h"
 #include "libfrog/fsproperties.h"
 #include "libfrog/zones.h"
+#include "libfrog/convert.h"
 #include "proto.h"
 #include <ini.h>
 
@@ -558,7 +559,7 @@ static struct opt_params iopts = {
 		  .conflicts = { { NULL, LAST_CONFLICT } },
 		  .convert = true,
 		  .minval = 1,
-		  .maxval = 1ULL << 30, /* 1GiB */
+		  .maxval = GIGABYTES(1),
 		  .defaultval = SUBOPT_NEEDS_VAL,
 		},
 	},
@@ -1576,7 +1577,7 @@ discard_blocks(int fd, uint64_t nsectors, int quiet)
 {
 	uint64_t	offset = 0;
 	/* Discard the device 2G at a time */
-	const uint64_t	step = 2ULL << 30;
+	const uint64_t	step = GIGABYTES(2);
 	const uint64_t	count = BBTOB(nsectors);
 
 	/*
diff --git a/repair/agbtree.c b/repair/agbtree.c
index 983b645e1a35a3..fe28e5e94c575e 100644
--- a/repair/agbtree.c
+++ b/repair/agbtree.c
@@ -6,6 +6,7 @@
 #include <libxfs.h>
 #include "err_protos.h"
 #include "libfrog/bitmap.h"
+#include "libfrog/convert.h"
 #include "slab.h"
 #include "rmap.h"
 #include "incore.h"
@@ -23,7 +24,7 @@ init_rebuild(
 	memset(btr, 0, sizeof(struct bt_rebuild));
 
 	bulkload_init_ag(&btr->newbt, sc, oinfo, NULLFSBLOCK);
-	btr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */
+	btr->bload.max_dirty = XFS_B_TO_FSBT(sc->mp, KILOBYTES(256));
 	bulkload_estimate_ag_slack(sc, &btr->bload, est_agfreeblocks);
 }
 
diff --git a/repair/bmap_repair.c b/repair/bmap_repair.c
index 5d1f639be81ff4..192f189de05217 100644
--- a/repair/bmap_repair.c
+++ b/repair/bmap_repair.c
@@ -15,6 +15,7 @@
 #include "bulkload.h"
 #include "bmap_repair.h"
 #include "libfrog/util.h"
+#include "libfrog/convert.h"
 
 /*
  * Inode Fork Block Mapping (BMBT) Repair
@@ -499,7 +500,7 @@ xrep_bmap_btree_load(
 	rb->bmap_bload.get_records = xrep_bmap_get_records;
 	rb->bmap_bload.claim_block = xrep_bmap_claim_block;
 	rb->bmap_bload.iroot_size = xrep_bmap_iroot_size;
-	rb->bmap_bload.max_dirty = XFS_B_TO_FSBT(sc->mp, 256U << 10); /* 256K */
+	rb->bmap_bload.max_dirty = XFS_B_TO_FSBT(sc->mp, KILOBYTES(256));
 
 	/*
 	 * Always make the btree as small as possible, since we might need the
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 7bf75c09b94542..6b97a806384096 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -28,6 +28,7 @@
 #include "quotacheck.h"
 #include "rcbag_btree.h"
 #include "rt.h"
+#include "libfrog/convert.h"
 
 /*
  * option tables for getsubopt calls
@@ -1321,8 +1322,8 @@ main(int argc, char **argv)
 		}
 
 		max_mem -= mem_used;
-		if (max_mem >= (1 << 30))
-			max_mem = 1 << 30;
+		if (max_mem >= GIGABYTES(1))
+			max_mem = GIGABYTES(1);
 		libxfs_bhash_size = max_mem / (HASH_CACHE_RATIO *
 				(igeo->inode_cluster_size >> 10));
 		if (libxfs_bhash_size < 512)
diff --git a/scrub/common.c b/scrub/common.c
index a567d2a3c2f14e..34d91525928305 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -11,6 +11,7 @@
 #include "libfrog/paths.h"
 #include "libfrog/getparents.h"
 #include "libfrog/handle_priv.h"
+#include "libfrog/convert.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "progress.h"
@@ -218,18 +219,18 @@ auto_space_units(
 {
 	if (debug > 1)
 		goto no_prefix;
-	if (bytes > (1ULL << 40)) {
+	if (bytes > TERABYTES(1)) {
 		*units = "TiB";
-		return (double)bytes / (1ULL << 40);
-	} else if (bytes > (1ULL << 30)) {
+		return (double)bytes / TERABYTES(1);
+	} else if (bytes > GIGABYTES(1)) {
 		*units = "GiB";
-		return (double)bytes / (1ULL << 30);
-	} else if (bytes > (1ULL << 20)) {
+		return (double)bytes / GIGABYTES(1);
+	} else if (bytes > MEGABYTES(1)) {
 		*units = "MiB";
-		return (double)bytes / (1ULL << 20);
-	} else if (bytes > (1ULL << 10)) {
+		return (double)bytes / MEGABYTES(1);
+	} else if (bytes > KILOBYTES(1)) {
 		*units = "KiB";
-		return (double)bytes / (1ULL << 10);
+		return (double)bytes / KILOBYTES(1);
 	}
 
 no_prefix:
diff --git a/scrub/phase8.c b/scrub/phase8.c
index e8c72d8eb851af..0967832da890e3 100644
--- a/scrub/phase8.c
+++ b/scrub/phase8.c
@@ -12,6 +12,7 @@
 #include "libfrog/paths.h"
 #include "libfrog/workqueue.h"
 #include "libfrog/histogram.h"
+#include "libfrog/convert.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "progress.h"
@@ -51,7 +52,7 @@ fstrim_ok(
  * call so that we can implement decent progress reporting and CPU resource
  * control.  Pick a prime number of gigabytes for interest.
  */
-#define FSTRIM_MAX_BYTES	(11ULL << 30)
+#define FSTRIM_MAX_BYTES	GIGABYTES(11)
 
 /* Trim a certain range of the filesystem. */
 static int


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 04/22] xfs_scrub: report truncated devices as media errors
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (2 preceding siblings ...)
  2026-03-19  4:46   ` [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h Darrick J. Wong
@ 2026-03-19  4:46   ` Darrick J. Wong
  2026-03-20  7:13     ` Christoph Hellwig
  2026-03-19  4:46   ` [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return value Darrick J. Wong
                     ` (17 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we encounter a zero-length read of an xfs device but don't hit any
actual media errors, we won't report the truncated device.  Fix that.
Also fix the mistake that flushing the verify-pools of the log/rt
devices doesn't actually cause scrub to abort.

Cc: <linux-xfs@vger.kernel.org> # v6.13.0
Fixes: a6e089903f2f58 ("xfs_scrub: tread zero-length read verify as an IO error")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase6.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index abf6f9713f1a4d..590e5d23e6b267 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -782,9 +782,13 @@ phase6_func(
 	 * If the verify flush didn't work or we found no bad blocks, we're
 	 * done!  No errors detected.
 	 */
-	if (ret || ret2 || ret3)
+	if (ret || ret2 || ret3) {
+		ret |= ret2 | ret3; /* caller only cares about non-zero/zero */
 		goto out_rbad;
-	if (bitmap_empty(vs.d_bad) && bitmap_empty(vs.r_bad))
+	}
+	if (bitmap_empty(vs.d_bad) && !vs.d_trunc &&
+	    bitmap_empty(vs.r_bad) && !vs.r_trunc &&
+	    !vs.l_trunc)
 		goto out_rbad;
 
 	/* Scan the whole dir tree to see what matches the bad extents. */


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return value
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (3 preceding siblings ...)
  2026-03-19  4:46   ` [PATCH 04/22] xfs_scrub: report truncated devices as media errors Darrick J. Wong
@ 2026-03-19  4:46   ` Darrick J. Wong
  2026-03-20  7:13     ` Christoph Hellwig
  2026-03-19  4:47   ` [PATCH 06/22] scrub: remove the unused io_disk field in struct read_verify Darrick J. Wong
                     ` (16 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The function decode_special_owner turns a special fsmap owner into a
printable string value.  However, it does not query the gettext catalog
for a local language translation, which leads to annoying multilanguage
failure messages.  Fix that by adding the appropriate wrappers.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: b364a9c008fc04 ("xfs_scrub: scrub file data blocks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase6.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index 590e5d23e6b267..41b41aab7e2578 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -123,16 +123,16 @@ struct owner_decode {
 };
 
 static const struct owner_decode special_owners[] = {
-	{XFS_FMR_OWN_FREE,	"free space"},
-	{XFS_FMR_OWN_UNKNOWN,	"unknown owner"},
-	{XFS_FMR_OWN_FS,	"static FS metadata"},
-	{XFS_FMR_OWN_LOG,	"journalling log"},
-	{XFS_FMR_OWN_AG,	"per-AG metadata"},
-	{XFS_FMR_OWN_INOBT,	"inode btree blocks"},
-	{XFS_FMR_OWN_INODES,	"inodes"},
-	{XFS_FMR_OWN_REFC,	"refcount btree"},
-	{XFS_FMR_OWN_COW,	"CoW staging"},
-	{XFS_FMR_OWN_DEFECTIVE,	"bad blocks"},
+	{XFS_FMR_OWN_FREE,	N_("free space")},
+	{XFS_FMR_OWN_UNKNOWN,	N_("unknown owner")},
+	{XFS_FMR_OWN_FS,	N_("static FS metadata")},
+	{XFS_FMR_OWN_LOG,	N_("journalling log")},
+	{XFS_FMR_OWN_AG,	N_("per-AG metadata")},
+	{XFS_FMR_OWN_INOBT,	N_("inode btree blocks")},
+	{XFS_FMR_OWN_INODES,	N_("inodes")},
+	{XFS_FMR_OWN_REFC,	N_("refcount btree")},
+	{XFS_FMR_OWN_COW,	N_("CoW staging")},
+	{XFS_FMR_OWN_DEFECTIVE,	N_("bad blocks")},
 	{0, NULL},
 };
 
@@ -145,7 +145,7 @@ decode_special_owner(
 
 	while (od->descr) {
 		if (od->owner == owner)
-			return od->descr;
+			return _(od->descr);
 		od++;
 	}
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 06/22] scrub: remove the unused io_disk field in struct read_verify
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (4 preceding siblings ...)
  2026-03-19  4:46   ` [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return value Darrick J. Wong
@ 2026-03-19  4:47   ` Darrick J. Wong
  2026-03-19  4:47   ` [PATCH 07/22] xfs_scrub: move read verification scheduling to phase6.c Darrick J. Wong
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/read_verify.c |    1 -
 1 file changed, 1 deletion(-)


diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 1219efe2590182..56df7a40d6c07a 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -51,7 +51,6 @@ rvp_io_max_size(void)
 
 struct read_verify {
 	void			*io_end_arg;
-	struct disk		*io_disk;
 	uint64_t		io_start;	/* bytes */
 	uint64_t		io_length;	/* bytes */
 };


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 07/22] xfs_scrub: move read verification scheduling to phase6.c
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (5 preceding siblings ...)
  2026-03-19  4:47   ` [PATCH 06/22] scrub: remove the unused io_disk field in struct read_verify Darrick J. Wong
@ 2026-03-19  4:47   ` Darrick J. Wong
  2026-03-20  7:14     ` Christoph Hellwig
  2026-03-19  4:47   ` [PATCH 08/22] scrub: simplify the read_verify_pool_alloc interface Darrick J. Wong
                     ` (14 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Right now there's a weird coupling between read_verify.c and spacemap.c:
Anyone using a read_verify_pool is required to tell the pool how many
threads it's going to use to call read_verify_schedule_io.  This is
because the read_verify_pool accumulates verification requests on a
per-thread basis to try to combine adjacent written regions for media
verification.  However, the verification requests are made from the
phase6.c callback (check_rmap) that is called from the workers created
by scrub_scan_all_spacemaps.

Yeah, that's confusing: implementation details of spacemap.c must be
inferred by phase6.c and passed to read_verify.c.

Let's fix this by moving the per-thread schedule accumulation to
phase6.c before the next patches constrain the number of IO threads
sending verification requests to the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/read_verify.h |   16 +++++--
 scrub/spacemap.h    |    5 ++
 scrub/phase6.c      |   63 ++++++++++++++++++++++-----
 scrub/read_verify.c |  121 ++++++++++++++++++---------------------------------
 4 files changed, 112 insertions(+), 93 deletions(-)


diff --git a/scrub/read_verify.h b/scrub/read_verify.h
index 9d34d839c978b4..383823c3b5c640 100644
--- a/scrub/read_verify.h
+++ b/scrub/read_verify.h
@@ -10,6 +10,13 @@ struct scrub_ctx;
 struct read_verify_pool;
 struct disk;
 
+struct read_verify_schedule {
+	struct read_verify_pool	*rvp;
+	void			*io_end_arg;
+	uint64_t		io_start;	/* bytes */
+	uint64_t		io_length;	/* bytes */
+};
+
 /* Function called when an IO error happens. */
 typedef void (*read_verify_ioerr_fn_t)(struct scrub_ctx *ctx,
 		struct disk *disk, uint64_t start, uint64_t length,
@@ -17,15 +24,16 @@ typedef void (*read_verify_ioerr_fn_t)(struct scrub_ctx *ctx,
 
 int read_verify_pool_alloc(struct scrub_ctx *ctx, struct disk *disk,
 		size_t miniosz, read_verify_ioerr_fn_t ioerr_fn,
-		unsigned int submitter_threads,
 		struct read_verify_pool **prvp);
 void read_verify_pool_abort(struct read_verify_pool *rvp);
 int read_verify_pool_flush(struct read_verify_pool *rvp);
 void read_verify_pool_destroy(struct read_verify_pool *rvp);
 
-int read_verify_schedule_io(struct read_verify_pool *rvp, uint64_t start,
-		uint64_t length, void *end_arg);
-int read_verify_force_io(struct read_verify_pool *rvp);
+int read_verify_schedule_now(struct read_verify_schedule *rs);
+bool try_read_verify_schedule_io(struct read_verify_schedule *rs,
+		struct read_verify_pool *rvp, uint64_t start, uint64_t length,
+		void *end_arg);
+
 int read_verify_bytes(struct read_verify_pool *rvp, uint64_t *bytes);
 
 #endif /* XFS_SCRUB_READ_VERIFY_H_ */
diff --git a/scrub/spacemap.h b/scrub/spacemap.h
index 51975341b16d6b..759d0f89089a23 100644
--- a/scrub/spacemap.h
+++ b/scrub/spacemap.h
@@ -18,4 +18,9 @@ int scrub_iterate_fsmap(struct scrub_ctx *ctx, struct fsmap *keys,
 int scrub_scan_all_spacemaps(struct scrub_ctx *ctx, scrub_fsmap_iter_fn fn,
 		void *arg);
 
+static inline unsigned int scrub_scan_spacemaps_nproc(struct scrub_ctx *ctx)
+{
+	return scrub_nproc(ctx);
+}
+
 #endif /* XFS_SCRUB_SPACEMAP_H_ */
diff --git a/scrub/phase6.c b/scrub/phase6.c
index 41b41aab7e2578..a28ebee8e7b272 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -23,6 +23,7 @@
 #include "vfs.h"
 #include "common.h"
 #include "libfrog/bulkstat.h"
+#include "libfrog/ptvar.h"
 
 /*
  * Phase 6: Verify data file integrity.
@@ -39,6 +40,8 @@
 /* Verify disk blocks with GETFSMAP */
 
 struct media_verify_state {
+	struct ptvar		*verify_schedules;
+
 	struct read_verify_pool	*rvp_data;
 	struct read_verify_pool	*rvp_log;
 	struct read_verify_pool	*rvp_realtime;
@@ -604,6 +607,8 @@ check_rmap(
 {
 	struct media_verify_state	*vs = arg;
 	struct read_verify_pool		*rvp;
+	struct read_verify_schedule	*rs;
+	bool				scheduled;
 	int				ret;
 
 	rvp = dev_to_pool(ctx, vs, map->fmr_device);
@@ -632,17 +637,42 @@ check_rmap(
 
 	/* XXX: Filter out directory data blocks. */
 
+	rs = ptvar_get(vs->verify_schedules, &ret);
+	if (ret) {
+		str_liberror(ctx, -ret, _("grabbing media verify schedule"));
+		return -ret;
+	}
+
 	/* Schedule the read verify command for (eventual) running. */
-	ret = read_verify_schedule_io(rvp, map->fmr_physical, map->fmr_length,
-			vs);
+	scheduled = try_read_verify_schedule_io(rs, rvp, map->fmr_physical,
+			map->fmr_length, vs);
+	if (scheduled)
+		return 0;
+
+	ret = read_verify_schedule_now(rs);
 	if (ret) {
 		str_liberror(ctx, ret, _("scheduling media verify command"));
 		return ret;
 	}
 
+	scheduled = try_read_verify_schedule_io(rs, rvp, map->fmr_physical,
+			map->fmr_length, vs);
+	assert(scheduled);
 	return 0;
 }
 
+/* Initiate any scheduled verifications now. */
+static int
+force_one_verify(
+	struct ptvar			*ptv,
+	void				*data,
+	void				*foreach_arg)
+{
+	struct read_verify_schedule	*rs = data;
+
+	return read_verify_schedule_now(rs);
+}
+
 /* Wait for read/verify actions to finish, then return # bytes checked. */
 static int
 clean_pool(
@@ -655,10 +685,6 @@ clean_pool(
 	if (!rvp)
 		return 0;
 
-	ret = read_verify_force_io(rvp);
-	if (ret)
-		return ret;
-
 	ret = read_verify_pool_flush(rvp);
 	if (ret)
 		goto out_destroy;
@@ -737,7 +763,7 @@ phase6_func(
 
 	ret = read_verify_pool_alloc(ctx, ctx->datadev,
 			ctx->mnt.fsgeom.blocksize, remember_ioerr,
-			scrub_nproc(ctx), &vs.rvp_data);
+			&vs.rvp_data);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
 		goto out_rbad;
@@ -745,7 +771,7 @@ phase6_func(
 	if (ctx->logdev) {
 		ret = read_verify_pool_alloc(ctx, ctx->logdev,
 				ctx->mnt.fsgeom.blocksize, remember_ioerr,
-				scrub_nproc(ctx), &vs.rvp_log);
+				&vs.rvp_log);
 		if (ret) {
 			str_liberror(ctx, ret,
 					_("creating logdev media verifier"));
@@ -755,17 +781,32 @@ phase6_func(
 	if (ctx->rtdev) {
 		ret = read_verify_pool_alloc(ctx, ctx->rtdev,
 				ctx->mnt.fsgeom.blocksize, remember_ioerr,
-				scrub_nproc(ctx), &vs.rvp_realtime);
+				&vs.rvp_realtime);
 		if (ret) {
 			str_liberror(ctx, ret,
 					_("creating rtdev media verifier"));
 			goto out_logpool;
 		}
 	}
-	ret = scrub_scan_all_spacemaps(ctx, check_rmap, &vs);
+
+	ret = -ptvar_alloc(scrub_scan_spacemaps_nproc(ctx),
+			sizeof(struct read_verify_schedule), NULL,
+			&vs.verify_schedules);
 	if (ret)
 		goto out_rtpool;
 
+	ret = scrub_scan_all_spacemaps(ctx, check_rmap, &vs);
+	if (ret)
+		goto out_schedules;
+
+	ret = -ptvar_foreach(vs.verify_schedules, force_one_verify, NULL);
+	if (ret) {
+		str_liberror(ctx, ret, _("flushing read verify commands"));
+		goto out_schedules;
+	}
+	ptvar_free(vs.verify_schedules);
+	vs.verify_schedules = NULL;
+
 	ret = clean_pool(vs.rvp_data, &ctx->bytes_checked);
 	if (ret)
 		str_liberror(ctx, ret, _("flushing datadev verify pool"));
@@ -798,6 +839,8 @@ phase6_func(
 	bitmap_free(&vs.d_bad);
 	return ret;
 
+out_schedules:
+	ptvar_free(vs.verify_schedules);
 out_rtpool:
 	if (vs.rvp_realtime) {
 		read_verify_pool_abort(vs.rvp_realtime);
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 56df7a40d6c07a..2447ed272734be 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -7,7 +7,6 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <sys/statvfs.h>
-#include "libfrog/ptvar.h"
 #include "libfrog/workqueue.h"
 #include "libfrog/paths.h"
 #include "xfs_scrub.h"
@@ -60,7 +59,6 @@ struct read_verify_pool {
 	struct scrub_ctx	*ctx;		/* scrub context */
 	void			*readbuf;	/* read buffer */
 	struct ptcounter	*verified_bytes;
-	struct ptvar		*rvstate;	/* combines read requests */
 	struct disk		*disk;		/* which disk? */
 	read_verify_ioerr_fn_t	ioerr_fn;	/* io error callback */
 	size_t			miniosz;	/* minimum io size, bytes */
@@ -78,8 +76,6 @@ struct read_verify_pool {
  * @disk is the disk we want to verify.
  * @miniosz is the minimum size of an IO to expect (in bytes).
  * @ioerr_fn will be called when IO errors occur.
- * @submitter_threads is the number of threads that may be sending verify
- * requests at any given time.
  */
 int
 read_verify_pool_alloc(
@@ -87,7 +83,6 @@ read_verify_pool_alloc(
 	struct disk			*disk,
 	size_t				miniosz,
 	read_verify_ioerr_fn_t		ioerr_fn,
-	unsigned int			submitter_threads,
 	struct read_verify_pool		**prvp)
 {
 	struct read_verify_pool		*rvp;
@@ -118,19 +113,13 @@ read_verify_pool_alloc(
 	rvp->ctx = ctx;
 	rvp->disk = disk;
 	rvp->ioerr_fn = ioerr_fn;
-	ret = -ptvar_alloc(submitter_threads, sizeof(struct read_verify),
-			NULL, &rvp->rvstate);
-	if (ret)
-		goto out_counter;
 	ret = -workqueue_create(&rvp->wq, (struct xfs_mount *)rvp,
 			verifier_threads == 1 ? 0 : verifier_threads);
 	if (ret)
-		goto out_rvstate;
+		goto out_counter;
 	*prvp = rvp;
 	return 0;
 
-out_rvstate:
-	ptvar_free(rvp->rvstate);
 out_counter:
 	ptcounter_free(rvp->verified_bytes);
 out_buf:
@@ -164,7 +153,6 @@ read_verify_pool_destroy(
 	struct read_verify_pool		*rvp)
 {
 	workqueue_destroy(&rvp->wq);
-	ptvar_free(rvp->rvstate);
 	ptcounter_free(rvp->verified_bytes);
 	free(rvp->readbuf);
 	free(rvp);
@@ -285,17 +273,20 @@ read_verify(
 		rvp->runtime_error = ret;
 }
 
-/* Queue a read verify request. */
-static int
-read_verify_queue(
-	struct read_verify_pool		*rvp,
-	struct read_verify		*rv)
+/* Queue a read verify request immediately. */
+int
+read_verify_schedule_now(
+	struct read_verify_schedule	*rs)
 {
+	struct read_verify_pool		*rvp = rs->rvp;
 	struct read_verify		*tmp;
 	bool				ret;
 
+	if (!rvp)
+		return 0;
+
 	dbg_printf("verify fd %d start %"PRIu64" len %"PRIu64"\n",
-			rvp->disk->d_fd, rv->io_start, rv->io_length);
+			rvp->disk->d_fd, rs->io_start, rs->io_length);
 
 	/* Worker thread saw a runtime error, don't queue more. */
 	if (rvp->runtime_error)
@@ -308,7 +299,9 @@ read_verify_queue(
 		return errno;
 	}
 
-	memcpy(tmp, rv, sizeof(*tmp));
+	tmp->io_end_arg = rs->io_end_arg;
+	tmp->io_start = rs->io_start;
+	tmp->io_length = rs->io_length;
 
 	ret = -workqueue_add(&rvp->wq, read_verify, 0, tmp);
 	if (ret) {
@@ -317,25 +310,27 @@ read_verify_queue(
 		return ret;
 	}
 
-	rv->io_length = 0;
+	/* Reset the schedule */
+	rs->rvp = NULL;
+	rs->io_length = 0;
 	return 0;
 }
 
 /*
- * Issue an IO request.  We'll batch subsequent requests if they're
- * within 64k of each other
+ * Schedule a read verification request.  We'll batch subsequent requests if
+ * they're within 64k of each other.  Returns true if the schedule was updated,
+ * or false if the caller should call read_verify_schedule_now().
  */
-int
-read_verify_schedule_io(
+bool
+try_read_verify_schedule_io(
+	struct read_verify_schedule	*rs,
 	struct read_verify_pool		*rvp,
 	uint64_t			start,
 	uint64_t			length,
 	void				*end_arg)
 {
-	struct read_verify		*rv;
 	uint64_t			req_end;
 	uint64_t			rv_end;
-	int				ret;
 
 	assert(rvp->readbuf);
 
@@ -343,67 +338,35 @@ read_verify_schedule_io(
 	start &= ~(rvp->miniosz - 1);
 	length = roundup(length, rvp->miniosz);
 
-	rv = ptvar_get(rvp->rvstate, &ret);
-	if (ret)
-		return -ret;
 	req_end = start + length;
-	rv_end = rv->io_start + rv->io_length;
+	rv_end = rs->io_start + rs->io_length;
+
+	/* If the schedule is empty, stash the new IO. */
+	if (!rs->rvp) {
+		rs->rvp = rvp;
+		rs->io_start = start;
+		rs->io_length = length;
+		rs->io_end_arg = end_arg;
+
+		return true;
+	}
 
 	/*
-	 * If we have a stashed IO, we haven't changed fds, the error
+	 * If we have a stashed IO, we haven't changed pools, the error
 	 * reporting is the same, and the two extents are close,
 	 * we can combine them.
 	 */
-	if (rv->io_length > 0 &&
-	    end_arg == rv->io_end_arg &&
-	    ((start >= rv->io_start && start <= rv_end + RVP_IO_BATCH_LOCALITY) ||
-	     (rv->io_start >= start &&
-	      rv->io_start <= req_end + RVP_IO_BATCH_LOCALITY))) {
-		rv->io_start = min(rv->io_start, start);
-		rv->io_length = max(req_end, rv_end) - rv->io_start;
-	} else  {
-		/* Otherwise, issue the stashed IO (if there is one) */
-		if (rv->io_length > 0) {
-			int	res;
+	if (rs->rvp == rvp && rs->io_length > 0 && end_arg == rs->io_end_arg &&
+	    ((start >= rs->io_start && start <= rv_end + RVP_IO_BATCH_LOCALITY) ||
+	     (rs->io_start >= start &&
+	      rs->io_start <= req_end + RVP_IO_BATCH_LOCALITY))) {
+		rs->io_start = min(rs->io_start, start);
+		rs->io_length = max(req_end, rv_end) - rs->io_start;
 
-			res = read_verify_queue(rvp, rv);
-			if (res)
-				return res;
-		}
-
-		/* Stash the new IO. */
-		rv->io_start = start;
-		rv->io_length = length;
-		rv->io_end_arg = end_arg;
+		return true;
 	}
 
-	return 0;
-}
-
-/* Force any per-thread stashed IOs into the verifier. */
-static int
-force_one_io(
-	struct ptvar		*ptv,
-	void			*data,
-	void			*foreach_arg)
-{
-	struct read_verify_pool	*rvp = foreach_arg;
-	struct read_verify	*rv = data;
-
-	if (rv->io_length == 0)
-		return 0;
-
-	return -read_verify_queue(rvp, rv);
-}
-
-/* Force any stashed IOs into the verifier. */
-int
-read_verify_force_io(
-	struct read_verify_pool		*rvp)
-{
-	assert(rvp->readbuf);
-
-	return -ptvar_foreach(rvp->rvstate, force_one_io, rvp);
+	return false;
 }
 
 /* How many bytes has this process verified? */


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 08/22] scrub: simplify the read_verify_pool_alloc interface
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (6 preceding siblings ...)
  2026-03-19  4:47   ` [PATCH 07/22] xfs_scrub: move read verification scheduling to phase6.c Darrick J. Wong
@ 2026-03-19  4:47   ` Darrick J. Wong
  2026-03-19  4:47   ` [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere Darrick J. Wong
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Don't pass the miniosize as that's always the fs block size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: rebase atop another read verify cleanup]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/read_verify.h |    2 +-
 scrub/phase6.c      |    9 +++------
 scrub/read_verify.c |   12 ++----------
 3 files changed, 6 insertions(+), 17 deletions(-)


diff --git a/scrub/read_verify.h b/scrub/read_verify.h
index 383823c3b5c640..a46244334268de 100644
--- a/scrub/read_verify.h
+++ b/scrub/read_verify.h
@@ -23,7 +23,7 @@ typedef void (*read_verify_ioerr_fn_t)(struct scrub_ctx *ctx,
 		int error, void *arg);
 
 int read_verify_pool_alloc(struct scrub_ctx *ctx, struct disk *disk,
-		size_t miniosz, read_verify_ioerr_fn_t ioerr_fn,
+		read_verify_ioerr_fn_t ioerr_fn,
 		struct read_verify_pool **prvp);
 void read_verify_pool_abort(struct read_verify_pool *rvp);
 int read_verify_pool_flush(struct read_verify_pool *rvp);
diff --git a/scrub/phase6.c b/scrub/phase6.c
index a28ebee8e7b272..bfd5659590d603 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -761,16 +761,14 @@ phase6_func(
 		goto out_dbad;
 	}
 
-	ret = read_verify_pool_alloc(ctx, ctx->datadev,
-			ctx->mnt.fsgeom.blocksize, remember_ioerr,
+	ret = read_verify_pool_alloc(ctx, ctx->datadev, remember_ioerr,
 			&vs.rvp_data);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
 		goto out_rbad;
 	}
 	if (ctx->logdev) {
-		ret = read_verify_pool_alloc(ctx, ctx->logdev,
-				ctx->mnt.fsgeom.blocksize, remember_ioerr,
+		ret = read_verify_pool_alloc(ctx, ctx->logdev, remember_ioerr,
 				&vs.rvp_log);
 		if (ret) {
 			str_liberror(ctx, ret,
@@ -779,8 +777,7 @@ phase6_func(
 		}
 	}
 	if (ctx->rtdev) {
-		ret = read_verify_pool_alloc(ctx, ctx->rtdev,
-				ctx->mnt.fsgeom.blocksize, remember_ioerr,
+		ret = read_verify_pool_alloc(ctx, ctx->rtdev, remember_ioerr,
 				&vs.rvp_realtime);
 		if (ret) {
 			str_liberror(ctx, ret,
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 2447ed272734be..128df2be86898b 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -74,14 +74,12 @@ struct read_verify_pool {
  * Create a thread pool to run read verifiers.
  *
  * @disk is the disk we want to verify.
- * @miniosz is the minimum size of an IO to expect (in bytes).
  * @ioerr_fn will be called when IO errors occur.
  */
 int
 read_verify_pool_alloc(
 	struct scrub_ctx		*ctx,
 	struct disk			*disk,
-	size_t				miniosz,
 	read_verify_ioerr_fn_t		ioerr_fn,
 	struct read_verify_pool		**prvp)
 {
@@ -89,13 +87,7 @@ read_verify_pool_alloc(
 	unsigned int			verifier_threads = disk_heads(disk);
 	int				ret;
 
-	/*
-	 * The minimum IO size must be a multiple of the disk sector size
-	 * and a factor of the max io size.
-	 */
-	if (miniosz % disk->d_lbasize)
-		return EINVAL;
-	if (rvp_io_max_size() % miniosz)
+	if (rvp_io_max_size() % ctx->mnt.fsgeom.blocksize)
 		return EINVAL;
 
 	rvp = calloc(1, sizeof(struct read_verify_pool));
@@ -109,7 +101,7 @@ read_verify_pool_alloc(
 	ret = ptcounter_alloc(verifier_threads, &rvp->verified_bytes);
 	if (ret)
 		goto out_buf;
-	rvp->miniosz = miniosz;
+	rvp->miniosz = ctx->mnt.fsgeom.blocksize;
 	rvp->ctx = ctx;
 	rvp->disk = disk;
 	rvp->ioerr_fn = ioerr_fn;


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (7 preceding siblings ...)
  2026-03-19  4:47   ` [PATCH 08/22] scrub: simplify the read_verify_pool_alloc interface Darrick J. Wong
@ 2026-03-19  4:47   ` Darrick J. Wong
  2026-03-20  7:14     ` Christoph Hellwig
  2026-03-19  4:48   ` [PATCH 10/22] scrub: use enum xfs_device for read verification Darrick J. Wong
                     ` (12 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The value is the same across all callers and read-verify pools, so let's
just pass it into read_verify_pool_alloc instead of making temporary
aliases everywhere and increasing memory usage.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/read_verify.h |    6 ++----
 scrub/phase6.c      |   10 +++++-----
 scrub/read_verify.c |   15 +++++++--------
 3 files changed, 14 insertions(+), 17 deletions(-)


diff --git a/scrub/read_verify.h b/scrub/read_verify.h
index a46244334268de..165ad8f7af7b6c 100644
--- a/scrub/read_verify.h
+++ b/scrub/read_verify.h
@@ -12,7 +12,6 @@ struct disk;
 
 struct read_verify_schedule {
 	struct read_verify_pool	*rvp;
-	void			*io_end_arg;
 	uint64_t		io_start;	/* bytes */
 	uint64_t		io_length;	/* bytes */
 };
@@ -23,7 +22,7 @@ typedef void (*read_verify_ioerr_fn_t)(struct scrub_ctx *ctx,
 		int error, void *arg);
 
 int read_verify_pool_alloc(struct scrub_ctx *ctx, struct disk *disk,
-		read_verify_ioerr_fn_t ioerr_fn,
+		read_verify_ioerr_fn_t ioerr_fn, void *ioerr_arg,
 		struct read_verify_pool **prvp);
 void read_verify_pool_abort(struct read_verify_pool *rvp);
 int read_verify_pool_flush(struct read_verify_pool *rvp);
@@ -31,8 +30,7 @@ void read_verify_pool_destroy(struct read_verify_pool *rvp);
 
 int read_verify_schedule_now(struct read_verify_schedule *rs);
 bool try_read_verify_schedule_io(struct read_verify_schedule *rs,
-		struct read_verify_pool *rvp, uint64_t start, uint64_t length,
-		void *end_arg);
+		struct read_verify_pool *rvp, uint64_t start, uint64_t length);
 
 int read_verify_bytes(struct read_verify_pool *rvp, uint64_t *bytes);
 
diff --git a/scrub/phase6.c b/scrub/phase6.c
index bfd5659590d603..3e6a236d010a6b 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -645,7 +645,7 @@ check_rmap(
 
 	/* Schedule the read verify command for (eventual) running. */
 	scheduled = try_read_verify_schedule_io(rs, rvp, map->fmr_physical,
-			map->fmr_length, vs);
+			map->fmr_length);
 	if (scheduled)
 		return 0;
 
@@ -656,7 +656,7 @@ check_rmap(
 	}
 
 	scheduled = try_read_verify_schedule_io(rs, rvp, map->fmr_physical,
-			map->fmr_length, vs);
+			map->fmr_length);
 	assert(scheduled);
 	return 0;
 }
@@ -761,7 +761,7 @@ phase6_func(
 		goto out_dbad;
 	}
 
-	ret = read_verify_pool_alloc(ctx, ctx->datadev, remember_ioerr,
+	ret = read_verify_pool_alloc(ctx, ctx->datadev, remember_ioerr, &vs,
 			&vs.rvp_data);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
@@ -769,7 +769,7 @@ phase6_func(
 	}
 	if (ctx->logdev) {
 		ret = read_verify_pool_alloc(ctx, ctx->logdev, remember_ioerr,
-				&vs.rvp_log);
+				&vs, &vs.rvp_log);
 		if (ret) {
 			str_liberror(ctx, ret,
 					_("creating logdev media verifier"));
@@ -778,7 +778,7 @@ phase6_func(
 	}
 	if (ctx->rtdev) {
 		ret = read_verify_pool_alloc(ctx, ctx->rtdev, remember_ioerr,
-				&vs.rvp_realtime);
+				&vs, &vs.rvp_realtime);
 		if (ret) {
 			str_liberror(ctx, ret,
 					_("creating rtdev media verifier"));
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 128df2be86898b..bcd923e5f3cbc3 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -49,7 +49,6 @@ rvp_io_max_size(void)
 #define RVP_IO_BATCH_LOCALITY	(65536)
 
 struct read_verify {
-	void			*io_end_arg;
 	uint64_t		io_start;	/* bytes */
 	uint64_t		io_length;	/* bytes */
 };
@@ -60,6 +59,7 @@ struct read_verify_pool {
 	void			*readbuf;	/* read buffer */
 	struct ptcounter	*verified_bytes;
 	struct disk		*disk;		/* which disk? */
+	void			*ioerr_arg;
 	read_verify_ioerr_fn_t	ioerr_fn;	/* io error callback */
 	size_t			miniosz;	/* minimum io size, bytes */
 
@@ -81,6 +81,7 @@ read_verify_pool_alloc(
 	struct scrub_ctx		*ctx,
 	struct disk			*disk,
 	read_verify_ioerr_fn_t		ioerr_fn,
+	void				*ioerr_arg,
 	struct read_verify_pool		**prvp)
 {
 	struct read_verify_pool		*rvp;
@@ -105,6 +106,7 @@ read_verify_pool_alloc(
 	rvp->ctx = ctx;
 	rvp->disk = disk;
 	rvp->ioerr_fn = ioerr_fn;
+	rvp->ioerr_arg = ioerr_arg;
 	ret = -workqueue_create(&rvp->wq, (struct xfs_mount *)rvp,
 			verifier_threads == 1 ? 0 : verifier_threads);
 	if (ret)
@@ -223,14 +225,14 @@ read_verify(
 					rvp->disk->d_fd, rv->io_start, sz,
 					read_error);
 			rvp->ioerr_fn(rvp->ctx, rvp->disk, rv->io_start, sz,
-					read_error, rv->io_end_arg);
+					read_error, rvp->ioerr_arg);
 		} else if (sz == 0) {
 			/* No bytes at all?  Did we hit the end of the disk? */
 			dbg_printf("EOF %d @ %"PRIu64" %zu err %d\n",
 					rvp->disk->d_fd, rv->io_start, sz,
 					read_error);
 			rvp->ioerr_fn(rvp->ctx, rvp->disk, rv->io_start, sz,
-					read_error, rv->io_end_arg);
+					read_error, rvp->ioerr_arg);
 			break;
 		} else if (sz < len) {
 			/*
@@ -291,7 +293,6 @@ read_verify_schedule_now(
 		return errno;
 	}
 
-	tmp->io_end_arg = rs->io_end_arg;
 	tmp->io_start = rs->io_start;
 	tmp->io_length = rs->io_length;
 
@@ -318,8 +319,7 @@ try_read_verify_schedule_io(
 	struct read_verify_schedule	*rs,
 	struct read_verify_pool		*rvp,
 	uint64_t			start,
-	uint64_t			length,
-	void				*end_arg)
+	uint64_t			length)
 {
 	uint64_t			req_end;
 	uint64_t			rv_end;
@@ -338,7 +338,6 @@ try_read_verify_schedule_io(
 		rs->rvp = rvp;
 		rs->io_start = start;
 		rs->io_length = length;
-		rs->io_end_arg = end_arg;
 
 		return true;
 	}
@@ -348,7 +347,7 @@ try_read_verify_schedule_io(
 	 * reporting is the same, and the two extents are close,
 	 * we can combine them.
 	 */
-	if (rs->rvp == rvp && rs->io_length > 0 && end_arg == rs->io_end_arg &&
+	if (rs->rvp == rvp && rs->io_length > 0 &&
 	    ((start >= rs->io_start && start <= rv_end + RVP_IO_BATCH_LOCALITY) ||
 	     (rs->io_start >= start &&
 	      rs->io_start <= req_end + RVP_IO_BATCH_LOCALITY))) {


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 10/22] scrub: use enum xfs_device for read verification
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (8 preceding siblings ...)
  2026-03-19  4:47   ` [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere Darrick J. Wong
@ 2026-03-19  4:48   ` Darrick J. Wong
  2026-03-19  4:48   ` [PATCH 11/22] xfs_scrub: rename nr_io_threads Darrick J. Wong
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Passing the disk pointer and then translating back to an index is a bit
confusing.  Rewrite the read_verify and related code to pass around an
enum xfs_device and use that to identify the device.

The disks are placed in an array so that they can be trivially indexed using
the xfs device.  Because the value start at 1 this adds an unused slot, but
such a minor waste does not matter in the overall scheme of things.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: fix minor merge conflicts]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/read_verify.h |    5 +--
 scrub/xfs_scrub.h   |    6 +--
 scrub/phase1.c      |   26 +++++++------
 scrub/phase6.c      |  101 ++++++++++++++++++++++++++-------------------------
 scrub/read_verify.c |   38 +++++++++----------
 5 files changed, 86 insertions(+), 90 deletions(-)


diff --git a/scrub/read_verify.h b/scrub/read_verify.h
index 165ad8f7af7b6c..e4b36f6aaa20e1 100644
--- a/scrub/read_verify.h
+++ b/scrub/read_verify.h
@@ -8,7 +8,6 @@
 
 struct scrub_ctx;
 struct read_verify_pool;
-struct disk;
 
 struct read_verify_schedule {
 	struct read_verify_pool	*rvp;
@@ -18,10 +17,10 @@ struct read_verify_schedule {
 
 /* Function called when an IO error happens. */
 typedef void (*read_verify_ioerr_fn_t)(struct scrub_ctx *ctx,
-		struct disk *disk, uint64_t start, uint64_t length,
+		enum xfs_device dev, uint64_t start, uint64_t length,
 		int error, void *arg);
 
-int read_verify_pool_alloc(struct scrub_ctx *ctx, struct disk *disk,
+int read_verify_pool_alloc(struct scrub_ctx *ctx, enum xfs_device dev,
 		read_verify_ioerr_fn_t ioerr_fn, void *ioerr_arg,
 		struct read_verify_pool **prvp);
 void read_verify_pool_abort(struct read_verify_pool *rvp);
diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 041c0fadaa93c0..a7e67c37469beb 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -57,10 +57,8 @@ struct scrub_ctx {
 	struct statvfs		mnt_sv;
 	struct statfs		mnt_sf;
 
-	/* Open block devices */
-	struct disk		*datadev;
-	struct disk		*logdev;
-	struct disk		*rtdev;
+	/* Open block devices for legacy verify */
+	struct disk		*verify_disks[XFS_DEV_RT + 1];
 
 	/* What does the user want us to do? */
 	enum scrub_mode		mode;
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 10e9aa1892b701..954a62b7e8d711 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -96,12 +96,12 @@ scrub_cleanup(
 
 	if (ctx->fshandle)
 		free_handle(ctx->fshandle, ctx->fshandle_len);
-	if (ctx->rtdev)
-		disk_close(ctx->rtdev);
-	if (ctx->logdev)
-		disk_close(ctx->logdev);
-	if (ctx->datadev)
-		disk_close(ctx->datadev);
+	if (ctx->verify_disks[XFS_DEV_DATA])
+		disk_close(ctx->verify_disks[XFS_DEV_DATA]);
+	if (ctx->verify_disks[XFS_DEV_LOG])
+		disk_close(ctx->verify_disks[XFS_DEV_LOG]);
+	if (ctx->verify_disks[XFS_DEV_RT])
+		disk_close(ctx->verify_disks[XFS_DEV_RT]);
 	fshandle_destroy();
 	error = -xfd_close(&ctx->mnt);
 	if (error)
@@ -349,13 +349,13 @@ _("Unable to find realtime device path."));
 	}
 
 	/* Open the raw devices. */
-	ctx->datadev = disk_open(ctx->fsinfo.fs_name);
-	if (!ctx->datadev) {
+	ctx->verify_disks[XFS_DEV_DATA] = disk_open(ctx->fsinfo.fs_name);
+	if (!ctx->verify_disks[XFS_DEV_DATA]) {
 		str_error(ctx, ctx->mntpoint, _("Unable to open data device."));
 		return ECANCELED;
 	}
 
-	ctx->nr_io_threads = disk_heads(ctx->datadev);
+	ctx->nr_io_threads = disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
 	if (verbose) {
 		fprintf(stdout, _("%s: using %d threads to scrub.\n"),
 				ctx->mntpoint, scrub_nproc(ctx));
@@ -363,16 +363,16 @@ _("Unable to find realtime device path."));
 	}
 
 	if (ctx->fsinfo.fs_log) {
-		ctx->logdev = disk_open(ctx->fsinfo.fs_log);
-		if (!ctx->logdev) {
+		ctx->verify_disks[XFS_DEV_LOG] = disk_open(ctx->fsinfo.fs_log);
+		if (!ctx->verify_disks[XFS_DEV_LOG]) {
 			str_error(ctx, ctx->mntpoint,
 				_("Unable to open external log device."));
 			return ECANCELED;
 		}
 	}
 	if (ctx->fsinfo.fs_rt) {
-		ctx->rtdev = disk_open(ctx->fsinfo.fs_rt);
-		if (!ctx->rtdev) {
+		ctx->verify_disks[XFS_DEV_RT] = disk_open(ctx->fsinfo.fs_rt);
+		if (!ctx->verify_disks[XFS_DEV_RT]) {
 			str_error(ctx, ctx->mntpoint,
 				_("Unable to open realtime device."));
 			return ECANCELED;
diff --git a/scrub/phase6.c b/scrub/phase6.c
index 3e6a236d010a6b..02b25fe73aa656 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -77,47 +77,46 @@ dev_to_pool(
 	abort();
 }
 
-/* Find the device major/minor for a given file descriptor. */
-static dev_t
-disk_to_dev(
+/* Return fsmap device for XFS device index. */
+static uint32_t
+to_fsmap_dev(
 	struct scrub_ctx	*ctx,
-	struct disk		*disk)
+	enum xfs_device		dev)
 {
-	if (ctx->mnt.fsgeom.rtstart) {
-		if (disk == ctx->datadev)
-			return XFS_DEV_DATA;
-		if (disk == ctx->logdev)
-			return XFS_DEV_LOG;
-		if (disk == ctx->rtdev)
-			return XFS_DEV_RT;
-	} else {
-		if (disk == ctx->datadev)
-			return ctx->fsinfo.fs_datadev;
-		if (disk == ctx->logdev)
-			return ctx->fsinfo.fs_logdev;
-		if (disk == ctx->rtdev)
-			return ctx->fsinfo.fs_rtdev;
+	if (ctx->mnt.fsgeom.rtstart)
+		return dev;
+
+	switch (dev) {
+	case XFS_DEV_DATA:
+		return ctx->fsinfo.fs_datadev;
+	case XFS_DEV_LOG:
+		return ctx->fsinfo.fs_logdev;
+	case XFS_DEV_RT:
+		return ctx->fsinfo.fs_rtdev;
+	default:
+		abort();
 	}
-	abort();
 }
 
 /* Find the incore bad blocks bitmap for a given disk. */
 static struct bitmap *
 bitmap_for_disk(
-	struct scrub_ctx		*ctx,
-	struct disk			*disk,
+	enum xfs_device			dev,
 	struct media_verify_state	*vs)
 {
-	if (disk == ctx->datadev)
+	switch (dev) {
+	case XFS_DEV_DATA:
 		return vs->d_bad;
-	if (disk == ctx->rtdev)
+	case XFS_DEV_RT:
 		return vs->r_bad;
-	return NULL;
+	default:
+		return NULL;
+	}
 }
 
 struct disk_ioerr_report {
 	struct scrub_ctx	*ctx;
-	struct disk		*disk;
+	enum xfs_device		dev;
 };
 
 struct owner_decode {
@@ -522,7 +521,7 @@ report_ioerr(
 	struct disk_ioerr_report	*dioerr = arg;
 
 	/* Go figure out which blocks are bad from the fsmap. */
-	keys[0].fmr_device = disk_to_dev(dioerr->ctx, dioerr->disk);
+	keys[0].fmr_device = to_fsmap_dev(dioerr->ctx, dioerr->dev);
 	keys[0].fmr_physical = start;
 	keys[1].fmr_device = keys[0].fmr_device;
 	keys[1].fmr_physical = start + length - 1;
@@ -537,18 +536,15 @@ report_ioerr(
 static int
 report_disk_ioerrs(
 	struct scrub_ctx		*ctx,
-	struct disk			*disk,
-	struct media_verify_state	*vs)
+	struct media_verify_state	*vs,
+	enum xfs_device			dev)
 {
+	struct bitmap			*tree = bitmap_for_disk(dev, vs);
 	struct disk_ioerr_report	dioerr = {
 		.ctx			= ctx,
-		.disk			= disk,
+		.dev			= dev,
 	};
-	struct bitmap			*tree;
 
-	if (!disk)
-		return 0;
-	tree = bitmap_for_disk(ctx, disk, vs);
 	if (!tree)
 		return 0;
 	return -bitmap_iterate(tree, report_ioerr, &dioerr);
@@ -569,13 +565,13 @@ report_all_media_errors(
 	if (vs->r_trunc)
 		str_corrupt(ctx, ctx->mntpoint, _("rt device truncated"));
 
-	ret = report_disk_ioerrs(ctx, ctx->datadev, vs);
+	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_DATA);
 	if (ret) {
 		str_liberror(ctx, ret, _("walking datadev io errors"));
 		return ret;
 	}
 
-	ret = report_disk_ioerrs(ctx, ctx->rtdev, vs);
+	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_RT);
 	if (ret) {
 		str_liberror(ctx, ret, _("walking rtdev io errors"));
 		return ret;
@@ -703,7 +699,7 @@ clean_pool(
 static void
 remember_ioerr(
 	struct scrub_ctx		*ctx,
-	struct disk			*disk,
+	enum xfs_device			dev,
 	uint64_t			start,
 	uint64_t			length,
 	int				error,
@@ -714,16 +710,21 @@ remember_ioerr(
 	int				ret;
 
 	if (!length) {
-		if (disk == ctx->datadev)
+		switch (dev) {
+		case XFS_DEV_DATA:
 			vs->d_trunc = true;
-		else if (disk == ctx->logdev)
+			break;
+		case XFS_DEV_LOG:
 			vs->l_trunc = true;
-		else if (disk == ctx->rtdev)
+			break;
+		case XFS_DEV_RT:
 			vs->r_trunc = true;
+			break;
+		}
 		return;
 	}
 
-	tree = bitmap_for_disk(ctx, disk, vs);
+	tree = bitmap_for_disk(dev, vs);
 	if (!tree) {
 		str_liberror(ctx, ENOENT, _("finding bad block bitmap"));
 		return;
@@ -761,14 +762,14 @@ phase6_func(
 		goto out_dbad;
 	}
 
-	ret = read_verify_pool_alloc(ctx, ctx->datadev, remember_ioerr, &vs,
+	ret = read_verify_pool_alloc(ctx, XFS_DEV_DATA, remember_ioerr, &vs,
 			&vs.rvp_data);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
 		goto out_rbad;
 	}
-	if (ctx->logdev) {
-		ret = read_verify_pool_alloc(ctx, ctx->logdev, remember_ioerr,
+	if (ctx->fsinfo.fs_log) {
+		ret = read_verify_pool_alloc(ctx, XFS_DEV_LOG, remember_ioerr,
 				&vs, &vs.rvp_log);
 		if (ret) {
 			str_liberror(ctx, ret,
@@ -776,8 +777,8 @@ phase6_func(
 			goto out_datapool;
 		}
 	}
-	if (ctx->rtdev) {
-		ret = read_verify_pool_alloc(ctx, ctx->rtdev, remember_ioerr,
+	if (ctx->fsinfo.fs_rt) {
+		ret = read_verify_pool_alloc(ctx, XFS_DEV_RT, remember_ioerr,
 				&vs, &vs.rvp_realtime);
 		if (ret) {
 			str_liberror(ctx, ret,
@@ -888,11 +889,11 @@ phase6_estimate(
 	 * can contribute to the progress counter.  Hence we need to set
 	 * nr_threads appropriately to handle that many threads.
 	 */
-	*nr_threads = disk_heads(ctx->datadev);
-	if (ctx->rtdev)
-		*nr_threads += disk_heads(ctx->rtdev);
-	if (ctx->logdev)
-		*nr_threads += disk_heads(ctx->logdev);
+	*nr_threads = disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
+	if (ctx->verify_disks[XFS_DEV_RT])
+		*nr_threads += disk_heads(ctx->verify_disks[XFS_DEV_RT]);
+	if (ctx->verify_disks[XFS_DEV_LOG])
+		*nr_threads += disk_heads(ctx->verify_disks[XFS_DEV_LOG]);
 	*rshift = 20;
 	return 0;
 }
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index bcd923e5f3cbc3..9b692a923dfdf3 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -58,10 +58,10 @@ struct read_verify_pool {
 	struct scrub_ctx	*ctx;		/* scrub context */
 	void			*readbuf;	/* read buffer */
 	struct ptcounter	*verified_bytes;
-	struct disk		*disk;		/* which disk? */
 	void			*ioerr_arg;
 	read_verify_ioerr_fn_t	ioerr_fn;	/* io error callback */
 	size_t			miniosz;	/* minimum io size, bytes */
+	enum xfs_device		dev;		/* which device? */
 
 	/*
 	 * Store a runtime error code here so that we can stop the pool and
@@ -73,19 +73,19 @@ struct read_verify_pool {
 /*
  * Create a thread pool to run read verifiers.
  *
- * @disk is the disk we want to verify.
  * @ioerr_fn will be called when IO errors occur.
  */
 int
 read_verify_pool_alloc(
 	struct scrub_ctx		*ctx,
-	struct disk			*disk,
+	enum xfs_device			dev,
 	read_verify_ioerr_fn_t		ioerr_fn,
 	void				*ioerr_arg,
 	struct read_verify_pool		**prvp)
 {
 	struct read_verify_pool		*rvp;
-	unsigned int			verifier_threads = disk_heads(disk);
+	unsigned int			verifier_threads =
+		disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
 	int				ret;
 
 	if (rvp_io_max_size() % ctx->mnt.fsgeom.blocksize)
@@ -104,7 +104,7 @@ read_verify_pool_alloc(
 		goto out_buf;
 	rvp->miniosz = ctx->mnt.fsgeom.blocksize;
 	rvp->ctx = ctx;
-	rvp->disk = disk;
+	rvp->dev = dev;
 	rvp->ioerr_fn = ioerr_fn;
 	rvp->ioerr_arg = ioerr_arg;
 	ret = -workqueue_create(&rvp->wq, (struct xfs_mount *)rvp,
@@ -179,10 +179,10 @@ read_verify(
 	while (rv->io_length > 0) {
 		read_error = 0;
 		len = min(rv->io_length, io_max_size);
-		dbg_printf("diskverify %d %"PRIu64" %zu\n", rvp->disk->d_fd,
+		dbg_printf("diskverify %u %"PRIu64" %zu\n", rvp->dev,
 				rv->io_start, len);
-		sz = disk_read_verify(rvp->disk, rvp->readbuf, rv->io_start,
-				len);
+		sz = disk_read_verify(rvp->ctx->verify_disks[rvp->dev],
+				rvp->readbuf, rv->io_start, len);
 		if (sz == len && io_max_size < rvp->miniosz) {
 			/*
 			 * If the verify request was 100% successful and less
@@ -221,17 +221,15 @@ read_verify(
 			 * io_start to the next miniosz block.
 			 */
 			sz = rvp->miniosz - (rv->io_start % rvp->miniosz);
-			dbg_printf("IOERR %d @ %"PRIu64" %zu err %d\n",
-					rvp->disk->d_fd, rv->io_start, sz,
-					read_error);
-			rvp->ioerr_fn(rvp->ctx, rvp->disk, rv->io_start, sz,
+			dbg_printf("IOERR %u @ %"PRIu64" %zu err %d\n",
+					rvp->dev, rv->io_start, sz, read_error);
+			rvp->ioerr_fn(rvp->ctx, rvp->dev, rv->io_start, sz,
 					read_error, rvp->ioerr_arg);
 		} else if (sz == 0) {
 			/* No bytes at all?  Did we hit the end of the disk? */
-			dbg_printf("EOF %d @ %"PRIu64" %zu err %d\n",
-					rvp->disk->d_fd, rv->io_start, sz,
-					read_error);
-			rvp->ioerr_fn(rvp->ctx, rvp->disk, rv->io_start, sz,
+			dbg_printf("EOF %u @ %"PRIu64" %zu err %d\n",
+					rvp->dev, rv->io_start, sz, read_error);
+			rvp->ioerr_fn(rvp->ctx, rvp->dev, rv->io_start, sz,
 					read_error, rvp->ioerr_arg);
 			break;
 		} else if (sz < len) {
@@ -245,8 +243,8 @@ read_verify(
 			 * next full block.
 			 */
 			io_max_size = rvp->miniosz - (sz % rvp->miniosz);
-			dbg_printf("SHORT %d READ @ %"PRIu64" %zu try for %zd\n",
-					rvp->disk->d_fd, rv->io_start, sz,
+			dbg_printf("SHORT %u READ @ %"PRIu64" %zu try for %zd\n",
+					rvp->dev, rv->io_start, sz,
 					io_max_size);
 		} else {
 			/* We should never get back more bytes than we asked. */
@@ -279,8 +277,8 @@ read_verify_schedule_now(
 	if (!rvp)
 		return 0;
 
-	dbg_printf("verify fd %d start %"PRIu64" len %"PRIu64"\n",
-			rvp->disk->d_fd, rs->io_start, rs->io_length);
+	dbg_printf("verify dev %u start %"PRIu64" len %"PRIu64"\n",
+			rvp->dev, rs->io_start, rs->io_length);
 
 	/* Worker thread saw a runtime error, don't queue more. */
 	if (rvp->runtime_error)


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 11/22] xfs_scrub: rename nr_io_threads
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (9 preceding siblings ...)
  2026-03-19  4:48   ` [PATCH 10/22] scrub: use enum xfs_device for read verification Darrick J. Wong
@ 2026-03-19  4:48   ` Darrick J. Wong
  2026-03-20  7:14     ` Christoph Hellwig
  2026-03-19  4:48   ` [PATCH 12/22] scrub: simplify verifier threads calculation Darrick J. Wong
                     ` (10 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This variable really describes the number of threads that we should
start up to scan metadata.  Rename it to reduce confusion with the media
verification code, which also initiates IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/xfs_scrub.h |    2 +-
 scrub/common.c    |    2 +-
 scrub/phase1.c    |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index a7e67c37469beb..3009a7cd2d97d0 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -70,7 +70,7 @@ struct scrub_ctx {
 	struct xfs_fd		mnt;
 
 	/* Number of threads for metadata scrubbing */
-	unsigned int		nr_io_threads;
+	unsigned int		nr_scan_threads;
 
 	/* XFS specific geometry */
 	struct fs_path		fsinfo;
diff --git a/scrub/common.c b/scrub/common.c
index 34d91525928305..95bc0c091eae27 100644
--- a/scrub/common.c
+++ b/scrub/common.c
@@ -275,7 +275,7 @@ scrub_nproc(
 {
 	if (force_nr_threads)
 		return force_nr_threads;
-	return ctx->nr_io_threads;
+	return ctx->nr_scan_threads;
 }
 
 /*
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 954a62b7e8d711..16607af4c5f025 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -355,7 +355,7 @@ _("Unable to find realtime device path."));
 		return ECANCELED;
 	}
 
-	ctx->nr_io_threads = disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
+	ctx->nr_scan_threads = disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
 	if (verbose) {
 		fprintf(stdout, _("%s: using %d threads to scrub.\n"),
 				ctx->mntpoint, scrub_nproc(ctx));


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 12/22] scrub: simplify verifier threads calculation
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (10 preceding siblings ...)
  2026-03-19  4:48   ` [PATCH 11/22] xfs_scrub: rename nr_io_threads Darrick J. Wong
@ 2026-03-19  4:48   ` Darrick J. Wong
  2026-03-19  4:48   ` [PATCH 13/22] xfs_scrub: move disk media verification error injection Darrick J. Wong
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Throwing all CPUs at verifying seems like a bad idea for foreground
scrub, as does abusing I/O opt/min as that says absolutely nothing
about parallelism.  I can't really think of a better way than manually
configuring this except maybe kernel hints.  But the current decisions
are not good defaults, and also are the only user of struct disk for
file systems using the kernel verify ioctl.

The best default seems to be 8, because verification speed increases
diminish above that level.  Note that background service mode now obeys
the thread count restrictions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: rebase atop previous patches, leave disk_heads() alone]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/read_verify.h |    2 ++
 scrub/phase6.c      |   10 +++++-----
 scrub/read_verify.c |   20 ++++++++++++++++++--
 3 files changed, 25 insertions(+), 7 deletions(-)


diff --git a/scrub/read_verify.h b/scrub/read_verify.h
index e4b36f6aaa20e1..6a338abb896bc3 100644
--- a/scrub/read_verify.h
+++ b/scrub/read_verify.h
@@ -33,4 +33,6 @@ bool try_read_verify_schedule_io(struct read_verify_schedule *rs,
 
 int read_verify_bytes(struct read_verify_pool *rvp, uint64_t *bytes);
 
+unsigned int read_verify_nproc(struct scrub_ctx *ctx);
+
 #endif /* XFS_SCRUB_READ_VERIFY_H_ */
diff --git a/scrub/phase6.c b/scrub/phase6.c
index 02b25fe73aa656..bae4d7c5d92ce0 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -889,11 +889,11 @@ phase6_estimate(
 	 * can contribute to the progress counter.  Hence we need to set
 	 * nr_threads appropriately to handle that many threads.
 	 */
-	*nr_threads = disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
-	if (ctx->verify_disks[XFS_DEV_RT])
-		*nr_threads += disk_heads(ctx->verify_disks[XFS_DEV_RT]);
-	if (ctx->verify_disks[XFS_DEV_LOG])
-		*nr_threads += disk_heads(ctx->verify_disks[XFS_DEV_LOG]);
+	*nr_threads = read_verify_nproc(ctx);
+	if (ctx->fsinfo.fs_rt)
+		*nr_threads = read_verify_nproc(ctx);
+	if (ctx->fsinfo.fs_log)
+		*nr_threads = read_verify_nproc(ctx);
 	*rshift = 20;
 	return 0;
 }
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 9b692a923dfdf3..ba04ad3684ebd7 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -70,6 +70,22 @@ struct read_verify_pool {
 	int			runtime_error;
 };
 
+unsigned int
+read_verify_nproc(
+	struct scrub_ctx		*ctx)
+{
+	if (force_nr_threads)
+		return force_nr_threads;
+
+	/*
+	 * Throwing all CPUs at verifying seems like a bad idea for foreground
+	 * scrub, as does abusing I/O opt/min as that says absolutely nothing
+	 * about parallelism.  The authors observed diminishing returns on
+	 * verification speed past 8 IO threads, so that's the default.
+	 */
+	return 8;
+}
+
 /*
  * Create a thread pool to run read verifiers.
  *
@@ -84,8 +100,8 @@ read_verify_pool_alloc(
 	struct read_verify_pool		**prvp)
 {
 	struct read_verify_pool		*rvp;
-	unsigned int			verifier_threads =
-		disk_heads(ctx->verify_disks[XFS_DEV_DATA]);
+	const unsigned int		verifier_threads =
+		read_verify_nproc(ctx);
 	int				ret;
 
 	if (rvp_io_max_size() % ctx->mnt.fsgeom.blocksize)


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 13/22] xfs_scrub: move disk media verification error injection
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (11 preceding siblings ...)
  2026-03-19  4:48   ` [PATCH 12/22] scrub: simplify verifier threads calculation Darrick J. Wong
@ 2026-03-19  4:48   ` Darrick J. Wong
  2026-03-19  4:49   ` [PATCH 14/22] xfs_scrub: use the verify media ioctl during phase 6 if possible Darrick J. Wong
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

This isn't really disk-related since it's a knob to make the
read_verify_pool pretend that the media is defective.  Move this code
before we add a new media verify path that doesn't require the disk
abstraction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: split off from another hch patch, create a new commit message]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/disk.c        |   71 -------------------------------------------
 scrub/read_verify.c |   85 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 83 insertions(+), 73 deletions(-)


diff --git a/scrub/disk.c b/scrub/disk.c
index 2cf84d91887587..afce801d3f2076 100644
--- a/scrub/disk.c
+++ b/scrub/disk.c
@@ -266,63 +266,6 @@ disk_close(
 #define LBASIZE(d)		(1ULL << (d)->d_lbalog)
 #define BTOLBA(d, bytes)	(((uint64_t)(bytes) + LBASIZE(d) - 1) >> (d)->d_lbalog)
 
-/* Simulate disk errors. */
-static int
-disk_simulate_read_error(
-	struct disk		*disk,
-	uint64_t		start,
-	uint64_t		*length)
-{
-	static int64_t		interval;
-	uint64_t		start_interval;
-
-	/* Simulated disk errors are disabled. */
-	if (interval < 0)
-		return 0;
-
-	/* Figure out the disk read error interval. */
-	if (interval == 0) {
-		char		*p;
-
-		/* Pretend there's bad media every so often, in bytes. */
-		p = getenv("XFS_SCRUB_DISK_ERROR_INTERVAL");
-		if (p == NULL) {
-			interval = -1;
-			return 0;
-		}
-		interval = strtoull(p, NULL, 10);
-		interval &= ~((1U << disk->d_lbalog) - 1);
-	}
-	if (interval <= 0) {
-		interval = -1;
-		return 0;
-	}
-
-	/*
-	 * We simulate disk errors by pretending that there are media errors at
-	 * predetermined intervals across the disk.  If a read verify request
-	 * crosses one of those intervals we shorten it so that the next read
-	 * will start on an interval threshold.  If the read verify request
-	 * starts on an interval threshold, we send back EIO as if it had
-	 * failed.
-	 */
-	if ((start % interval) == 0) {
-		dbg_printf("fd %d: simulating disk error at %"PRIu64".\n",
-				disk->d_fd, start);
-		return EIO;
-	}
-
-	start_interval = start / interval;
-	if (start_interval != (start + *length) / interval) {
-		*length = ((start_interval + 1) * interval) - start;
-		dbg_printf(
-"fd %d: simulating short read at %"PRIu64" to length %"PRIu64".\n",
-				disk->d_fd, start, *length);
-	}
-
-	return 0;
-}
-
 /* Read-verify an extent of a disk device. */
 ssize_t
 disk_read_verify(
@@ -331,20 +274,6 @@ disk_read_verify(
 	uint64_t		start,
 	uint64_t		length)
 {
-	if (debug) {
-		int		ret;
-
-		ret = disk_simulate_read_error(disk, start, &length);
-		if (ret) {
-			errno = ret;
-			return -1;
-		}
-
-		/* Don't actually issue the IO */
-		if (getenv("XFS_SCRUB_DISK_VERIFY_SKIP"))
-			return length;
-	}
-
 	/* Convert to logical block size. */
 	if (disk->d_flags & DISK_FLAG_SCSI_VERIFY)
 		return disk_scsi_verify(disk, BTOLBAT(disk, start),
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index ba04ad3684ebd7..adee67d922fc76 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -168,6 +168,88 @@ read_verify_pool_destroy(
 	free(rvp);
 }
 
+/* Simulate disk errors. */
+static int
+verify_simulate_read_error(
+	struct read_verify_pool	*rvp,
+	uint64_t		start,
+	ssize_t			*length)
+{
+	static int64_t		interval;
+	uint64_t		start_interval;
+
+	/* Simulated disk errors are disabled. */
+	if (interval < 0)
+		return 0;
+
+	/* Figure out the disk read error interval. */
+	if (interval == 0) {
+		char		*p;
+
+		/* Pretend there's bad media every so often, in bytes. */
+		p = getenv("XFS_SCRUB_DISK_ERROR_INTERVAL");
+		if (p == NULL) {
+			interval = -1;
+			return 0;
+		}
+		interval = strtoull(p, NULL, 10);
+		interval &= (rvp->miniosz - 1);
+	}
+	if (interval <= 0) {
+		interval = -1;
+		return 0;
+	}
+
+	/*
+	 * We simulate disk errors by pretending that there are media errors at
+	 * predetermined intervals across the disk.  If a read verify request
+	 * crosses one of those intervals we shorten it so that the next read
+	 * will start on an interval threshold.  If the read verify request
+	 * starts on an interval threshold, we send back EIO as if it had
+	 * failed.
+	 */
+	if ((start % interval) == 0) {
+		dbg_printf("dev %u: simulating disk error at %"PRIu64".\n",
+				rvp->dev, start);
+		return EIO;
+	}
+
+	start_interval = start / interval;
+	if (start_interval != (start + *length) / interval) {
+		*length = ((start_interval + 1) * interval) - start;
+		dbg_printf(
+"dev %u: simulating short read at %"PRIu64" to length %"PRIu64".\n",
+				rvp->dev, start, *length);
+	}
+
+	return 0;
+}
+
+/* Read-verify an extent of a disk device. */
+static ssize_t
+read_verify_one(
+	struct read_verify_pool	*rvp,
+	struct read_verify	*rv,
+	ssize_t			len)
+{
+	if (debug) {
+		int		ret;
+
+		ret = verify_simulate_read_error(rvp, rv->io_start, &len);
+		if (ret) {
+			errno = ret;
+			return -1;
+		}
+
+		/* Don't actually issue the IO */
+		if (getenv("XFS_SCRUB_DISK_VERIFY_SKIP"))
+			return len;
+	}
+
+	return disk_read_verify(rvp->ctx->verify_disks[rvp->dev], rvp->readbuf,
+			rv->io_start, len);
+}
+
 /*
  * Issue a read-verify IO in big batches.
  */
@@ -197,8 +279,7 @@ read_verify(
 		len = min(rv->io_length, io_max_size);
 		dbg_printf("diskverify %u %"PRIu64" %zu\n", rvp->dev,
 				rv->io_start, len);
-		sz = disk_read_verify(rvp->ctx->verify_disks[rvp->dev],
-				rvp->readbuf, rv->io_start, len);
+		sz = read_verify_one(rvp, rv, len);
 		if (sz == len && io_max_size < rvp->miniosz) {
 			/*
 			 * If the verify request was 100% successful and less


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 14/22] xfs_scrub: use the verify media ioctl during phase 6 if possible
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (12 preceding siblings ...)
  2026-03-19  4:48   ` [PATCH 13/22] xfs_scrub: move disk media verification error injection Darrick J. Wong
@ 2026-03-19  4:49   ` Darrick J. Wong
  2026-03-19  4:49   ` [PATCH 15/22] scrub: don't allocate disk for ioctl-based media verify Darrick J. Wong
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the kernel suppots the XFS_IOC_VERIFY_MEDIA ioctl, use that to
perform the phase 6 media scan instead of pwrite or the SCSI VERIFY
command.  This enables better integration with xfs_healer and fsnotify;
and reduces the amount of work that userspace has to do.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 scrub/xfs_scrub.h   |    1 +
 scrub/phase1.c      |   18 ++++++++++++++++++
 scrub/read_verify.c |   44 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 59 insertions(+), 4 deletions(-)


diff --git a/scrub/xfs_scrub.h b/scrub/xfs_scrub.h
index 3009a7cd2d97d0..851b4f37db48e4 100644
--- a/scrub/xfs_scrub.h
+++ b/scrub/xfs_scrub.h
@@ -79,6 +79,7 @@ struct scrub_ctx {
 
 	/* Data block read verification buffer */
 	void			*readbuf;
+	bool			no_verify_ioctl;
 
 	/* Mutable scrub state; use lock. */
 	pthread_mutex_t		lock;
diff --git a/scrub/phase1.c b/scrub/phase1.c
index 16607af4c5f025..abc594898bf2c5 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -213,6 +213,22 @@ mode_from_autofsck(
 	goto summarize;
 }
 
+/* Does the XFS driver support media scanning its own disks? */
+static void
+configure_xfs_verify(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_verify_media	me = {
+		/* just probe for support using an empty range */
+		.me_start_daddr	= 0,
+		.me_end_daddr	= 0,
+		.me_dev		= XFS_DEV_DATA,
+	};
+
+	if (ioctl(ctx->mnt.fd, XFS_IOC_VERIFY_MEDIA, &me))
+		ctx->no_verify_ioctl = true;
+}
+
 /*
  * Bind to the mountpoint, read the XFS geometry, bind to the block devices.
  * Anything we've already built will be cleaned up by scrub_cleanup.
@@ -379,6 +395,8 @@ _("Unable to find realtime device path."));
 		}
 	}
 
+	configure_xfs_verify(ctx);
+
 	/*
 	 * Everything's set up, which means any failures recorded after
 	 * this point are most probably corruption errors (as opposed to
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index adee67d922fc76..f724dfd693d7ab 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -225,12 +225,45 @@ verify_simulate_read_error(
 	return 0;
 }
 
+/* Use the XFS media verification ioctl to do the media scan */
+static ssize_t
+ioctl_verify(
+	int			verify_fd,
+	enum xfs_device		dev,
+	uint64_t		start,
+	uint64_t		length,
+	bool			single_step)
+{
+	const uint64_t	orig_start_daddr = BTOBBT(start);
+	struct xfs_verify_media	me = {
+		.me_start_daddr	= orig_start_daddr,
+		.me_end_daddr	= BTOBB(start + length),
+		.me_dev		= dev,
+		.me_rest_us	= bg_mode > 2 ? bg_mode - 1 : 0,
+	};
+	int			ret;
+
+	if (single_step)
+		me.me_flags |= XFS_VERIFY_MEDIA_REPORT;
+
+	ret = ioctl(verify_fd, XFS_IOC_VERIFY_MEDIA, &me);
+	if (ret < 0)
+		return ret;
+	if (me.me_ioerror) {
+		errno = me.me_ioerror;
+		return -1;
+	}
+
+	return BBTOB(me.me_start_daddr - orig_start_daddr);
+}
+
 /* Read-verify an extent of a disk device. */
 static ssize_t
 read_verify_one(
 	struct read_verify_pool	*rvp,
 	struct read_verify	*rv,
-	ssize_t			len)
+	ssize_t			len,
+	bool			single_step)
 {
 	if (debug) {
 		int		ret;
@@ -246,8 +279,11 @@ read_verify_one(
 			return len;
 	}
 
-	return disk_read_verify(rvp->ctx->verify_disks[rvp->dev], rvp->readbuf,
-			rv->io_start, len);
+	if (rvp->ctx->no_verify_ioctl)
+		return disk_read_verify(rvp->ctx->verify_disks[rvp->dev],
+				rvp->readbuf, rv->io_start, len);
+	return ioctl_verify(rvp->ctx->mnt.fd, rvp->dev, rv->io_start, len,
+			single_step);
 }
 
 /*
@@ -279,7 +315,7 @@ read_verify(
 		len = min(rv->io_length, io_max_size);
 		dbg_printf("diskverify %u %"PRIu64" %zu\n", rvp->dev,
 				rv->io_start, len);
-		sz = read_verify_one(rvp, rv, len);
+		sz = read_verify_one(rvp, rv, len, io_max_size <= rvp->miniosz);
 		if (sz == len && io_max_size < rvp->miniosz) {
 			/*
 			 * If the verify request was 100% successful and less


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 15/22] scrub: don't allocate disk for ioctl-based media verify
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (13 preceding siblings ...)
  2026-03-19  4:49   ` [PATCH 14/22] xfs_scrub: use the verify media ioctl during phase 6 if possible Darrick J. Wong
@ 2026-03-19  4:49   ` Darrick J. Wong
  2026-03-19  4:49   ` [PATCH 16/22] xfs_scrub: perform media scanning of the log region Darrick J. Wong
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

When the kernel provides the data verification ioctl there is no point
in allocating struct disk and thus opening the underlying devices.
Refactor the code so that we don't have to do that, with the added
benefit of keeping the read verification self-contained in
read_verify.c for the case where the kernel provides the ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: break up patch]
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase1.c |   65 +++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 45 insertions(+), 20 deletions(-)


diff --git a/scrub/phase1.c b/scrub/phase1.c
index abc594898bf2c5..f4d50d016c2ebd 100644
--- a/scrub/phase1.c
+++ b/scrub/phase1.c
@@ -213,8 +213,39 @@ mode_from_autofsck(
 	goto summarize;
 }
 
+/*
+ * We can't do XFS_IOC_VERIFY_MEDIA media verification, so we need to fall back
+ * to reading the disk.  We already opened the data device, now we need to open
+ * the rt and log devices for media verification.
+ */
+static int
+configure_xfs_verify_fallback(
+	struct scrub_ctx	*ctx)
+{
+	if (ctx->fsinfo.fs_log) {
+		ctx->verify_disks[XFS_DEV_LOG] = disk_open(ctx->fsinfo.fs_log);
+		if (!ctx->verify_disks[XFS_DEV_LOG]) {
+			str_error(ctx, ctx->mntpoint,
+				_("Unable to open external log device."));
+			return ECANCELED;
+		}
+	}
+
+	if (ctx->fsinfo.fs_rt) {
+		ctx->verify_disks[XFS_DEV_RT] = disk_open(ctx->fsinfo.fs_rt);
+		if (!ctx->verify_disks[XFS_DEV_RT]) {
+			str_error(ctx, ctx->mntpoint,
+				_("Unable to open realtime device."));
+			return ECANCELED;
+		}
+	}
+
+	ctx->no_verify_ioctl = true;
+	return 0;
+}
+
 /* Does the XFS driver support media scanning its own disks? */
-static void
+static bool
 configure_xfs_verify(
 	struct scrub_ctx	*ctx)
 {
@@ -225,8 +256,7 @@ configure_xfs_verify(
 		.me_dev		= XFS_DEV_DATA,
 	};
 
-	if (ioctl(ctx->mnt.fd, XFS_IOC_VERIFY_MEDIA, &me))
-		ctx->no_verify_ioctl = true;
+	return ioctl(ctx->mnt.fd, XFS_IOC_VERIFY_MEDIA, &me) == 0;
 }
 
 /*
@@ -378,24 +408,19 @@ _("Unable to find realtime device path."));
 		fflush(stdout);
 	}
 
-	if (ctx->fsinfo.fs_log) {
-		ctx->verify_disks[XFS_DEV_LOG] = disk_open(ctx->fsinfo.fs_log);
-		if (!ctx->verify_disks[XFS_DEV_LOG]) {
-			str_error(ctx, ctx->mntpoint,
-				_("Unable to open external log device."));
-			return ECANCELED;
-		}
+	if (configure_xfs_verify(ctx)) {
+		/*
+		 * ioctl-based media verification is enabled and we already set
+		 * nr_io_threads from the data device, so we no longer need to
+		 * keep this open.
+		 */
+		disk_close(ctx->verify_disks[XFS_DEV_DATA]);
+		ctx->verify_disks[XFS_DEV_DATA] = NULL;
+	} else {
+		error = configure_xfs_verify_fallback(ctx);
+		if (error)
+			return error;
 	}
-	if (ctx->fsinfo.fs_rt) {
-		ctx->verify_disks[XFS_DEV_RT] = disk_open(ctx->fsinfo.fs_rt);
-		if (!ctx->verify_disks[XFS_DEV_RT]) {
-			str_error(ctx, ctx->mntpoint,
-				_("Unable to open realtime device."));
-			return ECANCELED;
-		}
-	}
-
-	configure_xfs_verify(ctx);
 
 	/*
 	 * Everything's set up, which means any failures recorded after


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 16/22] xfs_scrub: perform media scanning of the log region
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (14 preceding siblings ...)
  2026-03-19  4:49   ` [PATCH 15/22] scrub: don't allocate disk for ioctl-based media verify Darrick J. Wong
@ 2026-03-19  4:49   ` Darrick J. Wong
  2026-03-20  7:15     ` Christoph Hellwig
  2026-03-19  4:49   ` [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids Darrick J. Wong
                     ` (5 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Scan the log area for media errors because a defect in a region could
prevent the user from being able to perform log recovery.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase6.c |   37 ++++++++++++++++++++++++++++++-------
 1 file changed, 30 insertions(+), 7 deletions(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index bae4d7c5d92ce0..c1ce222d4ce99f 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -47,6 +47,7 @@ struct media_verify_state {
 	struct read_verify_pool	*rvp_realtime;
 	struct bitmap		*d_bad;		/* bytes */
 	struct bitmap		*r_bad;		/* bytes */
+	struct bitmap		*l_bad;		/* bytes */
 	bool			d_trunc:1;
 	bool			r_trunc:1;
 	bool			l_trunc:1;
@@ -109,6 +110,8 @@ bitmap_for_disk(
 		return vs->d_bad;
 	case XFS_DEV_RT:
 		return vs->r_bad;
+	case XFS_DEV_LOG:
+		return vs->l_bad;
 	default:
 		return NULL;
 	}
@@ -571,6 +574,12 @@ report_all_media_errors(
 		return ret;
 	}
 
+	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_LOG);
+	if (ret) {
+		str_liberror(ctx, ret, _("walking log device io errors"));
+		return ret;
+	}
+
 	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_RT);
 	if (ret) {
 		str_liberror(ctx, ret, _("walking rtdev io errors"));
@@ -617,9 +626,14 @@ check_rmap(
 			map->fmr_flags);
 
 	/* "Unknown" extents should be verified; they could be data. */
-	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
-			map->fmr_owner == XFS_FMR_OWN_UNKNOWN)
-		map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER)) {
+		switch (map->fmr_owner) {
+		case XFS_FMR_OWN_UNKNOWN:
+		case XFS_FMR_OWN_LOG:
+			map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+			break;
+		}
+	}
 
 	/*
 	 * We only care about read-verifying data extents that have been
@@ -762,11 +776,17 @@ phase6_func(
 		goto out_dbad;
 	}
 
+	ret = -bitmap_alloc(&vs.l_bad);
+	if (ret) {
+		str_liberror(ctx, ret, _("creating log badblock bitmap"));
+		goto out_rbad;
+	}
+
 	ret = read_verify_pool_alloc(ctx, XFS_DEV_DATA, remember_ioerr, &vs,
 			&vs.rvp_data);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
-		goto out_rbad;
+		goto out_lbad;
 	}
 	if (ctx->fsinfo.fs_log) {
 		ret = read_verify_pool_alloc(ctx, XFS_DEV_LOG, remember_ioerr,
@@ -823,16 +843,17 @@ phase6_func(
 	 */
 	if (ret || ret2 || ret3) {
 		ret |= ret2 | ret3; /* caller only cares about non-zero/zero */
-		goto out_rbad;
+		goto out_lbad;
 	}
 	if (bitmap_empty(vs.d_bad) && !vs.d_trunc &&
 	    bitmap_empty(vs.r_bad) && !vs.r_trunc &&
-	    !vs.l_trunc)
-		goto out_rbad;
+	    bitmap_empty(vs.l_bad) && !vs.l_trunc)
+		goto out_lbad;
 
 	/* Scan the whole dir tree to see what matches the bad extents. */
 	ret = report_all_media_errors(ctx, &vs);
 
+	bitmap_free(&vs.l_bad);
 	bitmap_free(&vs.r_bad);
 	bitmap_free(&vs.d_bad);
 	return ret;
@@ -852,6 +873,8 @@ phase6_func(
 out_datapool:
 	read_verify_pool_abort(vs.rvp_data);
 	read_verify_pool_destroy(vs.rvp_data);
+out_lbad:
+	bitmap_free(&vs.l_bad);
 out_rbad:
 	bitmap_free(&vs.r_bad);
 out_dbad:


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (15 preceding siblings ...)
  2026-03-19  4:49   ` [PATCH 16/22] xfs_scrub: perform media scanning of the log region Darrick J. Wong
@ 2026-03-19  4:49   ` Darrick J. Wong
  2026-03-20  7:15     ` Christoph Hellwig
  2026-03-19  4:50   ` [PATCH 18/22] xfs_scrub: move failmap and other outputs into read_verify_pool Darrick J. Wong
                     ` (4 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Refactor the read-verify pool array in struct media_verify_state so that
we can index them via enum xfs_device.  This will enable further
cleanups in the next few patches.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase6.c |  103 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 55 insertions(+), 48 deletions(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index c1ce222d4ce99f..a05bd7e1df1728 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -42,9 +42,8 @@
 struct media_verify_state {
 	struct ptvar		*verify_schedules;
 
-	struct read_verify_pool	*rvp_data;
-	struct read_verify_pool	*rvp_log;
-	struct read_verify_pool	*rvp_realtime;
+	struct read_verify_pool	*rvp[XFS_DEV_RT + 1];
+
 	struct bitmap		*d_bad;		/* bytes */
 	struct bitmap		*r_bad;		/* bytes */
 	struct bitmap		*l_bad;		/* bytes */
@@ -53,28 +52,24 @@ struct media_verify_state {
 	bool			l_trunc:1;
 };
 
-/* Find the fd for a given device identifier. */
-static struct read_verify_pool *
-dev_to_pool(
-	struct scrub_ctx		*ctx,
-	struct media_verify_state	*vs,
-	dev_t				dev)
+/* Return XFS device index from fsmap device. */
+static enum xfs_device
+from_fsmap_dev(
+	struct scrub_ctx	*ctx,
+	dev_t			dev)
 {
 	if (ctx->mnt.fsgeom.rtstart) {
-		if (dev == XFS_DEV_DATA)
-			return vs->rvp_data;
-		if (dev == XFS_DEV_LOG)
-			return vs->rvp_log;
-		if (dev == XFS_DEV_RT)
-			return vs->rvp_realtime;
-	} else {
-		if (dev == ctx->fsinfo.fs_datadev)
-			return vs->rvp_data;
-		if (dev == ctx->fsinfo.fs_logdev)
-			return vs->rvp_log;
-		if (dev == ctx->fsinfo.fs_rtdev)
-			return vs->rvp_realtime;
+		if (dev < XFS_DEV_DATA || dev > XFS_DEV_RT)
+			abort();
+		return dev;
 	}
+
+	if (dev == ctx->fsinfo.fs_datadev)
+		return XFS_DEV_DATA;
+	if (dev == ctx->fsinfo.fs_logdev)
+		return XFS_DEV_LOG;
+	if (dev == ctx->fsinfo.fs_rtdev)
+		return XFS_DEV_RT;
 	abort();
 }
 
@@ -611,13 +606,12 @@ check_rmap(
 	void				*arg)
 {
 	struct media_verify_state	*vs = arg;
-	struct read_verify_pool		*rvp;
+	struct read_verify_pool		*rvp =
+		vs->rvp[from_fsmap_dev(ctx, map->fmr_device)];
 	struct read_verify_schedule	*rs;
 	bool				scheduled;
 	int				ret;
 
-	rvp = dev_to_pool(ctx, vs, map->fmr_device);
-
 	dbg_printf("rmap dev %d:%d phys %"PRIu64" owner %"PRId64
 			" offset %"PRIu64" len %"PRIu64" flags 0x%x\n",
 			major(map->fmr_device), minor(map->fmr_device),
@@ -686,11 +680,13 @@ force_one_verify(
 /* Wait for read/verify actions to finish, then return # bytes checked. */
 static int
 clean_pool(
-	struct read_verify_pool	*rvp,
-	unsigned long long	*bytes_checked)
+	struct media_verify_state	*vs,
+	enum xfs_device			dev,
+	unsigned long long		*bytes_checked)
 {
-	uint64_t		pool_checked;
-	int			ret;
+	struct read_verify_pool		*rvp = vs->rvp[dev];
+	uint64_t			pool_checked;
+	int				ret;
 
 	if (!rvp)
 		return 0;
@@ -749,6 +745,27 @@ remember_ioerr(
 		str_liberror(ctx, ret, _("setting bad block bitmap"));
 }
 
+static inline int
+alloc_pool(
+	struct scrub_ctx		*ctx,
+	struct media_verify_state	*vs,
+	enum xfs_device			dev)
+{
+	return read_verify_pool_alloc(ctx, dev, remember_ioerr, vs,
+			&vs->rvp[dev]);
+}
+
+static inline void
+free_pool(
+	struct media_verify_state	*vs,
+	enum xfs_device			dev)
+{
+	if (vs->rvp[dev]) {
+		read_verify_pool_abort(vs->rvp[dev]);
+		read_verify_pool_destroy(vs->rvp[dev]);
+	}
+}
+
 /*
  * Read verify all the file data blocks in a filesystem.  Since XFS doesn't
  * do data checksums, we trust that the underlying storage will pass back
@@ -782,15 +799,13 @@ phase6_func(
 		goto out_rbad;
 	}
 
-	ret = read_verify_pool_alloc(ctx, XFS_DEV_DATA, remember_ioerr, &vs,
-			&vs.rvp_data);
+	ret = alloc_pool(ctx, &vs, XFS_DEV_DATA);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
 		goto out_lbad;
 	}
 	if (ctx->fsinfo.fs_log) {
-		ret = read_verify_pool_alloc(ctx, XFS_DEV_LOG, remember_ioerr,
-				&vs, &vs.rvp_log);
+		ret = alloc_pool(ctx, &vs, XFS_DEV_LOG);
 		if (ret) {
 			str_liberror(ctx, ret,
 					_("creating logdev media verifier"));
@@ -798,8 +813,7 @@ phase6_func(
 		}
 	}
 	if (ctx->fsinfo.fs_rt) {
-		ret = read_verify_pool_alloc(ctx, XFS_DEV_RT, remember_ioerr,
-				&vs, &vs.rvp_realtime);
+		ret = alloc_pool(ctx, &vs, XFS_DEV_RT);
 		if (ret) {
 			str_liberror(ctx, ret,
 					_("creating rtdev media verifier"));
@@ -825,15 +839,15 @@ phase6_func(
 	ptvar_free(vs.verify_schedules);
 	vs.verify_schedules = NULL;
 
-	ret = clean_pool(vs.rvp_data, &ctx->bytes_checked);
+	ret = clean_pool(&vs, XFS_DEV_DATA, &ctx->bytes_checked);
 	if (ret)
 		str_liberror(ctx, ret, _("flushing datadev verify pool"));
 
-	ret2 = clean_pool(vs.rvp_log, &ctx->bytes_checked);
+	ret2 = clean_pool(&vs, XFS_DEV_LOG, &ctx->bytes_checked);
 	if (ret2)
 		str_liberror(ctx, ret2, _("flushing logdev verify pool"));
 
-	ret3 = clean_pool(vs.rvp_realtime, &ctx->bytes_checked);
+	ret3 = clean_pool(&vs, XFS_DEV_RT, &ctx->bytes_checked);
 	if (ret3)
 		str_liberror(ctx, ret3, _("flushing rtdev verify pool"));
 
@@ -861,18 +875,11 @@ phase6_func(
 out_schedules:
 	ptvar_free(vs.verify_schedules);
 out_rtpool:
-	if (vs.rvp_realtime) {
-		read_verify_pool_abort(vs.rvp_realtime);
-		read_verify_pool_destroy(vs.rvp_realtime);
-	}
+	free_pool(&vs, XFS_DEV_RT);
 out_logpool:
-	if (vs.rvp_log) {
-		read_verify_pool_abort(vs.rvp_log);
-		read_verify_pool_destroy(vs.rvp_log);
-	}
+	free_pool(&vs, XFS_DEV_LOG);
 out_datapool:
-	read_verify_pool_abort(vs.rvp_data);
-	read_verify_pool_destroy(vs.rvp_data);
+	free_pool(&vs, XFS_DEV_DATA);
 out_lbad:
 	bitmap_free(&vs.l_bad);
 out_rbad:


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 18/22] xfs_scrub: move failmap and other outputs into read_verify_pool
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (16 preceding siblings ...)
  2026-03-19  4:49   ` [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids Darrick J. Wong
@ 2026-03-19  4:50   ` Darrick J. Wong
  2026-03-20  7:15     ` Christoph Hellwig
  2026-03-19  4:50   ` [PATCH 19/22] xfs_scrub: clean up device-related error messages Darrick J. Wong
                     ` (3 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Remove all the indirect verify IO error function calls and whatnot by
making the read_verify_pool track the ranges of failed media and other
problems.  Add some new helper functions that report on the outcome of
the read verifciation so that phase6 can report on what happened.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/read_verify.h |   13 +++-
 scrub/phase6.c      |  184 +++++++++++++--------------------------------------
 scrub/read_verify.c |  129 +++++++++++++++++++++++++++++++-----
 3 files changed, 170 insertions(+), 156 deletions(-)


diff --git a/scrub/read_verify.h b/scrub/read_verify.h
index 6a338abb896bc3..c46ba1e76a8d03 100644
--- a/scrub/read_verify.h
+++ b/scrub/read_verify.h
@@ -21,7 +21,6 @@ typedef void (*read_verify_ioerr_fn_t)(struct scrub_ctx *ctx,
 		int error, void *arg);
 
 int read_verify_pool_alloc(struct scrub_ctx *ctx, enum xfs_device dev,
-		read_verify_ioerr_fn_t ioerr_fn, void *ioerr_arg,
 		struct read_verify_pool **prvp);
 void read_verify_pool_abort(struct read_verify_pool *rvp);
 int read_verify_pool_flush(struct read_verify_pool *rvp);
@@ -31,7 +30,17 @@ int read_verify_schedule_now(struct read_verify_schedule *rs);
 bool try_read_verify_schedule_io(struct read_verify_schedule *rs,
 		struct read_verify_pool *rvp, uint64_t start, uint64_t length);
 
-int read_verify_bytes(struct read_verify_pool *rvp, uint64_t *bytes);
+bool read_verify_ok(const struct read_verify_pool *rvp);
+bool read_verify_truncated(const struct read_verify_pool *rvp);
+uint64_t read_verify_progress(const struct read_verify_pool *rvp);
+
+int read_verify_iterate_failed(struct read_verify_pool *rvp,
+		int (*fn)(uint64_t, uint64_t, void *), void *arg);
+int read_verify_iterate_failed_range(struct read_verify_pool *rvp,
+		uint64_t start, uint64_t length,
+		int (*fn)(uint64_t, uint64_t, void *), void *arg);
+bool read_verify_has_failed(struct read_verify_pool *rvp, uint64_t start,
+		uint64_t len);
 
 unsigned int read_verify_nproc(struct scrub_ctx *ctx);
 
diff --git a/scrub/phase6.c b/scrub/phase6.c
index a05bd7e1df1728..bf9ef3d12690ab 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -43,13 +43,6 @@ struct media_verify_state {
 	struct ptvar		*verify_schedules;
 
 	struct read_verify_pool	*rvp[XFS_DEV_RT + 1];
-
-	struct bitmap		*d_bad;		/* bytes */
-	struct bitmap		*r_bad;		/* bytes */
-	struct bitmap		*l_bad;		/* bytes */
-	bool			d_trunc:1;
-	bool			r_trunc:1;
-	bool			l_trunc:1;
 };
 
 /* Return XFS device index from fsmap device. */
@@ -94,24 +87,6 @@ to_fsmap_dev(
 	}
 }
 
-/* Find the incore bad blocks bitmap for a given disk. */
-static struct bitmap *
-bitmap_for_disk(
-	enum xfs_device			dev,
-	struct media_verify_state	*vs)
-{
-	switch (dev) {
-	case XFS_DEV_DATA:
-		return vs->d_bad;
-	case XFS_DEV_RT:
-		return vs->r_bad;
-	case XFS_DEV_LOG:
-		return vs->l_bad;
-	default:
-		return NULL;
-	}
-}
-
 struct disk_ioerr_report {
 	struct scrub_ctx	*ctx;
 	enum xfs_device		dev;
@@ -190,6 +165,13 @@ _("media error at data offset %llu length %llu."),
 	return 0;
 }
 
+static inline enum xfs_device from_fsx(const struct fsxattr *fsx)
+{
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		return XFS_DEV_RT;
+	return XFS_DEV_DATA;
+}
+
 /* Report if this extent overlaps a bad region. */
 static int
 report_data_loss(
@@ -202,7 +184,6 @@ report_data_loss(
 {
 	struct badfile_report		*br = arg;
 	struct media_verify_state	*vs = br->vs;
-	struct bitmap			*bmp;
 
 	br->bmap = bmap;
 
@@ -210,13 +191,9 @@ report_data_loss(
 	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
 		return 0;
 
-	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
-		bmp = vs->r_bad;
-	else
-		bmp = vs->d_bad;
-
-	return -bitmap_iterate_range(bmp, bmap->bm_physical, bmap->bm_length,
-			report_badfile, br);
+	return read_verify_iterate_failed_range(vs->rvp[from_fsx(fsx)],
+			bmap->bm_physical, bmap->bm_length, report_badfile,
+			br);
 }
 
 /* Report if the extended attribute data overlaps a bad region. */
@@ -231,7 +208,6 @@ report_attr_loss(
 {
 	struct badfile_report		*br = arg;
 	struct media_verify_state	*vs = br->vs;
-	struct bitmap			*bmp = vs->d_bad;
 
 	/* Complain about attr fork extents that don't look right. */
 	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC)) {
@@ -246,7 +222,8 @@ _("found unexpected realtime attr fork extent."));
 		return 0;
 	}
 
-	if (bitmap_test(bmp, bmap->bm_physical, bmap->bm_length))
+	if (read_verify_has_failed(vs->rvp[XFS_DEV_DATA], bmap->bm_physical,
+				bmap->bm_length))
 		str_corrupt(ctx, br->descr,
 _("media error in extended attribute data."));
 
@@ -530,6 +507,19 @@ report_ioerr(
 			&fr);
 }
 
+static inline const char *trunc_msg(enum xfs_device dev)
+{
+	switch (dev) {
+	case XFS_DEV_DATA:
+		return _("data device truncated");
+	case XFS_DEV_LOG:
+		return _("log device truncated");
+	case XFS_DEV_RT:
+		return _("rt device truncated");
+	}
+	abort();
+}
+
 /* Report all the media errors found on a disk. */
 static int
 report_disk_ioerrs(
@@ -537,15 +527,18 @@ report_disk_ioerrs(
 	struct media_verify_state	*vs,
 	enum xfs_device			dev)
 {
-	struct bitmap			*tree = bitmap_for_disk(dev, vs);
 	struct disk_ioerr_report	dioerr = {
 		.ctx			= ctx,
 		.dev			= dev,
 	};
 
-	if (!tree)
+	if (!vs->rvp[dev])
 		return 0;
-	return -bitmap_iterate(tree, report_ioerr, &dioerr);
+
+	if (read_verify_truncated(vs->rvp[dev]))
+		str_corrupt(ctx, ctx->mntpoint, trunc_msg(dev));
+
+	return read_verify_iterate_failed(vs->rvp[dev], report_ioerr, &dioerr);
 }
 
 /* Given bad extent lists for the data & rtdev, find bad files. */
@@ -556,13 +549,6 @@ report_all_media_errors(
 {
 	int				ret;
 
-	if (vs->d_trunc)
-		str_corrupt(ctx, ctx->mntpoint, _("data device truncated"));
-	if (vs->l_trunc)
-		str_corrupt(ctx, ctx->mntpoint, _("log device truncated"));
-	if (vs->r_trunc)
-		str_corrupt(ctx, ctx->mntpoint, _("rt device truncated"));
-
 	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_DATA);
 	if (ret) {
 		str_liberror(ctx, ret, _("walking datadev io errors"));
@@ -682,10 +668,10 @@ static int
 clean_pool(
 	struct media_verify_state	*vs,
 	enum xfs_device			dev,
-	unsigned long long		*bytes_checked)
+	unsigned long long		*bytes_checked,
+	bool				*ok)
 {
 	struct read_verify_pool		*rvp = vs->rvp[dev];
-	uint64_t			pool_checked;
 	int				ret;
 
 	if (!rvp)
@@ -693,56 +679,12 @@ clean_pool(
 
 	ret = read_verify_pool_flush(rvp);
 	if (ret)
-		goto out_destroy;
+		return ret;
 
-	ret = read_verify_bytes(rvp, &pool_checked);
-	if (ret)
-		goto out_destroy;
-
-	*bytes_checked += pool_checked;
-out_destroy:
-	read_verify_pool_destroy(rvp);
-	return ret;
-}
-
-/* Remember a media error for later. */
-static void
-remember_ioerr(
-	struct scrub_ctx		*ctx,
-	enum xfs_device			dev,
-	uint64_t			start,
-	uint64_t			length,
-	int				error,
-	void				*arg)
-{
-	struct media_verify_state	*vs = arg;
-	struct bitmap			*tree;
-	int				ret;
-
-	if (!length) {
-		switch (dev) {
-		case XFS_DEV_DATA:
-			vs->d_trunc = true;
-			break;
-		case XFS_DEV_LOG:
-			vs->l_trunc = true;
-			break;
-		case XFS_DEV_RT:
-			vs->r_trunc = true;
-			break;
-		}
-		return;
-	}
-
-	tree = bitmap_for_disk(dev, vs);
-	if (!tree) {
-		str_liberror(ctx, ENOENT, _("finding bad block bitmap"));
-		return;
-	}
-
-	ret = -bitmap_set(tree, start, length);
-	if (ret)
-		str_liberror(ctx, ret, _("setting bad block bitmap"));
+	*bytes_checked += read_verify_progress(rvp);
+	if (!read_verify_ok(rvp))
+		*ok = false;
+	return 0;
 }
 
 static inline int
@@ -751,8 +693,7 @@ alloc_pool(
 	struct media_verify_state	*vs,
 	enum xfs_device			dev)
 {
-	return read_verify_pool_alloc(ctx, dev, remember_ioerr, vs,
-			&vs->rvp[dev]);
+	return read_verify_pool_alloc(ctx, dev, &vs->rvp[dev]);
 }
 
 static inline void
@@ -779,30 +720,13 @@ phase6_func(
 	struct scrub_ctx		*ctx)
 {
 	struct media_verify_state	vs = { NULL };
+	bool				ok = true;
 	int				ret, ret2, ret3;
 
-	ret = -bitmap_alloc(&vs.d_bad);
-	if (ret) {
-		str_liberror(ctx, ret, _("creating datadev badblock bitmap"));
-		return ret;
-	}
-
-	ret = -bitmap_alloc(&vs.r_bad);
-	if (ret) {
-		str_liberror(ctx, ret, _("creating realtime badblock bitmap"));
-		goto out_dbad;
-	}
-
-	ret = -bitmap_alloc(&vs.l_bad);
-	if (ret) {
-		str_liberror(ctx, ret, _("creating log badblock bitmap"));
-		goto out_rbad;
-	}
-
 	ret = alloc_pool(ctx, &vs, XFS_DEV_DATA);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating datadev media verifier"));
-		goto out_lbad;
+		return ret;
 	}
 	if (ctx->fsinfo.fs_log) {
 		ret = alloc_pool(ctx, &vs, XFS_DEV_LOG);
@@ -839,15 +763,15 @@ phase6_func(
 	ptvar_free(vs.verify_schedules);
 	vs.verify_schedules = NULL;
 
-	ret = clean_pool(&vs, XFS_DEV_DATA, &ctx->bytes_checked);
+	ret = clean_pool(&vs, XFS_DEV_DATA, &ctx->bytes_checked, &ok);
 	if (ret)
 		str_liberror(ctx, ret, _("flushing datadev verify pool"));
 
-	ret2 = clean_pool(&vs, XFS_DEV_LOG, &ctx->bytes_checked);
+	ret2 = clean_pool(&vs, XFS_DEV_LOG, &ctx->bytes_checked, &ok);
 	if (ret2)
 		str_liberror(ctx, ret2, _("flushing logdev verify pool"));
 
-	ret3 = clean_pool(&vs, XFS_DEV_RT, &ctx->bytes_checked);
+	ret3 = clean_pool(&vs, XFS_DEV_RT, &ctx->bytes_checked, &ok);
 	if (ret3)
 		str_liberror(ctx, ret3, _("flushing rtdev verify pool"));
 
@@ -855,22 +779,14 @@ phase6_func(
 	 * If the verify flush didn't work or we found no bad blocks, we're
 	 * done!  No errors detected.
 	 */
-	if (ret || ret2 || ret3) {
+	if (ret || ret2 || ret3 || ok) {
 		ret |= ret2 | ret3; /* caller only cares about non-zero/zero */
-		goto out_lbad;
+		goto out_rtpool;
 	}
-	if (bitmap_empty(vs.d_bad) && !vs.d_trunc &&
-	    bitmap_empty(vs.r_bad) && !vs.r_trunc &&
-	    bitmap_empty(vs.l_bad) && !vs.l_trunc)
-		goto out_lbad;
 
 	/* Scan the whole dir tree to see what matches the bad extents. */
 	ret = report_all_media_errors(ctx, &vs);
-
-	bitmap_free(&vs.l_bad);
-	bitmap_free(&vs.r_bad);
-	bitmap_free(&vs.d_bad);
-	return ret;
+	goto out_rtpool;
 
 out_schedules:
 	ptvar_free(vs.verify_schedules);
@@ -880,12 +796,6 @@ phase6_func(
 	free_pool(&vs, XFS_DEV_LOG);
 out_datapool:
 	free_pool(&vs, XFS_DEV_DATA);
-out_lbad:
-	bitmap_free(&vs.l_bad);
-out_rbad:
-	bitmap_free(&vs.r_bad);
-out_dbad:
-	bitmap_free(&vs.d_bad);
 	return ret;
 }
 
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index f724dfd693d7ab..4eb7484317e5ca 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -9,6 +9,7 @@
 #include <sys/statvfs.h>
 #include "libfrog/workqueue.h"
 #include "libfrog/paths.h"
+#include "libfrog/bitmap.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "counter.h"
@@ -58,8 +59,6 @@ struct read_verify_pool {
 	struct scrub_ctx	*ctx;		/* scrub context */
 	void			*readbuf;	/* read buffer */
 	struct ptcounter	*verified_bytes;
-	void			*ioerr_arg;
-	read_verify_ioerr_fn_t	ioerr_fn;	/* io error callback */
 	size_t			miniosz;	/* minimum io size, bytes */
 	enum xfs_device		dev;		/* which device? */
 
@@ -68,6 +67,10 @@ struct read_verify_pool {
 	 * return it to the caller.
 	 */
 	int			runtime_error;
+
+	/* outputs: a bad block bitmap and a truncated flag */
+	struct bitmap		*failmap;
+	bool			truncated;
 };
 
 unsigned int
@@ -88,15 +91,11 @@ read_verify_nproc(
 
 /*
  * Create a thread pool to run read verifiers.
- *
- * @ioerr_fn will be called when IO errors occur.
  */
 int
 read_verify_pool_alloc(
 	struct scrub_ctx		*ctx,
 	enum xfs_device			dev,
-	read_verify_ioerr_fn_t		ioerr_fn,
-	void				*ioerr_arg,
 	struct read_verify_pool		**prvp)
 {
 	struct read_verify_pool		*rvp;
@@ -121,8 +120,6 @@ read_verify_pool_alloc(
 	rvp->miniosz = ctx->mnt.fsgeom.blocksize;
 	rvp->ctx = ctx;
 	rvp->dev = dev;
-	rvp->ioerr_fn = ioerr_fn;
-	rvp->ioerr_arg = ioerr_arg;
 	ret = -workqueue_create(&rvp->wq, (struct xfs_mount *)rvp,
 			verifier_threads == 1 ? 0 : verifier_threads);
 	if (ret)
@@ -146,7 +143,8 @@ read_verify_pool_abort(
 {
 	if (!rvp->runtime_error)
 		rvp->runtime_error = ECANCELED;
-	workqueue_terminate(&rvp->wq);
+	if (!rvp->wq.terminated)
+		workqueue_terminate(&rvp->wq);
 }
 
 /* Finish up any read verification work. */
@@ -163,6 +161,7 @@ read_verify_pool_destroy(
 	struct read_verify_pool		*rvp)
 {
 	workqueue_destroy(&rvp->wq);
+	bitmap_free(&rvp->failmap);
 	ptcounter_free(rvp->verified_bytes);
 	free(rvp->readbuf);
 	free(rvp);
@@ -286,6 +285,39 @@ read_verify_one(
 			single_step);
 }
 
+/* Remember a media error for later. */
+static int
+read_verify_error(
+	struct read_verify_pool		*rvp,
+	uint64_t			start,
+	uint64_t			length,
+	int				error)
+{
+	int				ret;
+
+	if (!length) {
+		rvp->truncated = true;
+		return 0;
+	}
+
+	if (!rvp->failmap) {
+		ret = -bitmap_alloc(&rvp->failmap);
+		if (ret) {
+			str_liberror(rvp->ctx, ret,
+ _("allocating bad block bitmap"));
+			return ret;
+		}
+	}
+
+	ret = -bitmap_set(rvp->failmap, start, length);
+	if (ret) {
+		str_liberror(rvp->ctx, ret, _("setting bad block bitmap"));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Issue a read-verify IO in big batches.
  */
@@ -356,14 +388,18 @@ read_verify(
 			sz = rvp->miniosz - (rv->io_start % rvp->miniosz);
 			dbg_printf("IOERR %u @ %"PRIu64" %zu err %d\n",
 					rvp->dev, rv->io_start, sz, read_error);
-			rvp->ioerr_fn(rvp->ctx, rvp->dev, rv->io_start, sz,
-					read_error, rvp->ioerr_arg);
+			ret = read_verify_error(rvp, rv->io_start, sz,
+					read_error);
+			if (ret)
+				goto out_err;
 		} else if (sz == 0) {
 			/* No bytes at all?  Did we hit the end of the disk? */
 			dbg_printf("EOF %u @ %"PRIu64" %zu err %d\n",
 					rvp->dev, rv->io_start, sz, read_error);
-			rvp->ioerr_fn(rvp->ctx, rvp->dev, rv->io_start, sz,
-					read_error, rvp->ioerr_arg);
+			ret = read_verify_error(rvp, rv->io_start, sz,
+					read_error);
+			if (ret)
+				goto out_err;
 			break;
 		} else if (sz < len) {
 			/*
@@ -392,6 +428,7 @@ read_verify(
 		background_sleep();
 	}
 
+out_err:
 	free(rv);
 	ret = ptcounter_add(rvp->verified_bytes, verified);
 	if (ret)
@@ -491,11 +528,69 @@ try_read_verify_schedule_io(
 	return false;
 }
 
-/* How many bytes has this process verified? */
+/* Did read verification succeed? */
+bool
+read_verify_ok(
+	const struct read_verify_pool	*rvp)
+{
+	return rvp->failmap == NULL && !rvp->truncated;
+}
+
+/* Did the verification unexpectedly stop early due to short reads? */
+bool
+read_verify_truncated(
+	const struct read_verify_pool	*rvp)
+{
+	return rvp->truncated;
+}
+
+/* How many bytes has this pool verified? */
+uint64_t
+read_verify_progress(
+	const struct read_verify_pool	*rvp)
+{
+	uint64_t			ret = 0;
+
+	ptcounter_value(rvp->verified_bytes, &ret);
+	return ret;
+}
+
+/* Call @fn for every media failure this pool observed. */
 int
-read_verify_bytes(
+read_verify_iterate_failed(
+	struct read_verify_pool		*rvp,
+	int				(*fn)(uint64_t, uint64_t, void *),
+	void				*arg)
+{
+	if (!rvp->failmap)
+		return 0;
+
+	return -bitmap_iterate(rvp->failmap, fn, arg);
+}
+
+/* Call @fn for every media failure this pool observed in the given range. */
+int
+read_verify_iterate_failed_range(
+	struct read_verify_pool		*rvp,
+	uint64_t			start,
+	uint64_t			length,
+	int				(*fn)(uint64_t, uint64_t, void *),
+	void				*arg)
+{
+	if (!rvp->failmap)
+		return 0;
+
+	return -bitmap_iterate_range(rvp->failmap, start, length, fn, arg);
+}
+
+/* Were there any media failures within the given range? */
+bool
+read_verify_has_failed(
 	struct read_verify_pool		*rvp,
-	uint64_t			*bytes_checked)
+	uint64_t			start,
+	uint64_t			length)
 {
-	return ptcounter_value(rvp->verified_bytes, bytes_checked);
+	if (rvp->failmap)
+		return bitmap_test(rvp->failmap, start, length);
+	return false;
 }


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 19/22] xfs_scrub: clean up device-related error messages
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (17 preceding siblings ...)
  2026-03-19  4:50   ` [PATCH 18/22] xfs_scrub: move failmap and other outputs into read_verify_pool Darrick J. Wong
@ 2026-03-19  4:50   ` Darrick J. Wong
  2026-03-20  7:15     ` Christoph Hellwig
  2026-03-19  4:50   ` [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk Darrick J. Wong
                     ` (2 subsequent siblings)
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use consistent terminology for the data/log/rt device in the error
messages.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase6.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index bf9ef3d12690ab..14f1a26ab4ff7f 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -551,7 +551,7 @@ report_all_media_errors(
 
 	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_DATA);
 	if (ret) {
-		str_liberror(ctx, ret, _("walking datadev io errors"));
+		str_liberror(ctx, ret, _("walking data device io errors"));
 		return ret;
 	}
 
@@ -563,7 +563,7 @@ report_all_media_errors(
 
 	ret = report_disk_ioerrs(ctx, vs, XFS_DEV_RT);
 	if (ret) {
-		str_liberror(ctx, ret, _("walking rtdev io errors"));
+		str_liberror(ctx, ret, _("walking rt device io errors"));
 		return ret;
 	}
 
@@ -725,14 +725,14 @@ phase6_func(
 
 	ret = alloc_pool(ctx, &vs, XFS_DEV_DATA);
 	if (ret) {
-		str_liberror(ctx, ret, _("creating datadev media verifier"));
+		str_liberror(ctx, ret, _("creating data device media verifier"));
 		return ret;
 	}
 	if (ctx->fsinfo.fs_log) {
 		ret = alloc_pool(ctx, &vs, XFS_DEV_LOG);
 		if (ret) {
 			str_liberror(ctx, ret,
-					_("creating logdev media verifier"));
+					_("creating log device media verifier"));
 			goto out_datapool;
 		}
 	}
@@ -740,7 +740,7 @@ phase6_func(
 		ret = alloc_pool(ctx, &vs, XFS_DEV_RT);
 		if (ret) {
 			str_liberror(ctx, ret,
-					_("creating rtdev media verifier"));
+					_("creating rt device media verifier"));
 			goto out_logpool;
 		}
 	}
@@ -765,15 +765,15 @@ phase6_func(
 
 	ret = clean_pool(&vs, XFS_DEV_DATA, &ctx->bytes_checked, &ok);
 	if (ret)
-		str_liberror(ctx, ret, _("flushing datadev verify pool"));
+		str_liberror(ctx, ret, _("flushing data device verify pool"));
 
 	ret2 = clean_pool(&vs, XFS_DEV_LOG, &ctx->bytes_checked, &ok);
 	if (ret2)
-		str_liberror(ctx, ret2, _("flushing logdev verify pool"));
+		str_liberror(ctx, ret2, _("flushing log device verify pool"));
 
 	ret3 = clean_pool(&vs, XFS_DEV_RT, &ctx->bytes_checked, &ok);
 	if (ret3)
-		str_liberror(ctx, ret3, _("flushing rtdev verify pool"));
+		str_liberror(ctx, ret3, _("flushing rt device verify pool"));
 
 	/*
 	 * If the verify flush didn't work or we found no bad blocks, we're


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk.
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (18 preceding siblings ...)
  2026-03-19  4:50   ` [PATCH 19/22] xfs_scrub: clean up device-related error messages Darrick J. Wong
@ 2026-03-19  4:50   ` Darrick J. Wong
  2026-03-20  7:16     ` Christoph Hellwig
  2026-03-19  4:51   ` [PATCH 21/22] xfs_scrub: raise media verification IO limits Darrick J. Wong
  2026-03-19  4:51   ` [PATCH 22/22] xfs_scrub: allow overrides of the " Darrick J. Wong
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have a media verification ioctl in the kernel, drop the
SCSI_VERIFY code, which enables us to drop the dependency on sg and
obviates the need to fix some unit-handling bugs in the HDIO_GETGEO
code.

A subsequent patch will enable larger verification IO sizes, which this
old code cannot handle.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/disk.h            |    2 -
 doc/README-env-vars.txt |    1 
 scrub/disk.c            |  133 -----------------------------------------------
 scrub/read_verify.c     |    5 --
 scrub/xfs_scrub.c       |    1 
 5 files changed, 1 insertion(+), 141 deletions(-)


diff --git a/scrub/disk.h b/scrub/disk.h
index 73c73ab57fb5c7..a9155bf1f75eb4 100644
--- a/scrub/disk.h
+++ b/scrub/disk.h
@@ -6,7 +6,6 @@
 #ifndef XFS_SCRUB_DISK_H_
 #define XFS_SCRUB_DISK_H_
 
-#define DISK_FLAG_SCSI_VERIFY	0x1
 struct disk {
 	struct stat	d_sb;
 	int		d_fd;
@@ -15,7 +14,6 @@ struct disk {
 	unsigned int	d_flags;
 	unsigned int	d_blksize;	/* bytes */
 	uint64_t	d_size;		/* bytes */
-	uint64_t	d_start;	/* bytes */
 };
 
 unsigned int disk_heads(struct disk *disk);
diff --git a/doc/README-env-vars.txt b/doc/README-env-vars.txt
index eec59a82513f36..6862dcad71e217 100644
--- a/doc/README-env-vars.txt
+++ b/doc/README-env-vars.txt
@@ -17,7 +17,6 @@ Known debug tweaks (pass -d and set the environment variable):
 XFS_SCRUB_FORCE_ERROR        -- pretend all metadata is corrupt
 XFS_SCRUB_FORCE_REPAIR       -- repair all metadata even if it's ok
 XFS_SCRUB_NO_KERNEL          -- pretend there is no kernel ioctl
-XFS_SCRUB_NO_SCSI_VERIFY     -- disable SCSI VERIFY (if present)
 XFS_SCRUB_PHASE              -- run only this scrub phase
 XFS_SCRUB_THREADS            -- start exactly this number of threads
 
diff --git a/scrub/disk.c b/scrub/disk.c
index afce801d3f2076..ca37724249e551 100644
--- a/scrub/disk.c
+++ b/scrub/disk.c
@@ -10,8 +10,6 @@
 #include <fcntl.h>
 #include <sys/types.h>
 #include <sys/statvfs.h>
-#include <scsi/sg.h>
-#include <linux/hdreg.h>
 #include "platform_defs.h"
 #include "libfrog/util.h"
 #include "libfrog/paths.h"
@@ -78,113 +76,12 @@ disk_heads(
 	return __disk_heads(disk);
 }
 
-/*
- * Execute a SCSI VERIFY(16) to verify disk contents.
- * For devices that support this command, this can sharply reduce the
- * runtime of the data block verification phase if the storage device's
- * internal bandwidth exceeds its link bandwidth.  However, it only
- * works if we're talking to a raw SCSI device, and only if we trust the
- * firmware.
- */
-#define SENSE_BUF_LEN		64
-#define VERIFY16_CMDLEN	16
-#define VERIFY16_CMD		0x8F
-
-#ifndef SG_FLAG_Q_AT_TAIL
-# define SG_FLAG_Q_AT_TAIL	0x10
-#endif
-static int
-disk_scsi_verify(
-	struct disk		*disk,
-	uint64_t		startblock, /* lba */
-	uint64_t		blockcount) /* lba */
-{
-	struct sg_io_hdr	iohdr;
-	unsigned char		cdb[VERIFY16_CMDLEN];
-	unsigned char		sense[SENSE_BUF_LEN];
-	uint64_t		llba;
-	uint64_t		veri_len = blockcount;
-	int			error;
-
-	assert(!debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"));
-
-	llba = startblock + (disk->d_start >> BBSHIFT);
-
-	/* Borrowed from sg_verify */
-	cdb[0] = VERIFY16_CMD;
-	cdb[1] = 0; /* skip PI, DPO, and byte check. */
-	cdb[2] = (llba >> 56) & 0xff;
-	cdb[3] = (llba >> 48) & 0xff;
-	cdb[4] = (llba >> 40) & 0xff;
-	cdb[5] = (llba >> 32) & 0xff;
-	cdb[6] = (llba >> 24) & 0xff;
-	cdb[7] = (llba >> 16) & 0xff;
-	cdb[8] = (llba >> 8) & 0xff;
-	cdb[9] = llba & 0xff;
-	cdb[10] = (veri_len >> 24) & 0xff;
-	cdb[11] = (veri_len >> 16) & 0xff;
-	cdb[12] = (veri_len >> 8) & 0xff;
-	cdb[13] = veri_len & 0xff;
-	cdb[14] = 0;
-	cdb[15] = 0;
-	memset(sense, 0, SENSE_BUF_LEN);
-
-	/* v3 SG_IO */
-	memset(&iohdr, 0, sizeof(iohdr));
-	iohdr.interface_id = 'S';
-	iohdr.dxfer_direction = SG_DXFER_NONE;
-	iohdr.cmdp = cdb;
-	iohdr.cmd_len = VERIFY16_CMDLEN;
-	iohdr.sbp = sense;
-	iohdr.mx_sb_len = SENSE_BUF_LEN;
-	iohdr.flags |= SG_FLAG_Q_AT_TAIL;
-	iohdr.timeout = 30000; /* 30s */
-
-	error = ioctl(disk->d_fd, SG_IO, &iohdr);
-	if (error < 0)
-		return error;
-
-	dbg_printf("VERIFY(16) fd %d lba %"PRIu64" len %"PRIu64" info %x "
-			"status %d masked %d msg %d host %d driver %d "
-			"duration %d resid %d\n",
-			disk->d_fd, startblock, blockcount, iohdr.info,
-			iohdr.status, iohdr.masked_status, iohdr.msg_status,
-			iohdr.host_status, iohdr.driver_status, iohdr.duration,
-			iohdr.resid);
-
-	if (iohdr.info & SG_INFO_CHECK) {
-		dbg_printf("status: msg %x host %x driver %x\n",
-				iohdr.msg_status, iohdr.host_status,
-				iohdr.driver_status);
-		errno = EIO;
-		return -1;
-	}
-
-	return blockcount << BBSHIFT;
-}
-
-/* Test the availability of the kernel scrub ioctl. */
-static bool
-disk_can_scsi_verify(
-	struct disk		*disk)
-{
-	int			error;
-
-	if (debug_tweak_on("XFS_SCRUB_NO_SCSI_VERIFY"))
-		return false;
-
-	error = disk_scsi_verify(disk, 0, 1);
-	return error == 0;
-}
-
 /* Open a disk device and discover its geometry. */
 struct disk *
 disk_open(
 	const char		*pathname)
 {
-	struct hd_geometry	bdgeo;
 	struct disk		*disk;
-	bool			suspicious_disk = false;
 	int			error;
 
 	disk = calloc(1, sizeof(struct disk));
@@ -214,32 +111,11 @@ disk_open(
 		error = ioctl(disk->d_fd, BLKBSZGET, &disk->d_blksize);
 		if (error)
 			disk->d_blksize = 0;
-		error = ioctl(disk->d_fd, HDIO_GETGEO, &bdgeo);
-		if (!error) {
-			/*
-			 * dm devices will pass through ioctls, which means
-			 * we can't use SCSI VERIFY unless the start is 0.
-			 * Most dm devices don't set geometry (unlike scsi
-			 * and nvme) so use a zeroed out CHS to screen them
-			 * out.
-			 */
-			if (bdgeo.start != 0 &&
-			    (unsigned long long)bdgeo.heads * bdgeo.sectors *
-					bdgeo.sectors == 0)
-				suspicious_disk = true;
-			disk->d_start = bdgeo.start << BBSHIFT;
-		} else
-			disk->d_start = 0;
 	} else {
 		disk->d_size = disk->d_sb.st_size;
 		disk->d_blksize = disk->d_sb.st_blksize;
-		disk->d_start = 0;
 	}
 
-	/* Can we issue SCSI VERIFY? */
-	if (!suspicious_disk && disk_can_scsi_verify(disk))
-		disk->d_flags |= DISK_FLAG_SCSI_VERIFY;
-
 	return disk;
 out_close:
 	close(disk->d_fd);
@@ -262,10 +138,6 @@ disk_close(
 	return error;
 }
 
-#define BTOLBAT(d, bytes)	((uint64_t)(bytes) >> (d)->d_lbalog)
-#define LBASIZE(d)		(1ULL << (d)->d_lbalog)
-#define BTOLBA(d, bytes)	(((uint64_t)(bytes) + LBASIZE(d) - 1) >> (d)->d_lbalog)
-
 /* Read-verify an extent of a disk device. */
 ssize_t
 disk_read_verify(
@@ -274,10 +146,5 @@ disk_read_verify(
 	uint64_t		start,
 	uint64_t		length)
 {
-	/* Convert to logical block size. */
-	if (disk->d_flags & DISK_FLAG_SCSI_VERIFY)
-		return disk_scsi_verify(disk, BTOLBAT(disk, start),
-				BTOLBA(disk, length));
-
 	return pread(disk->d_fd, buf, length, start);
 }
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 4eb7484317e5ca..6789f5f668b46c 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -27,10 +27,7 @@
  * pool takes care of issuing multiple IOs to the device, if possible.
  */
 
-/*
- * Perform all IO in 32M chunks.  This cannot exceed 65536 sectors
- * because that's the biggest SCSI VERIFY(16) we dare to send.
- */
+/* Perform all verification IO in 32M chunks. */
 #define RVP_IO_MAX_SIZE		(33554432)
 
 /*
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index b74dc1635141aa..aae24ec83c1c75 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -111,7 +111,6 @@
  * XFS_SCRUB_FORCE_ERROR	-- pretend all metadata is corrupt
  * XFS_SCRUB_FORCE_REPAIR	-- repair all metadata even if it's ok
  * XFS_SCRUB_NO_KERNEL		-- pretend there is no kernel ioctl
- * XFS_SCRUB_NO_SCSI_VERIFY	-- disable SCSI VERIFY (if present)
  * XFS_SCRUB_PHASE		-- run only this scrub phase
  * XFS_SCRUB_THREADS		-- start exactly this number of threads
  * XFS_SCRUB_DISK_ERROR_INTERVAL-- simulate a disk error every this many bytes


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 21/22] xfs_scrub: raise media verification IO limits
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (19 preceding siblings ...)
  2026-03-19  4:50   ` [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk Darrick J. Wong
@ 2026-03-19  4:51   ` Darrick J. Wong
  2026-03-20  7:16     ` Christoph Hellwig
  2026-03-19  4:51   ` [PATCH 22/22] xfs_scrub: allow overrides of the " Darrick J. Wong
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:51 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

To avoid starving other threads of disk IO resources, the read-verify
pool limits the size of verification IOs to limit bandwidth consumption,
and it is willing to over-verify some amount of unwritten media to
reduce the number of verification IO requests sent to the device.

However, these limits were set in 2018 when areal densities were lower
and disk bandwidth was more limited.  Increase them now to reduce scan
time on the author's system by 10% in foreground and 50% in background
mode.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/read_verify.c |   31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)


diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 6789f5f668b46c..01f9134799384f 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -10,6 +10,7 @@
 #include "libfrog/workqueue.h"
 #include "libfrog/paths.h"
 #include "libfrog/bitmap.h"
+#include "libfrog/convert.h"
 #include "xfs_scrub.h"
 #include "common.h"
 #include "counter.h"
@@ -28,23 +29,36 @@
  */
 
 /* Perform all verification IO in 32M chunks. */
-#define RVP_IO_MAX_SIZE		(33554432)
+#define RVP_IO_MAX_SIZE			MEGABYTES(32)
 
 /*
- * If we're running in the background then we perform IO in 128k chunks
+ * If we're running in the background then we perform IO in 256k chunks
  * to reduce the load on the IO subsystem.
  */
-#define RVP_BACKGROUND_IO_MAX_SIZE	(131072)
+#define RVP_BG_IO_MAX_SIZE		KILOBYTES(256)
 
 /* What's the real maximum IO size? */
 static inline unsigned int
 rvp_io_max_size(void)
 {
-	return bg_mode > 0 ? RVP_BACKGROUND_IO_MAX_SIZE : RVP_IO_MAX_SIZE;
+	return bg_mode > 0 ? RVP_BG_IO_MAX_SIZE : RVP_IO_MAX_SIZE;
 }
 
-/* Tolerate 64k holes in adjacent read verify requests. */
-#define RVP_IO_BATCH_LOCALITY	(65536)
+/* Tolerate 2M holes in adjacent read verify requests. */
+#define RVP_IO_BATCH_LOCALITY		MEGABYTES(2)
+
+/*
+ * Tolerate 256k holes in adjacent read verify requests when running in the
+ * background.
+ */
+#define RVP_BG_IO_BATCH_LOCALITY	KILOBYTES(256)
+
+/* How many holes are we willing to verify to reduce IO count? */
+static inline unsigned int
+rvp_io_batch_locality(void)
+{
+	return bg_mode > 0 ? RVP_BG_IO_BATCH_LOCALITY : RVP_IO_BATCH_LOCALITY;
+}
 
 struct read_verify {
 	uint64_t		io_start;	/* bytes */
@@ -488,6 +502,7 @@ try_read_verify_schedule_io(
 {
 	uint64_t			req_end;
 	uint64_t			rv_end;
+	const unsigned int		locality = rvp_io_batch_locality();
 
 	assert(rvp->readbuf);
 
@@ -513,9 +528,9 @@ try_read_verify_schedule_io(
 	 * we can combine them.
 	 */
 	if (rs->rvp == rvp && rs->io_length > 0 &&
-	    ((start >= rs->io_start && start <= rv_end + RVP_IO_BATCH_LOCALITY) ||
+	    ((start >= rs->io_start && start <= rv_end + locality) ||
 	     (rs->io_start >= start &&
-	      rs->io_start <= req_end + RVP_IO_BATCH_LOCALITY))) {
+	      rs->io_start <= req_end + locality))) {
 		rs->io_start = min(rs->io_start, start);
 		rs->io_length = max(req_end, rv_end) - rs->io_start;
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 22/22] xfs_scrub: allow overrides of the media verification IO limits
  2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
                     ` (20 preceding siblings ...)
  2026-03-19  4:51   ` [PATCH 21/22] xfs_scrub: raise media verification IO limits Darrick J. Wong
@ 2026-03-19  4:51   ` Darrick J. Wong
  2026-03-20  7:17     ` Christoph Hellwig
  21 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-19  4:51 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow power users to override the media verification IO size limits via
magic environment variables.  For the background service, this can be
done via:

[Service]
Environment=XFS_SCRUB_VERIFY_MAX_SIZE=128M

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 doc/README-env-vars.txt |    3 ++
 scrub/read_verify.c     |   62 +++++++++++++++++++++++++++++++++++++++--------
 scrub/xfs_scrub.c       |    3 ++
 3 files changed, 57 insertions(+), 11 deletions(-)


diff --git a/doc/README-env-vars.txt b/doc/README-env-vars.txt
index 6862dcad71e217..03faf0882f1ae0 100644
--- a/doc/README-env-vars.txt
+++ b/doc/README-env-vars.txt
@@ -23,3 +23,6 @@ XFS_SCRUB_THREADS            -- start exactly this number of threads
 Available even in non-debug mode:
 SERVICE_MODE                 -- compress all error codes to 1 for LSB
                                 service action compliance
+XFS_SCRUB_VERIFY_MAX_SIZE    -- maximum size of media verification requests
+XFS_SCRUB_VERIFY_BATCH_LOCALITY -- coalesce sparse areas of up to this size
+                                   to batch verification requests
diff --git a/scrub/read_verify.c b/scrub/read_verify.c
index 01f9134799384f..19024ddc711dd2 100644
--- a/scrub/read_verify.c
+++ b/scrub/read_verify.c
@@ -38,10 +38,29 @@
 #define RVP_BG_IO_MAX_SIZE		KILOBYTES(256)
 
 /* What's the real maximum IO size? */
-static inline unsigned int
-rvp_io_max_size(void)
+static unsigned int
+rvp_io_max_size(
+	const struct scrub_ctx	*ctx)
 {
-	return bg_mode > 0 ? RVP_BG_IO_MAX_SIZE : RVP_IO_MAX_SIZE;
+	static unsigned int	res;
+	char			*p;
+
+	if (res)
+		return res;
+
+	p = getenv("XFS_SCRUB_VERIFY_MAX_SIZE");
+	if (p) {
+		long long	r = cvtnum(ctx->mnt.fsgeom.blocksize,
+					   ctx->mnt.fsgeom.sectsize, p);
+
+		if (r >= ctx->mnt.fsgeom.blocksize && r <= GIGABYTES(2)) {
+			res = r;
+			return res;
+		}
+	}
+
+	res = bg_mode > 0 ? RVP_BG_IO_MAX_SIZE : RVP_IO_MAX_SIZE;
+	return res;
 }
 
 /* Tolerate 2M holes in adjacent read verify requests. */
@@ -54,10 +73,29 @@ rvp_io_max_size(void)
 #define RVP_BG_IO_BATCH_LOCALITY	KILOBYTES(256)
 
 /* How many holes are we willing to verify to reduce IO count? */
-static inline unsigned int
-rvp_io_batch_locality(void)
+static unsigned int
+rvp_io_batch_locality(
+	const struct scrub_ctx	*ctx)
 {
-	return bg_mode > 0 ? RVP_BG_IO_BATCH_LOCALITY : RVP_IO_BATCH_LOCALITY;
+	static unsigned int	res;
+	char			*p;
+
+	if (res)
+		return res;
+
+	p = getenv("XFS_SCRUB_VERIFY_BATCH_LOCALITY");
+	if (p) {
+		long long	r = cvtnum(ctx->mnt.fsgeom.blocksize,
+					   ctx->mnt.fsgeom.sectsize, p);
+
+		if (r >= ctx->mnt.fsgeom.blocksize && r <= GIGABYTES(2)) {
+			res = r;
+			return res;
+		}
+	}
+
+	res = bg_mode > 0 ? RVP_BG_IO_BATCH_LOCALITY : RVP_IO_BATCH_LOCALITY;
+	return res;
 }
 
 struct read_verify {
@@ -112,17 +150,18 @@ read_verify_pool_alloc(
 	struct read_verify_pool		*rvp;
 	const unsigned int		verifier_threads =
 		read_verify_nproc(ctx);
+	const unsigned int		maxsize =
+		rvp_io_max_size(ctx);
 	int				ret;
 
-	if (rvp_io_max_size() % ctx->mnt.fsgeom.blocksize)
+	if (maxsize % ctx->mnt.fsgeom.blocksize)
 		return EINVAL;
 
 	rvp = calloc(1, sizeof(struct read_verify_pool));
 	if (!rvp)
 		return errno;
 
-	ret = posix_memalign((void **)&rvp->readbuf, page_size,
-			rvp_io_max_size());
+	ret = posix_memalign((void **)&rvp->readbuf, page_size, maxsize);
 	if (ret)
 		goto out_free;
 	ret = ptcounter_alloc(verifier_threads, &rvp->verified_bytes);
@@ -351,7 +390,7 @@ read_verify(
 	if (rvp->runtime_error)
 		return;
 
-	io_max_size = rvp_io_max_size();
+	io_max_size = rvp_io_max_size(rvp->ctx);
 
 	while (rv->io_length > 0) {
 		read_error = 0;
@@ -502,7 +541,8 @@ try_read_verify_schedule_io(
 {
 	uint64_t			req_end;
 	uint64_t			rv_end;
-	const unsigned int		locality = rvp_io_batch_locality();
+	const unsigned int		locality =
+		rvp_io_batch_locality(rvp->ctx);
 
 	assert(rvp->readbuf);
 
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index aae24ec83c1c75..0faf49b862b932 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -120,6 +120,9 @@
  * Available even in non-debug mode:
  * SERVICE_MODE			-- compress all error codes to 1 for LSB
  *				   service action compliance
+ * XFS_SCRUB_VERIFY_MAX_SIZE    -- maximum size of media verification requests
+ * XFS_SCRUB_VERIFY_BATCH_LOCALITY -- coalesce sparse areas of up to this size
+ *                                    to batch verification requests
  */
 
 /* Program name; needed for libfrog error reports. */


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 19/26] xfs_healer: use statmount to find moved filesystems even faster
  2026-03-19  4:43   ` [PATCH 19/26] xfs_healer: use statmount to find moved filesystems even faster Darrick J. Wong
@ 2026-03-20  7:11     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer
  2026-03-19  4:45   ` [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer Darrick J. Wong
@ 2026-03-20  7:12     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 02/22] mkfs: rename byte unit conversion macros
  2026-03-19  4:46   ` [PATCH 02/22] mkfs: rename byte unit conversion macros Darrick J. Wong
@ 2026-03-20  7:12     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h
  2026-03-19  4:46   ` [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h Darrick J. Wong
@ 2026-03-20  7:12     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 04/22] xfs_scrub: report truncated devices as media errors
  2026-03-19  4:46   ` [PATCH 04/22] xfs_scrub: report truncated devices as media errors Darrick J. Wong
@ 2026-03-20  7:13     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return value
  2026-03-19  4:46   ` [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return value Darrick J. Wong
@ 2026-03-20  7:13     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 07/22] xfs_scrub: move read verification scheduling to phase6.c
  2026-03-19  4:47   ` [PATCH 07/22] xfs_scrub: move read verification scheduling to phase6.c Darrick J. Wong
@ 2026-03-20  7:14     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere
  2026-03-19  4:47   ` [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere Darrick J. Wong
@ 2026-03-20  7:14     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 11/22] xfs_scrub: rename nr_io_threads
  2026-03-19  4:48   ` [PATCH 11/22] xfs_scrub: rename nr_io_threads Darrick J. Wong
@ 2026-03-20  7:14     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 16/22] xfs_scrub: perform media scanning of the log region
  2026-03-19  4:49   ` [PATCH 16/22] xfs_scrub: perform media scanning of the log region Darrick J. Wong
@ 2026-03-20  7:15     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids
  2026-03-19  4:49   ` [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids Darrick J. Wong
@ 2026-03-20  7:15     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 18/22] xfs_scrub: move failmap and other outputs into read_verify_pool
  2026-03-19  4:50   ` [PATCH 18/22] xfs_scrub: move failmap and other outputs into read_verify_pool Darrick J. Wong
@ 2026-03-20  7:15     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 19/22] xfs_scrub: clean up device-related error messages
  2026-03-19  4:50   ` [PATCH 19/22] xfs_scrub: clean up device-related error messages Darrick J. Wong
@ 2026-03-20  7:15     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk.
  2026-03-19  4:50   ` [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk Darrick J. Wong
@ 2026-03-20  7:16     ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Wed, Mar 18, 2026 at 09:50:44PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have a media verification ioctl in the kernel, drop the
> SCSI_VERIFY code, which enables us to drop the dependency on sg and
> obviates the need to fix some unit-handling bugs in the HDIO_GETGEO
> code.
> 
> A subsequent patch will enable larger verification IO sizes, which this
> old code cannot handle.

Good riddance!

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 21/22] xfs_scrub: raise media verification IO limits
  2026-03-19  4:51   ` [PATCH 21/22] xfs_scrub: raise media verification IO limits Darrick J. Wong
@ 2026-03-20  7:16     ` Christoph Hellwig
  2026-03-20 15:46       ` Darrick J. Wong
  0 siblings, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Wed, Mar 18, 2026 at 09:51:00PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> To avoid starving other threads of disk IO resources, the read-verify
> pool limits the size of verification IOs to limit bandwidth consumption,
> and it is willing to over-verify some amount of unwritten media to
> reduce the number of verification IO requests sent to the device.
> 
> However, these limits were set in 2018 when areal densities were lower
> and disk bandwidth was more limited.  Increase them now to reduce scan
> time on the author's system by 10% in foreground and 50% in background
> mode.

This looks good for now.  At some point we'll need to return to
fine-tune this better for different media.

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 22/22] xfs_scrub: allow overrides of the media verification IO limits
  2026-03-19  4:51   ` [PATCH 22/22] xfs_scrub: allow overrides of the " Darrick J. Wong
@ 2026-03-20  7:17     ` Christoph Hellwig
  2026-03-20 15:44       ` Darrick J. Wong
  0 siblings, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-20  7:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

On Wed, Mar 18, 2026 at 09:51:16PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Allow power users to override the media verification IO size limits via
> magic environment variables.  For the background service, this can be
> done via:
> 
> [Service]
> Environment=XFS_SCRUB_VERIFY_MAX_SIZE=128M

So you'll need to hack the systemd unit files?  How could we set this
on a per-file system basis?

Not really arguing against this, but we might end up needing more
flexbility in the end.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 22/22] xfs_scrub: allow overrides of the media verification IO limits
  2026-03-20  7:17     ` Christoph Hellwig
@ 2026-03-20 15:44       ` Darrick J. Wong
  2026-03-23  6:08         ` Christoph Hellwig
  0 siblings, 1 reply; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-20 15:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: aalbersh, linux-xfs

On Fri, Mar 20, 2026 at 12:17:34AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 18, 2026 at 09:51:16PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Allow power users to override the media verification IO size limits via
> > magic environment variables.  For the background service, this can be
> > done via:
> > 
> > [Service]
> > Environment=XFS_SCRUB_VERIFY_MAX_SIZE=128M
> 
> So you'll need to hack the systemd unit files?  How could we set this
> on a per-file system basis?
> 
> Not really arguing against this, but we might end up needing more
> flexbility in the end.

I'd do per-fs tweaks by defining an xfs_property and telling users to
set it, e.g.

# xfs_property /home set scrub_verify_max_size=128M

I don't think we need to define the property right now, that can wait
until someone has time to do a more in depth analysis of what settings
adjustments are needed for modern hardware.  I'm keener on figuring out
something that'd work more automagically because sysadmins are lazy. :)

--D

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 21/22] xfs_scrub: raise media verification IO limits
  2026-03-20  7:16     ` Christoph Hellwig
@ 2026-03-20 15:46       ` Darrick J. Wong
  0 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-20 15:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: aalbersh, linux-xfs

On Fri, Mar 20, 2026 at 12:16:54AM -0700, Christoph Hellwig wrote:
> On Wed, Mar 18, 2026 at 09:51:00PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > To avoid starving other threads of disk IO resources, the read-verify
> > pool limits the size of verification IOs to limit bandwidth consumption,
> > and it is willing to over-verify some amount of unwritten media to
> > reduce the number of verification IO requests sent to the device.
> > 
> > However, these limits were set in 2018 when areal densities were lower
> > and disk bandwidth was more limited.  Increase them now to reduce scan
> > time on the author's system by 10% in foreground and 50% in background
> > mode.
> 
> This looks good for now.  At some point we'll need to return to
> fine-tune this better for different media.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

<nod> Thanks for reviewing and sharing patches! :)

--D

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 22/22] xfs_scrub: allow overrides of the media verification IO limits
  2026-03-20 15:44       ` Darrick J. Wong
@ 2026-03-23  6:08         ` Christoph Hellwig
  2026-03-23 15:18           ` Darrick J. Wong
  0 siblings, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2026-03-23  6:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, aalbersh, linux-xfs

On Fri, Mar 20, 2026 at 08:44:53AM -0700, Darrick J. Wong wrote:
> > So you'll need to hack the systemd unit files?  How could we set this
> > on a per-file system basis?
> > 
> > Not really arguing against this, but we might end up needing more
> > flexbility in the end.
> 
> I'd do per-fs tweaks by defining an xfs_property and telling users to
> set it, e.g.
> 
> # xfs_property /home set scrub_verify_max_size=128M

Sounds reasonable.

> I don't think we need to define the property right now, that can wait
> until someone has time to do a more in depth analysis of what settings
> adjustments are needed for modern hardware.  I'm keener on figuring out
> something that'd work more automagically because sysadmins are lazy. :)

Agreed.  Another thing we might want is to record the scrub progress
somewhere so that we don't always start from the beginning when
interrupted.  Also not really needed now, I'd rather land the code
first.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 22/22] xfs_scrub: allow overrides of the media verification IO limits
  2026-03-23  6:08         ` Christoph Hellwig
@ 2026-03-23 15:18           ` Darrick J. Wong
  0 siblings, 0 replies; 71+ messages in thread
From: Darrick J. Wong @ 2026-03-23 15:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: aalbersh, linux-xfs

On Sun, Mar 22, 2026 at 11:08:53PM -0700, Christoph Hellwig wrote:
> On Fri, Mar 20, 2026 at 08:44:53AM -0700, Darrick J. Wong wrote:
> > > So you'll need to hack the systemd unit files?  How could we set this
> > > on a per-file system basis?
> > > 
> > > Not really arguing against this, but we might end up needing more
> > > flexbility in the end.
> > 
> > I'd do per-fs tweaks by defining an xfs_property and telling users to
> > set it, e.g.
> > 
> > # xfs_property /home set scrub_verify_max_size=128M
> 
> Sounds reasonable.
> 
> > I don't think we need to define the property right now, that can wait
> > until someone has time to do a more in depth analysis of what settings
> > adjustments are needed for modern hardware.  I'm keener on figuring out
> > something that'd work more automagically because sysadmins are lazy. :)
> 
> Agreed.  Another thing we might want is to record the scrub progress
> somewhere so that we don't always start from the beginning when
> interrupted.  Also not really needed now, I'd rather land the code
> first.

<nod> I'll send out the most recent libxfs-7.0 resync patches shortly,
so Andrey can get started on that.

--D

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2026-03-23 15:18 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19  4:37 [PATCHBOMB v10] xfsprogs: autonomous self healing of filesystems Darrick J. Wong
2026-03-19  4:38 ` [PATCHSET v10 1/2] " Darrick J. Wong
2026-03-19  4:39   ` [PATCH 01/26] libfrog: add a function to grab the path from an open fd and a file handle Darrick J. Wong
2026-03-19  4:39   ` [PATCH 02/26] libfrog: create healthmon event log library functions Darrick J. Wong
2026-03-19  4:39   ` [PATCH 03/26] libfrog: add support code for starting systemd services programmatically Darrick J. Wong
2026-03-19  4:39   ` [PATCH 04/26] libfrog: hoist a couple of service helper functions Darrick J. Wong
2026-03-19  4:40   ` [PATCH 05/26] libfrog: add wrappers for listmount and statmount Darrick J. Wong
2026-03-19  4:40   ` [PATCH 06/26] man2: document the healthmon ioctl Darrick J. Wong
2026-03-19  4:40   ` [PATCH 07/26] man2: document the media verification ioctl Darrick J. Wong
2026-03-19  4:40   ` [PATCH 08/26] xfs_io: monitor filesystem health events Darrick J. Wong
2026-03-19  4:41   ` [PATCH 09/26] xfs_io: add a media verify command Darrick J. Wong
2026-03-19  4:41   ` [PATCH 10/26] xfs_healer: create daemon to listen for health events Darrick J. Wong
2026-03-19  4:41   ` [PATCH 11/26] xfs_healer: enable repairing filesystems Darrick J. Wong
2026-03-19  4:41   ` [PATCH 12/26] xfs_healer: use getparents to look up file names Darrick J. Wong
2026-03-19  4:42   ` [PATCH 13/26] xfs_healer: create a per-mount background monitoring service Darrick J. Wong
2026-03-19  4:42   ` [PATCH 14/26] xfs_healer: create a service to start the per-mount healer service Darrick J. Wong
2026-03-19  4:42   ` [PATCH 15/26] xfs_healer: don't start service if kernel support unavailable Darrick J. Wong
2026-03-19  4:42   ` [PATCH 16/26] xfs_healer: use the autofsck fsproperty to select mode Darrick J. Wong
2026-03-19  4:43   ` [PATCH 17/26] xfs_healer: run full scrub after lost corruption events or targeted repair failure Darrick J. Wong
2026-03-19  4:43   ` [PATCH 18/26] xfs_healer: use getmntent to find moved filesystems Darrick J. Wong
2026-03-19  4:43   ` [PATCH 19/26] xfs_healer: use statmount to find moved filesystems even faster Darrick J. Wong
2026-03-20  7:11     ` Christoph Hellwig
2026-03-19  4:43   ` [PATCH 20/26] xfs_healer: validate that repair fds point to the monitored fs Darrick J. Wong
2026-03-19  4:44   ` [PATCH 21/26] xfs_healer: add a manual page Darrick J. Wong
2026-03-19  4:44   ` [PATCH 22/26] xfs_scrub: print systemd service names Darrick J. Wong
2026-03-19  4:44   ` [PATCH 23/26] xfs_io: add listmount and statmount commands Darrick J. Wong
2026-03-19  4:45   ` [PATCH 24/26] mkfs: enable online repair if all backrefs are enabled Darrick J. Wong
2026-03-19  4:45   ` [PATCH 25/26] debian/control: listify the build dependencies Darrick J. Wong
2026-03-19  4:45   ` [PATCH 26/26] debian: enable xfs_healer on the root filesystem by default Darrick J. Wong
2026-03-19  4:38 ` [PATCHSET v10 2/2] xfs_scrub: refactor to XFS_IOC_VERIFY_MEDIA Darrick J. Wong
2026-03-19  4:45   ` [PATCH 01/22] libfrog: allow bitmap_free to handle a null bitmap pointer Darrick J. Wong
2026-03-20  7:12     ` Christoph Hellwig
2026-03-19  4:46   ` [PATCH 02/22] mkfs: rename byte unit conversion macros Darrick J. Wong
2026-03-20  7:12     ` Christoph Hellwig
2026-03-19  4:46   ` [PATCH 03/22] libfrog: lift *BYTES helpers to convert.h Darrick J. Wong
2026-03-20  7:12     ` Christoph Hellwig
2026-03-19  4:46   ` [PATCH 04/22] xfs_scrub: report truncated devices as media errors Darrick J. Wong
2026-03-20  7:13     ` Christoph Hellwig
2026-03-19  4:46   ` [PATCH 05/22] xfs_scrub: fix i18n of the decode_special_owner return value Darrick J. Wong
2026-03-20  7:13     ` Christoph Hellwig
2026-03-19  4:47   ` [PATCH 06/22] scrub: remove the unused io_disk field in struct read_verify Darrick J. Wong
2026-03-19  4:47   ` [PATCH 07/22] xfs_scrub: move read verification scheduling to phase6.c Darrick J. Wong
2026-03-20  7:14     ` Christoph Hellwig
2026-03-19  4:47   ` [PATCH 08/22] scrub: simplify the read_verify_pool_alloc interface Darrick J. Wong
2026-03-19  4:47   ` [PATCH 09/22] xfs_scrub: don't pass the io_end_arg around everywhere Darrick J. Wong
2026-03-20  7:14     ` Christoph Hellwig
2026-03-19  4:48   ` [PATCH 10/22] scrub: use enum xfs_device for read verification Darrick J. Wong
2026-03-19  4:48   ` [PATCH 11/22] xfs_scrub: rename nr_io_threads Darrick J. Wong
2026-03-20  7:14     ` Christoph Hellwig
2026-03-19  4:48   ` [PATCH 12/22] scrub: simplify verifier threads calculation Darrick J. Wong
2026-03-19  4:48   ` [PATCH 13/22] xfs_scrub: move disk media verification error injection Darrick J. Wong
2026-03-19  4:49   ` [PATCH 14/22] xfs_scrub: use the verify media ioctl during phase 6 if possible Darrick J. Wong
2026-03-19  4:49   ` [PATCH 15/22] scrub: don't allocate disk for ioctl-based media verify Darrick J. Wong
2026-03-19  4:49   ` [PATCH 16/22] xfs_scrub: perform media scanning of the log region Darrick J. Wong
2026-03-20  7:15     ` Christoph Hellwig
2026-03-19  4:49   ` [PATCH 17/22] xfs_scrub: index read-verify pools by xfs_device ids Darrick J. Wong
2026-03-20  7:15     ` Christoph Hellwig
2026-03-19  4:50   ` [PATCH 18/22] xfs_scrub: move failmap and other outputs into read_verify_pool Darrick J. Wong
2026-03-20  7:15     ` Christoph Hellwig
2026-03-19  4:50   ` [PATCH 19/22] xfs_scrub: clean up device-related error messages Darrick J. Wong
2026-03-20  7:15     ` Christoph Hellwig
2026-03-19  4:50   ` [PATCH 20/22] xfs_scrub: drop SCSI_VERIFY code from disk Darrick J. Wong
2026-03-20  7:16     ` Christoph Hellwig
2026-03-19  4:51   ` [PATCH 21/22] xfs_scrub: raise media verification IO limits Darrick J. Wong
2026-03-20  7:16     ` Christoph Hellwig
2026-03-20 15:46       ` Darrick J. Wong
2026-03-19  4:51   ` [PATCH 22/22] xfs_scrub: allow overrides of the " Darrick J. Wong
2026-03-20  7:17     ` Christoph Hellwig
2026-03-20 15:44       ` Darrick J. Wong
2026-03-23  6:08         ` Christoph Hellwig
2026-03-23 15:18           ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox