[PATCHBOMB v9] xfsprogs: autonomous self healing of filesystems

public inbox for fstests@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHBOMB v9] xfsprogs: autonomous self healing of filesystems
@ 2026-03-10  3:38 Darrick J. Wong
  2026-03-10  3:42 ` [PATCHSET v9 1/2] fstests: test generic file IO error reporting Darrick J. Wong
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
  0 siblings, 2 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:38 UTC (permalink / raw)
  To: Andrey Albershteyn; +Cc: cem, hch, linux-xfs, fstests, Zorro Lang

Hi all,

This patchset contains the userspace and QA changes (xfs_healer) needed
to put to use all the new kernel functionality to deliver live
information about filesystem health events (xfs_healthmon.c) to
userspace.

In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically.  This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing.  They are auto-started by a starter
service that uses fanotify.

When the patchsets under this cover letter are merged, online fsck for
XFS will at long last be fully feature complete.  The passive scan parts
have been done since mid-2024, this final part adds proactive repair.

These patches still need some review, please help if you can.  I will
get to the RWF_WRITETHOUGH series tomorrow.

[PATCHSET v9] xfsprogs: autonomous self healing of filesystems
  [PATCH 03/28] libfrog: add support code for starting systemd services
  [PATCH 05/28] libfrog: add wrappers for listmount and statmount
  [PATCH 13/28] xfs_healer: create a per-mount background monitoring
  [PATCH 14/28] xfs_healer: create a service to start the per-mount
  [PATCH 19/28] xfs_healer: use statmount to find moved filesystems
  [PATCH 20/28] xfs_healer: validate that repair fds point to the
  [PATCH 22/28] xfs_scrub: use the verify media ioctl during phase 6 if
  [PATCH 24/28] xfs_scrub: print systemd service names
  [PATCH 25/28] xfs_io: add listmount and statmount commands
  [PATCH 26/28] mkfs: enable online repair if all backrefs are enabled
  [PATCH 27/28] debian/control: listify the build dependencies
  [PATCH 28/28] debian: enable xfs_healer on the root filesystem by
[PATCHSET v9 1/2] fstests: test generic file IO error reporting
  [PATCH 1/1] generic: test fsnotify filesystem error reporting
[PATCHSET v9 2/2] fstests: autonomous self healing of filesystems
  [PATCH 01/14] xfs: test health monitoring code
  [PATCH 02/14] xfs: test for metadata corruption error reporting via
  [PATCH 03/14] xfs: test io error reporting via healthmon
  [PATCH 04/14] xfs: set up common code for testing xfs_healer
  [PATCH 05/14] xfs: test xfs_healer's event handling
  [PATCH 06/14] xfs: test xfs_healer can fix a filesystem
  [PATCH 07/14] xfs: test xfs_healer can report file I/O errors
  [PATCH 08/14] xfs: test xfs_healer can report file media errors
  [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns
  [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem
  [PATCH 11/14] xfs: test xfs_healer can follow mount moves
  [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem
  [PATCH 13/14] xfs: test xfs_healer background service
  [PATCH 14/14] xfs: test xfs_healer startup service

v9: reorg listmount/statmount, use it to find moved mounts, improve the
    commit messages and documentation
v8: clean up userspace for merging now that the kernel part is upstream
v7: more cleanups of the media verification ioctl, improve comments, and
    reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
    and filter bio status flags better
v5: add verify-media ioctl, collapse small helper funcs with only
    one caller
v4: drop multiple client support so we can make direct calls into
    healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status

--D

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCHSET v9 1/2] fstests: test generic file IO error reporting
  2026-03-10  3:38 [PATCHBOMB v9] xfsprogs: autonomous self healing of filesystems Darrick J. Wong
@ 2026-03-10  3:42 ` Darrick J. Wong
  2026-03-10  3:50   ` [PATCH 1/1] generic: test fsnotify filesystem " Darrick J. Wong
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
  1 sibling, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:42 UTC (permalink / raw)
  To: zlang, djwong
  Cc: jack, fstests, amir73il, gabriel, linux-fsdevel, linux-xfs, hch

Hi all,

Refactor the iomap file I/O error handling code so that failures are
reported in a generic way to fsnotify.  Then connect the XFS health
reporting to the same fsnotify, and now XFS can notify userspace of
problems.

v9: tweak fstests per userspace changes; improve attribution of the
    generic fserror test; fix some bugs pointed out by zorro

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=filesystem-error-reporting

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=filesystem-error-reporting
---
Commits in this patchset:
 * generic: test fsnotify filesystem error reporting
---
 common/rc              |   25 ++++++
 doc/group-names.txt    |    1 
 src/Makefile           |    2 
 src/fs-monitor.c       |  153 ++++++++++++++++++++++++++++++++++
 tests/generic/1838     |  214 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1838.out |   20 ++++
 6 files changed, 414 insertions(+), 1 deletion(-)
 create mode 100644 src/fs-monitor.c
 create mode 100755 tests/generic/1838
 create mode 100644 tests/generic/1838.out


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-10  3:42 ` [PATCHSET v9 1/2] fstests: test generic file IO error reporting Darrick J. Wong
@ 2026-03-10  3:50   ` Darrick J. Wong
  2026-03-10  7:07     ` Amir Goldstein
                       ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:50 UTC (permalink / raw)
  To: zlang, djwong
  Cc: jack, fstests, amir73il, gabriel, linux-fsdevel, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Test the fsnotify filesystem error reporting.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc              |   25 ++++++
 doc/group-names.txt    |    1 
 src/Makefile           |    2 
 src/fs-monitor.c       |  153 ++++++++++++++++++++++++++++++++++
 tests/generic/1838     |  214 ++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/1838.out |   20 ++++
 6 files changed, 414 insertions(+), 1 deletion(-)
 create mode 100644 src/fs-monitor.c
 create mode 100755 tests/generic/1838
 create mode 100644 tests/generic/1838.out


diff --git a/common/rc b/common/rc
index fd4ca9641822cf..ccb78baf5bd41a 100644
--- a/common/rc
+++ b/common/rc
@@ -3013,6 +3013,11 @@ _require_xfs_io_command()
 		echo $testio | grep -q "Inappropriate ioctl" && \
 			_notrun "xfs_io $command support is missing"
 		;;
+	"healthmon")
+		test -z "$param" && param="-p"
+		testio=`$XFS_IO_PROG -c "healthmon $param" $TEST_DIR 2>&1`
+		param_checked="$param"
+		;;
 	"label")
 		testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
 		;;
@@ -6149,6 +6154,26 @@ _require_max_file_range_blocks()
 	esac
 }
 
+_require_fanotify_ioerrors()
+{
+	local x
+
+	case "$FSTYP" in
+	xfs)
+		# added as a part of xfs health monitoring
+		_require_xfs_io_command healthmon
+		return 0
+		;;
+	ext4)
+		# added at the same time as err_report_sec
+		x="$(_get_fs_sysfs_attr $TEST_DEV err_report_sec)"
+		test -n "$x" && return 0
+		;;
+	esac
+
+	_notrun "$FSTYP does not support fanotify ioerrors"
+}
+
 ################################################################################
 # make sure this script returns success
 /bin/true
diff --git a/doc/group-names.txt b/doc/group-names.txt
index 10b49e50517797..158f84d36d3154 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -117,6 +117,7 @@ samefs			overlayfs when all layers are on the same fs
 scrub			filesystem metadata scrubbers
 seed			btrfs seeded filesystems
 seek			llseek functionality
+selfhealing		self healing filesystem code
 selftest		tests with fixed results, used to validate testing setup
 send			btrfs send/receive
 shrinkfs		decreasing the size of a filesystem
diff --git a/src/Makefile b/src/Makefile
index 577d816ae859b6..1c761da0ccff20 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -36,7 +36,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	fscrypt-crypt-util bulkstat_null_ocount splice-test chprojid_fail \
 	detached_mounts_propagation ext4_resize t_readdir_3 splice2pipe \
 	uuid_ioctl t_snapshot_deleted_subvolume fiemap-fault min_dio_alignment \
-	rw_hint
+	rw_hint fs-monitor
 
 EXTRA_EXECS = dmerror fill2attr fill2fs fill2fs_check scaleread.sh \
 	      btrfs_crc32c_forged_name.py popdir.pl popattr.py \
diff --git a/src/fs-monitor.c b/src/fs-monitor.c
new file mode 100644
index 00000000000000..0cf09677a3efda
--- /dev/null
+++ b/src/fs-monitor.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2021, Collabora Ltd.
+ * Copied from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/fanotify/fs-monitor.c
+ */
+
+#include <errno.h>
+#include <err.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <fcntl.h>
+#include <sys/fanotify.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#ifndef FAN_FS_ERROR
+#define FAN_FS_ERROR		0x00008000
+#define FAN_EVENT_INFO_TYPE_ERROR	5
+
+struct fanotify_event_info_error {
+	struct fanotify_event_info_header hdr;
+	__s32 error;
+	__u32 error_count;
+};
+#endif
+
+#ifndef FILEID_INO32_GEN
+#define FILEID_INO32_GEN	1
+#endif
+
+#ifndef FILEID_INVALID
+#define	FILEID_INVALID		0xff
+#endif
+
+static void print_fh(struct file_handle *fh)
+{
+	int i;
+	uint32_t *h = (uint32_t *) fh->f_handle;
+
+	printf("\tfh: ");
+	for (i = 0; i < fh->handle_bytes; i++)
+		printf("%hhx", fh->f_handle[i]);
+	printf("\n");
+
+	printf("\tdecoded fh: ");
+	if (fh->handle_type == FILEID_INO32_GEN)
+		printf("inode=%u gen=%u\n", h[0], h[1]);
+	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
+		printf("Type %d (Superblock error)\n", fh->handle_type);
+	else
+		printf("Type %d (Unknown)\n", fh->handle_type);
+
+}
+
+static void handle_notifications(char *buffer, int len)
+{
+	struct fanotify_event_metadata *event =
+		(struct fanotify_event_metadata *) buffer;
+	struct fanotify_event_info_header *info;
+	struct fanotify_event_info_error *err;
+	struct fanotify_event_info_fid *fid;
+	int off;
+
+	for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) {
+
+		if (event->mask != FAN_FS_ERROR) {
+			printf("unexpected FAN MARK: %llx\n",
+							(unsigned long long)event->mask);
+			goto next_event;
+		}
+
+		if (event->fd != FAN_NOFD) {
+			printf("Unexpected fd (!= FAN_NOFD)\n");
+			goto next_event;
+		}
+
+		printf("FAN_FS_ERROR (len=%d)\n", event->event_len);
+
+		for (off = sizeof(*event) ; off < event->event_len;
+		     off += info->len) {
+			info = (struct fanotify_event_info_header *)
+				((char *) event + off);
+
+			switch (info->info_type) {
+			case FAN_EVENT_INFO_TYPE_ERROR:
+				err = (struct fanotify_event_info_error *) info;
+
+				printf("\tGeneric Error Record: len=%d\n",
+				       err->hdr.len);
+				printf("\terror: %d\n", err->error);
+				printf("\terror_count: %d\n", err->error_count);
+				break;
+
+			case FAN_EVENT_INFO_TYPE_FID:
+				fid = (struct fanotify_event_info_fid *) info;
+
+				printf("\tfsid: %x%x\n",
+#if defined(__GLIBC__)
+				       fid->fsid.val[0], fid->fsid.val[1]);
+#else
+				       fid->fsid.__val[0], fid->fsid.__val[1]);
+#endif
+				print_fh((struct file_handle *) &fid->handle);
+				break;
+
+			default:
+				printf("\tUnknown info type=%d len=%d:\n",
+				       info->info_type, info->len);
+			}
+		}
+next_event:
+		printf("---\n\n");
+		fflush(stdout);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+
+	char buffer[BUFSIZ];
+
+	if (argc < 2) {
+		printf("Missing path argument\n");
+		return 1;
+	}
+
+	fd = fanotify_init(FAN_CLASS_NOTIF|FAN_REPORT_FID, O_RDONLY);
+	if (fd < 0) {
+		perror("fanotify_init");
+		errx(1, "fanotify_init");
+	}
+
+	if (fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM,
+			  FAN_FS_ERROR, AT_FDCWD, argv[1])) {
+		perror("fanotify_mark");
+		errx(1, "fanotify_mark");
+	}
+
+	printf("fanotify active\n");
+	fflush(stdout);
+
+	while (1) {
+		int n = read(fd, buffer, BUFSIZ);
+
+		if (n < 0)
+			errx(1, "read");
+
+		handle_notifications(buffer, n);
+	}
+
+	return 0;
+}
diff --git a/tests/generic/1838 b/tests/generic/1838
new file mode 100755
index 00000000000000..940811baae3a6e
--- /dev/null
+++ b/tests/generic/1838
@@ -0,0 +1,214 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1838
+#
+# Check that fsnotify can report file IO errors.
+
+. ./common/preamble
+_begin_fstest auto quick eio selfhealing
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	test -n "$fsmonitor_pid" && kill -TERM $fsmonitor_pid
+	rm -f $tmp.*
+	_dmerror_cleanup
+}
+
+# Import common functions.
+. ./common/fuzzy
+. ./common/filter
+. ./common/dmerror
+. ./common/systemd
+
+_require_scratch
+_require_dm_target error
+_require_test_program fs-monitor
+_require_xfs_io_command "fiemap"
+_require_odirect
+_require_fanotify_ioerrors
+
+# no out of place writes
+test "$FSTYP" = "xfs" && _require_no_xfs_always_cow
+
+# fsnotify only gives us a file handle, the error number, and the number of
+# times it was seen in between event deliveries.   The handle is mostly useless
+# since we have no generic way to map that to a file path.  Therefore we can
+# only coalesce all the I/O errors into one report.
+filter_fsnotify_errors() {
+	_filter_scratch | \
+		grep -E '(FAN_FS_ERROR|Generic Error Record|error: 5)' | \
+		sed -e "s/len=[0-9]*/len=XXX/g" | \
+		sort | \
+		uniq
+}
+
+_scratch_mkfs >> $seqres.full
+
+#
+# The dm-error map added by this test doesn't work on zoned devices because
+# table sizes need to be aligned to the zone size, and even for zoned on
+# conventional this test will get confused because of the internal RT device.
+#
+# That check requires a mounted file system, so do a dummy mount before setting
+# up DM.
+#
+_scratch_mount
+test $FSTYP = xfs && _require_xfs_scratch_non_zoned
+_scratch_unmount
+
+_dmerror_init
+_dmerror_mount >> $seqres.full 2>&1
+
+test $FSTYP = xfs && _xfs_force_bdev data $SCRATCH_MNT
+
+# Write a file with 4 file blocks worth of data, figure out the LBA to target
+victim=$SCRATCH_MNT/a
+file_blksz=$(_get_file_block_size $SCRATCH_MNT)
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
+
+awk_len_prog='{print $4}'
+bmap_str="$($XFS_IO_PROG -c "fiemap -v" $victim | grep "^[[:space:]]*0:")"
+echo "$bmap_str" >> $seqres.full
+
+phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
+len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
+
+fs_blksz=$(_get_block_size $SCRATCH_MNT)
+echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
+kernel_sectors_per_fs_block=$((fs_blksz / 512))
+
+# Did we get at least 4 fs blocks worth of extent?
+min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
+test "$len" -lt $min_len_sectors && \
+	_fail "could not format a long enough extent on an empty fs??"
+
+phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
+
+echo "$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
+echo "victim file:" >> $seqres.full
+od -tx1 -Ad -c $victim >> $seqres.full
+
+# Set the dmerror table so that all IO will pass through.
+_dmerror_reset_table
+
+cat >> $seqres.full << ENDL
+dmerror before:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+# All sector numbers that we feed to the kernel must be in units of 512b, but
+# they also must be aligned to the device's logical block size.
+logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
+kernel_sectors_per_device_lba=$((logical_block_size / 512))
+
+# Mark as bad one of the device LBAs in the middle of the extent.  Target the
+# second LBA of the third block of the four-block file extent that we allocated
+# earlier, but without overflowing into the fourth file block.
+bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
+bad_len=$kernel_sectors_per_device_lba
+if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
+	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
+fi
+if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
+	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
+fi
+
+# Remount to flush the page cache, start fsnotify, and make the LBA bad
+_dmerror_unmount
+_dmerror_mount
+
+$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
+fsmonitor_pid=$!
+sleep 1
+
+_dmerror_mark_range_bad $bad_sector $bad_len
+
+cat >> $seqres.full << ENDL
+dmerror after marking bad:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+_dmerror_load_error_table
+
+# See if buffered reads pick it up
+echo "Try buffered read"
+$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio reads pick it up
+echo "Try directio read"
+$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio writes pick it up
+echo "Try directio write"
+$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# See if buffered writes pick it up
+echo "Try buffered write"
+$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# Now mark the bad range good so that unmount won't fail due to IO errors.
+echo "Fix device"
+_dmerror_mark_range_good $bad_sector $bad_len
+_dmerror_load_error_table
+
+cat >> $seqres.full << ENDL
+dmerror after marking good:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+# Unmount filesystem to start fresh
+echo "Kill fsnotify"
+_dmerror_unmount
+sleep 1
+kill -TERM $fsmonitor_pid
+unset fsmonitor_pid
+echo fsnotify log >> $seqres.full
+cat $tmp.fsmonitor >> $seqres.full
+cat $tmp.fsmonitor | filter_fsnotify_errors
+
+# Start fsnotify again so that can verify that the errors don't persist after
+# we flip back to the good dm table.
+echo "Remount and restart fsnotify"
+_dmerror_mount
+$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
+fsmonitor_pid=$!
+sleep 1
+
+# See if buffered reads pick it up
+echo "Try buffered read again"
+$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio reads pick it up
+echo "Try directio read again"
+$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio writes pick it up
+echo "Try directio write again"
+$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# See if buffered writes pick it up
+echo "Try buffered write again"
+$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# Unmount fs and kill fsnotify, then wait for it to finish
+echo "Kill fsnotify again"
+_dmerror_unmount
+sleep 1
+kill -TERM $fsmonitor_pid
+unset fsmonitor_pid
+cat $tmp.fsmonitor >> $seqres.full
+cat $tmp.fsmonitor | filter_fsnotify_errors
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/1838.out b/tests/generic/1838.out
new file mode 100644
index 00000000000000..adae590fe0b2ea
--- /dev/null
+++ b/tests/generic/1838.out
@@ -0,0 +1,20 @@
+QA output created by 1838
+Try buffered read
+pread: Input/output error
+Try directio read
+pread: Input/output error
+Try directio write
+pwrite: Input/output error
+Try buffered write
+fsync: Input/output error
+Fix device
+Kill fsnotify
+	Generic Error Record: len=XXX
+	error: 5
+FAN_FS_ERROR (len=XXX)
+Remount and restart fsnotify
+Try buffered read again
+Try directio read again
+Try directio write again
+Try buffered write again
+Kill fsnotify again


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-10  3:50   ` [PATCH 1/1] generic: test fsnotify filesystem " Darrick J. Wong
@ 2026-03-10  7:07     ` Amir Goldstein
  2026-03-13 18:01     ` Zorro Lang
  2026-03-16  9:08     ` Christoph Hellwig
  2 siblings, 0 replies; 45+ messages in thread
From: Amir Goldstein @ 2026-03-10  7:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: zlang, jack, fstests, gabriel, linux-fsdevel, linux-xfs, hch

On Tue, Mar 10, 2026 at 4:50 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Test the fsnotify filesystem error reporting.
>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

Feel free to add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  common/rc              |   25 ++++++
>  doc/group-names.txt    |    1
>  src/Makefile           |    2
>  src/fs-monitor.c       |  153 ++++++++++++++++++++++++++++++++++
>  tests/generic/1838     |  214 ++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/1838.out |   20 ++++
>  6 files changed, 414 insertions(+), 1 deletion(-)
>  create mode 100644 src/fs-monitor.c
>  create mode 100755 tests/generic/1838
>  create mode 100644 tests/generic/1838.out
>
>
> diff --git a/common/rc b/common/rc
> index fd4ca9641822cf..ccb78baf5bd41a 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3013,6 +3013,11 @@ _require_xfs_io_command()
>                 echo $testio | grep -q "Inappropriate ioctl" && \
>                         _notrun "xfs_io $command support is missing"
>                 ;;
> +       "healthmon")
> +               test -z "$param" && param="-p"
> +               testio=`$XFS_IO_PROG -c "healthmon $param" $TEST_DIR 2>&1`
> +               param_checked="$param"
> +               ;;
>         "label")
>                 testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
>                 ;;
> @@ -6149,6 +6154,26 @@ _require_max_file_range_blocks()
>         esac
>  }
>
> +_require_fanotify_ioerrors()
> +{
> +       local x
> +
> +       case "$FSTYP" in
> +       xfs)
> +               # added as a part of xfs health monitoring
> +               _require_xfs_io_command healthmon
> +               return 0
> +               ;;
> +       ext4)
> +               # added at the same time as err_report_sec
> +               x="$(_get_fs_sysfs_attr $TEST_DEV err_report_sec)"
> +               test -n "$x" && return 0
> +               ;;
> +       esac
> +
> +       _notrun "$FSTYP does not support fanotify ioerrors"
> +}
> +
>  ################################################################################
>  # make sure this script returns success
>  /bin/true
> diff --git a/doc/group-names.txt b/doc/group-names.txt
> index 10b49e50517797..158f84d36d3154 100644
> --- a/doc/group-names.txt
> +++ b/doc/group-names.txt
> @@ -117,6 +117,7 @@ samefs                      overlayfs when all layers are on the same fs
>  scrub                  filesystem metadata scrubbers
>  seed                   btrfs seeded filesystems
>  seek                   llseek functionality
> +selfhealing            self healing filesystem code
>  selftest               tests with fixed results, used to validate testing setup
>  send                   btrfs send/receive
>  shrinkfs               decreasing the size of a filesystem
> diff --git a/src/Makefile b/src/Makefile
> index 577d816ae859b6..1c761da0ccff20 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -36,7 +36,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>         fscrypt-crypt-util bulkstat_null_ocount splice-test chprojid_fail \
>         detached_mounts_propagation ext4_resize t_readdir_3 splice2pipe \
>         uuid_ioctl t_snapshot_deleted_subvolume fiemap-fault min_dio_alignment \
> -       rw_hint
> +       rw_hint fs-monitor
>
>  EXTRA_EXECS = dmerror fill2attr fill2fs fill2fs_check scaleread.sh \
>               btrfs_crc32c_forged_name.py popdir.pl popattr.py \
> diff --git a/src/fs-monitor.c b/src/fs-monitor.c
> new file mode 100644
> index 00000000000000..0cf09677a3efda
> --- /dev/null
> +++ b/src/fs-monitor.c
> @@ -0,0 +1,153 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2021, Collabora Ltd.
> + * Copied from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/fanotify/fs-monitor.c
> + */
> +
> +#include <errno.h>
> +#include <err.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <sys/fanotify.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#ifndef FAN_FS_ERROR
> +#define FAN_FS_ERROR           0x00008000
> +#define FAN_EVENT_INFO_TYPE_ERROR      5
> +
> +struct fanotify_event_info_error {
> +       struct fanotify_event_info_header hdr;
> +       __s32 error;
> +       __u32 error_count;
> +};
> +#endif
> +
> +#ifndef FILEID_INO32_GEN
> +#define FILEID_INO32_GEN       1
> +#endif
> +
> +#ifndef FILEID_INVALID
> +#define        FILEID_INVALID          0xff
> +#endif
> +
> +static void print_fh(struct file_handle *fh)
> +{
> +       int i;
> +       uint32_t *h = (uint32_t *) fh->f_handle;
> +
> +       printf("\tfh: ");
> +       for (i = 0; i < fh->handle_bytes; i++)
> +               printf("%hhx", fh->f_handle[i]);
> +       printf("\n");
> +
> +       printf("\tdecoded fh: ");
> +       if (fh->handle_type == FILEID_INO32_GEN)
> +               printf("inode=%u gen=%u\n", h[0], h[1]);
> +       else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> +               printf("Type %d (Superblock error)\n", fh->handle_type);
> +       else
> +               printf("Type %d (Unknown)\n", fh->handle_type);
> +
> +}
> +
> +static void handle_notifications(char *buffer, int len)
> +{
> +       struct fanotify_event_metadata *event =
> +               (struct fanotify_event_metadata *) buffer;
> +       struct fanotify_event_info_header *info;
> +       struct fanotify_event_info_error *err;
> +       struct fanotify_event_info_fid *fid;
> +       int off;
> +
> +       for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) {
> +
> +               if (event->mask != FAN_FS_ERROR) {
> +                       printf("unexpected FAN MARK: %llx\n",
> +                                                       (unsigned long long)event->mask);
> +                       goto next_event;
> +               }
> +
> +               if (event->fd != FAN_NOFD) {
> +                       printf("Unexpected fd (!= FAN_NOFD)\n");
> +                       goto next_event;
> +               }
> +
> +               printf("FAN_FS_ERROR (len=%d)\n", event->event_len);
> +
> +               for (off = sizeof(*event) ; off < event->event_len;
> +                    off += info->len) {
> +                       info = (struct fanotify_event_info_header *)
> +                               ((char *) event + off);
> +
> +                       switch (info->info_type) {
> +                       case FAN_EVENT_INFO_TYPE_ERROR:
> +                               err = (struct fanotify_event_info_error *) info;
> +
> +                               printf("\tGeneric Error Record: len=%d\n",
> +                                      err->hdr.len);
> +                               printf("\terror: %d\n", err->error);
> +                               printf("\terror_count: %d\n", err->error_count);
> +                               break;
> +
> +                       case FAN_EVENT_INFO_TYPE_FID:
> +                               fid = (struct fanotify_event_info_fid *) info;
> +
> +                               printf("\tfsid: %x%x\n",
> +#if defined(__GLIBC__)
> +                                      fid->fsid.val[0], fid->fsid.val[1]);
> +#else
> +                                      fid->fsid.__val[0], fid->fsid.__val[1]);
> +#endif
> +                               print_fh((struct file_handle *) &fid->handle);
> +                               break;
> +
> +                       default:
> +                               printf("\tUnknown info type=%d len=%d:\n",
> +                                      info->info_type, info->len);
> +                       }
> +               }
> +next_event:
> +               printf("---\n\n");
> +               fflush(stdout);
> +       }
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       int fd;
> +
> +       char buffer[BUFSIZ];
> +
> +       if (argc < 2) {
> +               printf("Missing path argument\n");
> +               return 1;
> +       }
> +
> +       fd = fanotify_init(FAN_CLASS_NOTIF|FAN_REPORT_FID, O_RDONLY);
> +       if (fd < 0) {
> +               perror("fanotify_init");
> +               errx(1, "fanotify_init");
> +       }
> +
> +       if (fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM,
> +                         FAN_FS_ERROR, AT_FDCWD, argv[1])) {
> +               perror("fanotify_mark");
> +               errx(1, "fanotify_mark");
> +       }
> +
> +       printf("fanotify active\n");
> +       fflush(stdout);
> +
> +       while (1) {
> +               int n = read(fd, buffer, BUFSIZ);
> +
> +               if (n < 0)
> +                       errx(1, "read");
> +
> +               handle_notifications(buffer, n);
> +       }
> +
> +       return 0;
> +}
> diff --git a/tests/generic/1838 b/tests/generic/1838
> new file mode 100755
> index 00000000000000..940811baae3a6e
> --- /dev/null
> +++ b/tests/generic/1838
> @@ -0,0 +1,214 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1838
> +#
> +# Check that fsnotify can report file IO errors.
> +
> +. ./common/preamble
> +_begin_fstest auto quick eio selfhealing
> +
> +# Override the default cleanup function.
> +_cleanup()
> +{
> +       cd /
> +       test -n "$fsmonitor_pid" && kill -TERM $fsmonitor_pid
> +       rm -f $tmp.*
> +       _dmerror_cleanup
> +}
> +
> +# Import common functions.
> +. ./common/fuzzy
> +. ./common/filter
> +. ./common/dmerror
> +. ./common/systemd
> +
> +_require_scratch
> +_require_dm_target error
> +_require_test_program fs-monitor
> +_require_xfs_io_command "fiemap"
> +_require_odirect
> +_require_fanotify_ioerrors
> +
> +# no out of place writes
> +test "$FSTYP" = "xfs" && _require_no_xfs_always_cow
> +
> +# fsnotify only gives us a file handle, the error number, and the number of
> +# times it was seen in between event deliveries.   The handle is mostly useless
> +# since we have no generic way to map that to a file path.  Therefore we can
> +# only coalesce all the I/O errors into one report.
> +filter_fsnotify_errors() {
> +       _filter_scratch | \
> +               grep -E '(FAN_FS_ERROR|Generic Error Record|error: 5)' | \
> +               sed -e "s/len=[0-9]*/len=XXX/g" | \
> +               sort | \
> +               uniq
> +}
> +
> +_scratch_mkfs >> $seqres.full
> +
> +#
> +# The dm-error map added by this test doesn't work on zoned devices because
> +# table sizes need to be aligned to the zone size, and even for zoned on
> +# conventional this test will get confused because of the internal RT device.
> +#
> +# That check requires a mounted file system, so do a dummy mount before setting
> +# up DM.
> +#
> +_scratch_mount
> +test $FSTYP = xfs && _require_xfs_scratch_non_zoned
> +_scratch_unmount
> +
> +_dmerror_init
> +_dmerror_mount >> $seqres.full 2>&1
> +
> +test $FSTYP = xfs && _xfs_force_bdev data $SCRATCH_MNT
> +
> +# Write a file with 4 file blocks worth of data, figure out the LBA to target
> +victim=$SCRATCH_MNT/a
> +file_blksz=$(_get_file_block_size $SCRATCH_MNT)
> +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
> +
> +awk_len_prog='{print $4}'
> +bmap_str="$($XFS_IO_PROG -c "fiemap -v" $victim | grep "^[[:space:]]*0:")"
> +echo "$bmap_str" >> $seqres.full
> +
> +phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
> +len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
> +
> +fs_blksz=$(_get_block_size $SCRATCH_MNT)
> +echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
> +kernel_sectors_per_fs_block=$((fs_blksz / 512))
> +
> +# Did we get at least 4 fs blocks worth of extent?
> +min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
> +test "$len" -lt $min_len_sectors && \
> +       _fail "could not format a long enough extent on an empty fs??"
> +
> +phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
> +
> +echo "$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
> +echo "victim file:" >> $seqres.full
> +od -tx1 -Ad -c $victim >> $seqres.full
> +
> +# Set the dmerror table so that all IO will pass through.
> +_dmerror_reset_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror before:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# All sector numbers that we feed to the kernel must be in units of 512b, but
> +# they also must be aligned to the device's logical block size.
> +logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
> +kernel_sectors_per_device_lba=$((logical_block_size / 512))
> +
> +# Mark as bad one of the device LBAs in the middle of the extent.  Target the
> +# second LBA of the third block of the four-block file extent that we allocated
> +# earlier, but without overflowing into the fourth file block.
> +bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
> +bad_len=$kernel_sectors_per_device_lba
> +if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
> +       bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
> +fi
> +if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
> +       echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
> +fi
> +
> +# Remount to flush the page cache, start fsnotify, and make the LBA bad
> +_dmerror_unmount
> +_dmerror_mount
> +
> +$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
> +fsmonitor_pid=$!
> +sleep 1
> +
> +_dmerror_mark_range_bad $bad_sector $bad_len
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking bad:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +_dmerror_load_error_table
> +
> +# See if buffered reads pick it up
> +echo "Try buffered read"
> +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio reads pick it up
> +echo "Try directio read"
> +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio writes pick it up
> +echo "Try directio write"
> +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# See if buffered writes pick it up
> +echo "Try buffered write"
> +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# Now mark the bad range good so that unmount won't fail due to IO errors.
> +echo "Fix device"
> +_dmerror_mark_range_good $bad_sector $bad_len
> +_dmerror_load_error_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking good:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# Unmount filesystem to start fresh
> +echo "Kill fsnotify"
> +_dmerror_unmount
> +sleep 1
> +kill -TERM $fsmonitor_pid
> +unset fsmonitor_pid
> +echo fsnotify log >> $seqres.full
> +cat $tmp.fsmonitor >> $seqres.full
> +cat $tmp.fsmonitor | filter_fsnotify_errors
> +
> +# Start fsnotify again so that can verify that the errors don't persist after
> +# we flip back to the good dm table.
> +echo "Remount and restart fsnotify"
> +_dmerror_mount
> +$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
> +fsmonitor_pid=$!
> +sleep 1
> +
> +# See if buffered reads pick it up
> +echo "Try buffered read again"
> +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio reads pick it up
> +echo "Try directio read again"
> +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio writes pick it up
> +echo "Try directio write again"
> +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# See if buffered writes pick it up
> +echo "Try buffered write again"
> +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# Unmount fs and kill fsnotify, then wait for it to finish
> +echo "Kill fsnotify again"
> +_dmerror_unmount
> +sleep 1
> +kill -TERM $fsmonitor_pid
> +unset fsmonitor_pid
> +cat $tmp.fsmonitor >> $seqres.full
> +cat $tmp.fsmonitor | filter_fsnotify_errors
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/1838.out b/tests/generic/1838.out
> new file mode 100644
> index 00000000000000..adae590fe0b2ea
> --- /dev/null
> +++ b/tests/generic/1838.out
> @@ -0,0 +1,20 @@
> +QA output created by 1838
> +Try buffered read
> +pread: Input/output error
> +Try directio read
> +pread: Input/output error
> +Try directio write
> +pwrite: Input/output error
> +Try buffered write
> +fsync: Input/output error
> +Fix device
> +Kill fsnotify
> +       Generic Error Record: len=XXX
> +       error: 5
> +FAN_FS_ERROR (len=XXX)
> +Remount and restart fsnotify
> +Try buffered read again
> +Try directio read again
> +Try directio write again
> +Try buffered write again
> +Kill fsnotify again
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-10  3:50   ` [PATCH 1/1] generic: test fsnotify filesystem " Darrick J. Wong
  2026-03-10  7:07     ` Amir Goldstein
@ 2026-03-13 18:01     ` Zorro Lang
  2026-03-13 23:27       ` Darrick J. Wong
  2026-03-16  9:08     ` Christoph Hellwig
  2 siblings, 1 reply; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 18:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: jack, fstests, amir73il, gabriel, linux-fsdevel, linux-xfs, hch

On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Test the fsnotify filesystem error reporting.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  common/rc              |   25 ++++++
>  doc/group-names.txt    |    1 
>  src/Makefile           |    2 
>  src/fs-monitor.c       |  153 ++++++++++++++++++++++++++++++++++

I'll add "src/fs-monitor" into .gitignore file when I merge this patch.
Others look good to me.

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/generic/1838     |  214 ++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/1838.out |   20 ++++
>  6 files changed, 414 insertions(+), 1 deletion(-)
>  create mode 100644 src/fs-monitor.c
>  create mode 100755 tests/generic/1838
>  create mode 100644 tests/generic/1838.out
> 
> 
> diff --git a/common/rc b/common/rc
> index fd4ca9641822cf..ccb78baf5bd41a 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3013,6 +3013,11 @@ _require_xfs_io_command()
>  		echo $testio | grep -q "Inappropriate ioctl" && \
>  			_notrun "xfs_io $command support is missing"
>  		;;
> +	"healthmon")
> +		test -z "$param" && param="-p"
> +		testio=`$XFS_IO_PROG -c "healthmon $param" $TEST_DIR 2>&1`
> +		param_checked="$param"
> +		;;
>  	"label")
>  		testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
>  		;;
> @@ -6149,6 +6154,26 @@ _require_max_file_range_blocks()
>  	esac
>  }
>  
> +_require_fanotify_ioerrors()
> +{
> +	local x
> +
> +	case "$FSTYP" in
> +	xfs)
> +		# added as a part of xfs health monitoring
> +		_require_xfs_io_command healthmon
> +		return 0
> +		;;
> +	ext4)
> +		# added at the same time as err_report_sec
> +		x="$(_get_fs_sysfs_attr $TEST_DEV err_report_sec)"
> +		test -n "$x" && return 0
> +		;;
> +	esac
> +
> +	_notrun "$FSTYP does not support fanotify ioerrors"
> +}
> +
>  ################################################################################
>  # make sure this script returns success
>  /bin/true
> diff --git a/doc/group-names.txt b/doc/group-names.txt
> index 10b49e50517797..158f84d36d3154 100644
> --- a/doc/group-names.txt
> +++ b/doc/group-names.txt
> @@ -117,6 +117,7 @@ samefs			overlayfs when all layers are on the same fs
>  scrub			filesystem metadata scrubbers
>  seed			btrfs seeded filesystems
>  seek			llseek functionality
> +selfhealing		self healing filesystem code
>  selftest		tests with fixed results, used to validate testing setup
>  send			btrfs send/receive
>  shrinkfs		decreasing the size of a filesystem
> diff --git a/src/Makefile b/src/Makefile
> index 577d816ae859b6..1c761da0ccff20 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -36,7 +36,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>  	fscrypt-crypt-util bulkstat_null_ocount splice-test chprojid_fail \
>  	detached_mounts_propagation ext4_resize t_readdir_3 splice2pipe \
>  	uuid_ioctl t_snapshot_deleted_subvolume fiemap-fault min_dio_alignment \
> -	rw_hint
> +	rw_hint fs-monitor
>  
>  EXTRA_EXECS = dmerror fill2attr fill2fs fill2fs_check scaleread.sh \
>  	      btrfs_crc32c_forged_name.py popdir.pl popattr.py \
> diff --git a/src/fs-monitor.c b/src/fs-monitor.c
> new file mode 100644
> index 00000000000000..0cf09677a3efda
> --- /dev/null
> +++ b/src/fs-monitor.c
> @@ -0,0 +1,153 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2021, Collabora Ltd.
> + * Copied from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/fanotify/fs-monitor.c
> + */
> +
> +#include <errno.h>
> +#include <err.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <fcntl.h>
> +#include <sys/fanotify.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#ifndef FAN_FS_ERROR
> +#define FAN_FS_ERROR		0x00008000
> +#define FAN_EVENT_INFO_TYPE_ERROR	5
> +
> +struct fanotify_event_info_error {
> +	struct fanotify_event_info_header hdr;
> +	__s32 error;
> +	__u32 error_count;
> +};
> +#endif
> +
> +#ifndef FILEID_INO32_GEN
> +#define FILEID_INO32_GEN	1
> +#endif
> +
> +#ifndef FILEID_INVALID
> +#define	FILEID_INVALID		0xff
> +#endif
> +
> +static void print_fh(struct file_handle *fh)
> +{
> +	int i;
> +	uint32_t *h = (uint32_t *) fh->f_handle;
> +
> +	printf("\tfh: ");
> +	for (i = 0; i < fh->handle_bytes; i++)
> +		printf("%hhx", fh->f_handle[i]);
> +	printf("\n");
> +
> +	printf("\tdecoded fh: ");
> +	if (fh->handle_type == FILEID_INO32_GEN)
> +		printf("inode=%u gen=%u\n", h[0], h[1]);
> +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> +		printf("Type %d (Superblock error)\n", fh->handle_type);
> +	else
> +		printf("Type %d (Unknown)\n", fh->handle_type);
> +
> +}
> +
> +static void handle_notifications(char *buffer, int len)
> +{
> +	struct fanotify_event_metadata *event =
> +		(struct fanotify_event_metadata *) buffer;
> +	struct fanotify_event_info_header *info;
> +	struct fanotify_event_info_error *err;
> +	struct fanotify_event_info_fid *fid;
> +	int off;
> +
> +	for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) {
> +
> +		if (event->mask != FAN_FS_ERROR) {
> +			printf("unexpected FAN MARK: %llx\n",
> +							(unsigned long long)event->mask);
> +			goto next_event;
> +		}
> +
> +		if (event->fd != FAN_NOFD) {
> +			printf("Unexpected fd (!= FAN_NOFD)\n");
> +			goto next_event;
> +		}
> +
> +		printf("FAN_FS_ERROR (len=%d)\n", event->event_len);
> +
> +		for (off = sizeof(*event) ; off < event->event_len;
> +		     off += info->len) {
> +			info = (struct fanotify_event_info_header *)
> +				((char *) event + off);
> +
> +			switch (info->info_type) {
> +			case FAN_EVENT_INFO_TYPE_ERROR:
> +				err = (struct fanotify_event_info_error *) info;
> +
> +				printf("\tGeneric Error Record: len=%d\n",
> +				       err->hdr.len);
> +				printf("\terror: %d\n", err->error);
> +				printf("\terror_count: %d\n", err->error_count);
> +				break;
> +
> +			case FAN_EVENT_INFO_TYPE_FID:
> +				fid = (struct fanotify_event_info_fid *) info;
> +
> +				printf("\tfsid: %x%x\n",
> +#if defined(__GLIBC__)
> +				       fid->fsid.val[0], fid->fsid.val[1]);
> +#else
> +				       fid->fsid.__val[0], fid->fsid.__val[1]);
> +#endif
> +				print_fh((struct file_handle *) &fid->handle);
> +				break;
> +
> +			default:
> +				printf("\tUnknown info type=%d len=%d:\n",
> +				       info->info_type, info->len);
> +			}
> +		}
> +next_event:
> +		printf("---\n\n");
> +		fflush(stdout);
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	int fd;
> +
> +	char buffer[BUFSIZ];
> +
> +	if (argc < 2) {
> +		printf("Missing path argument\n");
> +		return 1;
> +	}
> +
> +	fd = fanotify_init(FAN_CLASS_NOTIF|FAN_REPORT_FID, O_RDONLY);
> +	if (fd < 0) {
> +		perror("fanotify_init");
> +		errx(1, "fanotify_init");
> +	}
> +
> +	if (fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM,
> +			  FAN_FS_ERROR, AT_FDCWD, argv[1])) {
> +		perror("fanotify_mark");
> +		errx(1, "fanotify_mark");
> +	}
> +
> +	printf("fanotify active\n");
> +	fflush(stdout);
> +
> +	while (1) {
> +		int n = read(fd, buffer, BUFSIZ);
> +
> +		if (n < 0)
> +			errx(1, "read");
> +
> +		handle_notifications(buffer, n);
> +	}
> +
> +	return 0;
> +}
> diff --git a/tests/generic/1838 b/tests/generic/1838
> new file mode 100755
> index 00000000000000..940811baae3a6e
> --- /dev/null
> +++ b/tests/generic/1838
> @@ -0,0 +1,214 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1838
> +#
> +# Check that fsnotify can report file IO errors.
> +
> +. ./common/preamble
> +_begin_fstest auto quick eio selfhealing
> +
> +# Override the default cleanup function.
> +_cleanup()
> +{
> +	cd /
> +	test -n "$fsmonitor_pid" && kill -TERM $fsmonitor_pid
> +	rm -f $tmp.*
> +	_dmerror_cleanup
> +}
> +
> +# Import common functions.
> +. ./common/fuzzy
> +. ./common/filter
> +. ./common/dmerror
> +. ./common/systemd
> +
> +_require_scratch
> +_require_dm_target error
> +_require_test_program fs-monitor
> +_require_xfs_io_command "fiemap"
> +_require_odirect
> +_require_fanotify_ioerrors
> +
> +# no out of place writes
> +test "$FSTYP" = "xfs" && _require_no_xfs_always_cow
> +
> +# fsnotify only gives us a file handle, the error number, and the number of
> +# times it was seen in between event deliveries.   The handle is mostly useless
> +# since we have no generic way to map that to a file path.  Therefore we can
> +# only coalesce all the I/O errors into one report.
> +filter_fsnotify_errors() {
> +	_filter_scratch | \
> +		grep -E '(FAN_FS_ERROR|Generic Error Record|error: 5)' | \
> +		sed -e "s/len=[0-9]*/len=XXX/g" | \
> +		sort | \
> +		uniq
> +}
> +
> +_scratch_mkfs >> $seqres.full
> +
> +#
> +# The dm-error map added by this test doesn't work on zoned devices because
> +# table sizes need to be aligned to the zone size, and even for zoned on
> +# conventional this test will get confused because of the internal RT device.
> +#
> +# That check requires a mounted file system, so do a dummy mount before setting
> +# up DM.
> +#
> +_scratch_mount
> +test $FSTYP = xfs && _require_xfs_scratch_non_zoned
> +_scratch_unmount
> +
> +_dmerror_init
> +_dmerror_mount >> $seqres.full 2>&1
> +
> +test $FSTYP = xfs && _xfs_force_bdev data $SCRATCH_MNT
> +
> +# Write a file with 4 file blocks worth of data, figure out the LBA to target
> +victim=$SCRATCH_MNT/a
> +file_blksz=$(_get_file_block_size $SCRATCH_MNT)
> +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
> +
> +awk_len_prog='{print $4}'
> +bmap_str="$($XFS_IO_PROG -c "fiemap -v" $victim | grep "^[[:space:]]*0:")"
> +echo "$bmap_str" >> $seqres.full
> +
> +phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
> +len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
> +
> +fs_blksz=$(_get_block_size $SCRATCH_MNT)
> +echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
> +kernel_sectors_per_fs_block=$((fs_blksz / 512))
> +
> +# Did we get at least 4 fs blocks worth of extent?
> +min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
> +test "$len" -lt $min_len_sectors && \
> +	_fail "could not format a long enough extent on an empty fs??"
> +
> +phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
> +
> +echo "$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
> +echo "victim file:" >> $seqres.full
> +od -tx1 -Ad -c $victim >> $seqres.full
> +
> +# Set the dmerror table so that all IO will pass through.
> +_dmerror_reset_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror before:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# All sector numbers that we feed to the kernel must be in units of 512b, but
> +# they also must be aligned to the device's logical block size.
> +logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
> +kernel_sectors_per_device_lba=$((logical_block_size / 512))
> +
> +# Mark as bad one of the device LBAs in the middle of the extent.  Target the
> +# second LBA of the third block of the four-block file extent that we allocated
> +# earlier, but without overflowing into the fourth file block.
> +bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
> +bad_len=$kernel_sectors_per_device_lba
> +if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
> +	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
> +fi
> +if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
> +	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
> +fi
> +
> +# Remount to flush the page cache, start fsnotify, and make the LBA bad
> +_dmerror_unmount
> +_dmerror_mount
> +
> +$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
> +fsmonitor_pid=$!
> +sleep 1
> +
> +_dmerror_mark_range_bad $bad_sector $bad_len
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking bad:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +_dmerror_load_error_table
> +
> +# See if buffered reads pick it up
> +echo "Try buffered read"
> +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio reads pick it up
> +echo "Try directio read"
> +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio writes pick it up
> +echo "Try directio write"
> +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# See if buffered writes pick it up
> +echo "Try buffered write"
> +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# Now mark the bad range good so that unmount won't fail due to IO errors.
> +echo "Fix device"
> +_dmerror_mark_range_good $bad_sector $bad_len
> +_dmerror_load_error_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking good:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# Unmount filesystem to start fresh
> +echo "Kill fsnotify"
> +_dmerror_unmount
> +sleep 1
> +kill -TERM $fsmonitor_pid
> +unset fsmonitor_pid
> +echo fsnotify log >> $seqres.full
> +cat $tmp.fsmonitor >> $seqres.full
> +cat $tmp.fsmonitor | filter_fsnotify_errors
> +
> +# Start fsnotify again so that can verify that the errors don't persist after
> +# we flip back to the good dm table.
> +echo "Remount and restart fsnotify"
> +_dmerror_mount
> +$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
> +fsmonitor_pid=$!
> +sleep 1
> +
> +# See if buffered reads pick it up
> +echo "Try buffered read again"
> +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio reads pick it up
> +echo "Try directio read again"
> +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio writes pick it up
> +echo "Try directio write again"
> +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# See if buffered writes pick it up
> +echo "Try buffered write again"
> +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# Unmount fs and kill fsnotify, then wait for it to finish
> +echo "Kill fsnotify again"
> +_dmerror_unmount
> +sleep 1
> +kill -TERM $fsmonitor_pid
> +unset fsmonitor_pid
> +cat $tmp.fsmonitor >> $seqres.full
> +cat $tmp.fsmonitor | filter_fsnotify_errors
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/1838.out b/tests/generic/1838.out
> new file mode 100644
> index 00000000000000..adae590fe0b2ea
> --- /dev/null
> +++ b/tests/generic/1838.out
> @@ -0,0 +1,20 @@
> +QA output created by 1838
> +Try buffered read
> +pread: Input/output error
> +Try directio read
> +pread: Input/output error
> +Try directio write
> +pwrite: Input/output error
> +Try buffered write
> +fsync: Input/output error
> +Fix device
> +Kill fsnotify
> +	Generic Error Record: len=XXX
> +	error: 5
> +FAN_FS_ERROR (len=XXX)
> +Remount and restart fsnotify
> +Try buffered read again
> +Try directio read again
> +Try directio write again
> +Try buffered write again
> +Kill fsnotify again
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-13 18:01     ` Zorro Lang
@ 2026-03-13 23:27       ` Darrick J. Wong
  0 siblings, 0 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-13 23:27 UTC (permalink / raw)
  To: Zorro Lang
  Cc: jack, fstests, amir73il, gabriel, linux-fsdevel, linux-xfs, hch

On Sat, Mar 14, 2026 at 02:01:02AM +0800, Zorro Lang wrote:
> On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Test the fsnotify filesystem error reporting.
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  common/rc              |   25 ++++++
> >  doc/group-names.txt    |    1 
> >  src/Makefile           |    2 
> >  src/fs-monitor.c       |  153 ++++++++++++++++++++++++++++++++++
> 
> I'll add "src/fs-monitor" into .gitignore file when I merge this patch.
> Others look good to me.

Oops, thanks for fixing that for me!

--D

> Reviewed-by: Zorro Lang <zlang@redhat.com>
> 
> >  tests/generic/1838     |  214 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/generic/1838.out |   20 ++++
> >  6 files changed, 414 insertions(+), 1 deletion(-)
> >  create mode 100644 src/fs-monitor.c
> >  create mode 100755 tests/generic/1838
> >  create mode 100644 tests/generic/1838.out
> > 
> > 
> > diff --git a/common/rc b/common/rc
> > index fd4ca9641822cf..ccb78baf5bd41a 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -3013,6 +3013,11 @@ _require_xfs_io_command()
> >  		echo $testio | grep -q "Inappropriate ioctl" && \
> >  			_notrun "xfs_io $command support is missing"
> >  		;;
> > +	"healthmon")
> > +		test -z "$param" && param="-p"
> > +		testio=`$XFS_IO_PROG -c "healthmon $param" $TEST_DIR 2>&1`
> > +		param_checked="$param"
> > +		;;
> >  	"label")
> >  		testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
> >  		;;
> > @@ -6149,6 +6154,26 @@ _require_max_file_range_blocks()
> >  	esac
> >  }
> >  
> > +_require_fanotify_ioerrors()
> > +{
> > +	local x
> > +
> > +	case "$FSTYP" in
> > +	xfs)
> > +		# added as a part of xfs health monitoring
> > +		_require_xfs_io_command healthmon
> > +		return 0
> > +		;;
> > +	ext4)
> > +		# added at the same time as err_report_sec
> > +		x="$(_get_fs_sysfs_attr $TEST_DEV err_report_sec)"
> > +		test -n "$x" && return 0
> > +		;;
> > +	esac
> > +
> > +	_notrun "$FSTYP does not support fanotify ioerrors"
> > +}
> > +
> >  ################################################################################
> >  # make sure this script returns success
> >  /bin/true
> > diff --git a/doc/group-names.txt b/doc/group-names.txt
> > index 10b49e50517797..158f84d36d3154 100644
> > --- a/doc/group-names.txt
> > +++ b/doc/group-names.txt
> > @@ -117,6 +117,7 @@ samefs			overlayfs when all layers are on the same fs
> >  scrub			filesystem metadata scrubbers
> >  seed			btrfs seeded filesystems
> >  seek			llseek functionality
> > +selfhealing		self healing filesystem code
> >  selftest		tests with fixed results, used to validate testing setup
> >  send			btrfs send/receive
> >  shrinkfs		decreasing the size of a filesystem
> > diff --git a/src/Makefile b/src/Makefile
> > index 577d816ae859b6..1c761da0ccff20 100644
> > --- a/src/Makefile
> > +++ b/src/Makefile
> > @@ -36,7 +36,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
> >  	fscrypt-crypt-util bulkstat_null_ocount splice-test chprojid_fail \
> >  	detached_mounts_propagation ext4_resize t_readdir_3 splice2pipe \
> >  	uuid_ioctl t_snapshot_deleted_subvolume fiemap-fault min_dio_alignment \
> > -	rw_hint
> > +	rw_hint fs-monitor
> >  
> >  EXTRA_EXECS = dmerror fill2attr fill2fs fill2fs_check scaleread.sh \
> >  	      btrfs_crc32c_forged_name.py popdir.pl popattr.py \
> > diff --git a/src/fs-monitor.c b/src/fs-monitor.c
> > new file mode 100644
> > index 00000000000000..0cf09677a3efda
> > --- /dev/null
> > +++ b/src/fs-monitor.c
> > @@ -0,0 +1,153 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright 2021, Collabora Ltd.
> > + * Copied from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/fanotify/fs-monitor.c
> > + */
> > +
> > +#include <errno.h>
> > +#include <err.h>
> > +#include <stdlib.h>
> > +#include <stdio.h>
> > +#include <fcntl.h>
> > +#include <sys/fanotify.h>
> > +#include <sys/types.h>
> > +#include <unistd.h>
> > +
> > +#ifndef FAN_FS_ERROR
> > +#define FAN_FS_ERROR		0x00008000
> > +#define FAN_EVENT_INFO_TYPE_ERROR	5
> > +
> > +struct fanotify_event_info_error {
> > +	struct fanotify_event_info_header hdr;
> > +	__s32 error;
> > +	__u32 error_count;
> > +};
> > +#endif
> > +
> > +#ifndef FILEID_INO32_GEN
> > +#define FILEID_INO32_GEN	1
> > +#endif
> > +
> > +#ifndef FILEID_INVALID
> > +#define	FILEID_INVALID		0xff
> > +#endif
> > +
> > +static void print_fh(struct file_handle *fh)
> > +{
> > +	int i;
> > +	uint32_t *h = (uint32_t *) fh->f_handle;
> > +
> > +	printf("\tfh: ");
> > +	for (i = 0; i < fh->handle_bytes; i++)
> > +		printf("%hhx", fh->f_handle[i]);
> > +	printf("\n");
> > +
> > +	printf("\tdecoded fh: ");
> > +	if (fh->handle_type == FILEID_INO32_GEN)
> > +		printf("inode=%u gen=%u\n", h[0], h[1]);
> > +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> > +		printf("Type %d (Superblock error)\n", fh->handle_type);
> > +	else
> > +		printf("Type %d (Unknown)\n", fh->handle_type);
> > +
> > +}
> > +
> > +static void handle_notifications(char *buffer, int len)
> > +{
> > +	struct fanotify_event_metadata *event =
> > +		(struct fanotify_event_metadata *) buffer;
> > +	struct fanotify_event_info_header *info;
> > +	struct fanotify_event_info_error *err;
> > +	struct fanotify_event_info_fid *fid;
> > +	int off;
> > +
> > +	for (; FAN_EVENT_OK(event, len); event = FAN_EVENT_NEXT(event, len)) {
> > +
> > +		if (event->mask != FAN_FS_ERROR) {
> > +			printf("unexpected FAN MARK: %llx\n",
> > +							(unsigned long long)event->mask);
> > +			goto next_event;
> > +		}
> > +
> > +		if (event->fd != FAN_NOFD) {
> > +			printf("Unexpected fd (!= FAN_NOFD)\n");
> > +			goto next_event;
> > +		}
> > +
> > +		printf("FAN_FS_ERROR (len=%d)\n", event->event_len);
> > +
> > +		for (off = sizeof(*event) ; off < event->event_len;
> > +		     off += info->len) {
> > +			info = (struct fanotify_event_info_header *)
> > +				((char *) event + off);
> > +
> > +			switch (info->info_type) {
> > +			case FAN_EVENT_INFO_TYPE_ERROR:
> > +				err = (struct fanotify_event_info_error *) info;
> > +
> > +				printf("\tGeneric Error Record: len=%d\n",
> > +				       err->hdr.len);
> > +				printf("\terror: %d\n", err->error);
> > +				printf("\terror_count: %d\n", err->error_count);
> > +				break;
> > +
> > +			case FAN_EVENT_INFO_TYPE_FID:
> > +				fid = (struct fanotify_event_info_fid *) info;
> > +
> > +				printf("\tfsid: %x%x\n",
> > +#if defined(__GLIBC__)
> > +				       fid->fsid.val[0], fid->fsid.val[1]);
> > +#else
> > +				       fid->fsid.__val[0], fid->fsid.__val[1]);
> > +#endif
> > +				print_fh((struct file_handle *) &fid->handle);
> > +				break;
> > +
> > +			default:
> > +				printf("\tUnknown info type=%d len=%d:\n",
> > +				       info->info_type, info->len);
> > +			}
> > +		}
> > +next_event:
> > +		printf("---\n\n");
> > +		fflush(stdout);
> > +	}
> > +}
> > +
> > +int main(int argc, char **argv)
> > +{
> > +	int fd;
> > +
> > +	char buffer[BUFSIZ];
> > +
> > +	if (argc < 2) {
> > +		printf("Missing path argument\n");
> > +		return 1;
> > +	}
> > +
> > +	fd = fanotify_init(FAN_CLASS_NOTIF|FAN_REPORT_FID, O_RDONLY);
> > +	if (fd < 0) {
> > +		perror("fanotify_init");
> > +		errx(1, "fanotify_init");
> > +	}
> > +
> > +	if (fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM,
> > +			  FAN_FS_ERROR, AT_FDCWD, argv[1])) {
> > +		perror("fanotify_mark");
> > +		errx(1, "fanotify_mark");
> > +	}
> > +
> > +	printf("fanotify active\n");
> > +	fflush(stdout);
> > +
> > +	while (1) {
> > +		int n = read(fd, buffer, BUFSIZ);
> > +
> > +		if (n < 0)
> > +			errx(1, "read");
> > +
> > +		handle_notifications(buffer, n);
> > +	}
> > +
> > +	return 0;
> > +}
> > diff --git a/tests/generic/1838 b/tests/generic/1838
> > new file mode 100755
> > index 00000000000000..940811baae3a6e
> > --- /dev/null
> > +++ b/tests/generic/1838
> > @@ -0,0 +1,214 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0-or-later
> > +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> > +#
> > +# FS QA Test No. 1838
> > +#
> > +# Check that fsnotify can report file IO errors.
> > +
> > +. ./common/preamble
> > +_begin_fstest auto quick eio selfhealing
> > +
> > +# Override the default cleanup function.
> > +_cleanup()
> > +{
> > +	cd /
> > +	test -n "$fsmonitor_pid" && kill -TERM $fsmonitor_pid
> > +	rm -f $tmp.*
> > +	_dmerror_cleanup
> > +}
> > +
> > +# Import common functions.
> > +. ./common/fuzzy
> > +. ./common/filter
> > +. ./common/dmerror
> > +. ./common/systemd
> > +
> > +_require_scratch
> > +_require_dm_target error
> > +_require_test_program fs-monitor
> > +_require_xfs_io_command "fiemap"
> > +_require_odirect
> > +_require_fanotify_ioerrors
> > +
> > +# no out of place writes
> > +test "$FSTYP" = "xfs" && _require_no_xfs_always_cow
> > +
> > +# fsnotify only gives us a file handle, the error number, and the number of
> > +# times it was seen in between event deliveries.   The handle is mostly useless
> > +# since we have no generic way to map that to a file path.  Therefore we can
> > +# only coalesce all the I/O errors into one report.
> > +filter_fsnotify_errors() {
> > +	_filter_scratch | \
> > +		grep -E '(FAN_FS_ERROR|Generic Error Record|error: 5)' | \
> > +		sed -e "s/len=[0-9]*/len=XXX/g" | \
> > +		sort | \
> > +		uniq
> > +}
> > +
> > +_scratch_mkfs >> $seqres.full
> > +
> > +#
> > +# The dm-error map added by this test doesn't work on zoned devices because
> > +# table sizes need to be aligned to the zone size, and even for zoned on
> > +# conventional this test will get confused because of the internal RT device.
> > +#
> > +# That check requires a mounted file system, so do a dummy mount before setting
> > +# up DM.
> > +#
> > +_scratch_mount
> > +test $FSTYP = xfs && _require_xfs_scratch_non_zoned
> > +_scratch_unmount
> > +
> > +_dmerror_init
> > +_dmerror_mount >> $seqres.full 2>&1
> > +
> > +test $FSTYP = xfs && _xfs_force_bdev data $SCRATCH_MNT
> > +
> > +# Write a file with 4 file blocks worth of data, figure out the LBA to target
> > +victim=$SCRATCH_MNT/a
> > +file_blksz=$(_get_file_block_size $SCRATCH_MNT)
> > +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
> > +
> > +awk_len_prog='{print $4}'
> > +bmap_str="$($XFS_IO_PROG -c "fiemap -v" $victim | grep "^[[:space:]]*0:")"
> > +echo "$bmap_str" >> $seqres.full
> > +
> > +phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
> > +len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
> > +
> > +fs_blksz=$(_get_block_size $SCRATCH_MNT)
> > +echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
> > +kernel_sectors_per_fs_block=$((fs_blksz / 512))
> > +
> > +# Did we get at least 4 fs blocks worth of extent?
> > +min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
> > +test "$len" -lt $min_len_sectors && \
> > +	_fail "could not format a long enough extent on an empty fs??"
> > +
> > +phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
> > +
> > +echo "$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
> > +echo "victim file:" >> $seqres.full
> > +od -tx1 -Ad -c $victim >> $seqres.full
> > +
> > +# Set the dmerror table so that all IO will pass through.
> > +_dmerror_reset_table
> > +
> > +cat >> $seqres.full << ENDL
> > +dmerror before:
> > +$DMERROR_TABLE
> > +$DMERROR_RTTABLE
> > +<end table>
> > +ENDL
> > +
> > +# All sector numbers that we feed to the kernel must be in units of 512b, but
> > +# they also must be aligned to the device's logical block size.
> > +logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
> > +kernel_sectors_per_device_lba=$((logical_block_size / 512))
> > +
> > +# Mark as bad one of the device LBAs in the middle of the extent.  Target the
> > +# second LBA of the third block of the four-block file extent that we allocated
> > +# earlier, but without overflowing into the fourth file block.
> > +bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
> > +bad_len=$kernel_sectors_per_device_lba
> > +if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
> > +	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
> > +fi
> > +if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
> > +	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
> > +fi
> > +
> > +# Remount to flush the page cache, start fsnotify, and make the LBA bad
> > +_dmerror_unmount
> > +_dmerror_mount
> > +
> > +$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
> > +fsmonitor_pid=$!
> > +sleep 1
> > +
> > +_dmerror_mark_range_bad $bad_sector $bad_len
> > +
> > +cat >> $seqres.full << ENDL
> > +dmerror after marking bad:
> > +$DMERROR_TABLE
> > +$DMERROR_RTTABLE
> > +<end table>
> > +ENDL
> > +
> > +_dmerror_load_error_table
> > +
> > +# See if buffered reads pick it up
> > +echo "Try buffered read"
> > +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> > +
> > +# See if directio reads pick it up
> > +echo "Try directio read"
> > +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> > +
> > +# See if directio writes pick it up
> > +echo "Try directio write"
> > +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> > +
> > +# See if buffered writes pick it up
> > +echo "Try buffered write"
> > +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> > +
> > +# Now mark the bad range good so that unmount won't fail due to IO errors.
> > +echo "Fix device"
> > +_dmerror_mark_range_good $bad_sector $bad_len
> > +_dmerror_load_error_table
> > +
> > +cat >> $seqres.full << ENDL
> > +dmerror after marking good:
> > +$DMERROR_TABLE
> > +$DMERROR_RTTABLE
> > +<end table>
> > +ENDL
> > +
> > +# Unmount filesystem to start fresh
> > +echo "Kill fsnotify"
> > +_dmerror_unmount
> > +sleep 1
> > +kill -TERM $fsmonitor_pid
> > +unset fsmonitor_pid
> > +echo fsnotify log >> $seqres.full
> > +cat $tmp.fsmonitor >> $seqres.full
> > +cat $tmp.fsmonitor | filter_fsnotify_errors
> > +
> > +# Start fsnotify again so that can verify that the errors don't persist after
> > +# we flip back to the good dm table.
> > +echo "Remount and restart fsnotify"
> > +_dmerror_mount
> > +$here/src/fs-monitor $SCRATCH_MNT > $tmp.fsmonitor &
> > +fsmonitor_pid=$!
> > +sleep 1
> > +
> > +# See if buffered reads pick it up
> > +echo "Try buffered read again"
> > +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> > +
> > +# See if directio reads pick it up
> > +echo "Try directio read again"
> > +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> > +
> > +# See if directio writes pick it up
> > +echo "Try directio write again"
> > +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> > +
> > +# See if buffered writes pick it up
> > +echo "Try buffered write again"
> > +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> > +
> > +# Unmount fs and kill fsnotify, then wait for it to finish
> > +echo "Kill fsnotify again"
> > +_dmerror_unmount
> > +sleep 1
> > +kill -TERM $fsmonitor_pid
> > +unset fsmonitor_pid
> > +cat $tmp.fsmonitor >> $seqres.full
> > +cat $tmp.fsmonitor | filter_fsnotify_errors
> > +
> > +# success, all done
> > +status=0
> > +exit
> > diff --git a/tests/generic/1838.out b/tests/generic/1838.out
> > new file mode 100644
> > index 00000000000000..adae590fe0b2ea
> > --- /dev/null
> > +++ b/tests/generic/1838.out
> > @@ -0,0 +1,20 @@
> > +QA output created by 1838
> > +Try buffered read
> > +pread: Input/output error
> > +Try directio read
> > +pread: Input/output error
> > +Try directio write
> > +pwrite: Input/output error
> > +Try buffered write
> > +fsync: Input/output error
> > +Fix device
> > +Kill fsnotify
> > +	Generic Error Record: len=XXX
> > +	error: 5
> > +FAN_FS_ERROR (len=XXX)
> > +Remount and restart fsnotify
> > +Try buffered read again
> > +Try directio read again
> > +Try directio write again
> > +Try buffered write again
> > +Kill fsnotify again
> > 
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-10  3:50   ` [PATCH 1/1] generic: test fsnotify filesystem " Darrick J. Wong
  2026-03-10  7:07     ` Amir Goldstein
  2026-03-13 18:01     ` Zorro Lang
@ 2026-03-16  9:08     ` Christoph Hellwig
  2026-03-16 16:21       ` Darrick J. Wong
  2 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2026-03-16  9:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: zlang, jack, fstests, amir73il, gabriel, linux-fsdevel, linux-xfs,
	hch

On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Test the fsnotify filesystem error reporting.

Still would be helpful to explain what is being tested and how and
the source of the test tool.

> +#ifndef FILEID_INO32_GEN
> +#define FILEID_INO32_GEN	1
> +#endif
> +
> +#ifndef FILEID_INVALID
> +#define	FILEID_INVALID		0xff
> +#endif
> +
> +static void print_fh(struct file_handle *fh)
> +{
> +	int i;
> +	uint32_t *h = (uint32_t *) fh->f_handle;
> +
> +	printf("\tfh: ");
> +	for (i = 0; i < fh->handle_bytes; i++)
> +		printf("%hhx", fh->f_handle[i]);
> +	printf("\n");
> +
> +	printf("\tdecoded fh: ");
> +	if (fh->handle_type == FILEID_INO32_GEN)
> +		printf("inode=%u gen=%u\n", h[0], h[1]);
> +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> +		printf("Type %d (Superblock error)\n", fh->handle_type);
> +	else
> +		printf("Type %d (Unknown)\n", fh->handle_type);


Isn't this always going to print unknown for normal xfs mounts without
inode32?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-16  9:08     ` Christoph Hellwig
@ 2026-03-16 16:21       ` Darrick J. Wong
  2026-03-16 18:40         ` Zorro Lang
  0 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-16 16:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: zlang, jack, fstests, amir73il, gabriel, linux-fsdevel, linux-xfs,
	hch

On Mon, Mar 16, 2026 at 02:08:11AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Test the fsnotify filesystem error reporting.
> 
> Still would be helpful to explain what is being tested and how and
> the source of the test tool.

"Use dmerror to inject a media error for a file's storage, then run some
IO and make sure that fanotify reports the IO errors."

(I'm not sure what the process is for updating commit messages once
something's in patches-in-queue...)

> > +#ifndef FILEID_INO32_GEN
> > +#define FILEID_INO32_GEN	1
> > +#endif
> > +
> > +#ifndef FILEID_INVALID
> > +#define	FILEID_INVALID		0xff
> > +#endif
> > +
> > +static void print_fh(struct file_handle *fh)
> > +{
> > +	int i;
> > +	uint32_t *h = (uint32_t *) fh->f_handle;
> > +
> > +	printf("\tfh: ");
> > +	for (i = 0; i < fh->handle_bytes; i++)
> > +		printf("%hhx", fh->f_handle[i]);
> > +	printf("\n");
> > +
> > +	printf("\tdecoded fh: ");
> > +	if (fh->handle_type == FILEID_INO32_GEN)
> > +		printf("inode=%u gen=%u\n", h[0], h[1]);
> > +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> > +		printf("Type %d (Superblock error)\n", fh->handle_type);
> > +	else
> > +		printf("Type %d (Unknown)\n", fh->handle_type);
> 
> 
> Isn't this always going to print unknown for normal xfs mounts without
> inode32?

Yes, though generic/791 filters out the file handle and (TBH) I don't
really want people getting ideas about cracking file handles.  Probably
we should just eliminate this whole part of the function, but that would
make future forklift upgrades from the kernel harder.

--D

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-16 16:21       ` Darrick J. Wong
@ 2026-03-16 18:40         ` Zorro Lang
  2026-03-16 22:16           ` Darrick J. Wong
  0 siblings, 1 reply; 45+ messages in thread
From: Zorro Lang @ 2026-03-16 18:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, jack, fstests, amir73il, gabriel,
	linux-fsdevel, linux-xfs, hch

On Mon, Mar 16, 2026 at 09:21:47AM -0700, Darrick J. Wong wrote:
> On Mon, Mar 16, 2026 at 02:08:11AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Test the fsnotify filesystem error reporting.
> > 
> > Still would be helpful to explain what is being tested and how and
> > the source of the test tool.
> 
> "Use dmerror to inject a media error for a file's storage, then run some
> IO and make sure that fanotify reports the IO errors."
> 
> (I'm not sure what the process is for updating commit messages once
> something's in patches-in-queue...)

patches-in-queue is a scratch branch, I always re-push(-f) to this branch.
So please feel free to tell me what do you want to update in this patch
(I'm going to make the next fstests release soon:).

Thanks,
Zorro

> 
> > > +#ifndef FILEID_INO32_GEN
> > > +#define FILEID_INO32_GEN	1
> > > +#endif
> > > +
> > > +#ifndef FILEID_INVALID
> > > +#define	FILEID_INVALID		0xff
> > > +#endif
> > > +
> > > +static void print_fh(struct file_handle *fh)
> > > +{
> > > +	int i;
> > > +	uint32_t *h = (uint32_t *) fh->f_handle;
> > > +
> > > +	printf("\tfh: ");
> > > +	for (i = 0; i < fh->handle_bytes; i++)
> > > +		printf("%hhx", fh->f_handle[i]);
> > > +	printf("\n");
> > > +
> > > +	printf("\tdecoded fh: ");
> > > +	if (fh->handle_type == FILEID_INO32_GEN)
> > > +		printf("inode=%u gen=%u\n", h[0], h[1]);
> > > +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> > > +		printf("Type %d (Superblock error)\n", fh->handle_type);
> > > +	else
> > > +		printf("Type %d (Unknown)\n", fh->handle_type);
> > 
> > 
> > Isn't this always going to print unknown for normal xfs mounts without
> > inode32?
> 
> Yes, though generic/791 filters out the file handle and (TBH) I don't
> really want people getting ideas about cracking file handles.  Probably
> we should just eliminate this whole part of the function, but that would
> make future forklift upgrades from the kernel harder.
> 
> --D
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-16 18:40         ` Zorro Lang
@ 2026-03-16 22:16           ` Darrick J. Wong
  2026-03-17  3:43             ` Zorro Lang
  0 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-16 22:16 UTC (permalink / raw)
  To: Zorro Lang
  Cc: Christoph Hellwig, jack, fstests, amir73il, gabriel,
	linux-fsdevel, linux-xfs, hch

On Tue, Mar 17, 2026 at 02:40:41AM +0800, Zorro Lang wrote:
> On Mon, Mar 16, 2026 at 09:21:47AM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 16, 2026 at 02:08:11AM -0700, Christoph Hellwig wrote:
> > > On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Test the fsnotify filesystem error reporting.
> > > 
> > > Still would be helpful to explain what is being tested and how and
> > > the source of the test tool.
> > 
> > "Use dmerror to inject a media error for a file's storage, then run some
> > IO and make sure that fanotify reports the IO errors."
> > 
> > (I'm not sure what the process is for updating commit messages once
> > something's in patches-in-queue...)
> 
> patches-in-queue is a scratch branch, I always re-push(-f) to this branch.
> So please feel free to tell me what do you want to update in this patch
> (I'm going to make the next fstests release soon:).

Just the two changes I've mentioned so far in the threads, please.

I've more fixes to make to these tests, so I'll send those patches
shortly.

--D

> Thanks,
> Zorro
> 
> > 
> > > > +#ifndef FILEID_INO32_GEN
> > > > +#define FILEID_INO32_GEN	1
> > > > +#endif
> > > > +
> > > > +#ifndef FILEID_INVALID
> > > > +#define	FILEID_INVALID		0xff
> > > > +#endif
> > > > +
> > > > +static void print_fh(struct file_handle *fh)
> > > > +{
> > > > +	int i;
> > > > +	uint32_t *h = (uint32_t *) fh->f_handle;
> > > > +
> > > > +	printf("\tfh: ");
> > > > +	for (i = 0; i < fh->handle_bytes; i++)
> > > > +		printf("%hhx", fh->f_handle[i]);
> > > > +	printf("\n");
> > > > +
> > > > +	printf("\tdecoded fh: ");
> > > > +	if (fh->handle_type == FILEID_INO32_GEN)
> > > > +		printf("inode=%u gen=%u\n", h[0], h[1]);
> > > > +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> > > > +		printf("Type %d (Superblock error)\n", fh->handle_type);
> > > > +	else
> > > > +		printf("Type %d (Unknown)\n", fh->handle_type);
> > > 
> > > 
> > > Isn't this always going to print unknown for normal xfs mounts without
> > > inode32?
> > 
> > Yes, though generic/791 filters out the file handle and (TBH) I don't
> > really want people getting ideas about cracking file handles.  Probably
> > we should just eliminate this whole part of the function, but that would
> > make future forklift upgrades from the kernel harder.
> > 
> > --D
> > 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/1] generic: test fsnotify filesystem error reporting
  2026-03-16 22:16           ` Darrick J. Wong
@ 2026-03-17  3:43             ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-17  3:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, jack, fstests, amir73il, gabriel,
	linux-fsdevel, linux-xfs, hch

On Mon, Mar 16, 2026 at 03:16:58PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 17, 2026 at 02:40:41AM +0800, Zorro Lang wrote:
> > On Mon, Mar 16, 2026 at 09:21:47AM -0700, Darrick J. Wong wrote:
> > > On Mon, Mar 16, 2026 at 02:08:11AM -0700, Christoph Hellwig wrote:
> > > > On Mon, Mar 09, 2026 at 08:50:08PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > Test the fsnotify filesystem error reporting.
> > > > 
> > > > Still would be helpful to explain what is being tested and how and
> > > > the source of the test tool.
> > > 
> > > "Use dmerror to inject a media error for a file's storage, then run some
> > > IO and make sure that fanotify reports the IO errors."
> > > 
> > > (I'm not sure what the process is for updating commit messages once
> > > something's in patches-in-queue...)
> > 
> > patches-in-queue is a scratch branch, I always re-push(-f) to this branch.
> > So please feel free to tell me what do you want to update in this patch
> > (I'm going to make the next fstests release soon:).
> 
> Just the two changes I've mentioned so far in the threads, please.

Sure, I just changed the commit log of this patch to:

  "Use dmerror to inject a media error for a file's storage, then run some
  IO and make sure that fanotify reports the IO errors."

but what's the 2nd change? Did I miss something? ...

> 
> I've more fixes to make to these tests, so I'll send those patches
> shortly.
> 
> --D
> 
> > Thanks,
> > Zorro
> > 
> > > 
> > > > > +#ifndef FILEID_INO32_GEN
> > > > > +#define FILEID_INO32_GEN	1
> > > > > +#endif
> > > > > +
> > > > > +#ifndef FILEID_INVALID
> > > > > +#define	FILEID_INVALID		0xff
> > > > > +#endif
> > > > > +
> > > > > +static void print_fh(struct file_handle *fh)
> > > > > +{
> > > > > +	int i;
> > > > > +	uint32_t *h = (uint32_t *) fh->f_handle;
> > > > > +
> > > > > +	printf("\tfh: ");
> > > > > +	for (i = 0; i < fh->handle_bytes; i++)
> > > > > +		printf("%hhx", fh->f_handle[i]);
> > > > > +	printf("\n");
> > > > > +
> > > > > +	printf("\tdecoded fh: ");
> > > > > +	if (fh->handle_type == FILEID_INO32_GEN)
> > > > > +		printf("inode=%u gen=%u\n", h[0], h[1]);
> > > > > +	else if (fh->handle_type == FILEID_INVALID && !fh->handle_bytes)
> > > > > +		printf("Type %d (Superblock error)\n", fh->handle_type);
> > > > > +	else
> > > > > +		printf("Type %d (Unknown)\n", fh->handle_type);
> > > > 
> > > > 
> > > > Isn't this always going to print unknown for normal xfs mounts without
> > > > inode32?
> > > 
> > > Yes, though generic/791 filters out the file handle and (TBH) I don't
> > > really want people getting ideas about cracking file handles.  Probably
> > > we should just eliminate this whole part of the function, but that would
> > > make future forklift upgrades from the kernel harder.

... This looks like not a certain decision/change. Is this the 2nd change you want
to do? If so, could you please provide what's the specific changes you want to make :)

Thanks,
Zorro

> > > 
> > > --D
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems
  2026-03-10  3:38 [PATCHBOMB v9] xfsprogs: autonomous self healing of filesystems Darrick J. Wong
  2026-03-10  3:42 ` [PATCHSET v9 1/2] fstests: test generic file IO error reporting Darrick J. Wong
@ 2026-03-10  3:42 ` Darrick J. Wong
  2026-03-10  3:50   ` [PATCH 01/14] xfs: test health monitoring code Darrick J. Wong
                     ` (14 more replies)
  1 sibling, 15 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:42 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

Hi all,

This series adds functionality and regression tests for the automated
self healing daemon for xfs.

v9: tweak fstests per userspace changes; improve attribution of the
    generic fserror test; fix some bugs pointed out by zorro
v8: clean up userspace for merging now that the kernel part is upstream
v7: more cleanups of the media verification ioctl, improve comments, and
    reuse the bio
v6: fix pi-breaking bugs, make verify failures trigger health reports
v5: add verify-media ioctl, collapse small helper funcs with only
    one caller
v4: drop multiple client support so we can make direct calls into
    healthmon instead of chasing pointers and doing indirect calls
v3: drag out of rfc status

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

With a bit of luck, this should all go splendidly.
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * xfs: test health monitoring code
 * xfs: test for metadata corruption error reporting via healthmon
 * xfs: test io error reporting via healthmon
 * xfs: set up common code for testing xfs_healer
 * xfs: test xfs_healer's event handling
 * xfs: test xfs_healer can fix a filesystem
 * xfs: test xfs_healer can report file I/O errors
 * xfs: test xfs_healer can report file media errors
 * xfs: test xfs_healer can report filesystem shutdowns
 * xfs: test xfs_healer can initiate full filesystem repairs
 * xfs: test xfs_healer can follow mount moves
 * xfs: test xfs_healer wont repair the wrong filesystem
 * xfs: test xfs_healer background service
 * xfs: test xfs_healer startup service
---
 common/config      |   14 +++
 common/module      |   11 +++
 common/rc          |    5 +
 common/systemd     |   39 ++++++++++
 common/xfs         |   98 ++++++++++++++++++++++++
 tests/xfs/1878     |   93 +++++++++++++++++++++++
 tests/xfs/1878.out |   10 ++
 tests/xfs/1879     |   93 +++++++++++++++++++++++
 tests/xfs/1879.out |    8 ++
 tests/xfs/1882     |   44 +++++++++++
 tests/xfs/1882.out |    2 
 tests/xfs/1884     |   89 ++++++++++++++++++++++
 tests/xfs/1884.out |    2 
 tests/xfs/1885     |   59 +++++++++++++++
 tests/xfs/1885.out |    5 +
 tests/xfs/1896     |  210 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1896.out |   21 +++++
 tests/xfs/1897     |  172 +++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1897.out |    7 ++
 tests/xfs/1898     |   37 +++++++++
 tests/xfs/1898.out |    4 +
 tests/xfs/1899     |  108 +++++++++++++++++++++++++++
 tests/xfs/1899.out |    3 +
 tests/xfs/1900     |  115 ++++++++++++++++++++++++++++
 tests/xfs/1900.out |    2 
 tests/xfs/1901     |  137 ++++++++++++++++++++++++++++++++++
 tests/xfs/1901.out |    2 
 tests/xfs/1902     |  152 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1902.out |    2 
 tests/xfs/1903     |  124 +++++++++++++++++++++++++++++++
 tests/xfs/1903.out |    6 +
 tests/xfs/802      |    4 -
 32 files changed, 1676 insertions(+), 2 deletions(-)
 create mode 100755 tests/xfs/1878
 create mode 100644 tests/xfs/1878.out
 create mode 100755 tests/xfs/1879
 create mode 100644 tests/xfs/1879.out
 create mode 100755 tests/xfs/1882
 create mode 100644 tests/xfs/1882.out
 create mode 100755 tests/xfs/1884
 create mode 100644 tests/xfs/1884.out
 create mode 100755 tests/xfs/1885
 create mode 100644 tests/xfs/1885.out
 create mode 100755 tests/xfs/1896
 create mode 100644 tests/xfs/1896.out
 create mode 100755 tests/xfs/1897
 create mode 100755 tests/xfs/1897.out
 create mode 100755 tests/xfs/1898
 create mode 100755 tests/xfs/1898.out
 create mode 100755 tests/xfs/1899
 create mode 100644 tests/xfs/1899.out
 create mode 100755 tests/xfs/1900
 create mode 100755 tests/xfs/1900.out
 create mode 100755 tests/xfs/1901
 create mode 100755 tests/xfs/1901.out
 create mode 100755 tests/xfs/1902
 create mode 100755 tests/xfs/1902.out
 create mode 100755 tests/xfs/1903
 create mode 100644 tests/xfs/1903.out


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/14] xfs: test health monitoring code
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
@ 2026-03-10  3:50   ` Darrick J. Wong
  2026-03-13 18:18     ` Zorro Lang
  2026-03-10  3:50   ` [PATCH 02/14] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
                     ` (13 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:50 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add some functionality tests for the new health monitoring code.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/module      |   11 ++++++++++
 tests/xfs/1885     |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1885.out |    5 ++++
 3 files changed, 75 insertions(+)
 create mode 100755 tests/xfs/1885
 create mode 100644 tests/xfs/1885.out


diff --git a/common/module b/common/module
index 697d76ba718bbc..c0529b65ad6e2b 100644
--- a/common/module
+++ b/common/module
@@ -225,3 +225,14 @@ _optional_reload_fs_module()
 		_test_loadable_fs_module "$@" 2>&1 | \
 		sed -e '/patient module removal/d'
 }
+
+_require_module_refcount()
+{
+	local refcount_file="/sys/module/$1/refcnt"
+	test -e "$refcount_file" || _notrun "cannot find $1 module refcount"
+}
+
+_module_refcount()
+{
+	cat "/sys/module/$1/refcnt"
+}
diff --git a/tests/xfs/1885 b/tests/xfs/1885
new file mode 100755
index 00000000000000..d44b29d1c57e06
--- /dev/null
+++ b/tests/xfs/1885
@@ -0,0 +1,59 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1885
+#
+# Make sure that healthmon handles module refcount correctly.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing quick
+
+. ./common/filter
+. ./common/module
+
+_cleanup()
+{
+	test -n "$healer_pid" && kill $healer_pid &>/dev/null
+	cd /
+	rm -r -f $tmp.*
+}
+
+_require_test
+_require_xfs_io_command healthmon
+_require_module_refcount xfs
+
+# Capture mod refcount without the test fs mounted
+_test_unmount
+init_refcount="$(_module_refcount xfs)"
+
+# Capture mod refcount with the test fs mounted
+_test_mount
+nomon_mount_refcount="$(_module_refcount xfs)"
+
+# Capture mod refcount with test fs mounted and the healthmon fd open.
+# Pause the xfs_io process so that it doesn't actually respond to events.
+$XFS_IO_PROG -c 'healthmon -c -v' $TEST_DIR >> $seqres.full &
+healer_pid=$!
+sleep 0.5
+kill -STOP $healer_pid
+mon_mount_refcount="$(_module_refcount xfs)"
+
+# Capture mod refcount with only the healthmon fd open.
+_test_unmount
+mon_nomount_refcount="$(_module_refcount xfs)"
+
+# Capture mod refcount after continuing healthmon (which should exit due to the
+# unmount) and killing it.
+kill -CONT $healer_pid
+kill $healer_pid
+wait
+nomon_nomount_refcount="$(_module_refcount xfs)"
+
+_within_tolerance "mount refcount" "$nomon_mount_refcount" "$((init_refcount + 1))" 0 -v
+_within_tolerance "mount + healthmon refcount" "$mon_mount_refcount" "$((init_refcount + 2))" 0 -v
+_within_tolerance "healthmon refcount" "$mon_nomount_refcount" "$((init_refcount + 1))" 0 -v
+_within_tolerance "end refcount" "$nomon_nomount_refcount" "$init_refcount" 0 -v
+
+status=0
+exit
diff --git a/tests/xfs/1885.out b/tests/xfs/1885.out
new file mode 100644
index 00000000000000..f152cef0525609
--- /dev/null
+++ b/tests/xfs/1885.out
@@ -0,0 +1,5 @@
+QA output created by 1885
+mount refcount is in range
+mount + healthmon refcount is in range
+healthmon refcount is in range
+end refcount is in range


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/14] xfs: test health monitoring code
  2026-03-10  3:50   ` [PATCH 01/14] xfs: test health monitoring code Darrick J. Wong
@ 2026-03-13 18:18     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 18:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:50:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add some functionality tests for the new health monitoring code.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  common/module      |   11 ++++++++++
>  tests/xfs/1885     |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1885.out |    5 ++++
>  3 files changed, 75 insertions(+)
>  create mode 100755 tests/xfs/1885
>  create mode 100644 tests/xfs/1885.out
> 
> 
> diff --git a/common/module b/common/module
> index 697d76ba718bbc..c0529b65ad6e2b 100644
> --- a/common/module
> +++ b/common/module
> @@ -225,3 +225,14 @@ _optional_reload_fs_module()
>  		_test_loadable_fs_module "$@" 2>&1 | \
>  		sed -e '/patient module removal/d'
>  }
> +
> +_require_module_refcount()
> +{
> +	local refcount_file="/sys/module/$1/refcnt"
> +	test -e "$refcount_file" || _notrun "cannot find $1 module refcount"
> +}
> +
> +_module_refcount()
> +{
> +	cat "/sys/module/$1/refcnt"
> +}
> diff --git a/tests/xfs/1885 b/tests/xfs/1885
> new file mode 100755
> index 00000000000000..d44b29d1c57e06
> --- /dev/null
> +++ b/tests/xfs/1885
> @@ -0,0 +1,59 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1885
> +#
> +# Make sure that healthmon handles module refcount correctly.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing quick
> +
> +. ./common/filter
> +. ./common/module
> +
> +_cleanup()
> +{
> +	test -n "$healer_pid" && kill $healer_pid &>/dev/null

I'll add a "wait" at here, and ...

> +	cd /
> +	rm -r -f $tmp.*
> +}
> +
> +_require_test
> +_require_xfs_io_command healthmon
> +_require_module_refcount xfs
> +
> +# Capture mod refcount without the test fs mounted
> +_test_unmount
> +init_refcount="$(_module_refcount xfs)"
> +
> +# Capture mod refcount with the test fs mounted
> +_test_mount
> +nomon_mount_refcount="$(_module_refcount xfs)"
> +
> +# Capture mod refcount with test fs mounted and the healthmon fd open.
> +# Pause the xfs_io process so that it doesn't actually respond to events.
> +$XFS_IO_PROG -c 'healthmon -c -v' $TEST_DIR >> $seqres.full &
> +healer_pid=$!
> +sleep 0.5
> +kill -STOP $healer_pid
> +mon_mount_refcount="$(_module_refcount xfs)"
> +
> +# Capture mod refcount with only the healthmon fd open.
> +_test_unmount
> +mon_nomount_refcount="$(_module_refcount xfs)"
> +
> +# Capture mod refcount after continuing healthmon (which should exit due to the
> +# unmount) and killing it.
> +kill -CONT $healer_pid
> +kill $healer_pid
> +wait

unset healer_pid

others look good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

> +nomon_nomount_refcount="$(_module_refcount xfs)"
> +
> +_within_tolerance "mount refcount" "$nomon_mount_refcount" "$((init_refcount + 1))" 0 -v
> +_within_tolerance "mount + healthmon refcount" "$mon_mount_refcount" "$((init_refcount + 2))" 0 -v
> +_within_tolerance "healthmon refcount" "$mon_nomount_refcount" "$((init_refcount + 1))" 0 -v
> +_within_tolerance "end refcount" "$nomon_nomount_refcount" "$init_refcount" 0 -v
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1885.out b/tests/xfs/1885.out
> new file mode 100644
> index 00000000000000..f152cef0525609
> --- /dev/null
> +++ b/tests/xfs/1885.out
> @@ -0,0 +1,5 @@
> +QA output created by 1885
> +mount refcount is in range
> +mount + healthmon refcount is in range
> +healthmon refcount is in range
> +end refcount is in range
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 02/14] xfs: test for metadata corruption error reporting via healthmon
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
  2026-03-10  3:50   ` [PATCH 01/14] xfs: test health monitoring code Darrick J. Wong
@ 2026-03-10  3:50   ` Darrick J. Wong
  2026-03-13 18:35     ` Zorro Lang
  2026-03-10  3:50   ` [PATCH 03/14] xfs: test io " Darrick J. Wong
                     ` (12 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:50 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check if we can detect runtime metadata corruptions via the health
monitor.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1879     |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1879.out |    8 ++++
 2 files changed, 101 insertions(+)
 create mode 100755 tests/xfs/1879
 create mode 100644 tests/xfs/1879.out


diff --git a/tests/xfs/1879 b/tests/xfs/1879
new file mode 100755
index 00000000000000..75bc8e3b5f4316
--- /dev/null
+++ b/tests/xfs/1879
@@ -0,0 +1,93 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1879
+#
+# Corrupt some metadata and try to access it with the health monitoring program
+# running.  Check that healthmon observes a metadata error.
+#
+. ./common/preamble
+_begin_fstest auto quick eio selfhealing
+
+_cleanup()
+{
+	cd /
+	rm -rf $tmp.* $testdir
+}
+
+. ./common/filter
+
+_require_scratch_nocheck
+_require_scratch_xfs_crc # can't detect minor corruption w/o crc
+_require_xfs_io_command healthmon
+
+# Disable the scratch rt device to avoid test failures relating to the rt
+# bitmap consuming all the free space in our small data device.
+unset SCRATCH_RTDEV
+
+echo "Format and mount"
+_scratch_mkfs -d agcount=1 | _filter_mkfs 2> $tmp.mkfs >> $seqres.full
+. $tmp.mkfs
+_scratch_mount
+mkdir $SCRATCH_MNT/a/
+# Enough entries to get to a single block directory
+for ((i = 0; i < ( (isize + 255) / 256); i++)); do
+	path="$(printf "%s/a/%0255d" "$SCRATCH_MNT" "$i")"
+	touch "$path"
+done
+inum="$(stat -c %i "$SCRATCH_MNT/a")"
+_scratch_unmount
+
+# Fuzz the directory block so that the touch below will be guaranteed to trip
+# a runtime sickness report in exactly the manner we desire.
+_scratch_xfs_db -x -c "inode $inum" -c "dblock 0" -c 'fuzz bhdr.hdr.owner add' -c print &>> $seqres.full
+
+# Try to allocate space to trigger a metadata corruption event
+echo "Runtime corruption detection"
+_scratch_mount
+$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon &
+sleep 1	# wait for program to start up
+touch $SCRATCH_MNT/a/farts &>> $seqres.full
+_scratch_unmount
+
+wait	# for healthmon to finish
+
+# Did we get errors?
+check_healthmon()
+{
+	cat $tmp.healthmon >> $seqres.full
+	_filter_scratch < $tmp.healthmon | \
+		grep -E '(sick|corrupt)' | \
+		sed -e 's|SCRATCH_MNT/a|VICTIM|g' \
+		    -e 's|SCRATCH_MNT ino [0-9]* gen 0x[0-9a-f]*|VICTIM|g' | \
+		sort | \
+		uniq
+}
+check_healthmon
+
+# Run scrub to trigger a health event from there too.
+echo "Scrub corruption detection"
+_scratch_mount
+if _supports_xfs_scrub $SCRATCH_MNT $SCRATCH_DEV; then
+	$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon &
+	sleep 1	# wait for program to start up
+	$XFS_SCRUB_PROG -n $SCRATCH_MNT &>> $seqres.full
+	_scratch_unmount
+
+	wait	# for healthmon to finish
+
+	# Did we get errors?
+	check_healthmon
+else
+	# mock the output since we don't support scrub
+	_scratch_unmount
+	cat << ENDL
+VICTIM directory: corrupt
+VICTIM directory: sick
+VICTIM parent: corrupt
+ENDL
+fi
+
+status=0
+exit
diff --git a/tests/xfs/1879.out b/tests/xfs/1879.out
new file mode 100644
index 00000000000000..2f6acbe1c4fb22
--- /dev/null
+++ b/tests/xfs/1879.out
@@ -0,0 +1,8 @@
+QA output created by 1879
+Format and mount
+Runtime corruption detection
+VICTIM directory: sick
+Scrub corruption detection
+VICTIM directory: corrupt
+VICTIM directory: sick
+VICTIM parent: corrupt


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/14] xfs: test for metadata corruption error reporting via healthmon
  2026-03-10  3:50   ` [PATCH 02/14] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
@ 2026-03-13 18:35     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 18:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:50:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Check if we can detect runtime metadata corruptions via the health
> monitor.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  tests/xfs/1879     |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1879.out |    8 ++++
>  2 files changed, 101 insertions(+)
>  create mode 100755 tests/xfs/1879
>  create mode 100644 tests/xfs/1879.out
> 
> 
> diff --git a/tests/xfs/1879 b/tests/xfs/1879
> new file mode 100755
> index 00000000000000..75bc8e3b5f4316
> --- /dev/null
> +++ b/tests/xfs/1879
> @@ -0,0 +1,93 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1879
> +#
> +# Corrupt some metadata and try to access it with the health monitoring program
> +# running.  Check that healthmon observes a metadata error.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick eio selfhealing
> +
> +_cleanup()
> +{
> +	cd /

test -n "$healer_pid" && kill $healer_pid &>/dev/null
wait

> +	rm -rf $tmp.* $testdir
> +}
> +
> +. ./common/filter
> +
> +_require_scratch_nocheck
> +_require_scratch_xfs_crc # can't detect minor corruption w/o crc
> +_require_xfs_io_command healthmon
> +
> +# Disable the scratch rt device to avoid test failures relating to the rt
> +# bitmap consuming all the free space in our small data device.
> +unset SCRATCH_RTDEV
> +
> +echo "Format and mount"
> +_scratch_mkfs -d agcount=1 | _filter_mkfs 2> $tmp.mkfs >> $seqres.full
> +. $tmp.mkfs
> +_scratch_mount
> +mkdir $SCRATCH_MNT/a/
> +# Enough entries to get to a single block directory
> +for ((i = 0; i < ( (isize + 255) / 256); i++)); do
> +	path="$(printf "%s/a/%0255d" "$SCRATCH_MNT" "$i")"
> +	touch "$path"
> +done
> +inum="$(stat -c %i "$SCRATCH_MNT/a")"
> +_scratch_unmount
> +
> +# Fuzz the directory block so that the touch below will be guaranteed to trip
> +# a runtime sickness report in exactly the manner we desire.
> +_scratch_xfs_db -x -c "inode $inum" -c "dblock 0" -c 'fuzz bhdr.hdr.owner add' -c print &>> $seqres.full
> +
> +# Try to allocate space to trigger a metadata corruption event
> +echo "Runtime corruption detection"
> +_scratch_mount
> +$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon &

healer_pid=$!

> +sleep 1	# wait for program to start up
> +touch $SCRATCH_MNT/a/farts &>> $seqres.full
> +_scratch_unmount
> +
> +wait	# for healthmon to finish

unset healer_pid

Others look good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

> +
> +# Did we get errors?
> +check_healthmon()
> +{
> +	cat $tmp.healthmon >> $seqres.full
> +	_filter_scratch < $tmp.healthmon | \
> +		grep -E '(sick|corrupt)' | \
> +		sed -e 's|SCRATCH_MNT/a|VICTIM|g' \
> +		    -e 's|SCRATCH_MNT ino [0-9]* gen 0x[0-9a-f]*|VICTIM|g' | \
> +		sort | \
> +		uniq
> +}
> +check_healthmon
> +
> +# Run scrub to trigger a health event from there too.
> +echo "Scrub corruption detection"
> +_scratch_mount
> +if _supports_xfs_scrub $SCRATCH_MNT $SCRATCH_DEV; then
> +	$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon &
> +	sleep 1	# wait for program to start up
> +	$XFS_SCRUB_PROG -n $SCRATCH_MNT &>> $seqres.full
> +	_scratch_unmount
> +
> +	wait	# for healthmon to finish
> +
> +	# Did we get errors?
> +	check_healthmon
> +else
> +	# mock the output since we don't support scrub
> +	_scratch_unmount
> +	cat << ENDL
> +VICTIM directory: corrupt
> +VICTIM directory: sick
> +VICTIM parent: corrupt
> +ENDL
> +fi
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1879.out b/tests/xfs/1879.out
> new file mode 100644
> index 00000000000000..2f6acbe1c4fb22
> --- /dev/null
> +++ b/tests/xfs/1879.out
> @@ -0,0 +1,8 @@
> +QA output created by 1879
> +Format and mount
> +Runtime corruption detection
> +VICTIM directory: sick
> +Scrub corruption detection
> +VICTIM directory: corrupt
> +VICTIM directory: sick
> +VICTIM parent: corrupt
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 03/14] xfs: test io error reporting via healthmon
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
  2026-03-10  3:50   ` [PATCH 01/14] xfs: test health monitoring code Darrick J. Wong
  2026-03-10  3:50   ` [PATCH 02/14] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
@ 2026-03-10  3:50   ` Darrick J. Wong
  2026-03-13 18:53     ` Zorro Lang
  2026-03-10  3:51   ` [PATCH 04/14] xfs: set up common code for testing xfs_healer Darrick J. Wong
                     ` (11 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:50 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new test to make sure the kernel can report IO errors via
health monitoring.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1878     |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1878.out |   10 ++++++
 2 files changed, 103 insertions(+)
 create mode 100755 tests/xfs/1878
 create mode 100644 tests/xfs/1878.out


diff --git a/tests/xfs/1878 b/tests/xfs/1878
new file mode 100755
index 00000000000000..1ff6ae040fb193
--- /dev/null
+++ b/tests/xfs/1878
@@ -0,0 +1,93 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1878
+#
+# Attempt to read and write a file in buffered and directio mode with the
+# health monitoring program running.  Check that healthmon observes all four
+# types of IO errors.
+#
+. ./common/preamble
+_begin_fstest auto quick eio selfhealing
+
+_cleanup()
+{
+	cd /
+	rm -rf $tmp.* $testdir
+	_dmerror_cleanup
+}
+
+. ./common/filter
+. ./common/dmerror
+
+_require_scratch_nocheck
+_require_xfs_io_command healthmon
+_require_dm_target error
+
+filter_healer_errors() {
+	_filter_scratch | \
+		grep -E '(buffered|directio)' | \
+		sed \
+		    -e 's/ino [0-9]*/ino NUM/g' \
+		    -e 's/gen 0x[0-9a-f]*/gen NUM/g' \
+		    -e 's/pos [0-9]*/pos NUM/g' \
+		    -e 's/len [0-9]*/len NUM/g' \
+		    -e 's|SCRATCH_MNT/a|VICTIM|g' \
+		    -e 's|SCRATCH_MNT ino NUM gen NUM|VICTIM|g' | \
+		uniq
+}
+
+# Disable the scratch rt device to avoid test failures relating to the rt
+# bitmap consuming all the free space in our small data device.
+unset SCRATCH_RTDEV
+
+echo "Format and mount"
+_scratch_mkfs > $seqres.full 2>&1
+_dmerror_init no_log
+_dmerror_mount
+
+_require_fs_space $SCRATCH_MNT 65536
+
+# Create a file with written regions far enough apart that the pagecache can't
+# possibly be caching the regions with a single folio.
+testfile=$SCRATCH_MNT/fsync-err-test
+$XFS_IO_PROG -f \
+	-c 'pwrite -b 1m 0 1m' \
+	-c 'pwrite -b 1m 10g 1m' \
+	-c 'pwrite -b 1m 20g 1m' \
+	-c fsync $testfile >> $seqres.full
+
+# First we check if directio errors get reported
+$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon &
+sleep 1	# wait for program to start up
+_dmerror_load_error_table
+$XFS_IO_PROG -d -c 'pwrite -b 256k 12k 16k' $testfile >> $seqres.full
+$XFS_IO_PROG -d -c 'pread -b 256k 10g 16k' $testfile >> $seqres.full
+_dmerror_load_working_table
+
+_dmerror_unmount
+wait	# for healthmon to finish
+_dmerror_mount
+
+# Next we check if buffered io errors get reported.  We have to write something
+# before loading the error table to ensure the dquots get loaded.
+$XFS_IO_PROG -c 'pwrite -b 256k 20g 1k' -c fsync $testfile >> $seqres.full
+$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon &
+sleep 1	# wait for program to start up
+_dmerror_load_error_table
+$XFS_IO_PROG -c 'pread -b 256k 12k 16k' $testfile >> $seqres.full
+$XFS_IO_PROG -c 'pwrite -b 256k 20g 16k' -c fsync $testfile >> $seqres.full
+_dmerror_load_working_table
+
+_dmerror_unmount
+wait	# for healthmon to finish
+
+# Did we get errors?
+cat $tmp.healthmon >> $seqres.full
+filter_healer_errors < $tmp.healthmon
+
+_dmerror_cleanup
+
+status=0
+exit
diff --git a/tests/xfs/1878.out b/tests/xfs/1878.out
new file mode 100644
index 00000000000000..f64c440b1a6ed1
--- /dev/null
+++ b/tests/xfs/1878.out
@@ -0,0 +1,10 @@
+QA output created by 1878
+Format and mount
+pwrite: Input/output error
+pread: Input/output error
+pread: Input/output error
+fsync: Input/output error
+VICTIM pos NUM len NUM: directio_write: Input/output error
+VICTIM pos NUM len NUM: directio_read: Input/output error
+VICTIM pos NUM len NUM: buffered_read: Input/output error
+VICTIM pos NUM len NUM: buffered_write: Input/output error


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] xfs: test io error reporting via healthmon
  2026-03-10  3:50   ` [PATCH 03/14] xfs: test io " Darrick J. Wong
@ 2026-03-13 18:53     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 18:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:50:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a new test to make sure the kernel can report IO errors via
> health monitoring.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  tests/xfs/1878     |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1878.out |   10 ++++++
>  2 files changed, 103 insertions(+)
>  create mode 100755 tests/xfs/1878
>  create mode 100644 tests/xfs/1878.out
> 
> 
> diff --git a/tests/xfs/1878 b/tests/xfs/1878
> new file mode 100755
> index 00000000000000..1ff6ae040fb193
> --- /dev/null
> +++ b/tests/xfs/1878
> @@ -0,0 +1,93 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1878
> +#
> +# Attempt to read and write a file in buffered and directio mode with the
> +# health monitoring program running.  Check that healthmon observes all four
> +# types of IO errors.
> +#
> +. ./common/preamble
> +_begin_fstest auto quick eio selfhealing
> +
> +_cleanup()
> +{
> +	cd /

test -n "$healer_pid" && kill $healer_pid &>/dev/null
wait

> +	rm -rf $tmp.* $testdir
> +	_dmerror_cleanup
> +}
> +
> +. ./common/filter
> +. ./common/dmerror
> +
> +_require_scratch_nocheck
> +_require_xfs_io_command healthmon
> +_require_dm_target error
> +
> +filter_healer_errors() {
> +	_filter_scratch | \
> +		grep -E '(buffered|directio)' | \
> +		sed \
> +		    -e 's/ino [0-9]*/ino NUM/g' \
> +		    -e 's/gen 0x[0-9a-f]*/gen NUM/g' \
> +		    -e 's/pos [0-9]*/pos NUM/g' \
> +		    -e 's/len [0-9]*/len NUM/g' \
> +		    -e 's|SCRATCH_MNT/a|VICTIM|g' \
> +		    -e 's|SCRATCH_MNT ino NUM gen NUM|VICTIM|g' | \
> +		uniq
> +}
> +
> +# Disable the scratch rt device to avoid test failures relating to the rt
> +# bitmap consuming all the free space in our small data device.
> +unset SCRATCH_RTDEV
> +
> +echo "Format and mount"
> +_scratch_mkfs > $seqres.full 2>&1
> +_dmerror_init no_log
> +_dmerror_mount
> +
> +_require_fs_space $SCRATCH_MNT 65536
> +
> +# Create a file with written regions far enough apart that the pagecache can't
> +# possibly be caching the regions with a single folio.
> +testfile=$SCRATCH_MNT/fsync-err-test
> +$XFS_IO_PROG -f \
> +	-c 'pwrite -b 1m 0 1m' \
> +	-c 'pwrite -b 1m 10g 1m' \
> +	-c 'pwrite -b 1m 20g 1m' \
> +	-c fsync $testfile >> $seqres.full
> +
> +# First we check if directio errors get reported
> +$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon &

healer_pid=$!

> +sleep 1	# wait for program to start up
> +_dmerror_load_error_table
> +$XFS_IO_PROG -d -c 'pwrite -b 256k 12k 16k' $testfile >> $seqres.full
> +$XFS_IO_PROG -d -c 'pread -b 256k 10g 16k' $testfile >> $seqres.full
> +_dmerror_load_working_table
> +
> +_dmerror_unmount
> +wait	# for healthmon to finish

unset healer_pid

> +_dmerror_mount
> +
> +# Next we check if buffered io errors get reported.  We have to write something
> +# before loading the error table to ensure the dquots get loaded.
> +$XFS_IO_PROG -c 'pwrite -b 256k 20g 1k' -c fsync $testfile >> $seqres.full
> +$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon &

healer_pid=$!

> +sleep 1	# wait for program to start up
> +_dmerror_load_error_table
> +$XFS_IO_PROG -c 'pread -b 256k 12k 16k' $testfile >> $seqres.full
> +$XFS_IO_PROG -c 'pwrite -b 256k 20g 16k' -c fsync $testfile >> $seqres.full
> +_dmerror_load_working_table
> +
> +_dmerror_unmount
> +wait	# for healthmon to finish

unset healer_pid

> +
> +# Did we get errors?
> +cat $tmp.healthmon >> $seqres.full
> +filter_healer_errors < $tmp.healthmon
> +
> +_dmerror_cleanup
> +
> +status=0
> +exit

_exit 0

Others look good to me, I'll change these when I merge this patch.

Reviewed-by: Zorro Lang <zlang@redhat.com>

> diff --git a/tests/xfs/1878.out b/tests/xfs/1878.out
> new file mode 100644
> index 00000000000000..f64c440b1a6ed1
> --- /dev/null
> +++ b/tests/xfs/1878.out
> @@ -0,0 +1,10 @@
> +QA output created by 1878
> +Format and mount
> +pwrite: Input/output error
> +pread: Input/output error
> +pread: Input/output error
> +fsync: Input/output error
> +VICTIM pos NUM len NUM: directio_write: Input/output error
> +VICTIM pos NUM len NUM: directio_read: Input/output error
> +VICTIM pos NUM len NUM: buffered_read: Input/output error
> +VICTIM pos NUM len NUM: buffered_write: Input/output error
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 04/14] xfs: set up common code for testing xfs_healer
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2026-03-10  3:50   ` [PATCH 03/14] xfs: test io " Darrick J. Wong
@ 2026-03-10  3:51   ` Darrick J. Wong
  2026-03-13 19:04     ` Zorro Lang
  2026-03-14 20:37     ` Zorro Lang
  2026-03-10  3:51   ` [PATCH 05/14] xfs: test xfs_healer's event handling Darrick J. Wong
                     ` (10 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:51 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a bunch of common code so that we can test the xfs_healer daemon.
Most of the changes here are to make it easier to manage the systemd
service units for xfs_healer and xfs_scrub.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/config  |   14 ++++++++
 common/rc      |    5 +++
 common/systemd |   39 ++++++++++++++++++++++
 common/xfs     |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/802  |    4 +-
 5 files changed, 158 insertions(+), 2 deletions(-)


diff --git a/common/config b/common/config
index 1420e35ddfee42..8468a60081f50c 100644
--- a/common/config
+++ b/common/config
@@ -161,6 +161,20 @@ export XFS_ADMIN_PROG="$(type -P xfs_admin)"
 export XFS_GROWFS_PROG=$(type -P xfs_growfs)
 export XFS_SPACEMAN_PROG="$(type -P xfs_spaceman)"
 export XFS_SCRUB_PROG="$(type -P xfs_scrub)"
+
+XFS_HEALER_PROG="$(type -P xfs_healer)"
+XFS_HEALER_START_PROG="$(type -P xfs_healer_start)"
+
+# If not found, try the ones installed in libexec
+if [ ! -x "$XFS_HEALER_PROG" ] && [ -e /usr/libexec/xfsprogs/xfs_healer ]; then
+	XFS_HEALER_PROG=/usr/libexec/xfsprogs/xfs_healer
+fi
+if [ ! -x "$XFS_HEALER_START_PROG" ] && [ -e /usr/libexec/xfsprogs/xfs_healer_start ]; then
+	XFS_HEALER_START_PROG=/usr/libexec/xfsprogs/xfs_healer_start
+fi
+export XFS_HEALER_PROG
+export XFS_HEALER_START_PROG
+
 export XFS_PARALLEL_REPAIR_PROG="$(type -P xfs_prepair)"
 export XFS_PARALLEL_REPAIR64_PROG="$(type -P xfs_prepair64)"
 export __XFSDUMP_PROG="$(type -P xfsdump)"
diff --git a/common/rc b/common/rc
index ccb78baf5bd41a..0b740595d231b5 100644
--- a/common/rc
+++ b/common/rc
@@ -3021,6 +3021,11 @@ _require_xfs_io_command()
 	"label")
 		testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
 		;;
+	"verifymedia")
+		testio=`$XFS_IO_PROG -x -c "verifymedia $* 0 0" 2>&1`
+		echo $testio | grep -q "invalid option" && \
+			_notrun "xfs_io $command support is missing"
+		;;
 	"open")
 		# -c "open $f" is broken in xfs_io <= 4.8. Along with the fix,
 		# a new -C flag was introduced to execute one shot commands.
diff --git a/common/systemd b/common/systemd
index b2e24f267b2d93..589aad1bef2637 100644
--- a/common/systemd
+++ b/common/systemd
@@ -44,6 +44,18 @@ _systemd_unit_active() {
 	test "$(systemctl is-active "$1")" = "active"
 }
 
+# Wait for up to a certain number of seconds for a service to reach inactive
+# state.
+_systemd_unit_wait() {
+	local svcname="$1"
+	local timeout="${2:-30}"
+
+	for ((i = 0; i < (timeout * 2); i++)); do
+		test "$(systemctl is-active "$svcname")" = "inactive" && break
+		sleep 0.5
+	done
+}
+
 _require_systemd_unit_active() {
 	_require_systemd_unit_defined "$1"
 	_systemd_unit_active "$1" || \
@@ -71,3 +83,30 @@ _systemd_unit_status() {
 	_systemd_installed || return 1
 	systemctl status "$1"
 }
+
+# Start a running systemd unit
+_systemd_unit_start() {
+	systemctl start "$1"
+}
+# Stop a running systemd unit
+_systemd_unit_stop() {
+	systemctl stop "$1"
+}
+
+# Mask or unmask a running systemd unit
+_systemd_unit_mask() {
+	systemctl mask "$1"
+}
+_systemd_unit_unmask() {
+	systemctl unmask "$1"
+}
+_systemd_unit_masked() {
+	systemctl status "$1" 2>/dev/null | grep -q 'Loaded: masked'
+}
+
+_systemd_service_unit_path() {
+	local template="$1"
+	local path="$2"
+
+	systemd-escape --template "$template" --path "$path"
+}
diff --git a/common/xfs b/common/xfs
index 7fa0db2e26b4c9..f276325df8fbac 100644
--- a/common/xfs
+++ b/common/xfs
@@ -2301,3 +2301,101 @@ _filter_bmap_gno()
 		if ($ag =~ /\d+/) {print "$ag "} ;
         '
 }
+
+# Run the xfs_healer program on some filesystem
+_xfs_healer() {
+	$XFS_HEALER_PROG "$@"
+}
+
+# Compute the xfs_healer systemd service instance name for a given path.
+# This is easy because xfs_healer has always supported --svcname.
+_xfs_healer_svcname()
+{
+	_xfs_healer --svcname "$@"
+}
+
+# Compute the xfs_scrub systemd service instance name for a given path.  This
+# is tricky because xfs_scrub only gained --svcname when xfs_healer was made.
+_xfs_scrub_svcname()
+{
+	local ret
+
+	if ret="$($XFS_SCRUB_PROG --svcname "$@")"; then
+		echo "$ret"
+		return 0
+	fi
+
+	# ...but if not, we can fall back to brute force systemd invocations.
+	_systemd_service_unit_path "xfs_scrub@.service" "$*"
+}
+
+# Run the xfs_healer program on the scratch fs
+_scratch_xfs_healer() {
+	_xfs_healer "$@" "$SCRATCH_MNT"
+}
+
+# Turn off the background xfs_healer service if any so that it doesn't fix
+# injected metadata errors; then start a background copy of xfs_healer to
+# capture that.
+_invoke_xfs_healer() {
+	local mount="$1"
+	local logfile="$2"
+	shift; shift
+
+	if _systemd_is_running; then
+		local svc="$(_xfs_healer_svcname "$mount")"
+		_systemd_unit_stop "$svc" &>> $seqres.full
+	fi
+
+	$XFS_HEALER_PROG "$mount" "$@" &> "$logfile" &
+	XFS_HEALER_PID=$!
+
+	# Wait 30s for the healer program to really start up
+	for ((i = 0; i < 60; i++)); do
+		test -e "$logfile" && \
+			grep -q 'monitoring started' "$logfile" && \
+			break
+		sleep 0.5
+	done
+}
+
+# Run our own copy of xfs_healer against the scratch device.  Note that
+# unmounting the scratch fs causes the healer daemon to exit, so we don't need
+# to kill it explicitly from _cleanup.
+_scratch_invoke_xfs_healer() {
+	_invoke_xfs_healer "$SCRATCH_MNT" "$@"
+}
+
+# Unmount the filesystem to kill the xfs_healer instance started by
+# _invoke_xfs_healer, and wait up to a certain amount of time for it to exit.
+_kill_xfs_healer() {
+	local unmount="$1"
+	local timeout="${2:-30}"
+	local i
+
+	# Unmount fs to kill healer, then wait for it to finish
+	for ((i = 0; i < (timeout * 2); i++)); do
+		$unmount &>> $seqres.full && break
+		sleep 0.5
+	done
+
+	test -n "$XFS_HEALER_PID" && \
+		kill $XFS_HEALER_PID &>> $seqres.full
+	wait
+	unset XFS_HEALER_PID
+}
+
+# Unmount the scratch fs to kill a _scratch_invoke_xfs_healer instance.
+_scratch_kill_xfs_healer() {
+	local unmount="${1:-_scratch_unmount}"
+	shift
+
+	_kill_xfs_healer "$unmount" "$@"
+}
+
+# Does this mounted filesystem support xfs_healer?
+_require_xfs_healer()
+{
+	_xfs_healer --supported "$@" &>/dev/null || \
+		_notrun "health monitoring not supported on this kernel"
+}
diff --git a/tests/xfs/802 b/tests/xfs/802
index fc4767acb66a55..18312b15b645bd 100755
--- a/tests/xfs/802
+++ b/tests/xfs/802
@@ -105,8 +105,8 @@ run_scrub_service() {
 }
 
 echo "Scrub Scratch FS"
-scratch_path=$(systemd-escape --path "$SCRATCH_MNT")
-run_scrub_service xfs_scrub@$scratch_path
+svc="$(_xfs_scrub_svcname "$SCRATCH_MNT")"
+run_scrub_service "$svc"
 find_scrub_trace "$SCRATCH_MNT"
 
 # Remove the xfs_scrub_all media scan stamp directory (if specified) because we


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] xfs: set up common code for testing xfs_healer
  2026-03-10  3:51   ` [PATCH 04/14] xfs: set up common code for testing xfs_healer Darrick J. Wong
@ 2026-03-13 19:04     ` Zorro Lang
  2026-03-14 20:37     ` Zorro Lang
  1 sibling, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:51:10PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a bunch of common code so that we can test the xfs_healer daemon.
> Most of the changes here are to make it easier to manage the systemd
> service units for xfs_healer and xfs_scrub.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  common/config  |   14 ++++++++
>  common/rc      |    5 +++
>  common/systemd |   39 ++++++++++++++++++++++
>  common/xfs     |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/802  |    4 +-
>  5 files changed, 158 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/common/config b/common/config
> index 1420e35ddfee42..8468a60081f50c 100644
> --- a/common/config
> +++ b/common/config
> @@ -161,6 +161,20 @@ export XFS_ADMIN_PROG="$(type -P xfs_admin)"
>  export XFS_GROWFS_PROG=$(type -P xfs_growfs)
>  export XFS_SPACEMAN_PROG="$(type -P xfs_spaceman)"
>  export XFS_SCRUB_PROG="$(type -P xfs_scrub)"
> +
> +XFS_HEALER_PROG="$(type -P xfs_healer)"
> +XFS_HEALER_START_PROG="$(type -P xfs_healer_start)"
> +
> +# If not found, try the ones installed in libexec
> +if [ ! -x "$XFS_HEALER_PROG" ] && [ -e /usr/libexec/xfsprogs/xfs_healer ]; then
> +	XFS_HEALER_PROG=/usr/libexec/xfsprogs/xfs_healer
> +fi
> +if [ ! -x "$XFS_HEALER_START_PROG" ] && [ -e /usr/libexec/xfsprogs/xfs_healer_start ]; then
> +	XFS_HEALER_START_PROG=/usr/libexec/xfsprogs/xfs_healer_start
> +fi
> +export XFS_HEALER_PROG
> +export XFS_HEALER_START_PROG
> +
>  export XFS_PARALLEL_REPAIR_PROG="$(type -P xfs_prepair)"
>  export XFS_PARALLEL_REPAIR64_PROG="$(type -P xfs_prepair64)"
>  export __XFSDUMP_PROG="$(type -P xfsdump)"
> diff --git a/common/rc b/common/rc
> index ccb78baf5bd41a..0b740595d231b5 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3021,6 +3021,11 @@ _require_xfs_io_command()
>  	"label")
>  		testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
>  		;;
> +	"verifymedia")
> +		testio=`$XFS_IO_PROG -x -c "verifymedia $* 0 0" 2>&1`
> +		echo $testio | grep -q "invalid option" && \
> +			_notrun "xfs_io $command support is missing"
> +		;;
>  	"open")
>  		# -c "open $f" is broken in xfs_io <= 4.8. Along with the fix,
>  		# a new -C flag was introduced to execute one shot commands.
> diff --git a/common/systemd b/common/systemd
> index b2e24f267b2d93..589aad1bef2637 100644
> --- a/common/systemd
> +++ b/common/systemd
> @@ -44,6 +44,18 @@ _systemd_unit_active() {
>  	test "$(systemctl is-active "$1")" = "active"
>  }
>  
> +# Wait for up to a certain number of seconds for a service to reach inactive
> +# state.
> +_systemd_unit_wait() {
> +	local svcname="$1"
> +	local timeout="${2:-30}"
> +
> +	for ((i = 0; i < (timeout * 2); i++)); do
> +		test "$(systemctl is-active "$svcname")" = "inactive" && break
> +		sleep 0.5
> +	done
> +}
> +
>  _require_systemd_unit_active() {
>  	_require_systemd_unit_defined "$1"
>  	_systemd_unit_active "$1" || \
> @@ -71,3 +83,30 @@ _systemd_unit_status() {
>  	_systemd_installed || return 1
>  	systemctl status "$1"
>  }
> +
> +# Start a running systemd unit
> +_systemd_unit_start() {
> +	systemctl start "$1"
> +}
> +# Stop a running systemd unit
> +_systemd_unit_stop() {
> +	systemctl stop "$1"
> +}
> +
> +# Mask or unmask a running systemd unit
> +_systemd_unit_mask() {
> +	systemctl mask "$1"
> +}
> +_systemd_unit_unmask() {
> +	systemctl unmask "$1"
> +}
> +_systemd_unit_masked() {
> +	systemctl status "$1" 2>/dev/null | grep -q 'Loaded: masked'
> +}
> +
> +_systemd_service_unit_path() {
> +	local template="$1"
> +	local path="$2"
> +
> +	systemd-escape --template "$template" --path "$path"
> +}
> diff --git a/common/xfs b/common/xfs
> index 7fa0db2e26b4c9..f276325df8fbac 100644
> --- a/common/xfs
> +++ b/common/xfs
> @@ -2301,3 +2301,101 @@ _filter_bmap_gno()
>  		if ($ag =~ /\d+/) {print "$ag "} ;
>          '
>  }
> +
> +# Run the xfs_healer program on some filesystem
> +_xfs_healer() {
> +	$XFS_HEALER_PROG "$@"
> +}
> +
> +# Compute the xfs_healer systemd service instance name for a given path.
> +# This is easy because xfs_healer has always supported --svcname.
> +_xfs_healer_svcname()
> +{
> +	_xfs_healer --svcname "$@"
> +}
> +
> +# Compute the xfs_scrub systemd service instance name for a given path.  This
> +# is tricky because xfs_scrub only gained --svcname when xfs_healer was made.
> +_xfs_scrub_svcname()
> +{
> +	local ret
> +
> +	if ret="$($XFS_SCRUB_PROG --svcname "$@")"; then
> +		echo "$ret"
> +		return 0
> +	fi
> +
> +	# ...but if not, we can fall back to brute force systemd invocations.
> +	_systemd_service_unit_path "xfs_scrub@.service" "$*"
> +}
> +
> +# Run the xfs_healer program on the scratch fs
> +_scratch_xfs_healer() {
> +	_xfs_healer "$@" "$SCRATCH_MNT"
> +}
> +
> +# Turn off the background xfs_healer service if any so that it doesn't fix
> +# injected metadata errors; then start a background copy of xfs_healer to
> +# capture that.
> +_invoke_xfs_healer() {
> +	local mount="$1"
> +	local logfile="$2"
> +	shift; shift
> +
> +	if _systemd_is_running; then
> +		local svc="$(_xfs_healer_svcname "$mount")"
> +		_systemd_unit_stop "$svc" &>> $seqres.full
> +	fi
> +
> +	$XFS_HEALER_PROG "$mount" "$@" &> "$logfile" &
> +	XFS_HEALER_PID=$!
> +
> +	# Wait 30s for the healer program to really start up
> +	for ((i = 0; i < 60; i++)); do
> +		test -e "$logfile" && \
> +			grep -q 'monitoring started' "$logfile" && \
> +			break
> +		sleep 0.5
> +	done
> +}
> +
> +# Run our own copy of xfs_healer against the scratch device.  Note that
> +# unmounting the scratch fs causes the healer daemon to exit, so we don't need
> +# to kill it explicitly from _cleanup.
> +_scratch_invoke_xfs_healer() {
> +	_invoke_xfs_healer "$SCRATCH_MNT" "$@"
> +}
> +
> +# Unmount the filesystem to kill the xfs_healer instance started by
> +# _invoke_xfs_healer, and wait up to a certain amount of time for it to exit.
> +_kill_xfs_healer() {
> +	local unmount="$1"
> +	local timeout="${2:-30}"
> +	local i
> +
> +	# Unmount fs to kill healer, then wait for it to finish
> +	for ((i = 0; i < (timeout * 2); i++)); do
> +		$unmount &>> $seqres.full && break
> +		sleep 0.5
> +	done
> +
> +	test -n "$XFS_HEALER_PID" && \
> +		kill $XFS_HEALER_PID &>> $seqres.full
> +	wait
> +	unset XFS_HEALER_PID
> +}
> +
> +# Unmount the scratch fs to kill a _scratch_invoke_xfs_healer instance.
> +_scratch_kill_xfs_healer() {
> +	local unmount="${1:-_scratch_unmount}"
> +	shift
> +
> +	_kill_xfs_healer "$unmount" "$@"
> +}
> +
> +# Does this mounted filesystem support xfs_healer?
> +_require_xfs_healer()
> +{
> +	_xfs_healer --supported "$@" &>/dev/null || \
> +		_notrun "health monitoring not supported on this kernel"
> +}
> diff --git a/tests/xfs/802 b/tests/xfs/802
> index fc4767acb66a55..18312b15b645bd 100755
> --- a/tests/xfs/802
> +++ b/tests/xfs/802
> @@ -105,8 +105,8 @@ run_scrub_service() {
>  }
>  
>  echo "Scrub Scratch FS"
> -scratch_path=$(systemd-escape --path "$SCRATCH_MNT")
> -run_scrub_service xfs_scrub@$scratch_path
> +svc="$(_xfs_scrub_svcname "$SCRATCH_MNT")"
> +run_scrub_service "$svc"
>  find_scrub_trace "$SCRATCH_MNT"
>  
>  # Remove the xfs_scrub_all media scan stamp directory (if specified) because we
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] xfs: set up common code for testing xfs_healer
  2026-03-10  3:51   ` [PATCH 04/14] xfs: set up common code for testing xfs_healer Darrick J. Wong
  2026-03-13 19:04     ` Zorro Lang
@ 2026-03-14 20:37     ` Zorro Lang
  2026-03-15  4:51       ` Darrick J. Wong
  1 sibling, 1 reply; 45+ messages in thread
From: Zorro Lang @ 2026-03-14 20:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:51:10PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a bunch of common code so that we can test the xfs_healer daemon.
> Most of the changes here are to make it easier to manage the systemd
> service units for xfs_healer and xfs_scrub.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  common/config  |   14 ++++++++
>  common/rc      |    5 +++
>  common/systemd |   39 ++++++++++++++++++++++
>  common/xfs     |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/802  |    4 +-
>  5 files changed, 158 insertions(+), 2 deletions(-)
> 

[snip]

> +# Compute the xfs_scrub systemd service instance name for a given path.  This
> +# is tricky because xfs_scrub only gained --svcname when xfs_healer was made.
> +_xfs_scrub_svcname()
> +{
> +	local ret
> +
> +	if ret="$($XFS_SCRUB_PROG --svcname "$@")"; then

Better to be:

-       if ret="$($XFS_SCRUB_PROG --svcname "$@")"; then
+       if ret="$($XFS_SCRUB_PROG --svcname "$@" 2>/dev/null)"; then

Or below xfs/802 will ...

> +		echo "$ret"
> +		return 0
> +	fi

[snip]

> diff --git a/tests/xfs/802 b/tests/xfs/802
> index fc4767acb66a55..18312b15b645bd 100755
> --- a/tests/xfs/802
> +++ b/tests/xfs/802
> @@ -105,8 +105,8 @@ run_scrub_service() {
>  }
>  
>  echo "Scrub Scratch FS"
> -scratch_path=$(systemd-escape --path "$SCRATCH_MNT")
> -run_scrub_service xfs_scrub@$scratch_path
> +svc="$(_xfs_scrub_svcname "$SCRATCH_MNT")"

... fails on old xfsprogs as:

  --- /dev/fd/63	2026-03-13 19:16:15.217899866 -0400
  +++ xfs/802.out.bad	2026-03-13 19:16:15.191834546 -0400
  @@ -1,5 +1,21 @@
   QA output created by 802
   Format and populate
   Scrub Scratch FS
  +/usr/sbin/xfs_scrub: invalid option -- '-'
  +Usage: xfs_scrub [OPTIONS] mountpoint
  +
  +Options:
  +  -a count     Stop after this many errors are found.
  +  -b           Background mode.
  +  -C fd        Print progress information to this fd.
  +  -e behavior  What to do if errors are found.
  +  -k           Do not FITRIM the free space.
  +  -m path      Path to /etc/mtab.
  +  -n           Dry run.  Do not modify anything.
  +  -p           Only optimize, do not fix corruptions.
  +  -T           Display timing/usage information.
  +  -v           Verbose output.
  +  -V           Print version.
  +  -x           Scrub file data too.
   Scrub Everything
   Scrub Done

If you don't have more suggestion, I'll help to change that :)

Thanks,
Zorro

> +run_scrub_service "$svc"
>  find_scrub_trace "$SCRATCH_MNT"
>  
>  # Remove the xfs_scrub_all media scan stamp directory (if specified) because we
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] xfs: set up common code for testing xfs_healer
  2026-03-14 20:37     ` Zorro Lang
@ 2026-03-15  4:51       ` Darrick J. Wong
  0 siblings, 0 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-15  4:51 UTC (permalink / raw)
  To: Zorro Lang; +Cc: fstests, linux-xfs

On Sun, Mar 15, 2026 at 04:37:59AM +0800, Zorro Lang wrote:
> On Mon, Mar 09, 2026 at 08:51:10PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a bunch of common code so that we can test the xfs_healer daemon.
> > Most of the changes here are to make it easier to manage the systemd
> > service units for xfs_healer and xfs_scrub.
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  common/config  |   14 ++++++++
> >  common/rc      |    5 +++
> >  common/systemd |   39 ++++++++++++++++++++++
> >  common/xfs     |   98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/802  |    4 +-
> >  5 files changed, 158 insertions(+), 2 deletions(-)
> > 
> 
> [snip]
> 
> > +# Compute the xfs_scrub systemd service instance name for a given path.  This
> > +# is tricky because xfs_scrub only gained --svcname when xfs_healer was made.
> > +_xfs_scrub_svcname()
> > +{
> > +	local ret
> > +
> > +	if ret="$($XFS_SCRUB_PROG --svcname "$@")"; then
> 
> Better to be:
> 
> -       if ret="$($XFS_SCRUB_PROG --svcname "$@")"; then
> +       if ret="$($XFS_SCRUB_PROG --svcname "$@" 2>/dev/null)"; then
> 
> Or below xfs/802 will ...
> 
> > +		echo "$ret"
> > +		return 0
> > +	fi
> 
> [snip]
> 
> > diff --git a/tests/xfs/802 b/tests/xfs/802
> > index fc4767acb66a55..18312b15b645bd 100755
> > --- a/tests/xfs/802
> > +++ b/tests/xfs/802
> > @@ -105,8 +105,8 @@ run_scrub_service() {
> >  }
> >  
> >  echo "Scrub Scratch FS"
> > -scratch_path=$(systemd-escape --path "$SCRATCH_MNT")
> > -run_scrub_service xfs_scrub@$scratch_path
> > +svc="$(_xfs_scrub_svcname "$SCRATCH_MNT")"
> 
> ... fails on old xfsprogs as:
> 
>   --- /dev/fd/63	2026-03-13 19:16:15.217899866 -0400
>   +++ xfs/802.out.bad	2026-03-13 19:16:15.191834546 -0400
>   @@ -1,5 +1,21 @@
>    QA output created by 802
>    Format and populate
>    Scrub Scratch FS
>   +/usr/sbin/xfs_scrub: invalid option -- '-'
>   +Usage: xfs_scrub [OPTIONS] mountpoint
>   +
>   +Options:
>   +  -a count     Stop after this many errors are found.
>   +  -b           Background mode.
>   +  -C fd        Print progress information to this fd.
>   +  -e behavior  What to do if errors are found.
>   +  -k           Do not FITRIM the free space.
>   +  -m path      Path to /etc/mtab.
>   +  -n           Dry run.  Do not modify anything.
>   +  -p           Only optimize, do not fix corruptions.
>   +  -T           Display timing/usage information.
>   +  -v           Verbose output.
>   +  -V           Print version.
>   +  -x           Scrub file data too.
>    Scrub Everything
>    Scrub Done
> 
> If you don't have more suggestion, I'll help to change that :)

That seems the proper correction to make.  Thanks for your help!

--D

> Thanks,
> Zorro
> 
> > +run_scrub_service "$svc"
> >  find_scrub_trace "$SCRATCH_MNT"
> >  
> >  # Remove the xfs_scrub_all media scan stamp directory (if specified) because we
> > 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 05/14] xfs: test xfs_healer's event handling
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2026-03-10  3:51   ` [PATCH 04/14] xfs: set up common code for testing xfs_healer Darrick J. Wong
@ 2026-03-10  3:51   ` Darrick J. Wong
  2026-03-13 19:19     ` Zorro Lang
  2026-03-10  3:51   ` [PATCH 06/14] xfs: test xfs_healer can fix a filesystem Darrick J. Wong
                     ` (9 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:51 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that xfs_healer can handle every type of event that the kernel
can throw at it by initiating a full scrub of a test filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1882     |   44 ++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1882.out |    2 ++
 2 files changed, 46 insertions(+)
 create mode 100755 tests/xfs/1882
 create mode 100644 tests/xfs/1882.out


diff --git a/tests/xfs/1882 b/tests/xfs/1882
new file mode 100755
index 00000000000000..2fb4589418401e
--- /dev/null
+++ b/tests/xfs/1882
@@ -0,0 +1,44 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1882
+#
+# Make sure that xfs_healer correctly handles all the reports that it gets
+# from the kernel.  We simulate this by using the --everything mode so we get
+# all the events, not just the sickness reports.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+. ./common/populate
+
+_require_scrub
+_require_xfs_io_command "scrub"		# online check support
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_scratch
+
+# Does this fs support health monitoring?
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_xfs_healer $SCRATCH_MNT
+_scratch_unmount
+
+# Create a sample fs with all the goodies
+_scratch_populate_cached nofill &>> $seqres.full
+_scratch_mount
+
+_scratch_invoke_xfs_healer "$tmp.healer" --everything
+
+# Run scrub to make some noise
+_scratch_scrub -b -n >> $seqres.full
+
+_scratch_kill_xfs_healer
+cat $tmp.healer >> $seqres.full
+
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/1882.out b/tests/xfs/1882.out
new file mode 100644
index 00000000000000..9b31ccb735cabd
--- /dev/null
+++ b/tests/xfs/1882.out
@@ -0,0 +1,2 @@
+QA output created by 1882
+Silence is golden


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/14] xfs: test xfs_healer's event handling
  2026-03-10  3:51   ` [PATCH 05/14] xfs: test xfs_healer's event handling Darrick J. Wong
@ 2026-03-13 19:19     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:51:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that xfs_healer can handle every type of event that the kernel
> can throw at it by initiating a full scrub of a test filesystem.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1882     |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1882.out |    2 ++
>  2 files changed, 46 insertions(+)
>  create mode 100755 tests/xfs/1882
>  create mode 100644 tests/xfs/1882.out
> 
> 
> diff --git a/tests/xfs/1882 b/tests/xfs/1882
> new file mode 100755
> index 00000000000000..2fb4589418401e
> --- /dev/null
> +++ b/tests/xfs/1882
> @@ -0,0 +1,44 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1882
> +#
> +# Make sure that xfs_healer correctly handles all the reports that it gets
> +# from the kernel.  We simulate this by using the --everything mode so we get
> +# all the events, not just the sickness reports.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +. ./common/populate
> +
> +_require_scrub
> +_require_xfs_io_command "scrub"		# online check support
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_scratch
> +
> +# Does this fs support health monitoring?
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +_require_xfs_healer $SCRATCH_MNT
> +_scratch_unmount
> +
> +# Create a sample fs with all the goodies
> +_scratch_populate_cached nofill &>> $seqres.full
> +_scratch_mount
> +
> +_scratch_invoke_xfs_healer "$tmp.healer" --everything
> +
> +# Run scrub to make some noise
> +_scratch_scrub -b -n >> $seqres.full
> +
> +_scratch_kill_xfs_healer
> +cat $tmp.healer >> $seqres.full
> +
> +echo Silence is golden
> +status=0
> +exit
> diff --git a/tests/xfs/1882.out b/tests/xfs/1882.out
> new file mode 100644
> index 00000000000000..9b31ccb735cabd
> --- /dev/null
> +++ b/tests/xfs/1882.out
> @@ -0,0 +1,2 @@
> +QA output created by 1882
> +Silence is golden
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 06/14] xfs: test xfs_healer can fix a filesystem
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2026-03-10  3:51   ` [PATCH 05/14] xfs: test xfs_healer's event handling Darrick J. Wong
@ 2026-03-10  3:51   ` Darrick J. Wong
  2026-03-13 19:28     ` Zorro Lang
  2026-03-10  3:51   ` [PATCH 07/14] xfs: test xfs_healer can report file I/O errors Darrick J. Wong
                     ` (8 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:51 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that xfs_healer can actually fix an injected metadata corruption.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1884     |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1884.out |    2 +
 2 files changed, 91 insertions(+)
 create mode 100755 tests/xfs/1884
 create mode 100644 tests/xfs/1884.out


diff --git a/tests/xfs/1884 b/tests/xfs/1884
new file mode 100755
index 00000000000000..1fa6457ad25203
--- /dev/null
+++ b/tests/xfs/1884
@@ -0,0 +1,89 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1884
+#
+# Ensure that autonomous self healing fixes the filesystem correctly.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command "$XFS_PROPERTY_PROG" "xfs_property"
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_xfs_has_feature $SCRATCH_MNT rmapbt || \
+	_notrun "reverse mapping required to test directory auto-repair"
+_xfs_has_feature $SCRATCH_MNT parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_require_xfs_healer $SCRATCH_MNT --repair
+
+# Configure the filesystem for automatic repair of the filesystem.
+$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
+echo testdata > $SCRATCH_MNT/a
+mkdir -p "$SCRATCH_MNT/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Break the directory, remount filesystem
+_scratch_unmount
+_scratch_xfs_db -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
+_scratch_mount
+
+_scratch_invoke_xfs_healer "$tmp.healer" --repair
+
+# Access the broken directory to trigger a repair, then poll the directory
+# for 5 seconds to see if it gets fixed without us needing to intervene.
+ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+_filter_scratch < $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "try $try saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "try $try no longer saw corruption or gave up" >> $seqres.full
+_filter_scratch < $tmp.err
+
+# List the dirents of /victimdir to see if it stops reporting corruption
+ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "retry $try still saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "retry $try no longer saw corruption or gave up" >> $seqres.full
+
+# Unmount to kill the healer
+_scratch_kill_xfs_healer
+cat $tmp.healer >> $seqres.full
+
+status=0
+exit
diff --git a/tests/xfs/1884.out b/tests/xfs/1884.out
new file mode 100644
index 00000000000000..929e33da01f92c
--- /dev/null
+++ b/tests/xfs/1884.out
@@ -0,0 +1,2 @@
+QA output created by 1884
+ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/14] xfs: test xfs_healer can fix a filesystem
  2026-03-10  3:51   ` [PATCH 06/14] xfs: test xfs_healer can fix a filesystem Darrick J. Wong
@ 2026-03-13 19:28     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:51:41PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that xfs_healer can actually fix an injected metadata corruption.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1884     |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1884.out |    2 +
>  2 files changed, 91 insertions(+)
>  create mode 100755 tests/xfs/1884
>  create mode 100644 tests/xfs/1884.out
> 
> 
> diff --git a/tests/xfs/1884 b/tests/xfs/1884
> new file mode 100755
> index 00000000000000..1fa6457ad25203
> --- /dev/null
> +++ b/tests/xfs/1884
> @@ -0,0 +1,89 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1884
> +#
> +# Ensure that autonomous self healing fixes the filesystem correctly.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +_require_scrub
> +_require_xfs_io_command "repair"	# online repair support
> +_require_xfs_db_command "blocktrash"
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> +_require_scratch
> +
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +
> +_xfs_has_feature $SCRATCH_MNT rmapbt || \
> +	_notrun "reverse mapping required to test directory auto-repair"
> +_xfs_has_feature $SCRATCH_MNT parent || \
> +	_notrun "parent pointers required to test directory auto-repair"
> +_require_xfs_healer $SCRATCH_MNT --repair
> +
> +# Configure the filesystem for automatic repair of the filesystem.
> +$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
> +
> +# Create a largeish directory
> +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
> +echo testdata > $SCRATCH_MNT/a
> +mkdir -p "$SCRATCH_MNT/some/victimdir"
> +for ((i = 0; i < (dblksz / 255); i++)); do
> +	fname="$(printf "%0255d" "$i")"
> +	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
> +done
> +
> +# Did we get at least two dir blocks?
> +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
> +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> +
> +# Break the directory, remount filesystem
> +_scratch_unmount
> +_scratch_xfs_db -x \
> +	-c 'path /some/victimdir' \
> +	-c 'bmap' \
> +	-c 'dblock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
> +_scratch_mount
> +
> +_scratch_invoke_xfs_healer "$tmp.healer" --repair
> +
> +# Access the broken directory to trigger a repair, then poll the directory
> +# for 5 seconds to see if it gets fixed without us needing to intervene.
> +ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +_filter_scratch < $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "try $try saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "try $try no longer saw corruption or gave up" >> $seqres.full
> +_filter_scratch < $tmp.err
> +
> +# List the dirents of /victimdir to see if it stops reporting corruption
> +ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "retry $try still saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "retry $try no longer saw corruption or gave up" >> $seqres.full
> +
> +# Unmount to kill the healer
> +_scratch_kill_xfs_healer
> +cat $tmp.healer >> $seqres.full
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1884.out b/tests/xfs/1884.out
> new file mode 100644
> index 00000000000000..929e33da01f92c
> --- /dev/null
> +++ b/tests/xfs/1884.out
> @@ -0,0 +1,2 @@
> +QA output created by 1884
> +ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 07/14] xfs: test xfs_healer can report file I/O errors
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2026-03-10  3:51   ` [PATCH 06/14] xfs: test xfs_healer can fix a filesystem Darrick J. Wong
@ 2026-03-10  3:51   ` Darrick J. Wong
  2026-03-13 19:32     ` Zorro Lang
  2026-03-10  3:52   ` [PATCH 08/14] xfs: test xfs_healer can report file media errors Darrick J. Wong
                     ` (7 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:51 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that xfs_healer can actually report file I/O errors.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1896     |  210 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1896.out |   21 +++++
 2 files changed, 231 insertions(+)
 create mode 100755 tests/xfs/1896
 create mode 100644 tests/xfs/1896.out


diff --git a/tests/xfs/1896 b/tests/xfs/1896
new file mode 100755
index 00000000000000..911e1d5ee8a576
--- /dev/null
+++ b/tests/xfs/1896
@@ -0,0 +1,210 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1896
+#
+# Check that xfs_healer can report file IO errors.
+
+. ./common/preamble
+_begin_fstest auto quick scrub eio selfhealing
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	_dmerror_cleanup
+}
+
+# Import common functions.
+. ./common/fuzzy
+. ./common/filter
+. ./common/dmerror
+. ./common/systemd
+
+_require_scratch
+_require_scrub
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_dm_target error
+_require_no_xfs_always_cow	# no out of place writes
+
+# Ignore everything from the healer except for the four IO error log messages.
+# Strip out file handle and range information because the blocksize can vary.
+# Writeback and readahead can trigger multiple error messages due to retries,
+# hence the uniq.
+filter_healer_errors() {
+	_filter_scratch | \
+		grep -E '(buffered|directio)' | \
+		sed \
+		    -e 's/ino [0-9]*/ino NUM/g' \
+		    -e 's/gen 0x[0-9a-f]*/gen NUM/g' \
+		    -e 's/pos [0-9]*/pos NUM/g' \
+		    -e 's/len [0-9]*/len NUM/g' \
+		    -e 's|SCRATCH_MNT/a|VICTIM|g' \
+		    -e 's|SCRATCH_MNT ino NUM gen NUM|VICTIM|g' | \
+		sort | \
+		uniq
+}
+
+_scratch_mkfs >> $seqres.full
+
+#
+# The dm-error map added by this test doesn't work on zoned devices because
+# table sizes need to be aligned to the zone size, and even for zoned on
+# conventional this test will get confused because of the internal RT device.
+#
+# That check requires a mounted file system, so do a dummy mount before setting
+# up DM.
+#
+_scratch_mount
+_require_xfs_scratch_non_zoned
+_require_xfs_healer $SCRATCH_MNT
+_scratch_unmount
+
+_dmerror_init
+_dmerror_mount >> $seqres.full 2>&1
+
+# Write a file with 4 file blocks worth of data, figure out the LBA to target
+victim=$SCRATCH_MNT/a
+file_blksz=$(_get_file_block_size $SCRATCH_MNT)
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
+unset errordev
+
+awk_len_prog='{print $6}'
+if _xfs_is_realtime_file $victim; then
+	if ! _xfs_has_feature $SCRATCH_MNT rtgroups; then
+		awk_len_prog='{print $4}'
+	fi
+	errordev="RT"
+fi
+bmap_str="$($XFS_IO_PROG -c "bmap -elpv" $victim | grep "^[[:space:]]*0:")"
+echo "$errordev:$bmap_str" >> $seqres.full
+
+phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
+len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
+
+fs_blksz=$(_get_block_size $SCRATCH_MNT)
+echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
+kernel_sectors_per_fs_block=$((fs_blksz / 512))
+
+# Did we get at least 4 fs blocks worth of extent?
+min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
+test "$len" -lt $min_len_sectors && \
+	_fail "could not format a long enough extent on an empty fs??"
+
+phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
+
+echo "$errordev:$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
+echo "victim file:" >> $seqres.full
+od -tx1 -Ad -c $victim >> $seqres.full
+
+# Set the dmerror table so that all IO will pass through.
+_dmerror_reset_table
+
+cat >> $seqres.full << ENDL
+dmerror before:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+# All sector numbers that we feed to the kernel must be in units of 512b, but
+# they also must be aligned to the device's logical block size.
+logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
+kernel_sectors_per_device_lba=$((logical_block_size / 512))
+
+# Mark as bad one of the device LBAs in the middle of the extent.  Target the
+# second LBA of the third block of the four-block file extent that we allocated
+# earlier, but without overflowing into the fourth file block.
+bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
+bad_len=$kernel_sectors_per_device_lba
+if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
+	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
+fi
+if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
+	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
+fi
+
+# Remount to flush the page cache, start the healer, and make the LBA bad
+_dmerror_unmount
+_dmerror_mount
+
+_scratch_invoke_xfs_healer "$tmp.healer"
+
+_dmerror_mark_range_bad $bad_sector $bad_len $errordev
+
+cat >> $seqres.full << ENDL
+dmerror after marking bad:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+_dmerror_load_error_table
+
+# See if buffered reads pick it up
+echo "Try buffered read"
+$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio reads pick it up
+echo "Try directio read"
+$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio writes pick it up
+echo "Try directio write"
+$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# See if buffered writes pick it up
+echo "Try buffered write"
+$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# Now mark the bad range good so that unmount won't fail due to IO errors.
+echo "Fix device"
+_dmerror_mark_range_good $bad_sector $bad_len $errordev
+_dmerror_load_error_table
+
+cat >> $seqres.full << ENDL
+dmerror after marking good:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+# Unmount filesystem to start fresh
+echo "Kill healer"
+_scratch_kill_xfs_healer _dmerror_unmount
+cat $tmp.healer >> $seqres.full
+cat $tmp.healer | filter_healer_errors
+
+# Start the healer again so that can verify that the errors don't persist after
+# we flip back to the good dm table.
+echo "Remount and restart healer"
+_dmerror_mount
+_scratch_invoke_xfs_healer "$tmp.healer"
+
+# See if buffered reads pick it up
+echo "Try buffered read again"
+$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio reads pick it up
+echo "Try directio read again"
+$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
+
+# See if directio writes pick it up
+echo "Try directio write again"
+$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# See if buffered writes pick it up
+echo "Try buffered write again"
+$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
+
+# Unmount fs to kill healer, then wait for it to finish
+echo "Kill healer again"
+_scratch_kill_xfs_healer _dmerror_unmount
+cat $tmp.healer >> $seqres.full
+cat $tmp.healer | filter_healer_errors
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1896.out b/tests/xfs/1896.out
new file mode 100644
index 00000000000000..1378d4fad44522
--- /dev/null
+++ b/tests/xfs/1896.out
@@ -0,0 +1,21 @@
+QA output created by 1896
+Try buffered read
+pread: Input/output error
+Try directio read
+pread: Input/output error
+Try directio write
+pwrite: Input/output error
+Try buffered write
+fsync: Input/output error
+Fix device
+Kill healer
+VICTIM pos NUM len NUM: buffered_read: Input/output error
+VICTIM pos NUM len NUM: buffered_write: Input/output error
+VICTIM pos NUM len NUM: directio_read: Input/output error
+VICTIM pos NUM len NUM: directio_write: Input/output error
+Remount and restart healer
+Try buffered read again
+Try directio read again
+Try directio write again
+Try buffered write again
+Kill healer again


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/14] xfs: test xfs_healer can report file I/O errors
  2026-03-10  3:51   ` [PATCH 07/14] xfs: test xfs_healer can report file I/O errors Darrick J. Wong
@ 2026-03-13 19:32     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:51:57PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that xfs_healer can actually report file I/O errors.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1896     |  210 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1896.out |   21 +++++
>  2 files changed, 231 insertions(+)
>  create mode 100755 tests/xfs/1896
>  create mode 100644 tests/xfs/1896.out
> 
> 
> diff --git a/tests/xfs/1896 b/tests/xfs/1896
> new file mode 100755
> index 00000000000000..911e1d5ee8a576
> --- /dev/null
> +++ b/tests/xfs/1896
> @@ -0,0 +1,210 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1896
> +#
> +# Check that xfs_healer can report file IO errors.
> +
> +. ./common/preamble
> +_begin_fstest auto quick scrub eio selfhealing
> +
> +# Override the default cleanup function.
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +	_dmerror_cleanup
> +}
> +
> +# Import common functions.
> +. ./common/fuzzy
> +. ./common/filter
> +. ./common/dmerror
> +. ./common/systemd
> +
> +_require_scratch
> +_require_scrub
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_dm_target error
> +_require_no_xfs_always_cow	# no out of place writes
> +
> +# Ignore everything from the healer except for the four IO error log messages.
> +# Strip out file handle and range information because the blocksize can vary.
> +# Writeback and readahead can trigger multiple error messages due to retries,
> +# hence the uniq.
> +filter_healer_errors() {
> +	_filter_scratch | \
> +		grep -E '(buffered|directio)' | \
> +		sed \
> +		    -e 's/ino [0-9]*/ino NUM/g' \
> +		    -e 's/gen 0x[0-9a-f]*/gen NUM/g' \
> +		    -e 's/pos [0-9]*/pos NUM/g' \
> +		    -e 's/len [0-9]*/len NUM/g' \
> +		    -e 's|SCRATCH_MNT/a|VICTIM|g' \
> +		    -e 's|SCRATCH_MNT ino NUM gen NUM|VICTIM|g' | \
> +		sort | \
> +		uniq
> +}
> +
> +_scratch_mkfs >> $seqres.full
> +
> +#
> +# The dm-error map added by this test doesn't work on zoned devices because
> +# table sizes need to be aligned to the zone size, and even for zoned on
> +# conventional this test will get confused because of the internal RT device.
> +#
> +# That check requires a mounted file system, so do a dummy mount before setting
> +# up DM.
> +#
> +_scratch_mount
> +_require_xfs_scratch_non_zoned
> +_require_xfs_healer $SCRATCH_MNT
> +_scratch_unmount
> +
> +_dmerror_init
> +_dmerror_mount >> $seqres.full 2>&1
> +
> +# Write a file with 4 file blocks worth of data, figure out the LBA to target
> +victim=$SCRATCH_MNT/a
> +file_blksz=$(_get_file_block_size $SCRATCH_MNT)
> +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
> +unset errordev
> +
> +awk_len_prog='{print $6}'
> +if _xfs_is_realtime_file $victim; then
> +	if ! _xfs_has_feature $SCRATCH_MNT rtgroups; then
> +		awk_len_prog='{print $4}'
> +	fi
> +	errordev="RT"
> +fi
> +bmap_str="$($XFS_IO_PROG -c "bmap -elpv" $victim | grep "^[[:space:]]*0:")"
> +echo "$errordev:$bmap_str" >> $seqres.full
> +
> +phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
> +len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
> +
> +fs_blksz=$(_get_block_size $SCRATCH_MNT)
> +echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
> +kernel_sectors_per_fs_block=$((fs_blksz / 512))
> +
> +# Did we get at least 4 fs blocks worth of extent?
> +min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
> +test "$len" -lt $min_len_sectors && \
> +	_fail "could not format a long enough extent on an empty fs??"
> +
> +phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
> +
> +echo "$errordev:$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
> +echo "victim file:" >> $seqres.full
> +od -tx1 -Ad -c $victim >> $seqres.full
> +
> +# Set the dmerror table so that all IO will pass through.
> +_dmerror_reset_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror before:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# All sector numbers that we feed to the kernel must be in units of 512b, but
> +# they also must be aligned to the device's logical block size.
> +logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
> +kernel_sectors_per_device_lba=$((logical_block_size / 512))
> +
> +# Mark as bad one of the device LBAs in the middle of the extent.  Target the
> +# second LBA of the third block of the four-block file extent that we allocated
> +# earlier, but without overflowing into the fourth file block.
> +bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
> +bad_len=$kernel_sectors_per_device_lba
> +if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
> +	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
> +fi
> +if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
> +	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
> +fi
> +
> +# Remount to flush the page cache, start the healer, and make the LBA bad
> +_dmerror_unmount
> +_dmerror_mount
> +
> +_scratch_invoke_xfs_healer "$tmp.healer"
> +
> +_dmerror_mark_range_bad $bad_sector $bad_len $errordev
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking bad:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +_dmerror_load_error_table
> +
> +# See if buffered reads pick it up
> +echo "Try buffered read"
> +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio reads pick it up
> +echo "Try directio read"
> +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio writes pick it up
> +echo "Try directio write"
> +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# See if buffered writes pick it up
> +echo "Try buffered write"
> +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# Now mark the bad range good so that unmount won't fail due to IO errors.
> +echo "Fix device"
> +_dmerror_mark_range_good $bad_sector $bad_len $errordev
> +_dmerror_load_error_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking good:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# Unmount filesystem to start fresh
> +echo "Kill healer"
> +_scratch_kill_xfs_healer _dmerror_unmount
> +cat $tmp.healer >> $seqres.full
> +cat $tmp.healer | filter_healer_errors
> +
> +# Start the healer again so that can verify that the errors don't persist after
> +# we flip back to the good dm table.
> +echo "Remount and restart healer"
> +_dmerror_mount
> +_scratch_invoke_xfs_healer "$tmp.healer"
> +
> +# See if buffered reads pick it up
> +echo "Try buffered read again"
> +$XFS_IO_PROG -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio reads pick it up
> +echo "Try directio read again"
> +$XFS_IO_PROG -d -c "pread 0 $((4 * file_blksz))" $victim >> $seqres.full
> +
> +# See if directio writes pick it up
> +echo "Try directio write again"
> +$XFS_IO_PROG -d -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# See if buffered writes pick it up
> +echo "Try buffered write again"
> +$XFS_IO_PROG -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c fsync $victim >> $seqres.full
> +
> +# Unmount fs to kill healer, then wait for it to finish
> +echo "Kill healer again"
> +_scratch_kill_xfs_healer _dmerror_unmount
> +cat $tmp.healer >> $seqres.full
> +cat $tmp.healer | filter_healer_errors
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/1896.out b/tests/xfs/1896.out
> new file mode 100644
> index 00000000000000..1378d4fad44522
> --- /dev/null
> +++ b/tests/xfs/1896.out
> @@ -0,0 +1,21 @@
> +QA output created by 1896
> +Try buffered read
> +pread: Input/output error
> +Try directio read
> +pread: Input/output error
> +Try directio write
> +pwrite: Input/output error
> +Try buffered write
> +fsync: Input/output error
> +Fix device
> +Kill healer
> +VICTIM pos NUM len NUM: buffered_read: Input/output error
> +VICTIM pos NUM len NUM: buffered_write: Input/output error
> +VICTIM pos NUM len NUM: directio_read: Input/output error
> +VICTIM pos NUM len NUM: directio_write: Input/output error
> +Remount and restart healer
> +Try buffered read again
> +Try directio read again
> +Try directio write again
> +Try buffered write again
> +Kill healer again
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 08/14] xfs: test xfs_healer can report file media errors
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2026-03-10  3:51   ` [PATCH 07/14] xfs: test xfs_healer can report file I/O errors Darrick J. Wong
@ 2026-03-10  3:52   ` Darrick J. Wong
  2026-03-13 19:36     ` Zorro Lang
  2026-03-10  3:52   ` [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns Darrick J. Wong
                     ` (6 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:52 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that xfs_healer can actually report media errors as found by the
kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1897     |  172 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1897.out |    7 ++
 2 files changed, 179 insertions(+)
 create mode 100755 tests/xfs/1897
 create mode 100755 tests/xfs/1897.out


diff --git a/tests/xfs/1897 b/tests/xfs/1897
new file mode 100755
index 00000000000000..4670c333a2d82c
--- /dev/null
+++ b/tests/xfs/1897
@@ -0,0 +1,172 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1897
+#
+# Check that xfs_healer can report media errors.
+
+. ./common/preamble
+_begin_fstest auto quick scrub eio selfhealing
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	_dmerror_cleanup
+}
+
+. ./common/fuzzy
+. ./common/filter
+. ./common/dmerror
+. ./common/systemd
+
+_require_scratch
+_require_scrub
+_require_dm_target error
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_xfs_io_command verifymedia
+
+filter_healer() {
+	_filter_scratch | \
+		grep -E '(media failed|media error)' | \
+		sed \
+		    -e 's/datadev/DEVICE/g' \
+		    -e 's/rtdev/DEVICE/g' \
+		    -e 's/ino [0-9]*/ino NUM/g' \
+		    -e 's/gen 0x[0-9a-f]*/gen NUM/g' \
+		    -e 's/pos [0-9]*/pos NUM/g' \
+		    -e 's/len [0-9]*/len NUM/g' \
+		    -e 's/0x[0-9a-f]*/NUM/g' \
+		    -e 's|SCRATCH_MNT/a|VICTIM|g' \
+		    -e 's|SCRATCH_MNT ino NUM gen NUM|VICTIM|g'
+}
+
+filter_verify() {
+	sed -e 's/\([a-z]*dev\): verify error at offset \([0-9]*\) length \([0-9]*\)/DEVICE: verify error at offset XXX length XXX/g'
+}
+
+_scratch_mkfs >> $seqres.full
+
+# The dm-error map added by this test doesn't work on zoned devices because
+# table sizes need to be aligned to the zone size, and even for zoned on
+# conventional this test will get confused because of the internal RT device.
+#
+# That check requires a mounted file system, so do a dummy mount before setting
+# up DM.
+_scratch_mount
+_require_xfs_scratch_non_zoned
+_require_xfs_healer $SCRATCH_MNT
+_scratch_unmount
+
+_dmerror_init
+_dmerror_mount
+
+# Write a file with 4 file blocks worth of data, figure out the LBA to target
+victim=$SCRATCH_MNT/a
+file_blksz=$(_get_file_block_size $SCRATCH_MNT)
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
+unset errordev
+verifymediadev="-d"
+
+awk_len_prog='{print $6}'
+if _xfs_is_realtime_file $victim; then
+	if ! _xfs_has_feature $SCRATCH_MNT rtgroups; then
+		awk_len_prog='{print $4}'
+	fi
+	errordev="RT"
+	verifymediadev="-r"
+fi
+bmap_str="$($XFS_IO_PROG -c "bmap -elpv" $victim | grep "^[[:space:]]*0:")"
+echo "$errordev:$bmap_str" >> $seqres.full
+
+phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
+len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
+
+fs_blksz=$(_get_block_size $SCRATCH_MNT)
+echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
+kernel_sectors_per_fs_block=$((fs_blksz / 512))
+
+# Did we get at least 4 fs blocks worth of extent?
+min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
+test "$len" -lt $min_len_sectors && \
+	_fail "could not format a long enough extent on an empty fs??"
+
+phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
+
+echo "$errordev:$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
+echo "victim file:" >> $seqres.full
+od -tx1 -Ad -c $victim >> $seqres.full
+
+# Set the dmerror table so that all IO will pass through.
+_dmerror_reset_table
+
+cat >> $seqres.full << ENDL
+dmerror before:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+# All sector numbers that we feed to the kernel must be in units of 512b, but
+# they also must be aligned to the device's logical block size.
+logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
+kernel_sectors_per_device_lba=$((logical_block_size / 512))
+
+# Pretend as bad one of the device LBAs in the middle of the extent.  Target
+# the second LBA of the third block of the four-block file extent that we
+# allocated earlier, but without overflowing into the fourth file block.
+bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
+bad_len=$kernel_sectors_per_device_lba
+if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
+	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
+fi
+if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
+	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
+fi
+_dmerror_mark_range_bad $bad_sector $bad_len $errordev
+
+cat >> $seqres.full << ENDL
+dmerror after marking bad:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+_dmerror_load_error_table
+
+echo "Simulate media error"
+_scratch_invoke_xfs_healer "$tmp.healer"
+echo "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" >> $seqres.full
+$XFS_IO_PROG -x -c "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" $SCRATCH_MNT 2>&1 | filter_verify
+
+# Now mark the bad range good so that a retest shows no media failure.
+_dmerror_mark_range_good $bad_sector $bad_len $errordev
+_dmerror_load_error_table
+
+cat >> $seqres.full << ENDL
+dmerror after marking good:
+$DMERROR_TABLE
+$DMERROR_RTTABLE
+<end table>
+ENDL
+
+echo "No more media error"
+echo "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" >> $seqres.full
+$XFS_IO_PROG -x -c "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" $SCRATCH_MNT >> $seqres.full
+
+# Unmount filesystem to start fresh
+echo "Kill healer"
+_scratch_kill_xfs_healer _dmerror_unmount
+
+# filesystems without rmap do not translate media errors to lost file ranges
+# so fake the output
+_xfs_has_feature "$SCRATCH_DEV" rmapbt || \
+	echo "VICTIM pos 0 len 0: media failed" >> $tmp.healer
+
+cat $tmp.healer >> $seqres.full
+cat $tmp.healer | filter_healer
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1897.out b/tests/xfs/1897.out
new file mode 100755
index 00000000000000..1bb615c3119dce
--- /dev/null
+++ b/tests/xfs/1897.out
@@ -0,0 +1,7 @@
+QA output created by 1897
+Simulate media error
+DEVICE: verify error at offset XXX length XXX: Input/output error
+No more media error
+Kill healer
+SCRATCH_MNT DEVICE daddr NUM bbcount NUM: media error
+VICTIM pos NUM len NUM: media failed


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/14] xfs: test xfs_healer can report file media errors
  2026-03-10  3:52   ` [PATCH 08/14] xfs: test xfs_healer can report file media errors Darrick J. Wong
@ 2026-03-13 19:36     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:52:13PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that xfs_healer can actually report media errors as found by the
> kernel.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1897     |  172 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1897.out |    7 ++
>  2 files changed, 179 insertions(+)
>  create mode 100755 tests/xfs/1897
>  create mode 100755 tests/xfs/1897.out
> 
> 
> diff --git a/tests/xfs/1897 b/tests/xfs/1897
> new file mode 100755
> index 00000000000000..4670c333a2d82c
> --- /dev/null
> +++ b/tests/xfs/1897
> @@ -0,0 +1,172 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1897
> +#
> +# Check that xfs_healer can report media errors.
> +
> +. ./common/preamble
> +_begin_fstest auto quick scrub eio selfhealing
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +	_dmerror_cleanup
> +}
> +
> +. ./common/fuzzy
> +. ./common/filter
> +. ./common/dmerror
> +. ./common/systemd
> +
> +_require_scratch
> +_require_scrub
> +_require_dm_target error
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_xfs_io_command verifymedia
> +
> +filter_healer() {
> +	_filter_scratch | \
> +		grep -E '(media failed|media error)' | \
> +		sed \
> +		    -e 's/datadev/DEVICE/g' \
> +		    -e 's/rtdev/DEVICE/g' \
> +		    -e 's/ino [0-9]*/ino NUM/g' \
> +		    -e 's/gen 0x[0-9a-f]*/gen NUM/g' \
> +		    -e 's/pos [0-9]*/pos NUM/g' \
> +		    -e 's/len [0-9]*/len NUM/g' \
> +		    -e 's/0x[0-9a-f]*/NUM/g' \
> +		    -e 's|SCRATCH_MNT/a|VICTIM|g' \
> +		    -e 's|SCRATCH_MNT ino NUM gen NUM|VICTIM|g'
> +}
> +
> +filter_verify() {
> +	sed -e 's/\([a-z]*dev\): verify error at offset \([0-9]*\) length \([0-9]*\)/DEVICE: verify error at offset XXX length XXX/g'
> +}
> +
> +_scratch_mkfs >> $seqres.full
> +
> +# The dm-error map added by this test doesn't work on zoned devices because
> +# table sizes need to be aligned to the zone size, and even for zoned on
> +# conventional this test will get confused because of the internal RT device.
> +#
> +# That check requires a mounted file system, so do a dummy mount before setting
> +# up DM.
> +_scratch_mount
> +_require_xfs_scratch_non_zoned
> +_require_xfs_healer $SCRATCH_MNT
> +_scratch_unmount
> +
> +_dmerror_init
> +_dmerror_mount
> +
> +# Write a file with 4 file blocks worth of data, figure out the LBA to target
> +victim=$SCRATCH_MNT/a
> +file_blksz=$(_get_file_block_size $SCRATCH_MNT)
> +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 $((4 * file_blksz))" -c "fsync" $victim >> $seqres.full
> +unset errordev
> +verifymediadev="-d"
> +
> +awk_len_prog='{print $6}'
> +if _xfs_is_realtime_file $victim; then
> +	if ! _xfs_has_feature $SCRATCH_MNT rtgroups; then
> +		awk_len_prog='{print $4}'
> +	fi
> +	errordev="RT"
> +	verifymediadev="-r"
> +fi
> +bmap_str="$($XFS_IO_PROG -c "bmap -elpv" $victim | grep "^[[:space:]]*0:")"
> +echo "$errordev:$bmap_str" >> $seqres.full
> +
> +phys="$(echo "$bmap_str" | $AWK_PROG '{print $3}')"
> +len="$(echo "$bmap_str" | $AWK_PROG "$awk_len_prog")"
> +
> +fs_blksz=$(_get_block_size $SCRATCH_MNT)
> +echo "file_blksz:$file_blksz:fs_blksz:$fs_blksz" >> $seqres.full
> +kernel_sectors_per_fs_block=$((fs_blksz / 512))
> +
> +# Did we get at least 4 fs blocks worth of extent?
> +min_len_sectors=$(( 4 * kernel_sectors_per_fs_block ))
> +test "$len" -lt $min_len_sectors && \
> +	_fail "could not format a long enough extent on an empty fs??"
> +
> +phys_start=$(echo "$phys" | sed -e 's/\.\..*//g')
> +
> +echo "$errordev:$phys:$len:$fs_blksz:$phys_start" >> $seqres.full
> +echo "victim file:" >> $seqres.full
> +od -tx1 -Ad -c $victim >> $seqres.full
> +
> +# Set the dmerror table so that all IO will pass through.
> +_dmerror_reset_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror before:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +# All sector numbers that we feed to the kernel must be in units of 512b, but
> +# they also must be aligned to the device's logical block size.
> +logical_block_size=`$here/src/min_dio_alignment $SCRATCH_MNT $SCRATCH_DEV`
> +kernel_sectors_per_device_lba=$((logical_block_size / 512))
> +
> +# Pretend as bad one of the device LBAs in the middle of the extent.  Target
> +# the second LBA of the third block of the four-block file extent that we
> +# allocated earlier, but without overflowing into the fourth file block.
> +bad_sector=$(( phys_start + (2 * kernel_sectors_per_fs_block) ))
> +bad_len=$kernel_sectors_per_device_lba
> +if (( kernel_sectors_per_device_lba < kernel_sectors_per_fs_block )); then
> +	bad_sector=$((bad_sector + kernel_sectors_per_device_lba))
> +fi
> +if (( (bad_sector % kernel_sectors_per_device_lba) != 0)); then
> +	echo "bad_sector $bad_sector not congruent with device logical block size $logical_block_size"
> +fi
> +_dmerror_mark_range_bad $bad_sector $bad_len $errordev
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking bad:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +_dmerror_load_error_table
> +
> +echo "Simulate media error"
> +_scratch_invoke_xfs_healer "$tmp.healer"
> +echo "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" >> $seqres.full
> +$XFS_IO_PROG -x -c "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" $SCRATCH_MNT 2>&1 | filter_verify
> +
> +# Now mark the bad range good so that a retest shows no media failure.
> +_dmerror_mark_range_good $bad_sector $bad_len $errordev
> +_dmerror_load_error_table
> +
> +cat >> $seqres.full << ENDL
> +dmerror after marking good:
> +$DMERROR_TABLE
> +$DMERROR_RTTABLE
> +<end table>
> +ENDL
> +
> +echo "No more media error"
> +echo "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" >> $seqres.full
> +$XFS_IO_PROG -x -c "verifymedia $verifymediadev -R $((bad_sector * 512)) $(((bad_sector + bad_len) * 512))" $SCRATCH_MNT >> $seqres.full
> +
> +# Unmount filesystem to start fresh
> +echo "Kill healer"
> +_scratch_kill_xfs_healer _dmerror_unmount
> +
> +# filesystems without rmap do not translate media errors to lost file ranges
> +# so fake the output
> +_xfs_has_feature "$SCRATCH_DEV" rmapbt || \
> +	echo "VICTIM pos 0 len 0: media failed" >> $tmp.healer
> +
> +cat $tmp.healer >> $seqres.full
> +cat $tmp.healer | filter_healer
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/1897.out b/tests/xfs/1897.out
> new file mode 100755
> index 00000000000000..1bb615c3119dce
> --- /dev/null
> +++ b/tests/xfs/1897.out
> @@ -0,0 +1,7 @@
> +QA output created by 1897
> +Simulate media error
> +DEVICE: verify error at offset XXX length XXX: Input/output error
> +No more media error
> +Kill healer
> +SCRATCH_MNT DEVICE daddr NUM bbcount NUM: media error
> +VICTIM pos NUM len NUM: media failed
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (7 preceding siblings ...)
  2026-03-10  3:52   ` [PATCH 08/14] xfs: test xfs_healer can report file media errors Darrick J. Wong
@ 2026-03-10  3:52   ` Darrick J. Wong
  2026-03-13 19:45     ` Zorro Lang
  2026-03-10  3:52   ` [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem repairs Darrick J. Wong
                     ` (5 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:52 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that xfs_healer can actually report abnormal filesystem shutdowns.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1898     |   37 +++++++++++++++++++++++++++++++++++++
 tests/xfs/1898.out |    4 ++++
 2 files changed, 41 insertions(+)
 create mode 100755 tests/xfs/1898
 create mode 100755 tests/xfs/1898.out


diff --git a/tests/xfs/1898 b/tests/xfs/1898
new file mode 100755
index 00000000000000..2b6c72093e7021
--- /dev/null
+++ b/tests/xfs/1898
@@ -0,0 +1,37 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1898
+#
+# Check that xfs_healer can report filesystem shutdowns.
+
+. ./common/preamble
+_begin_fstest auto quick scrub eio selfhealing
+
+. ./common/fuzzy
+. ./common/filter
+. ./common/systemd
+
+_require_scratch_nocheck
+_require_scrub
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_xfs_healer $SCRATCH_MNT
+$XFS_IO_PROG -f -c "pwrite -S 0x58 0 500k" -c "fsync" $victim >> $seqres.full
+
+echo "Start healer and shut down"
+_scratch_invoke_xfs_healer "$tmp.healer"
+_scratch_shutdown -f
+
+# Unmount filesystem to start fresh
+echo "Kill healer"
+_scratch_kill_xfs_healer
+cat $tmp.healer >> $seqres.full
+cat $tmp.healer | _filter_scratch | grep 'shut down'
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1898.out b/tests/xfs/1898.out
new file mode 100755
index 00000000000000..f71f848da810ce
--- /dev/null
+++ b/tests/xfs/1898.out
@@ -0,0 +1,4 @@
+QA output created by 1898
+Start healer and shut down
+Kill healer
+SCRATCH_MNT: filesystem shut down due to forced unmount


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns
  2026-03-10  3:52   ` [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns Darrick J. Wong
@ 2026-03-13 19:45     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:52:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that xfs_healer can actually report abnormal filesystem shutdowns.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,
Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1898     |   37 +++++++++++++++++++++++++++++++++++++
>  tests/xfs/1898.out |    4 ++++
>  2 files changed, 41 insertions(+)
>  create mode 100755 tests/xfs/1898
>  create mode 100755 tests/xfs/1898.out
> 
> 
> diff --git a/tests/xfs/1898 b/tests/xfs/1898
> new file mode 100755
> index 00000000000000..2b6c72093e7021
> --- /dev/null
> +++ b/tests/xfs/1898
> @@ -0,0 +1,37 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1898
> +#
> +# Check that xfs_healer can report filesystem shutdowns.
> +
> +. ./common/preamble
> +_begin_fstest auto quick scrub eio selfhealing
> +
> +. ./common/fuzzy
> +. ./common/filter
> +. ./common/systemd
> +
> +_require_scratch_nocheck
> +_require_scrub
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +_require_xfs_healer $SCRATCH_MNT
> +$XFS_IO_PROG -f -c "pwrite -S 0x58 0 500k" -c "fsync" $victim >> $seqres.full
> +
> +echo "Start healer and shut down"
> +_scratch_invoke_xfs_healer "$tmp.healer"
> +_scratch_shutdown -f
> +
> +# Unmount filesystem to start fresh
> +echo "Kill healer"
> +_scratch_kill_xfs_healer
> +cat $tmp.healer >> $seqres.full
> +cat $tmp.healer | _filter_scratch | grep 'shut down'
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/1898.out b/tests/xfs/1898.out
> new file mode 100755
> index 00000000000000..f71f848da810ce
> --- /dev/null
> +++ b/tests/xfs/1898.out
> @@ -0,0 +1,4 @@
> +QA output created by 1898
> +Start healer and shut down
> +Kill healer
> +SCRATCH_MNT: filesystem shut down due to forced unmount
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem repairs
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (8 preceding siblings ...)
  2026-03-10  3:52   ` [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns Darrick J. Wong
@ 2026-03-10  3:52   ` Darrick J. Wong
  2026-03-13 19:48     ` Zorro Lang
  2026-03-10  3:52   ` [PATCH 11/14] xfs: test xfs_healer can follow mount moves Darrick J. Wong
                     ` (4 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:52 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that when xfs_healer can't perform a spot repair, it will actually
start up xfs_scrub to perform a full scan and repair.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1899     |  108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1899.out |    3 +
 2 files changed, 111 insertions(+)
 create mode 100755 tests/xfs/1899
 create mode 100644 tests/xfs/1899.out


diff --git a/tests/xfs/1899 b/tests/xfs/1899
new file mode 100755
index 00000000000000..5d35ca8265645f
--- /dev/null
+++ b/tests/xfs/1899
@@ -0,0 +1,108 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1899
+#
+# Ensure that autonomous self healing works fixes the filesystem correctly
+# even if the spot repair doesn't work and it falls back to a full fsck.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command "$XFS_PROPERTY_PROG" "xfs_property"
+_require_scratch
+_require_systemd_unit_defined "xfs_scrub@.service"
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_xfs_has_feature $SCRATCH_MNT rmapbt || \
+	_notrun "reverse mapping required to test directory auto-repair"
+_xfs_has_feature $SCRATCH_MNT parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_require_xfs_healer $SCRATCH_MNT --repair
+
+filter_healer() {
+	_filter_scratch | \
+		grep 'Full repairs in progress' | \
+		uniq
+}
+
+# Configure the filesystem for automatic repair of the filesystem.
+$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
+echo testdata > $SCRATCH_MNT/a
+mkdir -p "$SCRATCH_MNT/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Break the directory, remount filesystem
+_scratch_unmount
+_scratch_xfs_db -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' \
+	-c 'path /a' \
+	-c 'bmap -a' \
+	-c 'ablock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' \
+	>> $seqres.full
+_scratch_mount
+
+_scratch_invoke_xfs_healer "$tmp.healer" --repair
+
+# Access the broken directory to trigger a repair, then poll the directory
+# for 5 seconds to see if it gets fixed without us needing to intervene.
+ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+_filter_scratch < $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "try $try saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "try $try no longer saw corruption or gave up" >> $seqres.full
+_filter_scratch < $tmp.err
+
+# Wait for the background fixer to finish
+svc="$(_xfs_scrub_svcname "$SCRATCH_MNT")"
+_systemd_unit_wait "$svc"
+
+# List the dirents of /victimdir and parent pointers of /a to see if they both
+# stop reporting corruption
+(ls $SCRATCH_MNT/some/victimdir ; $XFS_IO_PROG -c 'parent') > /dev/null 2> $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "retry $try still saw corruption" >> $seqres.full
+	sleep 0.1
+	(ls $SCRATCH_MNT/some/victimdir ; $XFS_IO_PROG -c 'parent') > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "retry $try no longer saw corruption or gave up" >> $seqres.full
+
+# Unmount to kill the healer
+_scratch_kill_xfs_healer
+cat $tmp.healer >> $seqres.full
+cat $tmp.healer | filter_healer
+
+status=0
+exit
diff --git a/tests/xfs/1899.out b/tests/xfs/1899.out
new file mode 100644
index 00000000000000..5345fd400f3627
--- /dev/null
+++ b/tests/xfs/1899.out
@@ -0,0 +1,3 @@
+QA output created by 1899
+ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning
+SCRATCH_MNT: Full repairs in progress.


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem repairs
  2026-03-10  3:52   ` [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem repairs Darrick J. Wong
@ 2026-03-13 19:48     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:52:44PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that when xfs_healer can't perform a spot repair, it will actually
> start up xfs_scrub to perform a full scan and repair.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1899     |  108 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1899.out |    3 +
>  2 files changed, 111 insertions(+)
>  create mode 100755 tests/xfs/1899
>  create mode 100644 tests/xfs/1899.out
> 
> 
> diff --git a/tests/xfs/1899 b/tests/xfs/1899
> new file mode 100755
> index 00000000000000..5d35ca8265645f
> --- /dev/null
> +++ b/tests/xfs/1899
> @@ -0,0 +1,108 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1899
> +#
> +# Ensure that autonomous self healing works fixes the filesystem correctly
> +# even if the spot repair doesn't work and it falls back to a full fsck.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +_require_scrub
> +_require_xfs_io_command "repair"	# online repair support
> +_require_xfs_db_command "blocktrash"
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> +_require_scratch
> +_require_systemd_unit_defined "xfs_scrub@.service"
> +
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +
> +_xfs_has_feature $SCRATCH_MNT rmapbt || \
> +	_notrun "reverse mapping required to test directory auto-repair"
> +_xfs_has_feature $SCRATCH_MNT parent || \
> +	_notrun "parent pointers required to test directory auto-repair"
> +_require_xfs_healer $SCRATCH_MNT --repair
> +
> +filter_healer() {
> +	_filter_scratch | \
> +		grep 'Full repairs in progress' | \
> +		uniq
> +}
> +
> +# Configure the filesystem for automatic repair of the filesystem.
> +$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
> +
> +# Create a largeish directory
> +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
> +echo testdata > $SCRATCH_MNT/a
> +mkdir -p "$SCRATCH_MNT/some/victimdir"
> +for ((i = 0; i < (dblksz / 255); i++)); do
> +	fname="$(printf "%0255d" "$i")"
> +	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
> +done
> +
> +# Did we get at least two dir blocks?
> +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
> +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> +
> +# Break the directory, remount filesystem
> +_scratch_unmount
> +_scratch_xfs_db -x \
> +	-c 'path /some/victimdir' \
> +	-c 'bmap' \
> +	-c 'dblock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' \
> +	-c 'path /a' \
> +	-c 'bmap -a' \
> +	-c 'ablock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' \
> +	>> $seqres.full
> +_scratch_mount
> +
> +_scratch_invoke_xfs_healer "$tmp.healer" --repair
> +
> +# Access the broken directory to trigger a repair, then poll the directory
> +# for 5 seconds to see if it gets fixed without us needing to intervene.
> +ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +_filter_scratch < $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "try $try saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "try $try no longer saw corruption or gave up" >> $seqres.full
> +_filter_scratch < $tmp.err
> +
> +# Wait for the background fixer to finish
> +svc="$(_xfs_scrub_svcname "$SCRATCH_MNT")"
> +_systemd_unit_wait "$svc"
> +
> +# List the dirents of /victimdir and parent pointers of /a to see if they both
> +# stop reporting corruption
> +(ls $SCRATCH_MNT/some/victimdir ; $XFS_IO_PROG -c 'parent') > /dev/null 2> $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "retry $try still saw corruption" >> $seqres.full
> +	sleep 0.1
> +	(ls $SCRATCH_MNT/some/victimdir ; $XFS_IO_PROG -c 'parent') > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "retry $try no longer saw corruption or gave up" >> $seqres.full
> +
> +# Unmount to kill the healer
> +_scratch_kill_xfs_healer
> +cat $tmp.healer >> $seqres.full
> +cat $tmp.healer | filter_healer
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1899.out b/tests/xfs/1899.out
> new file mode 100644
> index 00000000000000..5345fd400f3627
> --- /dev/null
> +++ b/tests/xfs/1899.out
> @@ -0,0 +1,3 @@
> +QA output created by 1899
> +ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning
> +SCRATCH_MNT: Full repairs in progress.
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 11/14] xfs: test xfs_healer can follow mount moves
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (9 preceding siblings ...)
  2026-03-10  3:52   ` [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem repairs Darrick J. Wong
@ 2026-03-10  3:52   ` Darrick J. Wong
  2026-03-13 19:39     ` Zorro Lang
  2026-03-10  3:53   ` [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem Darrick J. Wong
                     ` (3 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:52 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that when xfs_healer needs to reopen a filesystem to repair it,
it can still find the filesystem even if it has been mount --move'd.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1900     |  115 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1900.out |    2 +
 2 files changed, 117 insertions(+)
 create mode 100755 tests/xfs/1900
 create mode 100755 tests/xfs/1900.out


diff --git a/tests/xfs/1900 b/tests/xfs/1900
new file mode 100755
index 00000000000000..9a8f9fabd124ad
--- /dev/null
+++ b/tests/xfs/1900
@@ -0,0 +1,115 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1900
+#
+# Ensure that autonomous self healing fixes the filesystem correctly even if
+# the original mount has moved somewhere else.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+_cleanup()
+{
+	command -v _kill_fsstress &>/dev/null && _kill_fsstress
+	cd /
+	rm -r -f $tmp.*
+	if [ -n "$new_dir" ]; then
+		_unmount "$new_dir" &>/dev/null
+		rm -rf "$new_dir"
+	fi
+}
+
+_require_test
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command "$XFS_PROPERTY_PROG" "xfs_property"
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_xfs_has_feature $SCRATCH_MNT rmapbt || \
+	_notrun "reverse mapping required to test directory auto-repair"
+_xfs_has_feature $SCRATCH_MNT parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_require_xfs_healer $SCRATCH_MNT --repair
+
+# Configure the filesystem for automatic repair of the filesystem.
+$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
+echo testdata > $SCRATCH_MNT/a
+mkdir -p "$SCRATCH_MNT/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Break the directory, remount filesystem
+_scratch_unmount
+_scratch_xfs_db -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
+_scratch_mount
+
+_scratch_invoke_xfs_healer "$tmp.healer" --repair
+
+# Move the scratch filesystem to a completely different mountpoint so that
+# we can test if the healer can find it again.
+new_dir=$TEST_DIR/moocow
+mkdir -p $new_dir
+_mount --bind $SCRATCH_MNT $new_dir
+_unmount $SCRATCH_MNT
+
+df -t xfs >> $seqres.full
+
+# Access the broken directory to trigger a repair, then poll the directory
+# for 5 seconds to see if it gets fixed without us needing to intervene.
+ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
+_filter_scratch < $tmp.err | _filter_test_dir
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "try $try saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "try $try no longer saw corruption or gave up" >> $seqres.full
+_filter_scratch < $tmp.err | _filter_test_dir
+
+# List the dirents of /victimdir to see if it stops reporting corruption
+ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "retry $try still saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "retry $try no longer saw corruption or gave up" >> $seqres.full
+
+new_dir_unmount() {
+	_unmount $new_dir
+}
+
+# Unmount to kill the healer
+_scratch_kill_xfs_healer new_dir_unmount
+cat $tmp.healer >> $seqres.full
+
+status=0
+exit
diff --git a/tests/xfs/1900.out b/tests/xfs/1900.out
new file mode 100755
index 00000000000000..604c9eb5eb10f4
--- /dev/null
+++ b/tests/xfs/1900.out
@@ -0,0 +1,2 @@
+QA output created by 1900
+ls: reading directory 'TEST_DIR/moocow/some/victimdir': Structure needs cleaning


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/14] xfs: test xfs_healer can follow mount moves
  2026-03-10  3:52   ` [PATCH 11/14] xfs: test xfs_healer can follow mount moves Darrick J. Wong
@ 2026-03-13 19:39     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:52:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that when xfs_healer needs to reopen a filesystem to repair it,
> it can still find the filesystem even if it has been mount --move'd.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1900     |  115 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1900.out |    2 +
>  2 files changed, 117 insertions(+)
>  create mode 100755 tests/xfs/1900
>  create mode 100755 tests/xfs/1900.out
> 
> 
> diff --git a/tests/xfs/1900 b/tests/xfs/1900
> new file mode 100755
> index 00000000000000..9a8f9fabd124ad
> --- /dev/null
> +++ b/tests/xfs/1900
> @@ -0,0 +1,115 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1900
> +#
> +# Ensure that autonomous self healing fixes the filesystem correctly even if
> +# the original mount has moved somewhere else.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +_cleanup()
> +{
> +	command -v _kill_fsstress &>/dev/null && _kill_fsstress
> +	cd /
> +	rm -r -f $tmp.*
> +	if [ -n "$new_dir" ]; then
> +		_unmount "$new_dir" &>/dev/null
> +		rm -rf "$new_dir"
> +	fi
> +}
> +
> +_require_test
> +_require_scrub
> +_require_xfs_io_command "repair"	# online repair support
> +_require_xfs_db_command "blocktrash"
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> +_require_scratch
> +
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +
> +_xfs_has_feature $SCRATCH_MNT rmapbt || \
> +	_notrun "reverse mapping required to test directory auto-repair"
> +_xfs_has_feature $SCRATCH_MNT parent || \
> +	_notrun "parent pointers required to test directory auto-repair"
> +_require_xfs_healer $SCRATCH_MNT --repair
> +
> +# Configure the filesystem for automatic repair of the filesystem.
> +$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
> +
> +# Create a largeish directory
> +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
> +echo testdata > $SCRATCH_MNT/a
> +mkdir -p "$SCRATCH_MNT/some/victimdir"
> +for ((i = 0; i < (dblksz / 255); i++)); do
> +	fname="$(printf "%0255d" "$i")"
> +	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
> +done
> +
> +# Did we get at least two dir blocks?
> +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
> +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> +
> +# Break the directory, remount filesystem
> +_scratch_unmount
> +_scratch_xfs_db -x \
> +	-c 'path /some/victimdir' \
> +	-c 'bmap' \
> +	-c 'dblock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
> +_scratch_mount
> +
> +_scratch_invoke_xfs_healer "$tmp.healer" --repair
> +
> +# Move the scratch filesystem to a completely different mountpoint so that
> +# we can test if the healer can find it again.
> +new_dir=$TEST_DIR/moocow
> +mkdir -p $new_dir
> +_mount --bind $SCRATCH_MNT $new_dir
> +_unmount $SCRATCH_MNT
> +
> +df -t xfs >> $seqres.full
> +
> +# Access the broken directory to trigger a repair, then poll the directory
> +# for 5 seconds to see if it gets fixed without us needing to intervene.
> +ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> +_filter_scratch < $tmp.err | _filter_test_dir
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "try $try saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "try $try no longer saw corruption or gave up" >> $seqres.full
> +_filter_scratch < $tmp.err | _filter_test_dir
> +
> +# List the dirents of /victimdir to see if it stops reporting corruption
> +ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "retry $try still saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "retry $try no longer saw corruption or gave up" >> $seqres.full
> +
> +new_dir_unmount() {
> +	_unmount $new_dir
> +}
> +
> +# Unmount to kill the healer
> +_scratch_kill_xfs_healer new_dir_unmount
> +cat $tmp.healer >> $seqres.full
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1900.out b/tests/xfs/1900.out
> new file mode 100755
> index 00000000000000..604c9eb5eb10f4
> --- /dev/null
> +++ b/tests/xfs/1900.out
> @@ -0,0 +1,2 @@
> +QA output created by 1900
> +ls: reading directory 'TEST_DIR/moocow/some/victimdir': Structure needs cleaning
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (10 preceding siblings ...)
  2026-03-10  3:52   ` [PATCH 11/14] xfs: test xfs_healer can follow mount moves Darrick J. Wong
@ 2026-03-10  3:53   ` Darrick J. Wong
  2026-03-13 19:53     ` Zorro Lang
  2026-03-10  3:53   ` [PATCH 13/14] xfs: test xfs_healer background service Darrick J. Wong
                     ` (2 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:53 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that when xfs_healer needs to reopen a filesystem to repair it, it
won't latch on to another xfs filesystem that has been mounted atop the same
mountpoint.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1901     |  137 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1901.out |    2 +
 2 files changed, 139 insertions(+)
 create mode 100755 tests/xfs/1901
 create mode 100755 tests/xfs/1901.out


diff --git a/tests/xfs/1901 b/tests/xfs/1901
new file mode 100755
index 00000000000000..c92dcf9a3b3d48
--- /dev/null
+++ b/tests/xfs/1901
@@ -0,0 +1,137 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2025-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1901
+#
+# Ensure that autonomous self healing won't fix the wrong filesystem if a
+# snapshot of the original filesystem is now mounted on the same directory as
+# the original.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+_cleanup()
+{
+	command -v _kill_fsstress &>/dev/null && _kill_fsstress
+	cd /
+	rm -r -f $tmp.*
+	test -e "$mntpt" && _unmount "$mntpt" &>/dev/null
+	test -e "$mntpt" && _unmount "$mntpt" &>/dev/null
+	test -e "$loop1" && _destroy_loop_device "$loop1"
+	test -e "$loop2" && _destroy_loop_device "$loop2"
+	test -e "$testdir" && rm -r -f "$testdir"
+}
+
+_require_test
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command "$XFS_PROPERTY_PROG" "xfs_property"
+
+testdir=$TEST_DIR/$seq
+mntpt=$testdir/mount
+disk1=$testdir/disk1
+disk2=$testdir/disk2
+
+mkdir -p "$mntpt"
+$XFS_IO_PROG -f -c "truncate 300m" $disk1
+$XFS_IO_PROG -f -c "truncate 300m" $disk2
+loop1="$(_create_loop_device "$disk1")"
+
+filter_mntpt() {
+	sed -e "s|$mntpt|MNTPT|g"
+}
+
+_mkfs_dev "$loop1" >> $seqres.full
+_mount "$loop1" "$mntpt" || _notrun "cannot mount victim filesystem"
+
+_xfs_has_feature $mntpt rmapbt || \
+	_notrun "reverse mapping required to test directory auto-repair"
+_xfs_has_feature $mntpt parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_require_xfs_healer $mntpt --repair
+
+# Configure the filesystem for automatic repair of the filesystem.
+$XFS_PROPERTY_PROG $mntpt set autofsck=repair >> $seqres.full
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$mntpt")
+echo testdata > $mntpt/a
+mkdir -p "$mntpt/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $mntpt/a $mntpt/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $mntpt/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Clone the fs, break the directory, remount filesystem
+_unmount "$mntpt"
+
+cp --sparse=always "$disk1" "$disk2" || _fail "cannot copy disk1"
+loop2="$(_create_loop_device_like_bdev "$disk2" "$loop1")"
+
+$XFS_DB_PROG "$loop1" -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
+_mount "$loop1" "$mntpt" || _fail "cannot mount broken fs"
+
+_invoke_xfs_healer "$mntpt" "$tmp.healer" --repair
+
+# Stop the healer process so that it can't read error events while we do some
+# shenanigans.
+test -n "$XFS_HEALER_PID" || _fail "nobody set XFS_HEALER_PID?"
+kill -STOP $XFS_HEALER_PID
+
+
+echo "LOG $XFS_HEALER_PID SO FAR:" >> $seqres.full
+cat $tmp.healer >> $seqres.full
+
+# Access the broken directory to trigger a repair event, which will not yet be
+# processed.
+ls $mntpt/some/victimdir > /dev/null 2> $tmp.err
+filter_mntpt < $tmp.err
+
+ps auxfww | grep xfs_healer >> $seqres.full
+
+echo "LOG AFTER TRYING TO POKE:" >> $seqres.full
+cat $tmp.healer >> $seqres.full
+
+# Mount the clone filesystem to the same mountpoint so that the healer cannot
+# actually reopen it to perform repairs.
+_mount "$loop2" "$mntpt" -o nouuid || _fail "cannot mount decoy fs"
+
+grep -w xfs /proc/mounts >> $seqres.full
+
+# Continue the healer process so it can handle events now.  Wait a few seconds
+# while it fails to reopen disk1's mount point to repair things.
+kill -CONT $XFS_HEALER_PID
+sleep 2
+
+new_dir_unmount() {
+	_unmount "$mntpt"
+	_unmount "$mntpt"
+}
+
+# Unmount to kill the healer
+_kill_xfs_healer new_dir_unmount
+echo "LOG AFTER FAILURE" >> $seqres.full
+cat $tmp.healer >> $seqres.full
+
+# Did the healer log complaints about not being able to reopen the mountpoint
+# to enact repairs?
+grep -q 'Stale file handle' $tmp.healer || \
+	echo "Should have seen stale file handle complaints"
+
+status=0
+exit
diff --git a/tests/xfs/1901.out b/tests/xfs/1901.out
new file mode 100755
index 00000000000000..ff83e03725307a
--- /dev/null
+++ b/tests/xfs/1901.out
@@ -0,0 +1,2 @@
+QA output created by 1901
+ls: reading directory 'MNTPT/some/victimdir': Structure needs cleaning


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem
  2026-03-10  3:53   ` [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem Darrick J. Wong
@ 2026-03-13 19:53     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:53:15PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that when xfs_healer needs to reopen a filesystem to repair it, it
> won't latch on to another xfs filesystem that has been mounted atop the same
> mountpoint.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1901     |  137 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1901.out |    2 +
>  2 files changed, 139 insertions(+)
>  create mode 100755 tests/xfs/1901
>  create mode 100755 tests/xfs/1901.out
> 
> 
> diff --git a/tests/xfs/1901 b/tests/xfs/1901
> new file mode 100755
> index 00000000000000..c92dcf9a3b3d48
> --- /dev/null
> +++ b/tests/xfs/1901
> @@ -0,0 +1,137 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2025-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1901
> +#
> +# Ensure that autonomous self healing won't fix the wrong filesystem if a
> +# snapshot of the original filesystem is now mounted on the same directory as
> +# the original.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +_cleanup()
> +{
> +	command -v _kill_fsstress &>/dev/null && _kill_fsstress
> +	cd /
> +	rm -r -f $tmp.*
> +	test -e "$mntpt" && _unmount "$mntpt" &>/dev/null
> +	test -e "$mntpt" && _unmount "$mntpt" &>/dev/null
> +	test -e "$loop1" && _destroy_loop_device "$loop1"
> +	test -e "$loop2" && _destroy_loop_device "$loop2"
> +	test -e "$testdir" && rm -r -f "$testdir"
> +}
> +
> +_require_test
> +_require_scrub
> +_require_xfs_io_command "repair"	# online repair support
> +_require_xfs_db_command "blocktrash"
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> +
> +testdir=$TEST_DIR/$seq
> +mntpt=$testdir/mount
> +disk1=$testdir/disk1
> +disk2=$testdir/disk2
> +
> +mkdir -p "$mntpt"
> +$XFS_IO_PROG -f -c "truncate 300m" $disk1
> +$XFS_IO_PROG -f -c "truncate 300m" $disk2
> +loop1="$(_create_loop_device "$disk1")"
> +
> +filter_mntpt() {
> +	sed -e "s|$mntpt|MNTPT|g"
> +}
> +
> +_mkfs_dev "$loop1" >> $seqres.full
> +_mount "$loop1" "$mntpt" || _notrun "cannot mount victim filesystem"
> +
> +_xfs_has_feature $mntpt rmapbt || \
> +	_notrun "reverse mapping required to test directory auto-repair"
> +_xfs_has_feature $mntpt parent || \
> +	_notrun "parent pointers required to test directory auto-repair"
> +_require_xfs_healer $mntpt --repair
> +
> +# Configure the filesystem for automatic repair of the filesystem.
> +$XFS_PROPERTY_PROG $mntpt set autofsck=repair >> $seqres.full
> +
> +# Create a largeish directory
> +dblksz=$(_xfs_get_dir_blocksize "$mntpt")
> +echo testdata > $mntpt/a
> +mkdir -p "$mntpt/some/victimdir"
> +for ((i = 0; i < (dblksz / 255); i++)); do
> +	fname="$(printf "%0255d" "$i")"
> +	ln $mntpt/a $mntpt/some/victimdir/$fname
> +done
> +
> +# Did we get at least two dir blocks?
> +dirsize=$(stat -c '%s' $mntpt/some/victimdir)
> +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> +
> +# Clone the fs, break the directory, remount filesystem
> +_unmount "$mntpt"
> +
> +cp --sparse=always "$disk1" "$disk2" || _fail "cannot copy disk1"
> +loop2="$(_create_loop_device_like_bdev "$disk2" "$loop1")"
> +
> +$XFS_DB_PROG "$loop1" -x \
> +	-c 'path /some/victimdir' \
> +	-c 'bmap' \
> +	-c 'dblock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
> +_mount "$loop1" "$mntpt" || _fail "cannot mount broken fs"
> +
> +_invoke_xfs_healer "$mntpt" "$tmp.healer" --repair
> +
> +# Stop the healer process so that it can't read error events while we do some
> +# shenanigans.
> +test -n "$XFS_HEALER_PID" || _fail "nobody set XFS_HEALER_PID?"
> +kill -STOP $XFS_HEALER_PID
> +
> +
> +echo "LOG $XFS_HEALER_PID SO FAR:" >> $seqres.full
> +cat $tmp.healer >> $seqres.full
> +
> +# Access the broken directory to trigger a repair event, which will not yet be
> +# processed.
> +ls $mntpt/some/victimdir > /dev/null 2> $tmp.err
> +filter_mntpt < $tmp.err
> +
> +ps auxfww | grep xfs_healer >> $seqres.full
> +
> +echo "LOG AFTER TRYING TO POKE:" >> $seqres.full
> +cat $tmp.healer >> $seqres.full
> +
> +# Mount the clone filesystem to the same mountpoint so that the healer cannot
> +# actually reopen it to perform repairs.
> +_mount "$loop2" "$mntpt" -o nouuid || _fail "cannot mount decoy fs"
> +
> +grep -w xfs /proc/mounts >> $seqres.full
> +
> +# Continue the healer process so it can handle events now.  Wait a few seconds
> +# while it fails to reopen disk1's mount point to repair things.
> +kill -CONT $XFS_HEALER_PID
> +sleep 2
> +
> +new_dir_unmount() {
> +	_unmount "$mntpt"
> +	_unmount "$mntpt"
> +}
> +
> +# Unmount to kill the healer
> +_kill_xfs_healer new_dir_unmount
> +echo "LOG AFTER FAILURE" >> $seqres.full
> +cat $tmp.healer >> $seqres.full
> +
> +# Did the healer log complaints about not being able to reopen the mountpoint
> +# to enact repairs?
> +grep -q 'Stale file handle' $tmp.healer || \
> +	echo "Should have seen stale file handle complaints"
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1901.out b/tests/xfs/1901.out
> new file mode 100755
> index 00000000000000..ff83e03725307a
> --- /dev/null
> +++ b/tests/xfs/1901.out
> @@ -0,0 +1,2 @@
> +QA output created by 1901
> +ls: reading directory 'MNTPT/some/victimdir': Structure needs cleaning
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 13/14] xfs: test xfs_healer background service
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (11 preceding siblings ...)
  2026-03-10  3:53   ` [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem Darrick J. Wong
@ 2026-03-10  3:53   ` Darrick J. Wong
  2026-03-13 19:56     ` Zorro Lang
  2026-03-10  3:53   ` [PATCH 14/14] xfs: test xfs_healer startup service Darrick J. Wong
  2026-03-12 14:21   ` [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves Darrick J. Wong
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:53 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that when xfs_healer can monitor and repair filesystems when it's
running as a systemd service, which is the intended usage model.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1902     |  152 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1902.out |    2 +
 2 files changed, 154 insertions(+)
 create mode 100755 tests/xfs/1902
 create mode 100755 tests/xfs/1902.out


diff --git a/tests/xfs/1902 b/tests/xfs/1902
new file mode 100755
index 00000000000000..6de2d602d52cdb
--- /dev/null
+++ b/tests/xfs/1902
@@ -0,0 +1,152 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1902
+#
+# Ensure that autonomous self healing fixes the filesystem correctly when
+# running in a systemd service
+#
+# unreliable_in_parallel: this test runs the xfs_healer systemd service, which
+# cannot be isolated to a specific testcase with the way check-parallel is
+# implemented.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing unreliable_in_parallel
+
+_cleanup()
+{
+	cd /
+	if [ -n "$new_svcfile" ]; then
+		rm -f "$new_svcfile"
+		systemctl daemon-reload
+	fi
+	rm -r -f $tmp.*
+}
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+_require_systemd_is_running
+_require_systemd_unit_defined xfs_healer@.service
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command "$XFS_PROPERTY_PROG" "xfs_property"
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_xfs_has_feature $SCRATCH_MNT rmapbt || \
+	_notrun "reverse mapping required to test directory auto-repair"
+_xfs_has_feature $SCRATCH_MNT parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_require_xfs_healer $SCRATCH_MNT --repair
+
+# Configure the filesystem for automatic repair of the filesystem.
+$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
+echo testdata > $SCRATCH_MNT/a
+mkdir -p "$SCRATCH_MNT/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Break the directory
+_scratch_unmount
+_scratch_xfs_db -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
+
+# Find the existing xfs_healer@ service definition, figure out where we're
+# going to land our test-specific override
+orig_svcfile="$(_systemd_unit_path "xfs_healer@-.service")"
+test -f "$orig_svcfile" || \
+	_notrun "cannot find xfs_healer@ service file"
+
+new_svcdir="$(_systemd_runtime_dir)"
+test -d "$new_svcdir" || \
+	_notrun "cannot find runtime systemd service dir"
+
+# We need to make some local mods to the xfs_healer@ service definition
+# so we fork it and create a new service just for this test.
+new_healer_template="xfs_healer_fstest@.service"
+new_healer_svc="$(_systemd_service_unit_path "$new_healer_template" "$SCRATCH_MNT")"
+_systemd_unit_status "$new_healer_svc" 2>&1 | \
+	grep -E -q '(could not be found|Loaded: not-found)' || \
+	_notrun "systemd service \"$new_healer_svc\" found, will not mess with this"
+
+new_svcfile="$new_svcdir/$new_healer_template"
+cp "$orig_svcfile" "$new_svcfile"
+
+# Pick up all the CLI args except for --repair and --no-autofsck because we're
+# going to force it to --autofsck below
+execargs="$(grep '^ExecStart=' $new_svcfile | \
+	    sed -e 's/^ExecStart=\S*//g' \
+	        -e 's/--no-autofsck//g' \
+		-e 's/--repair//g')"
+sed -e '/ExecStart=/d' -e '/BindPaths=/d' -e '/ExecCondition=/d' -i $new_svcfile
+cat >> "$new_svcfile" << ENDL
+
+[Service]
+ExecCondition=$XFS_HEALER_PROG --supported %f
+ExecStart=$XFS_HEALER_PROG $execargs
+ENDL
+_systemd_reload
+
+# Emit the results of our editing to the full log.
+systemctl cat "$new_healer_svc" >> $seqres.full
+
+# Remount, with service activation
+_scratch_mount
+
+old_healer_svc="$(_xfs_healer_svcname "$SCRATCH_MNT")"
+_systemd_unit_stop "$old_healer_svc" &>> $seqres.full
+_systemd_unit_start "$new_healer_svc" &>> $seqres.full
+
+_systemd_unit_status "$new_healer_svc" 2>&1 | grep -q 'Active: active' || \
+	echo "systemd service \"$new_healer_svc\" not running??"
+
+# Access the broken directory to trigger a repair, then poll the directory
+# for 5 seconds to see if it gets fixed without us needing to intervene.
+ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+_filter_scratch < $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "try $try saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "try $try no longer saw corruption or gave up" >> $seqres.full
+_filter_scratch < $tmp.err
+
+# List the dirents of /victimdir to see if it stops reporting corruption
+ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "retry $try still saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+echo "retry $try no longer saw corruption or gave up" >> $seqres.full
+
+# Unmount to kill the healer
+_scratch_kill_xfs_healer
+journalctl -u "$new_healer_svc" >> $seqres.full
+
+status=0
+exit
diff --git a/tests/xfs/1902.out b/tests/xfs/1902.out
new file mode 100755
index 00000000000000..84f9b9e50e1e02
--- /dev/null
+++ b/tests/xfs/1902.out
@@ -0,0 +1,2 @@
+QA output created by 1902
+ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/14] xfs: test xfs_healer background service
  2026-03-10  3:53   ` [PATCH 13/14] xfs: test xfs_healer background service Darrick J. Wong
@ 2026-03-13 19:56     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:53:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that when xfs_healer can monitor and repair filesystems when it's
> running as a systemd service, which is the intended usage model.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1902     |  152 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1902.out |    2 +
>  2 files changed, 154 insertions(+)
>  create mode 100755 tests/xfs/1902
>  create mode 100755 tests/xfs/1902.out
> 
> 
> diff --git a/tests/xfs/1902 b/tests/xfs/1902
> new file mode 100755
> index 00000000000000..6de2d602d52cdb
> --- /dev/null
> +++ b/tests/xfs/1902
> @@ -0,0 +1,152 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1902
> +#
> +# Ensure that autonomous self healing fixes the filesystem correctly when
> +# running in a systemd service
> +#
> +# unreliable_in_parallel: this test runs the xfs_healer systemd service, which
> +# cannot be isolated to a specific testcase with the way check-parallel is
> +# implemented.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing unreliable_in_parallel
> +
> +_cleanup()
> +{
> +	cd /
> +	if [ -n "$new_svcfile" ]; then
> +		rm -f "$new_svcfile"
> +		systemctl daemon-reload
> +	fi
> +	rm -r -f $tmp.*
> +}
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +_require_systemd_is_running
> +_require_systemd_unit_defined xfs_healer@.service
> +_require_scrub
> +_require_xfs_io_command "repair"	# online repair support
> +_require_xfs_db_command "blocktrash"
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> +_require_scratch
> +
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +
> +_xfs_has_feature $SCRATCH_MNT rmapbt || \
> +	_notrun "reverse mapping required to test directory auto-repair"
> +_xfs_has_feature $SCRATCH_MNT parent || \
> +	_notrun "parent pointers required to test directory auto-repair"
> +_require_xfs_healer $SCRATCH_MNT --repair
> +
> +# Configure the filesystem for automatic repair of the filesystem.
> +$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
> +
> +# Create a largeish directory
> +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
> +echo testdata > $SCRATCH_MNT/a
> +mkdir -p "$SCRATCH_MNT/some/victimdir"
> +for ((i = 0; i < (dblksz / 255); i++)); do
> +	fname="$(printf "%0255d" "$i")"
> +	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
> +done
> +
> +# Did we get at least two dir blocks?
> +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
> +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> +
> +# Break the directory
> +_scratch_unmount
> +_scratch_xfs_db -x \
> +	-c 'path /some/victimdir' \
> +	-c 'bmap' \
> +	-c 'dblock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
> +
> +# Find the existing xfs_healer@ service definition, figure out where we're
> +# going to land our test-specific override
> +orig_svcfile="$(_systemd_unit_path "xfs_healer@-.service")"
> +test -f "$orig_svcfile" || \
> +	_notrun "cannot find xfs_healer@ service file"
> +
> +new_svcdir="$(_systemd_runtime_dir)"
> +test -d "$new_svcdir" || \
> +	_notrun "cannot find runtime systemd service dir"
> +
> +# We need to make some local mods to the xfs_healer@ service definition
> +# so we fork it and create a new service just for this test.
> +new_healer_template="xfs_healer_fstest@.service"
> +new_healer_svc="$(_systemd_service_unit_path "$new_healer_template" "$SCRATCH_MNT")"
> +_systemd_unit_status "$new_healer_svc" 2>&1 | \
> +	grep -E -q '(could not be found|Loaded: not-found)' || \
> +	_notrun "systemd service \"$new_healer_svc\" found, will not mess with this"
> +
> +new_svcfile="$new_svcdir/$new_healer_template"
> +cp "$orig_svcfile" "$new_svcfile"
> +
> +# Pick up all the CLI args except for --repair and --no-autofsck because we're
> +# going to force it to --autofsck below
> +execargs="$(grep '^ExecStart=' $new_svcfile | \
> +	    sed -e 's/^ExecStart=\S*//g' \
> +	        -e 's/--no-autofsck//g' \
> +		-e 's/--repair//g')"
> +sed -e '/ExecStart=/d' -e '/BindPaths=/d' -e '/ExecCondition=/d' -i $new_svcfile
> +cat >> "$new_svcfile" << ENDL
> +
> +[Service]
> +ExecCondition=$XFS_HEALER_PROG --supported %f
> +ExecStart=$XFS_HEALER_PROG $execargs
> +ENDL
> +_systemd_reload
> +
> +# Emit the results of our editing to the full log.
> +systemctl cat "$new_healer_svc" >> $seqres.full
> +
> +# Remount, with service activation
> +_scratch_mount
> +
> +old_healer_svc="$(_xfs_healer_svcname "$SCRATCH_MNT")"
> +_systemd_unit_stop "$old_healer_svc" &>> $seqres.full
> +_systemd_unit_start "$new_healer_svc" &>> $seqres.full
> +
> +_systemd_unit_status "$new_healer_svc" 2>&1 | grep -q 'Active: active' || \
> +	echo "systemd service \"$new_healer_svc\" not running??"
> +
> +# Access the broken directory to trigger a repair, then poll the directory
> +# for 5 seconds to see if it gets fixed without us needing to intervene.
> +ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +_filter_scratch < $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "try $try saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "try $try no longer saw corruption or gave up" >> $seqres.full
> +_filter_scratch < $tmp.err
> +
> +# List the dirents of /victimdir to see if it stops reporting corruption
> +ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +try=0
> +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +	echo "retry $try still saw corruption" >> $seqres.full
> +	sleep 0.1
> +	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +	try=$((try + 1))
> +done
> +echo "retry $try no longer saw corruption or gave up" >> $seqres.full
> +
> +# Unmount to kill the healer
> +_scratch_kill_xfs_healer
> +journalctl -u "$new_healer_svc" >> $seqres.full
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1902.out b/tests/xfs/1902.out
> new file mode 100755
> index 00000000000000..84f9b9e50e1e02
> --- /dev/null
> +++ b/tests/xfs/1902.out
> @@ -0,0 +1,2 @@
> +QA output created by 1902
> +ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 14/14] xfs: test xfs_healer startup service
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (12 preceding siblings ...)
  2026-03-10  3:53   ` [PATCH 13/14] xfs: test xfs_healer background service Darrick J. Wong
@ 2026-03-10  3:53   ` Darrick J. Wong
  2026-03-13 19:58     ` Zorro Lang
  2026-03-12 14:21   ` [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves Darrick J. Wong
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-10  3:53 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that xfs_healer_start can actually start up xfs_healer service
instances when a filesystem is mounted.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1903     |  124 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1903.out |    6 +++
 2 files changed, 130 insertions(+)
 create mode 100755 tests/xfs/1903
 create mode 100644 tests/xfs/1903.out


diff --git a/tests/xfs/1903 b/tests/xfs/1903
new file mode 100755
index 00000000000000..d71d75a6af3f9d
--- /dev/null
+++ b/tests/xfs/1903
@@ -0,0 +1,124 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1903
+#
+# Check that the xfs_healer startup service starts the per-mount xfs_healer
+# service for the scratch filesystem.  IOWs, this is basic testing for the
+# xfs_healer systemd background services.
+#
+
+# unreliable_in_parallel: this appears to try to run healer services on all
+# mounted filesystems - that's a problem when there are a hundred other test
+# filesystems mounted running other tests...
+
+. ./common/preamble
+_begin_fstest auto selfhealing unreliable_in_parallel
+
+_cleanup()
+{
+	cd /
+	test -n "$new_healerstart_svc" &&
+		_systemd_unit_stop "$new_healerstart_svc"
+	test -n "$was_masked" && \
+		_systemd_unit_mask "$healer_svc" &>> $seqres.full
+	if [ -n "$new_svcfile" ]; then
+		rm -f "$new_svcfile"
+		systemctl daemon-reload
+	fi
+	rm -r -f $tmp.*
+}
+
+. ./common/filter
+. ./common/populate
+. ./common/fuzzy
+. ./common/systemd
+
+_require_systemd_is_running
+_require_systemd_unit_defined xfs_healer@.service
+_require_systemd_unit_defined xfs_healer_start.service
+_require_scratch
+_require_scrub
+_require_xfs_io_command "scrub"
+_require_xfs_spaceman_command "health"
+_require_populate_commands
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command $ATTR_PROG "attr"
+
+_xfs_skip_online_rebuild
+_xfs_skip_offline_rebuild
+
+orig_svcfile="$(_systemd_unit_path "xfs_healer_start.service")"
+test -f "$orig_svcfile" || \
+	_notrun "cannot find xfs_healer_start service file"
+
+new_svcdir="$(_systemd_runtime_dir)"
+test -d "$new_svcdir" || \
+	_notrun "cannot find runtime systemd service dir"
+
+# We need to make some local mods to the xfs_healer_start service definition
+# so we fork it and create a new service just for this test.
+new_healerstart_svc="xfs_healer_start_fstest.service"
+_systemd_unit_status "$new_healerstart_svc" 2>&1 | \
+	grep -E -q '(could not be found|Loaded: not-found)' || \
+	_notrun "systemd service \"$new_healerstart_svc\" found, will not mess with this"
+
+find_healer_trace() {
+	local path="$1"
+
+	sleep 2		# wait for delays in startup
+	$XFS_HEALER_PROG --supported "$path" 2>&1 | grep -q 'already running' || \
+		echo "cannot find evidence that xfs_healer is running for $path"
+}
+
+echo "Format and populate"
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+_require_xfs_healer $SCRATCH_MNT
+
+# Configure the filesystem for background checks of the filesystem.
+$ATTR_PROG -R -s xfs:autofsck -V check $SCRATCH_MNT >> $seqres.full
+
+was_masked=
+healer_svc="$(_xfs_healer_svcname "$SCRATCH_MNT")"
+
+# Preserve the xfs_healer@ mask state -- we don't want this permanently
+# changing global state.
+if _systemd_unit_masked "$healer_svc"; then
+	_systemd_unit_unmask "$healer_svc" &>> $seqres.full
+	was_masked=1
+fi
+
+echo "Start healer on scratch FS"
+_systemd_unit_start "$healer_svc"
+find_healer_trace "$SCRATCH_MNT"
+_systemd_unit_stop "$healer_svc"
+
+new_svcfile="$new_svcdir/$new_healerstart_svc"
+cp "$orig_svcfile" "$new_svcfile"
+
+sed -e '/ExecStart=/d' -e '/BindPaths=/d' -e '/ExecCondition=/d' -i $new_svcfile
+cat >> "$new_svcfile" << ENDL
+[Service]
+ExecCondition=$XFS_HEALER_START_PROG --supported
+ExecStart=$XFS_HEALER_START_PROG
+ENDL
+_systemd_reload
+
+# Emit the results of our editing to the full log.
+systemctl cat "$new_healerstart_svc" >> $seqres.full
+
+echo "Start healer for everything"
+_systemd_unit_start "$new_healerstart_svc"
+find_healer_trace "$SCRATCH_MNT"
+
+echo "Restart healer for scratch FS"
+_scratch_cycle_mount
+find_healer_trace "$SCRATCH_MNT"
+
+echo "Healer testing done" | tee -a $seqres.full
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1903.out b/tests/xfs/1903.out
new file mode 100644
index 00000000000000..07810f60ca10c6
--- /dev/null
+++ b/tests/xfs/1903.out
@@ -0,0 +1,6 @@
+QA output created by 1903
+Format and populate
+Start healer on scratch FS
+Start healer for everything
+Restart healer for scratch FS
+Healer testing done


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 14/14] xfs: test xfs_healer startup service
  2026-03-10  3:53   ` [PATCH 14/14] xfs: test xfs_healer startup service Darrick J. Wong
@ 2026-03-13 19:58     ` Zorro Lang
  0 siblings, 0 replies; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 19:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: fstests, linux-xfs

On Mon, Mar 09, 2026 at 08:53:46PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that xfs_healer_start can actually start up xfs_healer service
> instances when a filesystem is mounted.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Tests and looks good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/xfs/1903     |  124 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1903.out |    6 +++
>  2 files changed, 130 insertions(+)
>  create mode 100755 tests/xfs/1903
>  create mode 100644 tests/xfs/1903.out
> 
> 
> diff --git a/tests/xfs/1903 b/tests/xfs/1903
> new file mode 100755
> index 00000000000000..d71d75a6af3f9d
> --- /dev/null
> +++ b/tests/xfs/1903
> @@ -0,0 +1,124 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test No. 1903
> +#
> +# Check that the xfs_healer startup service starts the per-mount xfs_healer
> +# service for the scratch filesystem.  IOWs, this is basic testing for the
> +# xfs_healer systemd background services.
> +#
> +
> +# unreliable_in_parallel: this appears to try to run healer services on all
> +# mounted filesystems - that's a problem when there are a hundred other test
> +# filesystems mounted running other tests...
> +
> +. ./common/preamble
> +_begin_fstest auto selfhealing unreliable_in_parallel
> +
> +_cleanup()
> +{
> +	cd /
> +	test -n "$new_healerstart_svc" &&
> +		_systemd_unit_stop "$new_healerstart_svc"
> +	test -n "$was_masked" && \
> +		_systemd_unit_mask "$healer_svc" &>> $seqres.full
> +	if [ -n "$new_svcfile" ]; then
> +		rm -f "$new_svcfile"
> +		systemctl daemon-reload
> +	fi
> +	rm -r -f $tmp.*
> +}
> +
> +. ./common/filter
> +. ./common/populate
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +_require_systemd_is_running
> +_require_systemd_unit_defined xfs_healer@.service
> +_require_systemd_unit_defined xfs_healer_start.service
> +_require_scratch
> +_require_scrub
> +_require_xfs_io_command "scrub"
> +_require_xfs_spaceman_command "health"
> +_require_populate_commands
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command $ATTR_PROG "attr"
> +
> +_xfs_skip_online_rebuild
> +_xfs_skip_offline_rebuild
> +
> +orig_svcfile="$(_systemd_unit_path "xfs_healer_start.service")"
> +test -f "$orig_svcfile" || \
> +	_notrun "cannot find xfs_healer_start service file"
> +
> +new_svcdir="$(_systemd_runtime_dir)"
> +test -d "$new_svcdir" || \
> +	_notrun "cannot find runtime systemd service dir"
> +
> +# We need to make some local mods to the xfs_healer_start service definition
> +# so we fork it and create a new service just for this test.
> +new_healerstart_svc="xfs_healer_start_fstest.service"
> +_systemd_unit_status "$new_healerstart_svc" 2>&1 | \
> +	grep -E -q '(could not be found|Loaded: not-found)' || \
> +	_notrun "systemd service \"$new_healerstart_svc\" found, will not mess with this"
> +
> +find_healer_trace() {
> +	local path="$1"
> +
> +	sleep 2		# wait for delays in startup
> +	$XFS_HEALER_PROG --supported "$path" 2>&1 | grep -q 'already running' || \
> +		echo "cannot find evidence that xfs_healer is running for $path"
> +}
> +
> +echo "Format and populate"
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +_require_xfs_healer $SCRATCH_MNT
> +
> +# Configure the filesystem for background checks of the filesystem.
> +$ATTR_PROG -R -s xfs:autofsck -V check $SCRATCH_MNT >> $seqres.full
> +
> +was_masked=
> +healer_svc="$(_xfs_healer_svcname "$SCRATCH_MNT")"
> +
> +# Preserve the xfs_healer@ mask state -- we don't want this permanently
> +# changing global state.
> +if _systemd_unit_masked "$healer_svc"; then
> +	_systemd_unit_unmask "$healer_svc" &>> $seqres.full
> +	was_masked=1
> +fi
> +
> +echo "Start healer on scratch FS"
> +_systemd_unit_start "$healer_svc"
> +find_healer_trace "$SCRATCH_MNT"
> +_systemd_unit_stop "$healer_svc"
> +
> +new_svcfile="$new_svcdir/$new_healerstart_svc"
> +cp "$orig_svcfile" "$new_svcfile"
> +
> +sed -e '/ExecStart=/d' -e '/BindPaths=/d' -e '/ExecCondition=/d' -i $new_svcfile
> +cat >> "$new_svcfile" << ENDL
> +[Service]
> +ExecCondition=$XFS_HEALER_START_PROG --supported
> +ExecStart=$XFS_HEALER_START_PROG
> +ENDL
> +_systemd_reload
> +
> +# Emit the results of our editing to the full log.
> +systemctl cat "$new_healerstart_svc" >> $seqres.full
> +
> +echo "Start healer for everything"
> +_systemd_unit_start "$new_healerstart_svc"
> +find_healer_trace "$SCRATCH_MNT"
> +
> +echo "Restart healer for scratch FS"
> +_scratch_cycle_mount
> +find_healer_trace "$SCRATCH_MNT"
> +
> +echo "Healer testing done" | tee -a $seqres.full
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/1903.out b/tests/xfs/1903.out
> new file mode 100644
> index 00000000000000..07810f60ca10c6
> --- /dev/null
> +++ b/tests/xfs/1903.out
> @@ -0,0 +1,6 @@
> +QA output created by 1903
> +Format and populate
> +Start healer on scratch FS
> +Start healer for everything
> +Restart healer for scratch FS
> +Healer testing done
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves
  2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
                     ` (13 preceding siblings ...)
  2026-03-10  3:53   ` [PATCH 14/14] xfs: test xfs_healer startup service Darrick J. Wong
@ 2026-03-12 14:21   ` Darrick J. Wong
  2026-03-13 20:05     ` Zorro Lang
  14 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-12 14:21 UTC (permalink / raw)
  To: zlang, Christoph Hellwig; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that when xfs_healer needs to reopen a filesystem to repair
it, it can still find the filesystem even if it has been mount --move'd.
This requires a bunch of private namespace magic.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1904     |  129 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1904.out |    3 +
 2 files changed, 132 insertions(+)
 create mode 100755 tests/xfs/1904
 create mode 100755 tests/xfs/1904.out

diff --git a/tests/xfs/1904 b/tests/xfs/1904
new file mode 100755
index 00000000000000..78e8f5dcb0e834
--- /dev/null
+++ b/tests/xfs/1904
@@ -0,0 +1,129 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1904
+#
+# Ensure that autonomous self healing fixes the filesystem correctly even if
+# the original mount has moved somewhere else via --move.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+if [ -n "$IN_MOUNTNS" ]; then
+	_mount --make-rprivate /
+	findmnt -o TARGET,PROPAGATION >> $seqres.full
+
+	_scratch_mount
+	_scratch_invoke_xfs_healer "$tmp.healer" --repair
+
+	# Move the scratch filesystem to a completely different mountpoint so that
+	# we can test if the healer can find it again.
+	new_dir=$TEST_DIR/moocow
+	mkdir -p $new_dir
+	_mount --move $SCRATCH_MNT $new_dir
+
+	df -t xfs >> $seqres.full
+
+	# Access the broken directory to trigger a repair, then poll the directory
+	# for 5 seconds to see if it gets fixed without us needing to intervene.
+	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
+	_filter_scratch < $tmp.err | _filter_test_dir
+	try=0
+	while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+		echo "try $try saw corruption" >> $seqres.full
+		sleep 0.1
+		ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
+		try=$((try + 1))
+	done
+	echo "try $try no longer saw corruption or gave up" >> $seqres.full
+	_filter_scratch < $tmp.err | _filter_test_dir
+
+	# List the dirents of /victimdir to see if it stops reporting corruption
+	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
+	try=0
+	while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+		echo "retry $try still saw corruption" >> $seqres.full
+		sleep 0.1
+		ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+		try=$((try + 1))
+	done
+	echo "retry $try no longer saw corruption or gave up" >> $seqres.full
+
+	new_dir_unmount() {
+		_unmount $new_dir
+	}
+
+	# Unmount to kill the healer
+	_scratch_kill_xfs_healer new_dir_unmount
+	cat $tmp.healer >> $seqres.full
+
+	# No need to clean up, the mount ns destructor will detach the
+	# filesystems for us.
+	exit
+fi
+
+_cleanup()
+{
+	command -v _kill_fsstress &>/dev/null && _kill_fsstress
+	cd /
+	rm -r -f $tmp.*
+	if [ -n "$new_dir" ]; then
+		_unmount "$new_dir" &>/dev/null
+		rm -rf "$new_dir"
+	fi
+}
+
+_require_unshare
+_require_test
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_HEALER_PROG" "xfs_healer"
+_require_command "$XFS_PROPERTY_PROG" "xfs_property"
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_xfs_has_feature $SCRATCH_MNT rmapbt || \
+	_notrun "reverse mapping required to test directory auto-repair"
+_xfs_has_feature $SCRATCH_MNT parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_require_xfs_healer $SCRATCH_MNT --repair
+
+# Configure the filesystem for automatic repair of the filesystem.
+$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
+echo testdata > $SCRATCH_MNT/a
+mkdir -p "$SCRATCH_MNT/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Break the directory, remount filesystem
+_scratch_unmount
+_scratch_xfs_db -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
+
+# mount --move only works if mount propagation is disabled, so we have to start
+# a subshell with a separate mount namespace, disable propagation for the
+# entire directory tree, and only then can we run our tests.
+IN_MOUNTNS=1 unshare -m bash "$0"
+
+status=0
+exit
diff --git a/tests/xfs/1904.out b/tests/xfs/1904.out
new file mode 100755
index 00000000000000..34a46298dd439a
--- /dev/null
+++ b/tests/xfs/1904.out
@@ -0,0 +1,3 @@
+QA output created by 1904
+QA output created by 1904
+ls: reading directory 'TEST_DIR/moocow/some/victimdir': Structure needs cleaning

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves
  2026-03-12 14:21   ` [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves Darrick J. Wong
@ 2026-03-13 20:05     ` Zorro Lang
  2026-03-13 23:41       ` Darrick J. Wong
  0 siblings, 1 reply; 45+ messages in thread
From: Zorro Lang @ 2026-03-13 20:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, fstests, linux-xfs

On Thu, Mar 12, 2026 at 07:21:30AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make sure that when xfs_healer needs to reopen a filesystem to repair
> it, it can still find the filesystem even if it has been mount --move'd.
> This requires a bunch of private namespace magic.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  tests/xfs/1904     |  129 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/1904.out |    3 +
>  2 files changed, 132 insertions(+)
>  create mode 100755 tests/xfs/1904
>  create mode 100755 tests/xfs/1904.out
> 
> diff --git a/tests/xfs/1904 b/tests/xfs/1904
> new file mode 100755
> index 00000000000000..78e8f5dcb0e834
> --- /dev/null
> +++ b/tests/xfs/1904
> @@ -0,0 +1,129 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2026 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 1904
> +#
> +# Ensure that autonomous self healing fixes the filesystem correctly even if
> +# the original mount has moved somewhere else via --move.
> +#
> +. ./common/preamble
> +_begin_fstest auto selfhealing
> +
> +. ./common/filter
> +. ./common/fuzzy
> +. ./common/systemd
> +
> +if [ -n "$IN_MOUNTNS" ]; then
> +	_mount --make-rprivate /

I'd like to add this case and other cases related with mount propagation in this
patchset to "mount" group. I'll do it when I merge this patchset. Others look and
test good to me,

Reviewed-by: Zorro Lang <zlang@redhat.com>

> +	findmnt -o TARGET,PROPAGATION >> $seqres.full
> +
> +	_scratch_mount
> +	_scratch_invoke_xfs_healer "$tmp.healer" --repair
> +
> +	# Move the scratch filesystem to a completely different mountpoint so that
> +	# we can test if the healer can find it again.
> +	new_dir=$TEST_DIR/moocow
> +	mkdir -p $new_dir
> +	_mount --move $SCRATCH_MNT $new_dir
> +
> +	df -t xfs >> $seqres.full
> +
> +	# Access the broken directory to trigger a repair, then poll the directory
> +	# for 5 seconds to see if it gets fixed without us needing to intervene.
> +	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> +	_filter_scratch < $tmp.err | _filter_test_dir
> +	try=0
> +	while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +		echo "try $try saw corruption" >> $seqres.full
> +		sleep 0.1
> +		ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> +		try=$((try + 1))
> +	done
> +	echo "try $try no longer saw corruption or gave up" >> $seqres.full
> +	_filter_scratch < $tmp.err | _filter_test_dir
> +
> +	# List the dirents of /victimdir to see if it stops reporting corruption
> +	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> +	try=0
> +	while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> +		echo "retry $try still saw corruption" >> $seqres.full
> +		sleep 0.1
> +		ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> +		try=$((try + 1))
> +	done
> +	echo "retry $try no longer saw corruption or gave up" >> $seqres.full
> +
> +	new_dir_unmount() {
> +		_unmount $new_dir
> +	}
> +
> +	# Unmount to kill the healer
> +	_scratch_kill_xfs_healer new_dir_unmount
> +	cat $tmp.healer >> $seqres.full
> +
> +	# No need to clean up, the mount ns destructor will detach the
> +	# filesystems for us.
> +	exit
> +fi
> +
> +_cleanup()
> +{
> +	command -v _kill_fsstress &>/dev/null && _kill_fsstress
> +	cd /
> +	rm -r -f $tmp.*
> +	if [ -n "$new_dir" ]; then
> +		_unmount "$new_dir" &>/dev/null
> +		rm -rf "$new_dir"
> +	fi
> +}
> +
> +_require_unshare
> +_require_test
> +_require_scrub
> +_require_xfs_io_command "repair"	# online repair support
> +_require_xfs_db_command "blocktrash"
> +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> +_require_scratch
> +
> +_scratch_mkfs >> $seqres.full
> +_scratch_mount
> +
> +_xfs_has_feature $SCRATCH_MNT rmapbt || \
> +	_notrun "reverse mapping required to test directory auto-repair"
> +_xfs_has_feature $SCRATCH_MNT parent || \
> +	_notrun "parent pointers required to test directory auto-repair"
> +_require_xfs_healer $SCRATCH_MNT --repair
> +
> +# Configure the filesystem for automatic repair of the filesystem.
> +$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
> +
> +# Create a largeish directory
> +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
> +echo testdata > $SCRATCH_MNT/a
> +mkdir -p "$SCRATCH_MNT/some/victimdir"
> +for ((i = 0; i < (dblksz / 255); i++)); do
> +	fname="$(printf "%0255d" "$i")"
> +	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
> +done
> +
> +# Did we get at least two dir blocks?
> +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
> +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> +
> +# Break the directory, remount filesystem
> +_scratch_unmount
> +_scratch_xfs_db -x \
> +	-c 'path /some/victimdir' \
> +	-c 'bmap' \
> +	-c 'dblock 1' \
> +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
> +
> +# mount --move only works if mount propagation is disabled, so we have to start
> +# a subshell with a separate mount namespace, disable propagation for the
> +# entire directory tree, and only then can we run our tests.
> +IN_MOUNTNS=1 unshare -m bash "$0"
> +
> +status=0
> +exit
> diff --git a/tests/xfs/1904.out b/tests/xfs/1904.out
> new file mode 100755
> index 00000000000000..34a46298dd439a
> --- /dev/null
> +++ b/tests/xfs/1904.out
> @@ -0,0 +1,3 @@
> +QA output created by 1904
> +QA output created by 1904
> +ls: reading directory 'TEST_DIR/moocow/some/victimdir': Structure needs cleaning
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves
  2026-03-13 20:05     ` Zorro Lang
@ 2026-03-13 23:41       ` Darrick J. Wong
  0 siblings, 0 replies; 45+ messages in thread
From: Darrick J. Wong @ 2026-03-13 23:41 UTC (permalink / raw)
  To: Zorro Lang; +Cc: Christoph Hellwig, fstests, linux-xfs

On Sat, Mar 14, 2026 at 04:05:53AM +0800, Zorro Lang wrote:
> On Thu, Mar 12, 2026 at 07:21:30AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Make sure that when xfs_healer needs to reopen a filesystem to repair
> > it, it can still find the filesystem even if it has been mount --move'd.
> > This requires a bunch of private namespace magic.
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  tests/xfs/1904     |  129 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/1904.out |    3 +
> >  2 files changed, 132 insertions(+)
> >  create mode 100755 tests/xfs/1904
> >  create mode 100755 tests/xfs/1904.out
> > 
> > diff --git a/tests/xfs/1904 b/tests/xfs/1904
> > new file mode 100755
> > index 00000000000000..78e8f5dcb0e834
> > --- /dev/null
> > +++ b/tests/xfs/1904
> > @@ -0,0 +1,129 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2026 Oracle.  All Rights Reserved.
> > +#
> > +# FS QA Test 1904
> > +#
> > +# Ensure that autonomous self healing fixes the filesystem correctly even if
> > +# the original mount has moved somewhere else via --move.
> > +#
> > +. ./common/preamble
> > +_begin_fstest auto selfhealing
> > +
> > +. ./common/filter
> > +. ./common/fuzzy
> > +. ./common/systemd
> > +
> > +if [ -n "$IN_MOUNTNS" ]; then
> > +	_mount --make-rprivate /
> 
> I'd like to add this case and other cases related with mount propagation in this
> patchset to "mount" group. I'll do it when I merge this patchset. Others look and
> test good to me,

<nod> That sounds reasonable.  Thanks for all the other minor touch-ups
that you applied before merging into patches-in-queue!

--D

> 
> Reviewed-by: Zorro Lang <zlang@redhat.com>
> 
> > +	findmnt -o TARGET,PROPAGATION >> $seqres.full
> > +
> > +	_scratch_mount
> > +	_scratch_invoke_xfs_healer "$tmp.healer" --repair
> > +
> > +	# Move the scratch filesystem to a completely different mountpoint so that
> > +	# we can test if the healer can find it again.
> > +	new_dir=$TEST_DIR/moocow
> > +	mkdir -p $new_dir
> > +	_mount --move $SCRATCH_MNT $new_dir
> > +
> > +	df -t xfs >> $seqres.full
> > +
> > +	# Access the broken directory to trigger a repair, then poll the directory
> > +	# for 5 seconds to see if it gets fixed without us needing to intervene.
> > +	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> > +	_filter_scratch < $tmp.err | _filter_test_dir
> > +	try=0
> > +	while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> > +		echo "try $try saw corruption" >> $seqres.full
> > +		sleep 0.1
> > +		ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> > +		try=$((try + 1))
> > +	done
> > +	echo "try $try no longer saw corruption or gave up" >> $seqres.full
> > +	_filter_scratch < $tmp.err | _filter_test_dir
> > +
> > +	# List the dirents of /victimdir to see if it stops reporting corruption
> > +	ls $new_dir/some/victimdir > /dev/null 2> $tmp.err
> > +	try=0
> > +	while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
> > +		echo "retry $try still saw corruption" >> $seqres.full
> > +		sleep 0.1
> > +		ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
> > +		try=$((try + 1))
> > +	done
> > +	echo "retry $try no longer saw corruption or gave up" >> $seqres.full
> > +
> > +	new_dir_unmount() {
> > +		_unmount $new_dir
> > +	}
> > +
> > +	# Unmount to kill the healer
> > +	_scratch_kill_xfs_healer new_dir_unmount
> > +	cat $tmp.healer >> $seqres.full
> > +
> > +	# No need to clean up, the mount ns destructor will detach the
> > +	# filesystems for us.
> > +	exit
> > +fi
> > +
> > +_cleanup()
> > +{
> > +	command -v _kill_fsstress &>/dev/null && _kill_fsstress
> > +	cd /
> > +	rm -r -f $tmp.*
> > +	if [ -n "$new_dir" ]; then
> > +		_unmount "$new_dir" &>/dev/null
> > +		rm -rf "$new_dir"
> > +	fi
> > +}
> > +
> > +_require_unshare
> > +_require_test
> > +_require_scrub
> > +_require_xfs_io_command "repair"	# online repair support
> > +_require_xfs_db_command "blocktrash"
> > +_require_command "$XFS_HEALER_PROG" "xfs_healer"
> > +_require_command "$XFS_PROPERTY_PROG" "xfs_property"
> > +_require_scratch
> > +
> > +_scratch_mkfs >> $seqres.full
> > +_scratch_mount
> > +
> > +_xfs_has_feature $SCRATCH_MNT rmapbt || \
> > +	_notrun "reverse mapping required to test directory auto-repair"
> > +_xfs_has_feature $SCRATCH_MNT parent || \
> > +	_notrun "parent pointers required to test directory auto-repair"
> > +_require_xfs_healer $SCRATCH_MNT --repair
> > +
> > +# Configure the filesystem for automatic repair of the filesystem.
> > +$XFS_PROPERTY_PROG $SCRATCH_MNT set autofsck=repair >> $seqres.full
> > +
> > +# Create a largeish directory
> > +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
> > +echo testdata > $SCRATCH_MNT/a
> > +mkdir -p "$SCRATCH_MNT/some/victimdir"
> > +for ((i = 0; i < (dblksz / 255); i++)); do
> > +	fname="$(printf "%0255d" "$i")"
> > +	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
> > +done
> > +
> > +# Did we get at least two dir blocks?
> > +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
> > +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
> > +
> > +# Break the directory, remount filesystem
> > +_scratch_unmount
> > +_scratch_xfs_db -x \
> > +	-c 'path /some/victimdir' \
> > +	-c 'bmap' \
> > +	-c 'dblock 1' \
> > +	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
> > +
> > +# mount --move only works if mount propagation is disabled, so we have to start
> > +# a subshell with a separate mount namespace, disable propagation for the
> > +# entire directory tree, and only then can we run our tests.
> > +IN_MOUNTNS=1 unshare -m bash "$0"
> > +
> > +status=0
> > +exit
> > diff --git a/tests/xfs/1904.out b/tests/xfs/1904.out
> > new file mode 100755
> > index 00000000000000..34a46298dd439a
> > --- /dev/null
> > +++ b/tests/xfs/1904.out
> > @@ -0,0 +1,3 @@
> > +QA output created by 1904
> > +QA output created by 1904
> > +ls: reading directory 'TEST_DIR/moocow/some/victimdir': Structure needs cleaning
> > 
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2026-03-17  3:43 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10  3:38 [PATCHBOMB v9] xfsprogs: autonomous self healing of filesystems Darrick J. Wong
2026-03-10  3:42 ` [PATCHSET v9 1/2] fstests: test generic file IO error reporting Darrick J. Wong
2026-03-10  3:50   ` [PATCH 1/1] generic: test fsnotify filesystem " Darrick J. Wong
2026-03-10  7:07     ` Amir Goldstein
2026-03-13 18:01     ` Zorro Lang
2026-03-13 23:27       ` Darrick J. Wong
2026-03-16  9:08     ` Christoph Hellwig
2026-03-16 16:21       ` Darrick J. Wong
2026-03-16 18:40         ` Zorro Lang
2026-03-16 22:16           ` Darrick J. Wong
2026-03-17  3:43             ` Zorro Lang
2026-03-10  3:42 ` [PATCHSET v9 2/2] fstests: autonomous self healing of filesystems Darrick J. Wong
2026-03-10  3:50   ` [PATCH 01/14] xfs: test health monitoring code Darrick J. Wong
2026-03-13 18:18     ` Zorro Lang
2026-03-10  3:50   ` [PATCH 02/14] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
2026-03-13 18:35     ` Zorro Lang
2026-03-10  3:50   ` [PATCH 03/14] xfs: test io " Darrick J. Wong
2026-03-13 18:53     ` Zorro Lang
2026-03-10  3:51   ` [PATCH 04/14] xfs: set up common code for testing xfs_healer Darrick J. Wong
2026-03-13 19:04     ` Zorro Lang
2026-03-14 20:37     ` Zorro Lang
2026-03-15  4:51       ` Darrick J. Wong
2026-03-10  3:51   ` [PATCH 05/14] xfs: test xfs_healer's event handling Darrick J. Wong
2026-03-13 19:19     ` Zorro Lang
2026-03-10  3:51   ` [PATCH 06/14] xfs: test xfs_healer can fix a filesystem Darrick J. Wong
2026-03-13 19:28     ` Zorro Lang
2026-03-10  3:51   ` [PATCH 07/14] xfs: test xfs_healer can report file I/O errors Darrick J. Wong
2026-03-13 19:32     ` Zorro Lang
2026-03-10  3:52   ` [PATCH 08/14] xfs: test xfs_healer can report file media errors Darrick J. Wong
2026-03-13 19:36     ` Zorro Lang
2026-03-10  3:52   ` [PATCH 09/14] xfs: test xfs_healer can report filesystem shutdowns Darrick J. Wong
2026-03-13 19:45     ` Zorro Lang
2026-03-10  3:52   ` [PATCH 10/14] xfs: test xfs_healer can initiate full filesystem repairs Darrick J. Wong
2026-03-13 19:48     ` Zorro Lang
2026-03-10  3:52   ` [PATCH 11/14] xfs: test xfs_healer can follow mount moves Darrick J. Wong
2026-03-13 19:39     ` Zorro Lang
2026-03-10  3:53   ` [PATCH 12/14] xfs: test xfs_healer wont repair the wrong filesystem Darrick J. Wong
2026-03-13 19:53     ` Zorro Lang
2026-03-10  3:53   ` [PATCH 13/14] xfs: test xfs_healer background service Darrick J. Wong
2026-03-13 19:56     ` Zorro Lang
2026-03-10  3:53   ` [PATCH 14/14] xfs: test xfs_healer startup service Darrick J. Wong
2026-03-13 19:58     ` Zorro Lang
2026-03-12 14:21   ` [PATCH 15/14] xfs: test xfs_healer can follow private mntns mount moves Darrick J. Wong
2026-03-13 20:05     ` Zorro Lang
2026-03-13 23:41       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox