[NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing
@ 2024-12-31 23:25 Darrick J. Wong
  2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong
                   ` (15 more replies)
  0 siblings, 16 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:25 UTC (permalink / raw)
  To: Carlos Maiolino, Zorro Lang, Andrey Albershteyn,
	Christoph Hellwig
  Cc: xfs, greg.marsden, shirley.ma, konrad.wilk, fstests

Hi everyone,

Thank you all for helping get online repair, parent pointers, and
metadata directory trees, and realtime allocation groups merged this
year!  We got a lot done in 2024.

Having sent pull requests to Carlos for the last pieces of the realtime
modernization project, I have exactly two worthwhile projects left in my
development trees!  The stuff here isn't necessarily in mergeable state
yet, but I still believe everyone ought to know what I'm up to.

The first project implements (somewhat buggily; I never quite got back
to dealing with moving eof blocks) free space defragmentation so that we
can meaningfully shrink filesystems; garbage collect regions of the
filesystem; or prepare for large allocations.  There's not much new
kernel code other than exporting refcounts and gaining the ability to
map free space.

The second project initiates filesystem self healing routines whenever
problems start to crop up, which means that it can run fully
autonomously in the background.  The monitoring system uses some
pseudo-file and seqbuf tricks that I lifted from kmo last winter.

Both of these projects are largely userspace code.

Also I threw in some xfs_repair code to do dangerous fs upgrades.
Nobody should use these, ever.

Maintainers: please do not merge, this is a dog-and-pony show to attract
developer attention.

--D

PS: I'll be back after the holidays to look at the zoned/atomic/fsverity
patches.  And finally rebase fstests to 2024-12-08.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
@ 2024-12-31 23:32 ` Darrick J. Wong
  2024-12-31 23:36   ` [PATCH 1/1] xfs: Don't free EOF blocks on close when extent size hints are set Darrick J. Wong
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:32 UTC (permalink / raw)
  To: djwong, cem; +Cc: dchinner, linux-xfs

Hi all,

Here's a few patches mostly from Dave to make XFS more aggressive about
keeping post-eof speculative preallocations when closing files.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reduce-eofblocks-gc-on-close
---
Commits in this patchset:
 * xfs: Don't free EOF blocks on close when extent size hints are set
---
 fs/xfs/xfs_bmap_util.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/1] xfs: Don't free EOF blocks on close when extent size hints are set
  2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong
@ 2024-12-31 23:36   ` Darrick J. Wong
  0 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:36 UTC (permalink / raw)
  To: djwong, cem; +Cc: dchinner, linux-xfs

From: Dave Chinner <david@fromorbit.com>

When we have a workload that does open/write/close on files with
extent size hints set in parallel with other allocation, the file
becomes rapidly fragmented. This is due to close() calling
xfs_release() and removing the preallocated extent beyond EOF.  This
occurs for both buffered and direct writes that append to files with
extent size hints.

The existing open/write/close hueristic in xfs_release() does not
catch this as writes to files using extent size hints do not use
delayed allocation and hence do not leave delayed allocation blocks
allocated on the inode that can be detected in xfs_release(). Hence
XFS_IDIRTY_RELEASE never gets set.

In xfs_file_release(), we can tell whether the inode has extent size
hints set and skip EOF block truncation. We add this check to
xfs_can_free_eofblocks() so that we treat the post-EOF preallocated
extent like intentional preallocation and so are persistent unless
directly removed by userspace.

Before:

Test 2: Extent size hint fragmentation counts

/mnt/scratch/file.0: 1002
/mnt/scratch/file.1: 1002
/mnt/scratch/file.2: 1002
/mnt/scratch/file.3: 1002
/mnt/scratch/file.4: 1002
/mnt/scratch/file.5: 1002
/mnt/scratch/file.6: 1002
/mnt/scratch/file.7: 1002

After:

Test 2: Extent size hint fragmentation counts

/mnt/scratch/file.0: 4
/mnt/scratch/file.1: 4
/mnt/scratch/file.2: 4
/mnt/scratch/file.3: 4
/mnt/scratch/file.4: 4
/mnt/scratch/file.5: 4
/mnt/scratch/file.6: 4
/mnt/scratch/file.7: 4

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index b0096ff91000ce..783349f2361ad3 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -527,8 +527,9 @@ xfs_can_free_eofblocks(
 	 * Do not free real extents in preallocated files unless the file has
 	 * delalloc blocks and we are forced to remove them.
 	 */
-	if ((ip->i_diflags & XFS_DIFLAG_PREALLOC) && !ip->i_delayed_blks)
-		return false;
+	if (xfs_get_extsz_hint(ip) || (ip->i_diflags & XFS_DIFLAG_APPEND))
+		if (ip->i_delayed_blks == 0)
+			return false;

 	/*
 	 * Do not try to free post-EOF blocks if EOF is beyond the end of the

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET RFC 2/5] xfs: noalloc allocation groups
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
  2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong
@ 2024-12-31 23:32 ` Darrick J. Wong
  2024-12-31 23:36   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
                     ` (4 more replies)
  2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong
                   ` (13 subsequent siblings)
  15 siblings, 5 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:32 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

Hi all,

This series creates a new NOALLOC flag for allocation groups that causes
the block and inode allocators to look elsewhere when trying to
allocate resources.  This is either the first part of a patchset to
implement online shrinking (set noalloc on the last AGs, run fsr to move
the files and directories) or freeze-free rmapbt rebuilding (set
noalloc to prevent creation of new mappings, then hook deletion of old
mappings).  This is still totally a research project.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=noalloc-ags

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=noalloc-ags
---
Commits in this patchset:
 * xfs: track deferred ops statistics
 * xfs: whine to dmesg when we encounter errors
 * xfs: create a noalloc mode for allocation groups
 * xfs: enable userspace to hide an AG from allocation
 * xfs: apply noalloc mode to inode allocations too
---
 fs/xfs/Kconfig              |   13 +++++
 fs/xfs/libxfs/xfs_ag.c      |  114 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_ag.h      |    8 +++
 fs/xfs/libxfs/xfs_ag_resv.c |   27 +++++++++-
 fs/xfs/libxfs/xfs_defer.c   |   18 ++++++-
 fs/xfs/libxfs/xfs_fs.h      |    5 ++
 fs/xfs/libxfs/xfs_ialloc.c  |    3 +
 fs/xfs/scrub/btree.c        |   89 +++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/common.c       |  107 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h       |    1 
 fs/xfs/scrub/dabtree.c      |   24 +++++++++
 fs/xfs/scrub/fscounters.c   |    3 +
 fs/xfs/scrub/inode.c        |    4 ++
 fs/xfs/scrub/scrub.c        |   40 +++++++++++++++
 fs/xfs/scrub/trace.c        |   22 ++++++++
 fs/xfs/scrub/trace.h        |    2 +
 fs/xfs/xfs_fsops.c          |   10 +++-
 fs/xfs/xfs_globals.c        |    5 ++
 fs/xfs/xfs_ioctl.c          |    4 +-
 fs/xfs/xfs_super.c          |    1 
 fs/xfs/xfs_sysctl.h         |    1 
 fs/xfs/xfs_sysfs.c          |   32 ++++++++++++
 fs/xfs/xfs_trace.h          |   65 +++++++++++++++++++++++++
 fs/xfs/xfs_trans.c          |    3 +
 fs/xfs/xfs_trans.h          |    7 +++
 25 files changed, 599 insertions(+), 9 deletions(-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/5] xfs: track deferred ops statistics
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
@ 2024-12-31 23:36   ` Darrick J. Wong
  2024-12-31 23:36   ` [PATCH 2/5] xfs: whine to dmesg when we encounter errors Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:36 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Track some basic statistics on how hard we're pushing the defer ops.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_defer.c |   18 +++++++++++++++++-
 fs/xfs/xfs_trace.h        |   19 +++++++++++++++++++
 fs/xfs/xfs_trans.c        |    3 +++
 fs/xfs/xfs_trans.h        |    7 +++++++
 4 files changed, 46 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 5b377cbbb1f7e0..236409a3333ea6 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -618,6 +618,8 @@ xfs_defer_finish_one(
 	/* Done with the dfp, free it. */
 	list_del(&dfp->dfp_list);
 	kmem_cache_free(xfs_defer_pending_cache, dfp);
+	tp->t_dfops_nr--;
+	tp->t_dfops_finished++;
 out:
 	if (ops->finish_cleanup)
 		ops->finish_cleanup(tp, state, error);
@@ -680,6 +682,9 @@ xfs_defer_finish_noroll(
 
 		list_splice_init(&(*tp)->t_dfops, &dop_pending);
 
+		(*tp)->t_dfops_nr_max = max((*tp)->t_dfops_nr,
+					    (*tp)->t_dfops_nr_max);
+
 		if (has_intents < 0) {
 			error = has_intents;
 			goto out_shutdown;
@@ -721,6 +726,7 @@ xfs_defer_finish_noroll(
 	xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE);
 	trace_xfs_defer_finish_error(*tp, error);
 	xfs_defer_cancel_list((*tp)->t_mountp, &dop_pending);
+	(*tp)->t_dfops_nr = 0;
 	xfs_defer_cancel(*tp);
 	return error;
 }
@@ -768,6 +774,7 @@ xfs_defer_cancel(
 	trace_xfs_defer_cancel(tp, _RET_IP_);
 	xfs_defer_trans_abort(tp, &tp->t_dfops);
 	xfs_defer_cancel_list(mp, &tp->t_dfops);
+	tp->t_dfops_nr = 0;
 }
 
 /*
@@ -853,8 +860,10 @@ xfs_defer_add(
 	}
 
 	dfp = xfs_defer_find_last(tp, ops);
-	if (!dfp || !xfs_defer_can_append(dfp, ops))
+	if (!dfp || !xfs_defer_can_append(dfp, ops)) {
 		dfp = xfs_defer_alloc(&tp->t_dfops, ops);
+		tp->t_dfops_nr++;
+	}
 
 	xfs_defer_add_item(dfp, li);
 	trace_xfs_defer_add_item(tp->t_mountp, dfp, li);
@@ -879,6 +888,7 @@ xfs_defer_add_barrier(
 		return;
 
 	xfs_defer_alloc(&tp->t_dfops, &xfs_barrier_defer_type);
+	tp->t_dfops_nr++;
 
 	trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL);
 }
@@ -939,6 +949,12 @@ xfs_defer_move(
 	struct xfs_trans	*stp)
 {
 	list_splice_init(&stp->t_dfops, &dtp->t_dfops);
+	dtp->t_dfops_nr += stp->t_dfops_nr;
+	dtp->t_dfops_nr_max = stp->t_dfops_nr_max;
+	dtp->t_dfops_finished = stp->t_dfops_finished;
+	stp->t_dfops_nr = 0;
+	stp->t_dfops_nr_max = 0;
+	stp->t_dfops_finished = 0;
 
 	/*
 	 * Low free space mode was historically controlled by a dfops field.
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8d86a1e038cd5c..0352f432421598 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2880,6 +2880,25 @@ TRACE_EVENT(xfs_btree_free_block,
 /* deferred ops */
 struct xfs_defer_pending;
 
+TRACE_EVENT(xfs_defer_stats,
+	TP_PROTO(struct xfs_trans *tp),
+	TP_ARGS(tp),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, max)
+		__field(unsigned int, finished)
+	),
+	TP_fast_assign(
+		__entry->dev = tp->t_mountp->m_super->s_dev;
+		__entry->max = tp->t_dfops_nr_max;
+		__entry->finished = tp->t_dfops_finished;
+	),
+	TP_printk("dev %d:%d max %u finished %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->max,
+		  __entry->finished)
+)
+
 DECLARE_EVENT_CLASS(xfs_defer_class,
 	TP_PROTO(struct xfs_trans *tp, unsigned long caller_ip),
 	TP_ARGS(tp, caller_ip),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index f53f82456288e5..269cd4583a033d 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -71,6 +71,9 @@ xfs_trans_free(
 	xfs_extent_busy_sort(&tp->t_busy);
 	xfs_extent_busy_clear(&tp->t_busy, false);
 
+	if (tp->t_dfops_finished > 0)
+		trace_xfs_defer_stats(tp);
+
 	trace_xfs_trans_free(tp, _RET_IP_);
 	xfs_trans_clear_context(tp);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 71c2e82e4dadff..cb037a669754eb 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -153,6 +153,13 @@ typedef struct xfs_trans {
 	struct list_head	t_busy;		/* list of busy extents */
 	struct list_head	t_dfops;	/* deferred operations */
 	unsigned long		t_pflags;	/* saved process flags state */
+
+	/* Count of deferred ops attached to transaction. */
+	unsigned int		t_dfops_nr;
+	/* Maximum t_dfops_nr seen in a loop. */
+	unsigned int		t_dfops_nr_max;
+	/* Number of dfops finished. */
+	unsigned int		t_dfops_finished;
 } xfs_trans_t;
 
 /*


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/5] xfs: whine to dmesg when we encounter errors
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
  2024-12-31 23:36   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
@ 2024-12-31 23:36   ` Darrick J. Wong
  2024-12-31 23:37   ` [PATCH 3/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:36 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Forward everything scrub whines about to dmesg.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/Kconfig         |   13 ++++++
 fs/xfs/scrub/btree.c   |   89 +++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/common.c  |  107 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h  |    1 
 fs/xfs/scrub/dabtree.c |   24 +++++++++++
 fs/xfs/scrub/inode.c   |    4 ++
 fs/xfs/scrub/scrub.c   |   40 ++++++++++++++++++
 fs/xfs/scrub/trace.c   |   22 ++++++++++
 fs/xfs/scrub/trace.h   |    2 +
 fs/xfs/xfs_globals.c   |    5 ++
 fs/xfs/xfs_sysctl.h    |    1 
 fs/xfs/xfs_sysfs.c     |   32 ++++++++++++++
 12 files changed, 338 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index fffd6fffdce0f0..5700bc671a0e92 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -172,6 +172,19 @@ config XFS_ONLINE_SCRUB_STATS
 
 	  If unsure, say N.
 
+config XFS_ONLINE_SCRUB_WHINE
+	bool "XFS online metadata verbose logging by default"
+	default n
+	depends on XFS_ONLINE_SCRUB
+	help
+	  If you say Y here, the kernel will by default log the outcomes of all
+	  scrub and repair operations, as well as any corruptions found.  This
+	  may slow down scrub due to printk logging overhead timers.
+
+	  This value can be changed by editing /sys/fs/xfs/debug/scrub_whine
+
+	  If unsure, say N.
+
 config XFS_ONLINE_REPAIR
 	bool "XFS online metadata repair support"
 	default n
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index fe678a0438bc5c..e455eef892faec 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -11,6 +11,8 @@
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_btree.h"
+#include "xfs_log_format.h"
+#include "xfs_ag.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/btree.h"
@@ -18,6 +20,62 @@
 
 /* btree scrubbing */
 
+/* Figure out which block the btree cursor was pointing to. */
+static inline xfs_fsblock_t
+xchk_btree_cur_fsbno(
+	struct xfs_btree_cur		*cur,
+	int				level)
+{
+	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
+		return XFS_DADDR_TO_FSB(cur->bc_mp,
+				xfs_buf_daddr(cur->bc_levels[level].bp));
+	else if (level == cur->bc_nlevels - 1 &&
+		 cur->bc_ops->type == XFS_BTREE_TYPE_INODE)
+		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
+	else if (cur->bc_group)
+		return xfs_gbno_to_fsb(cur->bc_group, 0);
+	return NULLFSBLOCK;
+}
+
+static inline void
+process_error_whine(
+	struct xfs_scrub	*sc,
+	struct xfs_btree_cur	*cur,
+	int			level,
+	int			*error,
+	__u32			errflag,
+	void			*ret_ip)
+{
+	xfs_fsblock_t		fsbno = xchk_btree_cur_fsbno(cur, level);
+
+	if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE) {
+		xchk_whine(sc->mp, "ino 0x%llx fork %d type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x error %d errflag 0x%x ret_ip %pS",
+				cur->bc_ino.ip->i_ino,
+				cur->bc_ino.whichfork,
+				xchk_type_string(sc->sm->sm_type),
+				cur->bc_ops->name,
+				level,
+				cur->bc_levels[level].ptr,
+				XFS_FSB_TO_AGNO(cur->bc_mp, fsbno),
+				XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno),
+				*error,
+				errflag,
+				ret_ip);
+		return;
+	}
+
+	xchk_whine(sc->mp, "type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x error %d errflag 0x%x ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type),
+			cur->bc_ops->name,
+			level,
+			cur->bc_levels[level].ptr,
+			XFS_FSB_TO_AGNO(cur->bc_mp, fsbno),
+			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno),
+			*error,
+			errflag,
+			ret_ip);
+}
+
 /*
  * Check for btree operation errors.  See the section about handling
  * operational errors in common.c.
@@ -44,9 +102,13 @@ __xchk_btree_process_error(
 	case -EFSCORRUPTED:
 		/* Note the badness but don't abort. */
 		sc->sm->sm_flags |= errflag;
+		process_error_whine(sc, cur, level, error, errflag, ret_ip);
 		*error = 0;
 		fallthrough;
 	default:
+		if (*error)
+			process_error_whine(sc, cur, level, error, errflag,
+					ret_ip);
 		if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE)
 			trace_xchk_ifork_btree_op_error(sc, cur, level,
 					*error, ret_ip);
@@ -91,12 +153,35 @@ __xchk_btree_set_corrupt(
 {
 	sc->sm->sm_flags |= errflag;
 
-	if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE)
+	if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE) {
+		xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level);
+		xchk_whine(sc->mp, "ino 0x%llx fork %d type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x errflag 0x%x ret_ip %pS",
+				cur->bc_ino.ip->i_ino,
+				cur->bc_ino.whichfork,
+				xchk_type_string(sc->sm->sm_type),
+				cur->bc_ops->name,
+				level,
+				cur->bc_levels[level].ptr,
+				XFS_FSB_TO_AGNO(cur->bc_mp, fsbno),
+				XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno),
+				errflag,
+				ret_ip);
 		trace_xchk_ifork_btree_error(sc, cur, level,
 				ret_ip);
-	else
+	} else {
+		xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level);
+		xchk_whine(sc->mp, "type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x errflag 0x%x ret_ip %pS",
+				xchk_type_string(sc->sm->sm_type),
+				cur->bc_ops->name,
+				level,
+				cur->bc_levels[level].ptr,
+				XFS_FSB_TO_AGNO(cur->bc_mp, fsbno),
+				XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno),
+				errflag,
+				ret_ip);
 		trace_xchk_btree_error(sc, cur, level,
 				ret_ip);
+	}
 }
 
 void
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 28ad341df8eede..59c368c54a23f6 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -105,9 +105,23 @@ __xchk_process_error(
 	case -EFSCORRUPTED:
 		/* Note the badness but don't abort. */
 		sc->sm->sm_flags |= errflag;
+		xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x error %d errflag 0x%x ret_ip %pS",
+				xchk_type_string(sc->sm->sm_type),
+				agno,
+				bno,
+				*error,
+				errflag,
+				ret_ip);
 		*error = 0;
 		fallthrough;
 	default:
+		if (*error)
+			xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x error %d ret_ip %pS",
+					xchk_type_string(sc->sm->sm_type),
+					agno,
+					bno,
+					*error,
+					ret_ip);
 		trace_xchk_op_error(sc, agno, bno, *error, ret_ip);
 		break;
 	}
@@ -179,9 +193,25 @@ __xchk_fblock_process_error(
 	case -EFSCORRUPTED:
 		/* Note the badness but don't abort. */
 		sc->sm->sm_flags |= errflag;
+		xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu error %d errflag 0x%x ret_ip %pS",
+				sc->ip->i_ino,
+				whichfork,
+				xchk_type_string(sc->sm->sm_type),
+				offset,
+				*error,
+				errflag,
+				ret_ip);
 		*error = 0;
 		fallthrough;
 	default:
+		if (*error)
+			xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu error %d ret_ip %pS",
+					sc->ip->i_ino,
+					whichfork,
+					xchk_type_string(sc->sm->sm_type),
+					offset,
+					*error,
+					ret_ip);
 		trace_xchk_file_op_error(sc, whichfork, offset, *error,
 				ret_ip);
 		break;
@@ -253,6 +283,8 @@ xchk_set_corrupt(
 	struct xfs_scrub	*sc)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	xchk_whine(sc->mp, "type %s ret_ip %pS", xchk_type_string(sc->sm->sm_type),
+			__return_address);
 	trace_xchk_fs_error(sc, 0, __return_address);
 }
 
@@ -264,6 +296,11 @@ xchk_block_set_corrupt(
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 	trace_xchk_block_error(sc, xfs_buf_daddr(bp), __return_address);
+	xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type),
+			xfs_daddr_to_agno(sc->mp, xfs_buf_daddr(bp)),
+			xfs_daddr_to_agbno(sc->mp, xfs_buf_daddr(bp)),
+			__return_address);
 }
 
 #ifdef CONFIG_XFS_QUOTA
@@ -275,6 +312,8 @@ xchk_qcheck_set_corrupt(
 	xfs_dqid_t		id)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	xchk_whine(sc->mp, "type %s dqtype %u id %u ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type), dqtype, id, __return_address);
 	trace_xchk_qcheck_error(sc, dqtype, id, __return_address);
 }
 #endif
@@ -287,6 +326,11 @@ xchk_block_xref_set_corrupt(
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XCORRUPT;
 	trace_xchk_block_error(sc, xfs_buf_daddr(bp), __return_address);
+	xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type),
+			xfs_daddr_to_agno(sc->mp, xfs_buf_daddr(bp)),
+			xfs_daddr_to_agbno(sc->mp, xfs_buf_daddr(bp)),
+			__return_address);
 }
 
 /*
@@ -300,6 +344,8 @@ xchk_ino_set_corrupt(
 	xfs_ino_t		ino)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	xchk_whine(sc->mp, "ino 0x%llx type %s ret_ip %pS",
+			ino, xchk_type_string(sc->sm->sm_type), __return_address);
 	trace_xchk_ino_error(sc, ino, __return_address);
 }
 
@@ -310,6 +356,8 @@ xchk_ino_xref_set_corrupt(
 	xfs_ino_t		ino)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XCORRUPT;
+	xchk_whine(sc->mp, "ino 0x%llx type %s ret_ip %pS",
+			ino, xchk_type_string(sc->sm->sm_type), __return_address);
 	trace_xchk_ino_error(sc, ino, __return_address);
 }
 
@@ -321,6 +369,12 @@ xchk_fblock_set_corrupt(
 	xfs_fileoff_t		offset)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+	xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu ret_ip %pS",
+			sc->ip->i_ino,
+			whichfork,
+			xchk_type_string(sc->sm->sm_type),
+			offset,
+			__return_address);
 	trace_xchk_fblock_error(sc, whichfork, offset, __return_address);
 }
 
@@ -332,6 +386,12 @@ xchk_fblock_xref_set_corrupt(
 	xfs_fileoff_t		offset)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XCORRUPT;
+	xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu ret_ip %pS",
+			sc->ip->i_ino,
+			whichfork,
+			xchk_type_string(sc->sm->sm_type),
+			offset,
+			__return_address);
 	trace_xchk_fblock_error(sc, whichfork, offset, __return_address);
 }
 
@@ -345,6 +405,8 @@ xchk_ino_set_warning(
 	xfs_ino_t		ino)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_WARNING;
+	xchk_whine(sc->mp, "ino 0x%llx type %s ret_ip %pS",
+			ino, xchk_type_string(sc->sm->sm_type), __return_address);
 	trace_xchk_ino_warning(sc, ino, __return_address);
 }
 
@@ -356,6 +418,12 @@ xchk_fblock_set_warning(
 	xfs_fileoff_t		offset)
 {
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_WARNING;
+	xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu ret_ip %pS",
+			sc->ip->i_ino,
+			whichfork,
+			xchk_type_string(sc->sm->sm_type),
+			offset,
+			__return_address);
 	trace_xchk_fblock_warning(sc, whichfork, offset, __return_address);
 }
 
@@ -1219,6 +1287,10 @@ xchk_iget_for_scrubbing(
 out_cancel:
 	xchk_trans_cancel(sc);
 out_error:
+	xchk_whine(mp, "type %s agno 0x%x agbno 0x%x error %d ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type), agno,
+			XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), error,
+			__return_address);
 	trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
 			error, __return_address);
 	return error;
@@ -1352,6 +1424,10 @@ xchk_should_check_xref(
 	}
 
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XFAIL;
+	xchk_whine(sc->mp, "type %s xref error %d ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type),
+			*error,
+			__return_address);
 	trace_xchk_xref_error(sc, *error, __return_address);
 
 	/*
@@ -1383,6 +1459,11 @@ xchk_buffer_recheck(
 		return;
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 	trace_xchk_block_error(sc, xfs_buf_daddr(bp), fa);
+	xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type),
+			xfs_daddr_to_agno(sc->mp, xfs_buf_daddr(bp)),
+			xfs_daddr_to_agbno(sc->mp, xfs_buf_daddr(bp)),
+			fa);
 }
 
 static inline int
@@ -1735,3 +1816,29 @@ xchk_inode_count_blocks(
 	return xfs_bmap_count_blocks(sc->tp, sc->ip, whichfork, nextents,
 			count);
 }
+
+/* Complain about failures... */
+void
+xchk_whine(
+	const struct xfs_mount	*mp,
+	const char		*fmt,
+	...)
+{
+	struct va_format	vaf;
+	va_list			args;
+
+	if (!xfs_globals.scrub_whine)
+		return;
+
+	va_start(args, fmt);
+
+	vaf.fmt = fmt;
+	vaf.va = &args;
+
+	printk(KERN_INFO "XFS (%s) %pS: %pV\n", mp->m_super->s_id,
+			__return_address, &vaf);
+	va_end(args);
+
+	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
+		xfs_stack_trace();
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index bdcd40f0ec742c..4dc408b530153a 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -179,6 +179,7 @@ bool xchk_ilock_nowait(struct xfs_scrub *sc, unsigned int ilock_flags);
 void xchk_iunlock(struct xfs_scrub *sc, unsigned int ilock_flags);
 
 void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp);
+void xchk_whine(const struct xfs_mount *mp, const char *fmt, ...);
 
 /*
  * Grab the inode at @inum.  The caller must have created a scrub transaction
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index 056de4819f866d..ae64db9f0bba2b 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -47,9 +47,26 @@ xchk_da_process_error(
 	case -EFSCORRUPTED:
 		/* Note the badness but don't abort. */
 		sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+		xchk_whine(sc->mp, "ino 0x%llx fork %d type %s dablk 0x%llx error %d ret_ip %pS",
+				sc->ip->i_ino,
+				ds->dargs.whichfork,
+				xchk_type_string(sc->sm->sm_type),
+				xfs_dir2_da_to_db(ds->dargs.geo,
+					ds->state->path.blk[level].blkno),
+				*error,
+				__return_address);
 		*error = 0;
 		fallthrough;
 	default:
+		if (*error)
+			xchk_whine(sc->mp, "ino 0x%llx fork %d type %s dablk 0x%llx error %d ret_ip %pS",
+					sc->ip->i_ino,
+					ds->dargs.whichfork,
+					xchk_type_string(sc->sm->sm_type),
+					xfs_dir2_da_to_db(ds->dargs.geo,
+						ds->state->path.blk[level].blkno),
+					*error,
+					__return_address);
 		trace_xchk_file_op_error(sc, ds->dargs.whichfork,
 				xfs_dir2_da_to_db(ds->dargs.geo,
 					ds->state->path.blk[level].blkno),
@@ -72,6 +89,13 @@ xchk_da_set_corrupt(
 
 	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 
+	xchk_whine(sc->mp, "ino 0x%llx fork %d type %s dablk 0x%llx ret_ip %pS",
+			sc->ip->i_ino,
+			ds->dargs.whichfork,
+			xchk_type_string(sc->sm->sm_type),
+			xfs_dir2_da_to_db(ds->dargs.geo,
+				ds->state->path.blk[level].blkno),
+			__return_address);
 	trace_xchk_fblock_error(sc, ds->dargs.whichfork,
 			xfs_dir2_da_to_db(ds->dargs.geo,
 				ds->state->path.blk[level].blkno),
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index bb3f475b63532e..a93f63b6b518ff 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -218,6 +218,10 @@ xchk_setup_inode(
 out_cancel:
 	xchk_trans_cancel(sc);
 out_error:
+	xchk_whine(mp, "type %s agno 0x%x agbno 0x%x error %d ret_ip %pS",
+			xchk_type_string(sc->sm->sm_type), agno,
+			XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), error,
+			__return_address);
 	trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino),
 			error, __return_address);
 	return error;
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 1a05c27ba47197..d3a4ddd918f621 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -639,6 +639,45 @@ xchk_scrub_create_subord(
 	return sub;
 }
 
+static inline void
+repair_outcomes(struct xfs_scrub *sc, int error)
+{
+	struct xfs_scrub_metadata *sm = sc->sm;
+	const char *wut = NULL;
+
+	if (!xfs_globals.scrub_whine)
+		return;
+
+	if (sc->flags & XREP_ALREADY_FIXED) {
+		wut = "*** REPAIR SUCCESS";
+		error = 0;
+	} else if (error == -EBUSY) {
+		wut = "??? FILESYSTEM BUSY";
+	} else if (error == -EAGAIN) {
+		wut = "??? REPAIR DEFERRED";
+	} else if (error == -ECANCELED) {
+		wut = "??? REPAIR CANCELLED";
+	} else if (error == -EINTR) {
+		wut = "??? REPAIR INTERRUPTED";
+	} else if (error != -EOPNOTSUPP && error != -ENOENT) {
+		wut = "!!! REPAIR FAILED";
+		xfs_info(sc->mp,
+"%s ino 0x%llx type %s agno 0x%x inum 0x%llx gen 0x%x flags 0x%x error %d",
+				wut, XFS_I(file_inode(sc->file))->i_ino,
+				xchk_type_string(sm->sm_type), sm->sm_agno,
+				sm->sm_ino, sm->sm_gen, sm->sm_flags, error);
+		return;
+	} else {
+		return;
+	}
+
+	xfs_info_ratelimited(sc->mp,
+"%s ino 0x%llx type %s agno 0x%x inum 0x%llx gen 0x%x flags 0x%x error %d",
+			wut, XFS_I(file_inode(sc->file))->i_ino,
+			xchk_type_string(sm->sm_type), sm->sm_agno, sm->sm_ino,
+			sm->sm_gen, sm->sm_flags, error);
+}
+
 /* Dispatch metadata scrubbing. */
 STATIC int
 xfs_scrub_metadata(
@@ -735,6 +774,7 @@ xfs_scrub_metadata(
 		 * already tried to fix it, then attempt a repair.
 		 */
 		error = xrep_attempt(sc, &run);
+		repair_outcomes(sc, error);
 		if (error == -EAGAIN) {
 			/*
 			 * Either the repair function succeeded or it couldn't
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 2450e214103fed..4ea790e4063df7 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -58,3 +58,25 @@ xchk_btree_cur_fsbno(
  */
 #define CREATE_TRACE_POINTS
 #include "scrub/trace.h"
+
+/* xchk_whine stuff */
+struct xchk_tstr {
+	unsigned int	type;
+	const char	*tag;
+};
+
+static const struct xchk_tstr xchk_tstr_tags[] = { XFS_SCRUB_TYPE_STRINGS };
+
+const char *
+xchk_type_string(
+	unsigned int	type)
+{
+	unsigned int	i;
+
+	for (i = 0; i < ARRAY_SIZE(xchk_tstr_tags); i++) {
+		if (xchk_tstr_tags[i].type == type)
+			return xchk_tstr_tags[i].tag;
+	}
+
+	return "???";
+}
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index d7c4ced47c1567..69d9b0a336dbc5 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -115,6 +115,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_RTREFCBT);
 	{ XFS_SCRUB_TYPE_RTRMAPBT,	"rtrmapbt" }, \
 	{ XFS_SCRUB_TYPE_RTREFCBT,	"rtrefcountbt" }
 
+const char *xchk_type_string(unsigned int type);
+
 #define XFS_SCRUB_FLAG_STRINGS \
 	{ XFS_SCRUB_IFLAG_REPAIR,		"repair" }, \
 	{ XFS_SCRUB_OFLAG_CORRUPT,		"corrupt" }, \
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index f18fec0adf6662..f5fe896b9a8ec0 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -44,6 +44,11 @@ struct xfs_globals xfs_globals = {
 	.pwork_threads		=	-1,	/* automatic thread detection */
 	.larp			=	false,	/* log attribute replay */
 #endif
+#ifdef CONFIG_XFS_ONLINE_SCRUB_WHINE
+	.scrub_whine		=	true,
+#else
+	.scrub_whine		=	false,
+#endif
 
 	/*
 	 * Leave this many record slots empty when bulk loading btrees.  By
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index 276696a07040c8..b0939ac370fba1 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -91,6 +91,7 @@ struct xfs_globals {
 	int	mount_delay;		/* mount setup delay (secs) */
 	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
 	bool	always_cow;		/* use COW fork for all overwrites */
+	bool	scrub_whine;		/* noisier output from scrub */
 };
 extern struct xfs_globals	xfs_globals;
 
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index 60cb5318fdae3c..0ce31517e3cd89 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -260,6 +260,37 @@ larp_show(
 }
 XFS_SYSFS_ATTR_RW(larp);
 
+/* Logging of the outcomes of everything that scrub does */
+STATIC ssize_t
+scrub_whine_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	int		ret;
+	int		val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < -1 || val > num_possible_cpus())
+		return -EINVAL;
+
+	xfs_globals.scrub_whine = val;
+
+	return count;
+}
+
+STATIC ssize_t
+scrub_whine_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return sysfs_emit(buf, "%d\n", xfs_globals.scrub_whine);
+}
+XFS_SYSFS_ATTR_RW(scrub_whine);
+
 STATIC ssize_t
 bload_leaf_slack_store(
 	struct kobject	*kobject,
@@ -319,6 +350,7 @@ static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(always_cow),
 	ATTR_LIST(pwork_threads),
 	ATTR_LIST(larp),
+	ATTR_LIST(scrub_whine),
 	ATTR_LIST(bload_leaf_slack),
 	ATTR_LIST(bload_node_slack),
 	NULL,


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/5] xfs: create a noalloc mode for allocation groups
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
  2024-12-31 23:36   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
  2024-12-31 23:36   ` [PATCH 2/5] xfs: whine to dmesg when we encounter errors Darrick J. Wong
@ 2024-12-31 23:37   ` Darrick J. Wong
  2024-12-31 23:37   ` [PATCH 4/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong
  2024-12-31 23:37   ` [PATCH 5/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new noalloc state for the per-AG structure that will disable
block allocation in this AG.  We accomplish this by subtracting from
fdblocks all the free blocks in this AG, hiding those blocks from the
allocator, and preventing freed blocks from updating fdblocks until
we're ready to lift noalloc mode.

Note that we reduce the free block count of the filesystem so that we
can prevent transactions from entering the allocator looking for "free"
space that we've turned off incore.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c      |   60 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_ag.h      |    8 ++++++
 fs/xfs/libxfs/xfs_ag_resv.c |   27 +++++++++++++++++--
 fs/xfs/scrub/fscounters.c   |    3 +-
 fs/xfs/xfs_fsops.c          |   10 ++++++-
 fs/xfs/xfs_super.c          |    1 +
 fs/xfs/xfs_trace.h          |   46 +++++++++++++++++++++++++++++++++
 7 files changed, 150 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index b59cb461e096ea..1e65cd981afd49 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -976,3 +976,63 @@ xfs_ag_get_geometry(
 	xfs_buf_relse(agi_bp);
 	return error;
 }
+
+/* How many blocks does this AG contribute to fdblocks? */
+xfs_extlen_t
+xfs_ag_fdblocks(
+	struct xfs_perag		*pag)
+{
+	xfs_extlen_t			ret;
+
+	ASSERT(xfs_perag_initialised_agf(pag));
+
+	ret = pag->pagf_freeblks + pag->pagf_flcount + pag->pagf_btreeblks;
+	ret -= pag->pag_meta_resv.ar_reserved;
+	ret -= pag->pag_rmapbt_resv.ar_orig_reserved;
+	return ret;
+}
+
+/*
+ * Hide all the free space in this AG.  Caller must hold both the AGI and the
+ * AGF buffers or have otherwise prevented concurrent access.
+ */
+int
+xfs_ag_set_noalloc(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag_mount(pag);
+	int			error;
+
+	ASSERT(xfs_perag_initialised_agf(pag));
+	ASSERT(xfs_perag_initialised_agi(pag));
+
+	if (xfs_perag_prohibits_alloc(pag))
+		return 0;
+
+	error = xfs_dec_fdblocks(mp, xfs_ag_fdblocks(pag), false);
+	if (error)
+		return error;
+
+	trace_xfs_ag_set_noalloc(pag);
+	set_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate);
+	return 0;
+}
+
+/*
+ * Unhide all the free space in this AG.  Caller must hold both the AGI and
+ * the AGF buffers or have otherwise prevented concurrent access.
+ */
+void
+xfs_ag_clear_noalloc(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag_mount(pag);
+
+	if (!xfs_perag_prohibits_alloc(pag))
+		return;
+
+	xfs_add_fdblocks(mp, xfs_ag_fdblocks(pag));
+
+	trace_xfs_ag_clear_noalloc(pag);
+	clear_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate);
+}
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 1f24cfa2732172..e8fae59206d929 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -120,6 +120,7 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag)
 #define XFS_AGSTATE_PREFERS_METADATA	2
 #define XFS_AGSTATE_ALLOWS_INODES	3
 #define XFS_AGSTATE_AGFL_NEEDS_RESET	4
+#define XFS_AGSTATE_NOALLOC		5
 
 #define __XFS_AG_OPSTATE(name, NAME) \
 static inline bool xfs_perag_ ## name (struct xfs_perag *pag) \
@@ -132,6 +133,7 @@ __XFS_AG_OPSTATE(initialised_agi, AGI_INIT)
 __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA)
 __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES)
 __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
+__XFS_AG_OPSTATE(prohibits_alloc, NOALLOC)
 
 int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t orig_agcount,
 		xfs_agnumber_t new_agcount, xfs_rfsblock_t dcount,
@@ -164,6 +166,7 @@ xfs_perag_put(
 	xfs_group_put(pag_group(pag));
 }
 
+
 /* Active AG references */
 static inline struct xfs_perag *
 xfs_perag_grab(
@@ -208,6 +211,11 @@ xfs_perag_next(
 	return xfs_perag_next_from(mp, pag, 0);
 }
 
+/* Enable or disable allocation from an AG */
+xfs_extlen_t xfs_ag_fdblocks(struct xfs_perag *pag);
+int xfs_ag_set_noalloc(struct xfs_perag *pag);
+void xfs_ag_clear_noalloc(struct xfs_perag *pag);
+
 /*
  * Per-ag geometry infomation and validation
  */
diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index fb79215a509d21..fda3d7614838e7 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -74,6 +74,13 @@ xfs_ag_resv_critical(
 	xfs_extlen_t			avail;
 	xfs_extlen_t			orig;
 
+	/*
+	 * Pretend we're critically low on reservations in this AG to scare
+	 * everyone else away.
+	 */
+	if (xfs_perag_prohibits_alloc(pag))
+		return true;
+
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
 		avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved;
@@ -116,7 +123,12 @@ xfs_ag_resv_needed(
 		break;
 	case XFS_AG_RESV_METAFILE:
 	case XFS_AG_RESV_NONE:
-		/* empty */
+		/*
+		 * In noalloc mode, we pretend that all the free blocks in this
+		 * AG have been allocated.  Make this AG look full.
+		 */
+		if (xfs_perag_prohibits_alloc(pag))
+			len += xfs_ag_fdblocks(pag);
 		break;
 	default:
 		ASSERT(0);
@@ -344,6 +356,8 @@ xfs_ag_resv_alloc_extent(
 	xfs_extlen_t			len;
 	uint				field;
 
+	ASSERT(type != XFS_AG_RESV_NONE || !xfs_perag_prohibits_alloc(pag));
+
 	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
 
 	switch (type) {
@@ -401,7 +415,14 @@ xfs_ag_resv_free_extent(
 		ASSERT(0);
 		fallthrough;
 	case XFS_AG_RESV_NONE:
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
+		/*
+		 * Normally we put freed blocks back into fdblocks.  In noalloc
+		 * mode, however, we pretend that there are no fdblocks in the
+		 * AG, so don't put them back.
+		 */
+		if (!xfs_perag_prohibits_alloc(pag))
+			xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS,
+					(int64_t)len);
 		fallthrough;
 	case XFS_AG_RESV_IGNORE:
 		return;
@@ -414,6 +435,6 @@ xfs_ag_resv_free_extent(
 	/* Freeing into the reserved pool only requires on-disk update... */
 	xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len);
 	/* ...but freeing beyond that requires in-core and on-disk update. */
-	if (len > leftover)
+	if (len > leftover && !xfs_perag_prohibits_alloc(pag))
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, len - leftover);
 }
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index f7258544848fcd..af69ed7733acd6 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -337,7 +337,8 @@ xchk_fscount_aggregate_agcounts(
 		 */
 		fsc->fdblocks -= pag->pag_meta_resv.ar_reserved;
 		fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved;
-
+		if (xfs_perag_prohibits_alloc(pag))
+			fsc->fdblocks -= xfs_ag_fdblocks(pag);
 	}
 	if (pag)
 		xfs_perag_rele(pag);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 8dc2b738c911ee..150979c8333530 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -592,6 +592,14 @@ xfs_fs_unreserve_ag_blocks(
 	if (xfs_has_realtime(mp))
 		xfs_rt_resv_free(mp);
 
-	while ((pag = xfs_perag_next(mp, pag)))
+	while ((pag = xfs_perag_next(mp, pag))) {
+		/*
+		 * Bring the AG back online because our AG hiding only exists
+		 * in-core and we need the superblock to be written out with
+		 * the super fdblocks reflecting the AGF freeblks.  Do this
+		 * before adding the per-AG reservations back to fdblocks.
+		 */
+		xfs_ag_clear_noalloc(pag);
 		xfs_ag_resv_free(pag);
+	}
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e1554f061376e5..099c30339e8f9d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -336,6 +336,7 @@ xfs_set_inode_alloc(
 		pag = xfs_perag_get(mp, index);
 		if (xfs_set_inode_alloc_perag(pag, ino, max_metadata))
 			maxagi++;
+		clear_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate);
 		xfs_perag_put(pag);
 	}
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0352f432421598..dc7ffc8f8e9dea 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4589,6 +4589,52 @@ DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_corrupt);
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_unfixed_corruption);
 
+DECLARE_EVENT_CLASS(xfs_ag_noalloc_class,
+	TP_PROTO(struct xfs_perag *pag),
+	TP_ARGS(pag),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_extlen_t, freeblks)
+		__field(xfs_extlen_t, flcount)
+		__field(xfs_extlen_t, btreeblks)
+		__field(xfs_extlen_t, meta_resv)
+		__field(xfs_extlen_t, rmap_resv)
+
+		__field(unsigned long long, resblks)
+		__field(unsigned long long, resblks_avail)
+	),
+	TP_fast_assign(
+		__entry->dev = pag_mount(pag)->m_super->s_dev;
+		__entry->agno = pag_agno(pag);
+		__entry->freeblks = pag->pagf_freeblks;
+		__entry->flcount = pag->pagf_flcount;
+		__entry->btreeblks = pag->pagf_btreeblks;
+		__entry->meta_resv = pag->pag_meta_resv.ar_reserved;
+		__entry->rmap_resv = pag->pag_rmapbt_resv.ar_orig_reserved;
+
+		__entry->resblks = pag_mount(pag)->m_resblks[XC_FREE_BLOCKS].total;
+		__entry->resblks_avail = pag_mount(pag)->m_resblks[XC_FREE_BLOCKS].avail;
+	),
+	TP_printk("dev %d:%d agno 0x%x freeblks %u flcount %u btreeblks %u metaresv %u rmapresv %u resblks %llu resblks_avail %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->agno,
+		  __entry->freeblks,
+		  __entry->flcount,
+		  __entry->btreeblks,
+		  __entry->meta_resv,
+		  __entry->rmap_resv,
+		  __entry->resblks,
+		  __entry->resblks_avail)
+);
+#define DEFINE_AG_NOALLOC_EVENT(name)	\
+DEFINE_EVENT(xfs_ag_noalloc_class, name,	\
+	TP_PROTO(struct xfs_perag *pag),	\
+	TP_ARGS(pag))
+
+DEFINE_AG_NOALLOC_EVENT(xfs_ag_set_noalloc);
+DEFINE_AG_NOALLOC_EVENT(xfs_ag_clear_noalloc);
+
 TRACE_EVENT(xfs_iwalk_ag_rec,
 	TP_PROTO(const struct xfs_perag *pag, \
 		 struct xfs_inobt_rec_incore *irec),


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/5] xfs: enable userspace to hide an AG from allocation
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:37   ` [PATCH 3/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong
@ 2024-12-31 23:37   ` Darrick J. Wong
  2024-12-31 23:37   ` [PATCH 5/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add an administrative interface so that userspace can hide an allocation
group from block allocation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_fs.h |    5 ++++
 fs/xfs/xfs_ioctl.c     |    4 +++-
 3 files changed, 62 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 1e65cd981afd49..c538a5bfb4e330 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -932,6 +932,54 @@ xfs_ag_extend_space(
 	return 0;
 }
 
+/* Compute the AG geometry flags. */
+static inline uint32_t
+xfs_ag_calc_geoflags(
+	struct xfs_perag	*pag)
+{
+	uint32_t		ret = 0;
+
+	if (xfs_perag_prohibits_alloc(pag))
+		ret |= XFS_AG_FLAG_NOALLOC;
+
+	return ret;
+}
+
+/*
+ * Compare the current AG geometry flags against the flags in the AG geometry
+ * structure and update the AG state to reflect any changes, then update the
+ * struct to reflect the current status.
+ */
+static inline int
+xfs_ag_update_geoflags(
+	struct xfs_perag	*pag,
+	struct xfs_ag_geometry	*ageo,
+	uint32_t		new_flags)
+{
+	uint32_t		old_flags = xfs_ag_calc_geoflags(pag);
+	int			error;
+
+	if (!(new_flags & XFS_AG_FLAG_UPDATE)) {
+		ageo->ag_flags = old_flags;
+		return 0;
+	}
+
+	if ((old_flags & XFS_AG_FLAG_NOALLOC) &&
+	    !(new_flags & XFS_AG_FLAG_NOALLOC)) {
+		xfs_ag_clear_noalloc(pag);
+	}
+
+	if (!(old_flags & XFS_AG_FLAG_NOALLOC) &&
+	    (new_flags & XFS_AG_FLAG_NOALLOC)) {
+		error = xfs_ag_set_noalloc(pag);
+		if (error)
+			return error;
+	}
+
+	ageo->ag_flags = xfs_ag_calc_geoflags(pag);
+	return 0;
+}
+
 /* Retrieve AG geometry. */
 int
 xfs_ag_get_geometry(
@@ -943,6 +991,7 @@ xfs_ag_get_geometry(
 	struct xfs_agi		*agi;
 	struct xfs_agf		*agf;
 	unsigned int		freeblks;
+	uint32_t		inflags = ageo->ag_flags;
 	int			error;
 
 	/* Lock the AG headers. */
@@ -953,6 +1002,10 @@ xfs_ag_get_geometry(
 	if (error)
 		goto out_agi;
 
+	error = xfs_ag_update_geoflags(pag, ageo, inflags);
+	if (error)
+		goto out;
+
 	/* Fill out form. */
 	memset(ageo, 0, sizeof(*ageo));
 	ageo->ag_number = pag_agno(pag);
@@ -970,6 +1023,7 @@ xfs_ag_get_geometry(
 	ageo->ag_freeblks = freeblks;
 	xfs_ag_geom_health(pag, ageo);
 
+out:
 	/* Release resources. */
 	xfs_buf_relse(agf_bp);
 out_agi:
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 12463ba766da05..b391bf9de93dbf 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -307,6 +307,11 @@ struct xfs_ag_geometry {
 #define XFS_AG_GEOM_SICK_REFCNTBT (1 << 9)  /* reference counts */
 #define XFS_AG_GEOM_SICK_INODES	(1 << 10) /* bad inodes were seen */
 
+#define XFS_AG_FLAG_UPDATE	(1 << 0)  /* update flags */
+#define XFS_AG_FLAG_NOALLOC	(1 << 1)  /* do not allocate from this AG */
+#define XFS_AG_FLAG_ALL		(XFS_AG_FLAG_UPDATE | \
+				 XFS_AG_FLAG_NOALLOC)
+
 /*
  * Structures for XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG & XFS_IOC_FSGROWFSRT
  */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d3cf62d81f0d17..874e2def3d6e63 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -385,10 +385,12 @@ xfs_ioc_ag_geometry(
 
 	if (copy_from_user(&ageo, arg, sizeof(ageo)))
 		return -EFAULT;
-	if (ageo.ag_flags)
+	if (ageo.ag_flags & ~XFS_AG_FLAG_ALL)
 		return -EINVAL;
 	if (memchr_inv(&ageo.ag_reserved, 0, sizeof(ageo.ag_reserved)))
 		return -EINVAL;
+	if ((ageo.ag_flags & XFS_AG_FLAG_UPDATE) && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
 
 	pag = xfs_perag_get(mp, ageo.ag_number);
 	if (!pag)


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 5/5] xfs: apply noalloc mode to inode allocations too
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:37   ` [PATCH 4/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong
@ 2024-12-31 23:37   ` Darrick J. Wong
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't allow inode allocations from this group if it's marked noalloc.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 57513ba19d6a71..2d2f132d4d1773 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1107,6 +1107,7 @@ xfs_dialloc_ag_inobt(
 
 	ASSERT(xfs_perag_initialised_agi(pag));
 	ASSERT(xfs_perag_allows_inodes(pag));
+	ASSERT(!xfs_perag_prohibits_alloc(pag));
 	ASSERT(pag->pagi_freecount > 0);
 
  restart_pagno:
@@ -1735,6 +1736,8 @@ xfs_dialloc_good_ag(
 		return false;
 	if (!xfs_perag_allows_inodes(pag))
 		return false;
+	if (xfs_perag_prohibits_alloc(pag))
+		return false;
 
 	if (!xfs_perag_initialised_agi(pag)) {
 		error = xfs_ialloc_read_agi(pag, tp, 0, NULL);


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 3/5] xfs: report refcount information to userspace
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
  2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong
  2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
@ 2024-12-31 23:32 ` Darrick J. Wong
  2024-12-31 23:37   ` [PATCH 1/1] xfs: export reference count " Darrick J. Wong
  2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:32 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

Hi all,

Create a new ioctl to report the number of owners of each disk block so
that reflink-aware defraggers can make better decisions about which
extents to target.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=report-refcounts

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=report-refcounts

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=report-refcounts
---
Commits in this patchset:
 * xfs: export reference count information to userspace
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |   80 +++++
 fs/xfs/xfs_fsrefs.c    |  777 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsrefs.h    |   45 +++
 fs/xfs/xfs_ioctl.c     |    4 
 fs/xfs/xfs_trace.c     |    1 
 fs/xfs/xfs_trace.h     |  125 ++++++++
 7 files changed, 1033 insertions(+)
 create mode 100644 fs/xfs/xfs_fsrefs.c
 create mode 100644 fs/xfs/xfs_fsrefs.h


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/1] xfs: export reference count information to userspace
  2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong
@ 2024-12-31 23:37   ` Darrick J. Wong
  0 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Export refcount info to userspace so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |   80 +++++
 fs/xfs/xfs_fsrefs.c    |  777 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsrefs.h    |   45 +++
 fs/xfs/xfs_ioctl.c     |    4 
 fs/xfs/xfs_trace.c     |    1 
 fs/xfs/xfs_trace.h     |  125 ++++++++
 7 files changed, 1033 insertions(+)
 create mode 100644 fs/xfs/xfs_fsrefs.c
 create mode 100644 fs/xfs/xfs_fsrefs.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5bf501cf827172..4c59d43c77089e 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -85,6 +85,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_filestream.o \
 				   xfs_fsmap.o \
 				   xfs_fsops.o \
+				   xfs_fsrefs.o \
 				   xfs_globals.o \
 				   xfs_handle.o \
 				   xfs_health.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b391bf9de93dbf..936f719236944f 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1008,6 +1008,85 @@ struct xfs_rtgroup_geometry {
 #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
 #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
 
+/*
+ *	Structure for XFS_IOC_GETFSREFCOUNTS.
+ *
+ *	The memory layout for this call are the scalar values defined in struct
+ *	xfs_getfsrefs_head, followed by two struct xfs_getfsrefs that describe
+ *	the lower and upper bound of mappings to return, followed by an array
+ *	of struct xfs_getfsrefs mappings.
+ *
+ *	fch_iflags control the output of the call, whereas fch_oflags report
+ *	on the overall record output.  fch_count should be set to the length
+ *	of the fch_recs array, and fch_entries will be set to the number of
+ *	entries filled out during each call.  If fch_count is zero, the number
+ *	of refcount mappings will be returned in fch_entries, though no
+ *	mappings will be returned.  fch_reserved must be set to zero.
+ *
+ *	The two elements in the fch_keys array are used to constrain the
+ *	output.  The first element in the array should represent the lowest
+ *	disk mapping ("low key") that the user wants to learn about.  If this
+ *	value is all zeroes, the filesystem will return the first entry it
+ *	knows about.  For a subsequent call, the contents of
+ *	fsrefs_head.fch_recs[fsrefs_head.fch_count - 1] should be copied into
+ *	fch_keys[0] to have the kernel start where it left off.
+ *
+ *	The second element in the fch_keys array should represent the highest
+ *	disk mapping ("high key") that the user wants to learn about.  If this
+ *	value is all ones, the filesystem will not stop until it runs out of
+ *	mapping to return or runs out of space in fch_recs.
+ *
+ *	fcr_device can be either a 32-bit cookie representing a device, or a
+ *	32-bit dev_t if the FCH_OF_DEV_T flag is set.  fcr_physical and
+ *	fcr_length are expressed in units of bytes.  fcr_owners is the number
+ *	of owners.
+ */
+struct xfs_getfsrefs {
+	__u32		fcr_device;	/* device id */
+	__u32		fcr_flags;	/* mapping flags */
+	__u64		fcr_physical;	/* device offset of segment */
+	__u64		fcr_owners;	/* number of owners */
+	__u64		fcr_length;	/* length of segment */
+	__u64		fcr_reserved[4];	/* must be zero */
+};
+
+struct xfs_getfsrefs_head {
+	__u32		fch_iflags;	/* control flags */
+	__u32		fch_oflags;	/* output flags */
+	__u32		fch_count;	/* # of entries in array incl. input */
+	__u32		fch_entries;	/* # of entries filled in (output). */
+	__u64		fch_reserved[6];	/* must be zero */
+
+	struct xfs_getfsrefs	fch_keys[2];	/* low and high keys for the mapping search */
+	struct xfs_getfsrefs	fch_recs[];	/* returned records */
+};
+
+/* Size of an fsrefs_head with room for nr records. */
+static inline unsigned long long
+xfs_getfsrefs_sizeof(
+	unsigned int	nr)
+{
+	return sizeof(struct xfs_getfsrefs_head) +
+		(nr * sizeof(struct xfs_getfsrefs));
+}
+
+/* Start the next fsrefs query at the end of the current query results. */
+static inline void
+xfs_getfsrefs_advance(
+	struct xfs_getfsrefs_head	*head)
+{
+	head->fch_keys[0] = head->fch_recs[head->fch_entries - 1];
+}
+
+/* fch_iflags values - set by XFS_IOC_GETFSREFCOUNTS caller in the header. */
+#define FCH_IF_VALID		0
+
+/* fch_oflags values - returned in the header segment only. */
+#define FCH_OF_DEV_T		(1U << 0) /* fcr_device values will be dev_t */
+
+/* fcr_flags values - returned for each non-header segment */
+#define FCR_OF_LAST		(1U << 0) /* last record in the dataset */
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1047,6 +1126,7 @@ struct xfs_rtgroup_geometry {
 #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle)
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
+#define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_fsrefs.c b/fs/xfs/xfs_fsrefs.c
new file mode 100644
index 00000000000000..85e109dba20f99
--- /dev/null
+++ b/fs/xfs/xfs_fsrefs.c
@@ -0,0 +1,777 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_btree.h"
+#include "xfs_trace.h"
+#include "xfs_alloc.h"
+#include "xfs_bit.h"
+#include "xfs_fsrefs.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_rtrefcount_btree.h"
+#include "xfs_ag.h"
+#include "xfs_rtbitmap.h"
+#include "xfs_rtgroup.h"
+
+/* getfsrefs query state */
+struct xfs_fsrefs_info {
+	struct xfs_fsrefs_head	*head;
+	struct xfs_getfsrefs	*fsrefs_recs;	/* mapping records */
+
+	struct xfs_btree_cur	*refc_cur;	/* refcount btree cursor */
+	struct xfs_btree_cur	*bno_cur;	/* bnobt btree cursor */
+
+	struct xfs_buf		*agf_bp;	/* AGF, for refcount queries */
+	struct xfs_group	*group;
+
+	xfs_daddr_t		next_daddr;	/* next daddr we expect */
+	/* daddr of low fsrefs key when we're using the rtbitmap */
+	xfs_daddr_t		low_daddr;
+
+	/*
+	 * Low refcount key for the query.  If low.rc_blockcount is nonzero,
+	 * this is the second (or later) call to retrieve the recordset in
+	 * pieces.  xfs_getfsrefs_rec_before_start will compare all records
+	 * retrieved by the refcountbt query to filter out any records that
+	 * start before the last record.
+	 */
+	struct xfs_refcount_irec low;
+	struct xfs_refcount_irec high;		/* high refcount key */
+
+	u32			dev;		/* device id */
+	bool			last;		/* last extent? */
+};
+
+/* Associate a device with a getfsrefs handler. */
+struct xfs_fsrefs_dev {
+	u32			dev;
+	int			(*fn)(struct xfs_trans *tp,
+				      const struct xfs_fsrefs *keys,
+				      struct xfs_fsrefs_info *info);
+};
+
+/* Convert an xfs_fsrefs to an fsrefs. */
+static void
+xfs_fsrefs_from_internal(
+	struct xfs_getfsrefs	*dest,
+	struct xfs_fsrefs	*src)
+{
+	dest->fcr_device = src->fcr_device;
+	dest->fcr_flags = src->fcr_flags;
+	dest->fcr_physical = BBTOB(src->fcr_physical);
+	dest->fcr_owners = src->fcr_owners;
+	dest->fcr_length = BBTOB(src->fcr_length);
+	dest->fcr_reserved[0] = 0;
+	dest->fcr_reserved[1] = 0;
+	dest->fcr_reserved[2] = 0;
+	dest->fcr_reserved[3] = 0;
+}
+
+/* Convert an fsrefs to an xfs_fsrefs. */
+static void
+xfs_fsrefs_to_internal(
+	struct xfs_fsrefs	*dest,
+	struct xfs_getfsrefs	*src)
+{
+	dest->fcr_device = src->fcr_device;
+	dest->fcr_flags = src->fcr_flags;
+	dest->fcr_physical = BTOBBT(src->fcr_physical);
+	dest->fcr_owners = src->fcr_owners;
+	dest->fcr_length = BTOBBT(src->fcr_length);
+}
+
+/* Compare two getfsrefs device handlers. */
+static int
+xfs_fsrefs_dev_compare(
+	const void			*p1,
+	const void			*p2)
+{
+	const struct xfs_fsrefs_dev	*d1 = p1;
+	const struct xfs_fsrefs_dev	*d2 = p2;
+
+	return d1->dev - d2->dev;
+}
+
+static inline bool
+xfs_fsrefs_frec_before_start(
+	struct xfs_fsrefs_info		*info,
+	const struct xfs_fsrefs_irec	*frec)
+{
+	if (info->low_daddr != XFS_BUF_DADDR_NULL)
+		return frec->start_daddr < info->low_daddr;
+	if (info->low.rc_blockcount)
+		return frec->rec_key < info->low.rc_startblock;
+	return false;
+}
+
+/*
+ * Format a refcount record for fsrefs, having translated rc_startblock into
+ * the appropriate daddr units.
+ */
+STATIC int
+xfs_fsrefs_helper(
+	struct xfs_trans		*tp,
+	struct xfs_fsrefs_info		*info,
+	const struct xfs_fsrefs_irec	*frec)
+{
+	struct xfs_fsrefs		fcr;
+	struct xfs_getfsrefs		*row;
+	struct xfs_mount		*mp = tp->t_mountp;
+
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
+	/*
+	 * Filter out records that start before our startpoint, if the
+	 * caller requested that.
+	 */
+	if (xfs_fsrefs_frec_before_start(info, frec))
+		return 0;
+
+	/* Are we just counting mappings? */
+	if (info->head->fch_count == 0) {
+		if (info->head->fch_entries == UINT_MAX)
+			return -ECANCELED;
+
+		info->head->fch_entries++;
+		return 0;
+	}
+
+	/* Fill out the extent we found */
+	if (info->head->fch_entries >= info->head->fch_count)
+		return -ECANCELED;
+
+	trace_xfs_fsrefs_mapping(mp, info->dev,
+			info->group ? info->group->xg_gno : NULLAGNUMBER,
+			frec);
+
+	fcr.fcr_device = info->dev;
+	fcr.fcr_flags = 0;
+	fcr.fcr_physical = frec->start_daddr;
+	fcr.fcr_owners = frec->refcount;
+	fcr.fcr_length = frec->len_daddr;
+
+	trace_xfs_getfsrefs_mapping(mp, &fcr);
+
+	row = &info->fsrefs_recs[info->head->fch_entries++];
+	xfs_fsrefs_from_internal(row, &fcr);
+	return 0;
+}
+
+/* Synthesize fsrefs records from free space data. */
+STATIC int
+xfs_fsrefs_ddev_bnobt_helper(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_alloc_rec_incore *rec,
+	void				*priv)
+{
+	struct xfs_fsrefs_irec		frec = {
+		.refcount		= 1,
+	};
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_fsrefs_info		*info = priv;
+	xfs_agnumber_t			next_agno;
+	xfs_agblock_t			next_agbno;
+
+	/*
+	 * Figure out if there's a gap between the last fsrefs record we
+	 * emitted and this free extent.  If there is, report the gap as a
+	 * refcount==1 record.
+	 */
+	next_agno = xfs_daddr_to_agno(mp, info->next_daddr);
+	next_agbno = xfs_daddr_to_agbno(mp, info->next_daddr);
+
+	ASSERT(next_agno >= cur->bc_group->xg_gno);
+	ASSERT(rec->ar_startblock >= next_agbno);
+
+	/*
+	 * If we've already moved on to the next AG, we don't have any fsrefs
+	 * records to synthesize.
+	 */
+	if (next_agno > cur->bc_group->xg_gno)
+		return 0;
+
+	info->next_daddr = xfs_gbno_to_daddr(cur->bc_group,
+				rec->ar_startblock + rec->ar_blockcount);
+
+	if (rec->ar_startblock == next_agbno)
+		return 0;
+
+	/* Emit a record for the in-use space */
+	frec.start_daddr = xfs_gbno_to_daddr(cur->bc_group, next_agbno);
+	frec.len_daddr = XFS_FSB_TO_BB(mp, rec->ar_startblock - next_agbno);
+	frec.rec_key = next_agbno;
+	return xfs_fsrefs_helper(cur->bc_tp, info, &frec);
+}
+
+/* Emit records to fill a gap in the refcount btree with singly-owned blocks. */
+STATIC int
+xfs_fsrefs_ddev_fill_refcount_gap(
+	struct xfs_trans		*tp,
+	struct xfs_fsrefs_info		*info,
+	xfs_agblock_t			agbno)
+{
+	struct xfs_alloc_rec_incore	low = {0};
+	struct xfs_alloc_rec_incore	high = {0};
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_btree_cur		*cur = info->bno_cur;
+	struct xfs_agf			*agf;
+	int				error;
+
+	ASSERT(xfs_daddr_to_agno(mp, info->next_daddr) ==
+			cur->bc_group->xg_gno);
+
+	low.ar_startblock = xfs_daddr_to_agbno(mp, info->next_daddr);
+	if (low.ar_startblock >= agbno)
+		return 0;
+
+	high.ar_startblock = agbno;
+	error = xfs_alloc_query_range(cur, &low, &high,
+			xfs_fsrefs_ddev_bnobt_helper, info);
+	if (error)
+		return error;
+
+	/*
+	 * Synthesize records for single-owner extents between the last
+	 * fsrefcount record emitted and the end of the query range.
+	 */
+	agf = cur->bc_ag.agbp->b_addr;
+	low.ar_startblock = min_t(xfs_agblock_t, agbno,
+				  be32_to_cpu(agf->agf_length));
+	if (xfs_daddr_to_agbno(mp, info->next_daddr) > low.ar_startblock)
+		return 0;
+
+	info->last = true;
+	return xfs_fsrefs_ddev_bnobt_helper(cur, &low, info);
+}
+
+/* Transform a refcountbt irec into a fsrefs */
+STATIC int
+xfs_fsrefs_ddev_refcountbt_helper(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*rec,
+	void				*priv)
+{
+	struct xfs_fsrefs_irec		frec = {
+		.refcount		= rec->rc_refcount,
+		.rec_key		= rec->rc_startblock,
+	};
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_fsrefs_info		*info = priv;
+	int				error;
+
+	/*
+	 * Stop once we get to the CoW staging extents; they're all shoved to
+	 * the right side of the btree and were already covered by the bnobt
+	 * scan.
+	 */
+	if (rec->rc_domain != XFS_REFC_DOMAIN_SHARED)
+		return -ECANCELED;
+
+	/* Report on any gaps first */
+	error = xfs_fsrefs_ddev_fill_refcount_gap(cur->bc_tp, info,
+			rec->rc_startblock);
+	if (error)
+		return error;
+
+	/* Report the refcount record from the refcount btree. */
+	frec.start_daddr = xfs_gbno_to_daddr(cur->bc_group,
+					     rec->rc_startblock);
+	frec.len_daddr = XFS_FSB_TO_BB(mp, rec->rc_blockcount);
+	info->next_daddr = xfs_gbno_to_daddr(cur->bc_group,
+			rec->rc_startblock + rec->rc_blockcount);
+	return xfs_fsrefs_helper(cur->bc_tp, info, &frec);
+}
+
+/* Execute a getfsrefs query against the regular data device. */
+STATIC int
+xfs_fsrefs_ddev(
+	struct xfs_trans	*tp,
+	const struct xfs_fsrefs	*keys,
+	struct xfs_fsrefs_info	*info)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*agf_bp = NULL;
+	struct xfs_perag	*pag = NULL;
+	xfs_fsblock_t		start_fsb;
+	xfs_fsblock_t		end_fsb;
+	xfs_agnumber_t		start_ag;
+	xfs_agnumber_t		end_ag;
+	uint64_t		eofs;
+	int			error = 0;
+
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+	if (keys[0].fcr_physical >= eofs)
+		return 0;
+	start_fsb = XFS_DADDR_TO_FSB(mp, keys[0].fcr_physical);
+	end_fsb = XFS_DADDR_TO_FSB(mp, min(eofs - 1, keys[1].fcr_physical));
+
+	info->refc_cur = info->bno_cur = NULL;
+
+	/*
+	 * Convert the fsrefs low/high keys to AG based keys.  Initialize
+	 * low to the fsrefs low key and max out the high key to the end
+	 * of the AG.
+	 */
+	info->low.rc_startblock = XFS_FSB_TO_AGBNO(mp, start_fsb);
+	info->low.rc_blockcount = XFS_BB_TO_FSBT(mp, keys[0].fcr_length);
+	info->low.rc_refcount = 0;
+	info->low.rc_domain = XFS_REFC_DOMAIN_SHARED;
+
+	/* Adjust the low key if we are continuing from where we left off. */
+	if (info->low.rc_blockcount > 0) {
+		info->low.rc_startblock += info->low.rc_blockcount;
+
+		start_fsb += info->low.rc_blockcount;
+		if (XFS_FSB_TO_DADDR(mp, start_fsb) >= eofs)
+			return 0;
+	}
+
+	info->high.rc_startblock = -1U;
+	info->high.rc_refcount = 0;
+	info->high.rc_domain = XFS_REFC_DOMAIN_SHARED;
+
+	start_ag = XFS_FSB_TO_AGNO(mp, start_fsb);
+	end_ag = XFS_FSB_TO_AGNO(mp, end_fsb);
+
+	/* Query each AG */
+	while ((pag = xfs_perag_next_range(mp, pag, start_ag, end_ag))) {
+		info->group = pag_group(pag);
+
+		/*
+		 * Set the AG high key from the fsrefs high key if this
+		 * is the last AG that we're querying.
+		 */
+		if (pag_agno(pag) == end_ag)
+			info->high.rc_startblock = XFS_FSB_TO_AGBNO(mp,
+					end_fsb);
+
+		if (info->refc_cur) {
+			xfs_btree_del_cursor(info->refc_cur, XFS_BTREE_NOERROR);
+			info->refc_cur = NULL;
+		}
+		if (info->bno_cur) {
+			xfs_btree_del_cursor(info->bno_cur, XFS_BTREE_NOERROR);
+			info->bno_cur = NULL;
+		}
+		if (agf_bp) {
+			xfs_trans_brelse(tp, agf_bp);
+			agf_bp = NULL;
+		}
+
+		error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+		if (error)
+			break;
+
+		trace_xfs_fsrefs_low_group_key(mp, info->dev, info->group,
+				&info->low);
+		trace_xfs_fsrefs_high_group_key(mp, info->dev, info->group,
+				&info->high);
+
+		info->bno_cur = xfs_bnobt_init_cursor(mp, tp, agf_bp, pag);
+
+		if (xfs_has_reflink(mp)) {
+			info->refc_cur = xfs_refcountbt_init_cursor(mp, tp,
+							agf_bp, pag);
+
+			/*
+			 * Fill the query with refcount records and synthesize
+			 * singly-owned block records from free space data.
+			 */
+			error = xfs_refcount_query_range(info->refc_cur,
+					&info->low, &info->high,
+					xfs_fsrefs_ddev_refcountbt_helper,
+					info);
+			if (error && error != -ECANCELED)
+				break;
+		}
+
+		/*
+		 * Synthesize refcount==1 records from the free space data
+		 * between the end of the last fsrefs record reported and the
+		 * end of the range.  If we don't have refcount support, the
+		 * starting point will be the start of the query range.
+		 */
+		error = xfs_fsrefs_ddev_fill_refcount_gap(tp, info,
+				info->high.rc_startblock);
+		if (error)
+			break;
+
+		/*
+		 * Set the AG low key to the start of the AG prior to
+		 * moving on to the next AG.
+		 */
+		if (pag_agno(pag) == start_ag)
+			memset(&info->low, 0, sizeof(info->low));
+		info->group = NULL;
+	}
+
+	if (info->refc_cur) {
+		xfs_btree_del_cursor(info->refc_cur, error);
+		info->refc_cur = NULL;
+	}
+	if (info->bno_cur) {
+		xfs_btree_del_cursor(info->bno_cur, error);
+		info->bno_cur = NULL;
+	}
+	if (agf_bp)
+		xfs_trans_brelse(tp, agf_bp);
+	if (info->group) {
+		xfs_perag_rele(pag);
+		info->group = NULL;
+	} else if (pag) {
+		/* loop termination case */
+		xfs_perag_rele(pag);
+	}
+
+	return error;
+}
+
+/* Execute a getfsrefs query against the log device. */
+STATIC int
+xfs_fsrefs_logdev(
+	struct xfs_trans		*tp,
+	const struct xfs_fsrefs		*keys,
+	struct xfs_fsrefs_info		*info)
+{
+	struct xfs_fsrefs_irec		frec = {
+		.start_daddr		= 0,
+		.rec_key		= 0,
+		.refcount		= 1,
+	};
+	struct xfs_mount		*mp = tp->t_mountp;
+	xfs_fsblock_t			start_fsb, end_fsb;
+	uint64_t			eofs;
+
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
+	if (keys[0].fcr_physical >= eofs)
+		return 0;
+	start_fsb = XFS_BB_TO_FSBT(mp,
+				keys[0].fcr_physical + keys[0].fcr_length);
+	end_fsb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fcr_physical));
+
+	/* Adjust the low key if we are continuing from where we left off. */
+	if (keys[0].fcr_length > 0)
+		info->low_daddr = XFS_FSB_TO_BB(mp, start_fsb);
+
+	trace_xfs_fsrefs_low_linear_key(mp, info->dev, start_fsb);
+	trace_xfs_fsrefs_high_linear_key(mp, info->dev, end_fsb);
+
+	if (start_fsb > 0)
+		return 0;
+
+	/* Fabricate an refc entry for the external log device. */
+	frec.len_daddr = XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
+	return xfs_fsrefs_helper(tp, info, &frec);
+}
+
+/* Do we recognize the device? */
+STATIC bool
+xfs_fsrefs_is_valid_device(
+	struct xfs_mount	*mp,
+	struct xfs_fsrefs	*fcr)
+{
+	if (fcr->fcr_device == 0 || fcr->fcr_device == UINT_MAX ||
+	    fcr->fcr_device == new_encode_dev(mp->m_ddev_targp->bt_dev))
+		return true;
+	if (mp->m_logdev_targp &&
+	    fcr->fcr_device == new_encode_dev(mp->m_logdev_targp->bt_dev))
+		return true;
+	if (mp->m_rtdev_targp &&
+	    fcr->fcr_device == new_encode_dev(mp->m_rtdev_targp->bt_dev))
+		return true;
+	return false;
+}
+
+/* Ensure that the low key is less than the high key. */
+STATIC bool
+xfs_fsrefs_check_keys(
+	struct xfs_fsrefs	*low_key,
+	struct xfs_fsrefs	*high_key)
+{
+	if (low_key->fcr_device > high_key->fcr_device)
+		return false;
+	if (low_key->fcr_device < high_key->fcr_device)
+		return true;
+
+	if (low_key->fcr_physical > high_key->fcr_physical)
+		return false;
+	if (low_key->fcr_physical < high_key->fcr_physical)
+		return true;
+
+	return false;
+}
+
+#define XFS_GETFSREFS_DEVS	2
+
+/*
+ * Get filesystem's extent refcounts as described in head, and format for
+ * output. Fills in the supplied records array until there are no more reverse
+ * mappings to return or head.fch_entries == head.fch_count.  In the second
+ * case, this function returns -ECANCELED to indicate that more records would
+ * have been returned.
+ *
+ * Key to Confusion
+ * ----------------
+ * There are multiple levels of keys and counters at work here:
+ * xfs_fsrefs_head.fch_keys	-- low and high fsrefs keys passed in;
+ *				   these reflect fs-wide sector addrs.
+ * dkeys			-- fch_keys used to query each device;
+ *				   these are fch_keys but w/ the low key
+ *				   bumped up by fcr_length.
+ * xfs_fsrefs_info.next_daddr-- next disk addr we expect to see; this
+ *				   is how we detect gaps in the fsrefs
+ *				   records and report them.
+ * xfs_fsrefs_info.low/high	-- per-AG low/high keys computed from
+ *				   dkeys; used to query the metadata.
+ */
+STATIC int
+xfs_getfsrefs(
+	struct xfs_mount	*mp,
+	struct xfs_fsrefs_head	*head,
+	struct xfs_getfsrefs	*fsrefs_recs)
+{
+	struct xfs_trans	*tp = NULL;
+	struct xfs_fsrefs	dkeys[2];	/* per-dev keys */
+	struct xfs_fsrefs_dev	handlers[XFS_GETFSREFS_DEVS];
+	struct xfs_fsrefs_info	info = { NULL };
+	int			i;
+	int			error = 0;
+
+	if (head->fch_iflags & ~FCH_IF_VALID)
+		return -EINVAL;
+	if (!xfs_fsrefs_is_valid_device(mp, &head->fch_keys[0]) ||
+	    !xfs_fsrefs_is_valid_device(mp, &head->fch_keys[1]))
+		return -EINVAL;
+	if (!xfs_fsrefs_check_keys(&head->fch_keys[0], &head->fch_keys[1]))
+		return -EINVAL;
+
+	head->fch_entries = 0;
+
+	/* Set up our device handlers. */
+	memset(handlers, 0, sizeof(handlers));
+	handlers[0].dev = new_encode_dev(mp->m_ddev_targp->bt_dev);
+	handlers[0].fn = xfs_fsrefs_ddev;
+	if (mp->m_logdev_targp != mp->m_ddev_targp) {
+		handlers[1].dev = new_encode_dev(mp->m_logdev_targp->bt_dev);
+		handlers[1].fn = xfs_fsrefs_logdev;
+	}
+
+	xfs_sort(handlers, XFS_GETFSREFS_DEVS, sizeof(struct xfs_fsrefs_dev),
+			xfs_fsrefs_dev_compare);
+
+	/*
+	 * To continue where we left off, we allow userspace to use the last
+	 * mapping from a previous call as the low key of the next.  This is
+	 * identified by a non-zero length in the low key. We have to increment
+	 * the low key in this scenario to ensure we don't return the same
+	 * mapping again, and instead return the very next mapping.  Bump the
+	 * physical offset as there can be no other mapping for the same
+	 * physical block range.
+	 *
+	 * Each fsrefs backend is responsible for making this adjustment as
+	 * appropriate for the backend.
+	 */
+	dkeys[0] = head->fch_keys[0];
+	memset(&dkeys[1], 0xFF, sizeof(struct xfs_fsrefs));
+
+	info.next_daddr = head->fch_keys[0].fcr_physical +
+			  head->fch_keys[0].fcr_length;
+	info.fsrefs_recs = fsrefs_recs;
+	info.head = head;
+
+	/* For each device we support... */
+	for (i = 0; i < XFS_GETFSREFS_DEVS; i++) {
+		/* Is this device within the range the user asked for? */
+		if (!handlers[i].fn)
+			continue;
+		if (head->fch_keys[0].fcr_device > handlers[i].dev)
+			continue;
+		if (head->fch_keys[1].fcr_device < handlers[i].dev)
+			break;
+
+		/*
+		 * If this device number matches the high key, we have to pass
+		 * the high key to the handler to limit the query results.  If
+		 * the device number exceeds the low key, zero out the low key
+		 * so that we get everything from the beginning.
+		 */
+		if (handlers[i].dev == head->fch_keys[1].fcr_device)
+			dkeys[1] = head->fch_keys[1];
+		if (handlers[i].dev > head->fch_keys[0].fcr_device)
+			memset(&dkeys[0], 0, sizeof(struct xfs_fsrefs));
+
+		/*
+		 * Grab an empty transaction so that we can use its recursive
+		 * buffer locking abilities to detect cycles in the refcountbt
+		 * without deadlocking.
+		 */
+		error = xfs_trans_alloc_empty(mp, &tp);
+		if (error)
+			break;
+
+		info.dev = handlers[i].dev;
+		info.last = false;
+		info.group = NULL;
+		info.low_daddr = XFS_BUF_DADDR_NULL;
+		info.low.rc_blockcount = 0;
+		error = handlers[i].fn(tp, dkeys, &info);
+		if (error)
+			break;
+		xfs_trans_cancel(tp);
+		tp = NULL;
+		info.next_daddr = 0;
+	}
+
+	if (tp)
+		xfs_trans_cancel(tp);
+	head->fch_oflags = FCH_OF_DEV_T;
+	return error;
+}
+
+int
+xfs_ioc_getfsrefcounts(
+	struct xfs_inode		*ip,
+	struct xfs_getfsrefs_head	__user *arg)
+{
+	struct xfs_fsrefs_head		xhead = {0};
+	struct xfs_getfsrefs_head	head;
+	struct xfs_getfsrefs		*recs;
+	unsigned int			count;
+	__u32				last_flags = 0;
+	bool				done = false;
+	int				error;
+
+	if (copy_from_user(&head, arg, sizeof(struct xfs_getfsrefs_head)))
+		return -EFAULT;
+	if (memchr_inv(head.fch_reserved, 0, sizeof(head.fch_reserved)) ||
+	    memchr_inv(head.fch_keys[0].fcr_reserved, 0,
+		       sizeof(head.fch_keys[0].fcr_reserved)) ||
+	    memchr_inv(head.fch_keys[1].fcr_reserved, 0,
+		       sizeof(head.fch_keys[1].fcr_reserved)))
+		return -EINVAL;
+
+	/*
+	 * Use an internal memory buffer so that we don't have to copy fsrefs
+	 * data to userspace while holding locks.  Start by trying to allocate
+	 * up to 128k for the buffer, but fall back to a single page if needed.
+	 */
+	count = min_t(unsigned int, head.fch_count,
+			131072 / sizeof(struct xfs_getfsrefs));
+	recs = kvcalloc(count, sizeof(struct xfs_getfsrefs), GFP_KERNEL);
+	if (!recs) {
+		count = min_t(unsigned int, head.fch_count,
+				PAGE_SIZE / sizeof(struct xfs_getfsrefs));
+		recs = kvcalloc(count, sizeof(struct xfs_getfsrefs),
+				GFP_KERNEL);
+		if (!recs)
+			return -ENOMEM;
+	}
+
+	xhead.fch_iflags = head.fch_iflags;
+	xfs_fsrefs_to_internal(&xhead.fch_keys[0], &head.fch_keys[0]);
+	xfs_fsrefs_to_internal(&xhead.fch_keys[1], &head.fch_keys[1]);
+
+	trace_xfs_getfsrefs_low_key(ip->i_mount, &xhead.fch_keys[0]);
+	trace_xfs_getfsrefs_high_key(ip->i_mount, &xhead.fch_keys[1]);
+
+	head.fch_entries = 0;
+	do {
+		struct xfs_getfsrefs __user	*user_recs;
+		struct xfs_getfsrefs		*last_rec;
+		size_t				copy_bytes;
+
+		user_recs = &arg->fch_recs[head.fch_entries];
+		xhead.fch_entries = 0;
+		xhead.fch_count = min_t(unsigned int, count,
+					head.fch_count - head.fch_entries);
+
+		/* Run query, record how many entries we got. */
+		error = xfs_getfsrefs(ip->i_mount, &xhead, recs);
+		switch (error) {
+		case 0:
+			/*
+			 * There are no more records in the result set.  Copy
+			 * whatever we got to userspace and break out.
+			 */
+			done = true;
+			break;
+		case -ECANCELED:
+			/*
+			 * The internal memory buffer is full.  Copy whatever
+			 * records we got to userspace and go again if we have
+			 * not yet filled the userspace buffer.
+			 */
+			error = 0;
+			break;
+		default:
+			goto out_free;
+		}
+		head.fch_entries += xhead.fch_entries;
+		head.fch_oflags = xhead.fch_oflags;
+
+		/*
+		 * If the caller wanted a record count or there aren't any
+		 * new records to return, we're done.
+		 */
+		if (head.fch_count == 0 || xhead.fch_entries == 0)
+			break;
+
+		/* Copy all the records we got out to userspace. */
+		copy_bytes = array_size(xhead.fch_entries,
+					sizeof(struct xfs_getfsrefs));
+		if (copy_bytes == SIZE_MAX ||
+		    copy_to_user(user_recs, recs, copy_bytes)) {
+			error = -EFAULT;
+			goto out_free;
+		}
+
+		/* Remember the last record flags we copied to userspace. */
+		last_rec = &recs[xhead.fch_entries - 1];
+		last_flags = last_rec->fcr_flags;
+
+		/* Set up the low key for the next iteration. */
+		xfs_fsrefs_to_internal(&xhead.fch_keys[0], last_rec);
+		trace_xfs_getfsrefs_low_key(ip->i_mount, &xhead.fch_keys[0]);
+	} while (!done && head.fch_entries < head.fch_count);
+
+	/*
+	 * If there are no more records in the query result set and we're not
+	 * in counting mode, mark the last record returned with the LAST flag.
+	 */
+	if (done && head.fch_count > 0 && head.fch_entries > 0) {
+		struct xfs_getfsrefs __user	*user_rec;
+
+		last_flags |= FCR_OF_LAST;
+		user_rec = &arg->fch_recs[head.fch_entries - 1];
+
+		if (copy_to_user(&user_rec->fcr_flags, &last_flags,
+					sizeof(last_flags))) {
+			error = -EFAULT;
+			goto out_free;
+		}
+	}
+
+	/* copy back header */
+	if (copy_to_user(arg, &head, sizeof(struct xfs_getfsrefs_head))) {
+		error = -EFAULT;
+		goto out_free;
+	}
+
+out_free:
+	kvfree(recs);
+	return error;
+}
diff --git a/fs/xfs/xfs_fsrefs.h b/fs/xfs/xfs_fsrefs.h
new file mode 100644
index 00000000000000..6d23eaa4801e24
--- /dev/null
+++ b/fs/xfs/xfs_fsrefs.h
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_FSREFS_H__
+#define __XFS_FSREFS_H__
+
+struct xfs_getfsrefs;
+
+/* internal fsrefs representation */
+struct xfs_fsrefs {
+	dev_t		fcr_device;	/* device id */
+	uint32_t	fcr_flags;	/* mapping flags */
+	uint64_t	fcr_physical;	/* device offset of segment */
+	uint64_t	fcr_owners;	/* number of owners */
+	xfs_filblks_t	fcr_length;	/* length of segment, blocks */
+};
+
+struct xfs_fsrefs_head {
+	uint32_t	fch_iflags;	/* control flags */
+	uint32_t	fch_oflags;	/* output flags */
+	unsigned int	fch_count;	/* # of entries in array incl. input */
+	unsigned int	fch_entries;	/* # of entries filled in (output). */
+
+	struct xfs_fsrefs fch_keys[2];	/* low and high keys */
+};
+
+/* internal fsrefs record format */
+struct xfs_fsrefs_irec {
+	xfs_daddr_t	start_daddr;
+	xfs_daddr_t	len_daddr;
+	xfs_nlink_t	refcount;
+
+	/*
+	 * refcount startblock corresponding to start_daddr, if the record came
+	 * from a refcount btree.
+	 */
+	xfs_agblock_t	rec_key;
+};
+
+int xfs_ioc_getfsrefcounts(struct xfs_inode *ip,
+		struct xfs_getfsrefs_head __user *arg);
+
+#endif /* __XFS_FSREFS_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 874e2def3d6e63..20f013bd4ce653 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -29,6 +29,7 @@
 #include "xfs_btree.h"
 #include <linux/fsmap.h>
 #include "xfs_fsmap.h"
+#include "xfs_fsrefs.h"
 #include "scrub/xfs_scrub.h"
 #include "xfs_sb.h"
 #include "xfs_ag.h"
@@ -1266,6 +1267,9 @@ xfs_file_ioctl(
 	case FS_IOC_GETFSMAP:
 		return xfs_ioc_getfsmap(ip, arg);
 
+	case XFS_IOC_GETFSREFCOUNTS:
+		return xfs_ioc_getfsrefcounts(ip, arg);
+
 	case XFS_IOC_SCRUBV_METADATA:
 		return xfs_ioc_scrubv_metadata(filp, arg);
 	case XFS_IOC_SCRUB_METADATA:
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index a60556dbd172ee..555fe76b4d853c 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -51,6 +51,7 @@
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_zone_priv.h"
+#include "xfs_fsrefs.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index dc7ffc8f8e9dea..7043b6481d5f97 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -103,6 +103,8 @@ struct xfs_refcount_intent;
 struct xfs_metadir_update;
 struct xfs_rtgroup;
 struct xfs_open_zone;
+struct xfs_fsrefs;
+struct xfs_fsrefs_irec;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -4297,6 +4299,129 @@ DEFINE_GETFSMAP_EVENT(xfs_getfsmap_low_key);
 DEFINE_GETFSMAP_EVENT(xfs_getfsmap_high_key);
 DEFINE_GETFSMAP_EVENT(xfs_getfsmap_mapping);
 
+/* fsrefs traces */
+TRACE_EVENT(xfs_fsrefs_mapping,
+	TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_agnumber_t agno,
+		 const struct xfs_fsrefs_irec *frec),
+	TP_ARGS(mp, keydev, agno, frec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, keydev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+		__field(xfs_daddr_t, start_daddr)
+		__field(xfs_daddr_t, len_daddr)
+		__field(uint64_t, owners)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->keydev = new_decode_dev(keydev);
+		__entry->agno = agno;
+		__entry->agbno = frec->rec_key;
+		__entry->start_daddr = frec->start_daddr;
+		__entry->len_daddr = frec->len_daddr;
+		__entry->owners = frec->refcount;
+	),
+	TP_printk("dev %d:%d keydev %d:%d agno 0x%x agbno 0x%x start_daddr 0x%llx len_daddr 0x%llx owners %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->keydev), MINOR(__entry->keydev),
+		  __entry->agno,
+		  __entry->agbno,
+		  __entry->start_daddr,
+		  __entry->len_daddr,
+		  __entry->owners)
+);
+
+DECLARE_EVENT_CLASS(xfs_fsrefs_linear_key_class,
+	TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_fsblock_t fsbno),
+	TP_ARGS(mp, keydev, fsbno),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, keydev)
+		__field(xfs_fsblock_t, fsbno)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->keydev = new_decode_dev(keydev);
+		__entry->fsbno = fsbno;
+	),
+	TP_printk("dev %d:%d keydev %d:%d fsbno 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->keydev), MINOR(__entry->keydev),
+		  __entry->fsbno)
+)
+#define DEFINE_FSREFS_LINEAR_KEY_EVENT(name) \
+DEFINE_EVENT(xfs_fsrefs_linear_key_class, name, \
+	TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_fsblock_t fsbno), \
+	TP_ARGS(mp, keydev, fsbno))
+DEFINE_FSREFS_LINEAR_KEY_EVENT(xfs_fsrefs_low_linear_key);
+DEFINE_FSREFS_LINEAR_KEY_EVENT(xfs_fsrefs_high_linear_key);
+
+DECLARE_EVENT_CLASS(xfs_fsrefs_group_key_class,
+	TP_PROTO(struct xfs_mount *mp, u32 keydev, const struct xfs_group *xg,
+		 const struct xfs_refcount_irec *refc),
+	TP_ARGS(mp, keydev, xg, refc),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, keydev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agblock_t, agbno)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->keydev = new_decode_dev(keydev);
+		__entry->agno = xg->xg_gno;
+		__entry->agbno = refc->rc_startblock;
+	),
+	TP_printk("dev %d:%d keydev %d:%d agno 0x%x refcbno 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->keydev), MINOR(__entry->keydev),
+		  __entry->agno,
+		  __entry->agbno)
+)
+#define DEFINE_FSREFS_GROUP_KEY_EVENT(name) \
+DEFINE_EVENT(xfs_fsrefs_group_key_class, name, \
+	TP_PROTO(struct xfs_mount *mp, u32 keydev, const struct xfs_group *xg, \
+		 const struct xfs_refcount_irec *refc), \
+	TP_ARGS(mp, keydev, xg, refc))
+DEFINE_FSREFS_GROUP_KEY_EVENT(xfs_fsrefs_low_group_key);
+DEFINE_FSREFS_GROUP_KEY_EVENT(xfs_fsrefs_high_group_key);
+
+DECLARE_EVENT_CLASS(xfs_getfsrefs_class,
+	TP_PROTO(struct xfs_mount *mp, struct xfs_fsrefs *fsrefs),
+	TP_ARGS(mp, fsrefs),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, keydev)
+		__field(xfs_daddr_t, block)
+		__field(xfs_daddr_t, len)
+		__field(uint64_t, owners)
+		__field(uint32_t, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->keydev = new_decode_dev(fsrefs->fcr_device);
+		__entry->block = fsrefs->fcr_physical;
+		__entry->len = fsrefs->fcr_length;
+		__entry->owners = fsrefs->fcr_owners;
+		__entry->flags = fsrefs->fcr_flags;
+	),
+	TP_printk("dev %d:%d keydev %d:%d daddr 0x%llx bbcount 0x%llx owners %llu flags 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->keydev), MINOR(__entry->keydev),
+		  __entry->block,
+		  __entry->len,
+		  __entry->owners,
+		  __entry->flags)
+)
+#define DEFINE_GETFSREFS_EVENT(name) \
+DEFINE_EVENT(xfs_getfsrefs_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_fsrefs *fsrefs), \
+	TP_ARGS(mp, fsrefs))
+DEFINE_GETFSREFS_EVENT(xfs_getfsrefs_low_key);
+DEFINE_GETFSREFS_EVENT(xfs_getfsrefs_high_key);
+DEFINE_GETFSREFS_EVENT(xfs_getfsrefs_mapping);
+
 DECLARE_EVENT_CLASS(xfs_trans_resv_class,
 	TP_PROTO(struct xfs_mount *mp, unsigned int type,
 		 struct xfs_trans_res *res),


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 4/5] xfs: defragment free space
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong
@ 2024-12-31 23:33 ` Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong
                     ` (3 more replies)
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                   ` (11 subsequent siblings)
  15 siblings, 4 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

Hi all,

These patches contain experimental code to enable userspace to defragment
the free space in a filesystem.  Two purposes are imagined for this
functionality: clearing space at the end of a filesystem before
shrinking it, and clearing free space in anticipation of making a large
allocation.

The first patch adds a new fallocate mode that allows userspace to
allocate free space from the filesystem into a file.  The goal here is
to allow the filesystem shrink process to prevent allocation from a
certain part of the filesystem while a free space defragmenter evacuates
all the files from the doomed part of the filesystem.

The second patch amends the online repair system to allow the sysadmin
to forcibly rebuild metadata structures, even if they're not corrupt.
Without adding an ioctl to move metadata btree blocks, this is the only
way to dislodge metadata.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=defrag-freespace
---
Commits in this patchset:
 * xfs: export realtime refcount information
 * xfs: capture the offset and length in fallocate tracepoints
 * xfs: add an ioctl to map free space into a file
 * xfs: implement FALLOC_FL_MAP_FREE for realtime files
---
 fs/xfs/libxfs/xfs_alloc.c |   88 ++++++++
 fs/xfs/libxfs/xfs_alloc.h |    3 
 fs/xfs/libxfs/xfs_bmap.c  |    1 
 fs/xfs/libxfs/xfs_fs.h    |   14 +
 fs/xfs/xfs_bmap_util.c    |  513 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h    |    3 
 fs/xfs/xfs_file.c         |  143 ++++++++++++-
 fs/xfs/xfs_file.h         |    2 
 fs/xfs/xfs_fsrefs.c       |  405 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c        |    5 
 fs/xfs/xfs_rtalloc.c      |  108 +++++++++
 fs/xfs/xfs_rtalloc.h      |    7 +
 fs/xfs/xfs_trace.h        |   86 +++++++-
 13 files changed, 1368 insertions(+), 10 deletions(-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/4] xfs: export realtime refcount information
  2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
@ 2024-12-31 23:38   ` Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add support for reporting space refcount information from the realtime
volume.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_fsrefs.c |  405 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 405 insertions(+)


diff --git a/fs/xfs/xfs_fsrefs.c b/fs/xfs/xfs_fsrefs.c
index 85e109dba20f99..d5b77fe79b2653 100644
--- a/fs/xfs/xfs_fsrefs.c
+++ b/fs/xfs/xfs_fsrefs.c
@@ -478,6 +478,395 @@ xfs_fsrefs_logdev(
 	return xfs_fsrefs_helper(tp, info, &frec);
 }
 
+#ifdef CONFIG_XFS_RT
+/* Synthesize fsrefs records from rtbitmap records. */
+STATIC int
+xfs_fsrefs_rtdev_bitmap_helper(
+	struct xfs_rtgroup		*rtg,
+	struct xfs_trans		*tp,
+	const struct xfs_rtalloc_rec	*rec,
+	void				*priv)
+{
+	struct xfs_fsrefs_irec		frec = {
+		.refcount		= 1,
+	};
+	struct xfs_mount		*mp = rtg_mount(rtg);
+	struct xfs_fsrefs_info		*info = priv;
+	xfs_rtblock_t			next_rtb, rec_rtb, rtb;
+	xfs_rgnumber_t			next_rgno;
+	xfs_rgblock_t			next_rgbno;
+	xfs_rgblock_t			rec_rgbno;
+
+	/* Translate the free space record to group and block number. */
+	rec_rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext);
+	rec_rgbno = xfs_rtb_to_rgbno(mp, rec_rtb);
+
+	/*
+	 * Figure out if there's a gap between the last fsrefs record we
+	 * emitted and this free extent.  If there is, report the gap as a
+	 * refcount==1 record.
+	 */
+	next_rtb = xfs_daddr_to_rtb(mp, info->next_daddr);
+	next_rgno = xfs_rtb_to_rgno(mp, next_rtb);
+	next_rgbno = xfs_rtb_to_rgbno(mp, next_rtb);
+
+	ASSERT(next_rgno >= info->group->xg_gno);
+	ASSERT(rec_rgbno >= next_rgbno);
+
+	/*
+	 * If we've already moved on to the next rtgroup, we don't have any
+	 * fsrefs records to synthesize.
+	 */
+	if (next_rgno > info->group->xg_gno)
+		return 0;
+
+	rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext + rec->ar_extcount);
+	info->next_daddr = xfs_rtb_to_daddr(mp, rtb);
+
+	if (rec_rtb == next_rtb)
+		return 0;
+
+	/* Emit a record for the in-use space. */
+	frec.start_daddr = xfs_rtb_to_daddr(mp, next_rtb);
+	frec.len_daddr = XFS_FSB_TO_BB(mp, rec_rgbno - next_rgbno);
+	frec.rec_key = next_rgbno;
+	return xfs_fsrefs_helper(tp, info, &frec);
+}
+
+/* Emit records to fill a gap in the refcount btree with singly-owned blocks. */
+STATIC int
+xfs_fsrefs_rtdev_fill_refcount_gap(
+	struct xfs_trans		*tp,
+	struct xfs_fsrefs_info		*info,
+	xfs_rgblock_t			rgbno)
+{
+	struct xfs_rtalloc_rec		high = { 0 };
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_rtgroup		*rtg = to_rtg(info->group);
+	xfs_rtblock_t			start_rtbno =
+			xfs_daddr_to_rtb(mp, info->next_daddr);
+	xfs_rtblock_t			end_rtbno =
+			xfs_rgbno_to_rtb(rtg, rgbno);
+	xfs_rtxnum_t			low_rtx;
+	xfs_daddr_t			rec_daddr;
+	int				error;
+
+	ASSERT(xfs_rtb_to_rgno(mp, start_rtbno) == info->group->xg_gno);
+
+	low_rtx = xfs_rtb_to_rtx(mp, start_rtbno);
+	if (rgbno == -1U) {
+		/*
+		 * If the caller passes in an all 1s high key to signify the
+		 * end of the group, set the extent to all 1s as well.
+		 */
+		high.ar_startext = -1ULL;
+	} else {
+		high.ar_startext = xfs_rtb_to_rtx(mp,
+				end_rtbno + mp->m_sb.sb_rextsize - 1);
+	}
+	if (low_rtx >= high.ar_startext)
+		return 0;
+
+	error = xfs_rtalloc_query_range(rtg, tp, low_rtx, high.ar_startext,
+			xfs_fsrefs_rtdev_bitmap_helper, info);
+	if (error)
+		return error;
+
+	/*
+	 * Synthesize records for single-owner extents between the last
+	 * fsrefcount record emitted and the end of the query range.
+	 */
+	high.ar_startext = min(high.ar_startext, rtg->rtg_extents);
+	rec_daddr = xfs_rtb_to_daddr(mp, xfs_rtx_to_rtb(rtg, high.ar_startext));
+	if (info->next_daddr > rec_daddr)
+		return 0;
+
+	info->last = true;
+	return xfs_fsrefs_rtdev_bitmap_helper(rtg, tp, &high, info);
+}
+
+/* Transform a absolute-startblock refcount (rtdev, logdev) into a fsrefs */
+STATIC int
+xfs_fsrefs_rtdev_refcountbt_helper(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_refcount_irec	*rec,
+	void				*priv)
+{
+	struct xfs_fsrefs_irec		frec = {
+		.refcount		= rec->rc_refcount,
+		.rec_key		= rec->rc_startblock,
+	};
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_fsrefs_info		*info = priv;
+	struct xfs_rtgroup		*rtg = to_rtg(info->group);
+	xfs_rtblock_t			rec_rtbno;
+	int				error;
+
+	/*
+	 * Stop once we get to the CoW staging extents; they're all shoved to
+	 * the right side of the btree and were already covered by the rtbitmap
+	 * scan.
+	 */
+	if (rec->rc_domain != XFS_REFC_DOMAIN_SHARED)
+		return -ECANCELED;
+
+	/* Report on any gaps first */
+	error = xfs_fsrefs_rtdev_fill_refcount_gap(cur->bc_tp, info,
+			rec->rc_startblock);
+	if (error)
+		return error;
+
+	/* Report the refcount record from the refcount btree. */
+	rec_rtbno = xfs_rgbno_to_rtb(rtg, rec->rc_startblock);
+	frec.start_daddr = xfs_rtb_to_daddr(mp, rec_rtbno);
+	frec.len_daddr = XFS_FSB_TO_BB(mp, rec->rc_blockcount);
+	info->next_daddr = xfs_rtb_to_daddr(mp, rec_rtbno + rec->rc_blockcount);
+	return xfs_fsrefs_helper(cur->bc_tp, info, &frec);
+}
+
+#define XFS_RTGLOCK_FSREFS	(XFS_RTGLOCK_BITMAP | XFS_RTGLOCK_REFCOUNT)
+
+/* Execute a getfsrefs query against the realtime device. */
+STATIC int
+xfs_fsrefs_rtdev(
+	struct xfs_trans	*tp,
+	const struct xfs_fsrefs	*keys,
+	struct xfs_fsrefs_info	*info)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_rtgroup	*rtg = NULL, *locked_rtg = NULL;
+	xfs_rtblock_t		start_rtbno;
+	xfs_rtblock_t		end_rtbno;
+	xfs_rgnumber_t		start_rg;
+	xfs_rgnumber_t		end_rg;
+	uint64_t		eofs;
+	int			error = 0;
+
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks);
+	if (keys[0].fcr_physical >= eofs)
+		return 0;
+	start_rtbno = xfs_daddr_to_rtb(mp, keys[0].fcr_physical);
+	end_rtbno = xfs_daddr_to_rtb(mp, min(eofs - 1, keys[1].fcr_physical));
+
+	info->refc_cur = info->bno_cur = NULL;
+
+	/*
+	 * Convert the fsrefs low/high keys to rtgroup based keys.  Initialize
+	 * low to the fsrefs low key and max out the high key to the end of the
+	 * rtgroup.
+	 */
+	info->low.rc_startblock = xfs_rtb_to_rgbno(mp, start_rtbno);
+	info->low.rc_blockcount = XFS_BB_TO_FSBT(mp, keys[0].fcr_length);
+	info->low.rc_refcount = 0;
+	info->low.rc_domain = XFS_REFC_DOMAIN_SHARED;
+
+	/* Adjust the low key if we are continuing from where we left off. */
+	if (info->low.rc_blockcount > 0) {
+		info->low.rc_startblock += info->low.rc_blockcount;
+
+		start_rtbno += info->low.rc_blockcount;
+		if (xfs_rtb_to_daddr(mp, start_rtbno) >= eofs)
+			return 0;
+	}
+
+	info->high.rc_startblock = -1U;
+	info->high.rc_blockcount = 0;
+	info->high.rc_refcount = 0;
+	info->high.rc_domain = XFS_REFC_DOMAIN_SHARED;
+
+	start_rg = xfs_rtb_to_rgno(mp, start_rtbno);
+	end_rg = xfs_rtb_to_rgno(mp, end_rtbno);
+
+	/* Query each rtgroup */
+	while ((rtg = xfs_rtgroup_next_range(mp, rtg, start_rg, end_rg))) {
+		info->group = rtg_group(rtg);
+
+		/*
+		 * Set the rtgroup high key from the fsrefs high key if this
+		 * is the last rtgroup that we're querying.
+		 */
+		if (rtg_rgno(rtg) == end_rg)
+			info->high.rc_startblock = xfs_rtb_to_rgbno(mp,
+					end_rtbno);
+
+		if (info->refc_cur) {
+			xfs_btree_del_cursor(info->refc_cur, XFS_BTREE_NOERROR);
+			info->refc_cur = NULL;
+		}
+		if (locked_rtg)
+			xfs_rtgroup_unlock(locked_rtg, XFS_RTGLOCK_FSREFS);
+
+		trace_xfs_fsrefs_low_group_key(mp, info->dev, info->group,
+				&info->low);
+		trace_xfs_fsrefs_high_group_key(mp, info->dev, info->group,
+				&info->high);
+
+		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_FSREFS);
+		locked_rtg = rtg;
+
+		/*
+		 * Fill the query with refcount records and synthesize
+		 * singly-owned block records from free space data.
+		 */
+		if (xfs_has_rtreflink(mp)) {
+			info->refc_cur = xfs_rtrefcountbt_init_cursor(tp, rtg);
+
+			error = xfs_refcount_query_range(info->refc_cur,
+					&info->low, &info->high,
+					xfs_fsrefs_rtdev_refcountbt_helper,
+					info);
+			if (error && error != -ECANCELED)
+				break;
+		}
+
+		/*
+		 * Synthesize refcount==1 records from the free space data
+		 * between the end of the last fsrefs record reported and the
+		 * end of the range.  If we don't have refcount support, the
+		 * starting point will be the start of the query range.
+		 */
+		error = xfs_fsrefs_rtdev_fill_refcount_gap(tp, info,
+				info->high.rc_startblock);
+		if (error)
+			break;
+
+		/*
+		 * Set the rtgroup low key to the start of the rtgroup prior to
+		 * moving on to the next rtgroup.
+		 */
+		if (rtg_rgno(rtg) == start_rg)
+			memset(&info->low, 0, sizeof(info->low));
+		info->group = NULL;
+	}
+
+	if (info->refc_cur) {
+		xfs_btree_del_cursor(info->refc_cur, error);
+		info->refc_cur = NULL;
+	}
+	if (locked_rtg)
+		xfs_rtgroup_unlock(locked_rtg, XFS_RTGLOCK_FSREFS);
+	if (info->group) {
+		xfs_rtgroup_rele(rtg);
+		info->group = NULL;
+	} else if (rtg) {
+		/* loop termination case */
+		xfs_rtgroup_rele(rtg);
+	}
+
+	return error;
+}
+
+/* Synthesize fsrefs records from 64-bit rtbitmap records. */
+STATIC int
+xfs_fsrefs_rtdev_nogroups_helper(
+	struct xfs_rtgroup		*rtg,
+	struct xfs_trans		*tp,
+	const struct xfs_rtalloc_rec	*rec,
+	void				*priv)
+{
+	struct xfs_fsrefs_irec		frec = {
+		.refcount		= 1,
+	};
+	struct xfs_mount		*mp = rtg_mount(rtg);
+	struct xfs_fsrefs_info		*info = priv;
+	xfs_rtblock_t			next_rtb, rec_rtb, rtb;
+
+	/* Translate the free space record to group and block number. */
+	rec_rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext);
+
+	/*
+	 * Figure out if there's a gap between the last fsrefs record we
+	 * emitted and this free extent.  If there is, report the gap as a
+	 * refcount==1 record.
+	 */
+	next_rtb = xfs_daddr_to_rtb(mp, info->next_daddr);
+
+	ASSERT(rec_rtb >= next_rtb);
+
+	rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext + rec->ar_extcount);
+	info->next_daddr = xfs_rtb_to_daddr(mp, rtb);
+
+	if (rec_rtb == next_rtb)
+		return 0;
+
+	/* Emit records for the in-use space. */
+	frec.start_daddr = xfs_rtb_to_daddr(mp, next_rtb);
+	frec.len_daddr = xfs_rtb_to_daddr(mp, rec_rtb - next_rtb);
+	return xfs_fsrefs_helper(tp, info, &frec);
+}
+
+/*
+ * Synthesize refcount information from the rtbitmap for a pre-rtgroups
+ * filesystem.
+ */
+STATIC int
+xfs_fsrefs_rtdev_nogroups(
+	struct xfs_trans	*tp,
+	const struct xfs_fsrefs	*keys,
+	struct xfs_fsrefs_info	*info)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_rtgroup	*rtg = NULL;
+	xfs_rtblock_t		start_rtbno;
+	xfs_rtblock_t		end_rtbno;
+	xfs_rtxnum_t		low_rtx;
+	xfs_rtxnum_t		high_rtx;
+	uint64_t		eofs;
+	int			error = 0;
+
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks);
+	if (keys[0].fcr_physical >= eofs)
+		return 0;
+	start_rtbno = xfs_daddr_to_rtb(mp, keys[0].fcr_physical);
+	end_rtbno = xfs_daddr_to_rtb(mp, min(eofs - 1, keys[1].fcr_physical));
+
+	info->refc_cur = info->bno_cur = NULL;
+
+	/*
+	 * Convert the fsrefs low/high keys to rtgroup based keys.  Initialize
+	 * low to the fsrefs low key and max out the high key to the end of the
+	 * rtgroup.
+	 */
+	info->low_daddr = keys[0].fcr_physical;
+
+	/* Adjust the low key if we are continuing from where we left off. */
+	if (keys[0].fcr_length > 0) {
+		info->low_daddr += keys[0].fcr_length;
+		if (info->low_daddr >= eofs)
+			return 0;
+	}
+
+	rtg = xfs_rtgroup_grab(mp, 0);
+	if (!rtg)
+		return -EFSCORRUPTED;
+
+	info->group = rtg_group(rtg);
+
+	trace_xfs_fsrefs_low_linear_key(mp, info->dev, start_rtbno);
+	trace_xfs_fsrefs_high_linear_key(mp, info->dev, end_rtbno);
+
+	xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP);
+
+	/*
+	 * Walk the whole rtbitmap.  Without rtgroups, the startext values can
+	 * be more than 32-bits wide, which is why we need this separate
+	 * implementation.
+	 */
+	low_rtx = xfs_rtb_to_rtx(mp, start_rtbno);
+	high_rtx = xfs_rtb_to_rtx(mp, end_rtbno + mp->m_sb.sb_rextsize - 1);
+	if (low_rtx < high_rtx)
+		error = xfs_rtalloc_query_range(rtg, tp, low_rtx, high_rtx,
+				xfs_fsrefs_rtdev_nogroups_helper, info);
+
+	info->group = NULL;
+
+	xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP);
+	xfs_rtgroup_rele(rtg);
+
+	return error;
+}
+#endif
+
 /* Do we recognize the device? */
 STATIC bool
 xfs_fsrefs_is_valid_device(
@@ -515,7 +904,14 @@ xfs_fsrefs_check_keys(
 	return false;
 }
 
+/*
+ * There are only two devices if we didn't configure RT devices at build time.
+ */
+#ifdef CONFIG_XFS_RT
+#define XFS_GETFSREFS_DEVS	3
+#else
 #define XFS_GETFSREFS_DEVS	2
+#endif /* CONFIG_XFS_RT */
 
 /*
  * Get filesystem's extent refcounts as described in head, and format for
@@ -569,6 +965,15 @@ xfs_getfsrefs(
 		handlers[1].dev = new_encode_dev(mp->m_logdev_targp->bt_dev);
 		handlers[1].fn = xfs_fsrefs_logdev;
 	}
+#ifdef CONFIG_XFS_RT
+	if (mp->m_rtdev_targp) {
+		handlers[2].dev = new_encode_dev(mp->m_rtdev_targp->bt_dev);
+		if (xfs_has_rtgroups(mp))
+			handlers[2].fn = xfs_fsrefs_rtdev;
+		else
+			handlers[2].fn = xfs_fsrefs_rtdev_nogroups;
+	}
+#endif /* CONFIG_XFS_RT */
 
 	xfs_sort(handlers, XFS_GETFSREFS_DEVS, sizeof(struct xfs_fsrefs_dev),
 			xfs_fsrefs_dev_compare);


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints
  2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong
@ 2024-12-31 23:38   ` Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 3/4] xfs: add an ioctl to map free space into a file Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files Darrick J. Wong
  3 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Change the class of the fallocate tracepoints to capture the offset and
length of the requested operation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    8 ++++----
 fs/xfs/xfs_file.c      |    2 +-
 fs/xfs/xfs_trace.h     |   10 +++++-----
 3 files changed, 10 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 783349f2361ad3..c9e60fb2693c9b 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -652,7 +652,7 @@ xfs_alloc_file_space(
 	if (xfs_is_always_cow_inode(ip))
 		return 0;
 
-	trace_xfs_alloc_file_space(ip);
+	trace_xfs_alloc_file_space(ip, offset, len);
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -839,7 +839,7 @@ xfs_free_file_space(
 	xfs_fileoff_t		endoffset_fsb;
 	int			done = 0, error;
 
-	trace_xfs_free_file_space(ip);
+	trace_xfs_free_file_space(ip, offset, len);
 
 	error = xfs_qm_dqattach(ip);
 	if (error)
@@ -987,7 +987,7 @@ xfs_collapse_file_space(
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
-	trace_xfs_collapse_file_space(ip);
+	trace_xfs_collapse_file_space(ip, offset, len);
 
 	error = xfs_free_file_space(ip, offset, len, ac);
 	if (error)
@@ -1056,7 +1056,7 @@ xfs_insert_file_space(
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 
-	trace_xfs_insert_file_space(ip);
+	trace_xfs_insert_file_space(ip, offset, len);
 
 	error = xfs_bmap_can_insert_extents(ip, stop_fsb, shift_fsb);
 	if (error)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d31ad7bf29885d..b8f0b9a2998b9c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1362,7 +1362,7 @@ xfs_falloc_zero_range(
 	loff_t			new_size = 0;
 	int			error;
 
-	trace_xfs_zero_file_space(XFS_I(inode));
+	trace_xfs_zero_file_space(XFS_I(inode), offset, len);
 
 	error = xfs_falloc_newsize(file, mode, offset, len, &new_size);
 	if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 7043b6481d5f97..e81247b3024e53 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -928,11 +928,6 @@ DEFINE_INODE_EVENT(xfs_getattr);
 DEFINE_INODE_EVENT(xfs_setattr);
 DEFINE_INODE_EVENT(xfs_readlink);
 DEFINE_INODE_EVENT(xfs_inactive_symlink);
-DEFINE_INODE_EVENT(xfs_alloc_file_space);
-DEFINE_INODE_EVENT(xfs_free_file_space);
-DEFINE_INODE_EVENT(xfs_zero_file_space);
-DEFINE_INODE_EVENT(xfs_collapse_file_space);
-DEFINE_INODE_EVENT(xfs_insert_file_space);
 DEFINE_INODE_EVENT(xfs_readdir);
 #ifdef CONFIG_XFS_POSIX_ACL
 DEFINE_INODE_EVENT(xfs_get_acl);
@@ -1732,6 +1727,11 @@ DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_unwritten);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_append);
 DEFINE_SIMPLE_IO_EVENT(xfs_file_splice_read);
 DEFINE_SIMPLE_IO_EVENT(xfs_zoned_map_blocks);
+DEFINE_SIMPLE_IO_EVENT(xfs_alloc_file_space);
+DEFINE_SIMPLE_IO_EVENT(xfs_free_file_space);
+DEFINE_SIMPLE_IO_EVENT(xfs_zero_file_space);
+DEFINE_SIMPLE_IO_EVENT(xfs_collapse_file_space);
+DEFINE_SIMPLE_IO_EVENT(xfs_insert_file_space);
 
 DECLARE_EVENT_CLASS(xfs_itrunc_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/4] xfs: add an ioctl to map free space into a file
  2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints Darrick J. Wong
@ 2024-12-31 23:38   ` Darrick J. Wong
  2024-12-31 23:38   ` [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files Darrick J. Wong
  3 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new ioctl to map free physical space into a file, at the same file
offset as if the file were a sparse image of the physical device backing
the filesystem.  The intent here is to use this to prototype a free
space defragmentation tool.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |   88 +++++++++++++
 fs/xfs/libxfs/xfs_alloc.h |    3 
 fs/xfs/libxfs/xfs_bmap.c  |    1 
 fs/xfs/libxfs/xfs_fs.h    |   14 ++
 fs/xfs/xfs_bmap_util.c    |  303 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h    |    1 
 fs/xfs/xfs_file.c         |  139 +++++++++++++++++++++
 fs/xfs/xfs_file.h         |    2 
 fs/xfs/xfs_ioctl.c        |    5 +
 fs/xfs/xfs_trace.h        |   35 +++++
 10 files changed, 591 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 3d33e17f2e5ce0..e689ec5cbccd7e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -4168,3 +4168,91 @@ xfs_extfree_intent_destroy_cache(void)
 	kmem_cache_destroy(xfs_extfree_item_cache);
 	xfs_extfree_item_cache = NULL;
 }
+
+/*
+ * Find the next chunk of free space in @pag starting at @agbno and going no
+ * higher than @end_agbno.  Set @agbno and @len to whatever free space we find,
+ * or to @end_agbno if we find no space.
+ */
+int
+xfs_alloc_find_freesp(
+	struct xfs_trans	*tp,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		*agbno,
+	xfs_agblock_t		end_agbno,
+	xfs_extlen_t		*len)
+{
+	struct xfs_mount	*mp = pag_mount(pag);
+	struct xfs_btree_cur	*cur;
+	struct xfs_buf		*agf_bp = NULL;
+	xfs_agblock_t		found_agbno;
+	xfs_extlen_t		found_len;
+	int			found;
+	int			error;
+
+	trace_xfs_alloc_find_freesp(pag_group(pag), *agbno,
+			end_agbno - *agbno);
+
+	error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+	if (error)
+		return error;
+
+	cur = xfs_bnobt_init_cursor(mp, tp, agf_bp, pag);
+
+	/* Try to find a free extent that starts before here. */
+	error = xfs_alloc_lookup_le(cur, *agbno, 0, &found);
+	if (error)
+		goto out_cur;
+	if (found) {
+		error = xfs_alloc_get_rec(cur, &found_agbno, &found_len,
+				&found);
+		if (error)
+			goto out_cur;
+		if (XFS_IS_CORRUPT(mp, !found)) {
+			xfs_btree_mark_sick(cur);
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		if (found_agbno + found_len > *agbno)
+			goto found;
+	}
+
+	/* Examine the next record if free extent not in range. */
+	error = xfs_btree_increment(cur, 0, &found);
+	if (error)
+		goto out_cur;
+	if (!found)
+		goto next_ag;
+
+	error = xfs_alloc_get_rec(cur, &found_agbno, &found_len, &found);
+	if (error)
+		goto out_cur;
+	if (XFS_IS_CORRUPT(mp, !found)) {
+		xfs_btree_mark_sick(cur);
+		error = -EFSCORRUPTED;
+		goto out_cur;
+	}
+
+	if (found_agbno >= end_agbno)
+		goto next_ag;
+
+found:
+	/* Found something, so update the mapping. */
+	trace_xfs_alloc_find_freesp_done(pag_group(pag), found_agbno,
+			found_len);
+	if (found_agbno < *agbno) {
+		found_len -= *agbno - found_agbno;
+		found_agbno = *agbno;
+	}
+	*len = found_len;
+	*agbno = found_agbno;
+	goto out_cur;
+next_ag:
+	/* Found nothing, so advance the cursor beyond the end of the range. */
+	*agbno = end_agbno;
+	*len = 0;
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 50ef79a1ed41a1..069077d9ad2f8c 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -286,5 +286,8 @@ void xfs_extfree_intent_destroy_cache(void);
 
 xfs_failaddr_t xfs_validate_ag_length(struct xfs_buf *bp, uint32_t seqno,
 		uint32_t length);
+int xfs_alloc_find_freesp(struct xfs_trans *tp, struct xfs_perag *pag,
+		xfs_agblock_t *agbno, xfs_agblock_t end_agbno,
+		xfs_extlen_t *len);
 
 #endif	/* __XFS_ALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8c9d540c3ba91a..11dab550ca0fb6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -41,6 +41,7 @@
 #include "xfs_inode_util.h"
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_rtalloc.h"
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 936f719236944f..f4128dbdf3b9a2 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1087,6 +1087,19 @@ xfs_getfsrefs_advance(
 /* fcr_flags values - returned for each non-header segment */
 #define FCR_OF_LAST		(1U << 0) /* last record in the dataset */
 
+/* map free space to file */
+
+/*
+ * XFS_IOC_MAP_FREESP maps all the free physical space in the filesystem into
+ * the file at the same offsets.  This ioctl requires CAP_SYS_ADMIN.
+ */
+struct xfs_map_freesp {
+	__s64	offset;		/* disk address to map, in bytes */
+	__s64	len;		/* length in bytes */
+	__u64	flags;		/* must be zero */
+	__u64	pad;		/* must be zero */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1127,6 +1140,7 @@ xfs_getfsrefs_advance(
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
+#define XFS_IOC_MAP_FREESP	_IOW ('X', 67, struct xfs_map_freesp)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c9e60fb2693c9b..8d5c2072bcd533 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -31,6 +31,10 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
+#include "xfs_health.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_ag.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -1916,3 +1920,302 @@ xfs_convert_rtbigalloc_file_space(
 	return 0;
 }
 #endif /* CONFIG_XFS_RT */
+
+/*
+ * Reserve space and quota to this transaction to map in as much free space
+ * as we can.  Callers should set @len to the amount of space desired; this
+ * function will shorten that quantity if it can't get space.
+ */
+STATIC int
+xfs_map_free_reserve_more(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	xfs_extlen_t		*len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	unsigned int		dblocks;
+	unsigned int		rblocks;
+	unsigned int		min_len;
+	bool			isrt = XFS_IS_REALTIME_INODE(ip);
+	int			error;
+
+	if (*len > XFS_MAX_BMBT_EXTLEN)
+		*len = XFS_MAX_BMBT_EXTLEN;
+	min_len = isrt ? mp->m_sb.sb_rextsize : 1;
+
+again:
+	if (isrt) {
+		dblocks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+		rblocks = *len;
+	} else {
+		dblocks = XFS_DIOSTRAT_SPACE_RES(mp, *len);
+		rblocks = 0;
+	}
+	error = xfs_trans_reserve_more_inode(tp, ip, dblocks, rblocks, false);
+	if (error == -ENOSPC && *len > min_len) {
+		*len >>= 1;
+		goto again;
+	}
+	if (error) {
+		trace_xfs_map_free_reserve_more_fail(ip, error, _RET_IP_);
+		return error;
+	}
+
+	return 0;
+}
+
+static inline xfs_fileoff_t
+xfs_fsblock_to_fileoff(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		fsbno)
+{
+	xfs_daddr_t		daddr = XFS_FSB_TO_DADDR(mp, fsbno);
+
+	return XFS_B_TO_FSB(mp, BBTOB(daddr));
+}
+
+/*
+ * Given a file and a free physical extent, map it into the file at the same
+ * offset if the file were a sparse image of the physical device.  Set @mval to
+ * whatever mapping we added to the file.
+ */
+STATIC int
+xfs_map_free_ag_extent(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		agbno,
+	xfs_extlen_t		len,
+	struct xfs_bmbt_irec	*mval)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_alloc_arg	args = {
+		.mp		= mp,
+		.tp		= tp,
+		.pag		= pag,
+		.oinfo		= XFS_RMAP_OINFO_SKIP_UPDATE,
+		.resv		= XFS_AG_RESV_NONE,
+		.prod		= 1,
+		.datatype	= XFS_ALLOC_USERDATA,
+		.maxlen		= len,
+		.minlen		= 1,
+	};
+	struct xfs_bmbt_irec	irec;
+	xfs_fsblock_t		fsbno = xfs_gbno_to_fsb(pag_group(pag), agbno);
+	xfs_fileoff_t		startoff = xfs_fsblock_to_fileoff(mp, fsbno);
+	int			nimaps;
+	int			error;
+
+	ASSERT(!XFS_IS_REALTIME_INODE(ip));
+
+	trace_xfs_map_free_ag_extent(ip, fsbno, len);
+
+	/* Make sure the entire range is a hole. */
+	nimaps = 1;
+	error = xfs_bmapi_read(ip, startoff, len, &irec, &nimaps, 0);
+	if (error)
+		return error;
+
+	if (irec.br_startoff != startoff ||
+	    irec.br_startblock != HOLESTARTBLOCK ||
+	    irec.br_blockcount < len)
+		return -EINVAL;
+
+	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
+			XFS_IEXT_ADD_NOSPLIT_CNT);
+	if (error)
+		return error;
+
+	/*
+	 * Allocate the physical extent.  We should not have dropped the lock
+	 * since the scan of the free space metadata, so this should work,
+	 * though the length may be adjusted to play nicely with metadata space
+	 * reservations.
+	 */
+	error = xfs_alloc_vextent_exact_bno(&args, fsbno);
+	if (error)
+		return error;
+	if (args.fsbno == NULLFSBLOCK) {
+		/*
+		 * We were promised the space, but failed to get it.  This
+		 * could be because the space is reserved for metadata
+		 * expansion, or it could be because the AGFL fixup grabbed the
+		 * first block we wanted.  Either way, if the transaction is
+		 * dirty we must commit it and tell the caller to try again.
+		 */
+		if (tp->t_flags & XFS_TRANS_DIRTY)
+			return -EAGAIN;
+		return -ENOSPC;
+	}
+	if (args.fsbno != fsbno) {
+		ASSERT(0);
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
+		return -EFSCORRUPTED;
+	}
+
+	/* Map extent into file, update quota. */
+	mval->br_blockcount = args.len;
+	mval->br_startblock = fsbno;
+	mval->br_startoff = startoff;
+	mval->br_state = XFS_EXT_UNWRITTEN;
+
+	trace_xfs_map_free_ag_extent_done(ip, mval);
+
+	xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, mval);
+	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
+			mval->br_blockcount);
+
+	return 0;
+}
+
+/* Find a free extent in this AG and map it into the file. */
+STATIC int
+xfs_map_free_extent(
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		*cursor,
+	xfs_agblock_t		end_agbno,
+	xfs_agblock_t		*last_enospc_agbno)
+{
+	struct xfs_bmbt_irec	irec;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	loff_t			endpos;
+	xfs_extlen_t		free_len, map_len;
+	int			error;
+
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, 0, 0, false,
+			&tp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_find_freesp(tp, pag, cursor, end_agbno, &free_len);
+	if (error)
+		goto out_cancel;
+
+	/* Bail out if the cursor is beyond what we asked for. */
+	if (*cursor >= end_agbno)
+		goto out_cancel;
+
+	error = xfs_map_free_reserve_more(tp, ip, &free_len);
+	if (error)
+		goto out_cancel;
+
+	map_len = free_len;
+	do {
+		error = xfs_map_free_ag_extent(tp, ip, pag, *cursor, map_len,
+				&irec);
+		if (error == -EAGAIN) {
+			/* Failed to map space but were told to try again. */
+			error = xfs_trans_commit(tp);
+			goto out;
+		}
+		if (error != -ENOSPC)
+			break;
+		/*
+		 * If we can't get the space, try asking for successively less
+		 * space in case we're bumping up against per-AG metadata
+		 * reservation limits.
+		 */
+		map_len >>= 1;
+	} while (map_len > 0);
+	if (error == -ENOSPC) {
+		if (*last_enospc_agbno != *cursor) {
+			/*
+			 * However, backing off on the size of the mapping
+			 * request might not work if an AGFL fixup allocated
+			 * the block at *cursor.  The first time this happens,
+			 * remember that we ran out of space here, and try
+			 * again.
+			 */
+			*last_enospc_agbno = *cursor;
+		} else {
+			/*
+			 * If we hit this a second time on the same extent,
+			 * then it's likely that we're bumping up against
+			 * per-AG space reservation limits.  Skip to the next
+			 * extent.
+			 */
+			*cursor += free_len;
+		}
+		error = 0;
+		goto out_cancel;
+	}
+	if (error)
+		goto out_cancel;
+
+	/* Update isize if needed. */
+	endpos = XFS_FSB_TO_B(mp, irec.br_startoff + irec.br_blockcount);
+	if (endpos > i_size_read(VFS_I(ip))) {
+		i_size_write(VFS_I(ip), endpos);
+		ip->i_disk_size = endpos;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	}
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		return error;
+
+	*cursor += irec.br_blockcount;
+	return 0;
+out_cancel:
+	xfs_trans_cancel(tp);
+out:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
+/*
+ * Allocate all free physical space between off and len and map it to this
+ * regular non-realtime file.
+ */
+int
+xfs_map_free_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		off,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_perag	*pag = NULL;
+	xfs_daddr_t		off_daddr = BTOBB(off);
+	xfs_daddr_t		end_daddr = BTOBBT(off + len);
+	xfs_fsblock_t		off_fsb = XFS_DADDR_TO_FSB(mp, off_daddr);
+	xfs_fsblock_t		end_fsb = XFS_DADDR_TO_FSB(mp, end_daddr);
+	xfs_agnumber_t		off_agno = XFS_FSB_TO_AGNO(mp, off_fsb);
+	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsb);
+	int			error = 0;
+
+	trace_xfs_map_free_space(ip, off, len);
+
+	while ((pag = xfs_perag_next_range(mp, pag, off_agno,
+					   mp->m_sb.sb_agcount - 1))) {
+		xfs_agblock_t	off_agbno = 0;
+		xfs_agblock_t	end_agbno;
+		xfs_agblock_t	last_enospc_agbno = NULLAGBLOCK;
+
+		end_agbno = xfs_ag_block_count(mp, pag_agno(pag));
+
+		if (pag_agno(pag) == off_agno)
+			off_agbno = XFS_FSB_TO_AGBNO(mp, off_fsb);
+		if (pag_agno(pag) == end_agno)
+			end_agbno = XFS_FSB_TO_AGBNO(mp, end_fsb);
+
+		while (off_agbno < end_agbno) {
+			error = xfs_map_free_extent(ip, pag, &off_agbno,
+					end_agbno, &last_enospc_agbno);
+			if (error)
+				goto out;
+		}
+	}
+
+out:
+	if (pag)
+		xfs_perag_rele(pag);
+	if (error == -ENOSPC)
+		return 0;
+	return error;
+}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index c39cce66829e26..5d84b702b16326 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -63,6 +63,7 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
 		xfs_off_t len, struct xfs_zone_alloc_ctx *ac);
 int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 		xfs_off_t len);
+int	xfs_map_free_space(struct xfs_inode *ip, xfs_off_t off, xfs_off_t len);
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8f0b9a2998b9c..8bf1e96ab57a5b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -34,6 +34,7 @@
 #include <linux/mman.h>
 #include <linux/fadvise.h>
 #include <linux/mount.h>
+#include <linux/fsnotify.h>
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -1548,6 +1549,144 @@ xfs_file_fallocate(
 	return error;
 }
 
+STATIC int
+xfs_file_map_freesp(
+	struct file		*file,
+	const struct xfs_map_freesp *mf)
+{
+	struct inode		*inode = file_inode(file);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_off_t		device_size;
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	loff_t			new_size = 0;
+	int			error;
+
+	xfs_ilock(ip, iolock);
+	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * Must wait for all AIO to complete before we continue as AIO can
+	 * change the file size on completion without holding any locks we
+	 * currently hold. We must do this first because AIO can update both
+	 * the on disk and in memory inode sizes, and the operations that follow
+	 * require the in-memory size to be fully up-to-date.
+	 */
+	inode_dio_wait(inode);
+
+	error = file_modified(file);
+	if (error)
+		goto out_unlock;
+
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		error = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+	device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks);
+
+	/*
+	 * Bail out now if we aren't allowed to make the file size the
+	 * same length as the device.
+	 */
+	if (device_size > i_size_read(inode)) {
+		new_size = device_size;
+		error = inode_newsize_ok(inode, new_size);
+		if (error)
+			goto out_unlock;
+	}
+
+	error = xfs_map_free_space(ip, mf->offset, mf->len);
+	if (error) {
+		if (error == -ECANCELED)
+			error = 0;
+		goto out_unlock;
+	}
+
+	/* Change file size if needed */
+	if (new_size) {
+		struct iattr iattr;
+
+		iattr.ia_valid = ATTR_SIZE;
+		iattr.ia_size = new_size;
+		error = xfs_vn_setattr_size(file_mnt_idmap(file),
+					    file_dentry(file), &iattr);
+		if (error)
+			goto out_unlock;
+	}
+
+	if (xfs_file_sync_writes(file))
+		error = xfs_log_force_inode(ip);
+
+out_unlock:
+	xfs_iunlock(ip, iolock);
+	return error;
+}
+
+long
+xfs_ioc_map_freesp(
+	struct file			*file,
+	struct xfs_map_freesp __user	*argp)
+{
+	struct xfs_map_freesp		args;
+	struct inode			*inode = file_inode(file);
+	int				error;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+
+	if (args.flags || args.pad)
+		return -EINVAL;
+
+	if (args.offset < 0 || args.len <= 0)
+		return -EINVAL;
+
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+
+	/*
+	 * We can only allow pure fallocate on append only files
+	 */
+	if (IS_APPEND(inode))
+		return -EPERM;
+
+	if (IS_IMMUTABLE(inode))
+		return -EPERM;
+
+	/*
+	 * We cannot allow any fallocate operation on an active swapfile
+	 */
+	if (IS_SWAPFILE(inode))
+		return -ETXTBSY;
+
+	if (S_ISFIFO(inode->i_mode))
+		return -ESPIPE;
+
+	if (S_ISDIR(inode->i_mode))
+		return -EISDIR;
+
+	if (!S_ISREG(inode->i_mode))
+		return -ENODEV;
+
+	/* Check for wrap through zero too */
+	if (args.offset + args.len > inode->i_sb->s_maxbytes)
+		return -EFBIG;
+	if (args.offset + args.len < 0)
+		return -EFBIG;
+
+	file_start_write(file);
+	error = xfs_file_map_freesp(file, &args);
+	if (!error)
+		fsnotify_modify(file);
+
+	file_end_write(file);
+	return error;
+}
+
 STATIC int
 xfs_file_fadvise(
 	struct file	*file,
diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h
index 24490ea49e16c6..c9d50699baba85 100644
--- a/fs/xfs/xfs_file.h
+++ b/fs/xfs/xfs_file.h
@@ -15,4 +15,6 @@ bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos,
 bool xfs_truncate_needs_cow_around(struct xfs_inode *ip, loff_t pos);
 int xfs_file_unshare_at(struct xfs_inode *ip, loff_t pos);
 
+long xfs_ioc_map_freesp(struct file *file, struct xfs_map_freesp __user	*argp);
+
 #endif /* __XFS_FILE_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 20f013bd4ce653..092a3699ff9e75 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -45,6 +45,8 @@
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
+#include <linux/security.h>
+#include <linux/fsnotify.h>
 
 /* Return 0 on success or positive error */
 int
@@ -1429,6 +1431,9 @@ xfs_file_ioctl(
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);
 
+	case XFS_IOC_MAP_FREESP:
+		return xfs_ioc_map_freesp(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e81247b3024e53..ebbc832db8fa1e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1732,6 +1732,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_free_file_space);
 DEFINE_SIMPLE_IO_EVENT(xfs_zero_file_space);
 DEFINE_SIMPLE_IO_EVENT(xfs_collapse_file_space);
 DEFINE_SIMPLE_IO_EVENT(xfs_insert_file_space);
+DEFINE_SIMPLE_IO_EVENT(xfs_map_free_space);
 
 DECLARE_EVENT_CLASS(xfs_itrunc_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),
@@ -1821,6 +1822,36 @@ TRACE_EVENT(xfs_bunmap,
 
 );
 
+DECLARE_EVENT_CLASS(xfs_map_free_extent_class,
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t bno, xfs_extlen_t len),
+	TP_ARGS(ip, bno, len),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_fsize_t, size)
+		__field(xfs_fileoff_t, bno)
+		__field(xfs_extlen_t, len)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->size = ip->i_disk_size;
+		__entry->bno = bno;
+		__entry->len = len;
+	),
+	TP_printk("dev %d:%d ino 0x%llx disize 0x%llx fileoff 0x%llx fsbcount 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->size,
+		  __entry->bno,
+		  __entry->len)
+);
+#define DEFINE_MAP_FREE_EXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_map_free_extent_class, name, \
+	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t bno, xfs_extlen_t len), \
+	TP_ARGS(ip, bno, len))
+DEFINE_MAP_FREE_EXTENT_EVENT(xfs_map_free_ag_extent);
+
 DECLARE_EVENT_CLASS(xfs_extent_busy_class,
 	TP_PROTO(const struct xfs_group *xg, xfs_agblock_t agbno,
 		 xfs_extlen_t len),
@@ -1856,6 +1887,8 @@ DEFINE_BUSY_EVENT(xfs_extent_busy);
 DEFINE_BUSY_EVENT(xfs_extent_busy_force);
 DEFINE_BUSY_EVENT(xfs_extent_busy_reuse);
 DEFINE_BUSY_EVENT(xfs_extent_busy_clear);
+DEFINE_BUSY_EVENT(xfs_alloc_find_freesp);
+DEFINE_BUSY_EVENT(xfs_alloc_find_freesp_done);
 
 TRACE_EVENT(xfs_extent_busy_trim,
 	TP_PROTO(const struct xfs_group *xg, xfs_agblock_t agbno,
@@ -3962,6 +3995,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class,
 DEFINE_EVENT(xfs_inode_irec_class, name, \
 	TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \
 	TP_ARGS(ip, irec))
+DEFINE_INODE_IREC_EVENT(xfs_map_free_ag_extent_done);
 
 /* inode iomap invalidation events */
 DECLARE_EVENT_CLASS(xfs_wb_invalid_class,
@@ -4096,6 +4130,7 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src);
 DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest);
+DEFINE_INODE_ERROR_EVENT(xfs_map_free_reserve_more_fail);
 
 /* dedupe tracepoints */
 DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents);


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files
  2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:38   ` [PATCH 3/4] xfs: add an ioctl to map free space into a file Darrick J. Wong
@ 2024-12-31 23:38   ` Darrick J. Wong
  3 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Implement mapfree for realtime space.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  202 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h |    2 
 fs/xfs/xfs_file.c      |   14 ++-
 fs/xfs/xfs_rtalloc.c   |  108 ++++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h   |    7 ++
 fs/xfs/xfs_trace.h     |   41 ++++++++++
 6 files changed, 368 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 8d5c2072bcd533..83e6c27f63a969 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -2219,3 +2219,205 @@ xfs_map_free_space(
 		return 0;
 	return error;
 }
+
+#ifdef CONFIG_XFS_RT
+/*
+ * Given a file and a free rt extent, map it into the file at the same offset
+ * if the file were a sparse image of the physical device.  Set @mval to
+ * whatever mapping we added to the file.
+ */
+STATIC int
+xfs_map_free_rtgroup_extent(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_rtgroup	*rtg,
+	xfs_rtxnum_t		rtx,
+	xfs_rtxlen_t		rtxlen,
+	struct xfs_bmbt_irec	*mval)
+{
+	struct xfs_bmbt_irec	irec;
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fsblock_t		fsbno = xfs_rtx_to_rtb(rtg, rtx);
+	xfs_fileoff_t		startoff = fsbno;
+	xfs_extlen_t		len = xfs_rtbxlen_to_blen(mp, rtxlen);
+	int			nimaps;
+	int			error;
+
+	ASSERT(XFS_IS_REALTIME_INODE(ip));
+
+	trace_xfs_map_free_rt_extent(ip, fsbno, len);
+
+	/* Make sure the entire range is a hole. */
+	nimaps = 1;
+	error = xfs_bmapi_read(ip, startoff, len, &irec, &nimaps, 0);
+	if (error)
+		return error;
+
+	if (irec.br_startoff != startoff ||
+	    irec.br_startblock != HOLESTARTBLOCK ||
+	    irec.br_blockcount < len)
+		return -EINVAL;
+
+	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
+			XFS_IEXT_ADD_NOSPLIT_CNT);
+	if (error)
+		return error;
+
+	/*
+	 * Allocate the physical extent.  We should not have dropped the lock
+	 * since the scan of the free space metadata, so this should work,
+	 * though the length may be adjusted to play nicely with metadata space
+	 * reservations.
+	 */
+	error = xfs_rtallocate_exact(tp, rtg, rtx, rtxlen);
+	if (error)
+		return error;
+
+	/* Map extent into file, update quota. */
+	mval->br_blockcount = len;
+	mval->br_startblock = fsbno;
+	mval->br_startoff = startoff;
+	mval->br_state = XFS_EXT_UNWRITTEN;
+
+	trace_xfs_map_free_rt_extent_done(ip, mval);
+
+	xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, mval);
+	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_RTBCOUNT,
+			mval->br_blockcount);
+
+	return 0;
+}
+
+/* Find a free extent in this rtgroup and map it into the file. */
+STATIC int
+xfs_map_free_rt_extent(
+	struct xfs_inode	*ip,
+	struct xfs_rtgroup	*rtg,
+	xfs_rtxnum_t		*cursor,
+	xfs_rtxnum_t		end_rtx)
+{
+	struct xfs_bmbt_irec	irec;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	loff_t			endpos;
+	xfs_rtxlen_t		len_rtx;
+	xfs_extlen_t		free_len;
+	int			error;
+
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, 0, 0, false,
+			&tp);
+	if (error)
+		return error;
+
+	xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP);
+
+	error = xfs_rtallocate_find_freesp(tp, rtg, cursor, end_rtx, &len_rtx);
+	if (error)
+		goto out_rtglock;
+
+	/*
+	 * If off_rtx is beyond the end of the rt device or is past what the
+	 * user asked for, bail out.
+	 */
+	if (*cursor >= end_rtx)
+		goto out_rtglock;
+
+	free_len = xfs_rtxlen_to_extlen(mp, len_rtx);
+	error = xfs_map_free_reserve_more(tp, ip, &free_len);
+	if (error)
+		goto out_rtglock;
+
+	error = xfs_map_free_rtgroup_extent(tp, ip, rtg, *cursor, len_rtx,
+			&irec);
+	if (error == -EAGAIN) {
+		/*
+		 * The allocator was busy and told us to try again.  The
+		 * transaction could be dirty due to a nrext64 upgrade, so
+		 * commit the transaction and try again without advancing
+		 * the cursor.
+		 *
+		 * XXX do we fail to unlock something here?
+		 */
+		xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP);
+		error = xfs_trans_commit(tp);
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		return error;
+	}
+	if (error)
+		goto out_cancel;
+
+	/* Update isize if needed. */
+	endpos = XFS_FSB_TO_B(mp, irec.br_startoff + irec.br_blockcount);
+	if (endpos > i_size_read(VFS_I(ip))) {
+		i_size_write(VFS_I(ip), endpos);
+		ip->i_disk_size = endpos;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	}
+
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (error)
+		return error;
+
+	ASSERT(xfs_blen_to_rtxoff(mp, irec.br_blockcount) == 0);
+	*cursor += xfs_extlen_to_rtxlen(mp, irec.br_blockcount);
+	return 0;
+out_rtglock:
+	xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP);
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
+/*
+ * Allocate all free physical space between off and len and map it to this
+ * regular realtime file.
+ */
+int
+xfs_map_free_rt_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		off,
+	xfs_off_t		len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_rtgroup	*rtg = NULL;
+	xfs_daddr_t		off_daddr = BTOBB(off);
+	xfs_daddr_t		end_daddr = BTOBBT(off + len);
+	xfs_rtblock_t		off_rtb = xfs_daddr_to_rtb(mp, off_daddr);
+	xfs_rtblock_t		end_rtb = xfs_daddr_to_rtb(mp, end_daddr);
+	xfs_rgnumber_t		off_rgno = xfs_rtb_to_rgno(mp, off_rtb);
+	xfs_rgnumber_t		end_rgno = xfs_rtb_to_rgno(mp, end_rtb);
+	int			error = 0;
+
+	trace_xfs_map_free_rt_space(ip, off, len);
+
+	while ((rtg = xfs_rtgroup_next_range(mp, rtg, off_rgno,
+					     mp->m_sb.sb_rgcount))) {
+		xfs_rtxnum_t	off_rtx = 0;
+		xfs_rtxnum_t	end_rtx = rtg->rtg_extents;
+
+		if (rtg_rgno(rtg) == off_rgno)
+			off_rtx = xfs_rtb_to_rtx(mp, off_rtb);
+		if (rtg_rgno(rtg) == end_rgno)
+			end_rtx = min(end_rtx, xfs_rtb_to_rtx(mp, end_rtb));
+
+		while (off_rtx < end_rtx) {
+			error = xfs_map_free_rt_extent(ip, rtg, &off_rtx,
+					end_rtx);
+			if (error)
+				goto out;
+		}
+	}
+
+out:
+	if (rtg)
+		xfs_rtgroup_rele(rtg);
+	if (error == -ENOSPC)
+		return 0;
+	return error;
+}
+#endif
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 5d84b702b16326..0e16fbfef6cd09 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -85,8 +85,10 @@ int	xfs_flush_unmap_range(struct xfs_inode *ip, xfs_off_t offset,
 #ifdef CONFIG_XFS_RT
 int xfs_convert_rtbigalloc_file_space(struct xfs_inode *ip, loff_t pos,
 		uint64_t len);
+int xfs_map_free_rt_space(struct xfs_inode *ip, xfs_off_t off, xfs_off_t len);
 #else
 # define xfs_convert_rtbigalloc_file_space(ip, pos, len)	(-EOPNOTSUPP)
+# define xfs_map_free_rt_space(ip, off, len)			(-EOPNOTSUPP)
 #endif
 
 #endif	/* __XFS_BMAP_UTIL_H__ */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8bf1e96ab57a5b..ceb7936e5fd9a3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1580,11 +1580,10 @@ xfs_file_map_freesp(
 	if (error)
 		goto out_unlock;
 
-	if (XFS_IS_REALTIME_INODE(ip)) {
-		error = -EOPNOTSUPP;
-		goto out_unlock;
-	}
-	device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks);
+	if (XFS_IS_REALTIME_INODE(ip))
+		device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_rblocks);
+	else
+		device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks);
 
 	/*
 	 * Bail out now if we aren't allowed to make the file size the
@@ -1597,7 +1596,10 @@ xfs_file_map_freesp(
 			goto out_unlock;
 	}
 
-	error = xfs_map_free_space(ip, mf->offset, mf->len);
+	if (XFS_IS_REALTIME_INODE(ip))
+		error = xfs_map_free_rt_space(ip, mf->offset, mf->len);
+	else
+		error = xfs_map_free_space(ip, mf->offset, mf->len);
 	if (error) {
 		if (error == -ECANCELED)
 			error = 0;
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 2728c568ac5a8a..0a4e087b11b60e 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -2230,3 +2230,111 @@ xfs_bmap_rtalloc(
 	xfs_bmap_alloc_account(ap);
 	return 0;
 }
+
+/*
+ * Find the next free realtime extent starting at @rtx and going no higher than
+ * @end_rtx.  Set @rtx and @len_rtx to whatever free extents we find, or to
+ * @end_rtx if we find no space.
+ */
+int
+xfs_rtallocate_find_freesp(
+	struct xfs_trans	*tp,
+	struct xfs_rtgroup	*rtg,
+	xfs_rtxnum_t		*rtx,
+	xfs_rtxnum_t		end_rtx,
+	xfs_rtxlen_t		*len_rtx)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_rtalloc_args	args = {
+		.rtg		= rtg,
+		.mp		= mp,
+		.tp		= tp,
+	};
+	const unsigned int	max_rtxlen =
+			xfs_blen_to_rtbxlen(mp, XFS_MAX_BMBT_EXTLEN);
+	int			error;
+
+	trace_xfs_rtallocate_find_freesp(rtg, *rtx, end_rtx - *rtx);
+
+	while (*rtx < end_rtx) {
+		xfs_rtblock_t	next_rtx;
+		int		is_free = 0;
+
+		if (fatal_signal_pending(current))
+			return -EINTR;
+
+		/* Is the first rtx in the range free? */
+		error = xfs_rtcheck_range(&args, *rtx, 1, 1, &next_rtx,
+				&is_free);
+		if (error)
+			return error;
+
+		/* Free or not, how many more rtx have the same status? */
+		error = xfs_rtfind_forw(&args, *rtx, end_rtx, &next_rtx);
+		if (error)
+			return error;
+
+		if (is_free) {
+			*len_rtx = min_t(xfs_rtxlen_t, max_rtxlen,
+					 next_rtx - *rtx + 1);
+
+			trace_xfs_rtallocate_find_freesp_done(rtg, *rtx,
+					*len_rtx);
+			return 0;
+		}
+
+		*rtx = next_rtx + 1;
+	}
+
+	return 0;
+}
+
+/* Allocate exactly this space from the rt device. */
+int
+xfs_rtallocate_exact(
+	struct xfs_trans	*tp,
+	struct xfs_rtgroup	*rtg,
+	xfs_rtxnum_t		rtx,
+	xfs_rtxlen_t		len)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_rtalloc_args	args = {
+		.rtg		= rtg,
+		.mp		= mp,
+		.tp		= tp,
+	};
+	int			error;
+
+	trace_xfs_rtallocate_exact(rtg, rtx, len);
+
+	if (xfs_has_rtgroups(mp)) {
+		xfs_rtxnum_t	resrtx = rtx;
+		xfs_rtxlen_t	reslen = len;
+
+		/*
+		 * Never pass 0 for start here so that the busy extent code
+		 * knows that we wanted a near allocation and will flush the
+		 * log to wait for the start to become available.
+		 */
+		error = xfs_rtallocate_adjust_for_busy(&args, rtx ? rtx : 1, 1,
+				len, &reslen, 1, &resrtx);
+		if (error)
+			return error;
+
+		if (resrtx != rtx) {
+			ASSERT(resrtx == rtx);
+			return -EAGAIN;
+		}
+
+		len = reslen;
+	}
+
+	xfs_rtgroup_trans_join(tp, rtg, XFS_RTGLOCK_BITMAP);
+
+	error = xfs_rtallocate_range(&args, rtx, len);
+	if (error)
+		return error;
+
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FREXTENTS, -(long)len);
+	return 0;
+}
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index 0d95b29092c9f3..745af8a2798d36 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -10,6 +10,7 @@
 
 struct xfs_mount;
 struct xfs_trans;
+struct xfs_rtgroup;
 
 #ifdef CONFIG_XFS_RT
 /* rtgroup superblock initialization */
@@ -48,6 +49,10 @@ xfs_growfs_rt(
 int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp);
 int xfs_growfs_check_rtgeom(const struct xfs_mount *mp, xfs_rfsblock_t dblocks,
 		xfs_rfsblock_t rblocks, xfs_agblock_t rextsize);
+int xfs_rtallocate_find_freesp(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
+		xfs_rtxnum_t *rtx, xfs_rtxnum_t end_rtx, xfs_rtxlen_t *len_rtx);
+int xfs_rtallocate_exact(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
+		xfs_rtxnum_t rtx, xfs_rtxlen_t rtxlen);
 #else
 # define xfs_growfs_rt(mp,in)				(-ENOSYS)
 # define xfs_rtalloc_reinit_frextents(m)		(0)
@@ -67,6 +72,8 @@ xfs_rtmount_init(
 # define xfs_rtunmount_inodes(m)
 # define xfs_rt_resv_free(mp)				((void)0)
 # define xfs_rt_resv_init(mp)				(0)
+# define xfs_rtallocate_find_freesp(...)		(-EOPNOTSUPP)
+# define xfs_rtallocate_exact(...)			(-EOPNOTSUPP)
 
 static inline int
 xfs_growfs_check_rtgeom(const struct xfs_mount *mp,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ebbc832db8fa1e..76f5d78b6a6e09 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -105,6 +105,7 @@ struct xfs_rtgroup;
 struct xfs_open_zone;
 struct xfs_fsrefs;
 struct xfs_fsrefs_irec;
+struct xfs_rtgroup;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -1732,6 +1733,9 @@ DEFINE_SIMPLE_IO_EVENT(xfs_free_file_space);
 DEFINE_SIMPLE_IO_EVENT(xfs_zero_file_space);
 DEFINE_SIMPLE_IO_EVENT(xfs_collapse_file_space);
 DEFINE_SIMPLE_IO_EVENT(xfs_insert_file_space);
+#ifdef CONFIG_XFS_RT
+DEFINE_SIMPLE_IO_EVENT(xfs_map_free_rt_space);
+#endif /* CONFIG_XFS_RT */
 DEFINE_SIMPLE_IO_EVENT(xfs_map_free_space);
 
 DECLARE_EVENT_CLASS(xfs_itrunc_class,
@@ -1851,6 +1855,9 @@ DEFINE_EVENT(xfs_map_free_extent_class, name, \
 	TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t bno, xfs_extlen_t len), \
 	TP_ARGS(ip, bno, len))
 DEFINE_MAP_FREE_EXTENT_EVENT(xfs_map_free_ag_extent);
+#ifdef CONFIG_XFS_RT
+DEFINE_MAP_FREE_EXTENT_EVENT(xfs_map_free_rt_extent);
+#endif
 
 DECLARE_EVENT_CLASS(xfs_extent_busy_class,
 	TP_PROTO(const struct xfs_group *xg, xfs_agblock_t agbno,
@@ -1995,6 +2002,37 @@ TRACE_EVENT(xfs_rtalloc_extent_busy_trim,
 		  __entry->new_rtx,
 		  __entry->new_len)
 );
+
+DECLARE_EVENT_CLASS(xfs_rtextent_class,
+	TP_PROTO(struct xfs_rtgroup *rtg, xfs_rtxnum_t off_rtx,
+		 xfs_rtxlen_t len_rtx),
+	TP_ARGS(rtg, off_rtx, len_rtx),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_rgnumber_t, rgno)
+		__field(xfs_rtxnum_t, off_rtx)
+		__field(xfs_rtxlen_t, len_rtx)
+	),
+	TP_fast_assign(
+		__entry->dev = rtg_mount(rtg)->m_super->s_dev;
+		__entry->rgno = rtg_rgno(rtg);
+		__entry->off_rtx = off_rtx;
+		__entry->len_rtx = len_rtx;
+	),
+	TP_printk("dev %d:%d rgno 0x%x rtx 0x%llx rtxcount 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rgno,
+		  __entry->off_rtx,
+		  __entry->len_rtx)
+);
+#define DEFINE_RTEXTENT_EVENT(name) \
+DEFINE_EVENT(xfs_rtextent_class, name, \
+	TP_PROTO(struct xfs_rtgroup *rtg, xfs_rtxnum_t off_rtx, \
+		 xfs_rtxlen_t len_rtx), \
+	TP_ARGS(rtg, off_rtx, len_rtx))
+DEFINE_RTEXTENT_EVENT(xfs_rtallocate_exact);
+DEFINE_RTEXTENT_EVENT(xfs_rtallocate_find_freesp);
+DEFINE_RTEXTENT_EVENT(xfs_rtallocate_find_freesp_done);
 #endif /* CONFIG_XFS_RT */
 
 DECLARE_EVENT_CLASS(xfs_agf_class,
@@ -3996,6 +4034,9 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \
 	TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \
 	TP_ARGS(ip, irec))
 DEFINE_INODE_IREC_EVENT(xfs_map_free_ag_extent_done);
+#ifdef CONFIG_XFS_RT
+DEFINE_INODE_IREC_EVENT(xfs_map_free_rt_extent_done);
+#endif
 
 /* inode iomap invalidation events */
 DECLARE_EVENT_CLASS(xfs_wb_invalid_class,


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 5/5] xfs: live health monitoring of filesystems
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (3 preceding siblings ...)
  2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
@ 2024-12-31 23:33 ` Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong
                     ` (15 more replies)
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
                   ` (10 subsequent siblings)
  15 siblings, 16 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

Hi all,

This patchset builds off of Kent Overstreet's thread_with_file code to
deliver live information about filesystem health events to userspace.
This is done by creating a twf file and hooking internal operations so
that the event information can be queued to the twf without stalling the
kernel if the twf client program is nonresponsive.  This is a private
ioctl, so events are expressed using simple json objects so that we can
enrich the output later on without having to rev a ton of C structs.

In userspace, we create a new daemon program that will read the json
event objects and initiate repairs automatically.  This daemon is
managed entirely by systemd and will not block unmounting of the
filesystem unless repairs are ongoing.  It is autostarted via some
horrible udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * xfs: create debugfs uuid aliases
 * xfs: create hooks for monitoring health updates
 * xfs: create a filesystem shutdown hook
 * xfs: create hooks for media errors
 * iomap, filemap: report buffered read and write io errors to the filesystem
 * iomap: report directio read and write errors to callers
 * xfs: create file io error hooks
 * xfs: create a special file to pass filesystem health to userspace
 * xfs: create event queuing, formatting, and discovery infrastructure
 * xfs: report metadata health events through healthmon
 * xfs: report shutdown events through healthmon
 * xfs: report media errors through healthmon
 * xfs: report file io errors through healthmon
 * xfs: allow reconfiguration of the health monitoring device
 * xfs: add media error reporting ioctl
 * xfs: send uevents when mounting and unmounting a filesystem
---
 Documentation/filesystems/vfs.rst       |    7 
 fs/iomap/buffered-io.c                  |   26 +
 fs/iomap/direct-io.c                    |    4 
 fs/xfs/Kconfig                          |    8 
 fs/xfs/Makefile                         |    7 
 fs/xfs/libxfs/xfs_fs.h                  |   31 +
 fs/xfs/libxfs/xfs_health.h              |   47 +
 fs/xfs/libxfs/xfs_healthmon.schema.json |  595 +++++++++++++
 fs/xfs/xfs_aops.c                       |    2 
 fs/xfs/xfs_file.c                       |  167 ++++
 fs/xfs/xfs_file.h                       |   36 +
 fs/xfs/xfs_fsops.c                      |   57 +
 fs/xfs/xfs_fsops.h                      |   14 
 fs/xfs/xfs_health.c                     |  202 +++++
 fs/xfs/xfs_healthmon.c                  | 1372 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h                  |  102 ++
 fs/xfs/xfs_ioctl.c                      |    7 
 fs/xfs/xfs_linux.h                      |    3 
 fs/xfs/xfs_mount.h                      |   13 
 fs/xfs/xfs_notify_failure.c             |  137 +++
 fs/xfs/xfs_notify_failure.h             |   44 +
 fs/xfs/xfs_super.c                      |   55 +
 fs/xfs/xfs_trace.c                      |    4 
 fs/xfs/xfs_trace.h                      |  369 ++++++++
 include/linux/fs.h                      |    4 
 include/linux/iomap.h                   |    2 
 26 files changed, 3301 insertions(+), 14 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json
 create mode 100644 fs/xfs/xfs_healthmon.c
 create mode 100644 fs/xfs/xfs_healthmon.h


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 01/16] xfs: create debugfs uuid aliases
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
@ 2024-12-31 23:39   ` Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 02/16] xfs: create hooks for monitoring health updates Darrick J. Wong
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an alias for the debugfs dir so that we can find a filesystem by
uuid.  Unless it's mounted nouuid.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |   11 +++++++++++
 2 files changed, 12 insertions(+)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 87007d9de5d9d0..d73e76e36bfc10 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -292,6 +292,7 @@ typedef struct xfs_mount {
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct xfs_zone_info	*m_zone_info;	/* zone allocator information */
 	struct dentry		*m_debugfs;	/* debugfs parent */
+	struct dentry		*m_debugfs_uuid; /* debugfs symlink */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 099c30339e8f9d..fd641853fe3595 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -780,6 +780,7 @@ xfs_mount_free(
 	if (mp->m_ddev_targp)
 		xfs_free_buftarg(mp->m_ddev_targp);
 
+	debugfs_remove(mp->m_debugfs_uuid);
 	debugfs_remove(mp->m_debugfs);
 	kfree(mp->m_rtname);
 	kfree(mp->m_logname);
@@ -1893,6 +1894,16 @@ xfs_fs_fill_super(
 		goto out_unmount;
 	}
 
+	if (xfs_debugfs && mp->m_debugfs && !xfs_has_nouuid(mp)) {
+		char	name[UUID_STRING_LEN + 1];
+
+		snprintf(name, UUID_STRING_LEN + 1, "%pU", &mp->m_sb.sb_uuid);
+		mp->m_debugfs_uuid = debugfs_create_symlink(name, xfs_debugfs,
+				mp->m_super->s_id);
+	} else {
+		mp->m_debugfs_uuid = NULL;
+	}
+
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 02/16] xfs: create hooks for monitoring health updates
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong
@ 2024-12-31 23:39   ` Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 03/16] xfs: create a filesystem shutdown hook Darrick J. Wong
                     ` (13 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks for monitoring health events.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h |   47 ++++++++++
 fs/xfs/xfs_health.c        |  202 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h         |    3 +
 fs/xfs/xfs_super.c         |    1 
 4 files changed, 252 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index b31000f7190ce5..39fef33dedc6a8 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 #define xfs_metadata_is_sick(error) \
 	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
 
+/*
+ * Parameters for tracking health updates.  The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+	XFS_HEALTHUP_SICK = 1,	/* runtime corruption observed */
+	XFS_HEALTHUP_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHUP_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHUP_UNMOUNT,	/* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+	XFS_HEALTHUP_FS = 1,	/* main filesystem */
+	XFS_HEALTHUP_AG,	/* allocation group */
+	XFS_HEALTHUP_INODE,	/* inode */
+	XFS_HEALTHUP_RTGROUP,	/* realtime group */
+};
+
+struct xfs_health_update_params {
+	/* XFS_HEALTHUP_INODE */
+	xfs_ino_t			ino;
+	uint32_t			gen;
+
+	/* XFS_HEALTHUP_AG/RTGROUP */
+	uint32_t			group;
+
+	/* XFS_SICK_* flags */
+	unsigned int			old_mask;
+	unsigned int			new_mask;
+
+	enum xfs_health_update_domain	domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+	struct xfs_hook			health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 7c541fb373d5b2..abf9460ae79953 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -20,6 +20,157 @@
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of health updates.  If
+ * the compiler supports jump labels, the static branch will be replaced by a
+ * nop sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_health_hooks_switch);
+
+void
+xfs_health_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_health_hooks_switch);
+}
+
+void
+xfs_health_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_health_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem unmount health update. */
+static inline void
+xfs_health_unmount_hook(
+	struct xfs_mount		*mp)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+		};
+
+		xfs_hooks_call(&mp->m_health_update_hooks,
+				XFS_HEALTHUP_UNMOUNT, &p);
+	}
+}
+
+/* Call downstream hooks for a filesystem health update. */
+static inline void
+xfs_fs_health_update_hook(
+	struct xfs_mount		*mp,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+		};
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for a group health update. */
+static inline void
+xfs_group_health_update_hook(
+	struct xfs_group		*xg,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.group		= xg->xg_gno,
+		};
+		struct xfs_mount	*mp = xg->xg_mount;
+
+		switch (xg->xg_type) {
+		case XG_TYPE_AG:
+			p.domain = XFS_HEALTHUP_AG;
+			break;
+		case XG_TYPE_RTG:
+			p.domain = XFS_HEALTHUP_RTGROUP;
+			break;
+		default:
+			ASSERT(0);
+			return;
+		}
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for an inode health update. */
+static inline void
+xfs_inode_health_update_hook(
+	struct xfs_inode		*ip,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_INODE,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.ino		= ip->i_ino,
+			.gen		= VFS_I(ip)->i_generation,
+		};
+		struct xfs_mount	*mp = ip->i_mount;
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call the specified function during a health update. */
+int
+xfs_health_hook_add(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Stop calling the specified function during a health update. */
+void
+xfs_health_hook_del(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Configure health update hook functions. */
+void
+xfs_health_hook_setup(
+	struct xfs_health_hook	*hook,
+	notifier_fn_t		mod_fn)
+{
+	xfs_hook_setup(&hook->health_hook, mod_fn);
+}
+#else
+# define xfs_health_unmount_hook(...)			((void)0)
+# define xfs_fs_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_rt_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_group_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_inode_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 static void
 xfs_health_unmount_group(
 	struct xfs_group	*xg,
@@ -50,8 +201,10 @@ xfs_health_unmount(
 	unsigned int		checked = 0;
 	bool			warn = false;
 
-	if (xfs_is_shutdown(mp))
+	if (xfs_is_shutdown(mp)) {
+		xfs_health_unmount_hook(mp);
 		return;
+	}
 
 	/* Measure AG corruption levels. */
 	while ((pag = xfs_perag_next(mp, pag)))
@@ -97,6 +250,8 @@ xfs_health_unmount(
 		if (sick & XFS_SICK_FS_COUNTERS)
 			xfs_fs_mark_healthy(mp, XFS_SICK_FS_COUNTERS);
 	}
+
+	xfs_health_unmount_hook(mp);
 }
 
 /* Mark unhealthy per-fs metadata. */
@@ -105,12 +260,17 @@ xfs_fs_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark per-fs metadata as having been checked and found unhealthy by fsck. */
@@ -119,13 +279,18 @@ xfs_fs_mark_corrupt(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_corrupt(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark a per-fs metadata healed. */
@@ -134,15 +299,20 @@ xfs_fs_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick &= ~mask;
 	if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY))
 		mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-fs metadata are unhealthy. */
@@ -192,12 +362,17 @@ xfs_group_mark_sick(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_sick(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /*
@@ -208,13 +383,18 @@ xfs_group_mark_corrupt(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_corrupt(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick |= mask;
 	xg->xg_checked |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /*
@@ -225,15 +405,20 @@ xfs_group_mark_healthy(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_healthy(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick &= ~mask;
 	if (!(xg->xg_sick & XFS_SICK_AG_PRIMARY))
 		xg->xg_sick &= ~XFS_SICK_AG_SECONDARY;
 	xg->xg_checked |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-ag metadata are unhealthy. */
@@ -272,10 +457,13 @@ xfs_inode_mark_sick(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_sick(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	spin_unlock(&ip->i_flags_lock);
 
@@ -287,6 +475,8 @@ xfs_inode_mark_sick(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark inode metadata as having been checked and found unhealthy by fsck. */
@@ -295,10 +485,13 @@ xfs_inode_mark_corrupt(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_corrupt(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
@@ -311,6 +504,8 @@ xfs_inode_mark_corrupt(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark parts of an inode healed. */
@@ -319,15 +514,20 @@ xfs_inode_mark_healthy(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_healthy(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick &= ~mask;
 	if (!(ip->i_sick & XFS_SICK_INO_PRIMARY))
 		ip->i_sick &= ~XFS_SICK_INO_SECONDARY;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which parts of an inode are unhealthy. */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d73e76e36bfc10..df5e4a48af72b7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -340,6 +340,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed dirent updates to an active online repair. */
 	struct xfs_hooks	m_dir_update_hooks;
+
+	/* Hook to feed health events to a daemon. */
+	struct xfs_hooks	m_health_update_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fd641853fe3595..e4789dfe1a369e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2182,6 +2182,7 @@ xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_health_update_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 03/16] xfs: create a filesystem shutdown hook
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 02/16] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2024-12-31 23:39   ` Darrick J. Wong
  2024-12-31 23:39   ` [PATCH 04/16] xfs: create hooks for media errors Darrick J. Wong
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a hook so that health monitoring can report filesystem shutdown
events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsops.h |   14 +++++++++++++
 fs/xfs/xfs_mount.h |    3 +++
 fs/xfs/xfs_super.c |    1 +
 4 files changed, 75 insertions(+)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 150979c8333530..439e76f38ed42e 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -480,6 +480,61 @@ xfs_fs_goingdown(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_shutdown_hooks_switch);
+
+void
+xfs_shutdown_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_shutdown_hooks_switch);
+}
+
+void
+xfs_shutdown_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_shutdown_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem shutdown. */
+static inline void
+xfs_shutdown_hook(
+	struct xfs_mount		*mp,
+	uint32_t			flags)
+{
+	if (xfs_hooks_switched_on(&xfs_shutdown_hooks_switch))
+		xfs_hooks_call(&mp->m_shutdown_hooks, flags, NULL);
+}
+
+/* Call the specified function during a shutdown update. */
+int
+xfs_shutdown_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Stop calling the specified function during a shutdown update. */
+void
+xfs_shutdown_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Configure shutdown update hook functions. */
+void
+xfs_shutdown_hook_setup(
+	struct xfs_shutdown_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->shutdown_hook, mod_fn);
+}
+#else
+# define xfs_shutdown_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Force a shutdown of the filesystem instantly while keeping the filesystem
  * consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -538,6 +593,8 @@ xfs_do_force_shutdown(
 		"Please unmount the filesystem and rectify the problem(s)");
 	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
 		xfs_stack_trace();
+
+	xfs_shutdown_hook(mp, flags);
 }
 
 /*
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 9d23c361ef56e4..7f6f876de072b1 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -15,4 +15,18 @@ int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags);
 int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
 void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_shutdown_hook {
+	struct xfs_hook			shutdown_hook;
+};
+
+void xfs_shutdown_hook_disable(void);
+void xfs_shutdown_hook_enable(void);
+
+int xfs_shutdown_hook_add(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_del(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_setup(struct xfs_shutdown_hook *hook,
+		notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index df5e4a48af72b7..a8c81c4ccb2000 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -343,6 +343,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed health events to a daemon. */
 	struct xfs_hooks	m_health_update_hooks;
+
+	/* Hook to feed shutdown events to a daemon. */
+	struct xfs_hooks	m_shutdown_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e4789dfe1a369e..71aa97a5d1dcaa 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2182,6 +2182,7 @@ xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 
 	fc->s_fs_info = mp;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 04/16] xfs: create hooks for media errors
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:39   ` [PATCH 03/16] xfs: create a filesystem shutdown hook Darrick J. Wong
@ 2024-12-31 23:39   ` Darrick J. Wong
  2024-12-31 23:40   ` [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem Darrick J. Wong
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a media error event hook so that we can send events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h          |    3 ++
 fs/xfs/xfs_notify_failure.c |   86 ++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_notify_failure.h |   38 +++++++++++++++++++
 fs/xfs/xfs_super.c          |    1 +
 4 files changed, 122 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a8c81c4ccb2000..3fcfdaaf199315 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -346,6 +346,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed shutdown events to a daemon. */
 	struct xfs_hooks	m_shutdown_hooks;
+
+	/* Hook to feed media error events to a daemon. */
+	struct xfs_hooks	m_media_error_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index ed8d8ed42f0a2c..ea68c7e61bb585 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -27,6 +27,73 @@
 #include <linux/dax.h>
 #include <linux/fs.h>
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_media_error_hooks_switch);
+
+void
+xfs_media_error_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_media_error_hooks_switch);
+}
+
+void
+xfs_media_error_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_media_error_hooks_switch);
+}
+
+/* Call downstream hooks for a media error. */
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	enum xfs_failed_device		fdev,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	bool				pre_remove)
+{
+	if (xfs_hooks_switched_on(&xfs_media_error_hooks_switch)) {
+		struct xfs_media_error_params p = {
+			.mp		= mp,
+			.fdev		= fdev,
+			.daddr		= daddr,
+			.bbcount	= bbcount,
+			.pre_remove	= pre_remove,
+		};
+
+		xfs_hooks_call(&mp->m_media_error_hooks, 0, &p);
+	}
+}
+
+/* Call the specified function during a media error. */
+int
+xfs_media_error_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Stop calling the specified function during a media error. */
+void
+xfs_media_error_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Configure media error hook functions. */
+void
+xfs_media_error_hook_setup(
+	struct xfs_media_error_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->error_hook, mod_fn);
+}
+#else
+# define xfs_media_error_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
 	xfs_extlen_t		blockcount;
@@ -215,6 +282,9 @@ xfs_dax_notify_logdev_failure(
 	if (error)
 		return error;
 
+	xfs_media_error_hook(mp, XFS_FAILED_LOGDEV, daddr, bblen,
+			mf_flags & MF_MEM_PRE_REMOVE);
+
 	/*
 	 * In the pre-remove case the failure notification is attempting to
 	 * trigger a force unmount.  The expectation is that the device is
@@ -248,17 +318,21 @@ xfs_dax_notify_dev_failure(
 	uint64_t		bblen;
 	struct xfs_group	*xg = NULL;
 
+	error = xfs_dax_translate_range(type == XG_TYPE_RTG ?
+			mp->m_rtdev_targp : mp->m_ddev_targp,
+			offset, len, &daddr, &bblen);
+	if (error)
+		return error;
+
+	xfs_media_error_hook(mp, type == XG_TYPE_RTG ?
+			XFS_FAILED_RTDEV : XFS_FAILED_DATADEV,
+			daddr, bblen, mf_flags & MF_MEM_PRE_REMOVE);
+
 	if (!xfs_has_rmapbt(mp)) {
 		xfs_debug(mp, "notify_failure() needs rmapbt enabled!");
 		return -EOPNOTSUPP;
 	}
 
-	error = xfs_dax_translate_range(type == XG_TYPE_RTG ?
-			mp->m_rtdev_targp : mp->m_ddev_targp,
-			offset, len, &daddr, &bblen);
-	if (error)
-		return error;
-
 	if (type == XG_TYPE_RTG) {
 		start_bno = xfs_daddr_to_rtb(mp, daddr);
 		end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1);
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 41108044d35d47..835d4af504d832 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -8,4 +8,42 @@
 
 extern const struct dax_holder_operations xfs_dax_holder_operations;
 
+enum xfs_failed_device {
+	XFS_FAILED_DATADEV,
+	XFS_FAILED_LOGDEV,
+	XFS_FAILED_RTDEV,
+};
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+struct xfs_media_error_params {
+	struct xfs_mount		*mp;
+	enum xfs_failed_device		fdev;
+	xfs_daddr_t			daddr;
+	uint64_t			bbcount;
+	bool				pre_remove;
+};
+
+struct xfs_media_error_hook {
+	struct xfs_hook			error_hook;
+};
+
+void xfs_media_error_hook_disable(void);
+void xfs_media_error_hook_enable(void);
+
+int xfs_media_error_hook_add(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_del(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_setup(struct xfs_media_error_hook *hook,
+		notifier_fn_t mod_fn);
+#else
+struct xfs_media_error_params { };
+struct xfs_media_error_hook { };
+# define xfs_media_error_hook_disable()		((void)0)
+# define xfs_media_error_hook_enable()		((void)0)
+# define xfs_media_error_hook_add(...)		(0)
+# define xfs_media_error_hook_del(...)		((void)0)
+# define xfs_media_error_hook_setup(...)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 71aa97a5d1dcaa..a49082159faae8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2184,6 +2184,7 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_dir_update_hooks);
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
+	xfs_hooks_init(&mp->m_media_error_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:39   ` [PATCH 04/16] xfs: create hooks for media errors Darrick J. Wong
@ 2024-12-31 23:40   ` Darrick J. Wong
  2024-12-31 23:40   ` [PATCH 06/16] iomap: report directio read and write errors to callers Darrick J. Wong
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Provide a callback so that iomap can report read and write IO errors to
the caller filesystem.  For now this is only wired up for iomap as a
testbed for XFS.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 Documentation/filesystems/vfs.rst |    7 +++++++
 fs/iomap/buffered-io.c            |   26 +++++++++++++++++++++++++-
 include/linux/fs.h                |    4 ++++
 3 files changed, 36 insertions(+), 1 deletion(-)


diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 0b18af3f954eb7..2f0ef4e1a8d340 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -827,6 +827,8 @@ cache in your filesystem.  The following members are defined:
 		int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 		int (*swap_deactivate)(struct file *);
 		int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+		void (*ioerror)(struct address_space *mapping, int direction,
+				loff_t pos, u64 len, int error);
 	};
 
 ``writepage``
@@ -1056,6 +1058,11 @@ cache in your filesystem.  The following members are defined:
 ``swap_rw``
 	Called to read or write swap pages when SWP_FS_OPS is set.
 
+``ioerror``
+        Called to deal with IO errors during readahead or writeback.
+        This may be called from interrupt context, and without any
+        locks necessarily being held.
+
 The File Object
 ===============
 
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 86e30b56e8d41b..39782376895306 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -284,6 +284,14 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
 	*lenp = plen;
 }
 
+static inline void iomap_mapping_ioerror(struct address_space *mapping,
+		int direction, loff_t pos, u64 len, int error)
+{
+	if (mapping && mapping->a_ops->ioerror)
+		mapping->a_ops->ioerror(mapping, direction, pos, len,
+				error);
+}
+
 static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		size_t len, int error)
 {
@@ -302,6 +310,10 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		spin_unlock_irqrestore(&ifs->state_lock, flags);
 	}
 
+	if (error)
+		iomap_mapping_ioerror(folio->mapping, READ,
+				folio_pos(folio) + off, len, error);
+
 	if (finished)
 		folio_end_read(folio, uptodate);
 }
@@ -670,11 +682,16 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
 {
 	struct bio_vec bvec;
 	struct bio bio;
+	int ret;
 
 	bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ);
 	bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
 	bio_add_folio_nofail(&bio, folio, plen, poff);
-	return submit_bio_wait(&bio);
+	ret = submit_bio_wait(&bio);
+	if (ret)
+		iomap_mapping_ioerror(folio->mapping, READ,
+				folio_pos(folio) + poff, plen, ret);
+	return ret;
 }
 
 static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
@@ -1573,6 +1590,11 @@ u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend)
 
 	/* walk all folios in bio, ending page IO on them */
 	bio_for_each_folio_all(fi, bio) {
+		if (ioend->io_error)
+			iomap_mapping_ioerror(inode->i_mapping, WRITE,
+					folio_pos(fi.folio) + fi.offset,
+					fi.length, ioend->io_error);
+
 		iomap_finish_folio_write(inode, fi.folio, fi.length);
 		folio_count++;
 	}
@@ -1881,6 +1903,8 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 
 	if (count)
 		wpc->nr_folios++;
+	if (error && !count)
+		iomap_mapping_ioerror(inode->i_mapping, WRITE, pos, 0, error);
 
 	/*
 	 * We can have dirty bits set past end of file in page_mkwrite path
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b638fb1bcbc96f..9375753577025d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -438,6 +438,10 @@ struct address_space_operations {
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
 	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+
+	/* Callback for dealing with IO errors during readahead or writeback */
+	void (*ioerror)(struct address_space *mapping, int direction,
+			loff_t pos, u64 len, int error);
 };
 
 extern const struct address_space_operations empty_aops;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 06/16] iomap: report directio read and write errors to callers
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-12-31 23:40   ` [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem Darrick J. Wong
@ 2024-12-31 23:40   ` Darrick J. Wong
  2024-12-31 23:40   ` [PATCH 07/16] xfs: create file io error hooks Darrick J. Wong
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add more hooks to report directio IO errors to the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/direct-io.c  |    4 ++++
 include/linux/iomap.h |    2 ++
 2 files changed, 6 insertions(+)


diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index dd521f4edf55ac..f572be18490b0a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -100,6 +100,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	if (dops && dops->end_io)
 		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+	if (dio->error && dops && dops->ioerror)
+		dops->ioerror(file_inode(iocb->ki_filp),
+				(dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
+				offset, dio->size, dio->error);
 
 	if (likely(!ret)) {
 		ret = dio->size;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index afa0917cf43705..69c8b45bd9b935 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -439,6 +439,8 @@ struct iomap_dio_ops {
 		      unsigned flags);
 	void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
 		          loff_t file_offset);
+	void (*ioerror)(struct inode *inode, int direction, loff_t pos,
+			u64 len, int error);
 
 	/*
 	 * Filesystems wishing to attach private information to a direct io bio


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 07/16] xfs: create file io error hooks
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-12-31 23:40   ` [PATCH 06/16] iomap: report directio read and write errors to callers Darrick J. Wong
@ 2024-12-31 23:40   ` Darrick J. Wong
  2024-12-31 23:40   ` [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks within XFS to deliver IO errors to callers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_aops.c  |    2 +
 fs/xfs/xfs_file.c  |  167 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_file.h  |   36 +++++++++++
 fs/xfs/xfs_mount.h |    3 +
 fs/xfs/xfs_super.c |    1 
 5 files changed, 208 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4319d0488f2146..7892b794085251 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -21,6 +21,7 @@
 #include "xfs_error.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
+#include "xfs_file.h"
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -722,6 +723,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_folio	= generic_error_remove_folio,
 	.swap_activate		= xfs_iomap_swapfile_activate,
+	.ioerror		= xfs_vm_ioerror,
 };
 
 const struct address_space_operations xfs_dax_aops = {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ceb7936e5fd9a3..cbeb60582cb15f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -230,6 +230,169 @@ xfs_ilock_iocb_for_write(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_file_ioerror_hooks_switch);
+
+void
+xfs_file_ioerror_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_file_ioerror_hooks_switch);
+}
+
+void
+xfs_file_ioerror_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_file_ioerror_hooks_switch);
+}
+
+struct xfs_file_ioerror {
+	struct work_struct		work;
+	struct xfs_mount		*mp;
+	xfs_ino_t			ino;
+	loff_t				pos;
+	u64				len;
+	u32				gen;
+	int				error;
+	enum xfs_file_ioerror_type	type;
+};
+
+/* Call downstream hooks for a file io error update. */
+STATIC void
+xfs_file_report_ioerror(
+	struct work_struct	*work)
+{
+	struct xfs_file_ioerror	*ioerr;
+
+	ioerr = container_of(work, struct xfs_file_ioerror, work);
+
+	if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) {
+		struct xfs_file_ioerror_params	p = {
+			.ino		= ioerr->ino,
+			.gen		= ioerr->gen,
+			.pos		= ioerr->pos,
+			.len		= ioerr->len,
+		};
+		struct xfs_mount	*mp = ioerr->mp;
+
+		xfs_hooks_call(&mp->m_file_ioerror_hooks, ioerr->type, &p);
+	}
+
+	kfree(ioerr);
+}
+
+/* Queue a directio io error notification. */
+STATIC void
+xfs_dio_ioerror(
+	struct inode		*inode,
+	int			direction,
+	loff_t			pos,
+	u64			len,
+	int			error)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) {
+		ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+		if (!ioerr) {
+			xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+					ip->i_ino,
+					direction == WRITE ? "WRITE" : "READ",
+					pos, len, error);
+			return;
+		}
+
+		INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+		ioerr->mp = mp;
+		ioerr->ino = ip->i_ino;
+		ioerr->gen = VFS_I(ip)->i_generation;
+		ioerr->pos = pos;
+		ioerr->len = len;
+		if (direction == WRITE)
+			ioerr->type = XFS_FILE_IOERROR_DIRECT_WRITE;
+		else
+			ioerr->type = XFS_FILE_IOERROR_DIRECT_READ;
+		ioerr->error = error;
+		queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+	}
+}
+
+/* Queue a buffered io error notification. */
+void
+xfs_vm_ioerror(
+	struct address_space	*mapping,
+	int			direction,
+	loff_t			pos,
+	u64			len,
+	int			error)
+{
+	struct inode		*inode = mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) {
+		ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+		if (!ioerr) {
+			xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+					ip->i_ino,
+					direction == WRITE ? "WRITE" : "READ",
+					pos, len, error);
+			return;
+		}
+
+		INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+		ioerr->mp = mp;
+		ioerr->ino = ip->i_ino;
+		ioerr->gen = VFS_I(ip)->i_generation;
+		ioerr->pos = pos;
+		ioerr->len = len;
+		if (direction == WRITE)
+			ioerr->type = XFS_FILE_IOERROR_BUFFERED_WRITE;
+		else
+			ioerr->type = XFS_FILE_IOERROR_BUFFERED_READ;
+		ioerr->error = error;
+		queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+	}
+}
+
+/* Call the specified function after a file io error. */
+int
+xfs_file_ioerror_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_file_ioerror_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Stop calling the specified function after a file io error. */
+void
+xfs_file_ioerror_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_file_ioerror_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Configure file io error update hook functions. */
+void
+xfs_file_ioerror_hook_setup(
+	struct xfs_file_ioerror_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->ioerror_hook, mod_fn);
+}
+#else
+# define xfs_dio_ioerror		NULL
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
+static const struct iomap_dio_ops xfs_dio_read_ops = {
+	.ioerror	= xfs_dio_ioerror,
+};
+
 STATIC ssize_t
 xfs_file_dio_read(
 	struct kiocb		*iocb,
@@ -248,7 +411,8 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, &xfs_dio_read_ops,
+			0, NULL, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -769,6 +933,7 @@ xfs_dio_write_end_io(
 
 static const struct iomap_dio_ops xfs_dio_write_ops = {
 	.end_io		= xfs_dio_write_end_io,
+	.ioerror	= xfs_dio_ioerror,
 };
 
 static void
diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h
index c9d50699baba85..38c546cd498a52 100644
--- a/fs/xfs/xfs_file.h
+++ b/fs/xfs/xfs_file.h
@@ -17,4 +17,40 @@ int xfs_file_unshare_at(struct xfs_inode *ip, loff_t pos);
 
 long xfs_ioc_map_freesp(struct file *file, struct xfs_map_freesp __user	*argp);
 
+enum xfs_file_ioerror_type {
+	XFS_FILE_IOERROR_BUFFERED_READ,
+	XFS_FILE_IOERROR_BUFFERED_WRITE,
+	XFS_FILE_IOERROR_DIRECT_READ,
+	XFS_FILE_IOERROR_DIRECT_WRITE,
+};
+
+struct xfs_file_ioerror_params {
+	xfs_ino_t		ino;
+	loff_t			pos;
+	u64			len;
+	u32			gen;
+	int			error;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_file_ioerror_hook {
+	struct xfs_hook			ioerror_hook;
+};
+
+void xfs_file_ioerror_hook_disable(void);
+void xfs_file_ioerror_hook_enable(void);
+
+int xfs_file_ioerror_hook_add(struct xfs_mount *mp,
+		struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_del(struct xfs_mount *mp,
+		struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_setup(struct xfs_file_ioerror_hook *hook,
+		notifier_fn_t mod_fn);
+
+void xfs_vm_ioerror(struct address_space *mapping, int direction, loff_t pos,
+		u64 len, int error);
+#else
+# define xfs_vm_ioerror			NULL
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* __XFS_FILE_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3fcfdaaf199315..10b4ff3548601e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -349,6 +349,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed media error events to a daemon. */
 	struct xfs_hooks	m_media_error_hooks;
+
+	/* Hook to feed file io error events to a daemon. */
+	struct xfs_hooks	m_file_ioerror_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a49082159faae8..df6afcf8840948 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2185,6 +2185,7 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 	xfs_hooks_init(&mp->m_media_error_hooks);
+	xfs_hooks_init(&mp->m_file_ioerror_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-12-31 23:40   ` [PATCH 07/16] xfs: create file io error hooks Darrick J. Wong
@ 2024-12-31 23:40   ` Darrick J. Wong
  2024-12-31 23:41   ` [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/Kconfig         |    8 +++
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |    8 +++
 fs/xfs/xfs_healthmon.c |  145 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 6 files changed, 182 insertions(+)
 create mode 100644 fs/xfs/xfs_healthmon.c
 create mode 100644 fs/xfs/xfs_healthmon.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 5700bc671a0e92..9d061a8c2786fe 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -120,6 +120,14 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_HEALTH_MONITOR
+	bool "Report filesystem health events to userspace"
+	depends on XFS_FS
+	select XFS_LIVE_HOOKS
+	default y
+	help
+	  Report health events to userspace programs.
+
 config XFS_DRAIN_INTENTS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 4c59d43c77089e..94a9dc7aa7a1d5 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -158,6 +158,7 @@ xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
 xfs-$(CONFIG_XFS_BTREE_IN_MEM)	+= libxfs/xfs_btree_mem.o
+xfs-$(CONFIG_XFS_HEALTH_MONITOR) += xfs_healthmon.o
 
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index f4128dbdf3b9a2..d1a81b02a1a3f3 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1100,6 +1100,13 @@ struct xfs_map_freesp {
 	__u64	pad;		/* must be zero */
 };
 
+struct xfs_health_monitor {
+	__u64	flags;		/* flags */
+	__u8	format;		/* output format */
+	__u8	pad1[7];	/* zeroes */
+	__u64	pad2[2];	/* zeroes */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1141,6 +1148,7 @@ struct xfs_map_freesp {
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
 #define XFS_IOC_MAP_FREESP	_IOW ('X', 67, struct xfs_map_freesp)
+#define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
new file mode 100644
index 00000000000000..c5ce5699373c63
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trace.h"
+#include "xfs_ag.h"
+#include "xfs_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_quota_defs.h"
+#include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
+
+#include <linux/anon_inodes.h>
+#include <linux/eventpoll.h>
+#include <linux/poll.h>
+
+/*
+ * Live Health Monitoring
+ * ======================
+ *
+ * Autonomous self-healing of XFS filesystems requires a means for the kernel
+ * to send filesystem health events to a monitoring daemon in userspace.  To
+ * accomplish this, we establish a thread_with_file kthread object to handle
+ * translating internal events about filesystem health into a format that can
+ * be parsed easily by userspace.  Then we hook various parts of the filesystem
+ * to supply those internal events to the kthread.  Userspace reads events
+ * from the file descriptor returned by the ioctl.
+ *
+ * The healthmon abstraction has a weak reference to the host filesystem mount
+ * so that the queueing and processing of the events do not pin the mount and
+ * cannot slow down the main filesystem.  The healthmon object can exist past
+ * the end of the filesystem mount.
+ */
+
+struct xfs_healthmon {
+	struct xfs_mount		*mp;
+};
+
+/*
+ * Convey queued event data to userspace.  First copy any remaining bytes in
+ * the outbuf, then format the oldest event into the outbuf and copy that too.
+ */
+STATIC ssize_t
+xfs_healthmon_read_iter(
+	struct kiocb		*iocb,
+	struct iov_iter		*to)
+{
+	return -EIO;
+}
+
+/* Free the health monitoring information. */
+STATIC int
+xfs_healthmon_release(
+	struct inode		*inode,
+	struct file		*file)
+{
+	struct xfs_healthmon	*hm = file->private_data;
+
+	kfree(hm);
+
+	return 0;
+}
+
+/* Validate ioctl parameters. */
+static inline bool
+xfs_healthmon_validate(
+	const struct xfs_health_monitor	*hmo)
+{
+	if (hmo->flags)
+		return false;
+	if (hmo->format)
+		return false;
+	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
+		return false;
+	if (memchr_inv(&hmo->pad2, 0, sizeof(hmo->pad2)))
+		return false;
+	return true;
+}
+
+static const struct file_operations xfs_healthmon_fops = {
+	.owner		= THIS_MODULE,
+	.read_iter	= xfs_healthmon_read_iter,
+	.release	= xfs_healthmon_release,
+};
+
+/*
+ * Create a health monitoring file.  Returns an index to the fd table or a
+ * negative errno.
+ */
+long
+xfs_ioc_health_monitor(
+	struct xfs_mount		*mp,
+	struct xfs_health_monitor __user *arg)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm;
+	char				*name;
+	int				fd;
+	int				ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	hm = kzalloc(sizeof(*hm), GFP_KERNEL);
+	if (!hm)
+		return -ENOMEM;
+	hm->mp = mp;
+
+	/* Set up VFS file and file descriptor. */
+	name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id);
+	if (!name) {
+		ret = -ENOMEM;
+		goto out_hm;
+	}
+
+	fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm,
+			O_CLOEXEC | O_RDONLY);
+	kvfree(name);
+	if (fd < 0) {
+		ret = fd;
+		goto out_hm;
+	}
+
+	return fd;
+
+out_hm:
+	kfree(hm);
+	return ret;
+}
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
new file mode 100644
index 00000000000000..07126e39281a0c
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_HEALTHMON_H__
+#define __XFS_HEALTHMON_H__
+
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+long xfs_ioc_health_monitor(struct xfs_mount *mp,
+		struct xfs_health_monitor __user *arg);
+#else
+# define xfs_ioc_health_monitor(mp, hmo)	(-ENOTTY)
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
+#endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 092a3699ff9e75..6c7a30128c7bf6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -42,6 +42,7 @@
 #include "xfs_exchrange.h"
 #include "xfs_handle.h"
 #include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -1434,6 +1435,9 @@ xfs_file_ioctl(
 	case XFS_IOC_MAP_FREESP:
 		return xfs_ioc_map_freesp(filp, arg);
 
+	case XFS_IOC_HEALTH_MONITOR:
+		return xfs_ioc_health_monitor(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-12-31 23:40   ` [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
@ 2024-12-31 23:41   ` Darrick J. Wong
  2024-12-31 23:41   ` [PATCH 10/16] xfs: report metadata health events through healthmon Darrick J. Wong
                     ` (6 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.

Here, we've chosen json to export information to userspace.  The
structured key-value nature of json gives us enormous flexibility to
modify the schema of what we'll send to userspace because we can add new
keys at any time.  Userspace can use whatever json parsers are available
to consume the events and will not be confused by keys they don't
recognize.

Note that we do NOT allow sending json back to the kernel, nor is there
any intent to do that.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h                  |    8 
 fs/xfs/libxfs/xfs_healthmon.schema.json |   63 ++++
 fs/xfs/xfs_healthmon.c                  |  542 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h                  |   24 +
 fs/xfs/xfs_linux.h                      |    3 
 fs/xfs/xfs_trace.c                      |    2 
 fs/xfs/xfs_trace.h                      |  152 +++++++++
 7 files changed, 788 insertions(+), 6 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index d1a81b02a1a3f3..d7404e6efd866d 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1107,6 +1107,14 @@ struct xfs_health_monitor {
 	__u64	pad2[2];	/* zeroes */
 };
 
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Return events in JSON format */
+#define XFS_HEALTH_MONITOR_FMT_JSON	(1)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
new file mode 100644
index 00000000000000..9772efe25f193d
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -0,0 +1,63 @@
+{
+	"$comment": [
+		"SPDX-License-Identifier: GPL-2.0-or-later",
+		"Copyright (c) 2024-2025 Oracle.  All Rights Reserved.",
+		"Author: Darrick J. Wong <djwong@kernel.org>",
+		"",
+		"This schema file describes the format of the json objects",
+		"readable from the fd returned by the XFS_IOC_HEALTHMON",
+		"ioctl."
+	],
+
+	"$schema": "https://json-schema.org/draft/2020-12/schema",
+	"$id": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/fs/xfs/libxfs/xfs_healthmon.schema.json",
+
+	"title": "XFS Health Monitoring Events",
+
+	"$comment": "Events must be one of the following types:",
+	"oneOf": [
+		{
+			"$ref": "#/$events/lost"
+		}
+	],
+
+	"$comment": "Simple data types are defined here.",
+	"$defs": {
+		"time_ns": {
+			"title": "Time of Event",
+			"description": "Timestamp of the event, in nanoseconds since the Unix epoch.",
+			"type": "integer"
+		}
+	},
+
+	"$comment": "Event types are defined here.",
+	"$events": {
+		"lost": {
+			"title": "Health Monitoring Events Lost",
+			"$comment": [
+				"Previous health monitoring events were",
+				"dropped due to memory allocation failures",
+				"or queue limits."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "lost"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain"
+			]
+		}
+	}
+}
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index c5ce5699373c63..499f6aab9bdbf3 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -40,12 +40,417 @@
  * so that the queueing and processing of the events do not pin the mount and
  * cannot slow down the main filesystem.  The healthmon object can exist past
  * the end of the filesystem mount.
+ *
+ * Please see the xfs_healthmon.schema.json file for a description of the
+ * format of the json events that are conveyed to userspace.
  */
 
+/* Allow this many events to build up in memory per healthmon fd. */
+#define XFS_HEALTHMON_MAX_EVENTS \
+		(32768 / sizeof(struct xfs_healthmon_event))
+
+struct flag_string {
+	unsigned int	mask;
+	const char	*str;
+};
+
 struct xfs_healthmon {
+	/* lock for mp and eventlist */
+	struct mutex			lock;
+
+	/* waiter for signalling the arrival of events */
+	struct wait_queue_head		wait;
+
+	/* list of event objects */
+	struct xfs_healthmon_event	*first_event;
+	struct xfs_healthmon_event	*last_event;
+
 	struct xfs_mount		*mp;
+
+	/* number of events */
+	unsigned int			events;
+
+	/*
+	 * Buffer for formatting events.  New buffer data are appended to the
+	 * end of the seqbuf, and outpos is used to determine where to start
+	 * a copy_iter.  Both are protected by inode_lock.
+	 */
+	struct seq_buf			outbuf;
+	size_t				outpos;
+
+	/* do we want all events? */
+	bool				verbose;
+
+	/* did we lose an event? */
+	bool				lost_prev_event;
 };
 
+/* Remove an event from the head of the list. */
+static inline void
+xfs_healthmon_free_head(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	struct xfs_healthmon_event	*head;
+
+	mutex_lock(&hm->lock);
+	head = hm->first_event;
+	if (head != event) {
+		ASSERT(hm->first_event == event);
+		mutex_unlock(&hm->lock);
+		return;
+	}
+
+	if (hm->last_event == head)
+		hm->last_event = NULL;
+	hm->first_event = head->next;
+	hm->events--;
+	mutex_unlock(&hm->lock);
+
+	trace_xfs_healthmon_pop(hm->mp, head);
+	kfree(event);
+}
+
+/* Push an event onto the end of the list. */
+static inline int
+xfs_healthmon_push(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	/*
+	 * If the queue is already full, remember the fact that we lost events.
+	 * This doesn't apply to "event lost" events; those always go through
+	 * because there should only be one at the very end of the queue.
+	 */
+	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS &&
+	    event->type != XFS_HEALTHMON_LOST) {
+		trace_xfs_healthmon_lost_event(hm->mp);
+		hm->lost_prev_event = true;
+		return -ENOMEM;
+	}
+
+	if (!hm->first_event)
+		hm->first_event = event;
+	if (hm->last_event)
+		hm->last_event->next = event;
+	hm->last_event = event;
+	event->next = NULL;
+	hm->events++;
+	wake_up(&hm->wait);
+
+	trace_xfs_healthmon_push(hm->mp, event);
+
+	return 0;
+}
+
+/* Create a new event or record that we failed. */
+static struct xfs_healthmon_event *
+xfs_healthmon_alloc(
+	struct xfs_healthmon		*hm,
+	enum xfs_healthmon_type		type,
+	enum xfs_healthmon_domain	domain)
+{
+	struct timespec64		now;
+	struct xfs_healthmon_event	*event;
+
+	event = kzalloc(sizeof(*event), GFP_NOFS);
+	if (!event) {
+		trace_xfs_healthmon_lost_event(hm->mp);
+		hm->lost_prev_event = true;
+		return NULL;
+	}
+
+	event->type = type;
+	event->domain = domain;
+	ktime_get_coarse_real_ts64(&now);
+	event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec;
+
+	return event;
+}
+
+/*
+ * Before we accept an event notification from a live update hook, we need to
+ * clear out any previously lost events.
+ */
+static inline int
+xfs_healthmon_start_live_update(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event;
+
+	/*
+	 * If we previously lost an event or the queue is full, try to queue
+	 * a notification about lost events.
+	 */
+	if (!hm->lost_prev_event && hm->events != XFS_HEALTHMON_MAX_EVENTS)
+		return 0;
+
+	/*
+	 * A previous invocation of the live update hook could not allocate
+	 * any memory at all.  If the last event on the list is already a
+	 * notification of lost events, we're done.
+	 */
+	if (hm->last_event && hm->last_event->type == XFS_HEALTHMON_LOST)
+		return 0;
+
+	/*
+	 * There are no events or the last one wasn't about lost events.  Try
+	 * to allocate a new one to note the lost events.
+	 */
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+			XFS_HEALTHMON_MOUNT);
+	if (!event)
+		return -ENOMEM;
+
+	hm->lost_prev_event = false;
+	xfs_healthmon_push(hm, event);
+	return 0;
+}
+
+/* Render the health update type as a string. */
+STATIC const char *
+xfs_healthmon_typestring(
+	const struct xfs_healthmon_event	*event)
+{
+	static const char *type_strings[] = {
+		[XFS_HEALTHMON_LOST]		= "lost",
+	};
+
+	if (event->type >= ARRAY_SIZE(type_strings))
+		return "?";
+
+	return type_strings[event->type];
+}
+
+/* Render the health domain as a string. */
+STATIC const char *
+xfs_healthmon_domstring(
+	const struct xfs_healthmon_event	*event)
+{
+	static const char *dom_strings[] = {
+		[XFS_HEALTHMON_MOUNT]		= "mount",
+	};
+
+	if (event->domain >= ARRAY_SIZE(dom_strings))
+		return "?";
+
+	return dom_strings[event->domain];
+}
+
+/* Convert a flags bitmap into a jsonable string. */
+static inline int
+xfs_healthmon_format_flags(
+	struct seq_buf			*outbuf,
+	const struct flag_string	*strings,
+	size_t				nr_strings,
+	unsigned int			flags)
+{
+	const struct flag_string	*p;
+	ssize_t				ret;
+	unsigned int			i;
+	bool				first = true;
+
+	for (i = 0, p = strings; i < nr_strings; i++, p++) {
+		if (!(p->mask & flags))
+			continue;
+
+		ret = seq_buf_printf(outbuf, "%s\"%s\"",
+				first ? "" : ", ", p->str);
+		if (ret < 0)
+			return ret;
+
+		first = false;
+		flags &= ~p->mask;
+	}
+
+	for (i = 0; flags != 0 && i < sizeof(flags) * NBBY; i++) {
+		if (!(flags & (1U << i)))
+			continue;
+
+		/* json doesn't support hexadecimal notation */
+		ret = seq_buf_printf(outbuf, "%s%u",
+				first ? "" : ", ", (1U << i));
+		if (ret < 0)
+			return ret;
+
+		first = false;
+	}
+
+	return 0;
+}
+
+/* Convert the event mask into a jsonable string. */
+static inline int
+__xfs_healthmon_format_mask(
+	struct seq_buf			*outbuf,
+	const char			*descr,
+	const struct flag_string	*strings,
+	size_t				nr_strings,
+	unsigned int			mask)
+{
+	ssize_t				ret;
+
+	ret = seq_buf_printf(outbuf, "  \"%s\":  [", descr);
+	if (ret < 0)
+		return ret;
+
+	ret = xfs_healthmon_format_flags(outbuf, strings, nr_strings, mask);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "],\n");
+}
+
+#define xfs_healthmon_format_mask(o, d, s, m) \
+	__xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m))
+
+static inline void
+xfs_healthmon_reset_outbuf(
+	struct xfs_healthmon		*hm)
+{
+	hm->outpos = 0;
+	seq_buf_clear(&hm->outbuf);
+}
+
+/*
+ * Format an event into json.  Returns 0 if we formatted the event.  If
+ * formatting the event overflows the buffer, returns -1 with the seqbuf len
+ * unchanged.
+ */
+STATIC int
+xfs_healthmon_format(
+	struct xfs_healthmon		*hm,
+	const struct xfs_healthmon_event *event)
+{
+	struct seq_buf			*outbuf = &hm->outbuf;
+	size_t				old_seqlen = outbuf->len;
+	int				ret;
+
+	trace_xfs_healthmon_format(hm->mp, event);
+
+	ret = seq_buf_printf(outbuf, "{\n");
+	if (ret < 0)
+		goto overrun;
+
+	ret = seq_buf_printf(outbuf, "  \"type\":       \"%s\",\n",
+			xfs_healthmon_typestring(event));
+	if (ret < 0)
+		goto overrun;
+
+	ret = seq_buf_printf(outbuf, "  \"domain\":     \"%s\",\n",
+			xfs_healthmon_domstring(event));
+	if (ret < 0)
+		goto overrun;
+
+	switch (event->type) {
+	case XFS_HEALTHMON_LOST:
+		/* empty */
+		break;
+	default:
+		break;
+	}
+
+	switch (event->domain) {
+	case XFS_HEALTHMON_MOUNT:
+		/* empty */
+		break;
+	}
+	if (ret < 0)
+		goto overrun;
+
+	/* The last element in the json must not have a trailing comma. */
+	ret = seq_buf_printf(outbuf, "  \"time_ns\":    %llu\n",
+			event->time_ns);
+	if (ret < 0)
+		goto overrun;
+
+	ret = seq_buf_printf(outbuf, "}\n");
+	if (ret < 0)
+		goto overrun;
+
+	ASSERT(!seq_buf_has_overflowed(outbuf));
+	return 0;
+overrun:
+	/*
+	 * We overflowed the buffer and could not format the event.  Reset the
+	 * seqbuf and tell the caller not to delete the event.
+	 */
+	trace_xfs_healthmon_format_overflow(hm->mp, event);
+	outbuf->len = old_seqlen;
+	return -1;
+}
+
+/* How many bytes are waiting in the outbuf to be copied? */
+static inline size_t
+xfs_healthmon_outbuf_bytes(
+	struct xfs_healthmon	*hm)
+{
+	unsigned int		used = seq_buf_used(&hm->outbuf);
+
+	if (used > hm->outpos)
+		return used - hm->outpos;
+	return 0;
+}
+
+/*
+ * Do we have something for userspace to do?  This can mean unmount events,
+ * events pending in the queue, or pending bytes in the outbuf.
+ */
+static inline bool
+xfs_healthmon_has_eventdata(
+	struct xfs_healthmon	*hm)
+{
+	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+}
+
+/* Try to copy the rest of the outbuf to the iov iter. */
+STATIC ssize_t
+xfs_healthmon_copybuf(
+	struct xfs_healthmon	*hm,
+	struct iov_iter		*to)
+{
+	size_t			to_copy;
+	size_t			w = 0;
+
+	trace_xfs_healthmon_copybuf(hm->mp, to, &hm->outbuf, hm->outpos);
+
+	to_copy = xfs_healthmon_outbuf_bytes(hm);
+	if (to_copy) {
+		w = copy_to_iter(hm->outbuf.buffer + hm->outpos, to_copy, to);
+		if (!w)
+			return -EFAULT;
+
+		hm->outpos += w;
+	}
+
+	/*
+	 * Nothing left to copy?  Reset the seqbuf pointers and outbuf to the
+	 * start since there's no live data in the buffer.
+	 */
+	if (xfs_healthmon_outbuf_bytes(hm) == 0)
+		xfs_healthmon_reset_outbuf(hm);
+	return w;
+}
+
+/*
+ * See if there's an event waiting for us.  If the fs is no longer mounted,
+ * don't bother sending any more events.
+ */
+static inline struct xfs_healthmon_event *
+xfs_healthmon_peek(
+	struct xfs_healthmon	*hm)
+{
+	struct xfs_healthmon_event *event;
+
+	mutex_lock(&hm->lock);
+	if (hm->mp)
+		event = hm->first_event;
+	else
+		event = NULL;
+	mutex_unlock(&hm->lock);
+	return event;
+}
+
 /*
  * Convey queued event data to userspace.  First copy any remaining bytes in
  * the outbuf, then format the oldest event into the outbuf and copy that too.
@@ -55,7 +460,112 @@ xfs_healthmon_read_iter(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	return -EIO;
+	struct file		*file = iocb->ki_filp;
+	struct inode		*inode = file_inode(file);
+	struct xfs_healthmon	*hm = file->private_data;
+	struct xfs_healthmon_event *event;
+	size_t			copied = 0;
+	ssize_t			ret = 0;
+
+	/* Wait for data to become available */
+	if (!(file->f_flags & O_NONBLOCK)) {
+		ret = wait_event_interruptible(hm->wait,
+				xfs_healthmon_has_eventdata(hm));
+		if (ret)
+			return ret;
+	} else if (!xfs_healthmon_has_eventdata(hm)) {
+		return -EAGAIN;
+	}
+
+	/* Allocate formatting buffer up to 64k if necessary */
+	if (hm->outbuf.size == 0) {
+		void		*outbuf;
+		size_t		bufsize = min(65536, max(PAGE_SIZE,
+							 iov_iter_count(to)));
+
+		outbuf = kzalloc(bufsize, GFP_KERNEL);
+		if (!outbuf) {
+			bufsize = PAGE_SIZE;
+			outbuf = kzalloc(bufsize, GFP_KERNEL);
+			if (!outbuf)
+				return -ENOMEM;
+		}
+
+		inode_lock(inode);
+		if (hm->outbuf.size == 0) {
+			seq_buf_init(&hm->outbuf, outbuf, bufsize);
+			hm->outpos = 0;
+		} else {
+			kfree(outbuf);
+		}
+	} else {
+		inode_lock(inode);
+	}
+
+	trace_xfs_healthmon_read_start(hm->mp, hm->events, hm->lost_prev_event);
+
+	/*
+	 * If there's anything left in the seqbuf, copy that before formatting
+	 * more events.
+	 */
+	ret = xfs_healthmon_copybuf(hm, to);
+	if (ret < 0)
+		goto out_unlock;
+	copied += ret;
+
+	while (iov_iter_count(to) > 0) {
+		/* Format the next events into the outbuf until it's full. */
+		while ((event = xfs_healthmon_peek(hm)) != NULL) {
+			ret = xfs_healthmon_format(hm, event);
+			if (ret < 0)
+				break;
+			xfs_healthmon_free_head(hm, event);
+		}
+		/* Copy it to userspace */
+		ret = xfs_healthmon_copybuf(hm, to);
+		if (ret <= 0)
+			break;
+
+		copied += ret;
+	}
+
+out_unlock:
+	trace_xfs_healthmon_read_finish(hm->mp, hm->events, hm->lost_prev_event);
+	inode_unlock(inode);
+	return copied ?: ret;
+}
+
+/* Poll for available events. */
+STATIC __poll_t
+xfs_healthmon_poll(
+	struct file			*file,
+	struct poll_table_struct	*wait)
+{
+	struct xfs_healthmon		*hm = file->private_data;
+	__poll_t			mask = 0;
+
+	poll_wait(file, &hm->wait, wait);
+
+	if (xfs_healthmon_has_eventdata(hm))
+		mask |= EPOLLIN;
+	return mask;
+}
+
+/* Free all events */
+STATIC void
+xfs_healthmon_free_events(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event, *next;
+
+	event = hm->first_event;
+	while (event != NULL) {
+		trace_xfs_healthmon_drop(hm->mp, event);
+		next = event->next;
+		kfree(event);
+		event = next;
+	}
+	hm->first_event = hm->last_event = NULL;
 }
 
 /* Free the health monitoring information. */
@@ -66,6 +576,14 @@ xfs_healthmon_release(
 {
 	struct xfs_healthmon	*hm = file->private_data;
 
+	trace_xfs_healthmon_release(hm->mp, hm->events, hm->lost_prev_event);
+
+	wake_up_all(&hm->wait);
+
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
+	if (hm->outbuf.size)
+		kfree(hm->outbuf.buffer);
 	kfree(hm);
 
 	return 0;
@@ -76,9 +594,9 @@ static inline bool
 xfs_healthmon_validate(
 	const struct xfs_health_monitor	*hmo)
 {
-	if (hmo->flags)
+	if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL)
 		return false;
-	if (hmo->format)
+	if (hmo->format != XFS_HEALTH_MONITOR_FMT_JSON)
 		return false;
 	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
 		return false;
@@ -90,6 +608,7 @@ xfs_healthmon_validate(
 static const struct file_operations xfs_healthmon_fops = {
 	.owner		= THIS_MODULE,
 	.read_iter	= xfs_healthmon_read_iter,
+	.poll		= xfs_healthmon_poll,
 	.release	= xfs_healthmon_release,
 };
 
@@ -122,11 +641,18 @@ xfs_ioc_health_monitor(
 		return -ENOMEM;
 	hm->mp = mp;
 
+	seq_buf_init(&hm->outbuf, NULL, 0);
+	mutex_init(&hm->lock);
+	init_waitqueue_head(&hm->wait);
+
+	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
+		hm->verbose = true;
+
 	/* Set up VFS file and file descriptor. */
 	name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id);
 	if (!name) {
 		ret = -ENOMEM;
-		goto out_hm;
+		goto out_mutex;
 	}
 
 	fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm,
@@ -134,12 +660,16 @@ xfs_ioc_health_monitor(
 	kvfree(name);
 	if (fd < 0) {
 		ret = fd;
-		goto out_hm;
+		goto out_mutex;
 	}
 
+	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
+
 	return fd;
 
-out_hm:
+out_mutex:
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
 	kfree(hm);
 	return ret;
 }
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 07126e39281a0c..606f205074495c 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -6,6 +6,30 @@
 #ifndef __XFS_HEALTHMON_H__
 #define __XFS_HEALTHMON_H__
 
+enum xfs_healthmon_type {
+	XFS_HEALTHMON_LOST,	/* message lost */
+};
+
+enum xfs_healthmon_domain {
+	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+};
+
+struct xfs_healthmon_event {
+	struct xfs_healthmon_event	*next;
+
+	enum xfs_healthmon_type		type;
+	enum xfs_healthmon_domain	domain;
+
+	uint64_t			time_ns;
+
+	union {
+		/* mount */
+		struct {
+			unsigned int	flags;
+		};
+	};
+};
+
 #ifdef CONFIG_XFS_HEALTH_MONITOR
 long xfs_ioc_health_monitor(struct xfs_mount *mp,
 		struct xfs_health_monitor __user *arg);
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 9a2221b4aa21ed..d13a5fa2d652ff 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -63,6 +63,9 @@ typedef __u32			xfs_nlink_t;
 #include <linux/xattr.h>
 #include <linux/mnt_idmapping.h>
 #include <linux/debugfs.h>
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+# include <linux/seq_buf.h>
+#endif
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 555fe76b4d853c..41a2ac85dc5fdf 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -52,6 +52,8 @@
 #include "xfs_zone_alloc.h"
 #include "xfs_zone_priv.h"
 #include "xfs_fsrefs.h"
+#include "xfs_health.h"
+#include "xfs_healthmon.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 76f5d78b6a6e09..bd3b007d213fc6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -106,6 +106,8 @@ struct xfs_open_zone;
 struct xfs_fsrefs;
 struct xfs_fsrefs_irec;
 struct xfs_rtgroup;
+struct xfs_healthmon_event;
+struct xfs_health_update_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6077,6 +6079,156 @@ TRACE_EVENT(xfs_growfs_check_rtgeom,
 );
 #endif /* CONFIG_XFS_RT */
 
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+TRACE_EVENT(xfs_healthmon_lost_event,
+	TP_PROTO(const struct xfs_mount *mp),
+	TP_ARGS(mp),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+	),
+	TP_printk("dev %d:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+#define XFS_HEALTHMON_FLAGS_STRINGS \
+	{ XFS_HEALTH_MONITOR_VERBOSE,	"verbose" }
+#define XFS_HEALTHMON_FMT_STRINGS \
+	{ XFS_HEALTH_MONITOR_FMT_JSON,	"json" }
+
+TRACE_EVENT(xfs_healthmon_create,
+	TP_PROTO(const struct xfs_mount *mp, u64 flags, u8 format),
+	TP_ARGS(mp, flags, format),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, flags)
+		__field(u8, format)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->flags = flags;
+		__entry->format = format;
+	),
+	TP_printk("dev %d:%d flags %s format %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->flags, "|", XFS_HEALTHMON_FLAGS_STRINGS),
+		  __print_symbolic(__entry->format, XFS_HEALTHMON_FMT_STRINGS))
+);
+
+TRACE_EVENT(xfs_healthmon_copybuf,
+	TP_PROTO(const struct xfs_mount *mp, const struct iov_iter *iov,
+		 const struct seq_buf *seqbuf, size_t outpos),
+	TP_ARGS(mp, iov, seqbuf, outpos),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(size_t, seqbuf_size)
+		__field(size_t, seqbuf_len)
+		__field(size_t, outpos)
+		__field(size_t, to_copy)
+		__field(size_t, iter_count)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->seqbuf_size = seqbuf->size;
+		__entry->seqbuf_len = seqbuf->len;
+		__entry->outpos = outpos;
+		__entry->to_copy = seqbuf->len - outpos;
+		__entry->iter_count = iov_iter_count(iov);
+	),
+	TP_printk("dev %d:%d seqsize %zu seqlen %zu out_pos %zu to_copy %zu iter_count %zu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->seqbuf_size,
+		  __entry->seqbuf_len,
+		  __entry->outpos,
+		  __entry->to_copy,
+		  __entry->iter_count)
+);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_class,
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev),
+	TP_ARGS(mp, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#define DEFINE_HEALTHMON_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev), \
+	TP_ARGS(mp, events, lost_prev))
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_start);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
+
+#define XFS_HEALTHMON_TYPE_STRINGS \
+	{ XFS_HEALTHMON_LOST,		"lost" }
+
+#define XFS_HEALTHMON_DOMAIN_STRINGS \
+	{ XFS_HEALTHMON_MOUNT,		"mount" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
+	TP_ARGS(mp, event),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = event->type;
+		__entry->domain = event->domain;
+		__entry->mask = 0;
+		__entry->group = 0;
+		__entry->ino = 0;
+		__entry->gen = 0;
+		switch (__entry->domain) {
+		case XFS_HEALTHMON_MOUNT:
+			__entry->mask = event->flags;
+			break;
+		}
+	),
+	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x group 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS),
+		  __entry->mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->group)
+);
+#define DEFINE_HEALTHMONEVENT_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_event_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
+	TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 10/16] xfs: report metadata health events through healthmon
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-12-31 23:41   ` [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
@ 2024-12-31 23:41   ` Darrick J. Wong
  2024-12-31 23:41   ` [PATCH 11/16] xfs: report shutdown " Darrick J. Wong
                     ` (5 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a metadata health event hook so that we can send events to
userspace as we collect information.  The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_healthmon.schema.json |  328 ++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |  397 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h                  |   30 ++
 fs/xfs/xfs_trace.h                      |   97 +++++++-
 4 files changed, 846 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index 9772efe25f193d..154ea0228a3615 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -18,6 +18,18 @@
 	"oneOf": [
 		{
 			"$ref": "#/$events/lost"
+		},
+		{
+			"$ref": "#/$events/fs_metadata"
+		},
+		{
+			"$ref": "#/$events/rtgroup_metadata"
+		},
+		{
+			"$ref": "#/$events/perag_metadata"
+		},
+		{
+			"$ref": "#/$events/inode_metadata"
 		}
 	],
 
@@ -27,6 +39,169 @@
 			"title": "Time of Event",
 			"description": "Timestamp of the event, in nanoseconds since the Unix epoch.",
 			"type": "integer"
+		},
+		"xfs_agnumber_t": {
+			"description": "Allocation group number",
+			"type": "integer",
+			"minimum": 0,
+			"maximum": 2147483647
+		},
+		"xfs_rgnumber_t": {
+			"description": "Realtime allocation group number",
+			"type": "integer",
+			"minimum": 0,
+			"maximum": 2147483647
+		},
+		"xfs_ino_t": {
+			"description": "Inode number",
+			"type": "integer",
+			"minimum": 1
+		},
+		"i_generation": {
+			"description": "Inode generation number",
+			"type": "integer"
+		}
+	},
+
+	"$comment": "Filesystem metadata event data are defined here.",
+	"$metadata": {
+		"status": {
+			"description": "Metadata health status",
+			"$comment": [
+				"One of:",
+				"",
+				" * sick:    metadata corruption discovered",
+				"            during a runtime operation.",
+				" * corrupt: corruption discovered during",
+				"            an xfs_scrub run.",
+				" * healthy: metadata object was found to be",
+				"            ok by xfs_scrub."
+			],
+			"enum": [
+				"sick",
+				"corrupt",
+				"healthy"
+			]
+		},
+		"fs": {
+			"description": [
+				"Metadata structures that affect the entire",
+				"filesystem.  Options include:",
+				"",
+				" * fscounters: summary counters",
+				" * usrquota:   user quota records",
+				" * grpquota:   group quota records",
+				" * prjquota:   project quota records",
+				" * quotacheck: quota counters",
+				" * nlinks:     file link counts",
+				" * metadir:    metadata directory",
+				" * metapath:   metadata inode paths"
+			],
+			"enum": [
+				"fscounters",
+				"grpquota",
+				"metadir",
+				"metapath",
+				"nlinks",
+				"prjquota",
+				"quotacheck",
+				"usrquota"
+			]
+		},
+		"perag": {
+			"description": [
+				"Metadata structures owned by allocation",
+				"groups on the data device.  Options include:",
+				"",
+				" * agf:        group space header",
+				" * agfl:       per-group free block list",
+				" * agi:        group inode header",
+				" * bnobt:      free space by position btree",
+				" * cntbt:      free space by length btree",
+				" * finobt:     free inode btree",
+				" * inobt:      inode btree",
+				" * rmapbt:     reverse mapping btree",
+				" * refcountbt: reference count btree",
+				" * inodes:     problems were recorded for",
+				"               this group's inodes, but the",
+				"               inodes themselves had to be",
+				"               reclaimed.",
+				" * super:      superblock"
+			],
+			"enum": [
+				"agf",
+				"agfl",
+				"agi",
+				"bnobt",
+				"cntbt",
+				"finobt",
+				"inobt",
+				"inodes",
+				"refcountbt",
+				"rmapbt",
+				"super"
+			]
+		},
+		"rtgroup": {
+			"description": [
+				"Metadata structures owned by allocation",
+				"groups on the realtime volume.  Options",
+				"include:",
+				"",
+				" * bitmap:     free space bitmap contents",
+				"               for this group",
+				" * summary:    realtime free space summary file",
+				" * rmapbt:     reverse mapping btree",
+				" * refcountbt: reference count btree",
+				" * super:      group superblock"
+			],
+			"enum": [
+				"bitmap",
+				"summary",
+				"refcountbt",
+				"rmapbt",
+				"super"
+			]
+		},
+		"inode": {
+			"description": [
+				"Metadata structures owned by file inodes.",
+				"Options include:",
+				"",
+				" * bmapbta:    attr fork",
+				" * bmapbtc:    cow fork",
+				" * bmapbtd:    data fork",
+				" * core:       inode record",
+				" * directory:  directory entries",
+				" * dirtree:    directory tree problems detected",
+				" * parent:     directory parent pointer",
+				" * symlink:    symbolic link target",
+				" * xattr:      extended attributes",
+				"",
+				"These are set when an inode record repair had",
+				"to drop the corresponding data structure to",
+				"get the inode back to a consistent state.",
+				"",
+				" * bmapbtd_zapped",
+				" * bmapbta_zapped",
+				" * directory_zapped",
+				" * symlink_zapped"
+			],
+			"enum": [
+				"bmapbta",
+				"bmapbta_zapped",
+				"bmapbtc",
+				"bmapbtd",
+				"bmapbtd_zapped",
+				"core",
+				"directory",
+				"directory_zapped",
+				"dirtree",
+				"parent",
+				"symlink",
+				"symlink_zapped",
+				"xattr"
+			]
 		}
 	},
 
@@ -58,6 +233,159 @@
 				"time_ns",
 				"domain"
 			]
+		},
+		"fs_metadata": {
+			"title": "Filesystem-wide metadata event",
+			"description": [
+				"Health status updates for filesystem-wide",
+				"metadata objects."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "fs"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/fs"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"structures"
+			]
+		},
+		"perag_metadata": {
+			"title": "Data device allocation group metadata event",
+			"description": [
+				"Health status updates for data device ",
+				"allocation group metadata."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "perag"
+				},
+				"group": {
+					"$ref": "#/$defs/xfs_agnumber_t"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/perag"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"group",
+				"structures"
+			]
+		},
+		"rtgroup_metadata": {
+			"title": "Realtime allocation group metadata event",
+			"description": [
+				"Health status updates for realtime allocation",
+				"group metadata."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "rtgroup"
+				},
+				"group": {
+					"$ref": "#/$defs/xfs_rgnumber_t"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/rtgroup"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"group",
+				"structures"
+			]
+		},
+		"inode_metadata": {
+			"title": "Inode metadata event",
+			"description": [
+				"Health status updates for inode metadata.",
+				"The inode and generation number describe the",
+				"file that is affected by the change."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "inode"
+				},
+				"inumber": {
+					"$ref": "#/$defs/xfs_ino_t"
+				},
+				"generation": {
+					"$ref": "#/$defs/i_generation"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/inode"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"inumber",
+				"generation",
+				"structures"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 499f6aab9bdbf3..9d34a826726e3e 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -18,6 +18,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
+#include "xfs_health.h"
 #include "xfs_healthmon.h"
 
 #include <linux/anon_inodes.h>
@@ -65,8 +66,15 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*first_event;
 	struct xfs_healthmon_event	*last_event;
 
+	/* live update hooks */
+	struct xfs_health_hook		hhook;
+
+	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
 
+	/* filesystem type for safe cleanup of hooks; requires module_get */
+	struct file_system_type		*fstyp;
+
 	/* number of events */
 	unsigned int			events;
 
@@ -178,6 +186,10 @@ xfs_healthmon_start_live_update(
 {
 	struct xfs_healthmon_event	*event;
 
+	/* Already unmounted filesystem, do nothing. */
+	if (!hm->mp)
+		return -ESHUTDOWN;
+
 	/*
 	 * If we previously lost an event or the queue is full, try to queue
 	 * a notification about lost events.
@@ -207,6 +219,171 @@ xfs_healthmon_start_live_update(
 	return 0;
 }
 
+/* Compute the reporting mask. */
+static inline bool
+xfs_healthmon_event_mask(
+	struct xfs_healthmon			*hm,
+	enum xfs_health_update_type		type,
+	const struct xfs_health_update_params	*hup,
+	unsigned int				*mask)
+{
+	/* Always report unmounts. */
+	if (type == XFS_HEALTHUP_UNMOUNT)
+		return true;
+
+	/* If we want all events, return all events. */
+	if (hm->verbose) {
+		*mask = hup->new_mask;
+		return true;
+	}
+
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		/* Always report runtime corruptions */
+		*mask = hup->new_mask;
+		break;
+	case XFS_HEALTHUP_CORRUPT:
+		/* Only report new fsck errors */
+		*mask = hup->new_mask & ~hup->old_mask;
+		break;
+	case XFS_HEALTHUP_HEALTHY:
+		/* Only report healthy metadata that got fixed */
+		*mask = hup->new_mask & hup->old_mask;
+		break;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* This is here for static enum checking */
+		break;
+	}
+
+	/* If not in verbose mode, mask state has to change. */
+	return *mask != 0;
+}
+
+static inline enum xfs_healthmon_type
+health_update_to_type(
+	enum xfs_health_update_type	type)
+{
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		return XFS_HEALTHMON_SICK;
+	case XFS_HEALTHUP_CORRUPT:
+		return XFS_HEALTHMON_CORRUPT;
+	case XFS_HEALTHUP_HEALTHY:
+		return XFS_HEALTHMON_HEALTHY;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_UNMOUNT;
+}
+
+static inline enum xfs_healthmon_domain
+health_update_to_domain(
+	enum xfs_health_update_domain	domain)
+{
+	switch (domain) {
+	case XFS_HEALTHUP_FS:
+		return XFS_HEALTHMON_FS;
+	case XFS_HEALTHUP_AG:
+		return XFS_HEALTHMON_AG;
+	case XFS_HEALTHUP_RTGROUP:
+		return XFS_HEALTHMON_RTGROUP;
+	case XFS_HEALTHUP_INODE:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_INODE;
+}
+
+/* Add a health event to the reporting queue. */
+STATIC int
+xfs_healthmon_metadata_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_health_update_params	*hup = data;
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	enum xfs_health_update_type	type = action;
+	unsigned int			mask = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, hhook.health_hook.nb);
+
+	/* Decode event mask and skip events we don't care about. */
+	if (!xfs_healthmon_event_mask(hm, type, hup, &mask))
+		return NOTIFY_DONE;
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	if (type == XFS_HEALTHUP_UNMOUNT) {
+		/*
+		 * The filesystem is unmounting, so we must detach from the
+		 * mount.  After this point, the healthmon thread has no
+		 * connection to the mounted filesystem.
+		 */
+		trace_xfs_healthmon_unmount(hm->mp, hm->events,
+				hm->lost_prev_event);
+		hm->mp = NULL;
+		wake_up(&hm->wait);
+		goto out_unlock;
+	}
+
+	event = xfs_healthmon_alloc(hm, health_update_to_type(type),
+			  health_update_to_domain(hup->domain));
+	if (!event)
+		goto out_unlock;
+
+	/* Ignore the event if it's only reporting a secondary health state. */
+	switch (event->domain) {
+	case XFS_HEALTHMON_FS:
+		event->fsmask = mask & ~XFS_SICK_FS_SECONDARY;
+		if (!event->fsmask)
+			goto out_event;
+		break;
+	case XFS_HEALTHMON_AG:
+		event->grpmask = mask & ~XFS_SICK_AG_SECONDARY;
+		if (!event->grpmask)
+			goto out_event;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		event->grpmask = mask & ~XFS_SICK_RG_SECONDARY;
+		if (!event->grpmask)
+			goto out_event;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_INODE:
+		event->imask = mask & ~XFS_SICK_INO_SECONDARY;
+		if (!event->imask)
+			goto out_event;
+		event->ino = hup->ino;
+		event->gen = hup->gen;
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		goto out_event;
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+out_event:
+	kfree(event);
+	goto out_unlock;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -214,6 +391,10 @@ xfs_healthmon_typestring(
 {
 	static const char *type_strings[] = {
 		[XFS_HEALTHMON_LOST]		= "lost",
+		[XFS_HEALTHMON_UNMOUNT]		= "unmount",
+		[XFS_HEALTHMON_SICK]		= "sick",
+		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
+		[XFS_HEALTHMON_HEALTHY]		= "healthy",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -229,6 +410,10 @@ xfs_healthmon_domstring(
 {
 	static const char *dom_strings[] = {
 		[XFS_HEALTHMON_MOUNT]		= "mount",
+		[XFS_HEALTHMON_FS]		= "fs",
+		[XFS_HEALTHMON_AG]		= "perag",
+		[XFS_HEALTHMON_INODE]		= "inode",
+		[XFS_HEALTHMON_RTGROUP]		= "rtgroup",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -254,6 +439,11 @@ xfs_healthmon_format_flags(
 		if (!(p->mask & flags))
 			continue;
 
+		if (!p->str) {
+			flags &= ~p->mask;
+			continue;
+		}
+
 		ret = seq_buf_printf(outbuf, "%s\"%s\"",
 				first ? "" : ", ", p->str);
 		if (ret < 0)
@@ -304,6 +494,118 @@ __xfs_healthmon_format_mask(
 #define xfs_healthmon_format_mask(o, d, s, m) \
 	__xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m))
 
+/* Render fs sickness mask as a string set */
+static int
+xfs_healthmon_format_fs(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_FS_COUNTERS,		"fscounters" },
+		{ XFS_SICK_FS_UQUOTA,		"usrquota" },
+		{ XFS_SICK_FS_GQUOTA,		"grpquota" },
+		{ XFS_SICK_FS_PQUOTA,		"prjquota" },
+		{ XFS_SICK_FS_QUOTACHECK,	"quotacheck" },
+		{ XFS_SICK_FS_NLINKS,		"nlinks" },
+		{ XFS_SICK_FS_METADIR,		"metadir" },
+		{ XFS_SICK_FS_METAPATH,		"metapath" },
+	};
+
+	return xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			event->fsmask);
+}
+
+/* Render rtgroup sickness mask as a string set */
+static int
+xfs_healthmon_format_rtgroup(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_RG_SUPER,		"super" },
+		{ XFS_SICK_RG_BITMAP,		"bitmap" },
+		{ XFS_SICK_RG_SUMMARY,		"summary" },
+		{ XFS_SICK_RG_RMAPBT,		"rmapbt" },
+		{ XFS_SICK_RG_REFCNTBT,		"refcountbt" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			event->grpmask);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"group\":      %u,\n",
+			event->group);
+}
+
+/* Render perag sickness mask as a string set */
+static int
+xfs_healthmon_format_ag(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_AG_SB,		"super" },
+		{ XFS_SICK_AG_AGF,		"agf" },
+		{ XFS_SICK_AG_AGFL,		"agfl" },
+		{ XFS_SICK_AG_AGI,		"agi" },
+		{ XFS_SICK_AG_BNOBT,		"bnobt" },
+		{ XFS_SICK_AG_CNTBT,		"cntbt" },
+		{ XFS_SICK_AG_INOBT,		"inobt" },
+		{ XFS_SICK_AG_FINOBT,		"finobt" },
+		{ XFS_SICK_AG_RMAPBT,		"rmapbt" },
+		{ XFS_SICK_AG_REFCNTBT,		"refcountbt" },
+		{ XFS_SICK_AG_INODES,		"inodes" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			event->grpmask);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"group\":      %u,\n",
+			event->group);
+}
+
+/* Render inode sickness mask as a string set */
+static int
+xfs_healthmon_format_inode(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_INO_CORE,		"core" },
+		{ XFS_SICK_INO_BMBTD,		"bmapbtd" },
+		{ XFS_SICK_INO_BMBTA,		"bmapbta" },
+		{ XFS_SICK_INO_BMBTC,		"bmapbtc" },
+		{ XFS_SICK_INO_DIR,		"directory" },
+		{ XFS_SICK_INO_XATTR,		"xattr" },
+		{ XFS_SICK_INO_SYMLINK,		"symlink" },
+		{ XFS_SICK_INO_PARENT,		"parent" },
+		{ XFS_SICK_INO_BMBTD_ZAPPED,	"bmapbtd_zapped" },
+		{ XFS_SICK_INO_BMBTA_ZAPPED,	"bmapbta_zapped" },
+		{ XFS_SICK_INO_DIR_ZAPPED,	"directory_zapped" },
+		{ XFS_SICK_INO_SYMLINK_ZAPPED,	"symlink_zapped" },
+		{ XFS_SICK_INO_FORGET,		NULL, },
+		{ XFS_SICK_INO_DIRTREE,		"dirtree" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			event->imask);
+	if (ret < 0)
+		return ret;
+
+	ret = seq_buf_printf(outbuf, "  \"inumber\":    %llu,\n",
+			event->ino);
+	if (ret < 0)
+		return ret;
+	return seq_buf_printf(outbuf, "  \"generation\": %u,\n",
+			event->gen);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -354,6 +656,18 @@ xfs_healthmon_format(
 	case XFS_HEALTHMON_MOUNT:
 		/* empty */
 		break;
+	case XFS_HEALTHMON_FS:
+		ret = xfs_healthmon_format_fs(outbuf, event);
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		ret = xfs_healthmon_format_rtgroup(outbuf, event);
+		break;
+	case XFS_HEALTHMON_AG:
+		ret = xfs_healthmon_format_ag(outbuf, event);
+		break;
+	case XFS_HEALTHMON_INODE:
+		ret = xfs_healthmon_format_inode(outbuf, event);
+		break;
 	}
 	if (ret < 0)
 		goto overrun;
@@ -400,7 +714,7 @@ static inline bool
 xfs_healthmon_has_eventdata(
 	struct xfs_healthmon	*hm)
 {
-	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+	return !hm->mp || hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
 }
 
 /* Try to copy the rest of the outbuf to the iov iter. */
@@ -521,6 +835,7 @@ xfs_healthmon_read_iter(
 				break;
 			xfs_healthmon_free_head(hm, event);
 		}
+
 		/* Copy it to userspace */
 		ret = xfs_healthmon_copybuf(hm, to);
 		if (ret <= 0)
@@ -568,6 +883,58 @@ xfs_healthmon_free_events(
 	hm->first_event = hm->last_event = NULL;
 }
 
+/*
+ * Detach all filesystem hooks that were set up for a health monitor.  Only
+ * call this from iterate_super*.
+ */
+STATIC void
+xfs_healthmon_detach_hooks(
+	struct super_block	*sb,
+	void			*arg)
+{
+	struct xfs_healthmon	*hm = arg;
+
+	mutex_lock(&hm->lock);
+
+	/*
+	 * Because health monitors have a weak reference to the filesystem
+	 * they're monitoring, the hook deletions below must not race against
+	 * that filesystem being unmounted because that could lead to UAF
+	 * errors.
+	 *
+	 * If hm->mp is NULL, the health unmount hook already ran and the hook
+	 * chain head (contained within the xfs_mount structure) is gone.  Do
+	 * not detach any hooks; just let them get freed when the healthmon
+	 * object is torn down.
+	 */
+	if (!hm->mp)
+		goto out_unlock;
+
+	/*
+	 * Otherwise, the caller gave us a non-dying @sb with s_umount held in
+	 * shared mode, which means that @sb cannot be running through
+	 * deactivate_locked_super and cannot be freed.  It's safe to compare
+	 * @sb against the super that we snapshotted when we set up the health
+	 * monitor.
+	 */
+	if (hm->mp->m_super != sb)
+		goto out_unlock;
+
+	mutex_unlock(&hm->lock);
+
+	/*
+	 * Now we know that the filesystem @hm->mp is active and cannot be
+	 * deactivated until this function returns.  Unmount events are sent
+	 * through the health monitoring subsystem from xfs_fs_put_super, so
+	 * it is now time to detach the hooks.
+	 */
+	xfs_health_hook_del(hm->mp, &hm->hhook);
+	return;
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+}
+
 /* Free the health monitoring information. */
 STATIC int
 xfs_healthmon_release(
@@ -580,6 +947,9 @@ xfs_healthmon_release(
 
 	wake_up_all(&hm->wait);
 
+	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_health_hook_disable();
+
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	if (hm->outbuf.size)
@@ -641,6 +1011,13 @@ xfs_ioc_health_monitor(
 		return -ENOMEM;
 	hm->mp = mp;
 
+	/*
+	 * Since we already got a ref to the module, take a reference to the
+	 * fstype to make it easier to detach the hooks when we tear things
+	 * down later.
+	 */
+	hm->fstyp = mp->m_super->s_type;
+
 	seq_buf_init(&hm->outbuf, NULL, 0);
 	mutex_init(&hm->lock);
 	init_waitqueue_head(&hm->wait);
@@ -648,11 +1025,20 @@ xfs_ioc_health_monitor(
 	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
 		hm->verbose = true;
 
+	/* Enable hooks to receive events, generally. */
+	xfs_health_hook_enable();
+
+	/* Attach specific event hooks to this monitor. */
+	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
+	ret = xfs_health_hook_add(mp, &hm->hhook);
+	if (ret)
+		goto out_hooks;
+
 	/* Set up VFS file and file descriptor. */
 	name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id);
 	if (!name) {
 		ret = -ENOMEM;
-		goto out_mutex;
+		goto out_healthhook;
 	}
 
 	fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm,
@@ -660,14 +1046,17 @@ xfs_ioc_health_monitor(
 	kvfree(name);
 	if (fd < 0) {
 		ret = fd;
-		goto out_mutex;
+		goto out_healthhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
-out_mutex:
+out_healthhook:
+	xfs_health_hook_del(mp, &hm->hhook);
+out_hooks:
+	xfs_health_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	kfree(hm);
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 606f205074495c..3ece61165837b2 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -8,10 +8,22 @@
 
 enum xfs_healthmon_type {
 	XFS_HEALTHMON_LOST,	/* message lost */
+
+	/* metadata health events */
+	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
+	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
 };
 
 enum xfs_healthmon_domain {
 	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+
+	/* metadata health events */
+	XFS_HEALTHMON_FS,	/* main filesystem metadata */
+	XFS_HEALTHMON_AG,	/* allocation group metadata */
+	XFS_HEALTHMON_INODE,	/* inode metadata */
+	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
 };
 
 struct xfs_healthmon_event {
@@ -27,6 +39,24 @@ struct xfs_healthmon_event {
 		struct {
 			unsigned int	flags;
 		};
+		/* fs/rt metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	fsmask;
+		};
+		/* ag/rtgroup metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	grpmask;
+			unsigned int	group;
+		};
+		/* inode metadata */
+		struct {
+			/* XFS_SICK_INO_* flags */
+			unsigned int	imask;
+			uint32_t	gen;
+			xfs_ino_t	ino;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index bd3b007d213fc6..4a68d2ec8d0a34 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6174,14 +6174,30 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
 #define XFS_HEALTHMON_TYPE_STRINGS \
-	{ XFS_HEALTHMON_LOST,		"lost" }
+	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
+	{ XFS_HEALTHMON_SICK,		"sick" }, \
+	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
+	{ XFS_HEALTHMON_HEALTHY,	"healthy" }
 
 #define XFS_HEALTHMON_DOMAIN_STRINGS \
-	{ XFS_HEALTHMON_MOUNT,		"mount" }
+	{ XFS_HEALTHMON_MOUNT,		"mount" }, \
+	{ XFS_HEALTHMON_FS,		"fs" }, \
+	{ XFS_HEALTHMON_AG,		"ag" }, \
+	{ XFS_HEALTHMON_INODE,		"inode" }, \
+	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_HEALTHY);
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_RTGROUP);
 
 DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
@@ -6207,6 +6223,19 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 		case XFS_HEALTHMON_MOUNT:
 			__entry->mask = event->flags;
 			break;
+		case XFS_HEALTHMON_FS:
+			__entry->mask = event->fsmask;
+			break;
+		case XFS_HEALTHMON_AG:
+		case XFS_HEALTHMON_RTGROUP:
+			__entry->mask = event->grpmask;
+			__entry->group = event->group;
+			break;
+		case XFS_HEALTHMON_INODE:
+			__entry->mask = event->imask;
+			__entry->ino = event->ino;
+			__entry->gen = event->gen;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x group 0x%x",
@@ -6227,6 +6256,70 @@ DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+
+#define XFS_HEALTHUP_TYPE_STRINGS \
+	{ XFS_HEALTHUP_UNMOUNT,		"unmount" }, \
+	{ XFS_HEALTHUP_SICK,		"sick" }, \
+	{ XFS_HEALTHUP_CORRUPT,		"corrupt" }, \
+	{ XFS_HEALTHUP_HEALTHY,		"healthy" }
+
+#define XFS_HEALTHUP_DOMAIN_STRINGS \
+	{ XFS_HEALTHUP_FS,		"fs" }, \
+	{ XFS_HEALTHUP_AG,		"ag" }, \
+	{ XFS_HEALTHUP_INODE,		"inode" }, \
+	{ XFS_HEALTHUP_RTGROUP,		"rtgroup" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_HEALTHY);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_RTGROUP);
+
+TRACE_EVENT(xfs_healthmon_metadata_hook,
+	TP_PROTO(const struct xfs_mount *mp, unsigned long type,
+		 const struct xfs_health_update_params *update,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(mp, type, update, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, old_mask)
+		__field(unsigned int, new_mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = type;
+		__entry->domain = update->domain;
+		__entry->old_mask = update->old_mask;
+		__entry->new_mask = update->new_mask;
+		__entry->ino = update->ino;
+		__entry->gen = update->gen;
+		__entry->group = update->group;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d type %s domain %s oldmask 0x%x newmask 0x%x ino 0x%llx gen 0x%x group 0x%x events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHUP_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHUP_DOMAIN_STRINGS),
+		  __entry->old_mask,
+		  __entry->new_mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->group,
+		  __entry->events,
+		  __entry->lost_prev)
+);
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 11/16] xfs: report shutdown events through healthmon
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-12-31 23:41   ` [PATCH 10/16] xfs: report metadata health events through healthmon Darrick J. Wong
@ 2024-12-31 23:41   ` Darrick J. Wong
  2024-12-31 23:41   ` [PATCH 12/16] xfs: report media errors " Darrick J. Wong
                     ` (4 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_healthmon.schema.json |   62 +++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |   77 ++++++++++++++++++++++++++++++-
 fs/xfs/xfs_healthmon.h                  |    3 +
 fs/xfs/xfs_trace.h                      |   25 ++++++++++
 4 files changed, 165 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index 154ea0228a3615..a8bc75b0b8c4f9 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -30,6 +30,9 @@
 		},
 		{
 			"$ref": "#/$events/inode_metadata"
+		},
+		{
+			"$ref": "#/$events/shutdown"
 		}
 	],
 
@@ -205,6 +208,31 @@
 		}
 	},
 
+	"$comment": "Shutdown event data are defined here.",
+	"$shutdown": {
+		"reason": {
+			"description": [
+				"Reason for a filesystem to shut down.",
+				"Options include:",
+				"",
+				" * corrupt_incore: in-memory corruption",
+				" * corrupt_ondisk: on-disk corruption",
+				" * device_removed: device removed",
+				" * force_umount:   userspace asked for it",
+				" * log_ioerr:      log write IO error",
+				" * meta_ioerr:     metadata writeback IO error"
+			],
+			"enum": [
+				"corrupt_incore",
+				"corrupt_ondisk",
+				"device_removed",
+				"force_umount",
+				"log_ioerr",
+				"meta_ioerr"
+			]
+		}
+	},
+
 	"$comment": "Event types are defined here.",
 	"$events": {
 		"lost": {
@@ -386,6 +414,40 @@
 				"generation",
 				"structures"
 			]
+		},
+		"shutdown": {
+			"title": "Abnormal Shutdown Event",
+			"description": [
+				"The filesystem went offline due to",
+				"unrecoverable errors."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "shutdown"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				},
+				"reasons": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$shutdown/reason"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"reasons"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 9d34a826726e3e..c7df6dad5612f8 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -20,6 +20,7 @@
 #include "xfs_rtgroup.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_fsops.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -67,6 +68,7 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*last_event;
 
 	/* live update hooks */
+	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
@@ -384,6 +386,43 @@ xfs_healthmon_metadata_hook(
 	goto out_unlock;
 }
 
+/* Add a shutdown event to the reporting queue. */
+STATIC int
+xfs_healthmon_shutdown_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_SHUTDOWN,
+			XFS_HEALTHMON_MOUNT);
+	if (!event)
+		goto out_unlock;
+
+	event->flags = action;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -391,6 +430,7 @@ xfs_healthmon_typestring(
 {
 	static const char *type_strings[] = {
 		[XFS_HEALTHMON_LOST]		= "lost",
+		[XFS_HEALTHMON_SHUTDOWN]	= "shutdown",
 		[XFS_HEALTHMON_UNMOUNT]		= "unmount",
 		[XFS_HEALTHMON_SICK]		= "sick",
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
@@ -606,6 +646,25 @@ xfs_healthmon_format_inode(
 			event->gen);
 }
 
+/* Render shutdown mask as a string set */
+static int
+xfs_healthmon_format_shutdown(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ SHUTDOWN_META_IO_ERROR,	"meta_ioerr" },
+		{ SHUTDOWN_LOG_IO_ERROR,	"log_ioerr" },
+		{ SHUTDOWN_FORCE_UMOUNT,	"force_umount" },
+		{ SHUTDOWN_CORRUPT_INCORE,	"corrupt_incore" },
+		{ SHUTDOWN_CORRUPT_ONDISK,	"corrupt_ondisk" },
+		{ SHUTDOWN_DEVICE_REMOVED,	"device_removed" },
+	};
+
+	return xfs_healthmon_format_mask(outbuf, "reasons", mask_strings,
+			event->flags);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -645,6 +704,9 @@ xfs_healthmon_format(
 		goto overrun;
 
 	switch (event->type) {
+	case XFS_HEALTHMON_SHUTDOWN:
+		ret = xfs_healthmon_format_shutdown(outbuf, event);
+		break;
 	case XFS_HEALTHMON_LOST:
 		/* empty */
 		break;
@@ -928,6 +990,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
 	return;
 
@@ -948,6 +1011,7 @@ xfs_healthmon_release(
 	wake_up_all(&hm->wait);
 
 	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_shutdown_hook_disable();
 	xfs_health_hook_disable();
 
 	mutex_destroy(&hm->lock);
@@ -1027,6 +1091,7 @@ xfs_ioc_health_monitor(
 
 	/* Enable hooks to receive events, generally. */
 	xfs_health_hook_enable();
+	xfs_shutdown_hook_enable();
 
 	/* Attach specific event hooks to this monitor. */
 	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
@@ -1034,11 +1099,16 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_hooks;
 
+	xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
+	ret = xfs_shutdown_hook_add(mp, &hm->shook);
+	if (ret)
+		goto out_healthhook;
+
 	/* Set up VFS file and file descriptor. */
 	name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id);
 	if (!name) {
 		ret = -ENOMEM;
-		goto out_healthhook;
+		goto out_shutdownhook;
 	}
 
 	fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm,
@@ -1046,17 +1116,20 @@ xfs_ioc_health_monitor(
 	kvfree(name);
 	if (fd < 0) {
 		ret = fd;
-		goto out_healthhook;
+		goto out_shutdownhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_shutdownhook:
+	xfs_shutdown_hook_del(mp, &hm->shook);
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:
 	xfs_health_hook_disable();
+	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	kfree(hm);
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 3ece61165837b2..a7b2eaf3dd64e1 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -9,6 +9,9 @@
 enum xfs_healthmon_type {
 	XFS_HEALTHMON_LOST,	/* message lost */
 
+	/* filesystem shutdown */
+	XFS_HEALTHMON_SHUTDOWN,
+
 	/* metadata health events */
 	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 4a68d2ec8d0a34..404b857db39d0d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6173,8 +6173,32 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
+TRACE_EVENT(xfs_healthmon_shutdown_hook,
+	TP_PROTO(const struct xfs_mount *mp, uint32_t shutdown_flags,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(mp, shutdown_flags, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint32_t, shutdown_flags)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->shutdown_flags = shutdown_flags;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d shutdown_flags %s events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->shutdown_flags, "|", XFS_SHUTDOWN_STRINGS),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+
 #define XFS_HEALTHMON_TYPE_STRINGS \
 	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_SHUTDOWN,	"shutdown" }, \
 	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
 	{ XFS_HEALTHMON_SICK,		"sick" }, \
 	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
@@ -6188,6 +6212,7 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 12/16] xfs: report media errors through healthmon
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (10 preceding siblings ...)
  2024-12-31 23:41   ` [PATCH 11/16] xfs: report shutdown " Darrick J. Wong
@ 2024-12-31 23:41   ` Darrick J. Wong
  2024-12-31 23:42   ` [PATCH 13/16] xfs: report file io " Darrick J. Wong
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have hooks to report media errors, connect this to the
health monitor as well.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_healthmon.schema.json |   65 +++++++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |   96 ++++++++++++++++++++++++++++++-
 fs/xfs/xfs_healthmon.h                  |   13 ++++
 fs/xfs/xfs_trace.c                      |    1 
 fs/xfs/xfs_trace.h                      |   51 ++++++++++++++++
 5 files changed, 224 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index a8bc75b0b8c4f9..006f4145faa9f5 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -33,6 +33,9 @@
 		},
 		{
 			"$ref": "#/$events/shutdown"
+		},
+		{
+			"$ref": "#/$events/media_error"
 		}
 	],
 
@@ -63,6 +66,31 @@
 		"i_generation": {
 			"description": "Inode generation number",
 			"type": "integer"
+		},
+		"storage_devs": {
+			"description": "Storage devices in a filesystem",
+			"_comment": [
+				"One of:",
+				"",
+				" * datadev: filesystem device",
+				" * logdev:  external log device",
+				" * rtdev:   realtime volume"
+			],
+			"enum": [
+				"datadev",
+				"logdev",
+				"rtdev"
+			]
+		},
+		"xfs_daddr_t": {
+			"description": "Storage device address, in units of 512-byte blocks",
+			"type": "integer",
+			"minimum": 0
+		},
+		"bbcount": {
+			"description": "Storage space length, in units of 512-byte blocks",
+			"type": "integer",
+			"minimum": 1
 		}
 	},
 
@@ -448,6 +476,43 @@
 				"domain",
 				"reasons"
 			]
+		},
+		"media_error": {
+			"title": "Media Error",
+			"description": [
+				"A storage device reported a media error.",
+				"The domain element tells us which storage",
+				"device reported the media failure.  The",
+				"daddr and bbcount elements tell us where",
+				"inside that device the failure was observed."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "media"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"$ref": "#/$defs/storage_devs"
+				},
+				"daddr": {
+					"$ref": "#/$defs/xfs_daddr_t"
+				},
+				"bbcount": {
+					"$ref": "#/$defs/bbcount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"daddr",
+				"bbcount"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index c7df6dad5612f8..c828ea7442e932 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -21,6 +21,7 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_fsops.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -70,6 +71,7 @@ struct xfs_healthmon {
 	/* live update hooks */
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
+	struct xfs_media_error_hook	mhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -423,6 +425,59 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+/* Add a media error event to the reporting queue. */
+STATIC int
+xfs_healthmon_media_error_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_media_error_params	*p = data;
+	enum xfs_healthmon_domain	domain = 0; /* shut up gcc */
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_media_error_hook(p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	switch (p->fdev) {
+	case XFS_FAILED_LOGDEV:
+		domain = XFS_HEALTHMON_LOGDEV;
+		break;
+	case XFS_FAILED_RTDEV:
+		domain = XFS_HEALTHMON_RTDEV;
+		break;
+	case XFS_FAILED_DATADEV:
+		domain = XFS_HEALTHMON_DATADEV;
+		break;
+	}
+
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_MEDIA_ERROR, domain);
+	if (!event)
+		goto out_unlock;
+
+	event->daddr = p->daddr;
+	event->bbcount = p->bbcount;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+#endif
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -435,6 +490,7 @@ xfs_healthmon_typestring(
 		[XFS_HEALTHMON_SICK]		= "sick",
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
 		[XFS_HEALTHMON_HEALTHY]		= "healthy",
+		[XFS_HEALTHMON_MEDIA_ERROR]	= "media",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -454,6 +510,9 @@ xfs_healthmon_domstring(
 		[XFS_HEALTHMON_AG]		= "perag",
 		[XFS_HEALTHMON_INODE]		= "inode",
 		[XFS_HEALTHMON_RTGROUP]		= "rtgroup",
+		[XFS_HEALTHMON_DATADEV]		= "datadev",
+		[XFS_HEALTHMON_LOGDEV]		= "logdev",
+		[XFS_HEALTHMON_RTDEV]		= "rtdev",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -665,6 +724,23 @@ xfs_healthmon_format_shutdown(
 			event->flags);
 }
 
+/* Render media error as a string set */
+static int
+xfs_healthmon_format_media_error(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	ssize_t				ret;
+
+	ret = seq_buf_printf(outbuf, "  \"daddr\":      %llu,\n",
+			event->daddr);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"bbcount\":    %llu,\n",
+			event->bbcount);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -730,6 +806,11 @@ xfs_healthmon_format(
 	case XFS_HEALTHMON_INODE:
 		ret = xfs_healthmon_format_inode(outbuf, event);
 		break;
+	case XFS_HEALTHMON_DATADEV:
+	case XFS_HEALTHMON_LOGDEV:
+	case XFS_HEALTHMON_RTDEV:
+		ret = xfs_healthmon_format_media_error(outbuf, event);
+		break;
 	}
 	if (ret < 0)
 		goto overrun;
@@ -990,6 +1071,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
 	return;
@@ -1011,6 +1093,7 @@ xfs_healthmon_release(
 	wake_up_all(&hm->wait);
 
 	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_media_error_hook_disable();
 	xfs_shutdown_hook_disable();
 	xfs_health_hook_disable();
 
@@ -1092,6 +1175,7 @@ xfs_ioc_health_monitor(
 	/* Enable hooks to receive events, generally. */
 	xfs_health_hook_enable();
 	xfs_shutdown_hook_enable();
+	xfs_media_error_hook_enable();
 
 	/* Attach specific event hooks to this monitor. */
 	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
@@ -1104,11 +1188,16 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_healthhook;
 
+	xfs_media_error_hook_setup(&hm->mhook, xfs_healthmon_media_error_hook);
+	ret = xfs_media_error_hook_add(mp, &hm->mhook);
+	if (ret)
+		goto out_shutdownhook;
+
 	/* Set up VFS file and file descriptor. */
 	name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id);
 	if (!name) {
 		ret = -ENOMEM;
-		goto out_shutdownhook;
+		goto out_mediahook;
 	}
 
 	fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm,
@@ -1116,18 +1205,21 @@ xfs_ioc_health_monitor(
 	kvfree(name);
 	if (fd < 0) {
 		ret = fd;
-		goto out_shutdownhook;
+		goto out_mediahook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_mediahook:
+	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:
 	xfs_shutdown_hook_del(mp, &hm->shook);
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:
+	xfs_media_error_hook_disable();
 	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index a7b2eaf3dd64e1..23ce320f4b086b 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -17,6 +17,9 @@ enum xfs_healthmon_type {
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
 	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
 	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
+
+	/* media errors */
+	XFS_HEALTHMON_MEDIA_ERROR,
 };
 
 enum xfs_healthmon_domain {
@@ -27,6 +30,11 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_AG,	/* allocation group metadata */
 	XFS_HEALTHMON_INODE,	/* inode metadata */
 	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
+
+	/* media errors */
+	XFS_HEALTHMON_DATADEV,
+	XFS_HEALTHMON_RTDEV,
+	XFS_HEALTHMON_LOGDEV,
 };
 
 struct xfs_healthmon_event {
@@ -60,6 +68,11 @@ struct xfs_healthmon_event {
 			uint32_t	gen;
 			xfs_ino_t	ino;
 		};
+		/* media errors */
+		struct {
+			xfs_daddr_t	daddr;
+			uint64_t	bbcount;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 41a2ac85dc5fdf..23741ff36a2e14 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -54,6 +54,7 @@
 #include "xfs_fsrefs.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 404b857db39d0d..47293206400d6e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -108,6 +108,7 @@ struct xfs_fsrefs_irec;
 struct xfs_rtgroup;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
+struct xfs_media_error_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6345,6 +6346,56 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->events,
 		  __entry->lost_prev)
 );
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+TRACE_EVENT(xfs_healthmon_media_error_hook,
+	TP_PROTO(const struct xfs_media_error_params *p,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(uint64_t, daddr)
+		__field(uint64_t, bbcount)
+		__field(int, pre_remove)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		struct xfs_mount	*mp = p->mp;
+		struct xfs_buftarg	*btp = NULL;
+
+		switch (p->fdev) {
+		case XFS_FAILED_DATADEV:
+			btp = mp->m_ddev_targp;
+			break;
+		case XFS_FAILED_LOGDEV:
+			btp = mp->m_logdev_targp;
+			break;
+		case XFS_FAILED_RTDEV:
+			btp = mp->m_rtdev_targp;
+			break;
+		}
+
+		__entry->dev = mp->m_super->s_dev;
+		if (btp)
+			__entry->error_dev = btp->bt_dev;
+		__entry->daddr = p->daddr;
+		__entry->bbcount = p->bbcount;
+		__entry->pre_remove = p->pre_remove;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d error_dev %d:%d daddr 0x%llx bbcount 0x%llx pre_remove? %d events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->error_dev), MINOR(__entry->error_dev),
+		  __entry->daddr,
+		  __entry->bbcount,
+		  __entry->pre_remove,
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#endif
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 13/16] xfs: report file io errors through healthmon
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (11 preceding siblings ...)
  2024-12-31 23:41   ` [PATCH 12/16] xfs: report media errors " Darrick J. Wong
@ 2024-12-31 23:42   ` Darrick J. Wong
  2024-12-31 23:42   ` [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
                     ` (2 subsequent siblings)
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:42 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a file io error event hook so that we can send events about read
errors, writeback errors, and directio errors to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_healthmon.schema.json |   77 ++++++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |  120 ++++++++++++++++++++++++++++++-
 fs/xfs/xfs_healthmon.h                  |   16 ++++
 fs/xfs/xfs_trace.c                      |    1 
 fs/xfs/xfs_trace.h                      |   50 +++++++++++++
 5 files changed, 262 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index 006f4145faa9f5..9c1070a629997c 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -36,6 +36,9 @@
 		},
 		{
 			"$ref": "#/$events/media_error"
+		},
+		{
+			"$ref": "#/$events/file_ioerror"
 		}
 	],
 
@@ -67,6 +70,16 @@
 			"description": "Inode generation number",
 			"type": "integer"
 		},
+		"off_t": {
+			"description": "File position, in bytes",
+			"type": "integer",
+			"minimum": 0
+		},
+		"size_t": {
+			"description": "File operation length, in bytes",
+			"type": "integer",
+			"minimum": 1
+		},
 		"storage_devs": {
 			"description": "Storage devices in a filesystem",
 			"_comment": [
@@ -261,6 +274,26 @@
 		}
 	},
 
+	"$comment": "File IO event data are defined here.",
+	"$fileio": {
+		"types": {
+			"description": [
+				"File I/O operations.  One of:",
+				"",
+				" * readahead: reads into the page cache.",
+				" * writeback: writeback of dirty page cache.",
+				" * dioread:   O_DIRECT reads.",
+				" * diowrite:  O_DIRECT writes."
+			],
+			"enum": [
+				"readahead",
+				"writeback",
+				"dioread",
+				"diowrite"
+			]
+		}
+	},
+
 	"$comment": "Event types are defined here.",
 	"$events": {
 		"lost": {
@@ -513,6 +546,50 @@
 				"daddr",
 				"bbcount"
 			]
+		},
+		"file_ioerror": {
+			"title": "File I/O error",
+			"description": [
+				"A read or a write to a file failed.  The",
+				"inode, generation, pos, and len fields",
+				"describe the range of the file that is",
+				"affected."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$fileio/types"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "filerange"
+				},
+				"inumber": {
+					"$ref": "#/$defs/xfs_ino_t"
+				},
+				"generation": {
+					"$ref": "#/$defs/i_generation"
+				},
+				"pos": {
+					"$ref": "#/$defs/off_t"
+				},
+				"len": {
+					"$ref": "#/$defs/size_t"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"inumber",
+				"generation",
+				"pos",
+				"len"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index c828ea7442e932..9320f12b60ade9 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -22,6 +22,7 @@
 #include "xfs_healthmon.h"
 #include "xfs_fsops.h"
 #include "xfs_notify_failure.h"
+#include "xfs_file.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -72,6 +73,7 @@ struct xfs_healthmon {
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 	struct xfs_media_error_hook	mhook;
+	struct xfs_file_ioerror_hook	fhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -478,6 +480,73 @@ xfs_healthmon_media_error_hook(
 }
 #endif
 
+/* Add a file io error event to the reporting queue. */
+STATIC int
+xfs_healthmon_file_ioerror_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_file_ioerror_params	*p = data;
+	enum xfs_healthmon_type		type = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, fhook.ioerror_hook.nb);
+
+	switch (action) {
+	case XFS_FILE_IOERROR_BUFFERED_READ:
+	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+	case XFS_FILE_IOERROR_DIRECT_READ:
+	case XFS_FILE_IOERROR_DIRECT_WRITE:
+		break;
+	default:
+		ASSERT(0);
+		return NOTIFY_DONE;
+	}
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	switch (action) {
+	case XFS_FILE_IOERROR_BUFFERED_READ:
+		type = XFS_HEALTHMON_BUFREAD;
+		break;
+	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+		type = XFS_HEALTHMON_BUFWRITE;
+		break;
+	case XFS_FILE_IOERROR_DIRECT_READ:
+		type = XFS_HEALTHMON_DIOREAD;
+		break;
+	case XFS_FILE_IOERROR_DIRECT_WRITE:
+		type = XFS_HEALTHMON_DIOWRITE;
+		break;
+	}
+
+	event = xfs_healthmon_alloc(hm, type, XFS_HEALTHMON_FILERANGE);
+	if (!event)
+		goto out_unlock;
+
+	event->fino = p->ino;
+	event->fgen = p->gen;
+	event->fpos = p->pos;
+	event->flen = p->len;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -491,6 +560,10 @@ xfs_healthmon_typestring(
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
 		[XFS_HEALTHMON_HEALTHY]		= "healthy",
 		[XFS_HEALTHMON_MEDIA_ERROR]	= "media",
+		[XFS_HEALTHMON_BUFREAD]		= "readahead",
+		[XFS_HEALTHMON_BUFWRITE]	= "writeback",
+		[XFS_HEALTHMON_DIOREAD]		= "dioread",
+		[XFS_HEALTHMON_DIOWRITE]	= "diowrite",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -513,6 +586,7 @@ xfs_healthmon_domstring(
 		[XFS_HEALTHMON_DATADEV]		= "datadev",
 		[XFS_HEALTHMON_LOGDEV]		= "logdev",
 		[XFS_HEALTHMON_RTDEV]		= "rtdev",
+		[XFS_HEALTHMON_FILERANGE]	= "filerange",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -741,6 +815,33 @@ xfs_healthmon_format_media_error(
 			event->bbcount);
 }
 
+/* Render file range events as a string set */
+static int
+xfs_healthmon_format_filerange(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	ssize_t				ret;
+
+	ret = seq_buf_printf(outbuf, "  \"inumber\":    %llu,\n",
+			event->fino);
+	if (ret < 0)
+		return ret;
+
+	ret = seq_buf_printf(outbuf, "  \"generation\": %u,\n",
+			event->fgen);
+	if (ret < 0)
+		return ret;
+
+	ret = seq_buf_printf(outbuf, "  \"pos\":        %llu,\n",
+			event->fpos);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"length\":     %llu,\n",
+			event->flen);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -811,6 +912,9 @@ xfs_healthmon_format(
 	case XFS_HEALTHMON_RTDEV:
 		ret = xfs_healthmon_format_media_error(outbuf, event);
 		break;
+	case XFS_HEALTHMON_FILERANGE:
+		ret = xfs_healthmon_format_filerange(outbuf, event);
+		break;
 	}
 	if (ret < 0)
 		goto overrun;
@@ -1071,6 +1175,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_file_ioerror_hook_del(hm->mp, &hm->fhook);
 	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
@@ -1093,6 +1198,7 @@ xfs_healthmon_release(
 	wake_up_all(&hm->wait);
 
 	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_file_ioerror_hook_disable();
 	xfs_media_error_hook_disable();
 	xfs_shutdown_hook_disable();
 	xfs_health_hook_disable();
@@ -1176,6 +1282,7 @@ xfs_ioc_health_monitor(
 	xfs_health_hook_enable();
 	xfs_shutdown_hook_enable();
 	xfs_media_error_hook_enable();
+	xfs_file_ioerror_hook_enable();
 
 	/* Attach specific event hooks to this monitor. */
 	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
@@ -1193,11 +1300,17 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_shutdownhook;
 
+	xfs_file_ioerror_hook_setup(&hm->fhook,
+			xfs_healthmon_file_ioerror_hook);
+	ret = xfs_file_ioerror_hook_add(mp, &hm->fhook);
+	if (ret)
+		goto out_mediahook;
+
 	/* Set up VFS file and file descriptor. */
 	name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id);
 	if (!name) {
 		ret = -ENOMEM;
-		goto out_mediahook;
+		goto out_ioerrhook;
 	}
 
 	fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm,
@@ -1205,13 +1318,15 @@ xfs_ioc_health_monitor(
 	kvfree(name);
 	if (fd < 0) {
 		ret = fd;
-		goto out_mediahook;
+		goto out_ioerrhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_ioerrhook:
+	xfs_file_ioerror_hook_del(mp, &hm->fhook);
 out_mediahook:
 	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:
@@ -1219,6 +1334,7 @@ xfs_ioc_health_monitor(
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:
+	xfs_file_ioerror_hook_disable();
 	xfs_media_error_hook_disable();
 	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 23ce320f4b086b..748173eed79660 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -20,6 +20,12 @@ enum xfs_healthmon_type {
 
 	/* media errors */
 	XFS_HEALTHMON_MEDIA_ERROR,
+
+	/* file range events */
+	XFS_HEALTHMON_BUFREAD,
+	XFS_HEALTHMON_BUFWRITE,
+	XFS_HEALTHMON_DIOREAD,
+	XFS_HEALTHMON_DIOWRITE,
 };
 
 enum xfs_healthmon_domain {
@@ -35,6 +41,9 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_DATADEV,
 	XFS_HEALTHMON_RTDEV,
 	XFS_HEALTHMON_LOGDEV,
+
+	/* file range events */
+	XFS_HEALTHMON_FILERANGE,
 };
 
 struct xfs_healthmon_event {
@@ -73,6 +82,13 @@ struct xfs_healthmon_event {
 			xfs_daddr_t	daddr;
 			uint64_t	bbcount;
 		};
+		/* file range events */
+		struct {
+			xfs_ino_t	fino;
+			loff_t		fpos;
+			uint64_t	flen;
+			uint32_t	fgen;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 23741ff36a2e14..d8e5d607b0dc6a 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -55,6 +55,7 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_notify_failure.h"
+#include "xfs_file.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 47293206400d6e..aba32f5ccc1a3b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -109,6 +109,7 @@ struct xfs_rtgroup;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
 struct xfs_media_error_params;
+struct xfs_file_ioerror_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6396,6 +6397,55 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->lost_prev)
 );
 #endif
+
+#define XFS_FILE_IOERROR_STRINGS \
+	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \
+	{ XFS_FILE_IOERROR_BUFFERED_WRITE,	"writeback" }, \
+	{ XFS_FILE_IOERROR_DIRECT_READ,		"dioread" }, \
+	{ XFS_FILE_IOERROR_DIRECT_WRITE,	"diowrite" }
+
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_WRITE);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_WRITE);
+
+TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
+	TP_PROTO(const struct xfs_mount *mp,
+		 unsigned long action,
+		 const struct xfs_file_ioerror_params *p,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(mp, action, p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(unsigned long, action)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(long long, pos)
+		__field(unsigned long long, len)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->action = action;
+		__entry->ino = p->ino;
+		__entry->gen = p->gen;
+		__entry->pos = p->pos;
+		__entry->len = p->len;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d ino 0x%llx gen 0x%x op %s pos 0x%llx bytecount 0x%llx events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->gen,
+		  __print_symbolic(__entry->action, XFS_FILE_IOERROR_STRINGS),
+		  __entry->pos,
+		  __entry->len,
+		  __entry->events,
+		  __entry->lost_prev)
+);
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (12 preceding siblings ...)
  2024-12-31 23:42   ` [PATCH 13/16] xfs: report file io " Darrick J. Wong
@ 2024-12-31 23:42   ` Darrick J. Wong
  2024-12-31 23:42   ` [PATCH 15/16] xfs: add media error reporting ioctl Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:42 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can reconfigure the health monitoring device by
calling the XFS_IOC_HEALTH_MONITOR ioctl on it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)


diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 9320f12b60ade9..67f7d4a8cc7f58 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -23,6 +23,8 @@
 #include "xfs_fsops.h"
 #include "xfs_notify_failure.h"
 #include "xfs_file.h"
+#include "xfs_fs.h"
+#include "xfs_ioctl.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -1228,11 +1230,38 @@ xfs_healthmon_validate(
 	return true;
 }
 
+/* Handle ioctls for the health monitoring thread. */
+STATIC long
+xfs_healthmon_ioctl(
+	struct file			*file,
+	unsigned int			cmd,
+	unsigned long			p)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm = file->private_data;
+	void __user			*arg = (void __user *)p;
+
+	if (cmd != XFS_IOC_HEALTH_MONITOR)
+		return -ENOTTY;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	mutex_lock(&hm->lock);
+	hm->verbose = !!(hmo.flags & XFS_HEALTH_MONITOR_VERBOSE);
+	mutex_unlock(&hm->lock);
+	return 0;
+}
+
 static const struct file_operations xfs_healthmon_fops = {
 	.owner		= THIS_MODULE,
 	.read_iter	= xfs_healthmon_read_iter,
 	.poll		= xfs_healthmon_poll,
 	.release	= xfs_healthmon_release,
+	.unlocked_ioctl	= xfs_healthmon_ioctl,
 };
 
 /*


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 15/16] xfs: add media error reporting ioctl
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (13 preceding siblings ...)
  2024-12-31 23:42   ` [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
@ 2024-12-31 23:42   ` Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:42 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new privileged ioctl so that xfs_scrub can report media errors to
the kernel for further processing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/Makefile             |    6 +----
 fs/xfs/libxfs/xfs_fs.h      |   15 ++++++++++++
 fs/xfs/xfs_healthmon.c      |    2 --
 fs/xfs/xfs_ioctl.c          |    3 ++
 fs/xfs/xfs_notify_failure.c |   53 ++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_notify_failure.h |    8 ++++++
 fs/xfs/xfs_trace.h          |    2 --
 7 files changed, 78 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 94a9dc7aa7a1d5..71e6512899da3a 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -99,6 +99,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_notify_failure.o \
 				   xfs_pwork.o \
 				   xfs_reflink.o \
 				   xfs_stats.o \
@@ -149,11 +150,6 @@ xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
 
-# notify failure
-ifeq ($(CONFIG_MEMORY_FAILURE),y)
-xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
-endif
-
 xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index d7404e6efd866d..32e552d40b1bf5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1115,6 +1115,20 @@ struct xfs_health_monitor {
 /* Return events in JSON format */
 #define XFS_HEALTH_MONITOR_FMT_JSON	(1)
 
+struct xfs_media_error {
+	__u64	flags;		/* flags */
+	__u64	daddr;		/* disk address of range */
+	__u64	bbcount;	/* length, in 512b blocks */
+	__u64	pad;		/* zero */
+};
+
+#define XFS_MEDIA_ERROR_DATADEV	(1)	/* data device */
+#define XFS_MEDIA_ERROR_LOGDEV	(2)	/* external log device */
+#define XFS_MEDIA_ERROR_RTDEV	(3)	/* realtime device */
+
+/* bottom byte of flags is the device code */
+#define XFS_MEDIA_ERROR_DEVMASK	(0xFF)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1157,6 +1171,7 @@ struct xfs_health_monitor {
 #define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
 #define XFS_IOC_MAP_FREESP	_IOW ('X', 67, struct xfs_map_freesp)
 #define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
+#define XFS_IOC_MEDIA_ERROR	_IOW ('X', 69, struct xfs_media_error)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 67f7d4a8cc7f58..b6fdad798fae89 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -429,7 +429,6 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
-#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 /* Add a media error event to the reporting queue. */
 STATIC int
 xfs_healthmon_media_error_hook(
@@ -480,7 +479,6 @@ xfs_healthmon_media_error_hook(
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 }
-#endif
 
 /* Add a file io error event to the reporting queue. */
 STATIC int
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6c7a30128c7bf6..c253538c48f3b3 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -43,6 +43,7 @@
 #include "xfs_handle.h"
 #include "xfs_rtgroup.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -1437,6 +1438,8 @@ xfs_file_ioctl(
 
 	case XFS_IOC_HEALTH_MONITOR:
 		return xfs_ioc_health_monitor(mp, arg);
+	case XFS_IOC_MEDIA_ERROR:
+		return xfs_ioc_media_error(mp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index ea68c7e61bb585..fcf9f0139d673c 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -91,9 +91,19 @@ xfs_media_error_hook_setup(
 	xfs_hook_setup(&hook->error_hook, mod_fn);
 }
 #else
-# define xfs_media_error_hook(...)		((void)0)
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	enum xfs_failed_device		fdev,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	bool				pre_remove)
+{
+	/* empty */
+}
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
 	xfs_extlen_t		blockcount;
@@ -463,3 +473,44 @@ xfs_dax_notify_failure(
 const struct dax_holder_operations xfs_dax_holder_operations = {
 	.notify_failure		= xfs_dax_notify_failure,
 };
+#endif /* CONFIG_MEMORY_FAILURE && CONFIG_FS_DAX */
+
+#define XFS_VALID_MEDIA_ERROR_FLAGS	(XFS_MEDIA_ERROR_DATADEV | \
+					 XFS_MEDIA_ERROR_LOGDEV | \
+					 XFS_MEDIA_ERROR_RTDEV)
+int
+xfs_ioc_media_error(
+	struct xfs_mount		*mp,
+	struct xfs_media_error __user	*arg)
+{
+	struct xfs_media_error		me;
+	enum xfs_failed_device		fdev;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&me, arg, sizeof(me)))
+		return -EFAULT;
+
+	if (me.pad)
+		return -EINVAL;
+	if (me.flags & ~XFS_VALID_MEDIA_ERROR_FLAGS)
+		return -EINVAL;
+
+	switch (me.flags & XFS_MEDIA_ERROR_DEVMASK) {
+	case XFS_MEDIA_ERROR_DATADEV:
+		fdev = XFS_FAILED_DATADEV;
+		break;
+	case XFS_MEDIA_ERROR_LOGDEV:
+		fdev = XFS_FAILED_LOGDEV;
+		break;
+	case XFS_MEDIA_ERROR_RTDEV:
+		fdev = XFS_FAILED_RTDEV;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	xfs_media_error_hook(mp, fdev, me.daddr, me.bbcount, false);
+	return 0;
+}
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 835d4af504d832..c23034891d99fd 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -6,7 +6,9 @@
 #ifndef __XFS_NOTIFY_FAILURE_H__
 #define __XFS_NOTIFY_FAILURE_H__
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 extern const struct dax_holder_operations xfs_dax_holder_operations;
+#endif
 
 enum xfs_failed_device {
 	XFS_FAILED_DATADEV,
@@ -14,7 +16,7 @@ enum xfs_failed_device {
 	XFS_FAILED_RTDEV,
 };
 
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+#if defined(CONFIG_XFS_LIVE_HOOKS)
 struct xfs_media_error_params {
 	struct xfs_mount		*mp;
 	enum xfs_failed_device		fdev;
@@ -46,4 +48,8 @@ struct xfs_media_error_hook { };
 # define xfs_media_error_hook_setup(...)	((void)0)
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+struct xfs_media_error;
+int xfs_ioc_media_error(struct xfs_mount *mp,
+		struct xfs_media_error __user *arg);
+
 #endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index aba32f5ccc1a3b..3baa39a2b0a8b8 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6348,7 +6348,6 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->lost_prev)
 );
 
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 TRACE_EVENT(xfs_healthmon_media_error_hook,
 	TP_PROTO(const struct xfs_media_error_params *p,
 		 unsigned int events, bool lost_prev),
@@ -6396,7 +6395,6 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->events,
 		  __entry->lost_prev)
 );
-#endif
 
 #define XFS_FILE_IOERROR_STRINGS \
 	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (14 preceding siblings ...)
  2024-12-31 23:42   ` [PATCH 15/16] xfs: add media error reporting ioctl Darrick J. Wong
@ 2024-12-31 23:43   ` Darrick J. Wong
  15 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Send uevents when we mount and unmount the filesystem, so that we can
trigger systemd services.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_super.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)


diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index df6afcf8840948..1d295991e08047 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1197,12 +1197,28 @@ xfs_inodegc_free_percpu(
 	free_percpu(mp->m_inodegc);
 }
 
+static void
+xfs_send_unmount_uevent(
+	struct xfs_mount	*mp)
+{
+	char			sid[256] = "";
+	char			*env[] = {
+		"TYPE=mount",
+		sid,
+		NULL,
+	};
+
+	snprintf(sid, sizeof(sid), "SID=%s", mp->m_super->s_id);
+	kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_REMOVE, env);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	xfs_send_unmount_uevent(mp);
 	xfs_notice(mp, "Unmounting Filesystem %pU", &mp->m_sb.sb_uuid);
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
@@ -1590,6 +1606,29 @@ xfs_debugfs_mkdir(
 	return child;
 }
 
+/*
+ * Send a uevent signalling that the mount succeeded so we can use udev rules
+ * to start background services.
+ */
+static void
+xfs_send_mount_uevent(
+	struct fs_context	*fc,
+	struct xfs_mount	*mp)
+{
+	char			source[256] = "";
+	char			sid[256] = "";
+	char			*env[] = {
+		"TYPE=mount",
+		source,
+		sid,
+		NULL,
+	};
+
+	snprintf(source, sizeof(source), "SOURCE=%s", fc->source);
+	snprintf(sid, sizeof(sid), "SID=%s", mp->m_super->s_id);
+	kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_ADD, env);
+}
+
 static int
 xfs_fs_fill_super(
 	struct super_block	*sb,
@@ -1904,6 +1943,7 @@ xfs_fs_fill_super(
 		mp->m_debugfs_uuid = NULL;
 	}
 
+	xfs_send_mount_uevent(fc, mp);
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (4 preceding siblings ...)
  2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
@ 2024-12-31 23:33 ` Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
                     ` (4 more replies)
  2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong
                   ` (9 subsequent siblings)
  15 siblings, 5 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

Hi all,

This series creates a new NOALLOC flag for allocation groups that causes
the block and inode allocators to look elsewhere when trying to
allocate resources.  This is either the first part of a patchset to
implement online shrinking (set noalloc on the last AGs, run fsr to move
the files and directories) or freeze-free rmapbt rebuilding (set
noalloc to prevent creation of new mappings, then hook deletion of old
mappings).  This is still totally a research project.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=noalloc-ags

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=noalloc-ags
---
Commits in this patchset:
 * xfs: track deferred ops statistics
 * xfs: create a noalloc mode for allocation groups
 * xfs: enable userspace to hide an AG from allocation
 * xfs: apply noalloc mode to inode allocations too
 * xfs_io: enhance the aginfo command to control the noalloc flag
---
 include/xfs_trace.h  |    2 +
 include/xfs_trans.h  |    4 ++
 io/aginfo.c          |   45 ++++++++++++++++++--
 libxfs/xfs_ag.c      |  114 ++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_ag.h      |    8 ++++
 libxfs/xfs_ag_resv.c |   28 +++++++++++-
 libxfs/xfs_defer.c   |   18 +++++++-
 libxfs/xfs_fs.h      |    5 ++
 libxfs/xfs_ialloc.c  |    3 +
 man/man8/xfs_io.8    |    6 ++-
 10 files changed, 223 insertions(+), 10 deletions(-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/5] xfs: track deferred ops statistics
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
@ 2024-12-31 23:43   ` Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 2/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Track some basic statistics on how hard we're pushing the defer ops.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/xfs_trans.h |    4 ++++
 libxfs/xfs_defer.c  |   18 +++++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)


diff --git a/include/xfs_trans.h b/include/xfs_trans.h
index 248064019a0ab5..64d73c36851b75 100644
--- a/include/xfs_trans.h
+++ b/include/xfs_trans.h
@@ -82,6 +82,10 @@ typedef struct xfs_trans {
 	long			t_frextents_delta;/* superblock freextents chg*/
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_dfops;	/* deferred operations */
+
+	unsigned int	t_dfops_nr;
+	unsigned int	t_dfops_nr_max;
+	unsigned int	t_dfops_finished;
 } xfs_trans_t;
 
 void	xfs_trans_init(struct xfs_mount *);
diff --git a/libxfs/xfs_defer.c b/libxfs/xfs_defer.c
index 8f6708c0f3bfcd..7e6167949f6509 100644
--- a/libxfs/xfs_defer.c
+++ b/libxfs/xfs_defer.c
@@ -611,6 +611,8 @@ xfs_defer_finish_one(
 	/* Done with the dfp, free it. */
 	list_del(&dfp->dfp_list);
 	kmem_cache_free(xfs_defer_pending_cache, dfp);
+	tp->t_dfops_nr--;
+	tp->t_dfops_finished++;
 out:
 	if (ops->finish_cleanup)
 		ops->finish_cleanup(tp, state, error);
@@ -673,6 +675,9 @@ xfs_defer_finish_noroll(
 
 		list_splice_init(&(*tp)->t_dfops, &dop_pending);
 
+		(*tp)->t_dfops_nr_max = max((*tp)->t_dfops_nr,
+					    (*tp)->t_dfops_nr_max);
+
 		if (has_intents < 0) {
 			error = has_intents;
 			goto out_shutdown;
@@ -714,6 +719,7 @@ xfs_defer_finish_noroll(
 	xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE);
 	trace_xfs_defer_finish_error(*tp, error);
 	xfs_defer_cancel_list((*tp)->t_mountp, &dop_pending);
+	(*tp)->t_dfops_nr = 0;
 	xfs_defer_cancel(*tp);
 	return error;
 }
@@ -761,6 +767,7 @@ xfs_defer_cancel(
 	trace_xfs_defer_cancel(tp, _RET_IP_);
 	xfs_defer_trans_abort(tp, &tp->t_dfops);
 	xfs_defer_cancel_list(mp, &tp->t_dfops);
+	tp->t_dfops_nr = 0;
 }
 
 /*
@@ -846,8 +853,10 @@ xfs_defer_add(
 	}
 
 	dfp = xfs_defer_find_last(tp, ops);
-	if (!dfp || !xfs_defer_can_append(dfp, ops))
+	if (!dfp || !xfs_defer_can_append(dfp, ops)) {
 		dfp = xfs_defer_alloc(&tp->t_dfops, ops);
+		tp->t_dfops_nr++;
+	}
 
 	xfs_defer_add_item(dfp, li);
 	trace_xfs_defer_add_item(tp->t_mountp, dfp, li);
@@ -872,6 +881,7 @@ xfs_defer_add_barrier(
 		return;
 
 	xfs_defer_alloc(&tp->t_dfops, &xfs_barrier_defer_type);
+	tp->t_dfops_nr++;
 
 	trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL);
 }
@@ -932,6 +942,12 @@ xfs_defer_move(
 	struct xfs_trans	*stp)
 {
 	list_splice_init(&stp->t_dfops, &dtp->t_dfops);
+	dtp->t_dfops_nr += stp->t_dfops_nr;
+	dtp->t_dfops_nr_max = stp->t_dfops_nr_max;
+	dtp->t_dfops_finished = stp->t_dfops_finished;
+	stp->t_dfops_nr = 0;
+	stp->t_dfops_nr_max = 0;
+	stp->t_dfops_finished = 0;
 
 	/*
 	 * Low free space mode was historically controlled by a dfops field.


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/5] xfs: create a noalloc mode for allocation groups
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
@ 2024-12-31 23:43   ` Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 3/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new noalloc state for the per-AG structure that will disable
block allocation in this AG.  We accomplish this by subtracting from
fdblocks all the free blocks in this AG, hiding those blocks from the
allocator, and preventing freed blocks from updating fdblocks until
we're ready to lift noalloc mode.

Note that we reduce the free block count of the filesystem so that we
can prevent transactions from entering the allocator looking for "free"
space that we've turned off incore.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/xfs_trace.h  |    2 ++
 libxfs/xfs_ag.c      |   60 ++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_ag.h      |    8 +++++++
 libxfs/xfs_ag_resv.c |   28 +++++++++++++++++++++--
 4 files changed, 95 insertions(+), 3 deletions(-)


diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 30166c11dd597b..7778366c5e3319 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -13,6 +13,8 @@
 #define trace_xfbtree_trans_cancel_buf(...)	((void) 0)
 #define trace_xfbtree_trans_commit_buf(...)	((void) 0)
 
+#define trace_xfs_ag_clear_noalloc(a)		((void) 0)
+#define trace_xfs_ag_set_noalloc(a)		((void) 0)
 #define trace_xfs_agfl_reset(a,b,c,d)		((void) 0)
 #define trace_xfs_agfl_free_defer(...)		((void) 0)
 #define trace_xfs_alloc_cur_check(...)		((void) 0)
diff --git a/libxfs/xfs_ag.c b/libxfs/xfs_ag.c
index 095b581a116180..462d16347cadb9 100644
--- a/libxfs/xfs_ag.c
+++ b/libxfs/xfs_ag.c
@@ -974,3 +974,63 @@ xfs_ag_get_geometry(
 	xfs_buf_relse(agi_bp);
 	return error;
 }
+
+/* How many blocks does this AG contribute to fdblocks? */
+xfs_extlen_t
+xfs_ag_fdblocks(
+	struct xfs_perag		*pag)
+{
+	xfs_extlen_t			ret;
+
+	ASSERT(xfs_perag_initialised_agf(pag));
+
+	ret = pag->pagf_freeblks + pag->pagf_flcount + pag->pagf_btreeblks;
+	ret -= pag->pag_meta_resv.ar_reserved;
+	ret -= pag->pag_rmapbt_resv.ar_orig_reserved;
+	return ret;
+}
+
+/*
+ * Hide all the free space in this AG.  Caller must hold both the AGI and the
+ * AGF buffers or have otherwise prevented concurrent access.
+ */
+int
+xfs_ag_set_noalloc(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag_mount(pag);
+	int			error;
+
+	ASSERT(xfs_perag_initialised_agf(pag));
+	ASSERT(xfs_perag_initialised_agi(pag));
+
+	if (xfs_perag_prohibits_alloc(pag))
+		return 0;
+
+	error = xfs_dec_fdblocks(mp, xfs_ag_fdblocks(pag), false);
+	if (error)
+		return error;
+
+	trace_xfs_ag_set_noalloc(pag);
+	set_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate);
+	return 0;
+}
+
+/*
+ * Unhide all the free space in this AG.  Caller must hold both the AGI and
+ * the AGF buffers or have otherwise prevented concurrent access.
+ */
+void
+xfs_ag_clear_noalloc(
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = pag_mount(pag);
+
+	if (!xfs_perag_prohibits_alloc(pag))
+		return;
+
+	xfs_add_fdblocks(mp, xfs_ag_fdblocks(pag));
+
+	trace_xfs_ag_clear_noalloc(pag);
+	clear_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate);
+}
diff --git a/libxfs/xfs_ag.h b/libxfs/xfs_ag.h
index 1f24cfa2732172..e8fae59206d929 100644
--- a/libxfs/xfs_ag.h
+++ b/libxfs/xfs_ag.h
@@ -120,6 +120,7 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag)
 #define XFS_AGSTATE_PREFERS_METADATA	2
 #define XFS_AGSTATE_ALLOWS_INODES	3
 #define XFS_AGSTATE_AGFL_NEEDS_RESET	4
+#define XFS_AGSTATE_NOALLOC		5
 
 #define __XFS_AG_OPSTATE(name, NAME) \
 static inline bool xfs_perag_ ## name (struct xfs_perag *pag) \
@@ -132,6 +133,7 @@ __XFS_AG_OPSTATE(initialised_agi, AGI_INIT)
 __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA)
 __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES)
 __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
+__XFS_AG_OPSTATE(prohibits_alloc, NOALLOC)
 
 int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t orig_agcount,
 		xfs_agnumber_t new_agcount, xfs_rfsblock_t dcount,
@@ -164,6 +166,7 @@ xfs_perag_put(
 	xfs_group_put(pag_group(pag));
 }
 
+
 /* Active AG references */
 static inline struct xfs_perag *
 xfs_perag_grab(
@@ -208,6 +211,11 @@ xfs_perag_next(
 	return xfs_perag_next_from(mp, pag, 0);
 }
 
+/* Enable or disable allocation from an AG */
+xfs_extlen_t xfs_ag_fdblocks(struct xfs_perag *pag);
+int xfs_ag_set_noalloc(struct xfs_perag *pag);
+void xfs_ag_clear_noalloc(struct xfs_perag *pag);
+
 /*
  * Per-ag geometry infomation and validation
  */
diff --git a/libxfs/xfs_ag_resv.c b/libxfs/xfs_ag_resv.c
index 83cac20331fd34..e811a6807e12ea 100644
--- a/libxfs/xfs_ag_resv.c
+++ b/libxfs/xfs_ag_resv.c
@@ -20,6 +20,7 @@
 #include "xfs_ialloc_btree.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "xfs_ag.h"
 
 /*
  * Per-AG Block Reservations
@@ -73,6 +74,13 @@ xfs_ag_resv_critical(
 	xfs_extlen_t			avail;
 	xfs_extlen_t			orig;
 
+	/*
+	 * Pretend we're critically low on reservations in this AG to scare
+	 * everyone else away.
+	 */
+	if (xfs_perag_prohibits_alloc(pag))
+		return true;
+
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
 		avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved;
@@ -115,7 +123,12 @@ xfs_ag_resv_needed(
 		break;
 	case XFS_AG_RESV_METAFILE:
 	case XFS_AG_RESV_NONE:
-		/* empty */
+		/*
+		 * In noalloc mode, we pretend that all the free blocks in this
+		 * AG have been allocated.  Make this AG look full.
+		 */
+		if (xfs_perag_prohibits_alloc(pag))
+			len += xfs_ag_fdblocks(pag);
 		break;
 	default:
 		ASSERT(0);
@@ -343,6 +356,8 @@ xfs_ag_resv_alloc_extent(
 	xfs_extlen_t			len;
 	uint				field;
 
+	ASSERT(type != XFS_AG_RESV_NONE || !xfs_perag_prohibits_alloc(pag));
+
 	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
 
 	switch (type) {
@@ -400,7 +415,14 @@ xfs_ag_resv_free_extent(
 		ASSERT(0);
 		fallthrough;
 	case XFS_AG_RESV_NONE:
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
+		/*
+		 * Normally we put freed blocks back into fdblocks.  In noalloc
+		 * mode, however, we pretend that there are no fdblocks in the
+		 * AG, so don't put them back.
+		 */
+		if (!xfs_perag_prohibits_alloc(pag))
+			xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS,
+					(int64_t)len);
 		fallthrough;
 	case XFS_AG_RESV_IGNORE:
 		return;
@@ -413,6 +435,6 @@ xfs_ag_resv_free_extent(
 	/* Freeing into the reserved pool only requires on-disk update... */
 	xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len);
 	/* ...but freeing beyond that requires in-core and on-disk update. */
-	if (len > leftover)
+	if (len > leftover && !xfs_perag_prohibits_alloc(pag))
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, len - leftover);
 }


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/5] xfs: enable userspace to hide an AG from allocation
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
  2024-12-31 23:43   ` [PATCH 2/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong
@ 2024-12-31 23:43   ` Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 4/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag Darrick J. Wong
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add an administrative interface so that userspace can hide an allocation
group from block allocation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_ag.c |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_fs.h |    5 +++++
 2 files changed, 59 insertions(+)


diff --git a/libxfs/xfs_ag.c b/libxfs/xfs_ag.c
index 462d16347cadb9..b3e21e0d26a36c 100644
--- a/libxfs/xfs_ag.c
+++ b/libxfs/xfs_ag.c
@@ -930,6 +930,54 @@ xfs_ag_extend_space(
 	return 0;
 }
 
+/* Compute the AG geometry flags. */
+static inline uint32_t
+xfs_ag_calc_geoflags(
+	struct xfs_perag	*pag)
+{
+	uint32_t		ret = 0;
+
+	if (xfs_perag_prohibits_alloc(pag))
+		ret |= XFS_AG_FLAG_NOALLOC;
+
+	return ret;
+}
+
+/*
+ * Compare the current AG geometry flags against the flags in the AG geometry
+ * structure and update the AG state to reflect any changes, then update the
+ * struct to reflect the current status.
+ */
+static inline int
+xfs_ag_update_geoflags(
+	struct xfs_perag	*pag,
+	struct xfs_ag_geometry	*ageo,
+	uint32_t		new_flags)
+{
+	uint32_t		old_flags = xfs_ag_calc_geoflags(pag);
+	int			error;
+
+	if (!(new_flags & XFS_AG_FLAG_UPDATE)) {
+		ageo->ag_flags = old_flags;
+		return 0;
+	}
+
+	if ((old_flags & XFS_AG_FLAG_NOALLOC) &&
+	    !(new_flags & XFS_AG_FLAG_NOALLOC)) {
+		xfs_ag_clear_noalloc(pag);
+	}
+
+	if (!(old_flags & XFS_AG_FLAG_NOALLOC) &&
+	    (new_flags & XFS_AG_FLAG_NOALLOC)) {
+		error = xfs_ag_set_noalloc(pag);
+		if (error)
+			return error;
+	}
+
+	ageo->ag_flags = xfs_ag_calc_geoflags(pag);
+	return 0;
+}
+
 /* Retrieve AG geometry. */
 int
 xfs_ag_get_geometry(
@@ -941,6 +989,7 @@ xfs_ag_get_geometry(
 	struct xfs_agi		*agi;
 	struct xfs_agf		*agf;
 	unsigned int		freeblks;
+	uint32_t		inflags = ageo->ag_flags;
 	int			error;
 
 	/* Lock the AG headers. */
@@ -951,6 +1000,10 @@ xfs_ag_get_geometry(
 	if (error)
 		goto out_agi;
 
+	error = xfs_ag_update_geoflags(pag, ageo, inflags);
+	if (error)
+		goto out;
+
 	/* Fill out form. */
 	memset(ageo, 0, sizeof(*ageo));
 	ageo->ag_number = pag_agno(pag);
@@ -968,6 +1021,7 @@ xfs_ag_get_geometry(
 	ageo->ag_freeblks = freeblks;
 	xfs_ag_geom_health(pag, ageo);
 
+out:
 	/* Release resources. */
 	xfs_buf_relse(agf_bp);
 out_agi:
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 12463ba766da05..b391bf9de93dbf 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -307,6 +307,11 @@ struct xfs_ag_geometry {
 #define XFS_AG_GEOM_SICK_REFCNTBT (1 << 9)  /* reference counts */
 #define XFS_AG_GEOM_SICK_INODES	(1 << 10) /* bad inodes were seen */
 
+#define XFS_AG_FLAG_UPDATE	(1 << 0)  /* update flags */
+#define XFS_AG_FLAG_NOALLOC	(1 << 1)  /* do not allocate from this AG */
+#define XFS_AG_FLAG_ALL		(XFS_AG_FLAG_UPDATE | \
+				 XFS_AG_FLAG_NOALLOC)
+
 /*
  * Structures for XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG & XFS_IOC_FSGROWFSRT
  */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/5] xfs: apply noalloc mode to inode allocations too
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:43   ` [PATCH 3/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong
@ 2024-12-31 23:44   ` Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag Darrick J. Wong
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't allow inode allocations from this group if it's marked noalloc.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_ialloc.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/libxfs/xfs_ialloc.c b/libxfs/xfs_ialloc.c
index b401299ad933f7..a086fb30b227a0 100644
--- a/libxfs/xfs_ialloc.c
+++ b/libxfs/xfs_ialloc.c
@@ -1102,6 +1102,7 @@ xfs_dialloc_ag_inobt(
 
 	ASSERT(xfs_perag_initialised_agi(pag));
 	ASSERT(xfs_perag_allows_inodes(pag));
+	ASSERT(!xfs_perag_prohibits_alloc(pag));
 	ASSERT(pag->pagi_freecount > 0);
 
  restart_pagno:
@@ -1730,6 +1731,8 @@ xfs_dialloc_good_ag(
 		return false;
 	if (!xfs_perag_allows_inodes(pag))
 		return false;
+	if (xfs_perag_prohibits_alloc(pag))
+		return false;
 
 	if (!xfs_perag_initialised_agi(pag)) {
 		error = xfs_ialloc_read_agi(pag, tp, 0, NULL);


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:44   ` [PATCH 4/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong
@ 2024-12-31 23:44   ` Darrick J. Wong
  4 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Augment the aginfo command to be able to set and clear the noalloc
state for an AG.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 io/aginfo.c       |   45 ++++++++++++++++++++++++++++++++++++++++-----
 man/man8/xfs_io.8 |    6 +++++-
 2 files changed, 45 insertions(+), 6 deletions(-)


diff --git a/io/aginfo.c b/io/aginfo.c
index f81986f0df4df3..0320a98b12f981 100644
--- a/io/aginfo.c
+++ b/io/aginfo.c
@@ -19,9 +19,11 @@ static cmdinfo_t rginfo_cmd;
 static int
 report_aginfo(
 	struct xfs_fd		*xfd,
-	xfs_agnumber_t		agno)
+	xfs_agnumber_t		agno,
+	int			oflag)
 {
 	struct xfs_ag_geometry	ageo = { 0 };
+	bool			update = false;
 	int			ret;
 
 	ret = -xfrog_ag_geometry(xfd->fd, agno, &ageo);
@@ -30,6 +32,26 @@ report_aginfo(
 		return 1;
 	}
 
+	switch (oflag) {
+	case 0:
+		ageo.ag_flags |= XFS_AG_FLAG_UPDATE;
+		ageo.ag_flags &= ~XFS_AG_FLAG_NOALLOC;
+		update = true;
+		break;
+	case 1:
+		ageo.ag_flags |= (XFS_AG_FLAG_UPDATE | XFS_AG_FLAG_NOALLOC);
+		update = true;
+		break;
+	}
+
+	if (update) {
+		ret = -xfrog_ag_geometry(xfd->fd, agno, &ageo);
+		if (ret) {
+			xfrog_perror(ret, "aginfo update");
+			return 1;
+		}
+	}
+
 	printf(_("AG: %u\n"),		ageo.ag_number);
 	printf(_("Blocks: %u\n"),	ageo.ag_length);
 	printf(_("Free Blocks: %u\n"),	ageo.ag_freeblks);
@@ -51,6 +73,7 @@ aginfo_f(
 	struct xfs_fd		xfd = XFS_FD_INIT(file->fd);
 	unsigned long long	x;
 	xfs_agnumber_t		agno = NULLAGNUMBER;
+	int			oflag = -1;
 	int			c;
 	int			ret = 0;
 
@@ -61,7 +84,7 @@ aginfo_f(
 		return 1;
 	}
 
-	while ((c = getopt(argc, argv, "a:")) != EOF) {
+	while ((c = getopt(argc, argv, "a:o:")) != EOF) {
 		switch (c) {
 		case 'a':
 			errno = 0;
@@ -74,16 +97,27 @@ aginfo_f(
 			}
 			agno = x;
 			break;
+		case 'o':
+			errno = 0;
+			x = strtoll(optarg, NULL, 10);
+			if (!errno && x != 0 && x != 1)
+				errno = ERANGE;
+			if (errno) {
+				perror("aginfo");
+				return 1;
+			}
+			oflag = x;
+			break;
 		default:
 			return command_usage(&aginfo_cmd);
 		}
 	}
 
 	if (agno != NULLAGNUMBER) {
-		ret = report_aginfo(&xfd, agno);
+		ret = report_aginfo(&xfd, agno, oflag);
 	} else {
 		for (agno = 0; !ret && agno < xfd.fsgeom.agcount; agno++) {
-			ret = report_aginfo(&xfd, agno);
+			ret = report_aginfo(&xfd, agno, oflag);
 		}
 	}
 
@@ -98,6 +132,7 @@ aginfo_help(void)
 "Report allocation group geometry.\n"
 "\n"
 " -a agno  -- Report on the given allocation group.\n"
+" -o state -- Change the NOALLOC state for this allocation group.\n"
 "\n"));
 
 }
@@ -107,7 +142,7 @@ static cmdinfo_t aginfo_cmd = {
 	.cfunc = aginfo_f,
 	.argmin = 0,
 	.argmax = -1,
-	.args = "[-a agno]",
+	.args = "[-a agno] [-o state]",
 	.flags = CMD_NOMAP_OK,
 	.help = aginfo_help,
 };
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 59d5ddc54dcc66..a42ab61a0de422 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1243,7 +1243,7 @@ .SH MEMORY MAPPED I/O COMMANDS
 
 .SH FILESYSTEM COMMANDS
 .TP
-.BI "aginfo [ \-a " agno " ]"
+.BI "aginfo [ \-a " agno " ] [ \-o " nr " ]"
 Show information about or update the state of allocation groups.
 .RE
 .RS 1.0i
@@ -1251,6 +1251,10 @@ .SH FILESYSTEM COMMANDS
 .TP
 .BI \-a
 Act only on a specific allocation group.
+.TP
+.BI \-o
+If 0, clear the NOALLOC flag.
+If 1, set the NOALLOC flag.
 .PD
 .RE
 


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 2/5] xfsprogs: report refcount information to userspace
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (5 preceding siblings ...)
  2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
@ 2024-12-31 23:33 ` Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 1/2] xfs: export reference count " Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 2/2] xfs_io: dump reference count information Darrick J. Wong
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

Hi all,

Create a new ioctl to report the number of owners of each disk block so
that reflink-aware defraggers can make better decisions about which
extents to target.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=report-refcounts

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=report-refcounts

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=report-refcounts
---
Commits in this patchset:
 * xfs: export reference count information to userspace
 * xfs_io: dump reference count information
---
 io/Makefile                         |    1 
 io/fsrefcounts.c                    |  476 +++++++++++++++++++++++++++++++++++
 io/init.c                           |    1 
 io/io.h                             |    1 
 libxfs/xfs_fs.h                     |   80 ++++++
 man/man2/ioctl_xfs_getfsrefcounts.2 |  237 +++++++++++++++++
 man/man8/xfs_io.8                   |   88 ++++++
 7 files changed, 884 insertions(+)
 create mode 100644 io/fsrefcounts.c
 create mode 100644 man/man2/ioctl_xfs_getfsrefcounts.2


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/2] xfs: export reference count information to userspace
  2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong
@ 2024-12-31 23:44   ` Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 2/2] xfs_io: dump reference count information Darrick J. Wong
  1 sibling, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Export refcount info to userspace so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_fs.h                     |   80 ++++++++++++
 man/man2/ioctl_xfs_getfsrefcounts.2 |  237 +++++++++++++++++++++++++++++++++++
 2 files changed, 317 insertions(+)
 create mode 100644 man/man2/ioctl_xfs_getfsrefcounts.2


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index b391bf9de93dbf..936f719236944f 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -1008,6 +1008,85 @@ struct xfs_rtgroup_geometry {
 #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
 #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
 
+/*
+ *	Structure for XFS_IOC_GETFSREFCOUNTS.
+ *
+ *	The memory layout for this call are the scalar values defined in struct
+ *	xfs_getfsrefs_head, followed by two struct xfs_getfsrefs that describe
+ *	the lower and upper bound of mappings to return, followed by an array
+ *	of struct xfs_getfsrefs mappings.
+ *
+ *	fch_iflags control the output of the call, whereas fch_oflags report
+ *	on the overall record output.  fch_count should be set to the length
+ *	of the fch_recs array, and fch_entries will be set to the number of
+ *	entries filled out during each call.  If fch_count is zero, the number
+ *	of refcount mappings will be returned in fch_entries, though no
+ *	mappings will be returned.  fch_reserved must be set to zero.
+ *
+ *	The two elements in the fch_keys array are used to constrain the
+ *	output.  The first element in the array should represent the lowest
+ *	disk mapping ("low key") that the user wants to learn about.  If this
+ *	value is all zeroes, the filesystem will return the first entry it
+ *	knows about.  For a subsequent call, the contents of
+ *	fsrefs_head.fch_recs[fsrefs_head.fch_count - 1] should be copied into
+ *	fch_keys[0] to have the kernel start where it left off.
+ *
+ *	The second element in the fch_keys array should represent the highest
+ *	disk mapping ("high key") that the user wants to learn about.  If this
+ *	value is all ones, the filesystem will not stop until it runs out of
+ *	mapping to return or runs out of space in fch_recs.
+ *
+ *	fcr_device can be either a 32-bit cookie representing a device, or a
+ *	32-bit dev_t if the FCH_OF_DEV_T flag is set.  fcr_physical and
+ *	fcr_length are expressed in units of bytes.  fcr_owners is the number
+ *	of owners.
+ */
+struct xfs_getfsrefs {
+	__u32		fcr_device;	/* device id */
+	__u32		fcr_flags;	/* mapping flags */
+	__u64		fcr_physical;	/* device offset of segment */
+	__u64		fcr_owners;	/* number of owners */
+	__u64		fcr_length;	/* length of segment */
+	__u64		fcr_reserved[4];	/* must be zero */
+};
+
+struct xfs_getfsrefs_head {
+	__u32		fch_iflags;	/* control flags */
+	__u32		fch_oflags;	/* output flags */
+	__u32		fch_count;	/* # of entries in array incl. input */
+	__u32		fch_entries;	/* # of entries filled in (output). */
+	__u64		fch_reserved[6];	/* must be zero */
+
+	struct xfs_getfsrefs	fch_keys[2];	/* low and high keys for the mapping search */
+	struct xfs_getfsrefs	fch_recs[];	/* returned records */
+};
+
+/* Size of an fsrefs_head with room for nr records. */
+static inline unsigned long long
+xfs_getfsrefs_sizeof(
+	unsigned int	nr)
+{
+	return sizeof(struct xfs_getfsrefs_head) +
+		(nr * sizeof(struct xfs_getfsrefs));
+}
+
+/* Start the next fsrefs query at the end of the current query results. */
+static inline void
+xfs_getfsrefs_advance(
+	struct xfs_getfsrefs_head	*head)
+{
+	head->fch_keys[0] = head->fch_recs[head->fch_entries - 1];
+}
+
+/* fch_iflags values - set by XFS_IOC_GETFSREFCOUNTS caller in the header. */
+#define FCH_IF_VALID		0
+
+/* fch_oflags values - returned in the header segment only. */
+#define FCH_OF_DEV_T		(1U << 0) /* fcr_device values will be dev_t */
+
+/* fcr_flags values - returned for each non-header segment */
+#define FCR_OF_LAST		(1U << 0) /* last record in the dataset */
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1047,6 +1126,7 @@ struct xfs_rtgroup_geometry {
 #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle)
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
+#define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/man/man2/ioctl_xfs_getfsrefcounts.2 b/man/man2/ioctl_xfs_getfsrefcounts.2
new file mode 100644
index 00000000000000..9a5e7273fcacdd
--- /dev/null
+++ b/man/man2/ioctl_xfs_getfsrefcounts.2
@@ -0,0 +1,237 @@
+.\" Copyright (c) 2021-2025 Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" This is free documentation; you can redistribute it and/or
+.\" modify it under the terms of the GNU General Public License as
+.\" published by the Free Software Foundation; either version 2 of
+.\" the License, or (at your option) any later version.
+.\"
+.\" The GNU General Public License's references to "object code"
+.\" and "executables" are to be interpreted as the output of any
+.\" document formatting or typesetting system, including
+.\" intermediate and printed output.
+.\"
+.\" This manual is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.TH IOCTL-XFS-GETFSREFCOUNTS 2 2023-05-08 "XFS"
+.SH NAME
+ioctl_xfs_getfsrefcounts \- retrieve the number of owners of space in the filesystem
+.SH SYNOPSIS
+.nf
+.B #include <sys/ioctl.h>
+.PP
+.BI "int ioctl(int " fd ", XFS_IOC_GETFSREFCOUNTS, struct xfs_fsrefs_head * " arg );
+.fi
+.SH DESCRIPTION
+This
+.BR ioctl (2)
+operation retrieves the number of owners for space extents in a filesystem.
+This information can be used to discover the sharing factor of physical media,
+among other things.
+.PP
+The sole argument to this operation should be a pointer to a single
+.IR "struct xfs_getfsrefs_head" ":"
+.PP
+.in +4n
+.EX
+struct xfs_getfsrefs {
+    __u32 fcr_device;      /* Device ID */
+    __u32 fcr_flags;       /* Mapping flags */
+    __u64 fcr_physical;    /* Device offset of segment */
+    __u64 fcr_owners;      /* Number of Owners */
+    __u64 fcr_length;      /* Length of segment */
+    __u64 fcr_reserved[4]; /* Must be zero */
+};
+
+struct xfs_getfsrefs_head {
+    __u32 fch_iflags;       /* Control flags */
+    __u32 fch_oflags;       /* Output flags */
+    __u32 fch_count;        /* # of entries in array incl. input */
+    __u32 fch_entries;      /* # of entries filled in (output) */
+    __u64 fch_reserved[6];  /* Must be zero */
+
+    struct xfs_getfsrefs fch_keys[2];  /* Low and high keys for
+                                  the mapping search */
+    struct xfs_getfsrefs fch_recs[];   /* Returned records */
+};
+.EE
+.in
+.PP
+The two
+.I fch_keys
+array elements specify the lowest and highest reverse-mapping
+key for which the application would like physical mapping
+information.
+A reverse mapping key consists of the tuple (device, block, owner, offset).
+The owner and offset fields are part of the key because some filesystems
+support sharing physical blocks between multiple files and
+therefore may return multiple mappings for a given physical block.
+.PP
+Filesystem mappings are copied into the
+.I fch_recs
+array, which immediately follows the header data.
+.\"
+.SS Fields of struct xfs_getfsrefs_head
+The
+.I fch_iflags
+field is a bit mask passed to the kernel to alter the output.
+No flags are currently defined, so the caller must set this value to zero.
+.PP
+The
+.I fch_oflags
+field is a bit mask of flags set by the kernel concerning the returned mappings.
+If
+.B FCH_OF_DEV_T
+is set, then the
+.I fcr_device
+field represents a
+.I dev_t
+structure containing the major and minor numbers of the block device.
+.PP
+The
+.I fch_count
+field contains the number of elements in the array being passed to the
+kernel.
+If this value is 0,
+.I fch_entries
+will be set to the number of records that would have been returned had
+the array been large enough;
+no mapping information will be returned.
+.PP
+The
+.I fch_entries
+field contains the number of elements in the
+.I fch_recs
+array that contain useful information.
+.PP
+The
+.I fch_reserved
+fields must be set to zero.
+.\"
+.SS Keys
+The two key records in
+.I fsrefs_head.fch_keys
+specify the lowest and highest extent records in the keyspace that the caller
+wants returned.
+The tuple
+.RI "(" "device" ", " "physical" ", " "flags" ")"
+can be used to index any filesystem space record.
+The format of
+.I fcr_device
+in the keys must match the format of the same field in the output records,
+as defined below.
+By convention, the field
+.I fsrefs_head.fch_keys[0]
+must contain the low key and
+.I fsrefs_head.fch_keys[1]
+must contain the high key for the request.
+.PP
+For convenience, if
+.I fcr_length
+is set in the low key, it will be added to
+.I fcr_block
+as appropriate.
+The caller can take advantage of this subtlety to set up subsequent calls
+by copying
+.I fsrefs_head.fch_recs[fsrefs_head.fch_entries \- 1]
+into the low key.
+The function
+.I fsrefs_advance
+(defined in
+.IR linux/fsrefcounts.h )
+provides this functionality.
+.\"
+.SS Fields of struct xfs_getfsrefs
+The
+.I fcr_device
+field uniquely identifies the underlying storage device.
+If the
+.B FCH_OF_DEV_T
+flag is set in the header's
+.I fch_oflags
+field, this field contains a
+.I dev_t
+from which major and minor numbers can be extracted.
+If the flag is not set, this field contains a value that must be unique
+for each unique storage device.
+.PP
+The
+.I fcr_physical
+field contains the disk address of the extent in bytes.
+.PP
+The
+.I fcr_owners
+field contains the number of owners of this extent.
+The actual owners can be queried with the
+.BR FS_IOC_GETFSMAP (2)
+ioctl.
+.PP
+The
+.I fcr_length
+field contains the length of the extent in bytes.
+.PP
+The
+.I fcr_flags
+field is a bit mask of extent state flags.
+The bits are:
+.RS 0.4i
+.TP
+.B FCR_OF_LAST
+This is the last record in the data set.
+.RE
+.PP
+The
+.I fcr_reserved
+field will be set to zero.
+.\"
+.RE
+.SH RETURN VALUE
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+The error placed in
+.I errno
+can be one of, but is not limited to, the following:
+.TP
+.B EBADF
+.IR fd
+is not open for reading.
+.TP
+.B EBADMSG
+The filesystem has detected a checksum error in the metadata.
+.TP
+.B EFAULT
+The pointer passed in was not mapped to a valid memory address.
+.TP
+.B EINVAL
+The array is not long enough, the keys do not point to a valid part of
+the filesystem, the low key points to a higher point in the filesystem's
+physical storage address space than the high key, or a nonzero value
+was passed in one of the fields that must be zero.
+.TP
+.B ENOMEM
+Insufficient memory to process the request.
+.TP
+.B EOPNOTSUPP
+The filesystem does not support this command.
+.TP
+.B EUCLEAN
+The filesystem metadata is corrupt and needs repair.
+.SH CONFORMING TO
+This API is XFS-specific.
+.SH EXAMPLES
+See
+.I io/fsrefs.c
+in the
+.I xfsprogs
+distribution for a sample program.
+.SH SEE ALSO
+.BR ioctl (2)


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/2] xfs_io: dump reference count information
  2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong
  2024-12-31 23:44   ` [PATCH 1/2] xfs: export reference count " Darrick J. Wong
@ 2024-12-31 23:44   ` Darrick J. Wong
  1 sibling, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Dump refcount info from the kernel so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 io/Makefile       |    1 
 io/fsrefcounts.c  |  476 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 io/init.c         |    1 
 io/io.h           |    1 
 man/man8/xfs_io.8 |   88 ++++++++++
 5 files changed, 567 insertions(+)
 create mode 100644 io/fsrefcounts.c


diff --git a/io/Makefile b/io/Makefile
index 8f835ec71fd768..c57594b090f70c 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -22,6 +22,7 @@ CFILES = \
 	file.c \
 	freeze.c \
 	fsproperties.c \
+	fsrefcounts.c \
 	fsuuid.c \
 	fsync.c \
 	getrusage.c \
diff --git a/io/fsrefcounts.c b/io/fsrefcounts.c
new file mode 100644
index 00000000000000..ad1f26dfde3ec3
--- /dev/null
+++ b/io/fsrefcounts.c
@@ -0,0 +1,476 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "platform_defs.h"
+#include "command.h"
+#include "init.h"
+#include "libfrog/paths.h"
+#include "io.h"
+#include "input.h"
+#include "libfrog/fsgeom.h"
+
+static cmdinfo_t	fsrefcounts_cmd;
+static dev_t		xfs_data_dev;
+
+static void
+fsrefcounts_help(void)
+{
+	printf(_(
+"\n"
+" Prints extent owner counts for the filesystem hosting the current file"
+"\n"
+" fsrefcounts prints the number of owners of disk blocks used by the whole\n"
+" filesystem. When possible, owner and offset information will be included\n"
+" in the space report.\n"
+"\n"
+" By default, each line of the listing takes the following form:\n"
+"     extent: major:minor [startblock..endblock]: owner startoffset..endoffset length\n"
+" All the file offsets and disk blocks are in units of 512-byte blocks.\n"
+" -d -- query only the data device (default).\n"
+" -l -- query only the log device.\n"
+" -r -- query only the realtime device.\n"
+" -n -- query n extents at a time.\n"
+" -o -- only print extents with at least this many owners (default 1).\n"
+" -O -- only print extents with no more than this many owners (default 2^64-1).\n"
+" -m -- output machine-readable format.\n"
+" -v -- Verbose information, show AG and offsets.  Show flags legend on 2nd -v\n"
+"\n"
+"The optional start and end arguments require one of -d, -l, or -r to be set.\n"
+"\n"));
+}
+
+static void
+dump_refcounts(
+	unsigned long long		*nr,
+	const unsigned long long	min_owners,
+	const unsigned long long	max_owners,
+	struct xfs_getfsrefs_head	*head)
+{
+	unsigned long long		i;
+	struct xfs_getfsrefs		*p;
+
+	for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) {
+		if (p->fcr_owners < min_owners || p->fcr_owners > max_owners)
+			continue;
+		printf("\t%llu: %u:%u [%lld..%lld]: ", i + (*nr),
+			major(p->fcr_device), minor(p->fcr_device),
+			(long long)BTOBBT(p->fcr_physical),
+			(long long)BTOBBT(p->fcr_physical + p->fcr_length - 1));
+		printf(_("%llu %lld\n"),
+			(unsigned long long)p->fcr_owners,
+			(long long)BTOBBT(p->fcr_length));
+	}
+
+	(*nr) += head->fch_entries;
+}
+
+static void
+dump_refcounts_machine(
+	unsigned long long		*nr,
+	const unsigned long long	min_owners,
+	const unsigned long long	max_owners,
+	struct xfs_getfsrefs_head	*head)
+{
+	unsigned long long		i;
+	struct xfs_getfsrefs		*p;
+
+	if (*nr == 0)
+		printf(_("EXT,MAJOR,MINOR,PSTART,PEND,OWNERS,LENGTH\n"));
+	for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) {
+		if (p->fcr_owners < min_owners || p->fcr_owners > max_owners)
+			continue;
+		printf("%llu,%u,%u,%lld,%lld,", i + (*nr),
+			major(p->fcr_device), minor(p->fcr_device),
+			(long long)BTOBBT(p->fcr_physical),
+			(long long)BTOBBT(p->fcr_physical + p->fcr_length - 1));
+		printf("%llu,%lld\n",
+			(unsigned long long)p->fcr_owners,
+			(long long)BTOBBT(p->fcr_length));
+	}
+
+	(*nr) += head->fch_entries;
+}
+
+/*
+ * Verbose mode displays:
+ *   extent: major:minor [startblock..endblock]: owners \
+ *	ag# (agoffset..agendoffset) totalbbs flags
+ */
+#define MINRANGE_WIDTH	16
+#define MINAG_WIDTH	2
+#define MINTOT_WIDTH	5
+#define NFLG		4	/* count of flags */
+#define	FLG_NULL	00000	/* Null flag */
+#define	FLG_BSU		01000	/* Not on begin of stripe unit  */
+#define	FLG_ESU		00100	/* Not on end   of stripe unit  */
+#define	FLG_BSW		00010	/* Not on begin of stripe width */
+#define	FLG_ESW		00001	/* Not on end   of stripe width */
+static void
+dump_refcounts_verbose(
+	unsigned long long		*nr,
+	const unsigned long long	min_owners,
+	const unsigned long long	max_owners,
+	struct xfs_getfsrefs_head	*head,
+	bool				*dumped_flags,
+	struct xfs_fsop_geom		*fsgeo)
+{
+	unsigned long long		i;
+	struct xfs_getfsrefs		*p;
+	int				agno;
+	off_t				agoff, bperag;
+	int				boff_w, aoff_w, tot_w, agno_w, own_w;
+	int				nr_w, dev_w;
+	char				bbuf[40], abuf[40], obuf[40];
+	char				nbuf[40], dbuf[40], gbuf[40];
+	int				sunit, swidth;
+	int				flg = 0;
+
+	boff_w = aoff_w = own_w = MINRANGE_WIDTH;
+	dev_w = 3;
+	nr_w = 4;
+	tot_w = MINTOT_WIDTH;
+	bperag = (off_t)fsgeo->agblocks * (off_t)fsgeo->blocksize;
+	sunit = (fsgeo->sunit * fsgeo->blocksize);
+	swidth = (fsgeo->swidth * fsgeo->blocksize);
+
+	/*
+	 * Go through the extents and figure out the width
+	 * needed for all columns.
+	 */
+	for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) {
+		if (p->fcr_owners < min_owners || p->fcr_owners > max_owners)
+			continue;
+		if (sunit &&
+		    (p->fcr_physical  % sunit != 0 ||
+		     ((p->fcr_physical + p->fcr_length) % sunit) != 0 ||
+		     p->fcr_physical % swidth != 0 ||
+		     ((p->fcr_physical + p->fcr_length) % swidth) != 0))
+			flg = 1;
+		if (flg)
+			*dumped_flags = true;
+		snprintf(nbuf, sizeof(nbuf), "%llu", (*nr) + i);
+		nr_w = max(nr_w, strlen(nbuf));
+		if (head->fch_oflags & FCH_OF_DEV_T)
+			snprintf(dbuf, sizeof(dbuf), "%u:%u",
+				major(p->fcr_device),
+				minor(p->fcr_device));
+		else
+			snprintf(dbuf, sizeof(dbuf), "0x%x", p->fcr_device);
+		dev_w = max(dev_w, strlen(dbuf));
+		snprintf(bbuf, sizeof(bbuf), "[%lld..%lld]:",
+			(long long)BTOBBT(p->fcr_physical),
+			(long long)BTOBBT(p->fcr_physical + p->fcr_length - 1));
+		boff_w = max(boff_w, strlen(bbuf));
+		snprintf(obuf, sizeof(obuf), "%llu",
+				(unsigned long long)p->fcr_owners);
+		own_w = max(own_w, strlen(obuf));
+		if (p->fcr_device == xfs_data_dev) {
+			agno = p->fcr_physical / bperag;
+			agoff = p->fcr_physical - (agno * bperag);
+			snprintf(abuf, sizeof(abuf),
+				"(%lld..%lld)",
+				(long long)BTOBBT(agoff),
+				(long long)BTOBBT(agoff + p->fcr_length - 1));
+		} else
+			abuf[0] = 0;
+		aoff_w = max(aoff_w, strlen(abuf));
+		tot_w = max(tot_w,
+			numlen(BTOBBT(p->fcr_length), 10));
+	}
+	agno_w = max(MINAG_WIDTH, numlen(fsgeo->agcount, 10));
+	if (*nr == 0)
+		printf("%*s: %-*s %-*s %-*s %*s %-*s %*s%s\n",
+			nr_w, _("EXT"),
+			dev_w, _("DEV"),
+			boff_w, _("BLOCK-RANGE"),
+			own_w, _("OWNERS"),
+			agno_w, _("AG"),
+			aoff_w, _("AG-OFFSET"),
+			tot_w, _("TOTAL"),
+			flg ? _(" FLAGS") : "");
+	for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) {
+		if (p->fcr_owners < min_owners || p->fcr_owners > max_owners)
+			continue;
+		flg = FLG_NULL;
+		/*
+		 * If striping enabled, determine if extent starts/ends
+		 * on a stripe unit boundary.
+		 */
+		if (sunit) {
+			if (p->fcr_physical  % sunit != 0)
+				flg |= FLG_BSU;
+			if (((p->fcr_physical +
+			      p->fcr_length ) % sunit ) != 0)
+				flg |= FLG_ESU;
+			if (p->fcr_physical % swidth != 0)
+				flg |= FLG_BSW;
+			if (((p->fcr_physical +
+			      p->fcr_length ) % swidth ) != 0)
+				flg |= FLG_ESW;
+		}
+		if (head->fch_oflags & FCH_OF_DEV_T)
+			snprintf(dbuf, sizeof(dbuf), "%u:%u",
+				major(p->fcr_device),
+				minor(p->fcr_device));
+		else
+			snprintf(dbuf, sizeof(dbuf), "0x%x", p->fcr_device);
+		snprintf(bbuf, sizeof(bbuf), "[%lld..%lld]:",
+			(long long)BTOBBT(p->fcr_physical),
+			(long long)BTOBBT(p->fcr_physical + p->fcr_length - 1));
+		snprintf(obuf, sizeof(obuf), "%llu",
+			(unsigned long long)p->fcr_owners);
+		if (p->fcr_device == xfs_data_dev) {
+			agno = p->fcr_physical / bperag;
+			agoff = p->fcr_physical - (agno * bperag);
+			snprintf(abuf, sizeof(abuf),
+				"(%lld..%lld)",
+				(long long)BTOBBT(agoff),
+				(long long)BTOBBT(agoff + p->fcr_length - 1));
+			snprintf(gbuf, sizeof(gbuf),
+				"%lld",
+				(long long)agno);
+		} else {
+			abuf[0] = 0;
+			gbuf[0] = 0;
+		}
+		printf("%*llu: %-*s %-*s %-*s %-*s %-*s %*lld",
+			nr_w, (*nr) + i, dev_w, dbuf, boff_w, bbuf, own_w,
+			obuf, agno_w, gbuf, aoff_w, abuf, tot_w,
+			(long long)BTOBBT(p->fcr_length));
+		if (flg == FLG_NULL)
+			printf("\n");
+		else
+			printf(" %-*.*o\n", NFLG, NFLG, flg);
+	}
+
+	(*nr) += head->fch_entries;
+}
+
+static void
+dump_verbose_key(void)
+{
+	printf(_(" FLAG Values:\n"));
+	printf(_("    %*.*o Doesn't begin on stripe unit\n"),
+		NFLG+1, NFLG+1, FLG_BSU);
+	printf(_("    %*.*o Doesn't end   on stripe unit\n"),
+		NFLG+1, NFLG+1, FLG_ESU);
+	printf(_("    %*.*o Doesn't begin on stripe width\n"),
+		NFLG+1, NFLG+1, FLG_BSW);
+	printf(_("    %*.*o Doesn't end   on stripe width\n"),
+		NFLG+1, NFLG+1, FLG_ESW);
+}
+
+static int
+fsrefcounts_f(
+	int			argc,
+	char			**argv)
+{
+	struct xfs_getfsrefs		*p;
+	struct xfs_getfsrefs_head	*head;
+	struct xfs_getfsrefs		*l, *h;
+	struct xfs_fsop_geom	fsgeo;
+	long long		start = 0;
+	long long		end = -1;
+	unsigned long long	min_owners = 1;
+	unsigned long long	max_owners = ULLONG_MAX;
+	int			map_size;
+	int			nflag = 0;
+	int			vflag = 0;
+	int			mflag = 0;
+	int			i = 0;
+	int			c;
+	unsigned long long	nr = 0;
+	size_t			fsblocksize, fssectsize;
+	struct fs_path		*fs;
+	static bool		tab_init;
+	bool			dumped_flags = false;
+	int			dflag, lflag, rflag;
+
+	init_cvtnum(&fsblocksize, &fssectsize);
+
+	dflag = lflag = rflag = 0;
+	while ((c = getopt(argc, argv, "dlmn:o:O:rv")) != EOF) {
+		switch (c) {
+		case 'd':	/* data device */
+			dflag = 1;
+			break;
+		case 'l':	/* log device */
+			lflag = 1;
+			break;
+		case 'm':	/* machine readable format */
+			mflag++;
+			break;
+		case 'n':	/* number of extents specified */
+			nflag = cvt_u32(optarg, 10);
+			if (errno)
+				return command_usage(&fsrefcounts_cmd);
+			break;
+		case 'o':	/* minimum owners */
+			min_owners = cvt_u64(optarg, 10);
+			if (errno)
+				return command_usage(&fsrefcounts_cmd);
+			if (min_owners < 1) {
+				fprintf(stderr,
+		_("min_owners must be greater than zero.\n"));
+				exitcode = 1;
+				return 0;
+			}
+			break;
+		case 'O':	/* maximum owners */
+			max_owners = cvt_u64(optarg, 10);
+			if (errno)
+				return command_usage(&fsrefcounts_cmd);
+			if (max_owners < 1) {
+				fprintf(stderr,
+		_("max_owners must be greater than zero.\n"));
+				exitcode = 1;
+				return 0;
+			}
+			break;
+		case 'r':	/* rt device */
+			rflag = 1;
+			break;
+		case 'v':	/* Verbose output */
+			vflag++;
+			break;
+		default:
+			exitcode = 1;
+			return command_usage(&fsrefcounts_cmd);
+		}
+	}
+
+	if ((dflag + lflag + rflag > 1) || (mflag > 0 && vflag > 0) ||
+	    (argc > optind && dflag + lflag + rflag == 0)) {
+		exitcode = 1;
+		return command_usage(&fsrefcounts_cmd);
+	}
+
+	if (argc > optind) {
+		start = cvtnum(fsblocksize, fssectsize, argv[optind]);
+		if (start < 0) {
+			fprintf(stderr,
+				_("Bad refcount start_bblock %s.\n"),
+				argv[optind]);
+			exitcode = 1;
+			return 0;
+		}
+		start <<= BBSHIFT;
+	}
+
+	if (argc > optind + 1) {
+		end = cvtnum(fsblocksize, fssectsize, argv[optind + 1]);
+		if (end < 0) {
+			fprintf(stderr,
+				_("Bad refcount end_bblock %s.\n"),
+				argv[optind + 1]);
+			exitcode = 1;
+			return 0;
+		}
+		end <<= BBSHIFT;
+	}
+
+	if (vflag) {
+		c = -xfrog_geometry(file->fd, &fsgeo);
+		if (c) {
+			fprintf(stderr,
+				_("%s: can't get geometry [\"%s\"]: %s\n"),
+				progname, file->name, strerror(c));
+			exitcode = 1;
+			return 0;
+		}
+	}
+
+	map_size = nflag ? nflag : 131072 / sizeof(struct xfs_getfsrefs);
+	head = malloc(xfs_getfsrefs_sizeof(map_size));
+	if (head == NULL) {
+		fprintf(stderr, _("%s: malloc of %llu bytes failed.\n"),
+				progname,
+				(unsigned long long)xfs_getfsrefs_sizeof(map_size));
+		exitcode = 1;
+		return 0;
+	}
+
+	memset(head, 0, sizeof(*head));
+	l = head->fch_keys;
+	h = head->fch_keys + 1;
+	if (dflag) {
+		l->fcr_device = h->fcr_device = file->fs_path.fs_datadev;
+	} else if (lflag) {
+		l->fcr_device = h->fcr_device = file->fs_path.fs_logdev;
+	} else if (rflag) {
+		l->fcr_device = h->fcr_device = file->fs_path.fs_rtdev;
+	} else {
+		l->fcr_device = 0;
+		h->fcr_device = UINT_MAX;
+	}
+	l->fcr_physical = start;
+	h->fcr_physical = end;
+	h->fcr_owners = ULLONG_MAX;
+	h->fcr_flags = UINT_MAX;
+
+	/*
+	 * If this is an XFS filesystem, remember the data device.
+	 * (We report AG number/block for data device extents on XFS).
+	 */
+	if (!tab_init) {
+		fs_table_initialise(0, NULL, 0, NULL);
+		tab_init = true;
+	}
+	fs = fs_table_lookup(file->name, FS_MOUNT_POINT);
+	xfs_data_dev = fs ? fs->fs_datadev : 0;
+
+	head->fch_count = map_size;
+	do {
+		/* Get some extents */
+		i = ioctl(file->fd, XFS_IOC_GETFSREFCOUNTS, head);
+		if (i < 0) {
+			fprintf(stderr, _("%s: xfsctl(XFS_IOC_GETFSREFCOUNTS)"
+				" iflags=0x%x [\"%s\"]: %s\n"),
+				progname, head->fch_iflags, file->name,
+				strerror(errno));
+			free(head);
+			exitcode = 1;
+			return 0;
+		}
+
+		if (head->fch_entries == 0)
+			break;
+
+		if (vflag)
+			dump_refcounts_verbose(&nr, min_owners, max_owners,
+					head, &dumped_flags, &fsgeo);
+		else if (mflag)
+			dump_refcounts_machine(&nr, min_owners, max_owners,
+					head);
+		else
+			dump_refcounts(&nr, min_owners, max_owners, head);
+
+		p = &head->fch_recs[head->fch_entries - 1];
+		if (p->fcr_flags & FCR_OF_LAST)
+			break;
+		xfs_getfsrefs_advance(head);
+	} while (true);
+
+	if (dumped_flags)
+		dump_verbose_key();
+
+	free(head);
+	return 0;
+}
+
+void
+fsrefcounts_init(void)
+{
+	fsrefcounts_cmd.name = "fsrefcounts";
+	fsrefcounts_cmd.cfunc = fsrefcounts_f;
+	fsrefcounts_cmd.argmin = 0;
+	fsrefcounts_cmd.argmax = -1;
+	fsrefcounts_cmd.flags = CMD_NOMAP_OK | CMD_FLAG_FOREIGN_OK;
+	fsrefcounts_cmd.args = _("[-d|-l|-r] [-m|-v] [-n nx] [start] [end]");
+	fsrefcounts_cmd.oneline = _("print filesystem owner counts for a range of blocks");
+	fsrefcounts_cmd.help = fsrefcounts_help;
+
+	add_command(&fsrefcounts_cmd);
+}
diff --git a/io/init.c b/io/init.c
index 4831deae1b2683..17b772813bc113 100644
--- a/io/init.c
+++ b/io/init.c
@@ -58,6 +58,7 @@ init_commands(void)
 	freeze_init();
 	fsmap_init();
 	fsuuid_init();
+	fsrefcounts_init();
 	fsync_init();
 	getrusage_init();
 	help_init();
diff --git a/io/io.h b/io/io.h
index d99065582057de..7ae7cf90ace323 100644
--- a/io/io.h
+++ b/io/io.h
@@ -156,3 +156,4 @@ extern void		bulkstat_init(void);
 void			exchangerange_init(void);
 void			fsprops_init(void);
 void			aginfo_init(void);
+void			fsrefcounts_init(void);
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index a42ab61a0de422..37ad497c771051 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1325,6 +1325,94 @@ .SH FILESYSTEM COMMANDS
 .B thaw
 Undo the effects of a filesystem freeze operation.
 Only available in expert mode and requires privileges.
+
+.TP
+.BI "fsrefcounts [ \-d | \-l | \-r ] [ \-m | \-v ] [ \-n " nx " ] [ \-o " min_owners " ] [ \-O " max_owners " ] [ " start " ] [ " end " ]
+Prints the number of owners of disk extents used by the filesystem hosting the
+current file.
+The listing does not include free blocks.
+Each line of the listings takes the following form:
+.PP
+.RS
+.IR extent ": " major ":" minor " [" startblock .. endblock "]: " owners " " length
+.PP
+All blocks, offsets, and lengths are specified in units of 512-byte
+blocks, no matter what the filesystem's block size is.
+The optional
+.I start
+and
+.I end
+arguments can be used to constrain the output to a particular range of
+disk blocks.
+If these two options are specified, exactly one of
+.BR "-d" ", " "-l" ", or " "-r"
+must also be set.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI \-d
+Display only extents from the data device.
+This option only applies for XFS filesystems.
+.TP
+.BI \-l
+Display only extents from the external log device.
+This option only applies to XFS filesystems.
+.TP
+.BI \-r
+Display only extents from the realtime device.
+This option only applies to XFS filesystems.
+.TP
+.BI \-m
+Display results in a machine readable format (CSV).
+This option is not compatible with the
+.B \-v
+flag.
+The columns of the output are: extent number, device major, device minor,
+physical start, physical end, number of owners, length.
+The start, end, and length numbers are provided in units of 512b.
+
+.TP
+.BI \-n " num_extents"
+If this option is given,
+.B fsrefcounts
+obtains the extent list of the file in groups of
+.I num_extents
+extents.
+In the absence of
+.BR "-n" ", " "fsrefcounts"
+queries the system for extents in groups of 131,072 records.
+.TP
+.BI \-o " min_owners"
+Only print extents having at least this many owners.
+This argument must be in the range 1 to 2^64-1.
+The default value is 1.
+.TP
+.BI \-O " max_owners"
+Only print extents having this many or fewer owners.
+This argument must be in the range 1 to 2^64-1.
+There is no limit by default.
+.TP
+.B \-v
+Shows verbose information.
+When this flag is specified, additional AG specific information is
+appended to each line in the following form:
+.IP
+.RS 1.2i
+.IR agno " (" startagblock .. endagblock ") " nblocks " " flags
+.RE
+.IP
+A second
+.B \-v
+option will print out the
+.I flags
+legend.
+This option is not compatible with the
+.B \-m
+flag.
+.RE
+.PD
+
 .TP
 .BI "inject [ " tag " ]"
 Inject errors into a filesystem to observe filesystem behavior at


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 3/5] xfsprogs: defragment free space
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (6 preceding siblings ...)
  2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong
@ 2024-12-31 23:34 ` Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong
                     ` (10 more replies)
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                   ` (7 subsequent siblings)
  15 siblings, 11 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: dchinner, linux-xfs

Hi all,

These patches contain experimental code to enable userspace to defragment
the free space in a filesystem.  Two purposes are imagined for this
functionality: clearing space at the end of a filesystem before
shrinking it, and clearing free space in anticipation of making a large
allocation.

The first patch adds a new fallocate mode that allows userspace to
allocate free space from the filesystem into a file.  The goal here is
to allow the filesystem shrink process to prevent allocation from a
certain part of the filesystem while a free space defragmenter evacuates
all the files from the doomed part of the filesystem.

The second patch amends the online repair system to allow the sysadmin
to forcibly rebuild metadata structures, even if they're not corrupt.
Without adding an ioctl to move metadata btree blocks, this is the only
way to dislodge metadata.

This patchset also includes a separate inode migration tool as
prototyped by Dave Chinner in 2020.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=defrag-freespace
---
Commits in this patchset:
 * xfs_io: display rtgroup number in verbose fsrefs output
 * xfs: add an ioctl to map free space into a file
 * xfs_io: support using XFS_IOC_MAP_FREESP to map free space
 * xfs_db: get and put blocks on the AGFL
 * xfs_spaceman: implement clearing free space
 * spaceman: physically move a regular inode
 * spaceman: find owners of space in an AG
 * xfs_spaceman: wrap radix tree accesses in find_owner.c
 * xfs_spaceman: port relocation structure to 32-bit systems
 * spaceman: relocate the contents of an AG
 * spaceman: move inodes with hardlinks
---
 configure.ac                    |    1 
 db/agfl.c                       |  297 +++-
 include/builddefs.in            |    1 
 include/xfs_trace.h             |    4 
 io/fsrefcounts.c                |   22 
 io/prealloc.c                   |   35 
 libfrog/Makefile                |    5 
 libfrog/clearspace.c            | 3294 +++++++++++++++++++++++++++++++++++++++
 libfrog/clearspace.h            |   79 +
 libfrog/fsgeom.h                |   29 
 libfrog/radix-tree.c            |    2 
 libfrog/radix-tree.h            |    2 
 libxfs/libxfs_api_defs.h        |    4 
 libxfs/libxfs_priv.h            |    9 
 libxfs/xfs_alloc.c              |   88 +
 libxfs/xfs_alloc.h              |    3 
 libxfs/xfs_fs.h                 |   14 
 m4/package_libcdev.m4           |   20 
 man/man2/ioctl_xfs_map_freesp.2 |   76 +
 man/man8/xfs_db.8               |   11 
 man/man8/xfs_io.8               |    8 
 man/man8/xfs_spaceman.8         |   40 
 spaceman/Makefile               |   11 
 spaceman/clearfree.c            |  171 ++
 spaceman/find_owner.c           |  442 +++++
 spaceman/init.c                 |    7 
 spaceman/move_inode.c           |  662 ++++++++
 spaceman/relocation.c           |  566 +++++++
 spaceman/relocation.h           |   53 +
 spaceman/space.h                |    6 
 30 files changed, 5953 insertions(+), 9 deletions(-)
 create mode 100644 libfrog/clearspace.c
 create mode 100644 libfrog/clearspace.h
 create mode 100644 man/man2/ioctl_xfs_map_freesp.2
 create mode 100644 spaceman/clearfree.c
 create mode 100644 spaceman/find_owner.c
 create mode 100644 spaceman/move_inode.c
 create mode 100644 spaceman/relocation.c
 create mode 100644 spaceman/relocation.h


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
@ 2024-12-31 23:45   ` Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 02/11] xfs: add an ioctl to map free space into a file Darrick J. Wong
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Display the rtgroup number in the verbose fsrefcounts output.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 io/fsrefcounts.c |   22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)


diff --git a/io/fsrefcounts.c b/io/fsrefcounts.c
index ad1f26dfde3ec3..9127f536da382e 100644
--- a/io/fsrefcounts.c
+++ b/io/fsrefcounts.c
@@ -13,6 +13,7 @@
 
 static cmdinfo_t	fsrefcounts_cmd;
 static dev_t		xfs_data_dev;
+static dev_t		xfs_rt_dev;
 
 static void
 fsrefcounts_help(void)
@@ -119,7 +120,7 @@ dump_refcounts_verbose(
 	unsigned long long		i;
 	struct xfs_getfsrefs		*p;
 	int				agno;
-	off_t				agoff, bperag;
+	off_t				agoff, bperag, bperrtg;
 	int				boff_w, aoff_w, tot_w, agno_w, own_w;
 	int				nr_w, dev_w;
 	char				bbuf[40], abuf[40], obuf[40];
@@ -132,6 +133,7 @@ dump_refcounts_verbose(
 	nr_w = 4;
 	tot_w = MINTOT_WIDTH;
 	bperag = (off_t)fsgeo->agblocks * (off_t)fsgeo->blocksize;
+	bperrtg = bytes_per_rtgroup(fsgeo);
 	sunit = (fsgeo->sunit * fsgeo->blocksize);
 	swidth = (fsgeo->swidth * fsgeo->blocksize);
 
@@ -173,6 +175,13 @@ dump_refcounts_verbose(
 				"(%lld..%lld)",
 				(long long)BTOBBT(agoff),
 				(long long)BTOBBT(agoff + p->fcr_length - 1));
+		} else if (p->fcr_device == xfs_rt_dev && fsgeo->rgcount > 0) {
+			agno = p->fcr_physical / bperrtg;
+			agoff = p->fcr_physical - (agno * bperrtg);
+			snprintf(abuf, sizeof(abuf),
+				"(%lld..%lld)",
+				(long long)BTOBBT(agoff),
+				(long long)BTOBBT(agoff + p->fcr_length - 1));
 		} else
 			abuf[0] = 0;
 		aoff_w = max(aoff_w, strlen(abuf));
@@ -231,6 +240,16 @@ dump_refcounts_verbose(
 			snprintf(gbuf, sizeof(gbuf),
 				"%lld",
 				(long long)agno);
+		} else if (p->fcr_device == xfs_rt_dev && fsgeo->rgcount > 0) {
+			agno = p->fcr_physical / bperrtg;
+			agoff = p->fcr_physical - (agno * bperrtg);
+			snprintf(abuf, sizeof(abuf),
+				"(%lld..%lld)",
+				(long long)BTOBBT(agoff),
+				(long long)BTOBBT(agoff + p->fcr_length - 1));
+			snprintf(gbuf, sizeof(gbuf),
+				"%lld",
+				(long long)agno);
 		} else {
 			abuf[0] = 0;
 			gbuf[0] = 0;
@@ -420,6 +439,7 @@ fsrefcounts_f(
 	}
 	fs = fs_table_lookup(file->name, FS_MOUNT_POINT);
 	xfs_data_dev = fs ? fs->fs_datadev : 0;
+	xfs_rt_dev = fs ? fs->fs_rtdev : 0;
 
 	head->fch_count = map_size;
 	do {


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 02/11] xfs: add an ioctl to map free space into a file
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong
@ 2024-12-31 23:45   ` Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space Darrick J. Wong
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new ioctl to map free physical space into a file, at the same file
offset as if the file were a sparse image of the physical device backing
the filesystem.  The intent here is to use this to prototype a free
space defragmentation tool.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/xfs_trace.h             |    4 ++
 libxfs/libxfs_priv.h            |    9 ++++
 libxfs/xfs_alloc.c              |   88 +++++++++++++++++++++++++++++++++++++++
 libxfs/xfs_alloc.h              |    3 +
 libxfs/xfs_fs.h                 |   14 ++++++
 man/man2/ioctl_xfs_map_freesp.2 |   76 ++++++++++++++++++++++++++++++++++
 6 files changed, 194 insertions(+)
 create mode 100644 man/man2/ioctl_xfs_map_freesp.2


diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 7778366c5e3319..178497c8770d37 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -26,6 +26,8 @@
 #define trace_xfs_alloc_exact_done(a)		((void) 0)
 #define trace_xfs_alloc_exact_notfound(a)	((void) 0)
 #define trace_xfs_alloc_exact_error(a)		((void) 0)
+#define trace_xfs_alloc_find_freesp(...)	((void) 0)
+#define trace_xfs_alloc_find_freesp_done(...)	((void) 0)
 #define trace_xfs_alloc_near_first(a)		((void) 0)
 #define trace_xfs_alloc_near_greater(a)		((void) 0)
 #define trace_xfs_alloc_near_lesser(a)		((void) 0)
@@ -197,6 +199,8 @@
 
 #define trace_xfs_bmap_pre_update(a,b,c,d)	((void) 0)
 #define trace_xfs_bmap_post_update(a,b,c,d)	((void) 0)
+#define trace_xfs_bmapi_freesp(...)		((void) 0)
+#define trace_xfs_bmapi_freesp_done(...)	((void) 0)
 #define trace_xfs_bunmap(a,b,c,d,e)		((void) 0)
 #define trace_xfs_read_extent(a,b,c,d)		((void) 0)
 
diff --git a/libxfs/libxfs_priv.h b/libxfs/libxfs_priv.h
index ac2f64a9a75d82..932a45d734d460 100644
--- a/libxfs/libxfs_priv.h
+++ b/libxfs/libxfs_priv.h
@@ -446,6 +446,15 @@ xfs_buf_readahead(
 #define xfs_filestream_new_ag(ip,ag)		(0)
 #define xfs_filestream_select_ag(...)		(-ENOSYS)
 
+struct xfs_trans;
+
+static inline int
+xfs_rtallocate_extent(struct xfs_trans *tp, xfs_rtxnum_t start,
+		xfs_rtxlen_t maxlen, xfs_rtxlen_t *len, xfs_rtxnum_t *rtx)
+{
+	return -EOPNOTSUPP;
+}
+
 #define xfs_trans_inode_buf(tp, bp)		((void) 0)
 
 /* quota bits */
diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c
index 9aebe7227a6148..e21b694420e309 100644
--- a/libxfs/xfs_alloc.c
+++ b/libxfs/xfs_alloc.c
@@ -4164,3 +4164,91 @@ xfs_extfree_intent_destroy_cache(void)
 	kmem_cache_destroy(xfs_extfree_item_cache);
 	xfs_extfree_item_cache = NULL;
 }
+
+/*
+ * Find the next chunk of free space in @pag starting at @agbno and going no
+ * higher than @end_agbno.  Set @agbno and @len to whatever free space we find,
+ * or to @end_agbno if we find no space.
+ */
+int
+xfs_alloc_find_freesp(
+	struct xfs_trans	*tp,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		*agbno,
+	xfs_agblock_t		end_agbno,
+	xfs_extlen_t		*len)
+{
+	struct xfs_mount	*mp = pag_mount(pag);
+	struct xfs_btree_cur	*cur;
+	struct xfs_buf		*agf_bp = NULL;
+	xfs_agblock_t		found_agbno;
+	xfs_extlen_t		found_len;
+	int			found;
+	int			error;
+
+	trace_xfs_alloc_find_freesp(pag_group(pag), *agbno,
+			end_agbno - *agbno);
+
+	error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+	if (error)
+		return error;
+
+	cur = xfs_bnobt_init_cursor(mp, tp, agf_bp, pag);
+
+	/* Try to find a free extent that starts before here. */
+	error = xfs_alloc_lookup_le(cur, *agbno, 0, &found);
+	if (error)
+		goto out_cur;
+	if (found) {
+		error = xfs_alloc_get_rec(cur, &found_agbno, &found_len,
+				&found);
+		if (error)
+			goto out_cur;
+		if (XFS_IS_CORRUPT(mp, !found)) {
+			xfs_btree_mark_sick(cur);
+			error = -EFSCORRUPTED;
+			goto out_cur;
+		}
+
+		if (found_agbno + found_len > *agbno)
+			goto found;
+	}
+
+	/* Examine the next record if free extent not in range. */
+	error = xfs_btree_increment(cur, 0, &found);
+	if (error)
+		goto out_cur;
+	if (!found)
+		goto next_ag;
+
+	error = xfs_alloc_get_rec(cur, &found_agbno, &found_len, &found);
+	if (error)
+		goto out_cur;
+	if (XFS_IS_CORRUPT(mp, !found)) {
+		xfs_btree_mark_sick(cur);
+		error = -EFSCORRUPTED;
+		goto out_cur;
+	}
+
+	if (found_agbno >= end_agbno)
+		goto next_ag;
+
+found:
+	/* Found something, so update the mapping. */
+	trace_xfs_alloc_find_freesp_done(pag_group(pag), found_agbno,
+			found_len);
+	if (found_agbno < *agbno) {
+		found_len -= *agbno - found_agbno;
+		found_agbno = *agbno;
+	}
+	*len = found_len;
+	*agbno = found_agbno;
+	goto out_cur;
+next_ag:
+	/* Found nothing, so advance the cursor beyond the end of the range. */
+	*agbno = end_agbno;
+	*len = 0;
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h
index 50ef79a1ed41a1..069077d9ad2f8c 100644
--- a/libxfs/xfs_alloc.h
+++ b/libxfs/xfs_alloc.h
@@ -286,5 +286,8 @@ void xfs_extfree_intent_destroy_cache(void);
 
 xfs_failaddr_t xfs_validate_ag_length(struct xfs_buf *bp, uint32_t seqno,
 		uint32_t length);
+int xfs_alloc_find_freesp(struct xfs_trans *tp, struct xfs_perag *pag,
+		xfs_agblock_t *agbno, xfs_agblock_t end_agbno,
+		xfs_extlen_t *len);
 
 #endif	/* __XFS_ALLOC_H__ */
diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 936f719236944f..f4128dbdf3b9a2 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -1087,6 +1087,19 @@ xfs_getfsrefs_advance(
 /* fcr_flags values - returned for each non-header segment */
 #define FCR_OF_LAST		(1U << 0) /* last record in the dataset */
 
+/* map free space to file */
+
+/*
+ * XFS_IOC_MAP_FREESP maps all the free physical space in the filesystem into
+ * the file at the same offsets.  This ioctl requires CAP_SYS_ADMIN.
+ */
+struct xfs_map_freesp {
+	__s64	offset;		/* disk address to map, in bytes */
+	__s64	len;		/* length in bytes */
+	__u64	flags;		/* must be zero */
+	__u64	pad;		/* must be zero */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1127,6 +1140,7 @@ xfs_getfsrefs_advance(
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
+#define XFS_IOC_MAP_FREESP	_IOW ('X', 67, struct xfs_map_freesp)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/man/man2/ioctl_xfs_map_freesp.2 b/man/man2/ioctl_xfs_map_freesp.2
new file mode 100644
index 00000000000000..ecd2d08f3fdeee
--- /dev/null
+++ b/man/man2/ioctl_xfs_map_freesp.2
@@ -0,0 +1,76 @@
+.\" Copyright (c) 2023-2025 Oracle.  All rights reserved.
+.\"
+.\" %%%LICENSE_START(GPLv2+_DOC_FULL)
+.\" SPDX-License-Identifier: GPL-2.0-or-later
+.\" %%%LICENSE_END
+.TH IOCTL-XFS-MAP-FREESP 2 2023-11-17 "XFS"
+.SH NAME
+ioctl_xfs_map_freesp \- map free space into a file
+.SH SYNOPSIS
+.br
+.B #include <xfs/xfs_fs.h>
+.PP
+.BI "int ioctl(int " fd ", XFS_IOC_MAP_FREESP, struct xfs_map_freesp *" arg );
+.SH DESCRIPTION
+Maps free space into the sparse ranges of a regular file.
+This ioctl uses
+.B struct xfs_map_freesp
+to specify the range of free space to be mapped:
+.PP
+.in +4n
+.nf
+struct xfs_map_freesp {
+	__s64   offset;
+	__s64   len;
+	__s64   flags;
+	__s64   pad;
+};
+.fi
+.in
+.PP
+.I offset
+is the physical disk address, in bytes, of the start of the range to scan.
+Each free space extent in this range will be mapped to the file if the
+corresponding range of the file is sparse.
+.PP
+.I len
+is the number of bytes in the range to scan.
+.PP
+.I flags
+must be zero; there are no flags defined yet.
+.PP
+.I pad
+must be zero.
+.SH RETURN VALUE
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.PP
+.SH ERRORS
+Error codes can be one of, but are not limited to, the following:
+.TP
+.B EFAULT
+The kernel was not able to copy into the userspace buffer.
+.TP
+.B EFSBADCRC
+Metadata checksum validation failed while performing the query.
+.TP
+.B EFSCORRUPTED
+Metadata corruption was encountered while performing the query.
+.TP
+.B EINVAL
+One of the arguments was not valid,
+or the file was not sparse.
+.TP
+.B EIO
+An I/O error was encountered while performing the query.
+.TP
+.B ENOMEM
+There was insufficient memory to perform the query.
+.TP
+.B ENOSPC
+There was insufficient disk space to commit the space mappings.
+.SH CONFORMING TO
+This API is specific to XFS filesystem on the Linux kernel.
+.SH SEE ALSO
+.BR ioctl (2)


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 02/11] xfs: add an ioctl to map free space into a file Darrick J. Wong
@ 2024-12-31 23:45   ` Darrick J. Wong
  2024-12-31 23:45   ` [PATCH 04/11] xfs_db: get and put blocks on the AGFL Darrick J. Wong
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a command to call XFS_IOC_MAP_FREESP.  This is experimental code to
see if we can build a free space defragmenter out of this.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 io/prealloc.c     |   35 +++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |    8 +++++++-
 2 files changed, 42 insertions(+), 1 deletion(-)


diff --git a/io/prealloc.c b/io/prealloc.c
index 8e968c9f2455d5..b7004697a045c5 100644
--- a/io/prealloc.c
+++ b/io/prealloc.c
@@ -41,6 +41,7 @@ static cmdinfo_t fcollapse_cmd;
 static cmdinfo_t finsert_cmd;
 static cmdinfo_t fzero_cmd;
 static cmdinfo_t funshare_cmd;
+static cmdinfo_t fmapfree_cmd;
 
 static int
 offset_length(
@@ -377,6 +378,30 @@ funshare_f(
 	return 0;
 }
 
+static int
+fmapfree_f(
+	int			argc,
+	char			**argv)
+{
+	struct xfs_flock64	segment;
+	struct xfs_map_freesp	args = { };
+
+	if (!offset_length(argv[1], argv[2], &segment)) {
+		exitcode = 1;
+		return 0;
+	}
+
+	args.offset = segment.l_start;
+	args.len = segment.l_len;
+
+	if (ioctl(file->fd, XFS_IOC_MAP_FREESP, &args)) {
+		perror("XFS_IOC_MAP_FREESP");
+		exitcode = 1;
+		return 0;
+	}
+	return 0;
+}
+
 void
 prealloc_init(void)
 {
@@ -489,4 +514,14 @@ prealloc_init(void)
 	funshare_cmd.oneline =
 	_("unshares shared blocks within the range");
 	add_command(&funshare_cmd);
+
+	fmapfree_cmd.name = "fmapfree";
+	fmapfree_cmd.cfunc = fmapfree_f;
+	fmapfree_cmd.argmin = 2;
+	fmapfree_cmd.argmax = 2;
+	fmapfree_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK;
+	fmapfree_cmd.args = _("off len");
+	fmapfree_cmd.oneline =
+	_("maps free space into a file");
+	add_command(&fmapfree_cmd);
 }
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 37ad497c771051..c4d09ce07f597b 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -519,8 +519,14 @@ .SH FILE I/O COMMANDS
 .BR fallocate (2)
 manual page to create the hole by shifting data blocks.
 .TP
+.BI fmapfree " offset length"
+Maps free physical space into the file by calling XFS_IOC_MAP_FREESP as
+described in the
+.BR XFS_IOC_MAP_FREESP (2)
+manual page.
+.TP
 .BI fpunch " offset length"
-Punches (de-allocates) blocks in the file by calling fallocate with 
+Punches (de-allocates) blocks in the file by calling fallocate with
 the FALLOC_FL_PUNCH_HOLE flag as described in the
 .BR fallocate (2)
 manual page.


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 04/11] xfs_db: get and put blocks on the AGFL
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:45   ` [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space Darrick J. Wong
@ 2024-12-31 23:45   ` Darrick J. Wong
  2024-12-31 23:46   ` [PATCH 05/11] xfs_spaceman: implement clearing free space Darrick J. Wong
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new xfs_db command to let people add and remove blocks from an
AGFL.  This isn't really related to rmap btree reconstruction, other
than enabling debugging code to mess around with the AGFL to exercise
various odd scenarios.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 db/agfl.c                |  297 ++++++++++++++++++++++++++++++++++++++++++++++
 libxfs/libxfs_api_defs.h |    4 +
 man/man8/xfs_db.8        |   11 ++
 3 files changed, 308 insertions(+), 4 deletions(-)


diff --git a/db/agfl.c b/db/agfl.c
index f0f3f21a64d12c..cf5a2407f6b6d8 100644
--- a/db/agfl.c
+++ b/db/agfl.c
@@ -15,13 +15,14 @@
 #include "output.h"
 #include "init.h"
 #include "agfl.h"
+#include "libfrog/bitmap.h"
 
 static int agfl_bno_size(void *obj, int startoff);
 static int agfl_f(int argc, char **argv);
 static void agfl_help(void);
 
 static const cmdinfo_t agfl_cmd =
-	{ "agfl", NULL, agfl_f, 0, 1, 1, N_("[agno]"),
+	{ "agfl", NULL, agfl_f, 0, -1, 1, N_("[agno] [-g nr] [-p nr]"),
 	  N_("set address to agfl block"), agfl_help };
 
 const field_t	agfl_hfld[] = { {
@@ -77,10 +78,280 @@ agfl_help(void)
 " for each allocation group.  This acts as a reserved pool of space\n"
 " separate from the general filesystem freespace (not used for user data).\n"
 "\n"
+" -g quantity\tRemove this many blocks from the AGFL.\n"
+" -p quantity\tAdd this many blocks to the AGFL.\n"
+"\n"
 ));
 
 }
 
+struct dump_info {
+	struct xfs_perag	*pag;
+	bool			leak;
+};
+
+/* Return blocks freed from the AGFL to the free space btrees. */
+static int
+free_grabbed(
+	uint64_t		start,
+	uint64_t		length,
+	void			*data)
+{
+	struct dump_info	*di = data;
+	struct xfs_perag	*pag = di->pag;
+	struct xfs_mount	*mp = pag_mount(pag);
+	struct xfs_trans	*tp;
+	struct xfs_buf		*agf_bp;
+	int			error;
+
+	error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0,
+			&tp);
+	if (error)
+		return error;
+
+	error = -libxfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+	if (error)
+		goto out_cancel;
+
+	error = -libxfs_free_extent(tp, pag, start, length, &XFS_RMAP_OINFO_AG,
+			XFS_AG_RESV_AGFL);
+	if (error)
+		goto out_cancel;
+
+	return -libxfs_trans_commit(tp);
+
+out_cancel:
+	libxfs_trans_cancel(tp);
+	return error;
+}
+
+/* Report blocks freed from the AGFL. */
+static int
+dump_grabbed(
+	uint64_t		start,
+	uint64_t		length,
+	void			*data)
+{
+	struct dump_info	*di = data;
+	const char		*fmt;
+
+	if (length == 1)
+		fmt = di->leak ? _("agfl %u: leaked agbno %u\n") :
+				 _("agfl %u: removed agbno %u\n");
+	else
+		fmt = di->leak ? _("agfl %u: leaked agbno %u-%u\n") :
+				 _("agfl %u: removed agbno %u-%u\n");
+
+	printf(fmt, pag_agno(di->pag), (unsigned int)start,
+			(unsigned int)(start + length - 1));
+	return 0;
+}
+
+/* Remove blocks from the AGFL. */
+static int
+agfl_get(
+	struct xfs_perag	*pag,
+	int			quantity)
+{
+	struct dump_info	di = {
+		.pag		= pag,
+		.leak		= quantity < 0,
+	};
+	struct xfs_agf		*agf;
+	struct xfs_buf		*agf_bp;
+	struct xfs_trans	*tp;
+	struct bitmap		*grabbed;
+	const unsigned int	agfl_size = libxfs_agfl_size(pag_mount(pag));
+	unsigned int		i;
+	int			error;
+
+	if (!quantity)
+		return 0;
+
+	if (di.leak)
+		quantity = -quantity;
+	quantity = min(quantity, agfl_size);
+
+	error = bitmap_alloc(&grabbed);
+	if (error)
+		goto out;
+
+	error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, quantity, 0,
+			0, &tp);
+	if (error)
+		goto out_bitmap;
+
+	error = -libxfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+	if (error)
+		goto out_cancel;
+
+	agf = agf_bp->b_addr;
+	quantity = min(quantity, be32_to_cpu(agf->agf_flcount));
+
+	for (i = 0; i < quantity; i++) {
+		xfs_agblock_t	agbno;
+
+		error = -libxfs_alloc_get_freelist(pag, tp, agf_bp, &agbno, 0);
+		if (error)
+			goto out_cancel;
+
+		if (agbno == NULLAGBLOCK) {
+			error = ENOSPC;
+			goto out_cancel;
+		}
+
+		error = bitmap_set(grabbed, agbno, 1);
+		if (error)
+			goto out_cancel;
+	}
+
+	error = -libxfs_trans_commit(tp);
+	if (error)
+		goto out_bitmap;
+
+	error = bitmap_iterate(grabbed, dump_grabbed, &di);
+	if (error)
+		goto out_bitmap;
+
+	if (!di.leak) {
+		error = bitmap_iterate(grabbed, free_grabbed, &di);
+		if (error)
+			goto out_bitmap;
+	}
+
+	bitmap_free(&grabbed);
+	return 0;
+
+out_cancel:
+	libxfs_trans_cancel(tp);
+out_bitmap:
+	bitmap_free(&grabbed);
+out:
+	if (error)
+		printf(_("agfl %u: %s\n"), pag_agno(pag), strerror(error));
+	return error;
+}
+
+/* Add blocks to the AGFL. */
+static int
+agfl_put(
+	struct xfs_perag	*pag,
+	int			quantity)
+{
+	struct xfs_alloc_arg	args = {
+		.mp		= pag_mount(pag),
+		.alignment	= 1,
+		.minlen		= 1,
+		.prod		= 1,
+		.resv		= XFS_AG_RESV_AGFL,
+		.oinfo		= XFS_RMAP_OINFO_AG,
+	};
+	struct xfs_buf		*agfl_bp;
+	struct xfs_agf		*agf;
+	struct xfs_trans	*tp;
+	xfs_fsblock_t		target;
+	const unsigned int	agfl_size = libxfs_agfl_size(pag_mount(pag));
+	unsigned int		i;
+	bool			eoag = quantity < 0;
+	int			error;
+
+	if (!quantity)
+		return 0;
+
+	if (eoag)
+		quantity = -quantity;
+	quantity = min(quantity, agfl_size);
+
+	error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, quantity, 0,
+			0, &tp);
+	if (error)
+		return error;
+	args.tp = tp;
+
+	error = -libxfs_alloc_read_agf(pag, tp, 0, &args.agbp);
+	if (error)
+		goto out_cancel;
+
+	agf = args.agbp->b_addr;
+	args.maxlen = min(quantity, agfl_size - be32_to_cpu(agf->agf_flcount));
+
+	if (eoag)
+		target = xfs_agbno_to_fsb(pag,
+				be32_to_cpu(agf->agf_length) - 1);
+	else
+		target = xfs_agbno_to_fsb(pag, 0);
+
+	error = -libxfs_alloc_read_agfl(pag, tp, &agfl_bp);
+	if (error)
+		goto out_cancel;
+
+	error = -libxfs_alloc_vextent_near_bno(&args, target);
+	if (error)
+		goto out_cancel;
+
+	if (args.agbno == NULLAGBLOCK) {
+		error = ENOSPC;
+		goto out_cancel;
+	}
+
+	for (i = 0; i < args.len; i++) {
+		error = -libxfs_alloc_put_freelist(pag, tp, args.agbp,
+				agfl_bp, args.agbno + i, 0);
+		if (error)
+			goto out_cancel;
+	}
+
+	if (i == 1)
+		printf(_("agfl %u: added agbno %u\n"), pag_agno(pag),
+				args.agbno);
+	else if (i > 1)
+		printf(_("agfl %u: added agbno %u-%u\n"), pag_agno(pag),
+				args.agbno, args.agbno + i - 1);
+
+	error = -libxfs_trans_commit(tp);
+	if (error)
+		goto out;
+
+	return 0;
+
+out_cancel:
+	libxfs_trans_cancel(tp);
+out:
+	if (error)
+		printf(_("agfl %u: %s\n"), pag_agno(pag), strerror(error));
+	return error;
+}
+
+static void
+agfl_adjust(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	int			gblocks,
+	int			pblocks)
+{
+	struct xfs_perag	*pag;
+	int			error;
+
+	if (!expert_mode) {
+		printf(_("AGFL get/put only supported in expert mode.\n"));
+		exitcode = 1;
+		return;
+	}
+
+	pag = libxfs_perag_get(mp, agno);
+
+	error = agfl_get(pag, gblocks);
+	if (error)
+		goto out_pag;
+
+	error = agfl_put(pag, pblocks);
+
+out_pag:
+	libxfs_perag_put(pag);
+	if (error)
+		exitcode = 1;
+}
+
 static int
 agfl_f(
 	int		argc,
@@ -88,9 +359,25 @@ agfl_f(
 {
 	xfs_agnumber_t	agno;
 	char		*p;
+	int		c;
+	int		gblocks = 0, pblocks = 0;
 
-	if (argc > 1) {
-		agno = (xfs_agnumber_t)strtoul(argv[1], &p, 0);
+	while ((c = getopt(argc, argv, "g:p:")) != -1) {
+		switch (c) {
+		case 'g':
+			gblocks = atoi(optarg);
+			break;
+		case 'p':
+			pblocks = atoi(optarg);
+			break;
+		default:
+			agfl_help();
+			return 1;
+		}
+	}
+
+	if (argc > optind) {
+		agno = (xfs_agnumber_t)strtoul(argv[optind], &p, 0);
 		if (*p != '\0' || agno >= mp->m_sb.sb_agcount) {
 			dbprintf(_("bad allocation group number %s\n"), argv[1]);
 			return 0;
@@ -98,6 +385,10 @@ agfl_f(
 		cur_agno = agno;
 	} else if (cur_agno == NULLAGNUMBER)
 		cur_agno = 0;
+
+	if (gblocks || pblocks)
+		agfl_adjust(mp, cur_agno, gblocks, pblocks);
+
 	ASSERT(typtab[TYP_AGFL].typnm == TYP_AGFL);
 	set_cur(&typtab[TYP_AGFL],
 		XFS_AG_DADDR(mp, cur_agno, XFS_AGFL_DADDR(mp)),
diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index 530feef2a47db8..76f55515bb41f7 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -31,8 +31,12 @@
 #define xfs_allocbt_maxrecs		libxfs_allocbt_maxrecs
 #define xfs_allocbt_stage_cursor	libxfs_allocbt_stage_cursor
 #define xfs_alloc_fix_freelist		libxfs_alloc_fix_freelist
+#define xfs_alloc_get_freelist		libxfs_alloc_get_freelist
 #define xfs_alloc_min_freelist		libxfs_alloc_min_freelist
+#define xfs_alloc_put_freelist		libxfs_alloc_put_freelist
 #define xfs_alloc_read_agf		libxfs_alloc_read_agf
+#define xfs_alloc_read_agfl		libxfs_alloc_read_agfl
+#define xfs_alloc_vextent_near_bno	libxfs_alloc_vextent_near_bno
 #define xfs_alloc_vextent_start_ag	libxfs_alloc_vextent_start_ag
 
 #define xfs_ascii_ci_hashname		libxfs_ascii_ci_hashname
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index 553adff758bc02..4217e9932dd775 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -182,10 +182,19 @@ .SH COMMANDS
 .IR agno .
 If no argument is given, use the current allocation group.
 .TP
-.BI "agfl [" agno ]
+.BI "agfl [" agno "] [\-g " " quantity" "] [\-p " quantity ]
 Set current address to the AGFL block for allocation group
 .IR agno .
 If no argument is given, use the current allocation group.
+If the
+.B -g
+option is specified with a positive quantity, remove that many blocks from the
+AGFL and put them in the free space btrees.
+If the quantity is negative, remove the blocks and leak them.
+If the
+.B -p
+option is specified, add that many blocks to the AGFL.
+If the quantity is negative, the blocks are selected from the end of the AG.
 .TP
 .BI "agi [" agno ]
 Set current address to the AGI block for allocation group


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 05/11] xfs_spaceman: implement clearing free space
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:45   ` [PATCH 04/11] xfs_db: get and put blocks on the AGFL Darrick J. Wong
@ 2024-12-31 23:46   ` Darrick J. Wong
  2024-12-31 23:46   ` [PATCH 06/11] spaceman: physically move a regular inode Darrick J. Wong
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

First attempt at evacuating all the used blocks from part of a
filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/Makefile        |    5 
 libfrog/clearspace.c    | 3294 +++++++++++++++++++++++++++++++++++++++++++++++
 libfrog/clearspace.h    |   79 +
 man/man8/xfs_spaceman.8 |   17 
 spaceman/Makefile       |    2 
 spaceman/clearfree.c    |  171 ++
 spaceman/init.c         |    1 
 spaceman/space.h        |    2 
 8 files changed, 3570 insertions(+), 1 deletion(-)
 create mode 100644 libfrog/clearspace.c
 create mode 100644 libfrog/clearspace.h
 create mode 100644 spaceman/clearfree.c


diff --git a/libfrog/Makefile b/libfrog/Makefile
index 4da427789411a6..91c99822002347 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -65,6 +65,11 @@ workqueue.h
 
 LSRCFILES += gen_crc32table.c
 
+ifeq ($(HAVE_GETFSMAP),yes)
+CFILES+=clearspace.c
+HFILES+=clearspace.h
+endif
+
 LDIRT = gen_crc32table crc32table.h
 
 default: ltdepend $(LTLIBRARY)
diff --git a/libfrog/clearspace.c b/libfrog/clearspace.c
new file mode 100644
index 00000000000000..0b6ef8f1b15015
--- /dev/null
+++ b/libfrog/clearspace.c
@@ -0,0 +1,3294 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include <linux/fsmap.h>
+#include "paths.h"
+#include "fsgeom.h"
+#include "logging.h"
+#include "bulkstat.h"
+#include "bitmap.h"
+#include "file_exchange.h"
+#include "clearspace.h"
+#include "handle.h"
+
+/*
+ * Filesystem Space Balloons
+ * =========================
+ *
+ * NOTE: Due to the evolving identity of this code, the "space_fd" or "space
+ * file" in the codebase are the same as the balloon file in this introduction.
+ * The introduction was written much later than the code.
+ *
+ * The goal of this code is to create a balloon file that is mapped to a range
+ * of the physical space that is managed by a filesystem.  There are several
+ * uses envisioned for balloon files:
+ *
+ * 1. Defragmenting free space.  Once the balloon is created, freeing it leaves
+ *    a large chunk of contiguous free space ready for reallocation.
+ *
+ * 2. Shrinking the filesystem.  If the balloon is inflated at the end of the
+ *    filesystem, the file can be handed to the shrink code.  The shrink code
+ *    can then reduce the filesystem size by the size of the balloon.
+ *
+ * 3. Constraining usage of underlying thin provisioning pools.  The space
+ *    assigned to a balloon can be DISCARDed, which prevents the filesystem
+ *    from using that space until the balloon is freed.  This can be done more
+ *    efficiently with the standard fallocate call, unless the balloon must
+ *    target specific LBA ranges.
+ *
+ * Inflating a balloon is performed in five phases: claiming unused space;
+ * freezing used space; migrating file mappings away from frozen space; moving
+ * inodes; and rebuilding metadata elsewhere.
+ *
+ * Claiming Unused Space
+ * ---------------------
+ *
+ * The first step of inflating a file balloon is to define the range of
+ * physical space to be added to the balloon and claim as much of the free
+ * space inside that range as possible.  Dirty data are flushed to disk and
+ * the block and inode garbage collectors are run to remove any speculative
+ * preallocations that might be occupying space in the target range.
+ *
+ * Second, the new XFS_IOC_MAP_FREESP ioctl is used to map free space in the
+ * target range to the balloon file.  This step will be repeated after every
+ * space-clearing step below to capture that cleared space.  Concurrent writer
+ * threads will (hopefully) be allocated space outside the target range.
+ *
+ * Freezing Used Space
+ * -------------------
+ *
+ * The second phase of inflating the balloon is to freeze as much of the
+ * allocated space within the target range as possible.  The purpose of this
+ * step is to grab a second reference to the used space, thereby preventing it
+ * from being reused elsewhere.
+ *
+ * Freezing of a physical space extent starts by using GETFSMAP to find the
+ * file owner of the space, and opening the file by handle.  The fsmap record
+ * is used to create a FICLONERANGE request to link the file range into a work
+ * file.  Once the reflink is made, any subsequent writes to any of the owners
+ * of that space are staged via copy on write.  The balloon file prevents the
+ * copy on write from being staged within the target range.  The frozen space
+ * mapping is moved from the work file to the balloon file, where it remains
+ * until the balloon file is freed.
+ *
+ * If reflink is not supported on the filesystem, used space cannot be frozen.
+ * This phase is skipped.
+ *
+ * Migrating File Mappings
+ * -----------------------
+ *
+ * Once the balloon file has been populated with as much of the target range as
+ * possible, it is time to remap file ranges that point to the frozen space.
+ *
+ * It is advantageous to remap as many blocks as can be done with as few system
+ * calls as possible to avoid fragmenting files.  Furthermore, it is preferable
+ * to remap heavily shared extents before lightly shared extents to preserve
+ * reflinks when possible.  The new GETFSREFCOUNTS call is used to rank
+ * physical space extents by size and sharing factor so that the library always
+ * tries to relocate the highest ranking space extent.
+ *
+ * Once a space extent has been selected for relocation, it is reflinked from
+ * the balloon file into the work file.  Next, fallocate is called with the
+ * FALLOC_FL_UNSHARE_RANGE mode to persist a new copy of the file data and
+ * update the mapping in the work file.  The GETFSMAP call is used to find the
+ * remaining owners of the target space.  For each owner, FIEDEDUPERANGE is
+ * used to change the owner file's mapping to the space in the work file if the
+ * owner has not been changed.
+ *
+ * If the filesystem does not support reflink, FIDEDUPERANGE will not be
+ * available.  Fortunately, there will only be one owner of the frozen space.
+ * The file range contents are instead copied through the page cache to the
+ * work file, and EXCHANGE_RANGE is used to swap the mappings if the owner
+ * file has not been modified.
+ *
+ * When the only remaining owner of the space is the balloon file, return to
+ * the GETFSREFCOUNTS step to find a new target.  This phase is complete when
+ * there are no more targets.
+ *
+ * Moving Inodes
+ * -------------
+ *
+ * NOTE: This part is not written.
+ *
+ * When GETFSMAP tells us about an inode chunk, it is necessary to move the
+ * inodes allocated in that inode chunk to a new chunk.  The first step is to
+ * create a new donor file whose inode record is not in the target range.  This
+ * file must be created in a donor directory.  Next, the file contents should
+ * be cloned, either via FICLONE for regular files or by copying the directory
+ * entries for directories.  The caller must ensure that no programs write to
+ * the victim inode while this process is ongoing.
+ *
+ * Finally, the new inode must be mapped into the same points in the directory
+ * tree as the old inode.  For each parent pointer accessible by the file,
+ * perform a RENAME_EXCHANGE operation to update the directory entry.  One
+ * obvious flaw of this method is that we cannot specify (parent, name, child)
+ * pairs to renameat, which means that the rename does the wrong thing if
+ * either directory is updated concurrently.
+ *
+ * If parent pointers are not available, this phase could be performed slowly
+ * by iterating all directories looking for entries of interest and swapping
+ * them.
+ *
+ * It is required that the caller guarantee that other applications cannot
+ * update the filesystem concurrently.
+ *
+ * Rebuilding Metadata
+ * -------------------
+ *
+ * The final phase identifies filesystem metadata occupying the target range
+ * and uses the online filesystem repair facility to rebuild the metadata
+ * structures.  Assuming that the balloon file now maps most of the space in
+ * the target range, the new structures should be located outside of the target
+ * range.  This phase runs in a loop until there is no more metadata to
+ * relocate or no progress can be made on relocating metadata.
+ *
+ * Limitations and Bugs
+ * --------------------
+ *
+ * - This code must be able to find the owners of a range of physical space.
+ *   If GETFSMAP does not return owner information, this code cannot succeed.
+ *   In other words, reverse mapping must be enabled.
+ *
+ * - We cannot freeze EOF blocks because the FICLONERANGE code does not allow
+ *   us to remap an EOF block into the middle of the balloon file.  I think we
+ *   actually succeed at reflinking the EOF block into the work file during the
+ *   freeze step, but we need to dedupe/exchange the real owners' mappings
+ *   without waiting for the freeze step.  OTOH, we /also/ want to freeze as
+ *   much space as quickly as we can.
+ *
+ * - Freeze cannot use FIECLONERANGE to reflink unwritten extents into the work
+ *   file because FICLONERANGE ignores unwritten extents.  We could create the
+ *   work file as a sparse file and use EXCHANGE_RANGE to swap the unwritten
+ *   extent with the hole, extend EOF to be allocunit aligned, and use
+ *   EXCHANGE_RANGE to move it to the balloon file.  That first exchange must
+ *   be careful to sample the owner file's bulkstat data, re-measure the file
+ *   range to confirm that the unwritten extent is still the one we want, and
+ *   only exchange if the owner file has not changed.
+ *
+ * - csp_buffercopy seems to hang if pread returns zero bytes read.  Do we dare
+ *   use copy_file_range for this instead?
+ *
+ * - None of this code knows how to move inodes.  Phase 4 is entirely
+ *   speculative fiction rooted in Dave Chinner's earlier implementation.
+ *
+ * - Does this work for realtime files?  Even for large rt extent sizes?
+ */
+
+/* VFS helpers */
+
+/* Remap the file range described by @fcr into fd, or return an errno. */
+static inline int
+clonerange(int fd, struct file_clone_range *fcr)
+{
+	int	ret;
+
+	ret = ioctl(fd, FICLONERANGE, fcr);
+	if (ret)
+		return errno;
+
+	return 0;
+}
+
+/*
+ * Deduplicate part of fd into the file range described by fdr.  If the
+ * operation succeeded, we set @same to whether or not we deduped the data and
+ * return zero.  If not, return an errno.
+ */
+static inline int
+deduperange(int fd, struct file_dedupe_range *fdr, bool *same)
+{
+	struct file_dedupe_range_info *info = &fdr->info[0];
+	int	ret;
+
+	assert(fdr->dest_count == 1);
+	*same = false;
+
+	ret = ioctl(fd, FIDEDUPERANGE, fdr);
+	if (ret)
+		return errno;
+
+	if (info->status < 0)
+		return -info->status;
+
+	if (info->status == FILE_DEDUPE_RANGE_DIFFERS)
+		return 0;
+
+	/* The kernel should never dedupe more than it was asked. */
+	assert(fdr->src_length >= info->bytes_deduped);
+
+	*same = true;
+	return 0;
+}
+
+/* Space clearing operation control */
+
+#define QUERY_BATCH_SIZE		1024
+
+struct clearspace_tgt {
+	unsigned long long	start;
+	unsigned long long	length;
+	unsigned long long	owners;
+	unsigned long long	prio;
+	unsigned long long	evacuated;
+	bool			try_again;
+};
+
+struct clearspace_req {
+	struct xfs_fd		*xfd;
+
+	/* all the blocks that we've tried to clear */
+	struct bitmap		*visited;
+
+	/* stat buffer of the open file */
+	struct stat		statbuf;
+	struct stat		temp_statbuf;
+	struct stat		space_statbuf;
+
+	/* handle to this filesystem */
+	void			*fshandle;
+	size_t			fshandle_sz;
+
+	/* physical storage that we want to clear */
+	unsigned long long	start;
+	unsigned long long	length;
+	dev_t			dev;
+
+	/* convenience variable */
+	bool			realtime:1;
+	bool			use_reflink:1;
+	bool			can_evac_metadata:1;
+
+	/*
+	 * The "space capture" file.  Each extent in this file must be mapped
+	 * to the same byte offset as the byte address of the physical space.
+	 */
+	int			space_fd;
+
+	/* work file for migrating file data */
+	int			work_fd;
+
+	/* preallocated buffers for queries */
+	struct getbmapx		*bhead;
+	struct fsmap_head	*mhead;
+	struct xfs_getfsrefs_head	*rhead;
+
+	/* buffer for copying data */
+	char			*buf;
+
+	/* buffer for deduping data */
+	struct file_dedupe_range *fdr;
+
+	/* tracing mask and indent level */
+	unsigned int		trace_mask;
+	unsigned int		trace_indent;
+};
+
+static inline bool
+csp_is_internal_owner(
+	const struct clearspace_req	*req,
+	unsigned long long		owner)
+{
+	return owner == req->temp_statbuf.st_ino ||
+	       owner == req->space_statbuf.st_ino;
+}
+
+/* Debugging stuff */
+
+static const struct csp_errstr {
+	unsigned int		mask;
+	const char		*tag;
+} errtags[] = {
+	{ CSP_TRACE_FREEZE,	"freeze" },
+	{ CSP_TRACE_GRAB,	"grab" },
+	{ CSP_TRACE_PREP,	"prep" },
+	{ CSP_TRACE_TARGET,	"target" },
+	{ CSP_TRACE_DEDUPE,	"dedupe" },
+	{ CSP_TRACE_EXCHANGE,	"exchange_range" },
+	{ CSP_TRACE_XREBUILD,	"rebuild" },
+	{ CSP_TRACE_EFFICACY,	"efficacy" },
+	{ CSP_TRACE_SETUP,	"setup" },
+	{ CSP_TRACE_DUMPFILE,	"dumpfile" },
+	{ CSP_TRACE_BITMAP,	"bitmap" },
+
+	/* prioritize high level functions over low level queries for tagging */
+	{ CSP_TRACE_FSMAP,	"fsmap" },
+	{ CSP_TRACE_FSREFS,	"fsrefs" },
+	{ CSP_TRACE_BMAPX,	"bmapx" },
+	{ CSP_TRACE_FALLOC,	"falloc" },
+	{ CSP_TRACE_STATUS,	"status" },
+	{ 0, NULL },
+};
+
+static void
+csp_debug(
+	struct clearspace_req	*req,
+	unsigned int		mask,
+	const char		*func,
+	int			line,
+	const char		*format,
+	...)
+{
+	const struct csp_errstr	*et = errtags;
+	bool			debug = (req->trace_mask & ~CSP_TRACE_STATUS);
+	int			indent = req->trace_indent;
+	va_list			args;
+
+	if ((req->trace_mask & mask) != mask)
+		return;
+
+	if (debug) {
+		while (indent > 0) {
+			fprintf(stderr, "  ");
+			indent--;
+		}
+
+		for (; et->tag; et++) {
+			if (et->mask & mask) {
+				fprintf(stderr, "%s: ", et->tag);
+				break;
+			}
+		}
+	}
+
+	va_start(args, format);
+	vfprintf(stderr, format, args);
+	va_end(args);
+
+	if (debug)
+		fprintf(stderr, " (line %d)\n", line);
+	else
+		fprintf(stderr, "\n");
+	fflush(stderr);
+}
+
+#define trace_freeze(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_FREEZE, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_grabfree(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_GRAB, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_fsmap(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_FSMAP, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_fsmap_rec(req, mask, mrec)	\
+	while (!csp_is_internal_owner((req), (mrec)->fmr_owner)) { \
+		csp_debug((req), (mask) | CSP_TRACE_FSMAP, __func__, __LINE__, \
+"fsmap phys 0x%llx owner 0x%llx offset 0x%llx bytecount 0x%llx flags 0x%x", \
+				(unsigned long long)(mrec)->fmr_physical, \
+				(unsigned long long)(mrec)->fmr_owner, \
+				(unsigned long long)(mrec)->fmr_offset, \
+				(unsigned long long)(mrec)->fmr_length, \
+				(mrec)->fmr_flags); \
+		break; \
+	}
+
+#define trace_fsrefs(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_FSREFS, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_fsrefs_rec(req, mask, rrec)	\
+	csp_debug((req), (mask) | CSP_TRACE_FSREFS, __func__, __LINE__, \
+"fsref phys 0x%llx bytecount 0x%llx owners %llu flags 0x%x", \
+			(unsigned long long)(rrec)->fcr_physical, \
+			(unsigned long long)(rrec)->fcr_length, \
+			(unsigned long long)(rrec)->fcr_owners, \
+			(rrec)->fcr_flags)
+
+#define trace_bmapx(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_BMAPX, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_bmapx_rec(req, mask, brec)	\
+	csp_debug((req), (mask) | CSP_TRACE_BMAPX, __func__, __LINE__, \
+"bmapx pos 0x%llx bytecount 0x%llx phys 0x%llx flags 0x%x", \
+			(unsigned long long)BBTOB((brec)->bmv_offset), \
+			(unsigned long long)BBTOB((brec)->bmv_length), \
+			(unsigned long long)BBTOB((brec)->bmv_block), \
+			(brec)->bmv_oflags)
+
+#define trace_prep(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_PREP, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_target(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_TARGET, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_dedupe(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_DEDUPE, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_falloc(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_FALLOC, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_exchange(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_EXCHANGE, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_xrebuild(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_XREBUILD, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_setup(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_SETUP, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_status(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_STATUS, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_dumpfile(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_DUMPFILE, __func__, __LINE__, format, __VA_ARGS__)
+
+#define trace_bitmap(req, format, ...)	\
+	csp_debug((req), CSP_TRACE_BITMAP, __func__, __LINE__, format, __VA_ARGS__)
+
+/* VFS Iteration helpers */
+
+static inline void
+start_spacefd_iter(struct clearspace_req *req)
+{
+	req->trace_indent++;
+}
+
+static inline void
+end_spacefd_iter(struct clearspace_req *req)
+{
+	req->trace_indent--;
+}
+
+/*
+ * Iterate each hole in the space-capture file.  Returns 1 if holepos/length
+ * has been set to a hole; 0 if there aren't any holes left, or -1 for error.
+ */
+static inline int
+spacefd_hole_iter(
+	const struct clearspace_req	*req,
+	loff_t			*holepos,
+	loff_t			*length)
+{
+	loff_t			end = req->start + req->length;
+	loff_t			h;
+	loff_t			d;
+
+	if (*length == 0)
+		d = req->start;
+	else
+		d = *holepos + *length;
+	if (d >= end)
+		return 0;
+
+	h = lseek(req->space_fd, d, SEEK_HOLE);
+	if (h < 0) {
+		perror(_("finding start of hole in space capture file"));
+		return h;
+	}
+	if (h >= end)
+		return 0;
+
+	d = lseek(req->space_fd, h, SEEK_DATA);
+	if (d < 0 && errno == ENXIO)
+		d = end;
+	if (d < 0) {
+		perror(_("finding end of hole in space capture file"));
+		return d;
+	}
+	if (d > end)
+		d = end;
+
+	*holepos = h;
+	*length = d - h;
+	return 1;
+}
+
+/*
+ * Iterate each written region in the space-capture file.  Returns 1 if
+ * datapos/length have been set to a data area; 0 if there isn't any data left,
+ * or -1 for error.
+ */
+static int
+spacefd_data_iter(
+	const struct clearspace_req	*req,
+	loff_t			*datapos,
+	loff_t			*length)
+{
+	loff_t			end = req->start + req->length;
+	loff_t			d;
+	loff_t			h;
+
+	if (*length == 0)
+		h = req->start;
+	else
+		h = *datapos + *length;
+	if (h >= end)
+		return 0;
+
+	d = lseek(req->space_fd, h, SEEK_DATA);
+	if (d < 0 && errno == ENXIO)
+		return 0;
+	if (d < 0) {
+		perror(_("finding start of data in space capture file"));
+		return d;
+	}
+	if (d >= end)
+		return 0;
+
+	h = lseek(req->space_fd, d, SEEK_HOLE);
+	if (h < 0) {
+		perror(_("finding end of data in space capture file"));
+		return h;
+	}
+	if (h > end)
+		h = end;
+
+	*datapos = d;
+	*length = h - d;
+	return 1;
+}
+
+/* Filesystem space usage queries */
+
+/* Allocate the structures needed for a fsmap query. */
+static void
+start_fsmap_query(
+	struct clearspace_req	*req,
+	dev_t			dev,
+	unsigned long long	physical,
+	unsigned long long	length)
+{
+	struct fsmap_head	*mhead = req->mhead;
+
+	assert(req->mhead->fmh_count == 0);
+	memset(mhead, 0, sizeof(struct fsmap_head));
+	mhead->fmh_count = QUERY_BATCH_SIZE;
+	mhead->fmh_keys[0].fmr_device = dev;
+	mhead->fmh_keys[0].fmr_physical = physical;
+	mhead->fmh_keys[1].fmr_device = dev;
+	mhead->fmh_keys[1].fmr_physical = physical + length;
+	mhead->fmh_keys[1].fmr_owner = ULLONG_MAX;
+	mhead->fmh_keys[1].fmr_flags = UINT_MAX;
+	mhead->fmh_keys[1].fmr_offset = ULLONG_MAX;
+
+	trace_fsmap(req, "dev %u:%u physical 0x%llx bytecount 0x%llx highkey 0x%llx",
+			major(dev), minor(dev),
+			(unsigned long long)physical,
+			(unsigned long long)length,
+			(unsigned long long)mhead->fmh_keys[1].fmr_physical);
+	req->trace_indent++;
+}
+
+static inline void
+end_fsmap_query(
+	struct clearspace_req	*req)
+{
+	req->trace_indent--;
+	req->mhead->fmh_count = 0;
+}
+
+/* Set us up for the next run_fsmap_query, or return false. */
+static inline bool
+advance_fsmap_cursor(struct fsmap_head *mhead)
+{
+	struct fsmap	*mrec;
+
+	mrec = &mhead->fmh_recs[mhead->fmh_entries - 1];
+	if (mrec->fmr_flags & FMR_OF_LAST)
+		return false;
+
+	fsmap_advance(mhead);
+	return true;
+}
+
+/*
+ * Run a GETFSMAP query.  Returns 1 if there are rows, 0 if there are no rows,
+ * or -1 for error.
+ */
+static inline int
+run_fsmap_query(
+	struct clearspace_req	*req)
+{
+	struct fsmap_head	*mhead = req->mhead;
+	int			ret;
+
+	if (mhead->fmh_entries > 0 && !advance_fsmap_cursor(mhead))
+		return 0;
+
+	trace_fsmap(req,
+ "ioctl dev %u:%u physical 0x%llx length 0x%llx highkey 0x%llx",
+			major(mhead->fmh_keys[0].fmr_device),
+			minor(mhead->fmh_keys[0].fmr_device),
+			(unsigned long long)mhead->fmh_keys[0].fmr_physical,
+			(unsigned long long)mhead->fmh_keys[0].fmr_length,
+			(unsigned long long)mhead->fmh_keys[1].fmr_physical);
+
+	ret = ioctl(req->xfd->fd, FS_IOC_GETFSMAP, mhead);
+	if (ret) {
+		perror(_("querying fsmap data"));
+		return -1;
+	}
+
+	if (!(mhead->fmh_oflags & FMH_OF_DEV_T)) {
+		fprintf(stderr, _("fsmap does not return dev_t.\n"));
+		return -1;
+	}
+
+	if (mhead->fmh_entries == 0)
+		return 0;
+
+	return 1;
+}
+
+#define for_each_fsmap_row(req, rec) \
+	for ((rec) = (req)->mhead->fmh_recs; \
+	     (rec) < (req)->mhead->fmh_recs + (req)->mhead->fmh_entries; \
+	     (rec)++)
+
+/* Allocate the structures needed for a fsrefcounts query. */
+static void
+start_fsrefs_query(
+	struct clearspace_req	*req,
+	dev_t			dev,
+	unsigned long long	physical,
+	unsigned long long	length)
+{
+	struct xfs_getfsrefs_head	*rhead = req->rhead;
+
+	assert(req->rhead->fch_count == 0);
+	memset(rhead, 0, sizeof(struct xfs_getfsrefs_head));
+	rhead->fch_count = QUERY_BATCH_SIZE;
+	rhead->fch_keys[0].fcr_device = dev;
+	rhead->fch_keys[0].fcr_physical = physical;
+	rhead->fch_keys[1].fcr_device = dev;
+	rhead->fch_keys[1].fcr_physical = physical + length;
+	rhead->fch_keys[1].fcr_owners = ULLONG_MAX;
+	rhead->fch_keys[1].fcr_flags = UINT_MAX;
+
+	trace_fsrefs(req, "dev %u:%u physical 0x%llx bytecount 0x%llx highkey 0x%llx",
+			major(dev), minor(dev),
+			(unsigned long long)physical,
+			(unsigned long long)length,
+			(unsigned long long)rhead->fch_keys[1].fcr_physical);
+	req->trace_indent++;
+}
+
+static inline void
+end_fsrefs_query(
+	struct clearspace_req	*req)
+{
+	req->trace_indent--;
+	req->rhead->fch_count = 0;
+}
+
+/* Set us up for the next run_fsrefs_query, or return false. */
+static inline bool
+advance_fsrefs_query(struct xfs_getfsrefs_head *rhead)
+{
+	struct xfs_getfsrefs	*rrec;
+
+	rrec = &rhead->fch_recs[rhead->fch_entries - 1];
+	if (rrec->fcr_flags & FCR_OF_LAST)
+		return false;
+
+	xfs_getfsrefs_advance(rhead);
+	return true;
+}
+
+/*
+ * Run a GETFSREFCOUNTS query.  Returns 1 if there are rows, 0 if there are
+ * no rows, or -1 for error.
+ */
+static inline int
+run_fsrefs_query(
+	struct clearspace_req	*req)
+{
+	struct xfs_getfsrefs_head	*rhead = req->rhead;
+	int			ret;
+
+	if (rhead->fch_entries > 0 && !advance_fsrefs_query(rhead))
+		return 0;
+
+	trace_fsrefs(req,
+ "ioctl dev %u:%u physical 0x%llx length 0x%llx highkey 0x%llx",
+			major(rhead->fch_keys[0].fcr_device),
+			minor(rhead->fch_keys[0].fcr_device),
+			(unsigned long long)rhead->fch_keys[0].fcr_physical,
+			(unsigned long long)rhead->fch_keys[0].fcr_length,
+			(unsigned long long)rhead->fch_keys[1].fcr_physical);
+
+	ret = ioctl(req->xfd->fd, XFS_IOC_GETFSREFCOUNTS, rhead);
+	if (ret) {
+		perror(_("querying refcount data"));
+		return -1;
+	}
+
+	if (!(rhead->fch_oflags & FCH_OF_DEV_T)) {
+		fprintf(stderr, _("fsrefcounts does not return dev_t.\n"));
+		return -1;
+	}
+
+	if (rhead->fch_entries == 0)
+		return 0;
+
+	return 1;
+}
+
+#define for_each_fsref_row(req, rec) \
+	for ((rec) = (req)->rhead->fch_recs; \
+	     (rec) < (req)->rhead->fch_recs + (req)->rhead->fch_entries; \
+	     (rec)++)
+
+/* Allocate the structures needed for a bmapx query. */
+static void
+start_bmapx_query(
+	struct clearspace_req	*req,
+	unsigned int		fork,
+	unsigned long long	pos,
+	unsigned long long	length)
+{
+	struct getbmapx		*bhead = req->bhead;
+
+	assert(fork == BMV_IF_ATTRFORK || fork == BMV_IF_COWFORK || !fork);
+	assert(req->bhead->bmv_count == 0);
+
+	memset(bhead, 0, sizeof(struct getbmapx));
+	bhead[0].bmv_offset = BTOBB(pos);
+	bhead[0].bmv_length = BTOBB(length);
+	bhead[0].bmv_count = QUERY_BATCH_SIZE + 1;
+	bhead[0].bmv_iflags = fork | BMV_IF_PREALLOC | BMV_IF_DELALLOC;
+
+	trace_bmapx(req, "%s pos 0x%llx bytecount 0x%llx",
+			fork == BMV_IF_COWFORK ? "cow" : fork == BMV_IF_ATTRFORK ? "attr" : "data",
+			(unsigned long long)BBTOB(bhead[0].bmv_offset),
+			(unsigned long long)BBTOB(bhead[0].bmv_length));
+	req->trace_indent++;
+}
+
+static inline void
+end_bmapx_query(
+	struct clearspace_req	*req)
+{
+	req->trace_indent--;
+	req->bhead->bmv_count = 0;
+}
+
+/* Set us up for the next run_bmapx_query, or return false. */
+static inline bool
+advance_bmapx_query(struct getbmapx *bhead)
+{
+	struct getbmapx		*brec;
+	unsigned long long	next_offset;
+	unsigned long long	end = bhead->bmv_offset + bhead->bmv_length;
+
+	brec = &bhead[bhead->bmv_entries];
+	if (brec->bmv_oflags & BMV_OF_LAST)
+		return false;
+
+	next_offset = brec->bmv_offset + brec->bmv_length;
+	if (next_offset > end)
+		return false;
+
+	bhead->bmv_offset = next_offset;
+	bhead->bmv_length = end - next_offset;
+	return true;
+}
+
+/*
+ * Run a GETBMAPX query.  Returns 1 if there are rows, 0 if there are no rows,
+ * or -1 for error.
+ */
+static inline int
+run_bmapx_query(
+	struct clearspace_req	*req,
+	int			fd)
+{
+	struct getbmapx		*bhead = req->bhead;
+	unsigned int		fork;
+	int			ret;
+
+	if (bhead->bmv_entries > 0 && !advance_bmapx_query(bhead))
+		return 0;
+
+	fork = bhead[0].bmv_iflags & (BMV_IF_COWFORK | BMV_IF_ATTRFORK);
+	trace_bmapx(req, "ioctl %s pos 0x%llx bytecount 0x%llx",
+			fork == BMV_IF_COWFORK ? "cow" : fork == BMV_IF_ATTRFORK ? "attr" : "data",
+			(unsigned long long)BBTOB(bhead[0].bmv_offset),
+			(unsigned long long)BBTOB(bhead[0].bmv_length));
+
+	ret = ioctl(fd, XFS_IOC_GETBMAPX, bhead);
+	if (ret) {
+		perror(_("querying bmapx data"));
+		return -1;
+	}
+
+	if (bhead->bmv_entries == 0)
+		return 0;
+
+	return 1;
+}
+
+#define for_each_bmapx_row(req, rec) \
+	for ((rec) = (req)->bhead + 1; \
+	     (rec) < (req)->bhead + 1 + (req)->bhead->bmv_entries; \
+	     (rec)++)
+
+static inline void
+csp_dump_bmapx_row(
+	struct clearspace_req	*req,
+	unsigned int		nr,
+	const struct getbmapx	*brec)
+{
+	if (brec->bmv_block == -1) {
+		trace_dumpfile(req, "[%u]: pos 0x%llx len 0x%llx hole",
+				nr,
+				(unsigned long long)BBTOB(brec->bmv_offset),
+				(unsigned long long)BBTOB(brec->bmv_length));
+		return;
+	}
+
+	if (brec->bmv_block == -2) {
+		trace_dumpfile(req, "[%u]: pos 0x%llx len 0x%llx delalloc",
+				nr,
+				(unsigned long long)BBTOB(brec->bmv_offset),
+				(unsigned long long)BBTOB(brec->bmv_length));
+		return;
+	}
+
+	trace_dumpfile(req, "[%u]: pos 0x%llx len 0x%llx phys 0x%llx flags 0x%x",
+			nr,
+			(unsigned long long)BBTOB(brec->bmv_offset),
+			(unsigned long long)BBTOB(brec->bmv_length),
+			(unsigned long long)BBTOB(brec->bmv_block),
+			brec->bmv_oflags);
+}
+
+static inline void
+csp_dump_bmapx(
+	struct clearspace_req	*req,
+	int			fd,
+	unsigned int		indent,
+	const char		*tag)
+{
+	unsigned int		nr;
+	int			ret;
+
+	trace_dumpfile(req, "DUMP BMAP OF DATA FORK %s", tag);
+	start_bmapx_query(req, 0, req->start, req->length);
+	nr = 0;
+	while ((ret = run_bmapx_query(req, fd)) > 0) {
+		struct getbmapx	*brec;
+
+		for_each_bmapx_row(req, brec) {
+			csp_dump_bmapx_row(req, nr++, brec);
+			if (nr > 10)
+				goto dump_cow;
+		}
+	}
+
+dump_cow:
+	end_bmapx_query(req);
+	trace_dumpfile(req, "DUMP BMAP OF COW FORK %s", tag);
+	start_bmapx_query(req, BMV_IF_COWFORK, req->start, req->length);
+	nr = 0;
+	while ((ret = run_bmapx_query(req, fd)) > 0) {
+		struct getbmapx	*brec;
+
+		for_each_bmapx_row(req, brec) {
+			csp_dump_bmapx_row(req, nr++, brec);
+			if (nr > 10)
+				goto dump_attr;
+		}
+	}
+
+dump_attr:
+	end_bmapx_query(req);
+	trace_dumpfile(req, "DUMP BMAP OF ATTR FORK %s", tag);
+	start_bmapx_query(req, BMV_IF_ATTRFORK, req->start, req->length);
+	nr = 0;
+	while ((ret = run_bmapx_query(req, fd)) > 0) {
+		struct getbmapx	*brec;
+
+		for_each_bmapx_row(req, brec) {
+			csp_dump_bmapx_row(req, nr++, brec);
+			if (nr > 10)
+				goto stop;
+		}
+	}
+
+stop:
+	end_bmapx_query(req);
+	trace_dumpfile(req, "DONE DUMPING %s", tag);
+}
+
+/* Return the first bmapx for the given file range. */
+static int
+bmapx_one(
+	struct clearspace_req	*req,
+	int			fd,
+	unsigned long long	pos,
+	unsigned long long	length,
+	struct getbmapx		*brec)
+{
+	struct getbmapx		bhead[2];
+	int			ret;
+
+	memset(bhead, 0, sizeof(struct getbmapx) * 2);
+	bhead[0].bmv_offset = BTOBB(pos);
+	bhead[0].bmv_length = BTOBB(length);
+	bhead[0].bmv_count = 2;
+	bhead[0].bmv_iflags = BMV_IF_PREALLOC | BMV_IF_DELALLOC;
+
+	ret = ioctl(fd, XFS_IOC_GETBMAPX, bhead);
+	if (ret) {
+		perror(_("simple bmapx query"));
+		return -1;
+	}
+
+	if (bhead->bmv_entries > 0) {
+		memcpy(brec, &bhead[1], sizeof(struct getbmapx));
+		return 0;
+	}
+
+	memset(brec, 0, sizeof(struct getbmapx));
+	brec->bmv_offset = pos;
+	brec->bmv_block = -1;	/* hole */
+	brec->bmv_length = length;
+	return 0;
+}
+
+/* Constrain space map records. */
+static void
+__trim_fsmap(
+	uint64_t		start,
+	uint64_t		length,
+	struct fsmap		*fsmap)
+{
+	unsigned long long	delta, end;
+	bool			need_off;
+
+	need_off = !(fsmap->fmr_flags & (FMR_OF_EXTENT_MAP |
+					 FMR_OF_SPECIAL_OWNER));
+
+	if (fsmap->fmr_physical < start) {
+		delta = start - fsmap->fmr_physical;
+		fsmap->fmr_physical = start;
+		fsmap->fmr_length -= delta;
+		if (need_off)
+			fsmap->fmr_offset += delta;
+	}
+
+	end = fsmap->fmr_physical + fsmap->fmr_length;
+	if (end > start + length) {
+		delta = end - (start + length);
+		fsmap->fmr_length -= delta;
+	}
+}
+
+static inline void
+trim_target_fsmap(const struct clearspace_tgt *tgt, struct fsmap *fsmap)
+{
+	return __trim_fsmap(tgt->start, tgt->length, fsmap);
+}
+
+static inline void
+trim_request_fsmap(const struct clearspace_req *req, struct fsmap *fsmap)
+{
+	return __trim_fsmap(req->start, req->length, fsmap);
+}
+
+/* Actual space clearing code */
+
+/*
+ * Map all the free space in the region that we're clearing to the space
+ * catcher file.
+ */
+static int
+csp_grab_free_space(
+	struct clearspace_req	*req)
+{
+	struct xfs_map_freesp	args = {
+		.offset		= req->start,
+		.len		= req->length,
+	};
+	int			ret;
+
+	trace_grabfree(req, "start 0x%llx length 0x%llx",
+			(unsigned long long)req->start,
+			(unsigned long long)req->length);
+
+	ret = ioctl(req->space_fd, XFS_IOC_MAP_FREESP, &args);
+	if (ret) {
+		perror(_("map free space to space capture file"));
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Rank a refcount record.  We prefer to tackle highly shared and longer
+ * extents first.
+ */
+static inline unsigned long long
+csp_space_prio(
+	const struct xfs_fsop_geom	*g,
+	const struct xfs_getfsrefs	*p)
+{
+	unsigned long long		blocks = p->fcr_length / g->blocksize;
+	unsigned long long		ret = blocks * p->fcr_owners;
+
+	if (ret < blocks || ret < p->fcr_owners)
+		return UINT64_MAX;
+	return ret;
+}
+
+/* Make the current refcount record the clearing target if desirable. */
+static void
+csp_adjust_target(
+	struct clearspace_req		*req,
+	struct clearspace_tgt		*target,
+	const struct xfs_getfsrefs	*rec,
+	unsigned long long		prio)
+{
+	if (prio < target->prio)
+		return;
+	if (prio == target->prio &&
+	    rec->fcr_length <= target->length)
+		return;
+
+	/* Ignore results that go beyond the end of what we wanted. */
+	if (rec->fcr_physical >= req->start + req->length)
+		return;
+
+	/* Ignore regions that we already tried to clear. */
+	if (bitmap_test(req->visited, rec->fcr_physical, rec->fcr_length))
+		return;
+
+	trace_target(req,
+ "set target, prio 0x%llx -> 0x%llx phys 0x%llx bytecount 0x%llx",
+			target->prio, prio,
+			(unsigned long long)rec->fcr_physical,
+			(unsigned long long)rec->fcr_length);
+
+	target->start = rec->fcr_physical;
+	target->length = rec->fcr_length;
+	target->owners = rec->fcr_owners;
+	target->prio = prio;
+}
+
+/*
+ * Decide if this refcount record maps to extents that are sufficiently
+ * interesting to target.
+ */
+static int
+csp_evaluate_refcount(
+	struct clearspace_req		*req,
+	const struct xfs_getfsrefs	*rrec,
+	struct clearspace_tgt		*target)
+{
+	const struct xfs_fsop_geom	*fsgeom = &req->xfd->fsgeom;
+	unsigned long long		prio = csp_space_prio(fsgeom, rrec);
+	int				ret;
+
+	if (rrec->fcr_device != req->dev)
+		return 0;
+
+	if (prio < target->prio)
+		return 0;
+
+	/*
+	 * XFS only supports sharing data blocks.  If there's more than one
+	 * owner, we know that we can easily move the blocks.
+	 */
+	if (rrec->fcr_owners > 1) {
+		csp_adjust_target(req, target, rrec, prio);
+		return 0;
+	}
+
+	/*
+	 * Otherwise, this extent has single owners.  Walk the fsmap records to
+	 * figure out if they're movable or not.
+	 */
+	start_fsmap_query(req, rrec->fcr_device, rrec->fcr_physical,
+			rrec->fcr_length);
+	while ((ret = run_fsmap_query(req)) > 0) {
+		struct fsmap	*mrec;
+		uint64_t	next_phys = 0;
+
+		for_each_fsmap_row(req, mrec) {
+			struct xfs_getfsrefs	fake_rec = { };
+
+			trace_fsmap_rec(req, CSP_TRACE_TARGET, mrec);
+
+			if (mrec->fmr_device != rrec->fcr_device)
+				continue;
+			if (mrec->fmr_flags & FMR_OF_SPECIAL_OWNER)
+				continue;
+			if (csp_is_internal_owner(req, mrec->fmr_owner))
+				continue;
+
+			/*
+			 * If the space has become shared since the fsrefs
+			 * query, just skip this record.  We might come back to
+			 * it in a later iteration.
+			 */
+			if (mrec->fmr_physical < next_phys)
+				continue;
+
+			/* Fake enough of a fsrefs to calculate the priority. */
+			fake_rec.fcr_physical = mrec->fmr_physical;
+			fake_rec.fcr_length = mrec->fmr_length;
+			fake_rec.fcr_owners = 1;
+			prio = csp_space_prio(fsgeom, &fake_rec);
+
+			/* Target unwritten extents first; they're cheap. */
+			if (mrec->fmr_flags & FMR_OF_PREALLOC)
+				prio |= (1ULL << 63);
+
+			csp_adjust_target(req, target, &fake_rec, prio);
+
+			next_phys = mrec->fmr_physical + mrec->fmr_length;
+		}
+	}
+	end_fsmap_query(req);
+
+	return ret;
+}
+
+/*
+ * Given a range of storage to search, find the most appealing target for space
+ * clearing.  If nothing suitable is found, the target will be zeroed.
+ */
+static int
+csp_find_target(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target)
+{
+	int			ret;
+
+	memset(target, 0, sizeof(struct clearspace_tgt));
+
+	start_fsrefs_query(req, req->dev, req->start, req->length);
+	while ((ret = run_fsrefs_query(req)) > 0) {
+		struct xfs_getfsrefs	*rrec;
+
+		for_each_fsref_row(req, rrec) {
+			trace_fsrefs_rec(req, CSP_TRACE_TARGET, rrec);
+			ret = csp_evaluate_refcount(req, rrec, target);
+			if (ret) {
+				end_fsrefs_query(req);
+				return ret;
+			}
+		}
+	}
+	end_fsrefs_query(req);
+
+	if (target->length != 0) {
+		/*
+		 * Mark this extent visited so that we won't try again this
+		 * round.
+		 */
+		trace_bitmap(req, "set filedata start 0x%llx length 0x%llx",
+				target->start, target->length);
+		ret = bitmap_set(req->visited, target->start, target->length);
+		if (ret) {
+			perror(_("marking file extent visited"));
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/* Try to evacuate blocks by using online repair. */
+static int
+csp_evac_file_metadata(
+	struct clearspace_req		*req,
+	struct clearspace_tgt		*target,
+	const struct fsmap		*mrec,
+	int				fd,
+	const struct xfs_bulkstat	*bulkstat)
+{
+	struct xfs_scrub_metadata	scrub = {
+		.sm_type		= XFS_SCRUB_TYPE_PROBE,
+		.sm_flags		= XFS_SCRUB_IFLAG_REPAIR |
+					  XFS_SCRUB_IFLAG_FORCE_REBUILD,
+	};
+	struct xfs_fd			*xfd = req->xfd;
+	int				ret;
+
+	trace_xrebuild(req,
+ "ino 0x%llx pos 0x%llx bytecount 0x%llx phys 0x%llx flags 0x%llx",
+				(unsigned long long)mrec->fmr_owner,
+				(unsigned long long)mrec->fmr_offset,
+				(unsigned long long)mrec->fmr_physical,
+				(unsigned long long)mrec->fmr_length,
+				(unsigned long long)mrec->fmr_flags);
+
+	if (fd == -1) {
+		scrub.sm_ino = mrec->fmr_owner;
+		scrub.sm_gen = bulkstat->bs_gen;
+		fd = xfd->fd;
+	}
+
+	if (mrec->fmr_flags & FMR_OF_ATTR_FORK) {
+		if (mrec->fmr_flags & FMR_OF_EXTENT_MAP)
+			scrub.sm_type = XFS_SCRUB_TYPE_BMBTA;
+		else
+			scrub.sm_type = XFS_SCRUB_TYPE_XATTR;
+	} else if (mrec->fmr_flags & FMR_OF_EXTENT_MAP) {
+		scrub.sm_type = XFS_SCRUB_TYPE_BMBTD;
+	} else if (S_ISLNK(bulkstat->bs_mode)) {
+		scrub.sm_type = XFS_SCRUB_TYPE_SYMLINK;
+	} else if (S_ISDIR(bulkstat->bs_mode)) {
+		scrub.sm_type = XFS_SCRUB_TYPE_DIR;
+	}
+
+	if (scrub.sm_type == XFS_SCRUB_TYPE_PROBE)
+		return 0;
+
+	trace_xrebuild(req, "ino 0x%llx gen 0x%x type %u",
+			(unsigned long long)mrec->fmr_owner,
+			(unsigned int)bulkstat->bs_gen,
+			(unsigned int)scrub.sm_type);
+
+	ret = ioctl(fd, XFS_IOC_SCRUB_METADATA, &scrub);
+	if (ret) {
+		fprintf(stderr,
+	_("evacuating inode 0x%llx metadata type %u: %s\n"),
+				(unsigned long long)mrec->fmr_owner,
+				scrub.sm_type, strerror(errno));
+		return -1;
+	}
+
+	target->evacuated++;
+	return 0;
+}
+
+/*
+ * Open an inode via handle.  Returns a file descriptor, -2 if the file is
+ * gone, or -1 on error.
+ */
+static int
+csp_open_by_handle(
+	struct clearspace_req	*req,
+	int			oflags,
+	uint64_t		ino,
+	uint32_t		gen)
+{
+	struct xfs_handle	handle = { };
+	struct xfs_fsop_handlereq hreq = {
+		.oflags		= oflags | O_NOATIME | O_NOFOLLOW |
+				  O_NOCTTY | O_LARGEFILE,
+		.ihandle	= &handle,
+		.ihandlen	= sizeof(handle),
+	};
+	int			ret;
+
+	memcpy(&handle.ha_fsid, req->fshandle, sizeof(handle.ha_fsid));
+	handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
+			sizeof(handle.ha_fid.fid_len);
+	handle.ha_fid.fid_pad = 0;
+	handle.ha_fid.fid_ino = ino;
+	handle.ha_fid.fid_gen = gen;
+
+	/*
+	 * Since we extracted the fshandle from the open file instead of using
+	 * path_to_fshandle, the fsid cache doesn't know about the fshandle.
+	 * Construct the open by handle request manually.
+	 */
+	ret = ioctl(req->xfd->fd, XFS_IOC_OPEN_BY_HANDLE, &hreq);
+	if (ret < 0) {
+		if (errno == ENOENT || errno == EINVAL)
+			return -2;
+
+		fprintf(stderr, _("open inode 0x%llx: %s\n"),
+				(unsigned long long)ino,
+				strerror(errno));
+		return -1;
+	}
+
+	return ret;
+}
+
+/*
+ * Open a file for evacuation.  Returns a positive errno on error; a fd in @fd
+ * if the caller is supposed to do something; or @fd == -1 if there's nothing
+ * further to do.
+ */
+static int
+csp_evac_open(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target,
+	const struct fsmap	*mrec,
+	struct xfs_bulkstat	*bulkstat,
+	int			oflags,
+	int			*fd)
+{
+	struct xfs_bulkstat	__bs;
+	int			target_fd;
+	int			ret;
+
+	*fd = -1;
+
+	if (csp_is_internal_owner(req, mrec->fmr_owner) ||
+	    (mrec->fmr_flags & FMR_OF_SPECIAL_OWNER))
+		goto nothing_to_do;
+
+	if (bulkstat == NULL)
+		bulkstat = &__bs;
+
+	/*
+	 * Snapshot this file so that we can perform a fresh-only exchange.
+	 * For other types of files we just skip to the evacuation step.
+	 */
+	ret = -xfrog_bulkstat_single(req->xfd, mrec->fmr_owner, 0, bulkstat);
+	if (ret) {
+		if (ret == ENOENT || ret == EINVAL)
+			goto nothing_to_do;
+
+		fprintf(stderr, _("bulkstat inode 0x%llx: %s\n"),
+				(unsigned long long)mrec->fmr_owner,
+				strerror(ret));
+		return ret;
+	}
+
+	/*
+	 * If we get stats for a different inode, the file may have been freed
+	 * out from under us and there's nothing to do.
+	 */
+	if (bulkstat->bs_ino != mrec->fmr_owner)
+		goto nothing_to_do;
+
+	/*
+	 * We're only allowed to open regular files and directories via handle
+	 * so jump to online rebuild for all other file types.
+	 */
+	if (!S_ISREG(bulkstat->bs_mode) && !S_ISDIR(bulkstat->bs_mode))
+		return csp_evac_file_metadata(req, target, mrec, -1,
+				bulkstat);
+
+	if (S_ISDIR(bulkstat->bs_mode))
+		oflags = O_RDONLY;
+
+	target_fd = csp_open_by_handle(req, oflags, mrec->fmr_owner,
+			bulkstat->bs_gen);
+	if (target_fd == -2)
+		goto nothing_to_do;
+	if (target_fd < 0)
+		return -target_fd;
+
+	/*
+	 * Exchange only works for regular file data blocks.  If that isn't the
+	 * case, our only recourse is online rebuild.
+	 */
+	if (S_ISDIR(bulkstat->bs_mode) ||
+	    (mrec->fmr_flags & (FMR_OF_ATTR_FORK | FMR_OF_EXTENT_MAP))) {
+		int	ret2;
+
+		ret = csp_evac_file_metadata(req, target, mrec, target_fd,
+				bulkstat);
+		ret2 = close(target_fd);
+		if (!ret && ret2)
+			ret = ret2;
+		return ret;
+	}
+
+	*fd = target_fd;
+	return 0;
+
+nothing_to_do:
+	target->try_again = true;
+	return 0;
+}
+
+/* Unshare the space in the work file that we're using for deduplication. */
+static int
+csp_unshare_workfile(
+	struct clearspace_req	*req,
+	unsigned long long	start,
+	unsigned long long	length)
+{
+	int			ret;
+
+	trace_falloc(req, "funshare workfd pos 0x%llx bytecount 0x%llx",
+			start, length);
+
+	ret = fallocate(req->work_fd, FALLOC_FL_UNSHARE_RANGE, start, length);
+	if (ret) {
+		perror(_("unsharing work file"));
+		return ret;
+	}
+
+	ret = fsync(req->work_fd);
+	if (ret) {
+		perror(_("syncing work file"));
+		return ret;
+	}
+
+	/* Make sure we didn't get any space within the clearing range. */
+	start_bmapx_query(req, 0, start, length);
+	while ((ret = run_bmapx_query(req, req->work_fd)) > 0) {
+		struct getbmapx	*brec;
+
+		for_each_bmapx_row(req, brec) {
+			unsigned long long	p, l;
+
+			trace_bmapx_rec(req, CSP_TRACE_FALLOC, brec);
+			p = BBTOB(brec->bmv_block);
+			l = BBTOB(brec->bmv_length);
+
+			if (p + l < req->start || p >= req->start + req->length)
+				continue;
+
+			trace_prep(req,
+	"workfd has extent inside clearing range, phys 0x%llx fsbcount 0x%llx",
+					p, l);
+			end_bmapx_query(req);
+			return -1;
+		}
+	}
+	end_bmapx_query(req);
+
+	return 0;
+}
+
+/* Try to deduplicate every block in the fdr request, if we can. */
+static int
+csp_evac_dedupe_loop(
+	struct clearspace_req		*req,
+	struct clearspace_tgt		*target,
+	unsigned long long		ino,
+	int				max_reqlen)
+{
+	struct file_dedupe_range	*fdr = req->fdr;
+	struct file_dedupe_range_info	*info = &fdr->info[0];
+	loff_t				last_unshare_off = -1;
+	int				ret;
+
+	while (fdr->src_length > 0) {
+		struct getbmapx		brec;
+		bool			same;
+		unsigned int		old_reqlen = fdr->src_length;
+
+		if (max_reqlen && fdr->src_length > max_reqlen)
+			fdr->src_length = max_reqlen;
+
+		trace_dedupe(req, "ino 0x%llx pos 0x%llx bytecount 0x%llx",
+				ino,
+				(unsigned long long)info->dest_offset,
+				(unsigned long long)fdr->src_length);
+
+		ret = bmapx_one(req, req->work_fd, fdr->src_offset,
+				fdr->src_length, &brec);
+		if (ret)
+			return ret;
+
+		trace_dedupe(req, "workfd pos 0x%llx phys 0x%llx",
+				(unsigned long long)fdr->src_offset,
+				(unsigned long long)BBTOB(brec.bmv_block));
+
+		ret = deduperange(req->work_fd, fdr, &same);
+		if (ret == ENOSPC && last_unshare_off < fdr->src_offset) {
+			req->trace_indent++;
+			trace_dedupe(req, "funshare workfd at phys 0x%llx",
+					(unsigned long long)fdr->src_offset);
+			/*
+			 * If we ran out of space, it's possible that we have
+			 * reached the maximum sharing factor of the blocks in
+			 * the work file.  Try unsharing the range of the work
+			 * file to get a singly-owned range and loop again.
+			 */
+			ret = csp_unshare_workfile(req, fdr->src_offset,
+					fdr->src_length);
+			req->trace_indent--;
+			if (ret)
+				return ret;
+
+			ret = fsync(req->work_fd);
+			if (ret) {
+				perror(_("sync after unshare work file"));
+				return ret;
+			}
+
+			last_unshare_off = fdr->src_offset;
+			fdr->src_length = old_reqlen;
+			continue;
+		}
+		if (ret == EINVAL) {
+			/*
+			 * If we can't dedupe get the block, it's possible that
+			 * src_fd was punched or truncated out from under us.
+			 * Treat this the same way we would if the contents
+			 * didn't match.
+			 */
+			trace_dedupe(req, "cannot evac space, moving on", 0);
+			same = false;
+			ret = 0;
+		}
+		if (ret) {
+			fprintf(stderr, _("evacuating inode 0x%llx: %s\n"),
+					ino, strerror(ret));
+			return ret;
+		}
+
+		if (same) {
+			req->trace_indent++;
+			trace_dedupe(req,
+	"evacuated ino 0x%llx pos 0x%llx bytecount 0x%llx",
+					ino,
+					(unsigned long long)info->dest_offset,
+					(unsigned long long)info->bytes_deduped);
+			req->trace_indent--;
+
+			target->evacuated++;
+		} else {
+			req->trace_indent++;
+			trace_dedupe(req,
+	"failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx",
+					ino,
+					(unsigned long long)info->dest_offset,
+					(unsigned long long)fdr->src_length);
+			req->trace_indent--;
+
+			target->try_again = true;
+
+			/*
+			 * If we aren't single-stepping the deduplication,
+			 * stop early so that the caller goes into single-step
+			 * mode.
+			 */
+			if (!max_reqlen) {
+				fdr->src_length = old_reqlen;
+				return 0;
+			}
+
+			/* Contents changed, move on to the next block. */
+			info->bytes_deduped = fdr->src_length;
+		}
+		fdr->src_length = old_reqlen;
+
+		fdr->src_offset += info->bytes_deduped;
+		info->dest_offset += info->bytes_deduped;
+		fdr->src_length -= info->bytes_deduped;
+	}
+
+	return 0;
+}
+
+/*
+ * Evacuate one fsmapping by using dedupe to remap data stored in the target
+ * range to a copy stored in the work file.
+ */
+static int
+csp_evac_dedupe_fsmap(
+	struct clearspace_req		*req,
+	struct clearspace_tgt		*target,
+	const struct fsmap		*mrec)
+{
+	struct file_dedupe_range	*fdr = req->fdr;
+	struct file_dedupe_range_info	*info = &fdr->info[0];
+	bool				can_single_step;
+	int				target_fd;
+	int				ret, ret2;
+
+	if (mrec->fmr_device != req->dev) {
+		fprintf(stderr, _("wrong fsmap device in results.\n"));
+		return -1;
+	}
+
+	ret = csp_evac_open(req, target, mrec, NULL, O_RDONLY, &target_fd);
+	if (ret || target_fd < 0)
+		return ret;
+
+	/*
+	 * Use dedupe to try to shift the target file's mappings to use the
+	 * copy of the data that's in the work file.
+	 */
+	fdr->src_offset = mrec->fmr_physical;
+	fdr->src_length = mrec->fmr_length;
+	fdr->dest_count = 1;
+	info->dest_fd = target_fd;
+	info->dest_offset = mrec->fmr_offset;
+
+	can_single_step = mrec->fmr_length > req->xfd->fsgeom.blocksize;
+
+	/* First we try to do the entire thing all at once. */
+	ret = csp_evac_dedupe_loop(req, target, mrec->fmr_owner, 0);
+	if (ret)
+		goto out_fd;
+
+	/* If there's any work left, try again one block at a time. */
+	if (can_single_step && fdr->src_length > 0) {
+		ret = csp_evac_dedupe_loop(req, target, mrec->fmr_owner,
+				req->xfd->fsgeom.blocksize);
+		if (ret)
+			goto out_fd;
+	}
+
+out_fd:
+	ret2 = close(target_fd);
+	if (!ret && ret2)
+		ret = ret2;
+	return ret;
+}
+
+/*
+ * Evacuate a prealloc fsmapping by using exchangerange to move the
+ * preallocation to the work file.
+ */
+static int
+csp_evac_exchange_prealloc(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target,
+	const struct fsmap	*mrec)
+{
+	struct xfs_bulkstat	bulkstat;
+	struct xfs_commit_range	xcr;
+	struct getbmapx		brec;
+	int			target_fd;
+	int			ret, ret2;
+
+	if (mrec->fmr_device != req->dev) {
+		fprintf(stderr, _("wrong fsmap device in results.\n"));
+		return -1;
+	}
+
+	ret = csp_evac_open(req, target, mrec, &bulkstat, O_RDWR, &target_fd);
+	if (ret || target_fd < 0)
+		return ret;
+
+	ret = xfrog_commitrange_prep(&xcr, target_fd, mrec->fmr_offset,
+			req->work_fd, mrec->fmr_offset, mrec->fmr_length);
+	if (ret) {
+		perror(_("preparing for commit"));
+		goto out_fd;
+	}
+
+	/*
+	 * Now that we've snapshotted target_fd, check that the mapping we're
+	 * after is still one large preallocation.  If it isn't, then we tell
+	 * the caller to try again.
+	 */
+	ret = bmapx_one(req, target_fd, mrec->fmr_offset, mrec->fmr_length,
+			&brec);
+	if (ret)
+		return ret;
+
+	trace_exchange(req,
+ "targetfd pos 0x%llx offset 0x%llx phys 0x%llx len 0x%llx prealloc? %d",
+			(unsigned long long)mrec->fmr_offset,
+			(unsigned long long)BBTOB(brec.bmv_offset),
+			(unsigned long long)BBTOB(brec.bmv_block),
+			(unsigned long long)BBTOB(brec.bmv_length),
+			!!(brec.bmv_oflags & BMV_IF_PREALLOC));
+
+	if (BBTOB(brec.bmv_offset) > mrec->fmr_offset ||
+	    BBTOB(brec.bmv_offset + brec.bmv_length) <
+					mrec->fmr_offset + mrec->fmr_length ||
+	    !(brec.bmv_oflags & BMV_IF_PREALLOC)) {
+		req->trace_indent++;
+		trace_exchange(req,
+ "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx",
+				bulkstat.bs_ino,
+				(unsigned long long)mrec->fmr_offset,
+				(unsigned long long)mrec->fmr_length);
+		req->trace_indent--;
+		target->try_again = true;
+		goto out_fd;
+	}
+
+	ret = ftruncate(req->work_fd, 0);
+	if (ret) {
+		perror(_("truncating work file"));
+		goto out_fd;
+	}
+
+	/*
+	 * Create a preallocation in the work file to match the one in the
+	 * file that we're evacuating.
+	 */
+	ret = fallocate(req->work_fd, 0, mrec->fmr_offset, mrec->fmr_length);
+	if (ret) {
+		fprintf(stderr,
+ _("copying target file preallocation to work file: %s\n"),
+				strerror(ret));
+		goto out_fd;
+	}
+
+	ret = bmapx_one(req, req->work_fd, mrec->fmr_offset, mrec->fmr_length,
+			&brec);
+	if (ret)
+		return ret;
+
+	trace_exchange(req, "workfd pos 0x%llx off 0x%llx phys 0x%llx",
+			(unsigned long long)mrec->fmr_offset,
+			(unsigned long long)BBTOB(brec.bmv_offset),
+			(unsigned long long)BBTOB(brec.bmv_block));
+
+	/*
+	 * Exchange the mappings, with the freshness check enabled.  This
+	 * should result in the target file being switched to new blocks unless
+	 * it has changed, in which case we bounce out and find a new target.
+	 */
+	ret = xfrog_commitrange(target_fd, &xcr, 0);
+	if (ret) {
+		if (ret == EBUSY) {
+			req->trace_indent++;
+			trace_exchange(req,
+ "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx",
+					bulkstat.bs_ino,
+					(unsigned long long)mrec->fmr_offset,
+					(unsigned long long)mrec->fmr_length);
+			req->trace_indent--;
+			target->try_again = true;
+		} else {
+			fprintf(stderr,
+	_("exchanging target and work file contents: %s\n"),
+					strerror(ret));
+		}
+		goto out_fd;
+	}
+
+	req->trace_indent++;
+	trace_exchange(req,
+ "evacuated ino 0x%llx pos 0x%llx bytecount 0x%llx",
+			bulkstat.bs_ino,
+			(unsigned long long)mrec->fmr_offset,
+			(unsigned long long)mrec->fmr_length);
+	req->trace_indent--;
+	target->evacuated++;
+
+out_fd:
+	ret2 = close(target_fd);
+	if (!ret && ret2)
+		ret = ret2;
+	return ret;
+}
+
+/* Use deduplication to remap data extents away from where we're clearing. */
+static int
+csp_evac_dedupe(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target)
+{
+	int			ret;
+
+	start_fsmap_query(req, req->dev, target->start, target->length);
+	while ((ret = run_fsmap_query(req)) > 0) {
+		struct fsmap	*mrec;
+
+		for_each_fsmap_row(req, mrec) {
+			trace_fsmap_rec(req, CSP_TRACE_DEDUPE, mrec);
+			trim_target_fsmap(target, mrec);
+
+			req->trace_indent++;
+			if (mrec->fmr_flags & FMR_OF_PREALLOC)
+				ret = csp_evac_exchange_prealloc(req, target,
+						mrec);
+			else
+				ret = csp_evac_dedupe_fsmap(req, target, mrec);
+			req->trace_indent--;
+			if (ret)
+				goto out;
+
+			ret = csp_grab_free_space(req);
+			if (ret)
+				goto out;
+		}
+	}
+
+out:
+	end_fsmap_query(req);
+	if (ret)
+		trace_dedupe(req, "ret %d", ret);
+	return ret;
+}
+
+#define BUFFERCOPY_BUFSZ		65536
+
+/*
+ * Use a memory buffer to copy part of src_fd to dst_fd, or return an errno. */
+static int
+csp_buffercopy(
+	struct clearspace_req	*req,
+	int			src_fd,
+	loff_t			src_off,
+	int			dst_fd,
+	loff_t			dst_off,
+	loff_t			len)
+{
+	int			ret = 0;
+
+	while (len > 0) {
+		size_t count = min(BUFFERCOPY_BUFSZ, len);
+		ssize_t bytes_read, bytes_written;
+
+		bytes_read = pread(src_fd, req->buf, count, src_off);
+		if (bytes_read < 0) {
+			ret = errno;
+			break;
+		}
+
+		bytes_written = pwrite(dst_fd, req->buf, bytes_read, dst_off);
+		if (bytes_written < 0) {
+			ret = errno;
+			break;
+		}
+
+		src_off += bytes_written;
+		dst_off += bytes_written;
+		len -= bytes_written;
+	}
+
+	return ret;
+}
+
+/*
+ * Prepare the work file to assist in evacuating file data by copying the
+ * contents of the frozen space into the work file.
+ */
+static int
+csp_prepare_for_dedupe(
+	struct clearspace_req	*req)
+{
+	struct file_clone_range	fcr;
+	struct stat		statbuf;
+	loff_t			datapos = 0;
+	loff_t			length = 0;
+	int			ret;
+
+	ret = fstat(req->space_fd, &statbuf);
+	if (ret) {
+		perror(_("space capture file"));
+		return ret;
+	}
+
+	ret = ftruncate(req->work_fd, 0);
+	if (ret) {
+		perror(_("truncate work file"));
+		return ret;
+	}
+
+	ret = ftruncate(req->work_fd, statbuf.st_size);
+	if (ret) {
+		perror(_("reset work file"));
+		return ret;
+	}
+
+	/* Make a working copy of the frozen file data. */
+	start_spacefd_iter(req);
+	while ((ret = spacefd_data_iter(req, &datapos, &length)) > 0) {
+		trace_prep(req, "clone spacefd data 0x%llx length 0x%llx",
+				(long long)datapos, (long long)length);
+
+		fcr.src_fd = req->space_fd;
+		fcr.src_offset = datapos;
+		fcr.src_length = length;
+		fcr.dest_offset = datapos;
+
+		ret = clonerange(req->work_fd, &fcr);
+		if (ret == ENOSPC) {
+			req->trace_indent++;
+			trace_prep(req,
+	"falling back to buffered copy at 0x%llx",
+					(long long)datapos);
+			req->trace_indent--;
+			ret = csp_buffercopy(req, req->space_fd, datapos,
+					req->work_fd, datapos, length);
+		}
+		if (ret) {
+			perror(
+	_("copying space capture file contents to work file"));
+			return ret;
+		}
+	}
+	end_spacefd_iter(req);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Unshare the work file so that it contains an identical copy of the
+	 * contents of the space capture file but mapped to different blocks.
+	 * This is key to using dedupe to migrate file space away from the
+	 * requested region.
+	 */
+	req->trace_indent++;
+	ret = csp_unshare_workfile(req, req->start, req->length);
+	req->trace_indent--;
+	return ret;
+}
+
+/*
+ * Evacuate one fsmapping by using dedupe to remap data stored in the target
+ * range to a copy stored in the work file.
+ */
+static int
+csp_evac_exchange_fsmap(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target,
+	const struct fsmap	*mrec)
+{
+	struct xfs_bulkstat	bulkstat;
+	struct xfs_commit_range	xcr;
+	struct getbmapx		brec;
+	int			target_fd;
+	int			ret, ret2;
+
+	if (mrec->fmr_device != req->dev) {
+		fprintf(stderr, _("wrong fsmap device in results.\n"));
+		return -1;
+	}
+
+	ret = csp_evac_open(req, target, mrec, &bulkstat, O_RDWR, &target_fd);
+	if (ret || target_fd < 0)
+		return ret;
+
+	ret = xfrog_commitrange_prep(&xcr, target_fd, mrec->fmr_offset,
+			req->work_fd, mrec->fmr_offset, mrec->fmr_length);
+	if (ret) {
+		perror(_("preparing for commit"));
+		goto out_fd;
+	}
+
+	ret = ftruncate(req->work_fd, 0);
+	if (ret) {
+		perror(_("truncating work file"));
+		goto out_fd;
+	}
+
+	/*
+	 * Copy the data from the original file to the work file.  We assume
+	 * that the work file will end up with different data blocks and that
+	 * they're outside of the requested range.
+	 */
+	ret = csp_buffercopy(req, target_fd, mrec->fmr_offset, req->work_fd,
+			mrec->fmr_offset, mrec->fmr_length);
+	if (ret) {
+		fprintf(stderr, _("copying target file to work file: %s\n"),
+				strerror(ret));
+		goto out_fd;
+	}
+
+	ret = fsync(req->work_fd);
+	if (ret) {
+		perror(_("flush work file for fiexchange"));
+		goto out_fd;
+	}
+
+	ret = bmapx_one(req, req->work_fd, mrec->fmr_offset, mrec->fmr_length,
+			&brec);
+	if (ret)
+		return ret;
+
+	trace_exchange(req, "workfd pos 0x%llx phys 0x%llx",
+			(unsigned long long)mrec->fmr_offset,
+			(unsigned long long)BBTOB(brec.bmv_block));
+
+	/*
+	 * Exchange the mappings, with the freshness check enabled.  This
+	 * should result in the target file being switched to new blocks unless
+	 * it has changed, in which case we bounce out and find a new target.
+	 */
+	ret = xfrog_commitrange(target_fd, &xcr, 0);
+	if (ret) {
+		if (ret == EBUSY) {
+			req->trace_indent++;
+			trace_exchange(req,
+ "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx",
+					bulkstat.bs_ino,
+					(unsigned long long)mrec->fmr_offset,
+					(unsigned long long)mrec->fmr_length);
+			req->trace_indent--;
+			target->try_again = true;
+		} else {
+			fprintf(stderr,
+	_("exchanging target and work file contents: %s\n"),
+					strerror(ret));
+		}
+		goto out_fd;
+	}
+
+	req->trace_indent++;
+	trace_exchange(req,
+ "evacuated ino 0x%llx pos 0x%llx bytecount 0x%llx",
+			bulkstat.bs_ino,
+			(unsigned long long)mrec->fmr_offset,
+			(unsigned long long)mrec->fmr_length);
+	req->trace_indent--;
+	target->evacuated++;
+
+out_fd:
+	ret2 = close(target_fd);
+	if (!ret && ret2)
+		ret = ret2;
+	return ret;
+}
+
+/*
+ * Try to evacuate all data blocks in the target region by copying the contents
+ * to a new file and exchanging the extents.
+ */
+static int
+csp_evac_exchange(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target)
+{
+	int			ret;
+
+	start_fsmap_query(req, req->dev, target->start, target->length);
+	while ((ret = run_fsmap_query(req)) > 0) {
+		struct fsmap	*mrec;
+
+		for_each_fsmap_row(req, mrec) {
+			trace_fsmap_rec(req, CSP_TRACE_EXCHANGE, mrec);
+			trim_target_fsmap(target, mrec);
+
+			req->trace_indent++;
+			ret = csp_evac_exchange_fsmap(req, target, mrec);
+			req->trace_indent--;
+			if (ret)
+				goto out;
+
+			ret = csp_grab_free_space(req);
+			if (ret)
+				goto out;
+		}
+	}
+out:
+	end_fsmap_query(req);
+	if (ret)
+		trace_exchange(req, "ret %d", ret);
+	return ret;
+}
+
+/* Try to evacuate blocks by using online repair to rebuild AG metadata. */
+static int
+csp_evac_ag_metadata(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target,
+	uint32_t		agno,
+	uint32_t		mask)
+{
+	struct xfs_scrub_metadata scrub = {
+		.sm_flags	= XFS_SCRUB_IFLAG_REPAIR |
+				  XFS_SCRUB_IFLAG_FORCE_REBUILD,
+	};
+	unsigned int		i;
+	int			ret;
+
+	trace_xrebuild(req, "agno 0x%x mask 0x%x",
+			(unsigned int)agno,
+			(unsigned int)mask);
+
+	for (i = XFS_SCRUB_TYPE_AGFL; i < XFS_SCRUB_TYPE_REFCNTBT; i++) {
+
+		if (!(mask & (1U << i)))
+			continue;
+
+		scrub.sm_type = i;
+
+		req->trace_indent++;
+		trace_xrebuild(req, "agno %u type %u",
+				(unsigned int)agno,
+				(unsigned int)scrub.sm_type);
+		req->trace_indent--;
+
+		ret = ioctl(req->xfd->fd, XFS_IOC_SCRUB_METADATA, &scrub);
+		if (ret) {
+			if (errno == ENOENT || errno == ENOSPC)
+				continue;
+			fprintf(stderr, _("rebuilding ag %u type %u: %s\n"),
+					(unsigned int)agno, scrub.sm_type,
+					strerror(errno));
+			return -1;
+		}
+
+		target->evacuated++;
+
+		ret = csp_grab_free_space(req);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+/* Compute a scrub mask for a fsmap special owner. */
+static uint32_t
+fsmap_owner_to_scrub_mask(__u64 owner)
+{
+	switch (owner) {
+	case XFS_FMR_OWN_FREE:
+	case XFS_FMR_OWN_UNKNOWN:
+	case XFS_FMR_OWN_FS:
+	case XFS_FMR_OWN_LOG:
+		/* can't move these */
+		return 0;
+	case XFS_FMR_OWN_AG:
+		return (1U << XFS_SCRUB_TYPE_BNOBT) |
+		       (1U << XFS_SCRUB_TYPE_CNTBT) |
+		       (1U << XFS_SCRUB_TYPE_AGFL) |
+		       (1U << XFS_SCRUB_TYPE_RMAPBT);
+	case XFS_FMR_OWN_INOBT:
+		return (1U << XFS_SCRUB_TYPE_INOBT) |
+		       (1U << XFS_SCRUB_TYPE_FINOBT);
+	case XFS_FMR_OWN_REFC:
+		return (1U << XFS_SCRUB_TYPE_REFCNTBT);
+	case XFS_FMR_OWN_INODES:
+	case XFS_FMR_OWN_COW:
+		/* don't know how to get rid of these */
+		return 0;
+	case XFS_FMR_OWN_DEFECTIVE:
+		/* good, get rid of it */
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+/* Try to clear all per-AG metadata from the requested range. */
+static int
+csp_evac_fs_metadata(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target,
+	bool			*cleared_anything)
+{
+	uint32_t		curr_agno = -1U;
+	uint32_t		curr_mask = 0;
+	int			ret = 0;
+
+	if (req->realtime)
+		return 0;
+
+	start_fsmap_query(req, req->dev, target->start, target->length);
+	while ((ret = run_fsmap_query(req)) > 0) {
+		struct fsmap	*mrec;
+
+		for_each_fsmap_row(req, mrec) {
+			uint64_t	daddr;
+			uint32_t	agno;
+			uint32_t	mask;
+
+			if (mrec->fmr_device != req->dev)
+				continue;
+			if (!(mrec->fmr_flags & FMR_OF_SPECIAL_OWNER))
+				continue;
+
+			/* Ignore regions that we already tried to clear. */
+			if (bitmap_test(req->visited, mrec->fmr_physical,
+						mrec->fmr_length))
+				continue;
+
+			mask = fsmap_owner_to_scrub_mask(mrec->fmr_owner);
+			if (!mask)
+				continue;
+
+			trace_fsmap_rec(req, CSP_TRACE_XREBUILD, mrec);
+
+			daddr = BTOBB(mrec->fmr_physical);
+			agno = cvt_daddr_to_agno(req->xfd, daddr);
+
+			trace_xrebuild(req,
+	"agno 0x%x -> 0x%x mask 0x%x owner %lld",
+					curr_agno, agno, curr_mask,
+					(unsigned long long)mrec->fmr_owner);
+
+			if (curr_agno == -1U) {
+				curr_agno = agno;
+			} else if (curr_agno != agno) {
+				ret = csp_evac_ag_metadata(req, target,
+						curr_agno, curr_mask);
+				if (ret)
+					goto out;
+
+				*cleared_anything = true;
+				curr_agno = agno;
+				curr_mask = 0;
+			}
+
+			/* Put this on the list and try to clear it once. */
+			curr_mask |= mask;
+			ret = bitmap_set(req->visited, mrec->fmr_physical,
+					mrec->fmr_length);
+			if (ret) {
+				perror(_("marking metadata extent visited"));
+				goto out;
+			}
+		}
+	}
+
+	if (curr_agno != -1U && curr_mask != 0) {
+		ret = csp_evac_ag_metadata(req, target, curr_agno, curr_mask);
+		if (ret)
+			goto out;
+		*cleared_anything = true;
+	}
+
+	if (*cleared_anything)
+		trace_bitmap(req, "set metadata start 0x%llx length 0x%llx",
+				target->start, target->length);
+
+out:
+	end_fsmap_query(req);
+	if (ret)
+		trace_xrebuild(req, "ret %d", ret);
+	return ret;
+}
+
+/*
+ * Check that at least the start of the mapping was frozen into the work file
+ * at the correct offset.  Set @len to the number of bytes that were frozen.
+ * Returns -1 for error, zero if written extents are waiting to be mapped into
+ * the space capture file, or 1 if there's nothing to transfer to the space
+ * capture file.
+ */
+enum freeze_outcome {
+	FREEZE_FAILED = -1,
+	FREEZE_DONE,
+	FREEZE_SKIP,
+};
+
+static enum freeze_outcome
+csp_freeze_check_outcome(
+	struct clearspace_req	*req,
+	const struct fsmap	*mrec,
+	unsigned long long	*len)
+{
+	struct getbmapx		brec;
+	int			ret;
+
+	*len = 0;
+
+	ret = bmapx_one(req, req->work_fd, 0, mrec->fmr_length, &brec);
+	if (ret)
+		return FREEZE_FAILED;
+
+	trace_freeze(req,
+ "check if workfd pos 0x0 phys 0x%llx len 0x%llx maps to phys 0x%llx len 0x%llx",
+			(unsigned long long)mrec->fmr_physical,
+			(unsigned long long)mrec->fmr_length,
+			(unsigned long long)BBTOB(brec.bmv_block),
+			(unsigned long long)BBTOB(brec.bmv_length));
+
+	/* freeze of an unwritten extent punches a hole in the work file. */
+	if ((mrec->fmr_flags & FMR_OF_PREALLOC) && brec.bmv_block == -1) {
+		*len = min(mrec->fmr_length, BBTOB(brec.bmv_length));
+		return FREEZE_SKIP;
+	}
+
+	/*
+	 * freeze of a written extent must result in the same physical space
+	 * being mapped into the work file.
+	 */
+	if (!(mrec->fmr_flags & FMR_OF_PREALLOC) &&
+	    BBTOB(brec.bmv_block) == mrec->fmr_physical) {
+		*len = min(mrec->fmr_length, BBTOB(brec.bmv_length));
+		return FREEZE_DONE;
+	}
+
+	/*
+	 * We didn't find what we were looking for, which implies that the
+	 * mapping changed out from under us.  Punch out everything that could
+	 * have been mapped into the work file.  Set @len to zero and return so
+	 * that we try again with the next mapping.
+	 */
+	trace_falloc(req, "reset workfd isize 0x0", 0);
+
+	ret = ftruncate(req->work_fd, 0);
+	if (ret) {
+		perror(_("resetting work file after failed freeze"));
+		return FREEZE_FAILED;
+	}
+
+	return FREEZE_SKIP;
+}
+
+/*
+ * Open a file to try to freeze whatever data is in the requested range.
+ *
+ * Returns nonzero on error.  Returns zero and a file descriptor in @fd if the
+ * caller is supposed to do something; or returns zero and @fd == -1 if there's
+ * nothing to freeze.
+ */
+static int
+csp_freeze_open(
+	struct clearspace_req	*req,
+	const struct fsmap	*mrec,
+	int			*fd)
+{
+	struct xfs_bulkstat	bulkstat;
+	int			oflags = O_RDWR;
+	int			target_fd;
+	int			ret;
+
+	*fd = -1;
+
+	ret = -xfrog_bulkstat_single(req->xfd, mrec->fmr_owner, 0, &bulkstat);
+	if (ret) {
+		if (ret == ENOENT || ret == EINVAL)
+			return 0;
+
+		fprintf(stderr, _("bulkstat inode 0x%llx: %s\n"),
+				(unsigned long long)mrec->fmr_owner,
+				strerror(errno));
+		return ret;
+	}
+
+	/*
+	 * If we get stats for a different inode, the file may have been freed
+	 * out from under us and there's nothing to do.
+	 */
+	if (bulkstat.bs_ino != mrec->fmr_owner)
+		return 0;
+
+	/* Skip anything we can't freeze. */
+	if (!S_ISREG(bulkstat.bs_mode) && !S_ISDIR(bulkstat.bs_mode))
+		return 0;
+
+	if (S_ISDIR(bulkstat.bs_mode))
+		oflags = O_RDONLY;
+
+	target_fd = csp_open_by_handle(req, oflags, mrec->fmr_owner,
+			bulkstat.bs_gen);
+	if (target_fd == -2)
+		return 0;
+	if (target_fd < 0)
+		return target_fd;
+
+	/*
+	 * Skip mappings for directories, xattr data, and block mapping btree
+	 * blocks.  We still have to close the file though.
+	 */
+	if (S_ISDIR(bulkstat.bs_mode) ||
+	    (mrec->fmr_flags & (FMR_OF_ATTR_FORK | FMR_OF_EXTENT_MAP))) {
+		return close(target_fd);
+	}
+
+	*fd = target_fd;
+	return 0;
+}
+
+static inline uint64_t rounddown_64(uint64_t x, uint64_t y)
+{
+	return (x / y) * y;
+}
+
+/*
+ * Deal with a frozen extent containing a partially written EOF block.  Either
+ * we use funshare to get src_fd to release the block, or we reduce the length
+ * of the frozen extent by one block.
+ */
+static int
+csp_freeze_unaligned_eofblock(
+	struct clearspace_req	*req,
+	int			src_fd,
+	const struct fsmap	*mrec,
+	unsigned long long	*frozen_len)
+{
+	struct getbmapx		brec;
+	struct stat		statbuf;
+	loff_t			work_offset, length;
+	int			ret;
+
+	ret = fstat(req->work_fd, &statbuf);
+	if (ret) {
+		perror(_("statting work file"));
+		return ret;
+	}
+
+	/*
+	 * The frozen extent is less than the size of the work file, which
+	 * means that we're already block aligned.
+	 */
+	if (*frozen_len <= statbuf.st_size)
+		return 0;
+
+	/* The frozen extent does not contain a partially written EOF block. */
+	if (statbuf.st_size % statbuf.st_blksize == 0)
+		return 0;
+
+	/*
+	 * Unshare what we think is a partially written EOF block of the
+	 * original file, to try to force it to release that block.
+	 */
+	work_offset = rounddown_64(statbuf.st_size, statbuf.st_blksize);
+	length = statbuf.st_size - work_offset;
+
+	trace_freeze(req,
+ "unaligned eofblock 0x%llx work_size 0x%llx blksize 0x%x work_offset 0x%llx work_length 0x%llx",
+			*frozen_len, statbuf.st_size, statbuf.st_blksize,
+			work_offset, length);
+
+	ret = fallocate(src_fd, FALLOC_FL_UNSHARE_RANGE,
+			mrec->fmr_offset + work_offset, length);
+	if (ret) {
+		perror(_("unsharing original file"));
+		return ret;
+	}
+
+	ret = fsync(src_fd);
+	if (ret) {
+		perror(_("flushing original file"));
+		return ret;
+	}
+
+	ret = bmapx_one(req, req->work_fd, work_offset, length, &brec);
+	if (ret)
+		return ret;
+
+	if (BBTOB(brec.bmv_block) != mrec->fmr_physical + work_offset) {
+		fprintf(stderr,
+ _("work file offset 0x%llx maps to phys 0x%llx, expected 0x%llx"),
+				(unsigned long long)work_offset,
+				(unsigned long long)BBTOB(brec.bmv_block),
+				(unsigned long long)mrec->fmr_physical);
+		return -1;
+	}
+
+	/*
+	 * If the block is still shared, there must be other owners of this
+	 * block.  Round down the frozen length and we'll come back to it
+	 * eventually.
+	 */
+	if (brec.bmv_oflags & BMV_OF_SHARED) {
+		*frozen_len = work_offset;
+		return 0;
+	}
+
+	/*
+	 * Not shared anymore, so increase the size of the file to the next
+	 * block boundary so that we can reflink it into the space capture
+	 * file.
+	 */
+	ret = ftruncate(req->work_fd,
+			BBTOB(brec.bmv_length) + BBTOB(brec.bmv_offset));
+	if (ret) {
+		perror(_("expanding work file"));
+		return ret;
+	}
+
+	/* Double-check that we didn't lose the block. */
+	ret = bmapx_one(req, req->work_fd, work_offset, length, &brec);
+	if (ret)
+		return ret;
+
+	if (BBTOB(brec.bmv_block) != mrec->fmr_physical + work_offset) {
+		fprintf(stderr,
+ _("work file offset 0x%llx maps to phys 0x%llx, should be 0x%llx"),
+				(unsigned long long)work_offset,
+				(unsigned long long)BBTOB(brec.bmv_block),
+				(unsigned long long)mrec->fmr_physical);
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Given a fsmap, try to reflink the physical space into the space capture
+ * file.
+ */
+static int
+csp_freeze_req_fsmap(
+	struct clearspace_req	*req,
+	unsigned long long	*cursor,
+	const struct fsmap	*mrec)
+{
+	struct fsmap		short_mrec;
+	struct file_clone_range	fcr = { };
+	unsigned long long	frozen_len;
+	enum freeze_outcome	outcome;
+	int			src_fd;
+	int			ret, ret2;
+
+	if (mrec->fmr_device != req->dev) {
+		fprintf(stderr, _("wrong fsmap device in results.\n"));
+		return -1;
+	}
+
+	/* Ignore mappings for our secret files. */
+	if (csp_is_internal_owner(req, mrec->fmr_owner))
+		return 0;
+
+	/* Ignore mappings before the cursor. */
+	if (mrec->fmr_physical + mrec->fmr_length < *cursor)
+		return 0;
+
+	/* Jump past mappings for metadata. */
+	if (mrec->fmr_flags & FMR_OF_SPECIAL_OWNER)
+		goto skip;
+
+	/*
+	 * Open this file so that we can try to freeze its data blocks.
+	 * For other types of files we just skip to the evacuation step.
+	 */
+	ret = csp_freeze_open(req, mrec, &src_fd);
+	if (ret)
+		return ret;
+	if (src_fd < 0)
+		goto skip;
+
+	/*
+	 * If the cursor is in the middle of this mapping, increase the start
+	 * of the mapping to start at the cursor.
+	 */
+	if (mrec->fmr_physical < *cursor) {
+		unsigned long long	delta = *cursor - mrec->fmr_physical;
+
+		short_mrec = *mrec;
+		short_mrec.fmr_physical = *cursor;
+		short_mrec.fmr_offset += delta;
+		short_mrec.fmr_length -= delta;
+
+		mrec = &short_mrec;
+	}
+
+	req->trace_indent++;
+	if (mrec->fmr_length == 0) {
+		trace_freeze(req, "skipping zero-length freeze", 0);
+		goto out_fd;
+	}
+
+	/*
+	 * Reflink the mapping from the source file into the empty work file so
+	 * that a write will be written elsewhere.  The only way to reflink a
+	 * partially written EOF block is if the kernel can reset the work file
+	 * size so that the post-EOF part of the block remains post-EOF.  If we
+	 * can't do that, we're sunk.  If the mapping is unwritten, we'll leave
+	 * a hole in the work file.
+	 */
+	ret = ftruncate(req->work_fd, 0);
+	if (ret) {
+		perror(_("truncating work file for freeze"));
+		goto out_fd;
+	}
+
+	fcr.src_fd = src_fd;
+	fcr.src_offset = mrec->fmr_offset;
+	fcr.src_length = mrec->fmr_length;
+	fcr.dest_offset = 0;
+
+	trace_freeze(req,
+ "reflink ino 0x%llx offset 0x%llx bytecount 0x%llx into workfd",
+			(unsigned long long)mrec->fmr_owner,
+			(unsigned long long)fcr.src_offset,
+			(unsigned long long)fcr.src_length);
+
+	ret = clonerange(req->work_fd, &fcr);
+	if (ret == EINVAL) {
+		/*
+		 * If that didn't work, try reflinking to EOF and picking out
+		 * whatever pieces we want.
+		 */
+		fcr.src_length = 0;
+
+		trace_freeze(req,
+ "reflink ino 0x%llx offset 0x%llx to EOF into workfd",
+				(unsigned long long)mrec->fmr_owner,
+				(unsigned long long)fcr.src_offset);
+
+		ret = clonerange(req->work_fd, &fcr);
+	}
+	if (ret == EINVAL) {
+		/*
+		 * If we still can't get the block, it's possible that src_fd
+		 * was punched or truncated out from under us, so we just move
+		 * on to the next fsmap.
+		 */
+		trace_freeze(req, "cannot freeze space, moving on", 0);
+		ret = 0;
+		goto out_fd;
+	}
+	if (ret) {
+		fprintf(stderr, _("freezing space to work file: %s\n"),
+				strerror(ret));
+		goto out_fd;
+	}
+
+	req->trace_indent++;
+	outcome = csp_freeze_check_outcome(req, mrec, &frozen_len);
+	req->trace_indent--;
+	switch (outcome) {
+	case FREEZE_FAILED:
+		ret = -1;
+		goto out_fd;
+	case FREEZE_SKIP:
+		*cursor += frozen_len;
+		goto out_fd;
+	case FREEZE_DONE:
+		break;
+	}
+
+	/*
+	 * If we tried reflinking to EOF to capture a partially written EOF
+	 * block in the work file, we need to unshare the end of the source
+	 * file before we try to reflink the frozen space into the space
+	 * capture file.
+	 */
+	if (fcr.src_length == 0) {
+		ret = csp_freeze_unaligned_eofblock(req, src_fd, mrec,
+				&frozen_len);
+		if (ret)
+			goto out_fd;
+	}
+
+	/*
+	 * We've frozen the mapping by reflinking it into the work file and
+	 * confirmed that the work file has the space we wanted.  Now we need
+	 * to map the same extent into the space capture file.  If reflink
+	 * fails because we're out of space, fall back to EXCHANGE_RANGE.  The
+	 * end goal is to populate the space capture file; we don't care about
+	 * the contents of the work file.
+	 */
+	fcr.src_fd = req->work_fd;
+	fcr.src_offset = 0;
+	fcr.dest_offset = mrec->fmr_physical;
+	fcr.src_length = frozen_len;
+
+	trace_freeze(req, "reflink phys 0x%llx len 0x%llx to spacefd",
+			(unsigned long long)mrec->fmr_physical,
+			(unsigned long long)mrec->fmr_length);
+
+	ret = clonerange(req->space_fd, &fcr);
+	if (ret == ENOSPC) {
+		struct xfs_exchange_range	fxr;
+
+		xfrog_exchangerange_prep(&fxr, mrec->fmr_physical, req->work_fd,
+				mrec->fmr_physical, frozen_len);
+		ret = xfrog_exchangerange(req->space_fd, &fxr, 0);
+	}
+	if (ret) {
+		fprintf(stderr, _("freezing space to space capture file: %s\n"),
+				strerror(ret));
+		goto out_fd;
+	}
+
+	*cursor += frozen_len;
+out_fd:
+	ret2 = close(src_fd);
+	if (!ret && ret2)
+		ret = ret2;
+	req->trace_indent--;
+	if (ret)
+		trace_freeze(req, "ret %d", ret);
+	return ret;
+skip:
+	*cursor += mrec->fmr_length;
+	return 0;
+}
+
+/*
+ * Try to freeze all the space in the requested range against overwrites.
+ *
+ * For each file data fsmap within each hole in the part of the space capture
+ * file corresponding to the requested range, try to reflink the space into the
+ * space capture file so that any subsequent writes to the original owner are
+ * CoW and nobody else can allocate the space.  If we cannot use reflink to
+ * freeze all the space, we cannot proceed with the clearing.
+ */
+static int
+csp_freeze_req_range(
+	struct clearspace_req	*req)
+{
+	unsigned long long	cursor = req->start;
+	loff_t			holepos = 0;
+	loff_t			length = 0;
+	int			ret;
+
+	ret = ftruncate(req->space_fd, req->start + req->length);
+	if (ret) {
+		perror(_("setting up space capture file"));
+		return ret;
+	}
+
+	if (!req->use_reflink)
+		return 0;
+
+	start_spacefd_iter(req);
+	while ((ret = spacefd_hole_iter(req, &holepos, &length)) > 0) {
+		trace_freeze(req, "spacefd hole 0x%llx length 0x%llx",
+				(long long)holepos, (long long)length);
+
+		start_fsmap_query(req, req->dev, holepos, length);
+		while ((ret = run_fsmap_query(req)) > 0) {
+			struct fsmap	*mrec;
+
+			for_each_fsmap_row(req, mrec) {
+				trace_fsmap_rec(req, CSP_TRACE_FREEZE, mrec);
+				trim_request_fsmap(req, mrec);
+				ret = csp_freeze_req_fsmap(req, &cursor, mrec);
+				if (ret) {
+					end_fsmap_query(req);
+					goto out;
+				}
+			}
+		}
+		end_fsmap_query(req);
+	}
+out:
+	end_spacefd_iter(req);
+	return ret;
+}
+
+/*
+ * Dump all speculative preallocations, COW staging blocks, and inactive inodes
+ * to try to free up as much space as we can.
+ */
+static int
+csp_collect_garbage(
+	struct clearspace_req	*req)
+{
+	struct xfs_fs_eofblocks	eofb = {
+		.eof_version	= XFS_EOFBLOCKS_VERSION,
+		.eof_flags	= XFS_EOF_FLAGS_SYNC,
+	};
+	int			ret;
+
+	ret = ioctl(req->xfd->fd, XFS_IOC_FREE_EOFBLOCKS, &eofb);
+	if (ret) {
+		perror(_("xfs garbage collector"));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+csp_prepare(
+	struct clearspace_req	*req)
+{
+	blkcnt_t		old_blocks = 0;
+	int			ret;
+
+	/*
+	 * Empty out CoW forks and speculative post-EOF preallocations before
+	 * starting the clearing process.  This may be somewhat overkill.
+	 */
+	ret = syncfs(req->xfd->fd);
+	if (ret) {
+		perror(_("syncing filesystem"));
+		return ret;
+	}
+
+	ret = csp_collect_garbage(req);
+	if (ret)
+		return ret;
+
+	/*
+	 * Set up the space capture file as a large sparse file mirroring the
+	 * physical space that we want to defragment.
+	 */
+	ret = ftruncate(req->space_fd, req->start + req->length);
+	if (ret) {
+		perror(_("setting up space capture file"));
+		return ret;
+	}
+
+	/*
+	 * If we don't have reflink, just grab the free space and move on to
+	 * copying and exchanging file contents.
+	 */
+	if (!req->use_reflink)
+		return csp_grab_free_space(req);
+
+	/*
+	 * Try to freeze as much of the requested range as we can, grab the
+	 * free space in that range, and run freeze again to pick up anything
+	 * that may have been allocated while all that was going on.
+	 */
+	do {
+		struct stat	statbuf;
+
+		ret = csp_freeze_req_range(req);
+		if (ret)
+			return ret;
+
+		ret = csp_grab_free_space(req);
+		if (ret)
+			return ret;
+
+		ret = fstat(req->space_fd, &statbuf);
+		if (ret)
+			return ret;
+
+		if (old_blocks == statbuf.st_blocks)
+			break;
+		old_blocks = statbuf.st_blocks;
+	} while (1);
+
+	/*
+	 * If reflink is enabled, our strategy is to dedupe to free blocks in
+	 * the area that we're clearing without making any user-visible changes
+	 * to the file contents.  For all the written file data blocks in area
+	 * we're clearing, make an identical copy in the work file that is
+	 * backed by blocks that are not in the clearing area.
+	 */
+	return csp_prepare_for_dedupe(req);
+}
+
+/* Set up the target to clear all metadata from the given range. */
+static inline void
+csp_target_metadata(
+	struct clearspace_req	*req,
+	struct clearspace_tgt	*target)
+{
+	target->start = req->start;
+	target->length = req->length;
+	target->prio = 0;
+	target->evacuated = 0;
+	target->owners = 0;
+	target->try_again = false;
+}
+
+/*
+ * Loop through the space to find the most appealing part of the device to
+ * clear, then try to evacuate everything within.
+ */
+int
+clearspace_run(
+	struct clearspace_req	*req)
+{
+	struct clearspace_tgt	target;
+	const struct csp_errstr	*es;
+	bool			cleared_anything;
+	int			ret;
+
+	if (req->trace_mask) {
+		fprintf(stderr, "debug flags 0x%x:", req->trace_mask);
+		for (es = errtags; es->tag; es++) {
+			if (req->trace_mask & es->mask)
+				fprintf(stderr, " %s", es->tag);
+		}
+		fprintf(stderr, "\n");
+	}
+
+	req->trace_indent = 0;
+	trace_status(req,
+ _("Clearing dev %u:%u physical 0x%llx bytecount 0x%llx."),
+			major(req->dev), minor(req->dev),
+			req->start, req->length);
+
+	if (req->trace_mask & ~CSP_TRACE_STATUS)
+		trace_status(req, "reflink? %d evac_metadata? %d",
+				req->use_reflink, req->can_evac_metadata);
+
+	ret = bitmap_alloc(&req->visited);
+	if (ret) {
+		perror(_("allocating visited bitmap"));
+		return ret;
+	}
+
+	ret = csp_prepare(req);
+	if (ret)
+		goto out_bitmap;
+
+	/* Evacuate as many file blocks as we can. */
+	do {
+		ret = csp_find_target(req, &target);
+		if (ret)
+			goto out_bitmap;
+
+		if (target.length == 0)
+			break;
+
+		trace_target(req,
+	"phys 0x%llx len 0x%llx owners 0x%llx prio 0x%llx",
+				target.start, target.length,
+				target.owners, target.prio);
+
+		if (req->use_reflink)
+			ret = csp_evac_dedupe(req, &target);
+		else
+			ret = csp_evac_exchange(req, &target);
+		if (ret)
+			goto out_bitmap;
+
+		trace_status(req, _("Evacuated %llu file items."),
+				target.evacuated);
+	} while (target.evacuated > 0 || target.try_again);
+
+	if (!req->can_evac_metadata)
+		goto out_bitmap;
+
+	/* Evacuate as many AG metadata blocks as we can. */
+	do {
+		csp_target_metadata(req, &target);
+
+		ret = csp_evac_fs_metadata(req, &target, &cleared_anything);
+		if (ret)
+			goto out_bitmap;
+
+		trace_status(req, "evacuated %llu metadata items",
+				target.evacuated);
+	} while (target.evacuated > 0 && cleared_anything);
+
+out_bitmap:
+	bitmap_free(&req->visited);
+	return ret;
+}
+
+/* How much space did we actually clear? */
+int
+clearspace_efficacy(
+	struct clearspace_req	*req,
+	unsigned long long	*cleared_bytes)
+{
+	unsigned long long	cleared = 0;
+	int			ret;
+
+	start_bmapx_query(req, 0, req->start, req->length);
+	while ((ret = run_bmapx_query(req, req->space_fd)) > 0) {
+		struct getbmapx	*brec;
+
+		for_each_bmapx_row(req, brec) {
+			if (brec->bmv_block == -1)
+				continue;
+
+			trace_bmapx_rec(req, CSP_TRACE_EFFICACY, brec);
+
+			if (brec->bmv_offset != brec->bmv_block) {
+				fprintf(stderr,
+	_("space capture file mapped incorrectly\n"));
+				end_bmapx_query(req);
+				return -1;
+			}
+			cleared += BBTOB(brec->bmv_length);
+		}
+	}
+	end_bmapx_query(req);
+	if (ret)
+		return ret;
+
+	*cleared_bytes = cleared;
+	return 0;
+}
+
+/*
+ * Create a temporary file on the same volume (data/rt) that we're trying to
+ * clear free space on.
+ */
+static int
+csp_open_tempfile(
+	struct clearspace_req	*req,
+	struct stat		*statbuf)
+{
+	struct fsxattr		fsx;
+	int			fd, ret;
+
+	fd = openat(req->xfd->fd, ".", O_TMPFILE | O_RDWR | O_EXCL, 0600);
+	if (fd < 0) {
+		perror(_("opening temp file"));
+		return -1;
+	}
+
+	/* Make sure we got the same filesystem as the open file. */
+	ret = fstat(fd, statbuf);
+	if (ret) {
+		perror(_("stat temp file"));
+		goto fail;
+	}
+	if (statbuf->st_dev != req->statbuf.st_dev) {
+		fprintf(stderr,
+	_("Cannot create temp file on same fs as open file.\n"));
+		goto fail;
+	}
+
+	/* Ensure this file targets the correct data/rt device. */
+	ret = ioctl(fd, FS_IOC_FSGETXATTR, &fsx);
+	if (ret) {
+		perror(_("FSGETXATTR temp file"));
+		goto fail;
+	}
+
+	if (!!(fsx.fsx_xflags & FS_XFLAG_REALTIME) != req->realtime) {
+		if (req->realtime)
+			fsx.fsx_xflags |= FS_XFLAG_REALTIME;
+		else
+			fsx.fsx_xflags &= ~FS_XFLAG_REALTIME;
+
+		ret = ioctl(fd, FS_IOC_FSSETXATTR, &fsx);
+		if (ret) {
+			perror(_("FSSETXATTR temp file"));
+			goto fail;
+		}
+	}
+
+	trace_setup(req, "opening temp inode 0x%llx as fd %d",
+			(unsigned long long)statbuf->st_ino, fd);
+
+	return fd;
+fail:
+	close(fd);
+	return -1;
+}
+
+/* Extract fshandle from the open file. */
+static int
+csp_install_file(
+	struct clearspace_req	*req,
+	struct xfs_fd		*xfd)
+{
+	void			*handle;
+	size_t			handle_sz;
+	int			ret;
+
+	ret = fstat(xfd->fd, &req->statbuf);
+	if (ret)
+		return ret;
+
+	if (!S_ISDIR(req->statbuf.st_mode)) {
+		errno = -ENOTDIR;
+		return -1;
+	}
+
+	ret = fd_to_handle(xfd->fd, &handle, &handle_sz);
+	if (ret)
+		return ret;
+
+	ret = handle_to_fshandle(handle, handle_sz, &req->fshandle,
+			&req->fshandle_sz);
+	if (ret)
+		return ret;
+
+	free_handle(handle, handle_sz);
+	req->xfd = xfd;
+	return 0;
+}
+
+/* Decide if we can use online repair to evacuate metadata. */
+static void
+csp_detect_evac_metadata(
+	struct clearspace_req		*req)
+{
+	struct xfs_scrub_metadata	scrub = {
+		.sm_type		= XFS_SCRUB_TYPE_PROBE,
+		.sm_flags		= XFS_SCRUB_IFLAG_REPAIR |
+					  XFS_SCRUB_IFLAG_FORCE_REBUILD,
+	};
+	int				ret;
+
+	ret = ioctl(req->xfd->fd, XFS_IOC_SCRUB_METADATA, &scrub);
+	if (ret)
+		return;
+
+	/*
+	 * We'll try to evacuate metadata if the probe works.  This doesn't
+	 * guarantee success; it merely means that the kernel call exists.
+	 */
+	req->can_evac_metadata = true;
+}
+
+/* Detect XFS_IOC_MAP_FREESP; this is critical for grabbing free space! */
+static int
+csp_detect_map_freesp(
+	struct clearspace_req	*req)
+{
+	struct xfs_map_freesp	args = {
+		.offset		= 0,
+		.len		= 1,
+	};
+	int			ret;
+
+	/*
+	 * A single-byte fallocate request will succeed without doing anything
+	 * to the filesystem.
+	 */
+	ret = ioctl(req->work_fd, XFS_IOC_MAP_FREESP, &args);
+	if (!ret)
+		return 0;
+
+	if (errno == EOPNOTSUPP) {
+		fprintf(stderr,
+	_("Filesystem does not support XFS_IOC_MAP_FREESP\n"));
+		return -1;
+	}
+
+	perror(_("test XFS_IOC_MAP_FREESP on work file"));
+	return -1;
+}
+
+/*
+ * Assemble operation information to clear the physical space in part of a
+ * filesystem.
+ */
+int
+clearspace_init(
+	struct clearspace_req		**reqp,
+	const struct clearspace_init	*attrs)
+{
+	struct clearspace_req		*req;
+	int				ret;
+
+	req = calloc(1, sizeof(struct clearspace_req));
+	if (!req) {
+		perror(_("malloc clearspace"));
+		return -1;
+	}
+
+	req->work_fd = -1;
+	req->space_fd = -1;
+	req->trace_mask = attrs->trace_mask;
+
+	req->realtime = attrs->is_realtime;
+	req->dev = attrs->dev;
+	req->start = attrs->start;
+	req->length = attrs->length;
+
+	ret = csp_install_file(req, attrs->xfd);
+	if (ret) {
+		perror(attrs->fname);
+		goto fail;
+	}
+
+	csp_detect_evac_metadata(req);
+
+	req->work_fd = csp_open_tempfile(req, &req->temp_statbuf);
+	if (req->work_fd < 0)
+		goto fail;
+
+	req->space_fd = csp_open_tempfile(req, &req->space_statbuf);
+	if (req->space_fd < 0)
+		goto fail;
+
+	ret = csp_detect_map_freesp(req);
+	if (ret)
+		goto fail;
+
+	req->mhead = calloc(1, fsmap_sizeof(QUERY_BATCH_SIZE));
+	if (!req->mhead) {
+		perror(_("opening fs mapping query"));
+		goto fail;
+	}
+
+	req->rhead = calloc(1, xfs_getfsrefs_sizeof(QUERY_BATCH_SIZE));
+	if (!req->rhead) {
+		perror(_("opening refcount query"));
+		goto fail;
+	}
+
+	req->bhead = calloc(QUERY_BATCH_SIZE + 1, sizeof(struct getbmapx));
+	if (!req->bhead) {
+		perror(_("opening file mapping query"));
+		goto fail;
+	}
+
+	req->buf = malloc(BUFFERCOPY_BUFSZ);
+	if (!req->buf) {
+		perror(_("allocating file copy buffer"));
+		goto fail;
+	}
+
+	req->fdr = calloc(1, sizeof(struct file_dedupe_range) +
+			     sizeof(struct file_dedupe_range_info));
+	if (!req->fdr) {
+		perror(_("allocating dedupe control buffer"));
+		goto fail;
+	}
+
+	req->use_reflink = req->xfd->fsgeom.flags & XFS_FSOP_GEOM_FLAGS_REFLINK;
+
+	*reqp = req;
+	return 0;
+fail:
+	clearspace_free(&req);
+	return -1;
+}
+
+#ifdef CLEARSPACE_DEBUG
+static void
+csp_dump_fd(
+	struct clearspace_req	*req,
+	int			fd,
+	const char		*tag)
+{
+	struct stat		sb;
+	struct getbmapx		*brec;
+	unsigned long		i = 0;
+	int			ret;
+
+	ret = fstat(fd, &sb);
+	if (ret) {
+		perror("fstat");
+		return;
+	}
+
+	printf("CLEARFREE DUMP ino 0x%llx: %s\n",
+			(unsigned long long)sb.st_ino, tag);
+	start_bmapx_query(req, 0, 0, sb.st_size);
+	while ((ret = run_bmapx_query(req, fd)) > 0) {
+		for_each_bmapx_row(req, brec) {
+			char	*delim = "";
+
+			printf("[%lu]: startoff 0x%llx ",
+					i++, BBTOB(brec->bmv_offset));
+
+			if (brec->bmv_block == -1)
+				printf("startblock hole ");
+			else if (brec->bmv_block == -2)
+				printf("startblock delalloc ");
+			else
+				printf("startblock 0x%llx ",
+						BBTOB(brec->bmv_block));
+			printf("blockcount 0x%llx flags [",
+					BBTOB(brec->bmv_length));
+			if (brec->bmv_oflags & BMV_OF_PREALLOC) {
+				printf("%sprealloc", delim);
+				delim = ", ";
+			}
+			if (brec->bmv_oflags & BMV_OF_DELALLOC) {
+				printf("%sdelalloc", delim);
+				delim = ", ";
+			}
+			if (brec->bmv_oflags & BMV_OF_SHARED) {
+				printf("%sshared", delim);
+				delim = ", ";
+			}
+			printf("]\n");
+		}
+	}
+	end_bmapx_query(req);
+}
+
+/* Dump the space file and work file contents. */
+void
+clearspace_dump(
+	struct clearspace_req	*req)
+{
+	csp_dump_fd(req, req->space_fd, "space file");
+	csp_dump_fd(req, req->work_fd, "work file");
+}
+#endif /* CLEARSPACE_DEBUG */
+
+/* Free all resources associated with a space clearing request. */
+int
+clearspace_free(
+	struct clearspace_req	**reqp)
+{
+	struct clearspace_req	*req = *reqp;
+	int			ret = 0;
+
+	if (!req)
+		return 0;
+
+	*reqp = NULL;
+	free(req->fdr);
+	free(req->buf);
+	free(req->bhead);
+	free(req->rhead);
+	free(req->mhead);
+
+	if (req->space_fd >= 0) {
+		ret = close(req->space_fd);
+		if (ret)
+			perror(_("closing space capture file"));
+	}
+
+	if (req->work_fd >= 0) {
+		int	ret2 = close(req->work_fd);
+
+		if (ret2) {
+			perror(_("closing work file"));
+			if (!ret && ret2)
+				ret = ret2;
+		}
+	}
+
+	if (req->fshandle)
+		free_handle(req->fshandle, req->fshandle_sz);
+	free(req);
+	return ret;
+}
diff --git a/libfrog/clearspace.h b/libfrog/clearspace.h
new file mode 100644
index 00000000000000..d75545752b1fbf
--- /dev/null
+++ b/libfrog/clearspace.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __LIBFROG_CLEARSPACE_H__
+#define __LIBFROG_CLEARSPACE_H__
+
+#undef CLEARSPACE_DEBUG
+
+struct clearspace_req;
+
+struct clearspace_init {
+	/* Open file and its pathname */
+	struct xfs_fd		*xfd;
+	const char		*fname;
+
+	/* Which device do we want? */
+	bool			is_realtime;
+	dev_t			dev;
+
+	/* Range of device to clear. */
+	unsigned long long	start;
+	unsigned long long	length;
+
+	unsigned int		trace_mask;
+};
+
+int clearspace_init(struct clearspace_req **reqp,
+		const struct clearspace_init *init);
+int clearspace_free(struct clearspace_req **reqp);
+
+int clearspace_run(struct clearspace_req *req);
+
+#ifdef CLEARSPACE_DEBUG
+void clearspace_dump(struct clearspace_req *req);
+#else
+# define clearspace_dump(req)		((void)0)
+#endif
+int clearspace_efficacy(struct clearspace_req *req,
+		unsigned long long *cleared_bytes);
+
+/* Debugging levels */
+
+#define CSP_TRACE_FREEZE	(1U << 0)
+#define CSP_TRACE_GRAB		(1U << 1)
+#define CSP_TRACE_FSMAP		(1U << 2)
+#define CSP_TRACE_FSREFS	(1U << 3)
+#define CSP_TRACE_BMAPX		(1U << 4)
+#define CSP_TRACE_PREP		(1U << 5)
+#define CSP_TRACE_TARGET	(1U << 6)
+#define CSP_TRACE_DEDUPE	(1U << 7)
+#define CSP_TRACE_FALLOC	(1U << 8)
+#define CSP_TRACE_EXCHANGE	(1U << 9)
+#define CSP_TRACE_XREBUILD	(1U << 10)
+#define CSP_TRACE_EFFICACY	(1U << 11)
+#define CSP_TRACE_SETUP		(1U << 12)
+#define CSP_TRACE_STATUS	(1U << 13)
+#define CSP_TRACE_DUMPFILE	(1U << 14)
+#define CSP_TRACE_BITMAP	(1U << 15)
+
+#define CSP_TRACE_ALL		(CSP_TRACE_FREEZE | \
+				 CSP_TRACE_GRAB | \
+				 CSP_TRACE_FSMAP | \
+				 CSP_TRACE_FSREFS | \
+				 CSP_TRACE_BMAPX | \
+				 CSP_TRACE_PREP	 | \
+				 CSP_TRACE_TARGET | \
+				 CSP_TRACE_DEDUPE | \
+				 CSP_TRACE_FALLOC | \
+				 CSP_TRACE_EXCHANGE | \
+				 CSP_TRACE_XREBUILD | \
+				 CSP_TRACE_EFFICACY | \
+				 CSP_TRACE_SETUP | \
+				 CSP_TRACE_STATUS | \
+				 CSP_TRACE_DUMPFILE | \
+				 CSP_TRACE_BITMAP)
+
+#endif /* __LIBFROG_CLEARSPACE_H__ */
diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8
index 7d2d1ff94eeb55..a326b9a6486296 100644
--- a/man/man8/xfs_spaceman.8
+++ b/man/man8/xfs_spaceman.8
@@ -25,6 +25,23 @@ .SH OPTIONS
 
 .SH COMMANDS
 .TP
+.BI "clearfree [ \-n nr ] [ \-r ] [ \-v mask ] " start " " length
+Try to clear the specified physical range in the filesystem.
+The
+.B start
+and
+.B length
+arguments must be given in units of bytes.
+If the
+.B -n
+option is given, run the clearing algorithm this many times.
+If the
+.B -r
+option is given, clear the realtime device.
+If the
+.B -v
+option is given, print what's happening every step of the way.
+.TP
 .BI "freesp [ \-dgrs ] [-a agno]... [ \-b | \-e bsize | \-h bsize | \-m factor ]"
 With no arguments,
 .B freesp
diff --git a/spaceman/Makefile b/spaceman/Makefile
index 358db9edf5cb73..b9eead8340cec1 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -27,7 +27,7 @@ LLDLIBS += $(LIBEDITLINE) $(LIBTERMCAP)
 endif
 
 ifeq ($(HAVE_GETFSMAP),yes)
-CFILES += freesp.c
+CFILES += freesp.c clearfree.c
 endif
 
 default: depend $(LTCOMMAND)
diff --git a/spaceman/clearfree.c b/spaceman/clearfree.c
new file mode 100644
index 00000000000000..6d686f805855dc
--- /dev/null
+++ b/spaceman/clearfree.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "platform_defs.h"
+#include "command.h"
+#include "init.h"
+#include "libfrog/paths.h"
+#include "input.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/clearspace.h"
+#include "handle.h"
+#include "space.h"
+
+static void
+clearfree_help(void)
+{
+	printf(_(
+"Evacuate the contents of the given range of physical storage in the filesystem"
+"\n"
+" -n -- Run the space clearing algorithm this many times.\n"
+" -r -- clear space on the realtime device.\n"
+" -v -- verbosity level, or \"all\" to print everything.\n"
+"\n"
+"The start and length arguments are required, and must be specified in units\n"
+"of bytes.\n"
+"\n"));
+}
+
+static int
+clearfree_f(
+	int			argc,
+	char			**argv)
+{
+	struct clearspace_init	attrs = {
+		.xfd		= &file->xfd,
+		.fname		= file->name,
+	};
+	struct clearspace_req	*req = NULL;
+	unsigned long long	cleared;
+	unsigned long		arg;
+	long long		lnum;
+	unsigned int		i, nr = 1;
+	int			c, ret;
+
+	while ((c = getopt(argc, argv, "n:rv:")) != EOF) {
+		switch (c) {
+		case 'n':
+			errno = 0;
+			arg = strtoul(optarg, NULL, 0);
+			if (errno) {
+				perror(optarg);
+				return 1;
+			}
+			if (arg > UINT_MAX)
+				arg = UINT_MAX;
+			nr = arg;
+			break;
+		case 'r':	/* rt device */
+			attrs.is_realtime = true;
+			break;
+		case 'v':	/* Verbose output */
+			if (!strcmp(optarg, "all")) {
+				attrs.trace_mask = CSP_TRACE_ALL;
+			} else {
+				errno = 0;
+				attrs.trace_mask = strtoul(optarg, NULL, 0);
+				if (errno) {
+					perror(optarg);
+					return 1;
+				}
+			}
+			break;
+		default:
+			exitcode = 1;
+			clearfree_help();
+			return 0;
+		}
+	}
+
+	if (attrs.trace_mask)
+		attrs.trace_mask |= CSP_TRACE_STATUS;
+
+	if (argc != optind + 2) {
+		clearfree_help();
+		goto fail;
+	}
+
+	if (attrs.is_realtime) {
+		if (file->xfd.fsgeom.rtblocks == 0) {
+			fprintf(stderr, _("No realtime volume present.\n"));
+			goto fail;
+		}
+		attrs.dev = file->fs_path.fs_rtdev;
+	} else {
+		attrs.dev = file->fs_path.fs_datadev;
+	}
+
+	lnum = cvtnum(file->xfd.fsgeom.blocksize, file->xfd.fsgeom.sectsize,
+			argv[optind]);
+	if (lnum < 0) {
+		fprintf(stderr, _("Bad clearfree start sector %s.\n"),
+				argv[optind]);
+		goto fail;
+	}
+	attrs.start = lnum;
+
+	lnum = cvtnum(file->xfd.fsgeom.blocksize, file->xfd.fsgeom.sectsize,
+			argv[optind + 1]);
+	if (lnum < 0) {
+		fprintf(stderr, _("Bad clearfree length %s.\n"),
+				argv[optind + 1]);
+		goto fail;
+	}
+	attrs.length = lnum;
+
+	ret = clearspace_init(&req, &attrs);
+	if (ret)
+		goto fail;
+
+	for (i = 0; i < nr; i++) {
+		ret = clearspace_run(req);
+		if (ret)
+			goto out_clearspace;
+	}
+
+	ret = clearspace_efficacy(req, &cleared);
+	if (ret)
+		goto out_clearspace;
+
+	printf(_("Cleared 0x%llx bytes (%.1f%%) from 0x%llx to 0x%llx.\n"),
+			cleared, 100.0 * cleared / attrs.length, attrs.start,
+			attrs.start + attrs.length);
+
+	if (!cleared)
+		clearspace_dump(req);
+
+	ret = clearspace_free(&req);
+	if (ret)
+		goto fail;
+
+	fshandle_destroy();
+	return 0;
+
+out_clearspace:
+	clearspace_dump(req);
+	clearspace_free(&req);
+fail:
+	fshandle_destroy();
+	exitcode = 1;
+	return 1;
+}
+
+static struct cmdinfo clearfree_cmd = {
+	.name		= "clearfree",
+	.cfunc		= clearfree_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT,
+	.args		= "[-n runs] [-r] [-v mask] start length",
+	.help		= clearfree_help,
+};
+
+void
+clearfree_init(void)
+{
+	clearfree_cmd.oneline = _("clear free space in the filesystem");
+
+	add_command(&clearfree_cmd);
+}
diff --git a/spaceman/init.c b/spaceman/init.c
index cf1ff3cbb0ee8d..bce62dec47f2c8 100644
--- a/spaceman/init.c
+++ b/spaceman/init.c
@@ -35,6 +35,7 @@ init_commands(void)
 	trim_init();
 	freesp_init();
 	health_init();
+	clearfree_init();
 }
 
 static int
diff --git a/spaceman/space.h b/spaceman/space.h
index 28fa35a3047957..509e923375f42f 100644
--- a/spaceman/space.h
+++ b/spaceman/space.h
@@ -31,8 +31,10 @@ extern void	quit_init(void);
 extern void	trim_init(void);
 #ifdef HAVE_GETFSMAP
 extern void	freesp_init(void);
+extern void	clearfree_init(void);
 #else
 # define freesp_init()	do { } while (0)
+# define clearfree_init()	do { } while(0)
 #endif
 extern void	info_init(void);
 extern void	health_init(void);


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 06/11] spaceman: physically move a regular inode
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-12-31 23:46   ` [PATCH 05/11] xfs_spaceman: implement clearing free space Darrick J. Wong
@ 2024-12-31 23:46   ` Darrick J. Wong
  2024-12-31 23:46   ` [PATCH 07/11] spaceman: find owners of space in an AG Darrick J. Wong
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: dchinner, linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To be able to shrink a filesystem, we need to be able to physically
move an inode and all it's data and metadata from it's current
location to a new AG.  Add a command to spaceman to allow an inode
to be moved to a new AG.

This new command is not intended to be a perfect solution. I am not
trying to handle atomic movement of open files - this is intended to
be run as a maintenance operation on idle filesystem. If root
filesystems are the target, then this should be run via a rescue
environment that is not executing directly on the root fs. With
those caveats in place, we can do the entire inode move as a set of
non-destructive operations finalised by an atomic inode swap
without any needing special kernel support.

To ensure we move metadata such as BMBT blocks even if we don't need
to move data, we clone the data to a new inode that we've allocated
in the destination AG. This will result in new bmbt blocks being
allocated in the new location even though the data is not copied.
Attributes need to be copied one at a time from the original inode.

If data needs to be moved, then we use fallocate(UNSHARE) to create
a private copy of the range of data that needs to be moved in the
new inode. This will be allocated in the destination AG by normal
allocation policy.

Once the new inode has been finalised, use RENAME_EXCHANGE to swap
it into place and unlink the original inode to free up all the
resources it still pins.

There are many optimisations still possible to speed this up, but
the goal here is "functional" rather than "optimal". Performance can
be optimised once all the parts for a "empty the tail of the
filesystem before shrink" operation are implemented and solidly
tested.

This functionality has been smoke tested by creating a 32MB data
file with 4k extents and several hundred attributes:

$ cat test.sh
fname=/mnt/scratch/foo
xfs_io -f -c "pwrite 0 32m" -c sync $fname
for (( i=0; i < 4096 ; i++ )); do
	xfs_io -c "fpunch $((i * 8))k 4k" $fname
done

for (( i=0; i < 100 ; i++ )); do
	setfattr -n user.blah.$i.$i.blah -v blah.$i.$i.blah $fname
	setfattr -n user.foo.$i.$i.foo -v $i.cantbele.$i.ve.$i.tsnotbutter $fname
done
for (( i=0; i < 100 ; i++ )); do
	setfattr -n security.baz.$i.$i.baz -v wotchul$i$iookinat $fname
done

xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname
xfs_spaceman -c "move_inode -a 22" /mnt/scratch/foo
xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname
$

and the output looks something like:

$ sudo ./test.sh
....
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 133
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          hole                                      8
   1: [8..15]:         208..215           0 (208..215)           8 000000
   2: [16..23]:        hole                                      8
   3: [24..31]:        224..231           0 (224..231)           8 000000
....
8189: [65512..65519]:  65712..65719       0 (65712..65719)       8 000000
8190: [65520..65527]:  hole                                      8
8191: [65528..65535]:  65728..65735       0 (65728..65735)       8 000000
mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          392..399           0 (392..399)           8 000000
   1: [8..15]:         408..415           0 (408..415)           8 000000
   2: [16..23]:        424..431           0 (424..431)           8 000000
   3: [24..31]:        456..463           0 (456..463)           8 000000
move mnt /mnt/scratch, path /mnt/scratch/foo, agno 22
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 47244651475
....
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE               AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          hole                                              8
   1: [8..15]:         47244763192..47244763199  22 (123112..123119)     8 000000
   2: [16..23]:        hole                                              8
   3: [24..31]:        47244763208..47244763215  22 (123128..123135)     8 000000
....
8189: [65512..65519]:  47244828808..47244828815  22 (188728..188735)     8 000000
8190: [65520..65527]:  hole                                              8
8191: [65528..65535]:  47244828824..47244828831  22 (188744..188751)     8 000000
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE               AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          47244763176..47244763183  22 (123096..123103)     8 000000
$


Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_spaceman.8 |    4 
 spaceman/Makefile       |    3 
 spaceman/init.c         |    1 
 spaceman/move_inode.c   |  562 +++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/space.h        |    1 
 5 files changed, 570 insertions(+), 1 deletion(-)
 create mode 100644 spaceman/move_inode.c


diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8
index a326b9a6486296..f898a8bbe840ea 100644
--- a/man/man8/xfs_spaceman.8
+++ b/man/man8/xfs_spaceman.8
@@ -146,6 +146,10 @@ .SH COMMANDS
 .TP
 .BR "help [ " command " ]"
 Display a brief description of one or all commands.
+.TP
+.BI "move_inode \-a agno"
+Move the currently open file into the specified allocation group.
+
 .TP
 .BI "prealloc [ \-u id ] [ \-g id ] [ -p id ] [ \-m minlen ] [ \-s ]"
 Removes speculative preallocation.
diff --git a/spaceman/Makefile b/spaceman/Makefile
index b9eead8340cec1..9d080b67de9a22 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -14,11 +14,12 @@ CFILES = \
 	health.c \
 	info.c \
 	init.c \
+	move_inode.c \
 	prealloc.c \
 	trim.c
 LSRCFILES = xfs_info.sh
 
-LLDLIBS = $(LIBHANDLE) $(LIBXCMD) $(LIBFROG)
+LLDLIBS = $(LIBHANDLE) $(LIBXCMD) $(LIBFROG) $(LIBHANDLE)
 LTDEPENDENCIES = $(LIBHANDLE) $(LIBXCMD) $(LIBFROG)
 LLDFLAGS = -static
 
diff --git a/spaceman/init.c b/spaceman/init.c
index bce62dec47f2c8..dbeebcf97b9fb2 100644
--- a/spaceman/init.c
+++ b/spaceman/init.c
@@ -36,6 +36,7 @@ init_commands(void)
 	freesp_init();
 	health_init();
 	clearfree_init();
+	move_inode_init();
 }
 
 static int
diff --git a/spaceman/move_inode.c b/spaceman/move_inode.c
new file mode 100644
index 00000000000000..b7d71ee7a46dc6
--- /dev/null
+++ b/spaceman/move_inode.c
@@ -0,0 +1,562 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Red Hat, Inc.
+ * All Rights Reserved.
+ */
+
+#include "libxfs.h"
+#include "libfrog/fsgeom.h"
+#include "command.h"
+#include "init.h"
+#include "libfrog/paths.h"
+#include "space.h"
+#include "input.h"
+#include "handle.h"
+
+#include <linux/fiemap.h>
+#include <linux/falloc.h>
+#include <attr/attributes.h>
+
+static cmdinfo_t move_inode_cmd;
+
+/*
+ * We can't entirely use O_TMPFILE here because we want to use RENAME_EXCHANGE
+ * to swap the inode once rebuild is complete. Hence the new file has to be
+ * somewhere in the namespace for rename to act upon. Hence we use a normal
+ * open(O_CREATE) for now.
+ *
+ * This could potentially use O_TMPFILE to rebuild the entire inode, the use
+ * a linkat()/renameat2() pair to add it to the namespace then atomically
+ * replace the original.
+ */
+static int
+create_tmpfile(
+	const char	*mnt,
+	struct xfs_fd	*xfd,
+	xfs_agnumber_t	agno,
+	char		**tmpfile,
+	int		*tmpfd)
+{
+	char		name[PATH_MAX + 1];
+	mode_t		mask;
+	int		fd;
+	int		i;
+	int		ret;
+
+	/* construct tmpdir */
+	mask = umask(0);
+
+	snprintf(name, PATH_MAX, "%s/.spaceman", mnt);
+	ret = mkdir(name, 0700);
+	if (ret) {
+		if (errno != EEXIST) {
+			fprintf(stderr, _("could not create tmpdir: %s: %s\n"),
+					name, strerror(errno));
+			ret = -errno;
+			goto out_cleanup;
+		}
+	}
+
+	/* loop creating directories until we get one in the right AG */
+	for (i = 0; i < xfd->fsgeom.agcount; i++) {
+		struct stat	st;
+
+		snprintf(name, PATH_MAX, "%s/.spaceman/dir%d", mnt, i);
+		ret = mkdir(name, 0700);
+		if (ret) {
+			if (errno != EEXIST) {
+				fprintf(stderr,
+					_("cannot create tmpdir: %s: %s\n"),
+				       name, strerror(errno));
+				ret = -errno;
+				goto out_cleanup_dir;
+			}
+		}
+		ret = lstat(name, &st);
+		if (ret) {
+			fprintf(stderr, _("cannot stat tmpdir: %s: %s\n"),
+				       name, strerror(errno));
+			ret = -errno;
+			rmdir(name);
+			goto out_cleanup_dir;
+		}
+		if (cvt_ino_to_agno(xfd, st.st_ino) == agno)
+			break;
+
+		/* remove directory in wrong AG */
+		rmdir(name);
+	}
+
+	if (i == xfd->fsgeom.agcount) {
+		/*
+		 * Nothing landed in the selected AG! Must have been skipped
+		 * because the AG is out of space.
+		 */
+		fprintf(stderr, _("Cannot create AG tmpdir.\n"));
+		ret = -ENOSPC;
+		goto out_cleanup_dir;
+	}
+
+	/* create tmpfile */
+	snprintf(name, PATH_MAX, "%s/.spaceman/dir%d/tmpfile.%d", mnt, i, getpid());
+	fd = open(name, O_CREAT|O_EXCL|O_RDWR, 0700);
+	if (fd < 0) {
+		fprintf(stderr, _("cannot create tmpfile: %s: %s\n"),
+		       name, strerror(errno));
+		ret = -errno;
+	}
+
+	/* return name and fd */
+	(void)umask(mask);
+	*tmpfd = fd;
+	*tmpfile = strdup(name);
+
+	return 0;
+out_cleanup_dir:
+	snprintf(name, PATH_MAX, "%s/.spaceman", mnt);
+	rmdir(name);
+out_cleanup:
+	(void)umask(mask);
+	return ret;
+}
+
+static int
+get_attr(
+	void		*hdl,
+	size_t		hlen,
+	char		*name,
+	void		*attrbuf,
+	int		*attrlen,
+	int		attr_ns)
+{
+	struct xfs_attr_multiop	ops = {
+		.am_opcode	= ATTR_OP_GET,
+		.am_attrname	= name,
+		.am_attrvalue	= attrbuf,
+		.am_length	= *attrlen,
+		.am_flags	= attr_ns,
+	};
+	int		ret;
+
+	ret = attr_multi_by_handle(hdl, hlen, &ops, 1, 0);
+	if (ret < 0) {
+		fprintf(stderr, _("attr_multi_by_handle(GET): %s\n"),
+			strerror(errno));
+		return -errno;
+	}
+	*attrlen = ops.am_length;
+	return 0;
+}
+
+static int
+set_attr(
+	void		*hdl,
+	size_t		hlen,
+	char		*name,
+	void		*attrbuf,
+	int		attrlen,
+	int		attr_ns)
+{
+	struct xfs_attr_multiop	ops = {
+		.am_opcode	= ATTR_OP_SET,
+		.am_attrname	= name,
+		.am_attrvalue	= attrbuf,
+		.am_length	= attrlen,
+		.am_flags	= ATTR_CREATE | attr_ns,
+	};
+	int		ret;
+
+	ret = attr_multi_by_handle(hdl, hlen, &ops, 1, 0);
+	if (ret < 0) {
+		fprintf(stderr, _("attr_multi_by_handle(SET): %s\n"),
+			strerror(errno));
+		return -errno;
+	}
+	return 0;
+}
+
+/*
+ * Copy all the attributes from the original source file into the replacement
+ * destination.
+ *
+ * Oh the humanity of deprecated Irix compatible attr interfaces that are more
+ * functional and useful than their native Linux replacements!
+ */
+static int
+copy_attrs(
+	int			srcfd,
+	int			dstfd,
+	int			attr_ns)
+{
+	void			*shdl;
+	void			*dhdl;
+	size_t			shlen;
+	size_t			dhlen;
+	attrlist_cursor_t	cursor;
+	attrlist_t		*alist;
+	struct attrlist_ent	*ent;
+	char			alistbuf[XATTR_LIST_MAX];
+	char			attrbuf[XATTR_SIZE_MAX];
+	int			attrlen;
+	int			error;
+	int			i;
+
+	memset(&cursor, 0, sizeof(cursor));
+
+	/*
+	 * All this handle based stuff is hoop jumping to avoid:
+	 *
+	 * a) deprecated API warnings because attr_list, attr_get and attr_set
+	 *    have been deprecated hence through compiler warnings; and
+	 *
+	 * b) listxattr() failing hard if there are more than 64kB worth of attr
+	 *    names on the inode so is unusable.
+	 *
+	 * That leaves libhandle as the only usable interface for iterating all
+	 * xattrs on an inode reliably. Lucky for us, libhandle is part of
+	 * xfsprogs, so this hoop jump isn't going to get ripped out from under
+	 * us any time soon.
+	 */
+	error = fd_to_handle(srcfd, (void **)&shdl, &shlen);
+	if (error) {
+		fprintf(stderr, _("fd_to_handle(shdl): %s\n"),
+			strerror(errno));
+		return -errno;
+	}
+	error = fd_to_handle(dstfd, (void **)&dhdl, &dhlen);
+	if (error) {
+		fprintf(stderr, _("fd_to_handle(dhdl): %s\n"),
+			strerror(errno));
+		goto out_free_shdl;
+	}
+
+	/* loop to iterate all xattrs */
+	error = attr_list_by_handle(shdl, shlen, alistbuf,
+					XATTR_LIST_MAX, attr_ns, &cursor);
+	if (error) {
+		fprintf(stderr, _("attr_list_by_handle(shdl): %s\n"),
+			strerror(errno));
+	}
+	while (!error) {
+		alist = (attrlist_t *)alistbuf;
+
+		/*
+		 * We loop one attr at a time for initial implementation
+		 * simplicity. attr_multi_by_handle() can retrieve and set
+		 * multiple attrs in a single call, but that is more complex.
+		 * Get it working first, then optimise.
+		 */
+		for (i = 0; i < alist->al_count; i++) {
+			ent = ATTR_ENTRY(alist, i);
+
+			/* get xattr (val, len) from name */
+			attrlen = XATTR_SIZE_MAX;
+			error = get_attr(shdl, shlen, ent->a_name, attrbuf,
+						&attrlen, attr_ns);
+			if (error)
+				break;
+
+			/* set xattr (val, len) to name */
+			error = set_attr(dhdl, dhlen, ent->a_name, attrbuf,
+						attrlen, ATTR_CREATE | attr_ns);
+			if (error)
+				break;
+		}
+
+		if (!alist->al_more)
+			break;
+		error = attr_list_by_handle(shdl, shlen, alistbuf,
+					XATTR_LIST_MAX, attr_ns, &cursor);
+	}
+
+	free_handle(dhdl, dhlen);
+out_free_shdl:
+	free_handle(shdl, shlen);
+	return error ? -errno : 0;
+}
+
+/*
+ * scan the range of the new file for data that isn't in the destination AG
+ * and unshare it to create a new copy of it in the current target location
+ * of the new file.
+ */
+#define EXTENT_BATCH 32
+static int
+unshare_data(
+	struct xfs_fd	*xfd,
+	int		destfd,
+	xfs_agnumber_t	agno)
+{
+	int		ret;
+	struct fiemap	*fiemap;
+	int		done = 0;
+	int		fiemap_flags = FIEMAP_FLAG_SYNC;
+	int		i;
+	int		map_size;
+	__u64		last_logical = 0;	/* last extent offset handled */
+	off_t		range_end = -1LL;	/* mapping end*/
+
+	/* fiemap loop over extents */
+	map_size = sizeof(struct fiemap) +
+		(EXTENT_BATCH * sizeof(struct fiemap_extent));
+	fiemap = malloc(map_size);
+	if (!fiemap) {
+		fprintf(stderr, _("%s: malloc of %d bytes failed.\n"),
+			progname, map_size);
+		return -ENOMEM;
+	}
+
+	while (!done) {
+		memset(fiemap, 0, map_size);
+		fiemap->fm_flags = fiemap_flags;
+		fiemap->fm_start = last_logical;
+		fiemap->fm_length = range_end - last_logical;
+		fiemap->fm_extent_count = EXTENT_BATCH;
+
+		ret = ioctl(destfd, FS_IOC_FIEMAP, (unsigned long)fiemap);
+		if (ret < 0) {
+			fprintf(stderr, "%s: ioctl(FS_IOC_FIEMAP): %s\n",
+				progname, strerror(errno));
+			free(fiemap);
+			return -errno;
+		}
+
+		/* No more extents to map, exit */
+		if (!fiemap->fm_mapped_extents)
+			break;
+
+		for (i = 0; i < fiemap->fm_mapped_extents; i++) {
+			struct fiemap_extent	*extent;
+			xfs_agnumber_t		this_agno;
+
+			extent = &fiemap->fm_extents[i];
+			this_agno = cvt_daddr_to_agno(xfd,
+					cvt_btobbt(extent->fe_physical));
+
+			/*
+			 * If extent not in dst AG, unshare whole extent to
+			 * trigger reallocated of the extent to be local to
+			 * the current inode.
+			 */
+			if (this_agno != agno) {
+				ret = fallocate(destfd, FALLOC_FL_UNSHARE_RANGE,
+					extent->fe_logical, extent->fe_length);
+				if (ret) {
+					fprintf(stderr,
+						"%s: fallocate(UNSHARE): %s\n",
+						progname, strerror(errno));
+					return -errno;
+				}
+			}
+
+			last_logical = extent->fe_logical + extent->fe_length;
+
+			/* Kernel has told us there are no more extents */
+			if (extent->fe_flags & FIEMAP_EXTENT_LAST) {
+				done = 1;
+				break;
+			}
+		}
+	}
+	return 0;
+}
+
+/*
+ * Exchange the inodes at the two paths indicated after first ensuring that the
+ * owners, permissions and timestamps are set correctly in the tmpfile.
+ */
+static int
+exchange_inodes(
+	struct xfs_fd	*xfd,
+	int		tmpfd,
+	const char	*tmpfile,
+	const char	*path)
+{
+	struct timespec	ts[2];
+	struct stat	st;
+	int		ret;
+
+	ret = fstat(xfd->fd, &st);
+	if (ret)
+		return -errno;
+
+	/* set user ids */
+	ret = fchown(tmpfd, st.st_uid, st.st_gid);
+	if (ret)
+		return -errno;
+
+	/* set permissions */
+	ret = fchmod(tmpfd, st.st_mode);
+	if (ret)
+		return -errno;
+
+	/* set timestamps */
+	ts[0] = st.st_atim;
+	ts[1] = st.st_mtim;
+	ret = futimens(tmpfd, ts);
+	if (ret)
+		return -errno;
+
+	/* exchange the two inodes */
+	ret = renameat2(AT_FDCWD, tmpfile, AT_FDCWD, path, RENAME_EXCHANGE);
+	if (ret)
+		return -errno;
+	return 0;
+}
+
+static int
+move_file_to_ag(
+	const char		*mnt,
+	const char		*path,
+	struct xfs_fd		*xfd,
+	xfs_agnumber_t		agno)
+{
+	int			ret;
+	int			tmpfd = -1;
+	char			*tmpfile = NULL;
+
+	fprintf(stderr, "move mnt %s, path %s, agno %d\n", mnt, path, agno);
+
+	/* create temporary file in agno */
+	ret = create_tmpfile(mnt, xfd, agno, &tmpfile, &tmpfd);
+	if (ret)
+		return ret;
+
+	/* clone data to tempfile */
+	ret = ioctl(tmpfd, FICLONE, xfd->fd);
+	if (ret)
+		goto out_cleanup;
+
+	/* copy system attributes to tempfile */
+	ret = copy_attrs(xfd->fd, tmpfd, ATTR_ROOT);
+	if (ret)
+		goto out_cleanup;
+
+	/* copy user attributes to tempfile */
+	ret = copy_attrs(xfd->fd, tmpfd, 0);
+	if (ret)
+		goto out_cleanup;
+
+	/* unshare data to move it */
+	ret = unshare_data(xfd, tmpfd, agno);
+	if (ret)
+		goto out_cleanup;
+
+	/* swap the inodes over */
+	ret = exchange_inodes(xfd, tmpfd, tmpfile, path);
+
+out_cleanup:
+	if (ret == -1)
+		ret = -errno;
+
+	close(tmpfd);
+	if (tmpfile)
+		unlink(tmpfile);
+	free(tmpfile);
+
+	return ret;
+}
+
+static int
+move_inode_f(
+	int			argc,
+	char			**argv)
+{
+	void			*fshandle;
+	size_t			fshdlen;
+	xfs_agnumber_t		agno = 0;
+	struct stat		st;
+	int			ret;
+	int			c;
+
+	while ((c = getopt(argc, argv, "a:")) != EOF) {
+		switch (c) {
+		case 'a':
+			agno = cvt_u32(optarg, 10);
+			if (errno) {
+				fprintf(stderr, _("bad agno value %s\n"),
+					optarg);
+				return command_usage(&move_inode_cmd);
+			}
+			break;
+		default:
+			return command_usage(&move_inode_cmd);
+		}
+	}
+
+	if (optind != argc)
+		return command_usage(&move_inode_cmd);
+
+	if (agno >= file->xfd.fsgeom.agcount) {
+		fprintf(stderr,
+_("Destination AG %d does not exist. Filesystem only has %d AGs\n"),
+			agno, file->xfd.fsgeom.agcount);
+		exitcode = 1;
+		return 0;
+	}
+
+	/* this is so we can use fd_to_handle() later on */
+	ret = path_to_fshandle(file->fs_path.fs_dir, &fshandle, &fshdlen);
+	if (ret < 0) {
+		fprintf(stderr, _("Cannot get fshandle for mount %s: %s\n"),
+			file->fs_path.fs_dir, strerror(errno));
+		goto exit_fail;
+	}
+
+	ret = fstat(file->xfd.fd, &st);
+	if (ret) {
+		fprintf(stderr, _("stat(%s) failed: %s\n"),
+			file->name, strerror(errno));
+		goto exit_fail;
+	}
+
+	if (S_ISREG(st.st_mode)) {
+		ret = move_file_to_ag(file->fs_path.fs_dir, file->name,
+				&file->xfd, agno);
+	} else {
+		fprintf(stderr, _("Unsupported: %s is not a regular file.\n"),
+			file->name);
+		goto exit_fail;
+	}
+
+	if (ret) {
+		fprintf(stderr, _("Failed to move inode to AG %d: %s\n"),
+			agno, strerror(-ret));
+		goto exit_fail;
+	}
+	fshandle_destroy();
+	return 0;
+
+exit_fail:
+	fshandle_destroy();
+	exitcode = 1;
+	return 0;
+}
+
+static void
+move_inode_help(void)
+{
+	printf(_(
+"\n"
+"Physically move an inode into a new allocation group\n"
+"\n"
+" -a agno       -- destination AG agno for the current open file\n"
+"\n"));
+
+}
+
+void
+move_inode_init(void)
+{
+	move_inode_cmd.name = "move_inode";
+	move_inode_cmd.altname = "mvino";
+	move_inode_cmd.cfunc = move_inode_f;
+	move_inode_cmd.argmin = 2;
+	move_inode_cmd.argmax = 2;
+	move_inode_cmd.args = "-a agno";
+	move_inode_cmd.flags = CMD_FLAG_ONESHOT;
+	move_inode_cmd.oneline = _("Move an inode into a new AG.");
+	move_inode_cmd.help = move_inode_help;
+
+	add_command(&move_inode_cmd);
+}
diff --git a/spaceman/space.h b/spaceman/space.h
index 509e923375f42f..96c3c356f13fec 100644
--- a/spaceman/space.h
+++ b/spaceman/space.h
@@ -38,5 +38,6 @@ extern void	clearfree_init(void);
 #endif
 extern void	info_init(void);
 extern void	health_init(void);
+void		move_inode_init(void);
 
 #endif /* XFS_SPACEMAN_SPACE_H_ */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 07/11] spaceman: find owners of space in an AG
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-12-31 23:46   ` [PATCH 06/11] spaceman: physically move a regular inode Darrick J. Wong
@ 2024-12-31 23:46   ` Darrick J. Wong
  2024-12-31 23:46   ` [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c Darrick J. Wong
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: dchinner, linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Before we can move inodes for a shrink operation, we have to find
all the inodes that own space in the AG(s) we want to empty.

This implementation uses FS_IOC_GETFSMAP on the assumption that
filesystems to be shrunk have reverse mapping enabled as it is the
only way to identify inode related metadata that userspace is unable
to see or influence (e.g. BMBT blocks) that may be located in the
specific AG. We can use GETFSMAP to identify both inodes to be moved
(via XFS_FMR_OWN_INODES records) and inodes with just data and/or
metadata to be moved.

Once we have identified all the inodes to be moved, we have to
map them to paths so that we can use renameat2() to exchange the
directory entries pointing at the moved inode atomically. We also
need to record inodes with hard links and all of the paths to the
inode so that hard links can be recreated appropriately.

This requires a directory tree walk to discover the paths (until
parent pointers are a thing). Hence for filesystems that aren't
reverse mapping enabled, we can eventually use this pass to discover
inodes with visible data and metadata that need to be moved.

As we resolve the paths to the inodes to be moved, output the
information to stdout so that it can be acted upon by other
utilities. This results in a command that acts similar to find but
with a physical location filter rather than an inode metadata
filter.

Again, this is not meant to be an optimal implementation. It
shouldn't suck, but there is plenty of scope for performance
optimisation, especially with a multithreaded and/or async directory
traversal/parent pointer path resolution process to hide access
latencies.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/fsgeom.h        |   19 ++
 libfrog/radix-tree.c    |    2 
 libfrog/radix-tree.h    |    2 
 man/man8/xfs_spaceman.8 |   11 +
 spaceman/Makefile       |    1 
 spaceman/find_owner.c   |  481 +++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/init.c         |    4 
 spaceman/space.h        |    2 
 8 files changed, 521 insertions(+), 1 deletion(-)
 create mode 100644 spaceman/find_owner.c


diff --git a/libfrog/fsgeom.h b/libfrog/fsgeom.h
index b851b9bbf36a58..679046077cba84 100644
--- a/libfrog/fsgeom.h
+++ b/libfrog/fsgeom.h
@@ -97,6 +97,25 @@ cvt_ino_to_agino(
 	return ino & ((1ULL << xfd->aginolog) - 1);
 }
 
+/* Convert an AG block to an AG inode number. */
+static inline uint32_t
+cvt_agbno_to_agino(
+	const struct xfs_fd	*xfd,
+	xfs_agblock_t		agbno)
+{
+	return agbno << xfd->inopblog;
+}
+
+/* Calculate the number of inodes in a byte range */
+static inline uint32_t
+cvt_b_to_inode_count(
+	const struct xfs_fd	*xfd,
+	uint64_t		bytes)
+{
+	return (bytes >> xfd->blocklog) << xfd->inopblog;
+}
+
+
 /*
  * Convert a linear fs block offset number into bytes.  This is the runtime
  * equivalent of XFS_FSB_TO_B, which means that it is /not/ for segmented fsbno
diff --git a/libfrog/radix-tree.c b/libfrog/radix-tree.c
index 261fc2487de97f..788d11612e290f 100644
--- a/libfrog/radix-tree.c
+++ b/libfrog/radix-tree.c
@@ -377,6 +377,8 @@ void *radix_tree_tag_set(struct radix_tree_root *root,
 	unsigned int height, shift;
 	struct radix_tree_node *slot;
 
+	ASSERT(tag < RADIX_TREE_MAX_TAGS);
+
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
diff --git a/libfrog/radix-tree.h b/libfrog/radix-tree.h
index 0a4e3bb4f9defc..73f41a9d902a26 100644
--- a/libfrog/radix-tree.h
+++ b/libfrog/radix-tree.h
@@ -28,7 +28,7 @@ do {									\
 } while (0)
 
 #ifdef RADIX_TREE_TAGS
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
 #endif
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8
index f898a8bbe840ea..6fef6949aa6c8b 100644
--- a/man/man8/xfs_spaceman.8
+++ b/man/man8/xfs_spaceman.8
@@ -41,6 +41,14 @@ .SH COMMANDS
 If the
 .B -v
 option is given, print what's happening every step of the way.
+.TP
+.BI "find_owner \-a agno"
+Create an internal structure to map physical space in the given allocation
+group to file paths.
+This enables space reorganization on a mounted filesystem by enabling
+us to find files.
+Unclear why we can't just use FSMAP and BULKSTAT to open by handle.
+
 .TP
 .BI "freesp [ \-dgrs ] [-a agno]... [ \-b | \-e bsize | \-h bsize | \-m factor ]"
 With no arguments,
@@ -195,6 +203,9 @@ .SH COMMANDS
 .B print
 Display a list of all open files.
 .TP
+.B resolve_owner
+Resolves space in the filesystem to file paths, maybe?
+.TP
 .B quit
 Exit
 .BR xfs_spaceman .
diff --git a/spaceman/Makefile b/spaceman/Makefile
index 9d080b67de9a22..b35ab1dbd2f440 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -11,6 +11,7 @@ HFILES = \
 	space.h
 CFILES = \
 	file.c \
+	find_owner.c \
 	health.c \
 	info.c \
 	init.c \
diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c
new file mode 100644
index 00000000000000..7a656d80d21217
--- /dev/null
+++ b/spaceman/find_owner.c
@@ -0,0 +1,481 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2017 Oracle.
+ * Copyright (c) 2020 Red Hat, Inc.
+ * All Rights Reserved.
+ */
+
+#include "libxfs.h"
+#include <linux/fiemap.h>
+#include "libfrog/fsgeom.h"
+#include "libfrog/radix-tree.h"
+#include "command.h"
+#include "init.h"
+#include "libfrog/paths.h"
+#include <linux/fsmap.h>
+#include "space.h"
+#include "input.h"
+
+static cmdinfo_t find_owner_cmd;
+static cmdinfo_t resolve_owner_cmd;
+
+#define NR_EXTENTS 128
+
+static RADIX_TREE(inode_tree, 0);
+#define MOVE_INODE	0
+#define MOVE_BLOCKS	1
+#define INODE_PATH	2
+int inode_count;
+int inode_paths;
+
+static void
+track_inode_chunks(
+	struct xfs_fd	*xfd,
+	xfs_agnumber_t	agno,
+	uint64_t	physaddr,
+	uint64_t	length)
+{
+	xfs_agblock_t	agbno = cvt_b_to_agbno(xfd, physaddr);
+	uint64_t	first_ino = cvt_agino_to_ino(xfd, agno,
+						cvt_agbno_to_agino(xfd, agbno));
+	uint64_t	num_inodes = cvt_b_to_inode_count(xfd, length);
+	int		i;
+
+	printf(_("AG %d\tInode Range to move: 0x%llx - 0x%llx (length 0x%llx)\n"),
+			agno,
+			(unsigned long long)first_ino,
+			(unsigned long long)first_ino + num_inodes - 1,
+			(unsigned long long)length);
+
+	for (i = 0; i < num_inodes; i++) {
+		if (!radix_tree_lookup(&inode_tree, first_ino + i)) {
+			radix_tree_insert(&inode_tree, first_ino + i,
+					(void *)first_ino + i);
+			inode_count++;
+		}
+		radix_tree_tag_set(&inode_tree, first_ino + i, MOVE_INODE);
+	}
+}
+
+static void
+track_inode(
+	struct xfs_fd	*xfd,
+	xfs_agnumber_t	agno,
+	uint64_t	owner,
+	uint64_t	physaddr,
+	uint64_t	length)
+{
+	if (radix_tree_tag_get(&inode_tree, owner, MOVE_BLOCKS))
+		return;
+
+	printf(_("AG %d\tInode 0x%llx: blocks to move to move: 0x%llx - 0x%llx\n"),
+			agno,
+			(unsigned long long)owner,
+			(unsigned long long)physaddr,
+			(unsigned long long)physaddr + length - 1);
+	if (!radix_tree_lookup(&inode_tree, owner)) {
+		radix_tree_insert(&inode_tree, owner, (void *)owner);
+		inode_count++;
+	}
+	radix_tree_tag_set(&inode_tree, owner, MOVE_BLOCKS);
+}
+
+static void
+scan_ag(
+	xfs_agnumber_t		agno)
+{
+	struct fsmap_head	*fsmap;
+	struct fsmap		*extent;
+	struct fsmap		*l, *h;
+	struct fsmap		*p;
+	struct xfs_fd		*xfd = &file->xfd;
+	int			ret;
+	int			i;
+
+	fsmap = malloc(fsmap_sizeof(NR_EXTENTS));
+	if (!fsmap) {
+		fprintf(stderr, _("%s: fsmap malloc failed.\n"), progname);
+		exitcode = 1;
+		return;
+	}
+
+	memset(fsmap, 0, sizeof(*fsmap));
+	fsmap->fmh_count = NR_EXTENTS;
+	l = fsmap->fmh_keys;
+	h = fsmap->fmh_keys + 1;
+	l->fmr_physical = cvt_agbno_to_b(xfd, agno, 0);
+	h->fmr_physical = cvt_agbno_to_b(xfd, agno + 1, 0);
+	l->fmr_device = h->fmr_device = file->fs_path.fs_datadev;
+	h->fmr_owner = ULLONG_MAX;
+	h->fmr_flags = UINT_MAX;
+	h->fmr_offset = ULLONG_MAX;
+
+	while (true) {
+		printf("Inode count %d\n", inode_count);
+		ret = ioctl(xfd->fd, FS_IOC_GETFSMAP, fsmap);
+		if (ret < 0) {
+			fprintf(stderr, _("%s: FS_IOC_GETFSMAP [\"%s\"]: %s\n"),
+				progname, file->name, strerror(errno));
+			free(fsmap);
+			exitcode = 1;
+			return;
+		}
+
+		/* No more extents to map, exit */
+		if (!fsmap->fmh_entries)
+			break;
+
+		/*
+		 * Walk the extents, ignore everything except inode chunks
+		 * and inode owned blocks.
+		 */
+		for (i = 0, extent = fsmap->fmh_recs;
+		     i < fsmap->fmh_entries;
+		     i++, extent++) {
+			if (extent->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+				if (extent->fmr_owner != XFS_FMR_OWN_INODES)
+					continue;
+				/*
+				 * This extent contains inodes that need to be
+				 * moved into another AG. Convert the extent to
+				 * a range of inode numbers and track them all.
+				 */
+				track_inode_chunks(xfd, agno,
+							extent->fmr_physical,
+							extent->fmr_length);
+
+				continue;
+			}
+
+			/*
+			 * Extent is owned by an inode that may be located
+			 * anywhere in the filesystem, not just this AG.
+			 */
+			track_inode(xfd, agno, extent->fmr_owner,
+					extent->fmr_physical,
+					extent->fmr_length);
+		}
+
+		p = &fsmap->fmh_recs[fsmap->fmh_entries - 1];
+		if (p->fmr_flags & FMR_OF_LAST)
+			break;
+		fsmap_advance(fsmap);
+	}
+
+	free(fsmap);
+}
+
+/*
+ * find inodes that own physical space in a given AG.
+ */
+static int
+find_owner_f(
+	int			argc,
+	char			**argv)
+{
+	xfs_agnumber_t		agno = -1;
+	int			c;
+
+	while ((c = getopt(argc, argv, "a:")) != EOF) {
+		switch (c) {
+		case 'a':
+			agno = cvt_u32(optarg, 10);
+			if (errno) {
+				fprintf(stderr, _("bad agno value %s\n"),
+					optarg);
+				return command_usage(&find_owner_cmd);
+			}
+			break;
+		default:
+			return command_usage(&find_owner_cmd);
+		}
+	}
+
+	if (optind != argc)
+		return command_usage(&find_owner_cmd);
+
+	if (agno == -1 || agno >= file->xfd.fsgeom.agcount) {
+		fprintf(stderr,
+_("Destination AG %d does not exist. Filesystem only has %d AGs\n"),
+			agno, file->xfd.fsgeom.agcount);
+		exitcode = 1;
+		return 0;
+	}
+
+	/*
+	 * Check that rmap is enabled so that GETFSMAP is actually useful.
+	 */
+	if (!(file->xfd.fsgeom.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT)) {
+		fprintf(stderr,
+_("Filesystem at %s does not have reverse mapping enabled. Aborting.\n"),
+			file->fs_path.fs_dir);
+		exitcode = 1;
+		return 0;
+	}
+
+	scan_ag(agno);
+	return 0;
+}
+
+static void
+find_owner_help(void)
+{
+	printf(_(
+"\n"
+"Find inodes owning physical blocks in a given AG.\n"
+"\n"
+" -a agno  -- Scan the given AG agno.\n"
+"\n"));
+
+}
+
+void
+find_owner_init(void)
+{
+	find_owner_cmd.name = "find_owner";
+	find_owner_cmd.altname = "fown";
+	find_owner_cmd.cfunc = find_owner_f;
+	find_owner_cmd.argmin = 2;
+	find_owner_cmd.argmax = 2;
+	find_owner_cmd.args = "-a agno";
+	find_owner_cmd.flags = CMD_FLAG_ONESHOT;
+	find_owner_cmd.oneline = _("Find inodes owning physical blocks in a given AG");
+	find_owner_cmd.help = find_owner_help;
+
+	add_command(&find_owner_cmd);
+}
+
+/*
+ * for each dirent we get returned, look up the inode tree to see if it is an
+ * inode we need to process. If it is, then replace the entry in the tree with
+ * a structure containing the current path and mark the entry as resolved.
+ */
+struct inode_path {
+	uint64_t		ino;
+	struct list_head	path_list;
+	uint32_t		link_count;
+	char			path[1];
+};
+
+static int
+resolve_owner_cb(
+	const char		*path,
+	const struct stat	*stat,
+	int			status,
+	struct FTW		*data)
+{
+	struct inode_path	*ipath, *slot_ipath;
+	int			pathlen;
+	void			**slot;
+
+	/*
+	 * Lookup the slot rather than the entry so we can replace the contents
+	 * without another lookup later on.
+	 */
+	slot = radix_tree_lookup_slot(&inode_tree, stat->st_ino);
+	if (!slot || *slot == NULL)
+		return 0;
+
+	/* Could not get stat data? Fail! */
+	if (status == FTW_NS) {
+		fprintf(stderr,
+_("Failed to obtain stat(2) information from path %s. Aborting\n"),
+			path);
+		return -EPERM;
+	}
+
+	/* Allocate a new inode path and record the path in it. */
+	pathlen = strlen(path);
+	ipath = calloc(1, sizeof(*ipath) + pathlen + 1);
+	if (!ipath) {
+		fprintf(stderr,
+_("Aborting: Storing path %s for inode 0x%lx failed: %s\n"),
+			path, stat->st_ino, strerror(ENOMEM));
+		return -ENOMEM;
+	}
+	INIT_LIST_HEAD(&ipath->path_list);
+	memcpy(&ipath->path[0], path, pathlen);
+	ipath->ino = stat->st_ino;
+
+	/*
+	 * If the slot contains the inode number we just looked up, then we
+	 * haven't recorded a path for it yet. If that is the case, we just
+	 * set the link count of the path to 1 and replace the slot contents
+	 * with our new_ipath.
+	 */
+	if (stat->st_ino == (uint64_t)*slot) {
+		ipath->link_count = 1;
+		*slot = ipath;
+		radix_tree_tag_set(&inode_tree, stat->st_ino, INODE_PATH);
+		inode_paths++;
+		return 0;
+	}
+
+	/*
+	 * Multiple hard links to this inode. The slot already contains an
+	 * ipath pointer, so we add the new ipath to the tail of the list held
+	 * by the slot's ipath and bump the link count of the slot's ipath to
+	 * keep track of how many hard links the inode has.
+	 */
+	slot_ipath = *slot;
+	slot_ipath->link_count++;
+	list_add_tail(&ipath->path_list, &slot_ipath->path_list);
+	return 0;
+}
+
+/*
+ * This should be parallelised - pass subdirs off to a work queue, have the
+ * work queue processes subdirs, queueing more subdirs to work on.
+ */
+static int
+walk_mount(
+	const char	*mntpt)
+{
+	int		ret;
+
+	ret = nftw(mntpt, resolve_owner_cb,
+                        100, FTW_PHYS | FTW_MOUNT | FTW_DEPTH);
+	if (ret)
+		return -errno;
+	return 0;
+}
+
+static int
+list_inode_paths(void)
+{
+	struct inode_path	*ipath;
+	uint64_t		idx = 0;
+	int			ret;
+
+	do {
+		bool		move_blocks;
+		bool		move_inode;
+
+		ret = radix_tree_gang_lookup_tag(&inode_tree, (void **)&ipath,
+						idx, 1, INODE_PATH);
+		if (!ret)
+			break;
+		idx = ipath->ino + 1;
+
+		/* Grab status tags and remove from tree. */
+		move_blocks = radix_tree_tag_get(&inode_tree, ipath->ino,
+						MOVE_BLOCKS);
+		move_inode = radix_tree_tag_get(&inode_tree, ipath->ino,
+						MOVE_INODE);
+		radix_tree_delete(&inode_tree, ipath->ino);
+
+		/* Print the initial path with inode number and state. */
+		printf("0x%.16llx\t%s\t%s\t%8d\t%s\n",
+				(unsigned long long)ipath->ino,
+				move_blocks ? "BLOCK" : "---",
+				move_inode ? "INODE" : "---",
+				ipath->link_count, ipath->path);
+		ipath->link_count--;
+
+		/* Walk all the hard link paths and emit them. */
+		while (!list_empty(&ipath->path_list)) {
+			struct inode_path	*hpath;
+
+			hpath = list_first_entry(&ipath->path_list,
+					struct inode_path, path_list);
+			list_del(&hpath->path_list);
+			ipath->link_count--;
+
+			printf("\t\t\t\t\t%s\n", hpath->path);
+		}
+		if (ipath->link_count) {
+			printf(_("Link count anomaly: %d paths left over\n"),
+				ipath->link_count);
+		}
+		free(ipath);
+	} while (true);
+
+	/*
+	 * Any inodes remaining in the tree at this point indicate inodes whose
+	 * paths were not found. This will be unlinked but still open inodes or
+	 * lost inodes due to corruptions. Either way, a shrink will not succeed
+	 * until these inodes are removed from the filesystem.
+	 */
+	idx = 0;
+	do {
+		uint64_t	ino;
+
+
+		ret = radix_tree_gang_lookup(&inode_tree, (void **)&ino, idx, 1);
+		if (!ret) {
+			if (idx != 0)
+				ret = -EBUSY;
+			break;
+		}
+		idx = ino + 1;
+		printf(_("No path found for inode 0x%llx!\n"),
+				(unsigned long long)ino);
+		radix_tree_delete(&inode_tree, ino);
+	} while (true);
+
+	return ret;
+}
+
+/*
+ * Resolve inode numbers to paths via a directory tree walk.
+ */
+static int
+resolve_owner_f(
+	int	argc,
+	char	**argv)
+{
+	int	ret;
+
+	if (!inode_tree.rnode) {
+		fprintf(stderr,
+_("Inode list has not been populated. No inodes to resolve.\n"));
+		return 0;
+	}
+
+	ret = walk_mount(file->fs_path.fs_dir);
+	if (ret) {
+		fprintf(stderr,
+_("Failed to resolve all paths from mount point %s: %s\n"),
+			file->fs_path.fs_dir, strerror(-ret));
+		exitcode = 1;
+		return 0;
+	}
+
+	ret = list_inode_paths();
+	if (ret) {
+		fprintf(stderr,
+_("Failed to list all resolved paths from mount point %s: %s\n"),
+			file->fs_path.fs_dir, strerror(-ret));
+		exitcode = 1;
+		return 0;
+	}
+	return 0;
+}
+
+static void
+resolve_owner_help(void)
+{
+	printf(_(
+"\n"
+"Resolve inodes owning physical blocks in a given AG.\n"
+"This requires the find_owner command to be run first to populate the table\n"
+"of inodes that need to have their paths resolved.\n"
+"\n"));
+
+}
+
+void
+resolve_owner_init(void)
+{
+	resolve_owner_cmd.name = "resolve_owner";
+	resolve_owner_cmd.altname = "rown";
+	resolve_owner_cmd.cfunc = resolve_owner_f;
+	resolve_owner_cmd.argmin = 0;
+	resolve_owner_cmd.argmax = 0;
+	resolve_owner_cmd.args = "";
+	resolve_owner_cmd.flags = CMD_FLAG_ONESHOT;
+	resolve_owner_cmd.oneline = _("Resolve patches to inodes owning physical blocks in a given AG");
+	resolve_owner_cmd.help = resolve_owner_help;
+
+	add_command(&resolve_owner_cmd);
+}
diff --git a/spaceman/init.c b/spaceman/init.c
index dbeebcf97b9fb2..8b0af14e566dc8 100644
--- a/spaceman/init.c
+++ b/spaceman/init.c
@@ -10,6 +10,7 @@
 #include "input.h"
 #include "init.h"
 #include "libfrog/paths.h"
+#include "libfrog/radix-tree.h"
 #include "space.h"
 
 char	*progname;
@@ -37,6 +38,8 @@ init_commands(void)
 	health_init();
 	clearfree_init();
 	move_inode_init();
+	find_owner_init();
+	resolve_owner_init();
 }
 
 static int
@@ -71,6 +74,7 @@ init(
 	setlocale(LC_ALL, "");
 	bindtextdomain(PACKAGE, LOCALEDIR);
 	textdomain(PACKAGE);
+	radix_tree_init();
 
 	fs_table_initialise(0, NULL, 0, NULL);
 	while ((c = getopt(argc, argv, "c:p:V")) != EOF) {
diff --git a/spaceman/space.h b/spaceman/space.h
index 96c3c356f13fec..cffb1882153a18 100644
--- a/spaceman/space.h
+++ b/spaceman/space.h
@@ -39,5 +39,7 @@ extern void	clearfree_init(void);
 extern void	info_init(void);
 extern void	health_init(void);
 void		move_inode_init(void);
+void		find_owner_init(void);
+void		resolve_owner_init(void);
 
 #endif /* XFS_SPACEMAN_SPACE_H_ */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-12-31 23:46   ` [PATCH 07/11] spaceman: find owners of space in an AG Darrick J. Wong
@ 2024-12-31 23:46   ` Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems Darrick J. Wong
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Wrap the raw radix tree accesses here so that we can provide an
alternate implementation on platforms where radix tree indices cannot
store a full 64-bit inode number.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 spaceman/Makefile     |    1 
 spaceman/find_owner.c |   76 +++++++++------------------------
 spaceman/relocation.c |  114 +++++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/relocation.h |   46 ++++++++++++++++++++
 4 files changed, 183 insertions(+), 54 deletions(-)
 create mode 100644 spaceman/relocation.c
 create mode 100644 spaceman/relocation.h


diff --git a/spaceman/Makefile b/spaceman/Makefile
index b35ab1dbd2f440..8980208285f610 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -17,6 +17,7 @@ CFILES = \
 	init.c \
 	move_inode.c \
 	prealloc.c \
+	relocation.c \
 	trim.c
 LSRCFILES = xfs_info.sh
 
diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c
index 7a656d80d21217..80b239f9ac5de8 100644
--- a/spaceman/find_owner.c
+++ b/spaceman/find_owner.c
@@ -15,19 +15,13 @@
 #include <linux/fsmap.h>
 #include "space.h"
 #include "input.h"
+#include "relocation.h"
 
 static cmdinfo_t find_owner_cmd;
 static cmdinfo_t resolve_owner_cmd;
 
 #define NR_EXTENTS 128
 
-static RADIX_TREE(inode_tree, 0);
-#define MOVE_INODE	0
-#define MOVE_BLOCKS	1
-#define INODE_PATH	2
-int inode_count;
-int inode_paths;
-
 static void
 track_inode_chunks(
 	struct xfs_fd	*xfd,
@@ -39,7 +33,7 @@ track_inode_chunks(
 	uint64_t	first_ino = cvt_agino_to_ino(xfd, agno,
 						cvt_agbno_to_agino(xfd, agbno));
 	uint64_t	num_inodes = cvt_b_to_inode_count(xfd, length);
-	int		i;
+	uint64_t	i;
 
 	printf(_("AG %d\tInode Range to move: 0x%llx - 0x%llx (length 0x%llx)\n"),
 			agno,
@@ -47,14 +41,8 @@ track_inode_chunks(
 			(unsigned long long)first_ino + num_inodes - 1,
 			(unsigned long long)length);
 
-	for (i = 0; i < num_inodes; i++) {
-		if (!radix_tree_lookup(&inode_tree, first_ino + i)) {
-			radix_tree_insert(&inode_tree, first_ino + i,
-					(void *)first_ino + i);
-			inode_count++;
-		}
-		radix_tree_tag_set(&inode_tree, first_ino + i, MOVE_INODE);
-	}
+	for (i = 0; i < num_inodes; i++)
+		set_reloc_iflag(first_ino + i, MOVE_INODE);
 }
 
 static void
@@ -65,7 +53,7 @@ track_inode(
 	uint64_t	physaddr,
 	uint64_t	length)
 {
-	if (radix_tree_tag_get(&inode_tree, owner, MOVE_BLOCKS))
+	if (test_reloc_iflag(owner, MOVE_BLOCKS))
 		return;
 
 	printf(_("AG %d\tInode 0x%llx: blocks to move to move: 0x%llx - 0x%llx\n"),
@@ -73,11 +61,8 @@ track_inode(
 			(unsigned long long)owner,
 			(unsigned long long)physaddr,
 			(unsigned long long)physaddr + length - 1);
-	if (!radix_tree_lookup(&inode_tree, owner)) {
-		radix_tree_insert(&inode_tree, owner, (void *)owner);
-		inode_count++;
-	}
-	radix_tree_tag_set(&inode_tree, owner, MOVE_BLOCKS);
+
+	set_reloc_iflag(owner, MOVE_BLOCKS);
 }
 
 static void
@@ -111,7 +96,7 @@ scan_ag(
 	h->fmr_offset = ULLONG_MAX;
 
 	while (true) {
-		printf("Inode count %d\n", inode_count);
+		printf("Inode count %llu\n", get_reloc_count());
 		ret = ioctl(xfd->fd, FS_IOC_GETFSMAP, fsmap);
 		if (ret < 0) {
 			fprintf(stderr, _("%s: FS_IOC_GETFSMAP [\"%s\"]: %s\n"),
@@ -245,18 +230,6 @@ find_owner_init(void)
 	add_command(&find_owner_cmd);
 }
 
-/*
- * for each dirent we get returned, look up the inode tree to see if it is an
- * inode we need to process. If it is, then replace the entry in the tree with
- * a structure containing the current path and mark the entry as resolved.
- */
-struct inode_path {
-	uint64_t		ino;
-	struct list_head	path_list;
-	uint32_t		link_count;
-	char			path[1];
-};
-
 static int
 resolve_owner_cb(
 	const char		*path,
@@ -266,14 +239,14 @@ resolve_owner_cb(
 {
 	struct inode_path	*ipath, *slot_ipath;
 	int			pathlen;
-	void			**slot;
+	struct inode_path	**slot;
 
 	/*
 	 * Lookup the slot rather than the entry so we can replace the contents
 	 * without another lookup later on.
 	 */
-	slot = radix_tree_lookup_slot(&inode_tree, stat->st_ino);
-	if (!slot || *slot == NULL)
+	slot = get_reloc_ipath_slot(stat->st_ino);
+	if (!slot)
 		return 0;
 
 	/* Could not get stat data? Fail! */
@@ -303,11 +276,10 @@ _("Aborting: Storing path %s for inode 0x%lx failed: %s\n"),
 	 * set the link count of the path to 1 and replace the slot contents
 	 * with our new_ipath.
 	 */
-	if (stat->st_ino == (uint64_t)*slot) {
+	if (*slot == UNLINKED_IPATH) {
 		ipath->link_count = 1;
 		*slot = ipath;
-		radix_tree_tag_set(&inode_tree, stat->st_ino, INODE_PATH);
-		inode_paths++;
+		set_reloc_iflag(stat->st_ino, INODE_PATH);
 		return 0;
 	}
 
@@ -351,18 +323,15 @@ list_inode_paths(void)
 		bool		move_blocks;
 		bool		move_inode;
 
-		ret = radix_tree_gang_lookup_tag(&inode_tree, (void **)&ipath,
-						idx, 1, INODE_PATH);
-		if (!ret)
+		ipath = get_next_reloc_ipath(idx);
+		if (!ipath)
 			break;
 		idx = ipath->ino + 1;
 
 		/* Grab status tags and remove from tree. */
-		move_blocks = radix_tree_tag_get(&inode_tree, ipath->ino,
-						MOVE_BLOCKS);
-		move_inode = radix_tree_tag_get(&inode_tree, ipath->ino,
-						MOVE_INODE);
-		radix_tree_delete(&inode_tree, ipath->ino);
+		move_blocks = test_reloc_iflag(ipath->ino, MOVE_BLOCKS);
+		move_inode = test_reloc_iflag(ipath->ino, MOVE_INODE);
+		forget_reloc_ino(ipath->ino);
 
 		/* Print the initial path with inode number and state. */
 		printf("0x%.16llx\t%s\t%s\t%8d\t%s\n",
@@ -400,9 +369,8 @@ list_inode_paths(void)
 	do {
 		uint64_t	ino;
 
-
-		ret = radix_tree_gang_lookup(&inode_tree, (void **)&ino, idx, 1);
-		if (!ret) {
+		ino = get_next_reloc_unlinked(idx);
+		if (!ino) {
 			if (idx != 0)
 				ret = -EBUSY;
 			break;
@@ -410,7 +378,7 @@ list_inode_paths(void)
 		idx = ino + 1;
 		printf(_("No path found for inode 0x%llx!\n"),
 				(unsigned long long)ino);
-		radix_tree_delete(&inode_tree, ino);
+		forget_reloc_ino(ino);
 	} while (true);
 
 	return ret;
@@ -426,7 +394,7 @@ resolve_owner_f(
 {
 	int	ret;
 
-	if (!inode_tree.rnode) {
+	if (!is_reloc_populated()) {
 		fprintf(stderr,
 _("Inode list has not been populated. No inodes to resolve.\n"));
 		return 0;
diff --git a/spaceman/relocation.c b/spaceman/relocation.c
new file mode 100644
index 00000000000000..7c7d9a2b4b236f
--- /dev/null
+++ b/spaceman/relocation.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Red Hat, Inc.
+ * All Rights Reserved.
+ */
+
+#include "libxfs.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/radix-tree.h"
+#include "libfrog/paths.h"
+#include "command.h"
+#include "init.h"
+#include "space.h"
+#include "input.h"
+#include "relocation.h"
+#include "handle.h"
+
+static unsigned long long inode_count;
+static unsigned long long inode_paths;
+
+unsigned long long
+get_reloc_count(void)
+{
+	return inode_count;
+}
+
+static RADIX_TREE(relocation_data, 0);
+
+bool
+is_reloc_populated(void)
+{
+	return relocation_data.rnode != NULL;
+}
+
+bool
+test_reloc_iflag(
+	uint64_t	ino,
+	unsigned int	flag)
+{
+	return radix_tree_tag_get(&relocation_data, ino, flag);
+}
+
+void
+set_reloc_iflag(
+	uint64_t	ino,
+	unsigned int	flag)
+{
+	if (!radix_tree_lookup(&relocation_data, ino)) {
+		radix_tree_insert(&relocation_data, ino, UNLINKED_IPATH);
+		if (flag != INODE_PATH)
+			inode_count++;
+	}
+	if (flag == INODE_PATH)
+		inode_paths++;
+
+	radix_tree_tag_set(&relocation_data, ino, flag);
+}
+
+struct inode_path *
+get_next_reloc_ipath(
+	uint64_t	ino)
+{
+	struct inode_path	*ipath;
+	int			ret;
+
+	ret = radix_tree_gang_lookup_tag(&relocation_data, (void **)&ipath,
+			ino, 1, INODE_PATH);
+	if (!ret)
+		return NULL;
+	return ipath;
+}
+
+uint64_t
+get_next_reloc_unlinked(
+	uint64_t	ino)
+{
+	uint64_t	next_ino;
+	int		ret;
+
+	ret = radix_tree_gang_lookup(&relocation_data, (void **)&next_ino, ino,
+			1);
+	if (!ret)
+		return 0;
+	return next_ino;
+}
+
+/*
+ * Return a pointer to a pointer where the caller can read or write a pointer
+ * to an inode path structure.
+ *
+ * The pointed-to pointer will be set to UNLINKED_IPATH if there is no ipath
+ * associated with this inode but the inode has been flagged for relocation.
+ *
+ * Returns NULL if the inode is not flagged for relocation.
+ */
+struct inode_path **
+get_reloc_ipath_slot(
+	uint64_t		ino)
+{
+	struct inode_path	**slot;
+
+	slot = (struct inode_path **)radix_tree_lookup_slot(&relocation_data,
+			ino);
+	if (!slot || *slot == NULL)
+		return NULL;
+	return slot;
+}
+
+void
+forget_reloc_ino(
+	uint64_t		ino)
+{
+	radix_tree_delete(&relocation_data, ino);
+}
diff --git a/spaceman/relocation.h b/spaceman/relocation.h
new file mode 100644
index 00000000000000..f05a871915da42
--- /dev/null
+++ b/spaceman/relocation.h
@@ -0,0 +1,46 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Red Hat, Inc.
+ * All Rights Reserved.
+ */
+#ifndef XFS_SPACEMAN_RELOCATION_H_
+#define XFS_SPACEMAN_RELOCATION_H_
+
+bool is_reloc_populated(void);
+unsigned long long get_reloc_count(void);
+
+/*
+ * Tags for the relocation_data tree that indicate what it contains and the
+ * discovery information that needed to be stored.
+ */
+#define MOVE_INODE	0
+#define MOVE_BLOCKS	1
+#define INODE_PATH	2
+
+bool test_reloc_iflag(uint64_t ino, unsigned int flag);
+void set_reloc_iflag(uint64_t ino, unsigned int flag);
+struct inode_path *get_next_reloc_ipath(uint64_t ino);
+uint64_t get_next_reloc_unlinked(uint64_t ino);
+struct inode_path **get_reloc_ipath_slot(uint64_t ino);
+void forget_reloc_ino(uint64_t ino);
+
+/*
+ * When the entry in the relocation_data tree is tagged with INODE_PATH, the
+ * entry contains a structure that tracks the discovered paths to the inode. If
+ * the inode has multiple hard links, then we chain each individual path found
+ * via the path_list and record the number of paths in the link_count entry.
+ */
+struct inode_path {
+	uint64_t		ino;
+	struct list_head	path_list;
+	uint32_t		link_count;
+	char			path[1];
+};
+
+/*
+ * Sentinel value for inodes that we have to move but haven't yet found a path
+ * to.
+ */
+#define UNLINKED_IPATH		((struct inode_path *)1)
+
+#endif /* XFS_SPACEMAN_RELOCATION_H_ */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-12-31 23:46   ` [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c Darrick J. Wong
@ 2024-12-31 23:47   ` Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 10/11] spaceman: relocate the contents of an AG Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 11/11] spaceman: move inodes with hardlinks Darrick J. Wong
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We can't use the radix tree to store relocation information on 32-bit
systems because unsigned longs are not large enough to hold 64-bit
inodes.  Use an avl64 tree instead.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure.ac          |    1 
 include/builddefs.in  |    1 
 m4/package_libcdev.m4 |   20 +++++
 spaceman/Makefile     |    4 +
 spaceman/relocation.c |  203 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 229 insertions(+)


diff --git a/configure.ac b/configure.ac
index 224d1d3930bf2f..1f7fec838e1239 100644
--- a/configure.ac
+++ b/configure.ac
@@ -212,6 +212,7 @@ fi
 
 AC_MANUAL_FORMAT
 AC_HAVE_LIBURCU_ATOMIC64
+AC_USE_RADIX_TREE_FOR_INUMS
 
 AC_CONFIG_FILES([include/builddefs])
 AC_OUTPUT
diff --git a/include/builddefs.in b/include/builddefs.in
index ac43b6412c8cbb..bb022c36627a72 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -114,6 +114,7 @@ CROND_DIR = @crond_dir@
 HAVE_UDEV = @have_udev@
 UDEV_RULE_DIR = @udev_rule_dir@
 HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@
+USE_RADIX_TREE_FOR_INUMS = @use_radix_tree_for_inums@
 
 GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall
 #	   -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 4ef7e8f67a3ba6..9e48273250244c 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -255,3 +255,23 @@ AC_DEFUN([AC_PACKAGE_CHECK_LTO],
     AC_SUBST(lto_cflags)
     AC_SUBST(lto_ldflags)
   ])
+
+#
+# Check if the radix tree index (unsigned long) is large enough to hold a
+# 64-bit inode number
+#
+AC_DEFUN([AC_USE_RADIX_TREE_FOR_INUMS],
+  [ AC_MSG_CHECKING([if radix tree can store XFS inums])
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([[
+#include <sys/param.h>
+#include <stdint.h>
+#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
+    ]], [[
+         typedef uint64_t    xfs_ino_t;
+
+         BUILD_BUG_ON(sizeof(unsigned long) < sizeof(xfs_ino_t));
+         return 0;
+    ]])],[use_radix_tree_for_inums=yes
+       AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)])
+    AC_SUBST(use_radix_tree_for_inums)
+  ])
diff --git a/spaceman/Makefile b/spaceman/Makefile
index 8980208285f610..d9d55245ffc47a 100644
--- a/spaceman/Makefile
+++ b/spaceman/Makefile
@@ -33,6 +33,10 @@ ifeq ($(HAVE_GETFSMAP),yes)
 CFILES += freesp.c clearfree.c
 endif
 
+ifeq ($(USE_RADIX_TREE_FOR_INUMS),yes)
+LCFLAGS += -DUSE_RADIX_TREE_FOR_INUMS
+endif
+
 default: depend $(LTCOMMAND)
 
 include $(BUILDRULES)
diff --git a/spaceman/relocation.c b/spaceman/relocation.c
index 7c7d9a2b4b236f..1c0db6a1dab465 100644
--- a/spaceman/relocation.c
+++ b/spaceman/relocation.c
@@ -6,7 +6,11 @@
 
 #include "libxfs.h"
 #include "libfrog/fsgeom.h"
+#ifdef USE_RADIX_TREE_FOR_INUMS
 #include "libfrog/radix-tree.h"
+#else
+#include "libfrog/avl64.h"
+#endif /* USE_RADIX_TREE_FOR_INUMS */
 #include "libfrog/paths.h"
 #include "command.h"
 #include "init.h"
@@ -24,6 +28,7 @@ get_reloc_count(void)
 	return inode_count;
 }
 
+#ifdef USE_RADIX_TREE_FOR_INUMS
 static RADIX_TREE(relocation_data, 0);
 
 bool
@@ -112,3 +117,201 @@ forget_reloc_ino(
 {
 	radix_tree_delete(&relocation_data, ino);
 }
+#else
+struct reloc_node {
+	struct avl64node	node;
+	uint64_t		ino;
+	struct inode_path	*ipath;
+	unsigned int		flags;
+};
+
+static uint64_t
+reloc_start(
+	struct avl64node	*node)
+{
+	struct reloc_node	*rln;
+
+	rln = container_of(node, struct reloc_node, node);
+	return rln->ino;
+}
+
+static uint64_t
+reloc_end(
+	struct avl64node	*node)
+{
+	struct reloc_node	*rln;
+
+	rln = container_of(node, struct reloc_node, node);
+	return rln->ino + 1;
+}
+
+static struct avl64ops reloc_ops = {
+	reloc_start,
+	reloc_end,
+};
+
+static struct avl64tree_desc	relocation_data = {
+	.avl_ops = &reloc_ops,
+};
+
+bool
+is_reloc_populated(void)
+{
+	return relocation_data.avl_firstino != NULL;
+}
+
+static inline struct reloc_node *
+reloc_lookup(
+	uint64_t		ino)
+{
+	avl64node_t		*node;
+
+	node = avl64_find(&relocation_data, ino);
+	if (!node)
+		return NULL;
+
+	return container_of(node, struct reloc_node, node);
+}
+
+static inline struct reloc_node *
+reloc_insert(
+	uint64_t		ino)
+{
+	struct reloc_node	*rln;
+	avl64node_t		*node;
+
+	rln = malloc(sizeof(struct reloc_node));
+	if (!rln)
+		return NULL;
+
+	rln->node.avl_nextino = NULL;
+	rln->ino = ino;
+	rln->ipath = UNLINKED_IPATH;
+	rln->flags = 0;
+
+	node = avl64_insert(&relocation_data, &rln->node);
+	if (node == NULL) {
+		free(rln);
+		return NULL;
+	}
+
+	return rln;
+}
+
+bool
+test_reloc_iflag(
+	uint64_t		ino,
+	unsigned int		flag)
+{
+	struct reloc_node	*rln;
+
+	rln = reloc_lookup(ino);
+	if (!rln)
+		return false;
+
+	return rln->flags & flag;
+}
+
+void
+set_reloc_iflag(
+	uint64_t		ino,
+	unsigned int		flag)
+{
+	struct reloc_node	*rln;
+
+	rln = reloc_lookup(ino);
+	if (!rln) {
+		rln = reloc_insert(ino);
+		if (!rln)
+			abort();
+		if (flag != INODE_PATH)
+			inode_count++;
+	}
+	if (flag == INODE_PATH)
+		inode_paths++;
+
+	rln->flags |= flag;
+}
+
+#define avl_for_each_range_safe(pos, n, l, first, last) \
+	for (pos = (first), n = pos->avl_nextino, l = (last)->avl_nextino; \
+			pos != (l); \
+			pos = n, n = pos ? pos->avl_nextino : NULL)
+
+struct inode_path *
+get_next_reloc_ipath(
+	uint64_t		ino)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct reloc_node	*rln;
+
+	avl64_findranges(&relocation_data, ino - 1, -1ULL, &firstn, &lastn);
+	if (firstn == NULL && lastn == NULL)
+		return NULL;
+
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		rln = container_of(pos, struct reloc_node, node);
+
+		if (rln->flags & INODE_PATH)
+			return rln->ipath;
+	}
+
+	return NULL;
+}
+
+uint64_t
+get_next_reloc_unlinked(
+	uint64_t		ino)
+{
+	struct avl64node	*firstn;
+	struct avl64node	*lastn;
+	struct avl64node	*pos;
+	struct avl64node	*n;
+	struct avl64node	*l;
+	struct reloc_node	*rln;
+
+	avl64_findranges(&relocation_data, ino - 1, -1ULL, &firstn, &lastn);
+	if (firstn == NULL && lastn == NULL)
+		return 0;
+
+	avl_for_each_range_safe(pos, n, l, firstn, lastn) {
+		rln = container_of(pos, struct reloc_node, node);
+
+		if (!(rln->flags & INODE_PATH))
+			return rln->ino;
+	}
+
+	return 0;
+}
+
+struct inode_path **
+get_reloc_ipath_slot(
+	uint64_t		ino)
+{
+	struct reloc_node	*rln;
+
+	rln = reloc_lookup(ino);
+	if (!rln)
+		return NULL;
+
+	return &rln->ipath;
+}
+
+void
+forget_reloc_ino(
+	uint64_t		ino)
+{
+	struct reloc_node	*rln;
+
+	rln = reloc_lookup(ino);
+	if (!rln)
+		return;
+
+	avl64_delete(&relocation_data, &rln->node);
+	free(rln);
+}
+#endif /* USE_RADIX_TREE_FOR_INUMS */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 10/11] spaceman: relocate the contents of an AG
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-12-31 23:47   ` [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems Darrick J. Wong
@ 2024-12-31 23:47   ` Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 11/11] spaceman: move inodes with hardlinks Darrick J. Wong
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: dchinner, linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Shrinking a filesystem needs to first remove all the active user
data and metadata from the AGs that are going to be lopped off the
filesystem. Before we can do this, we have to relocate this
information to a region of the filesystem that is going to be
retained.

We have a function to move an inode and all it's related
information to a specific AG, we have functions to find the
owners of all the information in an AG and we can find their paths.
This gives us all the information we need to relocate all the
objects in an AG we are going to remove via shrinking.

Firstly we scan the AG to be emptied to find the inodes that need to
be relocated, then we scan the directory structure to find all the
paths to those inodes that need to be moved. Then we iterate over
all the inodes to be moved attempting to move them to the lowest
numbers AGs.

When the destination AG fills up, we'll get ENOSPC from
the moving code and this is a trigger to bump the destination AG and
retry the move. If we haven't moved all the inodes and their data by
the time the destination reaches the source AG, then the entire
operation will fail with ENOSPC - there is not enough room in the
filesystem to empty the selected AG in preparation for a shrink.

This, once again, is not intended as an optimal or even guaranteed
way of emptying an AG for shrink. It simply provides the basic
algorithm and mechanisms we need to perform a shrink operation.
Improvements and optimisations will come in time, but we can't get
to an optimal solution without first having basic functionality in
place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/fsgeom.h        |   10 ++
 man/man8/xfs_spaceman.8 |    8 ++
 spaceman/find_owner.c   |   32 +++---
 spaceman/init.c         |    1 
 spaceman/move_inode.c   |    7 +
 spaceman/relocation.c   |  234 +++++++++++++++++++++++++++++++++++++++++++++++
 spaceman/relocation.h   |    5 +
 spaceman/space.h        |    1 
 8 files changed, 280 insertions(+), 18 deletions(-)


diff --git a/libfrog/fsgeom.h b/libfrog/fsgeom.h
index 679046077cba84..3fe642be6dc9ae 100644
--- a/libfrog/fsgeom.h
+++ b/libfrog/fsgeom.h
@@ -196,6 +196,16 @@ cvt_daddr_to_agno(
 	return cvt_bb_to_off_fsbt(xfd, daddr) / xfd->fsgeom.agblocks;
 }
 
+/* Convert sparse filesystem block to AG Number */
+static inline uint32_t
+cvt_fsb_to_agno(
+	struct xfs_fd		*xfd,
+	uint64_t		fsbno)
+{
+	return fsbno >> xfd->agblklog;
+}
+
+
 /* Convert sector number to AG block number. */
 static inline uint32_t
 cvt_daddr_to_agbno(
diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8
index 6fef6949aa6c8b..b6488810cfab30 100644
--- a/man/man8/xfs_spaceman.8
+++ b/man/man8/xfs_spaceman.8
@@ -202,9 +202,17 @@ .SH COMMANDS
 .TP
 .B print
 Display a list of all open files.
+.TP
+.BI "relocate \-a agno [ \-h agno ]"
+Empty out the given allocation group by moving file data elsewhere.
+The
+.B -h
+option specifies the highest allocation group into which we can move data.
+
 .TP
 .B resolve_owner
 Resolves space in the filesystem to file paths, maybe?
+
 .TP
 .B quit
 Exit
diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c
index 80b239f9ac5de8..8e93145539a227 100644
--- a/spaceman/find_owner.c
+++ b/spaceman/find_owner.c
@@ -9,10 +9,10 @@
 #include <linux/fiemap.h>
 #include "libfrog/fsgeom.h"
 #include "libfrog/radix-tree.h"
-#include "command.h"
-#include "init.h"
 #include "libfrog/paths.h"
 #include <linux/fsmap.h>
+#include "command.h"
+#include "init.h"
 #include "space.h"
 #include "input.h"
 #include "relocation.h"
@@ -65,8 +65,8 @@ track_inode(
 	set_reloc_iflag(owner, MOVE_BLOCKS);
 }
 
-static void
-scan_ag(
+int
+find_relocation_targets(
 	xfs_agnumber_t		agno)
 {
 	struct fsmap_head	*fsmap;
@@ -80,8 +80,7 @@ scan_ag(
 	fsmap = malloc(fsmap_sizeof(NR_EXTENTS));
 	if (!fsmap) {
 		fprintf(stderr, _("%s: fsmap malloc failed.\n"), progname);
-		exitcode = 1;
-		return;
+		return -ENOMEM;
 	}
 
 	memset(fsmap, 0, sizeof(*fsmap));
@@ -102,8 +101,7 @@ scan_ag(
 			fprintf(stderr, _("%s: FS_IOC_GETFSMAP [\"%s\"]: %s\n"),
 				progname, file->name, strerror(errno));
 			free(fsmap);
-			exitcode = 1;
-			return;
+			return -errno;
 		}
 
 		/* No more extents to map, exit */
@@ -148,6 +146,7 @@ scan_ag(
 	}
 
 	free(fsmap);
+	return 0;
 }
 
 /*
@@ -159,6 +158,7 @@ find_owner_f(
 	char			**argv)
 {
 	xfs_agnumber_t		agno = -1;
+	int			ret;
 	int			c;
 
 	while ((c = getopt(argc, argv, "a:")) != EOF) {
@@ -198,7 +198,9 @@ _("Filesystem at %s does not have reverse mapping enabled. Aborting.\n"),
 		return 0;
 	}
 
-	scan_ag(agno);
+	ret = find_relocation_targets(agno);
+	if (ret)
+		exitcode = 1;
 	return 0;
 }
 
@@ -299,8 +301,8 @@ _("Aborting: Storing path %s for inode 0x%lx failed: %s\n"),
  * This should be parallelised - pass subdirs off to a work queue, have the
  * work queue processes subdirs, queueing more subdirs to work on.
  */
-static int
-walk_mount(
+int
+resolve_target_paths(
 	const char	*mntpt)
 {
 	int		ret;
@@ -361,9 +363,9 @@ list_inode_paths(void)
 
 	/*
 	 * Any inodes remaining in the tree at this point indicate inodes whose
-	 * paths were not found. This will be unlinked but still open inodes or
-	 * lost inodes due to corruptions. Either way, a shrink will not succeed
-	 * until these inodes are removed from the filesystem.
+	 * paths were not found. This will be free inodes or unlinked but still
+	 * open inodes. Either way, a shrink will not succeed until these inodes
+	 * are removed from the filesystem.
 	 */
 	idx = 0;
 	do {
@@ -400,7 +402,7 @@ _("Inode list has not been populated. No inodes to resolve.\n"));
 		return 0;
 	}
 
-	ret = walk_mount(file->fs_path.fs_dir);
+	ret = resolve_target_paths(file->fs_path.fs_dir);
 	if (ret) {
 		fprintf(stderr,
 _("Failed to resolve all paths from mount point %s: %s\n"),
diff --git a/spaceman/init.c b/spaceman/init.c
index 8b0af14e566dc8..cfe1b96fb66cd1 100644
--- a/spaceman/init.c
+++ b/spaceman/init.c
@@ -40,6 +40,7 @@ init_commands(void)
 	move_inode_init();
 	find_owner_init();
 	resolve_owner_init();
+	relocate_init();
 }
 
 static int
diff --git a/spaceman/move_inode.c b/spaceman/move_inode.c
index b7d71ee7a46dc6..ab3c12f5de987b 100644
--- a/spaceman/move_inode.c
+++ b/spaceman/move_inode.c
@@ -12,6 +12,7 @@
 #include "space.h"
 #include "input.h"
 #include "handle.h"
+#include "relocation.h"
 
 #include <linux/fiemap.h>
 #include <linux/falloc.h>
@@ -404,8 +405,8 @@ exchange_inodes(
 	return 0;
 }
 
-static int
-move_file_to_ag(
+int
+relocate_file_to_ag(
 	const char		*mnt,
 	const char		*path,
 	struct xfs_fd		*xfd,
@@ -511,7 +512,7 @@ _("Destination AG %d does not exist. Filesystem only has %d AGs\n"),
 	}
 
 	if (S_ISREG(st.st_mode)) {
-		ret = move_file_to_ag(file->fs_path.fs_dir, file->name,
+		ret = relocate_file_to_ag(file->fs_path.fs_dir, file->name,
 				&file->xfd, agno);
 	} else {
 		fprintf(stderr, _("Unsupported: %s is not a regular file.\n"),
diff --git a/spaceman/relocation.c b/spaceman/relocation.c
index 1c0db6a1dab465..7b125cc0ae12b0 100644
--- a/spaceman/relocation.c
+++ b/spaceman/relocation.c
@@ -315,3 +315,237 @@ forget_reloc_ino(
 	free(rln);
 }
 #endif /* USE_RADIX_TREE_FOR_INUMS */
+
+static struct cmdinfo relocate_cmd;
+
+static int
+relocate_targets_to_ag(
+	const char		*mnt,
+	xfs_agnumber_t		dst_agno)
+{
+	struct inode_path	*ipath;
+	uint64_t		idx = 0;
+	int			ret = 0;
+
+	do {
+		struct xfs_fd	xfd = {0};
+		struct stat	st;
+
+		/* lookup first relocation target */
+		ipath = get_next_reloc_ipath(idx);
+		if (!ipath)
+			break;
+
+		/* XXX: don't handle hard link cases yet */
+		if (ipath->link_count > 1) {
+			fprintf(stderr,
+		"FIXME! Skipping hardlinked inode at path %s\n",
+				ipath->path);
+			goto next;
+		}
+
+
+		ret = stat(ipath->path, &st);
+		if (ret) {
+			fprintf(stderr, _("stat(%s) failed: %s\n"),
+				ipath->path, strerror(errno));
+			goto next;
+		}
+
+		if (!S_ISREG(st.st_mode)) {
+			fprintf(stderr,
+		_("FIXME! Skipping %s: not a regular file.\n"),
+				ipath->path);
+			goto next;
+		}
+
+		ret = xfd_open(&xfd, ipath->path, O_RDONLY);
+		if (ret) {
+			fprintf(stderr, _("xfd_open(%s) failed: %s\n"),
+				ipath->path, strerror(-ret));
+			goto next;
+		}
+
+		/* move to destination AG */
+		ret = relocate_file_to_ag(mnt, ipath->path, &xfd, dst_agno);
+		xfd_close(&xfd);
+
+		/*
+		 * If the destination AG has run out of space, we do not remove
+		 * this inode from relocation data so it will be immediately
+		 * retried in the next AG. Other errors will be fatal.
+		 */
+		if (ret < 0)
+			return ret;
+next:
+		/* remove from relocation data */
+		idx = ipath->ino + 1;
+		forget_reloc_ino(ipath->ino);
+	} while (ret != -ENOSPC);
+
+	return ret;
+}
+
+static int
+relocate_targets(
+	const char		*mnt,
+	xfs_agnumber_t		highest_agno)
+{
+	xfs_agnumber_t		dst_agno = 0;
+	int			ret;
+
+	for (dst_agno = 0; dst_agno <= highest_agno; dst_agno++) {
+		ret = relocate_targets_to_ag(mnt, dst_agno);
+		if (ret == -ENOSPC)
+			continue;
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Relocate all the user objects in an AG to lower numbered AGs.
+ */
+static int
+relocate_f(
+	int		argc,
+	char		**argv)
+{
+	xfs_agnumber_t	target_agno = -1;
+	xfs_agnumber_t	highest_agno = -1;
+	xfs_agnumber_t	log_agno;
+	void		*fshandle;
+	size_t		fshdlen;
+	int		c;
+	int		ret;
+
+	while ((c = getopt(argc, argv, "a:h:")) != EOF) {
+		switch (c) {
+		case 'a':
+			target_agno = cvt_u32(optarg, 10);
+			if (errno) {
+				fprintf(stderr, _("bad target agno value %s\n"),
+					optarg);
+				return command_usage(&relocate_cmd);
+			}
+			break;
+		case 'h':
+			highest_agno = cvt_u32(optarg, 10);
+			if (errno) {
+				fprintf(stderr, _("bad highest agno value %s\n"),
+					optarg);
+				return command_usage(&relocate_cmd);
+			}
+			break;
+		default:
+			return command_usage(&relocate_cmd);
+		}
+	}
+
+	if (optind != argc)
+		return command_usage(&relocate_cmd);
+
+	if (target_agno == -1) {
+		fprintf(stderr, _("Target AG must be specified!\n"));
+		return command_usage(&relocate_cmd);
+	}
+
+	log_agno = cvt_fsb_to_agno(&file->xfd, file->xfd.fsgeom.logstart);
+	if (target_agno <= log_agno) {
+		fprintf(stderr,
+_("Target AG %d must be higher than the journal AG (AG %d). Aborting.\n"),
+			target_agno, log_agno);
+		goto out_fail;
+	}
+
+	if (target_agno >= file->xfd.fsgeom.agcount) {
+		fprintf(stderr,
+_("Target AG %d does not exist. Filesystem only has %d AGs\n"),
+			target_agno, file->xfd.fsgeom.agcount);
+		goto out_fail;
+	}
+
+	if (highest_agno == -1)
+		highest_agno = target_agno - 1;
+
+	if (highest_agno >= target_agno) {
+		fprintf(stderr,
+_("Highest destination AG %d must be less than target AG %d. Aborting.\n"),
+			highest_agno, target_agno);
+		goto out_fail;
+	}
+
+	if (is_reloc_populated()) {
+		fprintf(stderr,
+_("Relocation data populated from previous commands. Aborting.\n"));
+		goto out_fail;
+	}
+
+	/* this is so we can use fd_to_handle() later on */
+	ret = path_to_fshandle(file->fs_path.fs_dir, &fshandle, &fshdlen);
+	if (ret < 0) {
+		fprintf(stderr, _("Cannot get fshandle for mount %s: %s\n"),
+			file->fs_path.fs_dir, strerror(errno));
+		goto out_fail;
+	}
+
+	ret = find_relocation_targets(target_agno);
+	if (ret) {
+		fprintf(stderr,
+_("Failure during target discovery. Aborting.\n"));
+		goto out_fail;
+	}
+
+	ret = resolve_target_paths(file->fs_path.fs_dir);
+	if (ret) {
+		fprintf(stderr,
+_("Failed to resolve all paths from mount point %s: %s\n"),
+			file->fs_path.fs_dir, strerror(-ret));
+		goto out_fail;
+	}
+
+	ret = relocate_targets(file->fs_path.fs_dir, highest_agno);
+	if (ret) {
+		fprintf(stderr,
+_("Failed to relocate all targets out of AG %d: %s\n"),
+			target_agno, strerror(-ret));
+		goto out_fail;
+	}
+
+	return 0;
+out_fail:
+	exitcode = 1;
+	return 0;
+}
+
+static void
+relocate_help(void)
+{
+	printf(_(
+"\n"
+"Relocate all the user data and metadata in an AG.\n"
+"\n"
+"This function will discover all the relocatable objects in a single AG and\n"
+"move them to a lower AG as preparation for a shrink operation.\n"
+"\n"
+"	-a <agno>	Allocation group to empty\n"
+"	-h <agno>	Highest target AG allowed to relocate into\n"
+"\n"));
+
+}
+
+void
+relocate_init(void)
+{
+	relocate_cmd.name = "relocate";
+	relocate_cmd.altname = "relocate";
+	relocate_cmd.cfunc = relocate_f;
+	relocate_cmd.argmin = 2;
+	relocate_cmd.argmax = 4;
+	relocate_cmd.args = "-a agno [-h agno]";
+	relocate_cmd.flags = CMD_FLAG_ONESHOT;
+	relocate_cmd.oneline = _("Relocate data in an AG.");
+	relocate_cmd.help = relocate_help;
+
+	add_command(&relocate_cmd);
+}
diff --git a/spaceman/relocation.h b/spaceman/relocation.h
index f05a871915da42..d4c71b7bb7f054 100644
--- a/spaceman/relocation.h
+++ b/spaceman/relocation.h
@@ -43,4 +43,9 @@ struct inode_path {
  */
 #define UNLINKED_IPATH		((struct inode_path *)1)
 
+int find_relocation_targets(xfs_agnumber_t agno);
+int relocate_file_to_ag(const char *mnt, const char *path, struct xfs_fd *xfd,
+			xfs_agnumber_t agno);
+int resolve_target_paths(const char *mntpt);
+
 #endif /* XFS_SPACEMAN_RELOCATION_H_ */
diff --git a/spaceman/space.h b/spaceman/space.h
index cffb1882153a18..8c2b3e5464dee6 100644
--- a/spaceman/space.h
+++ b/spaceman/space.h
@@ -41,5 +41,6 @@ extern void	health_init(void);
 void		move_inode_init(void);
 void		find_owner_init(void);
 void		resolve_owner_init(void);
+void		relocate_init(void);
 
 #endif /* XFS_SPACEMAN_SPACE_H_ */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 11/11] spaceman: move inodes with hardlinks
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-12-31 23:47   ` [PATCH 10/11] spaceman: relocate the contents of an AG Darrick J. Wong
@ 2024-12-31 23:47   ` Darrick J. Wong
  10 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: dchinner, linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When a inode to be moved to a different AG has multiple hard links,
we need to "move" all the hard links, too. To do this, we need to
create temporary hardlinks to the new file, and then use rename
exchange to swap all the hardlinks that point to the old inode
with new hardlinks that point to the new inode.

We already know that an inode has hard links via the path discovery,
and we can check it against the link count that is reported for the
inode before we start building the link farm.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 spaceman/find_owner.c |   13 +----
 spaceman/move_inode.c |  119 +++++++++++++++++++++++++++++++++++++++++++++----
 spaceman/relocation.c |   35 ++++++++++----
 spaceman/relocation.h |    6 ++
 4 files changed, 140 insertions(+), 33 deletions(-)


diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c
index 8e93145539a227..1984d0ee7ca5f6 100644
--- a/spaceman/find_owner.c
+++ b/spaceman/find_owner.c
@@ -240,7 +240,6 @@ resolve_owner_cb(
 	struct FTW		*data)
 {
 	struct inode_path	*ipath, *slot_ipath;
-	int			pathlen;
 	struct inode_path	**slot;
 
 	/*
@@ -260,17 +259,9 @@ _("Failed to obtain stat(2) information from path %s. Aborting\n"),
 	}
 
 	/* Allocate a new inode path and record the path in it. */
-	pathlen = strlen(path);
-	ipath = calloc(1, sizeof(*ipath) + pathlen + 1);
-	if (!ipath) {
-		fprintf(stderr,
-_("Aborting: Storing path %s for inode 0x%lx failed: %s\n"),
-			path, stat->st_ino, strerror(ENOMEM));
+	ipath = ipath_alloc(path, stat);
+	if (!ipath)
 		return -ENOMEM;
-	}
-	INIT_LIST_HEAD(&ipath->path_list);
-	memcpy(&ipath->path[0], path, pathlen);
-	ipath->ino = stat->st_ino;
 
 	/*
 	 * If the slot contains the inode number we just looked up, then we
diff --git a/spaceman/move_inode.c b/spaceman/move_inode.c
index ab3c12f5de987b..3a182929579e45 100644
--- a/spaceman/move_inode.c
+++ b/spaceman/move_inode.c
@@ -36,12 +36,14 @@ create_tmpfile(
 	struct xfs_fd	*xfd,
 	xfs_agnumber_t	agno,
 	char		**tmpfile,
-	int		*tmpfd)
+	int		*tmpfd,
+	int		link_count)
 {
 	char		name[PATH_MAX + 1];
+	char		linkname[PATH_MAX + 1];
 	mode_t		mask;
 	int		fd;
-	int		i;
+	int		i, j;
 	int		ret;
 
 	/* construct tmpdir */
@@ -105,14 +107,36 @@ create_tmpfile(
 		fprintf(stderr, _("cannot create tmpfile: %s: %s\n"),
 		       name, strerror(errno));
 		ret = -errno;
+		goto out_cleanup_dir;
 	}
 
+	/* Create hard links to temporary file. */
+	for (j = link_count; j > 1; i--) {
+		snprintf(linkname, PATH_MAX, "%s/.spaceman/dir%d/tmpfile.%d.hardlink.%d", mnt, i, getpid(), j);
+		ret = link(name, linkname);
+		if (ret < 0) {
+			fprintf(stderr, _("cannot create hardlink: %s: %s\n"),
+			       linkname, strerror(errno));
+			ret = -errno;
+			goto out_cleanup_links;
+		}
+	}
+
+
 	/* return name and fd */
 	(void)umask(mask);
 	*tmpfd = fd;
 	*tmpfile = strdup(name);
 
 	return 0;
+
+out_cleanup_links:
+	for (; j <= link_count; j++) {
+		snprintf(linkname, PATH_MAX, "%s/.spaceman/dir%d/tmpfile.%d.hardlink.%d", mnt, i, getpid(), j);
+		unlink(linkname);
+	}
+	close(fd);
+	unlink(name);
 out_cleanup_dir:
 	snprintf(name, PATH_MAX, "%s/.spaceman", mnt);
 	rmdir(name);
@@ -405,21 +429,53 @@ exchange_inodes(
 	return 0;
 }
 
+static int
+exchange_hardlinks(
+	struct inode_path	*ipath,
+	const char		*tmpfile)
+{
+	char			linkname[PATH_MAX];
+	struct inode_path	*linkpath;
+	int			i = 2;
+	int			ret;
+
+	list_for_each_entry(linkpath, &ipath->path_list, path_list) {
+		if (i++ > ipath->link_count) {
+			fprintf(stderr, "ipath link count mismatch!\n");
+			return 0;
+		}
+
+		snprintf(linkname, PATH_MAX, "%s.hardlink.%d", tmpfile, i);
+		ret = renameat2(AT_FDCWD, linkname,
+				AT_FDCWD, linkpath->path, RENAME_EXCHANGE);
+		if (ret) {
+			fprintf(stderr,
+		"failed to exchange hard link %s with %s: %s\n",
+				linkname, linkpath->path, strerror(errno));
+			return -errno;
+		}
+	}
+	return 0;
+}
+
 int
 relocate_file_to_ag(
 	const char		*mnt,
-	const char		*path,
+	struct inode_path	*ipath,
 	struct xfs_fd		*xfd,
 	xfs_agnumber_t		agno)
 {
 	int			ret;
 	int			tmpfd = -1;
 	char			*tmpfile = NULL;
+	int			i;
 
-	fprintf(stderr, "move mnt %s, path %s, agno %d\n", mnt, path, agno);
+	fprintf(stderr, "move mnt %s, path %s, agno %d\n",
+			mnt, ipath->path, agno);
 
 	/* create temporary file in agno */
-	ret = create_tmpfile(mnt, xfd, agno, &tmpfile, &tmpfd);
+	ret = create_tmpfile(mnt, xfd, agno, &tmpfile, &tmpfd,
+				ipath->link_count);
 	if (ret)
 		return ret;
 
@@ -444,12 +500,28 @@ relocate_file_to_ag(
 		goto out_cleanup;
 
 	/* swap the inodes over */
-	ret = exchange_inodes(xfd, tmpfd, tmpfile, path);
+	ret = exchange_inodes(xfd, tmpfd, tmpfile, ipath->path);
+	if (ret)
+		goto out_cleanup;
+
+	/* swap the hard links over */
+	ret = exchange_hardlinks(ipath, tmpfile);
+	if (ret)
+		goto out_cleanup;
 
 out_cleanup:
 	if (ret == -1)
 		ret = -errno;
 
+	/* remove old hard links */
+	for (i = 2; i <= ipath->link_count; i++) {
+		char linkname[PATH_MAX + 256]; // anti-warning-crap
+
+		snprintf(linkname, PATH_MAX + 256, "%s.hardlink.%d", tmpfile, i);
+		unlink(linkname);
+	}
+
+	/* remove tmpfile */
 	close(tmpfd);
 	if (tmpfile)
 		unlink(tmpfile);
@@ -458,11 +530,32 @@ relocate_file_to_ag(
 	return ret;
 }
 
+static int
+build_ipath(
+	const char		*path,
+	struct stat		*st,
+	struct inode_path	**ipathp)
+{
+	struct inode_path	*ipath;
+
+	*ipathp = NULL;
+
+	ipath = ipath_alloc(path, st);
+	if (!ipath)
+		return -ENOMEM;
+
+	/* we only move a single path with move_inode */
+	ipath->link_count = 1;
+	*ipathp = ipath;
+	return 0;
+}
+
 static int
 move_inode_f(
 	int			argc,
 	char			**argv)
 {
+	struct inode_path	*ipath = NULL;
 	void			*fshandle;
 	size_t			fshdlen;
 	xfs_agnumber_t		agno = 0;
@@ -511,24 +604,30 @@ _("Destination AG %d does not exist. Filesystem only has %d AGs\n"),
 		goto exit_fail;
 	}
 
-	if (S_ISREG(st.st_mode)) {
-		ret = relocate_file_to_ag(file->fs_path.fs_dir, file->name,
-				&file->xfd, agno);
-	} else {
+	if (!S_ISREG(st.st_mode)) {
 		fprintf(stderr, _("Unsupported: %s is not a regular file.\n"),
 			file->name);
 		goto exit_fail;
 	}
 
+	ret = build_ipath(file->name, &st, &ipath);
+	if (ret)
+		goto exit_fail;
+
+	ret = relocate_file_to_ag(file->fs_path.fs_dir, ipath,
+				&file->xfd, agno);
 	if (ret) {
 		fprintf(stderr, _("Failed to move inode to AG %d: %s\n"),
 			agno, strerror(-ret));
 		goto exit_fail;
 	}
+	free(ipath);
 	fshandle_destroy();
 	return 0;
 
 exit_fail:
+	if (ipath)
+		free(ipath);
 	fshandle_destroy();
 	exitcode = 1;
 	return 0;
diff --git a/spaceman/relocation.c b/spaceman/relocation.c
index 7b125cc0ae12b0..b0960272168510 100644
--- a/spaceman/relocation.c
+++ b/spaceman/relocation.c
@@ -318,6 +318,30 @@ forget_reloc_ino(
 
 static struct cmdinfo relocate_cmd;
 
+struct inode_path *
+ipath_alloc(
+	const char		*path,
+	const struct stat	*stat)
+{
+	struct inode_path	*ipath;
+	int			pathlen = strlen(path);
+
+	/* Allocate a new inode path and record the path in it. */
+	ipath = calloc(1, sizeof(*ipath) + pathlen + 1);
+	if (!ipath) {
+		fprintf(stderr,
+_("Failed to allocate ipath %s for inode 0x%llx failed: %s\n"),
+			path, (unsigned long long)stat->st_ino,
+			strerror(-errno));
+		return NULL;
+	}
+	INIT_LIST_HEAD(&ipath->path_list);
+	memcpy(&ipath->path[0], path, pathlen);
+	ipath->ino = stat->st_ino;
+
+	return ipath;
+}
+
 static int
 relocate_targets_to_ag(
 	const char		*mnt,
@@ -336,15 +360,6 @@ relocate_targets_to_ag(
 		if (!ipath)
 			break;
 
-		/* XXX: don't handle hard link cases yet */
-		if (ipath->link_count > 1) {
-			fprintf(stderr,
-		"FIXME! Skipping hardlinked inode at path %s\n",
-				ipath->path);
-			goto next;
-		}
-
-
 		ret = stat(ipath->path, &st);
 		if (ret) {
 			fprintf(stderr, _("stat(%s) failed: %s\n"),
@@ -367,7 +382,7 @@ relocate_targets_to_ag(
 		}
 
 		/* move to destination AG */
-		ret = relocate_file_to_ag(mnt, ipath->path, &xfd, dst_agno);
+		ret = relocate_file_to_ag(mnt, ipath, &xfd, dst_agno);
 		xfd_close(&xfd);
 
 		/*
diff --git a/spaceman/relocation.h b/spaceman/relocation.h
index d4c71b7bb7f054..2c807aa678ec5b 100644
--- a/spaceman/relocation.h
+++ b/spaceman/relocation.h
@@ -43,9 +43,11 @@ struct inode_path {
  */
 #define UNLINKED_IPATH		((struct inode_path *)1)
 
+struct inode_path *ipath_alloc(const char *path, const struct stat *st);
+
 int find_relocation_targets(xfs_agnumber_t agno);
-int relocate_file_to_ag(const char *mnt, const char *path, struct xfs_fd *xfd,
-			xfs_agnumber_t agno);
+int relocate_file_to_ag(const char *mnt, struct inode_path *ipath,
+			struct xfs_fd *xfd, xfs_agnumber_t agno);
 int resolve_target_paths(const char *mntpt);
 
 #endif /* XFS_SPACEMAN_RELOCATION_H_ */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (7 preceding siblings ...)
  2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
@ 2024-12-31 23:34 ` Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong
                     ` (20 more replies)
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                   ` (6 subsequent siblings)
  15 siblings, 21 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

Hi all,

This patchset builds off of Kent Overstreet's thread_with_file code to
deliver live information about filesystem health events to userspace.
This is done by creating a twf file and hooking internal operations so
that the event information can be queued to the twf without stalling the
kernel if the twf client program is nonresponsive.  This is a private
ioctl, so events are expressed using simple json objects so that we can
enrich the output later on without having to rev a ton of C structs.

In userspace, we create a new daemon program that will read the json
event objects and initiate repairs automatically.  This daemon is
managed entirely by systemd and will not block unmounting of the
filesystem unless repairs are ongoing.  It is autostarted via some
horrible udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * xfs: create hooks for monitoring health updates
 * xfs: create a special file to pass filesystem health to userspace
 * xfs: create event queuing, formatting, and discovery infrastructure
 * xfs: report metadata health events through healthmon
 * xfs: report shutdown events through healthmon
 * xfs: report media errors through healthmon
 * xfs: report file io errors through healthmon
 * xfs: add media error reporting ioctl
 * xfs_io: monitor filesystem health events
 * xfs_io: add a media error reporting command
 * xfs_scrubbed: create daemon to listen for health events
 * xfs_scrubbed: check events against schema
 * xfs_scrubbed: enable repairing filesystems
 * xfs_scrubbed: check for fs features needed for effective repairs
 * xfs_scrubbed: use getparents to look up file names
 * builddefs: refactor udev directory specification
 * xfs_scrubbed: create a background monitoring service
 * xfs_scrubbed: don't start service if kernel support unavailable
 * xfs_scrubbed: use the autofsck fsproperty to select mode
 * xfs_scrub: report media scrub failures to the kernel
 * debian: enable xfs_scrubbed on the root filesystem by default
---
 configure.ac                     |    2 
 debian/control                   |    2 
 debian/postinst                  |    8 
 debian/prerm                     |   13 
 include/builddefs.in             |    3 
 io/Makefile                      |    1 
 io/healthmon.c                   |  183 ++++++
 io/init.c                        |    1 
 io/io.h                          |    1 
 io/shutdown.c                    |  113 ++++
 libxfs/Makefile                  |   10 
 libxfs/xfs_fs.h                  |   31 +
 libxfs/xfs_health.h              |   47 ++
 libxfs/xfs_healthmon.schema.json |  595 ++++++++++++++++++++
 m4/package_services.m4           |   30 +
 man/man8/xfs_io.8                |   46 ++
 scrub/Makefile                   |   34 +
 scrub/phase6.c                   |   25 +
 scrub/xfs_scrubbed.in            | 1106 ++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrubbed.rules         |    7 
 scrub/xfs_scrubbed@.service.in   |  104 ++++
 scrub/xfs_scrubbed_start         |   17 +
 22 files changed, 2354 insertions(+), 25 deletions(-)
 create mode 100644 debian/prerm
 create mode 100644 io/healthmon.c
 create mode 100644 libxfs/xfs_healthmon.schema.json
 create mode 100644 scrub/xfs_scrubbed.in
 create mode 100644 scrub/xfs_scrubbed.rules
 create mode 100644 scrub/xfs_scrubbed@.service.in
 create mode 100755 scrub/xfs_scrubbed_start


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 01/21] xfs: create hooks for monitoring health updates
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
@ 2024-12-31 23:47   ` Darrick J. Wong
  2024-12-31 23:48   ` [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
                     ` (19 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks for monitoring health events.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_health.h |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)


diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h
index b31000f7190ce5..39fef33dedc6a8 100644
--- a/libxfs/xfs_health.h
+++ b/libxfs/xfs_health.h
@@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 #define xfs_metadata_is_sick(error) \
 	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
 
+/*
+ * Parameters for tracking health updates.  The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+	XFS_HEALTHUP_SICK = 1,	/* runtime corruption observed */
+	XFS_HEALTHUP_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHUP_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHUP_UNMOUNT,	/* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+	XFS_HEALTHUP_FS = 1,	/* main filesystem */
+	XFS_HEALTHUP_AG,	/* allocation group */
+	XFS_HEALTHUP_INODE,	/* inode */
+	XFS_HEALTHUP_RTGROUP,	/* realtime group */
+};
+
+struct xfs_health_update_params {
+	/* XFS_HEALTHUP_INODE */
+	xfs_ino_t			ino;
+	uint32_t			gen;
+
+	/* XFS_HEALTHUP_AG/RTGROUP */
+	uint32_t			group;
+
+	/* XFS_SICK_* flags */
+	unsigned int			old_mask;
+	unsigned int			new_mask;
+
+	enum xfs_health_update_domain	domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+	struct xfs_hook			health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_HEALTH_H__ */


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2024-12-31 23:48   ` Darrick J. Wong
  2024-12-31 23:48   ` [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
                     ` (18 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_fs.h |    8 ++++++++
 1 file changed, 8 insertions(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index f4128dbdf3b9a2..d1a81b02a1a3f3 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -1100,6 +1100,13 @@ struct xfs_map_freesp {
 	__u64	pad;		/* must be zero */
 };
 
+struct xfs_health_monitor {
+	__u64	flags;		/* flags */
+	__u8	format;		/* output format */
+	__u8	pad1[7];	/* zeroes */
+	__u64	pad2[2];	/* zeroes */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1141,6 +1148,7 @@ struct xfs_map_freesp {
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
 #define XFS_IOC_MAP_FREESP	_IOW ('X', 67, struct xfs_map_freesp)
+#define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
  2024-12-31 23:47   ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong
  2024-12-31 23:48   ` [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
@ 2024-12-31 23:48   ` Darrick J. Wong
  2024-12-31 23:48   ` [PATCH 04/21] xfs: report metadata health events through healthmon Darrick J. Wong
                     ` (17 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.

Here, we've chosen json to export information to userspace.  The
structured key-value nature of json gives us enormous flexibility to
modify the schema of what we'll send to userspace because we can add new
keys at any time.  Userspace can use whatever json parsers are available
to consume the events and will not be confused by keys they don't
recognize.

Note that we do NOT allow sending json back to the kernel, nor is there
any intent to do that.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_fs.h                  |    8 +++++
 libxfs/xfs_healthmon.schema.json |   63 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)
 create mode 100644 libxfs/xfs_healthmon.schema.json


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index d1a81b02a1a3f3..d7404e6efd866d 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -1107,6 +1107,14 @@ struct xfs_health_monitor {
 	__u64	pad2[2];	/* zeroes */
 };
 
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Return events in JSON format */
+#define XFS_HEALTH_MONITOR_FMT_JSON	(1)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json
new file mode 100644
index 00000000000000..9772efe25f193d
--- /dev/null
+++ b/libxfs/xfs_healthmon.schema.json
@@ -0,0 +1,63 @@
+{
+	"$comment": [
+		"SPDX-License-Identifier: GPL-2.0-or-later",
+		"Copyright (c) 2024-2025 Oracle.  All Rights Reserved.",
+		"Author: Darrick J. Wong <djwong@kernel.org>",
+		"",
+		"This schema file describes the format of the json objects",
+		"readable from the fd returned by the XFS_IOC_HEALTHMON",
+		"ioctl."
+	],
+
+	"$schema": "https://json-schema.org/draft/2020-12/schema",
+	"$id": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/fs/xfs/libxfs/xfs_healthmon.schema.json",
+
+	"title": "XFS Health Monitoring Events",
+
+	"$comment": "Events must be one of the following types:",
+	"oneOf": [
+		{
+			"$ref": "#/$events/lost"
+		}
+	],
+
+	"$comment": "Simple data types are defined here.",
+	"$defs": {
+		"time_ns": {
+			"title": "Time of Event",
+			"description": "Timestamp of the event, in nanoseconds since the Unix epoch.",
+			"type": "integer"
+		}
+	},
+
+	"$comment": "Event types are defined here.",
+	"$events": {
+		"lost": {
+			"title": "Health Monitoring Events Lost",
+			"$comment": [
+				"Previous health monitoring events were",
+				"dropped due to memory allocation failures",
+				"or queue limits."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "lost"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain"
+			]
+		}
+	}
+}


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 04/21] xfs: report metadata health events through healthmon
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:48   ` [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
@ 2024-12-31 23:48   ` Darrick J. Wong
  2024-12-31 23:49   ` [PATCH 05/21] xfs: report shutdown " Darrick J. Wong
                     ` (16 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:48 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a metadata health event hook so that we can send events to
userspace as we collect information.  The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_healthmon.schema.json |  328 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 328 insertions(+)


diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json
index 9772efe25f193d..154ea0228a3615 100644
--- a/libxfs/xfs_healthmon.schema.json
+++ b/libxfs/xfs_healthmon.schema.json
@@ -18,6 +18,18 @@
 	"oneOf": [
 		{
 			"$ref": "#/$events/lost"
+		},
+		{
+			"$ref": "#/$events/fs_metadata"
+		},
+		{
+			"$ref": "#/$events/rtgroup_metadata"
+		},
+		{
+			"$ref": "#/$events/perag_metadata"
+		},
+		{
+			"$ref": "#/$events/inode_metadata"
 		}
 	],
 
@@ -27,6 +39,169 @@
 			"title": "Time of Event",
 			"description": "Timestamp of the event, in nanoseconds since the Unix epoch.",
 			"type": "integer"
+		},
+		"xfs_agnumber_t": {
+			"description": "Allocation group number",
+			"type": "integer",
+			"minimum": 0,
+			"maximum": 2147483647
+		},
+		"xfs_rgnumber_t": {
+			"description": "Realtime allocation group number",
+			"type": "integer",
+			"minimum": 0,
+			"maximum": 2147483647
+		},
+		"xfs_ino_t": {
+			"description": "Inode number",
+			"type": "integer",
+			"minimum": 1
+		},
+		"i_generation": {
+			"description": "Inode generation number",
+			"type": "integer"
+		}
+	},
+
+	"$comment": "Filesystem metadata event data are defined here.",
+	"$metadata": {
+		"status": {
+			"description": "Metadata health status",
+			"$comment": [
+				"One of:",
+				"",
+				" * sick:    metadata corruption discovered",
+				"            during a runtime operation.",
+				" * corrupt: corruption discovered during",
+				"            an xfs_scrub run.",
+				" * healthy: metadata object was found to be",
+				"            ok by xfs_scrub."
+			],
+			"enum": [
+				"sick",
+				"corrupt",
+				"healthy"
+			]
+		},
+		"fs": {
+			"description": [
+				"Metadata structures that affect the entire",
+				"filesystem.  Options include:",
+				"",
+				" * fscounters: summary counters",
+				" * usrquota:   user quota records",
+				" * grpquota:   group quota records",
+				" * prjquota:   project quota records",
+				" * quotacheck: quota counters",
+				" * nlinks:     file link counts",
+				" * metadir:    metadata directory",
+				" * metapath:   metadata inode paths"
+			],
+			"enum": [
+				"fscounters",
+				"grpquota",
+				"metadir",
+				"metapath",
+				"nlinks",
+				"prjquota",
+				"quotacheck",
+				"usrquota"
+			]
+		},
+		"perag": {
+			"description": [
+				"Metadata structures owned by allocation",
+				"groups on the data device.  Options include:",
+				"",
+				" * agf:        group space header",
+				" * agfl:       per-group free block list",
+				" * agi:        group inode header",
+				" * bnobt:      free space by position btree",
+				" * cntbt:      free space by length btree",
+				" * finobt:     free inode btree",
+				" * inobt:      inode btree",
+				" * rmapbt:     reverse mapping btree",
+				" * refcountbt: reference count btree",
+				" * inodes:     problems were recorded for",
+				"               this group's inodes, but the",
+				"               inodes themselves had to be",
+				"               reclaimed.",
+				" * super:      superblock"
+			],
+			"enum": [
+				"agf",
+				"agfl",
+				"agi",
+				"bnobt",
+				"cntbt",
+				"finobt",
+				"inobt",
+				"inodes",
+				"refcountbt",
+				"rmapbt",
+				"super"
+			]
+		},
+		"rtgroup": {
+			"description": [
+				"Metadata structures owned by allocation",
+				"groups on the realtime volume.  Options",
+				"include:",
+				"",
+				" * bitmap:     free space bitmap contents",
+				"               for this group",
+				" * summary:    realtime free space summary file",
+				" * rmapbt:     reverse mapping btree",
+				" * refcountbt: reference count btree",
+				" * super:      group superblock"
+			],
+			"enum": [
+				"bitmap",
+				"summary",
+				"refcountbt",
+				"rmapbt",
+				"super"
+			]
+		},
+		"inode": {
+			"description": [
+				"Metadata structures owned by file inodes.",
+				"Options include:",
+				"",
+				" * bmapbta:    attr fork",
+				" * bmapbtc:    cow fork",
+				" * bmapbtd:    data fork",
+				" * core:       inode record",
+				" * directory:  directory entries",
+				" * dirtree:    directory tree problems detected",
+				" * parent:     directory parent pointer",
+				" * symlink:    symbolic link target",
+				" * xattr:      extended attributes",
+				"",
+				"These are set when an inode record repair had",
+				"to drop the corresponding data structure to",
+				"get the inode back to a consistent state.",
+				"",
+				" * bmapbtd_zapped",
+				" * bmapbta_zapped",
+				" * directory_zapped",
+				" * symlink_zapped"
+			],
+			"enum": [
+				"bmapbta",
+				"bmapbta_zapped",
+				"bmapbtc",
+				"bmapbtd",
+				"bmapbtd_zapped",
+				"core",
+				"directory",
+				"directory_zapped",
+				"dirtree",
+				"parent",
+				"symlink",
+				"symlink_zapped",
+				"xattr"
+			]
 		}
 	},
 
@@ -58,6 +233,159 @@
 				"time_ns",
 				"domain"
 			]
+		},
+		"fs_metadata": {
+			"title": "Filesystem-wide metadata event",
+			"description": [
+				"Health status updates for filesystem-wide",
+				"metadata objects."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "fs"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/fs"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"structures"
+			]
+		},
+		"perag_metadata": {
+			"title": "Data device allocation group metadata event",
+			"description": [
+				"Health status updates for data device ",
+				"allocation group metadata."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "perag"
+				},
+				"group": {
+					"$ref": "#/$defs/xfs_agnumber_t"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/perag"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"group",
+				"structures"
+			]
+		},
+		"rtgroup_metadata": {
+			"title": "Realtime allocation group metadata event",
+			"description": [
+				"Health status updates for realtime allocation",
+				"group metadata."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "rtgroup"
+				},
+				"group": {
+					"$ref": "#/$defs/xfs_rgnumber_t"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/rtgroup"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"group",
+				"structures"
+			]
+		},
+		"inode_metadata": {
+			"title": "Inode metadata event",
+			"description": [
+				"Health status updates for inode metadata.",
+				"The inode and generation number describe the",
+				"file that is affected by the change."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "inode"
+				},
+				"inumber": {
+					"$ref": "#/$defs/xfs_ino_t"
+				},
+				"generation": {
+					"$ref": "#/$defs/i_generation"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/inode"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"inumber",
+				"generation",
+				"structures"
+			]
 		}
 	}
 }


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 05/21] xfs: report shutdown events through healthmon
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:48   ` [PATCH 04/21] xfs: report metadata health events through healthmon Darrick J. Wong
@ 2024-12-31 23:49   ` Darrick J. Wong
  2024-12-31 23:49   ` [PATCH 06/21] xfs: report media errors " Darrick J. Wong
                     ` (15 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_healthmon.schema.json |   62 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)


diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json
index 154ea0228a3615..a8bc75b0b8c4f9 100644
--- a/libxfs/xfs_healthmon.schema.json
+++ b/libxfs/xfs_healthmon.schema.json
@@ -30,6 +30,9 @@
 		},
 		{
 			"$ref": "#/$events/inode_metadata"
+		},
+		{
+			"$ref": "#/$events/shutdown"
 		}
 	],
 
@@ -205,6 +208,31 @@
 		}
 	},
 
+	"$comment": "Shutdown event data are defined here.",
+	"$shutdown": {
+		"reason": {
+			"description": [
+				"Reason for a filesystem to shut down.",
+				"Options include:",
+				"",
+				" * corrupt_incore: in-memory corruption",
+				" * corrupt_ondisk: on-disk corruption",
+				" * device_removed: device removed",
+				" * force_umount:   userspace asked for it",
+				" * log_ioerr:      log write IO error",
+				" * meta_ioerr:     metadata writeback IO error"
+			],
+			"enum": [
+				"corrupt_incore",
+				"corrupt_ondisk",
+				"device_removed",
+				"force_umount",
+				"log_ioerr",
+				"meta_ioerr"
+			]
+		}
+	},
+
 	"$comment": "Event types are defined here.",
 	"$events": {
 		"lost": {
@@ -386,6 +414,40 @@
 				"generation",
 				"structures"
 			]
+		},
+		"shutdown": {
+			"title": "Abnormal Shutdown Event",
+			"description": [
+				"The filesystem went offline due to",
+				"unrecoverable errors."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "shutdown"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				},
+				"reasons": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$shutdown/reason"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"reasons"
+			]
 		}
 	}
 }


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 06/21] xfs: report media errors through healthmon
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-12-31 23:49   ` [PATCH 05/21] xfs: report shutdown " Darrick J. Wong
@ 2024-12-31 23:49   ` Darrick J. Wong
  2024-12-31 23:49   ` [PATCH 07/21] xfs: report file io " Darrick J. Wong
                     ` (14 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have hooks to report media errors, connect this to the
health monitor as well.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_healthmon.schema.json |   65 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)


diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json
index a8bc75b0b8c4f9..006f4145faa9f5 100644
--- a/libxfs/xfs_healthmon.schema.json
+++ b/libxfs/xfs_healthmon.schema.json
@@ -33,6 +33,9 @@
 		},
 		{
 			"$ref": "#/$events/shutdown"
+		},
+		{
+			"$ref": "#/$events/media_error"
 		}
 	],
 
@@ -63,6 +66,31 @@
 		"i_generation": {
 			"description": "Inode generation number",
 			"type": "integer"
+		},
+		"storage_devs": {
+			"description": "Storage devices in a filesystem",
+			"_comment": [
+				"One of:",
+				"",
+				" * datadev: filesystem device",
+				" * logdev:  external log device",
+				" * rtdev:   realtime volume"
+			],
+			"enum": [
+				"datadev",
+				"logdev",
+				"rtdev"
+			]
+		},
+		"xfs_daddr_t": {
+			"description": "Storage device address, in units of 512-byte blocks",
+			"type": "integer",
+			"minimum": 0
+		},
+		"bbcount": {
+			"description": "Storage space length, in units of 512-byte blocks",
+			"type": "integer",
+			"minimum": 1
 		}
 	},
 
@@ -448,6 +476,43 @@
 				"domain",
 				"reasons"
 			]
+		},
+		"media_error": {
+			"title": "Media Error",
+			"description": [
+				"A storage device reported a media error.",
+				"The domain element tells us which storage",
+				"device reported the media failure.  The",
+				"daddr and bbcount elements tell us where",
+				"inside that device the failure was observed."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "media"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"$ref": "#/$defs/storage_devs"
+				},
+				"daddr": {
+					"$ref": "#/$defs/xfs_daddr_t"
+				},
+				"bbcount": {
+					"$ref": "#/$defs/bbcount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"daddr",
+				"bbcount"
+			]
 		}
 	}
 }


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 07/21] xfs: report file io errors through healthmon
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-12-31 23:49   ` [PATCH 06/21] xfs: report media errors " Darrick J. Wong
@ 2024-12-31 23:49   ` Darrick J. Wong
  2024-12-31 23:49   ` [PATCH 08/21] xfs: add media error reporting ioctl Darrick J. Wong
                     ` (13 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a file io error event hook so that we can send events about read
errors, writeback errors, and directio errors to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_healthmon.schema.json |   77 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)


diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json
index 006f4145faa9f5..9c1070a629997c 100644
--- a/libxfs/xfs_healthmon.schema.json
+++ b/libxfs/xfs_healthmon.schema.json
@@ -36,6 +36,9 @@
 		},
 		{
 			"$ref": "#/$events/media_error"
+		},
+		{
+			"$ref": "#/$events/file_ioerror"
 		}
 	],
 
@@ -67,6 +70,16 @@
 			"description": "Inode generation number",
 			"type": "integer"
 		},
+		"off_t": {
+			"description": "File position, in bytes",
+			"type": "integer",
+			"minimum": 0
+		},
+		"size_t": {
+			"description": "File operation length, in bytes",
+			"type": "integer",
+			"minimum": 1
+		},
 		"storage_devs": {
 			"description": "Storage devices in a filesystem",
 			"_comment": [
@@ -261,6 +274,26 @@
 		}
 	},
 
+	"$comment": "File IO event data are defined here.",
+	"$fileio": {
+		"types": {
+			"description": [
+				"File I/O operations.  One of:",
+				"",
+				" * readahead: reads into the page cache.",
+				" * writeback: writeback of dirty page cache.",
+				" * dioread:   O_DIRECT reads.",
+				" * diowrite:  O_DIRECT writes."
+			],
+			"enum": [
+				"readahead",
+				"writeback",
+				"dioread",
+				"diowrite"
+			]
+		}
+	},
+
 	"$comment": "Event types are defined here.",
 	"$events": {
 		"lost": {
@@ -513,6 +546,50 @@
 				"daddr",
 				"bbcount"
 			]
+		},
+		"file_ioerror": {
+			"title": "File I/O error",
+			"description": [
+				"A read or a write to a file failed.  The",
+				"inode, generation, pos, and len fields",
+				"describe the range of the file that is",
+				"affected."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$fileio/types"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "filerange"
+				},
+				"inumber": {
+					"$ref": "#/$defs/xfs_ino_t"
+				},
+				"generation": {
+					"$ref": "#/$defs/i_generation"
+				},
+				"pos": {
+					"$ref": "#/$defs/off_t"
+				},
+				"len": {
+					"$ref": "#/$defs/size_t"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"inumber",
+				"generation",
+				"pos",
+				"len"
+			]
 		}
 	}
 }


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 08/21] xfs: add media error reporting ioctl
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-12-31 23:49   ` [PATCH 07/21] xfs: report file io " Darrick J. Wong
@ 2024-12-31 23:49   ` Darrick J. Wong
  2024-12-31 23:50   ` [PATCH 09/21] xfs_io: monitor filesystem health events Darrick J. Wong
                     ` (12 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new privileged ioctl so that xfs_scrub can report media errors to
the kernel for further processing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/xfs_fs.h |   15 +++++++++++++++
 1 file changed, 15 insertions(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index d7404e6efd866d..32e552d40b1bf5 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -1115,6 +1115,20 @@ struct xfs_health_monitor {
 /* Return events in JSON format */
 #define XFS_HEALTH_MONITOR_FMT_JSON	(1)
 
+struct xfs_media_error {
+	__u64	flags;		/* flags */
+	__u64	daddr;		/* disk address of range */
+	__u64	bbcount;	/* length, in 512b blocks */
+	__u64	pad;		/* zero */
+};
+
+#define XFS_MEDIA_ERROR_DATADEV	(1)	/* data device */
+#define XFS_MEDIA_ERROR_LOGDEV	(2)	/* external log device */
+#define XFS_MEDIA_ERROR_RTDEV	(3)	/* realtime device */
+
+/* bottom byte of flags is the device code */
+#define XFS_MEDIA_ERROR_DEVMASK	(0xFF)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1157,6 +1171,7 @@ struct xfs_health_monitor {
 #define XFS_IOC_GETFSREFCOUNTS	_IOWR('X', 66, struct xfs_getfsrefs_head)
 #define XFS_IOC_MAP_FREESP	_IOW ('X', 67, struct xfs_map_freesp)
 #define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
+#define XFS_IOC_MEDIA_ERROR	_IOW ('X', 69, struct xfs_media_error)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 09/21] xfs_io: monitor filesystem health events
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-12-31 23:49   ` [PATCH 08/21] xfs: add media error reporting ioctl Darrick J. Wong
@ 2024-12-31 23:50   ` Darrick J. Wong
  2024-12-31 23:50   ` [PATCH 10/21] xfs_io: add a media error reporting command Darrick J. Wong
                     ` (11 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a subcommand to monitor for health events generated by the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 io/Makefile       |    1 
 io/healthmon.c    |  183 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 io/init.c         |    1 
 io/io.h           |    1 
 man/man8/xfs_io.8 |   25 +++++++
 5 files changed, 211 insertions(+)
 create mode 100644 io/healthmon.c


diff --git a/io/Makefile b/io/Makefile
index c57594b090f70c..451d2a15b25919 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -26,6 +26,7 @@ CFILES = \
 	fsuuid.c \
 	fsync.c \
 	getrusage.c \
+	healthmon.c \
 	imap.c \
 	init.c \
 	inject.c \
diff --git a/io/healthmon.c b/io/healthmon.c
new file mode 100644
index 00000000000000..7d372d7d8c532b
--- /dev/null
+++ b/io/healthmon.c
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/paths.h"
+#include "command.h"
+#include "init.h"
+#include "io.h"
+
+static void
+healthmon_help(void)
+{
+	printf(_(
+"Monitor filesystem health events"
+"\n"
+"-c             Replace the open file with the monitor file.\n"
+"-d delay_ms    Sleep this many milliseconds between reads.\n"
+"-p             Only probe for the existence of the ioctl.\n"
+"-v             Request all events.\n"
+"\n"));
+}
+
+static inline int
+monitor_sleep(
+	int			delay_ms)
+{
+	struct timespec		ts;
+
+	if (!delay_ms)
+		return 0;
+
+	ts.tv_sec = delay_ms / 1000;
+	ts.tv_nsec = (delay_ms % 1000) * 1000000;
+
+	return nanosleep(&ts, NULL);
+}
+
+static int
+monitor(
+	size_t			bufsize,
+	bool			consume,
+	int			delay_ms,
+	bool			verbose,
+	bool			only_probe)
+{
+	struct xfs_health_monitor	hmo = {
+		.format		= XFS_HEALTH_MONITOR_FMT_JSON,
+	};
+	char			*buf;
+	ssize_t			bytes_read;
+	int			mon_fd;
+	int			ret = 1;
+
+	if (verbose)
+		hmo.flags |= XFS_HEALTH_MONITOR_ALL;
+
+	mon_fd = ioctl(file->fd, XFS_IOC_HEALTH_MONITOR, &hmo);
+	if (mon_fd < 0) {
+		perror("XFS_IOC_HEALTH_MONITOR");
+		return 1;
+	}
+
+	if (only_probe) {
+		ret = 0;
+		goto out_mon;
+	}
+
+	buf = malloc(bufsize);
+	if (!buf) {
+		perror("malloc");
+		goto out_mon;
+	}
+
+	if (consume) {
+		close(file->fd);
+		file->fd = mon_fd;
+	}
+
+	monitor_sleep(delay_ms);
+	while ((bytes_read = read(mon_fd, buf, bufsize)) > 0) {
+		char		*write_ptr = buf;
+		ssize_t		bytes_written;
+		size_t		to_write = bytes_read;
+
+		while ((bytes_written = write(STDOUT_FILENO, write_ptr, to_write)) > 0) {
+			write_ptr += bytes_written;
+			to_write -= bytes_written;
+		}
+		if (bytes_written < 0) {
+			perror("healthdump");
+			goto out_buf;
+		}
+
+		monitor_sleep(delay_ms);
+	}
+	if (bytes_read < 0) {
+		perror("healthmon");
+		goto out_buf;
+	}
+
+	ret = 0;
+
+out_buf:
+	free(buf);
+out_mon:
+	close(mon_fd);
+	return ret;
+}
+
+static int
+healthmon_f(
+	int			argc,
+	char			**argv)
+{
+	size_t			bufsize = 4096;
+	bool			consume = false;
+	bool			verbose = false;
+	bool			only_probe = false;
+	int			delay_ms = 0;
+	int			c;
+
+	while ((c = getopt(argc, argv, "b:cd:pv")) != EOF) {
+		switch (c) {
+		case 'b':
+			errno = 0;
+			c = atoi(optarg);
+			if (c < 0 || errno) {
+				printf("%s: bufsize must be positive\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			bufsize = c;
+			break;
+		case 'c':
+			consume = true;
+			break;
+		case 'd':
+			errno = 0;
+			delay_ms = atoi(optarg);
+			if (delay_ms < 0 || errno) {
+				printf("%s: delay must be positive msecs\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			break;
+		case 'p':
+			only_probe = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			exitcode = 1;
+			healthmon_help();
+			return 0;
+		}
+	}
+
+	return monitor(bufsize, consume, delay_ms, verbose, only_probe);
+}
+
+static struct cmdinfo healthmon_cmd = {
+	.name		= "healthmon",
+	.cfunc		= healthmon_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.args		= "[-c] [-d delay_ms] [-v]",
+	.help		= healthmon_help,
+};
+
+void
+healthmon_init(void)
+{
+	healthmon_cmd.oneline = _("monitor filesystem health events");
+
+	add_command(&healthmon_cmd);
+}
diff --git a/io/init.c b/io/init.c
index 17b772813bc113..22ebd2f7522a18 100644
--- a/io/init.c
+++ b/io/init.c
@@ -92,6 +92,7 @@ init_commands(void)
 	crc32cselftest_init();
 	exchangerange_init();
 	fsprops_init();
+	healthmon_init();
 }
 
 /*
diff --git a/io/io.h b/io/io.h
index 7ae7cf90ace323..267f3ffac36924 100644
--- a/io/io.h
+++ b/io/io.h
@@ -157,3 +157,4 @@ void			exchangerange_init(void);
 void			fsprops_init(void);
 void			aginfo_init(void);
 void			fsrefcounts_init(void);
+void			healthmon_init(void);
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index c4d09ce07f597b..632d07807f44f0 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1419,6 +1419,31 @@ .SH FILESYSTEM COMMANDS
 .RE
 .PD
 
+.TP
+.BI "healthmon [ \-c " bufsize " ] [ \-c ] [ \-d " delay_ms " ] [ \-p ] [ \-v ]"
+Watch for filesystem health events and write them to the console.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI "\-b " bufsize
+Use a buffer of this size to read events from the kernel.
+.TP
+.BI \-c
+Close the open file and replace it with the monitor file.
+.TP
+.BI "\-d " delay_ms
+Sleep for this long between read attempts.
+.TP
+.B \-p
+Probe for the existence of the functionality by opening the monitoring fd and
+closing it immediately.
+.TP
+.BI \-v
+Request all health events, even if nothing changed.
+.PD
+.RE
+
 .TP
 .BI "inject [ " tag " ]"
 Inject errors into a filesystem to observe filesystem behavior at


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 10/21] xfs_io: add a media error reporting command
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-12-31 23:50   ` [PATCH 09/21] xfs_io: monitor filesystem health events Darrick J. Wong
@ 2024-12-31 23:50   ` Darrick J. Wong
  2024-12-31 23:50   ` [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events Darrick J. Wong
                     ` (10 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a subcommand to invoke the media error ioctl to make sure it works.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 io/shutdown.c     |  113 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_io.8 |   21 ++++++++++
 2 files changed, 133 insertions(+), 1 deletion(-)


diff --git a/io/shutdown.c b/io/shutdown.c
index 3c29ea790643f8..b4fba7d78ba83b 100644
--- a/io/shutdown.c
+++ b/io/shutdown.c
@@ -53,6 +53,115 @@ shutdown_help(void)
 "\n"));
 }
 
+static void
+mediaerror_help(void)
+{
+	printf(_(
+"\n"
+" Report a media error on the data device to the filesystem.\n"
+"\n"
+" -l -- Report against the log device.\n"
+" -r -- Report against the realtime device.\n"
+"\n"
+" offset is the byte offset of the start of the failed range.  If offset is\n"
+" specified, mapping length may (optionally) be specified as well."
+"\n"
+" length is the byte length of the failed range.\n"
+"\n"
+" If neither offset nor length are specified, the media error report will\n"
+" be made against the entire device."
+"\n"));
+}
+
+static int
+mediaerror_f(
+	int			argc,
+	char			**argv)
+{
+	struct xfs_media_error	me = {
+		.daddr		= 0,
+		.bbcount	= -1ULL,
+		.flags		= XFS_MEDIA_ERROR_DATADEV,
+	};
+	long long		l;
+	size_t			fsblocksize, fssectsize;
+	int			c, ret;
+
+	init_cvtnum(&fsblocksize, &fssectsize);
+
+	while ((c = getopt(argc, argv, "lr")) != EOF) {
+		switch (c) {
+		case 'l':
+			me.flags = (me.flags & ~XFS_MEDIA_ERROR_DEVMASK) |
+						XFS_MEDIA_ERROR_LOGDEV;
+			break;
+		case 'r':
+			me.flags = (me.flags & ~XFS_MEDIA_ERROR_DEVMASK) |
+						XFS_MEDIA_ERROR_RTDEV;
+			break;
+		default:
+			mediaerror_help();
+			exitcode = 1;
+			return 0;
+		}
+	}
+
+	/* Range start (optional) */
+	if (optind < argc) {
+		l = cvtnum(fsblocksize, fssectsize, argv[optind]);
+		if (l < 0) {
+			printf("non-numeric offset argument -- %s\n",
+					argv[optind]);
+			exitcode = 1;
+			return 0;
+		}
+
+		me.daddr = l / 512;
+		optind++;
+	}
+
+	/* Range length (optional if range start was specified) */
+	if (optind < argc) {
+		l = cvtnum(fsblocksize, fssectsize, argv[optind]);
+		if (l < 0) {
+			printf("non-numeric len argument -- %s\n",
+					argv[optind]);
+			exitcode = 1;
+			return 0;
+		}
+
+		me.bbcount = howmany(l, 512);
+		optind++;
+	}
+
+	if (optind < argc) {
+		printf("too many arguments -- %s\n", argv[optind]);
+		exitcode = 1;
+		return 0;
+	}
+
+	ret = ioctl(file->fd, XFS_IOC_MEDIA_ERROR, &me);
+	if (ret) {
+		fprintf(stderr,
+ "%s: ioctl(XFS_IOC_MEDIA_ERROR) [\"%s\"]: %s\n",
+				progname, file->name, strerror(errno));
+		exitcode = 1;
+		return 0;
+	}
+
+	return 0;
+}
+
+static struct cmdinfo mediaerror_cmd = {
+	.name		= "mediaerror",
+	.cfunc		= mediaerror_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.args		= "[-lr] [offset [length]]",
+	.help		= mediaerror_help,
+};
+
 void
 shutdown_init(void)
 {
@@ -66,6 +175,8 @@ shutdown_init(void)
 	shutdown_cmd.oneline =
 		_("shuts down the filesystem where the current file resides");
 
-	if (expert)
+	if (expert) {
 		add_command(&shutdown_cmd);
+		add_command(&mediaerror_cmd);
+	}
 }
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 632d07807f44f0..2ca74e6ab57d4e 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1452,6 +1452,27 @@ .SH FILESYSTEM COMMANDS
 argument, displays the list of error tags available.
 Only available in expert mode and requires privileges.
 
+.TP
+.BI "mediaerror [ \-lr ] [ " offset " [ " length " ]]"
+Report a media error against the data device of an XFS filesystem.
+The
+.I offset
+and
+.I length
+parameters are specified in units of bytes.
+If neither are specified, the entire device will be reported.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI \-l
+Report against the log device instead of the data device.
+.TP
+.BI \-r
+Report against the realtime device instead of the data device.
+.PD
+.RE
+
 .TP
 .BI "rginfo [ \-r " rgno " ]"
 Show information about or update the state of realtime allocation groups.


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-12-31 23:50   ` [PATCH 10/21] xfs_io: add a media error reporting command Darrick J. Wong
@ 2024-12-31 23:50   ` Darrick J. Wong
  2024-12-31 23:50   ` [PATCH 12/21] xfs_scrubbed: check events against schema Darrick J. Wong
                     ` (9 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a daemon program that can listen for and log health events.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/Makefile        |   15 ++-
 scrub/xfs_scrubbed.in |  287 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 300 insertions(+), 2 deletions(-)
 create mode 100644 scrub/xfs_scrubbed.in


diff --git a/scrub/Makefile b/scrub/Makefile
index 1e1109048c2a83..bd910922ceb4bb 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -18,6 +18,7 @@ XFS_SCRUB_ALL_PROG = xfs_scrub_all
 XFS_SCRUB_FAIL_PROG = xfs_scrub_fail
 XFS_SCRUB_ARGS = -p
 XFS_SCRUB_SERVICE_ARGS = -b -o autofsck
+XFS_SCRUBBED_PROG = xfs_scrubbed
 ifeq ($(HAVE_SYSTEMD),yes)
 INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
@@ -108,9 +109,9 @@ endif
 # Automatically trigger a media scan once per month
 XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_INTERVAL=1mo
 
-LDIRT = $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) *.service *.cron
+LDIRT = $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_PROG) *.service *.cron
 
-default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(OPTIONAL_TARGETS)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_PROG) $(OPTIONAL_TARGETS)
 
 xfs_scrub_all: xfs_scrub_all.in $(builddefs)
 	@echo "    [SED]    $@"
@@ -123,6 +124,14 @@ xfs_scrub_all: xfs_scrub_all.in $(builddefs)
 		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@
 	$(Q)chmod a+x $@
 
+xfs_scrubbed: xfs_scrubbed.in $(builddefs)
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
+		   -e "s|@scrub_svcname@|$(scrub_svcname)|g" \
+		   -e "s|@pkg_version@|$(PKG_VERSION)|g" \
+		   < $< > $@
+	$(Q)chmod a+x $@
+
 xfs_scrub_fail: xfs_scrub_fail.in $(builddefs)
 	@echo "    [SED]    $@"
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
@@ -165,6 +174,8 @@ install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_SBIN_DIR)
 	$(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_SBIN_DIR)
+	$(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR)
+	$(INSTALL) -m 755 $(XFS_SCRUBBED_PROG) $(PKG_LIBEXEC_DIR)
 	$(INSTALL) -m 755 -d $(PKG_STATE_DIR)
 
 install-udev: $(UDEV_RULES)
diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
new file mode 100644
index 00000000000000..4d742a9151a082
--- /dev/null
+++ b/scrub/xfs_scrubbed.in
@@ -0,0 +1,287 @@
+#!/usr/bin/python3
+
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (c) 2024-2025 Oracle.  All rights reserved.
+#
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+# Daemon to listen for and react to filesystem health events
+
+import sys
+import os
+import argparse
+import fcntl
+import json
+import datetime
+import errno
+import ctypes
+import gc
+from concurrent.futures import ProcessPoolExecutor
+
+debug = False
+log = False
+everything = False
+debug_fast = False
+printf_prefix = ''
+
+# ioctl encoding stuff
+_IOC_NRBITS   =  8
+_IOC_TYPEBITS =  8
+_IOC_SIZEBITS = 14
+_IOC_DIRBITS  =  2
+
+_IOC_NRMASK   = (1 << _IOC_NRBITS) - 1
+_IOC_TYPEMASK = (1 << _IOC_TYPEBITS) - 1
+_IOC_SIZEMASK = (1 << _IOC_SIZEBITS) - 1
+_IOC_DIRMASK  = (1 << _IOC_DIRBITS) - 1
+
+_IOC_NRSHIFT   = 0
+_IOC_TYPESHIFT = (_IOC_NRSHIFT   + _IOC_NRBITS)
+_IOC_SIZESHIFT = (_IOC_TYPESHIFT + _IOC_TYPEBITS)
+_IOC_DIRSHIFT  = (_IOC_SIZESHIFT + _IOC_SIZEBITS)
+
+_IOC_NONE  = 0
+_IOC_WRITE = 1
+_IOC_READ  = 2
+
+def _IOC(direction, type, nr, t):
+	assert direction <= _IOC_DIRMASK, direction
+	assert type <= _IOC_TYPEMASK, type
+	assert nr <= _IOC_NRMASK, nr
+
+	size = ctypes.sizeof(t)
+	assert size <= _IOC_SIZEMASK, size
+
+	return (((direction)  << _IOC_DIRSHIFT) |
+		((type) << _IOC_TYPESHIFT) |
+		((nr)   << _IOC_NRSHIFT) |
+		((size) << _IOC_SIZESHIFT))
+
+def _IOR(type, number, size):
+	return _IOC(_IOC_READ, type, number, size)
+
+def _IOW(type, number, size):
+	return _IOC(_IOC_WRITE, type, number, size)
+
+def _IOWR(type, number, size):
+	return _IOC(_IOC_READ | _IOC_WRITE, type, number, size)
+
+# xfs health monitoring ioctl stuff
+XFS_HEALTH_MONITOR_FMT_JSON = 1
+XFS_HEALTH_MONITOR_VERBOSE = 1 << 0
+
+class xfs_health_monitor(ctypes.Structure):
+	_fields_ = [
+		('flags',	ctypes.c_ulonglong),
+		('format',	ctypes.c_ubyte),
+		('_pad0',	ctypes.c_ubyte * 7),
+		('_pad1',	ctypes.c_ulonglong * 2)
+	]
+assert ctypes.sizeof(xfs_health_monitor) == 32
+
+XFS_IOC_HEALTH_MONITOR = _IOW(0x58, 68, xfs_health_monitor)
+
+def open_health_monitor(fd, verbose = False):
+	'''Return a health monitoring fd.'''
+
+	arg = xfs_health_monitor()
+	arg.format = XFS_HEALTH_MONITOR_FMT_JSON
+
+	if verbose:
+		arg.flags |= XFS_HEALTH_MONITOR_VERBOSE
+
+	ret = fcntl.ioctl(fd, XFS_IOC_HEALTH_MONITOR, arg)
+	return ret
+
+# main program
+
+def health_reports(mon_fp):
+	'''Generate python objects describing health events.'''
+	global debug
+	global printf_prefix
+
+	lines = []
+	buf = mon_fp.readline()
+	while buf != '':
+		for line in buf.split('\0'):
+			line = line.strip()
+			if debug:
+				print(f'new line: {line}')
+			if line == '':
+				continue
+
+			lines.append(line)
+			if not '}' in line:
+				continue
+
+			s = ''.join(lines)
+			if debug:
+				print(f'new event: {s}')
+			try:
+				yield json.loads(s)
+			except json.decoder.JSONDecodeError as e:
+				print(f"{printf_prefix}: {e} from {s}",
+						file = sys.stderr)
+				pass
+			lines = []
+		buf = mon_fp.readline()
+
+def log_event(event):
+	'''Log a monitoring event to stdout.'''
+	global printf_prefix
+
+	print(f"{printf_prefix}: {event}")
+	sys.stdout.flush()
+
+def report_lost(event):
+	'''Report that the kernel lost events.'''
+	global printf_prefix
+
+	print(f"{printf_prefix}: Events were lost.")
+	sys.stdout.flush()
+
+def report_shutdown(event):
+	'''Report an abortive shutdown of the filesystem.'''
+	global printf_prefix
+	REASONS = {
+		"meta_ioerr":		"metadata IO error",
+		"log_ioerr":		"log IO error",
+		"force_umount":		"forced unmount",
+		"corrupt_incore":	"in-memory state corruption",
+		"corrupt_ondisk":	"ondisk metadata corruption",
+		"device_removed":	"device removal",
+	}
+
+	reasons = []
+	for reason in event['reasons']:
+		if reason in REASONS:
+			reasons.append(REASONS[reason])
+		else:
+			reasons.append(reason)
+
+	print(f"{printf_prefix}: Filesystem shut down due to {', '.join(reasons)}.")
+	sys.stdout.flush()
+
+def handle_event(event):
+	'''Handle an event asynchronously.'''
+	def stringify_timestamp(event):
+		'''Try to convert a timestamp to something human readable.'''
+		try:
+			ts = datetime.datetime.fromtimestamp(event['time_ns'] / 1e9).astimezone()
+			event['time'] = str(ts)
+			del event['time_ns']
+		except Exception as e:
+			# Not a big deal if we can't format the timestamp, but
+			# let's yell about that loudly
+			print(f'{printf_prefix}: bad timestamp: {e}', file = sys.stderr)
+
+	global log
+
+	stringify_timestamp(event)
+	if log:
+		log_event(event)
+	if event['type'] == 'lost':
+		report_lost(event)
+	elif event['type'] == 'shutdown':
+		report_shutdown(event)
+
+def monitor(mountpoint, event_queue, **kwargs):
+	'''Monitor the given mountpoint for health events.'''
+	global everything
+
+	fd = os.open(mountpoint, os.O_RDONLY)
+	try:
+		mon_fd = open_health_monitor(fd, verbose = everything)
+	except OSError as e:
+		if e.errno != errno.ENOTTY and e.errno != errno.EOPNOTSUPP:
+			raise e
+		print(f"{mountpoint}: XFS health monitoring not supported.",
+				file = sys.stderr)
+		return 1
+	finally:
+		# Close the mountpoint if opening the health monitor fails
+		os.close(fd)
+
+	# Ownership of mon_fd (and hence responsibility for closing it) is
+	# transferred to the mon_fp object.
+	with os.fdopen(mon_fd) as mon_fp:
+		nr = 0
+		for e in health_reports(mon_fp):
+			event_queue.submit(handle_event, e)
+
+			# Periodically run the garbage collector to constrain
+			# memory usage in the main thread.  If only there was
+			# a way to submit to a queue without everything being
+			# tied up in a Future
+			if nr % 5355 == 0:
+				gc.collect()
+			nr += 1
+
+	return 0
+
+def main():
+	global debug
+	global log
+	global printf_prefix
+	global everything
+	global debug_fast
+
+	parser = argparse.ArgumentParser( \
+			description = "XFS filesystem health monitoring demon.")
+	parser.add_argument("--debug", help = "Enabling debugging messages.", \
+			action = "store_true")
+	parser.add_argument("--log", help = "Log health events to stdout.", \
+			action = "store_true")
+	parser.add_argument("--everything", help = "Capture all events.", \
+			action = "store_true")
+	parser.add_argument("-V", help = "Report version and exit.", \
+			action = "store_true")
+	parser.add_argument('mountpoint', default = None, nargs = '?',
+			help = 'XFS filesystem mountpoint to target.')
+	parser.add_argument('--debug-fast', action = 'store_true', \
+			help = argparse.SUPPRESS)
+	args = parser.parse_args()
+
+	if args.V:
+		print("xfs_scrubbed version @pkg_version@")
+		return 0
+
+	if args.mountpoint is None:
+		parser.error("the following arguments are required: mountpoint")
+		return 1
+
+	if args.debug:
+		debug = True
+	if args.log:
+		log = True
+	if args.everything:
+		everything = True
+	if args.debug_fast:
+		debug_fast = True
+
+	# Use a separate subprocess to handle the events so that the main event
+	# reading process does not block on the GIL of the event handling
+	# subprocess.  The downside is that we cannot pass function pointers
+	# and all data must be pickleable; the upside is not losing events.
+	#
+	# If the secret maximum efficiency setting is enabled, assume this is
+	# part of QA, so use all CPUs to process events.  Normally we start one
+	# background process to minimize service footprint.
+	if debug_fast:
+		args.event_queue = ProcessPoolExecutor()
+	else:
+		args.event_queue = ProcessPoolExecutor(max_workers = 1)
+
+	printf_prefix = args.mountpoint
+	ret = 0
+	try:
+		ret = monitor(**vars(args))
+	except KeyboardInterrupt:
+		# Consider SIGINT to be a clean exit.
+		pass
+
+	args.event_queue.shutdown()
+	return ret
+
+if __name__ == '__main__':
+	sys.exit(main())


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 12/21] xfs_scrubbed: check events against schema
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (10 preceding siblings ...)
  2024-12-31 23:50   ` [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events Darrick J. Wong
@ 2024-12-31 23:50   ` Darrick J. Wong
  2024-12-31 23:51   ` [PATCH 13/21] xfs_scrubbed: enable repairing filesystems Darrick J. Wong
                     ` (8 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Validate that the event objects that we get from the kernel actually
obey the schema that the kernel publishes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/Makefile       |   10 ++++++--
 scrub/Makefile        |    1 +
 scrub/xfs_scrubbed.in |   62 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+), 3 deletions(-)


diff --git a/libxfs/Makefile b/libxfs/Makefile
index 61c43529b532b6..f84eb5b43cdddd 100644
--- a/libxfs/Makefile
+++ b/libxfs/Makefile
@@ -151,6 +151,8 @@ EXTRA_OBJECTS=\
 
 LDIRT += $(EXTRA_OBJECTS)
 
+JSON_SCHEMAS=xfs_healthmon.schema.json
+
 #
 # Tracing flags:
 # -DMEM_DEBUG		all zone memory use
@@ -174,7 +176,7 @@ LTLIBS = $(LIBPTHREAD) $(LIBRT)
 # don't try linking xfs_repair with a debug libxfs.
 DEBUG = -DNDEBUG
 
-default: ltdepend $(LTLIBRARY) $(EXTRA_OBJECTS)
+default: ltdepend $(LTLIBRARY) $(EXTRA_OBJECTS) $(JSON_SCHEMAS)
 
 %dummy.o: %dummy.cpp
 	@echo "    [CXXD]   $@"
@@ -196,14 +198,16 @@ MAKECXXDEP := $(MAKEDEPEND) $(CXXFLAGS)
 include $(BUILDRULES)
 
 install: default
-	$(INSTALL) -m 755 -d $(PKG_INC_DIR)
+	$(INSTALL) -m 755 -d $(PKG_DATA_DIR)
+	$(INSTALL) -m 644 $(JSON_SCHEMAS) $(PKG_DATA_DIR)
 
 install-headers: $(addsuffix -hdrs, $(PKGHFILES))
 
 %-hdrs:
 	$(Q)$(LN_S) -f $(CURDIR)/$* $(TOPDIR)/include/xfs/$*
 
-install-dev: install
+install-dev: default
+	$(INSTALL) -m 755 -d $(PKG_INC_DIR)
 	$(INSTALL) -m 644 $(PKGHFILES) $(PKG_INC_DIR)
 
 # We need to install the headers before building the dependencies.  If we
diff --git a/scrub/Makefile b/scrub/Makefile
index bd910922ceb4bb..7d4fa0ddc09685 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -129,6 +129,7 @@ xfs_scrubbed: xfs_scrubbed.in $(builddefs)
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
 		   -e "s|@scrub_svcname@|$(scrub_svcname)|g" \
 		   -e "s|@pkg_version@|$(PKG_VERSION)|g" \
+		   -e "s|@pkg_data_dir@|$(PKG_DATA_DIR)|g" \
 		   < $< > $@
 	$(Q)chmod a+x $@
 
diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 4d742a9151a082..992797113d6d30 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -18,6 +18,52 @@ import ctypes
 import gc
 from concurrent.futures import ProcessPoolExecutor
 
+try:
+	# Not all systems will have this json schema validation libarary,
+	# so we make it optional.
+	import jsonschema
+
+	def init_validation(args):
+		'''Initialize event json validation.'''
+		try:
+			with open(args.event_schema) as fp:
+				schema_js = json.load(fp)
+		except Exception as e:
+			print(f"{args.event_schema}: {e}", file = sys.stderr)
+			return
+
+		try:
+			vcls = jsonschema.validators.validator_for(schema_js)
+			vcls.check_schema(schema_js)
+			validator = vcls(schema_js)
+		except jsonschema.exceptions.SchemaError as e:
+			print(f"{args.event_schema}: invalid event data, {e.message}",
+					file = sys.stderr)
+			return
+		except Exception as e:
+			print(f"{args.event_schema}: {e}", file = sys.stderr)
+			return
+
+		def v(i):
+			e = jsonschema.exceptions.best_match(validator.iter_errors(i))
+			if e:
+				print(f"{printf_prefix}: {e.message}",
+						file = sys.stderr)
+				return False
+			return True
+
+		return v
+
+except:
+	def init_validation(args):
+		if args.require_validation:
+			print("JSON schema validation not available.",
+					file = sys.stderr)
+			return
+
+		return lambda instance: True
+
+validator_fn = None
 debug = False
 log = False
 everything = False
@@ -177,6 +223,12 @@ def handle_event(event):
 
 	global log
 
+	# Ignore any event that doesn't pass our schema.  This program must
+	# not try to handle a newer kernel that say things that it is not
+	# prepared to handle.
+	if not validator_fn(event):
+		return
+
 	stringify_timestamp(event)
 	if log:
 		log_event(event)
@@ -225,6 +277,7 @@ def main():
 	global printf_prefix
 	global everything
 	global debug_fast
+	global validator_fn
 
 	parser = argparse.ArgumentParser( \
 			description = "XFS filesystem health monitoring demon.")
@@ -240,6 +293,11 @@ def main():
 			help = 'XFS filesystem mountpoint to target.')
 	parser.add_argument('--debug-fast', action = 'store_true', \
 			help = argparse.SUPPRESS)
+	parser.add_argument('--require-validation', action = 'store_true', \
+			help = argparse.SUPPRESS)
+	parser.add_argument('--event-schema', type = str, \
+			default = '@pkg_data_dir@/xfs_healthmon.schema.json', \
+			help = argparse.SUPPRESS)
 	args = parser.parse_args()
 
 	if args.V:
@@ -250,6 +308,10 @@ def main():
 		parser.error("the following arguments are required: mountpoint")
 		return 1
 
+	validator_fn = init_validation(args)
+	if not validator_fn:
+		return 1
+
 	if args.debug:
 		debug = True
 	if args.log:


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 13/21] xfs_scrubbed: enable repairing filesystems
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (11 preceding siblings ...)
  2024-12-31 23:50   ` [PATCH 12/21] xfs_scrubbed: check events against schema Darrick J. Wong
@ 2024-12-31 23:51   ` Darrick J. Wong
  2024-12-31 23:51   ` [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs Darrick J. Wong
                     ` (7 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that our health monitoring daemon can initiate repairs.
Because repairs can take a while to run, so we don't actually want to be
doing that work in the event thread because the kernel queue can drop
events if userspace doesn't respond in time.

Therefore, create a subprocess executor to run the repairs in the
background, and do the repairs from there.  The subprocess executor is
similar in concept to what a libfrog workqueue does, but the workers do
not share address space, which eliminates GIL contention.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/xfs_scrubbed.in |  366 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 360 insertions(+), 6 deletions(-)


diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 992797113d6d30..c626c7bd56630c 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -17,6 +17,7 @@ import errno
 import ctypes
 import gc
 from concurrent.futures import ProcessPoolExecutor
+import ctypes.util
 
 try:
 	# Not all systems will have this json schema validation libarary,
@@ -37,7 +38,7 @@ try:
 			vcls.check_schema(schema_js)
 			validator = vcls(schema_js)
 		except jsonschema.exceptions.SchemaError as e:
-			print(f"{args.event_schema}: invalid event data, {e.message}",
+			print(f"{args.event_schema}: invalid event data: {e.message}",
 					file = sys.stderr)
 			return
 		except Exception as e:
@@ -69,6 +70,9 @@ log = False
 everything = False
 debug_fast = False
 printf_prefix = ''
+want_repair = False
+libhandle = None
+repair_queue = None # placeholder for event queue worker
 
 # ioctl encoding stuff
 _IOC_NRBITS   =  8
@@ -112,6 +116,9 @@ def _IOW(type, number, size):
 def _IOWR(type, number, size):
 	return _IOC(_IOC_READ | _IOC_WRITE, type, number, size)
 
+def _IOWR(type, number, size):
+	return _IOC(_IOC_READ | _IOC_WRITE, type, number, size)
+
 # xfs health monitoring ioctl stuff
 XFS_HEALTH_MONITOR_FMT_JSON = 1
 XFS_HEALTH_MONITOR_VERBOSE = 1 << 0
@@ -139,9 +146,206 @@ def open_health_monitor(fd, verbose = False):
 	ret = fcntl.ioctl(fd, XFS_IOC_HEALTH_MONITOR, arg)
 	return ret
 
+# libhandle stuff
+class xfs_fsid(ctypes.Structure):
+	_fields_ = [
+		("_val0",	ctypes.c_uint),
+		("_val1",	ctypes.c_uint)
+	]
+
+class xfs_fid(ctypes.Structure):
+	_fields_ = [
+		("fid_len",	ctypes.c_ushort),
+		("fid_pad",	ctypes.c_ushort),
+		("fid_gen",	ctypes.c_uint),
+		("fid_ino",	ctypes.c_ulonglong)
+	]
+
+class xfs_handle(ctypes.Structure):
+	_fields_ = [
+		("_ha_fsid",	xfs_fsid),
+		("ha_fid",	xfs_fid)
+	]
+assert ctypes.sizeof(xfs_handle) == 24
+
+class fshandle(object):
+	def __init__(self, fd, mountpoint):
+		global libhandle
+		global printf_prefix
+
+		self.handle = xfs_handle()
+
+		if mountpoint is None:
+			raise Exception('fshandle needs a mountpoint')
+
+		self.mountpoint = mountpoint
+
+		# Create the file and fs handles for the open mountpoint
+		# so that we can compare them later
+		buf = ctypes.c_void_p()
+		buflen = ctypes.c_size_t()
+		ret = libhandle.fd_to_handle(fd, buf, buflen)
+		if ret < 0:
+			errcode = ctypes.get_errno()
+			raise OSError(errcode,
+					f'cannot create handle: {os.strerror(errcode)}',
+					printf_prefix)
+		if buflen.value != ctypes.sizeof(xfs_handle):
+			libhandle.free_handle(buf, buflen.value)
+			raise Exception(f"fshandle expected {ctypes.sizeof(xfs_handle)} bytes, got {buflen.value}.")
+
+		hanp = ctypes.cast(buf, ctypes.POINTER(xfs_handle))
+		self.handle = hanp.contents
+
+	def open(self):
+		'''Reopen a file handle obtained via weak reference.'''
+		global libhandle
+		global printf_prefix
+
+		buf = ctypes.c_void_p()
+		buflen = ctypes.c_size_t()
+
+		fd = os.open(self.mountpoint, os.O_RDONLY)
+
+		# Create the file and fs handles for the open mountpoint
+		# so that we can compare them later
+		ret = libhandle.fd_to_handle(fd, buf, buflen)
+		if ret < 0:
+			errcode = ctypes.get_errno()
+			os.close(fd)
+			raise OSError(errcode,
+					f'resampling handle: {os.strerror(errcode)}',
+					printf_prefix)
+
+		hanp = ctypes.cast(buf, ctypes.POINTER(xfs_handle))
+
+		# Did we get the same handle?
+		if buflen.value != ctypes.sizeof(xfs_handle) or \
+		   bytes(hanp.contents) != bytes(self.handle):
+			os.close(fd)
+			libhandle.free_handle(buf, buflen)
+			raise OSError(errno.ESTALE,
+					os.strerror(errno.ESTALE),
+					printf_prefix)
+
+		libhandle.free_handle(buf, buflen)
+		return fd
+
+def libhandle_load():
+	'''Load libhandle and set things up.'''
+	global libhandle
+
+	soname = ctypes.util.find_library('handle')
+	if soname is None:
+		raise OSError(errno.ENOENT,
+				f'while finding library: {os.strerror(errno.ENOENT)}',
+				'libhandle')
+
+	libhandle = ctypes.CDLL(soname, use_errno = True)
+	libhandle.fd_to_handle.argtypes = (
+			ctypes.c_int,
+			ctypes.POINTER(ctypes.c_void_p),
+			ctypes.POINTER(ctypes.c_size_t))
+	libhandle.handle_to_fshandle.argtypes = (
+			ctypes.c_void_p,
+			ctypes.c_size_t,
+			ctypes.POINTER(ctypes.c_void_p),
+			ctypes.POINTER(ctypes.c_size_t))
+	libhandle.path_to_fshandle.argtypes = (
+			ctypes.c_char_p,
+			ctypes.c_void_p,
+			ctypes.c_size_t)
+	libhandle.free_handle.argtypes = (
+			ctypes.c_void_p,
+			ctypes.c_size_t)
+
+# metadata scrubbing stuff
+XFS_SCRUB_TYPE_PROBE		= 0
+XFS_SCRUB_TYPE_SB		= 1
+XFS_SCRUB_TYPE_AGF		= 2
+XFS_SCRUB_TYPE_AGFL		= 3
+XFS_SCRUB_TYPE_AGI		= 4
+XFS_SCRUB_TYPE_BNOBT		= 5
+XFS_SCRUB_TYPE_CNTBT		= 6
+XFS_SCRUB_TYPE_INOBT		= 7
+XFS_SCRUB_TYPE_FINOBT		= 8
+XFS_SCRUB_TYPE_RMAPBT		= 9
+XFS_SCRUB_TYPE_REFCNTBT		= 10
+XFS_SCRUB_TYPE_INODE		= 11
+XFS_SCRUB_TYPE_BMBTD		= 12
+XFS_SCRUB_TYPE_BMBTA		= 13
+XFS_SCRUB_TYPE_BMBTC		= 14
+XFS_SCRUB_TYPE_DIR		= 15
+XFS_SCRUB_TYPE_XATTR		= 16
+XFS_SCRUB_TYPE_SYMLINK		= 17
+XFS_SCRUB_TYPE_PARENT		= 18
+XFS_SCRUB_TYPE_RTBITMAP		= 19
+XFS_SCRUB_TYPE_RTSUM		= 20
+XFS_SCRUB_TYPE_UQUOTA		= 21
+XFS_SCRUB_TYPE_GQUOTA		= 22
+XFS_SCRUB_TYPE_PQUOTA		= 23
+XFS_SCRUB_TYPE_FSCOUNTERS	= 24
+XFS_SCRUB_TYPE_QUOTACHECK	= 25
+XFS_SCRUB_TYPE_NLINKS		= 26
+XFS_SCRUB_TYPE_HEALTHY		= 27
+XFS_SCRUB_TYPE_DIRTREE		= 28
+XFS_SCRUB_TYPE_METAPATH		= 29
+XFS_SCRUB_TYPE_RGSUPER		= 30
+XFS_SCRUB_TYPE_RGBITMAP		= 31
+XFS_SCRUB_TYPE_RTRMAPBT		= 32
+XFS_SCRUB_TYPE_RTREFCBT		= 33
+
+XFS_SCRUB_IFLAG_REPAIR			= 1 << 0
+XFS_SCRUB_OFLAG_CORRUPT			= 1 << 1
+XFS_SCRUB_OFLAG_PREEN			= 1 << 2
+XFS_SCRUB_OFLAG_XFAIL			= 1 << 3
+XFS_SCRUB_OFLAG_XCORRUPT		= 1 << 4
+XFS_SCRUB_OFLAG_INCOMPLETE		= 1 << 5
+XFS_SCRUB_OFLAG_WARNING			= 1 << 6
+XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED	= 1 << 7
+XFS_SCRUB_IFLAG_FORCE_REBUILD		= 1 << 8
+
+class xfs_scrub_metadata(ctypes.Structure):
+	_fields_ = [
+		('sm_type',	ctypes.c_uint),
+		('sm_flags',	ctypes.c_uint),
+		('sm_ino',	ctypes.c_ulonglong),
+		('sm_gen',	ctypes.c_uint),
+		('sm_agno',	ctypes.c_uint),
+		('_pad',	ctypes.c_ulonglong * 5),
+	]
+assert ctypes.sizeof(xfs_scrub_metadata) == 64
+
+XFS_IOC_SCRUB_METADATA		= _IOWR(0x58, 60, xfs_scrub_metadata)
+
+def __xfs_repair_metadata(fd, type, group, ino, gen):
+	'''Call the kernel to repair some inode metadata.'''
+
+	arg = xfs_scrub_metadata()
+	arg.sm_type = type
+	arg.sm_flags = XFS_SCRUB_IFLAG_REPAIR
+	arg.sm_ino = ino
+	arg.sm_gen = gen
+	arg.sm_agno = group
+
+	fcntl.ioctl(fd, XFS_IOC_SCRUB_METADATA, arg)
+	return arg.sm_flags
+
+def xfs_repair_fs_metadata(fd, type):
+	'''Call the kernel to repair some whole-fs metadata.'''
+	return __xfs_repair_metadata(fd, type, 0, 0, 0)
+
+def xfs_repair_group_metadata(fd, type, group):
+	'''Call the kernel to repair some group metadata.'''
+	return __xfs_repair_metadata(fd, type, group, 0, 0)
+
+def xfs_repair_inode_metadata(fd, type, ino, gen):
+	'''Call the kernel to repair some inode metadata.'''
+	return __xfs_repair_metadata(fd, type, 0, ino, gen)
+
 # main program
 
-def health_reports(mon_fp):
+def health_reports(mon_fp, fh):
 	'''Generate python objects describing health events.'''
 	global debug
 	global printf_prefix
@@ -164,7 +368,7 @@ def health_reports(mon_fp):
 			if debug:
 				print(f'new event: {s}')
 			try:
-				yield json.loads(s)
+				yield (json.loads(s), fh)
 			except json.decoder.JSONDecodeError as e:
 				print(f"{printf_prefix}: {e} from {s}",
 						file = sys.stderr)
@@ -208,7 +412,7 @@ def report_shutdown(event):
 	print(f"{printf_prefix}: Filesystem shut down due to {', '.join(reasons)}.")
 	sys.stdout.flush()
 
-def handle_event(event):
+def handle_event(e):
 	'''Handle an event asynchronously.'''
 	def stringify_timestamp(event):
 		'''Try to convert a timestamp to something human readable.'''
@@ -222,6 +426,17 @@ def handle_event(event):
 			print(f'{printf_prefix}: bad timestamp: {e}', file = sys.stderr)
 
 	global log
+	global repair_queue
+
+	# Use a separate subprocess to handle the repairs so that the event
+	# processing worker does not block on the GIL of the repair workers.
+	# The downside is that we cannot pass function pointers and all data
+	# must be pickleable; the upside is that we don't stall processing of
+	# non-sickness events while repairs are in progress.
+	if want_repair and not repair_queue:
+		repair_queue = ProcessPoolExecutor(max_workers = 1)
+
+	event, fh = e
 
 	# Ignore any event that doesn't pass our schema.  This program must
 	# not try to handle a newer kernel that say things that it is not
@@ -236,13 +451,21 @@ def handle_event(event):
 		report_lost(event)
 	elif event['type'] == 'shutdown':
 		report_shutdown(event)
+	elif want_repair and event['type'] == 'sick':
+		repair_queue.submit(repair_metadata, event, fh)
 
 def monitor(mountpoint, event_queue, **kwargs):
 	'''Monitor the given mountpoint for health events.'''
 	global everything
+	global log
+	global printf_prefix
+	global want_repair
 
+	fh = None
 	fd = os.open(mountpoint, os.O_RDONLY)
 	try:
+		if want_repair:
+			fh = fshandle(fd, mountpoint)
 		mon_fd = open_health_monitor(fd, verbose = everything)
 	except OSError as e:
 		if e.errno != errno.ENOTTY and e.errno != errno.EOPNOTSUPP:
@@ -251,14 +474,15 @@ def monitor(mountpoint, event_queue, **kwargs):
 				file = sys.stderr)
 		return 1
 	finally:
-		# Close the mountpoint if opening the health monitor fails
+		# Close the mountpoint if opening the health monitor fails;
+		# the handle object will free its own memory.
 		os.close(fd)
 
 	# Ownership of mon_fd (and hence responsibility for closing it) is
 	# transferred to the mon_fp object.
 	with os.fdopen(mon_fd) as mon_fp:
 		nr = 0
-		for e in health_reports(mon_fp):
+		for e in health_reports(mon_fp, fh):
 			event_queue.submit(handle_event, e)
 
 			# Periodically run the garbage collector to constrain
@@ -271,6 +495,125 @@ def monitor(mountpoint, event_queue, **kwargs):
 
 	return 0
 
+def __scrub_type(code):
+	'''Convert a "structures" json list to a scrub type code.'''
+	SCRUB_TYPES = {
+		"probe":	XFS_SCRUB_TYPE_PROBE,
+		"sb":		XFS_SCRUB_TYPE_SB,
+		"agf":		XFS_SCRUB_TYPE_AGF,
+		"agfl":		XFS_SCRUB_TYPE_AGFL,
+		"agi":		XFS_SCRUB_TYPE_AGI,
+		"bnobt":	XFS_SCRUB_TYPE_BNOBT,
+		"cntbt":	XFS_SCRUB_TYPE_CNTBT,
+		"inobt":	XFS_SCRUB_TYPE_INOBT,
+		"finobt":	XFS_SCRUB_TYPE_FINOBT,
+		"rmapbt":	XFS_SCRUB_TYPE_RMAPBT,
+		"refcountbt":	XFS_SCRUB_TYPE_REFCNTBT,
+		"inode":	XFS_SCRUB_TYPE_INODE,
+		"bmapbtd":	XFS_SCRUB_TYPE_BMBTD,
+		"bmapbta":	XFS_SCRUB_TYPE_BMBTA,
+		"bmapbtc":	XFS_SCRUB_TYPE_BMBTC,
+		"directory":	XFS_SCRUB_TYPE_DIR,
+		"xattr":	XFS_SCRUB_TYPE_XATTR,
+		"symlink":	XFS_SCRUB_TYPE_SYMLINK,
+		"parent":	XFS_SCRUB_TYPE_PARENT,
+		"rtbitmap":	XFS_SCRUB_TYPE_RTBITMAP,
+		"rtsummary":	XFS_SCRUB_TYPE_RTSUM,
+		"usrquota":	XFS_SCRUB_TYPE_UQUOTA,
+		"grpquota":	XFS_SCRUB_TYPE_GQUOTA,
+		"prjquota":	XFS_SCRUB_TYPE_PQUOTA,
+		"fscounters":	XFS_SCRUB_TYPE_FSCOUNTERS,
+		"quotacheck":	XFS_SCRUB_TYPE_QUOTACHECK,
+		"nlinks":	XFS_SCRUB_TYPE_NLINKS,
+		"healthy":	XFS_SCRUB_TYPE_HEALTHY,
+		"dirtree":	XFS_SCRUB_TYPE_DIRTREE,
+		"metapath":	XFS_SCRUB_TYPE_METAPATH,
+		"rgsuper":	XFS_SCRUB_TYPE_RGSUPER,
+		"rgbitmap":	XFS_SCRUB_TYPE_RGBITMAP,
+		"rtrmapbt":	XFS_SCRUB_TYPE_RTRMAPBT,
+		"rtrefcountbt":	XFS_SCRUB_TYPE_RTREFCBT,
+	}
+
+	if code not in SCRUB_TYPES:
+		return None
+
+	return SCRUB_TYPES[code]
+
+def report_outcome(oflags):
+	if oflags & (XFS_SCRUB_OFLAG_CORRUPT | \
+		     XFS_SCRUB_OFLAG_CORRUPT | \
+		     XFS_SCRUB_OFLAG_INCOMPLETE):
+		return "Repair unsuccessful; offline repair required."
+
+	if oflags & XFS_SCRUB_OFLAG_XFAIL:
+		return "Seems correct but cross-referencing failed; offline repair recommended."
+
+	if oflags & XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED:
+		return "No modification needed."
+
+	return "Repairs successful."
+
+def repair_wholefs(event, fd):
+	'''React to a fs-domain corruption event by repairing it.'''
+	for s in event['structures']:
+		type = __scrub_type(s)
+		if type is None:
+			continue
+		try:
+			oflags = xfs_repair_fs_metadata(fd, type)
+			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			sys.stdout.flush()
+		except Exception as e:
+			print(f"{printf_prefix}: {s}: {e}", file = sys.stderr)
+
+def repair_group(event, fd, group_type):
+	'''React to a group-domain corruption event by repairing it.'''
+	for s in event['structures']:
+		type = __scrub_type(s)
+		if type is None:
+			continue
+		try:
+			oflags = xfs_repair_group_metadata(fd, type, event['group'])
+			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			sys.stdout.flush()
+		except Exception as e:
+			print(f"{printf_prefix}: {s}: {e}", file = sys.stderr)
+
+def repair_inode(event, fd):
+	'''React to a inode-domain corruption event by repairing it.'''
+	for s in event['structures']:
+		type = __scrub_type(s)
+		if type is None:
+			continue
+		try:
+			oflags = xfs_repair_inode_metadata(fd, type,
+				      event['inumber'], event['generation'])
+			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			sys.stdout.flush()
+		except Exception as e:
+			print(f"{printf_prefix}: {s}: {e}", file = sys.stderr)
+
+def repair_metadata(event, fh):
+	'''Repair a metadata corruption.'''
+	global debug
+	global printf_prefix
+
+	if debug:
+		print(f'repair {event}')
+
+	fd = fh.open()
+	try:
+		if event['domain'] in ['fs', 'realtime']:
+			repair_wholefs(event, fd)
+		elif event['domain'] in ['perag', 'rtgroup']:
+			repair_group(event, fd, event['domain'])
+		elif event['domain'] == 'inode':
+			repair_inode(event, fd)
+		else:
+			raise Exception(f"{printf_prefix}: Unknown metadata domain \"{event['domain']}\".")
+	finally:
+		os.close(fd)
+
 def main():
 	global debug
 	global log
@@ -278,6 +621,7 @@ def main():
 	global everything
 	global debug_fast
 	global validator_fn
+	global want_repair
 
 	parser = argparse.ArgumentParser( \
 			description = "XFS filesystem health monitoring demon.")
@@ -287,6 +631,8 @@ def main():
 			action = "store_true")
 	parser.add_argument("--everything", help = "Capture all events.", \
 			action = "store_true")
+	parser.add_argument("--repair", help = "Automatically repair corrupt metadata.", \
+			action = "store_true")
 	parser.add_argument("-V", help = "Report version and exit.", \
 			action = "store_true")
 	parser.add_argument('mountpoint', default = None, nargs = '?',
@@ -312,6 +658,12 @@ def main():
 	if not validator_fn:
 		return 1
 
+	try:
+		libhandle_load()
+	except OSError as e:
+		print(f"libhandle: {e}", file = sys.stderr)
+		return 1
+
 	if args.debug:
 		debug = True
 	if args.log:
@@ -320,6 +672,8 @@ def main():
 		everything = True
 	if args.debug_fast:
 		debug_fast = True
+	if args.repair:
+		want_repair = True
 
 	# Use a separate subprocess to handle the events so that the main event
 	# reading process does not block on the GIL of the event handling


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (12 preceding siblings ...)
  2024-12-31 23:51   ` [PATCH 13/21] xfs_scrubbed: enable repairing filesystems Darrick J. Wong
@ 2024-12-31 23:51   ` Darrick J. Wong
  2024-12-31 23:51   ` [PATCH 15/21] xfs_scrubbed: use getparents to look up file names Darrick J. Wong
                     ` (6 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Online repair relies heavily on back references such as reverse mappings
and directory parent pointers to add redundancy to the filesystem.
Check for these two features and whine a bit if they are missing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/xfs_scrubbed.in |   72 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)


diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index c626c7bd56630c..25465128864583 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -71,6 +71,8 @@ everything = False
 debug_fast = False
 printf_prefix = ''
 want_repair = False
+has_parent = False
+has_rmapbt = False
 libhandle = None
 repair_queue = None # placeholder for event queue worker
 
@@ -343,6 +345,57 @@ def xfs_repair_inode_metadata(fd, type, ino, gen):
 	'''Call the kernel to repair some inode metadata.'''
 	return __xfs_repair_metadata(fd, type, 0, ino, gen)
 
+# fsgeometry ioctl
+class xfs_fsop_geom(ctypes.Structure):
+	_fields_ = [
+		("blocksize",		ctypes.c_uint),
+		("rtextesize",		ctypes.c_uint),
+		("agblocks",		ctypes.c_uint),
+		("agcount",		ctypes.c_uint),
+		("logblocks",		ctypes.c_uint),
+		("sectsize",		ctypes.c_uint),
+		("inodesize",		ctypes.c_uint),
+		("imaxpct",		ctypes.c_uint),
+		("datablocks",		ctypes.c_ulonglong),
+		("rtblocks",		ctypes.c_ulonglong),
+		("rtextents",		ctypes.c_ulonglong),
+		("logstart",		ctypes.c_ulonglong),
+		("uuid",		ctypes.c_ubyte * 16),
+		("sunit",		ctypes.c_uint),
+		("swidth",		ctypes.c_uint),
+		("version",		ctypes.c_uint),
+		("flags",		ctypes.c_uint),
+		("logsectsize",		ctypes.c_uint),
+		("rtsectsize",		ctypes.c_uint),
+		("dirblocksize",	ctypes.c_uint),
+		("logsunit",		ctypes.c_uint),
+		("sick",		ctypes.c_uint),
+		("checked",		ctypes.c_uint),
+		("rgblocks",		ctypes.c_uint),
+		("rgcount",		ctypes.c_uint),
+		("_pad",		ctypes.c_ulonglong * 16),
+	]
+assert ctypes.sizeof(xfs_fsop_geom) == 256
+
+XFS_FSOP_GEOM_FLAGS_RMAPBT	= 1 << 19
+XFS_FSOP_GEOM_FLAGS_PARENT	= 1 << 25
+
+XFS_IOC_FSGEOMETRY		= _IOR (0x58, 126, xfs_fsop_geom)
+
+def xfs_has_parent(fd):
+	'''Does this filesystem have parent pointers?'''
+
+	arg = xfs_fsop_geom()
+	fcntl.ioctl(fd, XFS_IOC_FSGEOMETRY, arg)
+	return arg.flags & XFS_FSOP_GEOM_FLAGS_PARENT != 0
+
+def xfs_has_rmapbt(fd):
+	'''Does this filesystem have reverse mapping?'''
+
+	arg = xfs_fsop_geom()
+	fcntl.ioctl(fd, XFS_IOC_FSGEOMETRY, arg)
+	return arg.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT != 0
+
 # main program
 
 def health_reports(mon_fp, fh):
@@ -460,9 +513,28 @@ def monitor(mountpoint, event_queue, **kwargs):
 	global log
 	global printf_prefix
 	global want_repair
+	global has_parent
+	global has_rmapbt
 
 	fh = None
 	fd = os.open(mountpoint, os.O_RDONLY)
+	try:
+		has_parent = xfs_has_parent(fd)
+		has_rmapbt = xfs_has_rmapbt(fd)
+	except Exception as e:
+		# Don't care if we can't detect parent pointers or rmap
+		print(f'{printf_prefix}: detecting fs features: {e}', file = sys.stderr)
+
+	# Check for the backref metadata that makes repair effective.
+	if want_repair:
+		if not has_rmapbt:
+			print(f"{mountpoint}: XFS online repair is less effective without rmap btrees.")
+		if not has_parent:
+			print(f"{mountpoint}: XFS online repair is less effective without parent pointers.")
+
+	# Flush anything that we may have printed about operational state.
+	sys.stdout.flush()
+
 	try:
 		if want_repair:
 			fh = fshandle(fd, mountpoint)


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 15/21] xfs_scrubbed: use getparents to look up file names
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (13 preceding siblings ...)
  2024-12-31 23:51   ` [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs Darrick J. Wong
@ 2024-12-31 23:51   ` Darrick J. Wong
  2024-12-31 23:51   ` [PATCH 16/21] builddefs: refactor udev directory specification Darrick J. Wong
                     ` (5 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the kernel tells about something that happened to a file, use the
GETPARENTS ioctl to try to look up the path to that file for more
ergonomic reporting.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/xfs_scrubbed.in |  235 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 230 insertions(+), 5 deletions(-)


diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 25465128864583..a4e073b3098f7a 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -18,6 +18,7 @@ import ctypes
 import gc
 from concurrent.futures import ProcessPoolExecutor
 import ctypes.util
+import collections
 
 try:
 	# Not all systems will have this json schema validation libarary,
@@ -171,12 +172,18 @@ class xfs_handle(ctypes.Structure):
 assert ctypes.sizeof(xfs_handle) == 24
 
 class fshandle(object):
-	def __init__(self, fd, mountpoint):
+	def __init__(self, fd, mountpoint = None):
 		global libhandle
 		global printf_prefix
 
 		self.handle = xfs_handle()
 
+		if isinstance(fd, fshandle):
+			# copy an existing fshandle
+			self.mountpoint = fd.mountpoint
+			ctypes.pointer(self.handle)[0] = fd.handle
+			return
+
 		if mountpoint is None:
 			raise Exception('fshandle needs a mountpoint')
 
@@ -233,6 +240,11 @@ class fshandle(object):
 		libhandle.free_handle(buf, buflen)
 		return fd
 
+	def subst(self, ino, gen):
+		'''Substitute the inode and generation components of a handle.'''
+		self.handle.ha_fid.fid_ino = ino
+		self.handle.ha_fid.fid_gen = gen
+
 def libhandle_load():
 	'''Load libhandle and set things up.'''
 	global libhandle
@@ -396,6 +408,170 @@ def xfs_has_rmapbt(fd):
 	fcntl.ioctl(fd, XFS_IOC_FSGEOMETRY, arg)
 	return arg.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT != 0
 
+# getparents ioctl
+class xfs_attrlist_cursor(ctypes.Structure):
+	_fields_ = [
+		("_opaque0",		ctypes.c_uint),
+		("_opaque1",		ctypes.c_uint),
+		("_opaque2",		ctypes.c_uint),
+		("_opaque3",		ctypes.c_uint)
+	]
+
+class xfs_getparents_rec(ctypes.Structure):
+	_fields_ = [
+		("gpr_parent",		xfs_handle),
+		("gpr_reclen",		ctypes.c_uint),
+		("_gpr_reserved",	ctypes.c_uint),
+	]
+
+xfs_getparents_tuple = collections.namedtuple('xfs_getparents_tuple', \
+		['gpr_parent', 'gpr_reclen', 'gpr_name'])
+
+class xfs_getparents_rec_array(object):
+	def __init__(self, nr_bytes):
+		self.nr_bytes = nr_bytes
+		self.bytearray = (ctypes.c_byte * int(nr_bytes))()
+
+	def __slice_to_record(self, bufslice):
+		'''Compute the number of bytes in a getparents record that contain a null-terminated directory entry name.'''
+		rec = ctypes.cast(bytes(bufslice), \
+				ctypes.POINTER(xfs_getparents_rec))
+		fixedlen = ctypes.sizeof(xfs_getparents_rec)
+		namelen = rec.contents.gpr_reclen - fixedlen
+
+		for i in range(0, namelen):
+			if bufslice[fixedlen + i] == 0:
+				namelen = i
+				break
+
+		if namelen == 0:
+			return
+
+		return xfs_getparents_tuple(
+				gpr_parent = rec.contents.gpr_parent,
+				gpr_reclen = rec.contents.gpr_reclen,
+				gpr_name = bufslice[fixedlen:fixedlen + namelen])
+
+	def get_buffer(self):
+		'''Return a pointer to the bytearray masquerading as an int.'''
+		return ctypes.addressof(self.bytearray)
+
+	def __iter__(self):
+		'''Walk the getparents records in this array.'''
+		off = 0
+		nr = 0
+		buf = bytes(self.bytearray)
+		while off < self.nr_bytes:
+			bufslice = buf[off:]
+			t = self.__slice_to_record(bufslice)
+			if t is None:
+				break
+			yield t
+			off += t.gpr_reclen
+			nr += 1
+
+class xfs_getparents(ctypes.Structure):
+	_fields_ = [
+		("_gp_cursor",		xfs_attrlist_cursor),
+		("gp_iflags",		ctypes.c_ushort),
+		("gp_oflags",		ctypes.c_ushort),
+		("gp_bufsize",		ctypes.c_uint),
+		("_pad",		ctypes.c_ulonglong),
+		("gp_buffer",		ctypes.c_ulonglong)
+	]
+
+	def __init__(self, fd, nr_bytes):
+		self.fd = fd
+		self.records = xfs_getparents_rec_array(nr_bytes)
+		self.gp_buffer = self.records.get_buffer()
+		self.gp_bufsize = nr_bytes
+
+	def __call_kernel(self):
+		if self.gp_oflags & XFS_GETPARENTS_OFLAG_DONE:
+			return False
+
+		ret = fcntl.ioctl(self.fd, XFS_IOC_GETPARENTS, self)
+		if ret != 0:
+			return False
+
+		return self.gp_oflags & XFS_GETPARENTS_OFLAG_ROOT == 0
+
+	def __iter__(self):
+		ctypes.memset(ctypes.pointer(self._gp_cursor), 0, \
+				ctypes.sizeof(xfs_attrlist_cursor))
+
+		while self.__call_kernel():
+			for i in self.records:
+				yield i
+
+class xfs_getparents_by_handle(ctypes.Structure):
+	_fields_ = [
+		("gph_handle",		xfs_handle),
+		("gph_request",		xfs_getparents)
+	]
+
+	def __init__(self, fd, fh, nr_bytes):
+		self.fd = fd
+		self.records = xfs_getparents_rec_array(nr_bytes)
+		self.gph_request.gp_buffer = self.records.get_buffer()
+		self.gph_request.gp_bufsize = nr_bytes
+		self.gph_handle = fh.handle
+
+	def __call_kernel(self):
+		if self.gph_request.gp_oflags & XFS_GETPARENTS_OFLAG_DONE:
+			return False
+
+		ret = fcntl.ioctl(self.fd, XFS_IOC_GETPARENTS_BY_HANDLE, self)
+		if ret != 0:
+			return False
+
+		return self.gph_request.gp_oflags & XFS_GETPARENTS_OFLAG_ROOT == 0
+
+	def __iter__(self):
+		ctypes.memset(ctypes.pointer(self.gph_request._gp_cursor), 0, \
+				ctypes.sizeof(xfs_attrlist_cursor))
+		while self.__call_kernel():
+			for i in self.records:
+				yield i
+
+assert ctypes.sizeof(xfs_getparents) == 40
+assert ctypes.sizeof(xfs_getparents_by_handle) == 64
+assert ctypes.sizeof(xfs_getparents_rec) == 32
+
+XFS_GETPARENTS_OFLAG_ROOT	= 1 << 0
+XFS_GETPARENTS_OFLAG_DONE	= 1 << 1
+
+XFS_IOC_GETPARENTS		= _IOWR(0x58, 62, xfs_getparents)
+XFS_IOC_GETPARENTS_BY_HANDLE	= _IOWR(0x58, 63, xfs_getparents_by_handle)
+
+def fgetparents(fd, fh = None, bufsize = 1024):
+	'''Return all the parent pointers for a given fd and/or handle.'''
+
+	if fh is not None:
+		return xfs_getparents_by_handle(fd, fh, bufsize)
+	return xfs_getparents(fd, bufsize)
+
+def fgetpath(fd, fh = None, mountpoint = None):
+	'''Return a list of path components up to the root dir of the filesystem for a given fd.'''
+	ret = []
+	if fh is None:
+		nfh = fshandle(fd, mountpoint)
+	else:
+		# Don't subst into the caller's handle
+		nfh = fshandle(fh)
+
+	while True:
+		added = False
+		for pptr in fgetparents(fd, nfh):
+			ret.insert(0, pptr.gpr_name)
+			nfh.subst(pptr.gpr_parent.ha_fid.fid_ino, \
+				  pptr.gpr_parent.ha_fid.fid_gen)
+			added = True
+			break
+		if not added:
+			break
+	return ret
+
 # main program
 
 def health_reports(mon_fp, fh):
@@ -429,11 +605,23 @@ def health_reports(mon_fp, fh):
 			lines = []
 		buf = mon_fp.readline()
 
+def inode_printf_prefix(event):
+	'''Compute the logging prefix for this event.'''
+	global printf_prefix
+
+	if 'path' not in event:
+		return printf_prefix
+
+	if printf_prefix.endswith(os.sep):
+		return f"{printf_prefix}{event['path']}"
+
+	return f"{printf_prefix}{os.sep}{event['path']}"
+
 def log_event(event):
 	'''Log a monitoring event to stdout.'''
 	global printf_prefix
 
-	print(f"{printf_prefix}: {event}")
+	print(f"{inode_printf_prefix(event)}: {event}")
 	sys.stdout.flush()
 
 def report_lost(event):
@@ -480,6 +668,39 @@ def handle_event(e):
 
 	global log
 	global repair_queue
+	global has_parent
+
+	def pathify_event(event, fh):
+		'''Come up with a directory tree path for a file event.'''
+		try:
+			path_fd = fh.open()
+		except Exception as e:
+			# Not the end of the world if we get nothing
+			if e.errno != errno.EOPNOTSUPP and e.errno != errno.ENOTTY:
+				print(f'{printf_prefix}: opening file handle: {e}', file = sys.stderr)
+			return
+
+		try:
+			fh2 = fshandle(fh)
+		except OSError as e:
+			if e.errno != errno.EOPNOTSUPP:
+				print(f'{printf_prefix}: making new file handle: {e}', file = sys.stderr)
+			os.close(path_fd)
+			return
+		except Exception as e:
+			print(f'{printf_prefix}: making new file handle: {e}', file = sys.stderr)
+			os.close(path_fd)
+			return
+
+		try:
+			fh2.subst(event['inumber'], event['generation'])
+			components = [x.decode('utf-8') for x in fgetpath(path_fd, fh2)]
+			event['path'] = os.sep.join(components)
+		except OSError as e:
+			if e.errno != errno.EOPNOTSUPP:
+				print(f'{printf_prefix}: constructing path: {e}', file = sys.stderr)
+		finally:
+			os.close(path_fd)
 
 	# Use a separate subprocess to handle the repairs so that the event
 	# processing worker does not block on the GIL of the repair workers.
@@ -498,6 +719,8 @@ def handle_event(e):
 		return
 
 	stringify_timestamp(event)
+	if event['domain'] == 'inode' and has_parent and not debug_fast:
+		pathify_event(event, fh)
 	if log:
 		log_event(event)
 	if event['type'] == 'lost':
@@ -536,7 +759,7 @@ def monitor(mountpoint, event_queue, **kwargs):
 	sys.stdout.flush()
 
 	try:
-		if want_repair:
+		if want_repair or has_parent:
 			fh = fshandle(fd, mountpoint)
 		mon_fd = open_health_monitor(fd, verbose = everything)
 	except OSError as e:
@@ -653,6 +876,8 @@ def repair_group(event, fd, group_type):
 
 def repair_inode(event, fd):
 	'''React to a inode-domain corruption event by repairing it.'''
+	ipp = inode_printf_prefix(event)
+
 	for s in event['structures']:
 		type = __scrub_type(s)
 		if type is None:
@@ -660,10 +885,10 @@ def repair_inode(event, fd):
 		try:
 			oflags = xfs_repair_inode_metadata(fd, type,
 				      event['inumber'], event['generation'])
-			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			print(f"{ipp}: {s}: {report_outcome(oflags)}")
 			sys.stdout.flush()
 		except Exception as e:
-			print(f"{printf_prefix}: {s}: {e}", file = sys.stderr)
+			print(f"{ipp}: {s}: {e}", file = sys.stderr)
 
 def repair_metadata(event, fh):
 	'''Repair a metadata corruption.'''


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 16/21] builddefs: refactor udev directory specification
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (14 preceding siblings ...)
  2024-12-31 23:51   ` [PATCH 15/21] xfs_scrubbed: use getparents to look up file names Darrick J. Wong
@ 2024-12-31 23:51   ` Darrick J. Wong
  2024-12-31 23:52   ` [PATCH 17/21] xfs_scrubbed: create a background monitoring service Darrick J. Wong
                     ` (4 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Refactor the code that finds the udev rules directory to detect the
location of the parent udev directory instead.  IOWs, we go from:

UDEV_RULE_DIR=/foo/bar/rules.d

to:

UDEV_DIR=/foo/bar
UDEV_RULE_DIR=/foo/bar/rules.d

This is needed by the next patch, which adds a helper script.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 configure.ac           |    2 +-
 include/builddefs.in   |    3 ++-
 m4/package_services.m4 |   30 +++++++++++++++---------------
 3 files changed, 18 insertions(+), 17 deletions(-)


diff --git a/configure.ac b/configure.ac
index 1f7fec838e1239..cabbef51068dbc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -175,7 +175,7 @@ if test "$enable_scrub" = "yes"; then
 fi
 AC_CONFIG_SYSTEMD_SYSTEM_UNIT_DIR
 AC_CONFIG_CROND_DIR
-AC_CONFIG_UDEV_RULE_DIR
+AC_CONFIG_UDEV_DIR
 AC_HAVE_BLKID_TOPO
 
 if test "$enable_ubsan" = "yes" || test "$enable_ubsan" = "probe"; then
diff --git a/include/builddefs.in b/include/builddefs.in
index bb022c36627a72..4a25de76d5c325 100644
--- a/include/builddefs.in
+++ b/include/builddefs.in
@@ -112,7 +112,8 @@ SYSTEMD_SYSTEM_UNIT_DIR = @systemd_system_unit_dir@
 HAVE_CROND = @have_crond@
 CROND_DIR = @crond_dir@
 HAVE_UDEV = @have_udev@
-UDEV_RULE_DIR = @udev_rule_dir@
+UDEV_DIR = @udev_dir@
+UDEV_RULE_DIR = @udev_dir@/rules.d
 HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@
 USE_RADIX_TREE_FOR_INUMS = @use_radix_tree_for_inums@
 
diff --git a/m4/package_services.m4 b/m4/package_services.m4
index a683ddb93e0e91..de0504df0c206f 100644
--- a/m4/package_services.m4
+++ b/m4/package_services.m4
@@ -77,33 +77,33 @@ AC_DEFUN([AC_CONFIG_CROND_DIR],
 ])
 
 #
-# Figure out where to put udev rule files
+# Figure out where to put udev files
 #
-AC_DEFUN([AC_CONFIG_UDEV_RULE_DIR],
+AC_DEFUN([AC_CONFIG_UDEV_DIR],
 [
 	AC_REQUIRE([PKG_PROG_PKG_CONFIG])
-	AC_ARG_WITH([udev_rule_dir],
-	  [AS_HELP_STRING([--with-udev-rule-dir@<:@=DIR@:>@],
-		[Install udev rules into DIR.])],
+	AC_ARG_WITH([udev_dir],
+	  [AS_HELP_STRING([--with-udev-dir@<:@=DIR@:>@],
+		[Install udev files underneath DIR.])],
 	  [],
-	  [with_udev_rule_dir=yes])
-	AS_IF([test "x${with_udev_rule_dir}" != "xno"],
+	  [with_udev_dir=yes])
+	AS_IF([test "x${with_udev_dir}" != "xno"],
 	  [
-		AS_IF([test "x${with_udev_rule_dir}" = "xyes"],
+		AS_IF([test "x${with_udev_dir}" = "xyes"],
 		  [
 			PKG_CHECK_MODULES([udev], [udev],
 			  [
-				with_udev_rule_dir="$($PKG_CONFIG --variable=udev_dir udev)/rules.d"
+				with_udev_dir="$($PKG_CONFIG --variable=udev_dir udev)"
 			  ], [
-				with_udev_rule_dir=""
+				with_udev_dir=""
 			  ])
 			m4_pattern_allow([^PKG_(MAJOR|MINOR|BUILD|REVISION)$])
 		  ])
-		AC_MSG_CHECKING([for udev rule dir])
-		udev_rule_dir="${with_udev_rule_dir}"
-		AS_IF([test -n "${udev_rule_dir}"],
+		AC_MSG_CHECKING([for udev dir])
+		udev_dir="${with_udev_dir}"
+		AS_IF([test -n "${udev_dir}"],
 		  [
-			AC_MSG_RESULT(${udev_rule_dir})
+			AC_MSG_RESULT(${udev_dir})
 			have_udev="yes"
 		  ],
 		  [
@@ -115,5 +115,5 @@ AC_DEFUN([AC_CONFIG_UDEV_RULE_DIR],
 		have_udev="disabled"
 	  ])
 	AC_SUBST(have_udev)
-	AC_SUBST(udev_rule_dir)
+	AC_SUBST(udev_dir)
 ])


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 17/21] xfs_scrubbed: create a background monitoring service
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (15 preceding siblings ...)
  2024-12-31 23:51   ` [PATCH 16/21] builddefs: refactor udev directory specification Darrick J. Wong
@ 2024-12-31 23:52   ` Darrick J. Wong
  2024-12-31 23:52   ` [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable Darrick J. Wong
                     ` (3 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a systemd service and activate it automatically.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/Makefile                 |   18 +++++++
 scrub/xfs_scrubbed.in          |    9 +++
 scrub/xfs_scrubbed.rules       |    7 +++
 scrub/xfs_scrubbed@.service.in |  103 ++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrubbed_start       |   17 +++++++
 5 files changed, 153 insertions(+), 1 deletion(-)
 create mode 100644 scrub/xfs_scrubbed.rules
 create mode 100644 scrub/xfs_scrubbed@.service.in
 create mode 100755 scrub/xfs_scrubbed_start


diff --git a/scrub/Makefile b/scrub/Makefile
index 7d4fa0ddc09685..731810d7c7fd9a 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -29,8 +29,16 @@ SYSTEMD_SERVICES=\
 	xfs_scrub_all.service \
 	xfs_scrub_all_fail.service \
 	xfs_scrub_all.timer \
-	system-xfs_scrub.slice
+	system-xfs_scrub.slice \
+	xfs_scrubbed@.service
 OPTIONAL_TARGETS += $(SYSTEMD_SERVICES)
+
+ifeq ($(HAVE_UDEV),yes)
+	XFS_SCRUBBED_UDEV_RULES = xfs_scrubbed.rules
+	XFS_SCRUBBED_HELPER = xfs_scrubbed_start
+	INSTALL_SCRUB += install-udev-scrubbed
+	OPTIONAL_TARGETS += $(XFS_SCRUBBED_HELPER)
+endif
 endif
 ifeq ($(HAVE_CROND),yes)
 INSTALL_SCRUB += install-crond
@@ -185,6 +193,14 @@ install-udev: $(UDEV_RULES)
 		$(INSTALL) -m 644 $$i $(UDEV_RULE_DIR)/64-$$i; \
 	done
 
+install-udev-scrubbed: $(XFS_SCRUBBED_HELPER)
+	$(INSTALL) -m 755 -d $(UDEV_DIR)
+	$(INSTALL) -m 755 $(XFS_SCRUBBED_HELPER) $(UDEV_DIR)
+	$(INSTALL) -m 755 -d $(UDEV_RULE_DIR)
+	for i in $(XFS_SCRUBBED_UDEV_RULES); do \
+		$(INSTALL) -m 644 $$i $(UDEV_RULE_DIR)/64-$$i; \
+	done
+
 install-dev:
 
 -include .dep
diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index a4e073b3098f7a..9df6f45e53ad80 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -19,6 +19,7 @@ import gc
 from concurrent.futures import ProcessPoolExecutor
 import ctypes.util
 import collections
+import time
 
 try:
 	# Not all systems will have this json schema validation libarary,
@@ -994,6 +995,14 @@ def main():
 		pass
 
 	args.event_queue.shutdown()
+
+	# See the service mode comments in xfs_scrub.c for why we sleep and
+	# compress all nonzero exit codes to 1.
+	if 'SERVICE_MODE' in os.environ:
+		time.sleep(2)
+		if ret != 0:
+			ret = 1
+
 	return ret
 
 if __name__ == '__main__':
diff --git a/scrub/xfs_scrubbed.rules b/scrub/xfs_scrubbed.rules
new file mode 100644
index 00000000000000..c651126d5373a1
--- /dev/null
+++ b/scrub/xfs_scrubbed.rules
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (c) 2024-2025 Oracle.  All rights reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+#
+# Start autonomous self healing automatically
+ACTION=="add", SUBSYSTEM=="xfs", ENV{TYPE}=="mount", RUN+="xfs_scrubbed_start"
diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in
new file mode 100644
index 00000000000000..9656bdb3cd9a9d
--- /dev/null
+++ b/scrub/xfs_scrubbed@.service.in
@@ -0,0 +1,103 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=Self Healing of XFS Metadata for %f
+Documentation=man:xfs_scrubbed(8)
+
+# Explicitly require the capabilities that this program needs
+ConditionCapability=CAP_SYS_ADMIN
+ConditionCapability=CAP_DAC_OVERRIDE
+
+# Must be a mountpoint
+ConditionPathIsMountPoint=%f
+RequiresMountsFor=%f
+
+[Service]
+Type=exec
+Environment=SERVICE_MODE=1
+ExecStart=@pkg_libexec_dir@/xfs_scrubbed --log %f
+SyslogIdentifier=%N
+
+# Run scrub with minimal CPU and IO priority so that nothing else will starve.
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+CPUAccounting=true
+Nice=19
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# Dynamically create a user that isn't root
+DynamicUser=true
+
+# Make the entire filesystem readonly, but don't hide /home and don't use a
+# private bind mount like xfs_scrub.  We don't want to pin the filesystem,
+# because we want umount to work correctly and this service to stop
+# automatically.
+ProtectSystem=strict
+ProtectHome=no
+PrivateTmp=true
+PrivateDevices=true
+
+# Don't let scrub complain about paths in /etc/projects that have been hidden
+# by our sandboxing.  scrub doesn't care about project ids anyway.
+InaccessiblePaths=-/etc/projects
+
+# No network access
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=none
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrubbed needs these privileges to open the rootdir and monitor
+CapabilityBoundingSet=CAP_SYS_ADMIN CAP_DAC_OVERRIDE
+AmbientCapabilities=CAP_SYS_ADMIN CAP_DAC_OVERRIDE
+NoNewPrivileges=true
+
+# xfs_scrubbed doesn't create files
+UMask=7777
+
+# No access to hardware /dev files except for block devices
+ProtectClock=true
+DevicePolicy=closed
+
+[Install]
+WantedBy=multi-user.target
+# If someone tries to enable the template itself, translate that into enabling
+# this service on the root directory at systemd startup time.  In the
+# initramfs, the udev rules in xfs_scrubbed.rules run before systemd starts.
+DefaultInstance=-
diff --git a/scrub/xfs_scrubbed_start b/scrub/xfs_scrubbed_start
new file mode 100755
index 00000000000000..82530cf7862717
--- /dev/null
+++ b/scrub/xfs_scrubbed_start
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+# Start the xfs_scrubbed service when the filesystem is mounted
+
+command -v systemctl || exit 0
+
+grep "^$SOURCE[[:space:]]" /proc/mounts | while read source mntpt therest; do
+	inst="$(systemd-escape --path "$mntpt")"
+	systemctl restart --no-block "xfs_scrubbed@$inst" && break
+done
+
+exit 0


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (16 preceding siblings ...)
  2024-12-31 23:52   ` [PATCH 17/21] xfs_scrubbed: create a background monitoring service Darrick J. Wong
@ 2024-12-31 23:52   ` Darrick J. Wong
  2024-12-31 23:52   ` [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode Darrick J. Wong
                     ` (2 subsequent siblings)
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use ExecCondition= in the system service to check if kernel support for
the health monitor is available.  If not, we don't want to run the
service, have it fail, and generate a bunch of silly log messages.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/xfs_scrubbed.in          |   39 ++++++++++++++++++++++++++++++++++++++-
 scrub/xfs_scrubbed@.service.in |    1 +
 2 files changed, 39 insertions(+), 1 deletion(-)


diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 9df6f45e53ad80..90602481f64c88 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -791,6 +791,38 @@ def monitor(mountpoint, event_queue, **kwargs):
 
 	return 0
 
+def check_monitor(mountpoint):
+	'''Check if the kernel can send us health events for the given mountpoint.'''
+	global log
+	global printf_prefix
+	global everything
+	global want_repair
+	global has_parent
+
+	try:
+		fd = os.open(mountpoint, os.O_RDONLY)
+	except OSError as e:
+		# Can't open mountpoint; monitor not available.
+		print(f"{mountpoint}: {e}", file = sys.stderr)
+		return 1
+
+	try:
+		mon_fd = open_health_monitor(fd, verbose = everything)
+	except OSError as e:
+		# Error opening monitor (or it's simply not there); monitor
+		# not available.
+		if e.errno == errno.ENOTTY or e.errno == errno.EOPNOTSUPP:
+			print(f"{mountpoint}: XFS health monitoring not supported.",
+					file = sys.stderr)
+		return 1
+	finally:
+		# Close the mountpoint if opening the health monitor fails;
+		# the handle object will free its own memory.
+		os.close(fd)
+
+	# Monitor available; success!
+	return 0
+
 def __scrub_type(code):
 	'''Convert a "structures" json list to a scrub type code.'''
 	SCRUB_TYPES = {
@@ -923,6 +955,8 @@ def main():
 
 	parser = argparse.ArgumentParser( \
 			description = "XFS filesystem health monitoring demon.")
+	parser.add_argument("--check", help = "Check presense of health monitor.", \
+			action = "store_true")
 	parser.add_argument("--debug", help = "Enabling debugging messages.", \
 			action = "store_true")
 	parser.add_argument("--log", help = "Log health events to stdout.", \
@@ -989,7 +1023,10 @@ def main():
 	printf_prefix = args.mountpoint
 	ret = 0
 	try:
-		ret = monitor(**vars(args))
+		if args.check:
+			ret = check_monitor(args.mountpoint)
+		else:
+			ret = monitor(**vars(args))
 	except KeyboardInterrupt:
 		# Consider SIGINT to be a clean exit.
 		pass
diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in
index 9656bdb3cd9a9d..afd5c204327946 100644
--- a/scrub/xfs_scrubbed@.service.in
+++ b/scrub/xfs_scrubbed@.service.in
@@ -18,6 +18,7 @@ RequiresMountsFor=%f
 [Service]
 Type=exec
 Environment=SERVICE_MODE=1
+ExecCondition=@pkg_libexec_dir@/xfs_scrubbed --check %f
 ExecStart=@pkg_libexec_dir@/xfs_scrubbed --log %f
 SyslogIdentifier=%N
 


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (17 preceding siblings ...)
  2024-12-31 23:52   ` [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable Darrick J. Wong
@ 2024-12-31 23:52   ` Darrick J. Wong
  2024-12-31 23:52   ` [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default Darrick J. Wong
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make the xfs_scrubbed background service query the autofsck filesystem
property to figure out which operating mode it should use.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/xfs_scrubbed.in |   62 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)


diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 90602481f64c88..2b34603cb361e2 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -573,6 +573,21 @@ def fgetpath(fd, fh = None, mountpoint = None):
 			break
 	return ret
 
+# Filesystem properties
+
+FSPROP_NAMESPACE = "trusted."
+FSPROP_NAME_PREFIX = "xfs:"
+FSPROP_AUTOFSCK_NAME = "autofsck"
+
+def fsprop_attrname(n):
+	'''Construct the xattr name for a filesystem property.'''
+	return f"{FSPROP_NAMESPACE}{FSPROP_NAME_PREFIX}{n}"
+
+def fsprop_getstr(fd, n):
+	'''Return the value of a filesystem property as a string.'''
+	attrname = fsprop_attrname(n)
+	return os.getxattr(fd, attrname).decode('utf-8')
+
 # main program
 
 def health_reports(mon_fp, fh):
@@ -731,6 +746,31 @@ def handle_event(e):
 	elif want_repair and event['type'] == 'sick':
 		repair_queue.submit(repair_metadata, event, fh)
 
+def want_repair_from_autofsck(fd):
+	'''Determine want_repair from the autofsck filesystem property.'''
+	global has_parent
+	global has_rmapbt
+
+	try:
+		advice = fsprop_getstr(fd, FSPROP_AUTOFSCK_NAME)
+		if advice == "repair":
+			return True
+		if advice == "check" or advice == "optimize":
+			return False
+		if advice == "none":
+			return None
+	except:
+		# Any OS error (including ENODATA) or string parsing error is
+		# treated the same as an unrecognized value.
+		pass
+
+	# For an unrecognized value, log but do not fix runtime corruption if
+	# backref metadata are enabled.  If no backref metadata are available,
+	# the fs is too old so don't run at all.
+	if has_rmapbt or has_parent:
+		return False
+	return None
+
 def monitor(mountpoint, event_queue, **kwargs):
 	'''Monitor the given mountpoint for health events.'''
 	global everything
@@ -749,6 +789,20 @@ def monitor(mountpoint, event_queue, **kwargs):
 		# Don't care if we can't detect parent pointers or rmap
 		print(f'{printf_prefix}: detecting fs features: {e}', file = sys.stderr)
 
+	# Does the sysadmin have any advice for us about whether or not to
+	# background scrub?
+	if want_repair is None:
+		want_repair = want_repair_from_autofsck(fd)
+		if want_repair is None:
+			print(f"{mountpoint}: Disabling daemon per autofsck directive.")
+			os.close(fd)
+			return 0
+		elif want_repair:
+			print(f"{mountpoint}: Automatically repairing per autofsck directive.")
+		else:
+			print(f"{mountpoint}: Only logging errors per autofsck directive.")
+
+
 	# Check for the backref metadata that makes repair effective.
 	if want_repair:
 		if not has_rmapbt:
@@ -963,7 +1017,11 @@ def main():
 			action = "store_true")
 	parser.add_argument("--everything", help = "Capture all events.", \
 			action = "store_true")
-	parser.add_argument("--repair", help = "Automatically repair corrupt metadata.", \
+	action_group = parser.add_mutually_exclusive_group()
+	action_group.add_argument("--repair", \
+			help = "Automatically repair corrupt metadata.", \
+			action = "store_true")
+	action_group.add_argument("--autofsck", help = argparse.SUPPRESS, \
 			action = "store_true")
 	parser.add_argument("-V", help = "Report version and exit.", \
 			action = "store_true")
@@ -1004,6 +1062,8 @@ def main():
 		everything = True
 	if args.debug_fast:
 		debug_fast = True
+	if args.autofsck:
+		want_repair = None
 	if args.repair:
 		want_repair = True
 


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (18 preceding siblings ...)
  2024-12-31 23:52   ` [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode Darrick J. Wong
@ 2024-12-31 23:52   ` Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default Darrick J. Wong
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If the media scan finds that media have been lost, report this to the
kernel so that the healthmon code can pass that along to xfs_scrubbed.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 scrub/phase6.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)


diff --git a/scrub/phase6.c b/scrub/phase6.c
index 5a1f29738680e5..b5f6f3c1d4bc63 100644
--- a/scrub/phase6.c
+++ b/scrub/phase6.c
@@ -671,6 +671,29 @@ clean_pool(
 	return ret;
 }
 
+static void
+report_ioerr_to_kernel(
+	struct scrub_ctx		*ctx,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length)
+{
+	struct xfs_media_error		me = {
+		.daddr			= start,
+		.bbcount		= length,
+	};
+	dev_t				dev = disk_to_dev(ctx, disk);
+
+	if (dev == ctx->fsinfo.fs_datadev)
+		me.flags |= XFS_MEDIA_ERROR_DATADEV;
+	else if (dev == ctx->fsinfo.fs_rtdev)
+		me.flags |= XFS_MEDIA_ERROR_RTDEV;
+	else if (dev == ctx->fsinfo.fs_logdev)
+		me.flags |= XFS_MEDIA_ERROR_LOGDEV;
+
+	ioctl(ctx->mnt.fd, XFS_IOC_MEDIA_ERROR, &me);
+}
+
 /* Remember a media error for later. */
 static void
 remember_ioerr(
@@ -695,6 +718,8 @@ remember_ioerr(
 		return;
 	}
 
+	report_ioerr_to_kernel(ctx, disk, start, length);
+
 	tree = bitmap_for_disk(ctx, disk, vs);
 	if (!tree) {
 		str_liberror(ctx, ENOENT, _("finding bad block bitmap"));


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (19 preceding siblings ...)
  2024-12-31 23:52   ` [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel Darrick J. Wong
@ 2024-12-31 23:53   ` Darrick J. Wong
  20 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we're finished building autonomous repair, enable the service
on the root filesystem by default.  The root filesystem is mounted by
the initrd prior to starting systemd, which is why the udev rule cannot
autostart the service for the root filesystem.

dh_installsystemd won't activate a template service (aka one with an
at-sign in the name) even if it provides a DefaultInstance directive to
make that possible.  Use a fugly shim for this.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 debian/control                 |    2 +-
 debian/postinst                |    8 ++++++++
 debian/prerm                   |   13 +++++++++++++
 scrub/xfs_scrubbed@.service.in |    2 +-
 4 files changed, 23 insertions(+), 2 deletions(-)
 create mode 100644 debian/prerm


diff --git a/debian/control b/debian/control
index 66b0a47a36ee24..31ea1e988f66be 100644
--- a/debian/control
+++ b/debian/control
@@ -10,7 +10,7 @@ Homepage: https://xfs.wiki.kernel.org/
 Package: xfsprogs
 Depends: ${shlibs:Depends}, ${misc:Depends}, python3-dbus, python3:any
 Provides: fsck-backend
-Suggests: xfsdump, acl, attr, quota
+Suggests: xfsdump, acl, attr, quota, python3-jsonschema
 Breaks: xfsdump (<< 3.0.0)
 Replaces: xfsdump (<< 3.0.0)
 Architecture: linux-any
diff --git a/debian/postinst b/debian/postinst
index 2ad9174658ceb4..4ba2e0c43b887e 100644
--- a/debian/postinst
+++ b/debian/postinst
@@ -24,5 +24,13 @@ case "${1}" in
 esac
 
 #DEBHELPER#
+#
+# dh_installsystemd doesn't handle template services even if we supply a
+# default instance, so we'll install it here.
+if [ -z "${DPKG_ROOT:-}" ] && [ -d /run/systemd/system ] ; then
+	if [ "$1" = "configure" ] || [ "$1" = "abort-upgrade" ] || [ "$1" = "abort-deconfigure" ] || [ "$1" = "abort-remove" ] ; then
+		/bin/systemctl enable xfs_scrubbed@.service || true
+	fi
+fi
 
 exit 0
diff --git a/debian/prerm b/debian/prerm
new file mode 100644
index 00000000000000..48e8e94c4fe9ac
--- /dev/null
+++ b/debian/prerm
@@ -0,0 +1,13 @@
+#!/bin/sh
+
+set -e
+
+# dh_installsystemd doesn't handle template services even if we supply a
+# default instance, so we'll install it here.
+if [ -z "${DPKG_ROOT:-}" ] && [ "$1" = remove ] && [ -d /run/systemd/system ] ; then
+	/bin/systemctl disable xfs_scrubbed@.service || true
+fi
+
+#DEBHELPER#
+
+exit 0
diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in
index afd5c204327946..5bf1e79031af8c 100644
--- a/scrub/xfs_scrubbed@.service.in
+++ b/scrub/xfs_scrubbed@.service.in
@@ -19,7 +19,7 @@ RequiresMountsFor=%f
 Type=exec
 Environment=SERVICE_MODE=1
 ExecCondition=@pkg_libexec_dir@/xfs_scrubbed --check %f
-ExecStart=@pkg_libexec_dir@/xfs_scrubbed --log %f
+ExecStart=@pkg_libexec_dir@/xfs_scrubbed --autofsck --log %f
 SyslogIdentifier=%N
 
 # Run scrub with minimal CPU and IO priority so that nothing else will starve.


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (8 preceding siblings ...)
  2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
@ 2024-12-31 23:34 ` Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong
                     ` (9 more replies)
  2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong
                   ` (5 subsequent siblings)
  15 siblings, 10 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

Hi all,

This series enables xfs_repair to add select features to existing V5
filesystems.  Specifically, one can add free inode btrees, reflink
support, and reverse mapping.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-newer-features

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=upgrade-newer-features
---
Commits in this patchset:
 * xfs_repair: allow sysadmins to add free inode btree indexes
 * xfs_repair: allow sysadmins to add reflink
 * xfs_repair: allow sysadmins to add reverse mapping indexes
 * xfs_repair: upgrade an existing filesystem to have parent pointers
 * xfs_repair: allow sysadmins to add metadata directories
 * xfs_repair: upgrade filesystems to support rtgroups when adding metadir
 * xfs_repair: allow sysadmins to add realtime reverse mapping indexes
 * xfs_repair: allow sysadmins to add realtime reflink
 * xfs_repair: skip free space checks when upgrading
 * xfs_repair: allow adding rmapbt to reflink filesystems
---
 libxfs/libxfs_api_defs.h |    1 
 man/man8/xfs_admin.8     |   37 +++++
 repair/dino_chunks.c     |    6 +
 repair/dinode.c          |    5 +
 repair/globals.c         |    7 +
 repair/globals.h         |    7 +
 repair/phase2.c          |  341 +++++++++++++++++++++++++++++++++++++++++++++-
 repair/phase4.c          |    5 +
 repair/pptr.c            |   15 ++
 repair/protos.h          |    6 +
 repair/rmap.c            |   12 +-
 repair/xfs_repair.c      |   77 ++++++++++
 12 files changed, 505 insertions(+), 14 deletions(-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
@ 2024-12-31 23:53   ` Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 02/10] xfs_repair: allow sysadmins to add reflink Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the free inode btree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_admin.8 |    7 +++++++
 repair/globals.c     |    1 +
 repair/globals.h     |    1 +
 repair/phase2.c      |   28 +++++++++++++++++++++++++++-
 repair/xfs_repair.c  |   11 +++++++++++
 5 files changed, 47 insertions(+), 1 deletion(-)


diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8
index 63f8ee90307b30..e07fc3ddb3fb82 100644
--- a/man/man8/xfs_admin.8
+++ b/man/man8/xfs_admin.8
@@ -163,6 +163,13 @@ .SH OPTIONS
 extended attributes, symbolic links, and realtime free space metadata.
 The filesystem cannot be downgraded after this feature is enabled.
 Once enabled, the filesystem will not be mountable by older kernels.
+.TP 0.4i
+.B finobt
+Track free inodes through a separate free inode btree index to speed up inode
+allocation on old filesystems.
+This upgrade can fail if any AG has less than 1% free space remaining.
+The filesystem cannot be downgraded after this feature is enabled.
+This feature was added to Linux 3.16.
 .RE
 .TP
 .BI \-U " uuid"
diff --git a/repair/globals.c b/repair/globals.c
index 143b4a8beb53f4..f13497c3121d6b 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -53,6 +53,7 @@ bool	add_inobtcount;		/* add inode btree counts to AGI */
 bool	add_bigtime;		/* add support for timestamps up to 2486 */
 bool	add_nrext64;
 bool	add_exchrange;		/* add file content exchange support */
+bool	add_finobt;		/* add free inode btrees */
 
 /* misc status variables */
 
diff --git a/repair/globals.h b/repair/globals.h
index 8bb9bbaeca4fb0..c5b27d9a60cf2e 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -94,6 +94,7 @@ extern bool	add_inobtcount;		/* add inode btree counts to AGI */
 extern bool	add_bigtime;		/* add support for timestamps up to 2486 */
 extern bool	add_nrext64;
 extern bool	add_exchrange;		/* add file content exchange support */
+extern bool	add_finobt;		/* add free inode btrees */
 
 /* misc status variables */
 
diff --git a/repair/phase2.c b/repair/phase2.c
index 71576f5806e473..1bb7cd19025be7 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -123,7 +123,7 @@ set_inobtcount(
 		exit(0);
 	}
 
-	if (!xfs_has_finobt(mp)) {
+	if (!xfs_has_finobt(mp) && !add_finobt) {
 		printf(
 	_("Inode btree count feature requires free inode btree.\n"));
 		exit(0);
@@ -212,6 +212,28 @@ set_exchrange(
 	return true;
 }
 
+static bool
+set_finobt(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*new_sb)
+{
+	if (xfs_has_finobt(mp)) {
+		printf(_("Filesystem already supports free inode btrees.\n"));
+		exit(0);
+	}
+
+	if (!xfs_has_crc(mp)) {
+		printf(
+	_("Free inode btree feature only supported on V5 filesystems.\n"));
+		exit(0);
+	}
+
+	printf(_("Adding free inode btrees to filesystem.\n"));
+	new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_FINOBT;
+	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+	return true;
+}
+
 struct check_state {
 	struct xfs_sb		sb;
 	uint64_t		features;
@@ -378,6 +400,8 @@ need_check_fs_free_space(
 	struct xfs_mount		*mp,
 	const struct check_state	*old)
 {
+	if (xfs_has_finobt(mp) && !(old->features & XFS_FEAT_FINOBT))
+		return true;
 	return false;
 }
 
@@ -455,6 +479,8 @@ upgrade_filesystem(
 		dirty |= set_nrext64(mp, &new_sb);
 	if (add_exchrange)
 		dirty |= set_exchrange(mp, &new_sb);
+	if (add_finobt)
+		dirty |= set_finobt(mp, &new_sb);
 	if (!dirty)
 		return;
 
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 7bf75c09b94542..d8f92b52b66f3a 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -71,6 +71,7 @@ enum c_opt_nums {
 	CONVERT_BIGTIME,
 	CONVERT_NREXT64,
 	CONVERT_EXCHRANGE,
+	CONVERT_FINOBT,
 	C_MAX_OPTS,
 };
 
@@ -80,6 +81,7 @@ static char *c_opts[] = {
 	[CONVERT_BIGTIME]	= "bigtime",
 	[CONVERT_NREXT64]	= "nrext64",
 	[CONVERT_EXCHRANGE]	= "exchange",
+	[CONVERT_FINOBT]	= "finobt",
 	[C_MAX_OPTS]		= NULL,
 };
 
@@ -372,6 +374,15 @@ process_args(int argc, char **argv)
 		_("-c exchange only supports upgrades\n"));
 					add_exchrange = true;
 					break;
+				case CONVERT_FINOBT:
+					if (!val)
+						do_abort(
+		_("-c finobt requires a parameter\n"));
+					if (strtol(val, NULL, 0) != 1)
+						do_abort(
+		_("-c finobt only supports upgrades\n"));
+					add_finobt = true;
+					break;
 				default:
 					unknown('c', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 02/10] xfs_repair: allow sysadmins to add reflink
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong
@ 2024-12-31 23:53   ` Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reference count btree, and therefore reflink.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_admin.8 |    6 ++++++
 repair/globals.c     |    1 +
 repair/globals.h     |    1 +
 repair/phase2.c      |   33 ++++++++++++++++++++++++++++++++-
 repair/rmap.c        |    6 +++---
 repair/xfs_repair.c  |   11 +++++++++++
 6 files changed, 54 insertions(+), 4 deletions(-)


diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8
index e07fc3ddb3fb82..3a9175c9f018e5 100644
--- a/man/man8/xfs_admin.8
+++ b/man/man8/xfs_admin.8
@@ -170,6 +170,12 @@ .SH OPTIONS
 This upgrade can fail if any AG has less than 1% free space remaining.
 The filesystem cannot be downgraded after this feature is enabled.
 This feature was added to Linux 3.16.
+.TP 0.4i
+.B reflink
+Enable sharing of file data blocks.
+This upgrade can fail if any AG has less than 2% free space remaining.
+The filesystem cannot be downgraded after this feature is enabled.
+This feature was added to Linux 4.9.
 .RE
 .TP
 .BI \-U " uuid"
diff --git a/repair/globals.c b/repair/globals.c
index f13497c3121d6b..cf4421e34dec84 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -54,6 +54,7 @@ bool	add_bigtime;		/* add support for timestamps up to 2486 */
 bool	add_nrext64;
 bool	add_exchrange;		/* add file content exchange support */
 bool	add_finobt;		/* add free inode btrees */
+bool	add_reflink;		/* add reference count btrees */
 
 /* misc status variables */
 
diff --git a/repair/globals.h b/repair/globals.h
index c5b27d9a60cf2e..efbb8db79bc080 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -95,6 +95,7 @@ extern bool	add_bigtime;		/* add support for timestamps up to 2486 */
 extern bool	add_nrext64;
 extern bool	add_exchrange;		/* add file content exchange support */
 extern bool	add_finobt;		/* add free inode btrees */
+extern bool	add_reflink;		/* add reference count btrees */
 
 /* misc status variables */
 
diff --git a/repair/phase2.c b/repair/phase2.c
index 1bb7cd19025be7..9cd841f8d05fc6 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -200,7 +200,7 @@ set_exchrange(
 		exit(0);
 	}
 
-	if (!xfs_has_reflink(mp)) {
+	if (!xfs_has_reflink(mp) && !add_reflink) {
 		printf(
 	_("File exchange-range feature cannot be added without reflink.\n"));
 		exit(0);
@@ -234,6 +234,33 @@ set_finobt(
 	return true;
 }
 
+static bool
+set_reflink(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*new_sb)
+{
+	if (xfs_has_reflink(mp)) {
+		printf(_("Filesystem already supports reflink.\n"));
+		exit(0);
+	}
+
+	if (!xfs_has_crc(mp)) {
+		printf(
+	_("Reflink feature only supported on V5 filesystems.\n"));
+		exit(0);
+	}
+
+	if (xfs_has_realtime(mp)) {
+		printf(_("Reflink feature not supported with realtime.\n"));
+		exit(0);
+	}
+
+	printf(_("Adding reflink support to filesystem.\n"));
+	new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_REFLINK;
+	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+	return true;
+}
+
 struct check_state {
 	struct xfs_sb		sb;
 	uint64_t		features;
@@ -402,6 +429,8 @@ need_check_fs_free_space(
 {
 	if (xfs_has_finobt(mp) && !(old->features & XFS_FEAT_FINOBT))
 		return true;
+	if (xfs_has_reflink(mp) && !(old->features & XFS_FEAT_REFLINK))
+		return true;
 	return false;
 }
 
@@ -481,6 +510,8 @@ upgrade_filesystem(
 		dirty |= set_exchrange(mp, &new_sb);
 	if (add_finobt)
 		dirty |= set_finobt(mp, &new_sb);
+	if (add_reflink)
+		dirty |= set_reflink(mp, &new_sb);
 	if (!dirty)
 		return;
 
diff --git a/repair/rmap.c b/repair/rmap.c
index 97510dd875911a..91f864351f6013 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -68,7 +68,7 @@ bool
 rmap_needs_work(
 	struct xfs_mount	*mp)
 {
-	return xfs_has_reflink(mp) ||
+	return xfs_has_reflink(mp) || add_reflink ||
 	       xfs_has_rmapbt(mp);
 }
 
@@ -1800,7 +1800,7 @@ check_refcounts(
 	struct xfs_perag		*pag = NULL;
 	int				error;
 
-	if (!xfs_has_reflink(mp))
+	if (!xfs_has_reflink(mp) || add_reflink)
 		return;
 	if (refcbt_suspect) {
 		if (no_modify && agno == 0)
@@ -1859,7 +1859,7 @@ check_rtrefcounts(
 	struct xfs_inode		*ip = NULL;
 	int				error;
 
-	if (!xfs_has_reflink(mp))
+	if (!xfs_has_reflink(mp) || add_reflink)
 		return;
 	if (refcbt_suspect) {
 		if (no_modify && rgno == 0)
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index d8f92b52b66f3a..e436dc2ef736d6 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -72,6 +72,7 @@ enum c_opt_nums {
 	CONVERT_NREXT64,
 	CONVERT_EXCHRANGE,
 	CONVERT_FINOBT,
+	CONVERT_REFLINK,
 	C_MAX_OPTS,
 };
 
@@ -82,6 +83,7 @@ static char *c_opts[] = {
 	[CONVERT_NREXT64]	= "nrext64",
 	[CONVERT_EXCHRANGE]	= "exchange",
 	[CONVERT_FINOBT]	= "finobt",
+	[CONVERT_REFLINK]	= "reflink",
 	[C_MAX_OPTS]		= NULL,
 };
 
@@ -383,6 +385,15 @@ process_args(int argc, char **argv)
 		_("-c finobt only supports upgrades\n"));
 					add_finobt = true;
 					break;
+				case CONVERT_REFLINK:
+					if (!val)
+						do_abort(
+		_("-c reflink requires a parameter\n"));
+					if (strtol(val, NULL, 0) != 1)
+						do_abort(
+		_("-c reflink only supports upgrades\n"));
+					add_reflink = true;
+					break;
 				default:
 					unknown('c', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong
  2024-12-31 23:53   ` [PATCH 02/10] xfs_repair: allow sysadmins to add reflink Darrick J. Wong
@ 2024-12-31 23:53   ` Darrick J. Wong
  2024-12-31 23:54   ` [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reverse mapping btree index.  This is needed for online
fsck.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_admin.8 |    8 ++++++++
 repair/globals.c     |    1 +
 repair/globals.h     |    1 +
 repair/phase2.c      |   38 ++++++++++++++++++++++++++++++++++++++
 repair/rmap.c        |    6 +++---
 repair/xfs_repair.c  |   11 +++++++++++
 6 files changed, 62 insertions(+), 3 deletions(-)


diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8
index 3a9175c9f018e5..74a400dcfeb557 100644
--- a/man/man8/xfs_admin.8
+++ b/man/man8/xfs_admin.8
@@ -176,6 +176,14 @@ .SH OPTIONS
 This upgrade can fail if any AG has less than 2% free space remaining.
 The filesystem cannot be downgraded after this feature is enabled.
 This feature was added to Linux 4.9.
+.TP 0.4i
+.B rmapbt
+Store an index of the owners of on-disk blocks.
+This enables much stronger cross-referencing of various metadata structures
+and online repairs to space usage metadata.
+The filesystem cannot be downgraded after this feature is enabled.
+This upgrade can fail if any AG has less than 5% free space remaining.
+This feature was added to Linux 4.8.
 .RE
 .TP
 .BI \-U " uuid"
diff --git a/repair/globals.c b/repair/globals.c
index cf4421e34dec84..dd7c422bb922e4 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -55,6 +55,7 @@ bool	add_nrext64;
 bool	add_exchrange;		/* add file content exchange support */
 bool	add_finobt;		/* add free inode btrees */
 bool	add_reflink;		/* add reference count btrees */
+bool	add_rmapbt;		/* add reverse mapping btrees */
 
 /* misc status variables */
 
diff --git a/repair/globals.h b/repair/globals.h
index efbb8db79bc080..d8c2aae23d8f0a 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -96,6 +96,7 @@ extern bool	add_nrext64;
 extern bool	add_exchrange;		/* add file content exchange support */
 extern bool	add_finobt;		/* add free inode btrees */
 extern bool	add_reflink;		/* add reference count btrees */
+extern bool	add_rmapbt;		/* add reverse mapping btrees */
 
 /* misc status variables */
 
diff --git a/repair/phase2.c b/repair/phase2.c
index 9cd841f8d05fc6..9dd37e7fc5c111 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -261,6 +261,40 @@ set_reflink(
 	return true;
 }
 
+static bool
+set_rmapbt(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*new_sb)
+{
+	if (xfs_has_rmapbt(mp)) {
+		printf(_("Filesystem already supports reverse mapping btrees.\n"));
+		exit(0);
+	}
+
+	if (!xfs_has_crc(mp)) {
+		printf(
+	_("Reverse mapping btree feature only supported on V5 filesystems.\n"));
+		exit(0);
+	}
+
+	if (xfs_has_realtime(mp)) {
+		printf(
+	_("Reverse mapping btree feature not supported with realtime.\n"));
+		exit(0);
+	}
+
+	if (xfs_has_reflink(mp) && !add_reflink) {
+		printf(
+	_("Reverse mapping btrees cannot be added when reflink is enabled.\n"));
+		exit(0);
+	}
+
+	printf(_("Adding reverse mapping btrees to filesystem.\n"));
+	new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_RMAPBT;
+	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+	return true;
+}
+
 struct check_state {
 	struct xfs_sb		sb;
 	uint64_t		features;
@@ -431,6 +465,8 @@ need_check_fs_free_space(
 		return true;
 	if (xfs_has_reflink(mp) && !(old->features & XFS_FEAT_REFLINK))
 		return true;
+	if (xfs_has_rmapbt(mp) && !(old->features & XFS_FEAT_RMAPBT))
+		return true;
 	return false;
 }
 
@@ -512,6 +548,8 @@ upgrade_filesystem(
 		dirty |= set_finobt(mp, &new_sb);
 	if (add_reflink)
 		dirty |= set_reflink(mp, &new_sb);
+	if (add_rmapbt)
+		dirty |= set_rmapbt(mp, &new_sb);
 	if (!dirty)
 		return;
 
diff --git a/repair/rmap.c b/repair/rmap.c
index 91f864351f6013..f1f837d33ea4f4 100644
--- a/repair/rmap.c
+++ b/repair/rmap.c
@@ -69,7 +69,7 @@ rmap_needs_work(
 	struct xfs_mount	*mp)
 {
 	return xfs_has_reflink(mp) || add_reflink ||
-	       xfs_has_rmapbt(mp);
+	       xfs_has_rmapbt(mp) || add_rmapbt;
 }
 
 static inline bool rmaps_has_observations(const struct xfs_ag_rmap *ag_rmap)
@@ -1339,7 +1339,7 @@ rmaps_verify_btree(
 	struct xfs_perag	*pag = NULL;
 	int			error;
 
-	if (!xfs_has_rmapbt(mp))
+	if (!xfs_has_rmapbt(mp) || add_rmapbt)
 		return;
 	if (rmapbt_suspect) {
 		if (no_modify && agno == 0)
@@ -1398,7 +1398,7 @@ rtrmaps_verify_btree(
 	struct xfs_inode	*ip = NULL;
 	int			error;
 
-	if (!xfs_has_rmapbt(mp))
+	if (!xfs_has_rmapbt(mp) || add_rmapbt)
 		return;
 	if (rmapbt_suspect) {
 		if (no_modify && rgno == 0)
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index e436dc2ef736d6..ca72c65f9d772a 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -73,6 +73,7 @@ enum c_opt_nums {
 	CONVERT_EXCHRANGE,
 	CONVERT_FINOBT,
 	CONVERT_REFLINK,
+	CONVERT_RMAPBT,
 	C_MAX_OPTS,
 };
 
@@ -84,6 +85,7 @@ static char *c_opts[] = {
 	[CONVERT_EXCHRANGE]	= "exchange",
 	[CONVERT_FINOBT]	= "finobt",
 	[CONVERT_REFLINK]	= "reflink",
+	[CONVERT_RMAPBT]	= "rmapbt",
 	[C_MAX_OPTS]		= NULL,
 };
 
@@ -394,6 +396,15 @@ process_args(int argc, char **argv)
 		_("-c reflink only supports upgrades\n"));
 					add_reflink = true;
 					break;
+				case CONVERT_RMAPBT:
+					if (!val)
+						do_abort(
+		_("-c rmapbt requires a parameter\n"));
+					if (strtol(val, NULL, 0) != 1)
+						do_abort(
+		_("-c rmapbt only supports upgrades\n"));
+					add_rmapbt = true;
+					break;
 				default:
 					unknown('c', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:53   ` [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes Darrick J. Wong
@ 2024-12-31 23:54   ` Darrick J. Wong
  2024-12-31 23:54   ` [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:54 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Upgrade an existing filesystem to have parent pointers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_admin.8 |    8 ++++++++
 repair/globals.c     |    1 +
 repair/globals.h     |    1 +
 repair/phase2.c      |   39 +++++++++++++++++++++++++++++++++++++++
 repair/pptr.c        |   15 ++++++++++++++-
 repair/xfs_repair.c  |   11 +++++++++++
 6 files changed, 74 insertions(+), 1 deletion(-)


diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8
index 74a400dcfeb557..a25e599e5f8e2c 100644
--- a/man/man8/xfs_admin.8
+++ b/man/man8/xfs_admin.8
@@ -184,6 +184,14 @@ .SH OPTIONS
 The filesystem cannot be downgraded after this feature is enabled.
 This upgrade can fail if any AG has less than 5% free space remaining.
 This feature was added to Linux 4.8.
+.TP 0.4i
+.B parent
+Store in each child file a mirror a pointing back to the parent directory.
+This enables much stronger cross-referencing and online repairs of the
+directory tree.
+The filesystem cannot be downgraded after this feature is enabled.
+This upgrade can fail if the filesystem has less than 25% free space remaining.
+This feature is not upstream yet.
 .RE
 .TP
 .BI \-U " uuid"
diff --git a/repair/globals.c b/repair/globals.c
index dd7c422bb922e4..320fcf6cfd701e 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -56,6 +56,7 @@ bool	add_exchrange;		/* add file content exchange support */
 bool	add_finobt;		/* add free inode btrees */
 bool	add_reflink;		/* add reference count btrees */
 bool	add_rmapbt;		/* add reverse mapping btrees */
+bool	add_parent;		/* add parent pointers */
 
 /* misc status variables */
 
diff --git a/repair/globals.h b/repair/globals.h
index d8c2aae23d8f0a..77d5d110048713 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -97,6 +97,7 @@ extern bool	add_exchrange;		/* add file content exchange support */
 extern bool	add_finobt;		/* add free inode btrees */
 extern bool	add_reflink;		/* add reference count btrees */
 extern bool	add_rmapbt;		/* add reverse mapping btrees */
+extern bool	add_parent;		/* add parent pointers */
 
 /* misc status variables */
 
diff --git a/repair/phase2.c b/repair/phase2.c
index 9dd37e7fc5c111..763cffdfe9d8d2 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -295,6 +295,28 @@ set_rmapbt(
 	return true;
 }
 
+static bool
+set_parent(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*new_sb)
+{
+	if (xfs_has_parent(mp)) {
+		printf(_("Filesystem already supports parent pointers.\n"));
+		exit(0);
+	}
+
+	if (!xfs_has_crc(mp)) {
+		printf(
+	_("Parent pointer feature only supported on V5 filesystems.\n"));
+		exit(0);
+	}
+
+	printf(_("Adding parent pointers to filesystem.\n"));
+	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_PARENT;
+	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+	return true;
+}
+
 struct check_state {
 	struct xfs_sb		sb;
 	uint64_t		features;
@@ -435,6 +457,19 @@ check_fs_free_space(
 		libxfs_trans_cancel(tp);
 	}
 
+	/*
+	 * If we're adding parent pointers, we need at least 25% free since
+	 * scanning the entire filesystem to guesstimate the overhead is
+	 * prohibitively expensive.
+	 */
+	if (xfs_has_parent(mp) && !(old->features & XFS_FEAT_PARENT)) {
+		if (mp->m_sb.sb_fdblocks < mp->m_sb.sb_dblocks / 4) {
+			printf(
+ _("Filesystem does not have enough space to add parent pointers.\n"));
+			exit(1);
+		}
+	}
+
 	/*
 	 * Would the post-upgrade filesystem have enough free space on the data
 	 * device after making per-AG reservations?
@@ -467,6 +502,8 @@ need_check_fs_free_space(
 		return true;
 	if (xfs_has_rmapbt(mp) && !(old->features & XFS_FEAT_RMAPBT))
 		return true;
+	if (xfs_has_parent(mp) && !(old->features & XFS_FEAT_PARENT))
+		return true;
 	return false;
 }
 
@@ -550,6 +587,8 @@ upgrade_filesystem(
 		dirty |= set_reflink(mp, &new_sb);
 	if (add_rmapbt)
 		dirty |= set_rmapbt(mp, &new_sb);
+	if (add_parent)
+		dirty |= set_parent(mp, &new_sb);
 	if (!dirty)
 		return;
 
diff --git a/repair/pptr.c b/repair/pptr.c
index ac0a9c618bc87d..a8156e55f1fdfc 100644
--- a/repair/pptr.c
+++ b/repair/pptr.c
@@ -793,7 +793,7 @@ add_missing_parent_ptr(
 				ag_pptr->namelen,
 				name);
 		return;
-	} else {
+	} else if (!add_parent) {
 		do_warn(
  _("adding missing ino %llu parent pointer (ino %llu gen 0x%x name '%.*s')\n"),
 				(unsigned long long)ip->i_ino,
@@ -801,6 +801,19 @@ add_missing_parent_ptr(
 				ag_pptr->parent_gen,
 				ag_pptr->namelen,
 				name);
+	} else {
+		static bool		warned = false;
+		static pthread_mutex_t	lock = PTHREAD_MUTEX_INITIALIZER;
+
+		if (!warned) {
+			pthread_mutex_lock(&lock);
+			if (!warned) {
+				do_warn(
+ _("setting parent pointers to upgrade filesystem\n"));
+				warned = true;
+			}
+			pthread_mutex_unlock(&lock);
+		}
 	}
 
 	error = add_file_pptr(ip, ag_pptr, name);
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index ca72c65f9d772a..189665a07d6892 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -74,6 +74,7 @@ enum c_opt_nums {
 	CONVERT_FINOBT,
 	CONVERT_REFLINK,
 	CONVERT_RMAPBT,
+	CONVERT_PARENT,
 	C_MAX_OPTS,
 };
 
@@ -86,6 +87,7 @@ static char *c_opts[] = {
 	[CONVERT_FINOBT]	= "finobt",
 	[CONVERT_REFLINK]	= "reflink",
 	[CONVERT_RMAPBT]	= "rmapbt",
+	[CONVERT_PARENT]	= "parent",
 	[C_MAX_OPTS]		= NULL,
 };
 
@@ -405,6 +407,15 @@ process_args(int argc, char **argv)
 		_("-c rmapbt only supports upgrades\n"));
 					add_rmapbt = true;
 					break;
+				case CONVERT_PARENT:
+					if (!val)
+						do_abort(
+		_("-c parent requires a parameter\n"));
+					if (strtol(val, NULL, 0) != 1)
+						do_abort(
+		_("-c parent only supports upgrades\n"));
+					add_parent = true;
+					break;
 				default:
 					unknown('c', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:54   ` [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers Darrick J. Wong
@ 2024-12-31 23:54   ` Darrick J. Wong
  2024-12-31 23:54   ` [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:54 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support metadata directories.  This will be needed to upgrade
filesystems to support realtime rmap and reflink.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_admin.8 |    8 ++++++
 repair/dino_chunks.c |    6 ++++
 repair/dinode.c      |    5 +++-
 repair/globals.c     |    1 +
 repair/globals.h     |    1 +
 repair/phase2.c      |   69 ++++++++++++++++++++++++++++++++++++++++++++++++++
 repair/phase4.c      |    5 +++-
 repair/protos.h      |    6 ++++
 repair/xfs_repair.c  |   11 ++++++++
 9 files changed, 109 insertions(+), 3 deletions(-)


diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8
index a25e599e5f8e2c..e55dee6070e460 100644
--- a/man/man8/xfs_admin.8
+++ b/man/man8/xfs_admin.8
@@ -191,6 +191,14 @@ .SH OPTIONS
 directory tree.
 The filesystem cannot be downgraded after this feature is enabled.
 This upgrade can fail if the filesystem has less than 25% free space remaining.
+.TP 0.4i
+.B metadir
+Create a directory tree of metadata inodes instead of storing them all in the
+superblock.
+This is required for reverse mapping btrees and reflink support on the realtime
+device.
+The filesystem cannot be downgraded after this feature is enabled.
+This upgrade can fail if any AG has less than 5% free space remaining.
 This feature is not upstream yet.
 .RE
 .TP
diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
index 250985ec264ead..120c490b1d8324 100644
--- a/repair/dino_chunks.c
+++ b/repair/dino_chunks.c
@@ -955,7 +955,11 @@ process_inode_chunk(
 		}
 
 		if (status)  {
-			if (mp->m_sb.sb_rootino == ino) {
+			if (wipe_pre_metadir_file(ino)) {
+				if (!ino_discovery)
+					do_warn(
+	_("wiping pre-metadir metadata inode %"PRIu64".\n"), ino);
+			} else if (mp->m_sb.sb_rootino == ino) {
 				need_root_inode = 1;
 
 				if (!no_modify)  {
diff --git a/repair/dinode.c b/repair/dinode.c
index 0c559c40808588..42c7e9fa5cc5e7 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -3068,6 +3068,9 @@ process_dinode_int(
 	ASSERT(uncertain == 0 || verify_mode != 0);
 	ASSERT(ino_bpp != NULL || verify_mode != 0);
 
+	if (wipe_pre_metadir_file(lino))
+		goto clear_bad_out;
+
 	/*
 	 * This is the only valid point to check the CRC; after this we may have
 	 * made changes which invalidate it, and the CRC is only updated again
@@ -3278,7 +3281,7 @@ _("bad (negative) size %" PRId64 " on inode %" PRIu64 "\n"),
 		if (flags & XFS_DIFLAG_NEWRTBM) {
 			/* must be a rt bitmap inode */
 			if (lino != mp->m_sb.sb_rbmino) {
-				if (!uncertain) {
+				if (!uncertain && !add_metadir) {
 					do_warn(
 	_("inode %" PRIu64 " not rt bitmap\n"),
 						lino);
diff --git a/repair/globals.c b/repair/globals.c
index 320fcf6cfd701e..603fea73da1654 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -57,6 +57,7 @@ bool	add_finobt;		/* add free inode btrees */
 bool	add_reflink;		/* add reference count btrees */
 bool	add_rmapbt;		/* add reverse mapping btrees */
 bool	add_parent;		/* add parent pointers */
+bool	add_metadir;		/* add metadata directory tree */
 
 /* misc status variables */
 
diff --git a/repair/globals.h b/repair/globals.h
index 77d5d110048713..9211e5e2432c9a 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -98,6 +98,7 @@ extern bool	add_finobt;		/* add free inode btrees */
 extern bool	add_reflink;		/* add reference count btrees */
 extern bool	add_rmapbt;		/* add reverse mapping btrees */
 extern bool	add_parent;		/* add parent pointers */
+extern bool	add_metadir;		/* add metadata directory tree */
 
 /* misc status variables */
 
diff --git a/repair/phase2.c b/repair/phase2.c
index 763cffdfe9d8d2..35f4c19de0555c 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -317,6 +317,71 @@ set_parent(
 	return true;
 }
 
+static xfs_ino_t doomed_rbmino = NULLFSINO;
+static xfs_ino_t doomed_rsumino = NULLFSINO;
+static xfs_ino_t doomed_uquotino = NULLFSINO;
+static xfs_ino_t doomed_gquotino = NULLFSINO;
+static xfs_ino_t doomed_pquotino = NULLFSINO;
+
+bool
+wipe_pre_metadir_file(
+	xfs_ino_t	ino)
+{
+	if (ino == doomed_rbmino ||
+	    ino == doomed_rsumino ||
+	    ino == doomed_uquotino ||
+	    ino == doomed_gquotino ||
+	    ino == doomed_pquotino)
+		return true;
+	return false;
+}
+
+static bool
+set_metadir(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*new_sb)
+{
+	if (xfs_has_metadir(mp)) {
+		printf(_("Filesystem already supports metadata directory trees.\n"));
+		exit(0);
+	}
+
+	if (!xfs_has_crc(mp)) {
+		printf(
+	_("Metadata directory trees only supported on V5 filesystems.\n"));
+		exit(0);
+	}
+
+	printf(_("Adding metadata directory trees to filesystem.\n"));
+	new_sb->sb_features_incompat |= (XFS_SB_FEAT_INCOMPAT_METADIR |
+					 XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR);
+
+	/* Blow out all the old metadata inodes; we'll rebuild in phase6. */
+	new_sb->sb_metadirino = new_sb->sb_rootino + 1;
+	doomed_rbmino = mp->m_sb.sb_rbmino;
+	doomed_rsumino = mp->m_sb.sb_rsumino;
+	doomed_uquotino = mp->m_sb.sb_uquotino;
+	doomed_gquotino = mp->m_sb.sb_gquotino;
+	doomed_pquotino = mp->m_sb.sb_pquotino;
+
+	new_sb->sb_rbmino = new_sb->sb_metadirino + 1;
+	new_sb->sb_rsumino = new_sb->sb_rbmino + 1;
+	new_sb->sb_uquotino = NULLFSINO;
+	new_sb->sb_gquotino = NULLFSINO;
+	new_sb->sb_pquotino = NULLFSINO;
+
+	/* Indicate that we need a rebuild. */
+	need_metadir_inode = 1;
+	need_rbmino = 1;
+	need_rsumino = 1;
+	have_uquotino = 0;
+	have_gquotino = 0;
+	have_pquotino = 0;
+	quotacheck_skip();
+
+	return true;
+}
+
 struct check_state {
 	struct xfs_sb		sb;
 	uint64_t		features;
@@ -504,6 +569,8 @@ need_check_fs_free_space(
 		return true;
 	if (xfs_has_parent(mp) && !(old->features & XFS_FEAT_PARENT))
 		return true;
+	if (xfs_has_metadir(mp) && !(old->features & XFS_FEAT_METADIR))
+		return true;
 	return false;
 }
 
@@ -589,6 +656,8 @@ upgrade_filesystem(
 		dirty |= set_rmapbt(mp, &new_sb);
 	if (add_parent)
 		dirty |= set_parent(mp, &new_sb);
+	if (add_metadir)
+		dirty |= set_metadir(mp, &new_sb);
 	if (!dirty)
 		return;
 
diff --git a/repair/phase4.c b/repair/phase4.c
index b752b4c871ea83..6d3c7857c6c343 100644
--- a/repair/phase4.c
+++ b/repair/phase4.c
@@ -431,7 +431,10 @@ phase4(xfs_mount_t *mp)
 	if (xfs_has_metadir(mp) &&
 	    (is_inode_free(irec, 1) || !inode_isadir(irec, 1))) {
 		need_metadir_inode = true;
-		if (no_modify)
+		if (add_metadir)
+			do_warn(
+	_("metadata directory root inode needs to be initialized\n"));
+		else if (no_modify)
 			do_warn(
 	_("metadata directory root inode would be lost\n"));
 		else
diff --git a/repair/protos.h b/repair/protos.h
index e2f39f1d6e8aa3..ce171f3dd87cb6 100644
--- a/repair/protos.h
+++ b/repair/protos.h
@@ -3,6 +3,8 @@
  * Copyright (c) 2000-2001,2005 Silicon Graphics, Inc.
  * All Rights Reserved.
  */
+#ifndef __XFS_REPAIR_PROTOS_H__
+#define __XFS_REPAIR_PROTOS_H__
 
 void	xfs_init(struct libxfs_init *args);
 
@@ -45,3 +47,7 @@ void	phase7(struct xfs_mount *, int);
 int	verify_set_agheader(struct xfs_mount *, struct xfs_buf *,
 		struct xfs_sb *, struct xfs_agf *, struct xfs_agi *,
 		xfs_agnumber_t);
+
+bool wipe_pre_metadir_file(xfs_ino_t ino);
+
+#endif  /* __XFS_REPAIR_PROTOS_H__ */
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 189665a07d6892..d4101f7d2297d7 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -75,6 +75,7 @@ enum c_opt_nums {
 	CONVERT_REFLINK,
 	CONVERT_RMAPBT,
 	CONVERT_PARENT,
+	CONVERT_METADIR,
 	C_MAX_OPTS,
 };
 
@@ -88,6 +89,7 @@ static char *c_opts[] = {
 	[CONVERT_REFLINK]	= "reflink",
 	[CONVERT_RMAPBT]	= "rmapbt",
 	[CONVERT_PARENT]	= "parent",
+	[CONVERT_METADIR]	= "metadir",
 	[C_MAX_OPTS]		= NULL,
 };
 
@@ -416,6 +418,15 @@ process_args(int argc, char **argv)
 		_("-c parent only supports upgrades\n"));
 					add_parent = true;
 					break;
+				case CONVERT_METADIR:
+					if (!val)
+						do_abort(
+		_("-c metadir requires a parameter\n"));
+					if (strtol(val, NULL, 0) != 1)
+						do_abort(
+		_("-c metadir only supports upgrades\n"));
+					add_metadir = true;
+					break;
 				default:
 					unknown('c', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-12-31 23:54   ` [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories Darrick J. Wong
@ 2024-12-31 23:54   ` Darrick J. Wong
  2024-12-31 23:55   ` [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:54 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Amend the metadir upgrade code to initialize the rtgroups related fields
in the superblock.  This obviously means that we can't upgrade metadir
to a filesystem with an existing rt section.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 repair/phase2.c |   36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)


diff --git a/repair/phase2.c b/repair/phase2.c
index 35f4c19de0555c..fa6ea91711557c 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -341,6 +341,9 @@ set_metadir(
 	struct xfs_mount	*mp,
 	struct xfs_sb		*new_sb)
 {
+	struct xfs_rtgroup	*rtg;
+	unsigned int		rgsize;
+
 	if (xfs_has_metadir(mp)) {
 		printf(_("Filesystem already supports metadata directory trees.\n"));
 		exit(0);
@@ -352,6 +355,15 @@ set_metadir(
 		exit(0);
 	}
 
+	if (xfs_has_realtime(mp)) {
+		printf(
+	_("Realtime groups cannot be added to an existing realtime section.\n"));
+		exit(0);
+	}
+
+	if (!xfs_has_exchange_range(mp))
+		set_exchrange(mp, new_sb);
+
 	printf(_("Adding metadata directory trees to filesystem.\n"));
 	new_sb->sb_features_incompat |= (XFS_SB_FEAT_INCOMPAT_METADIR |
 					 XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR);
@@ -364,21 +376,35 @@ set_metadir(
 	doomed_gquotino = mp->m_sb.sb_gquotino;
 	doomed_pquotino = mp->m_sb.sb_pquotino;
 
-	new_sb->sb_rbmino = new_sb->sb_metadirino + 1;
-	new_sb->sb_rsumino = new_sb->sb_rbmino + 1;
+	new_sb->sb_rbmino = NULLFSINO;
+	new_sb->sb_rsumino = NULLFSINO;
 	new_sb->sb_uquotino = NULLFSINO;
 	new_sb->sb_gquotino = NULLFSINO;
 	new_sb->sb_pquotino = NULLFSINO;
+	rgsize = XFS_B_TO_FSBT(mp, 1ULL << 40); /* 1TB */
+	rgsize -= rgsize % new_sb->sb_rextsize;
+	new_sb->sb_rgextents = rgsize;
+	new_sb->sb_rgcount = 0;
+	new_sb->sb_rgblklog = libxfs_compute_rgblklog(new_sb->sb_rgextents,
+						      new_sb->sb_rextsize);
 
 	/* Indicate that we need a rebuild. */
 	need_metadir_inode = 1;
 	need_rbmino = 1;
 	need_rsumino = 1;
-	have_uquotino = 0;
-	have_gquotino = 0;
-	have_pquotino = 0;
+	clear_quota_inode(XFS_DQTYPE_USER);
+	clear_quota_inode(XFS_DQTYPE_GROUP);
+	clear_quota_inode(XFS_DQTYPE_PROJ);
 	quotacheck_skip();
 
+	/* Dump incore rt freespace inodes. */
+	rtg = libxfs_rtgroup_grab(mp, 0);
+	if (rtg) {
+		libxfs_rtginode_irele(&rtg->rtg_inodes[XFS_RTGI_BITMAP]);
+		libxfs_rtginode_irele(&rtg->rtg_inodes[XFS_RTGI_SUMMARY]);
+		libxfs_rtgroup_rele(rtg);
+	}
+
 	return true;
 }
 


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-12-31 23:54   ` [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir Darrick J. Wong
@ 2024-12-31 23:55   ` Darrick J. Wong
  2024-12-31 23:55   ` [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reverse mapping btree index for realtime volumes.  This
is needed for online fsck.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/libxfs_api_defs.h |    1 +
 repair/phase2.c          |   64 ++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 60 insertions(+), 5 deletions(-)


diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index 76f55515bb41f7..2502a7736d1670 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -78,6 +78,7 @@
 #define xfs_btree_bload			libxfs_btree_bload
 #define xfs_btree_bload_compute_geometry libxfs_btree_bload_compute_geometry
 #define xfs_btree_calc_size		libxfs_btree_calc_size
+#define xfs_btree_compute_maxlevels	libxfs_btree_compute_maxlevels
 #define xfs_btree_decrement		libxfs_btree_decrement
 #define xfs_btree_del_cursor		libxfs_btree_del_cursor
 #define xfs_btree_delete		libxfs_btree_delete
diff --git a/repair/phase2.c b/repair/phase2.c
index fa6ea91711557c..b1288bf3dd90cd 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -277,9 +277,8 @@ set_rmapbt(
 		exit(0);
 	}
 
-	if (xfs_has_realtime(mp)) {
-		printf(
-	_("Reverse mapping btree feature not supported with realtime.\n"));
+	if (xfs_has_realtime(mp) && !xfs_has_rtgroups(mp)) {
+		printf(_("Reverse mapping btree requires realtime groups.\n"));
 		exit(0);
 	}
 
@@ -292,6 +291,7 @@ set_rmapbt(
 	printf(_("Adding reverse mapping btrees to filesystem.\n"));
 	new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_RMAPBT;
 	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+
 	return true;
 }
 
@@ -466,6 +466,37 @@ check_free_space(
 	return avail > GIGABYTES(10, mp->m_sb.sb_blocklog);
 }
 
+/*
+ * Reserve space to handle rt rmap btree expansion.
+ *
+ * If the rmap inode for this group already exists, we assume that we're adding
+ * some other feature.  Note that we have not validated the metadata directory
+ * tree, so we must perform the lookup by hand and abort the upgrade if there
+ * are errors.  Otherwise, the amount of space needed to handle a new maximally
+ * sized rmap btree is added to @new_resv.
+ */
+static int
+reserve_rtrmap_inode(
+	struct xfs_rtgroup	*rtg,
+	xfs_rfsblock_t		*new_resv)
+{
+	struct xfs_mount	*mp = rtg_mount(rtg);
+	struct xfs_inode	*ip = rtg_rmap(rtg);
+	xfs_filblks_t		ask;
+
+	if (!xfs_has_rtrmapbt(mp))
+		return 0;
+
+	ask = libxfs_rtrmapbt_calc_reserves(mp);
+
+	/* failed to load the rtdir inode? */
+	if (!ip) {
+		*new_resv += ask;
+		return 0;
+	}
+	return -libxfs_metafile_resv_init(ip, ask);
+}
+
 static void
 check_fs_free_space(
 	struct xfs_mount		*mp,
@@ -473,6 +504,8 @@ check_fs_free_space(
 	struct xfs_sb			*new_sb)
 {
 	struct xfs_perag		*pag = NULL;
+	struct xfs_rtgroup		*rtg = NULL;
+	xfs_rfsblock_t			new_resv = 0;
 	int				error;
 
 	/* Make sure we have enough space for per-AG reservations. */
@@ -548,6 +581,21 @@ check_fs_free_space(
 		libxfs_trans_cancel(tp);
 	}
 
+	/* Realtime metadata btree inodes */
+	while ((rtg = xfs_rtgroup_next(mp, rtg))) {
+		error = reserve_rtrmap_inode(rtg, &new_resv);
+		if (error == ENOSPC) {
+			printf(
+_("Not enough free space would remain for rtgroup %u rmap inode.\n"),
+					rtg_rgno(rtg));
+			exit(0);
+		}
+		if (error)
+			do_error(
+_("Error %d while checking rtgroup %u rmap inode space reservation.\n"),
+					error, rtg_rgno(rtg));
+	}
+
 	/*
 	 * If we're adding parent pointers, we need at least 25% free since
 	 * scanning the entire filesystem to guesstimate the overhead is
@@ -563,13 +611,19 @@ check_fs_free_space(
 
 	/*
 	 * Would the post-upgrade filesystem have enough free space on the data
-	 * device after making per-AG reservations?
+	 * device after making per-AG reservations and reserving rt metadata
+	 * inode blocks?
 	 */
-	if (!check_free_space(mp, mp->m_sb.sb_fdblocks, mp->m_sb.sb_dblocks)) {
+	if (new_resv > mp->m_sb.sb_fdblocks ||
+	    !check_free_space(mp, mp->m_sb.sb_fdblocks, mp->m_sb.sb_dblocks)) {
 		printf(_("Filesystem will be low on space after upgrade.\n"));
 		exit(1);
 	}
 
+	/* Unreserve the realtime metadata reservations. */
+	while ((rtg = xfs_rtgroup_next(mp, rtg)))
+		libxfs_metafile_resv_free(rtg_rmap(rtg));
+
 	/*
 	 * Release the per-AG reservations and mark the per-AG structure as
 	 * uninitialized so that we don't trip over stale cached counters


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-12-31 23:55   ` [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes Darrick J. Wong
@ 2024-12-31 23:55   ` Darrick J. Wong
  2024-12-31 23:55   ` [PATCH 09/10] xfs_repair: skip free space checks when upgrading Darrick J. Wong
  2024-12-31 23:55   ` [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems Darrick J. Wong
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the realtime reference count btree, and therefore reflink on
realtime volumes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 repair/phase2.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 3 deletions(-)


diff --git a/repair/phase2.c b/repair/phase2.c
index b1288bf3dd90cd..8dc936b572196e 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -250,14 +250,15 @@ set_reflink(
 		exit(0);
 	}
 
-	if (xfs_has_realtime(mp)) {
-		printf(_("Reflink feature not supported with realtime.\n"));
+	if (xfs_has_realtime(mp) && !xfs_has_rtgroups(mp)) {
+		printf(_("Reference count btree requires realtime groups.\n"));
 		exit(0);
 	}
 
 	printf(_("Adding reflink support to filesystem.\n"));
 	new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_REFLINK;
 	new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR;
+
 	return true;
 }
 
@@ -497,6 +498,38 @@ reserve_rtrmap_inode(
 	return -libxfs_metafile_resv_init(ip, ask);
 }
 
+/*
+ * Reserve space to handle rt refcount btree expansion.
+ *
+ * If the refcount inode for this group already exists, we assume that we're
+ * adding some other feature.  Note that we have not validated the metadata
+ * directory tree, so we must perform the lookup by hand and abort the upgrade
+ * if there are errors.  If the inode does not exist, the amount of space
+ * needed to handle a new maximally sized refcount btree is added to @new_resv.
+ */
+static int
+reserve_rtrefcount_inode(
+	struct xfs_rtgroup	*rtg,
+	xfs_rfsblock_t		*new_resv)
+{
+	struct xfs_mount	*mp = rtg_mount(rtg);
+	struct xfs_inode	*ip = rtg_refcount(rtg);
+	xfs_filblks_t		ask;
+
+	if (!xfs_has_rtreflink(mp))
+		return 0;
+
+	ask = libxfs_rtrefcountbt_calc_reserves(mp);
+
+	/* failed to load the rtdir inode? */
+	if (!ip) {
+		*new_resv += ask;
+		return 0;
+	}
+
+	return -libxfs_metafile_resv_init(ip, ask);
+}
+
 static void
 check_fs_free_space(
 	struct xfs_mount		*mp,
@@ -594,6 +627,18 @@ _("Not enough free space would remain for rtgroup %u rmap inode.\n"),
 			do_error(
 _("Error %d while checking rtgroup %u rmap inode space reservation.\n"),
 					error, rtg_rgno(rtg));
+
+		error = reserve_rtrefcount_inode(rtg, &new_resv);
+		if (error == ENOSPC) {
+			printf(
+_("Not enough free space would remain for rtgroup %u refcount inode.\n"),
+					rtg_rgno(rtg));
+			exit(0);
+		}
+		if (error)
+			do_error(
+_("Error %d while checking rtgroup %u refcount inode space reservation.\n"),
+					error, rtg_rgno(rtg));
 	}
 
 	/*
@@ -621,8 +666,10 @@ _("Error %d while checking rtgroup %u rmap inode space reservation.\n"),
 	}
 
 	/* Unreserve the realtime metadata reservations. */
-	while ((rtg = xfs_rtgroup_next(mp, rtg)))
+	while ((rtg = xfs_rtgroup_next(mp, rtg))) {
 		libxfs_metafile_resv_free(rtg_rmap(rtg));
+		libxfs_metafile_resv_free(rtg_refcount(rtg));
+	}
 
 	/*
 	 * Release the per-AG reservations and mark the per-AG structure as


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 09/10] xfs_repair: skip free space checks when upgrading
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-12-31 23:55   ` [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink Darrick J. Wong
@ 2024-12-31 23:55   ` Darrick J. Wong
  2024-12-31 23:55   ` [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems Darrick J. Wong
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a debug knob to disable the free space checks when upgrading a
system.  This is extremely risky and will cause severe tire damage!!!

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 repair/globals.c    |    1 +
 repair/globals.h    |    1 +
 repair/phase2.c     |    2 ++
 repair/xfs_repair.c |   11 +++++++++++
 4 files changed, 15 insertions(+)


diff --git a/repair/globals.c b/repair/globals.c
index 603fea73da1654..fe9f9ac5914bb0 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -48,6 +48,7 @@ char	*rt_name;		/* Name of realtime device */
 int	rt_spec;		/* Realtime dev specified as option */
 int	convert_lazy_count;	/* Convert lazy-count mode on/off */
 int	lazy_count;		/* What to set if to if converting */
+bool	skip_freesp_check_on_upgrade; /* do not enable */
 bool	features_changed;	/* did we change superblock feature bits? */
 bool	add_inobtcount;		/* add inode btree counts to AGI */
 bool	add_bigtime;		/* add support for timestamps up to 2486 */
diff --git a/repair/globals.h b/repair/globals.h
index 9211e5e2432c9a..c660971080f7e4 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -89,6 +89,7 @@ extern char	*rt_name;		/* Name of realtime device */
 extern int	rt_spec;		/* Realtime dev specified as option */
 extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
 extern int	lazy_count;		/* What to set if to if converting */
+extern bool	skip_freesp_check_on_upgrade; /* do not enable */
 extern bool	features_changed;	/* did we change superblock feature bits? */
 extern bool	add_inobtcount;		/* add inode btree counts to AGI */
 extern bool	add_bigtime;		/* add support for timestamps up to 2486 */
diff --git a/repair/phase2.c b/repair/phase2.c
index 8dc936b572196e..780294d24c9900 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -688,6 +688,8 @@ need_check_fs_free_space(
 	struct xfs_mount		*mp,
 	const struct check_state	*old)
 {
+	if (skip_freesp_check_on_upgrade)
+		return false;
 	if (xfs_has_finobt(mp) && !(old->features & XFS_FEAT_FINOBT))
 		return true;
 	if (xfs_has_reflink(mp) && !(old->features & XFS_FEAT_REFLINK))
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index d4101f7d2297d7..55e417201b34f7 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -46,6 +46,7 @@ enum o_opt_nums {
 	BLOAD_LEAF_SLACK,
 	BLOAD_NODE_SLACK,
 	NOQUOTA,
+	SKIP_FREESP_CHECK,
 	O_MAX_OPTS,
 };
 
@@ -59,6 +60,7 @@ static char *o_opts[] = {
 	[BLOAD_LEAF_SLACK]	= "debug_bload_leaf_slack",
 	[BLOAD_NODE_SLACK]	= "debug_bload_node_slack",
 	[NOQUOTA]		= "noquota",
+	[SKIP_FREESP_CHECK]	= "debug_skip_freesp_check_on_upgrade",
 	[O_MAX_OPTS]		= NULL,
 };
 
@@ -323,6 +325,15 @@ process_args(int argc, char **argv)
 				case NOQUOTA:
 					quotacheck_skip();
 					break;
+				case SKIP_FREESP_CHECK:
+					if (!val)
+						do_abort(
+		_("-o debug_skip_freesp_check_on_upgrade requires a parameter\n"));
+					skip_freesp_check_on_upgrade = (int)strtol(val, NULL, 0);
+					if (skip_freesp_check_on_upgrade)
+						do_log(
+		_("WARNING: Allowing filesystem upgrades to proceed without free space check.  THIS MAY DESTROY YOUR FILESYSTEM!!!\n"));
+					break;
 				default:
 					unknown('o', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-12-31 23:55   ` [PATCH 09/10] xfs_repair: skip free space checks when upgrading Darrick J. Wong
@ 2024-12-31 23:55   ` Darrick J. Wong
  9 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw)
  To: aalbersh, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

New debugging knob so that I can upgrade a filesystem to have rmap
btrees even if reflink was already enabled.  We cannot easily precompute
the space requirements, so this is dangerous.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 repair/globals.c    |    1 +
 repair/globals.h    |    1 +
 repair/phase2.c     |    3 ++-
 repair/xfs_repair.c |   11 +++++++++++
 4 files changed, 15 insertions(+), 1 deletion(-)


diff --git a/repair/globals.c b/repair/globals.c
index fe9f9ac5914bb0..f4f1d317917183 100644
--- a/repair/globals.c
+++ b/repair/globals.c
@@ -49,6 +49,7 @@ int	rt_spec;		/* Realtime dev specified as option */
 int	convert_lazy_count;	/* Convert lazy-count mode on/off */
 int	lazy_count;		/* What to set if to if converting */
 bool	skip_freesp_check_on_upgrade; /* do not enable */
+bool	allow_rmapbt_upgrade_with_reflink; /* add rmapbt when reflink already on */
 bool	features_changed;	/* did we change superblock feature bits? */
 bool	add_inobtcount;		/* add inode btree counts to AGI */
 bool	add_bigtime;		/* add support for timestamps up to 2486 */
diff --git a/repair/globals.h b/repair/globals.h
index c660971080f7e4..febbbbcc81f931 100644
--- a/repair/globals.h
+++ b/repair/globals.h
@@ -90,6 +90,7 @@ extern int	rt_spec;		/* Realtime dev specified as option */
 extern int	convert_lazy_count;	/* Convert lazy-count mode on/off */
 extern int	lazy_count;		/* What to set if to if converting */
 extern bool	skip_freesp_check_on_upgrade; /* do not enable */
+extern bool	allow_rmapbt_upgrade_with_reflink; /* add rmapbt when reflink already on */
 extern bool	features_changed;	/* did we change superblock feature bits? */
 extern bool	add_inobtcount;		/* add inode btree counts to AGI */
 extern bool	add_bigtime;		/* add support for timestamps up to 2486 */
diff --git a/repair/phase2.c b/repair/phase2.c
index 780294d24c9900..29a406f69ca3a1 100644
--- a/repair/phase2.c
+++ b/repair/phase2.c
@@ -283,7 +283,8 @@ set_rmapbt(
 		exit(0);
 	}
 
-	if (xfs_has_reflink(mp) && !add_reflink) {
+	if (xfs_has_reflink(mp) && !add_reflink &&
+	    !allow_rmapbt_upgrade_with_reflink) {
 		printf(
 	_("Reverse mapping btrees cannot be added when reflink is enabled.\n"));
 		exit(0);
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 55e417201b34f7..4cff11d81d6bcb 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -47,6 +47,7 @@ enum o_opt_nums {
 	BLOAD_NODE_SLACK,
 	NOQUOTA,
 	SKIP_FREESP_CHECK,
+	ALLOW_RMAPBT_UPGRADE_WITH_REFLINK,
 	O_MAX_OPTS,
 };
 
@@ -61,6 +62,7 @@ static char *o_opts[] = {
 	[BLOAD_NODE_SLACK]	= "debug_bload_node_slack",
 	[NOQUOTA]		= "noquota",
 	[SKIP_FREESP_CHECK]	= "debug_skip_freesp_check_on_upgrade",
+	[ALLOW_RMAPBT_UPGRADE_WITH_REFLINK] = "debug_allow_rmapbt_upgrade_with_reflink",
 	[O_MAX_OPTS]		= NULL,
 };
 
@@ -334,6 +336,15 @@ process_args(int argc, char **argv)
 						do_log(
 		_("WARNING: Allowing filesystem upgrades to proceed without free space check.  THIS MAY DESTROY YOUR FILESYSTEM!!!\n"));
 					break;
+				case ALLOW_RMAPBT_UPGRADE_WITH_REFLINK:
+					if (!val)
+						do_abort(
+		_("-o debug_allow_rmapbt_upgrade_with_reflink requires a parameter\n"));
+					allow_rmapbt_upgrade_with_reflink = (int)strtol(val, NULL, 0);
+					if (allow_rmapbt_upgrade_with_reflink)
+						do_log(
+		_("WARNING: Allowing filesystem upgrade to rmapbt when reflink enabled.  THIS MAY DESTROY YOUR FILESYSTEM!!!\n"));
+					break;
 				default:
 					unknown('o', val);
 					break;


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 1/5] fstests: functional test for refcount reporting
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (9 preceding siblings ...)
  2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
@ 2024-12-31 23:34 ` Darrick J. Wong
  2024-12-31 23:56   ` [PATCH 1/1] xfs: test output of new FSREFCOUNTS ioctl Darrick J. Wong
  2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

Hi all,

Add a short functional test for the new GETFSREFCOUNTS ioctl that allows
userspace to query reference count information for a given range of
physical blocks.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=report-refcounts

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=report-refcounts

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=report-refcounts
---
Commits in this patchset:
 * xfs: test output of new FSREFCOUNTS ioctl
---
 common/rc           |    4 +
 doc/group-names.txt |    1 
 tests/xfs/1921      |  164 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1921.out  |    4 +
 4 files changed, 171 insertions(+), 2 deletions(-)
 create mode 100755 tests/xfs/1921
 create mode 100644 tests/xfs/1921.out


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/1] xfs: test output of new FSREFCOUNTS ioctl
  2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong
@ 2024-12-31 23:56   ` Darrick J. Wong
  0 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure the cursors work properly and that refcounts are correct.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc           |    4 +
 doc/group-names.txt |    1 
 tests/xfs/1921      |  164 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1921.out  |    4 +
 4 files changed, 171 insertions(+), 2 deletions(-)
 create mode 100755 tests/xfs/1921
 create mode 100644 tests/xfs/1921.out


diff --git a/common/rc b/common/rc
index e04ca50e3140c0..c45a226849ce0f 100644
--- a/common/rc
+++ b/common/rc
@@ -2811,8 +2811,8 @@ _require_xfs_io_command()
 		echo $testio | grep -q "Operation not supported" && \
 			_notrun "O_TMPFILE is not supported"
 		;;
-	"fsmap")
-		testio=`$XFS_IO_PROG -f -c "fsmap" $testfile 2>&1`
+	"fsmap"|"fsrefcounts")
+		testio=`$XFS_IO_PROG -f -c "$command" $testfile 2>&1`
 		echo $testio | grep -q "Inappropriate ioctl" && \
 			_notrun "xfs_io $command support is missing"
 		;;
diff --git a/doc/group-names.txt b/doc/group-names.txt
index ed886caac058c3..b04d0180e8ec02 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -58,6 +58,7 @@ fsck			general fsck tests
 fsmap			FS_IOC_GETFSMAP ioctl
 fsproperties		Filesystem properties
 fsr			XFS free space reorganizer
+fsrefcounts		FS_IOC_GETFSREFCOUNTS ioctl
 fuzzers			filesystem fuzz tests
 growfs			increasing the size of a filesystem
 hardlink		hardlinks
diff --git a/tests/xfs/1921 b/tests/xfs/1921
new file mode 100755
index 00000000000000..2d0af845767ed2
--- /dev/null
+++ b/tests/xfs/1921
@@ -0,0 +1,164 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2021-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1921
+#
+# Populate filesystem, check that fsrefcounts -n10000 matches fsrefcounts -n1,
+# then verify that the refcount information is consistent with the fsmap info.
+#
+. ./common/preamble
+_begin_fstest auto clone fsrefcounts fsmap
+
+_cleanup()
+{
+	cd /
+	rm -rf $tmp.* $TEST_DIR/a $TEST_DIR/b
+}
+
+. ./common/filter
+
+_require_scratch
+_require_xfs_io_command "fsmap"
+_require_xfs_io_command "fsrefcounts"
+
+echo "Format and mount"
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount >> $seqres.full 2>&1
+
+cpus=$(( $(src/feature -o) * 4))
+
+# Use fsstress to create a directory tree with some variability
+FSSTRESS_ARGS=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 4000 $FSSTRESS_AVOID)
+$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full
+
+_scratch_cycle_mount	# flush all the background gc
+
+echo "Compare fsrefcounts" | tee -a $seqres.full
+$XFS_IO_PROG -c 'fsrefcounts -m -n 65536' $SCRATCH_MNT | grep -v 'EXT:' > $TEST_DIR/a
+$XFS_IO_PROG -c 'fsrefcounts -m -n 1' $SCRATCH_MNT | grep -v 'EXT:' > $TEST_DIR/b
+cat $TEST_DIR/a $TEST_DIR/b >> $seqres.full
+
+diff -uw $TEST_DIR/a $TEST_DIR/b
+
+echo "Compare fsrefcounts to fsmap" | tee -a $seqres.full
+$XFS_IO_PROG -c 'fsmap -m -n 65536' $SCRATCH_MNT | grep -v 'EXT:' > $TEST_DIR/b
+cat $TEST_DIR/b >> $seqres.full
+
+while IFS=',' read ext major minor pstart pend owners length crap; do
+	test "$ext" = "EXT" && continue
+
+	awk_args=(-'F' ',' '-v' "major=$major" '-v' "minor=$minor" \
+		  '-v' "pstart=$pstart" '-v' "pend=$pend" '-v' "owners=$owners")
+
+	if [ "$owners" -eq 1 ]; then
+		$AWK_PROG "${awk_args[@]}" \
+'
+BEGIN {
+	printf("Q:%s:%s:%s:%s:%s:\n", major, minor, pstart, pend, owners) > "/dev/stderr";
+	next_map = -1;
+}
+{
+	if ($2 != major || $3 != minor) {
+		next;
+	}
+	if ($5 <= pstart) {
+		next;
+	}
+
+	printf(" A:%s:%s:%s:%s\n", $2, $3, $4, $5) > "/dev/stderr";
+	if (next_map < 0) {
+		if ($4 > pstart) {
+			exit 1
+		}
+		next_map = $5 + 1;
+	} else {
+		if ($4 != next_map) {
+			exit 1
+		}
+		next_map = $5 + 1;
+	}
+	if (next_map >= pend) {
+		nextfile;
+	}
+}
+END {
+	exit 0;
+}
+' $TEST_DIR/b 2> $tmp.debug
+		res=$?
+	else
+		$AWK_PROG "${awk_args[@]}" \
+'
+function max(a, b) {
+	return a > b ? a : b;
+}
+function min(a, b) {
+	return a < b ? a : b;
+}
+BEGIN {
+	printf("Q:%s:%s:%s:%s:%s:\n", major, minor, pstart, pend, owners) > "/dev/stderr";
+	refcount_whole = 0;
+	aborted = 0;
+}
+{
+	if ($2 != major || $3 != minor) {
+		next;
+	}
+	if ($4 > pend) {
+		nextfile;
+	}
+	if ($5 < pstart) {
+		next;
+	}
+	if ($6 == "special_0:2") {
+		/* unknown owner means we cannot distinguish separate owners */
+		aborted = 1;
+		exit 0;
+	}
+
+	printf(" A:%s:%s:%s:%s -> %d\n", $2, $3, $4, $5, refcount_whole) > "/dev/stderr";
+	if ($4 <= pstart && $5 >= pend) {
+		/* Account for extents that span the whole range */
+		refcount_whole++;
+	} else {
+		/* Otherwise track refcounts per-block as we find them */
+		for (block = max($4, pstart); block <= min($5, pend); block++) {
+			refcounts[block]++;
+		}
+	}
+}
+END {
+	if (aborted) {
+		exit 0;
+	}
+	deficit = owners - refcount_whole;
+	printf(" W:%d:%d\n", owners, refcount_whole, deficit) > "/dev/stderr";
+	if (deficit == 0) {
+		exit 0;
+	}
+
+	refcount_slivers = deficit;
+	for (block in refcounts) {
+		printf(" X:%s:%d\n", block, refcounts[block]) > "/dev/stderr";
+		if (refcounts[block] != deficit) {
+			refcount_slivers = 0;
+		}
+	}
+
+	refcount_whole += refcount_slivers;
+	exit owners == refcount_whole ? 0 : 1;
+}
+' $TEST_DIR/b 2> $tmp.debug
+		res=$?
+	fi
+	if [ $res -ne 0 ]; then
+		echo "$major,$minor,$pstart,$pend,$owners not found in fsmap"
+		cat $tmp.debug >> $seqres.full
+		break
+	fi
+done < $TEST_DIR/a
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1921.out b/tests/xfs/1921.out
new file mode 100644
index 00000000000000..f5ea660379bbdd
--- /dev/null
+++ b/tests/xfs/1921.out
@@ -0,0 +1,4 @@
+QA output created by 1921
+Format and mount
+Compare fsrefcounts
+Compare fsrefcounts to fsmap


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 2/5] fstests: defragment free space
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (10 preceding siblings ...)
  2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong
@ 2024-12-31 23:35 ` Darrick J. Wong
  2024-12-31 23:56   ` [PATCH 1/1] xfs: test clearing of " Darrick J. Wong
  2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

Hi all,

These patches contain experimental code to enable userspace to defragment
the free space in a filesystem.  Two purposes are imagined for this
functionality: clearing space at the end of a filesystem before
shrinking it, and clearing free space in anticipation of making a large
allocation.

The first patch adds a new fallocate mode that allows userspace to
allocate free space from the filesystem into a file.  The goal here is
to allow the filesystem shrink process to prevent allocation from a
certain part of the filesystem while a free space defragmenter evacuates
all the files from the doomed part of the filesystem.

The second patch amends the online repair system to allow the sysadmin
to forcibly rebuild metadata structures, even if they're not corrupt.
Without adding an ioctl to move metadata btree blocks, this is the only
way to dislodge metadata.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=defrag-freespace
---
Commits in this patchset:
 * xfs: test clearing of free space
---
 common/rc          |    5 ++++
 tests/xfs/1400     |   52 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/1400.out |    2 +
 tests/xfs/1401     |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1401.out |    2 +
 5 files changed, 131 insertions(+)
 create mode 100755 tests/xfs/1400
 create mode 100644 tests/xfs/1400.out
 create mode 100755 tests/xfs/1401
 create mode 100644 tests/xfs/1401.out


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/1] xfs: test clearing of free space
  2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong
@ 2024-12-31 23:56   ` Darrick J. Wong
  0 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Simple regression test for the spaceman clearspace command, which tries
to free all the used space in some part of the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc          |    5 ++++
 tests/xfs/1400     |   52 +++++++++++++++++++++++++++++++++++++++
 tests/xfs/1400.out |    2 +
 tests/xfs/1401     |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1401.out |    2 +
 5 files changed, 131 insertions(+)
 create mode 100755 tests/xfs/1400
 create mode 100644 tests/xfs/1400.out
 create mode 100755 tests/xfs/1401
 create mode 100644 tests/xfs/1401.out


diff --git a/common/rc b/common/rc
index c45a226849ce0f..d7dfb55bbbd7e1 100644
--- a/common/rc
+++ b/common/rc
@@ -2786,6 +2786,11 @@ _require_xfs_io_command()
 			-c "fsync" -c "$command $blocksize $((2 * $blocksize))" \
 			$testfile 2>&1`
 		;;
+	"fmapfree")
+		local blocksize=$(_get_file_block_size $TEST_DIR)
+		testio=`$XFS_IO_PROG -F -f -c "$command $blocksize $((2 * $blocksize))" \
+			$testfile 2>&1`
+		;;
 	"fiemap")
 		# If 'ranged' is passed as argument then we check to see if fiemap supports
 		# ranged query params
diff --git a/tests/xfs/1400 b/tests/xfs/1400
new file mode 100755
index 00000000000000..ec3f7aec2a318a
--- /dev/null
+++ b/tests/xfs/1400
@@ -0,0 +1,52 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1400
+#
+# Basic functionality testing for FALLOC_FL_MAP_FREE
+#
+. ./common/preamble
+_begin_fstest auto prealloc
+
+. ./common/filter
+
+_require_scratch
+_require_xfs_io_command "fmapfree"
+
+_scratch_mkfs | _filter_mkfs 2> $tmp.mkfs > /dev/null
+_scratch_mount >> $seqres.full
+. $tmp.mkfs
+
+testfile="$SCRATCH_MNT/$seq.txt"
+touch $testfile
+if $XFS_IO_PROG -c 'stat -v' $testfile | grep -q 'realtime'; then
+	# realtime
+	increment=$((dbsize * rtblocks / 100))
+	length=$((dbsize * rtblocks))
+else
+	# data
+	increment=$((dbsize * dblocks / 100))
+	length=$((dbsize * dblocks))
+fi
+
+free_bytes=$(stat -f -c '%f * %S' $testfile | bc)
+
+echo "free space: $free_bytes; increment: $increment; length: $length" >> $seqres.full
+
+# Map all the free space on that device, 10% at a time
+for ((start = 0; start < length; start += increment)); do
+	$XFS_IO_PROG -f -c "fmapfree $start $increment" $testfile
+done
+
+space_used=$(stat -c '%b * %B' $testfile | bc)
+
+echo "space captured: $space_used" >> $seqres.full
+$FILEFRAG_PROG -v $testfile >> $seqres.full
+
+# Did we get within 10% of the free space?
+_within_tolerance "mapfree space used" $space_used $free_bytes 10% -v
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1400.out b/tests/xfs/1400.out
new file mode 100644
index 00000000000000..601404d7a46856
--- /dev/null
+++ b/tests/xfs/1400.out
@@ -0,0 +1,2 @@
+QA output created by 1400
+mapfree space used is in range
diff --git a/tests/xfs/1401 b/tests/xfs/1401
new file mode 100755
index 00000000000000..14675abd8ff985
--- /dev/null
+++ b/tests/xfs/1401
@@ -0,0 +1,70 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2022-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1401
+#
+# Basic functionality testing for the free space defragmenter.
+#
+. ./common/preamble
+_begin_fstest auto defrag shrinkfs
+
+. ./common/filter
+
+_notrun "XXX test is not ready yet; you need to deal with eof blocks"
+_notrun "XXX clearfree cannot move unwritten extents; does fiexchange work for this?"
+_notrun "XXX csp_buffercopy never returns if we hit eof"
+
+_require_scratch
+_require_xfs_spaceman_command "clearfree"
+
+_scratch_mkfs | _filter_mkfs 2> $tmp.mkfs > /dev/null
+cat $tmp.mkfs >> $seqres.full
+. $tmp.mkfs
+_scratch_mount >> $seqres.full
+
+cpus=$(( $(src/feature -o) * 4))
+
+# Use fsstress to create a directory tree with some variability
+FSSTRESS_ARGS=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 4000 $FSSTRESS_AVOID)
+$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full
+
+$XFS_IO_PROG -c 'stat -v' $SCRATCH_MNT >> $seqres.full
+
+if $XFS_IO_PROG -c 'stat -v' $SCRATCH_MNT | grep -q 'rt-inherit'; then
+	# realtime
+	increment=$((dbsize * rtblocks / agcount))
+	length=$((dbsize * rtblocks))
+	fsmap_devarg="-r"
+else
+	# data
+	increment=$((dbsize * agsize))
+	length=$((dbsize * dblocks))
+	fsmap_devarg="-d"
+fi
+
+echo "start: $start; increment: $increment; length: $length" >> $seqres.full
+$DF_PROG $SCRATCH_MNT >> $seqres.full
+
+TRACE_PROG="strace -s99 -e fallocate,ioctl,openat -o $tmp.strace"
+
+for ((start = 0; start < length; start += increment)); do
+	echo "---------------------------" >> $seqres.full
+	echo "start: $start end: $((start + increment))" >> $seqres.full
+	echo "---------------------------" >> $seqres.full
+
+	fsmap_args="-vvvv $fsmap_devarg $((start / 512)) $((increment / 512))"
+	clearfree_args="-v all $start $increment"
+
+	$XFS_IO_PROG -c "fsmap $fsmap_args" $SCRATCH_MNT > $tmp.before
+	$TRACE_PROG $XFS_SPACEMAN_PROG -c "clearfree $clearfree_args" $SCRATCH_MNT &>> $seqres.full || break
+	cat $tmp.strace >> $seqres.full
+	$XFS_IO_PROG -c "fsmap $fsmap_args" $SCRATCH_MNT > $tmp.after
+	cat $tmp.before >> $seqres.full
+	cat $tmp.after >> $seqres.full
+done
+
+# success, all done
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/1401.out b/tests/xfs/1401.out
new file mode 100644
index 00000000000000..504999381ea9a8
--- /dev/null
+++ b/tests/xfs/1401.out
@@ -0,0 +1,2 @@
+QA output created by 1401
+Silence is golden


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 3/5] fstests: capture logs from mount failures
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (11 preceding siblings ...)
  2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong
@ 2024-12-31 23:35 ` Darrick J. Wong
  2024-12-31 23:56   ` [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount Darrick J. Wong
  2024-12-31 23:56   ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

Hi all,

Whenever a mount fails, we should capture the kernel logs for the last
few seconds before the failure.  If the test fails, retain the log
contents for further analysis.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=capture-mount-failures
---
Commits in this patchset:
 * treewide: convert all $MOUNT_PROG to _mount
 * check: capture dmesg of mount failures if test fails
---
 check                  |   22 +++++++++++++++++++++-
 common/btrfs           |    4 ++--
 common/dmdelay         |    2 +-
 common/dmerror         |    2 +-
 common/dmlogwrites     |    2 +-
 common/overlay         |    6 +++---
 common/rc              |   26 +++++++++++++++++++++++++-
 common/report          |    8 ++++++++
 tests/btrfs/075        |    2 +-
 tests/btrfs/208        |    2 +-
 tests/ext4/032         |    2 +-
 tests/generic/067      |    6 +++---
 tests/generic/085      |    2 +-
 tests/generic/361      |    2 +-
 tests/generic/373      |    2 +-
 tests/generic/374      |    2 +-
 tests/generic/409      |    6 +++---
 tests/generic/410      |    8 ++++----
 tests/generic/411      |    8 ++++----
 tests/generic/589      |    8 ++++----
 tests/overlay/005      |    4 ++--
 tests/overlay/025      |    2 +-
 tests/overlay/035      |    2 +-
 tests/overlay/062      |    2 +-
 tests/overlay/083      |    6 +++---
 tests/overlay/086      |   12 ++++++------
 tests/selftest/008     |   20 ++++++++++++++++++++
 tests/selftest/008.out |    1 +
 tests/xfs/078          |    2 +-
 tests/xfs/149          |    4 ++--
 tests/xfs/289          |    4 ++--
 tests/xfs/544          |    2 +-
 32 files changed, 128 insertions(+), 55 deletions(-)
 create mode 100755 tests/selftest/008
 create mode 100644 tests/selftest/008.out


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount
  2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong
@ 2024-12-31 23:56   ` Darrick J. Wong
  2024-12-31 23:56   ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong
  1 sibling, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Going to add some new log scraping functionality when mount failures
occur, so we need everyone to use _mount instead of $MOUNT_PROG.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/btrfs       |    4 ++--
 common/dmdelay     |    2 +-
 common/dmerror     |    2 +-
 common/dmlogwrites |    2 +-
 common/overlay     |    6 +++---
 tests/btrfs/075    |    2 +-
 tests/btrfs/208    |    2 +-
 tests/ext4/032     |    2 +-
 tests/generic/067  |    6 +++---
 tests/generic/085  |    2 +-
 tests/generic/361  |    2 +-
 tests/generic/373  |    2 +-
 tests/generic/374  |    2 +-
 tests/generic/409  |    6 +++---
 tests/generic/410  |    8 ++++----
 tests/generic/411  |    8 ++++----
 tests/generic/589  |    8 ++++----
 tests/overlay/005  |    4 ++--
 tests/overlay/025  |    2 +-
 tests/overlay/035  |    2 +-
 tests/overlay/062  |    2 +-
 tests/overlay/083  |    6 +++---
 tests/overlay/086  |   12 ++++++------
 tests/xfs/078      |    2 +-
 tests/xfs/149      |    4 ++--
 tests/xfs/289      |    4 ++--
 tests/xfs/544      |    2 +-
 27 files changed, 53 insertions(+), 53 deletions(-)


diff --git a/common/btrfs b/common/btrfs
index 95a9c8e6c7f448..64f38cc240ab8b 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -351,7 +351,7 @@ _btrfs_stress_subvolume()
 	mkdir -p $subvol_mnt
 	while [ ! -e $stop_file ]; do
 		$BTRFS_UTIL_PROG subvolume create $btrfs_mnt/$subvol_name
-		$MOUNT_PROG -o subvol=$subvol_name $btrfs_dev $subvol_mnt
+		_mount -o subvol=$subvol_name $btrfs_dev $subvol_mnt
 		$UMOUNT_PROG $subvol_mnt
 		$BTRFS_UTIL_PROG subvolume delete $btrfs_mnt/$subvol_name
 	done
@@ -437,7 +437,7 @@ _btrfs_stress_remount_compress()
 	local btrfs_mnt=$1
 	while true; do
 		for algo in no zlib lzo; do
-			$MOUNT_PROG -o remount,compress=$algo $btrfs_mnt
+			_mount -o remount,compress=$algo $btrfs_mnt
 		done
 	done
 }
diff --git a/common/dmdelay b/common/dmdelay
index 66cac1a70c14c8..794ea37ba200ce 100644
--- a/common/dmdelay
+++ b/common/dmdelay
@@ -20,7 +20,7 @@ _init_delay()
 _mount_delay()
 {
 	_scratch_options mount
-	$MOUNT_PROG -t $FSTYP `_common_dev_mount_options` $SCRATCH_OPTIONS \
+	_mount -t $FSTYP `_common_dev_mount_options` $SCRATCH_OPTIONS \
 		$DELAY_DEV $SCRATCH_MNT
 }
 
diff --git a/common/dmerror b/common/dmerror
index 3494b6dd3b9479..2f006142a309fe 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -91,7 +91,7 @@ _dmerror_init()
 _dmerror_mount()
 {
 	_scratch_options mount
-	$MOUNT_PROG -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \
+	_mount -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \
 		$DMERROR_DEV $SCRATCH_MNT
 }
 
diff --git a/common/dmlogwrites b/common/dmlogwrites
index 7a8a9078cb8b65..c054acb875a384 100644
--- a/common/dmlogwrites
+++ b/common/dmlogwrites
@@ -139,7 +139,7 @@ _log_writes_mkfs()
 _log_writes_mount()
 {
 	_scratch_options mount
-	$MOUNT_PROG -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \
+	_mount -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \
 		$LOGWRITES_DMDEV $SCRATCH_MNT
 }
 
diff --git a/common/overlay b/common/overlay
index faa9339a6477f7..da1d8d2c3183f4 100644
--- a/common/overlay
+++ b/common/overlay
@@ -29,13 +29,13 @@ _overlay_mount_dirs()
 	[ -n "$upperdir" ] && [ "$upperdir" != "-" ] && \
 		diropts+=",upperdir=$upperdir,workdir=$workdir"
 
-	$MOUNT_PROG -t overlay $diropts `_common_dev_mount_options $*`
+	_mount -t overlay $diropts `_common_dev_mount_options $*`
 }
 
 # Mount with mnt/dev of scratch mount and custom mount options
 _overlay_scratch_mount_opts()
 {
-	$MOUNT_PROG -t overlay $OVL_BASE_SCRATCH_MNT $SCRATCH_MNT $*
+	_mount -t overlay $OVL_BASE_SCRATCH_MNT $SCRATCH_MNT $*
 }
 
 # Mount with same options/mnt/dev of scratch mount, but optionally
@@ -127,7 +127,7 @@ _overlay_base_scratch_mount()
 _overlay_scratch_mount()
 {
 	if echo "$*" | grep -q remount; then
-		$MOUNT_PROG $SCRATCH_MNT $*
+		_mount $SCRATCH_MNT $*
 		return
 	fi
 
diff --git a/tests/btrfs/075 b/tests/btrfs/075
index 917993ca2da3a6..737c4ffdd57865 100755
--- a/tests/btrfs/075
+++ b/tests/btrfs/075
@@ -37,7 +37,7 @@ _scratch_mount
 subvol_mnt=$TEST_DIR/$seq.mnt
 mkdir -p $subvol_mnt
 $BTRFS_UTIL_PROG subvolume create $SCRATCH_MNT/subvol >>$seqres.full 2>&1
-$MOUNT_PROG -o subvol=subvol $SELINUX_MOUNT_OPTIONS $SCRATCH_DEV $subvol_mnt
+_mount -o subvol=subvol $SELINUX_MOUNT_OPTIONS $SCRATCH_DEV $subvol_mnt
 status=$?
 
 exit
diff --git a/tests/btrfs/208 b/tests/btrfs/208
index 5ea732ae8f71a7..93a999541dab06 100755
--- a/tests/btrfs/208
+++ b/tests/btrfs/208
@@ -45,7 +45,7 @@ _scratch_unmount
 
 # Now we mount the subvol2, which makes subvol3 not accessible for this mount
 # point, but we should be able to delete it using it's subvolume id
-$MOUNT_PROG -o subvol=subvol2 $SCRATCH_DEV $SCRATCH_MNT
+_mount -o subvol=subvol2 $SCRATCH_DEV $SCRATCH_MNT
 _delete_and_list subvol3 "Last remaining subvolume:"
 _scratch_unmount
 
diff --git a/tests/ext4/032 b/tests/ext4/032
index 238ab178363c12..9a1b9312cc42cc 100755
--- a/tests/ext4/032
+++ b/tests/ext4/032
@@ -48,7 +48,7 @@ ext4_online_resize()
 		$seqres.full 2>&1 || _fail "mkfs failed"
 
 	echo "+++ mount image file" | tee -a $seqres.full
-	$MOUNT_PROG -t ${FSTYP} ${LOOP_DEVICE} ${IMG_MNT} > \
+	_mount -t ${FSTYP} ${LOOP_DEVICE} ${IMG_MNT} > \
 		/dev/null 2>&1 || _fail "mount failed"
 
 	echo "+++ resize fs to $final_size" | tee -a $seqres.full
diff --git a/tests/generic/067 b/tests/generic/067
index b561b7bc5946a2..b6e984f5231753 100755
--- a/tests/generic/067
+++ b/tests/generic/067
@@ -34,7 +34,7 @@ mount_nonexistent_mnt()
 {
 	echo "# mount to nonexistent mount point" >>$seqres.full
 	rm -rf $TEST_DIR/nosuchdir
-	$MOUNT_PROG $SCRATCH_DEV $TEST_DIR/nosuchdir >>$seqres.full 2>&1
+	_mount $SCRATCH_DEV $TEST_DIR/nosuchdir >>$seqres.full 2>&1
 }
 
 # fs driver should be able to handle mounting a free loop device gracefully
@@ -43,7 +43,7 @@ mount_free_loopdev()
 {
 	echo "# mount a free loop device" >>$seqres.full
 	loopdev=`losetup -f`
-	$MOUNT_PROG -t $FSTYP $loopdev $SCRATCH_MNT >>$seqres.full 2>&1
+	_mount -t $FSTYP $loopdev $SCRATCH_MNT >>$seqres.full 2>&1
 }
 
 # mount with wrong fs type specified.
@@ -55,7 +55,7 @@ mount_wrong_fstype()
 		fs=xfs
 	fi
 	echo "# mount with wrong fs type" >>$seqres.full
-	$MOUNT_PROG -t $fs $SCRATCH_DEV $SCRATCH_MNT >>$seqres.full 2>&1
+	_mount -t $fs $SCRATCH_DEV $SCRATCH_MNT >>$seqres.full 2>&1
 }
 
 # umount a symlink to device, which is not mounted.
diff --git a/tests/generic/085 b/tests/generic/085
index cfe6112d6b444d..cbabd257cad8f0 100755
--- a/tests/generic/085
+++ b/tests/generic/085
@@ -69,7 +69,7 @@ for ((i=0; i<100; i++)); do
 done &
 pid=$!
 for ((i=0; i<100; i++)); do
-	$MOUNT_PROG $lvdev $SCRATCH_MNT >/dev/null 2>&1
+	_mount $lvdev $SCRATCH_MNT >/dev/null 2>&1
 	$UMOUNT_PROG $lvdev >/dev/null 2>&1
 done &
 pid="$pid $!"
diff --git a/tests/generic/361 b/tests/generic/361
index c56157391d3209..c2ebda3c1a01ad 100755
--- a/tests/generic/361
+++ b/tests/generic/361
@@ -52,7 +52,7 @@ fi
 $XFS_IO_PROG -fc "pwrite 0 520m" $fs_mnt/testfile >>$seqres.full 2>&1
 
 # remount should not hang
-$MOUNT_PROG -o remount,ro $fs_mnt >>$seqres.full 2>&1
+_mount -o remount,ro $fs_mnt >>$seqres.full 2>&1
 
 # success, all done
 echo "Silence is golden"
diff --git a/tests/generic/373 b/tests/generic/373
index 3bd46963a76686..0d5a50cbee40b8 100755
--- a/tests/generic/373
+++ b/tests/generic/373
@@ -42,7 +42,7 @@ blksz=65536
 sz=$((blksz * blocks))
 
 echo "Mount otherdir"
-$MOUNT_PROG --bind $SCRATCH_MNT $otherdir
+_mount --bind $SCRATCH_MNT $otherdir
 
 echo "Create file"
 _pwrite_byte 0x61 0 $sz $testdir/file >> $seqres.full
diff --git a/tests/generic/374 b/tests/generic/374
index acb23d17289784..977a2b268bbc98 100755
--- a/tests/generic/374
+++ b/tests/generic/374
@@ -41,7 +41,7 @@ blksz=65536
 sz=$((blocks * blksz))
 
 echo "Mount otherdir"
-$MOUNT_PROG --bind $SCRATCH_MNT $otherdir
+_mount --bind $SCRATCH_MNT $otherdir
 
 echo "Create file"
 _pwrite_byte 0x61 0 $sz $testdir/file >> $seqres.full
diff --git a/tests/generic/409 b/tests/generic/409
index b7edc2ac664461..79468e2b0ddb41 100755
--- a/tests/generic/409
+++ b/tests/generic/409
@@ -87,7 +87,7 @@ start_test()
 
 	_scratch_mkfs >$seqres.full 2>&1
 	_get_mount -t $FSTYP $SCRATCH_DEV $MNTHEAD
-	$MOUNT_PROG --make-"${type}" $MNTHEAD
+	_mount --make-"${type}" $MNTHEAD
 	mkdir $mpA $mpB $mpC $mpD
 }
 
@@ -107,9 +107,9 @@ bind_run()
 	echo "bind $source on $dest"
 	_get_mount -t $FSTYP $SCRATCH_DEV $mpA
 	mkdir -p $mpA/dir 2>/dev/null
-	$MOUNT_PROG --make-shared $mpA
+	_mount --make-shared $mpA
 	_get_mount --bind $mpA $mpB
-	$MOUNT_PROG --make-"$source" $mpB
+	_mount --make-"$source" $mpB
 	# maybe unbindable at here
 	_get_mount --bind $mpB $mpC 2>/dev/null
 	if [ $? -ne 0 ]; then
diff --git a/tests/generic/410 b/tests/generic/410
index 902f27144285e4..db8c97dbac7701 100755
--- a/tests/generic/410
+++ b/tests/generic/410
@@ -93,7 +93,7 @@ start_test()
 
 	_scratch_mkfs >>$seqres.full 2>&1
 	_get_mount -t $FSTYP $SCRATCH_DEV $MNTHEAD
-	$MOUNT_PROG --make-"${type}" $MNTHEAD
+	_mount --make-"${type}" $MNTHEAD
 	mkdir $mpA $mpB $mpC
 }
 
@@ -117,14 +117,14 @@ run()
 	echo "make-$cmd a $orgs mount"
 	_get_mount -t $FSTYP $SCRATCH_DEV $mpA
 	mkdir -p $mpA/dir 2>/dev/null
-	$MOUNT_PROG --make-shared $mpA
+	_mount --make-shared $mpA
 
 	# prepare the original status on mpB
 	_get_mount --bind $mpA $mpB
 	# shared&slave status need to do make-slave then make-shared
 	# two operations.
 	for t in $orgs; do
-		$MOUNT_PROG --make-"$t" $mpB
+		_mount --make-"$t" $mpB
 	done
 
 	# "before" for prepare and check original status
@@ -145,7 +145,7 @@ run()
 			_put_mount # umount C
 		fi
 		if [ "$i" = "before" ];then
-			$MOUNT_PROG --make-"${cmd}" $mpB
+			_mount --make-"${cmd}" $mpB
 		fi
 	done
 
diff --git a/tests/generic/411 b/tests/generic/411
index c35436c82e988e..09a813f5d3028e 100755
--- a/tests/generic/411
+++ b/tests/generic/411
@@ -76,7 +76,7 @@ start_test()
 
 	_scratch_mkfs >$seqres.full 2>&1
 	_get_mount -t $FSTYP $SCRATCH_DEV $MNTHEAD
-	$MOUNT_PROG --make-"${type}" $MNTHEAD
+	_mount --make-"${type}" $MNTHEAD
 	mkdir $mpA $mpB $mpC
 }
 
@@ -99,11 +99,11 @@ crash_test()
 
 	_get_mount -t $FSTYP $SCRATCH_DEV $mpA
 	mkdir $mpA/mnt1
-	$MOUNT_PROG --make-shared $mpA
+	_mount --make-shared $mpA
 	_get_mount --bind $mpA $mpB
 	_get_mount --bind $mpA $mpC
-	$MOUNT_PROG --make-slave $mpB
-	$MOUNT_PROG --make-slave $mpC
+	_mount --make-slave $mpB
+	_mount --make-slave $mpC
 	_get_mount -t $FSTYP $SCRATCH_DEV $mpA/mnt1
 	mkdir $mpA/mnt1/mnt2
 
diff --git a/tests/generic/589 b/tests/generic/589
index 0ce16556a05df9..6f69abd17ab01e 100755
--- a/tests/generic/589
+++ b/tests/generic/589
@@ -80,12 +80,12 @@ start_test()
 
 	_get_mount -t $FSTYP $SCRATCH_DEV $SRCHEAD
 	# make sure $SRCHEAD is private
-	$MOUNT_PROG --make-private $SRCHEAD
+	_mount --make-private $SRCHEAD
 
 	_get_mount -t $FSTYP $SCRATCH_DEV $DSTHEAD
 	# test start with a bind, then make-shared $DSTHEAD
 	_get_mount --bind $DSTHEAD $DSTHEAD
-	$MOUNT_PROG --make-"${type}" $DSTHEAD
+	_mount --make-"${type}" $DSTHEAD
 	mkdir $mpA $mpB $mpC $mpD
 }
 
@@ -105,10 +105,10 @@ move_run()
 	echo "move $source to $dest"
 	_get_mount -t $FSTYP $SCRATCH_DEV $mpA
 	mkdir -p $mpA/dir 2>/dev/null
-	$MOUNT_PROG --make-shared $mpA
+	_mount --make-shared $mpA
 	# need a peer for slave later
 	_get_mount --bind $mpA $mpB
-	$MOUNT_PROG --make-"$source" $mpB
+	_mount --make-"$source" $mpB
 	# maybe unbindable at here
 	_get_mount --move $mpB $mpC 2>/dev/null
 	if [ $? -ne 0 ]; then
diff --git a/tests/overlay/005 b/tests/overlay/005
index 4c11d5e1b6f701..01914ee17b9a30 100755
--- a/tests/overlay/005
+++ b/tests/overlay/005
@@ -50,8 +50,8 @@ $MKFS_XFS_PROG -f -n ftype=1 $upper_loop_dev >>$seqres.full 2>&1
 # mount underlying xfs
 mkdir -p ${OVL_BASE_SCRATCH_MNT}/lowermnt
 mkdir -p ${OVL_BASE_SCRATCH_MNT}/uppermnt
-$MOUNT_PROG $fs_loop_dev ${OVL_BASE_SCRATCH_MNT}/lowermnt
-$MOUNT_PROG $upper_loop_dev ${OVL_BASE_SCRATCH_MNT}/uppermnt
+_mount $fs_loop_dev ${OVL_BASE_SCRATCH_MNT}/lowermnt
+_mount $upper_loop_dev ${OVL_BASE_SCRATCH_MNT}/uppermnt
 
 # prepare dirs
 mkdir -p ${OVL_BASE_SCRATCH_MNT}/lowermnt/lower
diff --git a/tests/overlay/025 b/tests/overlay/025
index dc819a39348b69..6ba46191b557be 100755
--- a/tests/overlay/025
+++ b/tests/overlay/025
@@ -36,7 +36,7 @@ _require_extra_fs tmpfs
 # create a tmpfs in $TEST_DIR
 tmpfsdir=$TEST_DIR/tmpfs
 mkdir -p $tmpfsdir
-$MOUNT_PROG -t tmpfs tmpfs $tmpfsdir
+_mount -t tmpfs tmpfs $tmpfsdir
 
 mkdir -p $tmpfsdir/{lower,upper,work,mnt}
 mkdir -p -m 0 $tmpfsdir/upper/testd
diff --git a/tests/overlay/035 b/tests/overlay/035
index 0b3257c4cce09e..cede58790e1b9d 100755
--- a/tests/overlay/035
+++ b/tests/overlay/035
@@ -42,7 +42,7 @@ mkdir -p $lowerdir1 $lowerdir2 $upperdir $workdir
 # Verify that overlay is mounted read-only and that it cannot be remounted rw.
 _overlay_scratch_mount_opts -o"lowerdir=$lowerdir2:$lowerdir1"
 touch $SCRATCH_MNT/foo 2>&1 | _filter_scratch
-$MOUNT_PROG -o remount,rw $SCRATCH_MNT 2>&1 | _filter_ro_mount
+_mount -o remount,rw $SCRATCH_MNT 2>&1 | _filter_ro_mount
 $UMOUNT_PROG $SCRATCH_MNT
 
 # Make workdir immutable to prevent workdir re-create on mount
diff --git a/tests/overlay/062 b/tests/overlay/062
index e44628b7459bfb..9a1db7419c4ca2 100755
--- a/tests/overlay/062
+++ b/tests/overlay/062
@@ -60,7 +60,7 @@ lowertestdir=$lower2/testdir
 create_test_files $lowertestdir
 
 # bind mount to pin lower test dir dentry to dcache
-$MOUNT_PROG --bind $lowertestdir $lowertestdir
+_mount --bind $lowertestdir $lowertestdir
 
 # For non-upper overlay mount, nfs_export requires disabling redirect_dir.
 _overlay_scratch_mount_opts \
diff --git a/tests/overlay/083 b/tests/overlay/083
index d037d4c858e6a6..56e02f8cc77d73 100755
--- a/tests/overlay/083
+++ b/tests/overlay/083
@@ -40,14 +40,14 @@ mkdir -p "$lowerdir_spaces" "$lowerdir_colons" "$lowerdir_commas"
 
 # _overlay_mount_* helpers do not handle special chars well, so execute mount directly.
 # if escaped colons are not parsed correctly, mount will fail.
-$MOUNT_PROG -t overlay ovl_esc_test $SCRATCH_MNT \
+_mount -t overlay ovl_esc_test $SCRATCH_MNT \
 	-o"upperdir=$upperdir,workdir=$workdir" \
 	-o"lowerdir=$lowerdir_colons_esc:$lowerdir_spaces" \
 	2>&1 | tee -a $seqres.full
 
 # if spaces are not escaped when showing mount options,
 # mount command will not show the word 'spaces' after the spaces
-$MOUNT_PROG -t overlay | grep ovl_esc_test  | tee -a $seqres.full | grep -v spaces && \
+_mount -t overlay | grep ovl_esc_test  | tee -a $seqres.full | grep -v spaces && \
 	echo "ERROR: escaped spaces truncated from lowerdir mount option"
 
 # Re-create the upper/work dirs to mount them with a different lower
@@ -65,7 +65,7 @@ mkdir -p "$upperdir" "$workdir"
 # and this test will fail, but the failure would indicate a libmount issue, not
 # a kernel issue.  Therefore, force libmount to use mount(2) syscall, so we only
 # test the kernel fix.
-LIBMOUNT_FORCE_MOUNT2=always $MOUNT_PROG -t overlay $OVL_BASE_SCRATCH_DEV $SCRATCH_MNT \
+LIBMOUNT_FORCE_MOUNT2=always _mount -t overlay $OVL_BASE_SCRATCH_DEV $SCRATCH_MNT \
 	-o"upperdir=$upperdir,workdir=$workdir,lowerdir=$lowerdir_commas_esc" 2>> $seqres.full || \
 	echo "ERROR: incorrect parsing of escaped comma in lowerdir mount option"
 
diff --git a/tests/overlay/086 b/tests/overlay/086
index 9c8a00588595f6..23c56d074ff34a 100755
--- a/tests/overlay/086
+++ b/tests/overlay/086
@@ -33,21 +33,21 @@ mkdir -p "$lowerdir_spaces" "$lowerdir_colons"
 # _overlay_mount_* helpers do not handle lowerdir+,datadir+, so execute mount directly.
 
 # check illegal combinations and order of lowerdir,lowerdir+,datadir+
-$MOUNT_PROG -t overlay none $SCRATCH_MNT \
+_mount -t overlay none $SCRATCH_MNT \
 	-o"lowerdir=$lowerdir,lowerdir+=$lowerdir_colons" \
 	2>> $seqres.full && \
 	echo "ERROR: invalid combination of lowerdir and lowerdir+ mount options"
 
 $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
 
-$MOUNT_PROG -t overlay none $SCRATCH_MNT \
+_mount -t overlay none $SCRATCH_MNT \
 	-o"lowerdir=$lowerdir,datadir+=$lowerdir_colons" \
 	-o redirect_dir=follow,metacopy=on 2>> $seqres.full && \
 	echo "ERROR: invalid combination of lowerdir and datadir+ mount options"
 
 $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
 
-$MOUNT_PROG -t overlay none $SCRATCH_MNT \
+_mount -t overlay none $SCRATCH_MNT \
 	-o"datadir+=$lowerdir,lowerdir+=$lowerdir_colons" \
 	-o redirect_dir=follow,metacopy=on 2>> $seqres.full && \
 	echo "ERROR: invalid order of lowerdir+ and datadir+ mount options"
@@ -55,7 +55,7 @@ $MOUNT_PROG -t overlay none $SCRATCH_MNT \
 $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
 
 # mount is expected to fail with escaped colons.
-$MOUNT_PROG -t overlay none $SCRATCH_MNT \
+_mount -t overlay none $SCRATCH_MNT \
 	-o"lowerdir+=$lowerdir_colons_esc" \
 	2>> $seqres.full && \
 	echo "ERROR: incorrect parsing of escaped colons in lowerdir+ mount option"
@@ -63,14 +63,14 @@ $MOUNT_PROG -t overlay none $SCRATCH_MNT \
 $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
 
 # mount is expected to succeed without escaped colons.
-$MOUNT_PROG -t overlay ovl_esc_test $SCRATCH_MNT \
+_mount -t overlay ovl_esc_test $SCRATCH_MNT \
 	-o"lowerdir+=$lowerdir_colons,datadir+=$lowerdir_spaces" \
 	-o redirect_dir=follow,metacopy=on \
 	2>&1 | tee -a $seqres.full
 
 # if spaces are not escaped when showing mount options,
 # mount command will not show the word 'spaces' after the spaces
-$MOUNT_PROG -t overlay | grep ovl_esc_test | tee -a $seqres.full | \
+_mount -t overlay | grep ovl_esc_test | tee -a $seqres.full | \
 	grep -q 'datadir+'.*spaces || \
 	echo "ERROR: escaped spaces truncated from datadir+ mount option"
 
diff --git a/tests/xfs/078 b/tests/xfs/078
index 834c99a0020153..4224fd40bc9fea 100755
--- a/tests/xfs/078
+++ b/tests/xfs/078
@@ -75,7 +75,7 @@ _grow_loop()
 	$XFS_IO_PROG -c "pwrite $new_size $bsize" $LOOP_IMG | _filter_io
 	LOOP_DEV=`_create_loop_device $LOOP_IMG`
 	echo "*** mount loop filesystem"
-	$MOUNT_PROG -t xfs $LOOP_DEV $LOOP_MNT
+	_mount -t xfs $LOOP_DEV $LOOP_MNT
 
 	echo "*** grow loop filesystem"
 	$XFS_GROWFS_PROG $LOOP_MNT 2>&1 |  _filter_growfs 2>&1
diff --git a/tests/xfs/149 b/tests/xfs/149
index f1b2405e7bff11..bbaf86132dff37 100755
--- a/tests/xfs/149
+++ b/tests/xfs/149
@@ -64,7 +64,7 @@ $XFS_GROWFS_PROG $loop_symlink 2>&1 | sed -e s:$loop_symlink:LOOPSYMLINK:
 # These mounted operations should pass
 
 echo "=== mount ==="
-$MOUNT_PROG $loop_dev $mntdir || _fail "!!! failed to loopback mount"
+_mount $loop_dev $mntdir || _fail "!!! failed to loopback mount"
 
 echo "=== xfs_growfs - check device node ==="
 $XFS_GROWFS_PROG -D 8192 $loop_dev > /dev/null
@@ -76,7 +76,7 @@ echo "=== unmount ==="
 $UMOUNT_PROG $mntdir || _fail "!!! failed to unmount"
 
 echo "=== mount device symlink ==="
-$MOUNT_PROG $loop_symlink $mntdir || _fail "!!! failed to loopback mount"
+_mount $loop_symlink $mntdir || _fail "!!! failed to loopback mount"
 
 echo "=== xfs_growfs - check device symlink ==="
 $XFS_GROWFS_PROG -D 16384 $loop_symlink > /dev/null
diff --git a/tests/xfs/289 b/tests/xfs/289
index cf0f2883c4f373..089a3f8cc14a68 100755
--- a/tests/xfs/289
+++ b/tests/xfs/289
@@ -56,7 +56,7 @@ echo "=== xfs_growfs - plain file - should be rejected ==="
 $XFS_GROWFS_PROG $tmpfile 2>&1 | _filter_test_dir
 
 echo "=== mount ==="
-$MOUNT_PROG -o loop $tmpfile $tmpdir || _fail "!!! failed to loopback mount"
+_mount -o loop $tmpfile $tmpdir || _fail "!!! failed to loopback mount"
 
 echo "=== xfs_growfs - mounted - check absolute path ==="
 $XFS_GROWFS_PROG -D 8192 $tmpdir | _filter_test_dir > /dev/null
@@ -79,7 +79,7 @@ $XFS_GROWFS_PROG -D 28672 tmpsymlink.$$ > /dev/null
 
 echo "=== xfs_growfs - bind mount ==="
 mkdir $tmpbind
-$MOUNT_PROG -o bind $tmpdir $tmpbind
+_mount -o bind $tmpdir $tmpbind
 $XFS_GROWFS_PROG -D 32768 $tmpbind | _filter_test_dir > /dev/null
 
 echo "=== xfs_growfs - bind mount - relative path ==="
diff --git a/tests/xfs/544 b/tests/xfs/544
index bd694453d5409f..a3a23c1726ca1c 100755
--- a/tests/xfs/544
+++ b/tests/xfs/544
@@ -35,7 +35,7 @@ mkdir $TEST_DIR/dest.$seq
 # Test
 echo "*** dump with bind-mounted test ***" >> $seqres.full
 
-$MOUNT_PROG --bind $TEST_DIR/src.$seq $TEST_DIR/dest.$seq || _fail "Bind mount failed"
+_mount --bind $TEST_DIR/src.$seq $TEST_DIR/dest.$seq || _fail "Bind mount failed"
 
 $XFSDUMP_PROG -L session -M test -f $tmp.dump $TEST_DIR/dest.$seq \
 	>> $seqres.full 2>&1 && echo "dump with bind-mounted should be failed, but passed."


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/2] check: capture dmesg of mount failures if test fails
  2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong
  2024-12-31 23:56   ` [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount Darrick J. Wong
@ 2024-12-31 23:56   ` Darrick J. Wong
  2025-01-06 11:18     ` Nirjhar Roy
  1 sibling, 1 reply; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Capture the kernel output after a mount failure occurs.  If the test
itself fails, then keep the logging output for further diagnosis.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 check                  |   22 +++++++++++++++++++++-
 common/rc              |   26 +++++++++++++++++++++++++-
 common/report          |    8 ++++++++
 tests/selftest/008     |   20 ++++++++++++++++++++
 tests/selftest/008.out |    1 +
 5 files changed, 75 insertions(+), 2 deletions(-)
 create mode 100755 tests/selftest/008
 create mode 100644 tests/selftest/008.out


diff --git a/check b/check
index 9222cd7e4f8197..a46ea1a54d78bb 100755
--- a/check
+++ b/check
@@ -614,7 +614,7 @@ _stash_fail_loop_files() {
 	local seq_prefix="${REPORT_DIR}/${1}"
 	local cp_suffix="$2"
 
-	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints"; do
+	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints" ".mountfail"; do
 		rm -f "${seq_prefix}${i}${cp_suffix}"
 		if [ -f "${seq_prefix}${i}" ]; then
 			cp "${seq_prefix}${i}" "${seq_prefix}${i}${cp_suffix}"
@@ -994,6 +994,7 @@ function run_section()
 				      echo -n "	$seqnum -- "
 			cat $seqres.notrun
 			tc_status="notrun"
+			rm -f "$seqres.mountfail?"
 			_stash_test_status "$seqnum" "$tc_status"
 
 			# Unmount the scratch fs so that we can wipe the scratch
@@ -1053,6 +1054,7 @@ function run_section()
 		if [ ! -f $seq.out ]; then
 			_dump_err "no qualified output"
 			tc_status="fail"
+			rm -f "$seqres.mountfail?"
 			_stash_test_status "$seqnum" "$tc_status"
 			continue;
 		fi
@@ -1089,6 +1091,24 @@ function run_section()
 				rm -f $seqres.hints
 			fi
 		fi
+
+		if [ -f "$seqres.mountfail?" ]; then
+			if [ "$tc_status" = "fail" ]; then
+				# Let the user know if there were mount
+				# failures on a test that failed because that
+				# could be interesting.
+				mv "$seqres.mountfail?" "$seqres.mountfail"
+				_dump_err "check: possible mount failures (see $seqres.mountfail)"
+				test -f $seqres.mountfail && \
+					maybe_compress_logfile $seqres.mountfail $MAX_MOUNTFAIL_SIZE
+			else
+				# Don't retain mount failure logs for tests
+				# that pass or were skipped because some tests
+				# intentionally drive mount failures.
+				rm -f "$seqres.mountfail?"
+			fi
+		fi
+
 		_stash_test_status "$seqnum" "$tc_status"
 	done
 
diff --git a/common/rc b/common/rc
index d7dfb55bbbd7e1..0ede68eb912440 100644
--- a/common/rc
+++ b/common/rc
@@ -204,9 +204,33 @@ _get_hugepagesize()
 	awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo
 }
 
+# Does dmesg have a --since flag?
+_dmesg_detect_since()
+{
+	if [ -z "$DMESG_HAS_SINCE" ]; then
+		test "$DMESG_HAS_SINCE" = "yes"
+		return
+	elif dmesg --help | grep -q -- --since; then
+		DMESG_HAS_SINCE=yes
+	else
+		DMESG_HAS_SINCE=no
+	fi
+}
+
 _mount()
 {
-    $MOUNT_PROG $*
+	$MOUNT_PROG $*
+	ret=$?
+	if [ "$ret" -ne 0 ]; then
+		echo "\"$MOUNT_PROG $*\" failed at $(date)" >> "$seqres.mountfail?"
+		if _dmesg_detect_since; then
+			dmesg --since '30s ago' >> "$seqres.mountfail?"
+		else
+			dmesg | tail -n 100 >> "$seqres.mountfail?"
+		fi
+	fi
+
+	return $ret
 }
 
 # Call _mount to do mount operation but also save mountpoint to
diff --git a/common/report b/common/report
index 0e91e481f9725a..b57697f76dafb2 100644
--- a/common/report
+++ b/common/report
@@ -199,6 +199,7 @@ _xunit_make_testcase_report()
 		local out_src="${SRC_DIR}/${test_name}.out"
 		local full_file="${REPORT_DIR}/${test_name}.full"
 		local dmesg_file="${REPORT_DIR}/${test_name}.dmesg"
+		local mountfail_file="${REPORT_DIR}/${test_name}.mountfail"
 		local outbad_file="${REPORT_DIR}/${test_name}.out.bad"
 		if [ -z "$_err_msg" ]; then
 			_err_msg="Test $test_name failed, reason unknown"
@@ -225,6 +226,13 @@ _xunit_make_testcase_report()
 			printf ']]>\n'	>>$report
 			echo -e "\t\t</system-err>" >> $report
 		fi
+		if [ -z "$quiet" -a -f "$mountfail_file" ]; then
+			echo -e "\t\t<mount-failure>" >> $report
+			printf	'<![CDATA[\n' >>$report
+			cat "$mountfail_file" | tr -dc '[:print:][:space:]' | encode_cdata >>$report
+			printf ']]>\n'	>>$report
+			echo -e "\t\t</mount-failure>" >> $report
+		fi
 		;;
 	*)
 		echo -e "\t\t<failure message=\"Unknown test_status=$test_status\" type=\"TestFail\"/>" >> $report
diff --git a/tests/selftest/008 b/tests/selftest/008
new file mode 100755
index 00000000000000..db80ffe6f77339
--- /dev/null
+++ b/tests/selftest/008
@@ -0,0 +1,20 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 008
+#
+# Test mount failure capture.
+#
+. ./common/preamble
+_begin_fstest selftest
+
+_require_command "$WIPEFS_PROG" wipefs
+_require_scratch
+
+$WIPEFS_PROG -a $SCRATCH_DEV
+_scratch_mount &>> $seqres.full
+
+# success, all done
+status=0
+exit
diff --git a/tests/selftest/008.out b/tests/selftest/008.out
new file mode 100644
index 00000000000000..aaff95f3f48372
--- /dev/null
+++ b/tests/selftest/008.out
@@ -0,0 +1 @@
+QA output created by 008


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/2] check: capture dmesg of mount failures if test fails
  2024-12-31 23:56   ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong
@ 2025-01-06 11:18     ` Nirjhar Roy
  2025-01-06 23:52       ` Darrick J. Wong
  0 siblings, 1 reply; 110+ messages in thread
From: Nirjhar Roy @ 2025-01-06 11:18 UTC (permalink / raw)
  To: Darrick J. Wong, zlang; +Cc: fstests, linux-xfs

On Tue, 2024-12-31 at 15:56 -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Capture the kernel output after a mount failure occurs.  If the test
> itself fails, then keep the logging output for further diagnosis.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  check                  |   22 +++++++++++++++++++++-
>  common/rc              |   26 +++++++++++++++++++++++++-
>  common/report          |    8 ++++++++
>  tests/selftest/008     |   20 ++++++++++++++++++++
>  tests/selftest/008.out |    1 +
>  5 files changed, 75 insertions(+), 2 deletions(-)
>  create mode 100755 tests/selftest/008
>  create mode 100644 tests/selftest/008.out
> 
> 
> diff --git a/check b/check
> index 9222cd7e4f8197..a46ea1a54d78bb 100755
> --- a/check
> +++ b/check
> @@ -614,7 +614,7 @@ _stash_fail_loop_files() {
>  	local seq_prefix="${REPORT_DIR}/${1}"
>  	local cp_suffix="$2"
>  
> -	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core"
> ".hints"; do
> +	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints"
> ".mountfail"; do
>  		rm -f "${seq_prefix}${i}${cp_suffix}"
>  		if [ -f "${seq_prefix}${i}" ]; then
>  			cp "${seq_prefix}${i}"
> "${seq_prefix}${i}${cp_suffix}"
> @@ -994,6 +994,7 @@ function run_section()
>  				      echo -n "	$seqnum -- "
>  			cat $seqres.notrun
>  			tc_status="notrun"
> +			rm -f "$seqres.mountfail?"
>  			_stash_test_status "$seqnum" "$tc_status"
>  
>  			# Unmount the scratch fs so that we can wipe
> the scratch
> @@ -1053,6 +1054,7 @@ function run_section()
>  		if [ ! -f $seq.out ]; then
>  			_dump_err "no qualified output"
>  			tc_status="fail"
> +			rm -f "$seqres.mountfail?"
>  			_stash_test_status "$seqnum" "$tc_status"
>  			continue;
>  		fi
> @@ -1089,6 +1091,24 @@ function run_section()
>  				rm -f $seqres.hints
>  			fi
>  		fi
> +
> +		if [ -f "$seqres.mountfail?" ]; then
> +			if [ "$tc_status" = "fail" ]; then
> +				# Let the user know if there were mount
> +				# failures on a test that failed
> because that
> +				# could be interesting.
> +				mv "$seqres.mountfail?"
> "$seqres.mountfail"
> +				_dump_err "check: possible mount
> failures (see $seqres.mountfail)"
> +				test -f $seqres.mountfail && \
> +					maybe_compress_logfile
> $seqres.mountfail $MAX_MOUNTFAIL_SIZE
> +			else
> +				# Don't retain mount failure logs for
> tests
> +				# that pass or were skipped because
> some tests
> +				# intentionally drive mount failures.
> +				rm -f "$seqres.mountfail?"
> +			fi
> +		fi
> +
>  		_stash_test_status "$seqnum" "$tc_status"
>  	done
>  
> diff --git a/common/rc b/common/rc
> index d7dfb55bbbd7e1..0ede68eb912440 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -204,9 +204,33 @@ _get_hugepagesize()
>  	awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo
>  }
>  
> +# Does dmesg have a --since flag?
> +_dmesg_detect_since()
> +{
> +	if [ -z "$DMESG_HAS_SINCE" ]; then
> +		test "$DMESG_HAS_SINCE" = "yes"
> +		return
> +	elif dmesg --help | grep -q -- --since; then
> +		DMESG_HAS_SINCE=yes
> +	else
> +		DMESG_HAS_SINCE=no
> +	fi
> +}
> +
>  _mount()
>  {
> -    $MOUNT_PROG $*
> +	$MOUNT_PROG $*
> +	ret=$?
> +	if [ "$ret" -ne 0 ]; then
> +		echo "\"$MOUNT_PROG $*\" failed at $(date)" >>
> "$seqres.mountfail?"
> +		if _dmesg_detect_since; then
> +			dmesg --since '30s ago' >> "$seqres.mountfail?"
> +		else
> +			dmesg | tail -n 100 >> "$seqres.mountfail?"
Is it possible to grep for a mount failure message in dmesg and then
capture the last n lines? Do you think that will be more accurate?

Also, do you think it is useful to make this 100 configurable instead
of hardcoding? 
> +		fi
> +	fi
> +
> +	return $ret
>  }
>  
>  # Call _mount to do mount operation but also save mountpoint to
> diff --git a/common/report b/common/report
> index 0e91e481f9725a..b57697f76dafb2 100644
> --- a/common/report
> +++ b/common/report
> @@ -199,6 +199,7 @@ _xunit_make_testcase_report()
>  		local out_src="${SRC_DIR}/${test_name}.out"
>  		local full_file="${REPORT_DIR}/${test_name}.full"
>  		local dmesg_file="${REPORT_DIR}/${test_name}.dmesg"
> +		local
> mountfail_file="${REPORT_DIR}/${test_name}.mountfail"
>  		local outbad_file="${REPORT_DIR}/${test_name}.out.bad"
>  		if [ -z "$_err_msg" ]; then
>  			_err_msg="Test $test_name failed, reason
> unknown"
> @@ -225,6 +226,13 @@ _xunit_make_testcase_report()
>  			printf ']]>\n'	>>$report
>  			echo -e "\t\t</system-err>" >> $report
>  		fi
> +		if [ -z "$quiet" -a -f "$mountfail_file" ]; then
> +			echo -e "\t\t<mount-failure>" >> $report
> +			printf	'<![CDATA[\n' >>$report
> +			cat "$mountfail_file" | tr -dc
> '[:print:][:space:]' | encode_cdata >>$report
> +			printf ']]>\n'	>>$report
> +			echo -e "\t\t</mount-failure>" >> $report
> +		fi
>  		;;
>  	*)
>  		echo -e "\t\t<failure message=\"Unknown
> test_status=$test_status\" type=\"TestFail\"/>" >> $report
> diff --git a/tests/selftest/008 b/tests/selftest/008
> new file mode 100755
> index 00000000000000..db80ffe6f77339
> --- /dev/null
> +++ b/tests/selftest/008
> @@ -0,0 +1,20 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
> +#
> +# FS QA Test 008
> +#
> +# Test mount failure capture.
> +#
> +. ./common/preamble
> +_begin_fstest selftest
> +
> +_require_command "$WIPEFS_PROG" wipefs
> +_require_scratch
> +
> +$WIPEFS_PROG -a $SCRATCH_DEV
> +_scratch_mount &>> $seqres.full
Minor: Do you think adding some filtered messages from the captured
dmesg logs in the output will be helpful?  
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/selftest/008.out b/tests/selftest/008.out
> new file mode 100644
> index 00000000000000..aaff95f3f48372
> --- /dev/null
> +++ b/tests/selftest/008.out
> @@ -0,0 +1 @@
> +QA output created by 008
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/2] check: capture dmesg of mount failures if test fails
  2025-01-06 11:18     ` Nirjhar Roy
@ 2025-01-06 23:52       ` Darrick J. Wong
  2025-01-13  5:55         ` Nirjhar Roy
  0 siblings, 1 reply; 110+ messages in thread
From: Darrick J. Wong @ 2025-01-06 23:52 UTC (permalink / raw)
  To: Nirjhar Roy; +Cc: zlang, fstests, linux-xfs

On Mon, Jan 06, 2025 at 04:48:34PM +0530, Nirjhar Roy wrote:
> On Tue, 2024-12-31 at 15:56 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Capture the kernel output after a mount failure occurs.  If the test
> > itself fails, then keep the logging output for further diagnosis.
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >  check                  |   22 +++++++++++++++++++++-
> >  common/rc              |   26 +++++++++++++++++++++++++-
> >  common/report          |    8 ++++++++
> >  tests/selftest/008     |   20 ++++++++++++++++++++
> >  tests/selftest/008.out |    1 +
> >  5 files changed, 75 insertions(+), 2 deletions(-)
> >  create mode 100755 tests/selftest/008
> >  create mode 100644 tests/selftest/008.out
> > 
> > 
> > diff --git a/check b/check
> > index 9222cd7e4f8197..a46ea1a54d78bb 100755
> > --- a/check
> > +++ b/check
> > @@ -614,7 +614,7 @@ _stash_fail_loop_files() {
> >  	local seq_prefix="${REPORT_DIR}/${1}"
> >  	local cp_suffix="$2"
> >  
> > -	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core"
> > ".hints"; do
> > +	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints"
> > ".mountfail"; do
> >  		rm -f "${seq_prefix}${i}${cp_suffix}"
> >  		if [ -f "${seq_prefix}${i}" ]; then
> >  			cp "${seq_prefix}${i}"
> > "${seq_prefix}${i}${cp_suffix}"
> > @@ -994,6 +994,7 @@ function run_section()
> >  				      echo -n "	$seqnum -- "
> >  			cat $seqres.notrun
> >  			tc_status="notrun"
> > +			rm -f "$seqres.mountfail?"
> >  			_stash_test_status "$seqnum" "$tc_status"
> >  
> >  			# Unmount the scratch fs so that we can wipe
> > the scratch
> > @@ -1053,6 +1054,7 @@ function run_section()
> >  		if [ ! -f $seq.out ]; then
> >  			_dump_err "no qualified output"
> >  			tc_status="fail"
> > +			rm -f "$seqres.mountfail?"
> >  			_stash_test_status "$seqnum" "$tc_status"
> >  			continue;
> >  		fi
> > @@ -1089,6 +1091,24 @@ function run_section()
> >  				rm -f $seqres.hints
> >  			fi
> >  		fi
> > +
> > +		if [ -f "$seqres.mountfail?" ]; then
> > +			if [ "$tc_status" = "fail" ]; then
> > +				# Let the user know if there were mount
> > +				# failures on a test that failed
> > because that
> > +				# could be interesting.
> > +				mv "$seqres.mountfail?"
> > "$seqres.mountfail"
> > +				_dump_err "check: possible mount
> > failures (see $seqres.mountfail)"
> > +				test -f $seqres.mountfail && \
> > +					maybe_compress_logfile
> > $seqres.mountfail $MAX_MOUNTFAIL_SIZE
> > +			else
> > +				# Don't retain mount failure logs for
> > tests
> > +				# that pass or were skipped because
> > some tests
> > +				# intentionally drive mount failures.
> > +				rm -f "$seqres.mountfail?"
> > +			fi
> > +		fi
> > +
> >  		_stash_test_status "$seqnum" "$tc_status"
> >  	done
> >  
> > diff --git a/common/rc b/common/rc
> > index d7dfb55bbbd7e1..0ede68eb912440 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -204,9 +204,33 @@ _get_hugepagesize()
> >  	awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo
> >  }
> >  
> > +# Does dmesg have a --since flag?
> > +_dmesg_detect_since()
> > +{
> > +	if [ -z "$DMESG_HAS_SINCE" ]; then
> > +		test "$DMESG_HAS_SINCE" = "yes"
> > +		return
> > +	elif dmesg --help | grep -q -- --since; then
> > +		DMESG_HAS_SINCE=yes
> > +	else
> > +		DMESG_HAS_SINCE=no
> > +	fi
> > +}
> > +
> >  _mount()
> >  {
> > -    $MOUNT_PROG $*
> > +	$MOUNT_PROG $*
> > +	ret=$?
> > +	if [ "$ret" -ne 0 ]; then
> > +		echo "\"$MOUNT_PROG $*\" failed at $(date)" >>
> > "$seqres.mountfail?"
> > +		if _dmesg_detect_since; then
> > +			dmesg --since '30s ago' >> "$seqres.mountfail?"
> > +		else
> > +			dmesg | tail -n 100 >> "$seqres.mountfail?"
> Is it possible to grep for a mount failure message in dmesg and then
> capture the last n lines? Do you think that will be more accurate?

Alas no, because there's no standard mount failure log message for us to
latch onto.

> Also, do you think it is useful to make this 100 configurable instead
> of hardcoding? 

I suppose, but why do you need more than 100?

> > +		fi
> > +	fi
> > +
> > +	return $ret
> >  }
> >  
> >  # Call _mount to do mount operation but also save mountpoint to
> > diff --git a/common/report b/common/report
> > index 0e91e481f9725a..b57697f76dafb2 100644
> > --- a/common/report
> > +++ b/common/report
> > @@ -199,6 +199,7 @@ _xunit_make_testcase_report()
> >  		local out_src="${SRC_DIR}/${test_name}.out"
> >  		local full_file="${REPORT_DIR}/${test_name}.full"
> >  		local dmesg_file="${REPORT_DIR}/${test_name}.dmesg"
> > +		local
> > mountfail_file="${REPORT_DIR}/${test_name}.mountfail"
> >  		local outbad_file="${REPORT_DIR}/${test_name}.out.bad"
> >  		if [ -z "$_err_msg" ]; then
> >  			_err_msg="Test $test_name failed, reason
> > unknown"
> > @@ -225,6 +226,13 @@ _xunit_make_testcase_report()
> >  			printf ']]>\n'	>>$report
> >  			echo -e "\t\t</system-err>" >> $report
> >  		fi
> > +		if [ -z "$quiet" -a -f "$mountfail_file" ]; then
> > +			echo -e "\t\t<mount-failure>" >> $report
> > +			printf	'<![CDATA[\n' >>$report
> > +			cat "$mountfail_file" | tr -dc
> > '[:print:][:space:]' | encode_cdata >>$report
> > +			printf ']]>\n'	>>$report
> > +			echo -e "\t\t</mount-failure>" >> $report
> > +		fi
> >  		;;
> >  	*)
> >  		echo -e "\t\t<failure message=\"Unknown
> > test_status=$test_status\" type=\"TestFail\"/>" >> $report
> > diff --git a/tests/selftest/008 b/tests/selftest/008
> > new file mode 100755
> > index 00000000000000..db80ffe6f77339
> > --- /dev/null
> > +++ b/tests/selftest/008
> > @@ -0,0 +1,20 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
> > +#
> > +# FS QA Test 008
> > +#
> > +# Test mount failure capture.
> > +#
> > +. ./common/preamble
> > +_begin_fstest selftest
> > +
> > +_require_command "$WIPEFS_PROG" wipefs
> > +_require_scratch
> > +
> > +$WIPEFS_PROG -a $SCRATCH_DEV
> > +_scratch_mount &>> $seqres.full
> Minor: Do you think adding some filtered messages from the captured
> dmesg logs in the output will be helpful?  

No, this test exists to make sure that the dmesg log is captured in
$RESULT_DIR.  We don't care about the mount(8) output.

--D

> > +
> > +# success, all done
> > +status=0
> > +exit
> > diff --git a/tests/selftest/008.out b/tests/selftest/008.out
> > new file mode 100644
> > index 00000000000000..aaff95f3f48372
> > --- /dev/null
> > +++ b/tests/selftest/008.out
> > @@ -0,0 +1 @@
> > +QA output created by 008
> > 
> 
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH 2/2] check: capture dmesg of mount failures if test fails
  2025-01-06 23:52       ` Darrick J. Wong
@ 2025-01-13  5:55         ` Nirjhar Roy
  0 siblings, 0 replies; 110+ messages in thread
From: Nirjhar Roy @ 2025-01-13  5:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: zlang, fstests, linux-xfs

On Mon, 2025-01-06 at 15:52 -0800, Darrick J. Wong wrote:
> On Mon, Jan 06, 2025 at 04:48:34PM +0530, Nirjhar Roy wrote:
> > On Tue, 2024-12-31 at 15:56 -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Capture the kernel output after a mount failure occurs.  If the
> > > test
> > > itself fails, then keep the logging output for further diagnosis.
> > > 
> > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > > ---
> > >  check                  |   22 +++++++++++++++++++++-
> > >  common/rc              |   26 +++++++++++++++++++++++++-
> > >  common/report          |    8 ++++++++
> > >  tests/selftest/008     |   20 ++++++++++++++++++++
> > >  tests/selftest/008.out |    1 +
> > >  5 files changed, 75 insertions(+), 2 deletions(-)
> > >  create mode 100755 tests/selftest/008
> > >  create mode 100644 tests/selftest/008.out
> > > 
> > > 
> > > diff --git a/check b/check
> > > index 9222cd7e4f8197..a46ea1a54d78bb 100755
> > > --- a/check
> > > +++ b/check
> > > @@ -614,7 +614,7 @@ _stash_fail_loop_files() {
> > >  	local seq_prefix="${REPORT_DIR}/${1}"
> > >  	local cp_suffix="$2"
> > >  
> > > -	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core"
> > > ".hints"; do
> > > +	for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints"
> > > ".mountfail"; do
> > >  		rm -f "${seq_prefix}${i}${cp_suffix}"
> > >  		if [ -f "${seq_prefix}${i}" ]; then
> > >  			cp "${seq_prefix}${i}"
> > > "${seq_prefix}${i}${cp_suffix}"
> > > @@ -994,6 +994,7 @@ function run_section()
> > >  				      echo -n "	$seqnum -- "
> > >  			cat $seqres.notrun
> > >  			tc_status="notrun"
> > > +			rm -f "$seqres.mountfail?"
> > >  			_stash_test_status "$seqnum" "$tc_status"
> > >  
> > >  			# Unmount the scratch fs so that we can wipe
> > > the scratch
> > > @@ -1053,6 +1054,7 @@ function run_section()
> > >  		if [ ! -f $seq.out ]; then
> > >  			_dump_err "no qualified output"
> > >  			tc_status="fail"
> > > +			rm -f "$seqres.mountfail?"
> > >  			_stash_test_status "$seqnum" "$tc_status"
> > >  			continue;
> > >  		fi
> > > @@ -1089,6 +1091,24 @@ function run_section()
> > >  				rm -f $seqres.hints
> > >  			fi
> > >  		fi
> > > +
> > > +		if [ -f "$seqres.mountfail?" ]; then
> > > +			if [ "$tc_status" = "fail" ]; then
> > > +				# Let the user know if there were mount
> > > +				# failures on a test that failed
> > > because that
> > > +				# could be interesting.
> > > +				mv "$seqres.mountfail?"
> > > "$seqres.mountfail"
> > > +				_dump_err "check: possible mount
> > > failures (see $seqres.mountfail)"
> > > +				test -f $seqres.mountfail && \
> > > +					maybe_compress_logfile
> > > $seqres.mountfail $MAX_MOUNTFAIL_SIZE
> > > +			else
> > > +				# Don't retain mount failure logs for
> > > tests
> > > +				# that pass or were skipped because
> > > some tests
> > > +				# intentionally drive mount failures.
> > > +				rm -f "$seqres.mountfail?"
> > > +			fi
> > > +		fi
> > > +
> > >  		_stash_test_status "$seqnum" "$tc_status"
> > >  	done
> > >  
> > > diff --git a/common/rc b/common/rc
> > > index d7dfb55bbbd7e1..0ede68eb912440 100644
> > > --- a/common/rc
> > > +++ b/common/rc
> > > @@ -204,9 +204,33 @@ _get_hugepagesize()
> > >  	awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo
> > >  }
> > >  
> > > +# Does dmesg have a --since flag?
> > > +_dmesg_detect_since()
> > > +{
> > > +	if [ -z "$DMESG_HAS_SINCE" ]; then
> > > +		test "$DMESG_HAS_SINCE" = "yes"
> > > +		return
> > > +	elif dmesg --help | grep -q -- --since; then
> > > +		DMESG_HAS_SINCE=yes
> > > +	else
> > > +		DMESG_HAS_SINCE=no
> > > +	fi
> > > +}
> > > +
> > >  _mount()
> > >  {
> > > -    $MOUNT_PROG $*
> > > +	$MOUNT_PROG $*
> > > +	ret=$?
> > > +	if [ "$ret" -ne 0 ]; then
> > > +		echo "\"$MOUNT_PROG $*\" failed at $(date)" >>
> > > "$seqres.mountfail?"
> > > +		if _dmesg_detect_since; then
> > > +			dmesg --since '30s ago' >> "$seqres.mountfail?"
> > > +		else
> > > +			dmesg | tail -n 100 >> "$seqres.mountfail?"
> > Is it possible to grep for a mount failure message in dmesg and
> > then
> > capture the last n lines? Do you think that will be more accurate?
> 
> Alas no, because there's no standard mount failure log message for us
> to
> latch onto.
Okay makes sense. 
> 
> > Also, do you think it is useful to make this 100 configurable
> > instead
> > of hardcoding? 
> 
> I suppose, but why do you need more than 100?
So my thought behind this is that in case, the dmesg gets cluttered
with noisy logs from other processes. No hard preferences though. 
> 
> > > +		fi
> > > +	fi
> > > +
> > > +	return $ret
> > >  }
> > >  
> > >  # Call _mount to do mount operation but also save mountpoint to
> > > diff --git a/common/report b/common/report
> > > index 0e91e481f9725a..b57697f76dafb2 100644
> > > --- a/common/report
> > > +++ b/common/report
> > > @@ -199,6 +199,7 @@ _xunit_make_testcase_report()
> > >  		local out_src="${SRC_DIR}/${test_name}.out"
> > >  		local full_file="${REPORT_DIR}/${test_name}.full"
> > >  		local dmesg_file="${REPORT_DIR}/${test_name}.dmesg"
> > > +		local
> > > mountfail_file="${REPORT_DIR}/${test_name}.mountfail"
> > >  		local outbad_file="${REPORT_DIR}/${test_name}.out.bad"
> > >  		if [ -z "$_err_msg" ]; then
> > >  			_err_msg="Test $test_name failed, reason
> > > unknown"
> > > @@ -225,6 +226,13 @@ _xunit_make_testcase_report()
> > >  			printf ']]>\n'	>>$report
> > >  			echo -e "\t\t</system-err>" >> $report
> > >  		fi
> > > +		if [ -z "$quiet" -a -f "$mountfail_file" ]; then
> > > +			echo -e "\t\t<mount-failure>" >> $report
> > > +			printf	'<![CDATA[\n' >>$report
> > > +			cat "$mountfail_file" | tr -dc
> > > '[:print:][:space:]' | encode_cdata >>$report
> > > +			printf ']]>\n'	>>$report
> > > +			echo -e "\t\t</mount-failure>" >> $report
> > > +		fi
> > >  		;;
> > >  	*)
> > >  		echo -e "\t\t<failure message=\"Unknown
> > > test_status=$test_status\" type=\"TestFail\"/>" >> $report
> > > diff --git a/tests/selftest/008 b/tests/selftest/008
> > > new file mode 100755
> > > index 00000000000000..db80ffe6f77339
> > > --- /dev/null
> > > +++ b/tests/selftest/008
> > > @@ -0,0 +1,20 @@
> > > +#! /bin/bash
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
> > > +#
> > > +# FS QA Test 008
> > > +#
> > > +# Test mount failure capture.
> > > +#
> > > +. ./common/preamble
> > > +_begin_fstest selftest
> > > +
> > > +_require_command "$WIPEFS_PROG" wipefs
> > > +_require_scratch
> > > +
> > > +$WIPEFS_PROG -a $SCRATCH_DEV
> > > +_scratch_mount &>> $seqres.full
> > Minor: Do you think adding some filtered messages from the captured
> > dmesg logs in the output will be helpful?  
> 
> No, this test exists to make sure that the dmesg log is captured in
> $RESULT_DIR.  We don't care about the mount(8) output.
> 
> --D
Okay, got it.
--NR 
> 
> > > +
> > > +# success, all done
> > > +status=0
> > > +exit
> > > diff --git a/tests/selftest/008.out b/tests/selftest/008.out
> > > new file mode 100644
> > > index 00000000000000..aaff95f3f48372
> > > --- /dev/null
> > > +++ b/tests/selftest/008.out
> > > @@ -0,0 +1 @@
> > > +QA output created by 008
> > > 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCHSET 4/5] fstests: live health monitoring of filesystems
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (12 preceding siblings ...)
  2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong
@ 2024-12-31 23:35 ` Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong
                     ` (5 more replies)
  2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong
  2025-01-02  1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang
  15 siblings, 6 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

Hi all,

This patchset builds off of Kent Overstreet's thread_with_file code to
deliver live information about filesystem health events to userspace.
This is done by creating a twf file and hooking internal operations so
that the event information can be queued to the twf without stalling the
kernel if the twf client program is nonresponsive.  This is a private
ioctl, so events are expressed using simple json objects so that we can
enrich the output later on without having to rev a ton of C structs.

In userspace, we create a new daemon program that will read the json
event objects and initiate repairs automatically.  This daemon is
managed entirely by systemd and will not block unmounting of the
filesystem unless repairs are ongoing.  It is autostarted via some
horrible udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * misc: convert all $UMOUNT_PROG to a _umount helper
 * misc: convert all umount(1) invocations to _umount
 * xfs: test health monitoring code
 * xfs: test for metadata corruption error reporting via healthmon
 * xfs: test io error reporting via healthmon
 * xfs: test new xfs_scrubbed daemon
---
 common/btrfs        |    2 +
 common/config       |    6 +++
 common/dmdelay      |    4 +-
 common/dmdust       |    4 +-
 common/dmerror      |    4 +-
 common/dmflakey     |    4 +-
 common/dmhugedisk   |    2 +
 common/dmlogwrites  |    4 +-
 common/dmthin       |    4 +-
 common/overlay      |   10 +++---
 common/populate     |    8 ++---
 common/quota        |    2 +
 common/rc           |   47 ++++++++++++++++++---------
 common/systemd      |    9 +++++
 common/xfs          |   18 ++++++++++
 doc/group-names.txt |    1 +
 tests/btrfs/012     |    2 +
 tests/btrfs/020     |    2 +
 tests/btrfs/029     |    2 +
 tests/btrfs/031     |    2 +
 tests/btrfs/060     |    2 +
 tests/btrfs/065     |    2 +
 tests/btrfs/066     |    2 +
 tests/btrfs/067     |    2 +
 tests/btrfs/068     |    2 +
 tests/btrfs/075     |    2 +
 tests/btrfs/089     |    2 +
 tests/btrfs/124     |    2 +
 tests/btrfs/125     |    2 +
 tests/btrfs/185     |    4 +-
 tests/btrfs/197     |    4 +-
 tests/btrfs/199     |    2 +
 tests/btrfs/219     |   12 +++----
 tests/btrfs/254     |    2 +
 tests/btrfs/291     |    2 +
 tests/btrfs/298     |    4 +-
 tests/ext4/006      |    4 +-
 tests/ext4/007      |    4 +-
 tests/ext4/008      |    4 +-
 tests/ext4/009      |    8 ++---
 tests/ext4/010      |    6 ++-
 tests/ext4/011      |    2 +
 tests/ext4/012      |    2 +
 tests/ext4/013      |    6 ++-
 tests/ext4/014      |    6 ++-
 tests/ext4/015      |    6 ++-
 tests/ext4/016      |    6 ++-
 tests/ext4/017      |    6 ++-
 tests/ext4/018      |    6 ++-
 tests/ext4/019      |    6 ++-
 tests/ext4/032      |    4 +-
 tests/ext4/033      |    2 +
 tests/ext4/052      |    4 +-
 tests/ext4/053      |   32 +++++++++---------
 tests/ext4/056      |    2 +
 tests/generic/042   |    4 +-
 tests/generic/067   |    6 ++-
 tests/generic/081   |    2 +
 tests/generic/085   |    4 +-
 tests/generic/108   |    2 +
 tests/generic/171   |    2 +
 tests/generic/172   |    2 +
 tests/generic/173   |    2 +
 tests/generic/174   |    2 +
 tests/generic/306   |    2 +
 tests/generic/330   |    2 +
 tests/generic/332   |    2 +
 tests/generic/361   |    2 +
 tests/generic/373   |    2 +
 tests/generic/374   |    2 +
 tests/generic/395   |    2 +
 tests/generic/459   |    2 +
 tests/generic/563   |    4 +-
 tests/generic/604   |    2 +
 tests/generic/631   |    2 +
 tests/generic/648   |    6 ++-
 tests/generic/698   |    4 +-
 tests/generic/699   |    8 ++---
 tests/generic/704   |    2 +
 tests/generic/717   |    2 +
 tests/generic/730   |    2 +
 tests/generic/731   |    2 +
 tests/generic/732   |    4 +-
 tests/generic/746   |    8 ++---
 tests/overlay/003   |    2 +
 tests/overlay/004   |    2 +
 tests/overlay/005   |    6 ++-
 tests/overlay/014   |    4 +-
 tests/overlay/022   |    2 +
 tests/overlay/025   |    4 +-
 tests/overlay/029   |    6 ++-
 tests/overlay/031   |    8 ++---
 tests/overlay/035   |    2 +
 tests/overlay/036   |    8 ++---
 tests/overlay/037   |    6 ++-
 tests/overlay/040   |    2 +
 tests/overlay/041   |    2 +
 tests/overlay/042   |    2 +
 tests/overlay/043   |    2 +
 tests/overlay/044   |    2 +
 tests/overlay/048   |    4 +-
 tests/overlay/049   |    2 +
 tests/overlay/050   |    2 +
 tests/overlay/051   |    4 +-
 tests/overlay/052   |    2 +
 tests/overlay/053   |    4 +-
 tests/overlay/054   |    2 +
 tests/overlay/055   |    4 +-
 tests/overlay/056   |    2 +
 tests/overlay/057   |    4 +-
 tests/overlay/059   |    2 +
 tests/overlay/060   |    2 +
 tests/overlay/062   |    2 +
 tests/overlay/063   |    2 +
 tests/overlay/065   |   22 ++++++-------
 tests/overlay/067   |    2 +
 tests/overlay/068   |    4 +-
 tests/overlay/069   |    6 ++-
 tests/overlay/070   |    6 ++-
 tests/overlay/071   |    6 ++-
 tests/overlay/076   |    2 +
 tests/overlay/077   |    2 +
 tests/overlay/078   |    2 +
 tests/overlay/079   |    2 +
 tests/overlay/080   |    2 +
 tests/overlay/081   |   14 ++++----
 tests/overlay/083   |    2 +
 tests/overlay/084   |   10 +++---
 tests/overlay/085   |    2 +
 tests/overlay/086   |    8 ++---
 tests/xfs/014       |    4 +-
 tests/xfs/049       |    8 ++---
 tests/xfs/073       |    8 ++---
 tests/xfs/074       |    4 +-
 tests/xfs/078       |    4 +-
 tests/xfs/083       |    6 ++-
 tests/xfs/085       |    4 +-
 tests/xfs/086       |    8 ++---
 tests/xfs/087       |    6 ++-
 tests/xfs/088       |    8 ++---
 tests/xfs/089       |    8 ++---
 tests/xfs/091       |    8 ++---
 tests/xfs/093       |    6 ++-
 tests/xfs/097       |    6 ++-
 tests/xfs/098       |    4 +-
 tests/xfs/099       |    6 ++-
 tests/xfs/100       |    6 ++-
 tests/xfs/101       |    6 ++-
 tests/xfs/102       |    6 ++-
 tests/xfs/105       |    6 ++-
 tests/xfs/112       |    8 ++---
 tests/xfs/113       |    6 ++-
 tests/xfs/117       |    6 ++-
 tests/xfs/120       |    6 ++-
 tests/xfs/123       |    6 ++-
 tests/xfs/124       |    6 ++-
 tests/xfs/125       |    6 ++-
 tests/xfs/126       |    6 ++-
 tests/xfs/130       |    2 +
 tests/xfs/148       |    6 ++-
 tests/xfs/149       |    4 +-
 tests/xfs/152       |    2 +
 tests/xfs/169       |    6 ++-
 tests/xfs/186       |    4 +-
 tests/xfs/1878      |   80 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1878.out  |   10 ++++++
 tests/xfs/1879      |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1879.out  |   12 +++++++
 tests/xfs/1882      |   64 +++++++++++++++++++++++++++++++++++++
 tests/xfs/1882.out  |    2 +
 tests/xfs/1883      |   75 +++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1883.out  |    2 +
 tests/xfs/1884      |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1884.out  |    2 +
 tests/xfs/1885      |   53 ++++++++++++++++++++++++++++++
 tests/xfs/1885.out  |    5 +++
 tests/xfs/206       |    2 +
 tests/xfs/216       |    2 +
 tests/xfs/217       |    2 +
 tests/xfs/235       |    6 ++-
 tests/xfs/236       |    6 ++-
 tests/xfs/239       |    2 +
 tests/xfs/241       |    2 +
 tests/xfs/250       |    4 +-
 tests/xfs/265       |    6 ++-
 tests/xfs/289       |    4 +-
 tests/xfs/310       |    4 +-
 tests/xfs/507       |    2 +
 tests/xfs/513       |    4 +-
 tests/xfs/544       |    2 +
 tests/xfs/716       |    4 +-
 tests/xfs/806       |    4 +-
 192 files changed, 921 insertions(+), 391 deletions(-)
 create mode 100755 tests/xfs/1878
 create mode 100644 tests/xfs/1878.out
 create mode 100755 tests/xfs/1879
 create mode 100644 tests/xfs/1879.out
 create mode 100755 tests/xfs/1882
 create mode 100644 tests/xfs/1882.out
 create mode 100755 tests/xfs/1883
 create mode 100644 tests/xfs/1883.out
 create mode 100755 tests/xfs/1884
 create mode 100644 tests/xfs/1884.out
 create mode 100755 tests/xfs/1885
 create mode 100644 tests/xfs/1885.out


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
@ 2024-12-31 23:57   ` Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 2/6] misc: convert all umount(1) invocations to _umount Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

We're going to start collecting ephemeral(ish) filesystem stats in the
next patch, so switch all the $UMOUNT_PROG to a helper.

sed -e 's/$UMOUNT_PROG/_umount/g' -i $(git ls-files common tests check)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/btrfs       |    2 +-
 common/dmdelay     |    4 ++--
 common/dmdust      |    4 ++--
 common/dmerror     |    2 +-
 common/dmflakey    |    4 ++--
 common/dmhugedisk  |    2 +-
 common/dmlogwrites |    4 ++--
 common/dmthin      |    4 ++--
 common/overlay     |   10 +++++-----
 common/rc          |   33 +++++++++++++++++++--------------
 tests/btrfs/020    |    2 +-
 tests/btrfs/029    |    2 +-
 tests/btrfs/031    |    2 +-
 tests/btrfs/060    |    2 +-
 tests/btrfs/065    |    2 +-
 tests/btrfs/066    |    2 +-
 tests/btrfs/067    |    2 +-
 tests/btrfs/068    |    2 +-
 tests/btrfs/075    |    2 +-
 tests/btrfs/089    |    2 +-
 tests/btrfs/124    |    2 +-
 tests/btrfs/125    |    2 +-
 tests/btrfs/185    |    4 ++--
 tests/btrfs/197    |    4 ++--
 tests/btrfs/219    |   12 ++++++------
 tests/btrfs/254    |    2 +-
 tests/ext4/032     |    4 ++--
 tests/ext4/052     |    4 ++--
 tests/ext4/053     |   32 ++++++++++++++++----------------
 tests/ext4/056     |    2 +-
 tests/generic/042  |    4 ++--
 tests/generic/067  |    6 +++---
 tests/generic/081  |    2 +-
 tests/generic/085  |    4 ++--
 tests/generic/108  |    2 +-
 tests/generic/361  |    2 +-
 tests/generic/373  |    2 +-
 tests/generic/374  |    2 +-
 tests/generic/459  |    2 +-
 tests/generic/604  |    2 ++
 tests/generic/648  |    6 +++---
 tests/generic/698  |    4 ++--
 tests/generic/699  |    8 ++++----
 tests/generic/704  |    2 +-
 tests/generic/730  |    2 +-
 tests/generic/731  |    2 +-
 tests/generic/732  |    4 ++--
 tests/generic/746  |    8 ++++----
 tests/overlay/003  |    2 +-
 tests/overlay/004  |    2 +-
 tests/overlay/005  |    6 +++---
 tests/overlay/014  |    4 ++--
 tests/overlay/022  |    2 +-
 tests/overlay/025  |    4 ++--
 tests/overlay/029  |    6 +++---
 tests/overlay/031  |    8 ++++----
 tests/overlay/035  |    2 +-
 tests/overlay/036  |    8 ++++----
 tests/overlay/037  |    6 +++---
 tests/overlay/040  |    2 +-
 tests/overlay/041  |    2 +-
 tests/overlay/042  |    2 +-
 tests/overlay/043  |    2 +-
 tests/overlay/044  |    2 +-
 tests/overlay/048  |    4 ++--
 tests/overlay/049  |    2 +-
 tests/overlay/050  |    2 +-
 tests/overlay/051  |    4 ++--
 tests/overlay/052  |    2 +-
 tests/overlay/053  |    4 ++--
 tests/overlay/054  |    2 +-
 tests/overlay/055  |    4 ++--
 tests/overlay/056  |    2 +-
 tests/overlay/057  |    4 ++--
 tests/overlay/059  |    2 +-
 tests/overlay/060  |    2 +-
 tests/overlay/062  |    2 +-
 tests/overlay/063  |    2 +-
 tests/overlay/065  |   22 +++++++++++-----------
 tests/overlay/067  |    2 +-
 tests/overlay/068  |    4 ++--
 tests/overlay/069  |    6 +++---
 tests/overlay/070  |    6 +++---
 tests/overlay/071  |    6 +++---
 tests/overlay/076  |    2 +-
 tests/overlay/077  |    2 +-
 tests/overlay/078  |    2 +-
 tests/overlay/079  |    2 +-
 tests/overlay/080  |    2 +-
 tests/overlay/081  |   14 +++++++-------
 tests/overlay/083  |    2 +-
 tests/overlay/084  |   10 +++++-----
 tests/overlay/085  |    2 +-
 tests/overlay/086  |    8 ++++----
 tests/xfs/078      |    4 ++--
 tests/xfs/148      |    6 +++---
 tests/xfs/149      |    4 ++--
 tests/xfs/186      |    4 ++--
 tests/xfs/289      |    4 ++--
 tests/xfs/507      |    2 +-
 tests/xfs/513      |    4 ++--
 tests/xfs/544      |    2 +-
 tests/xfs/806      |    4 ++--
 103 files changed, 226 insertions(+), 219 deletions(-)


diff --git a/common/btrfs b/common/btrfs
index 64f38cc240ab8b..b82c8f5a934cfd 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -352,7 +352,7 @@ _btrfs_stress_subvolume()
 	while [ ! -e $stop_file ]; do
 		$BTRFS_UTIL_PROG subvolume create $btrfs_mnt/$subvol_name
 		_mount -o subvol=$subvol_name $btrfs_dev $subvol_mnt
-		$UMOUNT_PROG $subvol_mnt
+		_umount $subvol_mnt
 		$BTRFS_UTIL_PROG subvolume delete $btrfs_mnt/$subvol_name
 	done
 }
diff --git a/common/dmdelay b/common/dmdelay
index 794ea37ba200ce..691e22538a622b 100644
--- a/common/dmdelay
+++ b/common/dmdelay
@@ -26,7 +26,7 @@ _mount_delay()
 
 _unmount_delay()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 _cleanup_delay()
@@ -34,7 +34,7 @@ _cleanup_delay()
 	# If dmsetup load fails then we need to make sure to do resume here
 	# otherwise the umount will hang
 	$DMSETUP_PROG resume delay-test > /dev/null 2>&1
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_dmsetup_remove delay-test
 }
 
diff --git a/common/dmdust b/common/dmdust
index 56fcc0e0fffa1e..13461c2dd3a006 100644
--- a/common/dmdust
+++ b/common/dmdust
@@ -22,7 +22,7 @@ _mount_dust()
 
 _unmount_dust()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 _cleanup_dust()
@@ -30,6 +30,6 @@ _cleanup_dust()
 	# If dmsetup load fails then we need to make sure to do resume here
 	# otherwise the umount will hang
 	$DMSETUP_PROG resume dust-test > /dev/null 2>&1
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_dmsetup_remove dust-test
 }
diff --git a/common/dmerror b/common/dmerror
index 2f006142a309fe..1e6a35230f3ccb 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -106,7 +106,7 @@ _dmerror_cleanup()
 	test -n "$NON_ERROR_RTDEV" && $DMSETUP_PROG resume error-rttest &>/dev/null
 	$DMSETUP_PROG resume error-test > /dev/null 2>&1
 
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 
 	test -n "$NON_ERROR_LOGDEV" && _dmsetup_remove error-logtest
 	test -n "$NON_ERROR_RTDEV" && _dmsetup_remove error-rttest
diff --git a/common/dmflakey b/common/dmflakey
index 52da3b100fbe45..64723f983b27ec 100644
--- a/common/dmflakey
+++ b/common/dmflakey
@@ -67,7 +67,7 @@ _mount_flakey()
 
 _unmount_flakey()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 _cleanup_flakey()
@@ -78,7 +78,7 @@ _cleanup_flakey()
 	test -n "$NON_FLAKEY_RTDEV" && $DMSETUP_PROG resume flakey-rttest &> /dev/null
 	$DMSETUP_PROG resume flakey-test > /dev/null 2>&1
 
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 
 	_dmsetup_remove flakey-test
 	test -n "$NON_FLAKEY_LOGDEV" && _dmsetup_remove flakey-logtest
diff --git a/common/dmhugedisk b/common/dmhugedisk
index 502f0243772d52..a02bff4351d9be 100644
--- a/common/dmhugedisk
+++ b/common/dmhugedisk
@@ -39,7 +39,7 @@ _dmhugedisk_init()
 
 _dmhugedisk_cleanup()
 {
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_dmsetup_remove huge-test
 	_dmsetup_remove huge-test-zero
 }
diff --git a/common/dmlogwrites b/common/dmlogwrites
index c054acb875a384..a1a5c415338276 100644
--- a/common/dmlogwrites
+++ b/common/dmlogwrites
@@ -145,7 +145,7 @@ _log_writes_mount()
 
 _log_writes_unmount()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # _log_writes_replay_log <mark>
@@ -177,7 +177,7 @@ _log_writes_remove()
 
 _log_writes_cleanup()
 {
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_log_writes_remove
 }
 
diff --git a/common/dmthin b/common/dmthin
index 7107d50804896e..38d561c8eb25d6 100644
--- a/common/dmthin
+++ b/common/dmthin
@@ -23,7 +23,7 @@ DMTHIN_VOL_DEV="/dev/mapper/$DMTHIN_VOL_NAME"
 
 _dmthin_cleanup()
 {
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_dmsetup_remove $DMTHIN_VOL_NAME
 	_dmsetup_remove $DMTHIN_POOL_NAME
 	_dmsetup_remove $DMTHIN_META_NAME
@@ -32,7 +32,7 @@ _dmthin_cleanup()
 
 _dmthin_check_fs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_check_scratch_fs $DMTHIN_VOL_DEV
 }
 
diff --git a/common/overlay b/common/overlay
index da1d8d2c3183f4..2877f31e22ebd9 100644
--- a/common/overlay
+++ b/common/overlay
@@ -142,18 +142,18 @@ _overlay_base_unmount()
 
 	[ -n "$dev" -a -n "$mnt" ] || return 0
 
-	$UMOUNT_PROG $mnt
+	_umount $mnt
 }
 
 _overlay_test_unmount()
 {
-	$UMOUNT_PROG $TEST_DIR
+	_umount $TEST_DIR
 	_overlay_base_unmount "$OVL_BASE_TEST_DEV" "$OVL_BASE_TEST_DIR"
 }
 
 _overlay_scratch_unmount()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_overlay_base_unmount "$OVL_BASE_SCRATCH_DEV" "$OVL_BASE_SCRATCH_MNT"
 }
 
@@ -342,7 +342,7 @@ _overlay_check_scratch_dirs()
 
 	# Need to umount overlay for scratch dir check
 	local ovl_mounted=`_is_dir_mountpoint $SCRATCH_MNT`
-	[ -z "$ovl_mounted" ] || $UMOUNT_PROG $SCRATCH_MNT
+	[ -z "$ovl_mounted" ] || _umount $SCRATCH_MNT
 
 	# Check dirs with extra overlay options
 	_overlay_check_dirs $lowerdir $upperdir $workdir $*
@@ -387,7 +387,7 @@ _overlay_check_fs()
 	else
 		# Check and umount overlay for dir check
 		ovl_mounted=`_is_dir_mountpoint $ovl_mnt`
-		[ -z "$ovl_mounted" ] || $UMOUNT_PROG $ovl_mnt
+		[ -z "$ovl_mounted" ] || _umount $ovl_mnt
 	fi
 
 	_overlay_check_dirs $base_mnt/$OVL_LOWER $base_mnt/$OVL_UPPER \
diff --git a/common/rc b/common/rc
index 0ede68eb912440..d3ee76e01db892 100644
--- a/common/rc
+++ b/common/rc
@@ -233,6 +233,11 @@ _mount()
 	return $ret
 }
 
+_umount()
+{
+	$UMOUNT_PROG $*
+}
+
 # Call _mount to do mount operation but also save mountpoint to
 # MOUNTED_POINT_STACK. Note that the mount point must be the last parameter
 _get_mount()
@@ -266,7 +271,7 @@ _put_mount()
 	local last_mnt=`echo $MOUNTED_POINT_STACK | awk '{print $1}'`
 
 	if [ -n "$last_mnt" ]; then
-		$UMOUNT_PROG $last_mnt
+		_umount $last_mnt
 	fi
 	MOUNTED_POINT_STACK=`echo $MOUNTED_POINT_STACK | cut -d\  -f2-`
 }
@@ -275,7 +280,7 @@ _put_mount()
 _clear_mount_stack()
 {
 	if [ -n "$MOUNTED_POINT_STACK" ]; then
-		$UMOUNT_PROG $MOUNTED_POINT_STACK
+		_umount $MOUNTED_POINT_STACK
 	fi
 	MOUNTED_POINT_STACK=""
 }
@@ -420,20 +425,20 @@ _scratch_unmount()
 		_overlay_scratch_unmount
 		;;
 	btrfs)
-		$UMOUNT_PROG $SCRATCH_MNT
+		_umount $SCRATCH_MNT
 		;;
 	tmpfs)
 		$UMOUNT_PROG $SCRATCH_MNT
 		;;
 	*)
-		$UMOUNT_PROG $SCRATCH_DEV
+		_umount $SCRATCH_DEV
 		;;
 	esac
 }
 
 _scratch_umount_idmapped()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 _scratch_remount()
@@ -457,7 +462,7 @@ _scratch_cycle_mount()
         ;;
     overlay)
         if [ "$OVL_BASE_FSTYP" = tmpfs ]; then
-            $UMOUNT_PROG $SCRATCH_MNT
+            _umount $SCRATCH_MNT
             unmounted=true
         fi
         ;;
@@ -505,9 +510,9 @@ _move_mount()
 
 	# Replace $mnt with $tmp. Use a temporary bind-mount because
 	# mount --move will fail with certain mount propagation layouts.
-	$UMOUNT_PROG $mnt || _fail "Failed to unmount $mnt"
+	_umount $mnt || _fail "Failed to unmount $mnt"
 	_mount --bind $tmp $mnt || _fail "Failed to bind-mount $tmp to $mnt"
-	$UMOUNT_PROG $tmp || _fail "Failed to unmount $tmp"
+	_umount $tmp || _fail "Failed to unmount $tmp"
 	rmdir $tmp
 }
 
@@ -573,7 +578,7 @@ _test_unmount()
 	if [ "$FSTYP" == "overlay" ]; then
 		_overlay_test_unmount
 	else
-		$UMOUNT_PROG $TEST_DEV
+		_umount $TEST_DEV
 	fi
 }
 
@@ -587,7 +592,7 @@ _test_cycle_mount()
         ;;
     overlay)
         if [ "$OVL_BASE_FSTYP" = tmpfs ]; then
-            $UMOUNT_PROG $TEST_DIR
+            _umount $TEST_DIR
             unmounted=true
         fi
         ;;
@@ -1375,7 +1380,7 @@ _repair_scratch_fs()
 		# Fall through to repair base fs
 		dev=$OVL_BASE_SCRATCH_DEV
 		fstyp=$OVL_BASE_FSTYP
-		$UMOUNT_PROG $OVL_BASE_SCRATCH_MNT
+		_umount $OVL_BASE_SCRATCH_MNT
 	fi
 	# Let's hope fsck -y suffices...
 	fsck -t $fstyp -y $dev 2>&1
@@ -2189,7 +2194,7 @@ _require_logdev()
         _notrun "This test requires USE_EXTERNAL to be enabled"
 
     # ensure its not mounted
-    $UMOUNT_PROG $SCRATCH_LOGDEV 2>/dev/null
+    _umount $SCRATCH_LOGDEV 2>/dev/null
 }
 
 # This test requires that an external log device is not in use
@@ -3281,7 +3286,7 @@ _umount_or_remount_ro()
     local mountpoint=`_is_dev_mounted $device`
 
     if [ $USE_REMOUNT -eq 0 ]; then
-        $UMOUNT_PROG $device
+        _umount $device
     else
         _remount $device ro
     fi
@@ -3799,7 +3804,7 @@ _require_scratch_dev_pool()
 			_notrun "$i is part of TEST_DEV, this test requires unique disks"
 		fi
 		if _mount | grep -q $i; then
-			if ! $UMOUNT_PROG $i; then
+			if ! _umount $i; then
 		            echo "failed to unmount $i - aborting"
 		            exit 1
 		        fi
diff --git a/tests/btrfs/020 b/tests/btrfs/020
index 7e5c6fd7b25229..f6fadab1f00bdb 100755
--- a/tests/btrfs/020
+++ b/tests/btrfs/020
@@ -17,7 +17,7 @@ _cleanup()
 {
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $loop_mnt
+	_umount $loop_mnt
 	_destroy_loop_device $loop_dev1
 	losetup -d $loop_dev2 >/dev/null 2>&1
 	_destroy_loop_device $loop_dev3
diff --git a/tests/btrfs/029 b/tests/btrfs/029
index c37ad63fb613db..9799b275250e5a 100755
--- a/tests/btrfs/029
+++ b/tests/btrfs/029
@@ -74,7 +74,7 @@ cp --reflink=always $orig_file $copy_file >> $seqres.full 2>&1 || echo "cp refli
 md5sum $orig_file | _filter_testdir_and_scratch
 md5sum $copy_file | _filter_testdir_and_scratch
 
-$UMOUNT_PROG $reflink_test_dir
+_umount $reflink_test_dir
 
 # success, all done
 status=0
diff --git a/tests/btrfs/031 b/tests/btrfs/031
index 8ac73d3a86e70b..92c1d26f865ba9 100755
--- a/tests/btrfs/031
+++ b/tests/btrfs/031
@@ -99,7 +99,7 @@ mv $testdir2/file* $subvol2/
 echo "Verify the file contents:"
 _checksum_files
 
-$UMOUNT_PROG $cross_mount_test_dir
+_umount $cross_mount_test_dir
 
 # success, all done
 status=0
diff --git a/tests/btrfs/060 b/tests/btrfs/060
index 75c10bd23c36f5..0bf88f86ca822b 100755
--- a/tests/btrfs/060
+++ b/tests/btrfs/060
@@ -82,7 +82,7 @@ run_test()
 	fi
 
 	# in case the subvolume is still mounted
-	$UMOUNT_PROG $subvol_mnt >/dev/null 2>&1
+	_umount $subvol_mnt >/dev/null 2>&1
 	_scratch_unmount
 	# we called _require_scratch_nocheck instead of _require_scratch
 	# do check after test for each profile config
diff --git a/tests/btrfs/065 b/tests/btrfs/065
index b87c66d6e3d45e..9cd38fefe46875 100755
--- a/tests/btrfs/065
+++ b/tests/btrfs/065
@@ -90,7 +90,7 @@ run_test()
 	fi
 
 	# in case the subvolume is still mounted
-	$UMOUNT_PROG $subvol_mnt >/dev/null 2>&1
+	_umount $subvol_mnt >/dev/null 2>&1
 	_scratch_unmount
 	# we called _require_scratch_nocheck instead of _require_scratch
 	# do check after test for each profile config
diff --git a/tests/btrfs/066 b/tests/btrfs/066
index cc7cd9b7273d1c..b3db57049714ad 100755
--- a/tests/btrfs/066
+++ b/tests/btrfs/066
@@ -82,7 +82,7 @@ run_test()
 	fi
 
 	# in case the subvolume is still mounted
-	$UMOUNT_PROG $subvol_mnt >/dev/null 2>&1
+	_umount $subvol_mnt >/dev/null 2>&1
 	_scratch_unmount
 	# we called _require_scratch_nocheck instead of _require_scratch
 	# do check after test for each profile config
diff --git a/tests/btrfs/067 b/tests/btrfs/067
index 0b473050027a0a..ede9abbc689fe0 100755
--- a/tests/btrfs/067
+++ b/tests/btrfs/067
@@ -83,7 +83,7 @@ run_test()
 	fi
 
 	# in case the subvolume is still mounted
-	$UMOUNT_PROG $subvol_mnt >/dev/null 2>&1
+	_umount $subvol_mnt >/dev/null 2>&1
 	_scratch_unmount
 	# we called _require_scratch_nocheck instead of _require_scratch
 	# do check after test for each profile config
diff --git a/tests/btrfs/068 b/tests/btrfs/068
index 83e932e8417c0d..82dac5fd90ba85 100755
--- a/tests/btrfs/068
+++ b/tests/btrfs/068
@@ -83,7 +83,7 @@ run_test()
 	fi
 
 	# in case the subvolume is still mounted
-	$UMOUNT_PROG $subvol_mnt >/dev/null 2>&1
+	_umount $subvol_mnt >/dev/null 2>&1
 	_scratch_unmount
 	# we called _require_scratch_nocheck instead of _require_scratch
 	# do check after test for each profile config
diff --git a/tests/btrfs/075 b/tests/btrfs/075
index 737c4ffdd57865..8e78bd3d4b2336 100755
--- a/tests/btrfs/075
+++ b/tests/btrfs/075
@@ -15,7 +15,7 @@ _cleanup()
 {
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $subvol_mnt >/dev/null 2>&1
+	_umount $subvol_mnt >/dev/null 2>&1
 }
 
 . ./common/filter
diff --git a/tests/btrfs/089 b/tests/btrfs/089
index 8f8e37b6fde87b..ade38a6d189eaa 100755
--- a/tests/btrfs/089
+++ b/tests/btrfs/089
@@ -35,7 +35,7 @@ mount --bind "$SCRATCH_MNT/testvol/testdir" "$SCRATCH_MNT/testvol/mnt"
 $BTRFS_UTIL_PROG subvolume delete "$SCRATCH_MNT/testvol" >>$seqres.full 2>&1
 
 # Unmount the bind mount, which should still be alive.
-$UMOUNT_PROG "$SCRATCH_MNT/testvol/mnt"
+_umount "$SCRATCH_MNT/testvol/mnt"
 
 echo "Silence is golden"
 status=0
diff --git a/tests/btrfs/124 b/tests/btrfs/124
index af079c2864de8e..19f8bbfc6b922e 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -132,7 +132,7 @@ if [ "$checkpoint1" != "$checkpoint3" ]; then
 	echo "Inital sum does not match with data on dev2 written by balance"
 fi
 
-$UMOUNT_PROG $dev2
+_umount $dev2
 _scratch_dev_pool_put
 _btrfs_rescan_devices
 _test_mount
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index c8c0dd422f72b6..7acef2d38cda46 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -144,7 +144,7 @@ if [ "$checkpoint1" != "$checkpoint3" ]; then
 	echo "Inital sum does not match with data on dev2 written by balance"
 fi
 
-$UMOUNT_PROG $dev2
+_umount $dev2
 _scratch_dev_pool_put
 _btrfs_rescan_devices
 _test_mount
diff --git a/tests/btrfs/185 b/tests/btrfs/185
index 8d0643450f5d7d..c3b52fc2dbff66 100755
--- a/tests/btrfs/185
+++ b/tests/btrfs/185
@@ -15,7 +15,7 @@ mnt=$TEST_DIR/$seq.mnt
 # Override the default cleanup function.
 _cleanup()
 {
-	$UMOUNT_PROG $mnt > /dev/null 2>&1
+	_umount $mnt > /dev/null 2>&1
 	rm -rf $mnt > /dev/null 2>&1
 	cd /
 	rm -f $tmp.*
@@ -62,7 +62,7 @@ $BTRFS_UTIL_PROG device scan $device_1 >> $seqres.full 2>&1
 	_fail "if it fails here, then it means subvolume mount at boot may fail "\
 	      "in some configs."
 
-$UMOUNT_PROG $mnt > /dev/null 2>&1
+_umount $mnt > /dev/null 2>&1
 _scratch_dev_pool_put
 
 # success, all done
diff --git a/tests/btrfs/197 b/tests/btrfs/197
index 9f1d879a4e267a..913dbb2d3a50ef 100755
--- a/tests/btrfs/197
+++ b/tests/btrfs/197
@@ -15,7 +15,7 @@ _begin_fstest auto quick volume
 # Override the default cleanup function.
 _cleanup()
 {
-	$UMOUNT_PROG $TEST_DIR/$seq.mnt >/dev/null 2>&1
+	_umount $TEST_DIR/$seq.mnt >/dev/null 2>&1
 	rm -rf $TEST_DIR/$seq.mnt
 	cd /
 	rm -f $tmp.*
@@ -67,7 +67,7 @@ workout()
 	grep -q "${SCRATCH_DEV_NAME[1]}" $tmp.output && _fail "found stale device"
 
 	$BTRFS_UTIL_PROG device remove "${SCRATCH_DEV_NAME[1]}" "$TEST_DIR/$seq.mnt"
-	$UMOUNT_PROG $TEST_DIR/$seq.mnt
+	_umount $TEST_DIR/$seq.mnt
 	_scratch_unmount
 	_spare_dev_put
 	_scratch_dev_pool_put
diff --git a/tests/btrfs/219 b/tests/btrfs/219
index 052f61a399ae66..efe5096746652a 100755
--- a/tests/btrfs/219
+++ b/tests/btrfs/219
@@ -21,8 +21,8 @@ _cleanup()
 	rm -f $tmp.*
 
 	# The variables are set before the test case can fail.
-	$UMOUNT_PROG ${loop_mnt1} &> /dev/null
-	$UMOUNT_PROG ${loop_mnt2} &> /dev/null
+	_umount ${loop_mnt1} &> /dev/null
+	_umount ${loop_mnt2} &> /dev/null
 	rm -rf $loop_mnt1
 	rm -rf $loop_mnt2
 
@@ -66,7 +66,7 @@ loop_dev2=`_create_loop_device $fs_img2`
 # Normal single device case, should pass just fine
 _mount $loop_dev1 $loop_mnt1 > /dev/null  2>&1 || \
 	_fail "Couldn't do initial mount"
-$UMOUNT_PROG $loop_mnt1
+_umount $loop_mnt1
 
 _btrfs_forget_or_module_reload
 
@@ -75,15 +75,15 @@ _btrfs_forget_or_module_reload
 # measure.
 _mount $loop_dev1 $loop_mnt1 > /dev/null 2>&1 || \
 	_fail "Failed to mount the second time"
-$UMOUNT_PROG $loop_mnt1
+_umount $loop_mnt1
 
 _mount $loop_dev2 $loop_mnt2 > /dev/null 2>&1 || \
 	_fail "We couldn't mount the old generation"
-$UMOUNT_PROG $loop_mnt2
+_umount $loop_mnt2
 
 _mount $loop_dev1 $loop_mnt1 > /dev/null 2>&1 || \
 	_fail "Failed to mount the second time"
-$UMOUNT_PROG $loop_mnt1
+_umount $loop_mnt1
 
 # Now try mount them at the same time, if kernel does not support
 # temp-fsid feature then mount will fail.
diff --git a/tests/btrfs/254 b/tests/btrfs/254
index d9c9eea9c7bf23..eda32be1c2b1d1 100755
--- a/tests/btrfs/254
+++ b/tests/btrfs/254
@@ -96,7 +96,7 @@ test_add_device()
 	$BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | \
 					_filter_btrfs_filesystem_show
 
-	$UMOUNT_PROG $seq_mnt
+	_umount $seq_mnt
 	_scratch_unmount
 	cleanup_dmdev
 }
diff --git a/tests/ext4/032 b/tests/ext4/032
index 9a1b9312cc42cc..6e98f4f4ebb8de 100755
--- a/tests/ext4/032
+++ b/tests/ext4/032
@@ -63,7 +63,7 @@ ext4_online_resize()
 	fi
 	cat $tmp.resize2fs >> $seqres.full
 	echo "+++ umount fs" | tee -a $seqres.full
-	$UMOUNT_PROG ${IMG_MNT}
+	_umount ${IMG_MNT}
 
 	echo "+++ check fs" | tee -a $seqres.full
 	_check_generic_filesystem $LOOP_DEVICE >> $seqres.full 2>&1 || \
@@ -77,7 +77,7 @@ _cleanup()
 	cd /
 	[ -n "$LOOP_DEVICE" ] && _destroy_loop_device $LOOP_DEVICE > /dev/null 2>&1
 	rm -f $tmp.*
-	$UMOUNT_PROG ${IMG_MNT} > /dev/null 2>&1
+	_umount ${IMG_MNT} > /dev/null 2>&1
 	rm -f ${IMG_FILE} > /dev/null 2>&1
 }
 
diff --git a/tests/ext4/052 b/tests/ext4/052
index edcdc02515f725..ce3f90eb7e6d02 100755
--- a/tests/ext4/052
+++ b/tests/ext4/052
@@ -18,7 +18,7 @@ _cleanup()
 	cd /
 	rm -r -f $tmp.*
 	if [ ! -z "$loop_mnt" ]; then
-		$UMOUNT_PROG $loop_mnt
+		_umount $loop_mnt
 		rm -rf $loop_mnt
 	fi
 	[ ! -z "$fs_img" ] && rm -rf $fs_img
@@ -63,7 +63,7 @@ then
     status=1
 fi
 
-$UMOUNT_PROG $loop_mnt || _fail "umount failed"
+_umount $loop_mnt || _fail "umount failed"
 loop_mnt=
 
 $E2FSCK_PROG -fn $fs_img >> $seqres.full 2>&1 || _fail "file system corrupted"
diff --git a/tests/ext4/053 b/tests/ext4/053
index 4f20d217d5fd7a..0beb2201260162 100755
--- a/tests/ext4/053
+++ b/tests/ext4/053
@@ -20,7 +20,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	if [ -n "$LOOP_LOGDEV" ];then
 		_destroy_loop_device $LOOP_LOGDEV 2>/dev/null
 	fi
@@ -237,7 +237,7 @@ not_mnt() {
 	if simple_mount -o $1 $SCRATCH_DEV $SCRATCH_MNT; then
 		print_log "(mount unexpectedly succeeded)"
 		fail
-		$UMOUNT_PROG $SCRATCH_MNT
+		_umount $SCRATCH_MNT
 		return
 	fi
 	ok
@@ -248,7 +248,7 @@ not_mnt() {
 		return
 	fi
 	not_remount $1
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 mnt_only() {
@@ -270,7 +270,7 @@ mnt() {
 	fi
 
 	mnt_only $*
-	$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+	_umount $SCRATCH_MNT 2> /dev/null
 
 	[ "$t2fs" -eq 0 ] && return
 
@@ -289,7 +289,7 @@ mnt() {
 				    -e 's/data=writeback/journal_data_writeback/')
 	$TUNE2FS_PROG -o $op_set $SCRATCH_DEV > /dev/null 2>&1
 	mnt_only "defaults" $check
-	$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+	_umount $SCRATCH_MNT 2> /dev/null
 	if [ "$op_set" = ^* ]; then
 		op_set=${op_set#^}
 	else
@@ -309,12 +309,12 @@ remount() {
 	do_mnt remount,$2 $3
 	if [ $? -ne 0 ]; then
 		fail
-		$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+		_umount $SCRATCH_MNT 2> /dev/null
 		return
 	else
 		ok
 	fi
-	$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+	_umount $SCRATCH_MNT 2> /dev/null
 
 	# Now just specify mnt
 	print_log "mounting $fstype \"$1\" "
@@ -328,7 +328,7 @@ remount() {
 		ok
 	fi
 
-	$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+	_umount $SCRATCH_MNT 2> /dev/null
 }
 
 # Test that the filesystem cannot be remounted with option(s) $1 (meaning that
@@ -364,7 +364,7 @@ mnt_then_not_remount() {
 		return
 	fi
 	not_remount $2
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 
@@ -400,8 +400,8 @@ LOGDEV_DEVNUM=`echo "${majmin%:*}*2^8 + ${majmin#*:}" | bc`
 fstype=
 for fstype in ext2 ext3 ext4; do
 
-	$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
-	$UMOUNT_PROG $SCRATCH_DEV 2> /dev/null
+	_umount $SCRATCH_MNT 2> /dev/null
+	_umount $SCRATCH_DEV 2> /dev/null
 
 	do_mkfs $SCRATCH_DEV ${SIZE}k
 
@@ -418,7 +418,7 @@ for fstype in ext2 ext3 ext4; do
 		continue
 	fi
 
-	$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+	_umount $SCRATCH_MNT 2> /dev/null
 
 	not_mnt failme
 	mnt
@@ -552,7 +552,7 @@ for fstype in ext2 ext3 ext4; do
 	# dax mount options
 	simple_mount -o dax=always $SCRATCH_DEV $SCRATCH_MNT > /dev/null 2>&1
 	if [ $? -eq 0 ]; then
-		$UMOUNT_PROG $SCRATCH_MNT 2> /dev/null
+		_umount $SCRATCH_MNT 2> /dev/null
 		mnt dax
 		mnt dax=always
 		mnt dax=never
@@ -633,7 +633,7 @@ for fstype in ext2 ext3 ext4; do
 	not_remount jqfmt=vfsv1
 	not_remount noquota
 	mnt_only remount,usrquota,grpquota ^usrquota,^grpquota
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 
 	# test clearing/changing quota when enabled
 	do_mkfs -E quotatype=^prjquota $SCRATCH_DEV ${SIZE}k
@@ -654,7 +654,7 @@ for fstype in ext2 ext3 ext4; do
 	mnt_only remount,usrquota,grpquota usrquota,grpquota
 	quotaoff -f $SCRATCH_MNT >> $seqres.full 2>&1
 	mnt_only remount,noquota ^usrquota,^grpquota,quota
-	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 
 	# Quota feature
 	echo "== Testing quota feature " >> $seqres.full
@@ -696,7 +696,7 @@ for fstype in ext2 ext3 ext4; do
 
 done #for fstype in ext2 ext3 ext4; do
 
-$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
+_umount $SCRATCH_MNT > /dev/null 2>&1
 echo "$ERR errors encountered" >> $seqres.full
 
 status=$ERR
diff --git a/tests/ext4/056 b/tests/ext4/056
index 8a290b11d69772..f9cb690fdfc80b 100755
--- a/tests/ext4/056
+++ b/tests/ext4/056
@@ -70,7 +70,7 @@ do_resize()
 	# delay
 	sleep 0.2
 	_scratch_unmount >> $seqres.full 2>&1 \
-		|| _fail "$UMOUNT_PROG failed. Exiting"
+		|| _fail "_umount failed. Exiting"
 }
 
 run_test()
diff --git a/tests/generic/042 b/tests/generic/042
index fd0ef705a18c3e..bea23ce29ac327 100755
--- a/tests/generic/042
+++ b/tests/generic/042
@@ -44,7 +44,7 @@ _crashtest()
 		_filter_xfs_io
 	$here/src/godown -f $mnt
 
-	$UMOUNT_PROG $mnt
+	_umount $mnt
 	_mount $img $mnt
 
 	# We should /never/ see 0xCD in the file, because we wrote that pattern
@@ -54,7 +54,7 @@ _crashtest()
 		_hexdump $file
 	fi
 
-	$UMOUNT_PROG $mnt
+	_umount $mnt
 }
 
 # Modify as appropriate.
diff --git a/tests/generic/067 b/tests/generic/067
index b6e984f5231753..19ee28d2cd945e 100755
--- a/tests/generic/067
+++ b/tests/generic/067
@@ -66,7 +66,7 @@ umount_symlink_device()
 	rm -f $symlink
 	echo "# umount symlink to device, which is not mounted" >>$seqres.full
 	ln -s $SCRATCH_DEV $symlink
-	$UMOUNT_PROG $symlink >>$seqres.full 2>&1
+	_umount $symlink >>$seqres.full 2>&1
 }
 
 # umount a path name that is 256 bytes long, this should fail gracefully,
@@ -78,7 +78,7 @@ umount_toolong_name()
 	_scratch_mount 2>&1 | tee -a $seqres.full
 
 	echo "# umount a too-long name" >>$seqres.full
-	$UMOUNT_PROG $longname >>$seqres.full 2>&1
+	_umount $longname >>$seqres.full 2>&1
 	_scratch_unmount 2>&1 | tee -a $seqres.full
 }
 
@@ -93,7 +93,7 @@ lazy_umount_symlink()
 	rm -f $symlink
 	ln -s $SCRATCH_MNT/testdir $symlink
 
-	$UMOUNT_PROG -l $symlink >>$seqres.full 2>&1
+	_umount -l $symlink >>$seqres.full 2>&1
 	# _scratch_unmount should not be blocked
 	_scratch_unmount 2>&1 | tee -a $seqres.full
 }
diff --git a/tests/generic/081 b/tests/generic/081
index 468c87ac9a9f0a..57dc07a36395f8 100755
--- a/tests/generic/081
+++ b/tests/generic/081
@@ -32,7 +32,7 @@ _cleanup()
 	# other tests to fail.
 	while test -e /dev/mapper/$vgname-$snapname || \
 	      test -e /dev/mapper/$vgname-$lvname; do
-		$UMOUNT_PROG $mnt >> $seqres.full 2>&1
+		_umount $mnt >> $seqres.full 2>&1
 		$LVM_PROG lvremove -f $vgname/$snapname >>$seqres.full 2>&1
 		$LVM_PROG lvremove -f $vgname/$lvname >>$seqres.full 2>&1
 		$LVM_PROG vgremove -f $vgname >>$seqres.full 2>&1
diff --git a/tests/generic/085 b/tests/generic/085
index cbabd257cad8f0..8c33386b7c383e 100755
--- a/tests/generic/085
+++ b/tests/generic/085
@@ -27,7 +27,7 @@ cleanup_dmdev()
 	$DMSETUP_PROG resume $lvdev >/dev/null 2>&1
 	[ -n "$pid" ] && kill -9 $pid 2>/dev/null
 	wait $pid
-	$UMOUNT_PROG $lvdev >/dev/null 2>&1
+	_umount $lvdev >/dev/null 2>&1
 	_dmsetup_remove $node
 }
 
@@ -70,7 +70,7 @@ done &
 pid=$!
 for ((i=0; i<100; i++)); do
 	_mount $lvdev $SCRATCH_MNT >/dev/null 2>&1
-	$UMOUNT_PROG $lvdev >/dev/null 2>&1
+	_umount $lvdev >/dev/null 2>&1
 done &
 pid="$pid $!"
 
diff --git a/tests/generic/108 b/tests/generic/108
index da13715f27ac21..e1df7ee1886cde 100755
--- a/tests/generic/108
+++ b/tests/generic/108
@@ -18,7 +18,7 @@ _cleanup()
 {
 	cd /
 	echo running > /sys/block/`_short_dev $SCSI_DEBUG_DEV`/device/state
-	$UMOUNT_PROG $SCRATCH_MNT >>$seqres.full 2>&1
+	_umount $SCRATCH_MNT >>$seqres.full 2>&1
 	$LVM_PROG vgremove -f $vgname >>$seqres.full 2>&1
 	$LVM_PROG pvremove -f $SCRATCH_DEV $SCSI_DEBUG_DEV >>$seqres.full 2>&1
 	$UDEV_SETTLE_PROG
diff --git a/tests/generic/361 b/tests/generic/361
index c2ebda3c1a01ad..456271b8d80308 100755
--- a/tests/generic/361
+++ b/tests/generic/361
@@ -16,7 +16,7 @@ _begin_fstest auto quick
 # Override the default cleanup function.
 _cleanup()
 {
-	$UMOUNT_PROG $fs_mnt
+	_umount $fs_mnt
 	_destroy_loop_device $loop_dev
 	cd /
 	rm -f $tmp.*
diff --git a/tests/generic/373 b/tests/generic/373
index 0d5a50cbee40b8..6ede189ead70bd 100755
--- a/tests/generic/373
+++ b/tests/generic/373
@@ -60,7 +60,7 @@ md5sum $testdir/file | _filter_scratch
 md5sum $othertestdir/otherfile | filter_otherdir
 
 echo "Unmount otherdir"
-$UMOUNT_PROG $otherdir
+_umount $otherdir
 rm -rf $otherdir
 
 # success, all done
diff --git a/tests/generic/374 b/tests/generic/374
index 977a2b268bbc98..bbdd8e66b4897b 100755
--- a/tests/generic/374
+++ b/tests/generic/374
@@ -59,7 +59,7 @@ echo "Check output"
 md5sum $testdir/file $othertestdir/otherfile | filter_md5
 
 echo "Unmount otherdir"
-$UMOUNT_PROG $otherdir
+_umount $otherdir
 rm -rf $otherdir
 
 # success, all done
diff --git a/tests/generic/459 b/tests/generic/459
index 32ee899f929819..e8799f75bf8e05 100755
--- a/tests/generic/459
+++ b/tests/generic/459
@@ -28,7 +28,7 @@ _cleanup()
 	xfs_freeze -u $SCRATCH_MNT 2>/dev/null
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $SCRATCH_MNT >>$seqres.full 2>&1
+	_umount $SCRATCH_MNT >>$seqres.full 2>&1
 	$LVM_PROG vgremove -ff $vgname >>$seqres.full 2>&1
 	$LVM_PROG pvremove -ff $SCRATCH_DEV >>$seqres.full 2>&1
 	$UDEV_SETTLE_PROG
diff --git a/tests/generic/604 b/tests/generic/604
index c2e03c2eabb871..124eea853ecf70 100755
--- a/tests/generic/604
+++ b/tests/generic/604
@@ -26,6 +26,8 @@ done
 # mount the base fs.  Delay the mount attempt by a small amount in the hope
 # that the mount() call will try to lock s_umount /after/ umount has already
 # taken it.
+# This is the /one/ place in fstests where we need to call the umount binary
+# directly.
 $UMOUNT_PROG $SCRATCH_MNT &
 sleep 0.01s ; _scratch_mount
 wait
diff --git a/tests/generic/648 b/tests/generic/648
index 29d1b470bded4a..3e995a02983931 100755
--- a/tests/generic/648
+++ b/tests/generic/648
@@ -20,7 +20,7 @@ _cleanup()
 	$KILLALL_PROG -9 fsstress > /dev/null 2>&1
 	wait
 	if [ -n "$loopmnt" ]; then
-		$UMOUNT_PROG $loopmnt 2>/dev/null
+		_umount $loopmnt 2>/dev/null
 		rm -r -f $loopmnt
 	fi
 	rm -f $tmp.*
@@ -111,7 +111,7 @@ while _soak_loop_running $((25 * TIME_FACTOR)); do
 
 	# Mount again to replay log after loading working table, so we have a
 	# consistent fs after test.
-	$UMOUNT_PROG $loopmnt
+	_umount $loopmnt
 	is_unmounted=1
 	# We must unmount dmerror at here, or whole later testing will crash.
 	# So try to umount enough times, before we have no choice.
@@ -137,7 +137,7 @@ done
 # Make sure the fs image file is ok
 if [ -f "$loopimg" ]; then
 	if _mount $loopimg $loopmnt -o loop; then
-		$UMOUNT_PROG $loopmnt &> /dev/null
+		_umount $loopmnt &> /dev/null
 	else
 		_metadump_dev $DMERROR_DEV $seqres.scratch.final.md
 		echo "final scratch mount failed"
diff --git a/tests/generic/698 b/tests/generic/698
index 28928b2fb32532..f432837a216f82 100755
--- a/tests/generic/698
+++ b/tests/generic/698
@@ -17,8 +17,8 @@ _begin_fstest auto quick perms attr idmapped mount
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $SCRATCH_MNT/target-mnt 2>/dev/null
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT/target-mnt 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
 	rm -r -f $tmp.*
 }
 
diff --git a/tests/generic/699 b/tests/generic/699
index 677307538a484b..5cff1cbaa67c4e 100755
--- a/tests/generic/699
+++ b/tests/generic/699
@@ -15,9 +15,9 @@ _begin_fstest auto quick perms attr idmapped mount
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $SCRATCH_MNT/target-mnt
-	$UMOUNT_PROG $SCRATCH_MNT/ovl-merge 2>/dev/null
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT/target-mnt
+	_umount $SCRATCH_MNT/ovl-merge 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
 	rm -r -f $tmp.*
 }
 
@@ -113,7 +113,7 @@ setup_overlayfs_idmapped_lower_metacopy_on()
 
 reset_overlayfs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT/ovl-merge 2>/dev/null
+	_umount $SCRATCH_MNT/ovl-merge 2>/dev/null
 	rm -rf $upper $work $merge
 }
 
diff --git a/tests/generic/704 b/tests/generic/704
index f39d47066ccc4a..31d52a97b37f9d 100755
--- a/tests/generic/704
+++ b/tests/generic/704
@@ -14,7 +14,7 @@ _cleanup()
 {
 	cd /
 	rm -r -f $tmp.*
-	[ -d "$SCSI_DEBUG_MNT" ] && $UMOUNT_PROG $SCSI_DEBUG_MNT 2>/dev/null
+	[ -d "$SCSI_DEBUG_MNT" ] && _umount $SCSI_DEBUG_MNT 2>/dev/null
 	_put_scsi_debug_dev
 }
 
diff --git a/tests/generic/730 b/tests/generic/730
index 062314ea01e7b5..650c604d5fbefd 100755
--- a/tests/generic/730
+++ b/tests/generic/730
@@ -12,7 +12,7 @@ _begin_fstest auto quick
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $SCSI_DEBUG_MNT >>$seqres.full 2>&1
+	_umount $SCSI_DEBUG_MNT >>$seqres.full 2>&1
 	_put_scsi_debug_dev
 	rm -f $tmp.*
 }
diff --git a/tests/generic/731 b/tests/generic/731
index cd39e8b09e3906..2621f6e237741d 100755
--- a/tests/generic/731
+++ b/tests/generic/731
@@ -13,7 +13,7 @@ _begin_fstest auto quick
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $SCSI_DEBUG_MNT >>$seqres.full 2>&1
+	_umount $SCSI_DEBUG_MNT >>$seqres.full 2>&1
 	_put_scsi_debug_dev
 	rm -f $tmp.*
 }
diff --git a/tests/generic/732 b/tests/generic/732
index d08028c2333d1b..63406ddc163f2c 100755
--- a/tests/generic/732
+++ b/tests/generic/732
@@ -15,8 +15,8 @@ _begin_fstest auto quick rename
 # Override the default cleanup function.
 _cleanup()
 {
-	$UMOUNT_PROG $testdir1 2>/dev/null
-	$UMOUNT_PROG $testdir2 2>/dev/null
+	_umount $testdir1 2>/dev/null
+	_umount $testdir2 2>/dev/null
 	cd /
 	rm -r -f $tmp.*
 }
diff --git a/tests/generic/746 b/tests/generic/746
index 651affe07b40bc..2b40c964371175 100755
--- a/tests/generic/746
+++ b/tests/generic/746
@@ -38,7 +38,7 @@ esac
 # Override the default cleanup function.
 _cleanup()
 {
-	$UMOUNT_PROG $loop_dev &> /dev/null
+	_umount $loop_dev &> /dev/null
 	_destroy_loop_device $loop_dev
 	if [ $status -eq 0 ]; then
 		rm -rf $tmp
@@ -53,7 +53,7 @@ get_holes()
 	# in-core state that will perturb the free space map on umount.  Stick
 	# to established convention which requires the filesystem to be
 	# unmounted while we probe the underlying file.
-	$UMOUNT_PROG $loop_mnt
+	_umount $loop_mnt
 
 	# FIEMAP only works on regular files, so call it on the backing file
 	# and not the loop device like everything else
@@ -66,7 +66,7 @@ get_free_sectors()
 {
 	case $FSTYP in
 	ext4)
-	$UMOUNT_PROG $loop_mnt
+	_umount $loop_mnt
 	$DUMPE2FS_PROG $loop_dev  2>&1 | grep " Free blocks" | cut -d ":" -f2- | \
 		tr ',' '\n' | $SED_PROG 's/^ //' | \
 		$AWK_PROG -v spb=$sectors_per_block 'BEGIN{FS="-"};
@@ -80,7 +80,7 @@ get_free_sectors()
 	xfs)
 	agsize=`$XFS_INFO_PROG $loop_mnt | $SED_PROG -n 's/.*agsize=\(.*\) blks.*/\1/p'`
 	# Convert free space (agno, block, length) to (start sector, end sector)
-	$UMOUNT_PROG $loop_mnt
+	_umount $loop_mnt
 	$XFS_DB_PROG -r -c "freesp -d" $loop_dev | $SED_PROG '/^.*from/,$d'| \
 		 $AWK_PROG -v spb=$sectors_per_block -v agsize=$agsize \
 		'{ print spb * ($1 * agsize + $2), spb * ($1 * agsize + $2 + $3) - 1 }'
diff --git a/tests/overlay/003 b/tests/overlay/003
index 41ad99e794d8ee..0a2cb928ea5c58 100755
--- a/tests/overlay/003
+++ b/tests/overlay/003
@@ -56,7 +56,7 @@ rm -rf ${SCRATCH_MNT}/*
 ls ${SCRATCH_MNT}/
 
 # unmount overlayfs but not base fs
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 echo "Silence is golden"
 # success, all done
diff --git a/tests/overlay/004 b/tests/overlay/004
index bea4bb543f3611..4591d4e8487ce2 100755
--- a/tests/overlay/004
+++ b/tests/overlay/004
@@ -53,7 +53,7 @@ _user_do "chmod u-X ${SCRATCH_MNT}/attr_file2 > /dev/null 2>&1"
 stat -c %a ${SCRATCH_MNT}/attr_file2
 
 # unmount overlayfs but not base fs
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # check mode bits of the file that has been copied up, and
 # the file that should not have been copied up.
diff --git a/tests/overlay/005 b/tests/overlay/005
index 01914ee17b9a30..6b382ddb50d873 100755
--- a/tests/overlay/005
+++ b/tests/overlay/005
@@ -75,14 +75,14 @@ $XFS_IO_PROG -f -c "o" ${SCRATCH_MNT}/test_file \
 	>>$seqres.full 2>&1
 
 # unmount overlayfs
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # check overlayfs
 _overlay_check_scratch_dirs $lowerd $upperd $workd
 
 # unmount undelying xfs, this tiggers panic if memleak happens
-$UMOUNT_PROG ${OVL_BASE_SCRATCH_MNT}/uppermnt
-$UMOUNT_PROG ${OVL_BASE_SCRATCH_MNT}/lowermnt
+_umount ${OVL_BASE_SCRATCH_MNT}/uppermnt
+_umount ${OVL_BASE_SCRATCH_MNT}/lowermnt
 
 # success, all done
 echo "Silence is golden"
diff --git a/tests/overlay/014 b/tests/overlay/014
index f07fc685572b92..08850d489e4b49 100755
--- a/tests/overlay/014
+++ b/tests/overlay/014
@@ -46,7 +46,7 @@ _overlay_scratch_mount_dirs $lowerdir1 $lowerdir2 $workdir2
 rm -rf $SCRATCH_MNT/testdir
 mkdir -p $SCRATCH_MNT/testdir/visibledir
 # unmount overlayfs but not base fs
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # check overlayfs
 _overlay_check_scratch_dirs $lowerdir1 $lowerdir2 $workdir2
@@ -59,7 +59,7 @@ touch $SCRATCH_MNT/testdir/visiblefile
 
 # umount and mount overlay again, buggy kernel treats the copied-up dir as
 # opaque, visibledir is not seen in merged dir.
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _overlay_scratch_mount_dirs "$lowerdir2:$lowerdir1" $upperdir $workdir
 ls $SCRATCH_MNT/testdir
 
diff --git a/tests/overlay/022 b/tests/overlay/022
index d33bd29781a356..40b0dd64f6fc6c 100755
--- a/tests/overlay/022
+++ b/tests/overlay/022
@@ -17,7 +17,7 @@ _begin_fstest auto quick mount nested
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $tmp/mnt > /dev/null 2>&1
+	_umount $tmp/mnt > /dev/null 2>&1
 	rm -rf $tmp
 	rm -f $tmp.*
 }
diff --git a/tests/overlay/025 b/tests/overlay/025
index 6ba46191b557be..0abc8bf80b1716 100755
--- a/tests/overlay/025
+++ b/tests/overlay/025
@@ -19,8 +19,8 @@ _begin_fstest auto quick attr
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $tmpfsdir/mnt
-	$UMOUNT_PROG $tmpfsdir
+	_umount $tmpfsdir/mnt
+	_umount $tmpfsdir
 	rm -rf $tmpfsdir
 	rm -f $tmp.*
 }
diff --git a/tests/overlay/029 b/tests/overlay/029
index 4bade9a0e129a4..007973dc075923 100755
--- a/tests/overlay/029
+++ b/tests/overlay/029
@@ -22,7 +22,7 @@ _begin_fstest auto quick nested
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $tmp/mnt
+	_umount $tmp/mnt
 	rm -rf $tmp
 	rm -f $tmp.*
 }
@@ -56,7 +56,7 @@ _overlay_mount_dirs $SCRATCH_MNT/up $tmp/{upper,work} \
   overlay $tmp/mnt
 # accessing file in the second mount
 cat $tmp/mnt/foo
-$UMOUNT_PROG $tmp/mnt
+_umount $tmp/mnt
 
 # re-create upper/work to avoid ovl_verify_origin() mount failure
 # when index is enabled
@@ -66,7 +66,7 @@ mkdir -p $tmp/{upper,work}
 _overlay_mount_dirs $SCRATCH_MNT/low $tmp/{upper,work} \
   overlay $tmp/mnt
 cat $tmp/mnt/bar
-$UMOUNT_PROG $tmp/mnt
+_umount $tmp/mnt
 
 rm -rf $tmp/{upper,work}
 mkdir -p $tmp/{upper,work}
diff --git a/tests/overlay/031 b/tests/overlay/031
index dd9dfcdb970ac7..31d22d1cadae41 100755
--- a/tests/overlay/031
+++ b/tests/overlay/031
@@ -28,7 +28,7 @@ create_whiteout()
 
 	rm -f $SCRATCH_MNT/testdir/$file
 
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Import common functions.
@@ -68,7 +68,7 @@ rm -rf $SCRATCH_MNT/testdir 2>&1 | _filter_scratch
 
 # umount overlay again, create a new file with the same name and
 # mount overlay again.
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 touch $lowerdir1/testdir
 
 _overlay_scratch_mount_dirs $lowerdir1 $upperdir $workdir
@@ -77,7 +77,7 @@ _overlay_scratch_mount_dirs $lowerdir1 $upperdir $workdir
 # it will not clean up the dir and lead to residue.
 rm -rf $SCRATCH_MNT/testdir 2>&1 | _filter_scratch
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # let lower dir have invalid whiteouts, repeat ls and rmdir test again.
 rm -rf $lowerdir1/testdir
@@ -92,7 +92,7 @@ _overlay_scratch_mount_dirs "$lowerdir1:$lowerdir2" $upperdir $workdir
 ls $SCRATCH_MNT/testdir
 rm -rf $SCRATCH_MNT/testdir 2>&1 | _filter_scratch
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # let lower dir and upper dir both have invalid whiteouts, repeat ls and rmdir again.
 rm -rf $lowerdir1/testdir
diff --git a/tests/overlay/035 b/tests/overlay/035
index cede58790e1b9d..c6ce1318fbbb37 100755
--- a/tests/overlay/035
+++ b/tests/overlay/035
@@ -43,7 +43,7 @@ mkdir -p $lowerdir1 $lowerdir2 $upperdir $workdir
 _overlay_scratch_mount_opts -o"lowerdir=$lowerdir2:$lowerdir1"
 touch $SCRATCH_MNT/foo 2>&1 | _filter_scratch
 _mount -o remount,rw $SCRATCH_MNT 2>&1 | _filter_ro_mount
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Make workdir immutable to prevent workdir re-create on mount
 $CHATTR_PROG +i $workdir
diff --git a/tests/overlay/036 b/tests/overlay/036
index 19a181bbdd9361..f902617d4ab0a2 100755
--- a/tests/overlay/036
+++ b/tests/overlay/036
@@ -34,8 +34,8 @@ _cleanup()
 	cd /
 	rm -f $tmp.*
 	# unmount the two extra mounts in case they did not fail
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
 }
 
 # Import common functions.
@@ -66,13 +66,13 @@ _overlay_mount_dirs $lowerdir $upperdir $workdir \
 # with index=off - expect success
 _overlay_mount_dirs $lowerdir $upperdir $workdir2 \
 		    overlay0 $SCRATCH_MNT -oindex=off && \
-		    $UMOUNT_PROG $SCRATCH_MNT
+		    _umount $SCRATCH_MNT
 
 # Try to mount another overlay with the same workdir
 # with index=off - expect success
 _overlay_mount_dirs $lowerdir2 $upperdir2 $workdir \
 		    overlay1 $SCRATCH_MNT -oindex=off && \
-		    $UMOUNT_PROG $SCRATCH_MNT
+		    _umount $SCRATCH_MNT
 
 # Try to mount another overlay with the same upperdir
 # with index=on - expect EBUSY
diff --git a/tests/overlay/037 b/tests/overlay/037
index 834e176380ebea..c278e7cab1fe05 100755
--- a/tests/overlay/037
+++ b/tests/overlay/037
@@ -39,17 +39,17 @@ mkdir -p $lowerdir $lowerdir2 $upperdir $upperdir2 $workdir
 # Mount overlay with lowerdir, upperdir, workdir and index=on
 # to store the file handles of lowerdir and upperdir in overlay.origin xattr
 _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -oindex=on
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Try to mount an overlay with the same upperdir and different lowerdir - expect ESTALE
 _overlay_scratch_mount_dirs $lowerdir2 $upperdir $workdir -oindex=on \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 # Try to mount an overlay with the same workdir and different upperdir - expect ESTALE
 _overlay_scratch_mount_dirs $lowerdir $upperdir2 $workdir -oindex=on \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 # Mount overlay with original lowerdir, upperdir, workdir and index=on - expect success
 _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -oindex=on
diff --git a/tests/overlay/040 b/tests/overlay/040
index 11c7bf129a3626..47f50eb0638da0 100755
--- a/tests/overlay/040
+++ b/tests/overlay/040
@@ -48,7 +48,7 @@ _scratch_mount
 # modify lower origin file.
 $CHATTR_PROG +i $SCRATCH_MNT/foo > /dev/null 2>&1
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # touching origin file in lower, should succeed
 touch $lowerdir/foo
diff --git a/tests/overlay/041 b/tests/overlay/041
index 36491b8fa0edf6..52ca351b66d86c 100755
--- a/tests/overlay/041
+++ b/tests/overlay/041
@@ -142,7 +142,7 @@ subdir_d=$($here/src/t_dir_type $pure_lower_dir $pure_lower_subdir_st_ino)
 [[ $subdir_d == "subdir d" ]] || \
 	echo "Merged dir: Invalid d_ino reported for subdir"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # check overlayfs
 _overlay_check_scratch_dirs $lowerdir $upperdir $workdir -o xino=on
diff --git a/tests/overlay/042 b/tests/overlay/042
index aaa10da33e0249..ddd4173abee8ce 100755
--- a/tests/overlay/042
+++ b/tests/overlay/042
@@ -45,7 +45,7 @@ _scratch_mount -o index=off
 # Copy up lower and create upper hardlink with no index
 ln $SCRATCH_MNT/0 $SCRATCH_MNT/1
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Add lower hardlinks while overlay is offline
 ln $lowerdir/0 $lowerdir/2
diff --git a/tests/overlay/043 b/tests/overlay/043
index 7325c653ab5cab..15cb9bf4bafaca 100755
--- a/tests/overlay/043
+++ b/tests/overlay/043
@@ -126,7 +126,7 @@ echo 3 > /proc/sys/vm/drop_caches
 check_inode_numbers $testdir $tmp.after_copyup $tmp.after_move
 
 # Verify that the inode numbers survive a mount cycle
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -o redirect_dir=on,xino=on
 
 # Compare inode numbers before/after mount cycle
diff --git a/tests/overlay/044 b/tests/overlay/044
index 4d04d883efd695..5f09cc31c32a1e 100755
--- a/tests/overlay/044
+++ b/tests/overlay/044
@@ -99,7 +99,7 @@ cat $FILES
 check_ino_nlink $SCRATCH_MNT $tmp.before $tmp.after_one
 
 # Verify that the hardlinks survive a mount cycle
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _overlay_check_scratch_dirs $lowerdir $upperdir $workdir -o index=on,xino=on
 _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -o index=on,xino=on
 
diff --git a/tests/overlay/048 b/tests/overlay/048
index 897e797e2ff549..4bd9753666bf6c 100755
--- a/tests/overlay/048
+++ b/tests/overlay/048
@@ -32,7 +32,7 @@ report_nlink()
 		_ls_l $SCRATCH_MNT/$f | awk '{ print $2, $9 }' | _filter_scratch
 	done
 
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Create lower hardlinks
@@ -101,7 +101,7 @@ touch $SCRATCH_MNT/1
 touch $SCRATCH_MNT/2
 
 # Perform the rest of the changes offline
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 test_hardlinks_offline
 
diff --git a/tests/overlay/049 b/tests/overlay/049
index 3ee500c5dd13b8..b091330ea26e2c 100755
--- a/tests/overlay/049
+++ b/tests/overlay/049
@@ -32,7 +32,7 @@ create_redirect()
 	touch $SCRATCH_MNT/origin/file
 	mv $SCRATCH_MNT/origin $SCRATCH_MNT/$redirect
 
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Import common functions.
diff --git a/tests/overlay/050 b/tests/overlay/050
index ec936e2a758f81..7c8ed1a4e96e8c 100755
--- a/tests/overlay/050
+++ b/tests/overlay/050
@@ -76,7 +76,7 @@ mount_dirs()
 # Unmount the overlay without unmounting base fs
 unmount_dirs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Check non-stale file handles of lower/upper files and verify
diff --git a/tests/overlay/051 b/tests/overlay/051
index 9404dbbab90f15..2dadb5a3027180 100755
--- a/tests/overlay/051
+++ b/tests/overlay/051
@@ -28,7 +28,7 @@ _cleanup()
 	# Cleanup overlay scratch mount that is holding base test mount
 	# to prevent _check_test_fs and _test_umount from failing before
 	# _check_scratch_fs _scratch_umount
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
 }
 
 # Import common functions.
@@ -103,7 +103,7 @@ mount_dirs()
 # underlying dirs
 unmount_dirs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 
 	_overlay_check_scratch_dirs $middle:$lower $upper $work \
 				-o "index=on,nfs_export=on"
diff --git a/tests/overlay/052 b/tests/overlay/052
index 37402067dbe65e..e3366ea44147cb 100755
--- a/tests/overlay/052
+++ b/tests/overlay/052
@@ -73,7 +73,7 @@ mount_dirs()
 # Unmount the overlay without unmounting base fs
 unmount_dirs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Check non-stale file handles of lower/upper moved files
diff --git a/tests/overlay/053 b/tests/overlay/053
index f7891aceda7246..87f748cefd3338 100755
--- a/tests/overlay/053
+++ b/tests/overlay/053
@@ -30,7 +30,7 @@ _cleanup()
 	# Cleanup overlay scratch mount that is holding base test mount
 	# to prevent _check_test_fs and _test_umount from failing before
 	# _check_scratch_fs _scratch_umount
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
 }
 
 # Import common functions.
@@ -99,7 +99,7 @@ mount_dirs()
 # underlying dirs
 unmount_dirs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 
 	_overlay_check_scratch_dirs $middle:$lower $upper $work \
 				-o "index=on,nfs_export=on,redirect_dir=on"
diff --git a/tests/overlay/054 b/tests/overlay/054
index 8d7f026a2d9b00..566d266a1ad788 100755
--- a/tests/overlay/054
+++ b/tests/overlay/054
@@ -87,7 +87,7 @@ mount_dirs()
 # Unmount the overlay without unmounting base fs
 unmount_dirs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Check encode/decode/read file handles of dir with non-indexed ancestor
diff --git a/tests/overlay/055 b/tests/overlay/055
index 87a348c94489b8..a5b169956f4c09 100755
--- a/tests/overlay/055
+++ b/tests/overlay/055
@@ -37,7 +37,7 @@ _cleanup()
 	# Cleanup overlay scratch mount that is holding base test mount
 	# to prevent _check_test_fs and _test_umount from failing before
 	# _check_scratch_fs _scratch_umount
-	$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+	_umount $SCRATCH_MNT 2>/dev/null
 }
 
 # Import common functions.
@@ -109,7 +109,7 @@ mount_dirs()
 # underlying dirs
 unmount_dirs()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 
 	_overlay_check_scratch_dirs $middle:$lower $upper $work \
 				-o "index=on,nfs_export=on,redirect_dir=on"
diff --git a/tests/overlay/056 b/tests/overlay/056
index 158f34d05c22e9..01c319d7263f3c 100755
--- a/tests/overlay/056
+++ b/tests/overlay/056
@@ -73,7 +73,7 @@ mkdir $lowerdir/testdir2/subdir
 _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir
 touch $SCRATCH_MNT/testdir1/foo
 touch $SCRATCH_MNT/testdir2/subdir
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 remove_impure $upperdir/testdir1
 remove_impure $upperdir/testdir2
 
diff --git a/tests/overlay/057 b/tests/overlay/057
index da7ffda30277d9..b631d431a37b47 100755
--- a/tests/overlay/057
+++ b/tests/overlay/057
@@ -48,7 +48,7 @@ _overlay_scratch_mount_dirs $lowerdir $lowerdir2 $workdir2 -o redirect_dir=on
 # Create opaque parent with absolute redirect child in middle layer
 mkdir $SCRATCH_MNT/pure
 mv $SCRATCH_MNT/origin $SCRATCH_MNT/pure/redirect
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _overlay_scratch_mount_dirs $lowerdir2:$lowerdir $upperdir $workdir -o redirect_dir=on
 mv $SCRATCH_MNT/pure/redirect $SCRATCH_MNT/redirect
 # List content of renamed merge dir before mount cycle
@@ -56,7 +56,7 @@ ls $SCRATCH_MNT/redirect/
 
 # Verify that redirects are followed by listing content of renamed merge dir
 # after mount cycle
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _overlay_scratch_mount_dirs $lowerdir2:$lowerdir $upperdir $workdir -o redirect_dir=on
 ls $SCRATCH_MNT/redirect/
 
diff --git a/tests/overlay/059 b/tests/overlay/059
index c48d2a82c76ec4..84b5c80eb984de 100755
--- a/tests/overlay/059
+++ b/tests/overlay/059
@@ -33,7 +33,7 @@ create_origin_ref()
 	_scratch_mount -o redirect_dir=on
 	mv $SCRATCH_MNT/origin $SCRATCH_MNT/$ref
 
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Import common functions.
diff --git a/tests/overlay/060 b/tests/overlay/060
index bb61fcfa644342..3d0ea353feaa9a 100755
--- a/tests/overlay/060
+++ b/tests/overlay/060
@@ -130,7 +130,7 @@ mount_ro_overlay()
 
 umount_overlay()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 # Assumes it is called with overlay mounted.
diff --git a/tests/overlay/062 b/tests/overlay/062
index 9a1db7419c4ca2..97a1bd8c12f20e 100755
--- a/tests/overlay/062
+++ b/tests/overlay/062
@@ -18,7 +18,7 @@ _cleanup()
 {
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $lowertestdir
+	_umount $lowertestdir
 }
 
 # Import common functions.
diff --git a/tests/overlay/063 b/tests/overlay/063
index d9f30606a92d44..a50e63665202f0 100755
--- a/tests/overlay/063
+++ b/tests/overlay/063
@@ -40,7 +40,7 @@ rm ${upperdir}/file
 mkdir ${SCRATCH_MNT}/file > /dev/null 2>&1
 
 # unmount overlayfs
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 echo "Silence is golden"
 # success, all done
diff --git a/tests/overlay/065 b/tests/overlay/065
index fb6d6dd1bfcc0e..26f1c4bde4da90 100755
--- a/tests/overlay/065
+++ b/tests/overlay/065
@@ -30,7 +30,7 @@ _cleanup()
 {
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $mnt2 2>/dev/null
+	_umount $mnt2 2>/dev/null
 }
 
 # Import common functions.
@@ -63,7 +63,7 @@ mkdir -p $lowerdir/lower $upperdir $workdir
 echo Conflicting upperdir/lowerdir
 _overlay_scratch_mount_dirs $upperdir $upperdir $workdir \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 # Use new upper/work dirs for each test to avoid ESTALE errors
 # on mismatch lowerdir/upperdir (see test overlay/037)
@@ -75,7 +75,7 @@ mkdir $upperdir $workdir
 echo Conflicting workdir/lowerdir
 _overlay_scratch_mount_dirs $workdir $upperdir $workdir \
 	-oindex=off 2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir $workdir
 mkdir -p $upperdir/lower $workdir
@@ -85,7 +85,7 @@ mkdir -p $upperdir/lower $workdir
 echo Overlapping upperdir/lowerdir
 _overlay_scratch_mount_dirs $upperdir/lower $upperdir $workdir \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir $workdir
 mkdir $upperdir $workdir
@@ -94,7 +94,7 @@ mkdir $upperdir $workdir
 echo Conflicting lower layers
 _overlay_scratch_mount_dirs $lowerdir:$lowerdir $upperdir $workdir \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir $workdir
 mkdir $upperdir $workdir
@@ -103,7 +103,7 @@ mkdir $upperdir $workdir
 echo Overlapping lower layers below
 _overlay_scratch_mount_dirs $lowerdir:$lowerdir/lower $upperdir $workdir \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir $workdir
 mkdir $upperdir $workdir
@@ -112,7 +112,7 @@ mkdir $upperdir $workdir
 echo Overlapping lower layers above
 _overlay_scratch_mount_dirs $lowerdir/lower:$lowerdir $upperdir $workdir \
 	2>&1 | _filter_error_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir $workdir
 mkdir -p $upperdir/upper $workdir $mnt2
@@ -129,14 +129,14 @@ mkdir -p $upperdir2 $workdir2 $mnt2
 echo "Overlapping with upperdir of another instance (index=on)"
 _overlay_scratch_mount_dirs $upperdir/upper $upperdir2 $workdir2 \
 	-oindex=on 2>&1 | _filter_busy_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir2 $workdir2
 mkdir -p $upperdir2 $workdir2
 
 echo "Overlapping with upperdir of another instance (index=off)"
 _overlay_scratch_mount_dirs $upperdir/upper $upperdir2 $workdir2 \
-	-oindex=off && $UMOUNT_PROG $SCRATCH_MNT
+	-oindex=off && _umount $SCRATCH_MNT
 
 rm -rf $upperdir2 $workdir2
 mkdir -p $upperdir2 $workdir2
@@ -146,14 +146,14 @@ mkdir -p $upperdir2 $workdir2
 echo "Overlapping with workdir of another instance (index=on)"
 _overlay_scratch_mount_dirs $workdir/work $upperdir2 $workdir2 \
 	-oindex=on 2>&1 | _filter_busy_mount
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 rm -rf $upperdir2 $workdir2
 mkdir -p $upperdir2 $workdir2
 
 echo "Overlapping with workdir of another instance (index=off)"
 _overlay_scratch_mount_dirs $workdir/work $upperdir2 $workdir2 \
-	-oindex=off && $UMOUNT_PROG $SCRATCH_MNT
+	-oindex=off && _umount $SCRATCH_MNT
 
 # Move upper layer root into lower layer after mount
 echo Overlapping upperdir/lowerdir after mount
diff --git a/tests/overlay/067 b/tests/overlay/067
index bb09a6042b275d..12a1781c149644 100755
--- a/tests/overlay/067
+++ b/tests/overlay/067
@@ -70,7 +70,7 @@ stat $testfile >>$seqres.full
 diff -q $realfile $testfile >>$seqres.full &&
 	echo "diff with middle layer file doesn't know right from wrong! (cold cache)"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 # check overlayfs
 _overlay_check_scratch_dirs $middle:$lower $upper $work -o xino=off
 
diff --git a/tests/overlay/068 b/tests/overlay/068
index 0d33cf12de8550..480ba67e33ea74 100755
--- a/tests/overlay/068
+++ b/tests/overlay/068
@@ -28,7 +28,7 @@ _cleanup()
 	cd /
 	rm -f $tmp.*
 	# Unmount the nested overlay mount
-	$UMOUNT_PROG $mnt2 2>/dev/null
+	_umount $mnt2 2>/dev/null
 }
 
 # Import common functions.
@@ -100,7 +100,7 @@ mount_dirs()
 unmount_dirs()
 {
 	# unmount & check nested overlay
-	$UMOUNT_PROG $mnt2
+	_umount $mnt2
 	_overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \
 		-o "index=on,nfs_export=on,redirect_dir=on"
 
diff --git a/tests/overlay/069 b/tests/overlay/069
index 373ab1ee3dc115..67969eebbfcaa3 100755
--- a/tests/overlay/069
+++ b/tests/overlay/069
@@ -28,7 +28,7 @@ _cleanup()
 	cd /
 	rm -f $tmp.*
 	# Unmount the nested overlay mount
-	$UMOUNT_PROG $mnt2 2>/dev/null
+	_umount $mnt2 2>/dev/null
 }
 
 # Import common functions.
@@ -108,12 +108,12 @@ mount_dirs()
 unmount_dirs()
 {
 	# unmount & check nested overlay
-	$UMOUNT_PROG $mnt2
+	_umount $mnt2
 	_overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \
 		-o "index=on,nfs_export=on,redirect_dir=on"
 
 	# unmount & check underlying overlay
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_overlay_check_dirs $lower $upper $work \
 		-o "index=on,nfs_export=on,redirect_dir=on"
 }
diff --git a/tests/overlay/070 b/tests/overlay/070
index 36991229f28fe7..104b5f492088d6 100755
--- a/tests/overlay/070
+++ b/tests/overlay/070
@@ -26,7 +26,7 @@ _cleanup()
 	cd /
 	rm -f $tmp.*
 	# Unmount the nested overlay mount
-	$UMOUNT_PROG $mnt2 2>/dev/null
+	_umount $mnt2 2>/dev/null
 	[ -z "$loopdev" ] || _destroy_loop_device $loopdev
 }
 
@@ -93,12 +93,12 @@ mount_dirs()
 unmount_dirs()
 {
 	# unmount & check nested overlay
-	$UMOUNT_PROG $mnt2
+	_umount $mnt2
 	_overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \
 		-o "redirect_dir=on,index=on,xino=on"
 
 	# unmount & check underlying overlay
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_overlay_check_scratch_dirs $lower $upper $work \
 		-o "index=on,nfs_export=on"
 }
diff --git a/tests/overlay/071 b/tests/overlay/071
index 2a6313142d09d2..c58347f6cdb1c6 100755
--- a/tests/overlay/071
+++ b/tests/overlay/071
@@ -29,7 +29,7 @@ _cleanup()
 	cd /
 	rm -f $tmp.*
 	# Unmount the nested overlay mount
-	$UMOUNT_PROG $mnt2 2>/dev/null
+	_umount $mnt2 2>/dev/null
 	[ -z "$loopdev" ] || _destroy_loop_device $loopdev
 }
 
@@ -103,12 +103,12 @@ mount_dirs()
 unmount_dirs()
 {
 	# unmount & check nested overlay
-	$UMOUNT_PROG $mnt2
+	_umount $mnt2
 	_overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \
 		-o "redirect_dir=on,index=on,xino=on"
 
 	# unmount & check underlying overlay
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_overlay_check_dirs $lower $upper $work \
 		-o "index=on,nfs_export=on"
 }
diff --git a/tests/overlay/076 b/tests/overlay/076
index fb94dff685b6cc..28bf2d305b94d7 100755
--- a/tests/overlay/076
+++ b/tests/overlay/076
@@ -47,7 +47,7 @@ _scratch_mount
 # on kernel v5.10..v5.10.14.  Anything but hang is considered a test success.
 $CHATTR_PROG +i $SCRATCH_MNT/foo > /dev/null 2>&1
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # success, all done
 echo "Silence is golden"
diff --git a/tests/overlay/077 b/tests/overlay/077
index 00de0825aea6dc..cff24800469362 100755
--- a/tests/overlay/077
+++ b/tests/overlay/077
@@ -65,7 +65,7 @@ mv $SCRATCH_MNT/f100 $SCRATCH_MNT/former/
 
 # Remove the lower directory and mount overlay again to create
 # a "former merge dir"
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 rm -rf $lowerdir/former
 _scratch_mount
 
diff --git a/tests/overlay/078 b/tests/overlay/078
index d6df11f6852f45..bcc5aff1b7dc89 100755
--- a/tests/overlay/078
+++ b/tests/overlay/078
@@ -61,7 +61,7 @@ do_check()
 
 	echo "Test chattr +$1 $2" >> $seqres.full
 
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 
 	# Add attribute to lower file
 	$CHATTR_PROG +$attr $lowertestfile
diff --git a/tests/overlay/079 b/tests/overlay/079
index cfcafceea56e66..f8926e091ca137 100755
--- a/tests/overlay/079
+++ b/tests/overlay/079
@@ -156,7 +156,7 @@ mount_ro_overlay()
 
 umount_overlay()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 test_no_access()
diff --git a/tests/overlay/080 b/tests/overlay/080
index ce5c2375fb3154..94fe33ae7db4d2 100755
--- a/tests/overlay/080
+++ b/tests/overlay/080
@@ -264,7 +264,7 @@ mount_overlay()
 
 umount_overlay()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 
diff --git a/tests/overlay/081 b/tests/overlay/081
index 2270a04750da1f..454eea2cd96576 100755
--- a/tests/overlay/081
+++ b/tests/overlay/081
@@ -46,7 +46,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir)
 	echo "Overlayfs (uuid=null) and upper fs fsid differ"
 
 # Keep base fs mounted in case it has a volatile fsid (e.g. tmpfs)
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Test legacy behavior is preserved by default for existing "impure" overlayfs
 _scratch_mount
@@ -55,7 +55,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir)
 [[ "$ovl_fsid" == "$upper_fsid" ]] || \
 	echo "Overlayfs (after uuid=null) and upper fs fsid differ"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Test unique fsid on explicit opt-in for existing "impure" overlayfs
 _scratch_mount -o uuid=on
@@ -65,7 +65,7 @@ ovl_unique_fsid=$ovl_fsid
 [[ "$ovl_fsid" != "$upper_fsid" ]] || \
 	echo "Overlayfs (uuid=on) and upper fs fsid are the same"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Test unique fsid is persistent by default after it was created
 _scratch_mount
@@ -74,7 +74,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir)
 [[ "$ovl_fsid" == "$ovl_unique_fsid" ]] || \
 	echo "Overlayfs (after uuid=on) unique fsid is not persistent"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Test ignore existing persistent fsid on explicit opt-out
 _scratch_mount -o uuid=null
@@ -83,7 +83,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir)
 [[ "$ovl_fsid" == "$upper_fsid" ]] || \
 	echo "Overlayfs (uuid=null) and upper fs fsid differ"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Test fallback to uuid=null with non-upper ovelray
 _overlay_scratch_mount_dirs "$upperdir:$lowerdir" "-" "-" -o ro,uuid=on
@@ -110,7 +110,7 @@ ovl_unique_fsid=$ovl_fsid
 [[ "$ovl_fsid" != "$upper_fsid" ]] || \
 	echo "Overlayfs (new) and upper fs fsid are the same"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 # Test unique fsid is persistent by default after it was created
 _scratch_mount -o uuid=on
@@ -119,7 +119,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir)
 [[ "$ovl_fsid" == "$ovl_unique_fsid" ]] || \
 	echo "Overlayfs (uuid=on) unique fsid is not persistent"
 
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 echo "Silence is golden"
 status=0
diff --git a/tests/overlay/083 b/tests/overlay/083
index 56e02f8cc77d73..aaa3fdb9ad139a 100755
--- a/tests/overlay/083
+++ b/tests/overlay/083
@@ -52,7 +52,7 @@ _mount -t overlay | grep ovl_esc_test  | tee -a $seqres.full | grep -v spaces &&
 
 # Re-create the upper/work dirs to mount them with a different lower
 # This is required in case index feature is enabled
-$UMOUNT_PROG $SCRATCH_MNT
+_umount $SCRATCH_MNT
 rm -rf "$upperdir" "$workdir"
 mkdir -p "$upperdir" "$workdir"
 
diff --git a/tests/overlay/084 b/tests/overlay/084
index 28e9a76dc734c0..67321bc7618389 100755
--- a/tests/overlay/084
+++ b/tests/overlay/084
@@ -15,7 +15,7 @@ _cleanup()
 {
 	cd /
 	# Unmount nested mounts if things fail
-	$UMOUNT_PROG $OVL_BASE_SCRATCH_MNT/nested  2>/dev/null
+	_umount $OVL_BASE_SCRATCH_MNT/nested  2>/dev/null
 	rm -rf $tmp
 }
 
@@ -44,7 +44,7 @@ nesteddir=$OVL_BASE_SCRATCH_MNT/nested
 
 umount_overlay()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 test_escape()
@@ -88,12 +88,12 @@ test_escape()
 	echo "nested xattr mount with trusted.overlay"
 	_overlay_mount_dirs $SCRATCH_MNT/layer2:$SCRATCH_MNT/layer1 - - overlayfs $nesteddir
 	stat $nesteddir/dir/file  2>&1 | _filter_scratch
-	$UMOUNT_PROG $nesteddir
+	_umount $nesteddir
 
 	echo "nested xattr mount with user.overlay"
 	_overlay_mount_dirs $SCRATCH_MNT/layer2:$SCRATCH_MNT/layer1 - - -o userxattr overlayfs $nesteddir
 	stat $nesteddir/dir/file  2>&1 | _filter_scratch
-	$UMOUNT_PROG $nesteddir
+	_umount $nesteddir
 
 	# Also ensure propagate the escaped xattr when we copy-up layer2/dir
 	echo "copy-up of escaped xattrs"
@@ -164,7 +164,7 @@ test_escaped_xwhiteout()
 
 	do_test_xwhiteout $prefix $nesteddir
 
-	$UMOUNT_PROG $nesteddir
+	_umount $nesteddir
 }
 
 test_escaped_xwhiteout trusted
diff --git a/tests/overlay/085 b/tests/overlay/085
index 046d01d161d829..8396ceb7c72b90 100755
--- a/tests/overlay/085
+++ b/tests/overlay/085
@@ -157,7 +157,7 @@ mount_ro_overlay()
 
 umount_overlay()
 {
-	$UMOUNT_PROG $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 test_no_access()
diff --git a/tests/overlay/086 b/tests/overlay/086
index 23c56d074ff34a..45e5b45a279853 100755
--- a/tests/overlay/086
+++ b/tests/overlay/086
@@ -38,21 +38,21 @@ _mount -t overlay none $SCRATCH_MNT \
 	2>> $seqres.full && \
 	echo "ERROR: invalid combination of lowerdir and lowerdir+ mount options"
 
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 _mount -t overlay none $SCRATCH_MNT \
 	-o"lowerdir=$lowerdir,datadir+=$lowerdir_colons" \
 	-o redirect_dir=follow,metacopy=on 2>> $seqres.full && \
 	echo "ERROR: invalid combination of lowerdir and datadir+ mount options"
 
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 _mount -t overlay none $SCRATCH_MNT \
 	-o"datadir+=$lowerdir,lowerdir+=$lowerdir_colons" \
 	-o redirect_dir=follow,metacopy=on 2>> $seqres.full && \
 	echo "ERROR: invalid order of lowerdir+ and datadir+ mount options"
 
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 # mount is expected to fail with escaped colons.
 _mount -t overlay none $SCRATCH_MNT \
@@ -60,7 +60,7 @@ _mount -t overlay none $SCRATCH_MNT \
 	2>> $seqres.full && \
 	echo "ERROR: incorrect parsing of escaped colons in lowerdir+ mount option"
 
-$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null
+_umount $SCRATCH_MNT 2>/dev/null
 
 # mount is expected to succeed without escaped colons.
 _mount -t overlay ovl_esc_test $SCRATCH_MNT \
diff --git a/tests/xfs/078 b/tests/xfs/078
index 4224fd40bc9fea..799d8881220582 100755
--- a/tests/xfs/078
+++ b/tests/xfs/078
@@ -16,7 +16,7 @@ _cleanup()
 {
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $LOOP_MNT 2>/dev/null
+	_umount $LOOP_MNT 2>/dev/null
 	[ -n "$LOOP_DEV" ] && _destroy_loop_device $LOOP_DEV 2>/dev/null
 	# try to keep the image file if test fails
 	[ $status -eq 0 ] && rm -f $LOOP_IMG
@@ -81,7 +81,7 @@ _grow_loop()
 	$XFS_GROWFS_PROG $LOOP_MNT 2>&1 |  _filter_growfs 2>&1
 
 	echo "*** unmount"
-	$UMOUNT_PROG -d $LOOP_MNT && LOOP_DEV=
+	_umount -d $LOOP_MNT && LOOP_DEV=
 
 	# Large grows takes forever to check..
 	if [ "$check" -gt "0" ]
diff --git a/tests/xfs/148 b/tests/xfs/148
index 9e6798f999b356..7c9badd3c1b3a0 100755
--- a/tests/xfs/148
+++ b/tests/xfs/148
@@ -14,7 +14,7 @@ _begin_fstest auto quick fuzzers
 _cleanup()
 {
 	cd /
-	$UMOUNT_PROG $mntpt > /dev/null 2>&1
+	_umount $mntpt > /dev/null 2>&1
 	_destroy_loop_device $loopdev > /dev/null 2>&1
 	rm -r -f $tmp.*
 }
@@ -90,7 +90,7 @@ cat $tmp.log >> $seqres.full
 cat $tmp.log | _filter_test_dir
 
 # Corrupt the entries
-$UMOUNT_PROG $mntpt
+_umount $mntpt
 _destroy_loop_device $loopdev
 cp $imgfile $imgfile.old
 sed -b \
@@ -121,7 +121,7 @@ fi
 echo "does repair complain?" >> $seqres.full
 
 # Does repair complain about this?
-$UMOUNT_PROG $mntpt
+_umount $mntpt
 $XFS_REPAIR_PROG -n $loopdev >> $seqres.full 2>&1
 res=$?
 test $res -eq 1 || \
diff --git a/tests/xfs/149 b/tests/xfs/149
index bbaf86132dff37..ceb80b646f5784 100755
--- a/tests/xfs/149
+++ b/tests/xfs/149
@@ -22,7 +22,7 @@ loop_symlink=$TEST_DIR/loop_symlink.$$
 # Override the default cleanup function.
 _cleanup()
 {
-    $UMOUNT_PROG $mntdir
+    _umount $mntdir
     [ -n "$loop_dev" ] && _destroy_loop_device $loop_dev
     rmdir $mntdir
     rm -f $loop_symlink
@@ -73,7 +73,7 @@ echo "=== xfs_growfs - check device symlink ==="
 $XFS_GROWFS_PROG -D 12288 $loop_symlink > /dev/null
 
 echo "=== unmount ==="
-$UMOUNT_PROG $mntdir || _fail "!!! failed to unmount"
+_umount $mntdir || _fail "!!! failed to unmount"
 
 echo "=== mount device symlink ==="
 _mount $loop_symlink $mntdir || _fail "!!! failed to loopback mount"
diff --git a/tests/xfs/186 b/tests/xfs/186
index 88f02585e7f667..2bd4fe10ab8930 100755
--- a/tests/xfs/186
+++ b/tests/xfs/186
@@ -87,7 +87,7 @@ _do_eas()
 		_create_eas $2 $3
 	fi
 	echo ""
-	cd /; $UMOUNT_PROG $SCRATCH_MNT
+	cd /; _umount $SCRATCH_MNT
 	_print_inode
 }
 
@@ -99,7 +99,7 @@ _do_dirents()
 	echo ""
 	_scratch_mount
 	_create_dirents $1 $2
-	cd /; $UMOUNT_PROG $SCRATCH_MNT
+	cd /; _umount $SCRATCH_MNT
 	_print_inode
 }
 
diff --git a/tests/xfs/289 b/tests/xfs/289
index 089a3f8cc14a68..aab5f96293b3a5 100755
--- a/tests/xfs/289
+++ b/tests/xfs/289
@@ -13,8 +13,8 @@ _begin_fstest growfs auto quick
 # Override the default cleanup function.
 _cleanup()
 {
-    $UMOUNT_PROG $tmpdir
-    $UMOUNT_PROG $tmpbind
+    _umount $tmpdir
+    _umount $tmpbind
     rmdir $tmpdir
     rm -f $tmpsymlink
     rmdir $tmpbind
diff --git a/tests/xfs/507 b/tests/xfs/507
index 75c183c07a9fce..60542112fbd5a1 100755
--- a/tests/xfs/507
+++ b/tests/xfs/507
@@ -22,7 +22,7 @@ _register_cleanup "_cleanup" BUS
 _cleanup()
 {
 	cd /
-	test -n "$loop_mount" && $UMOUNT_PROG $loop_mount > /dev/null 2>&1
+	test -n "$loop_mount" && _umount $loop_mount > /dev/null 2>&1
 	test -n "$loop_dev" && _destroy_loop_device $loop_dev
 	rm -rf $tmp.*
 }
diff --git a/tests/xfs/513 b/tests/xfs/513
index 5585a9c8e76703..cb8d0aca841530 100755
--- a/tests/xfs/513
+++ b/tests/xfs/513
@@ -14,7 +14,7 @@ _cleanup()
 {
 	cd /
 	rm -f $tmp.*
-	$UMOUNT_PROG $LOOP_MNT 2>/dev/null
+	_umount $LOOP_MNT 2>/dev/null
 	if [ -n "$LOOP_DEV" ];then
 		_destroy_loop_device $LOOP_DEV 2>/dev/null
 	fi
@@ -89,7 +89,7 @@ get_mount_info()
 
 force_unmount()
 {
-	$UMOUNT_PROG $LOOP_MNT >/dev/null 2>&1
+	_umount $LOOP_MNT >/dev/null 2>&1
 }
 
 # _do_test <mount options> <should be mounted?> [<key string> <key should be found?>]
diff --git a/tests/xfs/544 b/tests/xfs/544
index a3a23c1726ca1c..f1b5cc74983a62 100755
--- a/tests/xfs/544
+++ b/tests/xfs/544
@@ -15,7 +15,7 @@ _cleanup()
 	_cleanup_dump
 	cd /
 	rm -r -f $tmp.*
-	$UMOUNT_PROG $TEST_DIR/dest.$seq 2> /dev/null
+	_umount $TEST_DIR/dest.$seq 2> /dev/null
 	rmdir $TEST_DIR/src.$seq 2> /dev/null
 	rmdir $TEST_DIR/dest.$seq 2> /dev/null
 }
diff --git a/tests/xfs/806 b/tests/xfs/806
index 09c55332cc8800..9334d1780c6855 100755
--- a/tests/xfs/806
+++ b/tests/xfs/806
@@ -23,7 +23,7 @@ _cleanup()
 {
 	cd /
 	rm -r -f $tmp.*
-	umount $dummymnt &>/dev/null
+	_umount $dummymnt &>/dev/null
 	rmdir $dummymnt &>/dev/null
 	rm -f $dummyfile
 }
@@ -46,7 +46,7 @@ testme() {
 	XFS_SCRUB_PHASE=7 $XFS_SCRUB_PROG -d -o autofsck $dummymnt 2>&1 | \
 		grep autofsck | _filter_test_dir | \
 		sed -e 's/\(directive.\).*$/\1/g'
-	umount $dummymnt
+	_umount $dummymnt
 }
 
 # We don't test the absence of an autofsck directive because xfs_scrub behaves


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/6] misc: convert all umount(1) invocations to _umount
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong
@ 2024-12-31 23:57   ` Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 3/6] xfs: test health monitoring code Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Find all the places where we call umount(1) directly and convert all of
those to _umount calls as well.

sed \
 -e 's/\([[:space:]]\)umount\([[:space:]]*"\$\)/\1_umount\2/g' \
 -e 's/\([[:space:]]\)umount\([[:space:]]*\$\)/\1_umount\2/g' \
 -e 's/^umount\([[:space:]]*"\$\)/_umount\1/g' \
 -e 's/^umount\([[:space:]]*\$\)/_umount\1/g' \
 -i $(git ls-files tests common check)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/dmerror    |    2 +-
 common/populate   |    8 ++++----
 common/quota      |    2 +-
 common/rc         |    4 ++--
 common/xfs        |    2 +-
 tests/btrfs/012   |    2 +-
 tests/btrfs/199   |    2 +-
 tests/btrfs/291   |    2 +-
 tests/btrfs/298   |    4 ++--
 tests/ext4/006    |    4 ++--
 tests/ext4/007    |    4 ++--
 tests/ext4/008    |    4 ++--
 tests/ext4/009    |    8 ++++----
 tests/ext4/010    |    6 +++---
 tests/ext4/011    |    2 +-
 tests/ext4/012    |    2 +-
 tests/ext4/013    |    6 +++---
 tests/ext4/014    |    6 +++---
 tests/ext4/015    |    6 +++---
 tests/ext4/016    |    6 +++---
 tests/ext4/017    |    6 +++---
 tests/ext4/018    |    6 +++---
 tests/ext4/019    |    6 +++---
 tests/ext4/033    |    2 +-
 tests/generic/171 |    2 +-
 tests/generic/172 |    2 +-
 tests/generic/173 |    2 +-
 tests/generic/174 |    2 +-
 tests/generic/306 |    2 +-
 tests/generic/330 |    2 +-
 tests/generic/332 |    2 +-
 tests/generic/395 |    2 +-
 tests/generic/563 |    4 ++--
 tests/generic/631 |    2 +-
 tests/generic/717 |    2 +-
 tests/xfs/014     |    4 ++--
 tests/xfs/049     |    8 ++++----
 tests/xfs/073     |    8 ++++----
 tests/xfs/074     |    4 ++--
 tests/xfs/083     |    6 +++---
 tests/xfs/085     |    4 ++--
 tests/xfs/086     |    8 ++++----
 tests/xfs/087     |    6 +++---
 tests/xfs/088     |    8 ++++----
 tests/xfs/089     |    8 ++++----
 tests/xfs/091     |    8 ++++----
 tests/xfs/093     |    6 +++---
 tests/xfs/097     |    6 +++---
 tests/xfs/098     |    4 ++--
 tests/xfs/099     |    6 +++---
 tests/xfs/100     |    6 +++---
 tests/xfs/101     |    6 +++---
 tests/xfs/102     |    6 +++---
 tests/xfs/105     |    6 +++---
 tests/xfs/112     |    8 ++++----
 tests/xfs/113     |    6 +++---
 tests/xfs/117     |    6 +++---
 tests/xfs/120     |    6 +++---
 tests/xfs/123     |    6 +++---
 tests/xfs/124     |    6 +++---
 tests/xfs/125     |    6 +++---
 tests/xfs/126     |    6 +++---
 tests/xfs/130     |    2 +-
 tests/xfs/152     |    2 +-
 tests/xfs/169     |    6 +++---
 tests/xfs/206     |    2 +-
 tests/xfs/216     |    2 +-
 tests/xfs/217     |    2 +-
 tests/xfs/235     |    6 +++---
 tests/xfs/236     |    6 +++---
 tests/xfs/239     |    2 +-
 tests/xfs/241     |    2 +-
 tests/xfs/250     |    4 ++--
 tests/xfs/265     |    6 +++---
 tests/xfs/310     |    4 ++--
 tests/xfs/716     |    4 ++--
 76 files changed, 172 insertions(+), 172 deletions(-)


diff --git a/common/dmerror b/common/dmerror
index 1e6a35230f3ccb..2b6f001b8427f6 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -97,7 +97,7 @@ _dmerror_mount()
 
 _dmerror_unmount()
 {
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 }
 
 _dmerror_cleanup()
diff --git a/common/populate b/common/populate
index 96e6a0f0572f12..e6bcdf346ac4ff 100644
--- a/common/populate
+++ b/common/populate
@@ -540,7 +540,7 @@ _scratch_xfs_populate() {
 	__populate_fragment_file "${SCRATCH_MNT}/REFCOUNTBT"
 	__populate_fragment_file "${SCRATCH_MNT}/RTREFCOUNTBT"
 
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 }
 
 # Populate an ext4 on the scratch device with (we hope) all known
@@ -642,7 +642,7 @@ _scratch_ext4_populate() {
 	# Make sure we get all the fragmentation we asked for
 	__populate_fragment_file "${SCRATCH_MNT}/S_IFREG.FMT_ETREE"
 
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 }
 
 # Find the inode number of a file
@@ -831,7 +831,7 @@ _scratch_xfs_populate_check() {
 	dblksz="$(_xfs_get_dir_blocksize "$SCRATCH_MNT")"
 	leaf_lblk="$((32 * 1073741824 / blksz))"
 	node_lblk="$((64 * 1073741824 / blksz))"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 
 	__populate_check_xfs_dformat "${extents_file}" "extents"
 	__populate_check_xfs_dformat "${btree_file}" "btree"
@@ -948,7 +948,7 @@ _scratch_ext4_populate_check() {
 	extents_slink="$(__populate_find_inode "${SCRATCH_MNT}/S_IFLNK.FMT_EXTENTS")"
 	local_attr="$(__populate_find_inode "${SCRATCH_MNT}/ATTR.FMT_LOCAL")"
 	block_attr="$(__populate_find_inode "${SCRATCH_MNT}/ATTR.FMT_BLOCK")"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 
 	__populate_check_ext4_dformat "${extents_file}" "extents"
 	__populate_check_ext4_dformat "${etree_file}" "etree"
diff --git a/common/quota b/common/quota
index 344c942045e5f2..7399819bb2579b 100644
--- a/common/quota
+++ b/common/quota
@@ -92,7 +92,7 @@ _require_xfs_quota_acct_enabled()
 	if [ -z "$umount" ] && [ "$dev" = "$SCRATCH_DEV" ]; then
 		umount="-u"
 	fi
-	test "$umount" = "-u" && umount "$dev" &>/dev/null
+	test "$umount" = "-u" && _umount "$dev" &>/dev/null
 
 	case "$dev" in
 	"$TEST_DEV")	fsname="test";;
diff --git a/common/rc b/common/rc
index d3ee76e01db892..0d5c785cecc017 100644
--- a/common/rc
+++ b/common/rc
@@ -1348,7 +1348,7 @@ _repair_scratch_fs()
 			_scratch_xfs_repair -L 2>&1
 			echo "log zap returns $?"
 		else
-			umount "$SCRATCH_MNT"
+			_umount "$SCRATCH_MNT"
 		fi
 		_scratch_xfs_repair "$@" 2>&1
 		res=$?
@@ -1413,7 +1413,7 @@ _repair_test_fs()
 				_test_xfs_repair -L >>$tmp.repair 2>&1
 				echo "log zap returns $?" >> $tmp.repair
 			else
-				umount "$TEST_DEV"
+				_umount "$TEST_DEV"
 			fi
 			_test_xfs_repair "$@" >>$tmp.repair 2>&1
 			res=$?
diff --git a/common/xfs b/common/xfs
index 86654a9379cf89..b9e897e0e8839a 100644
--- a/common/xfs
+++ b/common/xfs
@@ -466,7 +466,7 @@ _require_xfs_has_feature()
 
 	_xfs_has_feature "$1" "$2" && return 0
 
-	test "$umount" = "-u" && umount "$fs" &>/dev/null
+	test "$umount" = "-u" && _umount "$fs" &>/dev/null
 
 	test -n "$message" && _notrun "$message"
 
diff --git a/tests/btrfs/012 b/tests/btrfs/012
index 5811b3b339cb3e..7bb075dc2d0e93 100755
--- a/tests/btrfs/012
+++ b/tests/btrfs/012
@@ -70,7 +70,7 @@ mount -o loop $SCRATCH_MNT/ext2_saved/image $SCRATCH_MNT/mnt || \
 
 echo "Checking saved ext2 image against the original one:"
 $FSSUM_PROG -r $tmp.original $SCRATCH_MNT/mnt/$BASENAME
-umount $SCRATCH_MNT/mnt
+_umount $SCRATCH_MNT/mnt
 
 echo "Generating new data on the converted btrfs" >> $seqres.full
 mkdir -p $SCRATCH_MNT/new 
diff --git a/tests/btrfs/199 b/tests/btrfs/199
index f161e55057ff27..bdad1cb934c91f 100755
--- a/tests/btrfs/199
+++ b/tests/btrfs/199
@@ -19,7 +19,7 @@ _begin_fstest auto quick trim fiemap
 _cleanup()
 {
 	cd /
-	umount $loop_mnt &> /dev/null
+	_umount $loop_mnt &> /dev/null
 	_destroy_loop_device $loop_dev &> /dev/null
 	rm -rf $tmp.*
 }
diff --git a/tests/btrfs/291 b/tests/btrfs/291
index c31de3a96ef1f5..f69b65114ed696 100755
--- a/tests/btrfs/291
+++ b/tests/btrfs/291
@@ -134,7 +134,7 @@ do
 	_mount $snap_dev $SCRATCH_MNT || _fail "mount failed at entry $cur"
 	fsverity measure $SCRATCH_MNT/fsv >>$seqres.full 2>&1
 	measured=$?
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	[ $state -eq 1 ] && [ $measured -eq 0 ] && state=2
 	[ $state -eq 2 ] && ([ $measured -eq 0 ] || _fail "verity done, but measurement failed at entry $cur")
 	post_mount=$(count_merkle_items $snap_dev)
diff --git a/tests/btrfs/298 b/tests/btrfs/298
index d4aee55e785a94..c5b65772d428b1 100755
--- a/tests/btrfs/298
+++ b/tests/btrfs/298
@@ -31,11 +31,11 @@ $BTRFS_UTIL_PROG device scan --forget
 echo "#Scan seed device and check using mount" >> $seqres.full
 $BTRFS_UTIL_PROG device scan $SCRATCH_DEV >> $seqres.full
 _mount $SPARE_DEV $SCRATCH_MNT
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 echo "#check again, ensures seed device still in kernel" >> $seqres.full
 _mount $SPARE_DEV $SCRATCH_MNT
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 echo "#Now scan of non-seed device makes kernel forget" >> $seqres.full
 $BTRFS_TUNE_PROG -f -S 0 $SCRATCH_DEV >> $seqres.full 2>&1
diff --git a/tests/ext4/006 b/tests/ext4/006
index d7862073114872..579eab55b32d26 100755
--- a/tests/ext4/006
+++ b/tests/ext4/006
@@ -97,7 +97,7 @@ echo "++ modify scratch" >> $seqres.full
 _scratch_fuzz_modify >> $seqres.full 2>&1
 
 echo "++ unmount" >> $seqres.full
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 # repair in a loop...
 for p in $(seq 1 "${FSCK_PASSES}"); do
@@ -122,7 +122,7 @@ echo "++ modify scratch" >> $ROUND2_LOG
 _scratch_fuzz_modify >> $ROUND2_LOG 2>&1
 
 echo "++ unmount" >> $ROUND2_LOG
-umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1
+_umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1
 
 cat "$ROUND2_LOG" >> $seqres.full
 
diff --git a/tests/ext4/007 b/tests/ext4/007
index deedbd9e8fb3d8..24cc2290f79a29 100755
--- a/tests/ext4/007
+++ b/tests/ext4/007
@@ -54,7 +54,7 @@ done
 for x in `seq 2 64`; do
 	touch "${TESTFILE}.${x}"
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -89,7 +89,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/008 b/tests/ext4/008
index b4b20ac10d6d2a..a586bf681dfd34 100755
--- a/tests/ext4/008
+++ b/tests/ext4/008
@@ -50,7 +50,7 @@ done
 for x in `seq 2 64`; do
 	echo moo >> "${TESTFILE}.${x}"
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -70,7 +70,7 @@ e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
 echo "+ mount image (2)"
 _scratch_mount
 
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/009 b/tests/ext4/009
index 06a42fd77ffa0c..f6fe1e5f0d8d2a 100755
--- a/tests/ext4/009
+++ b/tests/ext4/009
@@ -45,13 +45,13 @@ done
 blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")"
 freeblks="$(stat -f -c '%a' "${SCRATCH_MNT}")"
 $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile2" >> $seqres.full
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ make some files"
 _scratch_mount
 rm -rf "${SCRATCH_MNT}/bigfile2"
 touch "${SCRATCH_MNT}/bigfile"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -70,7 +70,7 @@ $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile" >>
 after="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")"
 echo "$((after * b_bytes))" lt "$((blksz * freeblks / 4))" >> $seqres.full
 test "$((after * b_bytes))" -lt "$((blksz * freeblks / 4))" || _fail "falloc should fail"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -80,7 +80,7 @@ _scratch_mount
 
 echo "+ modify files (2)"
 $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile" >> $seqres.full
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/010 b/tests/ext4/010
index 1139c79e80d538..27ce20f822256f 100755
--- a/tests/ext4/010
+++ b/tests/ext4/010
@@ -46,7 +46,7 @@ echo "+ make some files"
 for i in `seq 1 $((nr_groups * 8))`; do
 	mkdir -p "${SCRATCH_MNT}/d_${i}"
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -61,7 +61,7 @@ _scratch_mount
 
 echo "+ modify files"
 touch "${SCRATCH_MNT}/file0" > /dev/null 2>&1 && _fail "touch should fail"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -71,7 +71,7 @@ _scratch_mount
 
 echo "+ modify files (2)"
 touch "${SCRATCH_MNT}/file1"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/011 b/tests/ext4/011
index cae4fb6b84768b..cb085c95596de1 100755
--- a/tests/ext4/011
+++ b/tests/ext4/011
@@ -39,7 +39,7 @@ blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")"
 
 echo "+ make some files"
 echo moo > "${SCRATCH_MNT}/file0"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/012 b/tests/ext4/012
index f7f2b0fb455762..e7adc617c4db17 100755
--- a/tests/ext4/012
+++ b/tests/ext4/012
@@ -39,7 +39,7 @@ blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")"
 
 echo "+ make some files"
 echo moo > "${SCRATCH_MNT}/file0"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/013 b/tests/ext4/013
index 7d2a9154a66936..4363e3d104b716 100755
--- a/tests/ext4/013
+++ b/tests/ext4/013
@@ -50,7 +50,7 @@ for x in `seq 2 64`; do
 	touch "${TESTFILE}.${x}"
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -72,7 +72,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -93,7 +93,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/014 b/tests/ext4/014
index ffed795ad4e93c..c874a62335d1f3 100755
--- a/tests/ext4/014
+++ b/tests/ext4/014
@@ -49,7 +49,7 @@ done
 for x in `seq 2 64`; do
 	touch "${TESTFILE}.${x}"
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -70,7 +70,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 && _fail "e2fsck should not succeed"
@@ -91,7 +91,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/015 b/tests/ext4/015
index 81feda5c9423fb..32b3884de32035 100755
--- a/tests/ext4/015
+++ b/tests/ext4/015
@@ -45,7 +45,7 @@ $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile" >>
 seq 1 2 ${freeblks} | while read lblk; do
 	$XFS_IO_PROG -f -c "fpunch $((lblk * blksz)) ${blksz}" "${SCRATCH_MNT}/bigfile" >> $seqres.full
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -60,7 +60,7 @@ _scratch_mount
 
 echo "+ modify files"
 echo moo >> "${SCRATCH_MNT}/bigfile" 2> /dev/null && _fail "extent tree should be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -70,7 +70,7 @@ _scratch_mount
 
 echo "+ modify files (2)"
 $XFS_IO_PROG -f -c "pwrite ${blksz} ${blksz}" "${SCRATCH_MNT}/bigfile" >> $seqres.full
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/016 b/tests/ext4/016
index b7db4cfda649ef..f0f1709b6c208a 100755
--- a/tests/ext4/016
+++ b/tests/ext4/016
@@ -40,7 +40,7 @@ echo "+ make some files"
 for x in `seq 1 15`; do
 	mkdir -p "${SCRATCH_MNT}/test/d_${x}"
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -53,7 +53,7 @@ _scratch_mount
 
 echo "+ modify dirs"
 mkdir -p "${SCRATCH_MNT}/test/newdir" 2> /dev/null && _fail "directory should be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -63,7 +63,7 @@ _scratch_mount
 
 echo "+ modify dirs (2)"
 mkdir -p "${SCRATCH_MNT}/test/newdir" || _fail "directory should be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/017 b/tests/ext4/017
index fc867442c3da3a..7fa563106d676c 100755
--- a/tests/ext4/017
+++ b/tests/ext4/017
@@ -43,7 +43,7 @@ for x in `seq 1 $((blksz * 4 / 256))`; do
 	fname="$(printf "%.255s\n" "$(perl -e "print \"${x}_\" x 500;")")"
 	touch "${SCRATCH_MNT}/test/${fname}"
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -56,7 +56,7 @@ _scratch_mount
 
 echo "+ modify dirs"
 mkdir -p "${SCRATCH_MNT}/test/newdir" 2> /dev/null && _fail "htree should be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -66,7 +66,7 @@ _scratch_mount
 
 echo "+ modify dirs (2)"
 mkdir -p "${SCRATCH_MNT}/test/newdir" || _fail "htree should not be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/018 b/tests/ext4/018
index f7377f059fb826..2e24fe2e82918d 100755
--- a/tests/ext4/018
+++ b/tests/ext4/018
@@ -40,7 +40,7 @@ blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")"
 echo "+ make some files"
 $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${SCRATCH_MNT}/attrfile" >> $seqres.full
 setfattr -n user.key -v "$(perl -e 'print "v" x 300;')" "${SCRATCH_MNT}/attrfile"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -54,7 +54,7 @@ _scratch_mount
 
 echo "+ modify attrs"
 setfattr -n user.newkey -v "$(perl -e 'print "v" x 300;')" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "xattr should be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -64,7 +64,7 @@ _scratch_mount
 
 echo "+ modify attrs (2)"
 setfattr -n user.newkey -v "$(perl -e 'print "v" x 300;')" "${SCRATCH_MNT}/attrfile" || _fail "xattr should not be corrupt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/019 b/tests/ext4/019
index 987972a80a3704..7df7ccbed5e50d 100755
--- a/tests/ext4/019
+++ b/tests/ext4/019
@@ -43,7 +43,7 @@ echo "file contents: moo" > "${SCRATCH_MNT}/x"
 str="$(perl -e "print './' x $(( (blksz / 2) - 16));")x"
 (cd $SCRATCH_MNT; ln -s "${str}" "long_symlink")
 cat "${SCRATCH_MNT}/long_symlink"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
@@ -54,7 +54,7 @@ debugfs -w -R 'zap -f /long_symlink -p 0x62 0' "${SCRATCH_DEV}" 2> /dev/null
 echo "+ mount image"
 _scratch_mount 2> /dev/null
 cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
@@ -62,7 +62,7 @@ e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1
 echo "+ mount image (2)"
 _scratch_mount
 cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail"
diff --git a/tests/ext4/033 b/tests/ext4/033
index 53f7106e2c6ba4..19cd1fb6f20d4c 100755
--- a/tests/ext4/033
+++ b/tests/ext4/033
@@ -14,7 +14,7 @@ _begin_fstest auto ioctl resize
 # Override the default cleanup function.
 _cleanup()
 {
-	umount $SCRATCH_MNT >/dev/null 2>&1
+	_umount $SCRATCH_MNT >/dev/null 2>&1
 	_dmhugedisk_cleanup
 	cd /
 	rm -f $tmp.*
diff --git a/tests/generic/171 b/tests/generic/171
index dd56aa792afbd5..f51f58e9495f8e 100755
--- a/tests/generic/171
+++ b/tests/generic/171
@@ -36,7 +36,7 @@ mkdir $testdir
 echo "Reformat with appropriate size"
 blksz="$(_get_block_size $testdir)"
 nr_blks=10240
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 sz_bytes=$((nr_blks * 8 * blksz))
 if [ $sz_bytes -lt $((32 * 1048576)) ]; then
 	sz_bytes=$((32 * 1048576))
diff --git a/tests/generic/172 b/tests/generic/172
index c23a1228455464..8d32f0288b1556 100755
--- a/tests/generic/172
+++ b/tests/generic/172
@@ -35,7 +35,7 @@ mkdir $testdir
 
 echo "Reformat with appropriate size"
 blksz="$(_get_block_size $testdir)"
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 file_size=$((768 * 1024 * 1024))
 fs_size=$((1024 * 1024 * 1024))
diff --git a/tests/generic/173 b/tests/generic/173
index 8df3c6df21b29c..2f1ea96ef6238e 100755
--- a/tests/generic/173
+++ b/tests/generic/173
@@ -36,7 +36,7 @@ mkdir $testdir
 echo "Reformat with appropriate size"
 blksz="$(_get_block_size $testdir)"
 nr_blks=10240
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 sz_bytes=$((nr_blks * 8 * blksz))
 if [ $sz_bytes -lt $((32 * 1048576)) ]; then
 	sz_bytes=$((32 * 1048576))
diff --git a/tests/generic/174 b/tests/generic/174
index b9c292071445fe..d93546eeb35581 100755
--- a/tests/generic/174
+++ b/tests/generic/174
@@ -37,7 +37,7 @@ mkdir $testdir
 echo "Reformat with appropriate size"
 blksz="$(_get_block_size $testdir)"
 nr_blks=10240
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 sz_bytes=$((nr_blks * 8 * blksz))
 if [ $sz_bytes -lt $((32 * 1048576)) ]; then
 	sz_bytes=$((32 * 1048576))
diff --git a/tests/generic/306 b/tests/generic/306
index a6ea654b67d179..e6502cb881e21e 100755
--- a/tests/generic/306
+++ b/tests/generic/306
@@ -12,7 +12,7 @@ _begin_fstest auto quick rw
 # Override the default cleanup function.
 _cleanup()
 {
-    umount $BINDFILE
+    _umount $BINDFILE
     cd /
     rm -f $tmp.*
 }
diff --git a/tests/generic/330 b/tests/generic/330
index 4fa81f9913ee7e..ab9af84611d725 100755
--- a/tests/generic/330
+++ b/tests/generic/330
@@ -61,7 +61,7 @@ md5sum $testdir/file1 | _filter_scratch
 md5sum $testdir/file2 | _filter_scratch
 
 echo "Check for damage"
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _repair_scratch_fs >> $seqres.full
 
 # success, all done
diff --git a/tests/generic/332 b/tests/generic/332
index 4a61e4a02a7cdc..b15546d66a41e0 100755
--- a/tests/generic/332
+++ b/tests/generic/332
@@ -61,7 +61,7 @@ md5sum $testdir/file1 | _filter_scratch
 md5sum $testdir/file2 | _filter_scratch
 
 echo "Check for damage"
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _repair_scratch_fs >> $seqres.full
 
 # success, all done
diff --git a/tests/generic/395 b/tests/generic/395
index 45787fff06be1d..d0600d0282c6a4 100755
--- a/tests/generic/395
+++ b/tests/generic/395
@@ -75,7 +75,7 @@ mount --bind $SCRATCH_MNT $SCRATCH_MNT/ro_bind_mnt
 mount -o remount,ro,bind $SCRATCH_MNT/ro_bind_mnt
 _set_encpolicy $SCRATCH_MNT/ro_bind_mnt/ro_dir |& _filter_scratch
 _get_encpolicy $SCRATCH_MNT/ro_bind_mnt/ro_dir |& _filter_scratch
-umount $SCRATCH_MNT/ro_bind_mnt
+_umount $SCRATCH_MNT/ro_bind_mnt
 
 # success, all done
 status=0
diff --git a/tests/generic/563 b/tests/generic/563
index ade66f93fbf30b..166774653a66d6 100755
--- a/tests/generic/563
+++ b/tests/generic/563
@@ -21,7 +21,7 @@ _cleanup()
 
 	echo $$ > $cgdir/cgroup.procs
 	rmdir $cgdir/$seq-cg* > /dev/null 2>&1
-	umount $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_destroy_loop_device $LOOP_DEV > /dev/null 2>&1
 }
 
@@ -80,7 +80,7 @@ reset()
 	rmdir $cgdir/$seq-cg* > /dev/null 2>&1
 	$XFS_IO_PROG -fc "pwrite 0 $iosize" $SCRATCH_MNT/file \
 		>> $seqres.full 2>&1
-	umount $SCRATCH_MNT || _fail "umount failed"
+	_umount $SCRATCH_MNT || _fail "umount failed"
 	_mount $LOOP_DEV $SCRATCH_MNT || _fail "mount failed"
 	stat $SCRATCH_MNT/file > /dev/null
 }
diff --git a/tests/generic/631 b/tests/generic/631
index c7c95e5608b760..c9f8299c948f83 100755
--- a/tests/generic/631
+++ b/tests/generic/631
@@ -84,7 +84,7 @@ worker() {
 		touch $mergedir/etc/access.conf
 		mv $mergedir/etc/access.conf $mergedir/etc/access.conf.bak
 		touch $mergedir/etc/access.conf
-		umount $mergedir
+		_umount $mergedir
 	done
 	rm -f $SCRATCH_MNT/workers/$tag
 }
diff --git a/tests/generic/717 b/tests/generic/717
index 4378e964ab8597..7ff356e255b3d1 100755
--- a/tests/generic/717
+++ b/tests/generic/717
@@ -85,7 +85,7 @@ mkdir -p $SCRATCH_MNT/xyz
 mount --bind $dir $SCRATCH_MNT/xyz --bind
 _pwrite_byte 0x60 0 $((blksz * (nrblks + 2))) $dir/c >> $seqres.full
 $XFS_IO_PROG -c "exchangerange $SCRATCH_MNT/xyz/c" $dir/a
-umount $SCRATCH_MNT/xyz
+_umount $SCRATCH_MNT/xyz
 
 echo Swapping a file with itself
 $XFS_IO_PROG -c "exchangerange $dir/a" $dir/a
diff --git a/tests/xfs/014 b/tests/xfs/014
index 098f64186e1134..efae4efa5138f5 100755
--- a/tests/xfs/014
+++ b/tests/xfs/014
@@ -22,7 +22,7 @@ _begin_fstest auto enospc quick quota prealloc
 _cleanup()
 {
 	cd /
-	umount $LOOP_MNT 2>/dev/null
+	_umount $LOOP_MNT 2>/dev/null
 	_scratch_unmount 2>/dev/null
 	rm -f $tmp.*
 }
@@ -174,7 +174,7 @@ mount -t xfs -o loop,uquota,gquota $LOOP_FILE $LOOP_MNT || \
 _test_enospc $LOOP_MNT
 _test_edquot $LOOP_MNT
 
-umount $LOOP_MNT
+_umount $LOOP_MNT
 
 echo $orig_sp_time > /proc/sys/fs/xfs/speculative_prealloc_lifetime
 
diff --git a/tests/xfs/049 b/tests/xfs/049
index 668ac374576a69..89ee1dbdff4f10 100755
--- a/tests/xfs/049
+++ b/tests/xfs/049
@@ -13,8 +13,8 @@ _begin_fstest rw auto quick
 _cleanup()
 {
     cd /
-    umount $SCRATCH_MNT/test2 > /dev/null 2>&1
-    umount $SCRATCH_MNT/test > /dev/null 2>&1
+    _umount $SCRATCH_MNT/test2 > /dev/null 2>&1
+    _umount $SCRATCH_MNT/test > /dev/null 2>&1
     rm -f $tmp.*
 
     if [ -w $seqres.full ]
@@ -96,11 +96,11 @@ rm -rf $SCRATCH_MNT/test/* >> $seqres.full 2>&1 \
     || _fail "!!! clean failed"
 
 _log "umount ext2 on xfs"
-umount $SCRATCH_MNT/test2 >> $seqres.full 2>&1 \
+_umount $SCRATCH_MNT/test2 >> $seqres.full 2>&1 \
     || _fail "!!! umount ext2 failed"
 
 _log "umount xfs"
-umount $SCRATCH_MNT/test >> $seqres.full 2>&1 \
+_umount $SCRATCH_MNT/test >> $seqres.full 2>&1 \
     || _fail "!!! umount xfs failed"
 
 echo "--- mounts at end (before cleanup)" >> $seqres.full
diff --git a/tests/xfs/073 b/tests/xfs/073
index 28f1fad08b8c96..7d99179b7bc974 100755
--- a/tests/xfs/073
+++ b/tests/xfs/073
@@ -21,9 +21,9 @@ _cleanup()
 {
 	cd /
 	_scratch_unmount 2>/dev/null
-	umount $imgs.loop 2>/dev/null
+	_umount $imgs.loop 2>/dev/null
 	[ -d $imgs.loop ] && rmdir $imgs.loop
-	umount $imgs.source_dir 2>/dev/null
+	_umount $imgs.source_dir 2>/dev/null
 	[ -d $imgs.source_dir ] && rm -rf $imgs.source_dir
 	rm -f $imgs.* $tmp.* /var/tmp/xfs_copy.log.*
 }
@@ -98,8 +98,8 @@ _verify_copy()
 	diff -u $tmp.geometry1 $tmp.geometry2
 
 	echo unmounting and removing new image
-	umount $source_dir
-	umount $target_dir > /dev/null 2>&1
+	_umount $source_dir
+	_umount $target_dir > /dev/null 2>&1
 	rm -f $target
 }
 
diff --git a/tests/xfs/074 b/tests/xfs/074
index 278f0ade694d22..282642a8674557 100755
--- a/tests/xfs/074
+++ b/tests/xfs/074
@@ -59,7 +59,7 @@ $XFS_IO_PROG -ft \
 	-c "falloc 0 $(($BLOCK_SIZE * 2097152))" \
 	$LOOP_MNT/foo >> $seqres.full
 
-umount $LOOP_MNT
+_umount $LOOP_MNT
 _check_xfs_filesystem $LOOP_DEV none none
 
 _mkfs_dev -f $LOOP_DEV
@@ -72,7 +72,7 @@ $XFS_IO_PROG -ft \
 	-c "falloc 1023m 2g" \
 	$LOOP_MNT/foo >> $seqres.full
 
-umount $LOOP_MNT
+_umount $LOOP_MNT
 _check_xfs_filesystem $LOOP_DEV none none
 
 # success, all done
diff --git a/tests/xfs/083 b/tests/xfs/083
index 9291c8c0382489..875937e6ffe3b3 100755
--- a/tests/xfs/083
+++ b/tests/xfs/083
@@ -57,7 +57,7 @@ scratch_repair() {
 			_scratch_xfs_repair -L >> "${FSCK_LOG}" 2>&1
 			echo "+++ returns $?" >> "${FSCK_LOG}"
 		else
-			umount "${SCRATCH_MNT}" >> "${FSCK_LOG}" 2>&1
+			_umount "${SCRATCH_MNT}" >> "${FSCK_LOG}" 2>&1
 		fi
 	elif [ "${fsck_pass}" -eq "${FSCK_PASSES}" ]; then
 		echo "++ fsck did not fix in ${FSCK_PASSES} passes." >> "${FSCK_LOG}"
@@ -109,7 +109,7 @@ echo "+++ modify scratch" >> $seqres.full
 _scratch_fuzz_modify >> $seqres.full 2>&1
 
 echo "++ umount" >> $seqres.full
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 # repair in a loop...
 for p in $(seq 1 "${FSCK_PASSES}"); do
@@ -134,7 +134,7 @@ echo "+++ modify scratch" >> $ROUND2_LOG
 _scratch_fuzz_modify >> $ROUND2_LOG 2>&1
 
 echo "++ umount" >> $ROUND2_LOG
-umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1
+_umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1
 
 cat "$ROUND2_LOG" >> $seqres.full
 
diff --git a/tests/xfs/085 b/tests/xfs/085
index d33dd199e6f9c1..9faf16fde5cdab 100755
--- a/tests/xfs/085
+++ b/tests/xfs/085
@@ -54,7 +54,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -82,7 +82,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/086 b/tests/xfs/086
index 44985f3913254d..03327cdeaf3f08 100755
--- a/tests/xfs/086
+++ b/tests/xfs/086
@@ -56,7 +56,7 @@ done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
 test "${agcount}" -gt 1 || _notrun "Single-AG XFS not supported"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 1 64`; do
 		$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -97,7 +97,7 @@ echo "+ modify files (2)"
 for x in `seq 1 64`; do
 	$XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 _repair_scratch_fs >> $seqres.full 2>&1
@@ -114,7 +114,7 @@ for x in `seq 1 64`; do
 	test -s "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/087 b/tests/xfs/087
index 3cca105685fc6a..aeef30657b9491 100755
--- a/tests/xfs/087
+++ b/tests/xfs/087
@@ -55,7 +55,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -72,7 +72,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 65 70`; do
 		touch "${TESTFILE}.${x}" 2> /dev/null && broken=0
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 echo "broken: ${broken}"
 
@@ -91,7 +91,7 @@ for x in `seq 65 70`; do
 	touch "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/088 b/tests/xfs/088
index b54a1ab7d00342..de100136014ba7 100755
--- a/tests/xfs/088
+++ b/tests/xfs/088
@@ -56,7 +56,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 1 64`; do
 		$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full 2>> $seqres.full
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -97,7 +97,7 @@ echo "+ modify files (2)"
 for x in `seq 1 64`; do
 	$XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 _repair_scratch_fs >> $seqres.full 2>&1
@@ -114,7 +114,7 @@ for x in `seq 1 64`; do
 	test -s "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/089 b/tests/xfs/089
index ff3ae719326eca..f5640a46177578 100755
--- a/tests/xfs/089
+++ b/tests/xfs/089
@@ -56,7 +56,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 1 64`; do
 		$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full 2>> $seqres.full
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -98,7 +98,7 @@ echo "+ modify files (2)"
 for x in `seq 1 64`; do
 	$XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 _repair_scratch_fs >> $seqres.full 2>&1
@@ -115,7 +115,7 @@ for x in `seq 1 64`; do
 	test -s "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/091 b/tests/xfs/091
index 3f606f8845797d..c7857cdf1b690b 100755
--- a/tests/xfs/091
+++ b/tests/xfs/091
@@ -56,7 +56,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 1 64`; do
 		$XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full 2>> $seqres.full
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -98,7 +98,7 @@ echo "+ modify files (2)"
 for x in `seq 1 64`; do
 	$XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full
 done
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ repair fs"
 _repair_scratch_fs >> $seqres.full 2>&1
@@ -115,7 +115,7 @@ for x in `seq 1 64`; do
 	test -s "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/093 b/tests/xfs/093
index c4e8006063e121..cfb2a8c80c1770 100755
--- a/tests/xfs/093
+++ b/tests/xfs/093
@@ -55,7 +55,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -72,7 +72,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 65 70`; do
 		touch "${TESTFILE}.${x}" 2> /dev/null && broken=0
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 echo "broken: ${broken}"
 
@@ -94,7 +94,7 @@ for x in `seq 65 70`; do
 	touch "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/097 b/tests/xfs/097
index 384c76080ddcf4..0fcf65a2a8f65a 100755
--- a/tests/xfs/097
+++ b/tests/xfs/097
@@ -58,7 +58,7 @@ for x in `seq 2 64`; do
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
 agcount="$(_xfs_mount_agcount $SCRATCH_MNT)"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -74,7 +74,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	for x in `seq 65 70`; do
 		touch "${TESTFILE}.${x}" 2> /dev/null && broken=0
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 echo "broken: ${broken}"
 
@@ -93,7 +93,7 @@ for x in `seq 65 70`; do
 	touch "${TESTFILE}.${x}" || broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/098 b/tests/xfs/098
index a47cda67e14e29..48eb3fa2b3a753 100755
--- a/tests/xfs/098
+++ b/tests/xfs/098
@@ -56,7 +56,7 @@ for x in `seq 2 64`; do
 	touch "${TESTFILE}.${x}"
 done
 inode="$(stat -c '%i' "${TESTFILE}.1")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -98,7 +98,7 @@ for x in `seq 1 64`; do
 	test $? -ne 0 && broken=1
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/099 b/tests/xfs/099
index f5321fe3d20b1c..17e1e8df7bf751 100755
--- a/tests/xfs/099
+++ b/tests/xfs/099
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}"
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -60,7 +60,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -77,7 +77,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/100 b/tests/xfs/100
index 6f465a79c926d2..dd50d984800335 100755
--- a/tests/xfs/100
+++ b/tests/xfs/100
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}"
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -65,7 +65,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -82,7 +82,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/101 b/tests/xfs/101
index a926acb0bc6735..2abcd711b18703 100755
--- a/tests/xfs/101
+++ b/tests/xfs/101
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}"
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -60,7 +60,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -77,7 +77,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/102 b/tests/xfs/102
index c3ddec5e432dc5..5a7c036ce55751 100755
--- a/tests/xfs/102
+++ b/tests/xfs/102
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -65,7 +65,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -82,7 +82,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/105 b/tests/xfs/105
index 132aa07f8300ef..30d4dc47ec1fed 100755
--- a/tests/xfs/105
+++ b/tests/xfs/105
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -65,7 +65,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -82,7 +82,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/112 b/tests/xfs/112
index f0e717cf26d8c9..267432a863a92d 100755
--- a/tests/xfs/112
+++ b/tests/xfs/112
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -65,14 +65,14 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
 _repair_scratch_fs >> $seqres.full 2>&1
 if [ $? -eq 2 ]; then
 	_scratch_mount
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 	_repair_scratch_fs >> $seqres.full 2>&1
 fi
 
@@ -86,7 +86,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/113 b/tests/xfs/113
index 22ac8c3fd51b80..2f19346aa74b3d 100755
--- a/tests/xfs/113
+++ b/tests/xfs/113
@@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))"
 echo "+ make some files"
 __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true
 inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -86,7 +86,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory"
 	mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -103,7 +103,7 @@ echo "+ modify dir (2)"
 mkdir -p "${SCRATCH_MNT}/blockdir"
 rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory"
 mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/117 b/tests/xfs/117
index 0ca8f1b96ddfd9..ae73ddbebfd53b 100755
--- a/tests/xfs/117
+++ b/tests/xfs/117
@@ -65,7 +65,7 @@ for ((i = 0; i < 64; i++)); do
 done
 echo "First victim inode is: " >> $seqres.full
 stat -c '%i' "$fname" >> $seqres.full
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -85,7 +85,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 		touch "$fname" &>> $seqres.full
 		test $? -eq 0 && broken=0
 	done
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 echo "broken: ${broken}"
 
@@ -110,7 +110,7 @@ for x in `seq 1 64`; do
 	echo "${x}: broken=${broken}" >> $seqres.full
 done
 echo "broken: ${broken}"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/120 b/tests/xfs/120
index f1f047f53a351b..9d0cc12a3e8b8d 100755
--- a/tests/xfs/120
+++ b/tests/xfs/120
@@ -45,7 +45,7 @@ for i in $(seq 1 2 ${nr}); do
 	$XFS_IO_PROG -f -c "fpunch $((i * blksz)) ${blksz}" "${SCRATCH_MNT}/bigfile" >> $seqres.full
 done
 inode="$(stat -c '%i' "${SCRATCH_MNT}/bigfile")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -60,7 +60,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 	$XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" -c 'fsync' "${SCRATCH_MNT}/bigfile" >> $seqres.full 2> /dev/null
 	after="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")"
 	test "${before}" -eq "${after}" || _fail "pwrite should fail on corrupt bmbt"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -78,7 +78,7 @@ before="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")"
 $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" -c 'fsync' "${SCRATCH_MNT}/bigfile" >> $seqres.full 2> /dev/null
 after="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")"
 test "${before}" -ne "${after}" || _fail "pwrite failed after fixing corrupt bmbt"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/123 b/tests/xfs/123
index 6b56551374cd8f..5bd3c86372058e 100755
--- a/tests/xfs/123
+++ b/tests/xfs/123
@@ -44,7 +44,7 @@ str="$(perl -e "print './' x $reps;")x"
 (cd $SCRATCH_MNT; ln -s "${str}" "long_symlink")
 cat "${SCRATCH_MNT}/long_symlink"
 inode="$(stat -c '%i' "${SCRATCH_MNT}/long_symlink")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -55,7 +55,7 @@ _scratch_xfs_db -x -c "inode ${inode}" -c "dblock 0" -c "stack" -c "blocktrash -
 echo "+ mount image"
 if _try_scratch_mount >> $seqres.full 2>&1; then
 	cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -64,7 +64,7 @@ _repair_scratch_fs >> $seqres.full 2>&1
 echo "+ mount image (2)"
 _scratch_mount
 cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/124 b/tests/xfs/124
index fe870dc96cc783..7890434b397262 100755
--- a/tests/xfs/124
+++ b/tests/xfs/124
@@ -46,7 +46,7 @@ seq 0 "${nr}" | while read d; do
 	setfattr -n "user.x$(printf "%.08d" "$d")" -v "0000000000000000" "${SCRATCH_MNT}/attrfile"
 done
 inode="$(stat -c '%i' "${SCRATCH_MNT}/attrfile")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -64,7 +64,7 @@ echo "+ mount image && modify xattr"
 if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "modified corrupt xattr"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -80,7 +80,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/"
 echo "+ modify xattr (2)"
 getfattr "${SCRATCH_MNT}/attrfile" -n "user.x00000000" > /dev/null 2>&1 && (setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" || _fail "remove corrupt xattr")
 setfattr -n "user.x00000000" -v 'x0x0x0x0' "${SCRATCH_MNT}/attrfile" || _fail "add corrupt xattr"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/125 b/tests/xfs/125
index 89e93650556e40..c3770c185b4063 100755
--- a/tests/xfs/125
+++ b/tests/xfs/125
@@ -47,7 +47,7 @@ seq 1 2 "${nr}" | while read d; do
 	setfattr -x "user.x$(printf "%.08d" "$d")" "${SCRATCH_MNT}/attrfile"
 done
 inode="$(stat -c '%i' "${SCRATCH_MNT}/attrfile")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -64,7 +64,7 @@ echo "+ mount image && modify xattr"
 if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "modified corrupt xattr"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -80,7 +80,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/"
 echo "+ modify xattr (2)"
 setfattr -n "user.x00000000" -v "1111111111111111" "${SCRATCH_MNT}/attrfile" || _fail "modified corrupt xattr"
 setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" || _fail "delete corrupt xattr"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/126 b/tests/xfs/126
index 5614ea398c0142..14eb2a6157e141 100755
--- a/tests/xfs/126
+++ b/tests/xfs/126
@@ -47,7 +47,7 @@ seq 1 2 "${nr}" | while read d; do
 	setfattr -x "user.x$(printf "%.08d" "$d")" "${SCRATCH_MNT}/attrfile"
 done
 inode="$(stat -c '%i' "${SCRATCH_MNT}/attrfile")"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
@@ -69,7 +69,7 @@ echo "+ mount image && modify xattr"
 if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "modified corrupt xattr"
-	umount "${SCRATCH_MNT}"
+	_umount "${SCRATCH_MNT}"
 fi
 
 echo "+ repair fs"
@@ -84,7 +84,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/"
 
 echo "+ modify xattr (2)"
 getfattr "${SCRATCH_MNT}/attrfile" -n "user.x00000000" 2> /dev/null && (setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" || _fail "modified corrupt xattr")
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail"
diff --git a/tests/xfs/130 b/tests/xfs/130
index 3e6dd861c47851..b1792a98e57db6 100755
--- a/tests/xfs/130
+++ b/tests/xfs/130
@@ -78,7 +78,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/"
 echo "+ reflink more (2)"
 _cp_reflink "${SCRATCH_MNT}/file1" "${SCRATCH_MNT}/file5" || \
 	_fail "modified refcount tree"
-umount "${SCRATCH_MNT}"
+_umount "${SCRATCH_MNT}"
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> "$seqres.full" 2>&1 || \
diff --git a/tests/xfs/152 b/tests/xfs/152
index 7ba00c4bfac9ff..66577cfb4617fc 100755
--- a/tests/xfs/152
+++ b/tests/xfs/152
@@ -15,7 +15,7 @@ _begin_fstest auto quick quota idmapped
 
 wipe_mounts()
 {
-	umount "${SCRATCH_MNT}/idmapped" >/dev/null 2>&1
+	_umount "${SCRATCH_MNT}/idmapped" >/dev/null 2>&1
 	_scratch_unmount >/dev/null 2>&1
 }
 
diff --git a/tests/xfs/169 b/tests/xfs/169
index 6400fd9e6bdc8b..16c5385cf4815a 100755
--- a/tests/xfs/169
+++ b/tests/xfs/169
@@ -15,7 +15,7 @@ _begin_fstest auto clone
 _cleanup()
 {
     cd /
-    umount $SCRATCH_MNT > /dev/null 2>&1
+    _umount $SCRATCH_MNT > /dev/null 2>&1
     rm -rf $tmp.*
 }
 
@@ -43,7 +43,7 @@ for i in 1 2 x; do
 		_reflink_range  $testdir/file1 $((nr * blksz)) \
 				$testdir/file2 $((nr * blksz)) $blksz >> $seqres.full
 	done
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_check_scratch_fs
 	_scratch_mount
 
@@ -51,7 +51,7 @@ for i in 1 2 x; do
 
 	echo "$i: Delete both files"
 	rm -rf $testdir/file1 $testdir/file2
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_check_scratch_fs
 	_scratch_mount
 done
diff --git a/tests/xfs/206 b/tests/xfs/206
index bfd2dee939ddd7..16a734c3751194 100755
--- a/tests/xfs/206
+++ b/tests/xfs/206
@@ -18,7 +18,7 @@ _begin_fstest growfs auto quick
 # Override the default cleanup function.
 _cleanup()
 {
-    umount $tmpdir
+    _umount $tmpdir
     rmdir $tmpdir
     rm -f $tmp
     rm -f $tmpfile
diff --git a/tests/xfs/216 b/tests/xfs/216
index 680239b4ef788d..149c8fdfec887d 100755
--- a/tests/xfs/216
+++ b/tests/xfs/216
@@ -52,7 +52,7 @@ _do_mkfs()
 			-d name=$LOOP_DEV,size=${i}g $loop_mkfs_opts |grep log
 		mount -o loop -t xfs $LOOP_DEV $LOOP_MNT
 		echo "test write" > $LOOP_MNT/test
-		umount $LOOP_MNT > /dev/null 2>&1
+		_umount $LOOP_MNT > /dev/null 2>&1
 	done
 }
 # make large holey file
diff --git a/tests/xfs/217 b/tests/xfs/217
index 41caaf738267d4..30a186d7294940 100755
--- a/tests/xfs/217
+++ b/tests/xfs/217
@@ -31,7 +31,7 @@ _do_mkfs()
 			-d name=$LOOP_DEV,size=${i}g |grep log
 		mount -o loop -t xfs $LOOP_DEV $LOOP_MNT
 		echo "test write" > $LOOP_MNT/test
-		umount $LOOP_MNT > /dev/null 2>&1
+		_umount $LOOP_MNT > /dev/null 2>&1
 
 		# punch out the previous blocks so that we keep the amount of
 		# disk space the test requires down to a minimum.
diff --git a/tests/xfs/235 b/tests/xfs/235
index 5b201d93076952..0184ff71f2878c 100755
--- a/tests/xfs/235
+++ b/tests/xfs/235
@@ -31,7 +31,7 @@ _pwrite_byte 0x62 0 $((blksz * 64)) ${SCRATCH_MNT}/file0 >> $seqres.full
 _pwrite_byte 0x61 0 $((blksz * 64)) ${SCRATCH_MNT}/file1 >> $seqres.full
 cp -p ${SCRATCH_MNT}/file0 ${SCRATCH_MNT}/file2
 cp -p ${SCRATCH_MNT}/file1 ${SCRATCH_MNT}/file3
-umount ${SCRATCH_MNT}
+_umount ${SCRATCH_MNT}
 
 echo "+ check fs"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || \
@@ -49,7 +49,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then
 
 	$XFS_IO_PROG -f -c "pwrite -S 0x63 0 $((blksz * 64))" -c "fsync" ${SCRATCH_MNT}/file4 >> $seqres.full 2>&1
 	test -s ${SCRATCH_MNT}/file4 && _fail "should not be able to copy with busted rmap btree"
-	umount ${SCRATCH_MNT}
+	_umount ${SCRATCH_MNT}
 fi
 
 echo "+ repair fs"
@@ -66,7 +66,7 @@ $CHATTR_PROG -R -f -i ${SCRATCH_MNT}/
 echo "+ copy more (2)"
 cp -p ${SCRATCH_MNT}/file1 ${SCRATCH_MNT}/file5 || \
 	_fail "modified rmap tree"
-umount ${SCRATCH_MNT}
+_umount ${SCRATCH_MNT}
 
 echo "+ check fs (2)"
 _scratch_xfs_repair -n >> $seqres.full 2>&1 || \
diff --git a/tests/xfs/236 b/tests/xfs/236
index a374a300d1905a..277a9a402e2e05 100755
--- a/tests/xfs/236
+++ b/tests/xfs/236
@@ -15,7 +15,7 @@ _begin_fstest auto rmap punch
 _cleanup()
 {
     cd /
-    umount $SCRATCH_MNT > /dev/null 2>&1
+    _umount $SCRATCH_MNT > /dev/null 2>&1
     rm -rf $tmp.*
 }
 
@@ -44,7 +44,7 @@ for i in 1 2 x; do
 	seq 1 2 $((nr_blks - 1)) | while read nr; do
 		$XFS_IO_PROG -c "fpunch $((nr * blksz)) $blksz" $testdir/file2 >> $seqres.full
 	done
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_check_scratch_fs
 	_scratch_mount
 
@@ -52,7 +52,7 @@ for i in 1 2 x; do
 
 	echo "$i: Delete both files"
 	rm -rf $testdir/file1 $testdir/file2
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_check_scratch_fs
 	_scratch_mount
 done
diff --git a/tests/xfs/239 b/tests/xfs/239
index bfe722c0add020..7dc9be7d2edfe0 100755
--- a/tests/xfs/239
+++ b/tests/xfs/239
@@ -66,7 +66,7 @@ md5sum $testdir/file1 | _filter_scratch
 md5sum $testdir/file2 | _filter_scratch
 
 echo "Check for damage"
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _repair_scratch_fs >> $seqres.full
 
 # success, all done
diff --git a/tests/xfs/241 b/tests/xfs/241
index 1532493979ffa7..a779e321417520 100755
--- a/tests/xfs/241
+++ b/tests/xfs/241
@@ -66,7 +66,7 @@ md5sum $testdir/file1 | _filter_scratch
 md5sum $testdir/file2 | _filter_scratch
 
 echo "Check for damage"
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _repair_scratch_fs >> $seqres.full
 
 # success, all done
diff --git a/tests/xfs/250 b/tests/xfs/250
index f8846be6e197aa..82ab08d65192e7 100755
--- a/tests/xfs/250
+++ b/tests/xfs/250
@@ -13,7 +13,7 @@ _begin_fstest auto quick rw prealloc metadata
 _cleanup()
 {
 	cd /
-	umount $LOOP_MNT 2>/dev/null
+	_umount $LOOP_MNT 2>/dev/null
 	rm -f $LOOP_DEV
 	rmdir $LOOP_MNT
 }
@@ -60,7 +60,7 @@ _test_loop()
 	$XFS_IO_PROG -f -c "resvsp 0 $fsize" $LOOP_MNT/foo | _filter_io
 
 	echo "*** unmount loop filesystem"
-	umount $LOOP_MNT > /dev/null 2>&1
+	_umount $LOOP_MNT > /dev/null 2>&1
 
 	echo "*** check loop filesystem"
 	 _check_xfs_filesystem $LOOP_DEV none none
diff --git a/tests/xfs/265 b/tests/xfs/265
index 21de4c054a573f..2ba7342d066bb6 100755
--- a/tests/xfs/265
+++ b/tests/xfs/265
@@ -16,7 +16,7 @@ _begin_fstest auto clone
 _cleanup()
 {
     cd /
-    umount $SCRATCH_MNT > /dev/null 2>&1
+    _umount $SCRATCH_MNT > /dev/null 2>&1
     rm -rf $tmp.*
 }
 
@@ -51,7 +51,7 @@ for i in 1 2 x; do
 		truncate -s $((blksz * (nr_blks - nr))) $testdir/file1.$nr >> $seqres.full
 	done
 
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_check_scratch_fs
 	_scratch_mount
 
@@ -60,7 +60,7 @@ for i in 1 2 x; do
 	echo "$i: Delete both files"
 	rm -rf $testdir
 	mkdir -p $testdir
-	umount $SCRATCH_MNT
+	_umount $SCRATCH_MNT
 	_check_scratch_fs
 	_scratch_mount
 done
diff --git a/tests/xfs/310 b/tests/xfs/310
index 34d17be97f36dd..f2a7ca50f67199 100755
--- a/tests/xfs/310
+++ b/tests/xfs/310
@@ -13,7 +13,7 @@ _begin_fstest auto clone rmap prealloc
 _cleanup()
 {
 	cd /
-	umount $SCRATCH_MNT > /dev/null 2>&1
+	_umount $SCRATCH_MNT > /dev/null 2>&1
 	_dmhugedisk_cleanup
 	rm -rf $tmp.*
 }
@@ -53,7 +53,7 @@ $XFS_IO_PROG -f -c "falloc 0 $((nr_blks * blksz))" $testdir/file1 >> $seqres.ful
 echo "Check extent count"
 xfs_bmap -l -p -v $testdir/file1 | grep '^[[:space:]]*2:' -q && xfs_bmap -l -p -v $testdir/file1
 inum=$(stat -c '%i' $testdir/file1)
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 
 echo "Check bmap count"
 nr_bmaps=$(xfs_db -c "inode $inum" -c "bmap" $DMHUGEDISK_DEV | grep 'data offset' | wc -l)
diff --git a/tests/xfs/716 b/tests/xfs/716
index cd4fffef298d31..55c66d1cf8bb19 100755
--- a/tests/xfs/716
+++ b/tests/xfs/716
@@ -49,7 +49,7 @@ ino=$(stat -c '%i' $file)
 
 # Figure out how many extents we need to have to create a data fork that's in
 # btree format.
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 di_forkoff=$(_scratch_xfs_db -c "inode $ino" -c "p core.forkoff" | \
 	awk '{print $3}')
 _scratch_xfs_db -c "inode $ino" -c "p" >> $seqres.full
@@ -61,7 +61,7 @@ $XFS_IO_PROG -c "falloc 0 $(( (min_ext_for_btree + 1) * 2 * blksz))" $file
 $here/src/punch-alternating $file
 
 # Make sure the data fork is in btree format.
-umount $SCRATCH_MNT
+_umount $SCRATCH_MNT
 _scratch_xfs_db -c "inode $ino" -c "p core.format" | grep -q "btree" || \
 	echo "data fork not in btree format?"
 echo "about to start test" >> $seqres.full


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/6] xfs: test health monitoring code
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 2/6] misc: convert all umount(1) invocations to _umount Darrick J. Wong
@ 2024-12-31 23:57   ` Darrick J. Wong
  2024-12-31 23:57   ` [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add some functionality tests for the new health monitoring code.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 doc/group-names.txt |    1 +
 tests/xfs/1885      |   53 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1885.out  |    5 +++++
 3 files changed, 59 insertions(+)
 create mode 100755 tests/xfs/1885
 create mode 100644 tests/xfs/1885.out


diff --git a/doc/group-names.txt b/doc/group-names.txt
index b04d0180e8ec02..8fbb260d8c7bb5 100644
--- a/doc/group-names.txt
+++ b/doc/group-names.txt
@@ -117,6 +117,7 @@ samefs			overlayfs when all layers are on the same fs
 scrub			filesystem metadata scrubbers
 seed			btrfs seeded filesystems
 seek			llseek functionality
+selfhealing		self healing filesystem code
 selftest		tests with fixed results, used to validate testing setup
 send			btrfs send/receive
 shrinkfs		decreasing the size of a filesystem
diff --git a/tests/xfs/1885 b/tests/xfs/1885
new file mode 100755
index 00000000000000..1b87af3a9178fc
--- /dev/null
+++ b/tests/xfs/1885
@@ -0,0 +1,53 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1885
+#
+# Make sure that healthmon handles module refcount correctly.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/module
+
+refcount_file="/sys/module/xfs/refcnt"
+test -e "$refcount_file" || _notrun "cannot find xfs module refcount"
+
+_require_test
+_require_xfs_io_command healthmon
+
+# Capture mod refcount without the test fs mounted
+_test_unmount
+init_refcount="$(cat "$refcount_file")"
+
+# Capture mod refcount with the test fs mounted
+_test_mount
+nomon_mount_refcount="$(cat "$refcount_file")"
+
+# Capture mod refcount with test fs mounted and the healthmon fd open.
+# Pause the xfs_io process so that it doesn't actually respond to events.
+$XFS_IO_PROG -c 'healthmon -c -v' $TEST_DIR &
+sleep 0.5
+kill -STOP %1
+mon_mount_refcount="$(cat "$refcount_file")"
+
+# Capture mod refcount with only the healthmon fd open.
+_test_unmount
+mon_nomount_refcount="$(cat "$refcount_file")"
+
+# Capture mod refcount after continuing healthmon (which should exit due to the
+# unmount) and killing it.
+kill -CONT %1
+kill %1
+wait
+nomon_nomount_refcount="$(cat "$refcount_file")"
+
+_within_tolerance "mount refcount" "$nomon_mount_refcount" "$((init_refcount + 1))" 0 -v
+_within_tolerance "mount + healthmon refcount" "$mon_mount_refcount" "$((init_refcount + 2))" 0 -v
+_within_tolerance "healthmon refcount" "$mon_nomount_refcount" "$((init_refcount + 1))" 0 -v
+_within_tolerance "end refcount" "$nomon_nomount_refcount" "$init_refcount" 0 -v
+
+status=0
+exit
diff --git a/tests/xfs/1885.out b/tests/xfs/1885.out
new file mode 100644
index 00000000000000..f152cef0525609
--- /dev/null
+++ b/tests/xfs/1885.out
@@ -0,0 +1,5 @@
+QA output created by 1885
+mount refcount is in range
+mount + healthmon refcount is in range
+healthmon refcount is in range
+end refcount is in range


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-12-31 23:57   ` [PATCH 3/6] xfs: test health monitoring code Darrick J. Wong
@ 2024-12-31 23:57   ` Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 5/6] xfs: test io " Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 6/6] xfs: test new xfs_scrubbed daemon Darrick J. Wong
  5 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check if we can detect runtime metadata corruptions via the health
monitor.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc          |   10 ++++++
 tests/xfs/1879     |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1879.out |   12 +++++++
 3 files changed, 111 insertions(+)
 create mode 100755 tests/xfs/1879
 create mode 100644 tests/xfs/1879.out


diff --git a/common/rc b/common/rc
index 0d5c785cecc017..dd6857461e14dd 100644
--- a/common/rc
+++ b/common/rc
@@ -2850,6 +2850,16 @@ _require_xfs_io_command()
 		echo $testio | grep -q "Inappropriate ioctl" && \
 			_notrun "xfs_io $command support is missing"
 		;;
+	"healthmon")
+		testio=`$XFS_IO_PROG -c "$command -p $param" $TEST_DIR 2>&1`
+		echo $testio | grep -q "bad argument count" && \
+			_notrun "xfs_io $command $param support is missing"
+		echo $testio | grep -q "Inappropriate ioctl" && \
+			_notrun "xfs_io $command $param ioctl support is missing"
+		echo $testio | grep -q "Operation not supported" && \
+			_notrun "xfs_io $command $param kernel support is missing"
+		param_checked="$param"
+		;;
 	"label")
 		testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1`
 		;;
diff --git a/tests/xfs/1879 b/tests/xfs/1879
new file mode 100755
index 00000000000000..aab7bf9fa1f6e4
--- /dev/null
+++ b/tests/xfs/1879
@@ -0,0 +1,89 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1879
+#
+# Corrupt some metadata and try to access it with the health monitoring program
+# running.  Check that healthmon observes a metadata error.
+#
+. ./common/preamble
+_begin_fstest auto quick eio
+
+_cleanup()
+{
+	cd /
+	rm -rf $tmp.* $testdir
+}
+
+. ./common/filter
+
+_require_scratch_nocheck
+_require_xfs_io_command healthmon
+
+# Disable the scratch rt device to avoid test failures relating to the rt
+# bitmap consuming all the free space in our small data device.
+unset SCRATCH_RTDEV
+
+echo "Format and mount"
+_scratch_mkfs -d agcount=1 | _filter_mkfs 2> $tmp.mkfs >> $seqres.full
+. $tmp.mkfs
+_scratch_mount
+mkdir $SCRATCH_MNT/a/
+# Enough entries to get to a single block directory
+for ((i = 0; i < ( (isize + 255) / 256); i++)); do
+	path="$(printf "%s/a/%0255d" "$SCRATCH_MNT" "$i")"
+	touch "$path"
+done
+inum="$(stat -c %i "$SCRATCH_MNT/a")"
+_scratch_unmount
+
+# Fuzz the directory block so that the touch below will be guaranteed to trip
+# a runtime sickness report in exactly the manner we desire.
+_scratch_xfs_db -x -c "inode $inum" -c "dblock 0" -c 'fuzz bhdr.hdr.owner add' -c print &>> $seqres.full
+
+# Try to allocate space to trigger a metadata corruption event
+echo "Runtime corruption detection"
+_scratch_mount
+$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon &
+sleep 1	# wait for python program to start up
+touch $SCRATCH_MNT/a/farts &>> $seqres.full
+_scratch_unmount
+
+wait	# for healthmon to finish
+
+# Did we get errors?
+filter_healthmon()
+{
+	cat $tmp.healthmon >> $seqres.full
+	grep -A2 -E '(sick|corrupt)' $tmp.healthmon | grep -v -- '--' | sort | uniq
+}
+filter_healthmon
+
+# Run scrub to trigger a health event from there too.
+echo "Scrub corruption detection"
+_scratch_mount
+if _supports_xfs_scrub $SCRATCH_MNT $SCRATCH_DEV; then
+	$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon &
+	sleep 1	# wait for python program to start up
+	$XFS_SCRUB_PROG -n $SCRATCH_MNT &>> $seqres.full
+	_scratch_unmount
+
+	wait	# for healthmon to finish
+
+	# Did we get errors?
+	filter_healthmon
+else
+	# mock the output since we don't support scrub
+	_scratch_unmount
+	cat << ENDL
+  "domain":     "inode",
+  "structures":  ["directory"],
+  "structures":  ["parent"],
+  "type":       "corrupt",
+  "type":       "sick",
+ENDL
+fi
+
+status=0
+exit
diff --git a/tests/xfs/1879.out b/tests/xfs/1879.out
new file mode 100644
index 00000000000000..f02eefbf58ad6c
--- /dev/null
+++ b/tests/xfs/1879.out
@@ -0,0 +1,12 @@
+QA output created by 1879
+Format and mount
+Runtime corruption detection
+  "domain":     "inode",
+  "structures":  ["directory"],
+  "type":       "sick",
+Scrub corruption detection
+  "domain":     "inode",
+  "structures":  ["directory"],
+  "structures":  ["parent"],
+  "type":       "corrupt",
+  "type":       "sick",


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 5/6] xfs: test io error reporting via healthmon
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-12-31 23:57   ` [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
@ 2024-12-31 23:58   ` Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 6/6] xfs: test new xfs_scrubbed daemon Darrick J. Wong
  5 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new test to make sure the kernel can report IO errors via
health monitoring.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1878     |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1878.out |   10 +++++++
 2 files changed, 90 insertions(+)
 create mode 100755 tests/xfs/1878
 create mode 100644 tests/xfs/1878.out


diff --git a/tests/xfs/1878 b/tests/xfs/1878
new file mode 100755
index 00000000000000..882d0dcca03cb1
--- /dev/null
+++ b/tests/xfs/1878
@@ -0,0 +1,80 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test No. 1878
+#
+# Attempt to read and write a file in buffered and directio mode with the
+# health monitoring program running.  Check that healthmon observes all four
+# types of IO errors.
+#
+. ./common/preamble
+_begin_fstest auto quick eio
+
+_cleanup()
+{
+	cd /
+	rm -rf $tmp.* $testdir
+	_dmerror_cleanup
+}
+
+. ./common/filter
+. ./common/dmerror
+
+_require_scratch_nocheck
+_require_xfs_io_command healthmon
+_require_dm_target error
+
+# Disable the scratch rt device to avoid test failures relating to the rt
+# bitmap consuming all the free space in our small data device.
+unset SCRATCH_RTDEV
+
+echo "Format and mount"
+_scratch_mkfs > $seqres.full 2>&1
+_dmerror_init no_log
+_dmerror_mount
+
+_require_fs_space $SCRATCH_MNT 65536
+
+# Create a file with written regions far enough apart that the pagecache can't
+# possibly be caching the regions with a single folio.
+testfile=$SCRATCH_MNT/fsync-err-test
+$XFS_IO_PROG -f \
+	-c 'pwrite -b 1m 0 1m' \
+	-c 'pwrite -b 1m 10g 1m' \
+	-c 'pwrite -b 1m 20g 1m' \
+	-c fsync $testfile >> $seqres.full
+
+# First we check if directio errors get reported
+$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon &
+sleep 1	# wait for python program to start up
+_dmerror_load_error_table
+$XFS_IO_PROG -d -c 'pwrite -b 256k 12k 16k' $testfile >> $seqres.full
+$XFS_IO_PROG -d -c 'pread -b 256k 10g 16k' $testfile >> $seqres.full
+_dmerror_load_working_table
+
+_dmerror_unmount
+wait	# for healthmon to finish
+_dmerror_mount
+
+# Next we check if buffered io errors get reported.  We have to write something
+# before loading the error table to ensure the dquots get loaded.
+$XFS_IO_PROG -c 'pwrite -b 256k 20g 1k' -c fsync $testfile >> $seqres.full
+$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon &
+sleep 1	# wait for python program to start up
+_dmerror_load_error_table
+$XFS_IO_PROG -c 'pread -b 256k 12k 16k' $testfile >> $seqres.full
+$XFS_IO_PROG -c 'pwrite -b 256k 20g 16k' -c fsync $testfile >> $seqres.full
+_dmerror_load_working_table
+
+_dmerror_unmount
+wait	# for healthmon to finish
+
+# Did we get errors?
+cat $tmp.healthmon >> $seqres.full
+grep -E '(diowrite|dioread|readahead|writeback)' $tmp.healthmon | sort | uniq
+
+_dmerror_cleanup
+
+status=0
+exit
diff --git a/tests/xfs/1878.out b/tests/xfs/1878.out
new file mode 100644
index 00000000000000..a8070c3c1afd23
--- /dev/null
+++ b/tests/xfs/1878.out
@@ -0,0 +1,10 @@
+QA output created by 1878
+Format and mount
+pwrite: Input/output error
+pread: Input/output error
+pread: Input/output error
+fsync: Input/output error
+  "type":       "dioread",
+  "type":       "diowrite",
+  "type":       "readahead",
+  "type":       "writeback",


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 6/6] xfs: test new xfs_scrubbed daemon
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-12-31 23:58   ` [PATCH 5/6] xfs: test io " Darrick J. Wong
@ 2024-12-31 23:58   ` Darrick J. Wong
  5 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure the daemon in charge of self healing xfs actually does what it
says it does.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/config      |    6 ++++
 common/systemd     |    9 +++++
 common/xfs         |   16 ++++++++++
 tests/xfs/1882     |   64 ++++++++++++++++++++++++++++++++++++++
 tests/xfs/1882.out |    2 +
 tests/xfs/1883     |   75 +++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1883.out |    2 +
 tests/xfs/1884     |   87 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1884.out |    2 +
 9 files changed, 263 insertions(+)
 create mode 100755 tests/xfs/1882
 create mode 100644 tests/xfs/1882.out
 create mode 100755 tests/xfs/1883
 create mode 100644 tests/xfs/1883.out
 create mode 100755 tests/xfs/1884
 create mode 100644 tests/xfs/1884.out


diff --git a/common/config b/common/config
index fcff0660b05a97..2b3f946f3d308d 100644
--- a/common/config
+++ b/common/config
@@ -166,6 +166,12 @@ export XFS_ADMIN_PROG="$(type -P xfs_admin)"
 export XFS_GROWFS_PROG=$(type -P xfs_growfs)
 export XFS_SPACEMAN_PROG="$(type -P xfs_spaceman)"
 export XFS_SCRUB_PROG="$(type -P xfs_scrub)"
+XFS_SCRUBBED_PROG="$(type -P xfs_scrubbed)"
+# Normally the scrubbed daemon is installed in libexec
+if [ -n "$XFS_SCRUBBED_PROG" ] && [ -e /usr/libexec/xfs_scrubbed ]; then
+	XFS_SCRUBBED_PROG=/usr/libexec/xfs_scrubbed
+fi
+export XFS_SCRUBBED_PROG
 export XFS_PARALLEL_REPAIR_PROG="$(type -P xfs_prepair)"
 export XFS_PARALLEL_REPAIR64_PROG="$(type -P xfs_prepair64)"
 export __XFSDUMP_PROG="$(type -P xfsdump)"
diff --git a/common/systemd b/common/systemd
index b2e24f267b2d93..8366d4cba39d85 100644
--- a/common/systemd
+++ b/common/systemd
@@ -71,3 +71,12 @@ _systemd_unit_status() {
 	_systemd_installed || return 1
 	systemctl status "$1"
 }
+
+# Start a running systemd unit
+_systemd_unit_start() {
+	systemctl start "$1"
+}
+# Stop a running systemd unit
+_systemd_unit_stop() {
+	systemctl stop "$1"
+}
diff --git a/common/xfs b/common/xfs
index b9e897e0e8839a..b4f69403e7396e 100644
--- a/common/xfs
+++ b/common/xfs
@@ -2224,3 +2224,19 @@ _scratch_find_rt_metadir_entry() {
 
 	return 1
 }
+
+# Run the xfs_scrubbed self healing daemon
+_scratch_xfs_scrubbed() {
+	local scrubbed_args=()
+	local daemon_dir
+	daemon_dir=$(dirname "$XFS_SCRUBBED_PROG")
+
+	# If we're being run from a development branch, we might need to find
+	# the schema file on our own.
+	local maybe_schema="$daemon_dir/../libxfs/xfs_healthmon.schema.json"
+	if [ -f "$maybe_schema" ]; then
+		scrubbed_args+=(--event-schema "$maybe_schema")
+	fi
+
+	$XFS_SCRUBBED_PROG "${scrubbed_args[@]}" "$@" $SCRATCH_MNT
+}
diff --git a/tests/xfs/1882 b/tests/xfs/1882
new file mode 100755
index 00000000000000..b6a8bd545dbcf5
--- /dev/null
+++ b/tests/xfs/1882
@@ -0,0 +1,64 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1882
+#
+# Make sure that xfs_scrubbed correctly handles all the reports that it gets
+# from the kernel.  We simulate this by using the --everything mode so we get
+# all the events, not just the sickness reports.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+. ./common/populate
+
+_require_scrub
+_require_xfs_io_command "scrub"		# online check support
+_require_command "$XFS_SCRUBBED_PROG" "xfs_scrubbed"
+_require_scratch
+
+# Does this fs support health monitoring?
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_scratch_xfs_scrubbed --check || \
+	_notrun "health monitoring not supported on this kernel"
+_scratch_xfs_scrubbed --require-validation --check && \
+	_notrun "skipping this test in favor of the one that does json validation"
+_scratch_unmount
+
+# Create a sample fs with all the goodies
+_scratch_populate_cached nofill &>> $seqres.full
+_scratch_mount
+
+# If the system xfsprogs has self healing enabled, we need to shut down the
+# daemon before we try to capture things.
+if _systemd_is_running; then
+	scratch_path=$(systemd-escape --path "$SCRATCH_MNT")
+	_systemd_unit_stop "xfs_scrubbed@${scratch_path}" &>> $seqres.full
+fi
+
+# Start the health monitor, have it log everything
+_scratch_xfs_scrubbed --everything --log > $tmp.scrubbed &
+scrubbed_pid=$!
+sleep 1
+
+# Run scrub to make some noise
+_scratch_scrub -b -n >> $seqres.full
+
+# Unmount fs to kill scrubbed, then wait for it to finish
+while ! _scratch_unmount &>/dev/null; do
+	sleep 0.5
+done
+kill $scrubbed_pid
+wait
+
+cat $tmp.scrubbed >> $seqres.full
+
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/1882.out b/tests/xfs/1882.out
new file mode 100644
index 00000000000000..9b31ccb735cabd
--- /dev/null
+++ b/tests/xfs/1882.out
@@ -0,0 +1,2 @@
+QA output created by 1882
+Silence is golden
diff --git a/tests/xfs/1883 b/tests/xfs/1883
new file mode 100755
index 00000000000000..9bba989386b37e
--- /dev/null
+++ b/tests/xfs/1883
@@ -0,0 +1,75 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1883
+#
+# Make sure that xfs_scrubbed correctly validates the json events that it gets
+# from the kernel.  We simulate this by using the --everything mode so we get
+# all the events, not just the sickness reports.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+. ./common/populate
+
+_require_scrub
+_require_xfs_io_command "scrub"		# online check support
+_require_command "$XFS_SCRUBBED_PROG" "xfs_scrubbed"
+_require_scratch
+
+# Does this fs support health monitoring?
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_scratch_xfs_scrubbed --require-validation --check || \
+	_notrun "health monitoring with validation not supported on this kernel"
+_scratch_unmount
+
+# Create a sample fs with all the goodies
+_scratch_populate_cached nofill &>> $seqres.full
+_scratch_mount
+
+# If the system xfsprogs has self healing enabled, we need to shut down the
+# daemon before we try to capture things.
+if _systemd_is_running; then
+	scratch_path=$(systemd-escape --path "$SCRATCH_MNT")
+	_systemd_unit_stop "xfs_scrubbed@${scratch_path}" &>> $seqres.full
+fi
+
+# Start the health monitor, have it validate everything
+_scratch_xfs_scrubbed --require-validation --everything --debug-fast --log &> $tmp.scrubbed &
+scrubbed_pid=$!
+sleep 1
+
+# Run scrub to make some noise
+_scratch_scrub -b -n >> $seqres.full
+
+# Wait for up to 60 seconds for the log file to stop growing
+old_logsz=
+new_logsz=$(stat -c '%s' $tmp.scrubbed)
+for ((i = 0; i < 60; i++)); do
+	test "$old_logsz" = "$new_logsz" && break
+	old_logsz="$new_logsz"
+	sleep 1
+	new_logsz=$(stat -c '%s' $tmp.scrubbed)
+done
+
+# Unmount fs to kill scrubbed, then wait for it to finish
+while ! _scratch_unmount &>/dev/null; do
+	sleep 0.5
+done
+kill $scrubbed_pid
+wait
+
+# Look for schema validation errors
+grep -q 'not valid under any of the given schemas' $tmp.scrubbed && \
+	echo "Should not have found schema validation errors"
+cat $tmp.scrubbed >> $seqres.full
+
+echo Silence is golden
+status=0
+exit
diff --git a/tests/xfs/1883.out b/tests/xfs/1883.out
new file mode 100644
index 00000000000000..bc9c390c778b6e
--- /dev/null
+++ b/tests/xfs/1883.out
@@ -0,0 +1,2 @@
+QA output created by 1883
+Silence is golden
diff --git a/tests/xfs/1884 b/tests/xfs/1884
new file mode 100755
index 00000000000000..fc6e0a48372fda
--- /dev/null
+++ b/tests/xfs/1884
@@ -0,0 +1,87 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1884
+#
+# Ensure that autonomous self healing works fixes the filesystem correctly.
+#
+. ./common/preamble
+_begin_fstest auto selfhealing
+
+. ./common/filter
+. ./common/fuzzy
+. ./common/systemd
+
+_require_scrub
+_require_xfs_io_command "repair"	# online repair support
+_require_xfs_db_command "blocktrash"
+_require_command "$XFS_SCRUBBED_PROG" "xfs_scrubbed"
+_require_scratch
+
+_scratch_mkfs >> $seqres.full
+_scratch_mount
+
+_xfs_has_feature $SCRATCH_MNT parent || \
+	_notrun "parent pointers required to test directory auto-repair"
+_scratch_xfs_scrubbed --repair --check || \
+	_notrun "health monitoring with repair not supported on this kernel"
+
+# Create a largeish directory
+dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT")
+echo testdata > $SCRATCH_MNT/a
+mkdir -p "$SCRATCH_MNT/some/victimdir"
+for ((i = 0; i < (dblksz / 255); i++)); do
+	fname="$(printf "%0255d" "$i")"
+	ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname
+done
+
+# Did we get at least two dir blocks?
+dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir)
+test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory"
+
+# Break the directory, remount filesystem
+_scratch_unmount
+_scratch_xfs_db -x \
+	-c 'path /some/victimdir' \
+	-c 'bmap' \
+	-c 'dblock 1' \
+	-c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full
+_scratch_mount
+
+# If the system xfsprogs has self healing enabled, we need to shut down the
+# daemon before we try to capture things.
+if _systemd_is_running; then
+	svcname="xfs_scrubbed@$(systemd-escape --path "$SCRATCH_MNT")"
+	echo "$svcname: $(systemctl is-active "$svcname")" >> $seqres.full
+	_systemd_unit_stop "$svcname" &>> $seqres.full
+fi
+
+# Start the health monitor, have it repair everything reported corrupt
+_scratch_xfs_scrubbed --repair --log > $tmp.scrubbed &
+scrubbed_pid=$!
+sleep 1
+
+# Access the broken directory to trigger a repair, then poll the directory
+# for 5 seconds to see if it gets fixed without us needing to intervene.
+ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+_filter_scratch < $tmp.err
+try=0
+while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do
+	echo "try $try saw corruption" >> $seqres.full
+	sleep 0.1
+	ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err
+	try=$((try + 1))
+done
+_filter_scratch < $tmp.err
+
+# Unmount fs to kill scrubbed, then wait for it to finish.
+while ! _scratch_unmount &>/dev/null; do
+	sleep 0.5
+done
+kill $scrubbed_pid
+wait
+cat $tmp.scrubbed >> $seqres.full
+
+status=0
+exit
diff --git a/tests/xfs/1884.out b/tests/xfs/1884.out
new file mode 100644
index 00000000000000..929e33da01f92c
--- /dev/null
+++ b/tests/xfs/1884.out
@@ -0,0 +1,2 @@
+QA output created by 1884
+ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCHSET 5/5] fstests: add difficult V5 features to filesystems
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (13 preceding siblings ...)
  2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
@ 2024-12-31 23:35 ` Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong
                     ` (2 more replies)
  2025-01-02  1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang
  15 siblings, 3 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

Hi all,

This series enables xfs_repair to add select features to existing V5
filesystems.  Specifically, one can add free inode btrees, reflink
support, and reverse mapping.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-newer-features

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=upgrade-newer-features
---
Commits in this patchset:
 * xfs/1856: add metadir upgrade to test matrix
 * xfs/1856: add rtrmapbt upgrade to test matrix
 * xfs/1856: add rtreflink upgrade to test matrix
---
 tests/xfs/1856 |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix
  2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong
@ 2024-12-31 23:58   ` Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 2/3] xfs/1856: add rtrmapbt " Darrick J. Wong
  2024-12-31 23:59   ` [PATCH 3/3] xfs/1856: add rtreflink " Darrick J. Wong
  2 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add metadata directory trees to the features that this test will try to
upgrade.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1856 |    1 +
 1 file changed, 1 insertion(+)


diff --git a/tests/xfs/1856 b/tests/xfs/1856
index 7524a449c3af00..fedeb157dbd9bb 100755
--- a/tests/xfs/1856
+++ b/tests/xfs/1856
@@ -188,6 +188,7 @@ else
 	check_repair_upgrade reflink && FEATURES+=("reflink")
 	check_repair_upgrade inobtcount && FEATURES+=("inobtcount")
 	check_repair_upgrade bigtime && FEATURES+=("bigtime")
+	check_repair_upgrade metadir && FEATURES+=("metadir")
 fi
 
 test "${#FEATURES[@]}" -eq 0 && \


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 2/3] xfs/1856: add rtrmapbt upgrade to test matrix
  2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong
@ 2024-12-31 23:58   ` Darrick J. Wong
  2024-12-31 23:59   ` [PATCH 3/3] xfs/1856: add rtreflink " Darrick J. Wong
  2 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add realtime reverse mapping btrees to the features that this test will
try to upgrade.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1856 |   40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)


diff --git a/tests/xfs/1856 b/tests/xfs/1856
index fedeb157dbd9bb..8e3213da752348 100755
--- a/tests/xfs/1856
+++ b/tests/xfs/1856
@@ -30,11 +30,47 @@ rt_configured()
 	test "$USE_EXTERNAL" = "yes" && test -n "$SCRATCH_RTDEV"
 }
 
+# Does mkfs support metadir?
+supports_metadir()
+{
+	$MKFS_XFS_PROG 2>&1 | grep -q 'metadir='
+}
+
+# Do we need to enable metadir at mkfs time to support a feature upgrade test?
+need_metadir()
+{
+	local feat="$1"
+
+	# if realtime isn't configured, we don't need metadir
+	rt_configured || return 1
+
+	# If we don't even know what realtime rmap is, we don't need rt groups
+	# and hence don't need metadir.
+	test -z "${FEATURE_STATE["rmapbt"]}" && return 1
+
+	# rt rmap btrees require metadir, but metadir cannot be added to an
+	# existing rt filesystem.  Force it on at mkfs time.
+	test "${FEATURE_STATE["rmapbt"]}" -eq 1 && return 0
+	test "$feat" = "rmapbt" && return 0
+
+	return 1
+}
+
 # Compute the MKFS_OPTIONS string for a particular feature upgrade test
 compute_mkfs_options()
 {
+	local feat="$1"
 	local m_opts=""
 	local caller_options="$MKFS_OPTIONS"
+	local metadir
+
+	need_metadir "$feat" && metadir=1
+	if echo "$caller_options" | grep -q 'metadir='; then
+		test -z "$metadir" && metadir=0
+		caller_options="$(echo "$caller_options" | sed -e 's/metadir=*[0-9]*/metadir='$metadir'/g')"
+	elif [ -n "$metadir" ]; then
+		caller_options="$caller_options -m metadir=$metadir"
+	fi
 
 	for feat in "${FEATURES[@]}"; do
 		local feat_state="${FEATURE_STATE["${feat}"]}"
@@ -179,9 +215,11 @@ MKFS_OPTIONS="$(qerase_mkfs_options)"
 # upgrade don't spread failure to the rest of the tests.
 FEATURES=()
 if rt_configured; then
+	# rmap wasn't added to rt devices until after metadir
 	check_repair_upgrade finobt && FEATURES+=("finobt")
 	check_repair_upgrade inobtcount && FEATURES+=("inobtcount")
 	check_repair_upgrade bigtime && FEATURES+=("bigtime")
+	supports_metadir && check_repair_upgrade rmapbt && FEATURES+=("rmapbt")
 else
 	check_repair_upgrade finobt && FEATURES+=("finobt")
 	check_repair_upgrade rmapbt && FEATURES+=("rmapbt")
@@ -204,7 +242,7 @@ for feat in "${FEATURES[@]}"; do
 
 	upgrade_start_message "$feat" | _tee_kernlog $seqres.full > /dev/null
 
-	opts="$(compute_mkfs_options)"
+	opts="$(compute_mkfs_options "$feat")"
 	echo "mkfs.xfs $opts" >> $seqres.full
 
 	# Format filesystem


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH 3/3] xfs/1856: add rtreflink upgrade to test matrix
  2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong
  2024-12-31 23:58   ` [PATCH 2/3] xfs/1856: add rtrmapbt " Darrick J. Wong
@ 2024-12-31 23:59   ` Darrick J. Wong
  2 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:59 UTC (permalink / raw)
  To: zlang, djwong; +Cc: fstests, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add realtime reflink to the features that this test will try to
upgrade.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 tests/xfs/1856 |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/tests/xfs/1856 b/tests/xfs/1856
index 8e3213da752348..9b776493f0486f 100755
--- a/tests/xfs/1856
+++ b/tests/xfs/1856
@@ -215,11 +215,12 @@ MKFS_OPTIONS="$(qerase_mkfs_options)"
 # upgrade don't spread failure to the rest of the tests.
 FEATURES=()
 if rt_configured; then
-	# rmap wasn't added to rt devices until after metadir
+	# rmap & reflink weren't added to rt devices until after metadir
 	check_repair_upgrade finobt && FEATURES+=("finobt")
 	check_repair_upgrade inobtcount && FEATURES+=("inobtcount")
 	check_repair_upgrade bigtime && FEATURES+=("bigtime")
 	supports_metadir && check_repair_upgrade rmapbt && FEATURES+=("rmapbt")
+	supports_metadir && check_repair_upgrade reflink && FEATURES+=("reflink")
 else
 	check_repair_upgrade finobt && FEATURES+=("finobt")
 	check_repair_upgrade rmapbt && FEATURES+=("rmapbt")


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing
  2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
                   ` (14 preceding siblings ...)
  2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong
@ 2025-01-02  1:37 ` Stephen Zhang
  2025-01-07  0:26   ` Darrick J. Wong
  15 siblings, 1 reply; 110+ messages in thread
From: Stephen Zhang @ 2025-01-02  1:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Carlos Maiolino, Zorro Lang, Andrey Albershteyn,
	Christoph Hellwig, Dave Chinner, xfs, greg.marsden, shirley.ma,
	konrad.wilk, fstests

Darrick J. Wong <djwong@kernel.org> 于2025年1月1日周三 07:25写道：
>
> Hi everyone,
>
> Thank you all for helping get online repair, parent pointers, and
> metadata directory trees, and realtime allocation groups merged this
> year!  We got a lot done in 2024.
>
> Having sent pull requests to Carlos for the last pieces of the realtime
> modernization project, I have exactly two worthwhile projects left in my
> development trees!  The stuff here isn't necessarily in mergeable state
> yet, but I still believe everyone ought to know what I'm up to.
>
> The first project implements (somewhat buggily; I never quite got back
> to dealing with moving eof blocks) free space defragmentation so that we
> can meaningfully shrink filesystems; garbage collect regions of the
> filesystem; or prepare for large allocations.  There's not much new
> kernel code other than exporting refcounts and gaining the ability to
> map free space.
>
> The second project initiates filesystem self healing routines whenever
> problems start to crop up, which means that it can run fully
> autonomously in the background.  The monitoring system uses some
> pseudo-file and seqbuf tricks that I lifted from kmo last winter.
>
> Both of these projects are largely userspace code.
>
> Also I threw in some xfs_repair code to do dangerous fs upgrades.
> Nobody should use these, ever.
>
> Maintainers: please do not merge, this is a dog-and-pony show to attract
> developer attention.
>

[Add Dave to the list]

Hi, Darrick and all,

Recently, I have been considering implementing the XFS shrink feature based
on the AF concept, which was mentioned in this link:

https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/

In the lore link, it stated:
The rules used by AG are more about extending outwards.
whilst
The rules used by AF are more about restricting inwards.

where the AF concept implicitly and naturally involves the semantics of
compressing/shrinking(restricting).

AG(for xfs extend) and AF(for xfs shrink) are constructed in a symmetrical way,
in which it is more elegant and easier to build more complex features on it.

To elaborate further, for example, AG should not be seen as
independent entities in
the shrink context. That means each AG requires separate
managements(flags or something to indicate the state of that
AG/region), which would increase the system complexity compared to the
idea behind AF. AF views several AGs as a whole.

And when it comes to growfs, things start to get a little more
complicated, and AF
can handle it easily and naturally.

However talk is too cheap, to validate our point, we truly hope to have the
opportunity to participate in developing these features by integrating
the existing
infrastructure you have already established with the AF concept.

Best regards,
Shida

> --D
>
> PS: I'll be back after the holidays to look at the zoned/atomic/fsverity
> patches.  And finally rebase fstests to 2024-12-08.
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing
  2025-01-02  1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang
@ 2025-01-07  0:26   ` Darrick J. Wong
  0 siblings, 0 replies; 110+ messages in thread
From: Darrick J. Wong @ 2025-01-07  0:26 UTC (permalink / raw)
  To: Stephen Zhang
  Cc: Carlos Maiolino, Zorro Lang, Andrey Albershteyn,
	Christoph Hellwig, Dave Chinner, xfs, greg.marsden, shirley.ma,
	konrad.wilk, fstests

On Thu, Jan 02, 2025 at 09:37:47AM +0800, Stephen Zhang wrote:
> Darrick J. Wong <djwong@kernel.org> 于2025年1月1日周三 07:25写道：
> >
> > Hi everyone,
> >
> > Thank you all for helping get online repair, parent pointers, and
> > metadata directory trees, and realtime allocation groups merged this
> > year!  We got a lot done in 2024.
> >
> > Having sent pull requests to Carlos for the last pieces of the realtime
> > modernization project, I have exactly two worthwhile projects left in my
> > development trees!  The stuff here isn't necessarily in mergeable state
> > yet, but I still believe everyone ought to know what I'm up to.
> >
> > The first project implements (somewhat buggily; I never quite got back
> > to dealing with moving eof blocks) free space defragmentation so that we
> > can meaningfully shrink filesystems; garbage collect regions of the
> > filesystem; or prepare for large allocations.  There's not much new
> > kernel code other than exporting refcounts and gaining the ability to
> > map free space.
> >
> > The second project initiates filesystem self healing routines whenever
> > problems start to crop up, which means that it can run fully
> > autonomously in the background.  The monitoring system uses some
> > pseudo-file and seqbuf tricks that I lifted from kmo last winter.
> >
> > Both of these projects are largely userspace code.
> >
> > Also I threw in some xfs_repair code to do dangerous fs upgrades.
> > Nobody should use these, ever.
> >
> > Maintainers: please do not merge, this is a dog-and-pony show to attract
> > developer attention.
> >
> 
> [Add Dave to the list]
> 
> Hi, Darrick and all,
> 
> Recently, I have been considering implementing the XFS shrink feature based
> on the AF concept, which was mentioned in this link:
> 
> https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/
> 
> In the lore link, it stated:
> The rules used by AG are more about extending outwards.
> whilst
> The rules used by AF are more about restricting inwards.
> 
> where the AF concept implicitly and naturally involves the semantics of
> compressing/shrinking(restricting).
> 
> AG(for xfs extend) and AF(for xfs shrink) are constructed in a symmetrical way,
> in which it is more elegant and easier to build more complex features on it.
> 
> To elaborate further, for example, AG should not be seen as
> independent entities in
> the shrink context. That means each AG requires separate
> managements(flags or something to indicate the state of that
> AG/region), which would increase the system complexity compared to the
> idea behind AF. AF views several AGs as a whole.
> 
> And when it comes to growfs, things start to get a little more
> complicated, and AF
> can handle it easily and naturally.
> 
> However talk is too cheap, to validate our point, we truly hope to have the
> opportunity to participate in developing these features by integrating
> the existing
> infrastructure you have already established with the AF concept.

Hmm, now that's interesting -- using the AF ("allocation fencing"?)
capability to constrain allocations to a subset of AGs, and then slowly
rewriting files and whatnot to migrate data to other AGs.  Eventually
you end up with an AG that's empty and therefore ready for shrink.

That's definitely a different way to do that than what I did (add a
"mapfree" ioctl to pin space to a file).  I'll ponder these 2 approaches
a bit more.

--D

> Best regards,
> Shida
> 
> 
> 
> > --D
> >
> > PS: I'll be back after the holidays to look at the zoned/atomic/fsverity
> > patches.  And finally rebase fstests to 2024-12-08.
> >
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2025-01-13  5:55 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong
2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong
2024-12-31 23:36   ` [PATCH 1/1] xfs: Don't free EOF blocks on close when extent size hints are set Darrick J. Wong
2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong
2024-12-31 23:36   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
2024-12-31 23:36   ` [PATCH 2/5] xfs: whine to dmesg when we encounter errors Darrick J. Wong
2024-12-31 23:37   ` [PATCH 3/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong
2024-12-31 23:37   ` [PATCH 4/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong
2024-12-31 23:37   ` [PATCH 5/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong
2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong
2024-12-31 23:37   ` [PATCH 1/1] xfs: export reference count " Darrick J. Wong
2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong
2024-12-31 23:38   ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong
2024-12-31 23:38   ` [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints Darrick J. Wong
2024-12-31 23:38   ` [PATCH 3/4] xfs: add an ioctl to map free space into a file Darrick J. Wong
2024-12-31 23:38   ` [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files Darrick J. Wong
2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong
2024-12-31 23:39   ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong
2024-12-31 23:39   ` [PATCH 02/16] xfs: create hooks for monitoring health updates Darrick J. Wong
2024-12-31 23:39   ` [PATCH 03/16] xfs: create a filesystem shutdown hook Darrick J. Wong
2024-12-31 23:39   ` [PATCH 04/16] xfs: create hooks for media errors Darrick J. Wong
2024-12-31 23:40   ` [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem Darrick J. Wong
2024-12-31 23:40   ` [PATCH 06/16] iomap: report directio read and write errors to callers Darrick J. Wong
2024-12-31 23:40   ` [PATCH 07/16] xfs: create file io error hooks Darrick J. Wong
2024-12-31 23:40   ` [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
2024-12-31 23:41   ` [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
2024-12-31 23:41   ` [PATCH 10/16] xfs: report metadata health events through healthmon Darrick J. Wong
2024-12-31 23:41   ` [PATCH 11/16] xfs: report shutdown " Darrick J. Wong
2024-12-31 23:41   ` [PATCH 12/16] xfs: report media errors " Darrick J. Wong
2024-12-31 23:42   ` [PATCH 13/16] xfs: report file io " Darrick J. Wong
2024-12-31 23:42   ` [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
2024-12-31 23:42   ` [PATCH 15/16] xfs: add media error reporting ioctl Darrick J. Wong
2024-12-31 23:43   ` [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong
2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong
2024-12-31 23:43   ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong
2024-12-31 23:43   ` [PATCH 2/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong
2024-12-31 23:43   ` [PATCH 3/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong
2024-12-31 23:44   ` [PATCH 4/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong
2024-12-31 23:44   ` [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag Darrick J. Wong
2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong
2024-12-31 23:44   ` [PATCH 1/2] xfs: export reference count " Darrick J. Wong
2024-12-31 23:44   ` [PATCH 2/2] xfs_io: dump reference count information Darrick J. Wong
2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong
2024-12-31 23:45   ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong
2024-12-31 23:45   ` [PATCH 02/11] xfs: add an ioctl to map free space into a file Darrick J. Wong
2024-12-31 23:45   ` [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space Darrick J. Wong
2024-12-31 23:45   ` [PATCH 04/11] xfs_db: get and put blocks on the AGFL Darrick J. Wong
2024-12-31 23:46   ` [PATCH 05/11] xfs_spaceman: implement clearing free space Darrick J. Wong
2024-12-31 23:46   ` [PATCH 06/11] spaceman: physically move a regular inode Darrick J. Wong
2024-12-31 23:46   ` [PATCH 07/11] spaceman: find owners of space in an AG Darrick J. Wong
2024-12-31 23:46   ` [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c Darrick J. Wong
2024-12-31 23:47   ` [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems Darrick J. Wong
2024-12-31 23:47   ` [PATCH 10/11] spaceman: relocate the contents of an AG Darrick J. Wong
2024-12-31 23:47   ` [PATCH 11/11] spaceman: move inodes with hardlinks Darrick J. Wong
2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong
2024-12-31 23:47   ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong
2024-12-31 23:48   ` [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
2024-12-31 23:48   ` [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
2024-12-31 23:48   ` [PATCH 04/21] xfs: report metadata health events through healthmon Darrick J. Wong
2024-12-31 23:49   ` [PATCH 05/21] xfs: report shutdown " Darrick J. Wong
2024-12-31 23:49   ` [PATCH 06/21] xfs: report media errors " Darrick J. Wong
2024-12-31 23:49   ` [PATCH 07/21] xfs: report file io " Darrick J. Wong
2024-12-31 23:49   ` [PATCH 08/21] xfs: add media error reporting ioctl Darrick J. Wong
2024-12-31 23:50   ` [PATCH 09/21] xfs_io: monitor filesystem health events Darrick J. Wong
2024-12-31 23:50   ` [PATCH 10/21] xfs_io: add a media error reporting command Darrick J. Wong
2024-12-31 23:50   ` [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events Darrick J. Wong
2024-12-31 23:50   ` [PATCH 12/21] xfs_scrubbed: check events against schema Darrick J. Wong
2024-12-31 23:51   ` [PATCH 13/21] xfs_scrubbed: enable repairing filesystems Darrick J. Wong
2024-12-31 23:51   ` [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs Darrick J. Wong
2024-12-31 23:51   ` [PATCH 15/21] xfs_scrubbed: use getparents to look up file names Darrick J. Wong
2024-12-31 23:51   ` [PATCH 16/21] builddefs: refactor udev directory specification Darrick J. Wong
2024-12-31 23:52   ` [PATCH 17/21] xfs_scrubbed: create a background monitoring service Darrick J. Wong
2024-12-31 23:52   ` [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable Darrick J. Wong
2024-12-31 23:52   ` [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode Darrick J. Wong
2024-12-31 23:52   ` [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel Darrick J. Wong
2024-12-31 23:53   ` [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default Darrick J. Wong
2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong
2024-12-31 23:53   ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong
2024-12-31 23:53   ` [PATCH 02/10] xfs_repair: allow sysadmins to add reflink Darrick J. Wong
2024-12-31 23:53   ` [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes Darrick J. Wong
2024-12-31 23:54   ` [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers Darrick J. Wong
2024-12-31 23:54   ` [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories Darrick J. Wong
2024-12-31 23:54   ` [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir Darrick J. Wong
2024-12-31 23:55   ` [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes Darrick J. Wong
2024-12-31 23:55   ` [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink Darrick J. Wong
2024-12-31 23:55   ` [PATCH 09/10] xfs_repair: skip free space checks when upgrading Darrick J. Wong
2024-12-31 23:55   ` [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems Darrick J. Wong
2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong
2024-12-31 23:56   ` [PATCH 1/1] xfs: test output of new FSREFCOUNTS ioctl Darrick J. Wong
2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong
2024-12-31 23:56   ` [PATCH 1/1] xfs: test clearing of " Darrick J. Wong
2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong
2024-12-31 23:56   ` [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount Darrick J. Wong
2024-12-31 23:56   ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong
2025-01-06 11:18     ` Nirjhar Roy
2025-01-06 23:52       ` Darrick J. Wong
2025-01-13  5:55         ` Nirjhar Roy
2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong
2024-12-31 23:57   ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong
2024-12-31 23:57   ` [PATCH 2/6] misc: convert all umount(1) invocations to _umount Darrick J. Wong
2024-12-31 23:57   ` [PATCH 3/6] xfs: test health monitoring code Darrick J. Wong
2024-12-31 23:57   ` [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong
2024-12-31 23:58   ` [PATCH 5/6] xfs: test io " Darrick J. Wong
2024-12-31 23:58   ` [PATCH 6/6] xfs: test new xfs_scrubbed daemon Darrick J. Wong
2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong
2024-12-31 23:58   ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong
2024-12-31 23:58   ` [PATCH 2/3] xfs/1856: add rtrmapbt " Darrick J. Wong
2024-12-31 23:59   ` [PATCH 3/3] xfs/1856: add rtreflink " Darrick J. Wong
2025-01-02  1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang
2025-01-07  0:26   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox