* [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing
@ 2024-12-31 23:25 Darrick J. Wong
2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong
` (15 more replies)
0 siblings, 16 replies; 110+ messages in thread
From: Darrick J. Wong @ 2024-12-31 23:25 UTC (permalink / raw)
To: Carlos Maiolino, Zorro Lang, Andrey Albershteyn,
Christoph Hellwig
Cc: xfs, greg.marsden, shirley.ma, konrad.wilk, fstests
Hi everyone,
Thank you all for helping get online repair, parent pointers, and
metadata directory trees, and realtime allocation groups merged this
year! We got a lot done in 2024.
Having sent pull requests to Carlos for the last pieces of the realtime
modernization project, I have exactly two worthwhile projects left in my
development trees! The stuff here isn't necessarily in mergeable state
yet, but I still believe everyone ought to know what I'm up to.
The first project implements (somewhat buggily; I never quite got back
to dealing with moving eof blocks) free space defragmentation so that we
can meaningfully shrink filesystems; garbage collect regions of the
filesystem; or prepare for large allocations. There's not much new
kernel code other than exporting refcounts and gaining the ability to
map free space.
The second project initiates filesystem self healing routines whenever
problems start to crop up, which means that it can run fully
autonomously in the background. The monitoring system uses some
pseudo-file and seqbuf tricks that I lifted from kmo last winter.
Both of these projects are largely userspace code.
Also I threw in some xfs_repair code to do dangerous fs upgrades.
Nobody should use these, ever.
Maintainers: please do not merge, this is a dog-and-pony show to attract
developer attention.
--D
PS: I'll be back after the holidays to look at the zoned/atomic/fsverity
patches. And finally rebase fstests to 2024-12-08.
^ permalink raw reply [flat|nested] 110+ messages in thread* [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong @ 2024-12-31 23:32 ` Darrick J. Wong 2024-12-31 23:36 ` [PATCH 1/1] xfs: Don't free EOF blocks on close when extent size hints are set Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong ` (14 subsequent siblings) 15 siblings, 1 reply; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:32 UTC (permalink / raw) To: djwong, cem; +Cc: dchinner, linux-xfs Hi all, Here's a few patches mostly from Dave to make XFS more aggressive about keeping post-eof speculative preallocations when closing files. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reduce-eofblocks-gc-on-close --- Commits in this patchset: * xfs: Don't free EOF blocks on close when extent size hints are set --- fs/xfs/xfs_bmap_util.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/1] xfs: Don't free EOF blocks on close when extent size hints are set 2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong @ 2024-12-31 23:36 ` Darrick J. Wong 0 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:36 UTC (permalink / raw) To: djwong, cem; +Cc: dchinner, linux-xfs From: Dave Chinner <david@fromorbit.com> When we have a workload that does open/write/close on files with extent size hints set in parallel with other allocation, the file becomes rapidly fragmented. This is due to close() calling xfs_release() and removing the preallocated extent beyond EOF. This occurs for both buffered and direct writes that append to files with extent size hints. The existing open/write/close hueristic in xfs_release() does not catch this as writes to files using extent size hints do not use delayed allocation and hence do not leave delayed allocation blocks allocated on the inode that can be detected in xfs_release(). Hence XFS_IDIRTY_RELEASE never gets set. In xfs_file_release(), we can tell whether the inode has extent size hints set and skip EOF block truncation. We add this check to xfs_can_free_eofblocks() so that we treat the post-EOF preallocated extent like intentional preallocation and so are persistent unless directly removed by userspace. Before: Test 2: Extent size hint fragmentation counts /mnt/scratch/file.0: 1002 /mnt/scratch/file.1: 1002 /mnt/scratch/file.2: 1002 /mnt/scratch/file.3: 1002 /mnt/scratch/file.4: 1002 /mnt/scratch/file.5: 1002 /mnt/scratch/file.6: 1002 /mnt/scratch/file.7: 1002 After: Test 2: Extent size hint fragmentation counts /mnt/scratch/file.0: 4 /mnt/scratch/file.1: 4 /mnt/scratch/file.2: 4 /mnt/scratch/file.3: 4 /mnt/scratch/file.4: 4 /mnt/scratch/file.5: 4 /mnt/scratch/file.6: 4 /mnt/scratch/file.7: 4 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_bmap_util.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index b0096ff91000ce..783349f2361ad3 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -527,8 +527,9 @@ xfs_can_free_eofblocks( * Do not free real extents in preallocated files unless the file has * delalloc blocks and we are forced to remove them. */ - if ((ip->i_diflags & XFS_DIFLAG_PREALLOC) && !ip->i_delayed_blks) - return false; + if (xfs_get_extsz_hint(ip) || (ip->i_diflags & XFS_DIFLAG_APPEND)) + if (ip->i_delayed_blks == 0) + return false; /* * Do not try to free post-EOF blocks if EOF is beyond the end of the ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET RFC 2/5] xfs: noalloc allocation groups 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong @ 2024-12-31 23:32 ` Darrick J. Wong 2024-12-31 23:36 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong ` (4 more replies) 2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong ` (13 subsequent siblings) 15 siblings, 5 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:32 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs Hi all, This series creates a new NOALLOC flag for allocation groups that causes the block and inode allocators to look elsewhere when trying to allocate resources. This is either the first part of a patchset to implement online shrinking (set noalloc on the last AGs, run fsr to move the files and directories) or freeze-free rmapbt rebuilding (set noalloc to prevent creation of new mappings, then hook deletion of old mappings). This is still totally a research project. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=noalloc-ags xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=noalloc-ags --- Commits in this patchset: * xfs: track deferred ops statistics * xfs: whine to dmesg when we encounter errors * xfs: create a noalloc mode for allocation groups * xfs: enable userspace to hide an AG from allocation * xfs: apply noalloc mode to inode allocations too --- fs/xfs/Kconfig | 13 +++++ fs/xfs/libxfs/xfs_ag.c | 114 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_ag.h | 8 +++ fs/xfs/libxfs/xfs_ag_resv.c | 27 +++++++++- fs/xfs/libxfs/xfs_defer.c | 18 ++++++- fs/xfs/libxfs/xfs_fs.h | 5 ++ fs/xfs/libxfs/xfs_ialloc.c | 3 + fs/xfs/scrub/btree.c | 89 +++++++++++++++++++++++++++++++++- fs/xfs/scrub/common.c | 107 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.h | 1 fs/xfs/scrub/dabtree.c | 24 +++++++++ fs/xfs/scrub/fscounters.c | 3 + fs/xfs/scrub/inode.c | 4 ++ fs/xfs/scrub/scrub.c | 40 +++++++++++++++ fs/xfs/scrub/trace.c | 22 ++++++++ fs/xfs/scrub/trace.h | 2 + fs/xfs/xfs_fsops.c | 10 +++- fs/xfs/xfs_globals.c | 5 ++ fs/xfs/xfs_ioctl.c | 4 +- fs/xfs/xfs_super.c | 1 fs/xfs/xfs_sysctl.h | 1 fs/xfs/xfs_sysfs.c | 32 ++++++++++++ fs/xfs/xfs_trace.h | 65 +++++++++++++++++++++++++ fs/xfs/xfs_trans.c | 3 + fs/xfs/xfs_trans.h | 7 +++ 25 files changed, 599 insertions(+), 9 deletions(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/5] xfs: track deferred ops statistics 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong @ 2024-12-31 23:36 ` Darrick J. Wong 2024-12-31 23:36 ` [PATCH 2/5] xfs: whine to dmesg when we encounter errors Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:36 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Track some basic statistics on how hard we're pushing the defer ops. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_defer.c | 18 +++++++++++++++++- fs/xfs/xfs_trace.h | 19 +++++++++++++++++++ fs/xfs/xfs_trans.c | 3 +++ fs/xfs/xfs_trans.h | 7 +++++++ 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 5b377cbbb1f7e0..236409a3333ea6 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -618,6 +618,8 @@ xfs_defer_finish_one( /* Done with the dfp, free it. */ list_del(&dfp->dfp_list); kmem_cache_free(xfs_defer_pending_cache, dfp); + tp->t_dfops_nr--; + tp->t_dfops_finished++; out: if (ops->finish_cleanup) ops->finish_cleanup(tp, state, error); @@ -680,6 +682,9 @@ xfs_defer_finish_noroll( list_splice_init(&(*tp)->t_dfops, &dop_pending); + (*tp)->t_dfops_nr_max = max((*tp)->t_dfops_nr, + (*tp)->t_dfops_nr_max); + if (has_intents < 0) { error = has_intents; goto out_shutdown; @@ -721,6 +726,7 @@ xfs_defer_finish_noroll( xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE); trace_xfs_defer_finish_error(*tp, error); xfs_defer_cancel_list((*tp)->t_mountp, &dop_pending); + (*tp)->t_dfops_nr = 0; xfs_defer_cancel(*tp); return error; } @@ -768,6 +774,7 @@ xfs_defer_cancel( trace_xfs_defer_cancel(tp, _RET_IP_); xfs_defer_trans_abort(tp, &tp->t_dfops); xfs_defer_cancel_list(mp, &tp->t_dfops); + tp->t_dfops_nr = 0; } /* @@ -853,8 +860,10 @@ xfs_defer_add( } dfp = xfs_defer_find_last(tp, ops); - if (!dfp || !xfs_defer_can_append(dfp, ops)) + if (!dfp || !xfs_defer_can_append(dfp, ops)) { dfp = xfs_defer_alloc(&tp->t_dfops, ops); + tp->t_dfops_nr++; + } xfs_defer_add_item(dfp, li); trace_xfs_defer_add_item(tp->t_mountp, dfp, li); @@ -879,6 +888,7 @@ xfs_defer_add_barrier( return; xfs_defer_alloc(&tp->t_dfops, &xfs_barrier_defer_type); + tp->t_dfops_nr++; trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL); } @@ -939,6 +949,12 @@ xfs_defer_move( struct xfs_trans *stp) { list_splice_init(&stp->t_dfops, &dtp->t_dfops); + dtp->t_dfops_nr += stp->t_dfops_nr; + dtp->t_dfops_nr_max = stp->t_dfops_nr_max; + dtp->t_dfops_finished = stp->t_dfops_finished; + stp->t_dfops_nr = 0; + stp->t_dfops_nr_max = 0; + stp->t_dfops_finished = 0; /* * Low free space mode was historically controlled by a dfops field. diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 8d86a1e038cd5c..0352f432421598 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -2880,6 +2880,25 @@ TRACE_EVENT(xfs_btree_free_block, /* deferred ops */ struct xfs_defer_pending; +TRACE_EVENT(xfs_defer_stats, + TP_PROTO(struct xfs_trans *tp), + TP_ARGS(tp), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned int, max) + __field(unsigned int, finished) + ), + TP_fast_assign( + __entry->dev = tp->t_mountp->m_super->s_dev; + __entry->max = tp->t_dfops_nr_max; + __entry->finished = tp->t_dfops_finished; + ), + TP_printk("dev %d:%d max %u finished %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->max, + __entry->finished) +) + DECLARE_EVENT_CLASS(xfs_defer_class, TP_PROTO(struct xfs_trans *tp, unsigned long caller_ip), TP_ARGS(tp, caller_ip), diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index f53f82456288e5..269cd4583a033d 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -71,6 +71,9 @@ xfs_trans_free( xfs_extent_busy_sort(&tp->t_busy); xfs_extent_busy_clear(&tp->t_busy, false); + if (tp->t_dfops_finished > 0) + trace_xfs_defer_stats(tp); + trace_xfs_trans_free(tp, _RET_IP_); xfs_trans_clear_context(tp); if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT)) diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 71c2e82e4dadff..cb037a669754eb 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -153,6 +153,13 @@ typedef struct xfs_trans { struct list_head t_busy; /* list of busy extents */ struct list_head t_dfops; /* deferred operations */ unsigned long t_pflags; /* saved process flags state */ + + /* Count of deferred ops attached to transaction. */ + unsigned int t_dfops_nr; + /* Maximum t_dfops_nr seen in a loop. */ + unsigned int t_dfops_nr_max; + /* Number of dfops finished. */ + unsigned int t_dfops_finished; } xfs_trans_t; /* ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/5] xfs: whine to dmesg when we encounter errors 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong 2024-12-31 23:36 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong @ 2024-12-31 23:36 ` Darrick J. Wong 2024-12-31 23:37 ` [PATCH 3/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:36 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Forward everything scrub whines about to dmesg. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/Kconfig | 13 ++++++ fs/xfs/scrub/btree.c | 89 +++++++++++++++++++++++++++++++++++++++- fs/xfs/scrub/common.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/common.h | 1 fs/xfs/scrub/dabtree.c | 24 +++++++++++ fs/xfs/scrub/inode.c | 4 ++ fs/xfs/scrub/scrub.c | 40 ++++++++++++++++++ fs/xfs/scrub/trace.c | 22 ++++++++++ fs/xfs/scrub/trace.h | 2 + fs/xfs/xfs_globals.c | 5 ++ fs/xfs/xfs_sysctl.h | 1 fs/xfs/xfs_sysfs.c | 32 ++++++++++++++ 12 files changed, 338 insertions(+), 2 deletions(-) diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index fffd6fffdce0f0..5700bc671a0e92 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -172,6 +172,19 @@ config XFS_ONLINE_SCRUB_STATS If unsure, say N. +config XFS_ONLINE_SCRUB_WHINE + bool "XFS online metadata verbose logging by default" + default n + depends on XFS_ONLINE_SCRUB + help + If you say Y here, the kernel will by default log the outcomes of all + scrub and repair operations, as well as any corruptions found. This + may slow down scrub due to printk logging overhead timers. + + This value can be changed by editing /sys/fs/xfs/debug/scrub_whine + + If unsure, say N. + config XFS_ONLINE_REPAIR bool "XFS online metadata repair support" default n diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c index fe678a0438bc5c..e455eef892faec 100644 --- a/fs/xfs/scrub/btree.c +++ b/fs/xfs/scrub/btree.c @@ -11,6 +11,8 @@ #include "xfs_mount.h" #include "xfs_inode.h" #include "xfs_btree.h" +#include "xfs_log_format.h" +#include "xfs_ag.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/btree.h" @@ -18,6 +20,62 @@ /* btree scrubbing */ +/* Figure out which block the btree cursor was pointing to. */ +static inline xfs_fsblock_t +xchk_btree_cur_fsbno( + struct xfs_btree_cur *cur, + int level) +{ + if (level < cur->bc_nlevels && cur->bc_levels[level].bp) + return XFS_DADDR_TO_FSB(cur->bc_mp, + xfs_buf_daddr(cur->bc_levels[level].bp)); + else if (level == cur->bc_nlevels - 1 && + cur->bc_ops->type == XFS_BTREE_TYPE_INODE) + return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino); + else if (cur->bc_group) + return xfs_gbno_to_fsb(cur->bc_group, 0); + return NULLFSBLOCK; +} + +static inline void +process_error_whine( + struct xfs_scrub *sc, + struct xfs_btree_cur *cur, + int level, + int *error, + __u32 errflag, + void *ret_ip) +{ + xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level); + + if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE) { + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x error %d errflag 0x%x ret_ip %pS", + cur->bc_ino.ip->i_ino, + cur->bc_ino.whichfork, + xchk_type_string(sc->sm->sm_type), + cur->bc_ops->name, + level, + cur->bc_levels[level].ptr, + XFS_FSB_TO_AGNO(cur->bc_mp, fsbno), + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), + *error, + errflag, + ret_ip); + return; + } + + xchk_whine(sc->mp, "type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x error %d errflag 0x%x ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + cur->bc_ops->name, + level, + cur->bc_levels[level].ptr, + XFS_FSB_TO_AGNO(cur->bc_mp, fsbno), + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), + *error, + errflag, + ret_ip); +} + /* * Check for btree operation errors. See the section about handling * operational errors in common.c. @@ -44,9 +102,13 @@ __xchk_btree_process_error( case -EFSCORRUPTED: /* Note the badness but don't abort. */ sc->sm->sm_flags |= errflag; + process_error_whine(sc, cur, level, error, errflag, ret_ip); *error = 0; fallthrough; default: + if (*error) + process_error_whine(sc, cur, level, error, errflag, + ret_ip); if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE) trace_xchk_ifork_btree_op_error(sc, cur, level, *error, ret_ip); @@ -91,12 +153,35 @@ __xchk_btree_set_corrupt( { sc->sm->sm_flags |= errflag; - if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE) + if (cur->bc_ops->type == XFS_BTREE_TYPE_INODE) { + xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level); + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x errflag 0x%x ret_ip %pS", + cur->bc_ino.ip->i_ino, + cur->bc_ino.whichfork, + xchk_type_string(sc->sm->sm_type), + cur->bc_ops->name, + level, + cur->bc_levels[level].ptr, + XFS_FSB_TO_AGNO(cur->bc_mp, fsbno), + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), + errflag, + ret_ip); trace_xchk_ifork_btree_error(sc, cur, level, ret_ip); - else + } else { + xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level); + xchk_whine(sc->mp, "type %s %sbt level %d ptr %d agno 0x%x agbno 0x%x errflag 0x%x ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + cur->bc_ops->name, + level, + cur->bc_levels[level].ptr, + XFS_FSB_TO_AGNO(cur->bc_mp, fsbno), + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), + errflag, + ret_ip); trace_xchk_btree_error(sc, cur, level, ret_ip); + } } void diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 28ad341df8eede..59c368c54a23f6 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -105,9 +105,23 @@ __xchk_process_error( case -EFSCORRUPTED: /* Note the badness but don't abort. */ sc->sm->sm_flags |= errflag; + xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x error %d errflag 0x%x ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + agno, + bno, + *error, + errflag, + ret_ip); *error = 0; fallthrough; default: + if (*error) + xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x error %d ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + agno, + bno, + *error, + ret_ip); trace_xchk_op_error(sc, agno, bno, *error, ret_ip); break; } @@ -179,9 +193,25 @@ __xchk_fblock_process_error( case -EFSCORRUPTED: /* Note the badness but don't abort. */ sc->sm->sm_flags |= errflag; + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu error %d errflag 0x%x ret_ip %pS", + sc->ip->i_ino, + whichfork, + xchk_type_string(sc->sm->sm_type), + offset, + *error, + errflag, + ret_ip); *error = 0; fallthrough; default: + if (*error) + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu error %d ret_ip %pS", + sc->ip->i_ino, + whichfork, + xchk_type_string(sc->sm->sm_type), + offset, + *error, + ret_ip); trace_xchk_file_op_error(sc, whichfork, offset, *error, ret_ip); break; @@ -253,6 +283,8 @@ xchk_set_corrupt( struct xfs_scrub *sc) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; + xchk_whine(sc->mp, "type %s ret_ip %pS", xchk_type_string(sc->sm->sm_type), + __return_address); trace_xchk_fs_error(sc, 0, __return_address); } @@ -264,6 +296,11 @@ xchk_block_set_corrupt( { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; trace_xchk_block_error(sc, xfs_buf_daddr(bp), __return_address); + xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + xfs_daddr_to_agno(sc->mp, xfs_buf_daddr(bp)), + xfs_daddr_to_agbno(sc->mp, xfs_buf_daddr(bp)), + __return_address); } #ifdef CONFIG_XFS_QUOTA @@ -275,6 +312,8 @@ xchk_qcheck_set_corrupt( xfs_dqid_t id) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; + xchk_whine(sc->mp, "type %s dqtype %u id %u ret_ip %pS", + xchk_type_string(sc->sm->sm_type), dqtype, id, __return_address); trace_xchk_qcheck_error(sc, dqtype, id, __return_address); } #endif @@ -287,6 +326,11 @@ xchk_block_xref_set_corrupt( { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XCORRUPT; trace_xchk_block_error(sc, xfs_buf_daddr(bp), __return_address); + xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + xfs_daddr_to_agno(sc->mp, xfs_buf_daddr(bp)), + xfs_daddr_to_agbno(sc->mp, xfs_buf_daddr(bp)), + __return_address); } /* @@ -300,6 +344,8 @@ xchk_ino_set_corrupt( xfs_ino_t ino) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; + xchk_whine(sc->mp, "ino 0x%llx type %s ret_ip %pS", + ino, xchk_type_string(sc->sm->sm_type), __return_address); trace_xchk_ino_error(sc, ino, __return_address); } @@ -310,6 +356,8 @@ xchk_ino_xref_set_corrupt( xfs_ino_t ino) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XCORRUPT; + xchk_whine(sc->mp, "ino 0x%llx type %s ret_ip %pS", + ino, xchk_type_string(sc->sm->sm_type), __return_address); trace_xchk_ino_error(sc, ino, __return_address); } @@ -321,6 +369,12 @@ xchk_fblock_set_corrupt( xfs_fileoff_t offset) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu ret_ip %pS", + sc->ip->i_ino, + whichfork, + xchk_type_string(sc->sm->sm_type), + offset, + __return_address); trace_xchk_fblock_error(sc, whichfork, offset, __return_address); } @@ -332,6 +386,12 @@ xchk_fblock_xref_set_corrupt( xfs_fileoff_t offset) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XCORRUPT; + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu ret_ip %pS", + sc->ip->i_ino, + whichfork, + xchk_type_string(sc->sm->sm_type), + offset, + __return_address); trace_xchk_fblock_error(sc, whichfork, offset, __return_address); } @@ -345,6 +405,8 @@ xchk_ino_set_warning( xfs_ino_t ino) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_WARNING; + xchk_whine(sc->mp, "ino 0x%llx type %s ret_ip %pS", + ino, xchk_type_string(sc->sm->sm_type), __return_address); trace_xchk_ino_warning(sc, ino, __return_address); } @@ -356,6 +418,12 @@ xchk_fblock_set_warning( xfs_fileoff_t offset) { sc->sm->sm_flags |= XFS_SCRUB_OFLAG_WARNING; + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s offset %llu ret_ip %pS", + sc->ip->i_ino, + whichfork, + xchk_type_string(sc->sm->sm_type), + offset, + __return_address); trace_xchk_fblock_warning(sc, whichfork, offset, __return_address); } @@ -1219,6 +1287,10 @@ xchk_iget_for_scrubbing( out_cancel: xchk_trans_cancel(sc); out_error: + xchk_whine(mp, "type %s agno 0x%x agbno 0x%x error %d ret_ip %pS", + xchk_type_string(sc->sm->sm_type), agno, + XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), error, + __return_address); trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), error, __return_address); return error; @@ -1352,6 +1424,10 @@ xchk_should_check_xref( } sc->sm->sm_flags |= XFS_SCRUB_OFLAG_XFAIL; + xchk_whine(sc->mp, "type %s xref error %d ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + *error, + __return_address); trace_xchk_xref_error(sc, *error, __return_address); /* @@ -1383,6 +1459,11 @@ xchk_buffer_recheck( return; sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; trace_xchk_block_error(sc, xfs_buf_daddr(bp), fa); + xchk_whine(sc->mp, "type %s agno 0x%x agbno 0x%x ret_ip %pS", + xchk_type_string(sc->sm->sm_type), + xfs_daddr_to_agno(sc->mp, xfs_buf_daddr(bp)), + xfs_daddr_to_agbno(sc->mp, xfs_buf_daddr(bp)), + fa); } static inline int @@ -1735,3 +1816,29 @@ xchk_inode_count_blocks( return xfs_bmap_count_blocks(sc->tp, sc->ip, whichfork, nextents, count); } + +/* Complain about failures... */ +void +xchk_whine( + const struct xfs_mount *mp, + const char *fmt, + ...) +{ + struct va_format vaf; + va_list args; + + if (!xfs_globals.scrub_whine) + return; + + va_start(args, fmt); + + vaf.fmt = fmt; + vaf.va = &args; + + printk(KERN_INFO "XFS (%s) %pS: %pV\n", mp->m_super->s_id, + __return_address, &vaf); + va_end(args); + + if (xfs_error_level >= XFS_ERRLEVEL_HIGH) + xfs_stack_trace(); +} diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h index bdcd40f0ec742c..4dc408b530153a 100644 --- a/fs/xfs/scrub/common.h +++ b/fs/xfs/scrub/common.h @@ -179,6 +179,7 @@ bool xchk_ilock_nowait(struct xfs_scrub *sc, unsigned int ilock_flags); void xchk_iunlock(struct xfs_scrub *sc, unsigned int ilock_flags); void xchk_buffer_recheck(struct xfs_scrub *sc, struct xfs_buf *bp); +void xchk_whine(const struct xfs_mount *mp, const char *fmt, ...); /* * Grab the inode at @inum. The caller must have created a scrub transaction diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c index 056de4819f866d..ae64db9f0bba2b 100644 --- a/fs/xfs/scrub/dabtree.c +++ b/fs/xfs/scrub/dabtree.c @@ -47,9 +47,26 @@ xchk_da_process_error( case -EFSCORRUPTED: /* Note the badness but don't abort. */ sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s dablk 0x%llx error %d ret_ip %pS", + sc->ip->i_ino, + ds->dargs.whichfork, + xchk_type_string(sc->sm->sm_type), + xfs_dir2_da_to_db(ds->dargs.geo, + ds->state->path.blk[level].blkno), + *error, + __return_address); *error = 0; fallthrough; default: + if (*error) + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s dablk 0x%llx error %d ret_ip %pS", + sc->ip->i_ino, + ds->dargs.whichfork, + xchk_type_string(sc->sm->sm_type), + xfs_dir2_da_to_db(ds->dargs.geo, + ds->state->path.blk[level].blkno), + *error, + __return_address); trace_xchk_file_op_error(sc, ds->dargs.whichfork, xfs_dir2_da_to_db(ds->dargs.geo, ds->state->path.blk[level].blkno), @@ -72,6 +89,13 @@ xchk_da_set_corrupt( sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT; + xchk_whine(sc->mp, "ino 0x%llx fork %d type %s dablk 0x%llx ret_ip %pS", + sc->ip->i_ino, + ds->dargs.whichfork, + xchk_type_string(sc->sm->sm_type), + xfs_dir2_da_to_db(ds->dargs.geo, + ds->state->path.blk[level].blkno), + __return_address); trace_xchk_fblock_error(sc, ds->dargs.whichfork, xfs_dir2_da_to_db(ds->dargs.geo, ds->state->path.blk[level].blkno), diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index bb3f475b63532e..a93f63b6b518ff 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -218,6 +218,10 @@ xchk_setup_inode( out_cancel: xchk_trans_cancel(sc); out_error: + xchk_whine(mp, "type %s agno 0x%x agbno 0x%x error %d ret_ip %pS", + xchk_type_string(sc->sm->sm_type), agno, + XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), error, + __return_address); trace_xchk_op_error(sc, agno, XFS_INO_TO_AGBNO(mp, sc->sm->sm_ino), error, __return_address); return error; diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 1a05c27ba47197..d3a4ddd918f621 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -639,6 +639,45 @@ xchk_scrub_create_subord( return sub; } +static inline void +repair_outcomes(struct xfs_scrub *sc, int error) +{ + struct xfs_scrub_metadata *sm = sc->sm; + const char *wut = NULL; + + if (!xfs_globals.scrub_whine) + return; + + if (sc->flags & XREP_ALREADY_FIXED) { + wut = "*** REPAIR SUCCESS"; + error = 0; + } else if (error == -EBUSY) { + wut = "??? FILESYSTEM BUSY"; + } else if (error == -EAGAIN) { + wut = "??? REPAIR DEFERRED"; + } else if (error == -ECANCELED) { + wut = "??? REPAIR CANCELLED"; + } else if (error == -EINTR) { + wut = "??? REPAIR INTERRUPTED"; + } else if (error != -EOPNOTSUPP && error != -ENOENT) { + wut = "!!! REPAIR FAILED"; + xfs_info(sc->mp, +"%s ino 0x%llx type %s agno 0x%x inum 0x%llx gen 0x%x flags 0x%x error %d", + wut, XFS_I(file_inode(sc->file))->i_ino, + xchk_type_string(sm->sm_type), sm->sm_agno, + sm->sm_ino, sm->sm_gen, sm->sm_flags, error); + return; + } else { + return; + } + + xfs_info_ratelimited(sc->mp, +"%s ino 0x%llx type %s agno 0x%x inum 0x%llx gen 0x%x flags 0x%x error %d", + wut, XFS_I(file_inode(sc->file))->i_ino, + xchk_type_string(sm->sm_type), sm->sm_agno, sm->sm_ino, + sm->sm_gen, sm->sm_flags, error); +} + /* Dispatch metadata scrubbing. */ STATIC int xfs_scrub_metadata( @@ -735,6 +774,7 @@ xfs_scrub_metadata( * already tried to fix it, then attempt a repair. */ error = xrep_attempt(sc, &run); + repair_outcomes(sc, error); if (error == -EAGAIN) { /* * Either the repair function succeeded or it couldn't diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c index 2450e214103fed..4ea790e4063df7 100644 --- a/fs/xfs/scrub/trace.c +++ b/fs/xfs/scrub/trace.c @@ -58,3 +58,25 @@ xchk_btree_cur_fsbno( */ #define CREATE_TRACE_POINTS #include "scrub/trace.h" + +/* xchk_whine stuff */ +struct xchk_tstr { + unsigned int type; + const char *tag; +}; + +static const struct xchk_tstr xchk_tstr_tags[] = { XFS_SCRUB_TYPE_STRINGS }; + +const char * +xchk_type_string( + unsigned int type) +{ + unsigned int i; + + for (i = 0; i < ARRAY_SIZE(xchk_tstr_tags); i++) { + if (xchk_tstr_tags[i].type == type) + return xchk_tstr_tags[i].tag; + } + + return "???"; +} diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index d7c4ced47c1567..69d9b0a336dbc5 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -115,6 +115,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_RTREFCBT); { XFS_SCRUB_TYPE_RTRMAPBT, "rtrmapbt" }, \ { XFS_SCRUB_TYPE_RTREFCBT, "rtrefcountbt" } +const char *xchk_type_string(unsigned int type); + #define XFS_SCRUB_FLAG_STRINGS \ { XFS_SCRUB_IFLAG_REPAIR, "repair" }, \ { XFS_SCRUB_OFLAG_CORRUPT, "corrupt" }, \ diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c index f18fec0adf6662..f5fe896b9a8ec0 100644 --- a/fs/xfs/xfs_globals.c +++ b/fs/xfs/xfs_globals.c @@ -44,6 +44,11 @@ struct xfs_globals xfs_globals = { .pwork_threads = -1, /* automatic thread detection */ .larp = false, /* log attribute replay */ #endif +#ifdef CONFIG_XFS_ONLINE_SCRUB_WHINE + .scrub_whine = true, +#else + .scrub_whine = false, +#endif /* * Leave this many record slots empty when bulk loading btrees. By diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h index 276696a07040c8..b0939ac370fba1 100644 --- a/fs/xfs/xfs_sysctl.h +++ b/fs/xfs/xfs_sysctl.h @@ -91,6 +91,7 @@ struct xfs_globals { int mount_delay; /* mount setup delay (secs) */ bool bug_on_assert; /* BUG() the kernel on assert failure */ bool always_cow; /* use COW fork for all overwrites */ + bool scrub_whine; /* noisier output from scrub */ }; extern struct xfs_globals xfs_globals; diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index 60cb5318fdae3c..0ce31517e3cd89 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -260,6 +260,37 @@ larp_show( } XFS_SYSFS_ATTR_RW(larp); +/* Logging of the outcomes of everything that scrub does */ +STATIC ssize_t +scrub_whine_store( + struct kobject *kobject, + const char *buf, + size_t count) +{ + int ret; + int val; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + if (val < -1 || val > num_possible_cpus()) + return -EINVAL; + + xfs_globals.scrub_whine = val; + + return count; +} + +STATIC ssize_t +scrub_whine_show( + struct kobject *kobject, + char *buf) +{ + return sysfs_emit(buf, "%d\n", xfs_globals.scrub_whine); +} +XFS_SYSFS_ATTR_RW(scrub_whine); + STATIC ssize_t bload_leaf_slack_store( struct kobject *kobject, @@ -319,6 +350,7 @@ static struct attribute *xfs_dbg_attrs[] = { ATTR_LIST(always_cow), ATTR_LIST(pwork_threads), ATTR_LIST(larp), + ATTR_LIST(scrub_whine), ATTR_LIST(bload_leaf_slack), ATTR_LIST(bload_node_slack), NULL, ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 3/5] xfs: create a noalloc mode for allocation groups 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong 2024-12-31 23:36 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong 2024-12-31 23:36 ` [PATCH 2/5] xfs: whine to dmesg when we encounter errors Darrick J. Wong @ 2024-12-31 23:37 ` Darrick J. Wong 2024-12-31 23:37 ` [PATCH 4/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong 2024-12-31 23:37 ` [PATCH 5/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new noalloc state for the per-AG structure that will disable block allocation in this AG. We accomplish this by subtracting from fdblocks all the free blocks in this AG, hiding those blocks from the allocator, and preventing freed blocks from updating fdblocks until we're ready to lift noalloc mode. Note that we reduce the free block count of the filesystem so that we can prevent transactions from entering the allocator looking for "free" space that we've turned off incore. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_ag.c | 60 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_ag.h | 8 ++++++ fs/xfs/libxfs/xfs_ag_resv.c | 27 +++++++++++++++++-- fs/xfs/scrub/fscounters.c | 3 +- fs/xfs/xfs_fsops.c | 10 ++++++- fs/xfs/xfs_super.c | 1 + fs/xfs/xfs_trace.h | 46 +++++++++++++++++++++++++++++++++ 7 files changed, 150 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index b59cb461e096ea..1e65cd981afd49 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -976,3 +976,63 @@ xfs_ag_get_geometry( xfs_buf_relse(agi_bp); return error; } + +/* How many blocks does this AG contribute to fdblocks? */ +xfs_extlen_t +xfs_ag_fdblocks( + struct xfs_perag *pag) +{ + xfs_extlen_t ret; + + ASSERT(xfs_perag_initialised_agf(pag)); + + ret = pag->pagf_freeblks + pag->pagf_flcount + pag->pagf_btreeblks; + ret -= pag->pag_meta_resv.ar_reserved; + ret -= pag->pag_rmapbt_resv.ar_orig_reserved; + return ret; +} + +/* + * Hide all the free space in this AG. Caller must hold both the AGI and the + * AGF buffers or have otherwise prevented concurrent access. + */ +int +xfs_ag_set_noalloc( + struct xfs_perag *pag) +{ + struct xfs_mount *mp = pag_mount(pag); + int error; + + ASSERT(xfs_perag_initialised_agf(pag)); + ASSERT(xfs_perag_initialised_agi(pag)); + + if (xfs_perag_prohibits_alloc(pag)) + return 0; + + error = xfs_dec_fdblocks(mp, xfs_ag_fdblocks(pag), false); + if (error) + return error; + + trace_xfs_ag_set_noalloc(pag); + set_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate); + return 0; +} + +/* + * Unhide all the free space in this AG. Caller must hold both the AGI and + * the AGF buffers or have otherwise prevented concurrent access. + */ +void +xfs_ag_clear_noalloc( + struct xfs_perag *pag) +{ + struct xfs_mount *mp = pag_mount(pag); + + if (!xfs_perag_prohibits_alloc(pag)) + return; + + xfs_add_fdblocks(mp, xfs_ag_fdblocks(pag)); + + trace_xfs_ag_clear_noalloc(pag); + clear_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate); +} diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 1f24cfa2732172..e8fae59206d929 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -120,6 +120,7 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag) #define XFS_AGSTATE_PREFERS_METADATA 2 #define XFS_AGSTATE_ALLOWS_INODES 3 #define XFS_AGSTATE_AGFL_NEEDS_RESET 4 +#define XFS_AGSTATE_NOALLOC 5 #define __XFS_AG_OPSTATE(name, NAME) \ static inline bool xfs_perag_ ## name (struct xfs_perag *pag) \ @@ -132,6 +133,7 @@ __XFS_AG_OPSTATE(initialised_agi, AGI_INIT) __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA) __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES) __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET) +__XFS_AG_OPSTATE(prohibits_alloc, NOALLOC) int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t orig_agcount, xfs_agnumber_t new_agcount, xfs_rfsblock_t dcount, @@ -164,6 +166,7 @@ xfs_perag_put( xfs_group_put(pag_group(pag)); } + /* Active AG references */ static inline struct xfs_perag * xfs_perag_grab( @@ -208,6 +211,11 @@ xfs_perag_next( return xfs_perag_next_from(mp, pag, 0); } +/* Enable or disable allocation from an AG */ +xfs_extlen_t xfs_ag_fdblocks(struct xfs_perag *pag); +int xfs_ag_set_noalloc(struct xfs_perag *pag); +void xfs_ag_clear_noalloc(struct xfs_perag *pag); + /* * Per-ag geometry infomation and validation */ diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c index fb79215a509d21..fda3d7614838e7 100644 --- a/fs/xfs/libxfs/xfs_ag_resv.c +++ b/fs/xfs/libxfs/xfs_ag_resv.c @@ -74,6 +74,13 @@ xfs_ag_resv_critical( xfs_extlen_t avail; xfs_extlen_t orig; + /* + * Pretend we're critically low on reservations in this AG to scare + * everyone else away. + */ + if (xfs_perag_prohibits_alloc(pag)) + return true; + switch (type) { case XFS_AG_RESV_METADATA: avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved; @@ -116,7 +123,12 @@ xfs_ag_resv_needed( break; case XFS_AG_RESV_METAFILE: case XFS_AG_RESV_NONE: - /* empty */ + /* + * In noalloc mode, we pretend that all the free blocks in this + * AG have been allocated. Make this AG look full. + */ + if (xfs_perag_prohibits_alloc(pag)) + len += xfs_ag_fdblocks(pag); break; default: ASSERT(0); @@ -344,6 +356,8 @@ xfs_ag_resv_alloc_extent( xfs_extlen_t len; uint field; + ASSERT(type != XFS_AG_RESV_NONE || !xfs_perag_prohibits_alloc(pag)); + trace_xfs_ag_resv_alloc_extent(pag, type, args->len); switch (type) { @@ -401,7 +415,14 @@ xfs_ag_resv_free_extent( ASSERT(0); fallthrough; case XFS_AG_RESV_NONE: - xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len); + /* + * Normally we put freed blocks back into fdblocks. In noalloc + * mode, however, we pretend that there are no fdblocks in the + * AG, so don't put them back. + */ + if (!xfs_perag_prohibits_alloc(pag)) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, + (int64_t)len); fallthrough; case XFS_AG_RESV_IGNORE: return; @@ -414,6 +435,6 @@ xfs_ag_resv_free_extent( /* Freeing into the reserved pool only requires on-disk update... */ xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len); /* ...but freeing beyond that requires in-core and on-disk update. */ - if (len > leftover) + if (len > leftover && !xfs_perag_prohibits_alloc(pag)) xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, len - leftover); } diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c index f7258544848fcd..af69ed7733acd6 100644 --- a/fs/xfs/scrub/fscounters.c +++ b/fs/xfs/scrub/fscounters.c @@ -337,7 +337,8 @@ xchk_fscount_aggregate_agcounts( */ fsc->fdblocks -= pag->pag_meta_resv.ar_reserved; fsc->fdblocks -= pag->pag_rmapbt_resv.ar_orig_reserved; - + if (xfs_perag_prohibits_alloc(pag)) + fsc->fdblocks -= xfs_ag_fdblocks(pag); } if (pag) xfs_perag_rele(pag); diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 8dc2b738c911ee..150979c8333530 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -592,6 +592,14 @@ xfs_fs_unreserve_ag_blocks( if (xfs_has_realtime(mp)) xfs_rt_resv_free(mp); - while ((pag = xfs_perag_next(mp, pag))) + while ((pag = xfs_perag_next(mp, pag))) { + /* + * Bring the AG back online because our AG hiding only exists + * in-core and we need the superblock to be written out with + * the super fdblocks reflecting the AGF freeblks. Do this + * before adding the per-AG reservations back to fdblocks. + */ + xfs_ag_clear_noalloc(pag); xfs_ag_resv_free(pag); + } } diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e1554f061376e5..099c30339e8f9d 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -336,6 +336,7 @@ xfs_set_inode_alloc( pag = xfs_perag_get(mp, index); if (xfs_set_inode_alloc_perag(pag, ino, max_metadata)) maxagi++; + clear_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate); xfs_perag_put(pag); } diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 0352f432421598..dc7ffc8f8e9dea 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -4589,6 +4589,52 @@ DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_corrupt); DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy); DEFINE_INODE_CORRUPT_EVENT(xfs_inode_unfixed_corruption); +DECLARE_EVENT_CLASS(xfs_ag_noalloc_class, + TP_PROTO(struct xfs_perag *pag), + TP_ARGS(pag), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_extlen_t, freeblks) + __field(xfs_extlen_t, flcount) + __field(xfs_extlen_t, btreeblks) + __field(xfs_extlen_t, meta_resv) + __field(xfs_extlen_t, rmap_resv) + + __field(unsigned long long, resblks) + __field(unsigned long long, resblks_avail) + ), + TP_fast_assign( + __entry->dev = pag_mount(pag)->m_super->s_dev; + __entry->agno = pag_agno(pag); + __entry->freeblks = pag->pagf_freeblks; + __entry->flcount = pag->pagf_flcount; + __entry->btreeblks = pag->pagf_btreeblks; + __entry->meta_resv = pag->pag_meta_resv.ar_reserved; + __entry->rmap_resv = pag->pag_rmapbt_resv.ar_orig_reserved; + + __entry->resblks = pag_mount(pag)->m_resblks[XC_FREE_BLOCKS].total; + __entry->resblks_avail = pag_mount(pag)->m_resblks[XC_FREE_BLOCKS].avail; + ), + TP_printk("dev %d:%d agno 0x%x freeblks %u flcount %u btreeblks %u metaresv %u rmapresv %u resblks %llu resblks_avail %llu", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->freeblks, + __entry->flcount, + __entry->btreeblks, + __entry->meta_resv, + __entry->rmap_resv, + __entry->resblks, + __entry->resblks_avail) +); +#define DEFINE_AG_NOALLOC_EVENT(name) \ +DEFINE_EVENT(xfs_ag_noalloc_class, name, \ + TP_PROTO(struct xfs_perag *pag), \ + TP_ARGS(pag)) + +DEFINE_AG_NOALLOC_EVENT(xfs_ag_set_noalloc); +DEFINE_AG_NOALLOC_EVENT(xfs_ag_clear_noalloc); + TRACE_EVENT(xfs_iwalk_ag_rec, TP_PROTO(const struct xfs_perag *pag, \ struct xfs_inobt_rec_incore *irec), ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 4/5] xfs: enable userspace to hide an AG from allocation 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:37 ` [PATCH 3/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong @ 2024-12-31 23:37 ` Darrick J. Wong 2024-12-31 23:37 ` [PATCH 5/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add an administrative interface so that userspace can hide an allocation group from block allocation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_ag.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_fs.h | 5 ++++ fs/xfs/xfs_ioctl.c | 4 +++- 3 files changed, 62 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index 1e65cd981afd49..c538a5bfb4e330 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -932,6 +932,54 @@ xfs_ag_extend_space( return 0; } +/* Compute the AG geometry flags. */ +static inline uint32_t +xfs_ag_calc_geoflags( + struct xfs_perag *pag) +{ + uint32_t ret = 0; + + if (xfs_perag_prohibits_alloc(pag)) + ret |= XFS_AG_FLAG_NOALLOC; + + return ret; +} + +/* + * Compare the current AG geometry flags against the flags in the AG geometry + * structure and update the AG state to reflect any changes, then update the + * struct to reflect the current status. + */ +static inline int +xfs_ag_update_geoflags( + struct xfs_perag *pag, + struct xfs_ag_geometry *ageo, + uint32_t new_flags) +{ + uint32_t old_flags = xfs_ag_calc_geoflags(pag); + int error; + + if (!(new_flags & XFS_AG_FLAG_UPDATE)) { + ageo->ag_flags = old_flags; + return 0; + } + + if ((old_flags & XFS_AG_FLAG_NOALLOC) && + !(new_flags & XFS_AG_FLAG_NOALLOC)) { + xfs_ag_clear_noalloc(pag); + } + + if (!(old_flags & XFS_AG_FLAG_NOALLOC) && + (new_flags & XFS_AG_FLAG_NOALLOC)) { + error = xfs_ag_set_noalloc(pag); + if (error) + return error; + } + + ageo->ag_flags = xfs_ag_calc_geoflags(pag); + return 0; +} + /* Retrieve AG geometry. */ int xfs_ag_get_geometry( @@ -943,6 +991,7 @@ xfs_ag_get_geometry( struct xfs_agi *agi; struct xfs_agf *agf; unsigned int freeblks; + uint32_t inflags = ageo->ag_flags; int error; /* Lock the AG headers. */ @@ -953,6 +1002,10 @@ xfs_ag_get_geometry( if (error) goto out_agi; + error = xfs_ag_update_geoflags(pag, ageo, inflags); + if (error) + goto out; + /* Fill out form. */ memset(ageo, 0, sizeof(*ageo)); ageo->ag_number = pag_agno(pag); @@ -970,6 +1023,7 @@ xfs_ag_get_geometry( ageo->ag_freeblks = freeblks; xfs_ag_geom_health(pag, ageo); +out: /* Release resources. */ xfs_buf_relse(agf_bp); out_agi: diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index 12463ba766da05..b391bf9de93dbf 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -307,6 +307,11 @@ struct xfs_ag_geometry { #define XFS_AG_GEOM_SICK_REFCNTBT (1 << 9) /* reference counts */ #define XFS_AG_GEOM_SICK_INODES (1 << 10) /* bad inodes were seen */ +#define XFS_AG_FLAG_UPDATE (1 << 0) /* update flags */ +#define XFS_AG_FLAG_NOALLOC (1 << 1) /* do not allocate from this AG */ +#define XFS_AG_FLAG_ALL (XFS_AG_FLAG_UPDATE | \ + XFS_AG_FLAG_NOALLOC) + /* * Structures for XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG & XFS_IOC_FSGROWFSRT */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index d3cf62d81f0d17..874e2def3d6e63 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -385,10 +385,12 @@ xfs_ioc_ag_geometry( if (copy_from_user(&ageo, arg, sizeof(ageo))) return -EFAULT; - if (ageo.ag_flags) + if (ageo.ag_flags & ~XFS_AG_FLAG_ALL) return -EINVAL; if (memchr_inv(&ageo.ag_reserved, 0, sizeof(ageo.ag_reserved))) return -EINVAL; + if ((ageo.ag_flags & XFS_AG_FLAG_UPDATE) && !capable(CAP_SYS_ADMIN)) + return -EPERM; pag = xfs_perag_get(mp, ageo.ag_number); if (!pag) ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 5/5] xfs: apply noalloc mode to inode allocations too 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:37 ` [PATCH 4/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong @ 2024-12-31 23:37 ` Darrick J. Wong 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Don't allow inode allocations from this group if it's marked noalloc. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_ialloc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index 57513ba19d6a71..2d2f132d4d1773 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -1107,6 +1107,7 @@ xfs_dialloc_ag_inobt( ASSERT(xfs_perag_initialised_agi(pag)); ASSERT(xfs_perag_allows_inodes(pag)); + ASSERT(!xfs_perag_prohibits_alloc(pag)); ASSERT(pag->pagi_freecount > 0); restart_pagno: @@ -1735,6 +1736,8 @@ xfs_dialloc_good_ag( return false; if (!xfs_perag_allows_inodes(pag)) return false; + if (xfs_perag_prohibits_alloc(pag)) + return false; if (!xfs_perag_initialised_agi(pag)) { error = xfs_ialloc_read_agi(pag, tp, 0, NULL); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 3/5] xfs: report refcount information to userspace 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong @ 2024-12-31 23:32 ` Darrick J. Wong 2024-12-31 23:37 ` [PATCH 1/1] xfs: export reference count " Darrick J. Wong 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong ` (12 subsequent siblings) 15 siblings, 1 reply; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:32 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs Hi all, Create a new ioctl to report the number of owners of each disk block so that reflink-aware defraggers can make better decisions about which extents to target. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=report-refcounts xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=report-refcounts fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=report-refcounts --- Commits in this patchset: * xfs: export reference count information to userspace --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_fs.h | 80 +++++ fs/xfs/xfs_fsrefs.c | 777 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_fsrefs.h | 45 +++ fs/xfs/xfs_ioctl.c | 4 fs/xfs/xfs_trace.c | 1 fs/xfs/xfs_trace.h | 125 ++++++++ 7 files changed, 1033 insertions(+) create mode 100644 fs/xfs/xfs_fsrefs.c create mode 100644 fs/xfs/xfs_fsrefs.h ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/1] xfs: export reference count information to userspace 2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong @ 2024-12-31 23:37 ` Darrick J. Wong 0 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:37 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Export refcount info to userspace so we can prototype a sharing-aware defrag/fs rearranging tool. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_fs.h | 80 +++++ fs/xfs/xfs_fsrefs.c | 777 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_fsrefs.h | 45 +++ fs/xfs/xfs_ioctl.c | 4 fs/xfs/xfs_trace.c | 1 fs/xfs/xfs_trace.h | 125 ++++++++ 7 files changed, 1033 insertions(+) create mode 100644 fs/xfs/xfs_fsrefs.c create mode 100644 fs/xfs/xfs_fsrefs.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 5bf501cf827172..4c59d43c77089e 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -85,6 +85,7 @@ xfs-y += xfs_aops.o \ xfs_filestream.o \ xfs_fsmap.o \ xfs_fsops.o \ + xfs_fsrefs.o \ xfs_globals.o \ xfs_handle.o \ xfs_health.o \ diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index b391bf9de93dbf..936f719236944f 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -1008,6 +1008,85 @@ struct xfs_rtgroup_geometry { #define XFS_RTGROUP_GEOM_SICK_RMAPBT (1U << 3) /* reverse mappings */ #define XFS_RTGROUP_GEOM_SICK_REFCNTBT (1U << 4) /* reference counts */ +/* + * Structure for XFS_IOC_GETFSREFCOUNTS. + * + * The memory layout for this call are the scalar values defined in struct + * xfs_getfsrefs_head, followed by two struct xfs_getfsrefs that describe + * the lower and upper bound of mappings to return, followed by an array + * of struct xfs_getfsrefs mappings. + * + * fch_iflags control the output of the call, whereas fch_oflags report + * on the overall record output. fch_count should be set to the length + * of the fch_recs array, and fch_entries will be set to the number of + * entries filled out during each call. If fch_count is zero, the number + * of refcount mappings will be returned in fch_entries, though no + * mappings will be returned. fch_reserved must be set to zero. + * + * The two elements in the fch_keys array are used to constrain the + * output. The first element in the array should represent the lowest + * disk mapping ("low key") that the user wants to learn about. If this + * value is all zeroes, the filesystem will return the first entry it + * knows about. For a subsequent call, the contents of + * fsrefs_head.fch_recs[fsrefs_head.fch_count - 1] should be copied into + * fch_keys[0] to have the kernel start where it left off. + * + * The second element in the fch_keys array should represent the highest + * disk mapping ("high key") that the user wants to learn about. If this + * value is all ones, the filesystem will not stop until it runs out of + * mapping to return or runs out of space in fch_recs. + * + * fcr_device can be either a 32-bit cookie representing a device, or a + * 32-bit dev_t if the FCH_OF_DEV_T flag is set. fcr_physical and + * fcr_length are expressed in units of bytes. fcr_owners is the number + * of owners. + */ +struct xfs_getfsrefs { + __u32 fcr_device; /* device id */ + __u32 fcr_flags; /* mapping flags */ + __u64 fcr_physical; /* device offset of segment */ + __u64 fcr_owners; /* number of owners */ + __u64 fcr_length; /* length of segment */ + __u64 fcr_reserved[4]; /* must be zero */ +}; + +struct xfs_getfsrefs_head { + __u32 fch_iflags; /* control flags */ + __u32 fch_oflags; /* output flags */ + __u32 fch_count; /* # of entries in array incl. input */ + __u32 fch_entries; /* # of entries filled in (output). */ + __u64 fch_reserved[6]; /* must be zero */ + + struct xfs_getfsrefs fch_keys[2]; /* low and high keys for the mapping search */ + struct xfs_getfsrefs fch_recs[]; /* returned records */ +}; + +/* Size of an fsrefs_head with room for nr records. */ +static inline unsigned long long +xfs_getfsrefs_sizeof( + unsigned int nr) +{ + return sizeof(struct xfs_getfsrefs_head) + + (nr * sizeof(struct xfs_getfsrefs)); +} + +/* Start the next fsrefs query at the end of the current query results. */ +static inline void +xfs_getfsrefs_advance( + struct xfs_getfsrefs_head *head) +{ + head->fch_keys[0] = head->fch_recs[head->fch_entries - 1]; +} + +/* fch_iflags values - set by XFS_IOC_GETFSREFCOUNTS caller in the header. */ +#define FCH_IF_VALID 0 + +/* fch_oflags values - returned in the header segment only. */ +#define FCH_OF_DEV_T (1U << 0) /* fcr_device values will be dev_t */ + +/* fcr_flags values - returned for each non-header segment */ +#define FCR_OF_LAST (1U << 0) /* last record in the dataset */ + /* * ioctl commands that are used by Linux filesystems */ @@ -1047,6 +1126,7 @@ struct xfs_rtgroup_geometry { #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle) #define XFS_IOC_SCRUBV_METADATA _IOWR('X', 64, struct xfs_scrub_vec_head) #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry) +#define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) /* * ioctl commands that replace IRIX syssgi()'s diff --git a/fs/xfs/xfs_fsrefs.c b/fs/xfs/xfs_fsrefs.c new file mode 100644 index 00000000000000..85e109dba20f99 --- /dev/null +++ b/fs/xfs/xfs_fsrefs.c @@ -0,0 +1,777 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_inode.h" +#include "xfs_trans.h" +#include "xfs_btree.h" +#include "xfs_trace.h" +#include "xfs_alloc.h" +#include "xfs_bit.h" +#include "xfs_fsrefs.h" +#include "xfs_refcount.h" +#include "xfs_refcount_btree.h" +#include "xfs_alloc_btree.h" +#include "xfs_rtalloc.h" +#include "xfs_rtrefcount_btree.h" +#include "xfs_ag.h" +#include "xfs_rtbitmap.h" +#include "xfs_rtgroup.h" + +/* getfsrefs query state */ +struct xfs_fsrefs_info { + struct xfs_fsrefs_head *head; + struct xfs_getfsrefs *fsrefs_recs; /* mapping records */ + + struct xfs_btree_cur *refc_cur; /* refcount btree cursor */ + struct xfs_btree_cur *bno_cur; /* bnobt btree cursor */ + + struct xfs_buf *agf_bp; /* AGF, for refcount queries */ + struct xfs_group *group; + + xfs_daddr_t next_daddr; /* next daddr we expect */ + /* daddr of low fsrefs key when we're using the rtbitmap */ + xfs_daddr_t low_daddr; + + /* + * Low refcount key for the query. If low.rc_blockcount is nonzero, + * this is the second (or later) call to retrieve the recordset in + * pieces. xfs_getfsrefs_rec_before_start will compare all records + * retrieved by the refcountbt query to filter out any records that + * start before the last record. + */ + struct xfs_refcount_irec low; + struct xfs_refcount_irec high; /* high refcount key */ + + u32 dev; /* device id */ + bool last; /* last extent? */ +}; + +/* Associate a device with a getfsrefs handler. */ +struct xfs_fsrefs_dev { + u32 dev; + int (*fn)(struct xfs_trans *tp, + const struct xfs_fsrefs *keys, + struct xfs_fsrefs_info *info); +}; + +/* Convert an xfs_fsrefs to an fsrefs. */ +static void +xfs_fsrefs_from_internal( + struct xfs_getfsrefs *dest, + struct xfs_fsrefs *src) +{ + dest->fcr_device = src->fcr_device; + dest->fcr_flags = src->fcr_flags; + dest->fcr_physical = BBTOB(src->fcr_physical); + dest->fcr_owners = src->fcr_owners; + dest->fcr_length = BBTOB(src->fcr_length); + dest->fcr_reserved[0] = 0; + dest->fcr_reserved[1] = 0; + dest->fcr_reserved[2] = 0; + dest->fcr_reserved[3] = 0; +} + +/* Convert an fsrefs to an xfs_fsrefs. */ +static void +xfs_fsrefs_to_internal( + struct xfs_fsrefs *dest, + struct xfs_getfsrefs *src) +{ + dest->fcr_device = src->fcr_device; + dest->fcr_flags = src->fcr_flags; + dest->fcr_physical = BTOBBT(src->fcr_physical); + dest->fcr_owners = src->fcr_owners; + dest->fcr_length = BTOBBT(src->fcr_length); +} + +/* Compare two getfsrefs device handlers. */ +static int +xfs_fsrefs_dev_compare( + const void *p1, + const void *p2) +{ + const struct xfs_fsrefs_dev *d1 = p1; + const struct xfs_fsrefs_dev *d2 = p2; + + return d1->dev - d2->dev; +} + +static inline bool +xfs_fsrefs_frec_before_start( + struct xfs_fsrefs_info *info, + const struct xfs_fsrefs_irec *frec) +{ + if (info->low_daddr != XFS_BUF_DADDR_NULL) + return frec->start_daddr < info->low_daddr; + if (info->low.rc_blockcount) + return frec->rec_key < info->low.rc_startblock; + return false; +} + +/* + * Format a refcount record for fsrefs, having translated rc_startblock into + * the appropriate daddr units. + */ +STATIC int +xfs_fsrefs_helper( + struct xfs_trans *tp, + struct xfs_fsrefs_info *info, + const struct xfs_fsrefs_irec *frec) +{ + struct xfs_fsrefs fcr; + struct xfs_getfsrefs *row; + struct xfs_mount *mp = tp->t_mountp; + + if (fatal_signal_pending(current)) + return -EINTR; + + /* + * Filter out records that start before our startpoint, if the + * caller requested that. + */ + if (xfs_fsrefs_frec_before_start(info, frec)) + return 0; + + /* Are we just counting mappings? */ + if (info->head->fch_count == 0) { + if (info->head->fch_entries == UINT_MAX) + return -ECANCELED; + + info->head->fch_entries++; + return 0; + } + + /* Fill out the extent we found */ + if (info->head->fch_entries >= info->head->fch_count) + return -ECANCELED; + + trace_xfs_fsrefs_mapping(mp, info->dev, + info->group ? info->group->xg_gno : NULLAGNUMBER, + frec); + + fcr.fcr_device = info->dev; + fcr.fcr_flags = 0; + fcr.fcr_physical = frec->start_daddr; + fcr.fcr_owners = frec->refcount; + fcr.fcr_length = frec->len_daddr; + + trace_xfs_getfsrefs_mapping(mp, &fcr); + + row = &info->fsrefs_recs[info->head->fch_entries++]; + xfs_fsrefs_from_internal(row, &fcr); + return 0; +} + +/* Synthesize fsrefs records from free space data. */ +STATIC int +xfs_fsrefs_ddev_bnobt_helper( + struct xfs_btree_cur *cur, + const struct xfs_alloc_rec_incore *rec, + void *priv) +{ + struct xfs_fsrefs_irec frec = { + .refcount = 1, + }; + struct xfs_mount *mp = cur->bc_mp; + struct xfs_fsrefs_info *info = priv; + xfs_agnumber_t next_agno; + xfs_agblock_t next_agbno; + + /* + * Figure out if there's a gap between the last fsrefs record we + * emitted and this free extent. If there is, report the gap as a + * refcount==1 record. + */ + next_agno = xfs_daddr_to_agno(mp, info->next_daddr); + next_agbno = xfs_daddr_to_agbno(mp, info->next_daddr); + + ASSERT(next_agno >= cur->bc_group->xg_gno); + ASSERT(rec->ar_startblock >= next_agbno); + + /* + * If we've already moved on to the next AG, we don't have any fsrefs + * records to synthesize. + */ + if (next_agno > cur->bc_group->xg_gno) + return 0; + + info->next_daddr = xfs_gbno_to_daddr(cur->bc_group, + rec->ar_startblock + rec->ar_blockcount); + + if (rec->ar_startblock == next_agbno) + return 0; + + /* Emit a record for the in-use space */ + frec.start_daddr = xfs_gbno_to_daddr(cur->bc_group, next_agbno); + frec.len_daddr = XFS_FSB_TO_BB(mp, rec->ar_startblock - next_agbno); + frec.rec_key = next_agbno; + return xfs_fsrefs_helper(cur->bc_tp, info, &frec); +} + +/* Emit records to fill a gap in the refcount btree with singly-owned blocks. */ +STATIC int +xfs_fsrefs_ddev_fill_refcount_gap( + struct xfs_trans *tp, + struct xfs_fsrefs_info *info, + xfs_agblock_t agbno) +{ + struct xfs_alloc_rec_incore low = {0}; + struct xfs_alloc_rec_incore high = {0}; + struct xfs_mount *mp = tp->t_mountp; + struct xfs_btree_cur *cur = info->bno_cur; + struct xfs_agf *agf; + int error; + + ASSERT(xfs_daddr_to_agno(mp, info->next_daddr) == + cur->bc_group->xg_gno); + + low.ar_startblock = xfs_daddr_to_agbno(mp, info->next_daddr); + if (low.ar_startblock >= agbno) + return 0; + + high.ar_startblock = agbno; + error = xfs_alloc_query_range(cur, &low, &high, + xfs_fsrefs_ddev_bnobt_helper, info); + if (error) + return error; + + /* + * Synthesize records for single-owner extents between the last + * fsrefcount record emitted and the end of the query range. + */ + agf = cur->bc_ag.agbp->b_addr; + low.ar_startblock = min_t(xfs_agblock_t, agbno, + be32_to_cpu(agf->agf_length)); + if (xfs_daddr_to_agbno(mp, info->next_daddr) > low.ar_startblock) + return 0; + + info->last = true; + return xfs_fsrefs_ddev_bnobt_helper(cur, &low, info); +} + +/* Transform a refcountbt irec into a fsrefs */ +STATIC int +xfs_fsrefs_ddev_refcountbt_helper( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *rec, + void *priv) +{ + struct xfs_fsrefs_irec frec = { + .refcount = rec->rc_refcount, + .rec_key = rec->rc_startblock, + }; + struct xfs_mount *mp = cur->bc_mp; + struct xfs_fsrefs_info *info = priv; + int error; + + /* + * Stop once we get to the CoW staging extents; they're all shoved to + * the right side of the btree and were already covered by the bnobt + * scan. + */ + if (rec->rc_domain != XFS_REFC_DOMAIN_SHARED) + return -ECANCELED; + + /* Report on any gaps first */ + error = xfs_fsrefs_ddev_fill_refcount_gap(cur->bc_tp, info, + rec->rc_startblock); + if (error) + return error; + + /* Report the refcount record from the refcount btree. */ + frec.start_daddr = xfs_gbno_to_daddr(cur->bc_group, + rec->rc_startblock); + frec.len_daddr = XFS_FSB_TO_BB(mp, rec->rc_blockcount); + info->next_daddr = xfs_gbno_to_daddr(cur->bc_group, + rec->rc_startblock + rec->rc_blockcount); + return xfs_fsrefs_helper(cur->bc_tp, info, &frec); +} + +/* Execute a getfsrefs query against the regular data device. */ +STATIC int +xfs_fsrefs_ddev( + struct xfs_trans *tp, + const struct xfs_fsrefs *keys, + struct xfs_fsrefs_info *info) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_buf *agf_bp = NULL; + struct xfs_perag *pag = NULL; + xfs_fsblock_t start_fsb; + xfs_fsblock_t end_fsb; + xfs_agnumber_t start_ag; + xfs_agnumber_t end_ag; + uint64_t eofs; + int error = 0; + + eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks); + if (keys[0].fcr_physical >= eofs) + return 0; + start_fsb = XFS_DADDR_TO_FSB(mp, keys[0].fcr_physical); + end_fsb = XFS_DADDR_TO_FSB(mp, min(eofs - 1, keys[1].fcr_physical)); + + info->refc_cur = info->bno_cur = NULL; + + /* + * Convert the fsrefs low/high keys to AG based keys. Initialize + * low to the fsrefs low key and max out the high key to the end + * of the AG. + */ + info->low.rc_startblock = XFS_FSB_TO_AGBNO(mp, start_fsb); + info->low.rc_blockcount = XFS_BB_TO_FSBT(mp, keys[0].fcr_length); + info->low.rc_refcount = 0; + info->low.rc_domain = XFS_REFC_DOMAIN_SHARED; + + /* Adjust the low key if we are continuing from where we left off. */ + if (info->low.rc_blockcount > 0) { + info->low.rc_startblock += info->low.rc_blockcount; + + start_fsb += info->low.rc_blockcount; + if (XFS_FSB_TO_DADDR(mp, start_fsb) >= eofs) + return 0; + } + + info->high.rc_startblock = -1U; + info->high.rc_refcount = 0; + info->high.rc_domain = XFS_REFC_DOMAIN_SHARED; + + start_ag = XFS_FSB_TO_AGNO(mp, start_fsb); + end_ag = XFS_FSB_TO_AGNO(mp, end_fsb); + + /* Query each AG */ + while ((pag = xfs_perag_next_range(mp, pag, start_ag, end_ag))) { + info->group = pag_group(pag); + + /* + * Set the AG high key from the fsrefs high key if this + * is the last AG that we're querying. + */ + if (pag_agno(pag) == end_ag) + info->high.rc_startblock = XFS_FSB_TO_AGBNO(mp, + end_fsb); + + if (info->refc_cur) { + xfs_btree_del_cursor(info->refc_cur, XFS_BTREE_NOERROR); + info->refc_cur = NULL; + } + if (info->bno_cur) { + xfs_btree_del_cursor(info->bno_cur, XFS_BTREE_NOERROR); + info->bno_cur = NULL; + } + if (agf_bp) { + xfs_trans_brelse(tp, agf_bp); + agf_bp = NULL; + } + + error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp); + if (error) + break; + + trace_xfs_fsrefs_low_group_key(mp, info->dev, info->group, + &info->low); + trace_xfs_fsrefs_high_group_key(mp, info->dev, info->group, + &info->high); + + info->bno_cur = xfs_bnobt_init_cursor(mp, tp, agf_bp, pag); + + if (xfs_has_reflink(mp)) { + info->refc_cur = xfs_refcountbt_init_cursor(mp, tp, + agf_bp, pag); + + /* + * Fill the query with refcount records and synthesize + * singly-owned block records from free space data. + */ + error = xfs_refcount_query_range(info->refc_cur, + &info->low, &info->high, + xfs_fsrefs_ddev_refcountbt_helper, + info); + if (error && error != -ECANCELED) + break; + } + + /* + * Synthesize refcount==1 records from the free space data + * between the end of the last fsrefs record reported and the + * end of the range. If we don't have refcount support, the + * starting point will be the start of the query range. + */ + error = xfs_fsrefs_ddev_fill_refcount_gap(tp, info, + info->high.rc_startblock); + if (error) + break; + + /* + * Set the AG low key to the start of the AG prior to + * moving on to the next AG. + */ + if (pag_agno(pag) == start_ag) + memset(&info->low, 0, sizeof(info->low)); + info->group = NULL; + } + + if (info->refc_cur) { + xfs_btree_del_cursor(info->refc_cur, error); + info->refc_cur = NULL; + } + if (info->bno_cur) { + xfs_btree_del_cursor(info->bno_cur, error); + info->bno_cur = NULL; + } + if (agf_bp) + xfs_trans_brelse(tp, agf_bp); + if (info->group) { + xfs_perag_rele(pag); + info->group = NULL; + } else if (pag) { + /* loop termination case */ + xfs_perag_rele(pag); + } + + return error; +} + +/* Execute a getfsrefs query against the log device. */ +STATIC int +xfs_fsrefs_logdev( + struct xfs_trans *tp, + const struct xfs_fsrefs *keys, + struct xfs_fsrefs_info *info) +{ + struct xfs_fsrefs_irec frec = { + .start_daddr = 0, + .rec_key = 0, + .refcount = 1, + }; + struct xfs_mount *mp = tp->t_mountp; + xfs_fsblock_t start_fsb, end_fsb; + uint64_t eofs; + + eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks); + if (keys[0].fcr_physical >= eofs) + return 0; + start_fsb = XFS_BB_TO_FSBT(mp, + keys[0].fcr_physical + keys[0].fcr_length); + end_fsb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fcr_physical)); + + /* Adjust the low key if we are continuing from where we left off. */ + if (keys[0].fcr_length > 0) + info->low_daddr = XFS_FSB_TO_BB(mp, start_fsb); + + trace_xfs_fsrefs_low_linear_key(mp, info->dev, start_fsb); + trace_xfs_fsrefs_high_linear_key(mp, info->dev, end_fsb); + + if (start_fsb > 0) + return 0; + + /* Fabricate an refc entry for the external log device. */ + frec.len_daddr = XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks); + return xfs_fsrefs_helper(tp, info, &frec); +} + +/* Do we recognize the device? */ +STATIC bool +xfs_fsrefs_is_valid_device( + struct xfs_mount *mp, + struct xfs_fsrefs *fcr) +{ + if (fcr->fcr_device == 0 || fcr->fcr_device == UINT_MAX || + fcr->fcr_device == new_encode_dev(mp->m_ddev_targp->bt_dev)) + return true; + if (mp->m_logdev_targp && + fcr->fcr_device == new_encode_dev(mp->m_logdev_targp->bt_dev)) + return true; + if (mp->m_rtdev_targp && + fcr->fcr_device == new_encode_dev(mp->m_rtdev_targp->bt_dev)) + return true; + return false; +} + +/* Ensure that the low key is less than the high key. */ +STATIC bool +xfs_fsrefs_check_keys( + struct xfs_fsrefs *low_key, + struct xfs_fsrefs *high_key) +{ + if (low_key->fcr_device > high_key->fcr_device) + return false; + if (low_key->fcr_device < high_key->fcr_device) + return true; + + if (low_key->fcr_physical > high_key->fcr_physical) + return false; + if (low_key->fcr_physical < high_key->fcr_physical) + return true; + + return false; +} + +#define XFS_GETFSREFS_DEVS 2 + +/* + * Get filesystem's extent refcounts as described in head, and format for + * output. Fills in the supplied records array until there are no more reverse + * mappings to return or head.fch_entries == head.fch_count. In the second + * case, this function returns -ECANCELED to indicate that more records would + * have been returned. + * + * Key to Confusion + * ---------------- + * There are multiple levels of keys and counters at work here: + * xfs_fsrefs_head.fch_keys -- low and high fsrefs keys passed in; + * these reflect fs-wide sector addrs. + * dkeys -- fch_keys used to query each device; + * these are fch_keys but w/ the low key + * bumped up by fcr_length. + * xfs_fsrefs_info.next_daddr-- next disk addr we expect to see; this + * is how we detect gaps in the fsrefs + * records and report them. + * xfs_fsrefs_info.low/high -- per-AG low/high keys computed from + * dkeys; used to query the metadata. + */ +STATIC int +xfs_getfsrefs( + struct xfs_mount *mp, + struct xfs_fsrefs_head *head, + struct xfs_getfsrefs *fsrefs_recs) +{ + struct xfs_trans *tp = NULL; + struct xfs_fsrefs dkeys[2]; /* per-dev keys */ + struct xfs_fsrefs_dev handlers[XFS_GETFSREFS_DEVS]; + struct xfs_fsrefs_info info = { NULL }; + int i; + int error = 0; + + if (head->fch_iflags & ~FCH_IF_VALID) + return -EINVAL; + if (!xfs_fsrefs_is_valid_device(mp, &head->fch_keys[0]) || + !xfs_fsrefs_is_valid_device(mp, &head->fch_keys[1])) + return -EINVAL; + if (!xfs_fsrefs_check_keys(&head->fch_keys[0], &head->fch_keys[1])) + return -EINVAL; + + head->fch_entries = 0; + + /* Set up our device handlers. */ + memset(handlers, 0, sizeof(handlers)); + handlers[0].dev = new_encode_dev(mp->m_ddev_targp->bt_dev); + handlers[0].fn = xfs_fsrefs_ddev; + if (mp->m_logdev_targp != mp->m_ddev_targp) { + handlers[1].dev = new_encode_dev(mp->m_logdev_targp->bt_dev); + handlers[1].fn = xfs_fsrefs_logdev; + } + + xfs_sort(handlers, XFS_GETFSREFS_DEVS, sizeof(struct xfs_fsrefs_dev), + xfs_fsrefs_dev_compare); + + /* + * To continue where we left off, we allow userspace to use the last + * mapping from a previous call as the low key of the next. This is + * identified by a non-zero length in the low key. We have to increment + * the low key in this scenario to ensure we don't return the same + * mapping again, and instead return the very next mapping. Bump the + * physical offset as there can be no other mapping for the same + * physical block range. + * + * Each fsrefs backend is responsible for making this adjustment as + * appropriate for the backend. + */ + dkeys[0] = head->fch_keys[0]; + memset(&dkeys[1], 0xFF, sizeof(struct xfs_fsrefs)); + + info.next_daddr = head->fch_keys[0].fcr_physical + + head->fch_keys[0].fcr_length; + info.fsrefs_recs = fsrefs_recs; + info.head = head; + + /* For each device we support... */ + for (i = 0; i < XFS_GETFSREFS_DEVS; i++) { + /* Is this device within the range the user asked for? */ + if (!handlers[i].fn) + continue; + if (head->fch_keys[0].fcr_device > handlers[i].dev) + continue; + if (head->fch_keys[1].fcr_device < handlers[i].dev) + break; + + /* + * If this device number matches the high key, we have to pass + * the high key to the handler to limit the query results. If + * the device number exceeds the low key, zero out the low key + * so that we get everything from the beginning. + */ + if (handlers[i].dev == head->fch_keys[1].fcr_device) + dkeys[1] = head->fch_keys[1]; + if (handlers[i].dev > head->fch_keys[0].fcr_device) + memset(&dkeys[0], 0, sizeof(struct xfs_fsrefs)); + + /* + * Grab an empty transaction so that we can use its recursive + * buffer locking abilities to detect cycles in the refcountbt + * without deadlocking. + */ + error = xfs_trans_alloc_empty(mp, &tp); + if (error) + break; + + info.dev = handlers[i].dev; + info.last = false; + info.group = NULL; + info.low_daddr = XFS_BUF_DADDR_NULL; + info.low.rc_blockcount = 0; + error = handlers[i].fn(tp, dkeys, &info); + if (error) + break; + xfs_trans_cancel(tp); + tp = NULL; + info.next_daddr = 0; + } + + if (tp) + xfs_trans_cancel(tp); + head->fch_oflags = FCH_OF_DEV_T; + return error; +} + +int +xfs_ioc_getfsrefcounts( + struct xfs_inode *ip, + struct xfs_getfsrefs_head __user *arg) +{ + struct xfs_fsrefs_head xhead = {0}; + struct xfs_getfsrefs_head head; + struct xfs_getfsrefs *recs; + unsigned int count; + __u32 last_flags = 0; + bool done = false; + int error; + + if (copy_from_user(&head, arg, sizeof(struct xfs_getfsrefs_head))) + return -EFAULT; + if (memchr_inv(head.fch_reserved, 0, sizeof(head.fch_reserved)) || + memchr_inv(head.fch_keys[0].fcr_reserved, 0, + sizeof(head.fch_keys[0].fcr_reserved)) || + memchr_inv(head.fch_keys[1].fcr_reserved, 0, + sizeof(head.fch_keys[1].fcr_reserved))) + return -EINVAL; + + /* + * Use an internal memory buffer so that we don't have to copy fsrefs + * data to userspace while holding locks. Start by trying to allocate + * up to 128k for the buffer, but fall back to a single page if needed. + */ + count = min_t(unsigned int, head.fch_count, + 131072 / sizeof(struct xfs_getfsrefs)); + recs = kvcalloc(count, sizeof(struct xfs_getfsrefs), GFP_KERNEL); + if (!recs) { + count = min_t(unsigned int, head.fch_count, + PAGE_SIZE / sizeof(struct xfs_getfsrefs)); + recs = kvcalloc(count, sizeof(struct xfs_getfsrefs), + GFP_KERNEL); + if (!recs) + return -ENOMEM; + } + + xhead.fch_iflags = head.fch_iflags; + xfs_fsrefs_to_internal(&xhead.fch_keys[0], &head.fch_keys[0]); + xfs_fsrefs_to_internal(&xhead.fch_keys[1], &head.fch_keys[1]); + + trace_xfs_getfsrefs_low_key(ip->i_mount, &xhead.fch_keys[0]); + trace_xfs_getfsrefs_high_key(ip->i_mount, &xhead.fch_keys[1]); + + head.fch_entries = 0; + do { + struct xfs_getfsrefs __user *user_recs; + struct xfs_getfsrefs *last_rec; + size_t copy_bytes; + + user_recs = &arg->fch_recs[head.fch_entries]; + xhead.fch_entries = 0; + xhead.fch_count = min_t(unsigned int, count, + head.fch_count - head.fch_entries); + + /* Run query, record how many entries we got. */ + error = xfs_getfsrefs(ip->i_mount, &xhead, recs); + switch (error) { + case 0: + /* + * There are no more records in the result set. Copy + * whatever we got to userspace and break out. + */ + done = true; + break; + case -ECANCELED: + /* + * The internal memory buffer is full. Copy whatever + * records we got to userspace and go again if we have + * not yet filled the userspace buffer. + */ + error = 0; + break; + default: + goto out_free; + } + head.fch_entries += xhead.fch_entries; + head.fch_oflags = xhead.fch_oflags; + + /* + * If the caller wanted a record count or there aren't any + * new records to return, we're done. + */ + if (head.fch_count == 0 || xhead.fch_entries == 0) + break; + + /* Copy all the records we got out to userspace. */ + copy_bytes = array_size(xhead.fch_entries, + sizeof(struct xfs_getfsrefs)); + if (copy_bytes == SIZE_MAX || + copy_to_user(user_recs, recs, copy_bytes)) { + error = -EFAULT; + goto out_free; + } + + /* Remember the last record flags we copied to userspace. */ + last_rec = &recs[xhead.fch_entries - 1]; + last_flags = last_rec->fcr_flags; + + /* Set up the low key for the next iteration. */ + xfs_fsrefs_to_internal(&xhead.fch_keys[0], last_rec); + trace_xfs_getfsrefs_low_key(ip->i_mount, &xhead.fch_keys[0]); + } while (!done && head.fch_entries < head.fch_count); + + /* + * If there are no more records in the query result set and we're not + * in counting mode, mark the last record returned with the LAST flag. + */ + if (done && head.fch_count > 0 && head.fch_entries > 0) { + struct xfs_getfsrefs __user *user_rec; + + last_flags |= FCR_OF_LAST; + user_rec = &arg->fch_recs[head.fch_entries - 1]; + + if (copy_to_user(&user_rec->fcr_flags, &last_flags, + sizeof(last_flags))) { + error = -EFAULT; + goto out_free; + } + } + + /* copy back header */ + if (copy_to_user(arg, &head, sizeof(struct xfs_getfsrefs_head))) { + error = -EFAULT; + goto out_free; + } + +out_free: + kvfree(recs); + return error; +} diff --git a/fs/xfs/xfs_fsrefs.h b/fs/xfs/xfs_fsrefs.h new file mode 100644 index 00000000000000..6d23eaa4801e24 --- /dev/null +++ b/fs/xfs/xfs_fsrefs.h @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_FSREFS_H__ +#define __XFS_FSREFS_H__ + +struct xfs_getfsrefs; + +/* internal fsrefs representation */ +struct xfs_fsrefs { + dev_t fcr_device; /* device id */ + uint32_t fcr_flags; /* mapping flags */ + uint64_t fcr_physical; /* device offset of segment */ + uint64_t fcr_owners; /* number of owners */ + xfs_filblks_t fcr_length; /* length of segment, blocks */ +}; + +struct xfs_fsrefs_head { + uint32_t fch_iflags; /* control flags */ + uint32_t fch_oflags; /* output flags */ + unsigned int fch_count; /* # of entries in array incl. input */ + unsigned int fch_entries; /* # of entries filled in (output). */ + + struct xfs_fsrefs fch_keys[2]; /* low and high keys */ +}; + +/* internal fsrefs record format */ +struct xfs_fsrefs_irec { + xfs_daddr_t start_daddr; + xfs_daddr_t len_daddr; + xfs_nlink_t refcount; + + /* + * refcount startblock corresponding to start_daddr, if the record came + * from a refcount btree. + */ + xfs_agblock_t rec_key; +}; + +int xfs_ioc_getfsrefcounts(struct xfs_inode *ip, + struct xfs_getfsrefs_head __user *arg); + +#endif /* __XFS_FSREFS_H__ */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 874e2def3d6e63..20f013bd4ce653 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -29,6 +29,7 @@ #include "xfs_btree.h" #include <linux/fsmap.h> #include "xfs_fsmap.h" +#include "xfs_fsrefs.h" #include "scrub/xfs_scrub.h" #include "xfs_sb.h" #include "xfs_ag.h" @@ -1266,6 +1267,9 @@ xfs_file_ioctl( case FS_IOC_GETFSMAP: return xfs_ioc_getfsmap(ip, arg); + case XFS_IOC_GETFSREFCOUNTS: + return xfs_ioc_getfsrefcounts(ip, arg); + case XFS_IOC_SCRUBV_METADATA: return xfs_ioc_scrubv_metadata(filp, arg); case XFS_IOC_SCRUB_METADATA: diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index a60556dbd172ee..555fe76b4d853c 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -51,6 +51,7 @@ #include "xfs_rtgroup.h" #include "xfs_zone_alloc.h" #include "xfs_zone_priv.h" +#include "xfs_fsrefs.h" /* * We include this last to have the helpers above available for the trace diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index dc7ffc8f8e9dea..7043b6481d5f97 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -103,6 +103,8 @@ struct xfs_refcount_intent; struct xfs_metadir_update; struct xfs_rtgroup; struct xfs_open_zone; +struct xfs_fsrefs; +struct xfs_fsrefs_irec; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -4297,6 +4299,129 @@ DEFINE_GETFSMAP_EVENT(xfs_getfsmap_low_key); DEFINE_GETFSMAP_EVENT(xfs_getfsmap_high_key); DEFINE_GETFSMAP_EVENT(xfs_getfsmap_mapping); +/* fsrefs traces */ +TRACE_EVENT(xfs_fsrefs_mapping, + TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_agnumber_t agno, + const struct xfs_fsrefs_irec *frec), + TP_ARGS(mp, keydev, agno, frec), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(dev_t, keydev) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, agbno) + __field(xfs_daddr_t, start_daddr) + __field(xfs_daddr_t, len_daddr) + __field(uint64_t, owners) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->keydev = new_decode_dev(keydev); + __entry->agno = agno; + __entry->agbno = frec->rec_key; + __entry->start_daddr = frec->start_daddr; + __entry->len_daddr = frec->len_daddr; + __entry->owners = frec->refcount; + ), + TP_printk("dev %d:%d keydev %d:%d agno 0x%x agbno 0x%x start_daddr 0x%llx len_daddr 0x%llx owners %llu", + MAJOR(__entry->dev), MINOR(__entry->dev), + MAJOR(__entry->keydev), MINOR(__entry->keydev), + __entry->agno, + __entry->agbno, + __entry->start_daddr, + __entry->len_daddr, + __entry->owners) +); + +DECLARE_EVENT_CLASS(xfs_fsrefs_linear_key_class, + TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_fsblock_t fsbno), + TP_ARGS(mp, keydev, fsbno), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(dev_t, keydev) + __field(xfs_fsblock_t, fsbno) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->keydev = new_decode_dev(keydev); + __entry->fsbno = fsbno; + ), + TP_printk("dev %d:%d keydev %d:%d fsbno 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + MAJOR(__entry->keydev), MINOR(__entry->keydev), + __entry->fsbno) +) +#define DEFINE_FSREFS_LINEAR_KEY_EVENT(name) \ +DEFINE_EVENT(xfs_fsrefs_linear_key_class, name, \ + TP_PROTO(struct xfs_mount *mp, u32 keydev, xfs_fsblock_t fsbno), \ + TP_ARGS(mp, keydev, fsbno)) +DEFINE_FSREFS_LINEAR_KEY_EVENT(xfs_fsrefs_low_linear_key); +DEFINE_FSREFS_LINEAR_KEY_EVENT(xfs_fsrefs_high_linear_key); + +DECLARE_EVENT_CLASS(xfs_fsrefs_group_key_class, + TP_PROTO(struct xfs_mount *mp, u32 keydev, const struct xfs_group *xg, + const struct xfs_refcount_irec *refc), + TP_ARGS(mp, keydev, xg, refc), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(dev_t, keydev) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, agbno) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->keydev = new_decode_dev(keydev); + __entry->agno = xg->xg_gno; + __entry->agbno = refc->rc_startblock; + ), + TP_printk("dev %d:%d keydev %d:%d agno 0x%x refcbno 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + MAJOR(__entry->keydev), MINOR(__entry->keydev), + __entry->agno, + __entry->agbno) +) +#define DEFINE_FSREFS_GROUP_KEY_EVENT(name) \ +DEFINE_EVENT(xfs_fsrefs_group_key_class, name, \ + TP_PROTO(struct xfs_mount *mp, u32 keydev, const struct xfs_group *xg, \ + const struct xfs_refcount_irec *refc), \ + TP_ARGS(mp, keydev, xg, refc)) +DEFINE_FSREFS_GROUP_KEY_EVENT(xfs_fsrefs_low_group_key); +DEFINE_FSREFS_GROUP_KEY_EVENT(xfs_fsrefs_high_group_key); + +DECLARE_EVENT_CLASS(xfs_getfsrefs_class, + TP_PROTO(struct xfs_mount *mp, struct xfs_fsrefs *fsrefs), + TP_ARGS(mp, fsrefs), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(dev_t, keydev) + __field(xfs_daddr_t, block) + __field(xfs_daddr_t, len) + __field(uint64_t, owners) + __field(uint32_t, flags) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->keydev = new_decode_dev(fsrefs->fcr_device); + __entry->block = fsrefs->fcr_physical; + __entry->len = fsrefs->fcr_length; + __entry->owners = fsrefs->fcr_owners; + __entry->flags = fsrefs->fcr_flags; + ), + TP_printk("dev %d:%d keydev %d:%d daddr 0x%llx bbcount 0x%llx owners %llu flags 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + MAJOR(__entry->keydev), MINOR(__entry->keydev), + __entry->block, + __entry->len, + __entry->owners, + __entry->flags) +) +#define DEFINE_GETFSREFS_EVENT(name) \ +DEFINE_EVENT(xfs_getfsrefs_class, name, \ + TP_PROTO(struct xfs_mount *mp, struct xfs_fsrefs *fsrefs), \ + TP_ARGS(mp, fsrefs)) +DEFINE_GETFSREFS_EVENT(xfs_getfsrefs_low_key); +DEFINE_GETFSREFS_EVENT(xfs_getfsrefs_high_key); +DEFINE_GETFSREFS_EVENT(xfs_getfsrefs_mapping); + DECLARE_EVENT_CLASS(xfs_trans_resv_class, TP_PROTO(struct xfs_mount *mp, unsigned int type, struct xfs_trans_res *res), ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 4/5] xfs: defragment free space 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong @ 2024-12-31 23:33 ` Darrick J. Wong 2024-12-31 23:38 ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong ` (3 more replies) 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (11 subsequent siblings) 15 siblings, 4 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs Hi all, These patches contain experimental code to enable userspace to defragment the free space in a filesystem. Two purposes are imagined for this functionality: clearing space at the end of a filesystem before shrinking it, and clearing free space in anticipation of making a large allocation. The first patch adds a new fallocate mode that allows userspace to allocate free space from the filesystem into a file. The goal here is to allow the filesystem shrink process to prevent allocation from a certain part of the filesystem while a free space defragmenter evacuates all the files from the doomed part of the filesystem. The second patch amends the online repair system to allow the sysadmin to forcibly rebuild metadata structures, even if they're not corrupt. Without adding an ioctl to move metadata btree blocks, this is the only way to dislodge metadata. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=defrag-freespace --- Commits in this patchset: * xfs: export realtime refcount information * xfs: capture the offset and length in fallocate tracepoints * xfs: add an ioctl to map free space into a file * xfs: implement FALLOC_FL_MAP_FREE for realtime files --- fs/xfs/libxfs/xfs_alloc.c | 88 ++++++++ fs/xfs/libxfs/xfs_alloc.h | 3 fs/xfs/libxfs/xfs_bmap.c | 1 fs/xfs/libxfs/xfs_fs.h | 14 + fs/xfs/xfs_bmap_util.c | 513 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_bmap_util.h | 3 fs/xfs/xfs_file.c | 143 ++++++++++++- fs/xfs/xfs_file.h | 2 fs/xfs/xfs_fsrefs.c | 405 ++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_ioctl.c | 5 fs/xfs/xfs_rtalloc.c | 108 +++++++++ fs/xfs/xfs_rtalloc.h | 7 + fs/xfs/xfs_trace.h | 86 +++++++- 13 files changed, 1368 insertions(+), 10 deletions(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/4] xfs: export realtime refcount information 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong @ 2024-12-31 23:38 ` Darrick J. Wong 2024-12-31 23:38 ` [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add support for reporting space refcount information from the realtime volume. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_fsrefs.c | 405 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 405 insertions(+) diff --git a/fs/xfs/xfs_fsrefs.c b/fs/xfs/xfs_fsrefs.c index 85e109dba20f99..d5b77fe79b2653 100644 --- a/fs/xfs/xfs_fsrefs.c +++ b/fs/xfs/xfs_fsrefs.c @@ -478,6 +478,395 @@ xfs_fsrefs_logdev( return xfs_fsrefs_helper(tp, info, &frec); } +#ifdef CONFIG_XFS_RT +/* Synthesize fsrefs records from rtbitmap records. */ +STATIC int +xfs_fsrefs_rtdev_bitmap_helper( + struct xfs_rtgroup *rtg, + struct xfs_trans *tp, + const struct xfs_rtalloc_rec *rec, + void *priv) +{ + struct xfs_fsrefs_irec frec = { + .refcount = 1, + }; + struct xfs_mount *mp = rtg_mount(rtg); + struct xfs_fsrefs_info *info = priv; + xfs_rtblock_t next_rtb, rec_rtb, rtb; + xfs_rgnumber_t next_rgno; + xfs_rgblock_t next_rgbno; + xfs_rgblock_t rec_rgbno; + + /* Translate the free space record to group and block number. */ + rec_rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext); + rec_rgbno = xfs_rtb_to_rgbno(mp, rec_rtb); + + /* + * Figure out if there's a gap between the last fsrefs record we + * emitted and this free extent. If there is, report the gap as a + * refcount==1 record. + */ + next_rtb = xfs_daddr_to_rtb(mp, info->next_daddr); + next_rgno = xfs_rtb_to_rgno(mp, next_rtb); + next_rgbno = xfs_rtb_to_rgbno(mp, next_rtb); + + ASSERT(next_rgno >= info->group->xg_gno); + ASSERT(rec_rgbno >= next_rgbno); + + /* + * If we've already moved on to the next rtgroup, we don't have any + * fsrefs records to synthesize. + */ + if (next_rgno > info->group->xg_gno) + return 0; + + rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext + rec->ar_extcount); + info->next_daddr = xfs_rtb_to_daddr(mp, rtb); + + if (rec_rtb == next_rtb) + return 0; + + /* Emit a record for the in-use space. */ + frec.start_daddr = xfs_rtb_to_daddr(mp, next_rtb); + frec.len_daddr = XFS_FSB_TO_BB(mp, rec_rgbno - next_rgbno); + frec.rec_key = next_rgbno; + return xfs_fsrefs_helper(tp, info, &frec); +} + +/* Emit records to fill a gap in the refcount btree with singly-owned blocks. */ +STATIC int +xfs_fsrefs_rtdev_fill_refcount_gap( + struct xfs_trans *tp, + struct xfs_fsrefs_info *info, + xfs_rgblock_t rgbno) +{ + struct xfs_rtalloc_rec high = { 0 }; + struct xfs_mount *mp = tp->t_mountp; + struct xfs_rtgroup *rtg = to_rtg(info->group); + xfs_rtblock_t start_rtbno = + xfs_daddr_to_rtb(mp, info->next_daddr); + xfs_rtblock_t end_rtbno = + xfs_rgbno_to_rtb(rtg, rgbno); + xfs_rtxnum_t low_rtx; + xfs_daddr_t rec_daddr; + int error; + + ASSERT(xfs_rtb_to_rgno(mp, start_rtbno) == info->group->xg_gno); + + low_rtx = xfs_rtb_to_rtx(mp, start_rtbno); + if (rgbno == -1U) { + /* + * If the caller passes in an all 1s high key to signify the + * end of the group, set the extent to all 1s as well. + */ + high.ar_startext = -1ULL; + } else { + high.ar_startext = xfs_rtb_to_rtx(mp, + end_rtbno + mp->m_sb.sb_rextsize - 1); + } + if (low_rtx >= high.ar_startext) + return 0; + + error = xfs_rtalloc_query_range(rtg, tp, low_rtx, high.ar_startext, + xfs_fsrefs_rtdev_bitmap_helper, info); + if (error) + return error; + + /* + * Synthesize records for single-owner extents between the last + * fsrefcount record emitted and the end of the query range. + */ + high.ar_startext = min(high.ar_startext, rtg->rtg_extents); + rec_daddr = xfs_rtb_to_daddr(mp, xfs_rtx_to_rtb(rtg, high.ar_startext)); + if (info->next_daddr > rec_daddr) + return 0; + + info->last = true; + return xfs_fsrefs_rtdev_bitmap_helper(rtg, tp, &high, info); +} + +/* Transform a absolute-startblock refcount (rtdev, logdev) into a fsrefs */ +STATIC int +xfs_fsrefs_rtdev_refcountbt_helper( + struct xfs_btree_cur *cur, + const struct xfs_refcount_irec *rec, + void *priv) +{ + struct xfs_fsrefs_irec frec = { + .refcount = rec->rc_refcount, + .rec_key = rec->rc_startblock, + }; + struct xfs_mount *mp = cur->bc_mp; + struct xfs_fsrefs_info *info = priv; + struct xfs_rtgroup *rtg = to_rtg(info->group); + xfs_rtblock_t rec_rtbno; + int error; + + /* + * Stop once we get to the CoW staging extents; they're all shoved to + * the right side of the btree and were already covered by the rtbitmap + * scan. + */ + if (rec->rc_domain != XFS_REFC_DOMAIN_SHARED) + return -ECANCELED; + + /* Report on any gaps first */ + error = xfs_fsrefs_rtdev_fill_refcount_gap(cur->bc_tp, info, + rec->rc_startblock); + if (error) + return error; + + /* Report the refcount record from the refcount btree. */ + rec_rtbno = xfs_rgbno_to_rtb(rtg, rec->rc_startblock); + frec.start_daddr = xfs_rtb_to_daddr(mp, rec_rtbno); + frec.len_daddr = XFS_FSB_TO_BB(mp, rec->rc_blockcount); + info->next_daddr = xfs_rtb_to_daddr(mp, rec_rtbno + rec->rc_blockcount); + return xfs_fsrefs_helper(cur->bc_tp, info, &frec); +} + +#define XFS_RTGLOCK_FSREFS (XFS_RTGLOCK_BITMAP | XFS_RTGLOCK_REFCOUNT) + +/* Execute a getfsrefs query against the realtime device. */ +STATIC int +xfs_fsrefs_rtdev( + struct xfs_trans *tp, + const struct xfs_fsrefs *keys, + struct xfs_fsrefs_info *info) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_rtgroup *rtg = NULL, *locked_rtg = NULL; + xfs_rtblock_t start_rtbno; + xfs_rtblock_t end_rtbno; + xfs_rgnumber_t start_rg; + xfs_rgnumber_t end_rg; + uint64_t eofs; + int error = 0; + + eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks); + if (keys[0].fcr_physical >= eofs) + return 0; + start_rtbno = xfs_daddr_to_rtb(mp, keys[0].fcr_physical); + end_rtbno = xfs_daddr_to_rtb(mp, min(eofs - 1, keys[1].fcr_physical)); + + info->refc_cur = info->bno_cur = NULL; + + /* + * Convert the fsrefs low/high keys to rtgroup based keys. Initialize + * low to the fsrefs low key and max out the high key to the end of the + * rtgroup. + */ + info->low.rc_startblock = xfs_rtb_to_rgbno(mp, start_rtbno); + info->low.rc_blockcount = XFS_BB_TO_FSBT(mp, keys[0].fcr_length); + info->low.rc_refcount = 0; + info->low.rc_domain = XFS_REFC_DOMAIN_SHARED; + + /* Adjust the low key if we are continuing from where we left off. */ + if (info->low.rc_blockcount > 0) { + info->low.rc_startblock += info->low.rc_blockcount; + + start_rtbno += info->low.rc_blockcount; + if (xfs_rtb_to_daddr(mp, start_rtbno) >= eofs) + return 0; + } + + info->high.rc_startblock = -1U; + info->high.rc_blockcount = 0; + info->high.rc_refcount = 0; + info->high.rc_domain = XFS_REFC_DOMAIN_SHARED; + + start_rg = xfs_rtb_to_rgno(mp, start_rtbno); + end_rg = xfs_rtb_to_rgno(mp, end_rtbno); + + /* Query each rtgroup */ + while ((rtg = xfs_rtgroup_next_range(mp, rtg, start_rg, end_rg))) { + info->group = rtg_group(rtg); + + /* + * Set the rtgroup high key from the fsrefs high key if this + * is the last rtgroup that we're querying. + */ + if (rtg_rgno(rtg) == end_rg) + info->high.rc_startblock = xfs_rtb_to_rgbno(mp, + end_rtbno); + + if (info->refc_cur) { + xfs_btree_del_cursor(info->refc_cur, XFS_BTREE_NOERROR); + info->refc_cur = NULL; + } + if (locked_rtg) + xfs_rtgroup_unlock(locked_rtg, XFS_RTGLOCK_FSREFS); + + trace_xfs_fsrefs_low_group_key(mp, info->dev, info->group, + &info->low); + trace_xfs_fsrefs_high_group_key(mp, info->dev, info->group, + &info->high); + + xfs_rtgroup_lock(rtg, XFS_RTGLOCK_FSREFS); + locked_rtg = rtg; + + /* + * Fill the query with refcount records and synthesize + * singly-owned block records from free space data. + */ + if (xfs_has_rtreflink(mp)) { + info->refc_cur = xfs_rtrefcountbt_init_cursor(tp, rtg); + + error = xfs_refcount_query_range(info->refc_cur, + &info->low, &info->high, + xfs_fsrefs_rtdev_refcountbt_helper, + info); + if (error && error != -ECANCELED) + break; + } + + /* + * Synthesize refcount==1 records from the free space data + * between the end of the last fsrefs record reported and the + * end of the range. If we don't have refcount support, the + * starting point will be the start of the query range. + */ + error = xfs_fsrefs_rtdev_fill_refcount_gap(tp, info, + info->high.rc_startblock); + if (error) + break; + + /* + * Set the rtgroup low key to the start of the rtgroup prior to + * moving on to the next rtgroup. + */ + if (rtg_rgno(rtg) == start_rg) + memset(&info->low, 0, sizeof(info->low)); + info->group = NULL; + } + + if (info->refc_cur) { + xfs_btree_del_cursor(info->refc_cur, error); + info->refc_cur = NULL; + } + if (locked_rtg) + xfs_rtgroup_unlock(locked_rtg, XFS_RTGLOCK_FSREFS); + if (info->group) { + xfs_rtgroup_rele(rtg); + info->group = NULL; + } else if (rtg) { + /* loop termination case */ + xfs_rtgroup_rele(rtg); + } + + return error; +} + +/* Synthesize fsrefs records from 64-bit rtbitmap records. */ +STATIC int +xfs_fsrefs_rtdev_nogroups_helper( + struct xfs_rtgroup *rtg, + struct xfs_trans *tp, + const struct xfs_rtalloc_rec *rec, + void *priv) +{ + struct xfs_fsrefs_irec frec = { + .refcount = 1, + }; + struct xfs_mount *mp = rtg_mount(rtg); + struct xfs_fsrefs_info *info = priv; + xfs_rtblock_t next_rtb, rec_rtb, rtb; + + /* Translate the free space record to group and block number. */ + rec_rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext); + + /* + * Figure out if there's a gap between the last fsrefs record we + * emitted and this free extent. If there is, report the gap as a + * refcount==1 record. + */ + next_rtb = xfs_daddr_to_rtb(mp, info->next_daddr); + + ASSERT(rec_rtb >= next_rtb); + + rtb = xfs_rtx_to_rtb(rtg, rec->ar_startext + rec->ar_extcount); + info->next_daddr = xfs_rtb_to_daddr(mp, rtb); + + if (rec_rtb == next_rtb) + return 0; + + /* Emit records for the in-use space. */ + frec.start_daddr = xfs_rtb_to_daddr(mp, next_rtb); + frec.len_daddr = xfs_rtb_to_daddr(mp, rec_rtb - next_rtb); + return xfs_fsrefs_helper(tp, info, &frec); +} + +/* + * Synthesize refcount information from the rtbitmap for a pre-rtgroups + * filesystem. + */ +STATIC int +xfs_fsrefs_rtdev_nogroups( + struct xfs_trans *tp, + const struct xfs_fsrefs *keys, + struct xfs_fsrefs_info *info) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_rtgroup *rtg = NULL; + xfs_rtblock_t start_rtbno; + xfs_rtblock_t end_rtbno; + xfs_rtxnum_t low_rtx; + xfs_rtxnum_t high_rtx; + uint64_t eofs; + int error = 0; + + eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks); + if (keys[0].fcr_physical >= eofs) + return 0; + start_rtbno = xfs_daddr_to_rtb(mp, keys[0].fcr_physical); + end_rtbno = xfs_daddr_to_rtb(mp, min(eofs - 1, keys[1].fcr_physical)); + + info->refc_cur = info->bno_cur = NULL; + + /* + * Convert the fsrefs low/high keys to rtgroup based keys. Initialize + * low to the fsrefs low key and max out the high key to the end of the + * rtgroup. + */ + info->low_daddr = keys[0].fcr_physical; + + /* Adjust the low key if we are continuing from where we left off. */ + if (keys[0].fcr_length > 0) { + info->low_daddr += keys[0].fcr_length; + if (info->low_daddr >= eofs) + return 0; + } + + rtg = xfs_rtgroup_grab(mp, 0); + if (!rtg) + return -EFSCORRUPTED; + + info->group = rtg_group(rtg); + + trace_xfs_fsrefs_low_linear_key(mp, info->dev, start_rtbno); + trace_xfs_fsrefs_high_linear_key(mp, info->dev, end_rtbno); + + xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP); + + /* + * Walk the whole rtbitmap. Without rtgroups, the startext values can + * be more than 32-bits wide, which is why we need this separate + * implementation. + */ + low_rtx = xfs_rtb_to_rtx(mp, start_rtbno); + high_rtx = xfs_rtb_to_rtx(mp, end_rtbno + mp->m_sb.sb_rextsize - 1); + if (low_rtx < high_rtx) + error = xfs_rtalloc_query_range(rtg, tp, low_rtx, high_rtx, + xfs_fsrefs_rtdev_nogroups_helper, info); + + info->group = NULL; + + xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP); + xfs_rtgroup_rele(rtg); + + return error; +} +#endif + /* Do we recognize the device? */ STATIC bool xfs_fsrefs_is_valid_device( @@ -515,7 +904,14 @@ xfs_fsrefs_check_keys( return false; } +/* + * There are only two devices if we didn't configure RT devices at build time. + */ +#ifdef CONFIG_XFS_RT +#define XFS_GETFSREFS_DEVS 3 +#else #define XFS_GETFSREFS_DEVS 2 +#endif /* CONFIG_XFS_RT */ /* * Get filesystem's extent refcounts as described in head, and format for @@ -569,6 +965,15 @@ xfs_getfsrefs( handlers[1].dev = new_encode_dev(mp->m_logdev_targp->bt_dev); handlers[1].fn = xfs_fsrefs_logdev; } +#ifdef CONFIG_XFS_RT + if (mp->m_rtdev_targp) { + handlers[2].dev = new_encode_dev(mp->m_rtdev_targp->bt_dev); + if (xfs_has_rtgroups(mp)) + handlers[2].fn = xfs_fsrefs_rtdev; + else + handlers[2].fn = xfs_fsrefs_rtdev_nogroups; + } +#endif /* CONFIG_XFS_RT */ xfs_sort(handlers, XFS_GETFSREFS_DEVS, sizeof(struct xfs_fsrefs_dev), xfs_fsrefs_dev_compare); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong 2024-12-31 23:38 ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong @ 2024-12-31 23:38 ` Darrick J. Wong 2024-12-31 23:38 ` [PATCH 3/4] xfs: add an ioctl to map free space into a file Darrick J. Wong 2024-12-31 23:38 ` [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files Darrick J. Wong 3 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Change the class of the fallocate tracepoints to capture the offset and length of the requested operation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_bmap_util.c | 8 ++++---- fs/xfs/xfs_file.c | 2 +- fs/xfs/xfs_trace.h | 10 +++++----- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 783349f2361ad3..c9e60fb2693c9b 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -652,7 +652,7 @@ xfs_alloc_file_space( if (xfs_is_always_cow_inode(ip)) return 0; - trace_xfs_alloc_file_space(ip); + trace_xfs_alloc_file_space(ip, offset, len); if (xfs_is_shutdown(mp)) return -EIO; @@ -839,7 +839,7 @@ xfs_free_file_space( xfs_fileoff_t endoffset_fsb; int done = 0, error; - trace_xfs_free_file_space(ip); + trace_xfs_free_file_space(ip, offset, len); error = xfs_qm_dqattach(ip); if (error) @@ -987,7 +987,7 @@ xfs_collapse_file_space( xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); - trace_xfs_collapse_file_space(ip); + trace_xfs_collapse_file_space(ip, offset, len); error = xfs_free_file_space(ip, offset, len, ac); if (error) @@ -1056,7 +1056,7 @@ xfs_insert_file_space( xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL); - trace_xfs_insert_file_space(ip); + trace_xfs_insert_file_space(ip, offset, len); error = xfs_bmap_can_insert_extents(ip, stop_fsb, shift_fsb); if (error) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index d31ad7bf29885d..b8f0b9a2998b9c 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1362,7 +1362,7 @@ xfs_falloc_zero_range( loff_t new_size = 0; int error; - trace_xfs_zero_file_space(XFS_I(inode)); + trace_xfs_zero_file_space(XFS_I(inode), offset, len); error = xfs_falloc_newsize(file, mode, offset, len, &new_size); if (error) diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 7043b6481d5f97..e81247b3024e53 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -928,11 +928,6 @@ DEFINE_INODE_EVENT(xfs_getattr); DEFINE_INODE_EVENT(xfs_setattr); DEFINE_INODE_EVENT(xfs_readlink); DEFINE_INODE_EVENT(xfs_inactive_symlink); -DEFINE_INODE_EVENT(xfs_alloc_file_space); -DEFINE_INODE_EVENT(xfs_free_file_space); -DEFINE_INODE_EVENT(xfs_zero_file_space); -DEFINE_INODE_EVENT(xfs_collapse_file_space); -DEFINE_INODE_EVENT(xfs_insert_file_space); DEFINE_INODE_EVENT(xfs_readdir); #ifdef CONFIG_XFS_POSIX_ACL DEFINE_INODE_EVENT(xfs_get_acl); @@ -1732,6 +1727,11 @@ DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_unwritten); DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_append); DEFINE_SIMPLE_IO_EVENT(xfs_file_splice_read); DEFINE_SIMPLE_IO_EVENT(xfs_zoned_map_blocks); +DEFINE_SIMPLE_IO_EVENT(xfs_alloc_file_space); +DEFINE_SIMPLE_IO_EVENT(xfs_free_file_space); +DEFINE_SIMPLE_IO_EVENT(xfs_zero_file_space); +DEFINE_SIMPLE_IO_EVENT(xfs_collapse_file_space); +DEFINE_SIMPLE_IO_EVENT(xfs_insert_file_space); DECLARE_EVENT_CLASS(xfs_itrunc_class, TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size), ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 3/4] xfs: add an ioctl to map free space into a file 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong 2024-12-31 23:38 ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong 2024-12-31 23:38 ` [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints Darrick J. Wong @ 2024-12-31 23:38 ` Darrick J. Wong 2024-12-31 23:38 ` [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files Darrick J. Wong 3 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a new ioctl to map free physical space into a file, at the same file offset as if the file were a sparse image of the physical device backing the filesystem. The intent here is to use this to prototype a free space defragmentation tool. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_alloc.c | 88 +++++++++++++ fs/xfs/libxfs/xfs_alloc.h | 3 fs/xfs/libxfs/xfs_bmap.c | 1 fs/xfs/libxfs/xfs_fs.h | 14 ++ fs/xfs/xfs_bmap_util.c | 303 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_bmap_util.h | 1 fs/xfs/xfs_file.c | 139 +++++++++++++++++++++ fs/xfs/xfs_file.h | 2 fs/xfs/xfs_ioctl.c | 5 + fs/xfs/xfs_trace.h | 35 +++++ 10 files changed, 591 insertions(+) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 3d33e17f2e5ce0..e689ec5cbccd7e 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -4168,3 +4168,91 @@ xfs_extfree_intent_destroy_cache(void) kmem_cache_destroy(xfs_extfree_item_cache); xfs_extfree_item_cache = NULL; } + +/* + * Find the next chunk of free space in @pag starting at @agbno and going no + * higher than @end_agbno. Set @agbno and @len to whatever free space we find, + * or to @end_agbno if we find no space. + */ +int +xfs_alloc_find_freesp( + struct xfs_trans *tp, + struct xfs_perag *pag, + xfs_agblock_t *agbno, + xfs_agblock_t end_agbno, + xfs_extlen_t *len) +{ + struct xfs_mount *mp = pag_mount(pag); + struct xfs_btree_cur *cur; + struct xfs_buf *agf_bp = NULL; + xfs_agblock_t found_agbno; + xfs_extlen_t found_len; + int found; + int error; + + trace_xfs_alloc_find_freesp(pag_group(pag), *agbno, + end_agbno - *agbno); + + error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp); + if (error) + return error; + + cur = xfs_bnobt_init_cursor(mp, tp, agf_bp, pag); + + /* Try to find a free extent that starts before here. */ + error = xfs_alloc_lookup_le(cur, *agbno, 0, &found); + if (error) + goto out_cur; + if (found) { + error = xfs_alloc_get_rec(cur, &found_agbno, &found_len, + &found); + if (error) + goto out_cur; + if (XFS_IS_CORRUPT(mp, !found)) { + xfs_btree_mark_sick(cur); + error = -EFSCORRUPTED; + goto out_cur; + } + + if (found_agbno + found_len > *agbno) + goto found; + } + + /* Examine the next record if free extent not in range. */ + error = xfs_btree_increment(cur, 0, &found); + if (error) + goto out_cur; + if (!found) + goto next_ag; + + error = xfs_alloc_get_rec(cur, &found_agbno, &found_len, &found); + if (error) + goto out_cur; + if (XFS_IS_CORRUPT(mp, !found)) { + xfs_btree_mark_sick(cur); + error = -EFSCORRUPTED; + goto out_cur; + } + + if (found_agbno >= end_agbno) + goto next_ag; + +found: + /* Found something, so update the mapping. */ + trace_xfs_alloc_find_freesp_done(pag_group(pag), found_agbno, + found_len); + if (found_agbno < *agbno) { + found_len -= *agbno - found_agbno; + found_agbno = *agbno; + } + *len = found_len; + *agbno = found_agbno; + goto out_cur; +next_ag: + /* Found nothing, so advance the cursor beyond the end of the range. */ + *agbno = end_agbno; + *len = 0; +out_cur: + xfs_btree_del_cursor(cur, error); + return error; +} diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h index 50ef79a1ed41a1..069077d9ad2f8c 100644 --- a/fs/xfs/libxfs/xfs_alloc.h +++ b/fs/xfs/libxfs/xfs_alloc.h @@ -286,5 +286,8 @@ void xfs_extfree_intent_destroy_cache(void); xfs_failaddr_t xfs_validate_ag_length(struct xfs_buf *bp, uint32_t seqno, uint32_t length); +int xfs_alloc_find_freesp(struct xfs_trans *tp, struct xfs_perag *pag, + xfs_agblock_t *agbno, xfs_agblock_t end_agbno, + xfs_extlen_t *len); #endif /* __XFS_ALLOC_H__ */ diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 8c9d540c3ba91a..11dab550ca0fb6 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -41,6 +41,7 @@ #include "xfs_inode_util.h" #include "xfs_rtgroup.h" #include "xfs_zone_alloc.h" +#include "xfs_rtalloc.h" struct kmem_cache *xfs_bmap_intent_cache; diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index 936f719236944f..f4128dbdf3b9a2 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -1087,6 +1087,19 @@ xfs_getfsrefs_advance( /* fcr_flags values - returned for each non-header segment */ #define FCR_OF_LAST (1U << 0) /* last record in the dataset */ +/* map free space to file */ + +/* + * XFS_IOC_MAP_FREESP maps all the free physical space in the filesystem into + * the file at the same offsets. This ioctl requires CAP_SYS_ADMIN. + */ +struct xfs_map_freesp { + __s64 offset; /* disk address to map, in bytes */ + __s64 len; /* length in bytes */ + __u64 flags; /* must be zero */ + __u64 pad; /* must be zero */ +}; + /* * ioctl commands that are used by Linux filesystems */ @@ -1127,6 +1140,7 @@ xfs_getfsrefs_advance( #define XFS_IOC_SCRUBV_METADATA _IOWR('X', 64, struct xfs_scrub_vec_head) #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry) #define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) +#define XFS_IOC_MAP_FREESP _IOW ('X', 67, struct xfs_map_freesp) /* * ioctl commands that replace IRIX syssgi()'s diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index c9e60fb2693c9b..8d5c2072bcd533 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -31,6 +31,10 @@ #include "xfs_rtbitmap.h" #include "xfs_rtgroup.h" #include "xfs_zone_alloc.h" +#include "xfs_health.h" +#include "xfs_alloc_btree.h" +#include "xfs_rmap.h" +#include "xfs_ag.h" /* Kernel only BMAP related definitions and functions */ @@ -1916,3 +1920,302 @@ xfs_convert_rtbigalloc_file_space( return 0; } #endif /* CONFIG_XFS_RT */ + +/* + * Reserve space and quota to this transaction to map in as much free space + * as we can. Callers should set @len to the amount of space desired; this + * function will shorten that quantity if it can't get space. + */ +STATIC int +xfs_map_free_reserve_more( + struct xfs_trans *tp, + struct xfs_inode *ip, + xfs_extlen_t *len) +{ + struct xfs_mount *mp = ip->i_mount; + unsigned int dblocks; + unsigned int rblocks; + unsigned int min_len; + bool isrt = XFS_IS_REALTIME_INODE(ip); + int error; + + if (*len > XFS_MAX_BMBT_EXTLEN) + *len = XFS_MAX_BMBT_EXTLEN; + min_len = isrt ? mp->m_sb.sb_rextsize : 1; + +again: + if (isrt) { + dblocks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK); + rblocks = *len; + } else { + dblocks = XFS_DIOSTRAT_SPACE_RES(mp, *len); + rblocks = 0; + } + error = xfs_trans_reserve_more_inode(tp, ip, dblocks, rblocks, false); + if (error == -ENOSPC && *len > min_len) { + *len >>= 1; + goto again; + } + if (error) { + trace_xfs_map_free_reserve_more_fail(ip, error, _RET_IP_); + return error; + } + + return 0; +} + +static inline xfs_fileoff_t +xfs_fsblock_to_fileoff( + struct xfs_mount *mp, + xfs_fsblock_t fsbno) +{ + xfs_daddr_t daddr = XFS_FSB_TO_DADDR(mp, fsbno); + + return XFS_B_TO_FSB(mp, BBTOB(daddr)); +} + +/* + * Given a file and a free physical extent, map it into the file at the same + * offset if the file were a sparse image of the physical device. Set @mval to + * whatever mapping we added to the file. + */ +STATIC int +xfs_map_free_ag_extent( + struct xfs_trans *tp, + struct xfs_inode *ip, + struct xfs_perag *pag, + xfs_agblock_t agbno, + xfs_extlen_t len, + struct xfs_bmbt_irec *mval) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_alloc_arg args = { + .mp = mp, + .tp = tp, + .pag = pag, + .oinfo = XFS_RMAP_OINFO_SKIP_UPDATE, + .resv = XFS_AG_RESV_NONE, + .prod = 1, + .datatype = XFS_ALLOC_USERDATA, + .maxlen = len, + .minlen = 1, + }; + struct xfs_bmbt_irec irec; + xfs_fsblock_t fsbno = xfs_gbno_to_fsb(pag_group(pag), agbno); + xfs_fileoff_t startoff = xfs_fsblock_to_fileoff(mp, fsbno); + int nimaps; + int error; + + ASSERT(!XFS_IS_REALTIME_INODE(ip)); + + trace_xfs_map_free_ag_extent(ip, fsbno, len); + + /* Make sure the entire range is a hole. */ + nimaps = 1; + error = xfs_bmapi_read(ip, startoff, len, &irec, &nimaps, 0); + if (error) + return error; + + if (irec.br_startoff != startoff || + irec.br_startblock != HOLESTARTBLOCK || + irec.br_blockcount < len) + return -EINVAL; + + error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + return error; + + /* + * Allocate the physical extent. We should not have dropped the lock + * since the scan of the free space metadata, so this should work, + * though the length may be adjusted to play nicely with metadata space + * reservations. + */ + error = xfs_alloc_vextent_exact_bno(&args, fsbno); + if (error) + return error; + if (args.fsbno == NULLFSBLOCK) { + /* + * We were promised the space, but failed to get it. This + * could be because the space is reserved for metadata + * expansion, or it could be because the AGFL fixup grabbed the + * first block we wanted. Either way, if the transaction is + * dirty we must commit it and tell the caller to try again. + */ + if (tp->t_flags & XFS_TRANS_DIRTY) + return -EAGAIN; + return -ENOSPC; + } + if (args.fsbno != fsbno) { + ASSERT(0); + xfs_bmap_mark_sick(ip, XFS_DATA_FORK); + return -EFSCORRUPTED; + } + + /* Map extent into file, update quota. */ + mval->br_blockcount = args.len; + mval->br_startblock = fsbno; + mval->br_startoff = startoff; + mval->br_state = XFS_EXT_UNWRITTEN; + + trace_xfs_map_free_ag_extent_done(ip, mval); + + xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, mval); + xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, + mval->br_blockcount); + + return 0; +} + +/* Find a free extent in this AG and map it into the file. */ +STATIC int +xfs_map_free_extent( + struct xfs_inode *ip, + struct xfs_perag *pag, + xfs_agblock_t *cursor, + xfs_agblock_t end_agbno, + xfs_agblock_t *last_enospc_agbno) +{ + struct xfs_bmbt_irec irec; + struct xfs_mount *mp = ip->i_mount; + struct xfs_trans *tp; + loff_t endpos; + xfs_extlen_t free_len, map_len; + int error; + + if (fatal_signal_pending(current)) + return -EINTR; + + error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, 0, 0, false, + &tp); + if (error) + return error; + + error = xfs_alloc_find_freesp(tp, pag, cursor, end_agbno, &free_len); + if (error) + goto out_cancel; + + /* Bail out if the cursor is beyond what we asked for. */ + if (*cursor >= end_agbno) + goto out_cancel; + + error = xfs_map_free_reserve_more(tp, ip, &free_len); + if (error) + goto out_cancel; + + map_len = free_len; + do { + error = xfs_map_free_ag_extent(tp, ip, pag, *cursor, map_len, + &irec); + if (error == -EAGAIN) { + /* Failed to map space but were told to try again. */ + error = xfs_trans_commit(tp); + goto out; + } + if (error != -ENOSPC) + break; + /* + * If we can't get the space, try asking for successively less + * space in case we're bumping up against per-AG metadata + * reservation limits. + */ + map_len >>= 1; + } while (map_len > 0); + if (error == -ENOSPC) { + if (*last_enospc_agbno != *cursor) { + /* + * However, backing off on the size of the mapping + * request might not work if an AGFL fixup allocated + * the block at *cursor. The first time this happens, + * remember that we ran out of space here, and try + * again. + */ + *last_enospc_agbno = *cursor; + } else { + /* + * If we hit this a second time on the same extent, + * then it's likely that we're bumping up against + * per-AG space reservation limits. Skip to the next + * extent. + */ + *cursor += free_len; + } + error = 0; + goto out_cancel; + } + if (error) + goto out_cancel; + + /* Update isize if needed. */ + endpos = XFS_FSB_TO_B(mp, irec.br_startoff + irec.br_blockcount); + if (endpos > i_size_read(VFS_I(ip))) { + i_size_write(VFS_I(ip), endpos); + ip->i_disk_size = endpos; + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); + } + + error = xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + if (error) + return error; + + *cursor += irec.br_blockcount; + return 0; +out_cancel: + xfs_trans_cancel(tp); +out: + xfs_iunlock(ip, XFS_ILOCK_EXCL); + return error; +} + +/* + * Allocate all free physical space between off and len and map it to this + * regular non-realtime file. + */ +int +xfs_map_free_space( + struct xfs_inode *ip, + xfs_off_t off, + xfs_off_t len) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_perag *pag = NULL; + xfs_daddr_t off_daddr = BTOBB(off); + xfs_daddr_t end_daddr = BTOBBT(off + len); + xfs_fsblock_t off_fsb = XFS_DADDR_TO_FSB(mp, off_daddr); + xfs_fsblock_t end_fsb = XFS_DADDR_TO_FSB(mp, end_daddr); + xfs_agnumber_t off_agno = XFS_FSB_TO_AGNO(mp, off_fsb); + xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsb); + int error = 0; + + trace_xfs_map_free_space(ip, off, len); + + while ((pag = xfs_perag_next_range(mp, pag, off_agno, + mp->m_sb.sb_agcount - 1))) { + xfs_agblock_t off_agbno = 0; + xfs_agblock_t end_agbno; + xfs_agblock_t last_enospc_agbno = NULLAGBLOCK; + + end_agbno = xfs_ag_block_count(mp, pag_agno(pag)); + + if (pag_agno(pag) == off_agno) + off_agbno = XFS_FSB_TO_AGBNO(mp, off_fsb); + if (pag_agno(pag) == end_agno) + end_agbno = XFS_FSB_TO_AGBNO(mp, end_fsb); + + while (off_agbno < end_agbno) { + error = xfs_map_free_extent(ip, pag, &off_agbno, + end_agbno, &last_enospc_agbno); + if (error) + goto out; + } + } + +out: + if (pag) + xfs_perag_rele(pag); + if (error == -ENOSPC) + return 0; + return error; +} diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h index c39cce66829e26..5d84b702b16326 100644 --- a/fs/xfs/xfs_bmap_util.h +++ b/fs/xfs/xfs_bmap_util.h @@ -63,6 +63,7 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset, xfs_off_t len, struct xfs_zone_alloc_ctx *ac); int xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset, xfs_off_t len); +int xfs_map_free_space(struct xfs_inode *ip, xfs_off_t off, xfs_off_t len); /* EOF block manipulation functions */ bool xfs_can_free_eofblocks(struct xfs_inode *ip); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index b8f0b9a2998b9c..8bf1e96ab57a5b 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -34,6 +34,7 @@ #include <linux/mman.h> #include <linux/fadvise.h> #include <linux/mount.h> +#include <linux/fsnotify.h> static const struct vm_operations_struct xfs_file_vm_ops; @@ -1548,6 +1549,144 @@ xfs_file_fallocate( return error; } +STATIC int +xfs_file_map_freesp( + struct file *file, + const struct xfs_map_freesp *mf) +{ + struct inode *inode = file_inode(file); + struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; + xfs_off_t device_size; + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; + loff_t new_size = 0; + int error; + + xfs_ilock(ip, iolock); + error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP); + if (error) + goto out_unlock; + + /* + * Must wait for all AIO to complete before we continue as AIO can + * change the file size on completion without holding any locks we + * currently hold. We must do this first because AIO can update both + * the on disk and in memory inode sizes, and the operations that follow + * require the in-memory size to be fully up-to-date. + */ + inode_dio_wait(inode); + + error = file_modified(file); + if (error) + goto out_unlock; + + if (XFS_IS_REALTIME_INODE(ip)) { + error = -EOPNOTSUPP; + goto out_unlock; + } + device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks); + + /* + * Bail out now if we aren't allowed to make the file size the + * same length as the device. + */ + if (device_size > i_size_read(inode)) { + new_size = device_size; + error = inode_newsize_ok(inode, new_size); + if (error) + goto out_unlock; + } + + error = xfs_map_free_space(ip, mf->offset, mf->len); + if (error) { + if (error == -ECANCELED) + error = 0; + goto out_unlock; + } + + /* Change file size if needed */ + if (new_size) { + struct iattr iattr; + + iattr.ia_valid = ATTR_SIZE; + iattr.ia_size = new_size; + error = xfs_vn_setattr_size(file_mnt_idmap(file), + file_dentry(file), &iattr); + if (error) + goto out_unlock; + } + + if (xfs_file_sync_writes(file)) + error = xfs_log_force_inode(ip); + +out_unlock: + xfs_iunlock(ip, iolock); + return error; +} + +long +xfs_ioc_map_freesp( + struct file *file, + struct xfs_map_freesp __user *argp) +{ + struct xfs_map_freesp args; + struct inode *inode = file_inode(file); + int error; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + + if (args.flags || args.pad) + return -EINVAL; + + if (args.offset < 0 || args.len <= 0) + return -EINVAL; + + if (!(file->f_mode & FMODE_WRITE)) + return -EBADF; + + /* + * We can only allow pure fallocate on append only files + */ + if (IS_APPEND(inode)) + return -EPERM; + + if (IS_IMMUTABLE(inode)) + return -EPERM; + + /* + * We cannot allow any fallocate operation on an active swapfile + */ + if (IS_SWAPFILE(inode)) + return -ETXTBSY; + + if (S_ISFIFO(inode->i_mode)) + return -ESPIPE; + + if (S_ISDIR(inode->i_mode)) + return -EISDIR; + + if (!S_ISREG(inode->i_mode)) + return -ENODEV; + + /* Check for wrap through zero too */ + if (args.offset + args.len > inode->i_sb->s_maxbytes) + return -EFBIG; + if (args.offset + args.len < 0) + return -EFBIG; + + file_start_write(file); + error = xfs_file_map_freesp(file, &args); + if (!error) + fsnotify_modify(file); + + file_end_write(file); + return error; +} + STATIC int xfs_file_fadvise( struct file *file, diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h index 24490ea49e16c6..c9d50699baba85 100644 --- a/fs/xfs/xfs_file.h +++ b/fs/xfs/xfs_file.h @@ -15,4 +15,6 @@ bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos, bool xfs_truncate_needs_cow_around(struct xfs_inode *ip, loff_t pos); int xfs_file_unshare_at(struct xfs_inode *ip, loff_t pos); +long xfs_ioc_map_freesp(struct file *file, struct xfs_map_freesp __user *argp); + #endif /* __XFS_FILE_H__ */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 20f013bd4ce653..092a3699ff9e75 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -45,6 +45,8 @@ #include <linux/mount.h> #include <linux/fileattr.h> +#include <linux/security.h> +#include <linux/fsnotify.h> /* Return 0 on success or positive error */ int @@ -1429,6 +1431,9 @@ xfs_file_ioctl( case XFS_IOC_COMMIT_RANGE: return xfs_ioc_commit_range(filp, arg); + case XFS_IOC_MAP_FREESP: + return xfs_ioc_map_freesp(filp, arg); + default: return -ENOTTY; } diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index e81247b3024e53..ebbc832db8fa1e 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1732,6 +1732,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_free_file_space); DEFINE_SIMPLE_IO_EVENT(xfs_zero_file_space); DEFINE_SIMPLE_IO_EVENT(xfs_collapse_file_space); DEFINE_SIMPLE_IO_EVENT(xfs_insert_file_space); +DEFINE_SIMPLE_IO_EVENT(xfs_map_free_space); DECLARE_EVENT_CLASS(xfs_itrunc_class, TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size), @@ -1821,6 +1822,36 @@ TRACE_EVENT(xfs_bunmap, ); +DECLARE_EVENT_CLASS(xfs_map_free_extent_class, + TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t bno, xfs_extlen_t len), + TP_ARGS(ip, bno, len), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_fsize_t, size) + __field(xfs_fileoff_t, bno) + __field(xfs_extlen_t, len) + ), + TP_fast_assign( + __entry->dev = VFS_I(ip)->i_sb->s_dev; + __entry->ino = ip->i_ino; + __entry->size = ip->i_disk_size; + __entry->bno = bno; + __entry->len = len; + ), + TP_printk("dev %d:%d ino 0x%llx disize 0x%llx fileoff 0x%llx fsbcount 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->size, + __entry->bno, + __entry->len) +); +#define DEFINE_MAP_FREE_EXTENT_EVENT(name) \ +DEFINE_EVENT(xfs_map_free_extent_class, name, \ + TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t bno, xfs_extlen_t len), \ + TP_ARGS(ip, bno, len)) +DEFINE_MAP_FREE_EXTENT_EVENT(xfs_map_free_ag_extent); + DECLARE_EVENT_CLASS(xfs_extent_busy_class, TP_PROTO(const struct xfs_group *xg, xfs_agblock_t agbno, xfs_extlen_t len), @@ -1856,6 +1887,8 @@ DEFINE_BUSY_EVENT(xfs_extent_busy); DEFINE_BUSY_EVENT(xfs_extent_busy_force); DEFINE_BUSY_EVENT(xfs_extent_busy_reuse); DEFINE_BUSY_EVENT(xfs_extent_busy_clear); +DEFINE_BUSY_EVENT(xfs_alloc_find_freesp); +DEFINE_BUSY_EVENT(xfs_alloc_find_freesp_done); TRACE_EVENT(xfs_extent_busy_trim, TP_PROTO(const struct xfs_group *xg, xfs_agblock_t agbno, @@ -3962,6 +3995,7 @@ DECLARE_EVENT_CLASS(xfs_inode_irec_class, DEFINE_EVENT(xfs_inode_irec_class, name, \ TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \ TP_ARGS(ip, irec)) +DEFINE_INODE_IREC_EVENT(xfs_map_free_ag_extent_done); /* inode iomap invalidation events */ DECLARE_EVENT_CLASS(xfs_wb_invalid_class, @@ -4096,6 +4130,7 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_blocks_error); DEFINE_INODE_ERROR_EVENT(xfs_reflink_remap_extent_error); DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_src); DEFINE_INODE_IREC_EVENT(xfs_reflink_remap_extent_dest); +DEFINE_INODE_ERROR_EVENT(xfs_map_free_reserve_more_fail); /* dedupe tracepoints */ DEFINE_DOUBLE_IO_EVENT(xfs_reflink_compare_extents); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:38 ` [PATCH 3/4] xfs: add an ioctl to map free space into a file Darrick J. Wong @ 2024-12-31 23:38 ` Darrick J. Wong 3 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:38 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Implement mapfree for realtime space. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_bmap_util.c | 202 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_bmap_util.h | 2 fs/xfs/xfs_file.c | 14 ++- fs/xfs/xfs_rtalloc.c | 108 ++++++++++++++++++++++++++ fs/xfs/xfs_rtalloc.h | 7 ++ fs/xfs/xfs_trace.h | 41 ++++++++++ 6 files changed, 368 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 8d5c2072bcd533..83e6c27f63a969 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -2219,3 +2219,205 @@ xfs_map_free_space( return 0; return error; } + +#ifdef CONFIG_XFS_RT +/* + * Given a file and a free rt extent, map it into the file at the same offset + * if the file were a sparse image of the physical device. Set @mval to + * whatever mapping we added to the file. + */ +STATIC int +xfs_map_free_rtgroup_extent( + struct xfs_trans *tp, + struct xfs_inode *ip, + struct xfs_rtgroup *rtg, + xfs_rtxnum_t rtx, + xfs_rtxlen_t rtxlen, + struct xfs_bmbt_irec *mval) +{ + struct xfs_bmbt_irec irec; + struct xfs_mount *mp = ip->i_mount; + xfs_fsblock_t fsbno = xfs_rtx_to_rtb(rtg, rtx); + xfs_fileoff_t startoff = fsbno; + xfs_extlen_t len = xfs_rtbxlen_to_blen(mp, rtxlen); + int nimaps; + int error; + + ASSERT(XFS_IS_REALTIME_INODE(ip)); + + trace_xfs_map_free_rt_extent(ip, fsbno, len); + + /* Make sure the entire range is a hole. */ + nimaps = 1; + error = xfs_bmapi_read(ip, startoff, len, &irec, &nimaps, 0); + if (error) + return error; + + if (irec.br_startoff != startoff || + irec.br_startblock != HOLESTARTBLOCK || + irec.br_blockcount < len) + return -EINVAL; + + error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK, + XFS_IEXT_ADD_NOSPLIT_CNT); + if (error) + return error; + + /* + * Allocate the physical extent. We should not have dropped the lock + * since the scan of the free space metadata, so this should work, + * though the length may be adjusted to play nicely with metadata space + * reservations. + */ + error = xfs_rtallocate_exact(tp, rtg, rtx, rtxlen); + if (error) + return error; + + /* Map extent into file, update quota. */ + mval->br_blockcount = len; + mval->br_startblock = fsbno; + mval->br_startoff = startoff; + mval->br_state = XFS_EXT_UNWRITTEN; + + trace_xfs_map_free_rt_extent_done(ip, mval); + + xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, mval); + xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_RTBCOUNT, + mval->br_blockcount); + + return 0; +} + +/* Find a free extent in this rtgroup and map it into the file. */ +STATIC int +xfs_map_free_rt_extent( + struct xfs_inode *ip, + struct xfs_rtgroup *rtg, + xfs_rtxnum_t *cursor, + xfs_rtxnum_t end_rtx) +{ + struct xfs_bmbt_irec irec; + struct xfs_mount *mp = ip->i_mount; + struct xfs_trans *tp; + loff_t endpos; + xfs_rtxlen_t len_rtx; + xfs_extlen_t free_len; + int error; + + if (fatal_signal_pending(current)) + return -EINTR; + + error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, 0, 0, false, + &tp); + if (error) + return error; + + xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP); + + error = xfs_rtallocate_find_freesp(tp, rtg, cursor, end_rtx, &len_rtx); + if (error) + goto out_rtglock; + + /* + * If off_rtx is beyond the end of the rt device or is past what the + * user asked for, bail out. + */ + if (*cursor >= end_rtx) + goto out_rtglock; + + free_len = xfs_rtxlen_to_extlen(mp, len_rtx); + error = xfs_map_free_reserve_more(tp, ip, &free_len); + if (error) + goto out_rtglock; + + error = xfs_map_free_rtgroup_extent(tp, ip, rtg, *cursor, len_rtx, + &irec); + if (error == -EAGAIN) { + /* + * The allocator was busy and told us to try again. The + * transaction could be dirty due to a nrext64 upgrade, so + * commit the transaction and try again without advancing + * the cursor. + * + * XXX do we fail to unlock something here? + */ + xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP); + error = xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + return error; + } + if (error) + goto out_cancel; + + /* Update isize if needed. */ + endpos = XFS_FSB_TO_B(mp, irec.br_startoff + irec.br_blockcount); + if (endpos > i_size_read(VFS_I(ip))) { + i_size_write(VFS_I(ip), endpos); + ip->i_disk_size = endpos; + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); + } + + error = xfs_trans_commit(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + if (error) + return error; + + ASSERT(xfs_blen_to_rtxoff(mp, irec.br_blockcount) == 0); + *cursor += xfs_extlen_to_rtxlen(mp, irec.br_blockcount); + return 0; +out_rtglock: + xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP); +out_cancel: + xfs_trans_cancel(tp); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + return error; +} + +/* + * Allocate all free physical space between off and len and map it to this + * regular realtime file. + */ +int +xfs_map_free_rt_space( + struct xfs_inode *ip, + xfs_off_t off, + xfs_off_t len) +{ + struct xfs_mount *mp = ip->i_mount; + struct xfs_rtgroup *rtg = NULL; + xfs_daddr_t off_daddr = BTOBB(off); + xfs_daddr_t end_daddr = BTOBBT(off + len); + xfs_rtblock_t off_rtb = xfs_daddr_to_rtb(mp, off_daddr); + xfs_rtblock_t end_rtb = xfs_daddr_to_rtb(mp, end_daddr); + xfs_rgnumber_t off_rgno = xfs_rtb_to_rgno(mp, off_rtb); + xfs_rgnumber_t end_rgno = xfs_rtb_to_rgno(mp, end_rtb); + int error = 0; + + trace_xfs_map_free_rt_space(ip, off, len); + + while ((rtg = xfs_rtgroup_next_range(mp, rtg, off_rgno, + mp->m_sb.sb_rgcount))) { + xfs_rtxnum_t off_rtx = 0; + xfs_rtxnum_t end_rtx = rtg->rtg_extents; + + if (rtg_rgno(rtg) == off_rgno) + off_rtx = xfs_rtb_to_rtx(mp, off_rtb); + if (rtg_rgno(rtg) == end_rgno) + end_rtx = min(end_rtx, xfs_rtb_to_rtx(mp, end_rtb)); + + while (off_rtx < end_rtx) { + error = xfs_map_free_rt_extent(ip, rtg, &off_rtx, + end_rtx); + if (error) + goto out; + } + } + +out: + if (rtg) + xfs_rtgroup_rele(rtg); + if (error == -ENOSPC) + return 0; + return error; +} +#endif diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h index 5d84b702b16326..0e16fbfef6cd09 100644 --- a/fs/xfs/xfs_bmap_util.h +++ b/fs/xfs/xfs_bmap_util.h @@ -85,8 +85,10 @@ int xfs_flush_unmap_range(struct xfs_inode *ip, xfs_off_t offset, #ifdef CONFIG_XFS_RT int xfs_convert_rtbigalloc_file_space(struct xfs_inode *ip, loff_t pos, uint64_t len); +int xfs_map_free_rt_space(struct xfs_inode *ip, xfs_off_t off, xfs_off_t len); #else # define xfs_convert_rtbigalloc_file_space(ip, pos, len) (-EOPNOTSUPP) +# define xfs_map_free_rt_space(ip, off, len) (-EOPNOTSUPP) #endif #endif /* __XFS_BMAP_UTIL_H__ */ diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 8bf1e96ab57a5b..ceb7936e5fd9a3 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1580,11 +1580,10 @@ xfs_file_map_freesp( if (error) goto out_unlock; - if (XFS_IS_REALTIME_INODE(ip)) { - error = -EOPNOTSUPP; - goto out_unlock; - } - device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks); + if (XFS_IS_REALTIME_INODE(ip)) + device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_rblocks); + else + device_size = XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks); /* * Bail out now if we aren't allowed to make the file size the @@ -1597,7 +1596,10 @@ xfs_file_map_freesp( goto out_unlock; } - error = xfs_map_free_space(ip, mf->offset, mf->len); + if (XFS_IS_REALTIME_INODE(ip)) + error = xfs_map_free_rt_space(ip, mf->offset, mf->len); + else + error = xfs_map_free_space(ip, mf->offset, mf->len); if (error) { if (error == -ECANCELED) error = 0; diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c index 2728c568ac5a8a..0a4e087b11b60e 100644 --- a/fs/xfs/xfs_rtalloc.c +++ b/fs/xfs/xfs_rtalloc.c @@ -2230,3 +2230,111 @@ xfs_bmap_rtalloc( xfs_bmap_alloc_account(ap); return 0; } + +/* + * Find the next free realtime extent starting at @rtx and going no higher than + * @end_rtx. Set @rtx and @len_rtx to whatever free extents we find, or to + * @end_rtx if we find no space. + */ +int +xfs_rtallocate_find_freesp( + struct xfs_trans *tp, + struct xfs_rtgroup *rtg, + xfs_rtxnum_t *rtx, + xfs_rtxnum_t end_rtx, + xfs_rtxlen_t *len_rtx) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_rtalloc_args args = { + .rtg = rtg, + .mp = mp, + .tp = tp, + }; + const unsigned int max_rtxlen = + xfs_blen_to_rtbxlen(mp, XFS_MAX_BMBT_EXTLEN); + int error; + + trace_xfs_rtallocate_find_freesp(rtg, *rtx, end_rtx - *rtx); + + while (*rtx < end_rtx) { + xfs_rtblock_t next_rtx; + int is_free = 0; + + if (fatal_signal_pending(current)) + return -EINTR; + + /* Is the first rtx in the range free? */ + error = xfs_rtcheck_range(&args, *rtx, 1, 1, &next_rtx, + &is_free); + if (error) + return error; + + /* Free or not, how many more rtx have the same status? */ + error = xfs_rtfind_forw(&args, *rtx, end_rtx, &next_rtx); + if (error) + return error; + + if (is_free) { + *len_rtx = min_t(xfs_rtxlen_t, max_rtxlen, + next_rtx - *rtx + 1); + + trace_xfs_rtallocate_find_freesp_done(rtg, *rtx, + *len_rtx); + return 0; + } + + *rtx = next_rtx + 1; + } + + return 0; +} + +/* Allocate exactly this space from the rt device. */ +int +xfs_rtallocate_exact( + struct xfs_trans *tp, + struct xfs_rtgroup *rtg, + xfs_rtxnum_t rtx, + xfs_rtxlen_t len) +{ + struct xfs_mount *mp = tp->t_mountp; + struct xfs_rtalloc_args args = { + .rtg = rtg, + .mp = mp, + .tp = tp, + }; + int error; + + trace_xfs_rtallocate_exact(rtg, rtx, len); + + if (xfs_has_rtgroups(mp)) { + xfs_rtxnum_t resrtx = rtx; + xfs_rtxlen_t reslen = len; + + /* + * Never pass 0 for start here so that the busy extent code + * knows that we wanted a near allocation and will flush the + * log to wait for the start to become available. + */ + error = xfs_rtallocate_adjust_for_busy(&args, rtx ? rtx : 1, 1, + len, &reslen, 1, &resrtx); + if (error) + return error; + + if (resrtx != rtx) { + ASSERT(resrtx == rtx); + return -EAGAIN; + } + + len = reslen; + } + + xfs_rtgroup_trans_join(tp, rtg, XFS_RTGLOCK_BITMAP); + + error = xfs_rtallocate_range(&args, rtx, len); + if (error) + return error; + + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FREXTENTS, -(long)len); + return 0; +} diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h index 0d95b29092c9f3..745af8a2798d36 100644 --- a/fs/xfs/xfs_rtalloc.h +++ b/fs/xfs/xfs_rtalloc.h @@ -10,6 +10,7 @@ struct xfs_mount; struct xfs_trans; +struct xfs_rtgroup; #ifdef CONFIG_XFS_RT /* rtgroup superblock initialization */ @@ -48,6 +49,10 @@ xfs_growfs_rt( int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp); int xfs_growfs_check_rtgeom(const struct xfs_mount *mp, xfs_rfsblock_t dblocks, xfs_rfsblock_t rblocks, xfs_agblock_t rextsize); +int xfs_rtallocate_find_freesp(struct xfs_trans *tp, struct xfs_rtgroup *rtg, + xfs_rtxnum_t *rtx, xfs_rtxnum_t end_rtx, xfs_rtxlen_t *len_rtx); +int xfs_rtallocate_exact(struct xfs_trans *tp, struct xfs_rtgroup *rtg, + xfs_rtxnum_t rtx, xfs_rtxlen_t rtxlen); #else # define xfs_growfs_rt(mp,in) (-ENOSYS) # define xfs_rtalloc_reinit_frextents(m) (0) @@ -67,6 +72,8 @@ xfs_rtmount_init( # define xfs_rtunmount_inodes(m) # define xfs_rt_resv_free(mp) ((void)0) # define xfs_rt_resv_init(mp) (0) +# define xfs_rtallocate_find_freesp(...) (-EOPNOTSUPP) +# define xfs_rtallocate_exact(...) (-EOPNOTSUPP) static inline int xfs_growfs_check_rtgeom(const struct xfs_mount *mp, diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index ebbc832db8fa1e..76f5d78b6a6e09 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -105,6 +105,7 @@ struct xfs_rtgroup; struct xfs_open_zone; struct xfs_fsrefs; struct xfs_fsrefs_irec; +struct xfs_rtgroup; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -1732,6 +1733,9 @@ DEFINE_SIMPLE_IO_EVENT(xfs_free_file_space); DEFINE_SIMPLE_IO_EVENT(xfs_zero_file_space); DEFINE_SIMPLE_IO_EVENT(xfs_collapse_file_space); DEFINE_SIMPLE_IO_EVENT(xfs_insert_file_space); +#ifdef CONFIG_XFS_RT +DEFINE_SIMPLE_IO_EVENT(xfs_map_free_rt_space); +#endif /* CONFIG_XFS_RT */ DEFINE_SIMPLE_IO_EVENT(xfs_map_free_space); DECLARE_EVENT_CLASS(xfs_itrunc_class, @@ -1851,6 +1855,9 @@ DEFINE_EVENT(xfs_map_free_extent_class, name, \ TP_PROTO(struct xfs_inode *ip, xfs_fileoff_t bno, xfs_extlen_t len), \ TP_ARGS(ip, bno, len)) DEFINE_MAP_FREE_EXTENT_EVENT(xfs_map_free_ag_extent); +#ifdef CONFIG_XFS_RT +DEFINE_MAP_FREE_EXTENT_EVENT(xfs_map_free_rt_extent); +#endif DECLARE_EVENT_CLASS(xfs_extent_busy_class, TP_PROTO(const struct xfs_group *xg, xfs_agblock_t agbno, @@ -1995,6 +2002,37 @@ TRACE_EVENT(xfs_rtalloc_extent_busy_trim, __entry->new_rtx, __entry->new_len) ); + +DECLARE_EVENT_CLASS(xfs_rtextent_class, + TP_PROTO(struct xfs_rtgroup *rtg, xfs_rtxnum_t off_rtx, + xfs_rtxlen_t len_rtx), + TP_ARGS(rtg, off_rtx, len_rtx), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_rgnumber_t, rgno) + __field(xfs_rtxnum_t, off_rtx) + __field(xfs_rtxlen_t, len_rtx) + ), + TP_fast_assign( + __entry->dev = rtg_mount(rtg)->m_super->s_dev; + __entry->rgno = rtg_rgno(rtg); + __entry->off_rtx = off_rtx; + __entry->len_rtx = len_rtx; + ), + TP_printk("dev %d:%d rgno 0x%x rtx 0x%llx rtxcount 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->rgno, + __entry->off_rtx, + __entry->len_rtx) +); +#define DEFINE_RTEXTENT_EVENT(name) \ +DEFINE_EVENT(xfs_rtextent_class, name, \ + TP_PROTO(struct xfs_rtgroup *rtg, xfs_rtxnum_t off_rtx, \ + xfs_rtxlen_t len_rtx), \ + TP_ARGS(rtg, off_rtx, len_rtx)) +DEFINE_RTEXTENT_EVENT(xfs_rtallocate_exact); +DEFINE_RTEXTENT_EVENT(xfs_rtallocate_find_freesp); +DEFINE_RTEXTENT_EVENT(xfs_rtallocate_find_freesp_done); #endif /* CONFIG_XFS_RT */ DECLARE_EVENT_CLASS(xfs_agf_class, @@ -3996,6 +4034,9 @@ DEFINE_EVENT(xfs_inode_irec_class, name, \ TP_PROTO(struct xfs_inode *ip, struct xfs_bmbt_irec *irec), \ TP_ARGS(ip, irec)) DEFINE_INODE_IREC_EVENT(xfs_map_free_ag_extent_done); +#ifdef CONFIG_XFS_RT +DEFINE_INODE_IREC_EVENT(xfs_map_free_rt_extent_done); +#endif /* inode iomap invalidation events */ DECLARE_EVENT_CLASS(xfs_wb_invalid_class, ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 5/5] xfs: live health monitoring of filesystems 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong @ 2024-12-31 23:33 ` Darrick J. Wong 2024-12-31 23:39 ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong ` (15 more replies) 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong ` (10 subsequent siblings) 15 siblings, 16 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs Hi all, This patchset builds off of Kent Overstreet's thread_with_file code to deliver live information about filesystem health events to userspace. This is done by creating a twf file and hooking internal operations so that the event information can be queued to the twf without stalling the kernel if the twf client program is nonresponsive. This is a private ioctl, so events are expressed using simple json objects so that we can enrich the output later on without having to rev a ton of C structs. In userspace, we create a new daemon program that will read the json event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. It is autostarted via some horrible udev rules. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring --- Commits in this patchset: * xfs: create debugfs uuid aliases * xfs: create hooks for monitoring health updates * xfs: create a filesystem shutdown hook * xfs: create hooks for media errors * iomap, filemap: report buffered read and write io errors to the filesystem * iomap: report directio read and write errors to callers * xfs: create file io error hooks * xfs: create a special file to pass filesystem health to userspace * xfs: create event queuing, formatting, and discovery infrastructure * xfs: report metadata health events through healthmon * xfs: report shutdown events through healthmon * xfs: report media errors through healthmon * xfs: report file io errors through healthmon * xfs: allow reconfiguration of the health monitoring device * xfs: add media error reporting ioctl * xfs: send uevents when mounting and unmounting a filesystem --- Documentation/filesystems/vfs.rst | 7 fs/iomap/buffered-io.c | 26 + fs/iomap/direct-io.c | 4 fs/xfs/Kconfig | 8 fs/xfs/Makefile | 7 fs/xfs/libxfs/xfs_fs.h | 31 + fs/xfs/libxfs/xfs_health.h | 47 + fs/xfs/libxfs/xfs_healthmon.schema.json | 595 +++++++++++++ fs/xfs/xfs_aops.c | 2 fs/xfs/xfs_file.c | 167 ++++ fs/xfs/xfs_file.h | 36 + fs/xfs/xfs_fsops.c | 57 + fs/xfs/xfs_fsops.h | 14 fs/xfs/xfs_health.c | 202 +++++ fs/xfs/xfs_healthmon.c | 1372 +++++++++++++++++++++++++++++++ fs/xfs/xfs_healthmon.h | 102 ++ fs/xfs/xfs_ioctl.c | 7 fs/xfs/xfs_linux.h | 3 fs/xfs/xfs_mount.h | 13 fs/xfs/xfs_notify_failure.c | 137 +++ fs/xfs/xfs_notify_failure.h | 44 + fs/xfs/xfs_super.c | 55 + fs/xfs/xfs_trace.c | 4 fs/xfs/xfs_trace.h | 369 ++++++++ include/linux/fs.h | 4 include/linux/iomap.h | 2 26 files changed, 3301 insertions(+), 14 deletions(-) create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json create mode 100644 fs/xfs/xfs_healthmon.c create mode 100644 fs/xfs/xfs_healthmon.h ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 01/16] xfs: create debugfs uuid aliases 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong @ 2024-12-31 23:39 ` Darrick J. Wong 2024-12-31 23:39 ` [PATCH 02/16] xfs: create hooks for monitoring health updates Darrick J. Wong ` (14 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create an alias for the debugfs dir so that we can find a filesystem by uuid. Unless it's mounted nouuid. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_mount.h | 1 + fs/xfs/xfs_super.c | 11 +++++++++++ 2 files changed, 12 insertions(+) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 87007d9de5d9d0..d73e76e36bfc10 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -292,6 +292,7 @@ typedef struct xfs_mount { struct delayed_work m_reclaim_work; /* background inode reclaim */ struct xfs_zone_info *m_zone_info; /* zone allocator information */ struct dentry *m_debugfs; /* debugfs parent */ + struct dentry *m_debugfs_uuid; /* debugfs symlink */ struct xfs_kobj m_kobj; struct xfs_kobj m_error_kobj; struct xfs_kobj m_error_meta_kobj; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 099c30339e8f9d..fd641853fe3595 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -780,6 +780,7 @@ xfs_mount_free( if (mp->m_ddev_targp) xfs_free_buftarg(mp->m_ddev_targp); + debugfs_remove(mp->m_debugfs_uuid); debugfs_remove(mp->m_debugfs); kfree(mp->m_rtname); kfree(mp->m_logname); @@ -1893,6 +1894,16 @@ xfs_fs_fill_super( goto out_unmount; } + if (xfs_debugfs && mp->m_debugfs && !xfs_has_nouuid(mp)) { + char name[UUID_STRING_LEN + 1]; + + snprintf(name, UUID_STRING_LEN + 1, "%pU", &mp->m_sb.sb_uuid); + mp->m_debugfs_uuid = debugfs_create_symlink(name, xfs_debugfs, + mp->m_super->s_id); + } else { + mp->m_debugfs_uuid = NULL; + } + return 0; out_filestream_unmount: ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 02/16] xfs: create hooks for monitoring health updates 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:39 ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong @ 2024-12-31 23:39 ` Darrick J. Wong 2024-12-31 23:39 ` [PATCH 03/16] xfs: create a filesystem shutdown hook Darrick J. Wong ` (13 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create hooks for monitoring health events. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_health.h | 47 ++++++++++ fs/xfs/xfs_health.c | 202 ++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_mount.h | 3 + fs/xfs/xfs_super.c | 1 4 files changed, 252 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h index b31000f7190ce5..39fef33dedc6a8 100644 --- a/fs/xfs/libxfs/xfs_health.h +++ b/fs/xfs/libxfs/xfs_health.h @@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs); #define xfs_metadata_is_sick(error) \ (unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC)) +/* + * Parameters for tracking health updates. The enum below is passed as the + * hook function argument. + */ +enum xfs_health_update_type { + XFS_HEALTHUP_SICK = 1, /* runtime corruption observed */ + XFS_HEALTHUP_CORRUPT, /* fsck reported corruption */ + XFS_HEALTHUP_HEALTHY, /* fsck reported healthy structure */ + XFS_HEALTHUP_UNMOUNT, /* filesystem is unmounting */ +}; + +/* Where in the filesystem was the event observed? */ +enum xfs_health_update_domain { + XFS_HEALTHUP_FS = 1, /* main filesystem */ + XFS_HEALTHUP_AG, /* allocation group */ + XFS_HEALTHUP_INODE, /* inode */ + XFS_HEALTHUP_RTGROUP, /* realtime group */ +}; + +struct xfs_health_update_params { + /* XFS_HEALTHUP_INODE */ + xfs_ino_t ino; + uint32_t gen; + + /* XFS_HEALTHUP_AG/RTGROUP */ + uint32_t group; + + /* XFS_SICK_* flags */ + unsigned int old_mask; + unsigned int new_mask; + + enum xfs_health_update_domain domain; +}; + +#ifdef CONFIG_XFS_LIVE_HOOKS +struct xfs_health_hook { + struct xfs_hook health_hook; +}; + +void xfs_health_hook_disable(void); +void xfs_health_hook_enable(void); + +int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook); +void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook); +void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn); +#endif /* CONFIG_XFS_LIVE_HOOKS */ + #endif /* __XFS_HEALTH_H__ */ diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c index 7c541fb373d5b2..abf9460ae79953 100644 --- a/fs/xfs/xfs_health.c +++ b/fs/xfs/xfs_health.c @@ -20,6 +20,157 @@ #include "xfs_quota_defs.h" #include "xfs_rtgroup.h" +#ifdef CONFIG_XFS_LIVE_HOOKS +/* + * Use a static key here to reduce the overhead of health updates. If + * the compiler supports jump labels, the static branch will be replaced by a + * nop sled when there are no hook users. Online fsck is currently the only + * caller, so this is a reasonable tradeoff. + * + * Note: Patching the kernel code requires taking the cpu hotplug lock. Other + * parts of the kernel allocate memory with that lock held, which means that + * XFS callers cannot hold any locks that might be used by memory reclaim or + * writeback when calling the static_branch_{inc,dec} functions. + */ +DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_health_hooks_switch); + +void +xfs_health_hook_disable(void) +{ + xfs_hooks_switch_off(&xfs_health_hooks_switch); +} + +void +xfs_health_hook_enable(void) +{ + xfs_hooks_switch_on(&xfs_health_hooks_switch); +} + +/* Call downstream hooks for a filesystem unmount health update. */ +static inline void +xfs_health_unmount_hook( + struct xfs_mount *mp) +{ + if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) { + struct xfs_health_update_params p = { + .domain = XFS_HEALTHUP_FS, + }; + + xfs_hooks_call(&mp->m_health_update_hooks, + XFS_HEALTHUP_UNMOUNT, &p); + } +} + +/* Call downstream hooks for a filesystem health update. */ +static inline void +xfs_fs_health_update_hook( + struct xfs_mount *mp, + enum xfs_health_update_type op, + unsigned int old_mask, + unsigned int new_mask) +{ + if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) { + struct xfs_health_update_params p = { + .domain = XFS_HEALTHUP_FS, + .old_mask = old_mask, + .new_mask = new_mask, + }; + + if (new_mask) + xfs_hooks_call(&mp->m_health_update_hooks, op, &p); + } +} + +/* Call downstream hooks for a group health update. */ +static inline void +xfs_group_health_update_hook( + struct xfs_group *xg, + enum xfs_health_update_type op, + unsigned int old_mask, + unsigned int new_mask) +{ + if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) { + struct xfs_health_update_params p = { + .old_mask = old_mask, + .new_mask = new_mask, + .group = xg->xg_gno, + }; + struct xfs_mount *mp = xg->xg_mount; + + switch (xg->xg_type) { + case XG_TYPE_AG: + p.domain = XFS_HEALTHUP_AG; + break; + case XG_TYPE_RTG: + p.domain = XFS_HEALTHUP_RTGROUP; + break; + default: + ASSERT(0); + return; + } + + if (new_mask) + xfs_hooks_call(&mp->m_health_update_hooks, op, &p); + } +} + +/* Call downstream hooks for an inode health update. */ +static inline void +xfs_inode_health_update_hook( + struct xfs_inode *ip, + enum xfs_health_update_type op, + unsigned int old_mask, + unsigned int new_mask) +{ + if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) { + struct xfs_health_update_params p = { + .domain = XFS_HEALTHUP_INODE, + .old_mask = old_mask, + .new_mask = new_mask, + .ino = ip->i_ino, + .gen = VFS_I(ip)->i_generation, + }; + struct xfs_mount *mp = ip->i_mount; + + if (new_mask) + xfs_hooks_call(&mp->m_health_update_hooks, op, &p); + } +} + +/* Call the specified function during a health update. */ +int +xfs_health_hook_add( + struct xfs_mount *mp, + struct xfs_health_hook *hook) +{ + return xfs_hooks_add(&mp->m_health_update_hooks, &hook->health_hook); +} + +/* Stop calling the specified function during a health update. */ +void +xfs_health_hook_del( + struct xfs_mount *mp, + struct xfs_health_hook *hook) +{ + xfs_hooks_del(&mp->m_health_update_hooks, &hook->health_hook); +} + +/* Configure health update hook functions. */ +void +xfs_health_hook_setup( + struct xfs_health_hook *hook, + notifier_fn_t mod_fn) +{ + xfs_hook_setup(&hook->health_hook, mod_fn); +} +#else +# define xfs_health_unmount_hook(...) ((void)0) +# define xfs_fs_health_update_hook(a,b,o,n) do {o = o;} while(0) +# define xfs_rt_health_update_hook(a,b,o,n) do {o = o;} while(0) +# define xfs_group_health_update_hook(a,b,o,n) do {o = o;} while(0) +# define xfs_inode_health_update_hook(a,b,o,n) do {o = o;} while(0) +#endif /* CONFIG_XFS_LIVE_HOOKS */ + static void xfs_health_unmount_group( struct xfs_group *xg, @@ -50,8 +201,10 @@ xfs_health_unmount( unsigned int checked = 0; bool warn = false; - if (xfs_is_shutdown(mp)) + if (xfs_is_shutdown(mp)) { + xfs_health_unmount_hook(mp); return; + } /* Measure AG corruption levels. */ while ((pag = xfs_perag_next(mp, pag))) @@ -97,6 +250,8 @@ xfs_health_unmount( if (sick & XFS_SICK_FS_COUNTERS) xfs_fs_mark_healthy(mp, XFS_SICK_FS_COUNTERS); } + + xfs_health_unmount_hook(mp); } /* Mark unhealthy per-fs metadata. */ @@ -105,12 +260,17 @@ xfs_fs_mark_sick( struct xfs_mount *mp, unsigned int mask) { + unsigned int old_mask; + ASSERT(!(mask & ~XFS_SICK_FS_ALL)); trace_xfs_fs_mark_sick(mp, mask); spin_lock(&mp->m_sb_lock); + old_mask = mp->m_fs_sick; mp->m_fs_sick |= mask; spin_unlock(&mp->m_sb_lock); + + xfs_fs_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask); } /* Mark per-fs metadata as having been checked and found unhealthy by fsck. */ @@ -119,13 +279,18 @@ xfs_fs_mark_corrupt( struct xfs_mount *mp, unsigned int mask) { + unsigned int old_mask; + ASSERT(!(mask & ~XFS_SICK_FS_ALL)); trace_xfs_fs_mark_corrupt(mp, mask); spin_lock(&mp->m_sb_lock); + old_mask = mp->m_fs_sick; mp->m_fs_sick |= mask; mp->m_fs_checked |= mask; spin_unlock(&mp->m_sb_lock); + + xfs_fs_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask); } /* Mark a per-fs metadata healed. */ @@ -134,15 +299,20 @@ xfs_fs_mark_healthy( struct xfs_mount *mp, unsigned int mask) { + unsigned int old_mask; + ASSERT(!(mask & ~XFS_SICK_FS_ALL)); trace_xfs_fs_mark_healthy(mp, mask); spin_lock(&mp->m_sb_lock); + old_mask = mp->m_fs_sick; mp->m_fs_sick &= ~mask; if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY)) mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY; mp->m_fs_checked |= mask; spin_unlock(&mp->m_sb_lock); + + xfs_fs_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask); } /* Sample which per-fs metadata are unhealthy. */ @@ -192,12 +362,17 @@ xfs_group_mark_sick( struct xfs_group *xg, unsigned int mask) { + unsigned int old_mask; + xfs_group_check_mask(xg, mask); trace_xfs_group_mark_sick(xg, mask); spin_lock(&xg->xg_state_lock); + old_mask = xg->xg_sick; xg->xg_sick |= mask; spin_unlock(&xg->xg_state_lock); + + xfs_group_health_update_hook(xg, XFS_HEALTHUP_SICK, old_mask, mask); } /* @@ -208,13 +383,18 @@ xfs_group_mark_corrupt( struct xfs_group *xg, unsigned int mask) { + unsigned int old_mask; + xfs_group_check_mask(xg, mask); trace_xfs_group_mark_corrupt(xg, mask); spin_lock(&xg->xg_state_lock); + old_mask = xg->xg_sick; xg->xg_sick |= mask; xg->xg_checked |= mask; spin_unlock(&xg->xg_state_lock); + + xfs_group_health_update_hook(xg, XFS_HEALTHUP_CORRUPT, old_mask, mask); } /* @@ -225,15 +405,20 @@ xfs_group_mark_healthy( struct xfs_group *xg, unsigned int mask) { + unsigned int old_mask; + xfs_group_check_mask(xg, mask); trace_xfs_group_mark_healthy(xg, mask); spin_lock(&xg->xg_state_lock); + old_mask = xg->xg_sick; xg->xg_sick &= ~mask; if (!(xg->xg_sick & XFS_SICK_AG_PRIMARY)) xg->xg_sick &= ~XFS_SICK_AG_SECONDARY; xg->xg_checked |= mask; spin_unlock(&xg->xg_state_lock); + + xfs_group_health_update_hook(xg, XFS_HEALTHUP_HEALTHY, old_mask, mask); } /* Sample which per-ag metadata are unhealthy. */ @@ -272,10 +457,13 @@ xfs_inode_mark_sick( struct xfs_inode *ip, unsigned int mask) { + unsigned int old_mask; + ASSERT(!(mask & ~XFS_SICK_INO_ALL)); trace_xfs_inode_mark_sick(ip, mask); spin_lock(&ip->i_flags_lock); + old_mask = ip->i_sick; ip->i_sick |= mask; spin_unlock(&ip->i_flags_lock); @@ -287,6 +475,8 @@ xfs_inode_mark_sick( spin_lock(&VFS_I(ip)->i_lock); VFS_I(ip)->i_state &= ~I_DONTCACHE; spin_unlock(&VFS_I(ip)->i_lock); + + xfs_inode_health_update_hook(ip, XFS_HEALTHUP_SICK, old_mask, mask); } /* Mark inode metadata as having been checked and found unhealthy by fsck. */ @@ -295,10 +485,13 @@ xfs_inode_mark_corrupt( struct xfs_inode *ip, unsigned int mask) { + unsigned int old_mask; + ASSERT(!(mask & ~XFS_SICK_INO_ALL)); trace_xfs_inode_mark_corrupt(ip, mask); spin_lock(&ip->i_flags_lock); + old_mask = ip->i_sick; ip->i_sick |= mask; ip->i_checked |= mask; spin_unlock(&ip->i_flags_lock); @@ -311,6 +504,8 @@ xfs_inode_mark_corrupt( spin_lock(&VFS_I(ip)->i_lock); VFS_I(ip)->i_state &= ~I_DONTCACHE; spin_unlock(&VFS_I(ip)->i_lock); + + xfs_inode_health_update_hook(ip, XFS_HEALTHUP_CORRUPT, old_mask, mask); } /* Mark parts of an inode healed. */ @@ -319,15 +514,20 @@ xfs_inode_mark_healthy( struct xfs_inode *ip, unsigned int mask) { + unsigned int old_mask; + ASSERT(!(mask & ~XFS_SICK_INO_ALL)); trace_xfs_inode_mark_healthy(ip, mask); spin_lock(&ip->i_flags_lock); + old_mask = ip->i_sick; ip->i_sick &= ~mask; if (!(ip->i_sick & XFS_SICK_INO_PRIMARY)) ip->i_sick &= ~XFS_SICK_INO_SECONDARY; ip->i_checked |= mask; spin_unlock(&ip->i_flags_lock); + + xfs_inode_health_update_hook(ip, XFS_HEALTHUP_HEALTHY, old_mask, mask); } /* Sample which parts of an inode are unhealthy. */ diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index d73e76e36bfc10..df5e4a48af72b7 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -340,6 +340,9 @@ typedef struct xfs_mount { /* Hook to feed dirent updates to an active online repair. */ struct xfs_hooks m_dir_update_hooks; + + /* Hook to feed health events to a daemon. */ + struct xfs_hooks m_health_update_hooks; } xfs_mount_t; #define M_IGEO(mp) (&(mp)->m_ino_geo) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index fd641853fe3595..e4789dfe1a369e 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -2182,6 +2182,7 @@ xfs_init_fs_context( mp->m_allocsize_log = 16; /* 64k */ xfs_hooks_init(&mp->m_dir_update_hooks); + xfs_hooks_init(&mp->m_health_update_hooks); fc->s_fs_info = mp; fc->ops = &xfs_context_ops; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 03/16] xfs: create a filesystem shutdown hook 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:39 ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong 2024-12-31 23:39 ` [PATCH 02/16] xfs: create hooks for monitoring health updates Darrick J. Wong @ 2024-12-31 23:39 ` Darrick J. Wong 2024-12-31 23:39 ` [PATCH 04/16] xfs: create hooks for media errors Darrick J. Wong ` (12 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a hook so that health monitoring can report filesystem shutdown events to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_fsops.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_fsops.h | 14 +++++++++++++ fs/xfs/xfs_mount.h | 3 +++ fs/xfs/xfs_super.c | 1 + 4 files changed, 75 insertions(+) diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 150979c8333530..439e76f38ed42e 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -480,6 +480,61 @@ xfs_fs_goingdown( return 0; } +#ifdef CONFIG_XFS_LIVE_HOOKS +DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_shutdown_hooks_switch); + +void +xfs_shutdown_hook_disable(void) +{ + xfs_hooks_switch_off(&xfs_shutdown_hooks_switch); +} + +void +xfs_shutdown_hook_enable(void) +{ + xfs_hooks_switch_on(&xfs_shutdown_hooks_switch); +} + +/* Call downstream hooks for a filesystem shutdown. */ +static inline void +xfs_shutdown_hook( + struct xfs_mount *mp, + uint32_t flags) +{ + if (xfs_hooks_switched_on(&xfs_shutdown_hooks_switch)) + xfs_hooks_call(&mp->m_shutdown_hooks, flags, NULL); +} + +/* Call the specified function during a shutdown update. */ +int +xfs_shutdown_hook_add( + struct xfs_mount *mp, + struct xfs_shutdown_hook *hook) +{ + return xfs_hooks_add(&mp->m_shutdown_hooks, &hook->shutdown_hook); +} + +/* Stop calling the specified function during a shutdown update. */ +void +xfs_shutdown_hook_del( + struct xfs_mount *mp, + struct xfs_shutdown_hook *hook) +{ + xfs_hooks_del(&mp->m_shutdown_hooks, &hook->shutdown_hook); +} + +/* Configure shutdown update hook functions. */ +void +xfs_shutdown_hook_setup( + struct xfs_shutdown_hook *hook, + notifier_fn_t mod_fn) +{ + xfs_hook_setup(&hook->shutdown_hook, mod_fn); +} +#else +# define xfs_shutdown_hook(...) ((void)0) +#endif /* CONFIG_XFS_LIVE_HOOKS */ + /* * Force a shutdown of the filesystem instantly while keeping the filesystem * consistent. We don't do an unmount here; just shutdown the shop, make sure @@ -538,6 +593,8 @@ xfs_do_force_shutdown( "Please unmount the filesystem and rectify the problem(s)"); if (xfs_error_level >= XFS_ERRLEVEL_HIGH) xfs_stack_trace(); + + xfs_shutdown_hook(mp, flags); } /* diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h index 9d23c361ef56e4..7f6f876de072b1 100644 --- a/fs/xfs/xfs_fsops.h +++ b/fs/xfs/xfs_fsops.h @@ -15,4 +15,18 @@ int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags); int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp); void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp); +#ifdef CONFIG_XFS_LIVE_HOOKS +struct xfs_shutdown_hook { + struct xfs_hook shutdown_hook; +}; + +void xfs_shutdown_hook_disable(void); +void xfs_shutdown_hook_enable(void); + +int xfs_shutdown_hook_add(struct xfs_mount *mp, struct xfs_shutdown_hook *hook); +void xfs_shutdown_hook_del(struct xfs_mount *mp, struct xfs_shutdown_hook *hook); +void xfs_shutdown_hook_setup(struct xfs_shutdown_hook *hook, + notifier_fn_t mod_fn); +#endif /* CONFIG_XFS_LIVE_HOOKS */ + #endif /* __XFS_FSOPS_H__ */ diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index df5e4a48af72b7..a8c81c4ccb2000 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -343,6 +343,9 @@ typedef struct xfs_mount { /* Hook to feed health events to a daemon. */ struct xfs_hooks m_health_update_hooks; + + /* Hook to feed shutdown events to a daemon. */ + struct xfs_hooks m_shutdown_hooks; } xfs_mount_t; #define M_IGEO(mp) (&(mp)->m_ino_geo) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index e4789dfe1a369e..71aa97a5d1dcaa 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -2182,6 +2182,7 @@ xfs_init_fs_context( mp->m_allocsize_log = 16; /* 64k */ xfs_hooks_init(&mp->m_dir_update_hooks); + xfs_hooks_init(&mp->m_shutdown_hooks); xfs_hooks_init(&mp->m_health_update_hooks); fc->s_fs_info = mp; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 04/16] xfs: create hooks for media errors 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:39 ` [PATCH 03/16] xfs: create a filesystem shutdown hook Darrick J. Wong @ 2024-12-31 23:39 ` Darrick J. Wong 2024-12-31 23:40 ` [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem Darrick J. Wong ` (11 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:39 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a media error event hook so that we can send events to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_mount.h | 3 ++ fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++++++++--- fs/xfs/xfs_notify_failure.h | 38 +++++++++++++++++++ fs/xfs/xfs_super.c | 1 + 4 files changed, 122 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index a8c81c4ccb2000..3fcfdaaf199315 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -346,6 +346,9 @@ typedef struct xfs_mount { /* Hook to feed shutdown events to a daemon. */ struct xfs_hooks m_shutdown_hooks; + + /* Hook to feed media error events to a daemon. */ + struct xfs_hooks m_media_error_hooks; } xfs_mount_t; #define M_IGEO(mp) (&(mp)->m_ino_geo) diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index ed8d8ed42f0a2c..ea68c7e61bb585 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -27,6 +27,73 @@ #include <linux/dax.h> #include <linux/fs.h> +#ifdef CONFIG_XFS_LIVE_HOOKS +DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_media_error_hooks_switch); + +void +xfs_media_error_hook_disable(void) +{ + xfs_hooks_switch_off(&xfs_media_error_hooks_switch); +} + +void +xfs_media_error_hook_enable(void) +{ + xfs_hooks_switch_on(&xfs_media_error_hooks_switch); +} + +/* Call downstream hooks for a media error. */ +static inline void +xfs_media_error_hook( + struct xfs_mount *mp, + enum xfs_failed_device fdev, + xfs_daddr_t daddr, + uint64_t bbcount, + bool pre_remove) +{ + if (xfs_hooks_switched_on(&xfs_media_error_hooks_switch)) { + struct xfs_media_error_params p = { + .mp = mp, + .fdev = fdev, + .daddr = daddr, + .bbcount = bbcount, + .pre_remove = pre_remove, + }; + + xfs_hooks_call(&mp->m_media_error_hooks, 0, &p); + } +} + +/* Call the specified function during a media error. */ +int +xfs_media_error_hook_add( + struct xfs_mount *mp, + struct xfs_media_error_hook *hook) +{ + return xfs_hooks_add(&mp->m_media_error_hooks, &hook->error_hook); +} + +/* Stop calling the specified function during a media error. */ +void +xfs_media_error_hook_del( + struct xfs_mount *mp, + struct xfs_media_error_hook *hook) +{ + xfs_hooks_del(&mp->m_media_error_hooks, &hook->error_hook); +} + +/* Configure media error hook functions. */ +void +xfs_media_error_hook_setup( + struct xfs_media_error_hook *hook, + notifier_fn_t mod_fn) +{ + xfs_hook_setup(&hook->error_hook, mod_fn); +} +#else +# define xfs_media_error_hook(...) ((void)0) +#endif /* CONFIG_XFS_LIVE_HOOKS */ + struct xfs_failure_info { xfs_agblock_t startblock; xfs_extlen_t blockcount; @@ -215,6 +282,9 @@ xfs_dax_notify_logdev_failure( if (error) return error; + xfs_media_error_hook(mp, XFS_FAILED_LOGDEV, daddr, bblen, + mf_flags & MF_MEM_PRE_REMOVE); + /* * In the pre-remove case the failure notification is attempting to * trigger a force unmount. The expectation is that the device is @@ -248,17 +318,21 @@ xfs_dax_notify_dev_failure( uint64_t bblen; struct xfs_group *xg = NULL; + error = xfs_dax_translate_range(type == XG_TYPE_RTG ? + mp->m_rtdev_targp : mp->m_ddev_targp, + offset, len, &daddr, &bblen); + if (error) + return error; + + xfs_media_error_hook(mp, type == XG_TYPE_RTG ? + XFS_FAILED_RTDEV : XFS_FAILED_DATADEV, + daddr, bblen, mf_flags & MF_MEM_PRE_REMOVE); + if (!xfs_has_rmapbt(mp)) { xfs_debug(mp, "notify_failure() needs rmapbt enabled!"); return -EOPNOTSUPP; } - error = xfs_dax_translate_range(type == XG_TYPE_RTG ? - mp->m_rtdev_targp : mp->m_ddev_targp, - offset, len, &daddr, &bblen); - if (error) - return error; - if (type == XG_TYPE_RTG) { start_bno = xfs_daddr_to_rtb(mp, daddr); end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1); diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h index 41108044d35d47..835d4af504d832 100644 --- a/fs/xfs/xfs_notify_failure.h +++ b/fs/xfs/xfs_notify_failure.h @@ -8,4 +8,42 @@ extern const struct dax_holder_operations xfs_dax_holder_operations; +enum xfs_failed_device { + XFS_FAILED_DATADEV, + XFS_FAILED_LOGDEV, + XFS_FAILED_RTDEV, +}; + +#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) +struct xfs_media_error_params { + struct xfs_mount *mp; + enum xfs_failed_device fdev; + xfs_daddr_t daddr; + uint64_t bbcount; + bool pre_remove; +}; + +struct xfs_media_error_hook { + struct xfs_hook error_hook; +}; + +void xfs_media_error_hook_disable(void); +void xfs_media_error_hook_enable(void); + +int xfs_media_error_hook_add(struct xfs_mount *mp, + struct xfs_media_error_hook *hook); +void xfs_media_error_hook_del(struct xfs_mount *mp, + struct xfs_media_error_hook *hook); +void xfs_media_error_hook_setup(struct xfs_media_error_hook *hook, + notifier_fn_t mod_fn); +#else +struct xfs_media_error_params { }; +struct xfs_media_error_hook { }; +# define xfs_media_error_hook_disable() ((void)0) +# define xfs_media_error_hook_enable() ((void)0) +# define xfs_media_error_hook_add(...) (0) +# define xfs_media_error_hook_del(...) ((void)0) +# define xfs_media_error_hook_setup(...) ((void)0) +#endif /* CONFIG_XFS_LIVE_HOOKS */ + #endif /* __XFS_NOTIFY_FAILURE_H__ */ diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 71aa97a5d1dcaa..a49082159faae8 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -2184,6 +2184,7 @@ xfs_init_fs_context( xfs_hooks_init(&mp->m_dir_update_hooks); xfs_hooks_init(&mp->m_shutdown_hooks); xfs_hooks_init(&mp->m_health_update_hooks); + xfs_hooks_init(&mp->m_media_error_hooks); fc->s_fs_info = mp; fc->ops = &xfs_context_ops; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:39 ` [PATCH 04/16] xfs: create hooks for media errors Darrick J. Wong @ 2024-12-31 23:40 ` Darrick J. Wong 2024-12-31 23:40 ` [PATCH 06/16] iomap: report directio read and write errors to callers Darrick J. Wong ` (10 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Provide a callback so that iomap can report read and write IO errors to the caller filesystem. For now this is only wired up for iomap as a testbed for XFS. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- Documentation/filesystems/vfs.rst | 7 +++++++ fs/iomap/buffered-io.c | 26 +++++++++++++++++++++++++- include/linux/fs.h | 4 ++++ 3 files changed, 36 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 0b18af3f954eb7..2f0ef4e1a8d340 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -827,6 +827,8 @@ cache in your filesystem. The following members are defined: int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) int (*swap_deactivate)(struct file *); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); + void (*ioerror)(struct address_space *mapping, int direction, + loff_t pos, u64 len, int error); }; ``writepage`` @@ -1056,6 +1058,11 @@ cache in your filesystem. The following members are defined: ``swap_rw`` Called to read or write swap pages when SWP_FS_OPS is set. +``ioerror`` + Called to deal with IO errors during readahead or writeback. + This may be called from interrupt context, and without any + locks necessarily being held. + The File Object =============== diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 86e30b56e8d41b..39782376895306 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -284,6 +284,14 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio, *lenp = plen; } +static inline void iomap_mapping_ioerror(struct address_space *mapping, + int direction, loff_t pos, u64 len, int error) +{ + if (mapping && mapping->a_ops->ioerror) + mapping->a_ops->ioerror(mapping, direction, pos, len, + error); +} + static void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len, int error) { @@ -302,6 +310,10 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off, spin_unlock_irqrestore(&ifs->state_lock, flags); } + if (error) + iomap_mapping_ioerror(folio->mapping, READ, + folio_pos(folio) + off, len, error); + if (finished) folio_end_read(folio, uptodate); } @@ -670,11 +682,16 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio, { struct bio_vec bvec; struct bio bio; + int ret; bio_init(&bio, iomap->bdev, &bvec, 1, REQ_OP_READ); bio.bi_iter.bi_sector = iomap_sector(iomap, block_start); bio_add_folio_nofail(&bio, folio, plen, poff); - return submit_bio_wait(&bio); + ret = submit_bio_wait(&bio); + if (ret) + iomap_mapping_ioerror(folio->mapping, READ, + folio_pos(folio) + poff, plen, ret); + return ret; } static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos, @@ -1573,6 +1590,11 @@ u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend) /* walk all folios in bio, ending page IO on them */ bio_for_each_folio_all(fi, bio) { + if (ioend->io_error) + iomap_mapping_ioerror(inode->i_mapping, WRITE, + folio_pos(fi.folio) + fi.offset, + fi.length, ioend->io_error); + iomap_finish_folio_write(inode, fi.folio, fi.length); folio_count++; } @@ -1881,6 +1903,8 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, if (count) wpc->nr_folios++; + if (error && !count) + iomap_mapping_ioerror(inode->i_mapping, WRITE, pos, 0, error); /* * We can have dirty bits set past end of file in page_mkwrite path diff --git a/include/linux/fs.h b/include/linux/fs.h index b638fb1bcbc96f..9375753577025d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -438,6 +438,10 @@ struct address_space_operations { sector_t *span); void (*swap_deactivate)(struct file *file); int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); + + /* Callback for dealing with IO errors during readahead or writeback */ + void (*ioerror)(struct address_space *mapping, int direction, + loff_t pos, u64 len, int error); }; extern const struct address_space_operations empty_aops; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 06/16] iomap: report directio read and write errors to callers 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (4 preceding siblings ...) 2024-12-31 23:40 ` [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem Darrick J. Wong @ 2024-12-31 23:40 ` Darrick J. Wong 2024-12-31 23:40 ` [PATCH 07/16] xfs: create file io error hooks Darrick J. Wong ` (9 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add more hooks to report directio IO errors to the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/iomap/direct-io.c | 4 ++++ include/linux/iomap.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index dd521f4edf55ac..f572be18490b0a 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -100,6 +100,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) if (dops && dops->end_io) ret = dops->end_io(iocb, dio->size, ret, dio->flags); + if (dio->error && dops && dops->ioerror) + dops->ioerror(file_inode(iocb->ki_filp), + (dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ, + offset, dio->size, dio->error); if (likely(!ret)) { ret = dio->size; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index afa0917cf43705..69c8b45bd9b935 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -439,6 +439,8 @@ struct iomap_dio_ops { unsigned flags); void (*submit_io)(const struct iomap_iter *iter, struct bio *bio, loff_t file_offset); + void (*ioerror)(struct inode *inode, int direction, loff_t pos, + u64 len, int error); /* * Filesystems wishing to attach private information to a direct io bio ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 07/16] xfs: create file io error hooks 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (5 preceding siblings ...) 2024-12-31 23:40 ` [PATCH 06/16] iomap: report directio read and write errors to callers Darrick J. Wong @ 2024-12-31 23:40 ` Darrick J. Wong 2024-12-31 23:40 ` [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong ` (8 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create hooks within XFS to deliver IO errors to callers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_aops.c | 2 + fs/xfs/xfs_file.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_file.h | 36 +++++++++++ fs/xfs/xfs_mount.h | 3 + fs/xfs/xfs_super.c | 1 5 files changed, 208 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 4319d0488f2146..7892b794085251 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -21,6 +21,7 @@ #include "xfs_error.h" #include "xfs_zone_alloc.h" #include "xfs_rtgroup.h" +#include "xfs_file.h" struct xfs_writepage_ctx { struct iomap_writepage_ctx ctx; @@ -722,6 +723,7 @@ const struct address_space_operations xfs_address_space_operations = { .is_partially_uptodate = iomap_is_partially_uptodate, .error_remove_folio = generic_error_remove_folio, .swap_activate = xfs_iomap_swapfile_activate, + .ioerror = xfs_vm_ioerror, }; const struct address_space_operations xfs_dax_aops = { diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index ceb7936e5fd9a3..cbeb60582cb15f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -230,6 +230,169 @@ xfs_ilock_iocb_for_write( return 0; } +#ifdef CONFIG_XFS_LIVE_HOOKS +DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_file_ioerror_hooks_switch); + +void +xfs_file_ioerror_hook_disable(void) +{ + xfs_hooks_switch_off(&xfs_file_ioerror_hooks_switch); +} + +void +xfs_file_ioerror_hook_enable(void) +{ + xfs_hooks_switch_on(&xfs_file_ioerror_hooks_switch); +} + +struct xfs_file_ioerror { + struct work_struct work; + struct xfs_mount *mp; + xfs_ino_t ino; + loff_t pos; + u64 len; + u32 gen; + int error; + enum xfs_file_ioerror_type type; +}; + +/* Call downstream hooks for a file io error update. */ +STATIC void +xfs_file_report_ioerror( + struct work_struct *work) +{ + struct xfs_file_ioerror *ioerr; + + ioerr = container_of(work, struct xfs_file_ioerror, work); + + if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) { + struct xfs_file_ioerror_params p = { + .ino = ioerr->ino, + .gen = ioerr->gen, + .pos = ioerr->pos, + .len = ioerr->len, + }; + struct xfs_mount *mp = ioerr->mp; + + xfs_hooks_call(&mp->m_file_ioerror_hooks, ioerr->type, &p); + } + + kfree(ioerr); +} + +/* Queue a directio io error notification. */ +STATIC void +xfs_dio_ioerror( + struct inode *inode, + int direction, + loff_t pos, + u64 len, + int error) +{ + struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; + struct xfs_file_ioerror *ioerr; + + if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) { + ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC); + if (!ioerr) { + xfs_err(mp, + "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d", + ip->i_ino, + direction == WRITE ? "WRITE" : "READ", + pos, len, error); + return; + } + + INIT_WORK(&ioerr->work, xfs_file_report_ioerror); + ioerr->mp = mp; + ioerr->ino = ip->i_ino; + ioerr->gen = VFS_I(ip)->i_generation; + ioerr->pos = pos; + ioerr->len = len; + if (direction == WRITE) + ioerr->type = XFS_FILE_IOERROR_DIRECT_WRITE; + else + ioerr->type = XFS_FILE_IOERROR_DIRECT_READ; + ioerr->error = error; + queue_work(mp->m_unwritten_workqueue, &ioerr->work); + } +} + +/* Queue a buffered io error notification. */ +void +xfs_vm_ioerror( + struct address_space *mapping, + int direction, + loff_t pos, + u64 len, + int error) +{ + struct inode *inode = mapping->host; + struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; + struct xfs_file_ioerror *ioerr; + + if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) { + ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC); + if (!ioerr) { + xfs_err(mp, + "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d", + ip->i_ino, + direction == WRITE ? "WRITE" : "READ", + pos, len, error); + return; + } + + INIT_WORK(&ioerr->work, xfs_file_report_ioerror); + ioerr->mp = mp; + ioerr->ino = ip->i_ino; + ioerr->gen = VFS_I(ip)->i_generation; + ioerr->pos = pos; + ioerr->len = len; + if (direction == WRITE) + ioerr->type = XFS_FILE_IOERROR_BUFFERED_WRITE; + else + ioerr->type = XFS_FILE_IOERROR_BUFFERED_READ; + ioerr->error = error; + queue_work(mp->m_unwritten_workqueue, &ioerr->work); + } +} + +/* Call the specified function after a file io error. */ +int +xfs_file_ioerror_hook_add( + struct xfs_mount *mp, + struct xfs_file_ioerror_hook *hook) +{ + return xfs_hooks_add(&mp->m_file_ioerror_hooks, &hook->ioerror_hook); +} + +/* Stop calling the specified function after a file io error. */ +void +xfs_file_ioerror_hook_del( + struct xfs_mount *mp, + struct xfs_file_ioerror_hook *hook) +{ + xfs_hooks_del(&mp->m_file_ioerror_hooks, &hook->ioerror_hook); +} + +/* Configure file io error update hook functions. */ +void +xfs_file_ioerror_hook_setup( + struct xfs_file_ioerror_hook *hook, + notifier_fn_t mod_fn) +{ + xfs_hook_setup(&hook->ioerror_hook, mod_fn); +} +#else +# define xfs_dio_ioerror NULL +#endif /* CONFIG_XFS_LIVE_HOOKS */ + +static const struct iomap_dio_ops xfs_dio_read_ops = { + .ioerror = xfs_dio_ioerror, +}; + STATIC ssize_t xfs_file_dio_read( struct kiocb *iocb, @@ -248,7 +411,8 @@ xfs_file_dio_read( ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED); if (ret) return ret; - ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0); + ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, &xfs_dio_read_ops, + 0, NULL, 0); xfs_iunlock(ip, XFS_IOLOCK_SHARED); return ret; @@ -769,6 +933,7 @@ xfs_dio_write_end_io( static const struct iomap_dio_ops xfs_dio_write_ops = { .end_io = xfs_dio_write_end_io, + .ioerror = xfs_dio_ioerror, }; static void diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h index c9d50699baba85..38c546cd498a52 100644 --- a/fs/xfs/xfs_file.h +++ b/fs/xfs/xfs_file.h @@ -17,4 +17,40 @@ int xfs_file_unshare_at(struct xfs_inode *ip, loff_t pos); long xfs_ioc_map_freesp(struct file *file, struct xfs_map_freesp __user *argp); +enum xfs_file_ioerror_type { + XFS_FILE_IOERROR_BUFFERED_READ, + XFS_FILE_IOERROR_BUFFERED_WRITE, + XFS_FILE_IOERROR_DIRECT_READ, + XFS_FILE_IOERROR_DIRECT_WRITE, +}; + +struct xfs_file_ioerror_params { + xfs_ino_t ino; + loff_t pos; + u64 len; + u32 gen; + int error; +}; + +#ifdef CONFIG_XFS_LIVE_HOOKS +struct xfs_file_ioerror_hook { + struct xfs_hook ioerror_hook; +}; + +void xfs_file_ioerror_hook_disable(void); +void xfs_file_ioerror_hook_enable(void); + +int xfs_file_ioerror_hook_add(struct xfs_mount *mp, + struct xfs_file_ioerror_hook *hook); +void xfs_file_ioerror_hook_del(struct xfs_mount *mp, + struct xfs_file_ioerror_hook *hook); +void xfs_file_ioerror_hook_setup(struct xfs_file_ioerror_hook *hook, + notifier_fn_t mod_fn); + +void xfs_vm_ioerror(struct address_space *mapping, int direction, loff_t pos, + u64 len, int error); +#else +# define xfs_vm_ioerror NULL +#endif /* CONFIG_XFS_LIVE_HOOKS */ + #endif /* __XFS_FILE_H__ */ diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 3fcfdaaf199315..10b4ff3548601e 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -349,6 +349,9 @@ typedef struct xfs_mount { /* Hook to feed media error events to a daemon. */ struct xfs_hooks m_media_error_hooks; + + /* Hook to feed file io error events to a daemon. */ + struct xfs_hooks m_file_ioerror_hooks; } xfs_mount_t; #define M_IGEO(mp) (&(mp)->m_ino_geo) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index a49082159faae8..df6afcf8840948 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -2185,6 +2185,7 @@ xfs_init_fs_context( xfs_hooks_init(&mp->m_shutdown_hooks); xfs_hooks_init(&mp->m_health_update_hooks); xfs_hooks_init(&mp->m_media_error_hooks); + xfs_hooks_init(&mp->m_file_ioerror_hooks); fc->s_fs_info = mp; fc->ops = &xfs_context_ops; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (6 preceding siblings ...) 2024-12-31 23:40 ` [PATCH 07/16] xfs: create file io error hooks Darrick J. Wong @ 2024-12-31 23:40 ` Darrick J. Wong 2024-12-31 23:41 ` [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong ` (7 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:40 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create an ioctl that installs a file descriptor backed by an anon_inode file that will convey filesystem health events to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/Kconfig | 8 +++ fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_fs.h | 8 +++ fs/xfs/xfs_healthmon.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_healthmon.h | 16 +++++ fs/xfs/xfs_ioctl.c | 4 + 6 files changed, 182 insertions(+) create mode 100644 fs/xfs/xfs_healthmon.c create mode 100644 fs/xfs/xfs_healthmon.h diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig index 5700bc671a0e92..9d061a8c2786fe 100644 --- a/fs/xfs/Kconfig +++ b/fs/xfs/Kconfig @@ -120,6 +120,14 @@ config XFS_RT If unsure, say N. +config XFS_HEALTH_MONITOR + bool "Report filesystem health events to userspace" + depends on XFS_FS + select XFS_LIVE_HOOKS + default y + help + Report health events to userspace programs. + config XFS_DRAIN_INTENTS bool select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 4c59d43c77089e..94a9dc7aa7a1d5 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -158,6 +158,7 @@ xfs-$(CONFIG_XFS_DRAIN_INTENTS) += xfs_drain.o xfs-$(CONFIG_XFS_LIVE_HOOKS) += xfs_hooks.o xfs-$(CONFIG_XFS_MEMORY_BUFS) += xfs_buf_mem.o xfs-$(CONFIG_XFS_BTREE_IN_MEM) += libxfs/xfs_btree_mem.o +xfs-$(CONFIG_XFS_HEALTH_MONITOR) += xfs_healthmon.o # online scrub/repair ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y) diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index f4128dbdf3b9a2..d1a81b02a1a3f3 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -1100,6 +1100,13 @@ struct xfs_map_freesp { __u64 pad; /* must be zero */ }; +struct xfs_health_monitor { + __u64 flags; /* flags */ + __u8 format; /* output format */ + __u8 pad1[7]; /* zeroes */ + __u64 pad2[2]; /* zeroes */ +}; + /* * ioctl commands that are used by Linux filesystems */ @@ -1141,6 +1148,7 @@ struct xfs_map_freesp { #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry) #define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) #define XFS_IOC_MAP_FREESP _IOW ('X', 67, struct xfs_map_freesp) +#define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor) /* * ioctl commands that replace IRIX syssgi()'s diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c new file mode 100644 index 00000000000000..c5ce5699373c63 --- /dev/null +++ b/fs/xfs/xfs_healthmon.c @@ -0,0 +1,145 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2024-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_inode.h" +#include "xfs_trace.h" +#include "xfs_ag.h" +#include "xfs_btree.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_quota_defs.h" +#include "xfs_rtgroup.h" +#include "xfs_healthmon.h" + +#include <linux/anon_inodes.h> +#include <linux/eventpoll.h> +#include <linux/poll.h> + +/* + * Live Health Monitoring + * ====================== + * + * Autonomous self-healing of XFS filesystems requires a means for the kernel + * to send filesystem health events to a monitoring daemon in userspace. To + * accomplish this, we establish a thread_with_file kthread object to handle + * translating internal events about filesystem health into a format that can + * be parsed easily by userspace. Then we hook various parts of the filesystem + * to supply those internal events to the kthread. Userspace reads events + * from the file descriptor returned by the ioctl. + * + * The healthmon abstraction has a weak reference to the host filesystem mount + * so that the queueing and processing of the events do not pin the mount and + * cannot slow down the main filesystem. The healthmon object can exist past + * the end of the filesystem mount. + */ + +struct xfs_healthmon { + struct xfs_mount *mp; +}; + +/* + * Convey queued event data to userspace. First copy any remaining bytes in + * the outbuf, then format the oldest event into the outbuf and copy that too. + */ +STATIC ssize_t +xfs_healthmon_read_iter( + struct kiocb *iocb, + struct iov_iter *to) +{ + return -EIO; +} + +/* Free the health monitoring information. */ +STATIC int +xfs_healthmon_release( + struct inode *inode, + struct file *file) +{ + struct xfs_healthmon *hm = file->private_data; + + kfree(hm); + + return 0; +} + +/* Validate ioctl parameters. */ +static inline bool +xfs_healthmon_validate( + const struct xfs_health_monitor *hmo) +{ + if (hmo->flags) + return false; + if (hmo->format) + return false; + if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1))) + return false; + if (memchr_inv(&hmo->pad2, 0, sizeof(hmo->pad2))) + return false; + return true; +} + +static const struct file_operations xfs_healthmon_fops = { + .owner = THIS_MODULE, + .read_iter = xfs_healthmon_read_iter, + .release = xfs_healthmon_release, +}; + +/* + * Create a health monitoring file. Returns an index to the fd table or a + * negative errno. + */ +long +xfs_ioc_health_monitor( + struct xfs_mount *mp, + struct xfs_health_monitor __user *arg) +{ + struct xfs_health_monitor hmo; + struct xfs_healthmon *hm; + char *name; + int fd; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&hmo, arg, sizeof(hmo))) + return -EFAULT; + + if (!xfs_healthmon_validate(&hmo)) + return -EINVAL; + + hm = kzalloc(sizeof(*hm), GFP_KERNEL); + if (!hm) + return -ENOMEM; + hm->mp = mp; + + /* Set up VFS file and file descriptor. */ + name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id); + if (!name) { + ret = -ENOMEM; + goto out_hm; + } + + fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm, + O_CLOEXEC | O_RDONLY); + kvfree(name); + if (fd < 0) { + ret = fd; + goto out_hm; + } + + return fd; + +out_hm: + kfree(hm); + return ret; +} diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h new file mode 100644 index 00000000000000..07126e39281a0c --- /dev/null +++ b/fs/xfs/xfs_healthmon.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2024-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_HEALTHMON_H__ +#define __XFS_HEALTHMON_H__ + +#ifdef CONFIG_XFS_HEALTH_MONITOR +long xfs_ioc_health_monitor(struct xfs_mount *mp, + struct xfs_health_monitor __user *arg); +#else +# define xfs_ioc_health_monitor(mp, hmo) (-ENOTTY) +#endif /* CONFIG_XFS_HEALTH_MONITOR */ + +#endif /* __XFS_HEALTHMON_H__ */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 092a3699ff9e75..6c7a30128c7bf6 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -42,6 +42,7 @@ #include "xfs_exchrange.h" #include "xfs_handle.h" #include "xfs_rtgroup.h" +#include "xfs_healthmon.h" #include <linux/mount.h> #include <linux/fileattr.h> @@ -1434,6 +1435,9 @@ xfs_file_ioctl( case XFS_IOC_MAP_FREESP: return xfs_ioc_map_freesp(filp, arg); + case XFS_IOC_HEALTH_MONITOR: + return xfs_ioc_health_monitor(mp, arg); + default: return -ENOTTY; } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (7 preceding siblings ...) 2024-12-31 23:40 ` [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong @ 2024-12-31 23:41 ` Darrick J. Wong 2024-12-31 23:41 ` [PATCH 10/16] xfs: report metadata health events through healthmon Darrick J. Wong ` (6 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create the basic infrastructure that we need to report health events to userspace. We need a compact form for recording critical information about an event and queueing them; a means to notice that we've lost some events; and a means to format the events into something that userspace can handle. Here, we've chosen json to export information to userspace. The structured key-value nature of json gives us enormous flexibility to modify the schema of what we'll send to userspace because we can add new keys at any time. Userspace can use whatever json parsers are available to consume the events and will not be confused by keys they don't recognize. Note that we do NOT allow sending json back to the kernel, nor is there any intent to do that. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_fs.h | 8 fs/xfs/libxfs/xfs_healthmon.schema.json | 63 ++++ fs/xfs/xfs_healthmon.c | 542 +++++++++++++++++++++++++++++++ fs/xfs/xfs_healthmon.h | 24 + fs/xfs/xfs_linux.h | 3 fs/xfs/xfs_trace.c | 2 fs/xfs/xfs_trace.h | 152 +++++++++ 7 files changed, 788 insertions(+), 6 deletions(-) create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index d1a81b02a1a3f3..d7404e6efd866d 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -1107,6 +1107,14 @@ struct xfs_health_monitor { __u64 pad2[2]; /* zeroes */ }; +/* Return all health status events, not just deltas */ +#define XFS_HEALTH_MONITOR_VERBOSE (1ULL << 0) + +#define XFS_HEALTH_MONITOR_ALL (XFS_HEALTH_MONITOR_VERBOSE) + +/* Return events in JSON format */ +#define XFS_HEALTH_MONITOR_FMT_JSON (1) + /* * ioctl commands that are used by Linux filesystems */ diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json new file mode 100644 index 00000000000000..9772efe25f193d --- /dev/null +++ b/fs/xfs/libxfs/xfs_healthmon.schema.json @@ -0,0 +1,63 @@ +{ + "$comment": [ + "SPDX-License-Identifier: GPL-2.0-or-later", + "Copyright (c) 2024-2025 Oracle. All Rights Reserved.", + "Author: Darrick J. Wong <djwong@kernel.org>", + "", + "This schema file describes the format of the json objects", + "readable from the fd returned by the XFS_IOC_HEALTHMON", + "ioctl." + ], + + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/fs/xfs/libxfs/xfs_healthmon.schema.json", + + "title": "XFS Health Monitoring Events", + + "$comment": "Events must be one of the following types:", + "oneOf": [ + { + "$ref": "#/$events/lost" + } + ], + + "$comment": "Simple data types are defined here.", + "$defs": { + "time_ns": { + "title": "Time of Event", + "description": "Timestamp of the event, in nanoseconds since the Unix epoch.", + "type": "integer" + } + }, + + "$comment": "Event types are defined here.", + "$events": { + "lost": { + "title": "Health Monitoring Events Lost", + "$comment": [ + "Previous health monitoring events were", + "dropped due to memory allocation failures", + "or queue limits." + ], + "type": "object", + + "properties": { + "type": { + "const": "lost" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "mount" + } + }, + + "required": [ + "type", + "time_ns", + "domain" + ] + } + } +} diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index c5ce5699373c63..499f6aab9bdbf3 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -40,12 +40,417 @@ * so that the queueing and processing of the events do not pin the mount and * cannot slow down the main filesystem. The healthmon object can exist past * the end of the filesystem mount. + * + * Please see the xfs_healthmon.schema.json file for a description of the + * format of the json events that are conveyed to userspace. */ +/* Allow this many events to build up in memory per healthmon fd. */ +#define XFS_HEALTHMON_MAX_EVENTS \ + (32768 / sizeof(struct xfs_healthmon_event)) + +struct flag_string { + unsigned int mask; + const char *str; +}; + struct xfs_healthmon { + /* lock for mp and eventlist */ + struct mutex lock; + + /* waiter for signalling the arrival of events */ + struct wait_queue_head wait; + + /* list of event objects */ + struct xfs_healthmon_event *first_event; + struct xfs_healthmon_event *last_event; + struct xfs_mount *mp; + + /* number of events */ + unsigned int events; + + /* + * Buffer for formatting events. New buffer data are appended to the + * end of the seqbuf, and outpos is used to determine where to start + * a copy_iter. Both are protected by inode_lock. + */ + struct seq_buf outbuf; + size_t outpos; + + /* do we want all events? */ + bool verbose; + + /* did we lose an event? */ + bool lost_prev_event; }; +/* Remove an event from the head of the list. */ +static inline void +xfs_healthmon_free_head( + struct xfs_healthmon *hm, + struct xfs_healthmon_event *event) +{ + struct xfs_healthmon_event *head; + + mutex_lock(&hm->lock); + head = hm->first_event; + if (head != event) { + ASSERT(hm->first_event == event); + mutex_unlock(&hm->lock); + return; + } + + if (hm->last_event == head) + hm->last_event = NULL; + hm->first_event = head->next; + hm->events--; + mutex_unlock(&hm->lock); + + trace_xfs_healthmon_pop(hm->mp, head); + kfree(event); +} + +/* Push an event onto the end of the list. */ +static inline int +xfs_healthmon_push( + struct xfs_healthmon *hm, + struct xfs_healthmon_event *event) +{ + /* + * If the queue is already full, remember the fact that we lost events. + * This doesn't apply to "event lost" events; those always go through + * because there should only be one at the very end of the queue. + */ + if (hm->events >= XFS_HEALTHMON_MAX_EVENTS && + event->type != XFS_HEALTHMON_LOST) { + trace_xfs_healthmon_lost_event(hm->mp); + hm->lost_prev_event = true; + return -ENOMEM; + } + + if (!hm->first_event) + hm->first_event = event; + if (hm->last_event) + hm->last_event->next = event; + hm->last_event = event; + event->next = NULL; + hm->events++; + wake_up(&hm->wait); + + trace_xfs_healthmon_push(hm->mp, event); + + return 0; +} + +/* Create a new event or record that we failed. */ +static struct xfs_healthmon_event * +xfs_healthmon_alloc( + struct xfs_healthmon *hm, + enum xfs_healthmon_type type, + enum xfs_healthmon_domain domain) +{ + struct timespec64 now; + struct xfs_healthmon_event *event; + + event = kzalloc(sizeof(*event), GFP_NOFS); + if (!event) { + trace_xfs_healthmon_lost_event(hm->mp); + hm->lost_prev_event = true; + return NULL; + } + + event->type = type; + event->domain = domain; + ktime_get_coarse_real_ts64(&now); + event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec; + + return event; +} + +/* + * Before we accept an event notification from a live update hook, we need to + * clear out any previously lost events. + */ +static inline int +xfs_healthmon_start_live_update( + struct xfs_healthmon *hm) +{ + struct xfs_healthmon_event *event; + + /* + * If we previously lost an event or the queue is full, try to queue + * a notification about lost events. + */ + if (!hm->lost_prev_event && hm->events != XFS_HEALTHMON_MAX_EVENTS) + return 0; + + /* + * A previous invocation of the live update hook could not allocate + * any memory at all. If the last event on the list is already a + * notification of lost events, we're done. + */ + if (hm->last_event && hm->last_event->type == XFS_HEALTHMON_LOST) + return 0; + + /* + * There are no events or the last one wasn't about lost events. Try + * to allocate a new one to note the lost events. + */ + event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST, + XFS_HEALTHMON_MOUNT); + if (!event) + return -ENOMEM; + + hm->lost_prev_event = false; + xfs_healthmon_push(hm, event); + return 0; +} + +/* Render the health update type as a string. */ +STATIC const char * +xfs_healthmon_typestring( + const struct xfs_healthmon_event *event) +{ + static const char *type_strings[] = { + [XFS_HEALTHMON_LOST] = "lost", + }; + + if (event->type >= ARRAY_SIZE(type_strings)) + return "?"; + + return type_strings[event->type]; +} + +/* Render the health domain as a string. */ +STATIC const char * +xfs_healthmon_domstring( + const struct xfs_healthmon_event *event) +{ + static const char *dom_strings[] = { + [XFS_HEALTHMON_MOUNT] = "mount", + }; + + if (event->domain >= ARRAY_SIZE(dom_strings)) + return "?"; + + return dom_strings[event->domain]; +} + +/* Convert a flags bitmap into a jsonable string. */ +static inline int +xfs_healthmon_format_flags( + struct seq_buf *outbuf, + const struct flag_string *strings, + size_t nr_strings, + unsigned int flags) +{ + const struct flag_string *p; + ssize_t ret; + unsigned int i; + bool first = true; + + for (i = 0, p = strings; i < nr_strings; i++, p++) { + if (!(p->mask & flags)) + continue; + + ret = seq_buf_printf(outbuf, "%s\"%s\"", + first ? "" : ", ", p->str); + if (ret < 0) + return ret; + + first = false; + flags &= ~p->mask; + } + + for (i = 0; flags != 0 && i < sizeof(flags) * NBBY; i++) { + if (!(flags & (1U << i))) + continue; + + /* json doesn't support hexadecimal notation */ + ret = seq_buf_printf(outbuf, "%s%u", + first ? "" : ", ", (1U << i)); + if (ret < 0) + return ret; + + first = false; + } + + return 0; +} + +/* Convert the event mask into a jsonable string. */ +static inline int +__xfs_healthmon_format_mask( + struct seq_buf *outbuf, + const char *descr, + const struct flag_string *strings, + size_t nr_strings, + unsigned int mask) +{ + ssize_t ret; + + ret = seq_buf_printf(outbuf, " \"%s\": [", descr); + if (ret < 0) + return ret; + + ret = xfs_healthmon_format_flags(outbuf, strings, nr_strings, mask); + if (ret < 0) + return ret; + + return seq_buf_printf(outbuf, "],\n"); +} + +#define xfs_healthmon_format_mask(o, d, s, m) \ + __xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m)) + +static inline void +xfs_healthmon_reset_outbuf( + struct xfs_healthmon *hm) +{ + hm->outpos = 0; + seq_buf_clear(&hm->outbuf); +} + +/* + * Format an event into json. Returns 0 if we formatted the event. If + * formatting the event overflows the buffer, returns -1 with the seqbuf len + * unchanged. + */ +STATIC int +xfs_healthmon_format( + struct xfs_healthmon *hm, + const struct xfs_healthmon_event *event) +{ + struct seq_buf *outbuf = &hm->outbuf; + size_t old_seqlen = outbuf->len; + int ret; + + trace_xfs_healthmon_format(hm->mp, event); + + ret = seq_buf_printf(outbuf, "{\n"); + if (ret < 0) + goto overrun; + + ret = seq_buf_printf(outbuf, " \"type\": \"%s\",\n", + xfs_healthmon_typestring(event)); + if (ret < 0) + goto overrun; + + ret = seq_buf_printf(outbuf, " \"domain\": \"%s\",\n", + xfs_healthmon_domstring(event)); + if (ret < 0) + goto overrun; + + switch (event->type) { + case XFS_HEALTHMON_LOST: + /* empty */ + break; + default: + break; + } + + switch (event->domain) { + case XFS_HEALTHMON_MOUNT: + /* empty */ + break; + } + if (ret < 0) + goto overrun; + + /* The last element in the json must not have a trailing comma. */ + ret = seq_buf_printf(outbuf, " \"time_ns\": %llu\n", + event->time_ns); + if (ret < 0) + goto overrun; + + ret = seq_buf_printf(outbuf, "}\n"); + if (ret < 0) + goto overrun; + + ASSERT(!seq_buf_has_overflowed(outbuf)); + return 0; +overrun: + /* + * We overflowed the buffer and could not format the event. Reset the + * seqbuf and tell the caller not to delete the event. + */ + trace_xfs_healthmon_format_overflow(hm->mp, event); + outbuf->len = old_seqlen; + return -1; +} + +/* How many bytes are waiting in the outbuf to be copied? */ +static inline size_t +xfs_healthmon_outbuf_bytes( + struct xfs_healthmon *hm) +{ + unsigned int used = seq_buf_used(&hm->outbuf); + + if (used > hm->outpos) + return used - hm->outpos; + return 0; +} + +/* + * Do we have something for userspace to do? This can mean unmount events, + * events pending in the queue, or pending bytes in the outbuf. + */ +static inline bool +xfs_healthmon_has_eventdata( + struct xfs_healthmon *hm) +{ + return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0; +} + +/* Try to copy the rest of the outbuf to the iov iter. */ +STATIC ssize_t +xfs_healthmon_copybuf( + struct xfs_healthmon *hm, + struct iov_iter *to) +{ + size_t to_copy; + size_t w = 0; + + trace_xfs_healthmon_copybuf(hm->mp, to, &hm->outbuf, hm->outpos); + + to_copy = xfs_healthmon_outbuf_bytes(hm); + if (to_copy) { + w = copy_to_iter(hm->outbuf.buffer + hm->outpos, to_copy, to); + if (!w) + return -EFAULT; + + hm->outpos += w; + } + + /* + * Nothing left to copy? Reset the seqbuf pointers and outbuf to the + * start since there's no live data in the buffer. + */ + if (xfs_healthmon_outbuf_bytes(hm) == 0) + xfs_healthmon_reset_outbuf(hm); + return w; +} + +/* + * See if there's an event waiting for us. If the fs is no longer mounted, + * don't bother sending any more events. + */ +static inline struct xfs_healthmon_event * +xfs_healthmon_peek( + struct xfs_healthmon *hm) +{ + struct xfs_healthmon_event *event; + + mutex_lock(&hm->lock); + if (hm->mp) + event = hm->first_event; + else + event = NULL; + mutex_unlock(&hm->lock); + return event; +} + /* * Convey queued event data to userspace. First copy any remaining bytes in * the outbuf, then format the oldest event into the outbuf and copy that too. @@ -55,7 +460,112 @@ xfs_healthmon_read_iter( struct kiocb *iocb, struct iov_iter *to) { - return -EIO; + struct file *file = iocb->ki_filp; + struct inode *inode = file_inode(file); + struct xfs_healthmon *hm = file->private_data; + struct xfs_healthmon_event *event; + size_t copied = 0; + ssize_t ret = 0; + + /* Wait for data to become available */ + if (!(file->f_flags & O_NONBLOCK)) { + ret = wait_event_interruptible(hm->wait, + xfs_healthmon_has_eventdata(hm)); + if (ret) + return ret; + } else if (!xfs_healthmon_has_eventdata(hm)) { + return -EAGAIN; + } + + /* Allocate formatting buffer up to 64k if necessary */ + if (hm->outbuf.size == 0) { + void *outbuf; + size_t bufsize = min(65536, max(PAGE_SIZE, + iov_iter_count(to))); + + outbuf = kzalloc(bufsize, GFP_KERNEL); + if (!outbuf) { + bufsize = PAGE_SIZE; + outbuf = kzalloc(bufsize, GFP_KERNEL); + if (!outbuf) + return -ENOMEM; + } + + inode_lock(inode); + if (hm->outbuf.size == 0) { + seq_buf_init(&hm->outbuf, outbuf, bufsize); + hm->outpos = 0; + } else { + kfree(outbuf); + } + } else { + inode_lock(inode); + } + + trace_xfs_healthmon_read_start(hm->mp, hm->events, hm->lost_prev_event); + + /* + * If there's anything left in the seqbuf, copy that before formatting + * more events. + */ + ret = xfs_healthmon_copybuf(hm, to); + if (ret < 0) + goto out_unlock; + copied += ret; + + while (iov_iter_count(to) > 0) { + /* Format the next events into the outbuf until it's full. */ + while ((event = xfs_healthmon_peek(hm)) != NULL) { + ret = xfs_healthmon_format(hm, event); + if (ret < 0) + break; + xfs_healthmon_free_head(hm, event); + } + /* Copy it to userspace */ + ret = xfs_healthmon_copybuf(hm, to); + if (ret <= 0) + break; + + copied += ret; + } + +out_unlock: + trace_xfs_healthmon_read_finish(hm->mp, hm->events, hm->lost_prev_event); + inode_unlock(inode); + return copied ?: ret; +} + +/* Poll for available events. */ +STATIC __poll_t +xfs_healthmon_poll( + struct file *file, + struct poll_table_struct *wait) +{ + struct xfs_healthmon *hm = file->private_data; + __poll_t mask = 0; + + poll_wait(file, &hm->wait, wait); + + if (xfs_healthmon_has_eventdata(hm)) + mask |= EPOLLIN; + return mask; +} + +/* Free all events */ +STATIC void +xfs_healthmon_free_events( + struct xfs_healthmon *hm) +{ + struct xfs_healthmon_event *event, *next; + + event = hm->first_event; + while (event != NULL) { + trace_xfs_healthmon_drop(hm->mp, event); + next = event->next; + kfree(event); + event = next; + } + hm->first_event = hm->last_event = NULL; } /* Free the health monitoring information. */ @@ -66,6 +576,14 @@ xfs_healthmon_release( { struct xfs_healthmon *hm = file->private_data; + trace_xfs_healthmon_release(hm->mp, hm->events, hm->lost_prev_event); + + wake_up_all(&hm->wait); + + mutex_destroy(&hm->lock); + xfs_healthmon_free_events(hm); + if (hm->outbuf.size) + kfree(hm->outbuf.buffer); kfree(hm); return 0; @@ -76,9 +594,9 @@ static inline bool xfs_healthmon_validate( const struct xfs_health_monitor *hmo) { - if (hmo->flags) + if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL) return false; - if (hmo->format) + if (hmo->format != XFS_HEALTH_MONITOR_FMT_JSON) return false; if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1))) return false; @@ -90,6 +608,7 @@ xfs_healthmon_validate( static const struct file_operations xfs_healthmon_fops = { .owner = THIS_MODULE, .read_iter = xfs_healthmon_read_iter, + .poll = xfs_healthmon_poll, .release = xfs_healthmon_release, }; @@ -122,11 +641,18 @@ xfs_ioc_health_monitor( return -ENOMEM; hm->mp = mp; + seq_buf_init(&hm->outbuf, NULL, 0); + mutex_init(&hm->lock); + init_waitqueue_head(&hm->wait); + + if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE) + hm->verbose = true; + /* Set up VFS file and file descriptor. */ name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id); if (!name) { ret = -ENOMEM; - goto out_hm; + goto out_mutex; } fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm, @@ -134,12 +660,16 @@ xfs_ioc_health_monitor( kvfree(name); if (fd < 0) { ret = fd; - goto out_hm; + goto out_mutex; } + trace_xfs_healthmon_create(mp, hmo.flags, hmo.format); + return fd; -out_hm: +out_mutex: + mutex_destroy(&hm->lock); + xfs_healthmon_free_events(hm); kfree(hm); return ret; } diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h index 07126e39281a0c..606f205074495c 100644 --- a/fs/xfs/xfs_healthmon.h +++ b/fs/xfs/xfs_healthmon.h @@ -6,6 +6,30 @@ #ifndef __XFS_HEALTHMON_H__ #define __XFS_HEALTHMON_H__ +enum xfs_healthmon_type { + XFS_HEALTHMON_LOST, /* message lost */ +}; + +enum xfs_healthmon_domain { + XFS_HEALTHMON_MOUNT, /* affects the whole fs */ +}; + +struct xfs_healthmon_event { + struct xfs_healthmon_event *next; + + enum xfs_healthmon_type type; + enum xfs_healthmon_domain domain; + + uint64_t time_ns; + + union { + /* mount */ + struct { + unsigned int flags; + }; + }; +}; + #ifdef CONFIG_XFS_HEALTH_MONITOR long xfs_ioc_health_monitor(struct xfs_mount *mp, struct xfs_health_monitor __user *arg); diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index 9a2221b4aa21ed..d13a5fa2d652ff 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -63,6 +63,9 @@ typedef __u32 xfs_nlink_t; #include <linux/xattr.h> #include <linux/mnt_idmapping.h> #include <linux/debugfs.h> +#ifdef CONFIG_XFS_HEALTH_MONITOR +# include <linux/seq_buf.h> +#endif #include <asm/page.h> #include <asm/div64.h> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 555fe76b4d853c..41a2ac85dc5fdf 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -52,6 +52,8 @@ #include "xfs_zone_alloc.h" #include "xfs_zone_priv.h" #include "xfs_fsrefs.h" +#include "xfs_health.h" +#include "xfs_healthmon.h" /* * We include this last to have the helpers above available for the trace diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 76f5d78b6a6e09..bd3b007d213fc6 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -106,6 +106,8 @@ struct xfs_open_zone; struct xfs_fsrefs; struct xfs_fsrefs_irec; struct xfs_rtgroup; +struct xfs_healthmon_event; +struct xfs_health_update_params; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -6077,6 +6079,156 @@ TRACE_EVENT(xfs_growfs_check_rtgeom, ); #endif /* CONFIG_XFS_RT */ +#ifdef CONFIG_XFS_HEALTH_MONITOR +TRACE_EVENT(xfs_healthmon_lost_event, + TP_PROTO(const struct xfs_mount *mp), + TP_ARGS(mp), + TP_STRUCT__entry( + __field(dev_t, dev) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + ), + TP_printk("dev %d:%d", + MAJOR(__entry->dev), MINOR(__entry->dev)) +); + +#define XFS_HEALTHMON_FLAGS_STRINGS \ + { XFS_HEALTH_MONITOR_VERBOSE, "verbose" } +#define XFS_HEALTHMON_FMT_STRINGS \ + { XFS_HEALTH_MONITOR_FMT_JSON, "json" } + +TRACE_EVENT(xfs_healthmon_create, + TP_PROTO(const struct xfs_mount *mp, u64 flags, u8 format), + TP_ARGS(mp, flags, format), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(u64, flags) + __field(u8, format) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->flags = flags; + __entry->format = format; + ), + TP_printk("dev %d:%d flags %s format %s", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_flags(__entry->flags, "|", XFS_HEALTHMON_FLAGS_STRINGS), + __print_symbolic(__entry->format, XFS_HEALTHMON_FMT_STRINGS)) +); + +TRACE_EVENT(xfs_healthmon_copybuf, + TP_PROTO(const struct xfs_mount *mp, const struct iov_iter *iov, + const struct seq_buf *seqbuf, size_t outpos), + TP_ARGS(mp, iov, seqbuf, outpos), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(size_t, seqbuf_size) + __field(size_t, seqbuf_len) + __field(size_t, outpos) + __field(size_t, to_copy) + __field(size_t, iter_count) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->seqbuf_size = seqbuf->size; + __entry->seqbuf_len = seqbuf->len; + __entry->outpos = outpos; + __entry->to_copy = seqbuf->len - outpos; + __entry->iter_count = iov_iter_count(iov); + ), + TP_printk("dev %d:%d seqsize %zu seqlen %zu out_pos %zu to_copy %zu iter_count %zu", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->seqbuf_size, + __entry->seqbuf_len, + __entry->outpos, + __entry->to_copy, + __entry->iter_count) +); + +DECLARE_EVENT_CLASS(xfs_healthmon_class, + TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev), + TP_ARGS(mp, events, lost_prev), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned int, events) + __field(bool, lost_prev) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->events = events; + __entry->lost_prev = lost_prev; + ), + TP_printk("dev %d:%d events %u lost_prev? %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->events, + __entry->lost_prev) +); +#define DEFINE_HEALTHMON_EVENT(name) \ +DEFINE_EVENT(xfs_healthmon_class, name, \ + TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev), \ + TP_ARGS(mp, events, lost_prev)) +DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_start); +DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish); +DEFINE_HEALTHMON_EVENT(xfs_healthmon_release); +DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount); + +#define XFS_HEALTHMON_TYPE_STRINGS \ + { XFS_HEALTHMON_LOST, "lost" } + +#define XFS_HEALTHMON_DOMAIN_STRINGS \ + { XFS_HEALTHMON_MOUNT, "mount" } + +TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST); + +TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT); + +DECLARE_EVENT_CLASS(xfs_healthmon_event_class, + TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), + TP_ARGS(mp, event), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned int, type) + __field(unsigned int, domain) + __field(unsigned int, mask) + __field(unsigned long long, ino) + __field(unsigned int, gen) + __field(unsigned int, group) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->type = event->type; + __entry->domain = event->domain; + __entry->mask = 0; + __entry->group = 0; + __entry->ino = 0; + __entry->gen = 0; + switch (__entry->domain) { + case XFS_HEALTHMON_MOUNT: + __entry->mask = event->flags; + break; + } + ), + TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x group 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS), + __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS), + __entry->mask, + __entry->ino, + __entry->gen, + __entry->group) +); +#define DEFINE_HEALTHMONEVENT_EVENT(name) \ +DEFINE_EVENT(xfs_healthmon_event_class, name, \ + TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \ + TP_ARGS(mp, event)) +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push); +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop); +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format); +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow); +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop); +#endif /* CONFIG_XFS_HEALTH_MONITOR */ + #endif /* _TRACE_XFS_H */ #undef TRACE_INCLUDE_PATH ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 10/16] xfs: report metadata health events through healthmon 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (8 preceding siblings ...) 2024-12-31 23:41 ` [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong @ 2024-12-31 23:41 ` Darrick J. Wong 2024-12-31 23:41 ` [PATCH 11/16] xfs: report shutdown " Darrick J. Wong ` (5 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a metadata health event hook so that we can send events to userspace as we collect information. The unmount hook severs the weak reference between the health monitor and the filesystem it's monitoring; when this happens, we stop reporting events because there's no longer any point. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_healthmon.schema.json | 328 ++++++++++++++++++++++++++ fs/xfs/xfs_healthmon.c | 397 +++++++++++++++++++++++++++++++ fs/xfs/xfs_healthmon.h | 30 ++ fs/xfs/xfs_trace.h | 97 +++++++- 4 files changed, 846 insertions(+), 6 deletions(-) diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json index 9772efe25f193d..154ea0228a3615 100644 --- a/fs/xfs/libxfs/xfs_healthmon.schema.json +++ b/fs/xfs/libxfs/xfs_healthmon.schema.json @@ -18,6 +18,18 @@ "oneOf": [ { "$ref": "#/$events/lost" + }, + { + "$ref": "#/$events/fs_metadata" + }, + { + "$ref": "#/$events/rtgroup_metadata" + }, + { + "$ref": "#/$events/perag_metadata" + }, + { + "$ref": "#/$events/inode_metadata" } ], @@ -27,6 +39,169 @@ "title": "Time of Event", "description": "Timestamp of the event, in nanoseconds since the Unix epoch.", "type": "integer" + }, + "xfs_agnumber_t": { + "description": "Allocation group number", + "type": "integer", + "minimum": 0, + "maximum": 2147483647 + }, + "xfs_rgnumber_t": { + "description": "Realtime allocation group number", + "type": "integer", + "minimum": 0, + "maximum": 2147483647 + }, + "xfs_ino_t": { + "description": "Inode number", + "type": "integer", + "minimum": 1 + }, + "i_generation": { + "description": "Inode generation number", + "type": "integer" + } + }, + + "$comment": "Filesystem metadata event data are defined here.", + "$metadata": { + "status": { + "description": "Metadata health status", + "$comment": [ + "One of:", + "", + " * sick: metadata corruption discovered", + " during a runtime operation.", + " * corrupt: corruption discovered during", + " an xfs_scrub run.", + " * healthy: metadata object was found to be", + " ok by xfs_scrub." + ], + "enum": [ + "sick", + "corrupt", + "healthy" + ] + }, + "fs": { + "description": [ + "Metadata structures that affect the entire", + "filesystem. Options include:", + "", + " * fscounters: summary counters", + " * usrquota: user quota records", + " * grpquota: group quota records", + " * prjquota: project quota records", + " * quotacheck: quota counters", + " * nlinks: file link counts", + " * metadir: metadata directory", + " * metapath: metadata inode paths" + ], + "enum": [ + "fscounters", + "grpquota", + "metadir", + "metapath", + "nlinks", + "prjquota", + "quotacheck", + "usrquota" + ] + }, + "perag": { + "description": [ + "Metadata structures owned by allocation", + "groups on the data device. Options include:", + "", + " * agf: group space header", + " * agfl: per-group free block list", + " * agi: group inode header", + " * bnobt: free space by position btree", + " * cntbt: free space by length btree", + " * finobt: free inode btree", + " * inobt: inode btree", + " * rmapbt: reverse mapping btree", + " * refcountbt: reference count btree", + " * inodes: problems were recorded for", + " this group's inodes, but the", + " inodes themselves had to be", + " reclaimed.", + " * super: superblock" + ], + "enum": [ + "agf", + "agfl", + "agi", + "bnobt", + "cntbt", + "finobt", + "inobt", + "inodes", + "refcountbt", + "rmapbt", + "super" + ] + }, + "rtgroup": { + "description": [ + "Metadata structures owned by allocation", + "groups on the realtime volume. Options", + "include:", + "", + " * bitmap: free space bitmap contents", + " for this group", + " * summary: realtime free space summary file", + " * rmapbt: reverse mapping btree", + " * refcountbt: reference count btree", + " * super: group superblock" + ], + "enum": [ + "bitmap", + "summary", + "refcountbt", + "rmapbt", + "super" + ] + }, + "inode": { + "description": [ + "Metadata structures owned by file inodes.", + "Options include:", + "", + " * bmapbta: attr fork", + " * bmapbtc: cow fork", + " * bmapbtd: data fork", + " * core: inode record", + " * directory: directory entries", + " * dirtree: directory tree problems detected", + " * parent: directory parent pointer", + " * symlink: symbolic link target", + " * xattr: extended attributes", + "", + "These are set when an inode record repair had", + "to drop the corresponding data structure to", + "get the inode back to a consistent state.", + "", + " * bmapbtd_zapped", + " * bmapbta_zapped", + " * directory_zapped", + " * symlink_zapped" + ], + "enum": [ + "bmapbta", + "bmapbta_zapped", + "bmapbtc", + "bmapbtd", + "bmapbtd_zapped", + "core", + "directory", + "directory_zapped", + "dirtree", + "parent", + "symlink", + "symlink_zapped", + "xattr" + ] } }, @@ -58,6 +233,159 @@ "time_ns", "domain" ] + }, + "fs_metadata": { + "title": "Filesystem-wide metadata event", + "description": [ + "Health status updates for filesystem-wide", + "metadata objects." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "fs" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/fs" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "structures" + ] + }, + "perag_metadata": { + "title": "Data device allocation group metadata event", + "description": [ + "Health status updates for data device ", + "allocation group metadata." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "perag" + }, + "group": { + "$ref": "#/$defs/xfs_agnumber_t" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/perag" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "group", + "structures" + ] + }, + "rtgroup_metadata": { + "title": "Realtime allocation group metadata event", + "description": [ + "Health status updates for realtime allocation", + "group metadata." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "rtgroup" + }, + "group": { + "$ref": "#/$defs/xfs_rgnumber_t" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/rtgroup" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "group", + "structures" + ] + }, + "inode_metadata": { + "title": "Inode metadata event", + "description": [ + "Health status updates for inode metadata.", + "The inode and generation number describe the", + "file that is affected by the change." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "inode" + }, + "inumber": { + "$ref": "#/$defs/xfs_ino_t" + }, + "generation": { + "$ref": "#/$defs/i_generation" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/inode" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "inumber", + "generation", + "structures" + ] } } } diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index 499f6aab9bdbf3..9d34a826726e3e 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -18,6 +18,7 @@ #include "xfs_da_btree.h" #include "xfs_quota_defs.h" #include "xfs_rtgroup.h" +#include "xfs_health.h" #include "xfs_healthmon.h" #include <linux/anon_inodes.h> @@ -65,8 +66,15 @@ struct xfs_healthmon { struct xfs_healthmon_event *first_event; struct xfs_healthmon_event *last_event; + /* live update hooks */ + struct xfs_health_hook hhook; + + /* filesystem mount, or NULL if we've unmounted */ struct xfs_mount *mp; + /* filesystem type for safe cleanup of hooks; requires module_get */ + struct file_system_type *fstyp; + /* number of events */ unsigned int events; @@ -178,6 +186,10 @@ xfs_healthmon_start_live_update( { struct xfs_healthmon_event *event; + /* Already unmounted filesystem, do nothing. */ + if (!hm->mp) + return -ESHUTDOWN; + /* * If we previously lost an event or the queue is full, try to queue * a notification about lost events. @@ -207,6 +219,171 @@ xfs_healthmon_start_live_update( return 0; } +/* Compute the reporting mask. */ +static inline bool +xfs_healthmon_event_mask( + struct xfs_healthmon *hm, + enum xfs_health_update_type type, + const struct xfs_health_update_params *hup, + unsigned int *mask) +{ + /* Always report unmounts. */ + if (type == XFS_HEALTHUP_UNMOUNT) + return true; + + /* If we want all events, return all events. */ + if (hm->verbose) { + *mask = hup->new_mask; + return true; + } + + switch (type) { + case XFS_HEALTHUP_SICK: + /* Always report runtime corruptions */ + *mask = hup->new_mask; + break; + case XFS_HEALTHUP_CORRUPT: + /* Only report new fsck errors */ + *mask = hup->new_mask & ~hup->old_mask; + break; + case XFS_HEALTHUP_HEALTHY: + /* Only report healthy metadata that got fixed */ + *mask = hup->new_mask & hup->old_mask; + break; + case XFS_HEALTHUP_UNMOUNT: + /* This is here for static enum checking */ + break; + } + + /* If not in verbose mode, mask state has to change. */ + return *mask != 0; +} + +static inline enum xfs_healthmon_type +health_update_to_type( + enum xfs_health_update_type type) +{ + switch (type) { + case XFS_HEALTHUP_SICK: + return XFS_HEALTHMON_SICK; + case XFS_HEALTHUP_CORRUPT: + return XFS_HEALTHMON_CORRUPT; + case XFS_HEALTHUP_HEALTHY: + return XFS_HEALTHMON_HEALTHY; + case XFS_HEALTHUP_UNMOUNT: + /* static checking */ + break; + } + return XFS_HEALTHMON_UNMOUNT; +} + +static inline enum xfs_healthmon_domain +health_update_to_domain( + enum xfs_health_update_domain domain) +{ + switch (domain) { + case XFS_HEALTHUP_FS: + return XFS_HEALTHMON_FS; + case XFS_HEALTHUP_AG: + return XFS_HEALTHMON_AG; + case XFS_HEALTHUP_RTGROUP: + return XFS_HEALTHMON_RTGROUP; + case XFS_HEALTHUP_INODE: + /* static checking */ + break; + } + return XFS_HEALTHMON_INODE; +} + +/* Add a health event to the reporting queue. */ +STATIC int +xfs_healthmon_metadata_hook( + struct notifier_block *nb, + unsigned long action, + void *data) +{ + struct xfs_health_update_params *hup = data; + struct xfs_healthmon *hm; + struct xfs_healthmon_event *event; + enum xfs_health_update_type type = action; + unsigned int mask = 0; + int error; + + hm = container_of(nb, struct xfs_healthmon, hhook.health_hook.nb); + + /* Decode event mask and skip events we don't care about. */ + if (!xfs_healthmon_event_mask(hm, type, hup, &mask)) + return NOTIFY_DONE; + + mutex_lock(&hm->lock); + + trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events, + hm->lost_prev_event); + + error = xfs_healthmon_start_live_update(hm); + if (error) + goto out_unlock; + + if (type == XFS_HEALTHUP_UNMOUNT) { + /* + * The filesystem is unmounting, so we must detach from the + * mount. After this point, the healthmon thread has no + * connection to the mounted filesystem. + */ + trace_xfs_healthmon_unmount(hm->mp, hm->events, + hm->lost_prev_event); + hm->mp = NULL; + wake_up(&hm->wait); + goto out_unlock; + } + + event = xfs_healthmon_alloc(hm, health_update_to_type(type), + health_update_to_domain(hup->domain)); + if (!event) + goto out_unlock; + + /* Ignore the event if it's only reporting a secondary health state. */ + switch (event->domain) { + case XFS_HEALTHMON_FS: + event->fsmask = mask & ~XFS_SICK_FS_SECONDARY; + if (!event->fsmask) + goto out_event; + break; + case XFS_HEALTHMON_AG: + event->grpmask = mask & ~XFS_SICK_AG_SECONDARY; + if (!event->grpmask) + goto out_event; + event->group = hup->group; + break; + case XFS_HEALTHMON_RTGROUP: + event->grpmask = mask & ~XFS_SICK_RG_SECONDARY; + if (!event->grpmask) + goto out_event; + event->group = hup->group; + break; + case XFS_HEALTHMON_INODE: + event->imask = mask & ~XFS_SICK_INO_SECONDARY; + if (!event->imask) + goto out_event; + event->ino = hup->ino; + event->gen = hup->gen; + break; + default: + ASSERT(0); + break; + } + error = xfs_healthmon_push(hm, event); + if (error) + goto out_event; + +out_unlock: + mutex_unlock(&hm->lock); + return NOTIFY_DONE; +out_event: + kfree(event); + goto out_unlock; +} + /* Render the health update type as a string. */ STATIC const char * xfs_healthmon_typestring( @@ -214,6 +391,10 @@ xfs_healthmon_typestring( { static const char *type_strings[] = { [XFS_HEALTHMON_LOST] = "lost", + [XFS_HEALTHMON_UNMOUNT] = "unmount", + [XFS_HEALTHMON_SICK] = "sick", + [XFS_HEALTHMON_CORRUPT] = "corrupt", + [XFS_HEALTHMON_HEALTHY] = "healthy", }; if (event->type >= ARRAY_SIZE(type_strings)) @@ -229,6 +410,10 @@ xfs_healthmon_domstring( { static const char *dom_strings[] = { [XFS_HEALTHMON_MOUNT] = "mount", + [XFS_HEALTHMON_FS] = "fs", + [XFS_HEALTHMON_AG] = "perag", + [XFS_HEALTHMON_INODE] = "inode", + [XFS_HEALTHMON_RTGROUP] = "rtgroup", }; if (event->domain >= ARRAY_SIZE(dom_strings)) @@ -254,6 +439,11 @@ xfs_healthmon_format_flags( if (!(p->mask & flags)) continue; + if (!p->str) { + flags &= ~p->mask; + continue; + } + ret = seq_buf_printf(outbuf, "%s\"%s\"", first ? "" : ", ", p->str); if (ret < 0) @@ -304,6 +494,118 @@ __xfs_healthmon_format_mask( #define xfs_healthmon_format_mask(o, d, s, m) \ __xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m)) +/* Render fs sickness mask as a string set */ +static int +xfs_healthmon_format_fs( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + static const struct flag_string mask_strings[] = { + { XFS_SICK_FS_COUNTERS, "fscounters" }, + { XFS_SICK_FS_UQUOTA, "usrquota" }, + { XFS_SICK_FS_GQUOTA, "grpquota" }, + { XFS_SICK_FS_PQUOTA, "prjquota" }, + { XFS_SICK_FS_QUOTACHECK, "quotacheck" }, + { XFS_SICK_FS_NLINKS, "nlinks" }, + { XFS_SICK_FS_METADIR, "metadir" }, + { XFS_SICK_FS_METAPATH, "metapath" }, + }; + + return xfs_healthmon_format_mask(outbuf, "structures", mask_strings, + event->fsmask); +} + +/* Render rtgroup sickness mask as a string set */ +static int +xfs_healthmon_format_rtgroup( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + static const struct flag_string mask_strings[] = { + { XFS_SICK_RG_SUPER, "super" }, + { XFS_SICK_RG_BITMAP, "bitmap" }, + { XFS_SICK_RG_SUMMARY, "summary" }, + { XFS_SICK_RG_RMAPBT, "rmapbt" }, + { XFS_SICK_RG_REFCNTBT, "refcountbt" }, + }; + ssize_t ret; + + ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings, + event->grpmask); + if (ret < 0) + return ret; + + return seq_buf_printf(outbuf, " \"group\": %u,\n", + event->group); +} + +/* Render perag sickness mask as a string set */ +static int +xfs_healthmon_format_ag( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + static const struct flag_string mask_strings[] = { + { XFS_SICK_AG_SB, "super" }, + { XFS_SICK_AG_AGF, "agf" }, + { XFS_SICK_AG_AGFL, "agfl" }, + { XFS_SICK_AG_AGI, "agi" }, + { XFS_SICK_AG_BNOBT, "bnobt" }, + { XFS_SICK_AG_CNTBT, "cntbt" }, + { XFS_SICK_AG_INOBT, "inobt" }, + { XFS_SICK_AG_FINOBT, "finobt" }, + { XFS_SICK_AG_RMAPBT, "rmapbt" }, + { XFS_SICK_AG_REFCNTBT, "refcountbt" }, + { XFS_SICK_AG_INODES, "inodes" }, + }; + ssize_t ret; + + ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings, + event->grpmask); + if (ret < 0) + return ret; + + return seq_buf_printf(outbuf, " \"group\": %u,\n", + event->group); +} + +/* Render inode sickness mask as a string set */ +static int +xfs_healthmon_format_inode( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + static const struct flag_string mask_strings[] = { + { XFS_SICK_INO_CORE, "core" }, + { XFS_SICK_INO_BMBTD, "bmapbtd" }, + { XFS_SICK_INO_BMBTA, "bmapbta" }, + { XFS_SICK_INO_BMBTC, "bmapbtc" }, + { XFS_SICK_INO_DIR, "directory" }, + { XFS_SICK_INO_XATTR, "xattr" }, + { XFS_SICK_INO_SYMLINK, "symlink" }, + { XFS_SICK_INO_PARENT, "parent" }, + { XFS_SICK_INO_BMBTD_ZAPPED, "bmapbtd_zapped" }, + { XFS_SICK_INO_BMBTA_ZAPPED, "bmapbta_zapped" }, + { XFS_SICK_INO_DIR_ZAPPED, "directory_zapped" }, + { XFS_SICK_INO_SYMLINK_ZAPPED, "symlink_zapped" }, + { XFS_SICK_INO_FORGET, NULL, }, + { XFS_SICK_INO_DIRTREE, "dirtree" }, + }; + ssize_t ret; + + ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings, + event->imask); + if (ret < 0) + return ret; + + ret = seq_buf_printf(outbuf, " \"inumber\": %llu,\n", + event->ino); + if (ret < 0) + return ret; + return seq_buf_printf(outbuf, " \"generation\": %u,\n", + event->gen); +} + static inline void xfs_healthmon_reset_outbuf( struct xfs_healthmon *hm) @@ -354,6 +656,18 @@ xfs_healthmon_format( case XFS_HEALTHMON_MOUNT: /* empty */ break; + case XFS_HEALTHMON_FS: + ret = xfs_healthmon_format_fs(outbuf, event); + break; + case XFS_HEALTHMON_RTGROUP: + ret = xfs_healthmon_format_rtgroup(outbuf, event); + break; + case XFS_HEALTHMON_AG: + ret = xfs_healthmon_format_ag(outbuf, event); + break; + case XFS_HEALTHMON_INODE: + ret = xfs_healthmon_format_inode(outbuf, event); + break; } if (ret < 0) goto overrun; @@ -400,7 +714,7 @@ static inline bool xfs_healthmon_has_eventdata( struct xfs_healthmon *hm) { - return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0; + return !hm->mp || hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0; } /* Try to copy the rest of the outbuf to the iov iter. */ @@ -521,6 +835,7 @@ xfs_healthmon_read_iter( break; xfs_healthmon_free_head(hm, event); } + /* Copy it to userspace */ ret = xfs_healthmon_copybuf(hm, to); if (ret <= 0) @@ -568,6 +883,58 @@ xfs_healthmon_free_events( hm->first_event = hm->last_event = NULL; } +/* + * Detach all filesystem hooks that were set up for a health monitor. Only + * call this from iterate_super*. + */ +STATIC void +xfs_healthmon_detach_hooks( + struct super_block *sb, + void *arg) +{ + struct xfs_healthmon *hm = arg; + + mutex_lock(&hm->lock); + + /* + * Because health monitors have a weak reference to the filesystem + * they're monitoring, the hook deletions below must not race against + * that filesystem being unmounted because that could lead to UAF + * errors. + * + * If hm->mp is NULL, the health unmount hook already ran and the hook + * chain head (contained within the xfs_mount structure) is gone. Do + * not detach any hooks; just let them get freed when the healthmon + * object is torn down. + */ + if (!hm->mp) + goto out_unlock; + + /* + * Otherwise, the caller gave us a non-dying @sb with s_umount held in + * shared mode, which means that @sb cannot be running through + * deactivate_locked_super and cannot be freed. It's safe to compare + * @sb against the super that we snapshotted when we set up the health + * monitor. + */ + if (hm->mp->m_super != sb) + goto out_unlock; + + mutex_unlock(&hm->lock); + + /* + * Now we know that the filesystem @hm->mp is active and cannot be + * deactivated until this function returns. Unmount events are sent + * through the health monitoring subsystem from xfs_fs_put_super, so + * it is now time to detach the hooks. + */ + xfs_health_hook_del(hm->mp, &hm->hhook); + return; + +out_unlock: + mutex_unlock(&hm->lock); +} + /* Free the health monitoring information. */ STATIC int xfs_healthmon_release( @@ -580,6 +947,9 @@ xfs_healthmon_release( wake_up_all(&hm->wait); + iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm); + xfs_health_hook_disable(); + mutex_destroy(&hm->lock); xfs_healthmon_free_events(hm); if (hm->outbuf.size) @@ -641,6 +1011,13 @@ xfs_ioc_health_monitor( return -ENOMEM; hm->mp = mp; + /* + * Since we already got a ref to the module, take a reference to the + * fstype to make it easier to detach the hooks when we tear things + * down later. + */ + hm->fstyp = mp->m_super->s_type; + seq_buf_init(&hm->outbuf, NULL, 0); mutex_init(&hm->lock); init_waitqueue_head(&hm->wait); @@ -648,11 +1025,20 @@ xfs_ioc_health_monitor( if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE) hm->verbose = true; + /* Enable hooks to receive events, generally. */ + xfs_health_hook_enable(); + + /* Attach specific event hooks to this monitor. */ + xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook); + ret = xfs_health_hook_add(mp, &hm->hhook); + if (ret) + goto out_hooks; + /* Set up VFS file and file descriptor. */ name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id); if (!name) { ret = -ENOMEM; - goto out_mutex; + goto out_healthhook; } fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm, @@ -660,14 +1046,17 @@ xfs_ioc_health_monitor( kvfree(name); if (fd < 0) { ret = fd; - goto out_mutex; + goto out_healthhook; } trace_xfs_healthmon_create(mp, hmo.flags, hmo.format); return fd; -out_mutex: +out_healthhook: + xfs_health_hook_del(mp, &hm->hhook); +out_hooks: + xfs_health_hook_disable(); mutex_destroy(&hm->lock); xfs_healthmon_free_events(hm); kfree(hm); diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h index 606f205074495c..3ece61165837b2 100644 --- a/fs/xfs/xfs_healthmon.h +++ b/fs/xfs/xfs_healthmon.h @@ -8,10 +8,22 @@ enum xfs_healthmon_type { XFS_HEALTHMON_LOST, /* message lost */ + + /* metadata health events */ + XFS_HEALTHMON_SICK, /* runtime corruption observed */ + XFS_HEALTHMON_CORRUPT, /* fsck reported corruption */ + XFS_HEALTHMON_HEALTHY, /* fsck reported healthy structure */ + XFS_HEALTHMON_UNMOUNT, /* filesystem is unmounting */ }; enum xfs_healthmon_domain { XFS_HEALTHMON_MOUNT, /* affects the whole fs */ + + /* metadata health events */ + XFS_HEALTHMON_FS, /* main filesystem metadata */ + XFS_HEALTHMON_AG, /* allocation group metadata */ + XFS_HEALTHMON_INODE, /* inode metadata */ + XFS_HEALTHMON_RTGROUP, /* realtime group metadata */ }; struct xfs_healthmon_event { @@ -27,6 +39,24 @@ struct xfs_healthmon_event { struct { unsigned int flags; }; + /* fs/rt metadata */ + struct { + /* XFS_SICK_* flags */ + unsigned int fsmask; + }; + /* ag/rtgroup metadata */ + struct { + /* XFS_SICK_* flags */ + unsigned int grpmask; + unsigned int group; + }; + /* inode metadata */ + struct { + /* XFS_SICK_INO_* flags */ + unsigned int imask; + uint32_t gen; + xfs_ino_t ino; + }; }; }; diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index bd3b007d213fc6..4a68d2ec8d0a34 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -6174,14 +6174,30 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_release); DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount); #define XFS_HEALTHMON_TYPE_STRINGS \ - { XFS_HEALTHMON_LOST, "lost" } + { XFS_HEALTHMON_LOST, "lost" }, \ + { XFS_HEALTHMON_UNMOUNT, "unmount" }, \ + { XFS_HEALTHMON_SICK, "sick" }, \ + { XFS_HEALTHMON_CORRUPT, "corrupt" }, \ + { XFS_HEALTHMON_HEALTHY, "healthy" } #define XFS_HEALTHMON_DOMAIN_STRINGS \ - { XFS_HEALTHMON_MOUNT, "mount" } + { XFS_HEALTHMON_MOUNT, "mount" }, \ + { XFS_HEALTHMON_FS, "fs" }, \ + { XFS_HEALTHMON_AG, "ag" }, \ + { XFS_HEALTHMON_INODE, "inode" }, \ + { XFS_HEALTHMON_RTGROUP, "rtgroup" } TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_HEALTHY); TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_FS); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_AG); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_INODE); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_RTGROUP); DECLARE_EVENT_CLASS(xfs_healthmon_event_class, TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), @@ -6207,6 +6223,19 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class, case XFS_HEALTHMON_MOUNT: __entry->mask = event->flags; break; + case XFS_HEALTHMON_FS: + __entry->mask = event->fsmask; + break; + case XFS_HEALTHMON_AG: + case XFS_HEALTHMON_RTGROUP: + __entry->mask = event->grpmask; + __entry->group = event->group; + break; + case XFS_HEALTHMON_INODE: + __entry->mask = event->imask; + __entry->ino = event->ino; + __entry->gen = event->gen; + break; } ), TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x group 0x%x", @@ -6227,6 +6256,70 @@ DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop); DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format); DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow); DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop); + +#define XFS_HEALTHUP_TYPE_STRINGS \ + { XFS_HEALTHUP_UNMOUNT, "unmount" }, \ + { XFS_HEALTHUP_SICK, "sick" }, \ + { XFS_HEALTHUP_CORRUPT, "corrupt" }, \ + { XFS_HEALTHUP_HEALTHY, "healthy" } + +#define XFS_HEALTHUP_DOMAIN_STRINGS \ + { XFS_HEALTHUP_FS, "fs" }, \ + { XFS_HEALTHUP_AG, "ag" }, \ + { XFS_HEALTHUP_INODE, "inode" }, \ + { XFS_HEALTHUP_RTGROUP, "rtgroup" } + +TRACE_DEFINE_ENUM(XFS_HEALTHUP_UNMOUNT); +TRACE_DEFINE_ENUM(XFS_HEALTHUP_SICK); +TRACE_DEFINE_ENUM(XFS_HEALTHUP_CORRUPT); +TRACE_DEFINE_ENUM(XFS_HEALTHUP_HEALTHY); + +TRACE_DEFINE_ENUM(XFS_HEALTHUP_FS); +TRACE_DEFINE_ENUM(XFS_HEALTHUP_AG); +TRACE_DEFINE_ENUM(XFS_HEALTHUP_INODE); +TRACE_DEFINE_ENUM(XFS_HEALTHUP_RTGROUP); + +TRACE_EVENT(xfs_healthmon_metadata_hook, + TP_PROTO(const struct xfs_mount *mp, unsigned long type, + const struct xfs_health_update_params *update, + unsigned int events, bool lost_prev), + TP_ARGS(mp, type, update, events, lost_prev), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned long, type) + __field(unsigned int, domain) + __field(unsigned int, old_mask) + __field(unsigned int, new_mask) + __field(unsigned long long, ino) + __field(unsigned int, gen) + __field(unsigned int, group) + __field(unsigned int, events) + __field(bool, lost_prev) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->type = type; + __entry->domain = update->domain; + __entry->old_mask = update->old_mask; + __entry->new_mask = update->new_mask; + __entry->ino = update->ino; + __entry->gen = update->gen; + __entry->group = update->group; + __entry->events = events; + __entry->lost_prev = lost_prev; + ), + TP_printk("dev %d:%d type %s domain %s oldmask 0x%x newmask 0x%x ino 0x%llx gen 0x%x group 0x%x events %u lost_prev? %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_symbolic(__entry->type, XFS_HEALTHUP_TYPE_STRINGS), + __print_symbolic(__entry->domain, XFS_HEALTHUP_DOMAIN_STRINGS), + __entry->old_mask, + __entry->new_mask, + __entry->ino, + __entry->gen, + __entry->group, + __entry->events, + __entry->lost_prev) +); #endif /* CONFIG_XFS_HEALTH_MONITOR */ #endif /* _TRACE_XFS_H */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 11/16] xfs: report shutdown events through healthmon 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (9 preceding siblings ...) 2024-12-31 23:41 ` [PATCH 10/16] xfs: report metadata health events through healthmon Darrick J. Wong @ 2024-12-31 23:41 ` Darrick J. Wong 2024-12-31 23:41 ` [PATCH 12/16] xfs: report media errors " Darrick J. Wong ` (4 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a shutdown hook so that we can send notifications to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_healthmon.schema.json | 62 +++++++++++++++++++++++++ fs/xfs/xfs_healthmon.c | 77 ++++++++++++++++++++++++++++++- fs/xfs/xfs_healthmon.h | 3 + fs/xfs/xfs_trace.h | 25 ++++++++++ 4 files changed, 165 insertions(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json index 154ea0228a3615..a8bc75b0b8c4f9 100644 --- a/fs/xfs/libxfs/xfs_healthmon.schema.json +++ b/fs/xfs/libxfs/xfs_healthmon.schema.json @@ -30,6 +30,9 @@ }, { "$ref": "#/$events/inode_metadata" + }, + { + "$ref": "#/$events/shutdown" } ], @@ -205,6 +208,31 @@ } }, + "$comment": "Shutdown event data are defined here.", + "$shutdown": { + "reason": { + "description": [ + "Reason for a filesystem to shut down.", + "Options include:", + "", + " * corrupt_incore: in-memory corruption", + " * corrupt_ondisk: on-disk corruption", + " * device_removed: device removed", + " * force_umount: userspace asked for it", + " * log_ioerr: log write IO error", + " * meta_ioerr: metadata writeback IO error" + ], + "enum": [ + "corrupt_incore", + "corrupt_ondisk", + "device_removed", + "force_umount", + "log_ioerr", + "meta_ioerr" + ] + } + }, + "$comment": "Event types are defined here.", "$events": { "lost": { @@ -386,6 +414,40 @@ "generation", "structures" ] + }, + "shutdown": { + "title": "Abnormal Shutdown Event", + "description": [ + "The filesystem went offline due to", + "unrecoverable errors." + ], + "type": "object", + + "properties": { + "type": { + "const": "shutdown" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "mount" + }, + "reasons": { + "type": "array", + "items": { + "$ref": "#/$shutdown/reason" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "reasons" + ] } } } diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index 9d34a826726e3e..c7df6dad5612f8 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -20,6 +20,7 @@ #include "xfs_rtgroup.h" #include "xfs_health.h" #include "xfs_healthmon.h" +#include "xfs_fsops.h" #include <linux/anon_inodes.h> #include <linux/eventpoll.h> @@ -67,6 +68,7 @@ struct xfs_healthmon { struct xfs_healthmon_event *last_event; /* live update hooks */ + struct xfs_shutdown_hook shook; struct xfs_health_hook hhook; /* filesystem mount, or NULL if we've unmounted */ @@ -384,6 +386,43 @@ xfs_healthmon_metadata_hook( goto out_unlock; } +/* Add a shutdown event to the reporting queue. */ +STATIC int +xfs_healthmon_shutdown_hook( + struct notifier_block *nb, + unsigned long action, + void *data) +{ + struct xfs_healthmon *hm; + struct xfs_healthmon_event *event; + int error; + + hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb); + + mutex_lock(&hm->lock); + + trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events, + hm->lost_prev_event); + + error = xfs_healthmon_start_live_update(hm); + if (error) + goto out_unlock; + + event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_SHUTDOWN, + XFS_HEALTHMON_MOUNT); + if (!event) + goto out_unlock; + + event->flags = action; + error = xfs_healthmon_push(hm, event); + if (error) + kfree(event); + +out_unlock: + mutex_unlock(&hm->lock); + return NOTIFY_DONE; +} + /* Render the health update type as a string. */ STATIC const char * xfs_healthmon_typestring( @@ -391,6 +430,7 @@ xfs_healthmon_typestring( { static const char *type_strings[] = { [XFS_HEALTHMON_LOST] = "lost", + [XFS_HEALTHMON_SHUTDOWN] = "shutdown", [XFS_HEALTHMON_UNMOUNT] = "unmount", [XFS_HEALTHMON_SICK] = "sick", [XFS_HEALTHMON_CORRUPT] = "corrupt", @@ -606,6 +646,25 @@ xfs_healthmon_format_inode( event->gen); } +/* Render shutdown mask as a string set */ +static int +xfs_healthmon_format_shutdown( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + static const struct flag_string mask_strings[] = { + { SHUTDOWN_META_IO_ERROR, "meta_ioerr" }, + { SHUTDOWN_LOG_IO_ERROR, "log_ioerr" }, + { SHUTDOWN_FORCE_UMOUNT, "force_umount" }, + { SHUTDOWN_CORRUPT_INCORE, "corrupt_incore" }, + { SHUTDOWN_CORRUPT_ONDISK, "corrupt_ondisk" }, + { SHUTDOWN_DEVICE_REMOVED, "device_removed" }, + }; + + return xfs_healthmon_format_mask(outbuf, "reasons", mask_strings, + event->flags); +} + static inline void xfs_healthmon_reset_outbuf( struct xfs_healthmon *hm) @@ -645,6 +704,9 @@ xfs_healthmon_format( goto overrun; switch (event->type) { + case XFS_HEALTHMON_SHUTDOWN: + ret = xfs_healthmon_format_shutdown(outbuf, event); + break; case XFS_HEALTHMON_LOST: /* empty */ break; @@ -928,6 +990,7 @@ xfs_healthmon_detach_hooks( * through the health monitoring subsystem from xfs_fs_put_super, so * it is now time to detach the hooks. */ + xfs_shutdown_hook_del(hm->mp, &hm->shook); xfs_health_hook_del(hm->mp, &hm->hhook); return; @@ -948,6 +1011,7 @@ xfs_healthmon_release( wake_up_all(&hm->wait); iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm); + xfs_shutdown_hook_disable(); xfs_health_hook_disable(); mutex_destroy(&hm->lock); @@ -1027,6 +1091,7 @@ xfs_ioc_health_monitor( /* Enable hooks to receive events, generally. */ xfs_health_hook_enable(); + xfs_shutdown_hook_enable(); /* Attach specific event hooks to this monitor. */ xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook); @@ -1034,11 +1099,16 @@ xfs_ioc_health_monitor( if (ret) goto out_hooks; + xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook); + ret = xfs_shutdown_hook_add(mp, &hm->shook); + if (ret) + goto out_healthhook; + /* Set up VFS file and file descriptor. */ name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id); if (!name) { ret = -ENOMEM; - goto out_healthhook; + goto out_shutdownhook; } fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm, @@ -1046,17 +1116,20 @@ xfs_ioc_health_monitor( kvfree(name); if (fd < 0) { ret = fd; - goto out_healthhook; + goto out_shutdownhook; } trace_xfs_healthmon_create(mp, hmo.flags, hmo.format); return fd; +out_shutdownhook: + xfs_shutdown_hook_del(mp, &hm->shook); out_healthhook: xfs_health_hook_del(mp, &hm->hhook); out_hooks: xfs_health_hook_disable(); + xfs_shutdown_hook_disable(); mutex_destroy(&hm->lock); xfs_healthmon_free_events(hm); kfree(hm); diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h index 3ece61165837b2..a7b2eaf3dd64e1 100644 --- a/fs/xfs/xfs_healthmon.h +++ b/fs/xfs/xfs_healthmon.h @@ -9,6 +9,9 @@ enum xfs_healthmon_type { XFS_HEALTHMON_LOST, /* message lost */ + /* filesystem shutdown */ + XFS_HEALTHMON_SHUTDOWN, + /* metadata health events */ XFS_HEALTHMON_SICK, /* runtime corruption observed */ XFS_HEALTHMON_CORRUPT, /* fsck reported corruption */ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 4a68d2ec8d0a34..404b857db39d0d 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -6173,8 +6173,32 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish); DEFINE_HEALTHMON_EVENT(xfs_healthmon_release); DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount); +TRACE_EVENT(xfs_healthmon_shutdown_hook, + TP_PROTO(const struct xfs_mount *mp, uint32_t shutdown_flags, + unsigned int events, bool lost_prev), + TP_ARGS(mp, shutdown_flags, events, lost_prev), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(uint32_t, shutdown_flags) + __field(unsigned int, events) + __field(bool, lost_prev) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->shutdown_flags = shutdown_flags; + __entry->events = events; + __entry->lost_prev = lost_prev; + ), + TP_printk("dev %d:%d shutdown_flags %s events %u lost_prev? %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_flags(__entry->shutdown_flags, "|", XFS_SHUTDOWN_STRINGS), + __entry->events, + __entry->lost_prev) +); + #define XFS_HEALTHMON_TYPE_STRINGS \ { XFS_HEALTHMON_LOST, "lost" }, \ + { XFS_HEALTHMON_SHUTDOWN, "shutdown" }, \ { XFS_HEALTHMON_UNMOUNT, "unmount" }, \ { XFS_HEALTHMON_SICK, "sick" }, \ { XFS_HEALTHMON_CORRUPT, "corrupt" }, \ @@ -6188,6 +6212,7 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount); { XFS_HEALTHMON_RTGROUP, "rtgroup" } TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST); +TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN); TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT); TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK); TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 12/16] xfs: report media errors through healthmon 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (10 preceding siblings ...) 2024-12-31 23:41 ` [PATCH 11/16] xfs: report shutdown " Darrick J. Wong @ 2024-12-31 23:41 ` Darrick J. Wong 2024-12-31 23:42 ` [PATCH 13/16] xfs: report file io " Darrick J. Wong ` (3 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:41 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we have hooks to report media errors, connect this to the health monitor as well. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_healthmon.schema.json | 65 +++++++++++++++++++++ fs/xfs/xfs_healthmon.c | 96 ++++++++++++++++++++++++++++++- fs/xfs/xfs_healthmon.h | 13 ++++ fs/xfs/xfs_trace.c | 1 fs/xfs/xfs_trace.h | 51 ++++++++++++++++ 5 files changed, 224 insertions(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json index a8bc75b0b8c4f9..006f4145faa9f5 100644 --- a/fs/xfs/libxfs/xfs_healthmon.schema.json +++ b/fs/xfs/libxfs/xfs_healthmon.schema.json @@ -33,6 +33,9 @@ }, { "$ref": "#/$events/shutdown" + }, + { + "$ref": "#/$events/media_error" } ], @@ -63,6 +66,31 @@ "i_generation": { "description": "Inode generation number", "type": "integer" + }, + "storage_devs": { + "description": "Storage devices in a filesystem", + "_comment": [ + "One of:", + "", + " * datadev: filesystem device", + " * logdev: external log device", + " * rtdev: realtime volume" + ], + "enum": [ + "datadev", + "logdev", + "rtdev" + ] + }, + "xfs_daddr_t": { + "description": "Storage device address, in units of 512-byte blocks", + "type": "integer", + "minimum": 0 + }, + "bbcount": { + "description": "Storage space length, in units of 512-byte blocks", + "type": "integer", + "minimum": 1 } }, @@ -448,6 +476,43 @@ "domain", "reasons" ] + }, + "media_error": { + "title": "Media Error", + "description": [ + "A storage device reported a media error.", + "The domain element tells us which storage", + "device reported the media failure. The", + "daddr and bbcount elements tell us where", + "inside that device the failure was observed." + ], + "type": "object", + + "properties": { + "type": { + "const": "media" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "$ref": "#/$defs/storage_devs" + }, + "daddr": { + "$ref": "#/$defs/xfs_daddr_t" + }, + "bbcount": { + "$ref": "#/$defs/bbcount" + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "daddr", + "bbcount" + ] } } } diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index c7df6dad5612f8..c828ea7442e932 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -21,6 +21,7 @@ #include "xfs_health.h" #include "xfs_healthmon.h" #include "xfs_fsops.h" +#include "xfs_notify_failure.h" #include <linux/anon_inodes.h> #include <linux/eventpoll.h> @@ -70,6 +71,7 @@ struct xfs_healthmon { /* live update hooks */ struct xfs_shutdown_hook shook; struct xfs_health_hook hhook; + struct xfs_media_error_hook mhook; /* filesystem mount, or NULL if we've unmounted */ struct xfs_mount *mp; @@ -423,6 +425,59 @@ xfs_healthmon_shutdown_hook( return NOTIFY_DONE; } +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) +/* Add a media error event to the reporting queue. */ +STATIC int +xfs_healthmon_media_error_hook( + struct notifier_block *nb, + unsigned long action, + void *data) +{ + struct xfs_healthmon *hm; + struct xfs_healthmon_event *event; + struct xfs_media_error_params *p = data; + enum xfs_healthmon_domain domain = 0; /* shut up gcc */ + int error; + + hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb); + + mutex_lock(&hm->lock); + + trace_xfs_healthmon_media_error_hook(p, hm->events, + hm->lost_prev_event); + + error = xfs_healthmon_start_live_update(hm); + if (error) + goto out_unlock; + + switch (p->fdev) { + case XFS_FAILED_LOGDEV: + domain = XFS_HEALTHMON_LOGDEV; + break; + case XFS_FAILED_RTDEV: + domain = XFS_HEALTHMON_RTDEV; + break; + case XFS_FAILED_DATADEV: + domain = XFS_HEALTHMON_DATADEV; + break; + } + + event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_MEDIA_ERROR, domain); + if (!event) + goto out_unlock; + + event->daddr = p->daddr; + event->bbcount = p->bbcount; + error = xfs_healthmon_push(hm, event); + if (error) + kfree(event); + +out_unlock: + mutex_unlock(&hm->lock); + return NOTIFY_DONE; +} +#endif + /* Render the health update type as a string. */ STATIC const char * xfs_healthmon_typestring( @@ -435,6 +490,7 @@ xfs_healthmon_typestring( [XFS_HEALTHMON_SICK] = "sick", [XFS_HEALTHMON_CORRUPT] = "corrupt", [XFS_HEALTHMON_HEALTHY] = "healthy", + [XFS_HEALTHMON_MEDIA_ERROR] = "media", }; if (event->type >= ARRAY_SIZE(type_strings)) @@ -454,6 +510,9 @@ xfs_healthmon_domstring( [XFS_HEALTHMON_AG] = "perag", [XFS_HEALTHMON_INODE] = "inode", [XFS_HEALTHMON_RTGROUP] = "rtgroup", + [XFS_HEALTHMON_DATADEV] = "datadev", + [XFS_HEALTHMON_LOGDEV] = "logdev", + [XFS_HEALTHMON_RTDEV] = "rtdev", }; if (event->domain >= ARRAY_SIZE(dom_strings)) @@ -665,6 +724,23 @@ xfs_healthmon_format_shutdown( event->flags); } +/* Render media error as a string set */ +static int +xfs_healthmon_format_media_error( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + ssize_t ret; + + ret = seq_buf_printf(outbuf, " \"daddr\": %llu,\n", + event->daddr); + if (ret < 0) + return ret; + + return seq_buf_printf(outbuf, " \"bbcount\": %llu,\n", + event->bbcount); +} + static inline void xfs_healthmon_reset_outbuf( struct xfs_healthmon *hm) @@ -730,6 +806,11 @@ xfs_healthmon_format( case XFS_HEALTHMON_INODE: ret = xfs_healthmon_format_inode(outbuf, event); break; + case XFS_HEALTHMON_DATADEV: + case XFS_HEALTHMON_LOGDEV: + case XFS_HEALTHMON_RTDEV: + ret = xfs_healthmon_format_media_error(outbuf, event); + break; } if (ret < 0) goto overrun; @@ -990,6 +1071,7 @@ xfs_healthmon_detach_hooks( * through the health monitoring subsystem from xfs_fs_put_super, so * it is now time to detach the hooks. */ + xfs_media_error_hook_del(hm->mp, &hm->mhook); xfs_shutdown_hook_del(hm->mp, &hm->shook); xfs_health_hook_del(hm->mp, &hm->hhook); return; @@ -1011,6 +1093,7 @@ xfs_healthmon_release( wake_up_all(&hm->wait); iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm); + xfs_media_error_hook_disable(); xfs_shutdown_hook_disable(); xfs_health_hook_disable(); @@ -1092,6 +1175,7 @@ xfs_ioc_health_monitor( /* Enable hooks to receive events, generally. */ xfs_health_hook_enable(); xfs_shutdown_hook_enable(); + xfs_media_error_hook_enable(); /* Attach specific event hooks to this monitor. */ xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook); @@ -1104,11 +1188,16 @@ xfs_ioc_health_monitor( if (ret) goto out_healthhook; + xfs_media_error_hook_setup(&hm->mhook, xfs_healthmon_media_error_hook); + ret = xfs_media_error_hook_add(mp, &hm->mhook); + if (ret) + goto out_shutdownhook; + /* Set up VFS file and file descriptor. */ name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id); if (!name) { ret = -ENOMEM; - goto out_shutdownhook; + goto out_mediahook; } fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm, @@ -1116,18 +1205,21 @@ xfs_ioc_health_monitor( kvfree(name); if (fd < 0) { ret = fd; - goto out_shutdownhook; + goto out_mediahook; } trace_xfs_healthmon_create(mp, hmo.flags, hmo.format); return fd; +out_mediahook: + xfs_media_error_hook_del(mp, &hm->mhook); out_shutdownhook: xfs_shutdown_hook_del(mp, &hm->shook); out_healthhook: xfs_health_hook_del(mp, &hm->hhook); out_hooks: + xfs_media_error_hook_disable(); xfs_health_hook_disable(); xfs_shutdown_hook_disable(); mutex_destroy(&hm->lock); diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h index a7b2eaf3dd64e1..23ce320f4b086b 100644 --- a/fs/xfs/xfs_healthmon.h +++ b/fs/xfs/xfs_healthmon.h @@ -17,6 +17,9 @@ enum xfs_healthmon_type { XFS_HEALTHMON_CORRUPT, /* fsck reported corruption */ XFS_HEALTHMON_HEALTHY, /* fsck reported healthy structure */ XFS_HEALTHMON_UNMOUNT, /* filesystem is unmounting */ + + /* media errors */ + XFS_HEALTHMON_MEDIA_ERROR, }; enum xfs_healthmon_domain { @@ -27,6 +30,11 @@ enum xfs_healthmon_domain { XFS_HEALTHMON_AG, /* allocation group metadata */ XFS_HEALTHMON_INODE, /* inode metadata */ XFS_HEALTHMON_RTGROUP, /* realtime group metadata */ + + /* media errors */ + XFS_HEALTHMON_DATADEV, + XFS_HEALTHMON_RTDEV, + XFS_HEALTHMON_LOGDEV, }; struct xfs_healthmon_event { @@ -60,6 +68,11 @@ struct xfs_healthmon_event { uint32_t gen; xfs_ino_t ino; }; + /* media errors */ + struct { + xfs_daddr_t daddr; + uint64_t bbcount; + }; }; }; diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 41a2ac85dc5fdf..23741ff36a2e14 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -54,6 +54,7 @@ #include "xfs_fsrefs.h" #include "xfs_health.h" #include "xfs_healthmon.h" +#include "xfs_notify_failure.h" /* * We include this last to have the helpers above available for the trace diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 404b857db39d0d..47293206400d6e 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -108,6 +108,7 @@ struct xfs_fsrefs_irec; struct xfs_rtgroup; struct xfs_healthmon_event; struct xfs_health_update_params; +struct xfs_media_error_params; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -6345,6 +6346,56 @@ TRACE_EVENT(xfs_healthmon_metadata_hook, __entry->events, __entry->lost_prev) ); + +#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) +TRACE_EVENT(xfs_healthmon_media_error_hook, + TP_PROTO(const struct xfs_media_error_params *p, + unsigned int events, bool lost_prev), + TP_ARGS(p, events, lost_prev), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(dev_t, error_dev) + __field(uint64_t, daddr) + __field(uint64_t, bbcount) + __field(int, pre_remove) + __field(unsigned int, events) + __field(bool, lost_prev) + ), + TP_fast_assign( + struct xfs_mount *mp = p->mp; + struct xfs_buftarg *btp = NULL; + + switch (p->fdev) { + case XFS_FAILED_DATADEV: + btp = mp->m_ddev_targp; + break; + case XFS_FAILED_LOGDEV: + btp = mp->m_logdev_targp; + break; + case XFS_FAILED_RTDEV: + btp = mp->m_rtdev_targp; + break; + } + + __entry->dev = mp->m_super->s_dev; + if (btp) + __entry->error_dev = btp->bt_dev; + __entry->daddr = p->daddr; + __entry->bbcount = p->bbcount; + __entry->pre_remove = p->pre_remove; + __entry->events = events; + __entry->lost_prev = lost_prev; + ), + TP_printk("dev %d:%d error_dev %d:%d daddr 0x%llx bbcount 0x%llx pre_remove? %d events %u lost_prev? %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + MAJOR(__entry->error_dev), MINOR(__entry->error_dev), + __entry->daddr, + __entry->bbcount, + __entry->pre_remove, + __entry->events, + __entry->lost_prev) +); +#endif #endif /* CONFIG_XFS_HEALTH_MONITOR */ #endif /* _TRACE_XFS_H */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 13/16] xfs: report file io errors through healthmon 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (11 preceding siblings ...) 2024-12-31 23:41 ` [PATCH 12/16] xfs: report media errors " Darrick J. Wong @ 2024-12-31 23:42 ` Darrick J. Wong 2024-12-31 23:42 ` [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong ` (2 subsequent siblings) 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:42 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a file io error event hook so that we can send events about read errors, writeback errors, and directio errors to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_healthmon.schema.json | 77 ++++++++++++++++++++ fs/xfs/xfs_healthmon.c | 120 ++++++++++++++++++++++++++++++- fs/xfs/xfs_healthmon.h | 16 ++++ fs/xfs/xfs_trace.c | 1 fs/xfs/xfs_trace.h | 50 +++++++++++++ 5 files changed, 262 insertions(+), 2 deletions(-) diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json index 006f4145faa9f5..9c1070a629997c 100644 --- a/fs/xfs/libxfs/xfs_healthmon.schema.json +++ b/fs/xfs/libxfs/xfs_healthmon.schema.json @@ -36,6 +36,9 @@ }, { "$ref": "#/$events/media_error" + }, + { + "$ref": "#/$events/file_ioerror" } ], @@ -67,6 +70,16 @@ "description": "Inode generation number", "type": "integer" }, + "off_t": { + "description": "File position, in bytes", + "type": "integer", + "minimum": 0 + }, + "size_t": { + "description": "File operation length, in bytes", + "type": "integer", + "minimum": 1 + }, "storage_devs": { "description": "Storage devices in a filesystem", "_comment": [ @@ -261,6 +274,26 @@ } }, + "$comment": "File IO event data are defined here.", + "$fileio": { + "types": { + "description": [ + "File I/O operations. One of:", + "", + " * readahead: reads into the page cache.", + " * writeback: writeback of dirty page cache.", + " * dioread: O_DIRECT reads.", + " * diowrite: O_DIRECT writes." + ], + "enum": [ + "readahead", + "writeback", + "dioread", + "diowrite" + ] + } + }, + "$comment": "Event types are defined here.", "$events": { "lost": { @@ -513,6 +546,50 @@ "daddr", "bbcount" ] + }, + "file_ioerror": { + "title": "File I/O error", + "description": [ + "A read or a write to a file failed. The", + "inode, generation, pos, and len fields", + "describe the range of the file that is", + "affected." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$fileio/types" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "filerange" + }, + "inumber": { + "$ref": "#/$defs/xfs_ino_t" + }, + "generation": { + "$ref": "#/$defs/i_generation" + }, + "pos": { + "$ref": "#/$defs/off_t" + }, + "len": { + "$ref": "#/$defs/size_t" + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "inumber", + "generation", + "pos", + "len" + ] } } } diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index c828ea7442e932..9320f12b60ade9 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -22,6 +22,7 @@ #include "xfs_healthmon.h" #include "xfs_fsops.h" #include "xfs_notify_failure.h" +#include "xfs_file.h" #include <linux/anon_inodes.h> #include <linux/eventpoll.h> @@ -72,6 +73,7 @@ struct xfs_healthmon { struct xfs_shutdown_hook shook; struct xfs_health_hook hhook; struct xfs_media_error_hook mhook; + struct xfs_file_ioerror_hook fhook; /* filesystem mount, or NULL if we've unmounted */ struct xfs_mount *mp; @@ -478,6 +480,73 @@ xfs_healthmon_media_error_hook( } #endif +/* Add a file io error event to the reporting queue. */ +STATIC int +xfs_healthmon_file_ioerror_hook( + struct notifier_block *nb, + unsigned long action, + void *data) +{ + struct xfs_healthmon *hm; + struct xfs_healthmon_event *event; + struct xfs_file_ioerror_params *p = data; + enum xfs_healthmon_type type = 0; + int error; + + hm = container_of(nb, struct xfs_healthmon, fhook.ioerror_hook.nb); + + switch (action) { + case XFS_FILE_IOERROR_BUFFERED_READ: + case XFS_FILE_IOERROR_BUFFERED_WRITE: + case XFS_FILE_IOERROR_DIRECT_READ: + case XFS_FILE_IOERROR_DIRECT_WRITE: + break; + default: + ASSERT(0); + return NOTIFY_DONE; + } + + mutex_lock(&hm->lock); + + trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events, + hm->lost_prev_event); + + error = xfs_healthmon_start_live_update(hm); + if (error) + goto out_unlock; + + switch (action) { + case XFS_FILE_IOERROR_BUFFERED_READ: + type = XFS_HEALTHMON_BUFREAD; + break; + case XFS_FILE_IOERROR_BUFFERED_WRITE: + type = XFS_HEALTHMON_BUFWRITE; + break; + case XFS_FILE_IOERROR_DIRECT_READ: + type = XFS_HEALTHMON_DIOREAD; + break; + case XFS_FILE_IOERROR_DIRECT_WRITE: + type = XFS_HEALTHMON_DIOWRITE; + break; + } + + event = xfs_healthmon_alloc(hm, type, XFS_HEALTHMON_FILERANGE); + if (!event) + goto out_unlock; + + event->fino = p->ino; + event->fgen = p->gen; + event->fpos = p->pos; + event->flen = p->len; + error = xfs_healthmon_push(hm, event); + if (error) + kfree(event); + +out_unlock: + mutex_unlock(&hm->lock); + return NOTIFY_DONE; +} + /* Render the health update type as a string. */ STATIC const char * xfs_healthmon_typestring( @@ -491,6 +560,10 @@ xfs_healthmon_typestring( [XFS_HEALTHMON_CORRUPT] = "corrupt", [XFS_HEALTHMON_HEALTHY] = "healthy", [XFS_HEALTHMON_MEDIA_ERROR] = "media", + [XFS_HEALTHMON_BUFREAD] = "readahead", + [XFS_HEALTHMON_BUFWRITE] = "writeback", + [XFS_HEALTHMON_DIOREAD] = "dioread", + [XFS_HEALTHMON_DIOWRITE] = "diowrite", }; if (event->type >= ARRAY_SIZE(type_strings)) @@ -513,6 +586,7 @@ xfs_healthmon_domstring( [XFS_HEALTHMON_DATADEV] = "datadev", [XFS_HEALTHMON_LOGDEV] = "logdev", [XFS_HEALTHMON_RTDEV] = "rtdev", + [XFS_HEALTHMON_FILERANGE] = "filerange", }; if (event->domain >= ARRAY_SIZE(dom_strings)) @@ -741,6 +815,33 @@ xfs_healthmon_format_media_error( event->bbcount); } +/* Render file range events as a string set */ +static int +xfs_healthmon_format_filerange( + struct seq_buf *outbuf, + const struct xfs_healthmon_event *event) +{ + ssize_t ret; + + ret = seq_buf_printf(outbuf, " \"inumber\": %llu,\n", + event->fino); + if (ret < 0) + return ret; + + ret = seq_buf_printf(outbuf, " \"generation\": %u,\n", + event->fgen); + if (ret < 0) + return ret; + + ret = seq_buf_printf(outbuf, " \"pos\": %llu,\n", + event->fpos); + if (ret < 0) + return ret; + + return seq_buf_printf(outbuf, " \"length\": %llu,\n", + event->flen); +} + static inline void xfs_healthmon_reset_outbuf( struct xfs_healthmon *hm) @@ -811,6 +912,9 @@ xfs_healthmon_format( case XFS_HEALTHMON_RTDEV: ret = xfs_healthmon_format_media_error(outbuf, event); break; + case XFS_HEALTHMON_FILERANGE: + ret = xfs_healthmon_format_filerange(outbuf, event); + break; } if (ret < 0) goto overrun; @@ -1071,6 +1175,7 @@ xfs_healthmon_detach_hooks( * through the health monitoring subsystem from xfs_fs_put_super, so * it is now time to detach the hooks. */ + xfs_file_ioerror_hook_del(hm->mp, &hm->fhook); xfs_media_error_hook_del(hm->mp, &hm->mhook); xfs_shutdown_hook_del(hm->mp, &hm->shook); xfs_health_hook_del(hm->mp, &hm->hhook); @@ -1093,6 +1198,7 @@ xfs_healthmon_release( wake_up_all(&hm->wait); iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm); + xfs_file_ioerror_hook_disable(); xfs_media_error_hook_disable(); xfs_shutdown_hook_disable(); xfs_health_hook_disable(); @@ -1176,6 +1282,7 @@ xfs_ioc_health_monitor( xfs_health_hook_enable(); xfs_shutdown_hook_enable(); xfs_media_error_hook_enable(); + xfs_file_ioerror_hook_enable(); /* Attach specific event hooks to this monitor. */ xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook); @@ -1193,11 +1300,17 @@ xfs_ioc_health_monitor( if (ret) goto out_shutdownhook; + xfs_file_ioerror_hook_setup(&hm->fhook, + xfs_healthmon_file_ioerror_hook); + ret = xfs_file_ioerror_hook_add(mp, &hm->fhook); + if (ret) + goto out_mediahook; + /* Set up VFS file and file descriptor. */ name = kasprintf(GFP_KERNEL, "XFS (%s): healthmon", mp->m_super->s_id); if (!name) { ret = -ENOMEM; - goto out_mediahook; + goto out_ioerrhook; } fd = anon_inode_getfd(name, &xfs_healthmon_fops, hm, @@ -1205,13 +1318,15 @@ xfs_ioc_health_monitor( kvfree(name); if (fd < 0) { ret = fd; - goto out_mediahook; + goto out_ioerrhook; } trace_xfs_healthmon_create(mp, hmo.flags, hmo.format); return fd; +out_ioerrhook: + xfs_file_ioerror_hook_del(mp, &hm->fhook); out_mediahook: xfs_media_error_hook_del(mp, &hm->mhook); out_shutdownhook: @@ -1219,6 +1334,7 @@ xfs_ioc_health_monitor( out_healthhook: xfs_health_hook_del(mp, &hm->hhook); out_hooks: + xfs_file_ioerror_hook_disable(); xfs_media_error_hook_disable(); xfs_health_hook_disable(); xfs_shutdown_hook_disable(); diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h index 23ce320f4b086b..748173eed79660 100644 --- a/fs/xfs/xfs_healthmon.h +++ b/fs/xfs/xfs_healthmon.h @@ -20,6 +20,12 @@ enum xfs_healthmon_type { /* media errors */ XFS_HEALTHMON_MEDIA_ERROR, + + /* file range events */ + XFS_HEALTHMON_BUFREAD, + XFS_HEALTHMON_BUFWRITE, + XFS_HEALTHMON_DIOREAD, + XFS_HEALTHMON_DIOWRITE, }; enum xfs_healthmon_domain { @@ -35,6 +41,9 @@ enum xfs_healthmon_domain { XFS_HEALTHMON_DATADEV, XFS_HEALTHMON_RTDEV, XFS_HEALTHMON_LOGDEV, + + /* file range events */ + XFS_HEALTHMON_FILERANGE, }; struct xfs_healthmon_event { @@ -73,6 +82,13 @@ struct xfs_healthmon_event { xfs_daddr_t daddr; uint64_t bbcount; }; + /* file range events */ + struct { + xfs_ino_t fino; + loff_t fpos; + uint64_t flen; + uint32_t fgen; + }; }; }; diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 23741ff36a2e14..d8e5d607b0dc6a 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -55,6 +55,7 @@ #include "xfs_health.h" #include "xfs_healthmon.h" #include "xfs_notify_failure.h" +#include "xfs_file.h" /* * We include this last to have the helpers above available for the trace diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 47293206400d6e..aba32f5ccc1a3b 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -109,6 +109,7 @@ struct xfs_rtgroup; struct xfs_healthmon_event; struct xfs_health_update_params; struct xfs_media_error_params; +struct xfs_file_ioerror_params; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -6396,6 +6397,55 @@ TRACE_EVENT(xfs_healthmon_media_error_hook, __entry->lost_prev) ); #endif + +#define XFS_FILE_IOERROR_STRINGS \ + { XFS_FILE_IOERROR_BUFFERED_READ, "readahead" }, \ + { XFS_FILE_IOERROR_BUFFERED_WRITE, "writeback" }, \ + { XFS_FILE_IOERROR_DIRECT_READ, "dioread" }, \ + { XFS_FILE_IOERROR_DIRECT_WRITE, "diowrite" } + +TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_READ); +TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_WRITE); +TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_READ); +TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_WRITE); + +TRACE_EVENT(xfs_healthmon_file_ioerror_hook, + TP_PROTO(const struct xfs_mount *mp, + unsigned long action, + const struct xfs_file_ioerror_params *p, + unsigned int events, bool lost_prev), + TP_ARGS(mp, action, p, events, lost_prev), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(dev_t, error_dev) + __field(unsigned long, action) + __field(unsigned long long, ino) + __field(unsigned int, gen) + __field(long long, pos) + __field(unsigned long long, len) + __field(unsigned int, events) + __field(bool, lost_prev) + ), + TP_fast_assign( + __entry->dev = mp ? mp->m_super->s_dev : 0; + __entry->action = action; + __entry->ino = p->ino; + __entry->gen = p->gen; + __entry->pos = p->pos; + __entry->len = p->len; + __entry->events = events; + __entry->lost_prev = lost_prev; + ), + TP_printk("dev %d:%d ino 0x%llx gen 0x%x op %s pos 0x%llx bytecount 0x%llx events %u lost_prev? %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->gen, + __print_symbolic(__entry->action, XFS_FILE_IOERROR_STRINGS), + __entry->pos, + __entry->len, + __entry->events, + __entry->lost_prev) +); #endif /* CONFIG_XFS_HEALTH_MONITOR */ #endif /* _TRACE_XFS_H */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (12 preceding siblings ...) 2024-12-31 23:42 ` [PATCH 13/16] xfs: report file io " Darrick J. Wong @ 2024-12-31 23:42 ` Darrick J. Wong 2024-12-31 23:42 ` [PATCH 15/16] xfs: add media error reporting ioctl Darrick J. Wong 2024-12-31 23:43 ` [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:42 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Make it so that we can reconfigure the health monitoring device by calling the XFS_IOC_HEALTH_MONITOR ioctl on it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_healthmon.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index 9320f12b60ade9..67f7d4a8cc7f58 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -23,6 +23,8 @@ #include "xfs_fsops.h" #include "xfs_notify_failure.h" #include "xfs_file.h" +#include "xfs_fs.h" +#include "xfs_ioctl.h" #include <linux/anon_inodes.h> #include <linux/eventpoll.h> @@ -1228,11 +1230,38 @@ xfs_healthmon_validate( return true; } +/* Handle ioctls for the health monitoring thread. */ +STATIC long +xfs_healthmon_ioctl( + struct file *file, + unsigned int cmd, + unsigned long p) +{ + struct xfs_health_monitor hmo; + struct xfs_healthmon *hm = file->private_data; + void __user *arg = (void __user *)p; + + if (cmd != XFS_IOC_HEALTH_MONITOR) + return -ENOTTY; + + if (copy_from_user(&hmo, arg, sizeof(hmo))) + return -EFAULT; + + if (!xfs_healthmon_validate(&hmo)) + return -EINVAL; + + mutex_lock(&hm->lock); + hm->verbose = !!(hmo.flags & XFS_HEALTH_MONITOR_VERBOSE); + mutex_unlock(&hm->lock); + return 0; +} + static const struct file_operations xfs_healthmon_fops = { .owner = THIS_MODULE, .read_iter = xfs_healthmon_read_iter, .poll = xfs_healthmon_poll, .release = xfs_healthmon_release, + .unlocked_ioctl = xfs_healthmon_ioctl, }; /* ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 15/16] xfs: add media error reporting ioctl 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (13 preceding siblings ...) 2024-12-31 23:42 ` [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong @ 2024-12-31 23:42 ` Darrick J. Wong 2024-12-31 23:43 ` [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:42 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a new privileged ioctl so that xfs_scrub can report media errors to the kernel for further processing. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/Makefile | 6 +---- fs/xfs/libxfs/xfs_fs.h | 15 ++++++++++++ fs/xfs/xfs_healthmon.c | 2 -- fs/xfs/xfs_ioctl.c | 3 ++ fs/xfs/xfs_notify_failure.c | 53 ++++++++++++++++++++++++++++++++++++++++++- fs/xfs/xfs_notify_failure.h | 8 ++++++ fs/xfs/xfs_trace.h | 2 -- 7 files changed, 78 insertions(+), 11 deletions(-) diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 94a9dc7aa7a1d5..71e6512899da3a 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -99,6 +99,7 @@ xfs-y += xfs_aops.o \ xfs_message.o \ xfs_mount.o \ xfs_mru_cache.o \ + xfs_notify_failure.o \ xfs_pwork.o \ xfs_reflink.o \ xfs_stats.o \ @@ -149,11 +150,6 @@ xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o -# notify failure -ifeq ($(CONFIG_MEMORY_FAILURE),y) -xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o -endif - xfs-$(CONFIG_XFS_DRAIN_INTENTS) += xfs_drain.o xfs-$(CONFIG_XFS_LIVE_HOOKS) += xfs_hooks.o xfs-$(CONFIG_XFS_MEMORY_BUFS) += xfs_buf_mem.o diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index d7404e6efd866d..32e552d40b1bf5 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -1115,6 +1115,20 @@ struct xfs_health_monitor { /* Return events in JSON format */ #define XFS_HEALTH_MONITOR_FMT_JSON (1) +struct xfs_media_error { + __u64 flags; /* flags */ + __u64 daddr; /* disk address of range */ + __u64 bbcount; /* length, in 512b blocks */ + __u64 pad; /* zero */ +}; + +#define XFS_MEDIA_ERROR_DATADEV (1) /* data device */ +#define XFS_MEDIA_ERROR_LOGDEV (2) /* external log device */ +#define XFS_MEDIA_ERROR_RTDEV (3) /* realtime device */ + +/* bottom byte of flags is the device code */ +#define XFS_MEDIA_ERROR_DEVMASK (0xFF) + /* * ioctl commands that are used by Linux filesystems */ @@ -1157,6 +1171,7 @@ struct xfs_health_monitor { #define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) #define XFS_IOC_MAP_FREESP _IOW ('X', 67, struct xfs_map_freesp) #define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor) +#define XFS_IOC_MEDIA_ERROR _IOW ('X', 69, struct xfs_media_error) /* * ioctl commands that replace IRIX syssgi()'s diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c index 67f7d4a8cc7f58..b6fdad798fae89 100644 --- a/fs/xfs/xfs_healthmon.c +++ b/fs/xfs/xfs_healthmon.c @@ -429,7 +429,6 @@ xfs_healthmon_shutdown_hook( return NOTIFY_DONE; } -#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) /* Add a media error event to the reporting queue. */ STATIC int xfs_healthmon_media_error_hook( @@ -480,7 +479,6 @@ xfs_healthmon_media_error_hook( mutex_unlock(&hm->lock); return NOTIFY_DONE; } -#endif /* Add a file io error event to the reporting queue. */ STATIC int diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 6c7a30128c7bf6..c253538c48f3b3 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -43,6 +43,7 @@ #include "xfs_handle.h" #include "xfs_rtgroup.h" #include "xfs_healthmon.h" +#include "xfs_notify_failure.h" #include <linux/mount.h> #include <linux/fileattr.h> @@ -1437,6 +1438,8 @@ xfs_file_ioctl( case XFS_IOC_HEALTH_MONITOR: return xfs_ioc_health_monitor(mp, arg); + case XFS_IOC_MEDIA_ERROR: + return xfs_ioc_media_error(mp, arg); default: return -ENOTTY; diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index ea68c7e61bb585..fcf9f0139d673c 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -91,9 +91,19 @@ xfs_media_error_hook_setup( xfs_hook_setup(&hook->error_hook, mod_fn); } #else -# define xfs_media_error_hook(...) ((void)0) +static inline void +xfs_media_error_hook( + struct xfs_mount *mp, + enum xfs_failed_device fdev, + xfs_daddr_t daddr, + uint64_t bbcount, + bool pre_remove) +{ + /* empty */ +} #endif /* CONFIG_XFS_LIVE_HOOKS */ +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) struct xfs_failure_info { xfs_agblock_t startblock; xfs_extlen_t blockcount; @@ -463,3 +473,44 @@ xfs_dax_notify_failure( const struct dax_holder_operations xfs_dax_holder_operations = { .notify_failure = xfs_dax_notify_failure, }; +#endif /* CONFIG_MEMORY_FAILURE && CONFIG_FS_DAX */ + +#define XFS_VALID_MEDIA_ERROR_FLAGS (XFS_MEDIA_ERROR_DATADEV | \ + XFS_MEDIA_ERROR_LOGDEV | \ + XFS_MEDIA_ERROR_RTDEV) +int +xfs_ioc_media_error( + struct xfs_mount *mp, + struct xfs_media_error __user *arg) +{ + struct xfs_media_error me; + enum xfs_failed_device fdev; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&me, arg, sizeof(me))) + return -EFAULT; + + if (me.pad) + return -EINVAL; + if (me.flags & ~XFS_VALID_MEDIA_ERROR_FLAGS) + return -EINVAL; + + switch (me.flags & XFS_MEDIA_ERROR_DEVMASK) { + case XFS_MEDIA_ERROR_DATADEV: + fdev = XFS_FAILED_DATADEV; + break; + case XFS_MEDIA_ERROR_LOGDEV: + fdev = XFS_FAILED_LOGDEV; + break; + case XFS_MEDIA_ERROR_RTDEV: + fdev = XFS_FAILED_RTDEV; + break; + default: + return -EINVAL; + } + + xfs_media_error_hook(mp, fdev, me.daddr, me.bbcount, false); + return 0; +} diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h index 835d4af504d832..c23034891d99fd 100644 --- a/fs/xfs/xfs_notify_failure.h +++ b/fs/xfs/xfs_notify_failure.h @@ -6,7 +6,9 @@ #ifndef __XFS_NOTIFY_FAILURE_H__ #define __XFS_NOTIFY_FAILURE_H__ +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) extern const struct dax_holder_operations xfs_dax_holder_operations; +#endif enum xfs_failed_device { XFS_FAILED_DATADEV, @@ -14,7 +16,7 @@ enum xfs_failed_device { XFS_FAILED_RTDEV, }; -#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) +#if defined(CONFIG_XFS_LIVE_HOOKS) struct xfs_media_error_params { struct xfs_mount *mp; enum xfs_failed_device fdev; @@ -46,4 +48,8 @@ struct xfs_media_error_hook { }; # define xfs_media_error_hook_setup(...) ((void)0) #endif /* CONFIG_XFS_LIVE_HOOKS */ +struct xfs_media_error; +int xfs_ioc_media_error(struct xfs_mount *mp, + struct xfs_media_error __user *arg); + #endif /* __XFS_NOTIFY_FAILURE_H__ */ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index aba32f5ccc1a3b..3baa39a2b0a8b8 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -6348,7 +6348,6 @@ TRACE_EVENT(xfs_healthmon_metadata_hook, __entry->lost_prev) ); -#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX) TRACE_EVENT(xfs_healthmon_media_error_hook, TP_PROTO(const struct xfs_media_error_params *p, unsigned int events, bool lost_prev), @@ -6396,7 +6395,6 @@ TRACE_EVENT(xfs_healthmon_media_error_hook, __entry->events, __entry->lost_prev) ); -#endif #define XFS_FILE_IOERROR_STRINGS \ { XFS_FILE_IOERROR_BUFFERED_READ, "readahead" }, \ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong ` (14 preceding siblings ...) 2024-12-31 23:42 ` [PATCH 15/16] xfs: add media error reporting ioctl Darrick J. Wong @ 2024-12-31 23:43 ` Darrick J. Wong 15 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw) To: djwong, cem; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Send uevents when we mount and unmount the filesystem, so that we can trigger systemd services. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/xfs_super.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index df6afcf8840948..1d295991e08047 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1197,12 +1197,28 @@ xfs_inodegc_free_percpu( free_percpu(mp->m_inodegc); } +static void +xfs_send_unmount_uevent( + struct xfs_mount *mp) +{ + char sid[256] = ""; + char *env[] = { + "TYPE=mount", + sid, + NULL, + }; + + snprintf(sid, sizeof(sid), "SID=%s", mp->m_super->s_id); + kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_REMOVE, env); +} + static void xfs_fs_put_super( struct super_block *sb) { struct xfs_mount *mp = XFS_M(sb); + xfs_send_unmount_uevent(mp); xfs_notice(mp, "Unmounting Filesystem %pU", &mp->m_sb.sb_uuid); xfs_filestream_unmount(mp); xfs_unmountfs(mp); @@ -1590,6 +1606,29 @@ xfs_debugfs_mkdir( return child; } +/* + * Send a uevent signalling that the mount succeeded so we can use udev rules + * to start background services. + */ +static void +xfs_send_mount_uevent( + struct fs_context *fc, + struct xfs_mount *mp) +{ + char source[256] = ""; + char sid[256] = ""; + char *env[] = { + "TYPE=mount", + source, + sid, + NULL, + }; + + snprintf(source, sizeof(source), "SOURCE=%s", fc->source); + snprintf(sid, sizeof(sid), "SID=%s", mp->m_super->s_id); + kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_ADD, env); +} + static int xfs_fs_fill_super( struct super_block *sb, @@ -1904,6 +1943,7 @@ xfs_fs_fill_super( mp->m_debugfs_uuid = NULL; } + xfs_send_mount_uevent(fc, mp); return 0; out_filestream_unmount: ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (4 preceding siblings ...) 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong @ 2024-12-31 23:33 ` Darrick J. Wong 2024-12-31 23:43 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong ` (4 more replies) 2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong ` (9 subsequent siblings) 15 siblings, 5 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs Hi all, This series creates a new NOALLOC flag for allocation groups that causes the block and inode allocators to look elsewhere when trying to allocate resources. This is either the first part of a patchset to implement online shrinking (set noalloc on the last AGs, run fsr to move the files and directories) or freeze-free rmapbt rebuilding (set noalloc to prevent creation of new mappings, then hook deletion of old mappings). This is still totally a research project. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=noalloc-ags xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=noalloc-ags --- Commits in this patchset: * xfs: track deferred ops statistics * xfs: create a noalloc mode for allocation groups * xfs: enable userspace to hide an AG from allocation * xfs: apply noalloc mode to inode allocations too * xfs_io: enhance the aginfo command to control the noalloc flag --- include/xfs_trace.h | 2 + include/xfs_trans.h | 4 ++ io/aginfo.c | 45 ++++++++++++++++++-- libxfs/xfs_ag.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++ libxfs/xfs_ag.h | 8 ++++ libxfs/xfs_ag_resv.c | 28 +++++++++++- libxfs/xfs_defer.c | 18 +++++++- libxfs/xfs_fs.h | 5 ++ libxfs/xfs_ialloc.c | 3 + man/man8/xfs_io.8 | 6 ++- 10 files changed, 223 insertions(+), 10 deletions(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/5] xfs: track deferred ops statistics 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong @ 2024-12-31 23:43 ` Darrick J. Wong 2024-12-31 23:43 ` [PATCH 2/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Track some basic statistics on how hard we're pushing the defer ops. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/xfs_trans.h | 4 ++++ libxfs/xfs_defer.c | 18 +++++++++++++++++- 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/include/xfs_trans.h b/include/xfs_trans.h index 248064019a0ab5..64d73c36851b75 100644 --- a/include/xfs_trans.h +++ b/include/xfs_trans.h @@ -82,6 +82,10 @@ typedef struct xfs_trans { long t_frextents_delta;/* superblock freextents chg*/ struct list_head t_items; /* log item descriptors */ struct list_head t_dfops; /* deferred operations */ + + unsigned int t_dfops_nr; + unsigned int t_dfops_nr_max; + unsigned int t_dfops_finished; } xfs_trans_t; void xfs_trans_init(struct xfs_mount *); diff --git a/libxfs/xfs_defer.c b/libxfs/xfs_defer.c index 8f6708c0f3bfcd..7e6167949f6509 100644 --- a/libxfs/xfs_defer.c +++ b/libxfs/xfs_defer.c @@ -611,6 +611,8 @@ xfs_defer_finish_one( /* Done with the dfp, free it. */ list_del(&dfp->dfp_list); kmem_cache_free(xfs_defer_pending_cache, dfp); + tp->t_dfops_nr--; + tp->t_dfops_finished++; out: if (ops->finish_cleanup) ops->finish_cleanup(tp, state, error); @@ -673,6 +675,9 @@ xfs_defer_finish_noroll( list_splice_init(&(*tp)->t_dfops, &dop_pending); + (*tp)->t_dfops_nr_max = max((*tp)->t_dfops_nr, + (*tp)->t_dfops_nr_max); + if (has_intents < 0) { error = has_intents; goto out_shutdown; @@ -714,6 +719,7 @@ xfs_defer_finish_noroll( xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE); trace_xfs_defer_finish_error(*tp, error); xfs_defer_cancel_list((*tp)->t_mountp, &dop_pending); + (*tp)->t_dfops_nr = 0; xfs_defer_cancel(*tp); return error; } @@ -761,6 +767,7 @@ xfs_defer_cancel( trace_xfs_defer_cancel(tp, _RET_IP_); xfs_defer_trans_abort(tp, &tp->t_dfops); xfs_defer_cancel_list(mp, &tp->t_dfops); + tp->t_dfops_nr = 0; } /* @@ -846,8 +853,10 @@ xfs_defer_add( } dfp = xfs_defer_find_last(tp, ops); - if (!dfp || !xfs_defer_can_append(dfp, ops)) + if (!dfp || !xfs_defer_can_append(dfp, ops)) { dfp = xfs_defer_alloc(&tp->t_dfops, ops); + tp->t_dfops_nr++; + } xfs_defer_add_item(dfp, li); trace_xfs_defer_add_item(tp->t_mountp, dfp, li); @@ -872,6 +881,7 @@ xfs_defer_add_barrier( return; xfs_defer_alloc(&tp->t_dfops, &xfs_barrier_defer_type); + tp->t_dfops_nr++; trace_xfs_defer_add_item(tp->t_mountp, dfp, NULL); } @@ -932,6 +942,12 @@ xfs_defer_move( struct xfs_trans *stp) { list_splice_init(&stp->t_dfops, &dtp->t_dfops); + dtp->t_dfops_nr += stp->t_dfops_nr; + dtp->t_dfops_nr_max = stp->t_dfops_nr_max; + dtp->t_dfops_finished = stp->t_dfops_finished; + stp->t_dfops_nr = 0; + stp->t_dfops_nr_max = 0; + stp->t_dfops_finished = 0; /* * Low free space mode was historically controlled by a dfops field. ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/5] xfs: create a noalloc mode for allocation groups 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong 2024-12-31 23:43 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong @ 2024-12-31 23:43 ` Darrick J. Wong 2024-12-31 23:43 ` [PATCH 3/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new noalloc state for the per-AG structure that will disable block allocation in this AG. We accomplish this by subtracting from fdblocks all the free blocks in this AG, hiding those blocks from the allocator, and preventing freed blocks from updating fdblocks until we're ready to lift noalloc mode. Note that we reduce the free block count of the filesystem so that we can prevent transactions from entering the allocator looking for "free" space that we've turned off incore. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/xfs_trace.h | 2 ++ libxfs/xfs_ag.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++ libxfs/xfs_ag.h | 8 +++++++ libxfs/xfs_ag_resv.c | 28 +++++++++++++++++++++-- 4 files changed, 95 insertions(+), 3 deletions(-) diff --git a/include/xfs_trace.h b/include/xfs_trace.h index 30166c11dd597b..7778366c5e3319 100644 --- a/include/xfs_trace.h +++ b/include/xfs_trace.h @@ -13,6 +13,8 @@ #define trace_xfbtree_trans_cancel_buf(...) ((void) 0) #define trace_xfbtree_trans_commit_buf(...) ((void) 0) +#define trace_xfs_ag_clear_noalloc(a) ((void) 0) +#define trace_xfs_ag_set_noalloc(a) ((void) 0) #define trace_xfs_agfl_reset(a,b,c,d) ((void) 0) #define trace_xfs_agfl_free_defer(...) ((void) 0) #define trace_xfs_alloc_cur_check(...) ((void) 0) diff --git a/libxfs/xfs_ag.c b/libxfs/xfs_ag.c index 095b581a116180..462d16347cadb9 100644 --- a/libxfs/xfs_ag.c +++ b/libxfs/xfs_ag.c @@ -974,3 +974,63 @@ xfs_ag_get_geometry( xfs_buf_relse(agi_bp); return error; } + +/* How many blocks does this AG contribute to fdblocks? */ +xfs_extlen_t +xfs_ag_fdblocks( + struct xfs_perag *pag) +{ + xfs_extlen_t ret; + + ASSERT(xfs_perag_initialised_agf(pag)); + + ret = pag->pagf_freeblks + pag->pagf_flcount + pag->pagf_btreeblks; + ret -= pag->pag_meta_resv.ar_reserved; + ret -= pag->pag_rmapbt_resv.ar_orig_reserved; + return ret; +} + +/* + * Hide all the free space in this AG. Caller must hold both the AGI and the + * AGF buffers or have otherwise prevented concurrent access. + */ +int +xfs_ag_set_noalloc( + struct xfs_perag *pag) +{ + struct xfs_mount *mp = pag_mount(pag); + int error; + + ASSERT(xfs_perag_initialised_agf(pag)); + ASSERT(xfs_perag_initialised_agi(pag)); + + if (xfs_perag_prohibits_alloc(pag)) + return 0; + + error = xfs_dec_fdblocks(mp, xfs_ag_fdblocks(pag), false); + if (error) + return error; + + trace_xfs_ag_set_noalloc(pag); + set_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate); + return 0; +} + +/* + * Unhide all the free space in this AG. Caller must hold both the AGI and + * the AGF buffers or have otherwise prevented concurrent access. + */ +void +xfs_ag_clear_noalloc( + struct xfs_perag *pag) +{ + struct xfs_mount *mp = pag_mount(pag); + + if (!xfs_perag_prohibits_alloc(pag)) + return; + + xfs_add_fdblocks(mp, xfs_ag_fdblocks(pag)); + + trace_xfs_ag_clear_noalloc(pag); + clear_bit(XFS_AGSTATE_NOALLOC, &pag->pag_opstate); +} diff --git a/libxfs/xfs_ag.h b/libxfs/xfs_ag.h index 1f24cfa2732172..e8fae59206d929 100644 --- a/libxfs/xfs_ag.h +++ b/libxfs/xfs_ag.h @@ -120,6 +120,7 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag) #define XFS_AGSTATE_PREFERS_METADATA 2 #define XFS_AGSTATE_ALLOWS_INODES 3 #define XFS_AGSTATE_AGFL_NEEDS_RESET 4 +#define XFS_AGSTATE_NOALLOC 5 #define __XFS_AG_OPSTATE(name, NAME) \ static inline bool xfs_perag_ ## name (struct xfs_perag *pag) \ @@ -132,6 +133,7 @@ __XFS_AG_OPSTATE(initialised_agi, AGI_INIT) __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA) __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES) __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET) +__XFS_AG_OPSTATE(prohibits_alloc, NOALLOC) int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t orig_agcount, xfs_agnumber_t new_agcount, xfs_rfsblock_t dcount, @@ -164,6 +166,7 @@ xfs_perag_put( xfs_group_put(pag_group(pag)); } + /* Active AG references */ static inline struct xfs_perag * xfs_perag_grab( @@ -208,6 +211,11 @@ xfs_perag_next( return xfs_perag_next_from(mp, pag, 0); } +/* Enable or disable allocation from an AG */ +xfs_extlen_t xfs_ag_fdblocks(struct xfs_perag *pag); +int xfs_ag_set_noalloc(struct xfs_perag *pag); +void xfs_ag_clear_noalloc(struct xfs_perag *pag); + /* * Per-ag geometry infomation and validation */ diff --git a/libxfs/xfs_ag_resv.c b/libxfs/xfs_ag_resv.c index 83cac20331fd34..e811a6807e12ea 100644 --- a/libxfs/xfs_ag_resv.c +++ b/libxfs/xfs_ag_resv.c @@ -20,6 +20,7 @@ #include "xfs_ialloc_btree.h" #include "xfs_ag.h" #include "xfs_ag_resv.h" +#include "xfs_ag.h" /* * Per-AG Block Reservations @@ -73,6 +74,13 @@ xfs_ag_resv_critical( xfs_extlen_t avail; xfs_extlen_t orig; + /* + * Pretend we're critically low on reservations in this AG to scare + * everyone else away. + */ + if (xfs_perag_prohibits_alloc(pag)) + return true; + switch (type) { case XFS_AG_RESV_METADATA: avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved; @@ -115,7 +123,12 @@ xfs_ag_resv_needed( break; case XFS_AG_RESV_METAFILE: case XFS_AG_RESV_NONE: - /* empty */ + /* + * In noalloc mode, we pretend that all the free blocks in this + * AG have been allocated. Make this AG look full. + */ + if (xfs_perag_prohibits_alloc(pag)) + len += xfs_ag_fdblocks(pag); break; default: ASSERT(0); @@ -343,6 +356,8 @@ xfs_ag_resv_alloc_extent( xfs_extlen_t len; uint field; + ASSERT(type != XFS_AG_RESV_NONE || !xfs_perag_prohibits_alloc(pag)); + trace_xfs_ag_resv_alloc_extent(pag, type, args->len); switch (type) { @@ -400,7 +415,14 @@ xfs_ag_resv_free_extent( ASSERT(0); fallthrough; case XFS_AG_RESV_NONE: - xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len); + /* + * Normally we put freed blocks back into fdblocks. In noalloc + * mode, however, we pretend that there are no fdblocks in the + * AG, so don't put them back. + */ + if (!xfs_perag_prohibits_alloc(pag)) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, + (int64_t)len); fallthrough; case XFS_AG_RESV_IGNORE: return; @@ -413,6 +435,6 @@ xfs_ag_resv_free_extent( /* Freeing into the reserved pool only requires on-disk update... */ xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len); /* ...but freeing beyond that requires in-core and on-disk update. */ - if (len > leftover) + if (len > leftover && !xfs_perag_prohibits_alloc(pag)) xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, len - leftover); } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 3/5] xfs: enable userspace to hide an AG from allocation 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong 2024-12-31 23:43 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong 2024-12-31 23:43 ` [PATCH 2/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong @ 2024-12-31 23:43 ` Darrick J. Wong 2024-12-31 23:44 ` [PATCH 4/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong 2024-12-31 23:44 ` [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag Darrick J. Wong 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:43 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add an administrative interface so that userspace can hide an allocation group from block allocation. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_ag.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ libxfs/xfs_fs.h | 5 +++++ 2 files changed, 59 insertions(+) diff --git a/libxfs/xfs_ag.c b/libxfs/xfs_ag.c index 462d16347cadb9..b3e21e0d26a36c 100644 --- a/libxfs/xfs_ag.c +++ b/libxfs/xfs_ag.c @@ -930,6 +930,54 @@ xfs_ag_extend_space( return 0; } +/* Compute the AG geometry flags. */ +static inline uint32_t +xfs_ag_calc_geoflags( + struct xfs_perag *pag) +{ + uint32_t ret = 0; + + if (xfs_perag_prohibits_alloc(pag)) + ret |= XFS_AG_FLAG_NOALLOC; + + return ret; +} + +/* + * Compare the current AG geometry flags against the flags in the AG geometry + * structure and update the AG state to reflect any changes, then update the + * struct to reflect the current status. + */ +static inline int +xfs_ag_update_geoflags( + struct xfs_perag *pag, + struct xfs_ag_geometry *ageo, + uint32_t new_flags) +{ + uint32_t old_flags = xfs_ag_calc_geoflags(pag); + int error; + + if (!(new_flags & XFS_AG_FLAG_UPDATE)) { + ageo->ag_flags = old_flags; + return 0; + } + + if ((old_flags & XFS_AG_FLAG_NOALLOC) && + !(new_flags & XFS_AG_FLAG_NOALLOC)) { + xfs_ag_clear_noalloc(pag); + } + + if (!(old_flags & XFS_AG_FLAG_NOALLOC) && + (new_flags & XFS_AG_FLAG_NOALLOC)) { + error = xfs_ag_set_noalloc(pag); + if (error) + return error; + } + + ageo->ag_flags = xfs_ag_calc_geoflags(pag); + return 0; +} + /* Retrieve AG geometry. */ int xfs_ag_get_geometry( @@ -941,6 +989,7 @@ xfs_ag_get_geometry( struct xfs_agi *agi; struct xfs_agf *agf; unsigned int freeblks; + uint32_t inflags = ageo->ag_flags; int error; /* Lock the AG headers. */ @@ -951,6 +1000,10 @@ xfs_ag_get_geometry( if (error) goto out_agi; + error = xfs_ag_update_geoflags(pag, ageo, inflags); + if (error) + goto out; + /* Fill out form. */ memset(ageo, 0, sizeof(*ageo)); ageo->ag_number = pag_agno(pag); @@ -968,6 +1021,7 @@ xfs_ag_get_geometry( ageo->ag_freeblks = freeblks; xfs_ag_geom_health(pag, ageo); +out: /* Release resources. */ xfs_buf_relse(agf_bp); out_agi: diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index 12463ba766da05..b391bf9de93dbf 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -307,6 +307,11 @@ struct xfs_ag_geometry { #define XFS_AG_GEOM_SICK_REFCNTBT (1 << 9) /* reference counts */ #define XFS_AG_GEOM_SICK_INODES (1 << 10) /* bad inodes were seen */ +#define XFS_AG_FLAG_UPDATE (1 << 0) /* update flags */ +#define XFS_AG_FLAG_NOALLOC (1 << 1) /* do not allocate from this AG */ +#define XFS_AG_FLAG_ALL (XFS_AG_FLAG_UPDATE | \ + XFS_AG_FLAG_NOALLOC) + /* * Structures for XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG & XFS_IOC_FSGROWFSRT */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 4/5] xfs: apply noalloc mode to inode allocations too 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:43 ` [PATCH 3/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong @ 2024-12-31 23:44 ` Darrick J. Wong 2024-12-31 23:44 ` [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag Darrick J. Wong 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Don't allow inode allocations from this group if it's marked noalloc. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_ialloc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/libxfs/xfs_ialloc.c b/libxfs/xfs_ialloc.c index b401299ad933f7..a086fb30b227a0 100644 --- a/libxfs/xfs_ialloc.c +++ b/libxfs/xfs_ialloc.c @@ -1102,6 +1102,7 @@ xfs_dialloc_ag_inobt( ASSERT(xfs_perag_initialised_agi(pag)); ASSERT(xfs_perag_allows_inodes(pag)); + ASSERT(!xfs_perag_prohibits_alloc(pag)); ASSERT(pag->pagi_freecount > 0); restart_pagno: @@ -1730,6 +1731,8 @@ xfs_dialloc_good_ag( return false; if (!xfs_perag_allows_inodes(pag)) return false; + if (xfs_perag_prohibits_alloc(pag)) + return false; if (!xfs_perag_initialised_agi(pag)) { error = xfs_ialloc_read_agi(pag, tp, 0, NULL); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:44 ` [PATCH 4/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong @ 2024-12-31 23:44 ` Darrick J. Wong 4 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Augment the aginfo command to be able to set and clear the noalloc state for an AG. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- io/aginfo.c | 45 ++++++++++++++++++++++++++++++++++++++++----- man/man8/xfs_io.8 | 6 +++++- 2 files changed, 45 insertions(+), 6 deletions(-) diff --git a/io/aginfo.c b/io/aginfo.c index f81986f0df4df3..0320a98b12f981 100644 --- a/io/aginfo.c +++ b/io/aginfo.c @@ -19,9 +19,11 @@ static cmdinfo_t rginfo_cmd; static int report_aginfo( struct xfs_fd *xfd, - xfs_agnumber_t agno) + xfs_agnumber_t agno, + int oflag) { struct xfs_ag_geometry ageo = { 0 }; + bool update = false; int ret; ret = -xfrog_ag_geometry(xfd->fd, agno, &ageo); @@ -30,6 +32,26 @@ report_aginfo( return 1; } + switch (oflag) { + case 0: + ageo.ag_flags |= XFS_AG_FLAG_UPDATE; + ageo.ag_flags &= ~XFS_AG_FLAG_NOALLOC; + update = true; + break; + case 1: + ageo.ag_flags |= (XFS_AG_FLAG_UPDATE | XFS_AG_FLAG_NOALLOC); + update = true; + break; + } + + if (update) { + ret = -xfrog_ag_geometry(xfd->fd, agno, &ageo); + if (ret) { + xfrog_perror(ret, "aginfo update"); + return 1; + } + } + printf(_("AG: %u\n"), ageo.ag_number); printf(_("Blocks: %u\n"), ageo.ag_length); printf(_("Free Blocks: %u\n"), ageo.ag_freeblks); @@ -51,6 +73,7 @@ aginfo_f( struct xfs_fd xfd = XFS_FD_INIT(file->fd); unsigned long long x; xfs_agnumber_t agno = NULLAGNUMBER; + int oflag = -1; int c; int ret = 0; @@ -61,7 +84,7 @@ aginfo_f( return 1; } - while ((c = getopt(argc, argv, "a:")) != EOF) { + while ((c = getopt(argc, argv, "a:o:")) != EOF) { switch (c) { case 'a': errno = 0; @@ -74,16 +97,27 @@ aginfo_f( } agno = x; break; + case 'o': + errno = 0; + x = strtoll(optarg, NULL, 10); + if (!errno && x != 0 && x != 1) + errno = ERANGE; + if (errno) { + perror("aginfo"); + return 1; + } + oflag = x; + break; default: return command_usage(&aginfo_cmd); } } if (agno != NULLAGNUMBER) { - ret = report_aginfo(&xfd, agno); + ret = report_aginfo(&xfd, agno, oflag); } else { for (agno = 0; !ret && agno < xfd.fsgeom.agcount; agno++) { - ret = report_aginfo(&xfd, agno); + ret = report_aginfo(&xfd, agno, oflag); } } @@ -98,6 +132,7 @@ aginfo_help(void) "Report allocation group geometry.\n" "\n" " -a agno -- Report on the given allocation group.\n" +" -o state -- Change the NOALLOC state for this allocation group.\n" "\n")); } @@ -107,7 +142,7 @@ static cmdinfo_t aginfo_cmd = { .cfunc = aginfo_f, .argmin = 0, .argmax = -1, - .args = "[-a agno]", + .args = "[-a agno] [-o state]", .flags = CMD_NOMAP_OK, .help = aginfo_help, }; diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8 index 59d5ddc54dcc66..a42ab61a0de422 100644 --- a/man/man8/xfs_io.8 +++ b/man/man8/xfs_io.8 @@ -1243,7 +1243,7 @@ .SH MEMORY MAPPED I/O COMMANDS .SH FILESYSTEM COMMANDS .TP -.BI "aginfo [ \-a " agno " ]" +.BI "aginfo [ \-a " agno " ] [ \-o " nr " ]" Show information about or update the state of allocation groups. .RE .RS 1.0i @@ -1251,6 +1251,10 @@ .SH FILESYSTEM COMMANDS .TP .BI \-a Act only on a specific allocation group. +.TP +.BI \-o +If 0, clear the NOALLOC flag. +If 1, set the NOALLOC flag. .PD .RE ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 2/5] xfsprogs: report refcount information to userspace 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (5 preceding siblings ...) 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong @ 2024-12-31 23:33 ` Darrick J. Wong 2024-12-31 23:44 ` [PATCH 1/2] xfs: export reference count " Darrick J. Wong 2024-12-31 23:44 ` [PATCH 2/2] xfs_io: dump reference count information Darrick J. Wong 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (8 subsequent siblings) 15 siblings, 2 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:33 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs Hi all, Create a new ioctl to report the number of owners of each disk block so that reflink-aware defraggers can make better decisions about which extents to target. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=report-refcounts xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=report-refcounts fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=report-refcounts --- Commits in this patchset: * xfs: export reference count information to userspace * xfs_io: dump reference count information --- io/Makefile | 1 io/fsrefcounts.c | 476 +++++++++++++++++++++++++++++++++++ io/init.c | 1 io/io.h | 1 libxfs/xfs_fs.h | 80 ++++++ man/man2/ioctl_xfs_getfsrefcounts.2 | 237 +++++++++++++++++ man/man8/xfs_io.8 | 88 ++++++ 7 files changed, 884 insertions(+) create mode 100644 io/fsrefcounts.c create mode 100644 man/man2/ioctl_xfs_getfsrefcounts.2 ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/2] xfs: export reference count information to userspace 2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong @ 2024-12-31 23:44 ` Darrick J. Wong 2024-12-31 23:44 ` [PATCH 2/2] xfs_io: dump reference count information Darrick J. Wong 1 sibling, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Export refcount info to userspace so we can prototype a sharing-aware defrag/fs rearranging tool. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_fs.h | 80 ++++++++++++ man/man2/ioctl_xfs_getfsrefcounts.2 | 237 +++++++++++++++++++++++++++++++++++ 2 files changed, 317 insertions(+) create mode 100644 man/man2/ioctl_xfs_getfsrefcounts.2 diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index b391bf9de93dbf..936f719236944f 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -1008,6 +1008,85 @@ struct xfs_rtgroup_geometry { #define XFS_RTGROUP_GEOM_SICK_RMAPBT (1U << 3) /* reverse mappings */ #define XFS_RTGROUP_GEOM_SICK_REFCNTBT (1U << 4) /* reference counts */ +/* + * Structure for XFS_IOC_GETFSREFCOUNTS. + * + * The memory layout for this call are the scalar values defined in struct + * xfs_getfsrefs_head, followed by two struct xfs_getfsrefs that describe + * the lower and upper bound of mappings to return, followed by an array + * of struct xfs_getfsrefs mappings. + * + * fch_iflags control the output of the call, whereas fch_oflags report + * on the overall record output. fch_count should be set to the length + * of the fch_recs array, and fch_entries will be set to the number of + * entries filled out during each call. If fch_count is zero, the number + * of refcount mappings will be returned in fch_entries, though no + * mappings will be returned. fch_reserved must be set to zero. + * + * The two elements in the fch_keys array are used to constrain the + * output. The first element in the array should represent the lowest + * disk mapping ("low key") that the user wants to learn about. If this + * value is all zeroes, the filesystem will return the first entry it + * knows about. For a subsequent call, the contents of + * fsrefs_head.fch_recs[fsrefs_head.fch_count - 1] should be copied into + * fch_keys[0] to have the kernel start where it left off. + * + * The second element in the fch_keys array should represent the highest + * disk mapping ("high key") that the user wants to learn about. If this + * value is all ones, the filesystem will not stop until it runs out of + * mapping to return or runs out of space in fch_recs. + * + * fcr_device can be either a 32-bit cookie representing a device, or a + * 32-bit dev_t if the FCH_OF_DEV_T flag is set. fcr_physical and + * fcr_length are expressed in units of bytes. fcr_owners is the number + * of owners. + */ +struct xfs_getfsrefs { + __u32 fcr_device; /* device id */ + __u32 fcr_flags; /* mapping flags */ + __u64 fcr_physical; /* device offset of segment */ + __u64 fcr_owners; /* number of owners */ + __u64 fcr_length; /* length of segment */ + __u64 fcr_reserved[4]; /* must be zero */ +}; + +struct xfs_getfsrefs_head { + __u32 fch_iflags; /* control flags */ + __u32 fch_oflags; /* output flags */ + __u32 fch_count; /* # of entries in array incl. input */ + __u32 fch_entries; /* # of entries filled in (output). */ + __u64 fch_reserved[6]; /* must be zero */ + + struct xfs_getfsrefs fch_keys[2]; /* low and high keys for the mapping search */ + struct xfs_getfsrefs fch_recs[]; /* returned records */ +}; + +/* Size of an fsrefs_head with room for nr records. */ +static inline unsigned long long +xfs_getfsrefs_sizeof( + unsigned int nr) +{ + return sizeof(struct xfs_getfsrefs_head) + + (nr * sizeof(struct xfs_getfsrefs)); +} + +/* Start the next fsrefs query at the end of the current query results. */ +static inline void +xfs_getfsrefs_advance( + struct xfs_getfsrefs_head *head) +{ + head->fch_keys[0] = head->fch_recs[head->fch_entries - 1]; +} + +/* fch_iflags values - set by XFS_IOC_GETFSREFCOUNTS caller in the header. */ +#define FCH_IF_VALID 0 + +/* fch_oflags values - returned in the header segment only. */ +#define FCH_OF_DEV_T (1U << 0) /* fcr_device values will be dev_t */ + +/* fcr_flags values - returned for each non-header segment */ +#define FCR_OF_LAST (1U << 0) /* last record in the dataset */ + /* * ioctl commands that are used by Linux filesystems */ @@ -1047,6 +1126,7 @@ struct xfs_rtgroup_geometry { #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle) #define XFS_IOC_SCRUBV_METADATA _IOWR('X', 64, struct xfs_scrub_vec_head) #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry) +#define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) /* * ioctl commands that replace IRIX syssgi()'s diff --git a/man/man2/ioctl_xfs_getfsrefcounts.2 b/man/man2/ioctl_xfs_getfsrefcounts.2 new file mode 100644 index 00000000000000..9a5e7273fcacdd --- /dev/null +++ b/man/man2/ioctl_xfs_getfsrefcounts.2 @@ -0,0 +1,237 @@ +.\" Copyright (c) 2021-2025 Oracle. All rights reserved. +.\" +.\" %%%LICENSE_START(GPLv2+_DOC_FULL) +.\" This is free documentation; you can redistribute it and/or +.\" modify it under the terms of the GNU General Public License as +.\" published by the Free Software Foundation; either version 2 of +.\" the License, or (at your option) any later version. +.\" +.\" The GNU General Public License's references to "object code" +.\" and "executables" are to be interpreted as the output of any +.\" document formatting or typesetting system, including +.\" intermediate and printed output. +.\" +.\" This manual is distributed in the hope that it will be useful, +.\" but WITHOUT ANY WARRANTY; without even the implied warranty of +.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +.\" GNU General Public License for more details. +.\" +.\" You should have received a copy of the GNU General Public +.\" License along with this manual; if not, see +.\" <http://www.gnu.org/licenses/>. +.\" %%%LICENSE_END +.TH IOCTL-XFS-GETFSREFCOUNTS 2 2023-05-08 "XFS" +.SH NAME +ioctl_xfs_getfsrefcounts \- retrieve the number of owners of space in the filesystem +.SH SYNOPSIS +.nf +.B #include <sys/ioctl.h> +.PP +.BI "int ioctl(int " fd ", XFS_IOC_GETFSREFCOUNTS, struct xfs_fsrefs_head * " arg ); +.fi +.SH DESCRIPTION +This +.BR ioctl (2) +operation retrieves the number of owners for space extents in a filesystem. +This information can be used to discover the sharing factor of physical media, +among other things. +.PP +The sole argument to this operation should be a pointer to a single +.IR "struct xfs_getfsrefs_head" ":" +.PP +.in +4n +.EX +struct xfs_getfsrefs { + __u32 fcr_device; /* Device ID */ + __u32 fcr_flags; /* Mapping flags */ + __u64 fcr_physical; /* Device offset of segment */ + __u64 fcr_owners; /* Number of Owners */ + __u64 fcr_length; /* Length of segment */ + __u64 fcr_reserved[4]; /* Must be zero */ +}; + +struct xfs_getfsrefs_head { + __u32 fch_iflags; /* Control flags */ + __u32 fch_oflags; /* Output flags */ + __u32 fch_count; /* # of entries in array incl. input */ + __u32 fch_entries; /* # of entries filled in (output) */ + __u64 fch_reserved[6]; /* Must be zero */ + + struct xfs_getfsrefs fch_keys[2]; /* Low and high keys for + the mapping search */ + struct xfs_getfsrefs fch_recs[]; /* Returned records */ +}; +.EE +.in +.PP +The two +.I fch_keys +array elements specify the lowest and highest reverse-mapping +key for which the application would like physical mapping +information. +A reverse mapping key consists of the tuple (device, block, owner, offset). +The owner and offset fields are part of the key because some filesystems +support sharing physical blocks between multiple files and +therefore may return multiple mappings for a given physical block. +.PP +Filesystem mappings are copied into the +.I fch_recs +array, which immediately follows the header data. +.\" +.SS Fields of struct xfs_getfsrefs_head +The +.I fch_iflags +field is a bit mask passed to the kernel to alter the output. +No flags are currently defined, so the caller must set this value to zero. +.PP +The +.I fch_oflags +field is a bit mask of flags set by the kernel concerning the returned mappings. +If +.B FCH_OF_DEV_T +is set, then the +.I fcr_device +field represents a +.I dev_t +structure containing the major and minor numbers of the block device. +.PP +The +.I fch_count +field contains the number of elements in the array being passed to the +kernel. +If this value is 0, +.I fch_entries +will be set to the number of records that would have been returned had +the array been large enough; +no mapping information will be returned. +.PP +The +.I fch_entries +field contains the number of elements in the +.I fch_recs +array that contain useful information. +.PP +The +.I fch_reserved +fields must be set to zero. +.\" +.SS Keys +The two key records in +.I fsrefs_head.fch_keys +specify the lowest and highest extent records in the keyspace that the caller +wants returned. +The tuple +.RI "(" "device" ", " "physical" ", " "flags" ")" +can be used to index any filesystem space record. +The format of +.I fcr_device +in the keys must match the format of the same field in the output records, +as defined below. +By convention, the field +.I fsrefs_head.fch_keys[0] +must contain the low key and +.I fsrefs_head.fch_keys[1] +must contain the high key for the request. +.PP +For convenience, if +.I fcr_length +is set in the low key, it will be added to +.I fcr_block +as appropriate. +The caller can take advantage of this subtlety to set up subsequent calls +by copying +.I fsrefs_head.fch_recs[fsrefs_head.fch_entries \- 1] +into the low key. +The function +.I fsrefs_advance +(defined in +.IR linux/fsrefcounts.h ) +provides this functionality. +.\" +.SS Fields of struct xfs_getfsrefs +The +.I fcr_device +field uniquely identifies the underlying storage device. +If the +.B FCH_OF_DEV_T +flag is set in the header's +.I fch_oflags +field, this field contains a +.I dev_t +from which major and minor numbers can be extracted. +If the flag is not set, this field contains a value that must be unique +for each unique storage device. +.PP +The +.I fcr_physical +field contains the disk address of the extent in bytes. +.PP +The +.I fcr_owners +field contains the number of owners of this extent. +The actual owners can be queried with the +.BR FS_IOC_GETFSMAP (2) +ioctl. +.PP +The +.I fcr_length +field contains the length of the extent in bytes. +.PP +The +.I fcr_flags +field is a bit mask of extent state flags. +The bits are: +.RS 0.4i +.TP +.B FCR_OF_LAST +This is the last record in the data set. +.RE +.PP +The +.I fcr_reserved +field will be set to zero. +.\" +.RE +.SH RETURN VALUE +On error, \-1 is returned, and +.I errno +is set to indicate the error. +.SH ERRORS +The error placed in +.I errno +can be one of, but is not limited to, the following: +.TP +.B EBADF +.IR fd +is not open for reading. +.TP +.B EBADMSG +The filesystem has detected a checksum error in the metadata. +.TP +.B EFAULT +The pointer passed in was not mapped to a valid memory address. +.TP +.B EINVAL +The array is not long enough, the keys do not point to a valid part of +the filesystem, the low key points to a higher point in the filesystem's +physical storage address space than the high key, or a nonzero value +was passed in one of the fields that must be zero. +.TP +.B ENOMEM +Insufficient memory to process the request. +.TP +.B EOPNOTSUPP +The filesystem does not support this command. +.TP +.B EUCLEAN +The filesystem metadata is corrupt and needs repair. +.SH CONFORMING TO +This API is XFS-specific. +.SH EXAMPLES +See +.I io/fsrefs.c +in the +.I xfsprogs +distribution for a sample program. +.SH SEE ALSO +.BR ioctl (2) ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/2] xfs_io: dump reference count information 2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong 2024-12-31 23:44 ` [PATCH 1/2] xfs: export reference count " Darrick J. Wong @ 2024-12-31 23:44 ` Darrick J. Wong 1 sibling, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:44 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Dump refcount info from the kernel so we can prototype a sharing-aware defrag/fs rearranging tool. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- io/Makefile | 1 io/fsrefcounts.c | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++++ io/init.c | 1 io/io.h | 1 man/man8/xfs_io.8 | 88 ++++++++++ 5 files changed, 567 insertions(+) create mode 100644 io/fsrefcounts.c diff --git a/io/Makefile b/io/Makefile index 8f835ec71fd768..c57594b090f70c 100644 --- a/io/Makefile +++ b/io/Makefile @@ -22,6 +22,7 @@ CFILES = \ file.c \ freeze.c \ fsproperties.c \ + fsrefcounts.c \ fsuuid.c \ fsync.c \ getrusage.c \ diff --git a/io/fsrefcounts.c b/io/fsrefcounts.c new file mode 100644 index 00000000000000..ad1f26dfde3ec3 --- /dev/null +++ b/io/fsrefcounts.c @@ -0,0 +1,476 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "platform_defs.h" +#include "command.h" +#include "init.h" +#include "libfrog/paths.h" +#include "io.h" +#include "input.h" +#include "libfrog/fsgeom.h" + +static cmdinfo_t fsrefcounts_cmd; +static dev_t xfs_data_dev; + +static void +fsrefcounts_help(void) +{ + printf(_( +"\n" +" Prints extent owner counts for the filesystem hosting the current file" +"\n" +" fsrefcounts prints the number of owners of disk blocks used by the whole\n" +" filesystem. When possible, owner and offset information will be included\n" +" in the space report.\n" +"\n" +" By default, each line of the listing takes the following form:\n" +" extent: major:minor [startblock..endblock]: owner startoffset..endoffset length\n" +" All the file offsets and disk blocks are in units of 512-byte blocks.\n" +" -d -- query only the data device (default).\n" +" -l -- query only the log device.\n" +" -r -- query only the realtime device.\n" +" -n -- query n extents at a time.\n" +" -o -- only print extents with at least this many owners (default 1).\n" +" -O -- only print extents with no more than this many owners (default 2^64-1).\n" +" -m -- output machine-readable format.\n" +" -v -- Verbose information, show AG and offsets. Show flags legend on 2nd -v\n" +"\n" +"The optional start and end arguments require one of -d, -l, or -r to be set.\n" +"\n")); +} + +static void +dump_refcounts( + unsigned long long *nr, + const unsigned long long min_owners, + const unsigned long long max_owners, + struct xfs_getfsrefs_head *head) +{ + unsigned long long i; + struct xfs_getfsrefs *p; + + for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) { + if (p->fcr_owners < min_owners || p->fcr_owners > max_owners) + continue; + printf("\t%llu: %u:%u [%lld..%lld]: ", i + (*nr), + major(p->fcr_device), minor(p->fcr_device), + (long long)BTOBBT(p->fcr_physical), + (long long)BTOBBT(p->fcr_physical + p->fcr_length - 1)); + printf(_("%llu %lld\n"), + (unsigned long long)p->fcr_owners, + (long long)BTOBBT(p->fcr_length)); + } + + (*nr) += head->fch_entries; +} + +static void +dump_refcounts_machine( + unsigned long long *nr, + const unsigned long long min_owners, + const unsigned long long max_owners, + struct xfs_getfsrefs_head *head) +{ + unsigned long long i; + struct xfs_getfsrefs *p; + + if (*nr == 0) + printf(_("EXT,MAJOR,MINOR,PSTART,PEND,OWNERS,LENGTH\n")); + for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) { + if (p->fcr_owners < min_owners || p->fcr_owners > max_owners) + continue; + printf("%llu,%u,%u,%lld,%lld,", i + (*nr), + major(p->fcr_device), minor(p->fcr_device), + (long long)BTOBBT(p->fcr_physical), + (long long)BTOBBT(p->fcr_physical + p->fcr_length - 1)); + printf("%llu,%lld\n", + (unsigned long long)p->fcr_owners, + (long long)BTOBBT(p->fcr_length)); + } + + (*nr) += head->fch_entries; +} + +/* + * Verbose mode displays: + * extent: major:minor [startblock..endblock]: owners \ + * ag# (agoffset..agendoffset) totalbbs flags + */ +#define MINRANGE_WIDTH 16 +#define MINAG_WIDTH 2 +#define MINTOT_WIDTH 5 +#define NFLG 4 /* count of flags */ +#define FLG_NULL 00000 /* Null flag */ +#define FLG_BSU 01000 /* Not on begin of stripe unit */ +#define FLG_ESU 00100 /* Not on end of stripe unit */ +#define FLG_BSW 00010 /* Not on begin of stripe width */ +#define FLG_ESW 00001 /* Not on end of stripe width */ +static void +dump_refcounts_verbose( + unsigned long long *nr, + const unsigned long long min_owners, + const unsigned long long max_owners, + struct xfs_getfsrefs_head *head, + bool *dumped_flags, + struct xfs_fsop_geom *fsgeo) +{ + unsigned long long i; + struct xfs_getfsrefs *p; + int agno; + off_t agoff, bperag; + int boff_w, aoff_w, tot_w, agno_w, own_w; + int nr_w, dev_w; + char bbuf[40], abuf[40], obuf[40]; + char nbuf[40], dbuf[40], gbuf[40]; + int sunit, swidth; + int flg = 0; + + boff_w = aoff_w = own_w = MINRANGE_WIDTH; + dev_w = 3; + nr_w = 4; + tot_w = MINTOT_WIDTH; + bperag = (off_t)fsgeo->agblocks * (off_t)fsgeo->blocksize; + sunit = (fsgeo->sunit * fsgeo->blocksize); + swidth = (fsgeo->swidth * fsgeo->blocksize); + + /* + * Go through the extents and figure out the width + * needed for all columns. + */ + for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) { + if (p->fcr_owners < min_owners || p->fcr_owners > max_owners) + continue; + if (sunit && + (p->fcr_physical % sunit != 0 || + ((p->fcr_physical + p->fcr_length) % sunit) != 0 || + p->fcr_physical % swidth != 0 || + ((p->fcr_physical + p->fcr_length) % swidth) != 0)) + flg = 1; + if (flg) + *dumped_flags = true; + snprintf(nbuf, sizeof(nbuf), "%llu", (*nr) + i); + nr_w = max(nr_w, strlen(nbuf)); + if (head->fch_oflags & FCH_OF_DEV_T) + snprintf(dbuf, sizeof(dbuf), "%u:%u", + major(p->fcr_device), + minor(p->fcr_device)); + else + snprintf(dbuf, sizeof(dbuf), "0x%x", p->fcr_device); + dev_w = max(dev_w, strlen(dbuf)); + snprintf(bbuf, sizeof(bbuf), "[%lld..%lld]:", + (long long)BTOBBT(p->fcr_physical), + (long long)BTOBBT(p->fcr_physical + p->fcr_length - 1)); + boff_w = max(boff_w, strlen(bbuf)); + snprintf(obuf, sizeof(obuf), "%llu", + (unsigned long long)p->fcr_owners); + own_w = max(own_w, strlen(obuf)); + if (p->fcr_device == xfs_data_dev) { + agno = p->fcr_physical / bperag; + agoff = p->fcr_physical - (agno * bperag); + snprintf(abuf, sizeof(abuf), + "(%lld..%lld)", + (long long)BTOBBT(agoff), + (long long)BTOBBT(agoff + p->fcr_length - 1)); + } else + abuf[0] = 0; + aoff_w = max(aoff_w, strlen(abuf)); + tot_w = max(tot_w, + numlen(BTOBBT(p->fcr_length), 10)); + } + agno_w = max(MINAG_WIDTH, numlen(fsgeo->agcount, 10)); + if (*nr == 0) + printf("%*s: %-*s %-*s %-*s %*s %-*s %*s%s\n", + nr_w, _("EXT"), + dev_w, _("DEV"), + boff_w, _("BLOCK-RANGE"), + own_w, _("OWNERS"), + agno_w, _("AG"), + aoff_w, _("AG-OFFSET"), + tot_w, _("TOTAL"), + flg ? _(" FLAGS") : ""); + for (i = 0, p = head->fch_recs; i < head->fch_entries; i++, p++) { + if (p->fcr_owners < min_owners || p->fcr_owners > max_owners) + continue; + flg = FLG_NULL; + /* + * If striping enabled, determine if extent starts/ends + * on a stripe unit boundary. + */ + if (sunit) { + if (p->fcr_physical % sunit != 0) + flg |= FLG_BSU; + if (((p->fcr_physical + + p->fcr_length ) % sunit ) != 0) + flg |= FLG_ESU; + if (p->fcr_physical % swidth != 0) + flg |= FLG_BSW; + if (((p->fcr_physical + + p->fcr_length ) % swidth ) != 0) + flg |= FLG_ESW; + } + if (head->fch_oflags & FCH_OF_DEV_T) + snprintf(dbuf, sizeof(dbuf), "%u:%u", + major(p->fcr_device), + minor(p->fcr_device)); + else + snprintf(dbuf, sizeof(dbuf), "0x%x", p->fcr_device); + snprintf(bbuf, sizeof(bbuf), "[%lld..%lld]:", + (long long)BTOBBT(p->fcr_physical), + (long long)BTOBBT(p->fcr_physical + p->fcr_length - 1)); + snprintf(obuf, sizeof(obuf), "%llu", + (unsigned long long)p->fcr_owners); + if (p->fcr_device == xfs_data_dev) { + agno = p->fcr_physical / bperag; + agoff = p->fcr_physical - (agno * bperag); + snprintf(abuf, sizeof(abuf), + "(%lld..%lld)", + (long long)BTOBBT(agoff), + (long long)BTOBBT(agoff + p->fcr_length - 1)); + snprintf(gbuf, sizeof(gbuf), + "%lld", + (long long)agno); + } else { + abuf[0] = 0; + gbuf[0] = 0; + } + printf("%*llu: %-*s %-*s %-*s %-*s %-*s %*lld", + nr_w, (*nr) + i, dev_w, dbuf, boff_w, bbuf, own_w, + obuf, agno_w, gbuf, aoff_w, abuf, tot_w, + (long long)BTOBBT(p->fcr_length)); + if (flg == FLG_NULL) + printf("\n"); + else + printf(" %-*.*o\n", NFLG, NFLG, flg); + } + + (*nr) += head->fch_entries; +} + +static void +dump_verbose_key(void) +{ + printf(_(" FLAG Values:\n")); + printf(_(" %*.*o Doesn't begin on stripe unit\n"), + NFLG+1, NFLG+1, FLG_BSU); + printf(_(" %*.*o Doesn't end on stripe unit\n"), + NFLG+1, NFLG+1, FLG_ESU); + printf(_(" %*.*o Doesn't begin on stripe width\n"), + NFLG+1, NFLG+1, FLG_BSW); + printf(_(" %*.*o Doesn't end on stripe width\n"), + NFLG+1, NFLG+1, FLG_ESW); +} + +static int +fsrefcounts_f( + int argc, + char **argv) +{ + struct xfs_getfsrefs *p; + struct xfs_getfsrefs_head *head; + struct xfs_getfsrefs *l, *h; + struct xfs_fsop_geom fsgeo; + long long start = 0; + long long end = -1; + unsigned long long min_owners = 1; + unsigned long long max_owners = ULLONG_MAX; + int map_size; + int nflag = 0; + int vflag = 0; + int mflag = 0; + int i = 0; + int c; + unsigned long long nr = 0; + size_t fsblocksize, fssectsize; + struct fs_path *fs; + static bool tab_init; + bool dumped_flags = false; + int dflag, lflag, rflag; + + init_cvtnum(&fsblocksize, &fssectsize); + + dflag = lflag = rflag = 0; + while ((c = getopt(argc, argv, "dlmn:o:O:rv")) != EOF) { + switch (c) { + case 'd': /* data device */ + dflag = 1; + break; + case 'l': /* log device */ + lflag = 1; + break; + case 'm': /* machine readable format */ + mflag++; + break; + case 'n': /* number of extents specified */ + nflag = cvt_u32(optarg, 10); + if (errno) + return command_usage(&fsrefcounts_cmd); + break; + case 'o': /* minimum owners */ + min_owners = cvt_u64(optarg, 10); + if (errno) + return command_usage(&fsrefcounts_cmd); + if (min_owners < 1) { + fprintf(stderr, + _("min_owners must be greater than zero.\n")); + exitcode = 1; + return 0; + } + break; + case 'O': /* maximum owners */ + max_owners = cvt_u64(optarg, 10); + if (errno) + return command_usage(&fsrefcounts_cmd); + if (max_owners < 1) { + fprintf(stderr, + _("max_owners must be greater than zero.\n")); + exitcode = 1; + return 0; + } + break; + case 'r': /* rt device */ + rflag = 1; + break; + case 'v': /* Verbose output */ + vflag++; + break; + default: + exitcode = 1; + return command_usage(&fsrefcounts_cmd); + } + } + + if ((dflag + lflag + rflag > 1) || (mflag > 0 && vflag > 0) || + (argc > optind && dflag + lflag + rflag == 0)) { + exitcode = 1; + return command_usage(&fsrefcounts_cmd); + } + + if (argc > optind) { + start = cvtnum(fsblocksize, fssectsize, argv[optind]); + if (start < 0) { + fprintf(stderr, + _("Bad refcount start_bblock %s.\n"), + argv[optind]); + exitcode = 1; + return 0; + } + start <<= BBSHIFT; + } + + if (argc > optind + 1) { + end = cvtnum(fsblocksize, fssectsize, argv[optind + 1]); + if (end < 0) { + fprintf(stderr, + _("Bad refcount end_bblock %s.\n"), + argv[optind + 1]); + exitcode = 1; + return 0; + } + end <<= BBSHIFT; + } + + if (vflag) { + c = -xfrog_geometry(file->fd, &fsgeo); + if (c) { + fprintf(stderr, + _("%s: can't get geometry [\"%s\"]: %s\n"), + progname, file->name, strerror(c)); + exitcode = 1; + return 0; + } + } + + map_size = nflag ? nflag : 131072 / sizeof(struct xfs_getfsrefs); + head = malloc(xfs_getfsrefs_sizeof(map_size)); + if (head == NULL) { + fprintf(stderr, _("%s: malloc of %llu bytes failed.\n"), + progname, + (unsigned long long)xfs_getfsrefs_sizeof(map_size)); + exitcode = 1; + return 0; + } + + memset(head, 0, sizeof(*head)); + l = head->fch_keys; + h = head->fch_keys + 1; + if (dflag) { + l->fcr_device = h->fcr_device = file->fs_path.fs_datadev; + } else if (lflag) { + l->fcr_device = h->fcr_device = file->fs_path.fs_logdev; + } else if (rflag) { + l->fcr_device = h->fcr_device = file->fs_path.fs_rtdev; + } else { + l->fcr_device = 0; + h->fcr_device = UINT_MAX; + } + l->fcr_physical = start; + h->fcr_physical = end; + h->fcr_owners = ULLONG_MAX; + h->fcr_flags = UINT_MAX; + + /* + * If this is an XFS filesystem, remember the data device. + * (We report AG number/block for data device extents on XFS). + */ + if (!tab_init) { + fs_table_initialise(0, NULL, 0, NULL); + tab_init = true; + } + fs = fs_table_lookup(file->name, FS_MOUNT_POINT); + xfs_data_dev = fs ? fs->fs_datadev : 0; + + head->fch_count = map_size; + do { + /* Get some extents */ + i = ioctl(file->fd, XFS_IOC_GETFSREFCOUNTS, head); + if (i < 0) { + fprintf(stderr, _("%s: xfsctl(XFS_IOC_GETFSREFCOUNTS)" + " iflags=0x%x [\"%s\"]: %s\n"), + progname, head->fch_iflags, file->name, + strerror(errno)); + free(head); + exitcode = 1; + return 0; + } + + if (head->fch_entries == 0) + break; + + if (vflag) + dump_refcounts_verbose(&nr, min_owners, max_owners, + head, &dumped_flags, &fsgeo); + else if (mflag) + dump_refcounts_machine(&nr, min_owners, max_owners, + head); + else + dump_refcounts(&nr, min_owners, max_owners, head); + + p = &head->fch_recs[head->fch_entries - 1]; + if (p->fcr_flags & FCR_OF_LAST) + break; + xfs_getfsrefs_advance(head); + } while (true); + + if (dumped_flags) + dump_verbose_key(); + + free(head); + return 0; +} + +void +fsrefcounts_init(void) +{ + fsrefcounts_cmd.name = "fsrefcounts"; + fsrefcounts_cmd.cfunc = fsrefcounts_f; + fsrefcounts_cmd.argmin = 0; + fsrefcounts_cmd.argmax = -1; + fsrefcounts_cmd.flags = CMD_NOMAP_OK | CMD_FLAG_FOREIGN_OK; + fsrefcounts_cmd.args = _("[-d|-l|-r] [-m|-v] [-n nx] [start] [end]"); + fsrefcounts_cmd.oneline = _("print filesystem owner counts for a range of blocks"); + fsrefcounts_cmd.help = fsrefcounts_help; + + add_command(&fsrefcounts_cmd); +} diff --git a/io/init.c b/io/init.c index 4831deae1b2683..17b772813bc113 100644 --- a/io/init.c +++ b/io/init.c @@ -58,6 +58,7 @@ init_commands(void) freeze_init(); fsmap_init(); fsuuid_init(); + fsrefcounts_init(); fsync_init(); getrusage_init(); help_init(); diff --git a/io/io.h b/io/io.h index d99065582057de..7ae7cf90ace323 100644 --- a/io/io.h +++ b/io/io.h @@ -156,3 +156,4 @@ extern void bulkstat_init(void); void exchangerange_init(void); void fsprops_init(void); void aginfo_init(void); +void fsrefcounts_init(void); diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8 index a42ab61a0de422..37ad497c771051 100644 --- a/man/man8/xfs_io.8 +++ b/man/man8/xfs_io.8 @@ -1325,6 +1325,94 @@ .SH FILESYSTEM COMMANDS .B thaw Undo the effects of a filesystem freeze operation. Only available in expert mode and requires privileges. + +.TP +.BI "fsrefcounts [ \-d | \-l | \-r ] [ \-m | \-v ] [ \-n " nx " ] [ \-o " min_owners " ] [ \-O " max_owners " ] [ " start " ] [ " end " ] +Prints the number of owners of disk extents used by the filesystem hosting the +current file. +The listing does not include free blocks. +Each line of the listings takes the following form: +.PP +.RS +.IR extent ": " major ":" minor " [" startblock .. endblock "]: " owners " " length +.PP +All blocks, offsets, and lengths are specified in units of 512-byte +blocks, no matter what the filesystem's block size is. +The optional +.I start +and +.I end +arguments can be used to constrain the output to a particular range of +disk blocks. +If these two options are specified, exactly one of +.BR "-d" ", " "-l" ", or " "-r" +must also be set. +.RE +.RS 1.0i +.PD 0 +.TP +.BI \-d +Display only extents from the data device. +This option only applies for XFS filesystems. +.TP +.BI \-l +Display only extents from the external log device. +This option only applies to XFS filesystems. +.TP +.BI \-r +Display only extents from the realtime device. +This option only applies to XFS filesystems. +.TP +.BI \-m +Display results in a machine readable format (CSV). +This option is not compatible with the +.B \-v +flag. +The columns of the output are: extent number, device major, device minor, +physical start, physical end, number of owners, length. +The start, end, and length numbers are provided in units of 512b. + +.TP +.BI \-n " num_extents" +If this option is given, +.B fsrefcounts +obtains the extent list of the file in groups of +.I num_extents +extents. +In the absence of +.BR "-n" ", " "fsrefcounts" +queries the system for extents in groups of 131,072 records. +.TP +.BI \-o " min_owners" +Only print extents having at least this many owners. +This argument must be in the range 1 to 2^64-1. +The default value is 1. +.TP +.BI \-O " max_owners" +Only print extents having this many or fewer owners. +This argument must be in the range 1 to 2^64-1. +There is no limit by default. +.TP +.B \-v +Shows verbose information. +When this flag is specified, additional AG specific information is +appended to each line in the following form: +.IP +.RS 1.2i +.IR agno " (" startagblock .. endagblock ") " nblocks " " flags +.RE +.IP +A second +.B \-v +option will print out the +.I flags +legend. +This option is not compatible with the +.B \-m +flag. +.RE +.PD + .TP .BI "inject [ " tag " ]" Inject errors into a filesystem to observe filesystem behavior at ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 3/5] xfsprogs: defragment free space 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (6 preceding siblings ...) 2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong @ 2024-12-31 23:34 ` Darrick J. Wong 2024-12-31 23:45 ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong ` (10 more replies) 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (7 subsequent siblings) 15 siblings, 11 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw) To: aalbersh, djwong; +Cc: dchinner, linux-xfs Hi all, These patches contain experimental code to enable userspace to defragment the free space in a filesystem. Two purposes are imagined for this functionality: clearing space at the end of a filesystem before shrinking it, and clearing free space in anticipation of making a large allocation. The first patch adds a new fallocate mode that allows userspace to allocate free space from the filesystem into a file. The goal here is to allow the filesystem shrink process to prevent allocation from a certain part of the filesystem while a free space defragmenter evacuates all the files from the doomed part of the filesystem. The second patch amends the online repair system to allow the sysadmin to forcibly rebuild metadata structures, even if they're not corrupt. Without adding an ioctl to move metadata btree blocks, this is the only way to dislodge metadata. This patchset also includes a separate inode migration tool as prototyped by Dave Chinner in 2020. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=defrag-freespace --- Commits in this patchset: * xfs_io: display rtgroup number in verbose fsrefs output * xfs: add an ioctl to map free space into a file * xfs_io: support using XFS_IOC_MAP_FREESP to map free space * xfs_db: get and put blocks on the AGFL * xfs_spaceman: implement clearing free space * spaceman: physically move a regular inode * spaceman: find owners of space in an AG * xfs_spaceman: wrap radix tree accesses in find_owner.c * xfs_spaceman: port relocation structure to 32-bit systems * spaceman: relocate the contents of an AG * spaceman: move inodes with hardlinks --- configure.ac | 1 db/agfl.c | 297 +++- include/builddefs.in | 1 include/xfs_trace.h | 4 io/fsrefcounts.c | 22 io/prealloc.c | 35 libfrog/Makefile | 5 libfrog/clearspace.c | 3294 +++++++++++++++++++++++++++++++++++++++ libfrog/clearspace.h | 79 + libfrog/fsgeom.h | 29 libfrog/radix-tree.c | 2 libfrog/radix-tree.h | 2 libxfs/libxfs_api_defs.h | 4 libxfs/libxfs_priv.h | 9 libxfs/xfs_alloc.c | 88 + libxfs/xfs_alloc.h | 3 libxfs/xfs_fs.h | 14 m4/package_libcdev.m4 | 20 man/man2/ioctl_xfs_map_freesp.2 | 76 + man/man8/xfs_db.8 | 11 man/man8/xfs_io.8 | 8 man/man8/xfs_spaceman.8 | 40 spaceman/Makefile | 11 spaceman/clearfree.c | 171 ++ spaceman/find_owner.c | 442 +++++ spaceman/init.c | 7 spaceman/move_inode.c | 662 ++++++++ spaceman/relocation.c | 566 +++++++ spaceman/relocation.h | 53 + spaceman/space.h | 6 30 files changed, 5953 insertions(+), 9 deletions(-) create mode 100644 libfrog/clearspace.c create mode 100644 libfrog/clearspace.h create mode 100644 man/man2/ioctl_xfs_map_freesp.2 create mode 100644 spaceman/clearfree.c create mode 100644 spaceman/find_owner.c create mode 100644 spaceman/move_inode.c create mode 100644 spaceman/relocation.c create mode 100644 spaceman/relocation.h ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong @ 2024-12-31 23:45 ` Darrick J. Wong 2024-12-31 23:45 ` [PATCH 02/11] xfs: add an ioctl to map free space into a file Darrick J. Wong ` (9 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Display the rtgroup number in the verbose fsrefcounts output. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- io/fsrefcounts.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/io/fsrefcounts.c b/io/fsrefcounts.c index ad1f26dfde3ec3..9127f536da382e 100644 --- a/io/fsrefcounts.c +++ b/io/fsrefcounts.c @@ -13,6 +13,7 @@ static cmdinfo_t fsrefcounts_cmd; static dev_t xfs_data_dev; +static dev_t xfs_rt_dev; static void fsrefcounts_help(void) @@ -119,7 +120,7 @@ dump_refcounts_verbose( unsigned long long i; struct xfs_getfsrefs *p; int agno; - off_t agoff, bperag; + off_t agoff, bperag, bperrtg; int boff_w, aoff_w, tot_w, agno_w, own_w; int nr_w, dev_w; char bbuf[40], abuf[40], obuf[40]; @@ -132,6 +133,7 @@ dump_refcounts_verbose( nr_w = 4; tot_w = MINTOT_WIDTH; bperag = (off_t)fsgeo->agblocks * (off_t)fsgeo->blocksize; + bperrtg = bytes_per_rtgroup(fsgeo); sunit = (fsgeo->sunit * fsgeo->blocksize); swidth = (fsgeo->swidth * fsgeo->blocksize); @@ -173,6 +175,13 @@ dump_refcounts_verbose( "(%lld..%lld)", (long long)BTOBBT(agoff), (long long)BTOBBT(agoff + p->fcr_length - 1)); + } else if (p->fcr_device == xfs_rt_dev && fsgeo->rgcount > 0) { + agno = p->fcr_physical / bperrtg; + agoff = p->fcr_physical - (agno * bperrtg); + snprintf(abuf, sizeof(abuf), + "(%lld..%lld)", + (long long)BTOBBT(agoff), + (long long)BTOBBT(agoff + p->fcr_length - 1)); } else abuf[0] = 0; aoff_w = max(aoff_w, strlen(abuf)); @@ -231,6 +240,16 @@ dump_refcounts_verbose( snprintf(gbuf, sizeof(gbuf), "%lld", (long long)agno); + } else if (p->fcr_device == xfs_rt_dev && fsgeo->rgcount > 0) { + agno = p->fcr_physical / bperrtg; + agoff = p->fcr_physical - (agno * bperrtg); + snprintf(abuf, sizeof(abuf), + "(%lld..%lld)", + (long long)BTOBBT(agoff), + (long long)BTOBBT(agoff + p->fcr_length - 1)); + snprintf(gbuf, sizeof(gbuf), + "%lld", + (long long)agno); } else { abuf[0] = 0; gbuf[0] = 0; @@ -420,6 +439,7 @@ fsrefcounts_f( } fs = fs_table_lookup(file->name, FS_MOUNT_POINT); xfs_data_dev = fs ? fs->fs_datadev : 0; + xfs_rt_dev = fs ? fs->fs_rtdev : 0; head->fch_count = map_size; do { ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 02/11] xfs: add an ioctl to map free space into a file 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong 2024-12-31 23:45 ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong @ 2024-12-31 23:45 ` Darrick J. Wong 2024-12-31 23:45 ` [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space Darrick J. Wong ` (8 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a new ioctl to map free physical space into a file, at the same file offset as if the file were a sparse image of the physical device backing the filesystem. The intent here is to use this to prototype a free space defragmentation tool. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- include/xfs_trace.h | 4 ++ libxfs/libxfs_priv.h | 9 ++++ libxfs/xfs_alloc.c | 88 +++++++++++++++++++++++++++++++++++++++ libxfs/xfs_alloc.h | 3 + libxfs/xfs_fs.h | 14 ++++++ man/man2/ioctl_xfs_map_freesp.2 | 76 ++++++++++++++++++++++++++++++++++ 6 files changed, 194 insertions(+) create mode 100644 man/man2/ioctl_xfs_map_freesp.2 diff --git a/include/xfs_trace.h b/include/xfs_trace.h index 7778366c5e3319..178497c8770d37 100644 --- a/include/xfs_trace.h +++ b/include/xfs_trace.h @@ -26,6 +26,8 @@ #define trace_xfs_alloc_exact_done(a) ((void) 0) #define trace_xfs_alloc_exact_notfound(a) ((void) 0) #define trace_xfs_alloc_exact_error(a) ((void) 0) +#define trace_xfs_alloc_find_freesp(...) ((void) 0) +#define trace_xfs_alloc_find_freesp_done(...) ((void) 0) #define trace_xfs_alloc_near_first(a) ((void) 0) #define trace_xfs_alloc_near_greater(a) ((void) 0) #define trace_xfs_alloc_near_lesser(a) ((void) 0) @@ -197,6 +199,8 @@ #define trace_xfs_bmap_pre_update(a,b,c,d) ((void) 0) #define trace_xfs_bmap_post_update(a,b,c,d) ((void) 0) +#define trace_xfs_bmapi_freesp(...) ((void) 0) +#define trace_xfs_bmapi_freesp_done(...) ((void) 0) #define trace_xfs_bunmap(a,b,c,d,e) ((void) 0) #define trace_xfs_read_extent(a,b,c,d) ((void) 0) diff --git a/libxfs/libxfs_priv.h b/libxfs/libxfs_priv.h index ac2f64a9a75d82..932a45d734d460 100644 --- a/libxfs/libxfs_priv.h +++ b/libxfs/libxfs_priv.h @@ -446,6 +446,15 @@ xfs_buf_readahead( #define xfs_filestream_new_ag(ip,ag) (0) #define xfs_filestream_select_ag(...) (-ENOSYS) +struct xfs_trans; + +static inline int +xfs_rtallocate_extent(struct xfs_trans *tp, xfs_rtxnum_t start, + xfs_rtxlen_t maxlen, xfs_rtxlen_t *len, xfs_rtxnum_t *rtx) +{ + return -EOPNOTSUPP; +} + #define xfs_trans_inode_buf(tp, bp) ((void) 0) /* quota bits */ diff --git a/libxfs/xfs_alloc.c b/libxfs/xfs_alloc.c index 9aebe7227a6148..e21b694420e309 100644 --- a/libxfs/xfs_alloc.c +++ b/libxfs/xfs_alloc.c @@ -4164,3 +4164,91 @@ xfs_extfree_intent_destroy_cache(void) kmem_cache_destroy(xfs_extfree_item_cache); xfs_extfree_item_cache = NULL; } + +/* + * Find the next chunk of free space in @pag starting at @agbno and going no + * higher than @end_agbno. Set @agbno and @len to whatever free space we find, + * or to @end_agbno if we find no space. + */ +int +xfs_alloc_find_freesp( + struct xfs_trans *tp, + struct xfs_perag *pag, + xfs_agblock_t *agbno, + xfs_agblock_t end_agbno, + xfs_extlen_t *len) +{ + struct xfs_mount *mp = pag_mount(pag); + struct xfs_btree_cur *cur; + struct xfs_buf *agf_bp = NULL; + xfs_agblock_t found_agbno; + xfs_extlen_t found_len; + int found; + int error; + + trace_xfs_alloc_find_freesp(pag_group(pag), *agbno, + end_agbno - *agbno); + + error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp); + if (error) + return error; + + cur = xfs_bnobt_init_cursor(mp, tp, agf_bp, pag); + + /* Try to find a free extent that starts before here. */ + error = xfs_alloc_lookup_le(cur, *agbno, 0, &found); + if (error) + goto out_cur; + if (found) { + error = xfs_alloc_get_rec(cur, &found_agbno, &found_len, + &found); + if (error) + goto out_cur; + if (XFS_IS_CORRUPT(mp, !found)) { + xfs_btree_mark_sick(cur); + error = -EFSCORRUPTED; + goto out_cur; + } + + if (found_agbno + found_len > *agbno) + goto found; + } + + /* Examine the next record if free extent not in range. */ + error = xfs_btree_increment(cur, 0, &found); + if (error) + goto out_cur; + if (!found) + goto next_ag; + + error = xfs_alloc_get_rec(cur, &found_agbno, &found_len, &found); + if (error) + goto out_cur; + if (XFS_IS_CORRUPT(mp, !found)) { + xfs_btree_mark_sick(cur); + error = -EFSCORRUPTED; + goto out_cur; + } + + if (found_agbno >= end_agbno) + goto next_ag; + +found: + /* Found something, so update the mapping. */ + trace_xfs_alloc_find_freesp_done(pag_group(pag), found_agbno, + found_len); + if (found_agbno < *agbno) { + found_len -= *agbno - found_agbno; + found_agbno = *agbno; + } + *len = found_len; + *agbno = found_agbno; + goto out_cur; +next_ag: + /* Found nothing, so advance the cursor beyond the end of the range. */ + *agbno = end_agbno; + *len = 0; +out_cur: + xfs_btree_del_cursor(cur, error); + return error; +} diff --git a/libxfs/xfs_alloc.h b/libxfs/xfs_alloc.h index 50ef79a1ed41a1..069077d9ad2f8c 100644 --- a/libxfs/xfs_alloc.h +++ b/libxfs/xfs_alloc.h @@ -286,5 +286,8 @@ void xfs_extfree_intent_destroy_cache(void); xfs_failaddr_t xfs_validate_ag_length(struct xfs_buf *bp, uint32_t seqno, uint32_t length); +int xfs_alloc_find_freesp(struct xfs_trans *tp, struct xfs_perag *pag, + xfs_agblock_t *agbno, xfs_agblock_t end_agbno, + xfs_extlen_t *len); #endif /* __XFS_ALLOC_H__ */ diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index 936f719236944f..f4128dbdf3b9a2 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -1087,6 +1087,19 @@ xfs_getfsrefs_advance( /* fcr_flags values - returned for each non-header segment */ #define FCR_OF_LAST (1U << 0) /* last record in the dataset */ +/* map free space to file */ + +/* + * XFS_IOC_MAP_FREESP maps all the free physical space in the filesystem into + * the file at the same offsets. This ioctl requires CAP_SYS_ADMIN. + */ +struct xfs_map_freesp { + __s64 offset; /* disk address to map, in bytes */ + __s64 len; /* length in bytes */ + __u64 flags; /* must be zero */ + __u64 pad; /* must be zero */ +}; + /* * ioctl commands that are used by Linux filesystems */ @@ -1127,6 +1140,7 @@ xfs_getfsrefs_advance( #define XFS_IOC_SCRUBV_METADATA _IOWR('X', 64, struct xfs_scrub_vec_head) #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry) #define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) +#define XFS_IOC_MAP_FREESP _IOW ('X', 67, struct xfs_map_freesp) /* * ioctl commands that replace IRIX syssgi()'s diff --git a/man/man2/ioctl_xfs_map_freesp.2 b/man/man2/ioctl_xfs_map_freesp.2 new file mode 100644 index 00000000000000..ecd2d08f3fdeee --- /dev/null +++ b/man/man2/ioctl_xfs_map_freesp.2 @@ -0,0 +1,76 @@ +.\" Copyright (c) 2023-2025 Oracle. All rights reserved. +.\" +.\" %%%LICENSE_START(GPLv2+_DOC_FULL) +.\" SPDX-License-Identifier: GPL-2.0-or-later +.\" %%%LICENSE_END +.TH IOCTL-XFS-MAP-FREESP 2 2023-11-17 "XFS" +.SH NAME +ioctl_xfs_map_freesp \- map free space into a file +.SH SYNOPSIS +.br +.B #include <xfs/xfs_fs.h> +.PP +.BI "int ioctl(int " fd ", XFS_IOC_MAP_FREESP, struct xfs_map_freesp *" arg ); +.SH DESCRIPTION +Maps free space into the sparse ranges of a regular file. +This ioctl uses +.B struct xfs_map_freesp +to specify the range of free space to be mapped: +.PP +.in +4n +.nf +struct xfs_map_freesp { + __s64 offset; + __s64 len; + __s64 flags; + __s64 pad; +}; +.fi +.in +.PP +.I offset +is the physical disk address, in bytes, of the start of the range to scan. +Each free space extent in this range will be mapped to the file if the +corresponding range of the file is sparse. +.PP +.I len +is the number of bytes in the range to scan. +.PP +.I flags +must be zero; there are no flags defined yet. +.PP +.I pad +must be zero. +.SH RETURN VALUE +On error, \-1 is returned, and +.I errno +is set to indicate the error. +.PP +.SH ERRORS +Error codes can be one of, but are not limited to, the following: +.TP +.B EFAULT +The kernel was not able to copy into the userspace buffer. +.TP +.B EFSBADCRC +Metadata checksum validation failed while performing the query. +.TP +.B EFSCORRUPTED +Metadata corruption was encountered while performing the query. +.TP +.B EINVAL +One of the arguments was not valid, +or the file was not sparse. +.TP +.B EIO +An I/O error was encountered while performing the query. +.TP +.B ENOMEM +There was insufficient memory to perform the query. +.TP +.B ENOSPC +There was insufficient disk space to commit the space mappings. +.SH CONFORMING TO +This API is specific to XFS filesystem on the Linux kernel. +.SH SEE ALSO +.BR ioctl (2) ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong 2024-12-31 23:45 ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong 2024-12-31 23:45 ` [PATCH 02/11] xfs: add an ioctl to map free space into a file Darrick J. Wong @ 2024-12-31 23:45 ` Darrick J. Wong 2024-12-31 23:45 ` [PATCH 04/11] xfs_db: get and put blocks on the AGFL Darrick J. Wong ` (7 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a command to call XFS_IOC_MAP_FREESP. This is experimental code to see if we can build a free space defragmenter out of this. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- io/prealloc.c | 35 +++++++++++++++++++++++++++++++++++ man/man8/xfs_io.8 | 8 +++++++- 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/io/prealloc.c b/io/prealloc.c index 8e968c9f2455d5..b7004697a045c5 100644 --- a/io/prealloc.c +++ b/io/prealloc.c @@ -41,6 +41,7 @@ static cmdinfo_t fcollapse_cmd; static cmdinfo_t finsert_cmd; static cmdinfo_t fzero_cmd; static cmdinfo_t funshare_cmd; +static cmdinfo_t fmapfree_cmd; static int offset_length( @@ -377,6 +378,30 @@ funshare_f( return 0; } +static int +fmapfree_f( + int argc, + char **argv) +{ + struct xfs_flock64 segment; + struct xfs_map_freesp args = { }; + + if (!offset_length(argv[1], argv[2], &segment)) { + exitcode = 1; + return 0; + } + + args.offset = segment.l_start; + args.len = segment.l_len; + + if (ioctl(file->fd, XFS_IOC_MAP_FREESP, &args)) { + perror("XFS_IOC_MAP_FREESP"); + exitcode = 1; + return 0; + } + return 0; +} + void prealloc_init(void) { @@ -489,4 +514,14 @@ prealloc_init(void) funshare_cmd.oneline = _("unshares shared blocks within the range"); add_command(&funshare_cmd); + + fmapfree_cmd.name = "fmapfree"; + fmapfree_cmd.cfunc = fmapfree_f; + fmapfree_cmd.argmin = 2; + fmapfree_cmd.argmax = 2; + fmapfree_cmd.flags = CMD_NOMAP_OK | CMD_FOREIGN_OK; + fmapfree_cmd.args = _("off len"); + fmapfree_cmd.oneline = + _("maps free space into a file"); + add_command(&fmapfree_cmd); } diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8 index 37ad497c771051..c4d09ce07f597b 100644 --- a/man/man8/xfs_io.8 +++ b/man/man8/xfs_io.8 @@ -519,8 +519,14 @@ .SH FILE I/O COMMANDS .BR fallocate (2) manual page to create the hole by shifting data blocks. .TP +.BI fmapfree " offset length" +Maps free physical space into the file by calling XFS_IOC_MAP_FREESP as +described in the +.BR XFS_IOC_MAP_FREESP (2) +manual page. +.TP .BI fpunch " offset length" -Punches (de-allocates) blocks in the file by calling fallocate with +Punches (de-allocates) blocks in the file by calling fallocate with the FALLOC_FL_PUNCH_HOLE flag as described in the .BR fallocate (2) manual page. ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 04/11] xfs_db: get and put blocks on the AGFL 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:45 ` [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space Darrick J. Wong @ 2024-12-31 23:45 ` Darrick J. Wong 2024-12-31 23:46 ` [PATCH 05/11] xfs_spaceman: implement clearing free space Darrick J. Wong ` (6 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:45 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a new xfs_db command to let people add and remove blocks from an AGFL. This isn't really related to rmap btree reconstruction, other than enabling debugging code to mess around with the AGFL to exercise various odd scenarios. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- db/agfl.c | 297 ++++++++++++++++++++++++++++++++++++++++++++++ libxfs/libxfs_api_defs.h | 4 + man/man8/xfs_db.8 | 11 ++ 3 files changed, 308 insertions(+), 4 deletions(-) diff --git a/db/agfl.c b/db/agfl.c index f0f3f21a64d12c..cf5a2407f6b6d8 100644 --- a/db/agfl.c +++ b/db/agfl.c @@ -15,13 +15,14 @@ #include "output.h" #include "init.h" #include "agfl.h" +#include "libfrog/bitmap.h" static int agfl_bno_size(void *obj, int startoff); static int agfl_f(int argc, char **argv); static void agfl_help(void); static const cmdinfo_t agfl_cmd = - { "agfl", NULL, agfl_f, 0, 1, 1, N_("[agno]"), + { "agfl", NULL, agfl_f, 0, -1, 1, N_("[agno] [-g nr] [-p nr]"), N_("set address to agfl block"), agfl_help }; const field_t agfl_hfld[] = { { @@ -77,10 +78,280 @@ agfl_help(void) " for each allocation group. This acts as a reserved pool of space\n" " separate from the general filesystem freespace (not used for user data).\n" "\n" +" -g quantity\tRemove this many blocks from the AGFL.\n" +" -p quantity\tAdd this many blocks to the AGFL.\n" +"\n" )); } +struct dump_info { + struct xfs_perag *pag; + bool leak; +}; + +/* Return blocks freed from the AGFL to the free space btrees. */ +static int +free_grabbed( + uint64_t start, + uint64_t length, + void *data) +{ + struct dump_info *di = data; + struct xfs_perag *pag = di->pag; + struct xfs_mount *mp = pag_mount(pag); + struct xfs_trans *tp; + struct xfs_buf *agf_bp; + int error; + + error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, + &tp); + if (error) + return error; + + error = -libxfs_alloc_read_agf(pag, tp, 0, &agf_bp); + if (error) + goto out_cancel; + + error = -libxfs_free_extent(tp, pag, start, length, &XFS_RMAP_OINFO_AG, + XFS_AG_RESV_AGFL); + if (error) + goto out_cancel; + + return -libxfs_trans_commit(tp); + +out_cancel: + libxfs_trans_cancel(tp); + return error; +} + +/* Report blocks freed from the AGFL. */ +static int +dump_grabbed( + uint64_t start, + uint64_t length, + void *data) +{ + struct dump_info *di = data; + const char *fmt; + + if (length == 1) + fmt = di->leak ? _("agfl %u: leaked agbno %u\n") : + _("agfl %u: removed agbno %u\n"); + else + fmt = di->leak ? _("agfl %u: leaked agbno %u-%u\n") : + _("agfl %u: removed agbno %u-%u\n"); + + printf(fmt, pag_agno(di->pag), (unsigned int)start, + (unsigned int)(start + length - 1)); + return 0; +} + +/* Remove blocks from the AGFL. */ +static int +agfl_get( + struct xfs_perag *pag, + int quantity) +{ + struct dump_info di = { + .pag = pag, + .leak = quantity < 0, + }; + struct xfs_agf *agf; + struct xfs_buf *agf_bp; + struct xfs_trans *tp; + struct bitmap *grabbed; + const unsigned int agfl_size = libxfs_agfl_size(pag_mount(pag)); + unsigned int i; + int error; + + if (!quantity) + return 0; + + if (di.leak) + quantity = -quantity; + quantity = min(quantity, agfl_size); + + error = bitmap_alloc(&grabbed); + if (error) + goto out; + + error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, quantity, 0, + 0, &tp); + if (error) + goto out_bitmap; + + error = -libxfs_alloc_read_agf(pag, tp, 0, &agf_bp); + if (error) + goto out_cancel; + + agf = agf_bp->b_addr; + quantity = min(quantity, be32_to_cpu(agf->agf_flcount)); + + for (i = 0; i < quantity; i++) { + xfs_agblock_t agbno; + + error = -libxfs_alloc_get_freelist(pag, tp, agf_bp, &agbno, 0); + if (error) + goto out_cancel; + + if (agbno == NULLAGBLOCK) { + error = ENOSPC; + goto out_cancel; + } + + error = bitmap_set(grabbed, agbno, 1); + if (error) + goto out_cancel; + } + + error = -libxfs_trans_commit(tp); + if (error) + goto out_bitmap; + + error = bitmap_iterate(grabbed, dump_grabbed, &di); + if (error) + goto out_bitmap; + + if (!di.leak) { + error = bitmap_iterate(grabbed, free_grabbed, &di); + if (error) + goto out_bitmap; + } + + bitmap_free(&grabbed); + return 0; + +out_cancel: + libxfs_trans_cancel(tp); +out_bitmap: + bitmap_free(&grabbed); +out: + if (error) + printf(_("agfl %u: %s\n"), pag_agno(pag), strerror(error)); + return error; +} + +/* Add blocks to the AGFL. */ +static int +agfl_put( + struct xfs_perag *pag, + int quantity) +{ + struct xfs_alloc_arg args = { + .mp = pag_mount(pag), + .alignment = 1, + .minlen = 1, + .prod = 1, + .resv = XFS_AG_RESV_AGFL, + .oinfo = XFS_RMAP_OINFO_AG, + }; + struct xfs_buf *agfl_bp; + struct xfs_agf *agf; + struct xfs_trans *tp; + xfs_fsblock_t target; + const unsigned int agfl_size = libxfs_agfl_size(pag_mount(pag)); + unsigned int i; + bool eoag = quantity < 0; + int error; + + if (!quantity) + return 0; + + if (eoag) + quantity = -quantity; + quantity = min(quantity, agfl_size); + + error = -libxfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, quantity, 0, + 0, &tp); + if (error) + return error; + args.tp = tp; + + error = -libxfs_alloc_read_agf(pag, tp, 0, &args.agbp); + if (error) + goto out_cancel; + + agf = args.agbp->b_addr; + args.maxlen = min(quantity, agfl_size - be32_to_cpu(agf->agf_flcount)); + + if (eoag) + target = xfs_agbno_to_fsb(pag, + be32_to_cpu(agf->agf_length) - 1); + else + target = xfs_agbno_to_fsb(pag, 0); + + error = -libxfs_alloc_read_agfl(pag, tp, &agfl_bp); + if (error) + goto out_cancel; + + error = -libxfs_alloc_vextent_near_bno(&args, target); + if (error) + goto out_cancel; + + if (args.agbno == NULLAGBLOCK) { + error = ENOSPC; + goto out_cancel; + } + + for (i = 0; i < args.len; i++) { + error = -libxfs_alloc_put_freelist(pag, tp, args.agbp, + agfl_bp, args.agbno + i, 0); + if (error) + goto out_cancel; + } + + if (i == 1) + printf(_("agfl %u: added agbno %u\n"), pag_agno(pag), + args.agbno); + else if (i > 1) + printf(_("agfl %u: added agbno %u-%u\n"), pag_agno(pag), + args.agbno, args.agbno + i - 1); + + error = -libxfs_trans_commit(tp); + if (error) + goto out; + + return 0; + +out_cancel: + libxfs_trans_cancel(tp); +out: + if (error) + printf(_("agfl %u: %s\n"), pag_agno(pag), strerror(error)); + return error; +} + +static void +agfl_adjust( + struct xfs_mount *mp, + xfs_agnumber_t agno, + int gblocks, + int pblocks) +{ + struct xfs_perag *pag; + int error; + + if (!expert_mode) { + printf(_("AGFL get/put only supported in expert mode.\n")); + exitcode = 1; + return; + } + + pag = libxfs_perag_get(mp, agno); + + error = agfl_get(pag, gblocks); + if (error) + goto out_pag; + + error = agfl_put(pag, pblocks); + +out_pag: + libxfs_perag_put(pag); + if (error) + exitcode = 1; +} + static int agfl_f( int argc, @@ -88,9 +359,25 @@ agfl_f( { xfs_agnumber_t agno; char *p; + int c; + int gblocks = 0, pblocks = 0; - if (argc > 1) { - agno = (xfs_agnumber_t)strtoul(argv[1], &p, 0); + while ((c = getopt(argc, argv, "g:p:")) != -1) { + switch (c) { + case 'g': + gblocks = atoi(optarg); + break; + case 'p': + pblocks = atoi(optarg); + break; + default: + agfl_help(); + return 1; + } + } + + if (argc > optind) { + agno = (xfs_agnumber_t)strtoul(argv[optind], &p, 0); if (*p != '\0' || agno >= mp->m_sb.sb_agcount) { dbprintf(_("bad allocation group number %s\n"), argv[1]); return 0; @@ -98,6 +385,10 @@ agfl_f( cur_agno = agno; } else if (cur_agno == NULLAGNUMBER) cur_agno = 0; + + if (gblocks || pblocks) + agfl_adjust(mp, cur_agno, gblocks, pblocks); + ASSERT(typtab[TYP_AGFL].typnm == TYP_AGFL); set_cur(&typtab[TYP_AGFL], XFS_AG_DADDR(mp, cur_agno, XFS_AGFL_DADDR(mp)), diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h index 530feef2a47db8..76f55515bb41f7 100644 --- a/libxfs/libxfs_api_defs.h +++ b/libxfs/libxfs_api_defs.h @@ -31,8 +31,12 @@ #define xfs_allocbt_maxrecs libxfs_allocbt_maxrecs #define xfs_allocbt_stage_cursor libxfs_allocbt_stage_cursor #define xfs_alloc_fix_freelist libxfs_alloc_fix_freelist +#define xfs_alloc_get_freelist libxfs_alloc_get_freelist #define xfs_alloc_min_freelist libxfs_alloc_min_freelist +#define xfs_alloc_put_freelist libxfs_alloc_put_freelist #define xfs_alloc_read_agf libxfs_alloc_read_agf +#define xfs_alloc_read_agfl libxfs_alloc_read_agfl +#define xfs_alloc_vextent_near_bno libxfs_alloc_vextent_near_bno #define xfs_alloc_vextent_start_ag libxfs_alloc_vextent_start_ag #define xfs_ascii_ci_hashname libxfs_ascii_ci_hashname diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8 index 553adff758bc02..4217e9932dd775 100644 --- a/man/man8/xfs_db.8 +++ b/man/man8/xfs_db.8 @@ -182,10 +182,19 @@ .SH COMMANDS .IR agno . If no argument is given, use the current allocation group. .TP -.BI "agfl [" agno ] +.BI "agfl [" agno "] [\-g " " quantity" "] [\-p " quantity ] Set current address to the AGFL block for allocation group .IR agno . If no argument is given, use the current allocation group. +If the +.B -g +option is specified with a positive quantity, remove that many blocks from the +AGFL and put them in the free space btrees. +If the quantity is negative, remove the blocks and leak them. +If the +.B -p +option is specified, add that many blocks to the AGFL. +If the quantity is negative, the blocks are selected from the end of the AG. .TP .BI "agi [" agno ] Set current address to the AGI block for allocation group ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 05/11] xfs_spaceman: implement clearing free space 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:45 ` [PATCH 04/11] xfs_db: get and put blocks on the AGFL Darrick J. Wong @ 2024-12-31 23:46 ` Darrick J. Wong 2024-12-31 23:46 ` [PATCH 06/11] spaceman: physically move a regular inode Darrick J. Wong ` (5 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> First attempt at evacuating all the used blocks from part of a filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libfrog/Makefile | 5 libfrog/clearspace.c | 3294 +++++++++++++++++++++++++++++++++++++++++++++++ libfrog/clearspace.h | 79 + man/man8/xfs_spaceman.8 | 17 spaceman/Makefile | 2 spaceman/clearfree.c | 171 ++ spaceman/init.c | 1 spaceman/space.h | 2 8 files changed, 3570 insertions(+), 1 deletion(-) create mode 100644 libfrog/clearspace.c create mode 100644 libfrog/clearspace.h create mode 100644 spaceman/clearfree.c diff --git a/libfrog/Makefile b/libfrog/Makefile index 4da427789411a6..91c99822002347 100644 --- a/libfrog/Makefile +++ b/libfrog/Makefile @@ -65,6 +65,11 @@ workqueue.h LSRCFILES += gen_crc32table.c +ifeq ($(HAVE_GETFSMAP),yes) +CFILES+=clearspace.c +HFILES+=clearspace.h +endif + LDIRT = gen_crc32table crc32table.h default: ltdepend $(LTLIBRARY) diff --git a/libfrog/clearspace.c b/libfrog/clearspace.c new file mode 100644 index 00000000000000..0b6ef8f1b15015 --- /dev/null +++ b/libfrog/clearspace.c @@ -0,0 +1,3294 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include <linux/fsmap.h> +#include "paths.h" +#include "fsgeom.h" +#include "logging.h" +#include "bulkstat.h" +#include "bitmap.h" +#include "file_exchange.h" +#include "clearspace.h" +#include "handle.h" + +/* + * Filesystem Space Balloons + * ========================= + * + * NOTE: Due to the evolving identity of this code, the "space_fd" or "space + * file" in the codebase are the same as the balloon file in this introduction. + * The introduction was written much later than the code. + * + * The goal of this code is to create a balloon file that is mapped to a range + * of the physical space that is managed by a filesystem. There are several + * uses envisioned for balloon files: + * + * 1. Defragmenting free space. Once the balloon is created, freeing it leaves + * a large chunk of contiguous free space ready for reallocation. + * + * 2. Shrinking the filesystem. If the balloon is inflated at the end of the + * filesystem, the file can be handed to the shrink code. The shrink code + * can then reduce the filesystem size by the size of the balloon. + * + * 3. Constraining usage of underlying thin provisioning pools. The space + * assigned to a balloon can be DISCARDed, which prevents the filesystem + * from using that space until the balloon is freed. This can be done more + * efficiently with the standard fallocate call, unless the balloon must + * target specific LBA ranges. + * + * Inflating a balloon is performed in five phases: claiming unused space; + * freezing used space; migrating file mappings away from frozen space; moving + * inodes; and rebuilding metadata elsewhere. + * + * Claiming Unused Space + * --------------------- + * + * The first step of inflating a file balloon is to define the range of + * physical space to be added to the balloon and claim as much of the free + * space inside that range as possible. Dirty data are flushed to disk and + * the block and inode garbage collectors are run to remove any speculative + * preallocations that might be occupying space in the target range. + * + * Second, the new XFS_IOC_MAP_FREESP ioctl is used to map free space in the + * target range to the balloon file. This step will be repeated after every + * space-clearing step below to capture that cleared space. Concurrent writer + * threads will (hopefully) be allocated space outside the target range. + * + * Freezing Used Space + * ------------------- + * + * The second phase of inflating the balloon is to freeze as much of the + * allocated space within the target range as possible. The purpose of this + * step is to grab a second reference to the used space, thereby preventing it + * from being reused elsewhere. + * + * Freezing of a physical space extent starts by using GETFSMAP to find the + * file owner of the space, and opening the file by handle. The fsmap record + * is used to create a FICLONERANGE request to link the file range into a work + * file. Once the reflink is made, any subsequent writes to any of the owners + * of that space are staged via copy on write. The balloon file prevents the + * copy on write from being staged within the target range. The frozen space + * mapping is moved from the work file to the balloon file, where it remains + * until the balloon file is freed. + * + * If reflink is not supported on the filesystem, used space cannot be frozen. + * This phase is skipped. + * + * Migrating File Mappings + * ----------------------- + * + * Once the balloon file has been populated with as much of the target range as + * possible, it is time to remap file ranges that point to the frozen space. + * + * It is advantageous to remap as many blocks as can be done with as few system + * calls as possible to avoid fragmenting files. Furthermore, it is preferable + * to remap heavily shared extents before lightly shared extents to preserve + * reflinks when possible. The new GETFSREFCOUNTS call is used to rank + * physical space extents by size and sharing factor so that the library always + * tries to relocate the highest ranking space extent. + * + * Once a space extent has been selected for relocation, it is reflinked from + * the balloon file into the work file. Next, fallocate is called with the + * FALLOC_FL_UNSHARE_RANGE mode to persist a new copy of the file data and + * update the mapping in the work file. The GETFSMAP call is used to find the + * remaining owners of the target space. For each owner, FIEDEDUPERANGE is + * used to change the owner file's mapping to the space in the work file if the + * owner has not been changed. + * + * If the filesystem does not support reflink, FIDEDUPERANGE will not be + * available. Fortunately, there will only be one owner of the frozen space. + * The file range contents are instead copied through the page cache to the + * work file, and EXCHANGE_RANGE is used to swap the mappings if the owner + * file has not been modified. + * + * When the only remaining owner of the space is the balloon file, return to + * the GETFSREFCOUNTS step to find a new target. This phase is complete when + * there are no more targets. + * + * Moving Inodes + * ------------- + * + * NOTE: This part is not written. + * + * When GETFSMAP tells us about an inode chunk, it is necessary to move the + * inodes allocated in that inode chunk to a new chunk. The first step is to + * create a new donor file whose inode record is not in the target range. This + * file must be created in a donor directory. Next, the file contents should + * be cloned, either via FICLONE for regular files or by copying the directory + * entries for directories. The caller must ensure that no programs write to + * the victim inode while this process is ongoing. + * + * Finally, the new inode must be mapped into the same points in the directory + * tree as the old inode. For each parent pointer accessible by the file, + * perform a RENAME_EXCHANGE operation to update the directory entry. One + * obvious flaw of this method is that we cannot specify (parent, name, child) + * pairs to renameat, which means that the rename does the wrong thing if + * either directory is updated concurrently. + * + * If parent pointers are not available, this phase could be performed slowly + * by iterating all directories looking for entries of interest and swapping + * them. + * + * It is required that the caller guarantee that other applications cannot + * update the filesystem concurrently. + * + * Rebuilding Metadata + * ------------------- + * + * The final phase identifies filesystem metadata occupying the target range + * and uses the online filesystem repair facility to rebuild the metadata + * structures. Assuming that the balloon file now maps most of the space in + * the target range, the new structures should be located outside of the target + * range. This phase runs in a loop until there is no more metadata to + * relocate or no progress can be made on relocating metadata. + * + * Limitations and Bugs + * -------------------- + * + * - This code must be able to find the owners of a range of physical space. + * If GETFSMAP does not return owner information, this code cannot succeed. + * In other words, reverse mapping must be enabled. + * + * - We cannot freeze EOF blocks because the FICLONERANGE code does not allow + * us to remap an EOF block into the middle of the balloon file. I think we + * actually succeed at reflinking the EOF block into the work file during the + * freeze step, but we need to dedupe/exchange the real owners' mappings + * without waiting for the freeze step. OTOH, we /also/ want to freeze as + * much space as quickly as we can. + * + * - Freeze cannot use FIECLONERANGE to reflink unwritten extents into the work + * file because FICLONERANGE ignores unwritten extents. We could create the + * work file as a sparse file and use EXCHANGE_RANGE to swap the unwritten + * extent with the hole, extend EOF to be allocunit aligned, and use + * EXCHANGE_RANGE to move it to the balloon file. That first exchange must + * be careful to sample the owner file's bulkstat data, re-measure the file + * range to confirm that the unwritten extent is still the one we want, and + * only exchange if the owner file has not changed. + * + * - csp_buffercopy seems to hang if pread returns zero bytes read. Do we dare + * use copy_file_range for this instead? + * + * - None of this code knows how to move inodes. Phase 4 is entirely + * speculative fiction rooted in Dave Chinner's earlier implementation. + * + * - Does this work for realtime files? Even for large rt extent sizes? + */ + +/* VFS helpers */ + +/* Remap the file range described by @fcr into fd, or return an errno. */ +static inline int +clonerange(int fd, struct file_clone_range *fcr) +{ + int ret; + + ret = ioctl(fd, FICLONERANGE, fcr); + if (ret) + return errno; + + return 0; +} + +/* + * Deduplicate part of fd into the file range described by fdr. If the + * operation succeeded, we set @same to whether or not we deduped the data and + * return zero. If not, return an errno. + */ +static inline int +deduperange(int fd, struct file_dedupe_range *fdr, bool *same) +{ + struct file_dedupe_range_info *info = &fdr->info[0]; + int ret; + + assert(fdr->dest_count == 1); + *same = false; + + ret = ioctl(fd, FIDEDUPERANGE, fdr); + if (ret) + return errno; + + if (info->status < 0) + return -info->status; + + if (info->status == FILE_DEDUPE_RANGE_DIFFERS) + return 0; + + /* The kernel should never dedupe more than it was asked. */ + assert(fdr->src_length >= info->bytes_deduped); + + *same = true; + return 0; +} + +/* Space clearing operation control */ + +#define QUERY_BATCH_SIZE 1024 + +struct clearspace_tgt { + unsigned long long start; + unsigned long long length; + unsigned long long owners; + unsigned long long prio; + unsigned long long evacuated; + bool try_again; +}; + +struct clearspace_req { + struct xfs_fd *xfd; + + /* all the blocks that we've tried to clear */ + struct bitmap *visited; + + /* stat buffer of the open file */ + struct stat statbuf; + struct stat temp_statbuf; + struct stat space_statbuf; + + /* handle to this filesystem */ + void *fshandle; + size_t fshandle_sz; + + /* physical storage that we want to clear */ + unsigned long long start; + unsigned long long length; + dev_t dev; + + /* convenience variable */ + bool realtime:1; + bool use_reflink:1; + bool can_evac_metadata:1; + + /* + * The "space capture" file. Each extent in this file must be mapped + * to the same byte offset as the byte address of the physical space. + */ + int space_fd; + + /* work file for migrating file data */ + int work_fd; + + /* preallocated buffers for queries */ + struct getbmapx *bhead; + struct fsmap_head *mhead; + struct xfs_getfsrefs_head *rhead; + + /* buffer for copying data */ + char *buf; + + /* buffer for deduping data */ + struct file_dedupe_range *fdr; + + /* tracing mask and indent level */ + unsigned int trace_mask; + unsigned int trace_indent; +}; + +static inline bool +csp_is_internal_owner( + const struct clearspace_req *req, + unsigned long long owner) +{ + return owner == req->temp_statbuf.st_ino || + owner == req->space_statbuf.st_ino; +} + +/* Debugging stuff */ + +static const struct csp_errstr { + unsigned int mask; + const char *tag; +} errtags[] = { + { CSP_TRACE_FREEZE, "freeze" }, + { CSP_TRACE_GRAB, "grab" }, + { CSP_TRACE_PREP, "prep" }, + { CSP_TRACE_TARGET, "target" }, + { CSP_TRACE_DEDUPE, "dedupe" }, + { CSP_TRACE_EXCHANGE, "exchange_range" }, + { CSP_TRACE_XREBUILD, "rebuild" }, + { CSP_TRACE_EFFICACY, "efficacy" }, + { CSP_TRACE_SETUP, "setup" }, + { CSP_TRACE_DUMPFILE, "dumpfile" }, + { CSP_TRACE_BITMAP, "bitmap" }, + + /* prioritize high level functions over low level queries for tagging */ + { CSP_TRACE_FSMAP, "fsmap" }, + { CSP_TRACE_FSREFS, "fsrefs" }, + { CSP_TRACE_BMAPX, "bmapx" }, + { CSP_TRACE_FALLOC, "falloc" }, + { CSP_TRACE_STATUS, "status" }, + { 0, NULL }, +}; + +static void +csp_debug( + struct clearspace_req *req, + unsigned int mask, + const char *func, + int line, + const char *format, + ...) +{ + const struct csp_errstr *et = errtags; + bool debug = (req->trace_mask & ~CSP_TRACE_STATUS); + int indent = req->trace_indent; + va_list args; + + if ((req->trace_mask & mask) != mask) + return; + + if (debug) { + while (indent > 0) { + fprintf(stderr, " "); + indent--; + } + + for (; et->tag; et++) { + if (et->mask & mask) { + fprintf(stderr, "%s: ", et->tag); + break; + } + } + } + + va_start(args, format); + vfprintf(stderr, format, args); + va_end(args); + + if (debug) + fprintf(stderr, " (line %d)\n", line); + else + fprintf(stderr, "\n"); + fflush(stderr); +} + +#define trace_freeze(req, format, ...) \ + csp_debug((req), CSP_TRACE_FREEZE, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_grabfree(req, format, ...) \ + csp_debug((req), CSP_TRACE_GRAB, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_fsmap(req, format, ...) \ + csp_debug((req), CSP_TRACE_FSMAP, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_fsmap_rec(req, mask, mrec) \ + while (!csp_is_internal_owner((req), (mrec)->fmr_owner)) { \ + csp_debug((req), (mask) | CSP_TRACE_FSMAP, __func__, __LINE__, \ +"fsmap phys 0x%llx owner 0x%llx offset 0x%llx bytecount 0x%llx flags 0x%x", \ + (unsigned long long)(mrec)->fmr_physical, \ + (unsigned long long)(mrec)->fmr_owner, \ + (unsigned long long)(mrec)->fmr_offset, \ + (unsigned long long)(mrec)->fmr_length, \ + (mrec)->fmr_flags); \ + break; \ + } + +#define trace_fsrefs(req, format, ...) \ + csp_debug((req), CSP_TRACE_FSREFS, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_fsrefs_rec(req, mask, rrec) \ + csp_debug((req), (mask) | CSP_TRACE_FSREFS, __func__, __LINE__, \ +"fsref phys 0x%llx bytecount 0x%llx owners %llu flags 0x%x", \ + (unsigned long long)(rrec)->fcr_physical, \ + (unsigned long long)(rrec)->fcr_length, \ + (unsigned long long)(rrec)->fcr_owners, \ + (rrec)->fcr_flags) + +#define trace_bmapx(req, format, ...) \ + csp_debug((req), CSP_TRACE_BMAPX, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_bmapx_rec(req, mask, brec) \ + csp_debug((req), (mask) | CSP_TRACE_BMAPX, __func__, __LINE__, \ +"bmapx pos 0x%llx bytecount 0x%llx phys 0x%llx flags 0x%x", \ + (unsigned long long)BBTOB((brec)->bmv_offset), \ + (unsigned long long)BBTOB((brec)->bmv_length), \ + (unsigned long long)BBTOB((brec)->bmv_block), \ + (brec)->bmv_oflags) + +#define trace_prep(req, format, ...) \ + csp_debug((req), CSP_TRACE_PREP, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_target(req, format, ...) \ + csp_debug((req), CSP_TRACE_TARGET, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_dedupe(req, format, ...) \ + csp_debug((req), CSP_TRACE_DEDUPE, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_falloc(req, format, ...) \ + csp_debug((req), CSP_TRACE_FALLOC, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_exchange(req, format, ...) \ + csp_debug((req), CSP_TRACE_EXCHANGE, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_xrebuild(req, format, ...) \ + csp_debug((req), CSP_TRACE_XREBUILD, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_setup(req, format, ...) \ + csp_debug((req), CSP_TRACE_SETUP, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_status(req, format, ...) \ + csp_debug((req), CSP_TRACE_STATUS, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_dumpfile(req, format, ...) \ + csp_debug((req), CSP_TRACE_DUMPFILE, __func__, __LINE__, format, __VA_ARGS__) + +#define trace_bitmap(req, format, ...) \ + csp_debug((req), CSP_TRACE_BITMAP, __func__, __LINE__, format, __VA_ARGS__) + +/* VFS Iteration helpers */ + +static inline void +start_spacefd_iter(struct clearspace_req *req) +{ + req->trace_indent++; +} + +static inline void +end_spacefd_iter(struct clearspace_req *req) +{ + req->trace_indent--; +} + +/* + * Iterate each hole in the space-capture file. Returns 1 if holepos/length + * has been set to a hole; 0 if there aren't any holes left, or -1 for error. + */ +static inline int +spacefd_hole_iter( + const struct clearspace_req *req, + loff_t *holepos, + loff_t *length) +{ + loff_t end = req->start + req->length; + loff_t h; + loff_t d; + + if (*length == 0) + d = req->start; + else + d = *holepos + *length; + if (d >= end) + return 0; + + h = lseek(req->space_fd, d, SEEK_HOLE); + if (h < 0) { + perror(_("finding start of hole in space capture file")); + return h; + } + if (h >= end) + return 0; + + d = lseek(req->space_fd, h, SEEK_DATA); + if (d < 0 && errno == ENXIO) + d = end; + if (d < 0) { + perror(_("finding end of hole in space capture file")); + return d; + } + if (d > end) + d = end; + + *holepos = h; + *length = d - h; + return 1; +} + +/* + * Iterate each written region in the space-capture file. Returns 1 if + * datapos/length have been set to a data area; 0 if there isn't any data left, + * or -1 for error. + */ +static int +spacefd_data_iter( + const struct clearspace_req *req, + loff_t *datapos, + loff_t *length) +{ + loff_t end = req->start + req->length; + loff_t d; + loff_t h; + + if (*length == 0) + h = req->start; + else + h = *datapos + *length; + if (h >= end) + return 0; + + d = lseek(req->space_fd, h, SEEK_DATA); + if (d < 0 && errno == ENXIO) + return 0; + if (d < 0) { + perror(_("finding start of data in space capture file")); + return d; + } + if (d >= end) + return 0; + + h = lseek(req->space_fd, d, SEEK_HOLE); + if (h < 0) { + perror(_("finding end of data in space capture file")); + return h; + } + if (h > end) + h = end; + + *datapos = d; + *length = h - d; + return 1; +} + +/* Filesystem space usage queries */ + +/* Allocate the structures needed for a fsmap query. */ +static void +start_fsmap_query( + struct clearspace_req *req, + dev_t dev, + unsigned long long physical, + unsigned long long length) +{ + struct fsmap_head *mhead = req->mhead; + + assert(req->mhead->fmh_count == 0); + memset(mhead, 0, sizeof(struct fsmap_head)); + mhead->fmh_count = QUERY_BATCH_SIZE; + mhead->fmh_keys[0].fmr_device = dev; + mhead->fmh_keys[0].fmr_physical = physical; + mhead->fmh_keys[1].fmr_device = dev; + mhead->fmh_keys[1].fmr_physical = physical + length; + mhead->fmh_keys[1].fmr_owner = ULLONG_MAX; + mhead->fmh_keys[1].fmr_flags = UINT_MAX; + mhead->fmh_keys[1].fmr_offset = ULLONG_MAX; + + trace_fsmap(req, "dev %u:%u physical 0x%llx bytecount 0x%llx highkey 0x%llx", + major(dev), minor(dev), + (unsigned long long)physical, + (unsigned long long)length, + (unsigned long long)mhead->fmh_keys[1].fmr_physical); + req->trace_indent++; +} + +static inline void +end_fsmap_query( + struct clearspace_req *req) +{ + req->trace_indent--; + req->mhead->fmh_count = 0; +} + +/* Set us up for the next run_fsmap_query, or return false. */ +static inline bool +advance_fsmap_cursor(struct fsmap_head *mhead) +{ + struct fsmap *mrec; + + mrec = &mhead->fmh_recs[mhead->fmh_entries - 1]; + if (mrec->fmr_flags & FMR_OF_LAST) + return false; + + fsmap_advance(mhead); + return true; +} + +/* + * Run a GETFSMAP query. Returns 1 if there are rows, 0 if there are no rows, + * or -1 for error. + */ +static inline int +run_fsmap_query( + struct clearspace_req *req) +{ + struct fsmap_head *mhead = req->mhead; + int ret; + + if (mhead->fmh_entries > 0 && !advance_fsmap_cursor(mhead)) + return 0; + + trace_fsmap(req, + "ioctl dev %u:%u physical 0x%llx length 0x%llx highkey 0x%llx", + major(mhead->fmh_keys[0].fmr_device), + minor(mhead->fmh_keys[0].fmr_device), + (unsigned long long)mhead->fmh_keys[0].fmr_physical, + (unsigned long long)mhead->fmh_keys[0].fmr_length, + (unsigned long long)mhead->fmh_keys[1].fmr_physical); + + ret = ioctl(req->xfd->fd, FS_IOC_GETFSMAP, mhead); + if (ret) { + perror(_("querying fsmap data")); + return -1; + } + + if (!(mhead->fmh_oflags & FMH_OF_DEV_T)) { + fprintf(stderr, _("fsmap does not return dev_t.\n")); + return -1; + } + + if (mhead->fmh_entries == 0) + return 0; + + return 1; +} + +#define for_each_fsmap_row(req, rec) \ + for ((rec) = (req)->mhead->fmh_recs; \ + (rec) < (req)->mhead->fmh_recs + (req)->mhead->fmh_entries; \ + (rec)++) + +/* Allocate the structures needed for a fsrefcounts query. */ +static void +start_fsrefs_query( + struct clearspace_req *req, + dev_t dev, + unsigned long long physical, + unsigned long long length) +{ + struct xfs_getfsrefs_head *rhead = req->rhead; + + assert(req->rhead->fch_count == 0); + memset(rhead, 0, sizeof(struct xfs_getfsrefs_head)); + rhead->fch_count = QUERY_BATCH_SIZE; + rhead->fch_keys[0].fcr_device = dev; + rhead->fch_keys[0].fcr_physical = physical; + rhead->fch_keys[1].fcr_device = dev; + rhead->fch_keys[1].fcr_physical = physical + length; + rhead->fch_keys[1].fcr_owners = ULLONG_MAX; + rhead->fch_keys[1].fcr_flags = UINT_MAX; + + trace_fsrefs(req, "dev %u:%u physical 0x%llx bytecount 0x%llx highkey 0x%llx", + major(dev), minor(dev), + (unsigned long long)physical, + (unsigned long long)length, + (unsigned long long)rhead->fch_keys[1].fcr_physical); + req->trace_indent++; +} + +static inline void +end_fsrefs_query( + struct clearspace_req *req) +{ + req->trace_indent--; + req->rhead->fch_count = 0; +} + +/* Set us up for the next run_fsrefs_query, or return false. */ +static inline bool +advance_fsrefs_query(struct xfs_getfsrefs_head *rhead) +{ + struct xfs_getfsrefs *rrec; + + rrec = &rhead->fch_recs[rhead->fch_entries - 1]; + if (rrec->fcr_flags & FCR_OF_LAST) + return false; + + xfs_getfsrefs_advance(rhead); + return true; +} + +/* + * Run a GETFSREFCOUNTS query. Returns 1 if there are rows, 0 if there are + * no rows, or -1 for error. + */ +static inline int +run_fsrefs_query( + struct clearspace_req *req) +{ + struct xfs_getfsrefs_head *rhead = req->rhead; + int ret; + + if (rhead->fch_entries > 0 && !advance_fsrefs_query(rhead)) + return 0; + + trace_fsrefs(req, + "ioctl dev %u:%u physical 0x%llx length 0x%llx highkey 0x%llx", + major(rhead->fch_keys[0].fcr_device), + minor(rhead->fch_keys[0].fcr_device), + (unsigned long long)rhead->fch_keys[0].fcr_physical, + (unsigned long long)rhead->fch_keys[0].fcr_length, + (unsigned long long)rhead->fch_keys[1].fcr_physical); + + ret = ioctl(req->xfd->fd, XFS_IOC_GETFSREFCOUNTS, rhead); + if (ret) { + perror(_("querying refcount data")); + return -1; + } + + if (!(rhead->fch_oflags & FCH_OF_DEV_T)) { + fprintf(stderr, _("fsrefcounts does not return dev_t.\n")); + return -1; + } + + if (rhead->fch_entries == 0) + return 0; + + return 1; +} + +#define for_each_fsref_row(req, rec) \ + for ((rec) = (req)->rhead->fch_recs; \ + (rec) < (req)->rhead->fch_recs + (req)->rhead->fch_entries; \ + (rec)++) + +/* Allocate the structures needed for a bmapx query. */ +static void +start_bmapx_query( + struct clearspace_req *req, + unsigned int fork, + unsigned long long pos, + unsigned long long length) +{ + struct getbmapx *bhead = req->bhead; + + assert(fork == BMV_IF_ATTRFORK || fork == BMV_IF_COWFORK || !fork); + assert(req->bhead->bmv_count == 0); + + memset(bhead, 0, sizeof(struct getbmapx)); + bhead[0].bmv_offset = BTOBB(pos); + bhead[0].bmv_length = BTOBB(length); + bhead[0].bmv_count = QUERY_BATCH_SIZE + 1; + bhead[0].bmv_iflags = fork | BMV_IF_PREALLOC | BMV_IF_DELALLOC; + + trace_bmapx(req, "%s pos 0x%llx bytecount 0x%llx", + fork == BMV_IF_COWFORK ? "cow" : fork == BMV_IF_ATTRFORK ? "attr" : "data", + (unsigned long long)BBTOB(bhead[0].bmv_offset), + (unsigned long long)BBTOB(bhead[0].bmv_length)); + req->trace_indent++; +} + +static inline void +end_bmapx_query( + struct clearspace_req *req) +{ + req->trace_indent--; + req->bhead->bmv_count = 0; +} + +/* Set us up for the next run_bmapx_query, or return false. */ +static inline bool +advance_bmapx_query(struct getbmapx *bhead) +{ + struct getbmapx *brec; + unsigned long long next_offset; + unsigned long long end = bhead->bmv_offset + bhead->bmv_length; + + brec = &bhead[bhead->bmv_entries]; + if (brec->bmv_oflags & BMV_OF_LAST) + return false; + + next_offset = brec->bmv_offset + brec->bmv_length; + if (next_offset > end) + return false; + + bhead->bmv_offset = next_offset; + bhead->bmv_length = end - next_offset; + return true; +} + +/* + * Run a GETBMAPX query. Returns 1 if there are rows, 0 if there are no rows, + * or -1 for error. + */ +static inline int +run_bmapx_query( + struct clearspace_req *req, + int fd) +{ + struct getbmapx *bhead = req->bhead; + unsigned int fork; + int ret; + + if (bhead->bmv_entries > 0 && !advance_bmapx_query(bhead)) + return 0; + + fork = bhead[0].bmv_iflags & (BMV_IF_COWFORK | BMV_IF_ATTRFORK); + trace_bmapx(req, "ioctl %s pos 0x%llx bytecount 0x%llx", + fork == BMV_IF_COWFORK ? "cow" : fork == BMV_IF_ATTRFORK ? "attr" : "data", + (unsigned long long)BBTOB(bhead[0].bmv_offset), + (unsigned long long)BBTOB(bhead[0].bmv_length)); + + ret = ioctl(fd, XFS_IOC_GETBMAPX, bhead); + if (ret) { + perror(_("querying bmapx data")); + return -1; + } + + if (bhead->bmv_entries == 0) + return 0; + + return 1; +} + +#define for_each_bmapx_row(req, rec) \ + for ((rec) = (req)->bhead + 1; \ + (rec) < (req)->bhead + 1 + (req)->bhead->bmv_entries; \ + (rec)++) + +static inline void +csp_dump_bmapx_row( + struct clearspace_req *req, + unsigned int nr, + const struct getbmapx *brec) +{ + if (brec->bmv_block == -1) { + trace_dumpfile(req, "[%u]: pos 0x%llx len 0x%llx hole", + nr, + (unsigned long long)BBTOB(brec->bmv_offset), + (unsigned long long)BBTOB(brec->bmv_length)); + return; + } + + if (brec->bmv_block == -2) { + trace_dumpfile(req, "[%u]: pos 0x%llx len 0x%llx delalloc", + nr, + (unsigned long long)BBTOB(brec->bmv_offset), + (unsigned long long)BBTOB(brec->bmv_length)); + return; + } + + trace_dumpfile(req, "[%u]: pos 0x%llx len 0x%llx phys 0x%llx flags 0x%x", + nr, + (unsigned long long)BBTOB(brec->bmv_offset), + (unsigned long long)BBTOB(brec->bmv_length), + (unsigned long long)BBTOB(brec->bmv_block), + brec->bmv_oflags); +} + +static inline void +csp_dump_bmapx( + struct clearspace_req *req, + int fd, + unsigned int indent, + const char *tag) +{ + unsigned int nr; + int ret; + + trace_dumpfile(req, "DUMP BMAP OF DATA FORK %s", tag); + start_bmapx_query(req, 0, req->start, req->length); + nr = 0; + while ((ret = run_bmapx_query(req, fd)) > 0) { + struct getbmapx *brec; + + for_each_bmapx_row(req, brec) { + csp_dump_bmapx_row(req, nr++, brec); + if (nr > 10) + goto dump_cow; + } + } + +dump_cow: + end_bmapx_query(req); + trace_dumpfile(req, "DUMP BMAP OF COW FORK %s", tag); + start_bmapx_query(req, BMV_IF_COWFORK, req->start, req->length); + nr = 0; + while ((ret = run_bmapx_query(req, fd)) > 0) { + struct getbmapx *brec; + + for_each_bmapx_row(req, brec) { + csp_dump_bmapx_row(req, nr++, brec); + if (nr > 10) + goto dump_attr; + } + } + +dump_attr: + end_bmapx_query(req); + trace_dumpfile(req, "DUMP BMAP OF ATTR FORK %s", tag); + start_bmapx_query(req, BMV_IF_ATTRFORK, req->start, req->length); + nr = 0; + while ((ret = run_bmapx_query(req, fd)) > 0) { + struct getbmapx *brec; + + for_each_bmapx_row(req, brec) { + csp_dump_bmapx_row(req, nr++, brec); + if (nr > 10) + goto stop; + } + } + +stop: + end_bmapx_query(req); + trace_dumpfile(req, "DONE DUMPING %s", tag); +} + +/* Return the first bmapx for the given file range. */ +static int +bmapx_one( + struct clearspace_req *req, + int fd, + unsigned long long pos, + unsigned long long length, + struct getbmapx *brec) +{ + struct getbmapx bhead[2]; + int ret; + + memset(bhead, 0, sizeof(struct getbmapx) * 2); + bhead[0].bmv_offset = BTOBB(pos); + bhead[0].bmv_length = BTOBB(length); + bhead[0].bmv_count = 2; + bhead[0].bmv_iflags = BMV_IF_PREALLOC | BMV_IF_DELALLOC; + + ret = ioctl(fd, XFS_IOC_GETBMAPX, bhead); + if (ret) { + perror(_("simple bmapx query")); + return -1; + } + + if (bhead->bmv_entries > 0) { + memcpy(brec, &bhead[1], sizeof(struct getbmapx)); + return 0; + } + + memset(brec, 0, sizeof(struct getbmapx)); + brec->bmv_offset = pos; + brec->bmv_block = -1; /* hole */ + brec->bmv_length = length; + return 0; +} + +/* Constrain space map records. */ +static void +__trim_fsmap( + uint64_t start, + uint64_t length, + struct fsmap *fsmap) +{ + unsigned long long delta, end; + bool need_off; + + need_off = !(fsmap->fmr_flags & (FMR_OF_EXTENT_MAP | + FMR_OF_SPECIAL_OWNER)); + + if (fsmap->fmr_physical < start) { + delta = start - fsmap->fmr_physical; + fsmap->fmr_physical = start; + fsmap->fmr_length -= delta; + if (need_off) + fsmap->fmr_offset += delta; + } + + end = fsmap->fmr_physical + fsmap->fmr_length; + if (end > start + length) { + delta = end - (start + length); + fsmap->fmr_length -= delta; + } +} + +static inline void +trim_target_fsmap(const struct clearspace_tgt *tgt, struct fsmap *fsmap) +{ + return __trim_fsmap(tgt->start, tgt->length, fsmap); +} + +static inline void +trim_request_fsmap(const struct clearspace_req *req, struct fsmap *fsmap) +{ + return __trim_fsmap(req->start, req->length, fsmap); +} + +/* Actual space clearing code */ + +/* + * Map all the free space in the region that we're clearing to the space + * catcher file. + */ +static int +csp_grab_free_space( + struct clearspace_req *req) +{ + struct xfs_map_freesp args = { + .offset = req->start, + .len = req->length, + }; + int ret; + + trace_grabfree(req, "start 0x%llx length 0x%llx", + (unsigned long long)req->start, + (unsigned long long)req->length); + + ret = ioctl(req->space_fd, XFS_IOC_MAP_FREESP, &args); + if (ret) { + perror(_("map free space to space capture file")); + return -1; + } + + return 0; +} + +/* + * Rank a refcount record. We prefer to tackle highly shared and longer + * extents first. + */ +static inline unsigned long long +csp_space_prio( + const struct xfs_fsop_geom *g, + const struct xfs_getfsrefs *p) +{ + unsigned long long blocks = p->fcr_length / g->blocksize; + unsigned long long ret = blocks * p->fcr_owners; + + if (ret < blocks || ret < p->fcr_owners) + return UINT64_MAX; + return ret; +} + +/* Make the current refcount record the clearing target if desirable. */ +static void +csp_adjust_target( + struct clearspace_req *req, + struct clearspace_tgt *target, + const struct xfs_getfsrefs *rec, + unsigned long long prio) +{ + if (prio < target->prio) + return; + if (prio == target->prio && + rec->fcr_length <= target->length) + return; + + /* Ignore results that go beyond the end of what we wanted. */ + if (rec->fcr_physical >= req->start + req->length) + return; + + /* Ignore regions that we already tried to clear. */ + if (bitmap_test(req->visited, rec->fcr_physical, rec->fcr_length)) + return; + + trace_target(req, + "set target, prio 0x%llx -> 0x%llx phys 0x%llx bytecount 0x%llx", + target->prio, prio, + (unsigned long long)rec->fcr_physical, + (unsigned long long)rec->fcr_length); + + target->start = rec->fcr_physical; + target->length = rec->fcr_length; + target->owners = rec->fcr_owners; + target->prio = prio; +} + +/* + * Decide if this refcount record maps to extents that are sufficiently + * interesting to target. + */ +static int +csp_evaluate_refcount( + struct clearspace_req *req, + const struct xfs_getfsrefs *rrec, + struct clearspace_tgt *target) +{ + const struct xfs_fsop_geom *fsgeom = &req->xfd->fsgeom; + unsigned long long prio = csp_space_prio(fsgeom, rrec); + int ret; + + if (rrec->fcr_device != req->dev) + return 0; + + if (prio < target->prio) + return 0; + + /* + * XFS only supports sharing data blocks. If there's more than one + * owner, we know that we can easily move the blocks. + */ + if (rrec->fcr_owners > 1) { + csp_adjust_target(req, target, rrec, prio); + return 0; + } + + /* + * Otherwise, this extent has single owners. Walk the fsmap records to + * figure out if they're movable or not. + */ + start_fsmap_query(req, rrec->fcr_device, rrec->fcr_physical, + rrec->fcr_length); + while ((ret = run_fsmap_query(req)) > 0) { + struct fsmap *mrec; + uint64_t next_phys = 0; + + for_each_fsmap_row(req, mrec) { + struct xfs_getfsrefs fake_rec = { }; + + trace_fsmap_rec(req, CSP_TRACE_TARGET, mrec); + + if (mrec->fmr_device != rrec->fcr_device) + continue; + if (mrec->fmr_flags & FMR_OF_SPECIAL_OWNER) + continue; + if (csp_is_internal_owner(req, mrec->fmr_owner)) + continue; + + /* + * If the space has become shared since the fsrefs + * query, just skip this record. We might come back to + * it in a later iteration. + */ + if (mrec->fmr_physical < next_phys) + continue; + + /* Fake enough of a fsrefs to calculate the priority. */ + fake_rec.fcr_physical = mrec->fmr_physical; + fake_rec.fcr_length = mrec->fmr_length; + fake_rec.fcr_owners = 1; + prio = csp_space_prio(fsgeom, &fake_rec); + + /* Target unwritten extents first; they're cheap. */ + if (mrec->fmr_flags & FMR_OF_PREALLOC) + prio |= (1ULL << 63); + + csp_adjust_target(req, target, &fake_rec, prio); + + next_phys = mrec->fmr_physical + mrec->fmr_length; + } + } + end_fsmap_query(req); + + return ret; +} + +/* + * Given a range of storage to search, find the most appealing target for space + * clearing. If nothing suitable is found, the target will be zeroed. + */ +static int +csp_find_target( + struct clearspace_req *req, + struct clearspace_tgt *target) +{ + int ret; + + memset(target, 0, sizeof(struct clearspace_tgt)); + + start_fsrefs_query(req, req->dev, req->start, req->length); + while ((ret = run_fsrefs_query(req)) > 0) { + struct xfs_getfsrefs *rrec; + + for_each_fsref_row(req, rrec) { + trace_fsrefs_rec(req, CSP_TRACE_TARGET, rrec); + ret = csp_evaluate_refcount(req, rrec, target); + if (ret) { + end_fsrefs_query(req); + return ret; + } + } + } + end_fsrefs_query(req); + + if (target->length != 0) { + /* + * Mark this extent visited so that we won't try again this + * round. + */ + trace_bitmap(req, "set filedata start 0x%llx length 0x%llx", + target->start, target->length); + ret = bitmap_set(req->visited, target->start, target->length); + if (ret) { + perror(_("marking file extent visited")); + return ret; + } + } + + return 0; +} + +/* Try to evacuate blocks by using online repair. */ +static int +csp_evac_file_metadata( + struct clearspace_req *req, + struct clearspace_tgt *target, + const struct fsmap *mrec, + int fd, + const struct xfs_bulkstat *bulkstat) +{ + struct xfs_scrub_metadata scrub = { + .sm_type = XFS_SCRUB_TYPE_PROBE, + .sm_flags = XFS_SCRUB_IFLAG_REPAIR | + XFS_SCRUB_IFLAG_FORCE_REBUILD, + }; + struct xfs_fd *xfd = req->xfd; + int ret; + + trace_xrebuild(req, + "ino 0x%llx pos 0x%llx bytecount 0x%llx phys 0x%llx flags 0x%llx", + (unsigned long long)mrec->fmr_owner, + (unsigned long long)mrec->fmr_offset, + (unsigned long long)mrec->fmr_physical, + (unsigned long long)mrec->fmr_length, + (unsigned long long)mrec->fmr_flags); + + if (fd == -1) { + scrub.sm_ino = mrec->fmr_owner; + scrub.sm_gen = bulkstat->bs_gen; + fd = xfd->fd; + } + + if (mrec->fmr_flags & FMR_OF_ATTR_FORK) { + if (mrec->fmr_flags & FMR_OF_EXTENT_MAP) + scrub.sm_type = XFS_SCRUB_TYPE_BMBTA; + else + scrub.sm_type = XFS_SCRUB_TYPE_XATTR; + } else if (mrec->fmr_flags & FMR_OF_EXTENT_MAP) { + scrub.sm_type = XFS_SCRUB_TYPE_BMBTD; + } else if (S_ISLNK(bulkstat->bs_mode)) { + scrub.sm_type = XFS_SCRUB_TYPE_SYMLINK; + } else if (S_ISDIR(bulkstat->bs_mode)) { + scrub.sm_type = XFS_SCRUB_TYPE_DIR; + } + + if (scrub.sm_type == XFS_SCRUB_TYPE_PROBE) + return 0; + + trace_xrebuild(req, "ino 0x%llx gen 0x%x type %u", + (unsigned long long)mrec->fmr_owner, + (unsigned int)bulkstat->bs_gen, + (unsigned int)scrub.sm_type); + + ret = ioctl(fd, XFS_IOC_SCRUB_METADATA, &scrub); + if (ret) { + fprintf(stderr, + _("evacuating inode 0x%llx metadata type %u: %s\n"), + (unsigned long long)mrec->fmr_owner, + scrub.sm_type, strerror(errno)); + return -1; + } + + target->evacuated++; + return 0; +} + +/* + * Open an inode via handle. Returns a file descriptor, -2 if the file is + * gone, or -1 on error. + */ +static int +csp_open_by_handle( + struct clearspace_req *req, + int oflags, + uint64_t ino, + uint32_t gen) +{ + struct xfs_handle handle = { }; + struct xfs_fsop_handlereq hreq = { + .oflags = oflags | O_NOATIME | O_NOFOLLOW | + O_NOCTTY | O_LARGEFILE, + .ihandle = &handle, + .ihandlen = sizeof(handle), + }; + int ret; + + memcpy(&handle.ha_fsid, req->fshandle, sizeof(handle.ha_fsid)); + handle.ha_fid.fid_len = sizeof(xfs_fid_t) - + sizeof(handle.ha_fid.fid_len); + handle.ha_fid.fid_pad = 0; + handle.ha_fid.fid_ino = ino; + handle.ha_fid.fid_gen = gen; + + /* + * Since we extracted the fshandle from the open file instead of using + * path_to_fshandle, the fsid cache doesn't know about the fshandle. + * Construct the open by handle request manually. + */ + ret = ioctl(req->xfd->fd, XFS_IOC_OPEN_BY_HANDLE, &hreq); + if (ret < 0) { + if (errno == ENOENT || errno == EINVAL) + return -2; + + fprintf(stderr, _("open inode 0x%llx: %s\n"), + (unsigned long long)ino, + strerror(errno)); + return -1; + } + + return ret; +} + +/* + * Open a file for evacuation. Returns a positive errno on error; a fd in @fd + * if the caller is supposed to do something; or @fd == -1 if there's nothing + * further to do. + */ +static int +csp_evac_open( + struct clearspace_req *req, + struct clearspace_tgt *target, + const struct fsmap *mrec, + struct xfs_bulkstat *bulkstat, + int oflags, + int *fd) +{ + struct xfs_bulkstat __bs; + int target_fd; + int ret; + + *fd = -1; + + if (csp_is_internal_owner(req, mrec->fmr_owner) || + (mrec->fmr_flags & FMR_OF_SPECIAL_OWNER)) + goto nothing_to_do; + + if (bulkstat == NULL) + bulkstat = &__bs; + + /* + * Snapshot this file so that we can perform a fresh-only exchange. + * For other types of files we just skip to the evacuation step. + */ + ret = -xfrog_bulkstat_single(req->xfd, mrec->fmr_owner, 0, bulkstat); + if (ret) { + if (ret == ENOENT || ret == EINVAL) + goto nothing_to_do; + + fprintf(stderr, _("bulkstat inode 0x%llx: %s\n"), + (unsigned long long)mrec->fmr_owner, + strerror(ret)); + return ret; + } + + /* + * If we get stats for a different inode, the file may have been freed + * out from under us and there's nothing to do. + */ + if (bulkstat->bs_ino != mrec->fmr_owner) + goto nothing_to_do; + + /* + * We're only allowed to open regular files and directories via handle + * so jump to online rebuild for all other file types. + */ + if (!S_ISREG(bulkstat->bs_mode) && !S_ISDIR(bulkstat->bs_mode)) + return csp_evac_file_metadata(req, target, mrec, -1, + bulkstat); + + if (S_ISDIR(bulkstat->bs_mode)) + oflags = O_RDONLY; + + target_fd = csp_open_by_handle(req, oflags, mrec->fmr_owner, + bulkstat->bs_gen); + if (target_fd == -2) + goto nothing_to_do; + if (target_fd < 0) + return -target_fd; + + /* + * Exchange only works for regular file data blocks. If that isn't the + * case, our only recourse is online rebuild. + */ + if (S_ISDIR(bulkstat->bs_mode) || + (mrec->fmr_flags & (FMR_OF_ATTR_FORK | FMR_OF_EXTENT_MAP))) { + int ret2; + + ret = csp_evac_file_metadata(req, target, mrec, target_fd, + bulkstat); + ret2 = close(target_fd); + if (!ret && ret2) + ret = ret2; + return ret; + } + + *fd = target_fd; + return 0; + +nothing_to_do: + target->try_again = true; + return 0; +} + +/* Unshare the space in the work file that we're using for deduplication. */ +static int +csp_unshare_workfile( + struct clearspace_req *req, + unsigned long long start, + unsigned long long length) +{ + int ret; + + trace_falloc(req, "funshare workfd pos 0x%llx bytecount 0x%llx", + start, length); + + ret = fallocate(req->work_fd, FALLOC_FL_UNSHARE_RANGE, start, length); + if (ret) { + perror(_("unsharing work file")); + return ret; + } + + ret = fsync(req->work_fd); + if (ret) { + perror(_("syncing work file")); + return ret; + } + + /* Make sure we didn't get any space within the clearing range. */ + start_bmapx_query(req, 0, start, length); + while ((ret = run_bmapx_query(req, req->work_fd)) > 0) { + struct getbmapx *brec; + + for_each_bmapx_row(req, brec) { + unsigned long long p, l; + + trace_bmapx_rec(req, CSP_TRACE_FALLOC, brec); + p = BBTOB(brec->bmv_block); + l = BBTOB(brec->bmv_length); + + if (p + l < req->start || p >= req->start + req->length) + continue; + + trace_prep(req, + "workfd has extent inside clearing range, phys 0x%llx fsbcount 0x%llx", + p, l); + end_bmapx_query(req); + return -1; + } + } + end_bmapx_query(req); + + return 0; +} + +/* Try to deduplicate every block in the fdr request, if we can. */ +static int +csp_evac_dedupe_loop( + struct clearspace_req *req, + struct clearspace_tgt *target, + unsigned long long ino, + int max_reqlen) +{ + struct file_dedupe_range *fdr = req->fdr; + struct file_dedupe_range_info *info = &fdr->info[0]; + loff_t last_unshare_off = -1; + int ret; + + while (fdr->src_length > 0) { + struct getbmapx brec; + bool same; + unsigned int old_reqlen = fdr->src_length; + + if (max_reqlen && fdr->src_length > max_reqlen) + fdr->src_length = max_reqlen; + + trace_dedupe(req, "ino 0x%llx pos 0x%llx bytecount 0x%llx", + ino, + (unsigned long long)info->dest_offset, + (unsigned long long)fdr->src_length); + + ret = bmapx_one(req, req->work_fd, fdr->src_offset, + fdr->src_length, &brec); + if (ret) + return ret; + + trace_dedupe(req, "workfd pos 0x%llx phys 0x%llx", + (unsigned long long)fdr->src_offset, + (unsigned long long)BBTOB(brec.bmv_block)); + + ret = deduperange(req->work_fd, fdr, &same); + if (ret == ENOSPC && last_unshare_off < fdr->src_offset) { + req->trace_indent++; + trace_dedupe(req, "funshare workfd at phys 0x%llx", + (unsigned long long)fdr->src_offset); + /* + * If we ran out of space, it's possible that we have + * reached the maximum sharing factor of the blocks in + * the work file. Try unsharing the range of the work + * file to get a singly-owned range and loop again. + */ + ret = csp_unshare_workfile(req, fdr->src_offset, + fdr->src_length); + req->trace_indent--; + if (ret) + return ret; + + ret = fsync(req->work_fd); + if (ret) { + perror(_("sync after unshare work file")); + return ret; + } + + last_unshare_off = fdr->src_offset; + fdr->src_length = old_reqlen; + continue; + } + if (ret == EINVAL) { + /* + * If we can't dedupe get the block, it's possible that + * src_fd was punched or truncated out from under us. + * Treat this the same way we would if the contents + * didn't match. + */ + trace_dedupe(req, "cannot evac space, moving on", 0); + same = false; + ret = 0; + } + if (ret) { + fprintf(stderr, _("evacuating inode 0x%llx: %s\n"), + ino, strerror(ret)); + return ret; + } + + if (same) { + req->trace_indent++; + trace_dedupe(req, + "evacuated ino 0x%llx pos 0x%llx bytecount 0x%llx", + ino, + (unsigned long long)info->dest_offset, + (unsigned long long)info->bytes_deduped); + req->trace_indent--; + + target->evacuated++; + } else { + req->trace_indent++; + trace_dedupe(req, + "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx", + ino, + (unsigned long long)info->dest_offset, + (unsigned long long)fdr->src_length); + req->trace_indent--; + + target->try_again = true; + + /* + * If we aren't single-stepping the deduplication, + * stop early so that the caller goes into single-step + * mode. + */ + if (!max_reqlen) { + fdr->src_length = old_reqlen; + return 0; + } + + /* Contents changed, move on to the next block. */ + info->bytes_deduped = fdr->src_length; + } + fdr->src_length = old_reqlen; + + fdr->src_offset += info->bytes_deduped; + info->dest_offset += info->bytes_deduped; + fdr->src_length -= info->bytes_deduped; + } + + return 0; +} + +/* + * Evacuate one fsmapping by using dedupe to remap data stored in the target + * range to a copy stored in the work file. + */ +static int +csp_evac_dedupe_fsmap( + struct clearspace_req *req, + struct clearspace_tgt *target, + const struct fsmap *mrec) +{ + struct file_dedupe_range *fdr = req->fdr; + struct file_dedupe_range_info *info = &fdr->info[0]; + bool can_single_step; + int target_fd; + int ret, ret2; + + if (mrec->fmr_device != req->dev) { + fprintf(stderr, _("wrong fsmap device in results.\n")); + return -1; + } + + ret = csp_evac_open(req, target, mrec, NULL, O_RDONLY, &target_fd); + if (ret || target_fd < 0) + return ret; + + /* + * Use dedupe to try to shift the target file's mappings to use the + * copy of the data that's in the work file. + */ + fdr->src_offset = mrec->fmr_physical; + fdr->src_length = mrec->fmr_length; + fdr->dest_count = 1; + info->dest_fd = target_fd; + info->dest_offset = mrec->fmr_offset; + + can_single_step = mrec->fmr_length > req->xfd->fsgeom.blocksize; + + /* First we try to do the entire thing all at once. */ + ret = csp_evac_dedupe_loop(req, target, mrec->fmr_owner, 0); + if (ret) + goto out_fd; + + /* If there's any work left, try again one block at a time. */ + if (can_single_step && fdr->src_length > 0) { + ret = csp_evac_dedupe_loop(req, target, mrec->fmr_owner, + req->xfd->fsgeom.blocksize); + if (ret) + goto out_fd; + } + +out_fd: + ret2 = close(target_fd); + if (!ret && ret2) + ret = ret2; + return ret; +} + +/* + * Evacuate a prealloc fsmapping by using exchangerange to move the + * preallocation to the work file. + */ +static int +csp_evac_exchange_prealloc( + struct clearspace_req *req, + struct clearspace_tgt *target, + const struct fsmap *mrec) +{ + struct xfs_bulkstat bulkstat; + struct xfs_commit_range xcr; + struct getbmapx brec; + int target_fd; + int ret, ret2; + + if (mrec->fmr_device != req->dev) { + fprintf(stderr, _("wrong fsmap device in results.\n")); + return -1; + } + + ret = csp_evac_open(req, target, mrec, &bulkstat, O_RDWR, &target_fd); + if (ret || target_fd < 0) + return ret; + + ret = xfrog_commitrange_prep(&xcr, target_fd, mrec->fmr_offset, + req->work_fd, mrec->fmr_offset, mrec->fmr_length); + if (ret) { + perror(_("preparing for commit")); + goto out_fd; + } + + /* + * Now that we've snapshotted target_fd, check that the mapping we're + * after is still one large preallocation. If it isn't, then we tell + * the caller to try again. + */ + ret = bmapx_one(req, target_fd, mrec->fmr_offset, mrec->fmr_length, + &brec); + if (ret) + return ret; + + trace_exchange(req, + "targetfd pos 0x%llx offset 0x%llx phys 0x%llx len 0x%llx prealloc? %d", + (unsigned long long)mrec->fmr_offset, + (unsigned long long)BBTOB(brec.bmv_offset), + (unsigned long long)BBTOB(brec.bmv_block), + (unsigned long long)BBTOB(brec.bmv_length), + !!(brec.bmv_oflags & BMV_IF_PREALLOC)); + + if (BBTOB(brec.bmv_offset) > mrec->fmr_offset || + BBTOB(brec.bmv_offset + brec.bmv_length) < + mrec->fmr_offset + mrec->fmr_length || + !(brec.bmv_oflags & BMV_IF_PREALLOC)) { + req->trace_indent++; + trace_exchange(req, + "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx", + bulkstat.bs_ino, + (unsigned long long)mrec->fmr_offset, + (unsigned long long)mrec->fmr_length); + req->trace_indent--; + target->try_again = true; + goto out_fd; + } + + ret = ftruncate(req->work_fd, 0); + if (ret) { + perror(_("truncating work file")); + goto out_fd; + } + + /* + * Create a preallocation in the work file to match the one in the + * file that we're evacuating. + */ + ret = fallocate(req->work_fd, 0, mrec->fmr_offset, mrec->fmr_length); + if (ret) { + fprintf(stderr, + _("copying target file preallocation to work file: %s\n"), + strerror(ret)); + goto out_fd; + } + + ret = bmapx_one(req, req->work_fd, mrec->fmr_offset, mrec->fmr_length, + &brec); + if (ret) + return ret; + + trace_exchange(req, "workfd pos 0x%llx off 0x%llx phys 0x%llx", + (unsigned long long)mrec->fmr_offset, + (unsigned long long)BBTOB(brec.bmv_offset), + (unsigned long long)BBTOB(brec.bmv_block)); + + /* + * Exchange the mappings, with the freshness check enabled. This + * should result in the target file being switched to new blocks unless + * it has changed, in which case we bounce out and find a new target. + */ + ret = xfrog_commitrange(target_fd, &xcr, 0); + if (ret) { + if (ret == EBUSY) { + req->trace_indent++; + trace_exchange(req, + "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx", + bulkstat.bs_ino, + (unsigned long long)mrec->fmr_offset, + (unsigned long long)mrec->fmr_length); + req->trace_indent--; + target->try_again = true; + } else { + fprintf(stderr, + _("exchanging target and work file contents: %s\n"), + strerror(ret)); + } + goto out_fd; + } + + req->trace_indent++; + trace_exchange(req, + "evacuated ino 0x%llx pos 0x%llx bytecount 0x%llx", + bulkstat.bs_ino, + (unsigned long long)mrec->fmr_offset, + (unsigned long long)mrec->fmr_length); + req->trace_indent--; + target->evacuated++; + +out_fd: + ret2 = close(target_fd); + if (!ret && ret2) + ret = ret2; + return ret; +} + +/* Use deduplication to remap data extents away from where we're clearing. */ +static int +csp_evac_dedupe( + struct clearspace_req *req, + struct clearspace_tgt *target) +{ + int ret; + + start_fsmap_query(req, req->dev, target->start, target->length); + while ((ret = run_fsmap_query(req)) > 0) { + struct fsmap *mrec; + + for_each_fsmap_row(req, mrec) { + trace_fsmap_rec(req, CSP_TRACE_DEDUPE, mrec); + trim_target_fsmap(target, mrec); + + req->trace_indent++; + if (mrec->fmr_flags & FMR_OF_PREALLOC) + ret = csp_evac_exchange_prealloc(req, target, + mrec); + else + ret = csp_evac_dedupe_fsmap(req, target, mrec); + req->trace_indent--; + if (ret) + goto out; + + ret = csp_grab_free_space(req); + if (ret) + goto out; + } + } + +out: + end_fsmap_query(req); + if (ret) + trace_dedupe(req, "ret %d", ret); + return ret; +} + +#define BUFFERCOPY_BUFSZ 65536 + +/* + * Use a memory buffer to copy part of src_fd to dst_fd, or return an errno. */ +static int +csp_buffercopy( + struct clearspace_req *req, + int src_fd, + loff_t src_off, + int dst_fd, + loff_t dst_off, + loff_t len) +{ + int ret = 0; + + while (len > 0) { + size_t count = min(BUFFERCOPY_BUFSZ, len); + ssize_t bytes_read, bytes_written; + + bytes_read = pread(src_fd, req->buf, count, src_off); + if (bytes_read < 0) { + ret = errno; + break; + } + + bytes_written = pwrite(dst_fd, req->buf, bytes_read, dst_off); + if (bytes_written < 0) { + ret = errno; + break; + } + + src_off += bytes_written; + dst_off += bytes_written; + len -= bytes_written; + } + + return ret; +} + +/* + * Prepare the work file to assist in evacuating file data by copying the + * contents of the frozen space into the work file. + */ +static int +csp_prepare_for_dedupe( + struct clearspace_req *req) +{ + struct file_clone_range fcr; + struct stat statbuf; + loff_t datapos = 0; + loff_t length = 0; + int ret; + + ret = fstat(req->space_fd, &statbuf); + if (ret) { + perror(_("space capture file")); + return ret; + } + + ret = ftruncate(req->work_fd, 0); + if (ret) { + perror(_("truncate work file")); + return ret; + } + + ret = ftruncate(req->work_fd, statbuf.st_size); + if (ret) { + perror(_("reset work file")); + return ret; + } + + /* Make a working copy of the frozen file data. */ + start_spacefd_iter(req); + while ((ret = spacefd_data_iter(req, &datapos, &length)) > 0) { + trace_prep(req, "clone spacefd data 0x%llx length 0x%llx", + (long long)datapos, (long long)length); + + fcr.src_fd = req->space_fd; + fcr.src_offset = datapos; + fcr.src_length = length; + fcr.dest_offset = datapos; + + ret = clonerange(req->work_fd, &fcr); + if (ret == ENOSPC) { + req->trace_indent++; + trace_prep(req, + "falling back to buffered copy at 0x%llx", + (long long)datapos); + req->trace_indent--; + ret = csp_buffercopy(req, req->space_fd, datapos, + req->work_fd, datapos, length); + } + if (ret) { + perror( + _("copying space capture file contents to work file")); + return ret; + } + } + end_spacefd_iter(req); + if (ret < 0) + return ret; + + /* + * Unshare the work file so that it contains an identical copy of the + * contents of the space capture file but mapped to different blocks. + * This is key to using dedupe to migrate file space away from the + * requested region. + */ + req->trace_indent++; + ret = csp_unshare_workfile(req, req->start, req->length); + req->trace_indent--; + return ret; +} + +/* + * Evacuate one fsmapping by using dedupe to remap data stored in the target + * range to a copy stored in the work file. + */ +static int +csp_evac_exchange_fsmap( + struct clearspace_req *req, + struct clearspace_tgt *target, + const struct fsmap *mrec) +{ + struct xfs_bulkstat bulkstat; + struct xfs_commit_range xcr; + struct getbmapx brec; + int target_fd; + int ret, ret2; + + if (mrec->fmr_device != req->dev) { + fprintf(stderr, _("wrong fsmap device in results.\n")); + return -1; + } + + ret = csp_evac_open(req, target, mrec, &bulkstat, O_RDWR, &target_fd); + if (ret || target_fd < 0) + return ret; + + ret = xfrog_commitrange_prep(&xcr, target_fd, mrec->fmr_offset, + req->work_fd, mrec->fmr_offset, mrec->fmr_length); + if (ret) { + perror(_("preparing for commit")); + goto out_fd; + } + + ret = ftruncate(req->work_fd, 0); + if (ret) { + perror(_("truncating work file")); + goto out_fd; + } + + /* + * Copy the data from the original file to the work file. We assume + * that the work file will end up with different data blocks and that + * they're outside of the requested range. + */ + ret = csp_buffercopy(req, target_fd, mrec->fmr_offset, req->work_fd, + mrec->fmr_offset, mrec->fmr_length); + if (ret) { + fprintf(stderr, _("copying target file to work file: %s\n"), + strerror(ret)); + goto out_fd; + } + + ret = fsync(req->work_fd); + if (ret) { + perror(_("flush work file for fiexchange")); + goto out_fd; + } + + ret = bmapx_one(req, req->work_fd, mrec->fmr_offset, mrec->fmr_length, + &brec); + if (ret) + return ret; + + trace_exchange(req, "workfd pos 0x%llx phys 0x%llx", + (unsigned long long)mrec->fmr_offset, + (unsigned long long)BBTOB(brec.bmv_block)); + + /* + * Exchange the mappings, with the freshness check enabled. This + * should result in the target file being switched to new blocks unless + * it has changed, in which case we bounce out and find a new target. + */ + ret = xfrog_commitrange(target_fd, &xcr, 0); + if (ret) { + if (ret == EBUSY) { + req->trace_indent++; + trace_exchange(req, + "failed evac ino 0x%llx pos 0x%llx bytecount 0x%llx", + bulkstat.bs_ino, + (unsigned long long)mrec->fmr_offset, + (unsigned long long)mrec->fmr_length); + req->trace_indent--; + target->try_again = true; + } else { + fprintf(stderr, + _("exchanging target and work file contents: %s\n"), + strerror(ret)); + } + goto out_fd; + } + + req->trace_indent++; + trace_exchange(req, + "evacuated ino 0x%llx pos 0x%llx bytecount 0x%llx", + bulkstat.bs_ino, + (unsigned long long)mrec->fmr_offset, + (unsigned long long)mrec->fmr_length); + req->trace_indent--; + target->evacuated++; + +out_fd: + ret2 = close(target_fd); + if (!ret && ret2) + ret = ret2; + return ret; +} + +/* + * Try to evacuate all data blocks in the target region by copying the contents + * to a new file and exchanging the extents. + */ +static int +csp_evac_exchange( + struct clearspace_req *req, + struct clearspace_tgt *target) +{ + int ret; + + start_fsmap_query(req, req->dev, target->start, target->length); + while ((ret = run_fsmap_query(req)) > 0) { + struct fsmap *mrec; + + for_each_fsmap_row(req, mrec) { + trace_fsmap_rec(req, CSP_TRACE_EXCHANGE, mrec); + trim_target_fsmap(target, mrec); + + req->trace_indent++; + ret = csp_evac_exchange_fsmap(req, target, mrec); + req->trace_indent--; + if (ret) + goto out; + + ret = csp_grab_free_space(req); + if (ret) + goto out; + } + } +out: + end_fsmap_query(req); + if (ret) + trace_exchange(req, "ret %d", ret); + return ret; +} + +/* Try to evacuate blocks by using online repair to rebuild AG metadata. */ +static int +csp_evac_ag_metadata( + struct clearspace_req *req, + struct clearspace_tgt *target, + uint32_t agno, + uint32_t mask) +{ + struct xfs_scrub_metadata scrub = { + .sm_flags = XFS_SCRUB_IFLAG_REPAIR | + XFS_SCRUB_IFLAG_FORCE_REBUILD, + }; + unsigned int i; + int ret; + + trace_xrebuild(req, "agno 0x%x mask 0x%x", + (unsigned int)agno, + (unsigned int)mask); + + for (i = XFS_SCRUB_TYPE_AGFL; i < XFS_SCRUB_TYPE_REFCNTBT; i++) { + + if (!(mask & (1U << i))) + continue; + + scrub.sm_type = i; + + req->trace_indent++; + trace_xrebuild(req, "agno %u type %u", + (unsigned int)agno, + (unsigned int)scrub.sm_type); + req->trace_indent--; + + ret = ioctl(req->xfd->fd, XFS_IOC_SCRUB_METADATA, &scrub); + if (ret) { + if (errno == ENOENT || errno == ENOSPC) + continue; + fprintf(stderr, _("rebuilding ag %u type %u: %s\n"), + (unsigned int)agno, scrub.sm_type, + strerror(errno)); + return -1; + } + + target->evacuated++; + + ret = csp_grab_free_space(req); + if (ret) + return ret; + } + + return 0; +} + +/* Compute a scrub mask for a fsmap special owner. */ +static uint32_t +fsmap_owner_to_scrub_mask(__u64 owner) +{ + switch (owner) { + case XFS_FMR_OWN_FREE: + case XFS_FMR_OWN_UNKNOWN: + case XFS_FMR_OWN_FS: + case XFS_FMR_OWN_LOG: + /* can't move these */ + return 0; + case XFS_FMR_OWN_AG: + return (1U << XFS_SCRUB_TYPE_BNOBT) | + (1U << XFS_SCRUB_TYPE_CNTBT) | + (1U << XFS_SCRUB_TYPE_AGFL) | + (1U << XFS_SCRUB_TYPE_RMAPBT); + case XFS_FMR_OWN_INOBT: + return (1U << XFS_SCRUB_TYPE_INOBT) | + (1U << XFS_SCRUB_TYPE_FINOBT); + case XFS_FMR_OWN_REFC: + return (1U << XFS_SCRUB_TYPE_REFCNTBT); + case XFS_FMR_OWN_INODES: + case XFS_FMR_OWN_COW: + /* don't know how to get rid of these */ + return 0; + case XFS_FMR_OWN_DEFECTIVE: + /* good, get rid of it */ + return 0; + default: + return 0; + } +} + +/* Try to clear all per-AG metadata from the requested range. */ +static int +csp_evac_fs_metadata( + struct clearspace_req *req, + struct clearspace_tgt *target, + bool *cleared_anything) +{ + uint32_t curr_agno = -1U; + uint32_t curr_mask = 0; + int ret = 0; + + if (req->realtime) + return 0; + + start_fsmap_query(req, req->dev, target->start, target->length); + while ((ret = run_fsmap_query(req)) > 0) { + struct fsmap *mrec; + + for_each_fsmap_row(req, mrec) { + uint64_t daddr; + uint32_t agno; + uint32_t mask; + + if (mrec->fmr_device != req->dev) + continue; + if (!(mrec->fmr_flags & FMR_OF_SPECIAL_OWNER)) + continue; + + /* Ignore regions that we already tried to clear. */ + if (bitmap_test(req->visited, mrec->fmr_physical, + mrec->fmr_length)) + continue; + + mask = fsmap_owner_to_scrub_mask(mrec->fmr_owner); + if (!mask) + continue; + + trace_fsmap_rec(req, CSP_TRACE_XREBUILD, mrec); + + daddr = BTOBB(mrec->fmr_physical); + agno = cvt_daddr_to_agno(req->xfd, daddr); + + trace_xrebuild(req, + "agno 0x%x -> 0x%x mask 0x%x owner %lld", + curr_agno, agno, curr_mask, + (unsigned long long)mrec->fmr_owner); + + if (curr_agno == -1U) { + curr_agno = agno; + } else if (curr_agno != agno) { + ret = csp_evac_ag_metadata(req, target, + curr_agno, curr_mask); + if (ret) + goto out; + + *cleared_anything = true; + curr_agno = agno; + curr_mask = 0; + } + + /* Put this on the list and try to clear it once. */ + curr_mask |= mask; + ret = bitmap_set(req->visited, mrec->fmr_physical, + mrec->fmr_length); + if (ret) { + perror(_("marking metadata extent visited")); + goto out; + } + } + } + + if (curr_agno != -1U && curr_mask != 0) { + ret = csp_evac_ag_metadata(req, target, curr_agno, curr_mask); + if (ret) + goto out; + *cleared_anything = true; + } + + if (*cleared_anything) + trace_bitmap(req, "set metadata start 0x%llx length 0x%llx", + target->start, target->length); + +out: + end_fsmap_query(req); + if (ret) + trace_xrebuild(req, "ret %d", ret); + return ret; +} + +/* + * Check that at least the start of the mapping was frozen into the work file + * at the correct offset. Set @len to the number of bytes that were frozen. + * Returns -1 for error, zero if written extents are waiting to be mapped into + * the space capture file, or 1 if there's nothing to transfer to the space + * capture file. + */ +enum freeze_outcome { + FREEZE_FAILED = -1, + FREEZE_DONE, + FREEZE_SKIP, +}; + +static enum freeze_outcome +csp_freeze_check_outcome( + struct clearspace_req *req, + const struct fsmap *mrec, + unsigned long long *len) +{ + struct getbmapx brec; + int ret; + + *len = 0; + + ret = bmapx_one(req, req->work_fd, 0, mrec->fmr_length, &brec); + if (ret) + return FREEZE_FAILED; + + trace_freeze(req, + "check if workfd pos 0x0 phys 0x%llx len 0x%llx maps to phys 0x%llx len 0x%llx", + (unsigned long long)mrec->fmr_physical, + (unsigned long long)mrec->fmr_length, + (unsigned long long)BBTOB(brec.bmv_block), + (unsigned long long)BBTOB(brec.bmv_length)); + + /* freeze of an unwritten extent punches a hole in the work file. */ + if ((mrec->fmr_flags & FMR_OF_PREALLOC) && brec.bmv_block == -1) { + *len = min(mrec->fmr_length, BBTOB(brec.bmv_length)); + return FREEZE_SKIP; + } + + /* + * freeze of a written extent must result in the same physical space + * being mapped into the work file. + */ + if (!(mrec->fmr_flags & FMR_OF_PREALLOC) && + BBTOB(brec.bmv_block) == mrec->fmr_physical) { + *len = min(mrec->fmr_length, BBTOB(brec.bmv_length)); + return FREEZE_DONE; + } + + /* + * We didn't find what we were looking for, which implies that the + * mapping changed out from under us. Punch out everything that could + * have been mapped into the work file. Set @len to zero and return so + * that we try again with the next mapping. + */ + trace_falloc(req, "reset workfd isize 0x0", 0); + + ret = ftruncate(req->work_fd, 0); + if (ret) { + perror(_("resetting work file after failed freeze")); + return FREEZE_FAILED; + } + + return FREEZE_SKIP; +} + +/* + * Open a file to try to freeze whatever data is in the requested range. + * + * Returns nonzero on error. Returns zero and a file descriptor in @fd if the + * caller is supposed to do something; or returns zero and @fd == -1 if there's + * nothing to freeze. + */ +static int +csp_freeze_open( + struct clearspace_req *req, + const struct fsmap *mrec, + int *fd) +{ + struct xfs_bulkstat bulkstat; + int oflags = O_RDWR; + int target_fd; + int ret; + + *fd = -1; + + ret = -xfrog_bulkstat_single(req->xfd, mrec->fmr_owner, 0, &bulkstat); + if (ret) { + if (ret == ENOENT || ret == EINVAL) + return 0; + + fprintf(stderr, _("bulkstat inode 0x%llx: %s\n"), + (unsigned long long)mrec->fmr_owner, + strerror(errno)); + return ret; + } + + /* + * If we get stats for a different inode, the file may have been freed + * out from under us and there's nothing to do. + */ + if (bulkstat.bs_ino != mrec->fmr_owner) + return 0; + + /* Skip anything we can't freeze. */ + if (!S_ISREG(bulkstat.bs_mode) && !S_ISDIR(bulkstat.bs_mode)) + return 0; + + if (S_ISDIR(bulkstat.bs_mode)) + oflags = O_RDONLY; + + target_fd = csp_open_by_handle(req, oflags, mrec->fmr_owner, + bulkstat.bs_gen); + if (target_fd == -2) + return 0; + if (target_fd < 0) + return target_fd; + + /* + * Skip mappings for directories, xattr data, and block mapping btree + * blocks. We still have to close the file though. + */ + if (S_ISDIR(bulkstat.bs_mode) || + (mrec->fmr_flags & (FMR_OF_ATTR_FORK | FMR_OF_EXTENT_MAP))) { + return close(target_fd); + } + + *fd = target_fd; + return 0; +} + +static inline uint64_t rounddown_64(uint64_t x, uint64_t y) +{ + return (x / y) * y; +} + +/* + * Deal with a frozen extent containing a partially written EOF block. Either + * we use funshare to get src_fd to release the block, or we reduce the length + * of the frozen extent by one block. + */ +static int +csp_freeze_unaligned_eofblock( + struct clearspace_req *req, + int src_fd, + const struct fsmap *mrec, + unsigned long long *frozen_len) +{ + struct getbmapx brec; + struct stat statbuf; + loff_t work_offset, length; + int ret; + + ret = fstat(req->work_fd, &statbuf); + if (ret) { + perror(_("statting work file")); + return ret; + } + + /* + * The frozen extent is less than the size of the work file, which + * means that we're already block aligned. + */ + if (*frozen_len <= statbuf.st_size) + return 0; + + /* The frozen extent does not contain a partially written EOF block. */ + if (statbuf.st_size % statbuf.st_blksize == 0) + return 0; + + /* + * Unshare what we think is a partially written EOF block of the + * original file, to try to force it to release that block. + */ + work_offset = rounddown_64(statbuf.st_size, statbuf.st_blksize); + length = statbuf.st_size - work_offset; + + trace_freeze(req, + "unaligned eofblock 0x%llx work_size 0x%llx blksize 0x%x work_offset 0x%llx work_length 0x%llx", + *frozen_len, statbuf.st_size, statbuf.st_blksize, + work_offset, length); + + ret = fallocate(src_fd, FALLOC_FL_UNSHARE_RANGE, + mrec->fmr_offset + work_offset, length); + if (ret) { + perror(_("unsharing original file")); + return ret; + } + + ret = fsync(src_fd); + if (ret) { + perror(_("flushing original file")); + return ret; + } + + ret = bmapx_one(req, req->work_fd, work_offset, length, &brec); + if (ret) + return ret; + + if (BBTOB(brec.bmv_block) != mrec->fmr_physical + work_offset) { + fprintf(stderr, + _("work file offset 0x%llx maps to phys 0x%llx, expected 0x%llx"), + (unsigned long long)work_offset, + (unsigned long long)BBTOB(brec.bmv_block), + (unsigned long long)mrec->fmr_physical); + return -1; + } + + /* + * If the block is still shared, there must be other owners of this + * block. Round down the frozen length and we'll come back to it + * eventually. + */ + if (brec.bmv_oflags & BMV_OF_SHARED) { + *frozen_len = work_offset; + return 0; + } + + /* + * Not shared anymore, so increase the size of the file to the next + * block boundary so that we can reflink it into the space capture + * file. + */ + ret = ftruncate(req->work_fd, + BBTOB(brec.bmv_length) + BBTOB(brec.bmv_offset)); + if (ret) { + perror(_("expanding work file")); + return ret; + } + + /* Double-check that we didn't lose the block. */ + ret = bmapx_one(req, req->work_fd, work_offset, length, &brec); + if (ret) + return ret; + + if (BBTOB(brec.bmv_block) != mrec->fmr_physical + work_offset) { + fprintf(stderr, + _("work file offset 0x%llx maps to phys 0x%llx, should be 0x%llx"), + (unsigned long long)work_offset, + (unsigned long long)BBTOB(brec.bmv_block), + (unsigned long long)mrec->fmr_physical); + return -1; + } + + return 0; +} + +/* + * Given a fsmap, try to reflink the physical space into the space capture + * file. + */ +static int +csp_freeze_req_fsmap( + struct clearspace_req *req, + unsigned long long *cursor, + const struct fsmap *mrec) +{ + struct fsmap short_mrec; + struct file_clone_range fcr = { }; + unsigned long long frozen_len; + enum freeze_outcome outcome; + int src_fd; + int ret, ret2; + + if (mrec->fmr_device != req->dev) { + fprintf(stderr, _("wrong fsmap device in results.\n")); + return -1; + } + + /* Ignore mappings for our secret files. */ + if (csp_is_internal_owner(req, mrec->fmr_owner)) + return 0; + + /* Ignore mappings before the cursor. */ + if (mrec->fmr_physical + mrec->fmr_length < *cursor) + return 0; + + /* Jump past mappings for metadata. */ + if (mrec->fmr_flags & FMR_OF_SPECIAL_OWNER) + goto skip; + + /* + * Open this file so that we can try to freeze its data blocks. + * For other types of files we just skip to the evacuation step. + */ + ret = csp_freeze_open(req, mrec, &src_fd); + if (ret) + return ret; + if (src_fd < 0) + goto skip; + + /* + * If the cursor is in the middle of this mapping, increase the start + * of the mapping to start at the cursor. + */ + if (mrec->fmr_physical < *cursor) { + unsigned long long delta = *cursor - mrec->fmr_physical; + + short_mrec = *mrec; + short_mrec.fmr_physical = *cursor; + short_mrec.fmr_offset += delta; + short_mrec.fmr_length -= delta; + + mrec = &short_mrec; + } + + req->trace_indent++; + if (mrec->fmr_length == 0) { + trace_freeze(req, "skipping zero-length freeze", 0); + goto out_fd; + } + + /* + * Reflink the mapping from the source file into the empty work file so + * that a write will be written elsewhere. The only way to reflink a + * partially written EOF block is if the kernel can reset the work file + * size so that the post-EOF part of the block remains post-EOF. If we + * can't do that, we're sunk. If the mapping is unwritten, we'll leave + * a hole in the work file. + */ + ret = ftruncate(req->work_fd, 0); + if (ret) { + perror(_("truncating work file for freeze")); + goto out_fd; + } + + fcr.src_fd = src_fd; + fcr.src_offset = mrec->fmr_offset; + fcr.src_length = mrec->fmr_length; + fcr.dest_offset = 0; + + trace_freeze(req, + "reflink ino 0x%llx offset 0x%llx bytecount 0x%llx into workfd", + (unsigned long long)mrec->fmr_owner, + (unsigned long long)fcr.src_offset, + (unsigned long long)fcr.src_length); + + ret = clonerange(req->work_fd, &fcr); + if (ret == EINVAL) { + /* + * If that didn't work, try reflinking to EOF and picking out + * whatever pieces we want. + */ + fcr.src_length = 0; + + trace_freeze(req, + "reflink ino 0x%llx offset 0x%llx to EOF into workfd", + (unsigned long long)mrec->fmr_owner, + (unsigned long long)fcr.src_offset); + + ret = clonerange(req->work_fd, &fcr); + } + if (ret == EINVAL) { + /* + * If we still can't get the block, it's possible that src_fd + * was punched or truncated out from under us, so we just move + * on to the next fsmap. + */ + trace_freeze(req, "cannot freeze space, moving on", 0); + ret = 0; + goto out_fd; + } + if (ret) { + fprintf(stderr, _("freezing space to work file: %s\n"), + strerror(ret)); + goto out_fd; + } + + req->trace_indent++; + outcome = csp_freeze_check_outcome(req, mrec, &frozen_len); + req->trace_indent--; + switch (outcome) { + case FREEZE_FAILED: + ret = -1; + goto out_fd; + case FREEZE_SKIP: + *cursor += frozen_len; + goto out_fd; + case FREEZE_DONE: + break; + } + + /* + * If we tried reflinking to EOF to capture a partially written EOF + * block in the work file, we need to unshare the end of the source + * file before we try to reflink the frozen space into the space + * capture file. + */ + if (fcr.src_length == 0) { + ret = csp_freeze_unaligned_eofblock(req, src_fd, mrec, + &frozen_len); + if (ret) + goto out_fd; + } + + /* + * We've frozen the mapping by reflinking it into the work file and + * confirmed that the work file has the space we wanted. Now we need + * to map the same extent into the space capture file. If reflink + * fails because we're out of space, fall back to EXCHANGE_RANGE. The + * end goal is to populate the space capture file; we don't care about + * the contents of the work file. + */ + fcr.src_fd = req->work_fd; + fcr.src_offset = 0; + fcr.dest_offset = mrec->fmr_physical; + fcr.src_length = frozen_len; + + trace_freeze(req, "reflink phys 0x%llx len 0x%llx to spacefd", + (unsigned long long)mrec->fmr_physical, + (unsigned long long)mrec->fmr_length); + + ret = clonerange(req->space_fd, &fcr); + if (ret == ENOSPC) { + struct xfs_exchange_range fxr; + + xfrog_exchangerange_prep(&fxr, mrec->fmr_physical, req->work_fd, + mrec->fmr_physical, frozen_len); + ret = xfrog_exchangerange(req->space_fd, &fxr, 0); + } + if (ret) { + fprintf(stderr, _("freezing space to space capture file: %s\n"), + strerror(ret)); + goto out_fd; + } + + *cursor += frozen_len; +out_fd: + ret2 = close(src_fd); + if (!ret && ret2) + ret = ret2; + req->trace_indent--; + if (ret) + trace_freeze(req, "ret %d", ret); + return ret; +skip: + *cursor += mrec->fmr_length; + return 0; +} + +/* + * Try to freeze all the space in the requested range against overwrites. + * + * For each file data fsmap within each hole in the part of the space capture + * file corresponding to the requested range, try to reflink the space into the + * space capture file so that any subsequent writes to the original owner are + * CoW and nobody else can allocate the space. If we cannot use reflink to + * freeze all the space, we cannot proceed with the clearing. + */ +static int +csp_freeze_req_range( + struct clearspace_req *req) +{ + unsigned long long cursor = req->start; + loff_t holepos = 0; + loff_t length = 0; + int ret; + + ret = ftruncate(req->space_fd, req->start + req->length); + if (ret) { + perror(_("setting up space capture file")); + return ret; + } + + if (!req->use_reflink) + return 0; + + start_spacefd_iter(req); + while ((ret = spacefd_hole_iter(req, &holepos, &length)) > 0) { + trace_freeze(req, "spacefd hole 0x%llx length 0x%llx", + (long long)holepos, (long long)length); + + start_fsmap_query(req, req->dev, holepos, length); + while ((ret = run_fsmap_query(req)) > 0) { + struct fsmap *mrec; + + for_each_fsmap_row(req, mrec) { + trace_fsmap_rec(req, CSP_TRACE_FREEZE, mrec); + trim_request_fsmap(req, mrec); + ret = csp_freeze_req_fsmap(req, &cursor, mrec); + if (ret) { + end_fsmap_query(req); + goto out; + } + } + } + end_fsmap_query(req); + } +out: + end_spacefd_iter(req); + return ret; +} + +/* + * Dump all speculative preallocations, COW staging blocks, and inactive inodes + * to try to free up as much space as we can. + */ +static int +csp_collect_garbage( + struct clearspace_req *req) +{ + struct xfs_fs_eofblocks eofb = { + .eof_version = XFS_EOFBLOCKS_VERSION, + .eof_flags = XFS_EOF_FLAGS_SYNC, + }; + int ret; + + ret = ioctl(req->xfd->fd, XFS_IOC_FREE_EOFBLOCKS, &eofb); + if (ret) { + perror(_("xfs garbage collector")); + return -1; + } + + return 0; +} + +static int +csp_prepare( + struct clearspace_req *req) +{ + blkcnt_t old_blocks = 0; + int ret; + + /* + * Empty out CoW forks and speculative post-EOF preallocations before + * starting the clearing process. This may be somewhat overkill. + */ + ret = syncfs(req->xfd->fd); + if (ret) { + perror(_("syncing filesystem")); + return ret; + } + + ret = csp_collect_garbage(req); + if (ret) + return ret; + + /* + * Set up the space capture file as a large sparse file mirroring the + * physical space that we want to defragment. + */ + ret = ftruncate(req->space_fd, req->start + req->length); + if (ret) { + perror(_("setting up space capture file")); + return ret; + } + + /* + * If we don't have reflink, just grab the free space and move on to + * copying and exchanging file contents. + */ + if (!req->use_reflink) + return csp_grab_free_space(req); + + /* + * Try to freeze as much of the requested range as we can, grab the + * free space in that range, and run freeze again to pick up anything + * that may have been allocated while all that was going on. + */ + do { + struct stat statbuf; + + ret = csp_freeze_req_range(req); + if (ret) + return ret; + + ret = csp_grab_free_space(req); + if (ret) + return ret; + + ret = fstat(req->space_fd, &statbuf); + if (ret) + return ret; + + if (old_blocks == statbuf.st_blocks) + break; + old_blocks = statbuf.st_blocks; + } while (1); + + /* + * If reflink is enabled, our strategy is to dedupe to free blocks in + * the area that we're clearing without making any user-visible changes + * to the file contents. For all the written file data blocks in area + * we're clearing, make an identical copy in the work file that is + * backed by blocks that are not in the clearing area. + */ + return csp_prepare_for_dedupe(req); +} + +/* Set up the target to clear all metadata from the given range. */ +static inline void +csp_target_metadata( + struct clearspace_req *req, + struct clearspace_tgt *target) +{ + target->start = req->start; + target->length = req->length; + target->prio = 0; + target->evacuated = 0; + target->owners = 0; + target->try_again = false; +} + +/* + * Loop through the space to find the most appealing part of the device to + * clear, then try to evacuate everything within. + */ +int +clearspace_run( + struct clearspace_req *req) +{ + struct clearspace_tgt target; + const struct csp_errstr *es; + bool cleared_anything; + int ret; + + if (req->trace_mask) { + fprintf(stderr, "debug flags 0x%x:", req->trace_mask); + for (es = errtags; es->tag; es++) { + if (req->trace_mask & es->mask) + fprintf(stderr, " %s", es->tag); + } + fprintf(stderr, "\n"); + } + + req->trace_indent = 0; + trace_status(req, + _("Clearing dev %u:%u physical 0x%llx bytecount 0x%llx."), + major(req->dev), minor(req->dev), + req->start, req->length); + + if (req->trace_mask & ~CSP_TRACE_STATUS) + trace_status(req, "reflink? %d evac_metadata? %d", + req->use_reflink, req->can_evac_metadata); + + ret = bitmap_alloc(&req->visited); + if (ret) { + perror(_("allocating visited bitmap")); + return ret; + } + + ret = csp_prepare(req); + if (ret) + goto out_bitmap; + + /* Evacuate as many file blocks as we can. */ + do { + ret = csp_find_target(req, &target); + if (ret) + goto out_bitmap; + + if (target.length == 0) + break; + + trace_target(req, + "phys 0x%llx len 0x%llx owners 0x%llx prio 0x%llx", + target.start, target.length, + target.owners, target.prio); + + if (req->use_reflink) + ret = csp_evac_dedupe(req, &target); + else + ret = csp_evac_exchange(req, &target); + if (ret) + goto out_bitmap; + + trace_status(req, _("Evacuated %llu file items."), + target.evacuated); + } while (target.evacuated > 0 || target.try_again); + + if (!req->can_evac_metadata) + goto out_bitmap; + + /* Evacuate as many AG metadata blocks as we can. */ + do { + csp_target_metadata(req, &target); + + ret = csp_evac_fs_metadata(req, &target, &cleared_anything); + if (ret) + goto out_bitmap; + + trace_status(req, "evacuated %llu metadata items", + target.evacuated); + } while (target.evacuated > 0 && cleared_anything); + +out_bitmap: + bitmap_free(&req->visited); + return ret; +} + +/* How much space did we actually clear? */ +int +clearspace_efficacy( + struct clearspace_req *req, + unsigned long long *cleared_bytes) +{ + unsigned long long cleared = 0; + int ret; + + start_bmapx_query(req, 0, req->start, req->length); + while ((ret = run_bmapx_query(req, req->space_fd)) > 0) { + struct getbmapx *brec; + + for_each_bmapx_row(req, brec) { + if (brec->bmv_block == -1) + continue; + + trace_bmapx_rec(req, CSP_TRACE_EFFICACY, brec); + + if (brec->bmv_offset != brec->bmv_block) { + fprintf(stderr, + _("space capture file mapped incorrectly\n")); + end_bmapx_query(req); + return -1; + } + cleared += BBTOB(brec->bmv_length); + } + } + end_bmapx_query(req); + if (ret) + return ret; + + *cleared_bytes = cleared; + return 0; +} + +/* + * Create a temporary file on the same volume (data/rt) that we're trying to + * clear free space on. + */ +static int +csp_open_tempfile( + struct clearspace_req *req, + struct stat *statbuf) +{ + struct fsxattr fsx; + int fd, ret; + + fd = openat(req->xfd->fd, ".", O_TMPFILE | O_RDWR | O_EXCL, 0600); + if (fd < 0) { + perror(_("opening temp file")); + return -1; + } + + /* Make sure we got the same filesystem as the open file. */ + ret = fstat(fd, statbuf); + if (ret) { + perror(_("stat temp file")); + goto fail; + } + if (statbuf->st_dev != req->statbuf.st_dev) { + fprintf(stderr, + _("Cannot create temp file on same fs as open file.\n")); + goto fail; + } + + /* Ensure this file targets the correct data/rt device. */ + ret = ioctl(fd, FS_IOC_FSGETXATTR, &fsx); + if (ret) { + perror(_("FSGETXATTR temp file")); + goto fail; + } + + if (!!(fsx.fsx_xflags & FS_XFLAG_REALTIME) != req->realtime) { + if (req->realtime) + fsx.fsx_xflags |= FS_XFLAG_REALTIME; + else + fsx.fsx_xflags &= ~FS_XFLAG_REALTIME; + + ret = ioctl(fd, FS_IOC_FSSETXATTR, &fsx); + if (ret) { + perror(_("FSSETXATTR temp file")); + goto fail; + } + } + + trace_setup(req, "opening temp inode 0x%llx as fd %d", + (unsigned long long)statbuf->st_ino, fd); + + return fd; +fail: + close(fd); + return -1; +} + +/* Extract fshandle from the open file. */ +static int +csp_install_file( + struct clearspace_req *req, + struct xfs_fd *xfd) +{ + void *handle; + size_t handle_sz; + int ret; + + ret = fstat(xfd->fd, &req->statbuf); + if (ret) + return ret; + + if (!S_ISDIR(req->statbuf.st_mode)) { + errno = -ENOTDIR; + return -1; + } + + ret = fd_to_handle(xfd->fd, &handle, &handle_sz); + if (ret) + return ret; + + ret = handle_to_fshandle(handle, handle_sz, &req->fshandle, + &req->fshandle_sz); + if (ret) + return ret; + + free_handle(handle, handle_sz); + req->xfd = xfd; + return 0; +} + +/* Decide if we can use online repair to evacuate metadata. */ +static void +csp_detect_evac_metadata( + struct clearspace_req *req) +{ + struct xfs_scrub_metadata scrub = { + .sm_type = XFS_SCRUB_TYPE_PROBE, + .sm_flags = XFS_SCRUB_IFLAG_REPAIR | + XFS_SCRUB_IFLAG_FORCE_REBUILD, + }; + int ret; + + ret = ioctl(req->xfd->fd, XFS_IOC_SCRUB_METADATA, &scrub); + if (ret) + return; + + /* + * We'll try to evacuate metadata if the probe works. This doesn't + * guarantee success; it merely means that the kernel call exists. + */ + req->can_evac_metadata = true; +} + +/* Detect XFS_IOC_MAP_FREESP; this is critical for grabbing free space! */ +static int +csp_detect_map_freesp( + struct clearspace_req *req) +{ + struct xfs_map_freesp args = { + .offset = 0, + .len = 1, + }; + int ret; + + /* + * A single-byte fallocate request will succeed without doing anything + * to the filesystem. + */ + ret = ioctl(req->work_fd, XFS_IOC_MAP_FREESP, &args); + if (!ret) + return 0; + + if (errno == EOPNOTSUPP) { + fprintf(stderr, + _("Filesystem does not support XFS_IOC_MAP_FREESP\n")); + return -1; + } + + perror(_("test XFS_IOC_MAP_FREESP on work file")); + return -1; +} + +/* + * Assemble operation information to clear the physical space in part of a + * filesystem. + */ +int +clearspace_init( + struct clearspace_req **reqp, + const struct clearspace_init *attrs) +{ + struct clearspace_req *req; + int ret; + + req = calloc(1, sizeof(struct clearspace_req)); + if (!req) { + perror(_("malloc clearspace")); + return -1; + } + + req->work_fd = -1; + req->space_fd = -1; + req->trace_mask = attrs->trace_mask; + + req->realtime = attrs->is_realtime; + req->dev = attrs->dev; + req->start = attrs->start; + req->length = attrs->length; + + ret = csp_install_file(req, attrs->xfd); + if (ret) { + perror(attrs->fname); + goto fail; + } + + csp_detect_evac_metadata(req); + + req->work_fd = csp_open_tempfile(req, &req->temp_statbuf); + if (req->work_fd < 0) + goto fail; + + req->space_fd = csp_open_tempfile(req, &req->space_statbuf); + if (req->space_fd < 0) + goto fail; + + ret = csp_detect_map_freesp(req); + if (ret) + goto fail; + + req->mhead = calloc(1, fsmap_sizeof(QUERY_BATCH_SIZE)); + if (!req->mhead) { + perror(_("opening fs mapping query")); + goto fail; + } + + req->rhead = calloc(1, xfs_getfsrefs_sizeof(QUERY_BATCH_SIZE)); + if (!req->rhead) { + perror(_("opening refcount query")); + goto fail; + } + + req->bhead = calloc(QUERY_BATCH_SIZE + 1, sizeof(struct getbmapx)); + if (!req->bhead) { + perror(_("opening file mapping query")); + goto fail; + } + + req->buf = malloc(BUFFERCOPY_BUFSZ); + if (!req->buf) { + perror(_("allocating file copy buffer")); + goto fail; + } + + req->fdr = calloc(1, sizeof(struct file_dedupe_range) + + sizeof(struct file_dedupe_range_info)); + if (!req->fdr) { + perror(_("allocating dedupe control buffer")); + goto fail; + } + + req->use_reflink = req->xfd->fsgeom.flags & XFS_FSOP_GEOM_FLAGS_REFLINK; + + *reqp = req; + return 0; +fail: + clearspace_free(&req); + return -1; +} + +#ifdef CLEARSPACE_DEBUG +static void +csp_dump_fd( + struct clearspace_req *req, + int fd, + const char *tag) +{ + struct stat sb; + struct getbmapx *brec; + unsigned long i = 0; + int ret; + + ret = fstat(fd, &sb); + if (ret) { + perror("fstat"); + return; + } + + printf("CLEARFREE DUMP ino 0x%llx: %s\n", + (unsigned long long)sb.st_ino, tag); + start_bmapx_query(req, 0, 0, sb.st_size); + while ((ret = run_bmapx_query(req, fd)) > 0) { + for_each_bmapx_row(req, brec) { + char *delim = ""; + + printf("[%lu]: startoff 0x%llx ", + i++, BBTOB(brec->bmv_offset)); + + if (brec->bmv_block == -1) + printf("startblock hole "); + else if (brec->bmv_block == -2) + printf("startblock delalloc "); + else + printf("startblock 0x%llx ", + BBTOB(brec->bmv_block)); + printf("blockcount 0x%llx flags [", + BBTOB(brec->bmv_length)); + if (brec->bmv_oflags & BMV_OF_PREALLOC) { + printf("%sprealloc", delim); + delim = ", "; + } + if (brec->bmv_oflags & BMV_OF_DELALLOC) { + printf("%sdelalloc", delim); + delim = ", "; + } + if (brec->bmv_oflags & BMV_OF_SHARED) { + printf("%sshared", delim); + delim = ", "; + } + printf("]\n"); + } + } + end_bmapx_query(req); +} + +/* Dump the space file and work file contents. */ +void +clearspace_dump( + struct clearspace_req *req) +{ + csp_dump_fd(req, req->space_fd, "space file"); + csp_dump_fd(req, req->work_fd, "work file"); +} +#endif /* CLEARSPACE_DEBUG */ + +/* Free all resources associated with a space clearing request. */ +int +clearspace_free( + struct clearspace_req **reqp) +{ + struct clearspace_req *req = *reqp; + int ret = 0; + + if (!req) + return 0; + + *reqp = NULL; + free(req->fdr); + free(req->buf); + free(req->bhead); + free(req->rhead); + free(req->mhead); + + if (req->space_fd >= 0) { + ret = close(req->space_fd); + if (ret) + perror(_("closing space capture file")); + } + + if (req->work_fd >= 0) { + int ret2 = close(req->work_fd); + + if (ret2) { + perror(_("closing work file")); + if (!ret && ret2) + ret = ret2; + } + } + + if (req->fshandle) + free_handle(req->fshandle, req->fshandle_sz); + free(req); + return ret; +} diff --git a/libfrog/clearspace.h b/libfrog/clearspace.h new file mode 100644 index 00000000000000..d75545752b1fbf --- /dev/null +++ b/libfrog/clearspace.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2021-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __LIBFROG_CLEARSPACE_H__ +#define __LIBFROG_CLEARSPACE_H__ + +#undef CLEARSPACE_DEBUG + +struct clearspace_req; + +struct clearspace_init { + /* Open file and its pathname */ + struct xfs_fd *xfd; + const char *fname; + + /* Which device do we want? */ + bool is_realtime; + dev_t dev; + + /* Range of device to clear. */ + unsigned long long start; + unsigned long long length; + + unsigned int trace_mask; +}; + +int clearspace_init(struct clearspace_req **reqp, + const struct clearspace_init *init); +int clearspace_free(struct clearspace_req **reqp); + +int clearspace_run(struct clearspace_req *req); + +#ifdef CLEARSPACE_DEBUG +void clearspace_dump(struct clearspace_req *req); +#else +# define clearspace_dump(req) ((void)0) +#endif +int clearspace_efficacy(struct clearspace_req *req, + unsigned long long *cleared_bytes); + +/* Debugging levels */ + +#define CSP_TRACE_FREEZE (1U << 0) +#define CSP_TRACE_GRAB (1U << 1) +#define CSP_TRACE_FSMAP (1U << 2) +#define CSP_TRACE_FSREFS (1U << 3) +#define CSP_TRACE_BMAPX (1U << 4) +#define CSP_TRACE_PREP (1U << 5) +#define CSP_TRACE_TARGET (1U << 6) +#define CSP_TRACE_DEDUPE (1U << 7) +#define CSP_TRACE_FALLOC (1U << 8) +#define CSP_TRACE_EXCHANGE (1U << 9) +#define CSP_TRACE_XREBUILD (1U << 10) +#define CSP_TRACE_EFFICACY (1U << 11) +#define CSP_TRACE_SETUP (1U << 12) +#define CSP_TRACE_STATUS (1U << 13) +#define CSP_TRACE_DUMPFILE (1U << 14) +#define CSP_TRACE_BITMAP (1U << 15) + +#define CSP_TRACE_ALL (CSP_TRACE_FREEZE | \ + CSP_TRACE_GRAB | \ + CSP_TRACE_FSMAP | \ + CSP_TRACE_FSREFS | \ + CSP_TRACE_BMAPX | \ + CSP_TRACE_PREP | \ + CSP_TRACE_TARGET | \ + CSP_TRACE_DEDUPE | \ + CSP_TRACE_FALLOC | \ + CSP_TRACE_EXCHANGE | \ + CSP_TRACE_XREBUILD | \ + CSP_TRACE_EFFICACY | \ + CSP_TRACE_SETUP | \ + CSP_TRACE_STATUS | \ + CSP_TRACE_DUMPFILE | \ + CSP_TRACE_BITMAP) + +#endif /* __LIBFROG_CLEARSPACE_H__ */ diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8 index 7d2d1ff94eeb55..a326b9a6486296 100644 --- a/man/man8/xfs_spaceman.8 +++ b/man/man8/xfs_spaceman.8 @@ -25,6 +25,23 @@ .SH OPTIONS .SH COMMANDS .TP +.BI "clearfree [ \-n nr ] [ \-r ] [ \-v mask ] " start " " length +Try to clear the specified physical range in the filesystem. +The +.B start +and +.B length +arguments must be given in units of bytes. +If the +.B -n +option is given, run the clearing algorithm this many times. +If the +.B -r +option is given, clear the realtime device. +If the +.B -v +option is given, print what's happening every step of the way. +.TP .BI "freesp [ \-dgrs ] [-a agno]... [ \-b | \-e bsize | \-h bsize | \-m factor ]" With no arguments, .B freesp diff --git a/spaceman/Makefile b/spaceman/Makefile index 358db9edf5cb73..b9eead8340cec1 100644 --- a/spaceman/Makefile +++ b/spaceman/Makefile @@ -27,7 +27,7 @@ LLDLIBS += $(LIBEDITLINE) $(LIBTERMCAP) endif ifeq ($(HAVE_GETFSMAP),yes) -CFILES += freesp.c +CFILES += freesp.c clearfree.c endif default: depend $(LTCOMMAND) diff --git a/spaceman/clearfree.c b/spaceman/clearfree.c new file mode 100644 index 00000000000000..6d686f805855dc --- /dev/null +++ b/spaceman/clearfree.c @@ -0,0 +1,171 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2021-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "platform_defs.h" +#include "command.h" +#include "init.h" +#include "libfrog/paths.h" +#include "input.h" +#include "libfrog/fsgeom.h" +#include "libfrog/clearspace.h" +#include "handle.h" +#include "space.h" + +static void +clearfree_help(void) +{ + printf(_( +"Evacuate the contents of the given range of physical storage in the filesystem" +"\n" +" -n -- Run the space clearing algorithm this many times.\n" +" -r -- clear space on the realtime device.\n" +" -v -- verbosity level, or \"all\" to print everything.\n" +"\n" +"The start and length arguments are required, and must be specified in units\n" +"of bytes.\n" +"\n")); +} + +static int +clearfree_f( + int argc, + char **argv) +{ + struct clearspace_init attrs = { + .xfd = &file->xfd, + .fname = file->name, + }; + struct clearspace_req *req = NULL; + unsigned long long cleared; + unsigned long arg; + long long lnum; + unsigned int i, nr = 1; + int c, ret; + + while ((c = getopt(argc, argv, "n:rv:")) != EOF) { + switch (c) { + case 'n': + errno = 0; + arg = strtoul(optarg, NULL, 0); + if (errno) { + perror(optarg); + return 1; + } + if (arg > UINT_MAX) + arg = UINT_MAX; + nr = arg; + break; + case 'r': /* rt device */ + attrs.is_realtime = true; + break; + case 'v': /* Verbose output */ + if (!strcmp(optarg, "all")) { + attrs.trace_mask = CSP_TRACE_ALL; + } else { + errno = 0; + attrs.trace_mask = strtoul(optarg, NULL, 0); + if (errno) { + perror(optarg); + return 1; + } + } + break; + default: + exitcode = 1; + clearfree_help(); + return 0; + } + } + + if (attrs.trace_mask) + attrs.trace_mask |= CSP_TRACE_STATUS; + + if (argc != optind + 2) { + clearfree_help(); + goto fail; + } + + if (attrs.is_realtime) { + if (file->xfd.fsgeom.rtblocks == 0) { + fprintf(stderr, _("No realtime volume present.\n")); + goto fail; + } + attrs.dev = file->fs_path.fs_rtdev; + } else { + attrs.dev = file->fs_path.fs_datadev; + } + + lnum = cvtnum(file->xfd.fsgeom.blocksize, file->xfd.fsgeom.sectsize, + argv[optind]); + if (lnum < 0) { + fprintf(stderr, _("Bad clearfree start sector %s.\n"), + argv[optind]); + goto fail; + } + attrs.start = lnum; + + lnum = cvtnum(file->xfd.fsgeom.blocksize, file->xfd.fsgeom.sectsize, + argv[optind + 1]); + if (lnum < 0) { + fprintf(stderr, _("Bad clearfree length %s.\n"), + argv[optind + 1]); + goto fail; + } + attrs.length = lnum; + + ret = clearspace_init(&req, &attrs); + if (ret) + goto fail; + + for (i = 0; i < nr; i++) { + ret = clearspace_run(req); + if (ret) + goto out_clearspace; + } + + ret = clearspace_efficacy(req, &cleared); + if (ret) + goto out_clearspace; + + printf(_("Cleared 0x%llx bytes (%.1f%%) from 0x%llx to 0x%llx.\n"), + cleared, 100.0 * cleared / attrs.length, attrs.start, + attrs.start + attrs.length); + + if (!cleared) + clearspace_dump(req); + + ret = clearspace_free(&req); + if (ret) + goto fail; + + fshandle_destroy(); + return 0; + +out_clearspace: + clearspace_dump(req); + clearspace_free(&req); +fail: + fshandle_destroy(); + exitcode = 1; + return 1; +} + +static struct cmdinfo clearfree_cmd = { + .name = "clearfree", + .cfunc = clearfree_f, + .argmin = 0, + .argmax = -1, + .flags = CMD_FLAG_ONESHOT, + .args = "[-n runs] [-r] [-v mask] start length", + .help = clearfree_help, +}; + +void +clearfree_init(void) +{ + clearfree_cmd.oneline = _("clear free space in the filesystem"); + + add_command(&clearfree_cmd); +} diff --git a/spaceman/init.c b/spaceman/init.c index cf1ff3cbb0ee8d..bce62dec47f2c8 100644 --- a/spaceman/init.c +++ b/spaceman/init.c @@ -35,6 +35,7 @@ init_commands(void) trim_init(); freesp_init(); health_init(); + clearfree_init(); } static int diff --git a/spaceman/space.h b/spaceman/space.h index 28fa35a3047957..509e923375f42f 100644 --- a/spaceman/space.h +++ b/spaceman/space.h @@ -31,8 +31,10 @@ extern void quit_init(void); extern void trim_init(void); #ifdef HAVE_GETFSMAP extern void freesp_init(void); +extern void clearfree_init(void); #else # define freesp_init() do { } while (0) +# define clearfree_init() do { } while(0) #endif extern void info_init(void); extern void health_init(void); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 06/11] spaceman: physically move a regular inode 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (4 preceding siblings ...) 2024-12-31 23:46 ` [PATCH 05/11] xfs_spaceman: implement clearing free space Darrick J. Wong @ 2024-12-31 23:46 ` Darrick J. Wong 2024-12-31 23:46 ` [PATCH 07/11] spaceman: find owners of space in an AG Darrick J. Wong ` (4 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw) To: aalbersh, djwong; +Cc: dchinner, linux-xfs From: Dave Chinner <dchinner@redhat.com> To be able to shrink a filesystem, we need to be able to physically move an inode and all it's data and metadata from it's current location to a new AG. Add a command to spaceman to allow an inode to be moved to a new AG. This new command is not intended to be a perfect solution. I am not trying to handle atomic movement of open files - this is intended to be run as a maintenance operation on idle filesystem. If root filesystems are the target, then this should be run via a rescue environment that is not executing directly on the root fs. With those caveats in place, we can do the entire inode move as a set of non-destructive operations finalised by an atomic inode swap without any needing special kernel support. To ensure we move metadata such as BMBT blocks even if we don't need to move data, we clone the data to a new inode that we've allocated in the destination AG. This will result in new bmbt blocks being allocated in the new location even though the data is not copied. Attributes need to be copied one at a time from the original inode. If data needs to be moved, then we use fallocate(UNSHARE) to create a private copy of the range of data that needs to be moved in the new inode. This will be allocated in the destination AG by normal allocation policy. Once the new inode has been finalised, use RENAME_EXCHANGE to swap it into place and unlink the original inode to free up all the resources it still pins. There are many optimisations still possible to speed this up, but the goal here is "functional" rather than "optimal". Performance can be optimised once all the parts for a "empty the tail of the filesystem before shrink" operation are implemented and solidly tested. This functionality has been smoke tested by creating a 32MB data file with 4k extents and several hundred attributes: $ cat test.sh fname=/mnt/scratch/foo xfs_io -f -c "pwrite 0 32m" -c sync $fname for (( i=0; i < 4096 ; i++ )); do xfs_io -c "fpunch $((i * 8))k 4k" $fname done for (( i=0; i < 100 ; i++ )); do setfattr -n user.blah.$i.$i.blah -v blah.$i.$i.blah $fname setfattr -n user.foo.$i.$i.foo -v $i.cantbele.$i.ve.$i.tsnotbutter $fname done for (( i=0; i < 100 ; i++ )); do setfattr -n security.baz.$i.$i.baz -v wotchul$i$iookinat $fname done xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname xfs_spaceman -c "move_inode -a 22" /mnt/scratch/foo xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname $ and the output looks something like: $ sudo ./test.sh .... fd.path = "/mnt/scratch/foo" fd.flags = non-sync,non-direct,read-write stat.ino = 133 /mnt/scratch/foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: hole 8 1: [8..15]: 208..215 0 (208..215) 8 000000 2: [16..23]: hole 8 3: [24..31]: 224..231 0 (224..231) 8 000000 .... 8189: [65512..65519]: 65712..65719 0 (65712..65719) 8 000000 8190: [65520..65527]: hole 8 8191: [65528..65535]: 65728..65735 0 (65728..65735) 8 000000 mnt/scratch/foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 392..399 0 (392..399) 8 000000 1: [8..15]: 408..415 0 (408..415) 8 000000 2: [16..23]: 424..431 0 (424..431) 8 000000 3: [24..31]: 456..463 0 (456..463) 8 000000 move mnt /mnt/scratch, path /mnt/scratch/foo, agno 22 fd.path = "/mnt/scratch/foo" fd.flags = non-sync,non-direct,read-write stat.ino = 47244651475 .... /mnt/scratch/foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: hole 8 1: [8..15]: 47244763192..47244763199 22 (123112..123119) 8 000000 2: [16..23]: hole 8 3: [24..31]: 47244763208..47244763215 22 (123128..123135) 8 000000 .... 8189: [65512..65519]: 47244828808..47244828815 22 (188728..188735) 8 000000 8190: [65520..65527]: hole 8 8191: [65528..65535]: 47244828824..47244828831 22 (188744..188751) 8 000000 /mnt/scratch/foo: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..7]: 47244763176..47244763183 22 (123096..123103) 8 000000 $ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- man/man8/xfs_spaceman.8 | 4 spaceman/Makefile | 3 spaceman/init.c | 1 spaceman/move_inode.c | 562 +++++++++++++++++++++++++++++++++++++++++++++++ spaceman/space.h | 1 5 files changed, 570 insertions(+), 1 deletion(-) create mode 100644 spaceman/move_inode.c diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8 index a326b9a6486296..f898a8bbe840ea 100644 --- a/man/man8/xfs_spaceman.8 +++ b/man/man8/xfs_spaceman.8 @@ -146,6 +146,10 @@ .SH COMMANDS .TP .BR "help [ " command " ]" Display a brief description of one or all commands. +.TP +.BI "move_inode \-a agno" +Move the currently open file into the specified allocation group. + .TP .BI "prealloc [ \-u id ] [ \-g id ] [ -p id ] [ \-m minlen ] [ \-s ]" Removes speculative preallocation. diff --git a/spaceman/Makefile b/spaceman/Makefile index b9eead8340cec1..9d080b67de9a22 100644 --- a/spaceman/Makefile +++ b/spaceman/Makefile @@ -14,11 +14,12 @@ CFILES = \ health.c \ info.c \ init.c \ + move_inode.c \ prealloc.c \ trim.c LSRCFILES = xfs_info.sh -LLDLIBS = $(LIBHANDLE) $(LIBXCMD) $(LIBFROG) +LLDLIBS = $(LIBHANDLE) $(LIBXCMD) $(LIBFROG) $(LIBHANDLE) LTDEPENDENCIES = $(LIBHANDLE) $(LIBXCMD) $(LIBFROG) LLDFLAGS = -static diff --git a/spaceman/init.c b/spaceman/init.c index bce62dec47f2c8..dbeebcf97b9fb2 100644 --- a/spaceman/init.c +++ b/spaceman/init.c @@ -36,6 +36,7 @@ init_commands(void) freesp_init(); health_init(); clearfree_init(); + move_inode_init(); } static int diff --git a/spaceman/move_inode.c b/spaceman/move_inode.c new file mode 100644 index 00000000000000..b7d71ee7a46dc6 --- /dev/null +++ b/spaceman/move_inode.c @@ -0,0 +1,562 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2020 Red Hat, Inc. + * All Rights Reserved. + */ + +#include "libxfs.h" +#include "libfrog/fsgeom.h" +#include "command.h" +#include "init.h" +#include "libfrog/paths.h" +#include "space.h" +#include "input.h" +#include "handle.h" + +#include <linux/fiemap.h> +#include <linux/falloc.h> +#include <attr/attributes.h> + +static cmdinfo_t move_inode_cmd; + +/* + * We can't entirely use O_TMPFILE here because we want to use RENAME_EXCHANGE + * to swap the inode once rebuild is complete. Hence the new file has to be + * somewhere in the namespace for rename to act upon. Hence we use a normal + * open(O_CREATE) for now. + * + * This could potentially use O_TMPFILE to rebuild the entire inode, the use + * a linkat()/renameat2() pair to add it to the namespace then atomically + * replace the original. + */ +static int +create_tmpfile( + const char *mnt, + struct xfs_fd *xfd, + xfs_agnumber_t agno, + char **tmpfile, + int *tmpfd) +{ + char name[PATH_MAX + 1]; + mode_t mask; + int fd; + int i; + int ret; + + /* construct tmpdir */ + mask = umask(0); + + snprintf(name, PATH_MAX, "%s/.spaceman", mnt); + ret = mkdir(name, 0700); + if (ret) { + if (errno != EEXIST) { + fprintf(stderr, _("could not create tmpdir: %s: %s\n"), + name, strerror(errno)); + ret = -errno; + goto out_cleanup; + } + } + + /* loop creating directories until we get one in the right AG */ + for (i = 0; i < xfd->fsgeom.agcount; i++) { + struct stat st; + + snprintf(name, PATH_MAX, "%s/.spaceman/dir%d", mnt, i); + ret = mkdir(name, 0700); + if (ret) { + if (errno != EEXIST) { + fprintf(stderr, + _("cannot create tmpdir: %s: %s\n"), + name, strerror(errno)); + ret = -errno; + goto out_cleanup_dir; + } + } + ret = lstat(name, &st); + if (ret) { + fprintf(stderr, _("cannot stat tmpdir: %s: %s\n"), + name, strerror(errno)); + ret = -errno; + rmdir(name); + goto out_cleanup_dir; + } + if (cvt_ino_to_agno(xfd, st.st_ino) == agno) + break; + + /* remove directory in wrong AG */ + rmdir(name); + } + + if (i == xfd->fsgeom.agcount) { + /* + * Nothing landed in the selected AG! Must have been skipped + * because the AG is out of space. + */ + fprintf(stderr, _("Cannot create AG tmpdir.\n")); + ret = -ENOSPC; + goto out_cleanup_dir; + } + + /* create tmpfile */ + snprintf(name, PATH_MAX, "%s/.spaceman/dir%d/tmpfile.%d", mnt, i, getpid()); + fd = open(name, O_CREAT|O_EXCL|O_RDWR, 0700); + if (fd < 0) { + fprintf(stderr, _("cannot create tmpfile: %s: %s\n"), + name, strerror(errno)); + ret = -errno; + } + + /* return name and fd */ + (void)umask(mask); + *tmpfd = fd; + *tmpfile = strdup(name); + + return 0; +out_cleanup_dir: + snprintf(name, PATH_MAX, "%s/.spaceman", mnt); + rmdir(name); +out_cleanup: + (void)umask(mask); + return ret; +} + +static int +get_attr( + void *hdl, + size_t hlen, + char *name, + void *attrbuf, + int *attrlen, + int attr_ns) +{ + struct xfs_attr_multiop ops = { + .am_opcode = ATTR_OP_GET, + .am_attrname = name, + .am_attrvalue = attrbuf, + .am_length = *attrlen, + .am_flags = attr_ns, + }; + int ret; + + ret = attr_multi_by_handle(hdl, hlen, &ops, 1, 0); + if (ret < 0) { + fprintf(stderr, _("attr_multi_by_handle(GET): %s\n"), + strerror(errno)); + return -errno; + } + *attrlen = ops.am_length; + return 0; +} + +static int +set_attr( + void *hdl, + size_t hlen, + char *name, + void *attrbuf, + int attrlen, + int attr_ns) +{ + struct xfs_attr_multiop ops = { + .am_opcode = ATTR_OP_SET, + .am_attrname = name, + .am_attrvalue = attrbuf, + .am_length = attrlen, + .am_flags = ATTR_CREATE | attr_ns, + }; + int ret; + + ret = attr_multi_by_handle(hdl, hlen, &ops, 1, 0); + if (ret < 0) { + fprintf(stderr, _("attr_multi_by_handle(SET): %s\n"), + strerror(errno)); + return -errno; + } + return 0; +} + +/* + * Copy all the attributes from the original source file into the replacement + * destination. + * + * Oh the humanity of deprecated Irix compatible attr interfaces that are more + * functional and useful than their native Linux replacements! + */ +static int +copy_attrs( + int srcfd, + int dstfd, + int attr_ns) +{ + void *shdl; + void *dhdl; + size_t shlen; + size_t dhlen; + attrlist_cursor_t cursor; + attrlist_t *alist; + struct attrlist_ent *ent; + char alistbuf[XATTR_LIST_MAX]; + char attrbuf[XATTR_SIZE_MAX]; + int attrlen; + int error; + int i; + + memset(&cursor, 0, sizeof(cursor)); + + /* + * All this handle based stuff is hoop jumping to avoid: + * + * a) deprecated API warnings because attr_list, attr_get and attr_set + * have been deprecated hence through compiler warnings; and + * + * b) listxattr() failing hard if there are more than 64kB worth of attr + * names on the inode so is unusable. + * + * That leaves libhandle as the only usable interface for iterating all + * xattrs on an inode reliably. Lucky for us, libhandle is part of + * xfsprogs, so this hoop jump isn't going to get ripped out from under + * us any time soon. + */ + error = fd_to_handle(srcfd, (void **)&shdl, &shlen); + if (error) { + fprintf(stderr, _("fd_to_handle(shdl): %s\n"), + strerror(errno)); + return -errno; + } + error = fd_to_handle(dstfd, (void **)&dhdl, &dhlen); + if (error) { + fprintf(stderr, _("fd_to_handle(dhdl): %s\n"), + strerror(errno)); + goto out_free_shdl; + } + + /* loop to iterate all xattrs */ + error = attr_list_by_handle(shdl, shlen, alistbuf, + XATTR_LIST_MAX, attr_ns, &cursor); + if (error) { + fprintf(stderr, _("attr_list_by_handle(shdl): %s\n"), + strerror(errno)); + } + while (!error) { + alist = (attrlist_t *)alistbuf; + + /* + * We loop one attr at a time for initial implementation + * simplicity. attr_multi_by_handle() can retrieve and set + * multiple attrs in a single call, but that is more complex. + * Get it working first, then optimise. + */ + for (i = 0; i < alist->al_count; i++) { + ent = ATTR_ENTRY(alist, i); + + /* get xattr (val, len) from name */ + attrlen = XATTR_SIZE_MAX; + error = get_attr(shdl, shlen, ent->a_name, attrbuf, + &attrlen, attr_ns); + if (error) + break; + + /* set xattr (val, len) to name */ + error = set_attr(dhdl, dhlen, ent->a_name, attrbuf, + attrlen, ATTR_CREATE | attr_ns); + if (error) + break; + } + + if (!alist->al_more) + break; + error = attr_list_by_handle(shdl, shlen, alistbuf, + XATTR_LIST_MAX, attr_ns, &cursor); + } + + free_handle(dhdl, dhlen); +out_free_shdl: + free_handle(shdl, shlen); + return error ? -errno : 0; +} + +/* + * scan the range of the new file for data that isn't in the destination AG + * and unshare it to create a new copy of it in the current target location + * of the new file. + */ +#define EXTENT_BATCH 32 +static int +unshare_data( + struct xfs_fd *xfd, + int destfd, + xfs_agnumber_t agno) +{ + int ret; + struct fiemap *fiemap; + int done = 0; + int fiemap_flags = FIEMAP_FLAG_SYNC; + int i; + int map_size; + __u64 last_logical = 0; /* last extent offset handled */ + off_t range_end = -1LL; /* mapping end*/ + + /* fiemap loop over extents */ + map_size = sizeof(struct fiemap) + + (EXTENT_BATCH * sizeof(struct fiemap_extent)); + fiemap = malloc(map_size); + if (!fiemap) { + fprintf(stderr, _("%s: malloc of %d bytes failed.\n"), + progname, map_size); + return -ENOMEM; + } + + while (!done) { + memset(fiemap, 0, map_size); + fiemap->fm_flags = fiemap_flags; + fiemap->fm_start = last_logical; + fiemap->fm_length = range_end - last_logical; + fiemap->fm_extent_count = EXTENT_BATCH; + + ret = ioctl(destfd, FS_IOC_FIEMAP, (unsigned long)fiemap); + if (ret < 0) { + fprintf(stderr, "%s: ioctl(FS_IOC_FIEMAP): %s\n", + progname, strerror(errno)); + free(fiemap); + return -errno; + } + + /* No more extents to map, exit */ + if (!fiemap->fm_mapped_extents) + break; + + for (i = 0; i < fiemap->fm_mapped_extents; i++) { + struct fiemap_extent *extent; + xfs_agnumber_t this_agno; + + extent = &fiemap->fm_extents[i]; + this_agno = cvt_daddr_to_agno(xfd, + cvt_btobbt(extent->fe_physical)); + + /* + * If extent not in dst AG, unshare whole extent to + * trigger reallocated of the extent to be local to + * the current inode. + */ + if (this_agno != agno) { + ret = fallocate(destfd, FALLOC_FL_UNSHARE_RANGE, + extent->fe_logical, extent->fe_length); + if (ret) { + fprintf(stderr, + "%s: fallocate(UNSHARE): %s\n", + progname, strerror(errno)); + return -errno; + } + } + + last_logical = extent->fe_logical + extent->fe_length; + + /* Kernel has told us there are no more extents */ + if (extent->fe_flags & FIEMAP_EXTENT_LAST) { + done = 1; + break; + } + } + } + return 0; +} + +/* + * Exchange the inodes at the two paths indicated after first ensuring that the + * owners, permissions and timestamps are set correctly in the tmpfile. + */ +static int +exchange_inodes( + struct xfs_fd *xfd, + int tmpfd, + const char *tmpfile, + const char *path) +{ + struct timespec ts[2]; + struct stat st; + int ret; + + ret = fstat(xfd->fd, &st); + if (ret) + return -errno; + + /* set user ids */ + ret = fchown(tmpfd, st.st_uid, st.st_gid); + if (ret) + return -errno; + + /* set permissions */ + ret = fchmod(tmpfd, st.st_mode); + if (ret) + return -errno; + + /* set timestamps */ + ts[0] = st.st_atim; + ts[1] = st.st_mtim; + ret = futimens(tmpfd, ts); + if (ret) + return -errno; + + /* exchange the two inodes */ + ret = renameat2(AT_FDCWD, tmpfile, AT_FDCWD, path, RENAME_EXCHANGE); + if (ret) + return -errno; + return 0; +} + +static int +move_file_to_ag( + const char *mnt, + const char *path, + struct xfs_fd *xfd, + xfs_agnumber_t agno) +{ + int ret; + int tmpfd = -1; + char *tmpfile = NULL; + + fprintf(stderr, "move mnt %s, path %s, agno %d\n", mnt, path, agno); + + /* create temporary file in agno */ + ret = create_tmpfile(mnt, xfd, agno, &tmpfile, &tmpfd); + if (ret) + return ret; + + /* clone data to tempfile */ + ret = ioctl(tmpfd, FICLONE, xfd->fd); + if (ret) + goto out_cleanup; + + /* copy system attributes to tempfile */ + ret = copy_attrs(xfd->fd, tmpfd, ATTR_ROOT); + if (ret) + goto out_cleanup; + + /* copy user attributes to tempfile */ + ret = copy_attrs(xfd->fd, tmpfd, 0); + if (ret) + goto out_cleanup; + + /* unshare data to move it */ + ret = unshare_data(xfd, tmpfd, agno); + if (ret) + goto out_cleanup; + + /* swap the inodes over */ + ret = exchange_inodes(xfd, tmpfd, tmpfile, path); + +out_cleanup: + if (ret == -1) + ret = -errno; + + close(tmpfd); + if (tmpfile) + unlink(tmpfile); + free(tmpfile); + + return ret; +} + +static int +move_inode_f( + int argc, + char **argv) +{ + void *fshandle; + size_t fshdlen; + xfs_agnumber_t agno = 0; + struct stat st; + int ret; + int c; + + while ((c = getopt(argc, argv, "a:")) != EOF) { + switch (c) { + case 'a': + agno = cvt_u32(optarg, 10); + if (errno) { + fprintf(stderr, _("bad agno value %s\n"), + optarg); + return command_usage(&move_inode_cmd); + } + break; + default: + return command_usage(&move_inode_cmd); + } + } + + if (optind != argc) + return command_usage(&move_inode_cmd); + + if (agno >= file->xfd.fsgeom.agcount) { + fprintf(stderr, +_("Destination AG %d does not exist. Filesystem only has %d AGs\n"), + agno, file->xfd.fsgeom.agcount); + exitcode = 1; + return 0; + } + + /* this is so we can use fd_to_handle() later on */ + ret = path_to_fshandle(file->fs_path.fs_dir, &fshandle, &fshdlen); + if (ret < 0) { + fprintf(stderr, _("Cannot get fshandle for mount %s: %s\n"), + file->fs_path.fs_dir, strerror(errno)); + goto exit_fail; + } + + ret = fstat(file->xfd.fd, &st); + if (ret) { + fprintf(stderr, _("stat(%s) failed: %s\n"), + file->name, strerror(errno)); + goto exit_fail; + } + + if (S_ISREG(st.st_mode)) { + ret = move_file_to_ag(file->fs_path.fs_dir, file->name, + &file->xfd, agno); + } else { + fprintf(stderr, _("Unsupported: %s is not a regular file.\n"), + file->name); + goto exit_fail; + } + + if (ret) { + fprintf(stderr, _("Failed to move inode to AG %d: %s\n"), + agno, strerror(-ret)); + goto exit_fail; + } + fshandle_destroy(); + return 0; + +exit_fail: + fshandle_destroy(); + exitcode = 1; + return 0; +} + +static void +move_inode_help(void) +{ + printf(_( +"\n" +"Physically move an inode into a new allocation group\n" +"\n" +" -a agno -- destination AG agno for the current open file\n" +"\n")); + +} + +void +move_inode_init(void) +{ + move_inode_cmd.name = "move_inode"; + move_inode_cmd.altname = "mvino"; + move_inode_cmd.cfunc = move_inode_f; + move_inode_cmd.argmin = 2; + move_inode_cmd.argmax = 2; + move_inode_cmd.args = "-a agno"; + move_inode_cmd.flags = CMD_FLAG_ONESHOT; + move_inode_cmd.oneline = _("Move an inode into a new AG."); + move_inode_cmd.help = move_inode_help; + + add_command(&move_inode_cmd); +} diff --git a/spaceman/space.h b/spaceman/space.h index 509e923375f42f..96c3c356f13fec 100644 --- a/spaceman/space.h +++ b/spaceman/space.h @@ -38,5 +38,6 @@ extern void clearfree_init(void); #endif extern void info_init(void); extern void health_init(void); +void move_inode_init(void); #endif /* XFS_SPACEMAN_SPACE_H_ */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 07/11] spaceman: find owners of space in an AG 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (5 preceding siblings ...) 2024-12-31 23:46 ` [PATCH 06/11] spaceman: physically move a regular inode Darrick J. Wong @ 2024-12-31 23:46 ` Darrick J. Wong 2024-12-31 23:46 ` [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c Darrick J. Wong ` (3 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw) To: aalbersh, djwong; +Cc: dchinner, linux-xfs From: Dave Chinner <dchinner@redhat.com> Before we can move inodes for a shrink operation, we have to find all the inodes that own space in the AG(s) we want to empty. This implementation uses FS_IOC_GETFSMAP on the assumption that filesystems to be shrunk have reverse mapping enabled as it is the only way to identify inode related metadata that userspace is unable to see or influence (e.g. BMBT blocks) that may be located in the specific AG. We can use GETFSMAP to identify both inodes to be moved (via XFS_FMR_OWN_INODES records) and inodes with just data and/or metadata to be moved. Once we have identified all the inodes to be moved, we have to map them to paths so that we can use renameat2() to exchange the directory entries pointing at the moved inode atomically. We also need to record inodes with hard links and all of the paths to the inode so that hard links can be recreated appropriately. This requires a directory tree walk to discover the paths (until parent pointers are a thing). Hence for filesystems that aren't reverse mapping enabled, we can eventually use this pass to discover inodes with visible data and metadata that need to be moved. As we resolve the paths to the inodes to be moved, output the information to stdout so that it can be acted upon by other utilities. This results in a command that acts similar to find but with a physical location filter rather than an inode metadata filter. Again, this is not meant to be an optimal implementation. It shouldn't suck, but there is plenty of scope for performance optimisation, especially with a multithreaded and/or async directory traversal/parent pointer path resolution process to hide access latencies. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libfrog/fsgeom.h | 19 ++ libfrog/radix-tree.c | 2 libfrog/radix-tree.h | 2 man/man8/xfs_spaceman.8 | 11 + spaceman/Makefile | 1 spaceman/find_owner.c | 481 +++++++++++++++++++++++++++++++++++++++++++++++ spaceman/init.c | 4 spaceman/space.h | 2 8 files changed, 521 insertions(+), 1 deletion(-) create mode 100644 spaceman/find_owner.c diff --git a/libfrog/fsgeom.h b/libfrog/fsgeom.h index b851b9bbf36a58..679046077cba84 100644 --- a/libfrog/fsgeom.h +++ b/libfrog/fsgeom.h @@ -97,6 +97,25 @@ cvt_ino_to_agino( return ino & ((1ULL << xfd->aginolog) - 1); } +/* Convert an AG block to an AG inode number. */ +static inline uint32_t +cvt_agbno_to_agino( + const struct xfs_fd *xfd, + xfs_agblock_t agbno) +{ + return agbno << xfd->inopblog; +} + +/* Calculate the number of inodes in a byte range */ +static inline uint32_t +cvt_b_to_inode_count( + const struct xfs_fd *xfd, + uint64_t bytes) +{ + return (bytes >> xfd->blocklog) << xfd->inopblog; +} + + /* * Convert a linear fs block offset number into bytes. This is the runtime * equivalent of XFS_FSB_TO_B, which means that it is /not/ for segmented fsbno diff --git a/libfrog/radix-tree.c b/libfrog/radix-tree.c index 261fc2487de97f..788d11612e290f 100644 --- a/libfrog/radix-tree.c +++ b/libfrog/radix-tree.c @@ -377,6 +377,8 @@ void *radix_tree_tag_set(struct radix_tree_root *root, unsigned int height, shift; struct radix_tree_node *slot; + ASSERT(tag < RADIX_TREE_MAX_TAGS); + height = root->height; if (index > radix_tree_maxindex(height)) return NULL; diff --git a/libfrog/radix-tree.h b/libfrog/radix-tree.h index 0a4e3bb4f9defc..73f41a9d902a26 100644 --- a/libfrog/radix-tree.h +++ b/libfrog/radix-tree.h @@ -28,7 +28,7 @@ do { \ } while (0) #ifdef RADIX_TREE_TAGS -#define RADIX_TREE_MAX_TAGS 2 +#define RADIX_TREE_MAX_TAGS 3 #endif int radix_tree_insert(struct radix_tree_root *, unsigned long, void *); diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8 index f898a8bbe840ea..6fef6949aa6c8b 100644 --- a/man/man8/xfs_spaceman.8 +++ b/man/man8/xfs_spaceman.8 @@ -41,6 +41,14 @@ .SH COMMANDS If the .B -v option is given, print what's happening every step of the way. +.TP +.BI "find_owner \-a agno" +Create an internal structure to map physical space in the given allocation +group to file paths. +This enables space reorganization on a mounted filesystem by enabling +us to find files. +Unclear why we can't just use FSMAP and BULKSTAT to open by handle. + .TP .BI "freesp [ \-dgrs ] [-a agno]... [ \-b | \-e bsize | \-h bsize | \-m factor ]" With no arguments, @@ -195,6 +203,9 @@ .SH COMMANDS .B print Display a list of all open files. .TP +.B resolve_owner +Resolves space in the filesystem to file paths, maybe? +.TP .B quit Exit .BR xfs_spaceman . diff --git a/spaceman/Makefile b/spaceman/Makefile index 9d080b67de9a22..b35ab1dbd2f440 100644 --- a/spaceman/Makefile +++ b/spaceman/Makefile @@ -11,6 +11,7 @@ HFILES = \ space.h CFILES = \ file.c \ + find_owner.c \ health.c \ info.c \ init.c \ diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c new file mode 100644 index 00000000000000..7a656d80d21217 --- /dev/null +++ b/spaceman/find_owner.c @@ -0,0 +1,481 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2017 Oracle. + * Copyright (c) 2020 Red Hat, Inc. + * All Rights Reserved. + */ + +#include "libxfs.h" +#include <linux/fiemap.h> +#include "libfrog/fsgeom.h" +#include "libfrog/radix-tree.h" +#include "command.h" +#include "init.h" +#include "libfrog/paths.h" +#include <linux/fsmap.h> +#include "space.h" +#include "input.h" + +static cmdinfo_t find_owner_cmd; +static cmdinfo_t resolve_owner_cmd; + +#define NR_EXTENTS 128 + +static RADIX_TREE(inode_tree, 0); +#define MOVE_INODE 0 +#define MOVE_BLOCKS 1 +#define INODE_PATH 2 +int inode_count; +int inode_paths; + +static void +track_inode_chunks( + struct xfs_fd *xfd, + xfs_agnumber_t agno, + uint64_t physaddr, + uint64_t length) +{ + xfs_agblock_t agbno = cvt_b_to_agbno(xfd, physaddr); + uint64_t first_ino = cvt_agino_to_ino(xfd, agno, + cvt_agbno_to_agino(xfd, agbno)); + uint64_t num_inodes = cvt_b_to_inode_count(xfd, length); + int i; + + printf(_("AG %d\tInode Range to move: 0x%llx - 0x%llx (length 0x%llx)\n"), + agno, + (unsigned long long)first_ino, + (unsigned long long)first_ino + num_inodes - 1, + (unsigned long long)length); + + for (i = 0; i < num_inodes; i++) { + if (!radix_tree_lookup(&inode_tree, first_ino + i)) { + radix_tree_insert(&inode_tree, first_ino + i, + (void *)first_ino + i); + inode_count++; + } + radix_tree_tag_set(&inode_tree, first_ino + i, MOVE_INODE); + } +} + +static void +track_inode( + struct xfs_fd *xfd, + xfs_agnumber_t agno, + uint64_t owner, + uint64_t physaddr, + uint64_t length) +{ + if (radix_tree_tag_get(&inode_tree, owner, MOVE_BLOCKS)) + return; + + printf(_("AG %d\tInode 0x%llx: blocks to move to move: 0x%llx - 0x%llx\n"), + agno, + (unsigned long long)owner, + (unsigned long long)physaddr, + (unsigned long long)physaddr + length - 1); + if (!radix_tree_lookup(&inode_tree, owner)) { + radix_tree_insert(&inode_tree, owner, (void *)owner); + inode_count++; + } + radix_tree_tag_set(&inode_tree, owner, MOVE_BLOCKS); +} + +static void +scan_ag( + xfs_agnumber_t agno) +{ + struct fsmap_head *fsmap; + struct fsmap *extent; + struct fsmap *l, *h; + struct fsmap *p; + struct xfs_fd *xfd = &file->xfd; + int ret; + int i; + + fsmap = malloc(fsmap_sizeof(NR_EXTENTS)); + if (!fsmap) { + fprintf(stderr, _("%s: fsmap malloc failed.\n"), progname); + exitcode = 1; + return; + } + + memset(fsmap, 0, sizeof(*fsmap)); + fsmap->fmh_count = NR_EXTENTS; + l = fsmap->fmh_keys; + h = fsmap->fmh_keys + 1; + l->fmr_physical = cvt_agbno_to_b(xfd, agno, 0); + h->fmr_physical = cvt_agbno_to_b(xfd, agno + 1, 0); + l->fmr_device = h->fmr_device = file->fs_path.fs_datadev; + h->fmr_owner = ULLONG_MAX; + h->fmr_flags = UINT_MAX; + h->fmr_offset = ULLONG_MAX; + + while (true) { + printf("Inode count %d\n", inode_count); + ret = ioctl(xfd->fd, FS_IOC_GETFSMAP, fsmap); + if (ret < 0) { + fprintf(stderr, _("%s: FS_IOC_GETFSMAP [\"%s\"]: %s\n"), + progname, file->name, strerror(errno)); + free(fsmap); + exitcode = 1; + return; + } + + /* No more extents to map, exit */ + if (!fsmap->fmh_entries) + break; + + /* + * Walk the extents, ignore everything except inode chunks + * and inode owned blocks. + */ + for (i = 0, extent = fsmap->fmh_recs; + i < fsmap->fmh_entries; + i++, extent++) { + if (extent->fmr_flags & FMR_OF_SPECIAL_OWNER) { + if (extent->fmr_owner != XFS_FMR_OWN_INODES) + continue; + /* + * This extent contains inodes that need to be + * moved into another AG. Convert the extent to + * a range of inode numbers and track them all. + */ + track_inode_chunks(xfd, agno, + extent->fmr_physical, + extent->fmr_length); + + continue; + } + + /* + * Extent is owned by an inode that may be located + * anywhere in the filesystem, not just this AG. + */ + track_inode(xfd, agno, extent->fmr_owner, + extent->fmr_physical, + extent->fmr_length); + } + + p = &fsmap->fmh_recs[fsmap->fmh_entries - 1]; + if (p->fmr_flags & FMR_OF_LAST) + break; + fsmap_advance(fsmap); + } + + free(fsmap); +} + +/* + * find inodes that own physical space in a given AG. + */ +static int +find_owner_f( + int argc, + char **argv) +{ + xfs_agnumber_t agno = -1; + int c; + + while ((c = getopt(argc, argv, "a:")) != EOF) { + switch (c) { + case 'a': + agno = cvt_u32(optarg, 10); + if (errno) { + fprintf(stderr, _("bad agno value %s\n"), + optarg); + return command_usage(&find_owner_cmd); + } + break; + default: + return command_usage(&find_owner_cmd); + } + } + + if (optind != argc) + return command_usage(&find_owner_cmd); + + if (agno == -1 || agno >= file->xfd.fsgeom.agcount) { + fprintf(stderr, +_("Destination AG %d does not exist. Filesystem only has %d AGs\n"), + agno, file->xfd.fsgeom.agcount); + exitcode = 1; + return 0; + } + + /* + * Check that rmap is enabled so that GETFSMAP is actually useful. + */ + if (!(file->xfd.fsgeom.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT)) { + fprintf(stderr, +_("Filesystem at %s does not have reverse mapping enabled. Aborting.\n"), + file->fs_path.fs_dir); + exitcode = 1; + return 0; + } + + scan_ag(agno); + return 0; +} + +static void +find_owner_help(void) +{ + printf(_( +"\n" +"Find inodes owning physical blocks in a given AG.\n" +"\n" +" -a agno -- Scan the given AG agno.\n" +"\n")); + +} + +void +find_owner_init(void) +{ + find_owner_cmd.name = "find_owner"; + find_owner_cmd.altname = "fown"; + find_owner_cmd.cfunc = find_owner_f; + find_owner_cmd.argmin = 2; + find_owner_cmd.argmax = 2; + find_owner_cmd.args = "-a agno"; + find_owner_cmd.flags = CMD_FLAG_ONESHOT; + find_owner_cmd.oneline = _("Find inodes owning physical blocks in a given AG"); + find_owner_cmd.help = find_owner_help; + + add_command(&find_owner_cmd); +} + +/* + * for each dirent we get returned, look up the inode tree to see if it is an + * inode we need to process. If it is, then replace the entry in the tree with + * a structure containing the current path and mark the entry as resolved. + */ +struct inode_path { + uint64_t ino; + struct list_head path_list; + uint32_t link_count; + char path[1]; +}; + +static int +resolve_owner_cb( + const char *path, + const struct stat *stat, + int status, + struct FTW *data) +{ + struct inode_path *ipath, *slot_ipath; + int pathlen; + void **slot; + + /* + * Lookup the slot rather than the entry so we can replace the contents + * without another lookup later on. + */ + slot = radix_tree_lookup_slot(&inode_tree, stat->st_ino); + if (!slot || *slot == NULL) + return 0; + + /* Could not get stat data? Fail! */ + if (status == FTW_NS) { + fprintf(stderr, +_("Failed to obtain stat(2) information from path %s. Aborting\n"), + path); + return -EPERM; + } + + /* Allocate a new inode path and record the path in it. */ + pathlen = strlen(path); + ipath = calloc(1, sizeof(*ipath) + pathlen + 1); + if (!ipath) { + fprintf(stderr, +_("Aborting: Storing path %s for inode 0x%lx failed: %s\n"), + path, stat->st_ino, strerror(ENOMEM)); + return -ENOMEM; + } + INIT_LIST_HEAD(&ipath->path_list); + memcpy(&ipath->path[0], path, pathlen); + ipath->ino = stat->st_ino; + + /* + * If the slot contains the inode number we just looked up, then we + * haven't recorded a path for it yet. If that is the case, we just + * set the link count of the path to 1 and replace the slot contents + * with our new_ipath. + */ + if (stat->st_ino == (uint64_t)*slot) { + ipath->link_count = 1; + *slot = ipath; + radix_tree_tag_set(&inode_tree, stat->st_ino, INODE_PATH); + inode_paths++; + return 0; + } + + /* + * Multiple hard links to this inode. The slot already contains an + * ipath pointer, so we add the new ipath to the tail of the list held + * by the slot's ipath and bump the link count of the slot's ipath to + * keep track of how many hard links the inode has. + */ + slot_ipath = *slot; + slot_ipath->link_count++; + list_add_tail(&ipath->path_list, &slot_ipath->path_list); + return 0; +} + +/* + * This should be parallelised - pass subdirs off to a work queue, have the + * work queue processes subdirs, queueing more subdirs to work on. + */ +static int +walk_mount( + const char *mntpt) +{ + int ret; + + ret = nftw(mntpt, resolve_owner_cb, + 100, FTW_PHYS | FTW_MOUNT | FTW_DEPTH); + if (ret) + return -errno; + return 0; +} + +static int +list_inode_paths(void) +{ + struct inode_path *ipath; + uint64_t idx = 0; + int ret; + + do { + bool move_blocks; + bool move_inode; + + ret = radix_tree_gang_lookup_tag(&inode_tree, (void **)&ipath, + idx, 1, INODE_PATH); + if (!ret) + break; + idx = ipath->ino + 1; + + /* Grab status tags and remove from tree. */ + move_blocks = radix_tree_tag_get(&inode_tree, ipath->ino, + MOVE_BLOCKS); + move_inode = radix_tree_tag_get(&inode_tree, ipath->ino, + MOVE_INODE); + radix_tree_delete(&inode_tree, ipath->ino); + + /* Print the initial path with inode number and state. */ + printf("0x%.16llx\t%s\t%s\t%8d\t%s\n", + (unsigned long long)ipath->ino, + move_blocks ? "BLOCK" : "---", + move_inode ? "INODE" : "---", + ipath->link_count, ipath->path); + ipath->link_count--; + + /* Walk all the hard link paths and emit them. */ + while (!list_empty(&ipath->path_list)) { + struct inode_path *hpath; + + hpath = list_first_entry(&ipath->path_list, + struct inode_path, path_list); + list_del(&hpath->path_list); + ipath->link_count--; + + printf("\t\t\t\t\t%s\n", hpath->path); + } + if (ipath->link_count) { + printf(_("Link count anomaly: %d paths left over\n"), + ipath->link_count); + } + free(ipath); + } while (true); + + /* + * Any inodes remaining in the tree at this point indicate inodes whose + * paths were not found. This will be unlinked but still open inodes or + * lost inodes due to corruptions. Either way, a shrink will not succeed + * until these inodes are removed from the filesystem. + */ + idx = 0; + do { + uint64_t ino; + + + ret = radix_tree_gang_lookup(&inode_tree, (void **)&ino, idx, 1); + if (!ret) { + if (idx != 0) + ret = -EBUSY; + break; + } + idx = ino + 1; + printf(_("No path found for inode 0x%llx!\n"), + (unsigned long long)ino); + radix_tree_delete(&inode_tree, ino); + } while (true); + + return ret; +} + +/* + * Resolve inode numbers to paths via a directory tree walk. + */ +static int +resolve_owner_f( + int argc, + char **argv) +{ + int ret; + + if (!inode_tree.rnode) { + fprintf(stderr, +_("Inode list has not been populated. No inodes to resolve.\n")); + return 0; + } + + ret = walk_mount(file->fs_path.fs_dir); + if (ret) { + fprintf(stderr, +_("Failed to resolve all paths from mount point %s: %s\n"), + file->fs_path.fs_dir, strerror(-ret)); + exitcode = 1; + return 0; + } + + ret = list_inode_paths(); + if (ret) { + fprintf(stderr, +_("Failed to list all resolved paths from mount point %s: %s\n"), + file->fs_path.fs_dir, strerror(-ret)); + exitcode = 1; + return 0; + } + return 0; +} + +static void +resolve_owner_help(void) +{ + printf(_( +"\n" +"Resolve inodes owning physical blocks in a given AG.\n" +"This requires the find_owner command to be run first to populate the table\n" +"of inodes that need to have their paths resolved.\n" +"\n")); + +} + +void +resolve_owner_init(void) +{ + resolve_owner_cmd.name = "resolve_owner"; + resolve_owner_cmd.altname = "rown"; + resolve_owner_cmd.cfunc = resolve_owner_f; + resolve_owner_cmd.argmin = 0; + resolve_owner_cmd.argmax = 0; + resolve_owner_cmd.args = ""; + resolve_owner_cmd.flags = CMD_FLAG_ONESHOT; + resolve_owner_cmd.oneline = _("Resolve patches to inodes owning physical blocks in a given AG"); + resolve_owner_cmd.help = resolve_owner_help; + + add_command(&resolve_owner_cmd); +} diff --git a/spaceman/init.c b/spaceman/init.c index dbeebcf97b9fb2..8b0af14e566dc8 100644 --- a/spaceman/init.c +++ b/spaceman/init.c @@ -10,6 +10,7 @@ #include "input.h" #include "init.h" #include "libfrog/paths.h" +#include "libfrog/radix-tree.h" #include "space.h" char *progname; @@ -37,6 +38,8 @@ init_commands(void) health_init(); clearfree_init(); move_inode_init(); + find_owner_init(); + resolve_owner_init(); } static int @@ -71,6 +74,7 @@ init( setlocale(LC_ALL, ""); bindtextdomain(PACKAGE, LOCALEDIR); textdomain(PACKAGE); + radix_tree_init(); fs_table_initialise(0, NULL, 0, NULL); while ((c = getopt(argc, argv, "c:p:V")) != EOF) { diff --git a/spaceman/space.h b/spaceman/space.h index 96c3c356f13fec..cffb1882153a18 100644 --- a/spaceman/space.h +++ b/spaceman/space.h @@ -39,5 +39,7 @@ extern void clearfree_init(void); extern void info_init(void); extern void health_init(void); void move_inode_init(void); +void find_owner_init(void); +void resolve_owner_init(void); #endif /* XFS_SPACEMAN_SPACE_H_ */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (6 preceding siblings ...) 2024-12-31 23:46 ` [PATCH 07/11] spaceman: find owners of space in an AG Darrick J. Wong @ 2024-12-31 23:46 ` Darrick J. Wong 2024-12-31 23:47 ` [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems Darrick J. Wong ` (2 subsequent siblings) 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:46 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Wrap the raw radix tree accesses here so that we can provide an alternate implementation on platforms where radix tree indices cannot store a full 64-bit inode number. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- spaceman/Makefile | 1 spaceman/find_owner.c | 76 +++++++++------------------------ spaceman/relocation.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++ spaceman/relocation.h | 46 ++++++++++++++++++++ 4 files changed, 183 insertions(+), 54 deletions(-) create mode 100644 spaceman/relocation.c create mode 100644 spaceman/relocation.h diff --git a/spaceman/Makefile b/spaceman/Makefile index b35ab1dbd2f440..8980208285f610 100644 --- a/spaceman/Makefile +++ b/spaceman/Makefile @@ -17,6 +17,7 @@ CFILES = \ init.c \ move_inode.c \ prealloc.c \ + relocation.c \ trim.c LSRCFILES = xfs_info.sh diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c index 7a656d80d21217..80b239f9ac5de8 100644 --- a/spaceman/find_owner.c +++ b/spaceman/find_owner.c @@ -15,19 +15,13 @@ #include <linux/fsmap.h> #include "space.h" #include "input.h" +#include "relocation.h" static cmdinfo_t find_owner_cmd; static cmdinfo_t resolve_owner_cmd; #define NR_EXTENTS 128 -static RADIX_TREE(inode_tree, 0); -#define MOVE_INODE 0 -#define MOVE_BLOCKS 1 -#define INODE_PATH 2 -int inode_count; -int inode_paths; - static void track_inode_chunks( struct xfs_fd *xfd, @@ -39,7 +33,7 @@ track_inode_chunks( uint64_t first_ino = cvt_agino_to_ino(xfd, agno, cvt_agbno_to_agino(xfd, agbno)); uint64_t num_inodes = cvt_b_to_inode_count(xfd, length); - int i; + uint64_t i; printf(_("AG %d\tInode Range to move: 0x%llx - 0x%llx (length 0x%llx)\n"), agno, @@ -47,14 +41,8 @@ track_inode_chunks( (unsigned long long)first_ino + num_inodes - 1, (unsigned long long)length); - for (i = 0; i < num_inodes; i++) { - if (!radix_tree_lookup(&inode_tree, first_ino + i)) { - radix_tree_insert(&inode_tree, first_ino + i, - (void *)first_ino + i); - inode_count++; - } - radix_tree_tag_set(&inode_tree, first_ino + i, MOVE_INODE); - } + for (i = 0; i < num_inodes; i++) + set_reloc_iflag(first_ino + i, MOVE_INODE); } static void @@ -65,7 +53,7 @@ track_inode( uint64_t physaddr, uint64_t length) { - if (radix_tree_tag_get(&inode_tree, owner, MOVE_BLOCKS)) + if (test_reloc_iflag(owner, MOVE_BLOCKS)) return; printf(_("AG %d\tInode 0x%llx: blocks to move to move: 0x%llx - 0x%llx\n"), @@ -73,11 +61,8 @@ track_inode( (unsigned long long)owner, (unsigned long long)physaddr, (unsigned long long)physaddr + length - 1); - if (!radix_tree_lookup(&inode_tree, owner)) { - radix_tree_insert(&inode_tree, owner, (void *)owner); - inode_count++; - } - radix_tree_tag_set(&inode_tree, owner, MOVE_BLOCKS); + + set_reloc_iflag(owner, MOVE_BLOCKS); } static void @@ -111,7 +96,7 @@ scan_ag( h->fmr_offset = ULLONG_MAX; while (true) { - printf("Inode count %d\n", inode_count); + printf("Inode count %llu\n", get_reloc_count()); ret = ioctl(xfd->fd, FS_IOC_GETFSMAP, fsmap); if (ret < 0) { fprintf(stderr, _("%s: FS_IOC_GETFSMAP [\"%s\"]: %s\n"), @@ -245,18 +230,6 @@ find_owner_init(void) add_command(&find_owner_cmd); } -/* - * for each dirent we get returned, look up the inode tree to see if it is an - * inode we need to process. If it is, then replace the entry in the tree with - * a structure containing the current path and mark the entry as resolved. - */ -struct inode_path { - uint64_t ino; - struct list_head path_list; - uint32_t link_count; - char path[1]; -}; - static int resolve_owner_cb( const char *path, @@ -266,14 +239,14 @@ resolve_owner_cb( { struct inode_path *ipath, *slot_ipath; int pathlen; - void **slot; + struct inode_path **slot; /* * Lookup the slot rather than the entry so we can replace the contents * without another lookup later on. */ - slot = radix_tree_lookup_slot(&inode_tree, stat->st_ino); - if (!slot || *slot == NULL) + slot = get_reloc_ipath_slot(stat->st_ino); + if (!slot) return 0; /* Could not get stat data? Fail! */ @@ -303,11 +276,10 @@ _("Aborting: Storing path %s for inode 0x%lx failed: %s\n"), * set the link count of the path to 1 and replace the slot contents * with our new_ipath. */ - if (stat->st_ino == (uint64_t)*slot) { + if (*slot == UNLINKED_IPATH) { ipath->link_count = 1; *slot = ipath; - radix_tree_tag_set(&inode_tree, stat->st_ino, INODE_PATH); - inode_paths++; + set_reloc_iflag(stat->st_ino, INODE_PATH); return 0; } @@ -351,18 +323,15 @@ list_inode_paths(void) bool move_blocks; bool move_inode; - ret = radix_tree_gang_lookup_tag(&inode_tree, (void **)&ipath, - idx, 1, INODE_PATH); - if (!ret) + ipath = get_next_reloc_ipath(idx); + if (!ipath) break; idx = ipath->ino + 1; /* Grab status tags and remove from tree. */ - move_blocks = radix_tree_tag_get(&inode_tree, ipath->ino, - MOVE_BLOCKS); - move_inode = radix_tree_tag_get(&inode_tree, ipath->ino, - MOVE_INODE); - radix_tree_delete(&inode_tree, ipath->ino); + move_blocks = test_reloc_iflag(ipath->ino, MOVE_BLOCKS); + move_inode = test_reloc_iflag(ipath->ino, MOVE_INODE); + forget_reloc_ino(ipath->ino); /* Print the initial path with inode number and state. */ printf("0x%.16llx\t%s\t%s\t%8d\t%s\n", @@ -400,9 +369,8 @@ list_inode_paths(void) do { uint64_t ino; - - ret = radix_tree_gang_lookup(&inode_tree, (void **)&ino, idx, 1); - if (!ret) { + ino = get_next_reloc_unlinked(idx); + if (!ino) { if (idx != 0) ret = -EBUSY; break; @@ -410,7 +378,7 @@ list_inode_paths(void) idx = ino + 1; printf(_("No path found for inode 0x%llx!\n"), (unsigned long long)ino); - radix_tree_delete(&inode_tree, ino); + forget_reloc_ino(ino); } while (true); return ret; @@ -426,7 +394,7 @@ resolve_owner_f( { int ret; - if (!inode_tree.rnode) { + if (!is_reloc_populated()) { fprintf(stderr, _("Inode list has not been populated. No inodes to resolve.\n")); return 0; diff --git a/spaceman/relocation.c b/spaceman/relocation.c new file mode 100644 index 00000000000000..7c7d9a2b4b236f --- /dev/null +++ b/spaceman/relocation.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2020 Red Hat, Inc. + * All Rights Reserved. + */ + +#include "libxfs.h" +#include "libfrog/fsgeom.h" +#include "libfrog/radix-tree.h" +#include "libfrog/paths.h" +#include "command.h" +#include "init.h" +#include "space.h" +#include "input.h" +#include "relocation.h" +#include "handle.h" + +static unsigned long long inode_count; +static unsigned long long inode_paths; + +unsigned long long +get_reloc_count(void) +{ + return inode_count; +} + +static RADIX_TREE(relocation_data, 0); + +bool +is_reloc_populated(void) +{ + return relocation_data.rnode != NULL; +} + +bool +test_reloc_iflag( + uint64_t ino, + unsigned int flag) +{ + return radix_tree_tag_get(&relocation_data, ino, flag); +} + +void +set_reloc_iflag( + uint64_t ino, + unsigned int flag) +{ + if (!radix_tree_lookup(&relocation_data, ino)) { + radix_tree_insert(&relocation_data, ino, UNLINKED_IPATH); + if (flag != INODE_PATH) + inode_count++; + } + if (flag == INODE_PATH) + inode_paths++; + + radix_tree_tag_set(&relocation_data, ino, flag); +} + +struct inode_path * +get_next_reloc_ipath( + uint64_t ino) +{ + struct inode_path *ipath; + int ret; + + ret = radix_tree_gang_lookup_tag(&relocation_data, (void **)&ipath, + ino, 1, INODE_PATH); + if (!ret) + return NULL; + return ipath; +} + +uint64_t +get_next_reloc_unlinked( + uint64_t ino) +{ + uint64_t next_ino; + int ret; + + ret = radix_tree_gang_lookup(&relocation_data, (void **)&next_ino, ino, + 1); + if (!ret) + return 0; + return next_ino; +} + +/* + * Return a pointer to a pointer where the caller can read or write a pointer + * to an inode path structure. + * + * The pointed-to pointer will be set to UNLINKED_IPATH if there is no ipath + * associated with this inode but the inode has been flagged for relocation. + * + * Returns NULL if the inode is not flagged for relocation. + */ +struct inode_path ** +get_reloc_ipath_slot( + uint64_t ino) +{ + struct inode_path **slot; + + slot = (struct inode_path **)radix_tree_lookup_slot(&relocation_data, + ino); + if (!slot || *slot == NULL) + return NULL; + return slot; +} + +void +forget_reloc_ino( + uint64_t ino) +{ + radix_tree_delete(&relocation_data, ino); +} diff --git a/spaceman/relocation.h b/spaceman/relocation.h new file mode 100644 index 00000000000000..f05a871915da42 --- /dev/null +++ b/spaceman/relocation.h @@ -0,0 +1,46 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2020 Red Hat, Inc. + * All Rights Reserved. + */ +#ifndef XFS_SPACEMAN_RELOCATION_H_ +#define XFS_SPACEMAN_RELOCATION_H_ + +bool is_reloc_populated(void); +unsigned long long get_reloc_count(void); + +/* + * Tags for the relocation_data tree that indicate what it contains and the + * discovery information that needed to be stored. + */ +#define MOVE_INODE 0 +#define MOVE_BLOCKS 1 +#define INODE_PATH 2 + +bool test_reloc_iflag(uint64_t ino, unsigned int flag); +void set_reloc_iflag(uint64_t ino, unsigned int flag); +struct inode_path *get_next_reloc_ipath(uint64_t ino); +uint64_t get_next_reloc_unlinked(uint64_t ino); +struct inode_path **get_reloc_ipath_slot(uint64_t ino); +void forget_reloc_ino(uint64_t ino); + +/* + * When the entry in the relocation_data tree is tagged with INODE_PATH, the + * entry contains a structure that tracks the discovered paths to the inode. If + * the inode has multiple hard links, then we chain each individual path found + * via the path_list and record the number of paths in the link_count entry. + */ +struct inode_path { + uint64_t ino; + struct list_head path_list; + uint32_t link_count; + char path[1]; +}; + +/* + * Sentinel value for inodes that we have to move but haven't yet found a path + * to. + */ +#define UNLINKED_IPATH ((struct inode_path *)1) + +#endif /* XFS_SPACEMAN_RELOCATION_H_ */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (7 preceding siblings ...) 2024-12-31 23:46 ` [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c Darrick J. Wong @ 2024-12-31 23:47 ` Darrick J. Wong 2024-12-31 23:47 ` [PATCH 10/11] spaceman: relocate the contents of an AG Darrick J. Wong 2024-12-31 23:47 ` [PATCH 11/11] spaceman: move inodes with hardlinks Darrick J. Wong 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> We can't use the radix tree to store relocation information on 32-bit systems because unsigned longs are not large enough to hold 64-bit inodes. Use an avl64 tree instead. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- configure.ac | 1 include/builddefs.in | 1 m4/package_libcdev.m4 | 20 +++++ spaceman/Makefile | 4 + spaceman/relocation.c | 203 +++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 229 insertions(+) diff --git a/configure.ac b/configure.ac index 224d1d3930bf2f..1f7fec838e1239 100644 --- a/configure.ac +++ b/configure.ac @@ -212,6 +212,7 @@ fi AC_MANUAL_FORMAT AC_HAVE_LIBURCU_ATOMIC64 +AC_USE_RADIX_TREE_FOR_INUMS AC_CONFIG_FILES([include/builddefs]) AC_OUTPUT diff --git a/include/builddefs.in b/include/builddefs.in index ac43b6412c8cbb..bb022c36627a72 100644 --- a/include/builddefs.in +++ b/include/builddefs.in @@ -114,6 +114,7 @@ CROND_DIR = @crond_dir@ HAVE_UDEV = @have_udev@ UDEV_RULE_DIR = @udev_rule_dir@ HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@ +USE_RADIX_TREE_FOR_INUMS = @use_radix_tree_for_inums@ GCCFLAGS = -funsigned-char -fno-strict-aliasing -Wall # -Wbitwise -Wno-transparent-union -Wno-old-initializer -Wno-decl diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4 index 4ef7e8f67a3ba6..9e48273250244c 100644 --- a/m4/package_libcdev.m4 +++ b/m4/package_libcdev.m4 @@ -255,3 +255,23 @@ AC_DEFUN([AC_PACKAGE_CHECK_LTO], AC_SUBST(lto_cflags) AC_SUBST(lto_ldflags) ]) + +# +# Check if the radix tree index (unsigned long) is large enough to hold a +# 64-bit inode number +# +AC_DEFUN([AC_USE_RADIX_TREE_FOR_INUMS], + [ AC_MSG_CHECKING([if radix tree can store XFS inums]) + AC_LINK_IFELSE([AC_LANG_PROGRAM([[ +#include <sys/param.h> +#include <stdint.h> +#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) + ]], [[ + typedef uint64_t xfs_ino_t; + + BUILD_BUG_ON(sizeof(unsigned long) < sizeof(xfs_ino_t)); + return 0; + ]])],[use_radix_tree_for_inums=yes + AC_MSG_RESULT(yes)],[AC_MSG_RESULT(no)]) + AC_SUBST(use_radix_tree_for_inums) + ]) diff --git a/spaceman/Makefile b/spaceman/Makefile index 8980208285f610..d9d55245ffc47a 100644 --- a/spaceman/Makefile +++ b/spaceman/Makefile @@ -33,6 +33,10 @@ ifeq ($(HAVE_GETFSMAP),yes) CFILES += freesp.c clearfree.c endif +ifeq ($(USE_RADIX_TREE_FOR_INUMS),yes) +LCFLAGS += -DUSE_RADIX_TREE_FOR_INUMS +endif + default: depend $(LTCOMMAND) include $(BUILDRULES) diff --git a/spaceman/relocation.c b/spaceman/relocation.c index 7c7d9a2b4b236f..1c0db6a1dab465 100644 --- a/spaceman/relocation.c +++ b/spaceman/relocation.c @@ -6,7 +6,11 @@ #include "libxfs.h" #include "libfrog/fsgeom.h" +#ifdef USE_RADIX_TREE_FOR_INUMS #include "libfrog/radix-tree.h" +#else +#include "libfrog/avl64.h" +#endif /* USE_RADIX_TREE_FOR_INUMS */ #include "libfrog/paths.h" #include "command.h" #include "init.h" @@ -24,6 +28,7 @@ get_reloc_count(void) return inode_count; } +#ifdef USE_RADIX_TREE_FOR_INUMS static RADIX_TREE(relocation_data, 0); bool @@ -112,3 +117,201 @@ forget_reloc_ino( { radix_tree_delete(&relocation_data, ino); } +#else +struct reloc_node { + struct avl64node node; + uint64_t ino; + struct inode_path *ipath; + unsigned int flags; +}; + +static uint64_t +reloc_start( + struct avl64node *node) +{ + struct reloc_node *rln; + + rln = container_of(node, struct reloc_node, node); + return rln->ino; +} + +static uint64_t +reloc_end( + struct avl64node *node) +{ + struct reloc_node *rln; + + rln = container_of(node, struct reloc_node, node); + return rln->ino + 1; +} + +static struct avl64ops reloc_ops = { + reloc_start, + reloc_end, +}; + +static struct avl64tree_desc relocation_data = { + .avl_ops = &reloc_ops, +}; + +bool +is_reloc_populated(void) +{ + return relocation_data.avl_firstino != NULL; +} + +static inline struct reloc_node * +reloc_lookup( + uint64_t ino) +{ + avl64node_t *node; + + node = avl64_find(&relocation_data, ino); + if (!node) + return NULL; + + return container_of(node, struct reloc_node, node); +} + +static inline struct reloc_node * +reloc_insert( + uint64_t ino) +{ + struct reloc_node *rln; + avl64node_t *node; + + rln = malloc(sizeof(struct reloc_node)); + if (!rln) + return NULL; + + rln->node.avl_nextino = NULL; + rln->ino = ino; + rln->ipath = UNLINKED_IPATH; + rln->flags = 0; + + node = avl64_insert(&relocation_data, &rln->node); + if (node == NULL) { + free(rln); + return NULL; + } + + return rln; +} + +bool +test_reloc_iflag( + uint64_t ino, + unsigned int flag) +{ + struct reloc_node *rln; + + rln = reloc_lookup(ino); + if (!rln) + return false; + + return rln->flags & flag; +} + +void +set_reloc_iflag( + uint64_t ino, + unsigned int flag) +{ + struct reloc_node *rln; + + rln = reloc_lookup(ino); + if (!rln) { + rln = reloc_insert(ino); + if (!rln) + abort(); + if (flag != INODE_PATH) + inode_count++; + } + if (flag == INODE_PATH) + inode_paths++; + + rln->flags |= flag; +} + +#define avl_for_each_range_safe(pos, n, l, first, last) \ + for (pos = (first), n = pos->avl_nextino, l = (last)->avl_nextino; \ + pos != (l); \ + pos = n, n = pos ? pos->avl_nextino : NULL) + +struct inode_path * +get_next_reloc_ipath( + uint64_t ino) +{ + struct avl64node *firstn; + struct avl64node *lastn; + struct avl64node *pos; + struct avl64node *n; + struct avl64node *l; + struct reloc_node *rln; + + avl64_findranges(&relocation_data, ino - 1, -1ULL, &firstn, &lastn); + if (firstn == NULL && lastn == NULL) + return NULL; + + avl_for_each_range_safe(pos, n, l, firstn, lastn) { + rln = container_of(pos, struct reloc_node, node); + + if (rln->flags & INODE_PATH) + return rln->ipath; + } + + return NULL; +} + +uint64_t +get_next_reloc_unlinked( + uint64_t ino) +{ + struct avl64node *firstn; + struct avl64node *lastn; + struct avl64node *pos; + struct avl64node *n; + struct avl64node *l; + struct reloc_node *rln; + + avl64_findranges(&relocation_data, ino - 1, -1ULL, &firstn, &lastn); + if (firstn == NULL && lastn == NULL) + return 0; + + avl_for_each_range_safe(pos, n, l, firstn, lastn) { + rln = container_of(pos, struct reloc_node, node); + + if (!(rln->flags & INODE_PATH)) + return rln->ino; + } + + return 0; +} + +struct inode_path ** +get_reloc_ipath_slot( + uint64_t ino) +{ + struct reloc_node *rln; + + rln = reloc_lookup(ino); + if (!rln) + return NULL; + + return &rln->ipath; +} + +void +forget_reloc_ino( + uint64_t ino) +{ + struct reloc_node *rln; + + rln = reloc_lookup(ino); + if (!rln) + return; + + avl64_delete(&relocation_data, &rln->node); + free(rln); +} +#endif /* USE_RADIX_TREE_FOR_INUMS */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 10/11] spaceman: relocate the contents of an AG 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (8 preceding siblings ...) 2024-12-31 23:47 ` [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems Darrick J. Wong @ 2024-12-31 23:47 ` Darrick J. Wong 2024-12-31 23:47 ` [PATCH 11/11] spaceman: move inodes with hardlinks Darrick J. Wong 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw) To: aalbersh, djwong; +Cc: dchinner, linux-xfs From: Dave Chinner <dchinner@redhat.com> Shrinking a filesystem needs to first remove all the active user data and metadata from the AGs that are going to be lopped off the filesystem. Before we can do this, we have to relocate this information to a region of the filesystem that is going to be retained. We have a function to move an inode and all it's related information to a specific AG, we have functions to find the owners of all the information in an AG and we can find their paths. This gives us all the information we need to relocate all the objects in an AG we are going to remove via shrinking. Firstly we scan the AG to be emptied to find the inodes that need to be relocated, then we scan the directory structure to find all the paths to those inodes that need to be moved. Then we iterate over all the inodes to be moved attempting to move them to the lowest numbers AGs. When the destination AG fills up, we'll get ENOSPC from the moving code and this is a trigger to bump the destination AG and retry the move. If we haven't moved all the inodes and their data by the time the destination reaches the source AG, then the entire operation will fail with ENOSPC - there is not enough room in the filesystem to empty the selected AG in preparation for a shrink. This, once again, is not intended as an optimal or even guaranteed way of emptying an AG for shrink. It simply provides the basic algorithm and mechanisms we need to perform a shrink operation. Improvements and optimisations will come in time, but we can't get to an optimal solution without first having basic functionality in place. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libfrog/fsgeom.h | 10 ++ man/man8/xfs_spaceman.8 | 8 ++ spaceman/find_owner.c | 32 +++--- spaceman/init.c | 1 spaceman/move_inode.c | 7 + spaceman/relocation.c | 234 +++++++++++++++++++++++++++++++++++++++++++++++ spaceman/relocation.h | 5 + spaceman/space.h | 1 8 files changed, 280 insertions(+), 18 deletions(-) diff --git a/libfrog/fsgeom.h b/libfrog/fsgeom.h index 679046077cba84..3fe642be6dc9ae 100644 --- a/libfrog/fsgeom.h +++ b/libfrog/fsgeom.h @@ -196,6 +196,16 @@ cvt_daddr_to_agno( return cvt_bb_to_off_fsbt(xfd, daddr) / xfd->fsgeom.agblocks; } +/* Convert sparse filesystem block to AG Number */ +static inline uint32_t +cvt_fsb_to_agno( + struct xfs_fd *xfd, + uint64_t fsbno) +{ + return fsbno >> xfd->agblklog; +} + + /* Convert sector number to AG block number. */ static inline uint32_t cvt_daddr_to_agbno( diff --git a/man/man8/xfs_spaceman.8 b/man/man8/xfs_spaceman.8 index 6fef6949aa6c8b..b6488810cfab30 100644 --- a/man/man8/xfs_spaceman.8 +++ b/man/man8/xfs_spaceman.8 @@ -202,9 +202,17 @@ .SH COMMANDS .TP .B print Display a list of all open files. +.TP +.BI "relocate \-a agno [ \-h agno ]" +Empty out the given allocation group by moving file data elsewhere. +The +.B -h +option specifies the highest allocation group into which we can move data. + .TP .B resolve_owner Resolves space in the filesystem to file paths, maybe? + .TP .B quit Exit diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c index 80b239f9ac5de8..8e93145539a227 100644 --- a/spaceman/find_owner.c +++ b/spaceman/find_owner.c @@ -9,10 +9,10 @@ #include <linux/fiemap.h> #include "libfrog/fsgeom.h" #include "libfrog/radix-tree.h" -#include "command.h" -#include "init.h" #include "libfrog/paths.h" #include <linux/fsmap.h> +#include "command.h" +#include "init.h" #include "space.h" #include "input.h" #include "relocation.h" @@ -65,8 +65,8 @@ track_inode( set_reloc_iflag(owner, MOVE_BLOCKS); } -static void -scan_ag( +int +find_relocation_targets( xfs_agnumber_t agno) { struct fsmap_head *fsmap; @@ -80,8 +80,7 @@ scan_ag( fsmap = malloc(fsmap_sizeof(NR_EXTENTS)); if (!fsmap) { fprintf(stderr, _("%s: fsmap malloc failed.\n"), progname); - exitcode = 1; - return; + return -ENOMEM; } memset(fsmap, 0, sizeof(*fsmap)); @@ -102,8 +101,7 @@ scan_ag( fprintf(stderr, _("%s: FS_IOC_GETFSMAP [\"%s\"]: %s\n"), progname, file->name, strerror(errno)); free(fsmap); - exitcode = 1; - return; + return -errno; } /* No more extents to map, exit */ @@ -148,6 +146,7 @@ scan_ag( } free(fsmap); + return 0; } /* @@ -159,6 +158,7 @@ find_owner_f( char **argv) { xfs_agnumber_t agno = -1; + int ret; int c; while ((c = getopt(argc, argv, "a:")) != EOF) { @@ -198,7 +198,9 @@ _("Filesystem at %s does not have reverse mapping enabled. Aborting.\n"), return 0; } - scan_ag(agno); + ret = find_relocation_targets(agno); + if (ret) + exitcode = 1; return 0; } @@ -299,8 +301,8 @@ _("Aborting: Storing path %s for inode 0x%lx failed: %s\n"), * This should be parallelised - pass subdirs off to a work queue, have the * work queue processes subdirs, queueing more subdirs to work on. */ -static int -walk_mount( +int +resolve_target_paths( const char *mntpt) { int ret; @@ -361,9 +363,9 @@ list_inode_paths(void) /* * Any inodes remaining in the tree at this point indicate inodes whose - * paths were not found. This will be unlinked but still open inodes or - * lost inodes due to corruptions. Either way, a shrink will not succeed - * until these inodes are removed from the filesystem. + * paths were not found. This will be free inodes or unlinked but still + * open inodes. Either way, a shrink will not succeed until these inodes + * are removed from the filesystem. */ idx = 0; do { @@ -400,7 +402,7 @@ _("Inode list has not been populated. No inodes to resolve.\n")); return 0; } - ret = walk_mount(file->fs_path.fs_dir); + ret = resolve_target_paths(file->fs_path.fs_dir); if (ret) { fprintf(stderr, _("Failed to resolve all paths from mount point %s: %s\n"), diff --git a/spaceman/init.c b/spaceman/init.c index 8b0af14e566dc8..cfe1b96fb66cd1 100644 --- a/spaceman/init.c +++ b/spaceman/init.c @@ -40,6 +40,7 @@ init_commands(void) move_inode_init(); find_owner_init(); resolve_owner_init(); + relocate_init(); } static int diff --git a/spaceman/move_inode.c b/spaceman/move_inode.c index b7d71ee7a46dc6..ab3c12f5de987b 100644 --- a/spaceman/move_inode.c +++ b/spaceman/move_inode.c @@ -12,6 +12,7 @@ #include "space.h" #include "input.h" #include "handle.h" +#include "relocation.h" #include <linux/fiemap.h> #include <linux/falloc.h> @@ -404,8 +405,8 @@ exchange_inodes( return 0; } -static int -move_file_to_ag( +int +relocate_file_to_ag( const char *mnt, const char *path, struct xfs_fd *xfd, @@ -511,7 +512,7 @@ _("Destination AG %d does not exist. Filesystem only has %d AGs\n"), } if (S_ISREG(st.st_mode)) { - ret = move_file_to_ag(file->fs_path.fs_dir, file->name, + ret = relocate_file_to_ag(file->fs_path.fs_dir, file->name, &file->xfd, agno); } else { fprintf(stderr, _("Unsupported: %s is not a regular file.\n"), diff --git a/spaceman/relocation.c b/spaceman/relocation.c index 1c0db6a1dab465..7b125cc0ae12b0 100644 --- a/spaceman/relocation.c +++ b/spaceman/relocation.c @@ -315,3 +315,237 @@ forget_reloc_ino( free(rln); } #endif /* USE_RADIX_TREE_FOR_INUMS */ + +static struct cmdinfo relocate_cmd; + +static int +relocate_targets_to_ag( + const char *mnt, + xfs_agnumber_t dst_agno) +{ + struct inode_path *ipath; + uint64_t idx = 0; + int ret = 0; + + do { + struct xfs_fd xfd = {0}; + struct stat st; + + /* lookup first relocation target */ + ipath = get_next_reloc_ipath(idx); + if (!ipath) + break; + + /* XXX: don't handle hard link cases yet */ + if (ipath->link_count > 1) { + fprintf(stderr, + "FIXME! Skipping hardlinked inode at path %s\n", + ipath->path); + goto next; + } + + + ret = stat(ipath->path, &st); + if (ret) { + fprintf(stderr, _("stat(%s) failed: %s\n"), + ipath->path, strerror(errno)); + goto next; + } + + if (!S_ISREG(st.st_mode)) { + fprintf(stderr, + _("FIXME! Skipping %s: not a regular file.\n"), + ipath->path); + goto next; + } + + ret = xfd_open(&xfd, ipath->path, O_RDONLY); + if (ret) { + fprintf(stderr, _("xfd_open(%s) failed: %s\n"), + ipath->path, strerror(-ret)); + goto next; + } + + /* move to destination AG */ + ret = relocate_file_to_ag(mnt, ipath->path, &xfd, dst_agno); + xfd_close(&xfd); + + /* + * If the destination AG has run out of space, we do not remove + * this inode from relocation data so it will be immediately + * retried in the next AG. Other errors will be fatal. + */ + if (ret < 0) + return ret; +next: + /* remove from relocation data */ + idx = ipath->ino + 1; + forget_reloc_ino(ipath->ino); + } while (ret != -ENOSPC); + + return ret; +} + +static int +relocate_targets( + const char *mnt, + xfs_agnumber_t highest_agno) +{ + xfs_agnumber_t dst_agno = 0; + int ret; + + for (dst_agno = 0; dst_agno <= highest_agno; dst_agno++) { + ret = relocate_targets_to_ag(mnt, dst_agno); + if (ret == -ENOSPC) + continue; + break; + } + return ret; +} + +/* + * Relocate all the user objects in an AG to lower numbered AGs. + */ +static int +relocate_f( + int argc, + char **argv) +{ + xfs_agnumber_t target_agno = -1; + xfs_agnumber_t highest_agno = -1; + xfs_agnumber_t log_agno; + void *fshandle; + size_t fshdlen; + int c; + int ret; + + while ((c = getopt(argc, argv, "a:h:")) != EOF) { + switch (c) { + case 'a': + target_agno = cvt_u32(optarg, 10); + if (errno) { + fprintf(stderr, _("bad target agno value %s\n"), + optarg); + return command_usage(&relocate_cmd); + } + break; + case 'h': + highest_agno = cvt_u32(optarg, 10); + if (errno) { + fprintf(stderr, _("bad highest agno value %s\n"), + optarg); + return command_usage(&relocate_cmd); + } + break; + default: + return command_usage(&relocate_cmd); + } + } + + if (optind != argc) + return command_usage(&relocate_cmd); + + if (target_agno == -1) { + fprintf(stderr, _("Target AG must be specified!\n")); + return command_usage(&relocate_cmd); + } + + log_agno = cvt_fsb_to_agno(&file->xfd, file->xfd.fsgeom.logstart); + if (target_agno <= log_agno) { + fprintf(stderr, +_("Target AG %d must be higher than the journal AG (AG %d). Aborting.\n"), + target_agno, log_agno); + goto out_fail; + } + + if (target_agno >= file->xfd.fsgeom.agcount) { + fprintf(stderr, +_("Target AG %d does not exist. Filesystem only has %d AGs\n"), + target_agno, file->xfd.fsgeom.agcount); + goto out_fail; + } + + if (highest_agno == -1) + highest_agno = target_agno - 1; + + if (highest_agno >= target_agno) { + fprintf(stderr, +_("Highest destination AG %d must be less than target AG %d. Aborting.\n"), + highest_agno, target_agno); + goto out_fail; + } + + if (is_reloc_populated()) { + fprintf(stderr, +_("Relocation data populated from previous commands. Aborting.\n")); + goto out_fail; + } + + /* this is so we can use fd_to_handle() later on */ + ret = path_to_fshandle(file->fs_path.fs_dir, &fshandle, &fshdlen); + if (ret < 0) { + fprintf(stderr, _("Cannot get fshandle for mount %s: %s\n"), + file->fs_path.fs_dir, strerror(errno)); + goto out_fail; + } + + ret = find_relocation_targets(target_agno); + if (ret) { + fprintf(stderr, +_("Failure during target discovery. Aborting.\n")); + goto out_fail; + } + + ret = resolve_target_paths(file->fs_path.fs_dir); + if (ret) { + fprintf(stderr, +_("Failed to resolve all paths from mount point %s: %s\n"), + file->fs_path.fs_dir, strerror(-ret)); + goto out_fail; + } + + ret = relocate_targets(file->fs_path.fs_dir, highest_agno); + if (ret) { + fprintf(stderr, +_("Failed to relocate all targets out of AG %d: %s\n"), + target_agno, strerror(-ret)); + goto out_fail; + } + + return 0; +out_fail: + exitcode = 1; + return 0; +} + +static void +relocate_help(void) +{ + printf(_( +"\n" +"Relocate all the user data and metadata in an AG.\n" +"\n" +"This function will discover all the relocatable objects in a single AG and\n" +"move them to a lower AG as preparation for a shrink operation.\n" +"\n" +" -a <agno> Allocation group to empty\n" +" -h <agno> Highest target AG allowed to relocate into\n" +"\n")); + +} + +void +relocate_init(void) +{ + relocate_cmd.name = "relocate"; + relocate_cmd.altname = "relocate"; + relocate_cmd.cfunc = relocate_f; + relocate_cmd.argmin = 2; + relocate_cmd.argmax = 4; + relocate_cmd.args = "-a agno [-h agno]"; + relocate_cmd.flags = CMD_FLAG_ONESHOT; + relocate_cmd.oneline = _("Relocate data in an AG."); + relocate_cmd.help = relocate_help; + + add_command(&relocate_cmd); +} diff --git a/spaceman/relocation.h b/spaceman/relocation.h index f05a871915da42..d4c71b7bb7f054 100644 --- a/spaceman/relocation.h +++ b/spaceman/relocation.h @@ -43,4 +43,9 @@ struct inode_path { */ #define UNLINKED_IPATH ((struct inode_path *)1) +int find_relocation_targets(xfs_agnumber_t agno); +int relocate_file_to_ag(const char *mnt, const char *path, struct xfs_fd *xfd, + xfs_agnumber_t agno); +int resolve_target_paths(const char *mntpt); + #endif /* XFS_SPACEMAN_RELOCATION_H_ */ diff --git a/spaceman/space.h b/spaceman/space.h index cffb1882153a18..8c2b3e5464dee6 100644 --- a/spaceman/space.h +++ b/spaceman/space.h @@ -41,5 +41,6 @@ extern void health_init(void); void move_inode_init(void); void find_owner_init(void); void resolve_owner_init(void); +void relocate_init(void); #endif /* XFS_SPACEMAN_SPACE_H_ */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 11/11] spaceman: move inodes with hardlinks 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong ` (9 preceding siblings ...) 2024-12-31 23:47 ` [PATCH 10/11] spaceman: relocate the contents of an AG Darrick J. Wong @ 2024-12-31 23:47 ` Darrick J. Wong 10 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw) To: aalbersh, djwong; +Cc: dchinner, linux-xfs From: Dave Chinner <dchinner@redhat.com> When a inode to be moved to a different AG has multiple hard links, we need to "move" all the hard links, too. To do this, we need to create temporary hardlinks to the new file, and then use rename exchange to swap all the hardlinks that point to the old inode with new hardlinks that point to the new inode. We already know that an inode has hard links via the path discovery, and we can check it against the link count that is reported for the inode before we start building the link farm. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- spaceman/find_owner.c | 13 +---- spaceman/move_inode.c | 119 +++++++++++++++++++++++++++++++++++++++++++++---- spaceman/relocation.c | 35 ++++++++++---- spaceman/relocation.h | 6 ++ 4 files changed, 140 insertions(+), 33 deletions(-) diff --git a/spaceman/find_owner.c b/spaceman/find_owner.c index 8e93145539a227..1984d0ee7ca5f6 100644 --- a/spaceman/find_owner.c +++ b/spaceman/find_owner.c @@ -240,7 +240,6 @@ resolve_owner_cb( struct FTW *data) { struct inode_path *ipath, *slot_ipath; - int pathlen; struct inode_path **slot; /* @@ -260,17 +259,9 @@ _("Failed to obtain stat(2) information from path %s. Aborting\n"), } /* Allocate a new inode path and record the path in it. */ - pathlen = strlen(path); - ipath = calloc(1, sizeof(*ipath) + pathlen + 1); - if (!ipath) { - fprintf(stderr, -_("Aborting: Storing path %s for inode 0x%lx failed: %s\n"), - path, stat->st_ino, strerror(ENOMEM)); + ipath = ipath_alloc(path, stat); + if (!ipath) return -ENOMEM; - } - INIT_LIST_HEAD(&ipath->path_list); - memcpy(&ipath->path[0], path, pathlen); - ipath->ino = stat->st_ino; /* * If the slot contains the inode number we just looked up, then we diff --git a/spaceman/move_inode.c b/spaceman/move_inode.c index ab3c12f5de987b..3a182929579e45 100644 --- a/spaceman/move_inode.c +++ b/spaceman/move_inode.c @@ -36,12 +36,14 @@ create_tmpfile( struct xfs_fd *xfd, xfs_agnumber_t agno, char **tmpfile, - int *tmpfd) + int *tmpfd, + int link_count) { char name[PATH_MAX + 1]; + char linkname[PATH_MAX + 1]; mode_t mask; int fd; - int i; + int i, j; int ret; /* construct tmpdir */ @@ -105,14 +107,36 @@ create_tmpfile( fprintf(stderr, _("cannot create tmpfile: %s: %s\n"), name, strerror(errno)); ret = -errno; + goto out_cleanup_dir; } + /* Create hard links to temporary file. */ + for (j = link_count; j > 1; i--) { + snprintf(linkname, PATH_MAX, "%s/.spaceman/dir%d/tmpfile.%d.hardlink.%d", mnt, i, getpid(), j); + ret = link(name, linkname); + if (ret < 0) { + fprintf(stderr, _("cannot create hardlink: %s: %s\n"), + linkname, strerror(errno)); + ret = -errno; + goto out_cleanup_links; + } + } + + /* return name and fd */ (void)umask(mask); *tmpfd = fd; *tmpfile = strdup(name); return 0; + +out_cleanup_links: + for (; j <= link_count; j++) { + snprintf(linkname, PATH_MAX, "%s/.spaceman/dir%d/tmpfile.%d.hardlink.%d", mnt, i, getpid(), j); + unlink(linkname); + } + close(fd); + unlink(name); out_cleanup_dir: snprintf(name, PATH_MAX, "%s/.spaceman", mnt); rmdir(name); @@ -405,21 +429,53 @@ exchange_inodes( return 0; } +static int +exchange_hardlinks( + struct inode_path *ipath, + const char *tmpfile) +{ + char linkname[PATH_MAX]; + struct inode_path *linkpath; + int i = 2; + int ret; + + list_for_each_entry(linkpath, &ipath->path_list, path_list) { + if (i++ > ipath->link_count) { + fprintf(stderr, "ipath link count mismatch!\n"); + return 0; + } + + snprintf(linkname, PATH_MAX, "%s.hardlink.%d", tmpfile, i); + ret = renameat2(AT_FDCWD, linkname, + AT_FDCWD, linkpath->path, RENAME_EXCHANGE); + if (ret) { + fprintf(stderr, + "failed to exchange hard link %s with %s: %s\n", + linkname, linkpath->path, strerror(errno)); + return -errno; + } + } + return 0; +} + int relocate_file_to_ag( const char *mnt, - const char *path, + struct inode_path *ipath, struct xfs_fd *xfd, xfs_agnumber_t agno) { int ret; int tmpfd = -1; char *tmpfile = NULL; + int i; - fprintf(stderr, "move mnt %s, path %s, agno %d\n", mnt, path, agno); + fprintf(stderr, "move mnt %s, path %s, agno %d\n", + mnt, ipath->path, agno); /* create temporary file in agno */ - ret = create_tmpfile(mnt, xfd, agno, &tmpfile, &tmpfd); + ret = create_tmpfile(mnt, xfd, agno, &tmpfile, &tmpfd, + ipath->link_count); if (ret) return ret; @@ -444,12 +500,28 @@ relocate_file_to_ag( goto out_cleanup; /* swap the inodes over */ - ret = exchange_inodes(xfd, tmpfd, tmpfile, path); + ret = exchange_inodes(xfd, tmpfd, tmpfile, ipath->path); + if (ret) + goto out_cleanup; + + /* swap the hard links over */ + ret = exchange_hardlinks(ipath, tmpfile); + if (ret) + goto out_cleanup; out_cleanup: if (ret == -1) ret = -errno; + /* remove old hard links */ + for (i = 2; i <= ipath->link_count; i++) { + char linkname[PATH_MAX + 256]; // anti-warning-crap + + snprintf(linkname, PATH_MAX + 256, "%s.hardlink.%d", tmpfile, i); + unlink(linkname); + } + + /* remove tmpfile */ close(tmpfd); if (tmpfile) unlink(tmpfile); @@ -458,11 +530,32 @@ relocate_file_to_ag( return ret; } +static int +build_ipath( + const char *path, + struct stat *st, + struct inode_path **ipathp) +{ + struct inode_path *ipath; + + *ipathp = NULL; + + ipath = ipath_alloc(path, st); + if (!ipath) + return -ENOMEM; + + /* we only move a single path with move_inode */ + ipath->link_count = 1; + *ipathp = ipath; + return 0; +} + static int move_inode_f( int argc, char **argv) { + struct inode_path *ipath = NULL; void *fshandle; size_t fshdlen; xfs_agnumber_t agno = 0; @@ -511,24 +604,30 @@ _("Destination AG %d does not exist. Filesystem only has %d AGs\n"), goto exit_fail; } - if (S_ISREG(st.st_mode)) { - ret = relocate_file_to_ag(file->fs_path.fs_dir, file->name, - &file->xfd, agno); - } else { + if (!S_ISREG(st.st_mode)) { fprintf(stderr, _("Unsupported: %s is not a regular file.\n"), file->name); goto exit_fail; } + ret = build_ipath(file->name, &st, &ipath); + if (ret) + goto exit_fail; + + ret = relocate_file_to_ag(file->fs_path.fs_dir, ipath, + &file->xfd, agno); if (ret) { fprintf(stderr, _("Failed to move inode to AG %d: %s\n"), agno, strerror(-ret)); goto exit_fail; } + free(ipath); fshandle_destroy(); return 0; exit_fail: + if (ipath) + free(ipath); fshandle_destroy(); exitcode = 1; return 0; diff --git a/spaceman/relocation.c b/spaceman/relocation.c index 7b125cc0ae12b0..b0960272168510 100644 --- a/spaceman/relocation.c +++ b/spaceman/relocation.c @@ -318,6 +318,30 @@ forget_reloc_ino( static struct cmdinfo relocate_cmd; +struct inode_path * +ipath_alloc( + const char *path, + const struct stat *stat) +{ + struct inode_path *ipath; + int pathlen = strlen(path); + + /* Allocate a new inode path and record the path in it. */ + ipath = calloc(1, sizeof(*ipath) + pathlen + 1); + if (!ipath) { + fprintf(stderr, +_("Failed to allocate ipath %s for inode 0x%llx failed: %s\n"), + path, (unsigned long long)stat->st_ino, + strerror(-errno)); + return NULL; + } + INIT_LIST_HEAD(&ipath->path_list); + memcpy(&ipath->path[0], path, pathlen); + ipath->ino = stat->st_ino; + + return ipath; +} + static int relocate_targets_to_ag( const char *mnt, @@ -336,15 +360,6 @@ relocate_targets_to_ag( if (!ipath) break; - /* XXX: don't handle hard link cases yet */ - if (ipath->link_count > 1) { - fprintf(stderr, - "FIXME! Skipping hardlinked inode at path %s\n", - ipath->path); - goto next; - } - - ret = stat(ipath->path, &st); if (ret) { fprintf(stderr, _("stat(%s) failed: %s\n"), @@ -367,7 +382,7 @@ relocate_targets_to_ag( } /* move to destination AG */ - ret = relocate_file_to_ag(mnt, ipath->path, &xfd, dst_agno); + ret = relocate_file_to_ag(mnt, ipath, &xfd, dst_agno); xfd_close(&xfd); /* diff --git a/spaceman/relocation.h b/spaceman/relocation.h index d4c71b7bb7f054..2c807aa678ec5b 100644 --- a/spaceman/relocation.h +++ b/spaceman/relocation.h @@ -43,9 +43,11 @@ struct inode_path { */ #define UNLINKED_IPATH ((struct inode_path *)1) +struct inode_path *ipath_alloc(const char *path, const struct stat *st); + int find_relocation_targets(xfs_agnumber_t agno); -int relocate_file_to_ag(const char *mnt, const char *path, struct xfs_fd *xfd, - xfs_agnumber_t agno); +int relocate_file_to_ag(const char *mnt, struct inode_path *ipath, + struct xfs_fd *xfd, xfs_agnumber_t agno); int resolve_target_paths(const char *mntpt); #endif /* XFS_SPACEMAN_RELOCATION_H_ */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (7 preceding siblings ...) 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong @ 2024-12-31 23:34 ` Darrick J. Wong 2024-12-31 23:47 ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong ` (20 more replies) 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (6 subsequent siblings) 15 siblings, 21 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs Hi all, This patchset builds off of Kent Overstreet's thread_with_file code to deliver live information about filesystem health events to userspace. This is done by creating a twf file and hooking internal operations so that the event information can be queued to the twf without stalling the kernel if the twf client program is nonresponsive. This is a private ioctl, so events are expressed using simple json objects so that we can enrich the output later on without having to rev a ton of C structs. In userspace, we create a new daemon program that will read the json event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. It is autostarted via some horrible udev rules. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring --- Commits in this patchset: * xfs: create hooks for monitoring health updates * xfs: create a special file to pass filesystem health to userspace * xfs: create event queuing, formatting, and discovery infrastructure * xfs: report metadata health events through healthmon * xfs: report shutdown events through healthmon * xfs: report media errors through healthmon * xfs: report file io errors through healthmon * xfs: add media error reporting ioctl * xfs_io: monitor filesystem health events * xfs_io: add a media error reporting command * xfs_scrubbed: create daemon to listen for health events * xfs_scrubbed: check events against schema * xfs_scrubbed: enable repairing filesystems * xfs_scrubbed: check for fs features needed for effective repairs * xfs_scrubbed: use getparents to look up file names * builddefs: refactor udev directory specification * xfs_scrubbed: create a background monitoring service * xfs_scrubbed: don't start service if kernel support unavailable * xfs_scrubbed: use the autofsck fsproperty to select mode * xfs_scrub: report media scrub failures to the kernel * debian: enable xfs_scrubbed on the root filesystem by default --- configure.ac | 2 debian/control | 2 debian/postinst | 8 debian/prerm | 13 include/builddefs.in | 3 io/Makefile | 1 io/healthmon.c | 183 ++++++ io/init.c | 1 io/io.h | 1 io/shutdown.c | 113 ++++ libxfs/Makefile | 10 libxfs/xfs_fs.h | 31 + libxfs/xfs_health.h | 47 ++ libxfs/xfs_healthmon.schema.json | 595 ++++++++++++++++++++ m4/package_services.m4 | 30 + man/man8/xfs_io.8 | 46 ++ scrub/Makefile | 34 + scrub/phase6.c | 25 + scrub/xfs_scrubbed.in | 1106 ++++++++++++++++++++++++++++++++++++++ scrub/xfs_scrubbed.rules | 7 scrub/xfs_scrubbed@.service.in | 104 ++++ scrub/xfs_scrubbed_start | 17 + 22 files changed, 2354 insertions(+), 25 deletions(-) create mode 100644 debian/prerm create mode 100644 io/healthmon.c create mode 100644 libxfs/xfs_healthmon.schema.json create mode 100644 scrub/xfs_scrubbed.in create mode 100644 scrub/xfs_scrubbed.rules create mode 100644 scrub/xfs_scrubbed@.service.in create mode 100755 scrub/xfs_scrubbed_start ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 01/21] xfs: create hooks for monitoring health updates 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong @ 2024-12-31 23:47 ` Darrick J. Wong 2024-12-31 23:48 ` [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong ` (19 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:47 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create hooks for monitoring health events. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_health.h | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h index b31000f7190ce5..39fef33dedc6a8 100644 --- a/libxfs/xfs_health.h +++ b/libxfs/xfs_health.h @@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs); #define xfs_metadata_is_sick(error) \ (unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC)) +/* + * Parameters for tracking health updates. The enum below is passed as the + * hook function argument. + */ +enum xfs_health_update_type { + XFS_HEALTHUP_SICK = 1, /* runtime corruption observed */ + XFS_HEALTHUP_CORRUPT, /* fsck reported corruption */ + XFS_HEALTHUP_HEALTHY, /* fsck reported healthy structure */ + XFS_HEALTHUP_UNMOUNT, /* filesystem is unmounting */ +}; + +/* Where in the filesystem was the event observed? */ +enum xfs_health_update_domain { + XFS_HEALTHUP_FS = 1, /* main filesystem */ + XFS_HEALTHUP_AG, /* allocation group */ + XFS_HEALTHUP_INODE, /* inode */ + XFS_HEALTHUP_RTGROUP, /* realtime group */ +}; + +struct xfs_health_update_params { + /* XFS_HEALTHUP_INODE */ + xfs_ino_t ino; + uint32_t gen; + + /* XFS_HEALTHUP_AG/RTGROUP */ + uint32_t group; + + /* XFS_SICK_* flags */ + unsigned int old_mask; + unsigned int new_mask; + + enum xfs_health_update_domain domain; +}; + +#ifdef CONFIG_XFS_LIVE_HOOKS +struct xfs_health_hook { + struct xfs_hook health_hook; +}; + +void xfs_health_hook_disable(void); +void xfs_health_hook_enable(void); + +int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook); +void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook); +void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn); +#endif /* CONFIG_XFS_LIVE_HOOKS */ + #endif /* __XFS_HEALTH_H__ */ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:47 ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong @ 2024-12-31 23:48 ` Darrick J. Wong 2024-12-31 23:48 ` [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong ` (18 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:48 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create an ioctl that installs a file descriptor backed by an anon_inode file that will convey filesystem health events to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_fs.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index f4128dbdf3b9a2..d1a81b02a1a3f3 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -1100,6 +1100,13 @@ struct xfs_map_freesp { __u64 pad; /* must be zero */ }; +struct xfs_health_monitor { + __u64 flags; /* flags */ + __u8 format; /* output format */ + __u8 pad1[7]; /* zeroes */ + __u64 pad2[2]; /* zeroes */ +}; + /* * ioctl commands that are used by Linux filesystems */ @@ -1141,6 +1148,7 @@ struct xfs_map_freesp { #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry) #define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) #define XFS_IOC_MAP_FREESP _IOW ('X', 67, struct xfs_map_freesp) +#define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor) /* * ioctl commands that replace IRIX syssgi()'s ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:47 ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong 2024-12-31 23:48 ` [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong @ 2024-12-31 23:48 ` Darrick J. Wong 2024-12-31 23:48 ` [PATCH 04/21] xfs: report metadata health events through healthmon Darrick J. Wong ` (17 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:48 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create the basic infrastructure that we need to report health events to userspace. We need a compact form for recording critical information about an event and queueing them; a means to notice that we've lost some events; and a means to format the events into something that userspace can handle. Here, we've chosen json to export information to userspace. The structured key-value nature of json gives us enormous flexibility to modify the schema of what we'll send to userspace because we can add new keys at any time. Userspace can use whatever json parsers are available to consume the events and will not be confused by keys they don't recognize. Note that we do NOT allow sending json back to the kernel, nor is there any intent to do that. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_fs.h | 8 +++++ libxfs/xfs_healthmon.schema.json | 63 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+) create mode 100644 libxfs/xfs_healthmon.schema.json diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index d1a81b02a1a3f3..d7404e6efd866d 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -1107,6 +1107,14 @@ struct xfs_health_monitor { __u64 pad2[2]; /* zeroes */ }; +/* Return all health status events, not just deltas */ +#define XFS_HEALTH_MONITOR_VERBOSE (1ULL << 0) + +#define XFS_HEALTH_MONITOR_ALL (XFS_HEALTH_MONITOR_VERBOSE) + +/* Return events in JSON format */ +#define XFS_HEALTH_MONITOR_FMT_JSON (1) + /* * ioctl commands that are used by Linux filesystems */ diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json new file mode 100644 index 00000000000000..9772efe25f193d --- /dev/null +++ b/libxfs/xfs_healthmon.schema.json @@ -0,0 +1,63 @@ +{ + "$comment": [ + "SPDX-License-Identifier: GPL-2.0-or-later", + "Copyright (c) 2024-2025 Oracle. All Rights Reserved.", + "Author: Darrick J. Wong <djwong@kernel.org>", + "", + "This schema file describes the format of the json objects", + "readable from the fd returned by the XFS_IOC_HEALTHMON", + "ioctl." + ], + + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/fs/xfs/libxfs/xfs_healthmon.schema.json", + + "title": "XFS Health Monitoring Events", + + "$comment": "Events must be one of the following types:", + "oneOf": [ + { + "$ref": "#/$events/lost" + } + ], + + "$comment": "Simple data types are defined here.", + "$defs": { + "time_ns": { + "title": "Time of Event", + "description": "Timestamp of the event, in nanoseconds since the Unix epoch.", + "type": "integer" + } + }, + + "$comment": "Event types are defined here.", + "$events": { + "lost": { + "title": "Health Monitoring Events Lost", + "$comment": [ + "Previous health monitoring events were", + "dropped due to memory allocation failures", + "or queue limits." + ], + "type": "object", + + "properties": { + "type": { + "const": "lost" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "mount" + } + }, + + "required": [ + "type", + "time_ns", + "domain" + ] + } + } +} ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 04/21] xfs: report metadata health events through healthmon 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:48 ` [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong @ 2024-12-31 23:48 ` Darrick J. Wong 2024-12-31 23:49 ` [PATCH 05/21] xfs: report shutdown " Darrick J. Wong ` (16 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:48 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a metadata health event hook so that we can send events to userspace as we collect information. The unmount hook severs the weak reference between the health monitor and the filesystem it's monitoring; when this happens, we stop reporting events because there's no longer any point. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_healthmon.schema.json | 328 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 328 insertions(+) diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json index 9772efe25f193d..154ea0228a3615 100644 --- a/libxfs/xfs_healthmon.schema.json +++ b/libxfs/xfs_healthmon.schema.json @@ -18,6 +18,18 @@ "oneOf": [ { "$ref": "#/$events/lost" + }, + { + "$ref": "#/$events/fs_metadata" + }, + { + "$ref": "#/$events/rtgroup_metadata" + }, + { + "$ref": "#/$events/perag_metadata" + }, + { + "$ref": "#/$events/inode_metadata" } ], @@ -27,6 +39,169 @@ "title": "Time of Event", "description": "Timestamp of the event, in nanoseconds since the Unix epoch.", "type": "integer" + }, + "xfs_agnumber_t": { + "description": "Allocation group number", + "type": "integer", + "minimum": 0, + "maximum": 2147483647 + }, + "xfs_rgnumber_t": { + "description": "Realtime allocation group number", + "type": "integer", + "minimum": 0, + "maximum": 2147483647 + }, + "xfs_ino_t": { + "description": "Inode number", + "type": "integer", + "minimum": 1 + }, + "i_generation": { + "description": "Inode generation number", + "type": "integer" + } + }, + + "$comment": "Filesystem metadata event data are defined here.", + "$metadata": { + "status": { + "description": "Metadata health status", + "$comment": [ + "One of:", + "", + " * sick: metadata corruption discovered", + " during a runtime operation.", + " * corrupt: corruption discovered during", + " an xfs_scrub run.", + " * healthy: metadata object was found to be", + " ok by xfs_scrub." + ], + "enum": [ + "sick", + "corrupt", + "healthy" + ] + }, + "fs": { + "description": [ + "Metadata structures that affect the entire", + "filesystem. Options include:", + "", + " * fscounters: summary counters", + " * usrquota: user quota records", + " * grpquota: group quota records", + " * prjquota: project quota records", + " * quotacheck: quota counters", + " * nlinks: file link counts", + " * metadir: metadata directory", + " * metapath: metadata inode paths" + ], + "enum": [ + "fscounters", + "grpquota", + "metadir", + "metapath", + "nlinks", + "prjquota", + "quotacheck", + "usrquota" + ] + }, + "perag": { + "description": [ + "Metadata structures owned by allocation", + "groups on the data device. Options include:", + "", + " * agf: group space header", + " * agfl: per-group free block list", + " * agi: group inode header", + " * bnobt: free space by position btree", + " * cntbt: free space by length btree", + " * finobt: free inode btree", + " * inobt: inode btree", + " * rmapbt: reverse mapping btree", + " * refcountbt: reference count btree", + " * inodes: problems were recorded for", + " this group's inodes, but the", + " inodes themselves had to be", + " reclaimed.", + " * super: superblock" + ], + "enum": [ + "agf", + "agfl", + "agi", + "bnobt", + "cntbt", + "finobt", + "inobt", + "inodes", + "refcountbt", + "rmapbt", + "super" + ] + }, + "rtgroup": { + "description": [ + "Metadata structures owned by allocation", + "groups on the realtime volume. Options", + "include:", + "", + " * bitmap: free space bitmap contents", + " for this group", + " * summary: realtime free space summary file", + " * rmapbt: reverse mapping btree", + " * refcountbt: reference count btree", + " * super: group superblock" + ], + "enum": [ + "bitmap", + "summary", + "refcountbt", + "rmapbt", + "super" + ] + }, + "inode": { + "description": [ + "Metadata structures owned by file inodes.", + "Options include:", + "", + " * bmapbta: attr fork", + " * bmapbtc: cow fork", + " * bmapbtd: data fork", + " * core: inode record", + " * directory: directory entries", + " * dirtree: directory tree problems detected", + " * parent: directory parent pointer", + " * symlink: symbolic link target", + " * xattr: extended attributes", + "", + "These are set when an inode record repair had", + "to drop the corresponding data structure to", + "get the inode back to a consistent state.", + "", + " * bmapbtd_zapped", + " * bmapbta_zapped", + " * directory_zapped", + " * symlink_zapped" + ], + "enum": [ + "bmapbta", + "bmapbta_zapped", + "bmapbtc", + "bmapbtd", + "bmapbtd_zapped", + "core", + "directory", + "directory_zapped", + "dirtree", + "parent", + "symlink", + "symlink_zapped", + "xattr" + ] } }, @@ -58,6 +233,159 @@ "time_ns", "domain" ] + }, + "fs_metadata": { + "title": "Filesystem-wide metadata event", + "description": [ + "Health status updates for filesystem-wide", + "metadata objects." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "fs" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/fs" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "structures" + ] + }, + "perag_metadata": { + "title": "Data device allocation group metadata event", + "description": [ + "Health status updates for data device ", + "allocation group metadata." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "perag" + }, + "group": { + "$ref": "#/$defs/xfs_agnumber_t" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/perag" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "group", + "structures" + ] + }, + "rtgroup_metadata": { + "title": "Realtime allocation group metadata event", + "description": [ + "Health status updates for realtime allocation", + "group metadata." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "rtgroup" + }, + "group": { + "$ref": "#/$defs/xfs_rgnumber_t" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/rtgroup" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "group", + "structures" + ] + }, + "inode_metadata": { + "title": "Inode metadata event", + "description": [ + "Health status updates for inode metadata.", + "The inode and generation number describe the", + "file that is affected by the change." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$metadata/status" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "inode" + }, + "inumber": { + "$ref": "#/$defs/xfs_ino_t" + }, + "generation": { + "$ref": "#/$defs/i_generation" + }, + "structures": { + "type": "array", + "items": { + "$ref": "#/$metadata/inode" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "inumber", + "generation", + "structures" + ] } } } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 05/21] xfs: report shutdown events through healthmon 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:48 ` [PATCH 04/21] xfs: report metadata health events through healthmon Darrick J. Wong @ 2024-12-31 23:49 ` Darrick J. Wong 2024-12-31 23:49 ` [PATCH 06/21] xfs: report media errors " Darrick J. Wong ` (15 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a shutdown hook so that we can send notifications to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_healthmon.schema.json | 62 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json index 154ea0228a3615..a8bc75b0b8c4f9 100644 --- a/libxfs/xfs_healthmon.schema.json +++ b/libxfs/xfs_healthmon.schema.json @@ -30,6 +30,9 @@ }, { "$ref": "#/$events/inode_metadata" + }, + { + "$ref": "#/$events/shutdown" } ], @@ -205,6 +208,31 @@ } }, + "$comment": "Shutdown event data are defined here.", + "$shutdown": { + "reason": { + "description": [ + "Reason for a filesystem to shut down.", + "Options include:", + "", + " * corrupt_incore: in-memory corruption", + " * corrupt_ondisk: on-disk corruption", + " * device_removed: device removed", + " * force_umount: userspace asked for it", + " * log_ioerr: log write IO error", + " * meta_ioerr: metadata writeback IO error" + ], + "enum": [ + "corrupt_incore", + "corrupt_ondisk", + "device_removed", + "force_umount", + "log_ioerr", + "meta_ioerr" + ] + } + }, + "$comment": "Event types are defined here.", "$events": { "lost": { @@ -386,6 +414,40 @@ "generation", "structures" ] + }, + "shutdown": { + "title": "Abnormal Shutdown Event", + "description": [ + "The filesystem went offline due to", + "unrecoverable errors." + ], + "type": "object", + + "properties": { + "type": { + "const": "shutdown" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "mount" + }, + "reasons": { + "type": "array", + "items": { + "$ref": "#/$shutdown/reason" + }, + "minItems": 1 + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "reasons" + ] } } } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 06/21] xfs: report media errors through healthmon 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (4 preceding siblings ...) 2024-12-31 23:49 ` [PATCH 05/21] xfs: report shutdown " Darrick J. Wong @ 2024-12-31 23:49 ` Darrick J. Wong 2024-12-31 23:49 ` [PATCH 07/21] xfs: report file io " Darrick J. Wong ` (14 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we have hooks to report media errors, connect this to the health monitor as well. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_healthmon.schema.json | 65 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 65 insertions(+) diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json index a8bc75b0b8c4f9..006f4145faa9f5 100644 --- a/libxfs/xfs_healthmon.schema.json +++ b/libxfs/xfs_healthmon.schema.json @@ -33,6 +33,9 @@ }, { "$ref": "#/$events/shutdown" + }, + { + "$ref": "#/$events/media_error" } ], @@ -63,6 +66,31 @@ "i_generation": { "description": "Inode generation number", "type": "integer" + }, + "storage_devs": { + "description": "Storage devices in a filesystem", + "_comment": [ + "One of:", + "", + " * datadev: filesystem device", + " * logdev: external log device", + " * rtdev: realtime volume" + ], + "enum": [ + "datadev", + "logdev", + "rtdev" + ] + }, + "xfs_daddr_t": { + "description": "Storage device address, in units of 512-byte blocks", + "type": "integer", + "minimum": 0 + }, + "bbcount": { + "description": "Storage space length, in units of 512-byte blocks", + "type": "integer", + "minimum": 1 } }, @@ -448,6 +476,43 @@ "domain", "reasons" ] + }, + "media_error": { + "title": "Media Error", + "description": [ + "A storage device reported a media error.", + "The domain element tells us which storage", + "device reported the media failure. The", + "daddr and bbcount elements tell us where", + "inside that device the failure was observed." + ], + "type": "object", + + "properties": { + "type": { + "const": "media" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "$ref": "#/$defs/storage_devs" + }, + "daddr": { + "$ref": "#/$defs/xfs_daddr_t" + }, + "bbcount": { + "$ref": "#/$defs/bbcount" + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "daddr", + "bbcount" + ] } } } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 07/21] xfs: report file io errors through healthmon 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (5 preceding siblings ...) 2024-12-31 23:49 ` [PATCH 06/21] xfs: report media errors " Darrick J. Wong @ 2024-12-31 23:49 ` Darrick J. Wong 2024-12-31 23:49 ` [PATCH 08/21] xfs: add media error reporting ioctl Darrick J. Wong ` (13 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Set up a file io error event hook so that we can send events about read errors, writeback errors, and directio errors to userspace. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_healthmon.schema.json | 77 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/libxfs/xfs_healthmon.schema.json b/libxfs/xfs_healthmon.schema.json index 006f4145faa9f5..9c1070a629997c 100644 --- a/libxfs/xfs_healthmon.schema.json +++ b/libxfs/xfs_healthmon.schema.json @@ -36,6 +36,9 @@ }, { "$ref": "#/$events/media_error" + }, + { + "$ref": "#/$events/file_ioerror" } ], @@ -67,6 +70,16 @@ "description": "Inode generation number", "type": "integer" }, + "off_t": { + "description": "File position, in bytes", + "type": "integer", + "minimum": 0 + }, + "size_t": { + "description": "File operation length, in bytes", + "type": "integer", + "minimum": 1 + }, "storage_devs": { "description": "Storage devices in a filesystem", "_comment": [ @@ -261,6 +274,26 @@ } }, + "$comment": "File IO event data are defined here.", + "$fileio": { + "types": { + "description": [ + "File I/O operations. One of:", + "", + " * readahead: reads into the page cache.", + " * writeback: writeback of dirty page cache.", + " * dioread: O_DIRECT reads.", + " * diowrite: O_DIRECT writes." + ], + "enum": [ + "readahead", + "writeback", + "dioread", + "diowrite" + ] + } + }, + "$comment": "Event types are defined here.", "$events": { "lost": { @@ -513,6 +546,50 @@ "daddr", "bbcount" ] + }, + "file_ioerror": { + "title": "File I/O error", + "description": [ + "A read or a write to a file failed. The", + "inode, generation, pos, and len fields", + "describe the range of the file that is", + "affected." + ], + "type": "object", + + "properties": { + "type": { + "$ref": "#/$fileio/types" + }, + "time_ns": { + "$ref": "#/$defs/time_ns" + }, + "domain": { + "const": "filerange" + }, + "inumber": { + "$ref": "#/$defs/xfs_ino_t" + }, + "generation": { + "$ref": "#/$defs/i_generation" + }, + "pos": { + "$ref": "#/$defs/off_t" + }, + "len": { + "$ref": "#/$defs/size_t" + } + }, + + "required": [ + "type", + "time_ns", + "domain", + "inumber", + "generation", + "pos", + "len" + ] } } } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 08/21] xfs: add media error reporting ioctl 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (6 preceding siblings ...) 2024-12-31 23:49 ` [PATCH 07/21] xfs: report file io " Darrick J. Wong @ 2024-12-31 23:49 ` Darrick J. Wong 2024-12-31 23:50 ` [PATCH 09/21] xfs_io: monitor filesystem health events Darrick J. Wong ` (12 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:49 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a new privileged ioctl so that xfs_scrub can report media errors to the kernel for further processing. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/xfs_fs.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h index d7404e6efd866d..32e552d40b1bf5 100644 --- a/libxfs/xfs_fs.h +++ b/libxfs/xfs_fs.h @@ -1115,6 +1115,20 @@ struct xfs_health_monitor { /* Return events in JSON format */ #define XFS_HEALTH_MONITOR_FMT_JSON (1) +struct xfs_media_error { + __u64 flags; /* flags */ + __u64 daddr; /* disk address of range */ + __u64 bbcount; /* length, in 512b blocks */ + __u64 pad; /* zero */ +}; + +#define XFS_MEDIA_ERROR_DATADEV (1) /* data device */ +#define XFS_MEDIA_ERROR_LOGDEV (2) /* external log device */ +#define XFS_MEDIA_ERROR_RTDEV (3) /* realtime device */ + +/* bottom byte of flags is the device code */ +#define XFS_MEDIA_ERROR_DEVMASK (0xFF) + /* * ioctl commands that are used by Linux filesystems */ @@ -1157,6 +1171,7 @@ struct xfs_health_monitor { #define XFS_IOC_GETFSREFCOUNTS _IOWR('X', 66, struct xfs_getfsrefs_head) #define XFS_IOC_MAP_FREESP _IOW ('X', 67, struct xfs_map_freesp) #define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor) +#define XFS_IOC_MEDIA_ERROR _IOW ('X', 69, struct xfs_media_error) /* * ioctl commands that replace IRIX syssgi()'s ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 09/21] xfs_io: monitor filesystem health events 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (7 preceding siblings ...) 2024-12-31 23:49 ` [PATCH 08/21] xfs: add media error reporting ioctl Darrick J. Wong @ 2024-12-31 23:50 ` Darrick J. Wong 2024-12-31 23:50 ` [PATCH 10/21] xfs_io: add a media error reporting command Darrick J. Wong ` (11 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a subcommand to monitor for health events generated by the kernel. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- io/Makefile | 1 io/healthmon.c | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++ io/init.c | 1 io/io.h | 1 man/man8/xfs_io.8 | 25 +++++++ 5 files changed, 211 insertions(+) create mode 100644 io/healthmon.c diff --git a/io/Makefile b/io/Makefile index c57594b090f70c..451d2a15b25919 100644 --- a/io/Makefile +++ b/io/Makefile @@ -26,6 +26,7 @@ CFILES = \ fsuuid.c \ fsync.c \ getrusage.c \ + healthmon.c \ imap.c \ init.c \ inject.c \ diff --git a/io/healthmon.c b/io/healthmon.c new file mode 100644 index 00000000000000..7d372d7d8c532b --- /dev/null +++ b/io/healthmon.c @@ -0,0 +1,183 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2024-2025 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "libxfs.h" +#include "libfrog/fsgeom.h" +#include "libfrog/paths.h" +#include "command.h" +#include "init.h" +#include "io.h" + +static void +healthmon_help(void) +{ + printf(_( +"Monitor filesystem health events" +"\n" +"-c Replace the open file with the monitor file.\n" +"-d delay_ms Sleep this many milliseconds between reads.\n" +"-p Only probe for the existence of the ioctl.\n" +"-v Request all events.\n" +"\n")); +} + +static inline int +monitor_sleep( + int delay_ms) +{ + struct timespec ts; + + if (!delay_ms) + return 0; + + ts.tv_sec = delay_ms / 1000; + ts.tv_nsec = (delay_ms % 1000) * 1000000; + + return nanosleep(&ts, NULL); +} + +static int +monitor( + size_t bufsize, + bool consume, + int delay_ms, + bool verbose, + bool only_probe) +{ + struct xfs_health_monitor hmo = { + .format = XFS_HEALTH_MONITOR_FMT_JSON, + }; + char *buf; + ssize_t bytes_read; + int mon_fd; + int ret = 1; + + if (verbose) + hmo.flags |= XFS_HEALTH_MONITOR_ALL; + + mon_fd = ioctl(file->fd, XFS_IOC_HEALTH_MONITOR, &hmo); + if (mon_fd < 0) { + perror("XFS_IOC_HEALTH_MONITOR"); + return 1; + } + + if (only_probe) { + ret = 0; + goto out_mon; + } + + buf = malloc(bufsize); + if (!buf) { + perror("malloc"); + goto out_mon; + } + + if (consume) { + close(file->fd); + file->fd = mon_fd; + } + + monitor_sleep(delay_ms); + while ((bytes_read = read(mon_fd, buf, bufsize)) > 0) { + char *write_ptr = buf; + ssize_t bytes_written; + size_t to_write = bytes_read; + + while ((bytes_written = write(STDOUT_FILENO, write_ptr, to_write)) > 0) { + write_ptr += bytes_written; + to_write -= bytes_written; + } + if (bytes_written < 0) { + perror("healthdump"); + goto out_buf; + } + + monitor_sleep(delay_ms); + } + if (bytes_read < 0) { + perror("healthmon"); + goto out_buf; + } + + ret = 0; + +out_buf: + free(buf); +out_mon: + close(mon_fd); + return ret; +} + +static int +healthmon_f( + int argc, + char **argv) +{ + size_t bufsize = 4096; + bool consume = false; + bool verbose = false; + bool only_probe = false; + int delay_ms = 0; + int c; + + while ((c = getopt(argc, argv, "b:cd:pv")) != EOF) { + switch (c) { + case 'b': + errno = 0; + c = atoi(optarg); + if (c < 0 || errno) { + printf("%s: bufsize must be positive\n", + optarg); + exitcode = 1; + return 0; + } + bufsize = c; + break; + case 'c': + consume = true; + break; + case 'd': + errno = 0; + delay_ms = atoi(optarg); + if (delay_ms < 0 || errno) { + printf("%s: delay must be positive msecs\n", + optarg); + exitcode = 1; + return 0; + } + break; + case 'p': + only_probe = true; + break; + case 'v': + verbose = true; + break; + default: + exitcode = 1; + healthmon_help(); + return 0; + } + } + + return monitor(bufsize, consume, delay_ms, verbose, only_probe); +} + +static struct cmdinfo healthmon_cmd = { + .name = "healthmon", + .cfunc = healthmon_f, + .argmin = 0, + .argmax = -1, + .flags = CMD_FLAG_ONESHOT | CMD_NOMAP_OK, + .args = "[-c] [-d delay_ms] [-v]", + .help = healthmon_help, +}; + +void +healthmon_init(void) +{ + healthmon_cmd.oneline = _("monitor filesystem health events"); + + add_command(&healthmon_cmd); +} diff --git a/io/init.c b/io/init.c index 17b772813bc113..22ebd2f7522a18 100644 --- a/io/init.c +++ b/io/init.c @@ -92,6 +92,7 @@ init_commands(void) crc32cselftest_init(); exchangerange_init(); fsprops_init(); + healthmon_init(); } /* diff --git a/io/io.h b/io/io.h index 7ae7cf90ace323..267f3ffac36924 100644 --- a/io/io.h +++ b/io/io.h @@ -157,3 +157,4 @@ void exchangerange_init(void); void fsprops_init(void); void aginfo_init(void); void fsrefcounts_init(void); +void healthmon_init(void); diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8 index c4d09ce07f597b..632d07807f44f0 100644 --- a/man/man8/xfs_io.8 +++ b/man/man8/xfs_io.8 @@ -1419,6 +1419,31 @@ .SH FILESYSTEM COMMANDS .RE .PD +.TP +.BI "healthmon [ \-c " bufsize " ] [ \-c ] [ \-d " delay_ms " ] [ \-p ] [ \-v ]" +Watch for filesystem health events and write them to the console. +.RE +.RS 1.0i +.PD 0 +.TP +.BI "\-b " bufsize +Use a buffer of this size to read events from the kernel. +.TP +.BI \-c +Close the open file and replace it with the monitor file. +.TP +.BI "\-d " delay_ms +Sleep for this long between read attempts. +.TP +.B \-p +Probe for the existence of the functionality by opening the monitoring fd and +closing it immediately. +.TP +.BI \-v +Request all health events, even if nothing changed. +.PD +.RE + .TP .BI "inject [ " tag " ]" Inject errors into a filesystem to observe filesystem behavior at ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 10/21] xfs_io: add a media error reporting command 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (8 preceding siblings ...) 2024-12-31 23:50 ` [PATCH 09/21] xfs_io: monitor filesystem health events Darrick J. Wong @ 2024-12-31 23:50 ` Darrick J. Wong 2024-12-31 23:50 ` [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events Darrick J. Wong ` (10 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a subcommand to invoke the media error ioctl to make sure it works. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- io/shutdown.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++++ man/man8/xfs_io.8 | 21 ++++++++++ 2 files changed, 133 insertions(+), 1 deletion(-) diff --git a/io/shutdown.c b/io/shutdown.c index 3c29ea790643f8..b4fba7d78ba83b 100644 --- a/io/shutdown.c +++ b/io/shutdown.c @@ -53,6 +53,115 @@ shutdown_help(void) "\n")); } +static void +mediaerror_help(void) +{ + printf(_( +"\n" +" Report a media error on the data device to the filesystem.\n" +"\n" +" -l -- Report against the log device.\n" +" -r -- Report against the realtime device.\n" +"\n" +" offset is the byte offset of the start of the failed range. If offset is\n" +" specified, mapping length may (optionally) be specified as well." +"\n" +" length is the byte length of the failed range.\n" +"\n" +" If neither offset nor length are specified, the media error report will\n" +" be made against the entire device." +"\n")); +} + +static int +mediaerror_f( + int argc, + char **argv) +{ + struct xfs_media_error me = { + .daddr = 0, + .bbcount = -1ULL, + .flags = XFS_MEDIA_ERROR_DATADEV, + }; + long long l; + size_t fsblocksize, fssectsize; + int c, ret; + + init_cvtnum(&fsblocksize, &fssectsize); + + while ((c = getopt(argc, argv, "lr")) != EOF) { + switch (c) { + case 'l': + me.flags = (me.flags & ~XFS_MEDIA_ERROR_DEVMASK) | + XFS_MEDIA_ERROR_LOGDEV; + break; + case 'r': + me.flags = (me.flags & ~XFS_MEDIA_ERROR_DEVMASK) | + XFS_MEDIA_ERROR_RTDEV; + break; + default: + mediaerror_help(); + exitcode = 1; + return 0; + } + } + + /* Range start (optional) */ + if (optind < argc) { + l = cvtnum(fsblocksize, fssectsize, argv[optind]); + if (l < 0) { + printf("non-numeric offset argument -- %s\n", + argv[optind]); + exitcode = 1; + return 0; + } + + me.daddr = l / 512; + optind++; + } + + /* Range length (optional if range start was specified) */ + if (optind < argc) { + l = cvtnum(fsblocksize, fssectsize, argv[optind]); + if (l < 0) { + printf("non-numeric len argument -- %s\n", + argv[optind]); + exitcode = 1; + return 0; + } + + me.bbcount = howmany(l, 512); + optind++; + } + + if (optind < argc) { + printf("too many arguments -- %s\n", argv[optind]); + exitcode = 1; + return 0; + } + + ret = ioctl(file->fd, XFS_IOC_MEDIA_ERROR, &me); + if (ret) { + fprintf(stderr, + "%s: ioctl(XFS_IOC_MEDIA_ERROR) [\"%s\"]: %s\n", + progname, file->name, strerror(errno)); + exitcode = 1; + return 0; + } + + return 0; +} + +static struct cmdinfo mediaerror_cmd = { + .name = "mediaerror", + .cfunc = mediaerror_f, + .argmin = 0, + .argmax = -1, + .flags = CMD_FLAG_ONESHOT | CMD_NOMAP_OK, + .args = "[-lr] [offset [length]]", + .help = mediaerror_help, +}; + void shutdown_init(void) { @@ -66,6 +175,8 @@ shutdown_init(void) shutdown_cmd.oneline = _("shuts down the filesystem where the current file resides"); - if (expert) + if (expert) { add_command(&shutdown_cmd); + add_command(&mediaerror_cmd); + } } diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8 index 632d07807f44f0..2ca74e6ab57d4e 100644 --- a/man/man8/xfs_io.8 +++ b/man/man8/xfs_io.8 @@ -1452,6 +1452,27 @@ .SH FILESYSTEM COMMANDS argument, displays the list of error tags available. Only available in expert mode and requires privileges. +.TP +.BI "mediaerror [ \-lr ] [ " offset " [ " length " ]]" +Report a media error against the data device of an XFS filesystem. +The +.I offset +and +.I length +parameters are specified in units of bytes. +If neither are specified, the entire device will be reported. +.RE +.RS 1.0i +.PD 0 +.TP +.BI \-l +Report against the log device instead of the data device. +.TP +.BI \-r +Report against the realtime device instead of the data device. +.PD +.RE + .TP .BI "rginfo [ \-r " rgno " ]" Show information about or update the state of realtime allocation groups. ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (9 preceding siblings ...) 2024-12-31 23:50 ` [PATCH 10/21] xfs_io: add a media error reporting command Darrick J. Wong @ 2024-12-31 23:50 ` Darrick J. Wong 2024-12-31 23:50 ` [PATCH 12/21] xfs_scrubbed: check events against schema Darrick J. Wong ` (9 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a daemon program that can listen for and log health events. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/Makefile | 15 ++- scrub/xfs_scrubbed.in | 287 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 300 insertions(+), 2 deletions(-) create mode 100644 scrub/xfs_scrubbed.in diff --git a/scrub/Makefile b/scrub/Makefile index 1e1109048c2a83..bd910922ceb4bb 100644 --- a/scrub/Makefile +++ b/scrub/Makefile @@ -18,6 +18,7 @@ XFS_SCRUB_ALL_PROG = xfs_scrub_all XFS_SCRUB_FAIL_PROG = xfs_scrub_fail XFS_SCRUB_ARGS = -p XFS_SCRUB_SERVICE_ARGS = -b -o autofsck +XFS_SCRUBBED_PROG = xfs_scrubbed ifeq ($(HAVE_SYSTEMD),yes) INSTALL_SCRUB += install-systemd SYSTEMD_SERVICES=\ @@ -108,9 +109,9 @@ endif # Automatically trigger a media scan once per month XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_INTERVAL=1mo -LDIRT = $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) *.service *.cron +LDIRT = $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_PROG) *.service *.cron -default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(OPTIONAL_TARGETS) +default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_PROG) $(OPTIONAL_TARGETS) xfs_scrub_all: xfs_scrub_all.in $(builddefs) @echo " [SED] $@" @@ -123,6 +124,14 @@ xfs_scrub_all: xfs_scrub_all.in $(builddefs) -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@ $(Q)chmod a+x $@ +xfs_scrubbed: xfs_scrubbed.in $(builddefs) + @echo " [SED] $@" + $(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \ + -e "s|@scrub_svcname@|$(scrub_svcname)|g" \ + -e "s|@pkg_version@|$(PKG_VERSION)|g" \ + < $< > $@ + $(Q)chmod a+x $@ + xfs_scrub_fail: xfs_scrub_fail.in $(builddefs) @echo " [SED] $@" $(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \ @@ -165,6 +174,8 @@ install-scrub: default $(INSTALL) -m 755 -d $(PKG_SBIN_DIR) $(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_SBIN_DIR) $(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_SBIN_DIR) + $(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR) + $(INSTALL) -m 755 $(XFS_SCRUBBED_PROG) $(PKG_LIBEXEC_DIR) $(INSTALL) -m 755 -d $(PKG_STATE_DIR) install-udev: $(UDEV_RULES) diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in new file mode 100644 index 00000000000000..4d742a9151a082 --- /dev/null +++ b/scrub/xfs_scrubbed.in @@ -0,0 +1,287 @@ +#!/usr/bin/python3 + +# SPDX-License-Identifier: GPL-2.0-or-later +# Copyright (c) 2024-2025 Oracle. All rights reserved. +# +# Author: Darrick J. Wong <djwong@kernel.org> + +# Daemon to listen for and react to filesystem health events + +import sys +import os +import argparse +import fcntl +import json +import datetime +import errno +import ctypes +import gc +from concurrent.futures import ProcessPoolExecutor + +debug = False +log = False +everything = False +debug_fast = False +printf_prefix = '' + +# ioctl encoding stuff +_IOC_NRBITS = 8 +_IOC_TYPEBITS = 8 +_IOC_SIZEBITS = 14 +_IOC_DIRBITS = 2 + +_IOC_NRMASK = (1 << _IOC_NRBITS) - 1 +_IOC_TYPEMASK = (1 << _IOC_TYPEBITS) - 1 +_IOC_SIZEMASK = (1 << _IOC_SIZEBITS) - 1 +_IOC_DIRMASK = (1 << _IOC_DIRBITS) - 1 + +_IOC_NRSHIFT = 0 +_IOC_TYPESHIFT = (_IOC_NRSHIFT + _IOC_NRBITS) +_IOC_SIZESHIFT = (_IOC_TYPESHIFT + _IOC_TYPEBITS) +_IOC_DIRSHIFT = (_IOC_SIZESHIFT + _IOC_SIZEBITS) + +_IOC_NONE = 0 +_IOC_WRITE = 1 +_IOC_READ = 2 + +def _IOC(direction, type, nr, t): + assert direction <= _IOC_DIRMASK, direction + assert type <= _IOC_TYPEMASK, type + assert nr <= _IOC_NRMASK, nr + + size = ctypes.sizeof(t) + assert size <= _IOC_SIZEMASK, size + + return (((direction) << _IOC_DIRSHIFT) | + ((type) << _IOC_TYPESHIFT) | + ((nr) << _IOC_NRSHIFT) | + ((size) << _IOC_SIZESHIFT)) + +def _IOR(type, number, size): + return _IOC(_IOC_READ, type, number, size) + +def _IOW(type, number, size): + return _IOC(_IOC_WRITE, type, number, size) + +def _IOWR(type, number, size): + return _IOC(_IOC_READ | _IOC_WRITE, type, number, size) + +# xfs health monitoring ioctl stuff +XFS_HEALTH_MONITOR_FMT_JSON = 1 +XFS_HEALTH_MONITOR_VERBOSE = 1 << 0 + +class xfs_health_monitor(ctypes.Structure): + _fields_ = [ + ('flags', ctypes.c_ulonglong), + ('format', ctypes.c_ubyte), + ('_pad0', ctypes.c_ubyte * 7), + ('_pad1', ctypes.c_ulonglong * 2) + ] +assert ctypes.sizeof(xfs_health_monitor) == 32 + +XFS_IOC_HEALTH_MONITOR = _IOW(0x58, 68, xfs_health_monitor) + +def open_health_monitor(fd, verbose = False): + '''Return a health monitoring fd.''' + + arg = xfs_health_monitor() + arg.format = XFS_HEALTH_MONITOR_FMT_JSON + + if verbose: + arg.flags |= XFS_HEALTH_MONITOR_VERBOSE + + ret = fcntl.ioctl(fd, XFS_IOC_HEALTH_MONITOR, arg) + return ret + +# main program + +def health_reports(mon_fp): + '''Generate python objects describing health events.''' + global debug + global printf_prefix + + lines = [] + buf = mon_fp.readline() + while buf != '': + for line in buf.split('\0'): + line = line.strip() + if debug: + print(f'new line: {line}') + if line == '': + continue + + lines.append(line) + if not '}' in line: + continue + + s = ''.join(lines) + if debug: + print(f'new event: {s}') + try: + yield json.loads(s) + except json.decoder.JSONDecodeError as e: + print(f"{printf_prefix}: {e} from {s}", + file = sys.stderr) + pass + lines = [] + buf = mon_fp.readline() + +def log_event(event): + '''Log a monitoring event to stdout.''' + global printf_prefix + + print(f"{printf_prefix}: {event}") + sys.stdout.flush() + +def report_lost(event): + '''Report that the kernel lost events.''' + global printf_prefix + + print(f"{printf_prefix}: Events were lost.") + sys.stdout.flush() + +def report_shutdown(event): + '''Report an abortive shutdown of the filesystem.''' + global printf_prefix + REASONS = { + "meta_ioerr": "metadata IO error", + "log_ioerr": "log IO error", + "force_umount": "forced unmount", + "corrupt_incore": "in-memory state corruption", + "corrupt_ondisk": "ondisk metadata corruption", + "device_removed": "device removal", + } + + reasons = [] + for reason in event['reasons']: + if reason in REASONS: + reasons.append(REASONS[reason]) + else: + reasons.append(reason) + + print(f"{printf_prefix}: Filesystem shut down due to {', '.join(reasons)}.") + sys.stdout.flush() + +def handle_event(event): + '''Handle an event asynchronously.''' + def stringify_timestamp(event): + '''Try to convert a timestamp to something human readable.''' + try: + ts = datetime.datetime.fromtimestamp(event['time_ns'] / 1e9).astimezone() + event['time'] = str(ts) + del event['time_ns'] + except Exception as e: + # Not a big deal if we can't format the timestamp, but + # let's yell about that loudly + print(f'{printf_prefix}: bad timestamp: {e}', file = sys.stderr) + + global log + + stringify_timestamp(event) + if log: + log_event(event) + if event['type'] == 'lost': + report_lost(event) + elif event['type'] == 'shutdown': + report_shutdown(event) + +def monitor(mountpoint, event_queue, **kwargs): + '''Monitor the given mountpoint for health events.''' + global everything + + fd = os.open(mountpoint, os.O_RDONLY) + try: + mon_fd = open_health_monitor(fd, verbose = everything) + except OSError as e: + if e.errno != errno.ENOTTY and e.errno != errno.EOPNOTSUPP: + raise e + print(f"{mountpoint}: XFS health monitoring not supported.", + file = sys.stderr) + return 1 + finally: + # Close the mountpoint if opening the health monitor fails + os.close(fd) + + # Ownership of mon_fd (and hence responsibility for closing it) is + # transferred to the mon_fp object. + with os.fdopen(mon_fd) as mon_fp: + nr = 0 + for e in health_reports(mon_fp): + event_queue.submit(handle_event, e) + + # Periodically run the garbage collector to constrain + # memory usage in the main thread. If only there was + # a way to submit to a queue without everything being + # tied up in a Future + if nr % 5355 == 0: + gc.collect() + nr += 1 + + return 0 + +def main(): + global debug + global log + global printf_prefix + global everything + global debug_fast + + parser = argparse.ArgumentParser( \ + description = "XFS filesystem health monitoring demon.") + parser.add_argument("--debug", help = "Enabling debugging messages.", \ + action = "store_true") + parser.add_argument("--log", help = "Log health events to stdout.", \ + action = "store_true") + parser.add_argument("--everything", help = "Capture all events.", \ + action = "store_true") + parser.add_argument("-V", help = "Report version and exit.", \ + action = "store_true") + parser.add_argument('mountpoint', default = None, nargs = '?', + help = 'XFS filesystem mountpoint to target.') + parser.add_argument('--debug-fast', action = 'store_true', \ + help = argparse.SUPPRESS) + args = parser.parse_args() + + if args.V: + print("xfs_scrubbed version @pkg_version@") + return 0 + + if args.mountpoint is None: + parser.error("the following arguments are required: mountpoint") + return 1 + + if args.debug: + debug = True + if args.log: + log = True + if args.everything: + everything = True + if args.debug_fast: + debug_fast = True + + # Use a separate subprocess to handle the events so that the main event + # reading process does not block on the GIL of the event handling + # subprocess. The downside is that we cannot pass function pointers + # and all data must be pickleable; the upside is not losing events. + # + # If the secret maximum efficiency setting is enabled, assume this is + # part of QA, so use all CPUs to process events. Normally we start one + # background process to minimize service footprint. + if debug_fast: + args.event_queue = ProcessPoolExecutor() + else: + args.event_queue = ProcessPoolExecutor(max_workers = 1) + + printf_prefix = args.mountpoint + ret = 0 + try: + ret = monitor(**vars(args)) + except KeyboardInterrupt: + # Consider SIGINT to be a clean exit. + pass + + args.event_queue.shutdown() + return ret + +if __name__ == '__main__': + sys.exit(main()) ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 12/21] xfs_scrubbed: check events against schema 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (10 preceding siblings ...) 2024-12-31 23:50 ` [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events Darrick J. Wong @ 2024-12-31 23:50 ` Darrick J. Wong 2024-12-31 23:51 ` [PATCH 13/21] xfs_scrubbed: enable repairing filesystems Darrick J. Wong ` (8 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:50 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Validate that the event objects that we get from the kernel actually obey the schema that the kernel publishes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/Makefile | 10 ++++++-- scrub/Makefile | 1 + scrub/xfs_scrubbed.in | 62 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 70 insertions(+), 3 deletions(-) diff --git a/libxfs/Makefile b/libxfs/Makefile index 61c43529b532b6..f84eb5b43cdddd 100644 --- a/libxfs/Makefile +++ b/libxfs/Makefile @@ -151,6 +151,8 @@ EXTRA_OBJECTS=\ LDIRT += $(EXTRA_OBJECTS) +JSON_SCHEMAS=xfs_healthmon.schema.json + # # Tracing flags: # -DMEM_DEBUG all zone memory use @@ -174,7 +176,7 @@ LTLIBS = $(LIBPTHREAD) $(LIBRT) # don't try linking xfs_repair with a debug libxfs. DEBUG = -DNDEBUG -default: ltdepend $(LTLIBRARY) $(EXTRA_OBJECTS) +default: ltdepend $(LTLIBRARY) $(EXTRA_OBJECTS) $(JSON_SCHEMAS) %dummy.o: %dummy.cpp @echo " [CXXD] $@" @@ -196,14 +198,16 @@ MAKECXXDEP := $(MAKEDEPEND) $(CXXFLAGS) include $(BUILDRULES) install: default - $(INSTALL) -m 755 -d $(PKG_INC_DIR) + $(INSTALL) -m 755 -d $(PKG_DATA_DIR) + $(INSTALL) -m 644 $(JSON_SCHEMAS) $(PKG_DATA_DIR) install-headers: $(addsuffix -hdrs, $(PKGHFILES)) %-hdrs: $(Q)$(LN_S) -f $(CURDIR)/$* $(TOPDIR)/include/xfs/$* -install-dev: install +install-dev: default + $(INSTALL) -m 755 -d $(PKG_INC_DIR) $(INSTALL) -m 644 $(PKGHFILES) $(PKG_INC_DIR) # We need to install the headers before building the dependencies. If we diff --git a/scrub/Makefile b/scrub/Makefile index bd910922ceb4bb..7d4fa0ddc09685 100644 --- a/scrub/Makefile +++ b/scrub/Makefile @@ -129,6 +129,7 @@ xfs_scrubbed: xfs_scrubbed.in $(builddefs) $(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \ -e "s|@scrub_svcname@|$(scrub_svcname)|g" \ -e "s|@pkg_version@|$(PKG_VERSION)|g" \ + -e "s|@pkg_data_dir@|$(PKG_DATA_DIR)|g" \ < $< > $@ $(Q)chmod a+x $@ diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index 4d742a9151a082..992797113d6d30 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -18,6 +18,52 @@ import ctypes import gc from concurrent.futures import ProcessPoolExecutor +try: + # Not all systems will have this json schema validation libarary, + # so we make it optional. + import jsonschema + + def init_validation(args): + '''Initialize event json validation.''' + try: + with open(args.event_schema) as fp: + schema_js = json.load(fp) + except Exception as e: + print(f"{args.event_schema}: {e}", file = sys.stderr) + return + + try: + vcls = jsonschema.validators.validator_for(schema_js) + vcls.check_schema(schema_js) + validator = vcls(schema_js) + except jsonschema.exceptions.SchemaError as e: + print(f"{args.event_schema}: invalid event data, {e.message}", + file = sys.stderr) + return + except Exception as e: + print(f"{args.event_schema}: {e}", file = sys.stderr) + return + + def v(i): + e = jsonschema.exceptions.best_match(validator.iter_errors(i)) + if e: + print(f"{printf_prefix}: {e.message}", + file = sys.stderr) + return False + return True + + return v + +except: + def init_validation(args): + if args.require_validation: + print("JSON schema validation not available.", + file = sys.stderr) + return + + return lambda instance: True + +validator_fn = None debug = False log = False everything = False @@ -177,6 +223,12 @@ def handle_event(event): global log + # Ignore any event that doesn't pass our schema. This program must + # not try to handle a newer kernel that say things that it is not + # prepared to handle. + if not validator_fn(event): + return + stringify_timestamp(event) if log: log_event(event) @@ -225,6 +277,7 @@ def main(): global printf_prefix global everything global debug_fast + global validator_fn parser = argparse.ArgumentParser( \ description = "XFS filesystem health monitoring demon.") @@ -240,6 +293,11 @@ def main(): help = 'XFS filesystem mountpoint to target.') parser.add_argument('--debug-fast', action = 'store_true', \ help = argparse.SUPPRESS) + parser.add_argument('--require-validation', action = 'store_true', \ + help = argparse.SUPPRESS) + parser.add_argument('--event-schema', type = str, \ + default = '@pkg_data_dir@/xfs_healthmon.schema.json', \ + help = argparse.SUPPRESS) args = parser.parse_args() if args.V: @@ -250,6 +308,10 @@ def main(): parser.error("the following arguments are required: mountpoint") return 1 + validator_fn = init_validation(args) + if not validator_fn: + return 1 + if args.debug: debug = True if args.log: ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 13/21] xfs_scrubbed: enable repairing filesystems 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (11 preceding siblings ...) 2024-12-31 23:50 ` [PATCH 12/21] xfs_scrubbed: check events against schema Darrick J. Wong @ 2024-12-31 23:51 ` Darrick J. Wong 2024-12-31 23:51 ` [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs Darrick J. Wong ` (7 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Make it so that our health monitoring daemon can initiate repairs. Because repairs can take a while to run, so we don't actually want to be doing that work in the event thread because the kernel queue can drop events if userspace doesn't respond in time. Therefore, create a subprocess executor to run the repairs in the background, and do the repairs from there. The subprocess executor is similar in concept to what a libfrog workqueue does, but the workers do not share address space, which eliminates GIL contention. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/xfs_scrubbed.in | 366 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 360 insertions(+), 6 deletions(-) diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index 992797113d6d30..c626c7bd56630c 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -17,6 +17,7 @@ import errno import ctypes import gc from concurrent.futures import ProcessPoolExecutor +import ctypes.util try: # Not all systems will have this json schema validation libarary, @@ -37,7 +38,7 @@ try: vcls.check_schema(schema_js) validator = vcls(schema_js) except jsonschema.exceptions.SchemaError as e: - print(f"{args.event_schema}: invalid event data, {e.message}", + print(f"{args.event_schema}: invalid event data: {e.message}", file = sys.stderr) return except Exception as e: @@ -69,6 +70,9 @@ log = False everything = False debug_fast = False printf_prefix = '' +want_repair = False +libhandle = None +repair_queue = None # placeholder for event queue worker # ioctl encoding stuff _IOC_NRBITS = 8 @@ -112,6 +116,9 @@ def _IOW(type, number, size): def _IOWR(type, number, size): return _IOC(_IOC_READ | _IOC_WRITE, type, number, size) +def _IOWR(type, number, size): + return _IOC(_IOC_READ | _IOC_WRITE, type, number, size) + # xfs health monitoring ioctl stuff XFS_HEALTH_MONITOR_FMT_JSON = 1 XFS_HEALTH_MONITOR_VERBOSE = 1 << 0 @@ -139,9 +146,206 @@ def open_health_monitor(fd, verbose = False): ret = fcntl.ioctl(fd, XFS_IOC_HEALTH_MONITOR, arg) return ret +# libhandle stuff +class xfs_fsid(ctypes.Structure): + _fields_ = [ + ("_val0", ctypes.c_uint), + ("_val1", ctypes.c_uint) + ] + +class xfs_fid(ctypes.Structure): + _fields_ = [ + ("fid_len", ctypes.c_ushort), + ("fid_pad", ctypes.c_ushort), + ("fid_gen", ctypes.c_uint), + ("fid_ino", ctypes.c_ulonglong) + ] + +class xfs_handle(ctypes.Structure): + _fields_ = [ + ("_ha_fsid", xfs_fsid), + ("ha_fid", xfs_fid) + ] +assert ctypes.sizeof(xfs_handle) == 24 + +class fshandle(object): + def __init__(self, fd, mountpoint): + global libhandle + global printf_prefix + + self.handle = xfs_handle() + + if mountpoint is None: + raise Exception('fshandle needs a mountpoint') + + self.mountpoint = mountpoint + + # Create the file and fs handles for the open mountpoint + # so that we can compare them later + buf = ctypes.c_void_p() + buflen = ctypes.c_size_t() + ret = libhandle.fd_to_handle(fd, buf, buflen) + if ret < 0: + errcode = ctypes.get_errno() + raise OSError(errcode, + f'cannot create handle: {os.strerror(errcode)}', + printf_prefix) + if buflen.value != ctypes.sizeof(xfs_handle): + libhandle.free_handle(buf, buflen.value) + raise Exception(f"fshandle expected {ctypes.sizeof(xfs_handle)} bytes, got {buflen.value}.") + + hanp = ctypes.cast(buf, ctypes.POINTER(xfs_handle)) + self.handle = hanp.contents + + def open(self): + '''Reopen a file handle obtained via weak reference.''' + global libhandle + global printf_prefix + + buf = ctypes.c_void_p() + buflen = ctypes.c_size_t() + + fd = os.open(self.mountpoint, os.O_RDONLY) + + # Create the file and fs handles for the open mountpoint + # so that we can compare them later + ret = libhandle.fd_to_handle(fd, buf, buflen) + if ret < 0: + errcode = ctypes.get_errno() + os.close(fd) + raise OSError(errcode, + f'resampling handle: {os.strerror(errcode)}', + printf_prefix) + + hanp = ctypes.cast(buf, ctypes.POINTER(xfs_handle)) + + # Did we get the same handle? + if buflen.value != ctypes.sizeof(xfs_handle) or \ + bytes(hanp.contents) != bytes(self.handle): + os.close(fd) + libhandle.free_handle(buf, buflen) + raise OSError(errno.ESTALE, + os.strerror(errno.ESTALE), + printf_prefix) + + libhandle.free_handle(buf, buflen) + return fd + +def libhandle_load(): + '''Load libhandle and set things up.''' + global libhandle + + soname = ctypes.util.find_library('handle') + if soname is None: + raise OSError(errno.ENOENT, + f'while finding library: {os.strerror(errno.ENOENT)}', + 'libhandle') + + libhandle = ctypes.CDLL(soname, use_errno = True) + libhandle.fd_to_handle.argtypes = ( + ctypes.c_int, + ctypes.POINTER(ctypes.c_void_p), + ctypes.POINTER(ctypes.c_size_t)) + libhandle.handle_to_fshandle.argtypes = ( + ctypes.c_void_p, + ctypes.c_size_t, + ctypes.POINTER(ctypes.c_void_p), + ctypes.POINTER(ctypes.c_size_t)) + libhandle.path_to_fshandle.argtypes = ( + ctypes.c_char_p, + ctypes.c_void_p, + ctypes.c_size_t) + libhandle.free_handle.argtypes = ( + ctypes.c_void_p, + ctypes.c_size_t) + +# metadata scrubbing stuff +XFS_SCRUB_TYPE_PROBE = 0 +XFS_SCRUB_TYPE_SB = 1 +XFS_SCRUB_TYPE_AGF = 2 +XFS_SCRUB_TYPE_AGFL = 3 +XFS_SCRUB_TYPE_AGI = 4 +XFS_SCRUB_TYPE_BNOBT = 5 +XFS_SCRUB_TYPE_CNTBT = 6 +XFS_SCRUB_TYPE_INOBT = 7 +XFS_SCRUB_TYPE_FINOBT = 8 +XFS_SCRUB_TYPE_RMAPBT = 9 +XFS_SCRUB_TYPE_REFCNTBT = 10 +XFS_SCRUB_TYPE_INODE = 11 +XFS_SCRUB_TYPE_BMBTD = 12 +XFS_SCRUB_TYPE_BMBTA = 13 +XFS_SCRUB_TYPE_BMBTC = 14 +XFS_SCRUB_TYPE_DIR = 15 +XFS_SCRUB_TYPE_XATTR = 16 +XFS_SCRUB_TYPE_SYMLINK = 17 +XFS_SCRUB_TYPE_PARENT = 18 +XFS_SCRUB_TYPE_RTBITMAP = 19 +XFS_SCRUB_TYPE_RTSUM = 20 +XFS_SCRUB_TYPE_UQUOTA = 21 +XFS_SCRUB_TYPE_GQUOTA = 22 +XFS_SCRUB_TYPE_PQUOTA = 23 +XFS_SCRUB_TYPE_FSCOUNTERS = 24 +XFS_SCRUB_TYPE_QUOTACHECK = 25 +XFS_SCRUB_TYPE_NLINKS = 26 +XFS_SCRUB_TYPE_HEALTHY = 27 +XFS_SCRUB_TYPE_DIRTREE = 28 +XFS_SCRUB_TYPE_METAPATH = 29 +XFS_SCRUB_TYPE_RGSUPER = 30 +XFS_SCRUB_TYPE_RGBITMAP = 31 +XFS_SCRUB_TYPE_RTRMAPBT = 32 +XFS_SCRUB_TYPE_RTREFCBT = 33 + +XFS_SCRUB_IFLAG_REPAIR = 1 << 0 +XFS_SCRUB_OFLAG_CORRUPT = 1 << 1 +XFS_SCRUB_OFLAG_PREEN = 1 << 2 +XFS_SCRUB_OFLAG_XFAIL = 1 << 3 +XFS_SCRUB_OFLAG_XCORRUPT = 1 << 4 +XFS_SCRUB_OFLAG_INCOMPLETE = 1 << 5 +XFS_SCRUB_OFLAG_WARNING = 1 << 6 +XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED = 1 << 7 +XFS_SCRUB_IFLAG_FORCE_REBUILD = 1 << 8 + +class xfs_scrub_metadata(ctypes.Structure): + _fields_ = [ + ('sm_type', ctypes.c_uint), + ('sm_flags', ctypes.c_uint), + ('sm_ino', ctypes.c_ulonglong), + ('sm_gen', ctypes.c_uint), + ('sm_agno', ctypes.c_uint), + ('_pad', ctypes.c_ulonglong * 5), + ] +assert ctypes.sizeof(xfs_scrub_metadata) == 64 + +XFS_IOC_SCRUB_METADATA = _IOWR(0x58, 60, xfs_scrub_metadata) + +def __xfs_repair_metadata(fd, type, group, ino, gen): + '''Call the kernel to repair some inode metadata.''' + + arg = xfs_scrub_metadata() + arg.sm_type = type + arg.sm_flags = XFS_SCRUB_IFLAG_REPAIR + arg.sm_ino = ino + arg.sm_gen = gen + arg.sm_agno = group + + fcntl.ioctl(fd, XFS_IOC_SCRUB_METADATA, arg) + return arg.sm_flags + +def xfs_repair_fs_metadata(fd, type): + '''Call the kernel to repair some whole-fs metadata.''' + return __xfs_repair_metadata(fd, type, 0, 0, 0) + +def xfs_repair_group_metadata(fd, type, group): + '''Call the kernel to repair some group metadata.''' + return __xfs_repair_metadata(fd, type, group, 0, 0) + +def xfs_repair_inode_metadata(fd, type, ino, gen): + '''Call the kernel to repair some inode metadata.''' + return __xfs_repair_metadata(fd, type, 0, ino, gen) + # main program -def health_reports(mon_fp): +def health_reports(mon_fp, fh): '''Generate python objects describing health events.''' global debug global printf_prefix @@ -164,7 +368,7 @@ def health_reports(mon_fp): if debug: print(f'new event: {s}') try: - yield json.loads(s) + yield (json.loads(s), fh) except json.decoder.JSONDecodeError as e: print(f"{printf_prefix}: {e} from {s}", file = sys.stderr) @@ -208,7 +412,7 @@ def report_shutdown(event): print(f"{printf_prefix}: Filesystem shut down due to {', '.join(reasons)}.") sys.stdout.flush() -def handle_event(event): +def handle_event(e): '''Handle an event asynchronously.''' def stringify_timestamp(event): '''Try to convert a timestamp to something human readable.''' @@ -222,6 +426,17 @@ def handle_event(event): print(f'{printf_prefix}: bad timestamp: {e}', file = sys.stderr) global log + global repair_queue + + # Use a separate subprocess to handle the repairs so that the event + # processing worker does not block on the GIL of the repair workers. + # The downside is that we cannot pass function pointers and all data + # must be pickleable; the upside is that we don't stall processing of + # non-sickness events while repairs are in progress. + if want_repair and not repair_queue: + repair_queue = ProcessPoolExecutor(max_workers = 1) + + event, fh = e # Ignore any event that doesn't pass our schema. This program must # not try to handle a newer kernel that say things that it is not @@ -236,13 +451,21 @@ def handle_event(event): report_lost(event) elif event['type'] == 'shutdown': report_shutdown(event) + elif want_repair and event['type'] == 'sick': + repair_queue.submit(repair_metadata, event, fh) def monitor(mountpoint, event_queue, **kwargs): '''Monitor the given mountpoint for health events.''' global everything + global log + global printf_prefix + global want_repair + fh = None fd = os.open(mountpoint, os.O_RDONLY) try: + if want_repair: + fh = fshandle(fd, mountpoint) mon_fd = open_health_monitor(fd, verbose = everything) except OSError as e: if e.errno != errno.ENOTTY and e.errno != errno.EOPNOTSUPP: @@ -251,14 +474,15 @@ def monitor(mountpoint, event_queue, **kwargs): file = sys.stderr) return 1 finally: - # Close the mountpoint if opening the health monitor fails + # Close the mountpoint if opening the health monitor fails; + # the handle object will free its own memory. os.close(fd) # Ownership of mon_fd (and hence responsibility for closing it) is # transferred to the mon_fp object. with os.fdopen(mon_fd) as mon_fp: nr = 0 - for e in health_reports(mon_fp): + for e in health_reports(mon_fp, fh): event_queue.submit(handle_event, e) # Periodically run the garbage collector to constrain @@ -271,6 +495,125 @@ def monitor(mountpoint, event_queue, **kwargs): return 0 +def __scrub_type(code): + '''Convert a "structures" json list to a scrub type code.''' + SCRUB_TYPES = { + "probe": XFS_SCRUB_TYPE_PROBE, + "sb": XFS_SCRUB_TYPE_SB, + "agf": XFS_SCRUB_TYPE_AGF, + "agfl": XFS_SCRUB_TYPE_AGFL, + "agi": XFS_SCRUB_TYPE_AGI, + "bnobt": XFS_SCRUB_TYPE_BNOBT, + "cntbt": XFS_SCRUB_TYPE_CNTBT, + "inobt": XFS_SCRUB_TYPE_INOBT, + "finobt": XFS_SCRUB_TYPE_FINOBT, + "rmapbt": XFS_SCRUB_TYPE_RMAPBT, + "refcountbt": XFS_SCRUB_TYPE_REFCNTBT, + "inode": XFS_SCRUB_TYPE_INODE, + "bmapbtd": XFS_SCRUB_TYPE_BMBTD, + "bmapbta": XFS_SCRUB_TYPE_BMBTA, + "bmapbtc": XFS_SCRUB_TYPE_BMBTC, + "directory": XFS_SCRUB_TYPE_DIR, + "xattr": XFS_SCRUB_TYPE_XATTR, + "symlink": XFS_SCRUB_TYPE_SYMLINK, + "parent": XFS_SCRUB_TYPE_PARENT, + "rtbitmap": XFS_SCRUB_TYPE_RTBITMAP, + "rtsummary": XFS_SCRUB_TYPE_RTSUM, + "usrquota": XFS_SCRUB_TYPE_UQUOTA, + "grpquota": XFS_SCRUB_TYPE_GQUOTA, + "prjquota": XFS_SCRUB_TYPE_PQUOTA, + "fscounters": XFS_SCRUB_TYPE_FSCOUNTERS, + "quotacheck": XFS_SCRUB_TYPE_QUOTACHECK, + "nlinks": XFS_SCRUB_TYPE_NLINKS, + "healthy": XFS_SCRUB_TYPE_HEALTHY, + "dirtree": XFS_SCRUB_TYPE_DIRTREE, + "metapath": XFS_SCRUB_TYPE_METAPATH, + "rgsuper": XFS_SCRUB_TYPE_RGSUPER, + "rgbitmap": XFS_SCRUB_TYPE_RGBITMAP, + "rtrmapbt": XFS_SCRUB_TYPE_RTRMAPBT, + "rtrefcountbt": XFS_SCRUB_TYPE_RTREFCBT, + } + + if code not in SCRUB_TYPES: + return None + + return SCRUB_TYPES[code] + +def report_outcome(oflags): + if oflags & (XFS_SCRUB_OFLAG_CORRUPT | \ + XFS_SCRUB_OFLAG_CORRUPT | \ + XFS_SCRUB_OFLAG_INCOMPLETE): + return "Repair unsuccessful; offline repair required." + + if oflags & XFS_SCRUB_OFLAG_XFAIL: + return "Seems correct but cross-referencing failed; offline repair recommended." + + if oflags & XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED: + return "No modification needed." + + return "Repairs successful." + +def repair_wholefs(event, fd): + '''React to a fs-domain corruption event by repairing it.''' + for s in event['structures']: + type = __scrub_type(s) + if type is None: + continue + try: + oflags = xfs_repair_fs_metadata(fd, type) + print(f"{printf_prefix}: {s}: {report_outcome(oflags)}") + sys.stdout.flush() + except Exception as e: + print(f"{printf_prefix}: {s}: {e}", file = sys.stderr) + +def repair_group(event, fd, group_type): + '''React to a group-domain corruption event by repairing it.''' + for s in event['structures']: + type = __scrub_type(s) + if type is None: + continue + try: + oflags = xfs_repair_group_metadata(fd, type, event['group']) + print(f"{printf_prefix}: {s}: {report_outcome(oflags)}") + sys.stdout.flush() + except Exception as e: + print(f"{printf_prefix}: {s}: {e}", file = sys.stderr) + +def repair_inode(event, fd): + '''React to a inode-domain corruption event by repairing it.''' + for s in event['structures']: + type = __scrub_type(s) + if type is None: + continue + try: + oflags = xfs_repair_inode_metadata(fd, type, + event['inumber'], event['generation']) + print(f"{printf_prefix}: {s}: {report_outcome(oflags)}") + sys.stdout.flush() + except Exception as e: + print(f"{printf_prefix}: {s}: {e}", file = sys.stderr) + +def repair_metadata(event, fh): + '''Repair a metadata corruption.''' + global debug + global printf_prefix + + if debug: + print(f'repair {event}') + + fd = fh.open() + try: + if event['domain'] in ['fs', 'realtime']: + repair_wholefs(event, fd) + elif event['domain'] in ['perag', 'rtgroup']: + repair_group(event, fd, event['domain']) + elif event['domain'] == 'inode': + repair_inode(event, fd) + else: + raise Exception(f"{printf_prefix}: Unknown metadata domain \"{event['domain']}\".") + finally: + os.close(fd) + def main(): global debug global log @@ -278,6 +621,7 @@ def main(): global everything global debug_fast global validator_fn + global want_repair parser = argparse.ArgumentParser( \ description = "XFS filesystem health monitoring demon.") @@ -287,6 +631,8 @@ def main(): action = "store_true") parser.add_argument("--everything", help = "Capture all events.", \ action = "store_true") + parser.add_argument("--repair", help = "Automatically repair corrupt metadata.", \ + action = "store_true") parser.add_argument("-V", help = "Report version and exit.", \ action = "store_true") parser.add_argument('mountpoint', default = None, nargs = '?', @@ -312,6 +658,12 @@ def main(): if not validator_fn: return 1 + try: + libhandle_load() + except OSError as e: + print(f"libhandle: {e}", file = sys.stderr) + return 1 + if args.debug: debug = True if args.log: @@ -320,6 +672,8 @@ def main(): everything = True if args.debug_fast: debug_fast = True + if args.repair: + want_repair = True # Use a separate subprocess to handle the events so that the main event # reading process does not block on the GIL of the event handling ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (12 preceding siblings ...) 2024-12-31 23:51 ` [PATCH 13/21] xfs_scrubbed: enable repairing filesystems Darrick J. Wong @ 2024-12-31 23:51 ` Darrick J. Wong 2024-12-31 23:51 ` [PATCH 15/21] xfs_scrubbed: use getparents to look up file names Darrick J. Wong ` (6 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Online repair relies heavily on back references such as reverse mappings and directory parent pointers to add redundancy to the filesystem. Check for these two features and whine a bit if they are missing. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/xfs_scrubbed.in | 72 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index c626c7bd56630c..25465128864583 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -71,6 +71,8 @@ everything = False debug_fast = False printf_prefix = '' want_repair = False +has_parent = False +has_rmapbt = False libhandle = None repair_queue = None # placeholder for event queue worker @@ -343,6 +345,57 @@ def xfs_repair_inode_metadata(fd, type, ino, gen): '''Call the kernel to repair some inode metadata.''' return __xfs_repair_metadata(fd, type, 0, ino, gen) +# fsgeometry ioctl +class xfs_fsop_geom(ctypes.Structure): + _fields_ = [ + ("blocksize", ctypes.c_uint), + ("rtextesize", ctypes.c_uint), + ("agblocks", ctypes.c_uint), + ("agcount", ctypes.c_uint), + ("logblocks", ctypes.c_uint), + ("sectsize", ctypes.c_uint), + ("inodesize", ctypes.c_uint), + ("imaxpct", ctypes.c_uint), + ("datablocks", ctypes.c_ulonglong), + ("rtblocks", ctypes.c_ulonglong), + ("rtextents", ctypes.c_ulonglong), + ("logstart", ctypes.c_ulonglong), + ("uuid", ctypes.c_ubyte * 16), + ("sunit", ctypes.c_uint), + ("swidth", ctypes.c_uint), + ("version", ctypes.c_uint), + ("flags", ctypes.c_uint), + ("logsectsize", ctypes.c_uint), + ("rtsectsize", ctypes.c_uint), + ("dirblocksize", ctypes.c_uint), + ("logsunit", ctypes.c_uint), + ("sick", ctypes.c_uint), + ("checked", ctypes.c_uint), + ("rgblocks", ctypes.c_uint), + ("rgcount", ctypes.c_uint), + ("_pad", ctypes.c_ulonglong * 16), + ] +assert ctypes.sizeof(xfs_fsop_geom) == 256 + +XFS_FSOP_GEOM_FLAGS_RMAPBT = 1 << 19 +XFS_FSOP_GEOM_FLAGS_PARENT = 1 << 25 + +XFS_IOC_FSGEOMETRY = _IOR (0x58, 126, xfs_fsop_geom) + +def xfs_has_parent(fd): + '''Does this filesystem have parent pointers?''' + + arg = xfs_fsop_geom() + fcntl.ioctl(fd, XFS_IOC_FSGEOMETRY, arg) + return arg.flags & XFS_FSOP_GEOM_FLAGS_PARENT != 0 + +def xfs_has_rmapbt(fd): + '''Does this filesystem have reverse mapping?''' + + arg = xfs_fsop_geom() + fcntl.ioctl(fd, XFS_IOC_FSGEOMETRY, arg) + return arg.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT != 0 + # main program def health_reports(mon_fp, fh): @@ -460,9 +513,28 @@ def monitor(mountpoint, event_queue, **kwargs): global log global printf_prefix global want_repair + global has_parent + global has_rmapbt fh = None fd = os.open(mountpoint, os.O_RDONLY) + try: + has_parent = xfs_has_parent(fd) + has_rmapbt = xfs_has_rmapbt(fd) + except Exception as e: + # Don't care if we can't detect parent pointers or rmap + print(f'{printf_prefix}: detecting fs features: {e}', file = sys.stderr) + + # Check for the backref metadata that makes repair effective. + if want_repair: + if not has_rmapbt: + print(f"{mountpoint}: XFS online repair is less effective without rmap btrees.") + if not has_parent: + print(f"{mountpoint}: XFS online repair is less effective without parent pointers.") + + # Flush anything that we may have printed about operational state. + sys.stdout.flush() + try: if want_repair: fh = fshandle(fd, mountpoint) ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 15/21] xfs_scrubbed: use getparents to look up file names 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (13 preceding siblings ...) 2024-12-31 23:51 ` [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs Darrick J. Wong @ 2024-12-31 23:51 ` Darrick J. Wong 2024-12-31 23:51 ` [PATCH 16/21] builddefs: refactor udev directory specification Darrick J. Wong ` (5 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> If the kernel tells about something that happened to a file, use the GETPARENTS ioctl to try to look up the path to that file for more ergonomic reporting. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/xfs_scrubbed.in | 235 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 230 insertions(+), 5 deletions(-) diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index 25465128864583..a4e073b3098f7a 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -18,6 +18,7 @@ import ctypes import gc from concurrent.futures import ProcessPoolExecutor import ctypes.util +import collections try: # Not all systems will have this json schema validation libarary, @@ -171,12 +172,18 @@ class xfs_handle(ctypes.Structure): assert ctypes.sizeof(xfs_handle) == 24 class fshandle(object): - def __init__(self, fd, mountpoint): + def __init__(self, fd, mountpoint = None): global libhandle global printf_prefix self.handle = xfs_handle() + if isinstance(fd, fshandle): + # copy an existing fshandle + self.mountpoint = fd.mountpoint + ctypes.pointer(self.handle)[0] = fd.handle + return + if mountpoint is None: raise Exception('fshandle needs a mountpoint') @@ -233,6 +240,11 @@ class fshandle(object): libhandle.free_handle(buf, buflen) return fd + def subst(self, ino, gen): + '''Substitute the inode and generation components of a handle.''' + self.handle.ha_fid.fid_ino = ino + self.handle.ha_fid.fid_gen = gen + def libhandle_load(): '''Load libhandle and set things up.''' global libhandle @@ -396,6 +408,170 @@ def xfs_has_rmapbt(fd): fcntl.ioctl(fd, XFS_IOC_FSGEOMETRY, arg) return arg.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT != 0 +# getparents ioctl +class xfs_attrlist_cursor(ctypes.Structure): + _fields_ = [ + ("_opaque0", ctypes.c_uint), + ("_opaque1", ctypes.c_uint), + ("_opaque2", ctypes.c_uint), + ("_opaque3", ctypes.c_uint) + ] + +class xfs_getparents_rec(ctypes.Structure): + _fields_ = [ + ("gpr_parent", xfs_handle), + ("gpr_reclen", ctypes.c_uint), + ("_gpr_reserved", ctypes.c_uint), + ] + +xfs_getparents_tuple = collections.namedtuple('xfs_getparents_tuple', \ + ['gpr_parent', 'gpr_reclen', 'gpr_name']) + +class xfs_getparents_rec_array(object): + def __init__(self, nr_bytes): + self.nr_bytes = nr_bytes + self.bytearray = (ctypes.c_byte * int(nr_bytes))() + + def __slice_to_record(self, bufslice): + '''Compute the number of bytes in a getparents record that contain a null-terminated directory entry name.''' + rec = ctypes.cast(bytes(bufslice), \ + ctypes.POINTER(xfs_getparents_rec)) + fixedlen = ctypes.sizeof(xfs_getparents_rec) + namelen = rec.contents.gpr_reclen - fixedlen + + for i in range(0, namelen): + if bufslice[fixedlen + i] == 0: + namelen = i + break + + if namelen == 0: + return + + return xfs_getparents_tuple( + gpr_parent = rec.contents.gpr_parent, + gpr_reclen = rec.contents.gpr_reclen, + gpr_name = bufslice[fixedlen:fixedlen + namelen]) + + def get_buffer(self): + '''Return a pointer to the bytearray masquerading as an int.''' + return ctypes.addressof(self.bytearray) + + def __iter__(self): + '''Walk the getparents records in this array.''' + off = 0 + nr = 0 + buf = bytes(self.bytearray) + while off < self.nr_bytes: + bufslice = buf[off:] + t = self.__slice_to_record(bufslice) + if t is None: + break + yield t + off += t.gpr_reclen + nr += 1 + +class xfs_getparents(ctypes.Structure): + _fields_ = [ + ("_gp_cursor", xfs_attrlist_cursor), + ("gp_iflags", ctypes.c_ushort), + ("gp_oflags", ctypes.c_ushort), + ("gp_bufsize", ctypes.c_uint), + ("_pad", ctypes.c_ulonglong), + ("gp_buffer", ctypes.c_ulonglong) + ] + + def __init__(self, fd, nr_bytes): + self.fd = fd + self.records = xfs_getparents_rec_array(nr_bytes) + self.gp_buffer = self.records.get_buffer() + self.gp_bufsize = nr_bytes + + def __call_kernel(self): + if self.gp_oflags & XFS_GETPARENTS_OFLAG_DONE: + return False + + ret = fcntl.ioctl(self.fd, XFS_IOC_GETPARENTS, self) + if ret != 0: + return False + + return self.gp_oflags & XFS_GETPARENTS_OFLAG_ROOT == 0 + + def __iter__(self): + ctypes.memset(ctypes.pointer(self._gp_cursor), 0, \ + ctypes.sizeof(xfs_attrlist_cursor)) + + while self.__call_kernel(): + for i in self.records: + yield i + +class xfs_getparents_by_handle(ctypes.Structure): + _fields_ = [ + ("gph_handle", xfs_handle), + ("gph_request", xfs_getparents) + ] + + def __init__(self, fd, fh, nr_bytes): + self.fd = fd + self.records = xfs_getparents_rec_array(nr_bytes) + self.gph_request.gp_buffer = self.records.get_buffer() + self.gph_request.gp_bufsize = nr_bytes + self.gph_handle = fh.handle + + def __call_kernel(self): + if self.gph_request.gp_oflags & XFS_GETPARENTS_OFLAG_DONE: + return False + + ret = fcntl.ioctl(self.fd, XFS_IOC_GETPARENTS_BY_HANDLE, self) + if ret != 0: + return False + + return self.gph_request.gp_oflags & XFS_GETPARENTS_OFLAG_ROOT == 0 + + def __iter__(self): + ctypes.memset(ctypes.pointer(self.gph_request._gp_cursor), 0, \ + ctypes.sizeof(xfs_attrlist_cursor)) + while self.__call_kernel(): + for i in self.records: + yield i + +assert ctypes.sizeof(xfs_getparents) == 40 +assert ctypes.sizeof(xfs_getparents_by_handle) == 64 +assert ctypes.sizeof(xfs_getparents_rec) == 32 + +XFS_GETPARENTS_OFLAG_ROOT = 1 << 0 +XFS_GETPARENTS_OFLAG_DONE = 1 << 1 + +XFS_IOC_GETPARENTS = _IOWR(0x58, 62, xfs_getparents) +XFS_IOC_GETPARENTS_BY_HANDLE = _IOWR(0x58, 63, xfs_getparents_by_handle) + +def fgetparents(fd, fh = None, bufsize = 1024): + '''Return all the parent pointers for a given fd and/or handle.''' + + if fh is not None: + return xfs_getparents_by_handle(fd, fh, bufsize) + return xfs_getparents(fd, bufsize) + +def fgetpath(fd, fh = None, mountpoint = None): + '''Return a list of path components up to the root dir of the filesystem for a given fd.''' + ret = [] + if fh is None: + nfh = fshandle(fd, mountpoint) + else: + # Don't subst into the caller's handle + nfh = fshandle(fh) + + while True: + added = False + for pptr in fgetparents(fd, nfh): + ret.insert(0, pptr.gpr_name) + nfh.subst(pptr.gpr_parent.ha_fid.fid_ino, \ + pptr.gpr_parent.ha_fid.fid_gen) + added = True + break + if not added: + break + return ret + # main program def health_reports(mon_fp, fh): @@ -429,11 +605,23 @@ def health_reports(mon_fp, fh): lines = [] buf = mon_fp.readline() +def inode_printf_prefix(event): + '''Compute the logging prefix for this event.''' + global printf_prefix + + if 'path' not in event: + return printf_prefix + + if printf_prefix.endswith(os.sep): + return f"{printf_prefix}{event['path']}" + + return f"{printf_prefix}{os.sep}{event['path']}" + def log_event(event): '''Log a monitoring event to stdout.''' global printf_prefix - print(f"{printf_prefix}: {event}") + print(f"{inode_printf_prefix(event)}: {event}") sys.stdout.flush() def report_lost(event): @@ -480,6 +668,39 @@ def handle_event(e): global log global repair_queue + global has_parent + + def pathify_event(event, fh): + '''Come up with a directory tree path for a file event.''' + try: + path_fd = fh.open() + except Exception as e: + # Not the end of the world if we get nothing + if e.errno != errno.EOPNOTSUPP and e.errno != errno.ENOTTY: + print(f'{printf_prefix}: opening file handle: {e}', file = sys.stderr) + return + + try: + fh2 = fshandle(fh) + except OSError as e: + if e.errno != errno.EOPNOTSUPP: + print(f'{printf_prefix}: making new file handle: {e}', file = sys.stderr) + os.close(path_fd) + return + except Exception as e: + print(f'{printf_prefix}: making new file handle: {e}', file = sys.stderr) + os.close(path_fd) + return + + try: + fh2.subst(event['inumber'], event['generation']) + components = [x.decode('utf-8') for x in fgetpath(path_fd, fh2)] + event['path'] = os.sep.join(components) + except OSError as e: + if e.errno != errno.EOPNOTSUPP: + print(f'{printf_prefix}: constructing path: {e}', file = sys.stderr) + finally: + os.close(path_fd) # Use a separate subprocess to handle the repairs so that the event # processing worker does not block on the GIL of the repair workers. @@ -498,6 +719,8 @@ def handle_event(e): return stringify_timestamp(event) + if event['domain'] == 'inode' and has_parent and not debug_fast: + pathify_event(event, fh) if log: log_event(event) if event['type'] == 'lost': @@ -536,7 +759,7 @@ def monitor(mountpoint, event_queue, **kwargs): sys.stdout.flush() try: - if want_repair: + if want_repair or has_parent: fh = fshandle(fd, mountpoint) mon_fd = open_health_monitor(fd, verbose = everything) except OSError as e: @@ -653,6 +876,8 @@ def repair_group(event, fd, group_type): def repair_inode(event, fd): '''React to a inode-domain corruption event by repairing it.''' + ipp = inode_printf_prefix(event) + for s in event['structures']: type = __scrub_type(s) if type is None: @@ -660,10 +885,10 @@ def repair_inode(event, fd): try: oflags = xfs_repair_inode_metadata(fd, type, event['inumber'], event['generation']) - print(f"{printf_prefix}: {s}: {report_outcome(oflags)}") + print(f"{ipp}: {s}: {report_outcome(oflags)}") sys.stdout.flush() except Exception as e: - print(f"{printf_prefix}: {s}: {e}", file = sys.stderr) + print(f"{ipp}: {s}: {e}", file = sys.stderr) def repair_metadata(event, fh): '''Repair a metadata corruption.''' ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 16/21] builddefs: refactor udev directory specification 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (14 preceding siblings ...) 2024-12-31 23:51 ` [PATCH 15/21] xfs_scrubbed: use getparents to look up file names Darrick J. Wong @ 2024-12-31 23:51 ` Darrick J. Wong 2024-12-31 23:52 ` [PATCH 17/21] xfs_scrubbed: create a background monitoring service Darrick J. Wong ` (4 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:51 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Refactor the code that finds the udev rules directory to detect the location of the parent udev directory instead. IOWs, we go from: UDEV_RULE_DIR=/foo/bar/rules.d to: UDEV_DIR=/foo/bar UDEV_RULE_DIR=/foo/bar/rules.d This is needed by the next patch, which adds a helper script. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- configure.ac | 2 +- include/builddefs.in | 3 ++- m4/package_services.m4 | 30 +++++++++++++++--------------- 3 files changed, 18 insertions(+), 17 deletions(-) diff --git a/configure.ac b/configure.ac index 1f7fec838e1239..cabbef51068dbc 100644 --- a/configure.ac +++ b/configure.ac @@ -175,7 +175,7 @@ if test "$enable_scrub" = "yes"; then fi AC_CONFIG_SYSTEMD_SYSTEM_UNIT_DIR AC_CONFIG_CROND_DIR -AC_CONFIG_UDEV_RULE_DIR +AC_CONFIG_UDEV_DIR AC_HAVE_BLKID_TOPO if test "$enable_ubsan" = "yes" || test "$enable_ubsan" = "probe"; then diff --git a/include/builddefs.in b/include/builddefs.in index bb022c36627a72..4a25de76d5c325 100644 --- a/include/builddefs.in +++ b/include/builddefs.in @@ -112,7 +112,8 @@ SYSTEMD_SYSTEM_UNIT_DIR = @systemd_system_unit_dir@ HAVE_CROND = @have_crond@ CROND_DIR = @crond_dir@ HAVE_UDEV = @have_udev@ -UDEV_RULE_DIR = @udev_rule_dir@ +UDEV_DIR = @udev_dir@ +UDEV_RULE_DIR = @udev_dir@/rules.d HAVE_LIBURCU_ATOMIC64 = @have_liburcu_atomic64@ USE_RADIX_TREE_FOR_INUMS = @use_radix_tree_for_inums@ diff --git a/m4/package_services.m4 b/m4/package_services.m4 index a683ddb93e0e91..de0504df0c206f 100644 --- a/m4/package_services.m4 +++ b/m4/package_services.m4 @@ -77,33 +77,33 @@ AC_DEFUN([AC_CONFIG_CROND_DIR], ]) # -# Figure out where to put udev rule files +# Figure out where to put udev files # -AC_DEFUN([AC_CONFIG_UDEV_RULE_DIR], +AC_DEFUN([AC_CONFIG_UDEV_DIR], [ AC_REQUIRE([PKG_PROG_PKG_CONFIG]) - AC_ARG_WITH([udev_rule_dir], - [AS_HELP_STRING([--with-udev-rule-dir@<:@=DIR@:>@], - [Install udev rules into DIR.])], + AC_ARG_WITH([udev_dir], + [AS_HELP_STRING([--with-udev-dir@<:@=DIR@:>@], + [Install udev files underneath DIR.])], [], - [with_udev_rule_dir=yes]) - AS_IF([test "x${with_udev_rule_dir}" != "xno"], + [with_udev_dir=yes]) + AS_IF([test "x${with_udev_dir}" != "xno"], [ - AS_IF([test "x${with_udev_rule_dir}" = "xyes"], + AS_IF([test "x${with_udev_dir}" = "xyes"], [ PKG_CHECK_MODULES([udev], [udev], [ - with_udev_rule_dir="$($PKG_CONFIG --variable=udev_dir udev)/rules.d" + with_udev_dir="$($PKG_CONFIG --variable=udev_dir udev)" ], [ - with_udev_rule_dir="" + with_udev_dir="" ]) m4_pattern_allow([^PKG_(MAJOR|MINOR|BUILD|REVISION)$]) ]) - AC_MSG_CHECKING([for udev rule dir]) - udev_rule_dir="${with_udev_rule_dir}" - AS_IF([test -n "${udev_rule_dir}"], + AC_MSG_CHECKING([for udev dir]) + udev_dir="${with_udev_dir}" + AS_IF([test -n "${udev_dir}"], [ - AC_MSG_RESULT(${udev_rule_dir}) + AC_MSG_RESULT(${udev_dir}) have_udev="yes" ], [ @@ -115,5 +115,5 @@ AC_DEFUN([AC_CONFIG_UDEV_RULE_DIR], have_udev="disabled" ]) AC_SUBST(have_udev) - AC_SUBST(udev_rule_dir) + AC_SUBST(udev_dir) ]) ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 17/21] xfs_scrubbed: create a background monitoring service 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (15 preceding siblings ...) 2024-12-31 23:51 ` [PATCH 16/21] builddefs: refactor udev directory specification Darrick J. Wong @ 2024-12-31 23:52 ` Darrick J. Wong 2024-12-31 23:52 ` [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable Darrick J. Wong ` (3 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a systemd service and activate it automatically. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/Makefile | 18 +++++++ scrub/xfs_scrubbed.in | 9 +++ scrub/xfs_scrubbed.rules | 7 +++ scrub/xfs_scrubbed@.service.in | 103 ++++++++++++++++++++++++++++++++++++++++ scrub/xfs_scrubbed_start | 17 +++++++ 5 files changed, 153 insertions(+), 1 deletion(-) create mode 100644 scrub/xfs_scrubbed.rules create mode 100644 scrub/xfs_scrubbed@.service.in create mode 100755 scrub/xfs_scrubbed_start diff --git a/scrub/Makefile b/scrub/Makefile index 7d4fa0ddc09685..731810d7c7fd9a 100644 --- a/scrub/Makefile +++ b/scrub/Makefile @@ -29,8 +29,16 @@ SYSTEMD_SERVICES=\ xfs_scrub_all.service \ xfs_scrub_all_fail.service \ xfs_scrub_all.timer \ - system-xfs_scrub.slice + system-xfs_scrub.slice \ + xfs_scrubbed@.service OPTIONAL_TARGETS += $(SYSTEMD_SERVICES) + +ifeq ($(HAVE_UDEV),yes) + XFS_SCRUBBED_UDEV_RULES = xfs_scrubbed.rules + XFS_SCRUBBED_HELPER = xfs_scrubbed_start + INSTALL_SCRUB += install-udev-scrubbed + OPTIONAL_TARGETS += $(XFS_SCRUBBED_HELPER) +endif endif ifeq ($(HAVE_CROND),yes) INSTALL_SCRUB += install-crond @@ -185,6 +193,14 @@ install-udev: $(UDEV_RULES) $(INSTALL) -m 644 $$i $(UDEV_RULE_DIR)/64-$$i; \ done +install-udev-scrubbed: $(XFS_SCRUBBED_HELPER) + $(INSTALL) -m 755 -d $(UDEV_DIR) + $(INSTALL) -m 755 $(XFS_SCRUBBED_HELPER) $(UDEV_DIR) + $(INSTALL) -m 755 -d $(UDEV_RULE_DIR) + for i in $(XFS_SCRUBBED_UDEV_RULES); do \ + $(INSTALL) -m 644 $$i $(UDEV_RULE_DIR)/64-$$i; \ + done + install-dev: -include .dep diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index a4e073b3098f7a..9df6f45e53ad80 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -19,6 +19,7 @@ import gc from concurrent.futures import ProcessPoolExecutor import ctypes.util import collections +import time try: # Not all systems will have this json schema validation libarary, @@ -994,6 +995,14 @@ def main(): pass args.event_queue.shutdown() + + # See the service mode comments in xfs_scrub.c for why we sleep and + # compress all nonzero exit codes to 1. + if 'SERVICE_MODE' in os.environ: + time.sleep(2) + if ret != 0: + ret = 1 + return ret if __name__ == '__main__': diff --git a/scrub/xfs_scrubbed.rules b/scrub/xfs_scrubbed.rules new file mode 100644 index 00000000000000..c651126d5373a1 --- /dev/null +++ b/scrub/xfs_scrubbed.rules @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0-or-later +# +# Copyright (c) 2024-2025 Oracle. All rights reserved. +# Author: Darrick J. Wong <djwong@kernel.org> +# +# Start autonomous self healing automatically +ACTION=="add", SUBSYSTEM=="xfs", ENV{TYPE}=="mount", RUN+="xfs_scrubbed_start" diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in new file mode 100644 index 00000000000000..9656bdb3cd9a9d --- /dev/null +++ b/scrub/xfs_scrubbed@.service.in @@ -0,0 +1,103 @@ +# SPDX-License-Identifier: GPL-2.0-or-later +# +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# Author: Darrick J. Wong <djwong@kernel.org> + +[Unit] +Description=Self Healing of XFS Metadata for %f +Documentation=man:xfs_scrubbed(8) + +# Explicitly require the capabilities that this program needs +ConditionCapability=CAP_SYS_ADMIN +ConditionCapability=CAP_DAC_OVERRIDE + +# Must be a mountpoint +ConditionPathIsMountPoint=%f +RequiresMountsFor=%f + +[Service] +Type=exec +Environment=SERVICE_MODE=1 +ExecStart=@pkg_libexec_dir@/xfs_scrubbed --log %f +SyslogIdentifier=%N + +# Run scrub with minimal CPU and IO priority so that nothing else will starve. +IOSchedulingClass=idle +CPUSchedulingPolicy=idle +CPUAccounting=true +Nice=19 + +# Create the service underneath the scrub background service slice so that we +# can control resource usage. +Slice=system-xfs_scrub.slice + +# No realtime CPU scheduling +RestrictRealtime=true + +# Dynamically create a user that isn't root +DynamicUser=true + +# Make the entire filesystem readonly, but don't hide /home and don't use a +# private bind mount like xfs_scrub. We don't want to pin the filesystem, +# because we want umount to work correctly and this service to stop +# automatically. +ProtectSystem=strict +ProtectHome=no +PrivateTmp=true +PrivateDevices=true + +# Don't let scrub complain about paths in /etc/projects that have been hidden +# by our sandboxing. scrub doesn't care about project ids anyway. +InaccessiblePaths=-/etc/projects + +# No network access +PrivateNetwork=true +ProtectHostname=true +RestrictAddressFamilies=none +IPAddressDeny=any + +# Don't let the program mess with the kernel configuration at all +ProtectKernelLogs=true +ProtectKernelModules=true +ProtectKernelTunables=true +ProtectControlGroups=true +ProtectProc=invisible +RestrictNamespaces=true + +# Hide everything in /proc, even /proc/mounts +ProcSubset=pid + +# Only allow the default personality Linux +LockPersonality=true + +# No writable memory pages +MemoryDenyWriteExecute=true + +# Don't let our mounts leak out to the host +PrivateMounts=true + +# Restrict system calls to the native arch and only enough to get things going +SystemCallArchitectures=native +SystemCallFilter=@system-service +SystemCallFilter=~@privileged +SystemCallFilter=~@resources +SystemCallFilter=~@mount + +# xfs_scrubbed needs these privileges to open the rootdir and monitor +CapabilityBoundingSet=CAP_SYS_ADMIN CAP_DAC_OVERRIDE +AmbientCapabilities=CAP_SYS_ADMIN CAP_DAC_OVERRIDE +NoNewPrivileges=true + +# xfs_scrubbed doesn't create files +UMask=7777 + +# No access to hardware /dev files except for block devices +ProtectClock=true +DevicePolicy=closed + +[Install] +WantedBy=multi-user.target +# If someone tries to enable the template itself, translate that into enabling +# this service on the root directory at systemd startup time. In the +# initramfs, the udev rules in xfs_scrubbed.rules run before systemd starts. +DefaultInstance=- diff --git a/scrub/xfs_scrubbed_start b/scrub/xfs_scrubbed_start new file mode 100755 index 00000000000000..82530cf7862717 --- /dev/null +++ b/scrub/xfs_scrubbed_start @@ -0,0 +1,17 @@ +#!/bin/sh + +# SPDX-License-Identifier: GPL-2.0-or-later +# +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# Author: Darrick J. Wong <djwong@kernel.org> + +# Start the xfs_scrubbed service when the filesystem is mounted + +command -v systemctl || exit 0 + +grep "^$SOURCE[[:space:]]" /proc/mounts | while read source mntpt therest; do + inst="$(systemd-escape --path "$mntpt")" + systemctl restart --no-block "xfs_scrubbed@$inst" && break +done + +exit 0 ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (16 preceding siblings ...) 2024-12-31 23:52 ` [PATCH 17/21] xfs_scrubbed: create a background monitoring service Darrick J. Wong @ 2024-12-31 23:52 ` Darrick J. Wong 2024-12-31 23:52 ` [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode Darrick J. Wong ` (2 subsequent siblings) 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Use ExecCondition= in the system service to check if kernel support for the health monitor is available. If not, we don't want to run the service, have it fail, and generate a bunch of silly log messages. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/xfs_scrubbed.in | 39 ++++++++++++++++++++++++++++++++++++++- scrub/xfs_scrubbed@.service.in | 1 + 2 files changed, 39 insertions(+), 1 deletion(-) diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index 9df6f45e53ad80..90602481f64c88 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -791,6 +791,38 @@ def monitor(mountpoint, event_queue, **kwargs): return 0 +def check_monitor(mountpoint): + '''Check if the kernel can send us health events for the given mountpoint.''' + global log + global printf_prefix + global everything + global want_repair + global has_parent + + try: + fd = os.open(mountpoint, os.O_RDONLY) + except OSError as e: + # Can't open mountpoint; monitor not available. + print(f"{mountpoint}: {e}", file = sys.stderr) + return 1 + + try: + mon_fd = open_health_monitor(fd, verbose = everything) + except OSError as e: + # Error opening monitor (or it's simply not there); monitor + # not available. + if e.errno == errno.ENOTTY or e.errno == errno.EOPNOTSUPP: + print(f"{mountpoint}: XFS health monitoring not supported.", + file = sys.stderr) + return 1 + finally: + # Close the mountpoint if opening the health monitor fails; + # the handle object will free its own memory. + os.close(fd) + + # Monitor available; success! + return 0 + def __scrub_type(code): '''Convert a "structures" json list to a scrub type code.''' SCRUB_TYPES = { @@ -923,6 +955,8 @@ def main(): parser = argparse.ArgumentParser( \ description = "XFS filesystem health monitoring demon.") + parser.add_argument("--check", help = "Check presense of health monitor.", \ + action = "store_true") parser.add_argument("--debug", help = "Enabling debugging messages.", \ action = "store_true") parser.add_argument("--log", help = "Log health events to stdout.", \ @@ -989,7 +1023,10 @@ def main(): printf_prefix = args.mountpoint ret = 0 try: - ret = monitor(**vars(args)) + if args.check: + ret = check_monitor(args.mountpoint) + else: + ret = monitor(**vars(args)) except KeyboardInterrupt: # Consider SIGINT to be a clean exit. pass diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in index 9656bdb3cd9a9d..afd5c204327946 100644 --- a/scrub/xfs_scrubbed@.service.in +++ b/scrub/xfs_scrubbed@.service.in @@ -18,6 +18,7 @@ RequiresMountsFor=%f [Service] Type=exec Environment=SERVICE_MODE=1 +ExecCondition=@pkg_libexec_dir@/xfs_scrubbed --check %f ExecStart=@pkg_libexec_dir@/xfs_scrubbed --log %f SyslogIdentifier=%N ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (17 preceding siblings ...) 2024-12-31 23:52 ` [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable Darrick J. Wong @ 2024-12-31 23:52 ` Darrick J. Wong 2024-12-31 23:52 ` [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel Darrick J. Wong 2024-12-31 23:53 ` [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default Darrick J. Wong 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Make the xfs_scrubbed background service query the autofsck filesystem property to figure out which operating mode it should use. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/xfs_scrubbed.in | 62 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 61 insertions(+), 1 deletion(-) diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in index 90602481f64c88..2b34603cb361e2 100644 --- a/scrub/xfs_scrubbed.in +++ b/scrub/xfs_scrubbed.in @@ -573,6 +573,21 @@ def fgetpath(fd, fh = None, mountpoint = None): break return ret +# Filesystem properties + +FSPROP_NAMESPACE = "trusted." +FSPROP_NAME_PREFIX = "xfs:" +FSPROP_AUTOFSCK_NAME = "autofsck" + +def fsprop_attrname(n): + '''Construct the xattr name for a filesystem property.''' + return f"{FSPROP_NAMESPACE}{FSPROP_NAME_PREFIX}{n}" + +def fsprop_getstr(fd, n): + '''Return the value of a filesystem property as a string.''' + attrname = fsprop_attrname(n) + return os.getxattr(fd, attrname).decode('utf-8') + # main program def health_reports(mon_fp, fh): @@ -731,6 +746,31 @@ def handle_event(e): elif want_repair and event['type'] == 'sick': repair_queue.submit(repair_metadata, event, fh) +def want_repair_from_autofsck(fd): + '''Determine want_repair from the autofsck filesystem property.''' + global has_parent + global has_rmapbt + + try: + advice = fsprop_getstr(fd, FSPROP_AUTOFSCK_NAME) + if advice == "repair": + return True + if advice == "check" or advice == "optimize": + return False + if advice == "none": + return None + except: + # Any OS error (including ENODATA) or string parsing error is + # treated the same as an unrecognized value. + pass + + # For an unrecognized value, log but do not fix runtime corruption if + # backref metadata are enabled. If no backref metadata are available, + # the fs is too old so don't run at all. + if has_rmapbt or has_parent: + return False + return None + def monitor(mountpoint, event_queue, **kwargs): '''Monitor the given mountpoint for health events.''' global everything @@ -749,6 +789,20 @@ def monitor(mountpoint, event_queue, **kwargs): # Don't care if we can't detect parent pointers or rmap print(f'{printf_prefix}: detecting fs features: {e}', file = sys.stderr) + # Does the sysadmin have any advice for us about whether or not to + # background scrub? + if want_repair is None: + want_repair = want_repair_from_autofsck(fd) + if want_repair is None: + print(f"{mountpoint}: Disabling daemon per autofsck directive.") + os.close(fd) + return 0 + elif want_repair: + print(f"{mountpoint}: Automatically repairing per autofsck directive.") + else: + print(f"{mountpoint}: Only logging errors per autofsck directive.") + + # Check for the backref metadata that makes repair effective. if want_repair: if not has_rmapbt: @@ -963,7 +1017,11 @@ def main(): action = "store_true") parser.add_argument("--everything", help = "Capture all events.", \ action = "store_true") - parser.add_argument("--repair", help = "Automatically repair corrupt metadata.", \ + action_group = parser.add_mutually_exclusive_group() + action_group.add_argument("--repair", \ + help = "Automatically repair corrupt metadata.", \ + action = "store_true") + action_group.add_argument("--autofsck", help = argparse.SUPPRESS, \ action = "store_true") parser.add_argument("-V", help = "Report version and exit.", \ action = "store_true") @@ -1004,6 +1062,8 @@ def main(): everything = True if args.debug_fast: debug_fast = True + if args.autofsck: + want_repair = None if args.repair: want_repair = True ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (18 preceding siblings ...) 2024-12-31 23:52 ` [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode Darrick J. Wong @ 2024-12-31 23:52 ` Darrick J. Wong 2024-12-31 23:53 ` [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default Darrick J. Wong 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:52 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> If the media scan finds that media have been lost, report this to the kernel so that the healthmon code can pass that along to xfs_scrubbed. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- scrub/phase6.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/scrub/phase6.c b/scrub/phase6.c index 5a1f29738680e5..b5f6f3c1d4bc63 100644 --- a/scrub/phase6.c +++ b/scrub/phase6.c @@ -671,6 +671,29 @@ clean_pool( return ret; } +static void +report_ioerr_to_kernel( + struct scrub_ctx *ctx, + struct disk *disk, + uint64_t start, + uint64_t length) +{ + struct xfs_media_error me = { + .daddr = start, + .bbcount = length, + }; + dev_t dev = disk_to_dev(ctx, disk); + + if (dev == ctx->fsinfo.fs_datadev) + me.flags |= XFS_MEDIA_ERROR_DATADEV; + else if (dev == ctx->fsinfo.fs_rtdev) + me.flags |= XFS_MEDIA_ERROR_RTDEV; + else if (dev == ctx->fsinfo.fs_logdev) + me.flags |= XFS_MEDIA_ERROR_LOGDEV; + + ioctl(ctx->mnt.fd, XFS_IOC_MEDIA_ERROR, &me); +} + /* Remember a media error for later. */ static void remember_ioerr( @@ -695,6 +718,8 @@ remember_ioerr( return; } + report_ioerr_to_kernel(ctx, disk, start, length); + tree = bitmap_for_disk(ctx, disk, vs); if (!tree) { str_liberror(ctx, ENOENT, _("finding bad block bitmap")); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong ` (19 preceding siblings ...) 2024-12-31 23:52 ` [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel Darrick J. Wong @ 2024-12-31 23:53 ` Darrick J. Wong 20 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we're finished building autonomous repair, enable the service on the root filesystem by default. The root filesystem is mounted by the initrd prior to starting systemd, which is why the udev rule cannot autostart the service for the root filesystem. dh_installsystemd won't activate a template service (aka one with an at-sign in the name) even if it provides a DefaultInstance directive to make that possible. Use a fugly shim for this. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- debian/control | 2 +- debian/postinst | 8 ++++++++ debian/prerm | 13 +++++++++++++ scrub/xfs_scrubbed@.service.in | 2 +- 4 files changed, 23 insertions(+), 2 deletions(-) create mode 100644 debian/prerm diff --git a/debian/control b/debian/control index 66b0a47a36ee24..31ea1e988f66be 100644 --- a/debian/control +++ b/debian/control @@ -10,7 +10,7 @@ Homepage: https://xfs.wiki.kernel.org/ Package: xfsprogs Depends: ${shlibs:Depends}, ${misc:Depends}, python3-dbus, python3:any Provides: fsck-backend -Suggests: xfsdump, acl, attr, quota +Suggests: xfsdump, acl, attr, quota, python3-jsonschema Breaks: xfsdump (<< 3.0.0) Replaces: xfsdump (<< 3.0.0) Architecture: linux-any diff --git a/debian/postinst b/debian/postinst index 2ad9174658ceb4..4ba2e0c43b887e 100644 --- a/debian/postinst +++ b/debian/postinst @@ -24,5 +24,13 @@ case "${1}" in esac #DEBHELPER# +# +# dh_installsystemd doesn't handle template services even if we supply a +# default instance, so we'll install it here. +if [ -z "${DPKG_ROOT:-}" ] && [ -d /run/systemd/system ] ; then + if [ "$1" = "configure" ] || [ "$1" = "abort-upgrade" ] || [ "$1" = "abort-deconfigure" ] || [ "$1" = "abort-remove" ] ; then + /bin/systemctl enable xfs_scrubbed@.service || true + fi +fi exit 0 diff --git a/debian/prerm b/debian/prerm new file mode 100644 index 00000000000000..48e8e94c4fe9ac --- /dev/null +++ b/debian/prerm @@ -0,0 +1,13 @@ +#!/bin/sh + +set -e + +# dh_installsystemd doesn't handle template services even if we supply a +# default instance, so we'll install it here. +if [ -z "${DPKG_ROOT:-}" ] && [ "$1" = remove ] && [ -d /run/systemd/system ] ; then + /bin/systemctl disable xfs_scrubbed@.service || true +fi + +#DEBHELPER# + +exit 0 diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in index afd5c204327946..5bf1e79031af8c 100644 --- a/scrub/xfs_scrubbed@.service.in +++ b/scrub/xfs_scrubbed@.service.in @@ -19,7 +19,7 @@ RequiresMountsFor=%f Type=exec Environment=SERVICE_MODE=1 ExecCondition=@pkg_libexec_dir@/xfs_scrubbed --check %f -ExecStart=@pkg_libexec_dir@/xfs_scrubbed --log %f +ExecStart=@pkg_libexec_dir@/xfs_scrubbed --autofsck --log %f SyslogIdentifier=%N # Run scrub with minimal CPU and IO priority so that nothing else will starve. ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (8 preceding siblings ...) 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong @ 2024-12-31 23:34 ` Darrick J. Wong 2024-12-31 23:53 ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong ` (9 more replies) 2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong ` (5 subsequent siblings) 15 siblings, 10 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs Hi all, This series enables xfs_repair to add select features to existing V5 filesystems. Specifically, one can add free inode btrees, reflink support, and reverse mapping. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-newer-features fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=upgrade-newer-features --- Commits in this patchset: * xfs_repair: allow sysadmins to add free inode btree indexes * xfs_repair: allow sysadmins to add reflink * xfs_repair: allow sysadmins to add reverse mapping indexes * xfs_repair: upgrade an existing filesystem to have parent pointers * xfs_repair: allow sysadmins to add metadata directories * xfs_repair: upgrade filesystems to support rtgroups when adding metadir * xfs_repair: allow sysadmins to add realtime reverse mapping indexes * xfs_repair: allow sysadmins to add realtime reflink * xfs_repair: skip free space checks when upgrading * xfs_repair: allow adding rmapbt to reflink filesystems --- libxfs/libxfs_api_defs.h | 1 man/man8/xfs_admin.8 | 37 +++++ repair/dino_chunks.c | 6 + repair/dinode.c | 5 + repair/globals.c | 7 + repair/globals.h | 7 + repair/phase2.c | 341 +++++++++++++++++++++++++++++++++++++++++++++- repair/phase4.c | 5 + repair/pptr.c | 15 ++ repair/protos.h | 6 + repair/rmap.c | 12 +- repair/xfs_repair.c | 77 ++++++++++ 12 files changed, 505 insertions(+), 14 deletions(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong @ 2024-12-31 23:53 ` Darrick J. Wong 2024-12-31 23:53 ` [PATCH 02/10] xfs_repair: allow sysadmins to add reflink Darrick J. Wong ` (8 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow the sysadmin to use xfs_repair to upgrade an existing filesystem to support the free inode btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- man/man8/xfs_admin.8 | 7 +++++++ repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 28 +++++++++++++++++++++++++++- repair/xfs_repair.c | 11 +++++++++++ 5 files changed, 47 insertions(+), 1 deletion(-) diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8 index 63f8ee90307b30..e07fc3ddb3fb82 100644 --- a/man/man8/xfs_admin.8 +++ b/man/man8/xfs_admin.8 @@ -163,6 +163,13 @@ .SH OPTIONS extended attributes, symbolic links, and realtime free space metadata. The filesystem cannot be downgraded after this feature is enabled. Once enabled, the filesystem will not be mountable by older kernels. +.TP 0.4i +.B finobt +Track free inodes through a separate free inode btree index to speed up inode +allocation on old filesystems. +This upgrade can fail if any AG has less than 1% free space remaining. +The filesystem cannot be downgraded after this feature is enabled. +This feature was added to Linux 3.16. .RE .TP .BI \-U " uuid" diff --git a/repair/globals.c b/repair/globals.c index 143b4a8beb53f4..f13497c3121d6b 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -53,6 +53,7 @@ bool add_inobtcount; /* add inode btree counts to AGI */ bool add_bigtime; /* add support for timestamps up to 2486 */ bool add_nrext64; bool add_exchrange; /* add file content exchange support */ +bool add_finobt; /* add free inode btrees */ /* misc status variables */ diff --git a/repair/globals.h b/repair/globals.h index 8bb9bbaeca4fb0..c5b27d9a60cf2e 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -94,6 +94,7 @@ extern bool add_inobtcount; /* add inode btree counts to AGI */ extern bool add_bigtime; /* add support for timestamps up to 2486 */ extern bool add_nrext64; extern bool add_exchrange; /* add file content exchange support */ +extern bool add_finobt; /* add free inode btrees */ /* misc status variables */ diff --git a/repair/phase2.c b/repair/phase2.c index 71576f5806e473..1bb7cd19025be7 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -123,7 +123,7 @@ set_inobtcount( exit(0); } - if (!xfs_has_finobt(mp)) { + if (!xfs_has_finobt(mp) && !add_finobt) { printf( _("Inode btree count feature requires free inode btree.\n")); exit(0); @@ -212,6 +212,28 @@ set_exchrange( return true; } +static bool +set_finobt( + struct xfs_mount *mp, + struct xfs_sb *new_sb) +{ + if (xfs_has_finobt(mp)) { + printf(_("Filesystem already supports free inode btrees.\n")); + exit(0); + } + + if (!xfs_has_crc(mp)) { + printf( + _("Free inode btree feature only supported on V5 filesystems.\n")); + exit(0); + } + + printf(_("Adding free inode btrees to filesystem.\n")); + new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_FINOBT; + new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR; + return true; +} + struct check_state { struct xfs_sb sb; uint64_t features; @@ -378,6 +400,8 @@ need_check_fs_free_space( struct xfs_mount *mp, const struct check_state *old) { + if (xfs_has_finobt(mp) && !(old->features & XFS_FEAT_FINOBT)) + return true; return false; } @@ -455,6 +479,8 @@ upgrade_filesystem( dirty |= set_nrext64(mp, &new_sb); if (add_exchrange) dirty |= set_exchrange(mp, &new_sb); + if (add_finobt) + dirty |= set_finobt(mp, &new_sb); if (!dirty) return; diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index 7bf75c09b94542..d8f92b52b66f3a 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -71,6 +71,7 @@ enum c_opt_nums { CONVERT_BIGTIME, CONVERT_NREXT64, CONVERT_EXCHRANGE, + CONVERT_FINOBT, C_MAX_OPTS, }; @@ -80,6 +81,7 @@ static char *c_opts[] = { [CONVERT_BIGTIME] = "bigtime", [CONVERT_NREXT64] = "nrext64", [CONVERT_EXCHRANGE] = "exchange", + [CONVERT_FINOBT] = "finobt", [C_MAX_OPTS] = NULL, }; @@ -372,6 +374,15 @@ process_args(int argc, char **argv) _("-c exchange only supports upgrades\n")); add_exchrange = true; break; + case CONVERT_FINOBT: + if (!val) + do_abort( + _("-c finobt requires a parameter\n")); + if (strtol(val, NULL, 0) != 1) + do_abort( + _("-c finobt only supports upgrades\n")); + add_finobt = true; + break; default: unknown('c', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 02/10] xfs_repair: allow sysadmins to add reflink 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong 2024-12-31 23:53 ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong @ 2024-12-31 23:53 ` Darrick J. Wong 2024-12-31 23:53 ` [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes Darrick J. Wong ` (7 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow the sysadmin to use xfs_repair to upgrade an existing filesystem to support the reference count btree, and therefore reflink. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- man/man8/xfs_admin.8 | 6 ++++++ repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 33 ++++++++++++++++++++++++++++++++- repair/rmap.c | 6 +++--- repair/xfs_repair.c | 11 +++++++++++ 6 files changed, 54 insertions(+), 4 deletions(-) diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8 index e07fc3ddb3fb82..3a9175c9f018e5 100644 --- a/man/man8/xfs_admin.8 +++ b/man/man8/xfs_admin.8 @@ -170,6 +170,12 @@ .SH OPTIONS This upgrade can fail if any AG has less than 1% free space remaining. The filesystem cannot be downgraded after this feature is enabled. This feature was added to Linux 3.16. +.TP 0.4i +.B reflink +Enable sharing of file data blocks. +This upgrade can fail if any AG has less than 2% free space remaining. +The filesystem cannot be downgraded after this feature is enabled. +This feature was added to Linux 4.9. .RE .TP .BI \-U " uuid" diff --git a/repair/globals.c b/repair/globals.c index f13497c3121d6b..cf4421e34dec84 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -54,6 +54,7 @@ bool add_bigtime; /* add support for timestamps up to 2486 */ bool add_nrext64; bool add_exchrange; /* add file content exchange support */ bool add_finobt; /* add free inode btrees */ +bool add_reflink; /* add reference count btrees */ /* misc status variables */ diff --git a/repair/globals.h b/repair/globals.h index c5b27d9a60cf2e..efbb8db79bc080 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -95,6 +95,7 @@ extern bool add_bigtime; /* add support for timestamps up to 2486 */ extern bool add_nrext64; extern bool add_exchrange; /* add file content exchange support */ extern bool add_finobt; /* add free inode btrees */ +extern bool add_reflink; /* add reference count btrees */ /* misc status variables */ diff --git a/repair/phase2.c b/repair/phase2.c index 1bb7cd19025be7..9cd841f8d05fc6 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -200,7 +200,7 @@ set_exchrange( exit(0); } - if (!xfs_has_reflink(mp)) { + if (!xfs_has_reflink(mp) && !add_reflink) { printf( _("File exchange-range feature cannot be added without reflink.\n")); exit(0); @@ -234,6 +234,33 @@ set_finobt( return true; } +static bool +set_reflink( + struct xfs_mount *mp, + struct xfs_sb *new_sb) +{ + if (xfs_has_reflink(mp)) { + printf(_("Filesystem already supports reflink.\n")); + exit(0); + } + + if (!xfs_has_crc(mp)) { + printf( + _("Reflink feature only supported on V5 filesystems.\n")); + exit(0); + } + + if (xfs_has_realtime(mp)) { + printf(_("Reflink feature not supported with realtime.\n")); + exit(0); + } + + printf(_("Adding reflink support to filesystem.\n")); + new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_REFLINK; + new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR; + return true; +} + struct check_state { struct xfs_sb sb; uint64_t features; @@ -402,6 +429,8 @@ need_check_fs_free_space( { if (xfs_has_finobt(mp) && !(old->features & XFS_FEAT_FINOBT)) return true; + if (xfs_has_reflink(mp) && !(old->features & XFS_FEAT_REFLINK)) + return true; return false; } @@ -481,6 +510,8 @@ upgrade_filesystem( dirty |= set_exchrange(mp, &new_sb); if (add_finobt) dirty |= set_finobt(mp, &new_sb); + if (add_reflink) + dirty |= set_reflink(mp, &new_sb); if (!dirty) return; diff --git a/repair/rmap.c b/repair/rmap.c index 97510dd875911a..91f864351f6013 100644 --- a/repair/rmap.c +++ b/repair/rmap.c @@ -68,7 +68,7 @@ bool rmap_needs_work( struct xfs_mount *mp) { - return xfs_has_reflink(mp) || + return xfs_has_reflink(mp) || add_reflink || xfs_has_rmapbt(mp); } @@ -1800,7 +1800,7 @@ check_refcounts( struct xfs_perag *pag = NULL; int error; - if (!xfs_has_reflink(mp)) + if (!xfs_has_reflink(mp) || add_reflink) return; if (refcbt_suspect) { if (no_modify && agno == 0) @@ -1859,7 +1859,7 @@ check_rtrefcounts( struct xfs_inode *ip = NULL; int error; - if (!xfs_has_reflink(mp)) + if (!xfs_has_reflink(mp) || add_reflink) return; if (refcbt_suspect) { if (no_modify && rgno == 0) diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index d8f92b52b66f3a..e436dc2ef736d6 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -72,6 +72,7 @@ enum c_opt_nums { CONVERT_NREXT64, CONVERT_EXCHRANGE, CONVERT_FINOBT, + CONVERT_REFLINK, C_MAX_OPTS, }; @@ -82,6 +83,7 @@ static char *c_opts[] = { [CONVERT_NREXT64] = "nrext64", [CONVERT_EXCHRANGE] = "exchange", [CONVERT_FINOBT] = "finobt", + [CONVERT_REFLINK] = "reflink", [C_MAX_OPTS] = NULL, }; @@ -383,6 +385,15 @@ process_args(int argc, char **argv) _("-c finobt only supports upgrades\n")); add_finobt = true; break; + case CONVERT_REFLINK: + if (!val) + do_abort( + _("-c reflink requires a parameter\n")); + if (strtol(val, NULL, 0) != 1) + do_abort( + _("-c reflink only supports upgrades\n")); + add_reflink = true; + break; default: unknown('c', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong 2024-12-31 23:53 ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong 2024-12-31 23:53 ` [PATCH 02/10] xfs_repair: allow sysadmins to add reflink Darrick J. Wong @ 2024-12-31 23:53 ` Darrick J. Wong 2024-12-31 23:54 ` [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers Darrick J. Wong ` (6 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:53 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow the sysadmin to use xfs_repair to upgrade an existing filesystem to support the reverse mapping btree index. This is needed for online fsck. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- man/man8/xfs_admin.8 | 8 ++++++++ repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 38 ++++++++++++++++++++++++++++++++++++++ repair/rmap.c | 6 +++--- repair/xfs_repair.c | 11 +++++++++++ 6 files changed, 62 insertions(+), 3 deletions(-) diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8 index 3a9175c9f018e5..74a400dcfeb557 100644 --- a/man/man8/xfs_admin.8 +++ b/man/man8/xfs_admin.8 @@ -176,6 +176,14 @@ .SH OPTIONS This upgrade can fail if any AG has less than 2% free space remaining. The filesystem cannot be downgraded after this feature is enabled. This feature was added to Linux 4.9. +.TP 0.4i +.B rmapbt +Store an index of the owners of on-disk blocks. +This enables much stronger cross-referencing of various metadata structures +and online repairs to space usage metadata. +The filesystem cannot be downgraded after this feature is enabled. +This upgrade can fail if any AG has less than 5% free space remaining. +This feature was added to Linux 4.8. .RE .TP .BI \-U " uuid" diff --git a/repair/globals.c b/repair/globals.c index cf4421e34dec84..dd7c422bb922e4 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -55,6 +55,7 @@ bool add_nrext64; bool add_exchrange; /* add file content exchange support */ bool add_finobt; /* add free inode btrees */ bool add_reflink; /* add reference count btrees */ +bool add_rmapbt; /* add reverse mapping btrees */ /* misc status variables */ diff --git a/repair/globals.h b/repair/globals.h index efbb8db79bc080..d8c2aae23d8f0a 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -96,6 +96,7 @@ extern bool add_nrext64; extern bool add_exchrange; /* add file content exchange support */ extern bool add_finobt; /* add free inode btrees */ extern bool add_reflink; /* add reference count btrees */ +extern bool add_rmapbt; /* add reverse mapping btrees */ /* misc status variables */ diff --git a/repair/phase2.c b/repair/phase2.c index 9cd841f8d05fc6..9dd37e7fc5c111 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -261,6 +261,40 @@ set_reflink( return true; } +static bool +set_rmapbt( + struct xfs_mount *mp, + struct xfs_sb *new_sb) +{ + if (xfs_has_rmapbt(mp)) { + printf(_("Filesystem already supports reverse mapping btrees.\n")); + exit(0); + } + + if (!xfs_has_crc(mp)) { + printf( + _("Reverse mapping btree feature only supported on V5 filesystems.\n")); + exit(0); + } + + if (xfs_has_realtime(mp)) { + printf( + _("Reverse mapping btree feature not supported with realtime.\n")); + exit(0); + } + + if (xfs_has_reflink(mp) && !add_reflink) { + printf( + _("Reverse mapping btrees cannot be added when reflink is enabled.\n")); + exit(0); + } + + printf(_("Adding reverse mapping btrees to filesystem.\n")); + new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_RMAPBT; + new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR; + return true; +} + struct check_state { struct xfs_sb sb; uint64_t features; @@ -431,6 +465,8 @@ need_check_fs_free_space( return true; if (xfs_has_reflink(mp) && !(old->features & XFS_FEAT_REFLINK)) return true; + if (xfs_has_rmapbt(mp) && !(old->features & XFS_FEAT_RMAPBT)) + return true; return false; } @@ -512,6 +548,8 @@ upgrade_filesystem( dirty |= set_finobt(mp, &new_sb); if (add_reflink) dirty |= set_reflink(mp, &new_sb); + if (add_rmapbt) + dirty |= set_rmapbt(mp, &new_sb); if (!dirty) return; diff --git a/repair/rmap.c b/repair/rmap.c index 91f864351f6013..f1f837d33ea4f4 100644 --- a/repair/rmap.c +++ b/repair/rmap.c @@ -69,7 +69,7 @@ rmap_needs_work( struct xfs_mount *mp) { return xfs_has_reflink(mp) || add_reflink || - xfs_has_rmapbt(mp); + xfs_has_rmapbt(mp) || add_rmapbt; } static inline bool rmaps_has_observations(const struct xfs_ag_rmap *ag_rmap) @@ -1339,7 +1339,7 @@ rmaps_verify_btree( struct xfs_perag *pag = NULL; int error; - if (!xfs_has_rmapbt(mp)) + if (!xfs_has_rmapbt(mp) || add_rmapbt) return; if (rmapbt_suspect) { if (no_modify && agno == 0) @@ -1398,7 +1398,7 @@ rtrmaps_verify_btree( struct xfs_inode *ip = NULL; int error; - if (!xfs_has_rmapbt(mp)) + if (!xfs_has_rmapbt(mp) || add_rmapbt) return; if (rmapbt_suspect) { if (no_modify && rgno == 0) diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index e436dc2ef736d6..ca72c65f9d772a 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -73,6 +73,7 @@ enum c_opt_nums { CONVERT_EXCHRANGE, CONVERT_FINOBT, CONVERT_REFLINK, + CONVERT_RMAPBT, C_MAX_OPTS, }; @@ -84,6 +85,7 @@ static char *c_opts[] = { [CONVERT_EXCHRANGE] = "exchange", [CONVERT_FINOBT] = "finobt", [CONVERT_REFLINK] = "reflink", + [CONVERT_RMAPBT] = "rmapbt", [C_MAX_OPTS] = NULL, }; @@ -394,6 +396,15 @@ process_args(int argc, char **argv) _("-c reflink only supports upgrades\n")); add_reflink = true; break; + case CONVERT_RMAPBT: + if (!val) + do_abort( + _("-c rmapbt requires a parameter\n")); + if (strtol(val, NULL, 0) != 1) + do_abort( + _("-c rmapbt only supports upgrades\n")); + add_rmapbt = true; + break; default: unknown('c', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:53 ` [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes Darrick J. Wong @ 2024-12-31 23:54 ` Darrick J. Wong 2024-12-31 23:54 ` [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories Darrick J. Wong ` (5 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:54 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Upgrade an existing filesystem to have parent pointers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- man/man8/xfs_admin.8 | 8 ++++++++ repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 39 +++++++++++++++++++++++++++++++++++++++ repair/pptr.c | 15 ++++++++++++++- repair/xfs_repair.c | 11 +++++++++++ 6 files changed, 74 insertions(+), 1 deletion(-) diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8 index 74a400dcfeb557..a25e599e5f8e2c 100644 --- a/man/man8/xfs_admin.8 +++ b/man/man8/xfs_admin.8 @@ -184,6 +184,14 @@ .SH OPTIONS The filesystem cannot be downgraded after this feature is enabled. This upgrade can fail if any AG has less than 5% free space remaining. This feature was added to Linux 4.8. +.TP 0.4i +.B parent +Store in each child file a mirror a pointing back to the parent directory. +This enables much stronger cross-referencing and online repairs of the +directory tree. +The filesystem cannot be downgraded after this feature is enabled. +This upgrade can fail if the filesystem has less than 25% free space remaining. +This feature is not upstream yet. .RE .TP .BI \-U " uuid" diff --git a/repair/globals.c b/repair/globals.c index dd7c422bb922e4..320fcf6cfd701e 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -56,6 +56,7 @@ bool add_exchrange; /* add file content exchange support */ bool add_finobt; /* add free inode btrees */ bool add_reflink; /* add reference count btrees */ bool add_rmapbt; /* add reverse mapping btrees */ +bool add_parent; /* add parent pointers */ /* misc status variables */ diff --git a/repair/globals.h b/repair/globals.h index d8c2aae23d8f0a..77d5d110048713 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -97,6 +97,7 @@ extern bool add_exchrange; /* add file content exchange support */ extern bool add_finobt; /* add free inode btrees */ extern bool add_reflink; /* add reference count btrees */ extern bool add_rmapbt; /* add reverse mapping btrees */ +extern bool add_parent; /* add parent pointers */ /* misc status variables */ diff --git a/repair/phase2.c b/repair/phase2.c index 9dd37e7fc5c111..763cffdfe9d8d2 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -295,6 +295,28 @@ set_rmapbt( return true; } +static bool +set_parent( + struct xfs_mount *mp, + struct xfs_sb *new_sb) +{ + if (xfs_has_parent(mp)) { + printf(_("Filesystem already supports parent pointers.\n")); + exit(0); + } + + if (!xfs_has_crc(mp)) { + printf( + _("Parent pointer feature only supported on V5 filesystems.\n")); + exit(0); + } + + printf(_("Adding parent pointers to filesystem.\n")); + new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_PARENT; + new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR; + return true; +} + struct check_state { struct xfs_sb sb; uint64_t features; @@ -435,6 +457,19 @@ check_fs_free_space( libxfs_trans_cancel(tp); } + /* + * If we're adding parent pointers, we need at least 25% free since + * scanning the entire filesystem to guesstimate the overhead is + * prohibitively expensive. + */ + if (xfs_has_parent(mp) && !(old->features & XFS_FEAT_PARENT)) { + if (mp->m_sb.sb_fdblocks < mp->m_sb.sb_dblocks / 4) { + printf( + _("Filesystem does not have enough space to add parent pointers.\n")); + exit(1); + } + } + /* * Would the post-upgrade filesystem have enough free space on the data * device after making per-AG reservations? @@ -467,6 +502,8 @@ need_check_fs_free_space( return true; if (xfs_has_rmapbt(mp) && !(old->features & XFS_FEAT_RMAPBT)) return true; + if (xfs_has_parent(mp) && !(old->features & XFS_FEAT_PARENT)) + return true; return false; } @@ -550,6 +587,8 @@ upgrade_filesystem( dirty |= set_reflink(mp, &new_sb); if (add_rmapbt) dirty |= set_rmapbt(mp, &new_sb); + if (add_parent) + dirty |= set_parent(mp, &new_sb); if (!dirty) return; diff --git a/repair/pptr.c b/repair/pptr.c index ac0a9c618bc87d..a8156e55f1fdfc 100644 --- a/repair/pptr.c +++ b/repair/pptr.c @@ -793,7 +793,7 @@ add_missing_parent_ptr( ag_pptr->namelen, name); return; - } else { + } else if (!add_parent) { do_warn( _("adding missing ino %llu parent pointer (ino %llu gen 0x%x name '%.*s')\n"), (unsigned long long)ip->i_ino, @@ -801,6 +801,19 @@ add_missing_parent_ptr( ag_pptr->parent_gen, ag_pptr->namelen, name); + } else { + static bool warned = false; + static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; + + if (!warned) { + pthread_mutex_lock(&lock); + if (!warned) { + do_warn( + _("setting parent pointers to upgrade filesystem\n")); + warned = true; + } + pthread_mutex_unlock(&lock); + } } error = add_file_pptr(ip, ag_pptr, name); diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index ca72c65f9d772a..189665a07d6892 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -74,6 +74,7 @@ enum c_opt_nums { CONVERT_FINOBT, CONVERT_REFLINK, CONVERT_RMAPBT, + CONVERT_PARENT, C_MAX_OPTS, }; @@ -86,6 +87,7 @@ static char *c_opts[] = { [CONVERT_FINOBT] = "finobt", [CONVERT_REFLINK] = "reflink", [CONVERT_RMAPBT] = "rmapbt", + [CONVERT_PARENT] = "parent", [C_MAX_OPTS] = NULL, }; @@ -405,6 +407,15 @@ process_args(int argc, char **argv) _("-c rmapbt only supports upgrades\n")); add_rmapbt = true; break; + case CONVERT_PARENT: + if (!val) + do_abort( + _("-c parent requires a parameter\n")); + if (strtol(val, NULL, 0) != 1) + do_abort( + _("-c parent only supports upgrades\n")); + add_parent = true; + break; default: unknown('c', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:54 ` [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers Darrick J. Wong @ 2024-12-31 23:54 ` Darrick J. Wong 2024-12-31 23:54 ` [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir Darrick J. Wong ` (4 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:54 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow the sysadmin to use xfs_repair to upgrade an existing filesystem to support metadata directories. This will be needed to upgrade filesystems to support realtime rmap and reflink. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- man/man8/xfs_admin.8 | 8 ++++++ repair/dino_chunks.c | 6 ++++ repair/dinode.c | 5 +++- repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++ repair/phase4.c | 5 +++- repair/protos.h | 6 ++++ repair/xfs_repair.c | 11 ++++++++ 9 files changed, 109 insertions(+), 3 deletions(-) diff --git a/man/man8/xfs_admin.8 b/man/man8/xfs_admin.8 index a25e599e5f8e2c..e55dee6070e460 100644 --- a/man/man8/xfs_admin.8 +++ b/man/man8/xfs_admin.8 @@ -191,6 +191,14 @@ .SH OPTIONS directory tree. The filesystem cannot be downgraded after this feature is enabled. This upgrade can fail if the filesystem has less than 25% free space remaining. +.TP 0.4i +.B metadir +Create a directory tree of metadata inodes instead of storing them all in the +superblock. +This is required for reverse mapping btrees and reflink support on the realtime +device. +The filesystem cannot be downgraded after this feature is enabled. +This upgrade can fail if any AG has less than 5% free space remaining. This feature is not upstream yet. .RE .TP diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c index 250985ec264ead..120c490b1d8324 100644 --- a/repair/dino_chunks.c +++ b/repair/dino_chunks.c @@ -955,7 +955,11 @@ process_inode_chunk( } if (status) { - if (mp->m_sb.sb_rootino == ino) { + if (wipe_pre_metadir_file(ino)) { + if (!ino_discovery) + do_warn( + _("wiping pre-metadir metadata inode %"PRIu64".\n"), ino); + } else if (mp->m_sb.sb_rootino == ino) { need_root_inode = 1; if (!no_modify) { diff --git a/repair/dinode.c b/repair/dinode.c index 0c559c40808588..42c7e9fa5cc5e7 100644 --- a/repair/dinode.c +++ b/repair/dinode.c @@ -3068,6 +3068,9 @@ process_dinode_int( ASSERT(uncertain == 0 || verify_mode != 0); ASSERT(ino_bpp != NULL || verify_mode != 0); + if (wipe_pre_metadir_file(lino)) + goto clear_bad_out; + /* * This is the only valid point to check the CRC; after this we may have * made changes which invalidate it, and the CRC is only updated again @@ -3278,7 +3281,7 @@ _("bad (negative) size %" PRId64 " on inode %" PRIu64 "\n"), if (flags & XFS_DIFLAG_NEWRTBM) { /* must be a rt bitmap inode */ if (lino != mp->m_sb.sb_rbmino) { - if (!uncertain) { + if (!uncertain && !add_metadir) { do_warn( _("inode %" PRIu64 " not rt bitmap\n"), lino); diff --git a/repair/globals.c b/repair/globals.c index 320fcf6cfd701e..603fea73da1654 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -57,6 +57,7 @@ bool add_finobt; /* add free inode btrees */ bool add_reflink; /* add reference count btrees */ bool add_rmapbt; /* add reverse mapping btrees */ bool add_parent; /* add parent pointers */ +bool add_metadir; /* add metadata directory tree */ /* misc status variables */ diff --git a/repair/globals.h b/repair/globals.h index 77d5d110048713..9211e5e2432c9a 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -98,6 +98,7 @@ extern bool add_finobt; /* add free inode btrees */ extern bool add_reflink; /* add reference count btrees */ extern bool add_rmapbt; /* add reverse mapping btrees */ extern bool add_parent; /* add parent pointers */ +extern bool add_metadir; /* add metadata directory tree */ /* misc status variables */ diff --git a/repair/phase2.c b/repair/phase2.c index 763cffdfe9d8d2..35f4c19de0555c 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -317,6 +317,71 @@ set_parent( return true; } +static xfs_ino_t doomed_rbmino = NULLFSINO; +static xfs_ino_t doomed_rsumino = NULLFSINO; +static xfs_ino_t doomed_uquotino = NULLFSINO; +static xfs_ino_t doomed_gquotino = NULLFSINO; +static xfs_ino_t doomed_pquotino = NULLFSINO; + +bool +wipe_pre_metadir_file( + xfs_ino_t ino) +{ + if (ino == doomed_rbmino || + ino == doomed_rsumino || + ino == doomed_uquotino || + ino == doomed_gquotino || + ino == doomed_pquotino) + return true; + return false; +} + +static bool +set_metadir( + struct xfs_mount *mp, + struct xfs_sb *new_sb) +{ + if (xfs_has_metadir(mp)) { + printf(_("Filesystem already supports metadata directory trees.\n")); + exit(0); + } + + if (!xfs_has_crc(mp)) { + printf( + _("Metadata directory trees only supported on V5 filesystems.\n")); + exit(0); + } + + printf(_("Adding metadata directory trees to filesystem.\n")); + new_sb->sb_features_incompat |= (XFS_SB_FEAT_INCOMPAT_METADIR | + XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR); + + /* Blow out all the old metadata inodes; we'll rebuild in phase6. */ + new_sb->sb_metadirino = new_sb->sb_rootino + 1; + doomed_rbmino = mp->m_sb.sb_rbmino; + doomed_rsumino = mp->m_sb.sb_rsumino; + doomed_uquotino = mp->m_sb.sb_uquotino; + doomed_gquotino = mp->m_sb.sb_gquotino; + doomed_pquotino = mp->m_sb.sb_pquotino; + + new_sb->sb_rbmino = new_sb->sb_metadirino + 1; + new_sb->sb_rsumino = new_sb->sb_rbmino + 1; + new_sb->sb_uquotino = NULLFSINO; + new_sb->sb_gquotino = NULLFSINO; + new_sb->sb_pquotino = NULLFSINO; + + /* Indicate that we need a rebuild. */ + need_metadir_inode = 1; + need_rbmino = 1; + need_rsumino = 1; + have_uquotino = 0; + have_gquotino = 0; + have_pquotino = 0; + quotacheck_skip(); + + return true; +} + struct check_state { struct xfs_sb sb; uint64_t features; @@ -504,6 +569,8 @@ need_check_fs_free_space( return true; if (xfs_has_parent(mp) && !(old->features & XFS_FEAT_PARENT)) return true; + if (xfs_has_metadir(mp) && !(old->features & XFS_FEAT_METADIR)) + return true; return false; } @@ -589,6 +656,8 @@ upgrade_filesystem( dirty |= set_rmapbt(mp, &new_sb); if (add_parent) dirty |= set_parent(mp, &new_sb); + if (add_metadir) + dirty |= set_metadir(mp, &new_sb); if (!dirty) return; diff --git a/repair/phase4.c b/repair/phase4.c index b752b4c871ea83..6d3c7857c6c343 100644 --- a/repair/phase4.c +++ b/repair/phase4.c @@ -431,7 +431,10 @@ phase4(xfs_mount_t *mp) if (xfs_has_metadir(mp) && (is_inode_free(irec, 1) || !inode_isadir(irec, 1))) { need_metadir_inode = true; - if (no_modify) + if (add_metadir) + do_warn( + _("metadata directory root inode needs to be initialized\n")); + else if (no_modify) do_warn( _("metadata directory root inode would be lost\n")); else diff --git a/repair/protos.h b/repair/protos.h index e2f39f1d6e8aa3..ce171f3dd87cb6 100644 --- a/repair/protos.h +++ b/repair/protos.h @@ -3,6 +3,8 @@ * Copyright (c) 2000-2001,2005 Silicon Graphics, Inc. * All Rights Reserved. */ +#ifndef __XFS_REPAIR_PROTOS_H__ +#define __XFS_REPAIR_PROTOS_H__ void xfs_init(struct libxfs_init *args); @@ -45,3 +47,7 @@ void phase7(struct xfs_mount *, int); int verify_set_agheader(struct xfs_mount *, struct xfs_buf *, struct xfs_sb *, struct xfs_agf *, struct xfs_agi *, xfs_agnumber_t); + +bool wipe_pre_metadir_file(xfs_ino_t ino); + +#endif /* __XFS_REPAIR_PROTOS_H__ */ diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index 189665a07d6892..d4101f7d2297d7 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -75,6 +75,7 @@ enum c_opt_nums { CONVERT_REFLINK, CONVERT_RMAPBT, CONVERT_PARENT, + CONVERT_METADIR, C_MAX_OPTS, }; @@ -88,6 +89,7 @@ static char *c_opts[] = { [CONVERT_REFLINK] = "reflink", [CONVERT_RMAPBT] = "rmapbt", [CONVERT_PARENT] = "parent", + [CONVERT_METADIR] = "metadir", [C_MAX_OPTS] = NULL, }; @@ -416,6 +418,15 @@ process_args(int argc, char **argv) _("-c parent only supports upgrades\n")); add_parent = true; break; + case CONVERT_METADIR: + if (!val) + do_abort( + _("-c metadir requires a parameter\n")); + if (strtol(val, NULL, 0) != 1) + do_abort( + _("-c metadir only supports upgrades\n")); + add_metadir = true; + break; default: unknown('c', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (4 preceding siblings ...) 2024-12-31 23:54 ` [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories Darrick J. Wong @ 2024-12-31 23:54 ` Darrick J. Wong 2024-12-31 23:55 ` [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes Darrick J. Wong ` (3 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:54 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Amend the metadir upgrade code to initialize the rtgroups related fields in the superblock. This obviously means that we can't upgrade metadir to a filesystem with an existing rt section. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- repair/phase2.c | 36 +++++++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/repair/phase2.c b/repair/phase2.c index 35f4c19de0555c..fa6ea91711557c 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -341,6 +341,9 @@ set_metadir( struct xfs_mount *mp, struct xfs_sb *new_sb) { + struct xfs_rtgroup *rtg; + unsigned int rgsize; + if (xfs_has_metadir(mp)) { printf(_("Filesystem already supports metadata directory trees.\n")); exit(0); @@ -352,6 +355,15 @@ set_metadir( exit(0); } + if (xfs_has_realtime(mp)) { + printf( + _("Realtime groups cannot be added to an existing realtime section.\n")); + exit(0); + } + + if (!xfs_has_exchange_range(mp)) + set_exchrange(mp, new_sb); + printf(_("Adding metadata directory trees to filesystem.\n")); new_sb->sb_features_incompat |= (XFS_SB_FEAT_INCOMPAT_METADIR | XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR); @@ -364,21 +376,35 @@ set_metadir( doomed_gquotino = mp->m_sb.sb_gquotino; doomed_pquotino = mp->m_sb.sb_pquotino; - new_sb->sb_rbmino = new_sb->sb_metadirino + 1; - new_sb->sb_rsumino = new_sb->sb_rbmino + 1; + new_sb->sb_rbmino = NULLFSINO; + new_sb->sb_rsumino = NULLFSINO; new_sb->sb_uquotino = NULLFSINO; new_sb->sb_gquotino = NULLFSINO; new_sb->sb_pquotino = NULLFSINO; + rgsize = XFS_B_TO_FSBT(mp, 1ULL << 40); /* 1TB */ + rgsize -= rgsize % new_sb->sb_rextsize; + new_sb->sb_rgextents = rgsize; + new_sb->sb_rgcount = 0; + new_sb->sb_rgblklog = libxfs_compute_rgblklog(new_sb->sb_rgextents, + new_sb->sb_rextsize); /* Indicate that we need a rebuild. */ need_metadir_inode = 1; need_rbmino = 1; need_rsumino = 1; - have_uquotino = 0; - have_gquotino = 0; - have_pquotino = 0; + clear_quota_inode(XFS_DQTYPE_USER); + clear_quota_inode(XFS_DQTYPE_GROUP); + clear_quota_inode(XFS_DQTYPE_PROJ); quotacheck_skip(); + /* Dump incore rt freespace inodes. */ + rtg = libxfs_rtgroup_grab(mp, 0); + if (rtg) { + libxfs_rtginode_irele(&rtg->rtg_inodes[XFS_RTGI_BITMAP]); + libxfs_rtginode_irele(&rtg->rtg_inodes[XFS_RTGI_SUMMARY]); + libxfs_rtgroup_rele(rtg); + } + return true; } ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (5 preceding siblings ...) 2024-12-31 23:54 ` [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir Darrick J. Wong @ 2024-12-31 23:55 ` Darrick J. Wong 2024-12-31 23:55 ` [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink Darrick J. Wong ` (2 subsequent siblings) 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow the sysadmin to use xfs_repair to upgrade an existing filesystem to support the reverse mapping btree index for realtime volumes. This is needed for online fsck. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- libxfs/libxfs_api_defs.h | 1 + repair/phase2.c | 64 ++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 60 insertions(+), 5 deletions(-) diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h index 76f55515bb41f7..2502a7736d1670 100644 --- a/libxfs/libxfs_api_defs.h +++ b/libxfs/libxfs_api_defs.h @@ -78,6 +78,7 @@ #define xfs_btree_bload libxfs_btree_bload #define xfs_btree_bload_compute_geometry libxfs_btree_bload_compute_geometry #define xfs_btree_calc_size libxfs_btree_calc_size +#define xfs_btree_compute_maxlevels libxfs_btree_compute_maxlevels #define xfs_btree_decrement libxfs_btree_decrement #define xfs_btree_del_cursor libxfs_btree_del_cursor #define xfs_btree_delete libxfs_btree_delete diff --git a/repair/phase2.c b/repair/phase2.c index fa6ea91711557c..b1288bf3dd90cd 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -277,9 +277,8 @@ set_rmapbt( exit(0); } - if (xfs_has_realtime(mp)) { - printf( - _("Reverse mapping btree feature not supported with realtime.\n")); + if (xfs_has_realtime(mp) && !xfs_has_rtgroups(mp)) { + printf(_("Reverse mapping btree requires realtime groups.\n")); exit(0); } @@ -292,6 +291,7 @@ set_rmapbt( printf(_("Adding reverse mapping btrees to filesystem.\n")); new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_RMAPBT; new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR; + return true; } @@ -466,6 +466,37 @@ check_free_space( return avail > GIGABYTES(10, mp->m_sb.sb_blocklog); } +/* + * Reserve space to handle rt rmap btree expansion. + * + * If the rmap inode for this group already exists, we assume that we're adding + * some other feature. Note that we have not validated the metadata directory + * tree, so we must perform the lookup by hand and abort the upgrade if there + * are errors. Otherwise, the amount of space needed to handle a new maximally + * sized rmap btree is added to @new_resv. + */ +static int +reserve_rtrmap_inode( + struct xfs_rtgroup *rtg, + xfs_rfsblock_t *new_resv) +{ + struct xfs_mount *mp = rtg_mount(rtg); + struct xfs_inode *ip = rtg_rmap(rtg); + xfs_filblks_t ask; + + if (!xfs_has_rtrmapbt(mp)) + return 0; + + ask = libxfs_rtrmapbt_calc_reserves(mp); + + /* failed to load the rtdir inode? */ + if (!ip) { + *new_resv += ask; + return 0; + } + return -libxfs_metafile_resv_init(ip, ask); +} + static void check_fs_free_space( struct xfs_mount *mp, @@ -473,6 +504,8 @@ check_fs_free_space( struct xfs_sb *new_sb) { struct xfs_perag *pag = NULL; + struct xfs_rtgroup *rtg = NULL; + xfs_rfsblock_t new_resv = 0; int error; /* Make sure we have enough space for per-AG reservations. */ @@ -548,6 +581,21 @@ check_fs_free_space( libxfs_trans_cancel(tp); } + /* Realtime metadata btree inodes */ + while ((rtg = xfs_rtgroup_next(mp, rtg))) { + error = reserve_rtrmap_inode(rtg, &new_resv); + if (error == ENOSPC) { + printf( +_("Not enough free space would remain for rtgroup %u rmap inode.\n"), + rtg_rgno(rtg)); + exit(0); + } + if (error) + do_error( +_("Error %d while checking rtgroup %u rmap inode space reservation.\n"), + error, rtg_rgno(rtg)); + } + /* * If we're adding parent pointers, we need at least 25% free since * scanning the entire filesystem to guesstimate the overhead is @@ -563,13 +611,19 @@ check_fs_free_space( /* * Would the post-upgrade filesystem have enough free space on the data - * device after making per-AG reservations? + * device after making per-AG reservations and reserving rt metadata + * inode blocks? */ - if (!check_free_space(mp, mp->m_sb.sb_fdblocks, mp->m_sb.sb_dblocks)) { + if (new_resv > mp->m_sb.sb_fdblocks || + !check_free_space(mp, mp->m_sb.sb_fdblocks, mp->m_sb.sb_dblocks)) { printf(_("Filesystem will be low on space after upgrade.\n")); exit(1); } + /* Unreserve the realtime metadata reservations. */ + while ((rtg = xfs_rtgroup_next(mp, rtg))) + libxfs_metafile_resv_free(rtg_rmap(rtg)); + /* * Release the per-AG reservations and mark the per-AG structure as * uninitialized so that we don't trip over stale cached counters ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (6 preceding siblings ...) 2024-12-31 23:55 ` [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes Darrick J. Wong @ 2024-12-31 23:55 ` Darrick J. Wong 2024-12-31 23:55 ` [PATCH 09/10] xfs_repair: skip free space checks when upgrading Darrick J. Wong 2024-12-31 23:55 ` [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems Darrick J. Wong 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow the sysadmin to use xfs_repair to upgrade an existing filesystem to support the realtime reference count btree, and therefore reflink on realtime volumes. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- repair/phase2.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 50 insertions(+), 3 deletions(-) diff --git a/repair/phase2.c b/repair/phase2.c index b1288bf3dd90cd..8dc936b572196e 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -250,14 +250,15 @@ set_reflink( exit(0); } - if (xfs_has_realtime(mp)) { - printf(_("Reflink feature not supported with realtime.\n")); + if (xfs_has_realtime(mp) && !xfs_has_rtgroups(mp)) { + printf(_("Reference count btree requires realtime groups.\n")); exit(0); } printf(_("Adding reflink support to filesystem.\n")); new_sb->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_REFLINK; new_sb->sb_features_incompat |= XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR; + return true; } @@ -497,6 +498,38 @@ reserve_rtrmap_inode( return -libxfs_metafile_resv_init(ip, ask); } +/* + * Reserve space to handle rt refcount btree expansion. + * + * If the refcount inode for this group already exists, we assume that we're + * adding some other feature. Note that we have not validated the metadata + * directory tree, so we must perform the lookup by hand and abort the upgrade + * if there are errors. If the inode does not exist, the amount of space + * needed to handle a new maximally sized refcount btree is added to @new_resv. + */ +static int +reserve_rtrefcount_inode( + struct xfs_rtgroup *rtg, + xfs_rfsblock_t *new_resv) +{ + struct xfs_mount *mp = rtg_mount(rtg); + struct xfs_inode *ip = rtg_refcount(rtg); + xfs_filblks_t ask; + + if (!xfs_has_rtreflink(mp)) + return 0; + + ask = libxfs_rtrefcountbt_calc_reserves(mp); + + /* failed to load the rtdir inode? */ + if (!ip) { + *new_resv += ask; + return 0; + } + + return -libxfs_metafile_resv_init(ip, ask); +} + static void check_fs_free_space( struct xfs_mount *mp, @@ -594,6 +627,18 @@ _("Not enough free space would remain for rtgroup %u rmap inode.\n"), do_error( _("Error %d while checking rtgroup %u rmap inode space reservation.\n"), error, rtg_rgno(rtg)); + + error = reserve_rtrefcount_inode(rtg, &new_resv); + if (error == ENOSPC) { + printf( +_("Not enough free space would remain for rtgroup %u refcount inode.\n"), + rtg_rgno(rtg)); + exit(0); + } + if (error) + do_error( +_("Error %d while checking rtgroup %u refcount inode space reservation.\n"), + error, rtg_rgno(rtg)); } /* @@ -621,8 +666,10 @@ _("Error %d while checking rtgroup %u rmap inode space reservation.\n"), } /* Unreserve the realtime metadata reservations. */ - while ((rtg = xfs_rtgroup_next(mp, rtg))) + while ((rtg = xfs_rtgroup_next(mp, rtg))) { libxfs_metafile_resv_free(rtg_rmap(rtg)); + libxfs_metafile_resv_free(rtg_refcount(rtg)); + } /* * Release the per-AG reservations and mark the per-AG structure as ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 09/10] xfs_repair: skip free space checks when upgrading 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (7 preceding siblings ...) 2024-12-31 23:55 ` [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink Darrick J. Wong @ 2024-12-31 23:55 ` Darrick J. Wong 2024-12-31 23:55 ` [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems Darrick J. Wong 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a debug knob to disable the free space checks when upgrading a system. This is extremely risky and will cause severe tire damage!!! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 2 ++ repair/xfs_repair.c | 11 +++++++++++ 4 files changed, 15 insertions(+) diff --git a/repair/globals.c b/repair/globals.c index 603fea73da1654..fe9f9ac5914bb0 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -48,6 +48,7 @@ char *rt_name; /* Name of realtime device */ int rt_spec; /* Realtime dev specified as option */ int convert_lazy_count; /* Convert lazy-count mode on/off */ int lazy_count; /* What to set if to if converting */ +bool skip_freesp_check_on_upgrade; /* do not enable */ bool features_changed; /* did we change superblock feature bits? */ bool add_inobtcount; /* add inode btree counts to AGI */ bool add_bigtime; /* add support for timestamps up to 2486 */ diff --git a/repair/globals.h b/repair/globals.h index 9211e5e2432c9a..c660971080f7e4 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -89,6 +89,7 @@ extern char *rt_name; /* Name of realtime device */ extern int rt_spec; /* Realtime dev specified as option */ extern int convert_lazy_count; /* Convert lazy-count mode on/off */ extern int lazy_count; /* What to set if to if converting */ +extern bool skip_freesp_check_on_upgrade; /* do not enable */ extern bool features_changed; /* did we change superblock feature bits? */ extern bool add_inobtcount; /* add inode btree counts to AGI */ extern bool add_bigtime; /* add support for timestamps up to 2486 */ diff --git a/repair/phase2.c b/repair/phase2.c index 8dc936b572196e..780294d24c9900 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -688,6 +688,8 @@ need_check_fs_free_space( struct xfs_mount *mp, const struct check_state *old) { + if (skip_freesp_check_on_upgrade) + return false; if (xfs_has_finobt(mp) && !(old->features & XFS_FEAT_FINOBT)) return true; if (xfs_has_reflink(mp) && !(old->features & XFS_FEAT_REFLINK)) diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index d4101f7d2297d7..55e417201b34f7 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -46,6 +46,7 @@ enum o_opt_nums { BLOAD_LEAF_SLACK, BLOAD_NODE_SLACK, NOQUOTA, + SKIP_FREESP_CHECK, O_MAX_OPTS, }; @@ -59,6 +60,7 @@ static char *o_opts[] = { [BLOAD_LEAF_SLACK] = "debug_bload_leaf_slack", [BLOAD_NODE_SLACK] = "debug_bload_node_slack", [NOQUOTA] = "noquota", + [SKIP_FREESP_CHECK] = "debug_skip_freesp_check_on_upgrade", [O_MAX_OPTS] = NULL, }; @@ -323,6 +325,15 @@ process_args(int argc, char **argv) case NOQUOTA: quotacheck_skip(); break; + case SKIP_FREESP_CHECK: + if (!val) + do_abort( + _("-o debug_skip_freesp_check_on_upgrade requires a parameter\n")); + skip_freesp_check_on_upgrade = (int)strtol(val, NULL, 0); + if (skip_freesp_check_on_upgrade) + do_log( + _("WARNING: Allowing filesystem upgrades to proceed without free space check. THIS MAY DESTROY YOUR FILESYSTEM!!!\n")); + break; default: unknown('o', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong ` (8 preceding siblings ...) 2024-12-31 23:55 ` [PATCH 09/10] xfs_repair: skip free space checks when upgrading Darrick J. Wong @ 2024-12-31 23:55 ` Darrick J. Wong 9 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:55 UTC (permalink / raw) To: aalbersh, djwong; +Cc: linux-xfs From: Darrick J. Wong <djwong@kernel.org> New debugging knob so that I can upgrade a filesystem to have rmap btrees even if reflink was already enabled. We cannot easily precompute the space requirements, so this is dangerous. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- repair/globals.c | 1 + repair/globals.h | 1 + repair/phase2.c | 3 ++- repair/xfs_repair.c | 11 +++++++++++ 4 files changed, 15 insertions(+), 1 deletion(-) diff --git a/repair/globals.c b/repair/globals.c index fe9f9ac5914bb0..f4f1d317917183 100644 --- a/repair/globals.c +++ b/repair/globals.c @@ -49,6 +49,7 @@ int rt_spec; /* Realtime dev specified as option */ int convert_lazy_count; /* Convert lazy-count mode on/off */ int lazy_count; /* What to set if to if converting */ bool skip_freesp_check_on_upgrade; /* do not enable */ +bool allow_rmapbt_upgrade_with_reflink; /* add rmapbt when reflink already on */ bool features_changed; /* did we change superblock feature bits? */ bool add_inobtcount; /* add inode btree counts to AGI */ bool add_bigtime; /* add support for timestamps up to 2486 */ diff --git a/repair/globals.h b/repair/globals.h index c660971080f7e4..febbbbcc81f931 100644 --- a/repair/globals.h +++ b/repair/globals.h @@ -90,6 +90,7 @@ extern int rt_spec; /* Realtime dev specified as option */ extern int convert_lazy_count; /* Convert lazy-count mode on/off */ extern int lazy_count; /* What to set if to if converting */ extern bool skip_freesp_check_on_upgrade; /* do not enable */ +extern bool allow_rmapbt_upgrade_with_reflink; /* add rmapbt when reflink already on */ extern bool features_changed; /* did we change superblock feature bits? */ extern bool add_inobtcount; /* add inode btree counts to AGI */ extern bool add_bigtime; /* add support for timestamps up to 2486 */ diff --git a/repair/phase2.c b/repair/phase2.c index 780294d24c9900..29a406f69ca3a1 100644 --- a/repair/phase2.c +++ b/repair/phase2.c @@ -283,7 +283,8 @@ set_rmapbt( exit(0); } - if (xfs_has_reflink(mp) && !add_reflink) { + if (xfs_has_reflink(mp) && !add_reflink && + !allow_rmapbt_upgrade_with_reflink) { printf( _("Reverse mapping btrees cannot be added when reflink is enabled.\n")); exit(0); diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c index 55e417201b34f7..4cff11d81d6bcb 100644 --- a/repair/xfs_repair.c +++ b/repair/xfs_repair.c @@ -47,6 +47,7 @@ enum o_opt_nums { BLOAD_NODE_SLACK, NOQUOTA, SKIP_FREESP_CHECK, + ALLOW_RMAPBT_UPGRADE_WITH_REFLINK, O_MAX_OPTS, }; @@ -61,6 +62,7 @@ static char *o_opts[] = { [BLOAD_NODE_SLACK] = "debug_bload_node_slack", [NOQUOTA] = "noquota", [SKIP_FREESP_CHECK] = "debug_skip_freesp_check_on_upgrade", + [ALLOW_RMAPBT_UPGRADE_WITH_REFLINK] = "debug_allow_rmapbt_upgrade_with_reflink", [O_MAX_OPTS] = NULL, }; @@ -334,6 +336,15 @@ process_args(int argc, char **argv) do_log( _("WARNING: Allowing filesystem upgrades to proceed without free space check. THIS MAY DESTROY YOUR FILESYSTEM!!!\n")); break; + case ALLOW_RMAPBT_UPGRADE_WITH_REFLINK: + if (!val) + do_abort( + _("-o debug_allow_rmapbt_upgrade_with_reflink requires a parameter\n")); + allow_rmapbt_upgrade_with_reflink = (int)strtol(val, NULL, 0); + if (allow_rmapbt_upgrade_with_reflink) + do_log( + _("WARNING: Allowing filesystem upgrade to rmapbt when reflink enabled. THIS MAY DESTROY YOUR FILESYSTEM!!!\n")); + break; default: unknown('o', val); break; ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 1/5] fstests: functional test for refcount reporting 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (9 preceding siblings ...) 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong @ 2024-12-31 23:34 ` Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/1] xfs: test output of new FSREFCOUNTS ioctl Darrick J. Wong 2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong ` (4 subsequent siblings) 15 siblings, 1 reply; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:34 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs Hi all, Add a short functional test for the new GETFSREFCOUNTS ioctl that allows userspace to query reference count information for a given range of physical blocks. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=report-refcounts xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=report-refcounts fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=report-refcounts --- Commits in this patchset: * xfs: test output of new FSREFCOUNTS ioctl --- common/rc | 4 + doc/group-names.txt | 1 tests/xfs/1921 | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1921.out | 4 + 4 files changed, 171 insertions(+), 2 deletions(-) create mode 100755 tests/xfs/1921 create mode 100644 tests/xfs/1921.out ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/1] xfs: test output of new FSREFCOUNTS ioctl 2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong @ 2024-12-31 23:56 ` Darrick J. Wong 0 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Make sure the cursors work properly and that refcounts are correct. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/rc | 4 + doc/group-names.txt | 1 tests/xfs/1921 | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1921.out | 4 + 4 files changed, 171 insertions(+), 2 deletions(-) create mode 100755 tests/xfs/1921 create mode 100644 tests/xfs/1921.out diff --git a/common/rc b/common/rc index e04ca50e3140c0..c45a226849ce0f 100644 --- a/common/rc +++ b/common/rc @@ -2811,8 +2811,8 @@ _require_xfs_io_command() echo $testio | grep -q "Operation not supported" && \ _notrun "O_TMPFILE is not supported" ;; - "fsmap") - testio=`$XFS_IO_PROG -f -c "fsmap" $testfile 2>&1` + "fsmap"|"fsrefcounts") + testio=`$XFS_IO_PROG -f -c "$command" $testfile 2>&1` echo $testio | grep -q "Inappropriate ioctl" && \ _notrun "xfs_io $command support is missing" ;; diff --git a/doc/group-names.txt b/doc/group-names.txt index ed886caac058c3..b04d0180e8ec02 100644 --- a/doc/group-names.txt +++ b/doc/group-names.txt @@ -58,6 +58,7 @@ fsck general fsck tests fsmap FS_IOC_GETFSMAP ioctl fsproperties Filesystem properties fsr XFS free space reorganizer +fsrefcounts FS_IOC_GETFSREFCOUNTS ioctl fuzzers filesystem fuzz tests growfs increasing the size of a filesystem hardlink hardlinks diff --git a/tests/xfs/1921 b/tests/xfs/1921 new file mode 100755 index 00000000000000..2d0af845767ed2 --- /dev/null +++ b/tests/xfs/1921 @@ -0,0 +1,164 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2021-2025 Oracle. All Rights Reserved. +# +# FS QA Test No. 1921 +# +# Populate filesystem, check that fsrefcounts -n10000 matches fsrefcounts -n1, +# then verify that the refcount information is consistent with the fsmap info. +# +. ./common/preamble +_begin_fstest auto clone fsrefcounts fsmap + +_cleanup() +{ + cd / + rm -rf $tmp.* $TEST_DIR/a $TEST_DIR/b +} + +. ./common/filter + +_require_scratch +_require_xfs_io_command "fsmap" +_require_xfs_io_command "fsrefcounts" + +echo "Format and mount" +_scratch_mkfs > $seqres.full 2>&1 +_scratch_mount >> $seqres.full 2>&1 + +cpus=$(( $(src/feature -o) * 4)) + +# Use fsstress to create a directory tree with some variability +FSSTRESS_ARGS=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 4000 $FSSTRESS_AVOID) +$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full + +_scratch_cycle_mount # flush all the background gc + +echo "Compare fsrefcounts" | tee -a $seqres.full +$XFS_IO_PROG -c 'fsrefcounts -m -n 65536' $SCRATCH_MNT | grep -v 'EXT:' > $TEST_DIR/a +$XFS_IO_PROG -c 'fsrefcounts -m -n 1' $SCRATCH_MNT | grep -v 'EXT:' > $TEST_DIR/b +cat $TEST_DIR/a $TEST_DIR/b >> $seqres.full + +diff -uw $TEST_DIR/a $TEST_DIR/b + +echo "Compare fsrefcounts to fsmap" | tee -a $seqres.full +$XFS_IO_PROG -c 'fsmap -m -n 65536' $SCRATCH_MNT | grep -v 'EXT:' > $TEST_DIR/b +cat $TEST_DIR/b >> $seqres.full + +while IFS=',' read ext major minor pstart pend owners length crap; do + test "$ext" = "EXT" && continue + + awk_args=(-'F' ',' '-v' "major=$major" '-v' "minor=$minor" \ + '-v' "pstart=$pstart" '-v' "pend=$pend" '-v' "owners=$owners") + + if [ "$owners" -eq 1 ]; then + $AWK_PROG "${awk_args[@]}" \ +' +BEGIN { + printf("Q:%s:%s:%s:%s:%s:\n", major, minor, pstart, pend, owners) > "/dev/stderr"; + next_map = -1; +} +{ + if ($2 != major || $3 != minor) { + next; + } + if ($5 <= pstart) { + next; + } + + printf(" A:%s:%s:%s:%s\n", $2, $3, $4, $5) > "/dev/stderr"; + if (next_map < 0) { + if ($4 > pstart) { + exit 1 + } + next_map = $5 + 1; + } else { + if ($4 != next_map) { + exit 1 + } + next_map = $5 + 1; + } + if (next_map >= pend) { + nextfile; + } +} +END { + exit 0; +} +' $TEST_DIR/b 2> $tmp.debug + res=$? + else + $AWK_PROG "${awk_args[@]}" \ +' +function max(a, b) { + return a > b ? a : b; +} +function min(a, b) { + return a < b ? a : b; +} +BEGIN { + printf("Q:%s:%s:%s:%s:%s:\n", major, minor, pstart, pend, owners) > "/dev/stderr"; + refcount_whole = 0; + aborted = 0; +} +{ + if ($2 != major || $3 != minor) { + next; + } + if ($4 > pend) { + nextfile; + } + if ($5 < pstart) { + next; + } + if ($6 == "special_0:2") { + /* unknown owner means we cannot distinguish separate owners */ + aborted = 1; + exit 0; + } + + printf(" A:%s:%s:%s:%s -> %d\n", $2, $3, $4, $5, refcount_whole) > "/dev/stderr"; + if ($4 <= pstart && $5 >= pend) { + /* Account for extents that span the whole range */ + refcount_whole++; + } else { + /* Otherwise track refcounts per-block as we find them */ + for (block = max($4, pstart); block <= min($5, pend); block++) { + refcounts[block]++; + } + } +} +END { + if (aborted) { + exit 0; + } + deficit = owners - refcount_whole; + printf(" W:%d:%d\n", owners, refcount_whole, deficit) > "/dev/stderr"; + if (deficit == 0) { + exit 0; + } + + refcount_slivers = deficit; + for (block in refcounts) { + printf(" X:%s:%d\n", block, refcounts[block]) > "/dev/stderr"; + if (refcounts[block] != deficit) { + refcount_slivers = 0; + } + } + + refcount_whole += refcount_slivers; + exit owners == refcount_whole ? 0 : 1; +} +' $TEST_DIR/b 2> $tmp.debug + res=$? + fi + if [ $res -ne 0 ]; then + echo "$major,$minor,$pstart,$pend,$owners not found in fsmap" + cat $tmp.debug >> $seqres.full + break + fi +done < $TEST_DIR/a + +# success, all done +status=0 +exit diff --git a/tests/xfs/1921.out b/tests/xfs/1921.out new file mode 100644 index 00000000000000..f5ea660379bbdd --- /dev/null +++ b/tests/xfs/1921.out @@ -0,0 +1,4 @@ +QA output created by 1921 +Format and mount +Compare fsrefcounts +Compare fsrefcounts to fsmap ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 2/5] fstests: defragment free space 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (10 preceding siblings ...) 2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong @ 2024-12-31 23:35 ` Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/1] xfs: test clearing of " Darrick J. Wong 2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong ` (3 subsequent siblings) 15 siblings, 1 reply; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs Hi all, These patches contain experimental code to enable userspace to defragment the free space in a filesystem. Two purposes are imagined for this functionality: clearing space at the end of a filesystem before shrinking it, and clearing free space in anticipation of making a large allocation. The first patch adds a new fallocate mode that allows userspace to allocate free space from the filesystem into a file. The goal here is to allow the filesystem shrink process to prevent allocation from a certain part of the filesystem while a free space defragmenter evacuates all the files from the doomed part of the filesystem. The second patch amends the online repair system to allow the sysadmin to forcibly rebuild metadata structures, even if they're not corrupt. Without adding an ioctl to move metadata btree blocks, this is the only way to dislodge metadata. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=defrag-freespace --- Commits in this patchset: * xfs: test clearing of free space --- common/rc | 5 ++++ tests/xfs/1400 | 52 +++++++++++++++++++++++++++++++++++++++ tests/xfs/1400.out | 2 + tests/xfs/1401 | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1401.out | 2 + 5 files changed, 131 insertions(+) create mode 100755 tests/xfs/1400 create mode 100644 tests/xfs/1400.out create mode 100755 tests/xfs/1401 create mode 100644 tests/xfs/1401.out ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/1] xfs: test clearing of free space 2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong @ 2024-12-31 23:56 ` Darrick J. Wong 0 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Simple regression test for the spaceman clearspace command, which tries to free all the used space in some part of the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/rc | 5 ++++ tests/xfs/1400 | 52 +++++++++++++++++++++++++++++++++++++++ tests/xfs/1400.out | 2 + tests/xfs/1401 | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1401.out | 2 + 5 files changed, 131 insertions(+) create mode 100755 tests/xfs/1400 create mode 100644 tests/xfs/1400.out create mode 100755 tests/xfs/1401 create mode 100644 tests/xfs/1401.out diff --git a/common/rc b/common/rc index c45a226849ce0f..d7dfb55bbbd7e1 100644 --- a/common/rc +++ b/common/rc @@ -2786,6 +2786,11 @@ _require_xfs_io_command() -c "fsync" -c "$command $blocksize $((2 * $blocksize))" \ $testfile 2>&1` ;; + "fmapfree") + local blocksize=$(_get_file_block_size $TEST_DIR) + testio=`$XFS_IO_PROG -F -f -c "$command $blocksize $((2 * $blocksize))" \ + $testfile 2>&1` + ;; "fiemap") # If 'ranged' is passed as argument then we check to see if fiemap supports # ranged query params diff --git a/tests/xfs/1400 b/tests/xfs/1400 new file mode 100755 index 00000000000000..ec3f7aec2a318a --- /dev/null +++ b/tests/xfs/1400 @@ -0,0 +1,52 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2022-2025 Oracle. All Rights Reserved. +# +# FS QA Test 1400 +# +# Basic functionality testing for FALLOC_FL_MAP_FREE +# +. ./common/preamble +_begin_fstest auto prealloc + +. ./common/filter + +_require_scratch +_require_xfs_io_command "fmapfree" + +_scratch_mkfs | _filter_mkfs 2> $tmp.mkfs > /dev/null +_scratch_mount >> $seqres.full +. $tmp.mkfs + +testfile="$SCRATCH_MNT/$seq.txt" +touch $testfile +if $XFS_IO_PROG -c 'stat -v' $testfile | grep -q 'realtime'; then + # realtime + increment=$((dbsize * rtblocks / 100)) + length=$((dbsize * rtblocks)) +else + # data + increment=$((dbsize * dblocks / 100)) + length=$((dbsize * dblocks)) +fi + +free_bytes=$(stat -f -c '%f * %S' $testfile | bc) + +echo "free space: $free_bytes; increment: $increment; length: $length" >> $seqres.full + +# Map all the free space on that device, 10% at a time +for ((start = 0; start < length; start += increment)); do + $XFS_IO_PROG -f -c "fmapfree $start $increment" $testfile +done + +space_used=$(stat -c '%b * %B' $testfile | bc) + +echo "space captured: $space_used" >> $seqres.full +$FILEFRAG_PROG -v $testfile >> $seqres.full + +# Did we get within 10% of the free space? +_within_tolerance "mapfree space used" $space_used $free_bytes 10% -v + +# success, all done +status=0 +exit diff --git a/tests/xfs/1400.out b/tests/xfs/1400.out new file mode 100644 index 00000000000000..601404d7a46856 --- /dev/null +++ b/tests/xfs/1400.out @@ -0,0 +1,2 @@ +QA output created by 1400 +mapfree space used is in range diff --git a/tests/xfs/1401 b/tests/xfs/1401 new file mode 100755 index 00000000000000..14675abd8ff985 --- /dev/null +++ b/tests/xfs/1401 @@ -0,0 +1,70 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2022-2025 Oracle. All Rights Reserved. +# +# FS QA Test No. 1401 +# +# Basic functionality testing for the free space defragmenter. +# +. ./common/preamble +_begin_fstest auto defrag shrinkfs + +. ./common/filter + +_notrun "XXX test is not ready yet; you need to deal with eof blocks" +_notrun "XXX clearfree cannot move unwritten extents; does fiexchange work for this?" +_notrun "XXX csp_buffercopy never returns if we hit eof" + +_require_scratch +_require_xfs_spaceman_command "clearfree" + +_scratch_mkfs | _filter_mkfs 2> $tmp.mkfs > /dev/null +cat $tmp.mkfs >> $seqres.full +. $tmp.mkfs +_scratch_mount >> $seqres.full + +cpus=$(( $(src/feature -o) * 4)) + +# Use fsstress to create a directory tree with some variability +FSSTRESS_ARGS=$(_scale_fsstress_args -p 4 -d $SCRATCH_MNT -n 4000 $FSSTRESS_AVOID) +$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full + +$XFS_IO_PROG -c 'stat -v' $SCRATCH_MNT >> $seqres.full + +if $XFS_IO_PROG -c 'stat -v' $SCRATCH_MNT | grep -q 'rt-inherit'; then + # realtime + increment=$((dbsize * rtblocks / agcount)) + length=$((dbsize * rtblocks)) + fsmap_devarg="-r" +else + # data + increment=$((dbsize * agsize)) + length=$((dbsize * dblocks)) + fsmap_devarg="-d" +fi + +echo "start: $start; increment: $increment; length: $length" >> $seqres.full +$DF_PROG $SCRATCH_MNT >> $seqres.full + +TRACE_PROG="strace -s99 -e fallocate,ioctl,openat -o $tmp.strace" + +for ((start = 0; start < length; start += increment)); do + echo "---------------------------" >> $seqres.full + echo "start: $start end: $((start + increment))" >> $seqres.full + echo "---------------------------" >> $seqres.full + + fsmap_args="-vvvv $fsmap_devarg $((start / 512)) $((increment / 512))" + clearfree_args="-v all $start $increment" + + $XFS_IO_PROG -c "fsmap $fsmap_args" $SCRATCH_MNT > $tmp.before + $TRACE_PROG $XFS_SPACEMAN_PROG -c "clearfree $clearfree_args" $SCRATCH_MNT &>> $seqres.full || break + cat $tmp.strace >> $seqres.full + $XFS_IO_PROG -c "fsmap $fsmap_args" $SCRATCH_MNT > $tmp.after + cat $tmp.before >> $seqres.full + cat $tmp.after >> $seqres.full +done + +# success, all done +echo Silence is golden +status=0 +exit diff --git a/tests/xfs/1401.out b/tests/xfs/1401.out new file mode 100644 index 00000000000000..504999381ea9a8 --- /dev/null +++ b/tests/xfs/1401.out @@ -0,0 +1,2 @@ +QA output created by 1401 +Silence is golden ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 3/5] fstests: capture logs from mount failures 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (11 preceding siblings ...) 2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong @ 2024-12-31 23:35 ` Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount Darrick J. Wong 2024-12-31 23:56 ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong ` (2 subsequent siblings) 15 siblings, 2 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs Hi all, Whenever a mount fails, we should capture the kernel logs for the last few seconds before the failure. If the test fails, retain the log contents for further analysis. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=capture-mount-failures --- Commits in this patchset: * treewide: convert all $MOUNT_PROG to _mount * check: capture dmesg of mount failures if test fails --- check | 22 +++++++++++++++++++++- common/btrfs | 4 ++-- common/dmdelay | 2 +- common/dmerror | 2 +- common/dmlogwrites | 2 +- common/overlay | 6 +++--- common/rc | 26 +++++++++++++++++++++++++- common/report | 8 ++++++++ tests/btrfs/075 | 2 +- tests/btrfs/208 | 2 +- tests/ext4/032 | 2 +- tests/generic/067 | 6 +++--- tests/generic/085 | 2 +- tests/generic/361 | 2 +- tests/generic/373 | 2 +- tests/generic/374 | 2 +- tests/generic/409 | 6 +++--- tests/generic/410 | 8 ++++---- tests/generic/411 | 8 ++++---- tests/generic/589 | 8 ++++---- tests/overlay/005 | 4 ++-- tests/overlay/025 | 2 +- tests/overlay/035 | 2 +- tests/overlay/062 | 2 +- tests/overlay/083 | 6 +++--- tests/overlay/086 | 12 ++++++------ tests/selftest/008 | 20 ++++++++++++++++++++ tests/selftest/008.out | 1 + tests/xfs/078 | 2 +- tests/xfs/149 | 4 ++-- tests/xfs/289 | 4 ++-- tests/xfs/544 | 2 +- 32 files changed, 128 insertions(+), 55 deletions(-) create mode 100755 tests/selftest/008 create mode 100644 tests/selftest/008.out ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount 2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong @ 2024-12-31 23:56 ` Darrick J. Wong 2024-12-31 23:56 ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong 1 sibling, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Going to add some new log scraping functionality when mount failures occur, so we need everyone to use _mount instead of $MOUNT_PROG. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/btrfs | 4 ++-- common/dmdelay | 2 +- common/dmerror | 2 +- common/dmlogwrites | 2 +- common/overlay | 6 +++--- tests/btrfs/075 | 2 +- tests/btrfs/208 | 2 +- tests/ext4/032 | 2 +- tests/generic/067 | 6 +++--- tests/generic/085 | 2 +- tests/generic/361 | 2 +- tests/generic/373 | 2 +- tests/generic/374 | 2 +- tests/generic/409 | 6 +++--- tests/generic/410 | 8 ++++---- tests/generic/411 | 8 ++++---- tests/generic/589 | 8 ++++---- tests/overlay/005 | 4 ++-- tests/overlay/025 | 2 +- tests/overlay/035 | 2 +- tests/overlay/062 | 2 +- tests/overlay/083 | 6 +++--- tests/overlay/086 | 12 ++++++------ tests/xfs/078 | 2 +- tests/xfs/149 | 4 ++-- tests/xfs/289 | 4 ++-- tests/xfs/544 | 2 +- 27 files changed, 53 insertions(+), 53 deletions(-) diff --git a/common/btrfs b/common/btrfs index 95a9c8e6c7f448..64f38cc240ab8b 100644 --- a/common/btrfs +++ b/common/btrfs @@ -351,7 +351,7 @@ _btrfs_stress_subvolume() mkdir -p $subvol_mnt while [ ! -e $stop_file ]; do $BTRFS_UTIL_PROG subvolume create $btrfs_mnt/$subvol_name - $MOUNT_PROG -o subvol=$subvol_name $btrfs_dev $subvol_mnt + _mount -o subvol=$subvol_name $btrfs_dev $subvol_mnt $UMOUNT_PROG $subvol_mnt $BTRFS_UTIL_PROG subvolume delete $btrfs_mnt/$subvol_name done @@ -437,7 +437,7 @@ _btrfs_stress_remount_compress() local btrfs_mnt=$1 while true; do for algo in no zlib lzo; do - $MOUNT_PROG -o remount,compress=$algo $btrfs_mnt + _mount -o remount,compress=$algo $btrfs_mnt done done } diff --git a/common/dmdelay b/common/dmdelay index 66cac1a70c14c8..794ea37ba200ce 100644 --- a/common/dmdelay +++ b/common/dmdelay @@ -20,7 +20,7 @@ _init_delay() _mount_delay() { _scratch_options mount - $MOUNT_PROG -t $FSTYP `_common_dev_mount_options` $SCRATCH_OPTIONS \ + _mount -t $FSTYP `_common_dev_mount_options` $SCRATCH_OPTIONS \ $DELAY_DEV $SCRATCH_MNT } diff --git a/common/dmerror b/common/dmerror index 3494b6dd3b9479..2f006142a309fe 100644 --- a/common/dmerror +++ b/common/dmerror @@ -91,7 +91,7 @@ _dmerror_init() _dmerror_mount() { _scratch_options mount - $MOUNT_PROG -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \ + _mount -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \ $DMERROR_DEV $SCRATCH_MNT } diff --git a/common/dmlogwrites b/common/dmlogwrites index 7a8a9078cb8b65..c054acb875a384 100644 --- a/common/dmlogwrites +++ b/common/dmlogwrites @@ -139,7 +139,7 @@ _log_writes_mkfs() _log_writes_mount() { _scratch_options mount - $MOUNT_PROG -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \ + _mount -t $FSTYP `_common_dev_mount_options $*` $SCRATCH_OPTIONS \ $LOGWRITES_DMDEV $SCRATCH_MNT } diff --git a/common/overlay b/common/overlay index faa9339a6477f7..da1d8d2c3183f4 100644 --- a/common/overlay +++ b/common/overlay @@ -29,13 +29,13 @@ _overlay_mount_dirs() [ -n "$upperdir" ] && [ "$upperdir" != "-" ] && \ diropts+=",upperdir=$upperdir,workdir=$workdir" - $MOUNT_PROG -t overlay $diropts `_common_dev_mount_options $*` + _mount -t overlay $diropts `_common_dev_mount_options $*` } # Mount with mnt/dev of scratch mount and custom mount options _overlay_scratch_mount_opts() { - $MOUNT_PROG -t overlay $OVL_BASE_SCRATCH_MNT $SCRATCH_MNT $* + _mount -t overlay $OVL_BASE_SCRATCH_MNT $SCRATCH_MNT $* } # Mount with same options/mnt/dev of scratch mount, but optionally @@ -127,7 +127,7 @@ _overlay_base_scratch_mount() _overlay_scratch_mount() { if echo "$*" | grep -q remount; then - $MOUNT_PROG $SCRATCH_MNT $* + _mount $SCRATCH_MNT $* return fi diff --git a/tests/btrfs/075 b/tests/btrfs/075 index 917993ca2da3a6..737c4ffdd57865 100755 --- a/tests/btrfs/075 +++ b/tests/btrfs/075 @@ -37,7 +37,7 @@ _scratch_mount subvol_mnt=$TEST_DIR/$seq.mnt mkdir -p $subvol_mnt $BTRFS_UTIL_PROG subvolume create $SCRATCH_MNT/subvol >>$seqres.full 2>&1 -$MOUNT_PROG -o subvol=subvol $SELINUX_MOUNT_OPTIONS $SCRATCH_DEV $subvol_mnt +_mount -o subvol=subvol $SELINUX_MOUNT_OPTIONS $SCRATCH_DEV $subvol_mnt status=$? exit diff --git a/tests/btrfs/208 b/tests/btrfs/208 index 5ea732ae8f71a7..93a999541dab06 100755 --- a/tests/btrfs/208 +++ b/tests/btrfs/208 @@ -45,7 +45,7 @@ _scratch_unmount # Now we mount the subvol2, which makes subvol3 not accessible for this mount # point, but we should be able to delete it using it's subvolume id -$MOUNT_PROG -o subvol=subvol2 $SCRATCH_DEV $SCRATCH_MNT +_mount -o subvol=subvol2 $SCRATCH_DEV $SCRATCH_MNT _delete_and_list subvol3 "Last remaining subvolume:" _scratch_unmount diff --git a/tests/ext4/032 b/tests/ext4/032 index 238ab178363c12..9a1b9312cc42cc 100755 --- a/tests/ext4/032 +++ b/tests/ext4/032 @@ -48,7 +48,7 @@ ext4_online_resize() $seqres.full 2>&1 || _fail "mkfs failed" echo "+++ mount image file" | tee -a $seqres.full - $MOUNT_PROG -t ${FSTYP} ${LOOP_DEVICE} ${IMG_MNT} > \ + _mount -t ${FSTYP} ${LOOP_DEVICE} ${IMG_MNT} > \ /dev/null 2>&1 || _fail "mount failed" echo "+++ resize fs to $final_size" | tee -a $seqres.full diff --git a/tests/generic/067 b/tests/generic/067 index b561b7bc5946a2..b6e984f5231753 100755 --- a/tests/generic/067 +++ b/tests/generic/067 @@ -34,7 +34,7 @@ mount_nonexistent_mnt() { echo "# mount to nonexistent mount point" >>$seqres.full rm -rf $TEST_DIR/nosuchdir - $MOUNT_PROG $SCRATCH_DEV $TEST_DIR/nosuchdir >>$seqres.full 2>&1 + _mount $SCRATCH_DEV $TEST_DIR/nosuchdir >>$seqres.full 2>&1 } # fs driver should be able to handle mounting a free loop device gracefully @@ -43,7 +43,7 @@ mount_free_loopdev() { echo "# mount a free loop device" >>$seqres.full loopdev=`losetup -f` - $MOUNT_PROG -t $FSTYP $loopdev $SCRATCH_MNT >>$seqres.full 2>&1 + _mount -t $FSTYP $loopdev $SCRATCH_MNT >>$seqres.full 2>&1 } # mount with wrong fs type specified. @@ -55,7 +55,7 @@ mount_wrong_fstype() fs=xfs fi echo "# mount with wrong fs type" >>$seqres.full - $MOUNT_PROG -t $fs $SCRATCH_DEV $SCRATCH_MNT >>$seqres.full 2>&1 + _mount -t $fs $SCRATCH_DEV $SCRATCH_MNT >>$seqres.full 2>&1 } # umount a symlink to device, which is not mounted. diff --git a/tests/generic/085 b/tests/generic/085 index cfe6112d6b444d..cbabd257cad8f0 100755 --- a/tests/generic/085 +++ b/tests/generic/085 @@ -69,7 +69,7 @@ for ((i=0; i<100; i++)); do done & pid=$! for ((i=0; i<100; i++)); do - $MOUNT_PROG $lvdev $SCRATCH_MNT >/dev/null 2>&1 + _mount $lvdev $SCRATCH_MNT >/dev/null 2>&1 $UMOUNT_PROG $lvdev >/dev/null 2>&1 done & pid="$pid $!" diff --git a/tests/generic/361 b/tests/generic/361 index c56157391d3209..c2ebda3c1a01ad 100755 --- a/tests/generic/361 +++ b/tests/generic/361 @@ -52,7 +52,7 @@ fi $XFS_IO_PROG -fc "pwrite 0 520m" $fs_mnt/testfile >>$seqres.full 2>&1 # remount should not hang -$MOUNT_PROG -o remount,ro $fs_mnt >>$seqres.full 2>&1 +_mount -o remount,ro $fs_mnt >>$seqres.full 2>&1 # success, all done echo "Silence is golden" diff --git a/tests/generic/373 b/tests/generic/373 index 3bd46963a76686..0d5a50cbee40b8 100755 --- a/tests/generic/373 +++ b/tests/generic/373 @@ -42,7 +42,7 @@ blksz=65536 sz=$((blksz * blocks)) echo "Mount otherdir" -$MOUNT_PROG --bind $SCRATCH_MNT $otherdir +_mount --bind $SCRATCH_MNT $otherdir echo "Create file" _pwrite_byte 0x61 0 $sz $testdir/file >> $seqres.full diff --git a/tests/generic/374 b/tests/generic/374 index acb23d17289784..977a2b268bbc98 100755 --- a/tests/generic/374 +++ b/tests/generic/374 @@ -41,7 +41,7 @@ blksz=65536 sz=$((blocks * blksz)) echo "Mount otherdir" -$MOUNT_PROG --bind $SCRATCH_MNT $otherdir +_mount --bind $SCRATCH_MNT $otherdir echo "Create file" _pwrite_byte 0x61 0 $sz $testdir/file >> $seqres.full diff --git a/tests/generic/409 b/tests/generic/409 index b7edc2ac664461..79468e2b0ddb41 100755 --- a/tests/generic/409 +++ b/tests/generic/409 @@ -87,7 +87,7 @@ start_test() _scratch_mkfs >$seqres.full 2>&1 _get_mount -t $FSTYP $SCRATCH_DEV $MNTHEAD - $MOUNT_PROG --make-"${type}" $MNTHEAD + _mount --make-"${type}" $MNTHEAD mkdir $mpA $mpB $mpC $mpD } @@ -107,9 +107,9 @@ bind_run() echo "bind $source on $dest" _get_mount -t $FSTYP $SCRATCH_DEV $mpA mkdir -p $mpA/dir 2>/dev/null - $MOUNT_PROG --make-shared $mpA + _mount --make-shared $mpA _get_mount --bind $mpA $mpB - $MOUNT_PROG --make-"$source" $mpB + _mount --make-"$source" $mpB # maybe unbindable at here _get_mount --bind $mpB $mpC 2>/dev/null if [ $? -ne 0 ]; then diff --git a/tests/generic/410 b/tests/generic/410 index 902f27144285e4..db8c97dbac7701 100755 --- a/tests/generic/410 +++ b/tests/generic/410 @@ -93,7 +93,7 @@ start_test() _scratch_mkfs >>$seqres.full 2>&1 _get_mount -t $FSTYP $SCRATCH_DEV $MNTHEAD - $MOUNT_PROG --make-"${type}" $MNTHEAD + _mount --make-"${type}" $MNTHEAD mkdir $mpA $mpB $mpC } @@ -117,14 +117,14 @@ run() echo "make-$cmd a $orgs mount" _get_mount -t $FSTYP $SCRATCH_DEV $mpA mkdir -p $mpA/dir 2>/dev/null - $MOUNT_PROG --make-shared $mpA + _mount --make-shared $mpA # prepare the original status on mpB _get_mount --bind $mpA $mpB # shared&slave status need to do make-slave then make-shared # two operations. for t in $orgs; do - $MOUNT_PROG --make-"$t" $mpB + _mount --make-"$t" $mpB done # "before" for prepare and check original status @@ -145,7 +145,7 @@ run() _put_mount # umount C fi if [ "$i" = "before" ];then - $MOUNT_PROG --make-"${cmd}" $mpB + _mount --make-"${cmd}" $mpB fi done diff --git a/tests/generic/411 b/tests/generic/411 index c35436c82e988e..09a813f5d3028e 100755 --- a/tests/generic/411 +++ b/tests/generic/411 @@ -76,7 +76,7 @@ start_test() _scratch_mkfs >$seqres.full 2>&1 _get_mount -t $FSTYP $SCRATCH_DEV $MNTHEAD - $MOUNT_PROG --make-"${type}" $MNTHEAD + _mount --make-"${type}" $MNTHEAD mkdir $mpA $mpB $mpC } @@ -99,11 +99,11 @@ crash_test() _get_mount -t $FSTYP $SCRATCH_DEV $mpA mkdir $mpA/mnt1 - $MOUNT_PROG --make-shared $mpA + _mount --make-shared $mpA _get_mount --bind $mpA $mpB _get_mount --bind $mpA $mpC - $MOUNT_PROG --make-slave $mpB - $MOUNT_PROG --make-slave $mpC + _mount --make-slave $mpB + _mount --make-slave $mpC _get_mount -t $FSTYP $SCRATCH_DEV $mpA/mnt1 mkdir $mpA/mnt1/mnt2 diff --git a/tests/generic/589 b/tests/generic/589 index 0ce16556a05df9..6f69abd17ab01e 100755 --- a/tests/generic/589 +++ b/tests/generic/589 @@ -80,12 +80,12 @@ start_test() _get_mount -t $FSTYP $SCRATCH_DEV $SRCHEAD # make sure $SRCHEAD is private - $MOUNT_PROG --make-private $SRCHEAD + _mount --make-private $SRCHEAD _get_mount -t $FSTYP $SCRATCH_DEV $DSTHEAD # test start with a bind, then make-shared $DSTHEAD _get_mount --bind $DSTHEAD $DSTHEAD - $MOUNT_PROG --make-"${type}" $DSTHEAD + _mount --make-"${type}" $DSTHEAD mkdir $mpA $mpB $mpC $mpD } @@ -105,10 +105,10 @@ move_run() echo "move $source to $dest" _get_mount -t $FSTYP $SCRATCH_DEV $mpA mkdir -p $mpA/dir 2>/dev/null - $MOUNT_PROG --make-shared $mpA + _mount --make-shared $mpA # need a peer for slave later _get_mount --bind $mpA $mpB - $MOUNT_PROG --make-"$source" $mpB + _mount --make-"$source" $mpB # maybe unbindable at here _get_mount --move $mpB $mpC 2>/dev/null if [ $? -ne 0 ]; then diff --git a/tests/overlay/005 b/tests/overlay/005 index 4c11d5e1b6f701..01914ee17b9a30 100755 --- a/tests/overlay/005 +++ b/tests/overlay/005 @@ -50,8 +50,8 @@ $MKFS_XFS_PROG -f -n ftype=1 $upper_loop_dev >>$seqres.full 2>&1 # mount underlying xfs mkdir -p ${OVL_BASE_SCRATCH_MNT}/lowermnt mkdir -p ${OVL_BASE_SCRATCH_MNT}/uppermnt -$MOUNT_PROG $fs_loop_dev ${OVL_BASE_SCRATCH_MNT}/lowermnt -$MOUNT_PROG $upper_loop_dev ${OVL_BASE_SCRATCH_MNT}/uppermnt +_mount $fs_loop_dev ${OVL_BASE_SCRATCH_MNT}/lowermnt +_mount $upper_loop_dev ${OVL_BASE_SCRATCH_MNT}/uppermnt # prepare dirs mkdir -p ${OVL_BASE_SCRATCH_MNT}/lowermnt/lower diff --git a/tests/overlay/025 b/tests/overlay/025 index dc819a39348b69..6ba46191b557be 100755 --- a/tests/overlay/025 +++ b/tests/overlay/025 @@ -36,7 +36,7 @@ _require_extra_fs tmpfs # create a tmpfs in $TEST_DIR tmpfsdir=$TEST_DIR/tmpfs mkdir -p $tmpfsdir -$MOUNT_PROG -t tmpfs tmpfs $tmpfsdir +_mount -t tmpfs tmpfs $tmpfsdir mkdir -p $tmpfsdir/{lower,upper,work,mnt} mkdir -p -m 0 $tmpfsdir/upper/testd diff --git a/tests/overlay/035 b/tests/overlay/035 index 0b3257c4cce09e..cede58790e1b9d 100755 --- a/tests/overlay/035 +++ b/tests/overlay/035 @@ -42,7 +42,7 @@ mkdir -p $lowerdir1 $lowerdir2 $upperdir $workdir # Verify that overlay is mounted read-only and that it cannot be remounted rw. _overlay_scratch_mount_opts -o"lowerdir=$lowerdir2:$lowerdir1" touch $SCRATCH_MNT/foo 2>&1 | _filter_scratch -$MOUNT_PROG -o remount,rw $SCRATCH_MNT 2>&1 | _filter_ro_mount +_mount -o remount,rw $SCRATCH_MNT 2>&1 | _filter_ro_mount $UMOUNT_PROG $SCRATCH_MNT # Make workdir immutable to prevent workdir re-create on mount diff --git a/tests/overlay/062 b/tests/overlay/062 index e44628b7459bfb..9a1db7419c4ca2 100755 --- a/tests/overlay/062 +++ b/tests/overlay/062 @@ -60,7 +60,7 @@ lowertestdir=$lower2/testdir create_test_files $lowertestdir # bind mount to pin lower test dir dentry to dcache -$MOUNT_PROG --bind $lowertestdir $lowertestdir +_mount --bind $lowertestdir $lowertestdir # For non-upper overlay mount, nfs_export requires disabling redirect_dir. _overlay_scratch_mount_opts \ diff --git a/tests/overlay/083 b/tests/overlay/083 index d037d4c858e6a6..56e02f8cc77d73 100755 --- a/tests/overlay/083 +++ b/tests/overlay/083 @@ -40,14 +40,14 @@ mkdir -p "$lowerdir_spaces" "$lowerdir_colons" "$lowerdir_commas" # _overlay_mount_* helpers do not handle special chars well, so execute mount directly. # if escaped colons are not parsed correctly, mount will fail. -$MOUNT_PROG -t overlay ovl_esc_test $SCRATCH_MNT \ +_mount -t overlay ovl_esc_test $SCRATCH_MNT \ -o"upperdir=$upperdir,workdir=$workdir" \ -o"lowerdir=$lowerdir_colons_esc:$lowerdir_spaces" \ 2>&1 | tee -a $seqres.full # if spaces are not escaped when showing mount options, # mount command will not show the word 'spaces' after the spaces -$MOUNT_PROG -t overlay | grep ovl_esc_test | tee -a $seqres.full | grep -v spaces && \ +_mount -t overlay | grep ovl_esc_test | tee -a $seqres.full | grep -v spaces && \ echo "ERROR: escaped spaces truncated from lowerdir mount option" # Re-create the upper/work dirs to mount them with a different lower @@ -65,7 +65,7 @@ mkdir -p "$upperdir" "$workdir" # and this test will fail, but the failure would indicate a libmount issue, not # a kernel issue. Therefore, force libmount to use mount(2) syscall, so we only # test the kernel fix. -LIBMOUNT_FORCE_MOUNT2=always $MOUNT_PROG -t overlay $OVL_BASE_SCRATCH_DEV $SCRATCH_MNT \ +LIBMOUNT_FORCE_MOUNT2=always _mount -t overlay $OVL_BASE_SCRATCH_DEV $SCRATCH_MNT \ -o"upperdir=$upperdir,workdir=$workdir,lowerdir=$lowerdir_commas_esc" 2>> $seqres.full || \ echo "ERROR: incorrect parsing of escaped comma in lowerdir mount option" diff --git a/tests/overlay/086 b/tests/overlay/086 index 9c8a00588595f6..23c56d074ff34a 100755 --- a/tests/overlay/086 +++ b/tests/overlay/086 @@ -33,21 +33,21 @@ mkdir -p "$lowerdir_spaces" "$lowerdir_colons" # _overlay_mount_* helpers do not handle lowerdir+,datadir+, so execute mount directly. # check illegal combinations and order of lowerdir,lowerdir+,datadir+ -$MOUNT_PROG -t overlay none $SCRATCH_MNT \ +_mount -t overlay none $SCRATCH_MNT \ -o"lowerdir=$lowerdir,lowerdir+=$lowerdir_colons" \ 2>> $seqres.full && \ echo "ERROR: invalid combination of lowerdir and lowerdir+ mount options" $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null -$MOUNT_PROG -t overlay none $SCRATCH_MNT \ +_mount -t overlay none $SCRATCH_MNT \ -o"lowerdir=$lowerdir,datadir+=$lowerdir_colons" \ -o redirect_dir=follow,metacopy=on 2>> $seqres.full && \ echo "ERROR: invalid combination of lowerdir and datadir+ mount options" $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null -$MOUNT_PROG -t overlay none $SCRATCH_MNT \ +_mount -t overlay none $SCRATCH_MNT \ -o"datadir+=$lowerdir,lowerdir+=$lowerdir_colons" \ -o redirect_dir=follow,metacopy=on 2>> $seqres.full && \ echo "ERROR: invalid order of lowerdir+ and datadir+ mount options" @@ -55,7 +55,7 @@ $MOUNT_PROG -t overlay none $SCRATCH_MNT \ $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null # mount is expected to fail with escaped colons. -$MOUNT_PROG -t overlay none $SCRATCH_MNT \ +_mount -t overlay none $SCRATCH_MNT \ -o"lowerdir+=$lowerdir_colons_esc" \ 2>> $seqres.full && \ echo "ERROR: incorrect parsing of escaped colons in lowerdir+ mount option" @@ -63,14 +63,14 @@ $MOUNT_PROG -t overlay none $SCRATCH_MNT \ $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null # mount is expected to succeed without escaped colons. -$MOUNT_PROG -t overlay ovl_esc_test $SCRATCH_MNT \ +_mount -t overlay ovl_esc_test $SCRATCH_MNT \ -o"lowerdir+=$lowerdir_colons,datadir+=$lowerdir_spaces" \ -o redirect_dir=follow,metacopy=on \ 2>&1 | tee -a $seqres.full # if spaces are not escaped when showing mount options, # mount command will not show the word 'spaces' after the spaces -$MOUNT_PROG -t overlay | grep ovl_esc_test | tee -a $seqres.full | \ +_mount -t overlay | grep ovl_esc_test | tee -a $seqres.full | \ grep -q 'datadir+'.*spaces || \ echo "ERROR: escaped spaces truncated from datadir+ mount option" diff --git a/tests/xfs/078 b/tests/xfs/078 index 834c99a0020153..4224fd40bc9fea 100755 --- a/tests/xfs/078 +++ b/tests/xfs/078 @@ -75,7 +75,7 @@ _grow_loop() $XFS_IO_PROG -c "pwrite $new_size $bsize" $LOOP_IMG | _filter_io LOOP_DEV=`_create_loop_device $LOOP_IMG` echo "*** mount loop filesystem" - $MOUNT_PROG -t xfs $LOOP_DEV $LOOP_MNT + _mount -t xfs $LOOP_DEV $LOOP_MNT echo "*** grow loop filesystem" $XFS_GROWFS_PROG $LOOP_MNT 2>&1 | _filter_growfs 2>&1 diff --git a/tests/xfs/149 b/tests/xfs/149 index f1b2405e7bff11..bbaf86132dff37 100755 --- a/tests/xfs/149 +++ b/tests/xfs/149 @@ -64,7 +64,7 @@ $XFS_GROWFS_PROG $loop_symlink 2>&1 | sed -e s:$loop_symlink:LOOPSYMLINK: # These mounted operations should pass echo "=== mount ===" -$MOUNT_PROG $loop_dev $mntdir || _fail "!!! failed to loopback mount" +_mount $loop_dev $mntdir || _fail "!!! failed to loopback mount" echo "=== xfs_growfs - check device node ===" $XFS_GROWFS_PROG -D 8192 $loop_dev > /dev/null @@ -76,7 +76,7 @@ echo "=== unmount ===" $UMOUNT_PROG $mntdir || _fail "!!! failed to unmount" echo "=== mount device symlink ===" -$MOUNT_PROG $loop_symlink $mntdir || _fail "!!! failed to loopback mount" +_mount $loop_symlink $mntdir || _fail "!!! failed to loopback mount" echo "=== xfs_growfs - check device symlink ===" $XFS_GROWFS_PROG -D 16384 $loop_symlink > /dev/null diff --git a/tests/xfs/289 b/tests/xfs/289 index cf0f2883c4f373..089a3f8cc14a68 100755 --- a/tests/xfs/289 +++ b/tests/xfs/289 @@ -56,7 +56,7 @@ echo "=== xfs_growfs - plain file - should be rejected ===" $XFS_GROWFS_PROG $tmpfile 2>&1 | _filter_test_dir echo "=== mount ===" -$MOUNT_PROG -o loop $tmpfile $tmpdir || _fail "!!! failed to loopback mount" +_mount -o loop $tmpfile $tmpdir || _fail "!!! failed to loopback mount" echo "=== xfs_growfs - mounted - check absolute path ===" $XFS_GROWFS_PROG -D 8192 $tmpdir | _filter_test_dir > /dev/null @@ -79,7 +79,7 @@ $XFS_GROWFS_PROG -D 28672 tmpsymlink.$$ > /dev/null echo "=== xfs_growfs - bind mount ===" mkdir $tmpbind -$MOUNT_PROG -o bind $tmpdir $tmpbind +_mount -o bind $tmpdir $tmpbind $XFS_GROWFS_PROG -D 32768 $tmpbind | _filter_test_dir > /dev/null echo "=== xfs_growfs - bind mount - relative path ===" diff --git a/tests/xfs/544 b/tests/xfs/544 index bd694453d5409f..a3a23c1726ca1c 100755 --- a/tests/xfs/544 +++ b/tests/xfs/544 @@ -35,7 +35,7 @@ mkdir $TEST_DIR/dest.$seq # Test echo "*** dump with bind-mounted test ***" >> $seqres.full -$MOUNT_PROG --bind $TEST_DIR/src.$seq $TEST_DIR/dest.$seq || _fail "Bind mount failed" +_mount --bind $TEST_DIR/src.$seq $TEST_DIR/dest.$seq || _fail "Bind mount failed" $XFSDUMP_PROG -L session -M test -f $tmp.dump $TEST_DIR/dest.$seq \ >> $seqres.full 2>&1 && echo "dump with bind-mounted should be failed, but passed." ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/2] check: capture dmesg of mount failures if test fails 2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount Darrick J. Wong @ 2024-12-31 23:56 ` Darrick J. Wong 2025-01-06 11:18 ` Nirjhar Roy 1 sibling, 1 reply; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:56 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Capture the kernel output after a mount failure occurs. If the test itself fails, then keep the logging output for further diagnosis. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- check | 22 +++++++++++++++++++++- common/rc | 26 +++++++++++++++++++++++++- common/report | 8 ++++++++ tests/selftest/008 | 20 ++++++++++++++++++++ tests/selftest/008.out | 1 + 5 files changed, 75 insertions(+), 2 deletions(-) create mode 100755 tests/selftest/008 create mode 100644 tests/selftest/008.out diff --git a/check b/check index 9222cd7e4f8197..a46ea1a54d78bb 100755 --- a/check +++ b/check @@ -614,7 +614,7 @@ _stash_fail_loop_files() { local seq_prefix="${REPORT_DIR}/${1}" local cp_suffix="$2" - for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints"; do + for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints" ".mountfail"; do rm -f "${seq_prefix}${i}${cp_suffix}" if [ -f "${seq_prefix}${i}" ]; then cp "${seq_prefix}${i}" "${seq_prefix}${i}${cp_suffix}" @@ -994,6 +994,7 @@ function run_section() echo -n " $seqnum -- " cat $seqres.notrun tc_status="notrun" + rm -f "$seqres.mountfail?" _stash_test_status "$seqnum" "$tc_status" # Unmount the scratch fs so that we can wipe the scratch @@ -1053,6 +1054,7 @@ function run_section() if [ ! -f $seq.out ]; then _dump_err "no qualified output" tc_status="fail" + rm -f "$seqres.mountfail?" _stash_test_status "$seqnum" "$tc_status" continue; fi @@ -1089,6 +1091,24 @@ function run_section() rm -f $seqres.hints fi fi + + if [ -f "$seqres.mountfail?" ]; then + if [ "$tc_status" = "fail" ]; then + # Let the user know if there were mount + # failures on a test that failed because that + # could be interesting. + mv "$seqres.mountfail?" "$seqres.mountfail" + _dump_err "check: possible mount failures (see $seqres.mountfail)" + test -f $seqres.mountfail && \ + maybe_compress_logfile $seqres.mountfail $MAX_MOUNTFAIL_SIZE + else + # Don't retain mount failure logs for tests + # that pass or were skipped because some tests + # intentionally drive mount failures. + rm -f "$seqres.mountfail?" + fi + fi + _stash_test_status "$seqnum" "$tc_status" done diff --git a/common/rc b/common/rc index d7dfb55bbbd7e1..0ede68eb912440 100644 --- a/common/rc +++ b/common/rc @@ -204,9 +204,33 @@ _get_hugepagesize() awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo } +# Does dmesg have a --since flag? +_dmesg_detect_since() +{ + if [ -z "$DMESG_HAS_SINCE" ]; then + test "$DMESG_HAS_SINCE" = "yes" + return + elif dmesg --help | grep -q -- --since; then + DMESG_HAS_SINCE=yes + else + DMESG_HAS_SINCE=no + fi +} + _mount() { - $MOUNT_PROG $* + $MOUNT_PROG $* + ret=$? + if [ "$ret" -ne 0 ]; then + echo "\"$MOUNT_PROG $*\" failed at $(date)" >> "$seqres.mountfail?" + if _dmesg_detect_since; then + dmesg --since '30s ago' >> "$seqres.mountfail?" + else + dmesg | tail -n 100 >> "$seqres.mountfail?" + fi + fi + + return $ret } # Call _mount to do mount operation but also save mountpoint to diff --git a/common/report b/common/report index 0e91e481f9725a..b57697f76dafb2 100644 --- a/common/report +++ b/common/report @@ -199,6 +199,7 @@ _xunit_make_testcase_report() local out_src="${SRC_DIR}/${test_name}.out" local full_file="${REPORT_DIR}/${test_name}.full" local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" + local mountfail_file="${REPORT_DIR}/${test_name}.mountfail" local outbad_file="${REPORT_DIR}/${test_name}.out.bad" if [ -z "$_err_msg" ]; then _err_msg="Test $test_name failed, reason unknown" @@ -225,6 +226,13 @@ _xunit_make_testcase_report() printf ']]>\n' >>$report echo -e "\t\t</system-err>" >> $report fi + if [ -z "$quiet" -a -f "$mountfail_file" ]; then + echo -e "\t\t<mount-failure>" >> $report + printf '<![CDATA[\n' >>$report + cat "$mountfail_file" | tr -dc '[:print:][:space:]' | encode_cdata >>$report + printf ']]>\n' >>$report + echo -e "\t\t</mount-failure>" >> $report + fi ;; *) echo -e "\t\t<failure message=\"Unknown test_status=$test_status\" type=\"TestFail\"/>" >> $report diff --git a/tests/selftest/008 b/tests/selftest/008 new file mode 100755 index 00000000000000..db80ffe6f77339 --- /dev/null +++ b/tests/selftest/008 @@ -0,0 +1,20 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test 008 +# +# Test mount failure capture. +# +. ./common/preamble +_begin_fstest selftest + +_require_command "$WIPEFS_PROG" wipefs +_require_scratch + +$WIPEFS_PROG -a $SCRATCH_DEV +_scratch_mount &>> $seqres.full + +# success, all done +status=0 +exit diff --git a/tests/selftest/008.out b/tests/selftest/008.out new file mode 100644 index 00000000000000..aaff95f3f48372 --- /dev/null +++ b/tests/selftest/008.out @@ -0,0 +1 @@ +QA output created by 008 ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [PATCH 2/2] check: capture dmesg of mount failures if test fails 2024-12-31 23:56 ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong @ 2025-01-06 11:18 ` Nirjhar Roy 2025-01-06 23:52 ` Darrick J. Wong 0 siblings, 1 reply; 110+ messages in thread From: Nirjhar Roy @ 2025-01-06 11:18 UTC (permalink / raw) To: Darrick J. Wong, zlang; +Cc: fstests, linux-xfs On Tue, 2024-12-31 at 15:56 -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Capture the kernel output after a mount failure occurs. If the test > itself fails, then keep the logging output for further diagnosis. > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > --- > check | 22 +++++++++++++++++++++- > common/rc | 26 +++++++++++++++++++++++++- > common/report | 8 ++++++++ > tests/selftest/008 | 20 ++++++++++++++++++++ > tests/selftest/008.out | 1 + > 5 files changed, 75 insertions(+), 2 deletions(-) > create mode 100755 tests/selftest/008 > create mode 100644 tests/selftest/008.out > > > diff --git a/check b/check > index 9222cd7e4f8197..a46ea1a54d78bb 100755 > --- a/check > +++ b/check > @@ -614,7 +614,7 @@ _stash_fail_loop_files() { > local seq_prefix="${REPORT_DIR}/${1}" > local cp_suffix="$2" > > - for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" > ".hints"; do > + for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints" > ".mountfail"; do > rm -f "${seq_prefix}${i}${cp_suffix}" > if [ -f "${seq_prefix}${i}" ]; then > cp "${seq_prefix}${i}" > "${seq_prefix}${i}${cp_suffix}" > @@ -994,6 +994,7 @@ function run_section() > echo -n " $seqnum -- " > cat $seqres.notrun > tc_status="notrun" > + rm -f "$seqres.mountfail?" > _stash_test_status "$seqnum" "$tc_status" > > # Unmount the scratch fs so that we can wipe > the scratch > @@ -1053,6 +1054,7 @@ function run_section() > if [ ! -f $seq.out ]; then > _dump_err "no qualified output" > tc_status="fail" > + rm -f "$seqres.mountfail?" > _stash_test_status "$seqnum" "$tc_status" > continue; > fi > @@ -1089,6 +1091,24 @@ function run_section() > rm -f $seqres.hints > fi > fi > + > + if [ -f "$seqres.mountfail?" ]; then > + if [ "$tc_status" = "fail" ]; then > + # Let the user know if there were mount > + # failures on a test that failed > because that > + # could be interesting. > + mv "$seqres.mountfail?" > "$seqres.mountfail" > + _dump_err "check: possible mount > failures (see $seqres.mountfail)" > + test -f $seqres.mountfail && \ > + maybe_compress_logfile > $seqres.mountfail $MAX_MOUNTFAIL_SIZE > + else > + # Don't retain mount failure logs for > tests > + # that pass or were skipped because > some tests > + # intentionally drive mount failures. > + rm -f "$seqres.mountfail?" > + fi > + fi > + > _stash_test_status "$seqnum" "$tc_status" > done > > diff --git a/common/rc b/common/rc > index d7dfb55bbbd7e1..0ede68eb912440 100644 > --- a/common/rc > +++ b/common/rc > @@ -204,9 +204,33 @@ _get_hugepagesize() > awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo > } > > +# Does dmesg have a --since flag? > +_dmesg_detect_since() > +{ > + if [ -z "$DMESG_HAS_SINCE" ]; then > + test "$DMESG_HAS_SINCE" = "yes" > + return > + elif dmesg --help | grep -q -- --since; then > + DMESG_HAS_SINCE=yes > + else > + DMESG_HAS_SINCE=no > + fi > +} > + > _mount() > { > - $MOUNT_PROG $* > + $MOUNT_PROG $* > + ret=$? > + if [ "$ret" -ne 0 ]; then > + echo "\"$MOUNT_PROG $*\" failed at $(date)" >> > "$seqres.mountfail?" > + if _dmesg_detect_since; then > + dmesg --since '30s ago' >> "$seqres.mountfail?" > + else > + dmesg | tail -n 100 >> "$seqres.mountfail?" Is it possible to grep for a mount failure message in dmesg and then capture the last n lines? Do you think that will be more accurate? Also, do you think it is useful to make this 100 configurable instead of hardcoding? > + fi > + fi > + > + return $ret > } > > # Call _mount to do mount operation but also save mountpoint to > diff --git a/common/report b/common/report > index 0e91e481f9725a..b57697f76dafb2 100644 > --- a/common/report > +++ b/common/report > @@ -199,6 +199,7 @@ _xunit_make_testcase_report() > local out_src="${SRC_DIR}/${test_name}.out" > local full_file="${REPORT_DIR}/${test_name}.full" > local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" > + local > mountfail_file="${REPORT_DIR}/${test_name}.mountfail" > local outbad_file="${REPORT_DIR}/${test_name}.out.bad" > if [ -z "$_err_msg" ]; then > _err_msg="Test $test_name failed, reason > unknown" > @@ -225,6 +226,13 @@ _xunit_make_testcase_report() > printf ']]>\n' >>$report > echo -e "\t\t</system-err>" >> $report > fi > + if [ -z "$quiet" -a -f "$mountfail_file" ]; then > + echo -e "\t\t<mount-failure>" >> $report > + printf '<![CDATA[\n' >>$report > + cat "$mountfail_file" | tr -dc > '[:print:][:space:]' | encode_cdata >>$report > + printf ']]>\n' >>$report > + echo -e "\t\t</mount-failure>" >> $report > + fi > ;; > *) > echo -e "\t\t<failure message=\"Unknown > test_status=$test_status\" type=\"TestFail\"/>" >> $report > diff --git a/tests/selftest/008 b/tests/selftest/008 > new file mode 100755 > index 00000000000000..db80ffe6f77339 > --- /dev/null > +++ b/tests/selftest/008 > @@ -0,0 +1,20 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. > +# > +# FS QA Test 008 > +# > +# Test mount failure capture. > +# > +. ./common/preamble > +_begin_fstest selftest > + > +_require_command "$WIPEFS_PROG" wipefs > +_require_scratch > + > +$WIPEFS_PROG -a $SCRATCH_DEV > +_scratch_mount &>> $seqres.full Minor: Do you think adding some filtered messages from the captured dmesg logs in the output will be helpful? > + > +# success, all done > +status=0 > +exit > diff --git a/tests/selftest/008.out b/tests/selftest/008.out > new file mode 100644 > index 00000000000000..aaff95f3f48372 > --- /dev/null > +++ b/tests/selftest/008.out > @@ -0,0 +1 @@ > +QA output created by 008 > ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH 2/2] check: capture dmesg of mount failures if test fails 2025-01-06 11:18 ` Nirjhar Roy @ 2025-01-06 23:52 ` Darrick J. Wong 2025-01-13 5:55 ` Nirjhar Roy 0 siblings, 1 reply; 110+ messages in thread From: Darrick J. Wong @ 2025-01-06 23:52 UTC (permalink / raw) To: Nirjhar Roy; +Cc: zlang, fstests, linux-xfs On Mon, Jan 06, 2025 at 04:48:34PM +0530, Nirjhar Roy wrote: > On Tue, 2024-12-31 at 15:56 -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Capture the kernel output after a mount failure occurs. If the test > > itself fails, then keep the logging output for further diagnosis. > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > > --- > > check | 22 +++++++++++++++++++++- > > common/rc | 26 +++++++++++++++++++++++++- > > common/report | 8 ++++++++ > > tests/selftest/008 | 20 ++++++++++++++++++++ > > tests/selftest/008.out | 1 + > > 5 files changed, 75 insertions(+), 2 deletions(-) > > create mode 100755 tests/selftest/008 > > create mode 100644 tests/selftest/008.out > > > > > > diff --git a/check b/check > > index 9222cd7e4f8197..a46ea1a54d78bb 100755 > > --- a/check > > +++ b/check > > @@ -614,7 +614,7 @@ _stash_fail_loop_files() { > > local seq_prefix="${REPORT_DIR}/${1}" > > local cp_suffix="$2" > > > > - for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" > > ".hints"; do > > + for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints" > > ".mountfail"; do > > rm -f "${seq_prefix}${i}${cp_suffix}" > > if [ -f "${seq_prefix}${i}" ]; then > > cp "${seq_prefix}${i}" > > "${seq_prefix}${i}${cp_suffix}" > > @@ -994,6 +994,7 @@ function run_section() > > echo -n " $seqnum -- " > > cat $seqres.notrun > > tc_status="notrun" > > + rm -f "$seqres.mountfail?" > > _stash_test_status "$seqnum" "$tc_status" > > > > # Unmount the scratch fs so that we can wipe > > the scratch > > @@ -1053,6 +1054,7 @@ function run_section() > > if [ ! -f $seq.out ]; then > > _dump_err "no qualified output" > > tc_status="fail" > > + rm -f "$seqres.mountfail?" > > _stash_test_status "$seqnum" "$tc_status" > > continue; > > fi > > @@ -1089,6 +1091,24 @@ function run_section() > > rm -f $seqres.hints > > fi > > fi > > + > > + if [ -f "$seqres.mountfail?" ]; then > > + if [ "$tc_status" = "fail" ]; then > > + # Let the user know if there were mount > > + # failures on a test that failed > > because that > > + # could be interesting. > > + mv "$seqres.mountfail?" > > "$seqres.mountfail" > > + _dump_err "check: possible mount > > failures (see $seqres.mountfail)" > > + test -f $seqres.mountfail && \ > > + maybe_compress_logfile > > $seqres.mountfail $MAX_MOUNTFAIL_SIZE > > + else > > + # Don't retain mount failure logs for > > tests > > + # that pass or were skipped because > > some tests > > + # intentionally drive mount failures. > > + rm -f "$seqres.mountfail?" > > + fi > > + fi > > + > > _stash_test_status "$seqnum" "$tc_status" > > done > > > > diff --git a/common/rc b/common/rc > > index d7dfb55bbbd7e1..0ede68eb912440 100644 > > --- a/common/rc > > +++ b/common/rc > > @@ -204,9 +204,33 @@ _get_hugepagesize() > > awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo > > } > > > > +# Does dmesg have a --since flag? > > +_dmesg_detect_since() > > +{ > > + if [ -z "$DMESG_HAS_SINCE" ]; then > > + test "$DMESG_HAS_SINCE" = "yes" > > + return > > + elif dmesg --help | grep -q -- --since; then > > + DMESG_HAS_SINCE=yes > > + else > > + DMESG_HAS_SINCE=no > > + fi > > +} > > + > > _mount() > > { > > - $MOUNT_PROG $* > > + $MOUNT_PROG $* > > + ret=$? > > + if [ "$ret" -ne 0 ]; then > > + echo "\"$MOUNT_PROG $*\" failed at $(date)" >> > > "$seqres.mountfail?" > > + if _dmesg_detect_since; then > > + dmesg --since '30s ago' >> "$seqres.mountfail?" > > + else > > + dmesg | tail -n 100 >> "$seqres.mountfail?" > Is it possible to grep for a mount failure message in dmesg and then > capture the last n lines? Do you think that will be more accurate? Alas no, because there's no standard mount failure log message for us to latch onto. > Also, do you think it is useful to make this 100 configurable instead > of hardcoding? I suppose, but why do you need more than 100? > > + fi > > + fi > > + > > + return $ret > > } > > > > # Call _mount to do mount operation but also save mountpoint to > > diff --git a/common/report b/common/report > > index 0e91e481f9725a..b57697f76dafb2 100644 > > --- a/common/report > > +++ b/common/report > > @@ -199,6 +199,7 @@ _xunit_make_testcase_report() > > local out_src="${SRC_DIR}/${test_name}.out" > > local full_file="${REPORT_DIR}/${test_name}.full" > > local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" > > + local > > mountfail_file="${REPORT_DIR}/${test_name}.mountfail" > > local outbad_file="${REPORT_DIR}/${test_name}.out.bad" > > if [ -z "$_err_msg" ]; then > > _err_msg="Test $test_name failed, reason > > unknown" > > @@ -225,6 +226,13 @@ _xunit_make_testcase_report() > > printf ']]>\n' >>$report > > echo -e "\t\t</system-err>" >> $report > > fi > > + if [ -z "$quiet" -a -f "$mountfail_file" ]; then > > + echo -e "\t\t<mount-failure>" >> $report > > + printf '<![CDATA[\n' >>$report > > + cat "$mountfail_file" | tr -dc > > '[:print:][:space:]' | encode_cdata >>$report > > + printf ']]>\n' >>$report > > + echo -e "\t\t</mount-failure>" >> $report > > + fi > > ;; > > *) > > echo -e "\t\t<failure message=\"Unknown > > test_status=$test_status\" type=\"TestFail\"/>" >> $report > > diff --git a/tests/selftest/008 b/tests/selftest/008 > > new file mode 100755 > > index 00000000000000..db80ffe6f77339 > > --- /dev/null > > +++ b/tests/selftest/008 > > @@ -0,0 +1,20 @@ > > +#! /bin/bash > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. > > +# > > +# FS QA Test 008 > > +# > > +# Test mount failure capture. > > +# > > +. ./common/preamble > > +_begin_fstest selftest > > + > > +_require_command "$WIPEFS_PROG" wipefs > > +_require_scratch > > + > > +$WIPEFS_PROG -a $SCRATCH_DEV > > +_scratch_mount &>> $seqres.full > Minor: Do you think adding some filtered messages from the captured > dmesg logs in the output will be helpful? No, this test exists to make sure that the dmesg log is captured in $RESULT_DIR. We don't care about the mount(8) output. --D > > + > > +# success, all done > > +status=0 > > +exit > > diff --git a/tests/selftest/008.out b/tests/selftest/008.out > > new file mode 100644 > > index 00000000000000..aaff95f3f48372 > > --- /dev/null > > +++ b/tests/selftest/008.out > > @@ -0,0 +1 @@ > > +QA output created by 008 > > > > ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [PATCH 2/2] check: capture dmesg of mount failures if test fails 2025-01-06 23:52 ` Darrick J. Wong @ 2025-01-13 5:55 ` Nirjhar Roy 0 siblings, 0 replies; 110+ messages in thread From: Nirjhar Roy @ 2025-01-13 5:55 UTC (permalink / raw) To: Darrick J. Wong; +Cc: zlang, fstests, linux-xfs On Mon, 2025-01-06 at 15:52 -0800, Darrick J. Wong wrote: > On Mon, Jan 06, 2025 at 04:48:34PM +0530, Nirjhar Roy wrote: > > On Tue, 2024-12-31 at 15:56 -0800, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@kernel.org> > > > > > > Capture the kernel output after a mount failure occurs. If the > > > test > > > itself fails, then keep the logging output for further diagnosis. > > > > > > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> > > > --- > > > check | 22 +++++++++++++++++++++- > > > common/rc | 26 +++++++++++++++++++++++++- > > > common/report | 8 ++++++++ > > > tests/selftest/008 | 20 ++++++++++++++++++++ > > > tests/selftest/008.out | 1 + > > > 5 files changed, 75 insertions(+), 2 deletions(-) > > > create mode 100755 tests/selftest/008 > > > create mode 100644 tests/selftest/008.out > > > > > > > > > diff --git a/check b/check > > > index 9222cd7e4f8197..a46ea1a54d78bb 100755 > > > --- a/check > > > +++ b/check > > > @@ -614,7 +614,7 @@ _stash_fail_loop_files() { > > > local seq_prefix="${REPORT_DIR}/${1}" > > > local cp_suffix="$2" > > > > > > - for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" > > > ".hints"; do > > > + for i in ".full" ".dmesg" ".out.bad" ".notrun" ".core" ".hints" > > > ".mountfail"; do > > > rm -f "${seq_prefix}${i}${cp_suffix}" > > > if [ -f "${seq_prefix}${i}" ]; then > > > cp "${seq_prefix}${i}" > > > "${seq_prefix}${i}${cp_suffix}" > > > @@ -994,6 +994,7 @@ function run_section() > > > echo -n " $seqnum -- " > > > cat $seqres.notrun > > > tc_status="notrun" > > > + rm -f "$seqres.mountfail?" > > > _stash_test_status "$seqnum" "$tc_status" > > > > > > # Unmount the scratch fs so that we can wipe > > > the scratch > > > @@ -1053,6 +1054,7 @@ function run_section() > > > if [ ! -f $seq.out ]; then > > > _dump_err "no qualified output" > > > tc_status="fail" > > > + rm -f "$seqres.mountfail?" > > > _stash_test_status "$seqnum" "$tc_status" > > > continue; > > > fi > > > @@ -1089,6 +1091,24 @@ function run_section() > > > rm -f $seqres.hints > > > fi > > > fi > > > + > > > + if [ -f "$seqres.mountfail?" ]; then > > > + if [ "$tc_status" = "fail" ]; then > > > + # Let the user know if there were mount > > > + # failures on a test that failed > > > because that > > > + # could be interesting. > > > + mv "$seqres.mountfail?" > > > "$seqres.mountfail" > > > + _dump_err "check: possible mount > > > failures (see $seqres.mountfail)" > > > + test -f $seqres.mountfail && \ > > > + maybe_compress_logfile > > > $seqres.mountfail $MAX_MOUNTFAIL_SIZE > > > + else > > > + # Don't retain mount failure logs for > > > tests > > > + # that pass or were skipped because > > > some tests > > > + # intentionally drive mount failures. > > > + rm -f "$seqres.mountfail?" > > > + fi > > > + fi > > > + > > > _stash_test_status "$seqnum" "$tc_status" > > > done > > > > > > diff --git a/common/rc b/common/rc > > > index d7dfb55bbbd7e1..0ede68eb912440 100644 > > > --- a/common/rc > > > +++ b/common/rc > > > @@ -204,9 +204,33 @@ _get_hugepagesize() > > > awk '/Hugepagesize/ {print $2 * 1024}' /proc/meminfo > > > } > > > > > > +# Does dmesg have a --since flag? > > > +_dmesg_detect_since() > > > +{ > > > + if [ -z "$DMESG_HAS_SINCE" ]; then > > > + test "$DMESG_HAS_SINCE" = "yes" > > > + return > > > + elif dmesg --help | grep -q -- --since; then > > > + DMESG_HAS_SINCE=yes > > > + else > > > + DMESG_HAS_SINCE=no > > > + fi > > > +} > > > + > > > _mount() > > > { > > > - $MOUNT_PROG $* > > > + $MOUNT_PROG $* > > > + ret=$? > > > + if [ "$ret" -ne 0 ]; then > > > + echo "\"$MOUNT_PROG $*\" failed at $(date)" >> > > > "$seqres.mountfail?" > > > + if _dmesg_detect_since; then > > > + dmesg --since '30s ago' >> "$seqres.mountfail?" > > > + else > > > + dmesg | tail -n 100 >> "$seqres.mountfail?" > > Is it possible to grep for a mount failure message in dmesg and > > then > > capture the last n lines? Do you think that will be more accurate? > > Alas no, because there's no standard mount failure log message for us > to > latch onto. Okay makes sense. > > > Also, do you think it is useful to make this 100 configurable > > instead > > of hardcoding? > > I suppose, but why do you need more than 100? So my thought behind this is that in case, the dmesg gets cluttered with noisy logs from other processes. No hard preferences though. > > > > + fi > > > + fi > > > + > > > + return $ret > > > } > > > > > > # Call _mount to do mount operation but also save mountpoint to > > > diff --git a/common/report b/common/report > > > index 0e91e481f9725a..b57697f76dafb2 100644 > > > --- a/common/report > > > +++ b/common/report > > > @@ -199,6 +199,7 @@ _xunit_make_testcase_report() > > > local out_src="${SRC_DIR}/${test_name}.out" > > > local full_file="${REPORT_DIR}/${test_name}.full" > > > local dmesg_file="${REPORT_DIR}/${test_name}.dmesg" > > > + local > > > mountfail_file="${REPORT_DIR}/${test_name}.mountfail" > > > local outbad_file="${REPORT_DIR}/${test_name}.out.bad" > > > if [ -z "$_err_msg" ]; then > > > _err_msg="Test $test_name failed, reason > > > unknown" > > > @@ -225,6 +226,13 @@ _xunit_make_testcase_report() > > > printf ']]>\n' >>$report > > > echo -e "\t\t</system-err>" >> $report > > > fi > > > + if [ -z "$quiet" -a -f "$mountfail_file" ]; then > > > + echo -e "\t\t<mount-failure>" >> $report > > > + printf '<![CDATA[\n' >>$report > > > + cat "$mountfail_file" | tr -dc > > > '[:print:][:space:]' | encode_cdata >>$report > > > + printf ']]>\n' >>$report > > > + echo -e "\t\t</mount-failure>" >> $report > > > + fi > > > ;; > > > *) > > > echo -e "\t\t<failure message=\"Unknown > > > test_status=$test_status\" type=\"TestFail\"/>" >> $report > > > diff --git a/tests/selftest/008 b/tests/selftest/008 > > > new file mode 100755 > > > index 00000000000000..db80ffe6f77339 > > > --- /dev/null > > > +++ b/tests/selftest/008 > > > @@ -0,0 +1,20 @@ > > > +#! /bin/bash > > > +# SPDX-License-Identifier: GPL-2.0 > > > +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. > > > +# > > > +# FS QA Test 008 > > > +# > > > +# Test mount failure capture. > > > +# > > > +. ./common/preamble > > > +_begin_fstest selftest > > > + > > > +_require_command "$WIPEFS_PROG" wipefs > > > +_require_scratch > > > + > > > +$WIPEFS_PROG -a $SCRATCH_DEV > > > +_scratch_mount &>> $seqres.full > > Minor: Do you think adding some filtered messages from the captured > > dmesg logs in the output will be helpful? > > No, this test exists to make sure that the dmesg log is captured in > $RESULT_DIR. We don't care about the mount(8) output. > > --D Okay, got it. --NR > > > > + > > > +# success, all done > > > +status=0 > > > +exit > > > diff --git a/tests/selftest/008.out b/tests/selftest/008.out > > > new file mode 100644 > > > index 00000000000000..aaff95f3f48372 > > > --- /dev/null > > > +++ b/tests/selftest/008.out > > > @@ -0,0 +1 @@ > > > +QA output created by 008 > > > ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCHSET 4/5] fstests: live health monitoring of filesystems 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (12 preceding siblings ...) 2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong @ 2024-12-31 23:35 ` Darrick J. Wong 2024-12-31 23:57 ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong ` (5 more replies) 2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong 2025-01-02 1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang 15 siblings, 6 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs Hi all, This patchset builds off of Kent Overstreet's thread_with_file code to deliver live information about filesystem health events to userspace. This is done by creating a twf file and hooking internal operations so that the event information can be queued to the twf without stalling the kernel if the twf client program is nonresponsive. This is a private ioctl, so events are expressed using simple json objects so that we can enrich the output later on without having to rev a ton of C structs. In userspace, we create a new daemon program that will read the json event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. It is autostarted via some horrible udev rules. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring --- Commits in this patchset: * misc: convert all $UMOUNT_PROG to a _umount helper * misc: convert all umount(1) invocations to _umount * xfs: test health monitoring code * xfs: test for metadata corruption error reporting via healthmon * xfs: test io error reporting via healthmon * xfs: test new xfs_scrubbed daemon --- common/btrfs | 2 + common/config | 6 +++ common/dmdelay | 4 +- common/dmdust | 4 +- common/dmerror | 4 +- common/dmflakey | 4 +- common/dmhugedisk | 2 + common/dmlogwrites | 4 +- common/dmthin | 4 +- common/overlay | 10 +++--- common/populate | 8 ++--- common/quota | 2 + common/rc | 47 ++++++++++++++++++--------- common/systemd | 9 +++++ common/xfs | 18 ++++++++++ doc/group-names.txt | 1 + tests/btrfs/012 | 2 + tests/btrfs/020 | 2 + tests/btrfs/029 | 2 + tests/btrfs/031 | 2 + tests/btrfs/060 | 2 + tests/btrfs/065 | 2 + tests/btrfs/066 | 2 + tests/btrfs/067 | 2 + tests/btrfs/068 | 2 + tests/btrfs/075 | 2 + tests/btrfs/089 | 2 + tests/btrfs/124 | 2 + tests/btrfs/125 | 2 + tests/btrfs/185 | 4 +- tests/btrfs/197 | 4 +- tests/btrfs/199 | 2 + tests/btrfs/219 | 12 +++---- tests/btrfs/254 | 2 + tests/btrfs/291 | 2 + tests/btrfs/298 | 4 +- tests/ext4/006 | 4 +- tests/ext4/007 | 4 +- tests/ext4/008 | 4 +- tests/ext4/009 | 8 ++--- tests/ext4/010 | 6 ++- tests/ext4/011 | 2 + tests/ext4/012 | 2 + tests/ext4/013 | 6 ++- tests/ext4/014 | 6 ++- tests/ext4/015 | 6 ++- tests/ext4/016 | 6 ++- tests/ext4/017 | 6 ++- tests/ext4/018 | 6 ++- tests/ext4/019 | 6 ++- tests/ext4/032 | 4 +- tests/ext4/033 | 2 + tests/ext4/052 | 4 +- tests/ext4/053 | 32 +++++++++--------- tests/ext4/056 | 2 + tests/generic/042 | 4 +- tests/generic/067 | 6 ++- tests/generic/081 | 2 + tests/generic/085 | 4 +- tests/generic/108 | 2 + tests/generic/171 | 2 + tests/generic/172 | 2 + tests/generic/173 | 2 + tests/generic/174 | 2 + tests/generic/306 | 2 + tests/generic/330 | 2 + tests/generic/332 | 2 + tests/generic/361 | 2 + tests/generic/373 | 2 + tests/generic/374 | 2 + tests/generic/395 | 2 + tests/generic/459 | 2 + tests/generic/563 | 4 +- tests/generic/604 | 2 + tests/generic/631 | 2 + tests/generic/648 | 6 ++- tests/generic/698 | 4 +- tests/generic/699 | 8 ++--- tests/generic/704 | 2 + tests/generic/717 | 2 + tests/generic/730 | 2 + tests/generic/731 | 2 + tests/generic/732 | 4 +- tests/generic/746 | 8 ++--- tests/overlay/003 | 2 + tests/overlay/004 | 2 + tests/overlay/005 | 6 ++- tests/overlay/014 | 4 +- tests/overlay/022 | 2 + tests/overlay/025 | 4 +- tests/overlay/029 | 6 ++- tests/overlay/031 | 8 ++--- tests/overlay/035 | 2 + tests/overlay/036 | 8 ++--- tests/overlay/037 | 6 ++- tests/overlay/040 | 2 + tests/overlay/041 | 2 + tests/overlay/042 | 2 + tests/overlay/043 | 2 + tests/overlay/044 | 2 + tests/overlay/048 | 4 +- tests/overlay/049 | 2 + tests/overlay/050 | 2 + tests/overlay/051 | 4 +- tests/overlay/052 | 2 + tests/overlay/053 | 4 +- tests/overlay/054 | 2 + tests/overlay/055 | 4 +- tests/overlay/056 | 2 + tests/overlay/057 | 4 +- tests/overlay/059 | 2 + tests/overlay/060 | 2 + tests/overlay/062 | 2 + tests/overlay/063 | 2 + tests/overlay/065 | 22 ++++++------- tests/overlay/067 | 2 + tests/overlay/068 | 4 +- tests/overlay/069 | 6 ++- tests/overlay/070 | 6 ++- tests/overlay/071 | 6 ++- tests/overlay/076 | 2 + tests/overlay/077 | 2 + tests/overlay/078 | 2 + tests/overlay/079 | 2 + tests/overlay/080 | 2 + tests/overlay/081 | 14 ++++---- tests/overlay/083 | 2 + tests/overlay/084 | 10 +++--- tests/overlay/085 | 2 + tests/overlay/086 | 8 ++--- tests/xfs/014 | 4 +- tests/xfs/049 | 8 ++--- tests/xfs/073 | 8 ++--- tests/xfs/074 | 4 +- tests/xfs/078 | 4 +- tests/xfs/083 | 6 ++- tests/xfs/085 | 4 +- tests/xfs/086 | 8 ++--- tests/xfs/087 | 6 ++- tests/xfs/088 | 8 ++--- tests/xfs/089 | 8 ++--- tests/xfs/091 | 8 ++--- tests/xfs/093 | 6 ++- tests/xfs/097 | 6 ++- tests/xfs/098 | 4 +- tests/xfs/099 | 6 ++- tests/xfs/100 | 6 ++- tests/xfs/101 | 6 ++- tests/xfs/102 | 6 ++- tests/xfs/105 | 6 ++- tests/xfs/112 | 8 ++--- tests/xfs/113 | 6 ++- tests/xfs/117 | 6 ++- tests/xfs/120 | 6 ++- tests/xfs/123 | 6 ++- tests/xfs/124 | 6 ++- tests/xfs/125 | 6 ++- tests/xfs/126 | 6 ++- tests/xfs/130 | 2 + tests/xfs/148 | 6 ++- tests/xfs/149 | 4 +- tests/xfs/152 | 2 + tests/xfs/169 | 6 ++- tests/xfs/186 | 4 +- tests/xfs/1878 | 80 ++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1878.out | 10 ++++++ tests/xfs/1879 | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1879.out | 12 +++++++ tests/xfs/1882 | 64 +++++++++++++++++++++++++++++++++++++ tests/xfs/1882.out | 2 + tests/xfs/1883 | 75 +++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1883.out | 2 + tests/xfs/1884 | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1884.out | 2 + tests/xfs/1885 | 53 ++++++++++++++++++++++++++++++ tests/xfs/1885.out | 5 +++ tests/xfs/206 | 2 + tests/xfs/216 | 2 + tests/xfs/217 | 2 + tests/xfs/235 | 6 ++- tests/xfs/236 | 6 ++- tests/xfs/239 | 2 + tests/xfs/241 | 2 + tests/xfs/250 | 4 +- tests/xfs/265 | 6 ++- tests/xfs/289 | 4 +- tests/xfs/310 | 4 +- tests/xfs/507 | 2 + tests/xfs/513 | 4 +- tests/xfs/544 | 2 + tests/xfs/716 | 4 +- tests/xfs/806 | 4 +- 192 files changed, 921 insertions(+), 391 deletions(-) create mode 100755 tests/xfs/1878 create mode 100644 tests/xfs/1878.out create mode 100755 tests/xfs/1879 create mode 100644 tests/xfs/1879.out create mode 100755 tests/xfs/1882 create mode 100644 tests/xfs/1882.out create mode 100755 tests/xfs/1883 create mode 100644 tests/xfs/1883.out create mode 100755 tests/xfs/1884 create mode 100644 tests/xfs/1884.out create mode 100755 tests/xfs/1885 create mode 100644 tests/xfs/1885.out ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong @ 2024-12-31 23:57 ` Darrick J. Wong 2024-12-31 23:57 ` [PATCH 2/6] misc: convert all umount(1) invocations to _umount Darrick J. Wong ` (4 subsequent siblings) 5 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> We're going to start collecting ephemeral(ish) filesystem stats in the next patch, so switch all the $UMOUNT_PROG to a helper. sed -e 's/$UMOUNT_PROG/_umount/g' -i $(git ls-files common tests check) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/btrfs | 2 +- common/dmdelay | 4 ++-- common/dmdust | 4 ++-- common/dmerror | 2 +- common/dmflakey | 4 ++-- common/dmhugedisk | 2 +- common/dmlogwrites | 4 ++-- common/dmthin | 4 ++-- common/overlay | 10 +++++----- common/rc | 33 +++++++++++++++++++-------------- tests/btrfs/020 | 2 +- tests/btrfs/029 | 2 +- tests/btrfs/031 | 2 +- tests/btrfs/060 | 2 +- tests/btrfs/065 | 2 +- tests/btrfs/066 | 2 +- tests/btrfs/067 | 2 +- tests/btrfs/068 | 2 +- tests/btrfs/075 | 2 +- tests/btrfs/089 | 2 +- tests/btrfs/124 | 2 +- tests/btrfs/125 | 2 +- tests/btrfs/185 | 4 ++-- tests/btrfs/197 | 4 ++-- tests/btrfs/219 | 12 ++++++------ tests/btrfs/254 | 2 +- tests/ext4/032 | 4 ++-- tests/ext4/052 | 4 ++-- tests/ext4/053 | 32 ++++++++++++++++---------------- tests/ext4/056 | 2 +- tests/generic/042 | 4 ++-- tests/generic/067 | 6 +++--- tests/generic/081 | 2 +- tests/generic/085 | 4 ++-- tests/generic/108 | 2 +- tests/generic/361 | 2 +- tests/generic/373 | 2 +- tests/generic/374 | 2 +- tests/generic/459 | 2 +- tests/generic/604 | 2 ++ tests/generic/648 | 6 +++--- tests/generic/698 | 4 ++-- tests/generic/699 | 8 ++++---- tests/generic/704 | 2 +- tests/generic/730 | 2 +- tests/generic/731 | 2 +- tests/generic/732 | 4 ++-- tests/generic/746 | 8 ++++---- tests/overlay/003 | 2 +- tests/overlay/004 | 2 +- tests/overlay/005 | 6 +++--- tests/overlay/014 | 4 ++-- tests/overlay/022 | 2 +- tests/overlay/025 | 4 ++-- tests/overlay/029 | 6 +++--- tests/overlay/031 | 8 ++++---- tests/overlay/035 | 2 +- tests/overlay/036 | 8 ++++---- tests/overlay/037 | 6 +++--- tests/overlay/040 | 2 +- tests/overlay/041 | 2 +- tests/overlay/042 | 2 +- tests/overlay/043 | 2 +- tests/overlay/044 | 2 +- tests/overlay/048 | 4 ++-- tests/overlay/049 | 2 +- tests/overlay/050 | 2 +- tests/overlay/051 | 4 ++-- tests/overlay/052 | 2 +- tests/overlay/053 | 4 ++-- tests/overlay/054 | 2 +- tests/overlay/055 | 4 ++-- tests/overlay/056 | 2 +- tests/overlay/057 | 4 ++-- tests/overlay/059 | 2 +- tests/overlay/060 | 2 +- tests/overlay/062 | 2 +- tests/overlay/063 | 2 +- tests/overlay/065 | 22 +++++++++++----------- tests/overlay/067 | 2 +- tests/overlay/068 | 4 ++-- tests/overlay/069 | 6 +++--- tests/overlay/070 | 6 +++--- tests/overlay/071 | 6 +++--- tests/overlay/076 | 2 +- tests/overlay/077 | 2 +- tests/overlay/078 | 2 +- tests/overlay/079 | 2 +- tests/overlay/080 | 2 +- tests/overlay/081 | 14 +++++++------- tests/overlay/083 | 2 +- tests/overlay/084 | 10 +++++----- tests/overlay/085 | 2 +- tests/overlay/086 | 8 ++++---- tests/xfs/078 | 4 ++-- tests/xfs/148 | 6 +++--- tests/xfs/149 | 4 ++-- tests/xfs/186 | 4 ++-- tests/xfs/289 | 4 ++-- tests/xfs/507 | 2 +- tests/xfs/513 | 4 ++-- tests/xfs/544 | 2 +- tests/xfs/806 | 4 ++-- 103 files changed, 226 insertions(+), 219 deletions(-) diff --git a/common/btrfs b/common/btrfs index 64f38cc240ab8b..b82c8f5a934cfd 100644 --- a/common/btrfs +++ b/common/btrfs @@ -352,7 +352,7 @@ _btrfs_stress_subvolume() while [ ! -e $stop_file ]; do $BTRFS_UTIL_PROG subvolume create $btrfs_mnt/$subvol_name _mount -o subvol=$subvol_name $btrfs_dev $subvol_mnt - $UMOUNT_PROG $subvol_mnt + _umount $subvol_mnt $BTRFS_UTIL_PROG subvolume delete $btrfs_mnt/$subvol_name done } diff --git a/common/dmdelay b/common/dmdelay index 794ea37ba200ce..691e22538a622b 100644 --- a/common/dmdelay +++ b/common/dmdelay @@ -26,7 +26,7 @@ _mount_delay() _unmount_delay() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } _cleanup_delay() @@ -34,7 +34,7 @@ _cleanup_delay() # If dmsetup load fails then we need to make sure to do resume here # otherwise the umount will hang $DMSETUP_PROG resume delay-test > /dev/null 2>&1 - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _dmsetup_remove delay-test } diff --git a/common/dmdust b/common/dmdust index 56fcc0e0fffa1e..13461c2dd3a006 100644 --- a/common/dmdust +++ b/common/dmdust @@ -22,7 +22,7 @@ _mount_dust() _unmount_dust() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } _cleanup_dust() @@ -30,6 +30,6 @@ _cleanup_dust() # If dmsetup load fails then we need to make sure to do resume here # otherwise the umount will hang $DMSETUP_PROG resume dust-test > /dev/null 2>&1 - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _dmsetup_remove dust-test } diff --git a/common/dmerror b/common/dmerror index 2f006142a309fe..1e6a35230f3ccb 100644 --- a/common/dmerror +++ b/common/dmerror @@ -106,7 +106,7 @@ _dmerror_cleanup() test -n "$NON_ERROR_RTDEV" && $DMSETUP_PROG resume error-rttest &>/dev/null $DMSETUP_PROG resume error-test > /dev/null 2>&1 - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 test -n "$NON_ERROR_LOGDEV" && _dmsetup_remove error-logtest test -n "$NON_ERROR_RTDEV" && _dmsetup_remove error-rttest diff --git a/common/dmflakey b/common/dmflakey index 52da3b100fbe45..64723f983b27ec 100644 --- a/common/dmflakey +++ b/common/dmflakey @@ -67,7 +67,7 @@ _mount_flakey() _unmount_flakey() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } _cleanup_flakey() @@ -78,7 +78,7 @@ _cleanup_flakey() test -n "$NON_FLAKEY_RTDEV" && $DMSETUP_PROG resume flakey-rttest &> /dev/null $DMSETUP_PROG resume flakey-test > /dev/null 2>&1 - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _dmsetup_remove flakey-test test -n "$NON_FLAKEY_LOGDEV" && _dmsetup_remove flakey-logtest diff --git a/common/dmhugedisk b/common/dmhugedisk index 502f0243772d52..a02bff4351d9be 100644 --- a/common/dmhugedisk +++ b/common/dmhugedisk @@ -39,7 +39,7 @@ _dmhugedisk_init() _dmhugedisk_cleanup() { - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _dmsetup_remove huge-test _dmsetup_remove huge-test-zero } diff --git a/common/dmlogwrites b/common/dmlogwrites index c054acb875a384..a1a5c415338276 100644 --- a/common/dmlogwrites +++ b/common/dmlogwrites @@ -145,7 +145,7 @@ _log_writes_mount() _log_writes_unmount() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # _log_writes_replay_log <mark> @@ -177,7 +177,7 @@ _log_writes_remove() _log_writes_cleanup() { - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _log_writes_remove } diff --git a/common/dmthin b/common/dmthin index 7107d50804896e..38d561c8eb25d6 100644 --- a/common/dmthin +++ b/common/dmthin @@ -23,7 +23,7 @@ DMTHIN_VOL_DEV="/dev/mapper/$DMTHIN_VOL_NAME" _dmthin_cleanup() { - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _dmsetup_remove $DMTHIN_VOL_NAME _dmsetup_remove $DMTHIN_POOL_NAME _dmsetup_remove $DMTHIN_META_NAME @@ -32,7 +32,7 @@ _dmthin_cleanup() _dmthin_check_fs() { - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _check_scratch_fs $DMTHIN_VOL_DEV } diff --git a/common/overlay b/common/overlay index da1d8d2c3183f4..2877f31e22ebd9 100644 --- a/common/overlay +++ b/common/overlay @@ -142,18 +142,18 @@ _overlay_base_unmount() [ -n "$dev" -a -n "$mnt" ] || return 0 - $UMOUNT_PROG $mnt + _umount $mnt } _overlay_test_unmount() { - $UMOUNT_PROG $TEST_DIR + _umount $TEST_DIR _overlay_base_unmount "$OVL_BASE_TEST_DEV" "$OVL_BASE_TEST_DIR" } _overlay_scratch_unmount() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_base_unmount "$OVL_BASE_SCRATCH_DEV" "$OVL_BASE_SCRATCH_MNT" } @@ -342,7 +342,7 @@ _overlay_check_scratch_dirs() # Need to umount overlay for scratch dir check local ovl_mounted=`_is_dir_mountpoint $SCRATCH_MNT` - [ -z "$ovl_mounted" ] || $UMOUNT_PROG $SCRATCH_MNT + [ -z "$ovl_mounted" ] || _umount $SCRATCH_MNT # Check dirs with extra overlay options _overlay_check_dirs $lowerdir $upperdir $workdir $* @@ -387,7 +387,7 @@ _overlay_check_fs() else # Check and umount overlay for dir check ovl_mounted=`_is_dir_mountpoint $ovl_mnt` - [ -z "$ovl_mounted" ] || $UMOUNT_PROG $ovl_mnt + [ -z "$ovl_mounted" ] || _umount $ovl_mnt fi _overlay_check_dirs $base_mnt/$OVL_LOWER $base_mnt/$OVL_UPPER \ diff --git a/common/rc b/common/rc index 0ede68eb912440..d3ee76e01db892 100644 --- a/common/rc +++ b/common/rc @@ -233,6 +233,11 @@ _mount() return $ret } +_umount() +{ + $UMOUNT_PROG $* +} + # Call _mount to do mount operation but also save mountpoint to # MOUNTED_POINT_STACK. Note that the mount point must be the last parameter _get_mount() @@ -266,7 +271,7 @@ _put_mount() local last_mnt=`echo $MOUNTED_POINT_STACK | awk '{print $1}'` if [ -n "$last_mnt" ]; then - $UMOUNT_PROG $last_mnt + _umount $last_mnt fi MOUNTED_POINT_STACK=`echo $MOUNTED_POINT_STACK | cut -d\ -f2-` } @@ -275,7 +280,7 @@ _put_mount() _clear_mount_stack() { if [ -n "$MOUNTED_POINT_STACK" ]; then - $UMOUNT_PROG $MOUNTED_POINT_STACK + _umount $MOUNTED_POINT_STACK fi MOUNTED_POINT_STACK="" } @@ -420,20 +425,20 @@ _scratch_unmount() _overlay_scratch_unmount ;; btrfs) - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT ;; tmpfs) $UMOUNT_PROG $SCRATCH_MNT ;; *) - $UMOUNT_PROG $SCRATCH_DEV + _umount $SCRATCH_DEV ;; esac } _scratch_umount_idmapped() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } _scratch_remount() @@ -457,7 +462,7 @@ _scratch_cycle_mount() ;; overlay) if [ "$OVL_BASE_FSTYP" = tmpfs ]; then - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT unmounted=true fi ;; @@ -505,9 +510,9 @@ _move_mount() # Replace $mnt with $tmp. Use a temporary bind-mount because # mount --move will fail with certain mount propagation layouts. - $UMOUNT_PROG $mnt || _fail "Failed to unmount $mnt" + _umount $mnt || _fail "Failed to unmount $mnt" _mount --bind $tmp $mnt || _fail "Failed to bind-mount $tmp to $mnt" - $UMOUNT_PROG $tmp || _fail "Failed to unmount $tmp" + _umount $tmp || _fail "Failed to unmount $tmp" rmdir $tmp } @@ -573,7 +578,7 @@ _test_unmount() if [ "$FSTYP" == "overlay" ]; then _overlay_test_unmount else - $UMOUNT_PROG $TEST_DEV + _umount $TEST_DEV fi } @@ -587,7 +592,7 @@ _test_cycle_mount() ;; overlay) if [ "$OVL_BASE_FSTYP" = tmpfs ]; then - $UMOUNT_PROG $TEST_DIR + _umount $TEST_DIR unmounted=true fi ;; @@ -1375,7 +1380,7 @@ _repair_scratch_fs() # Fall through to repair base fs dev=$OVL_BASE_SCRATCH_DEV fstyp=$OVL_BASE_FSTYP - $UMOUNT_PROG $OVL_BASE_SCRATCH_MNT + _umount $OVL_BASE_SCRATCH_MNT fi # Let's hope fsck -y suffices... fsck -t $fstyp -y $dev 2>&1 @@ -2189,7 +2194,7 @@ _require_logdev() _notrun "This test requires USE_EXTERNAL to be enabled" # ensure its not mounted - $UMOUNT_PROG $SCRATCH_LOGDEV 2>/dev/null + _umount $SCRATCH_LOGDEV 2>/dev/null } # This test requires that an external log device is not in use @@ -3281,7 +3286,7 @@ _umount_or_remount_ro() local mountpoint=`_is_dev_mounted $device` if [ $USE_REMOUNT -eq 0 ]; then - $UMOUNT_PROG $device + _umount $device else _remount $device ro fi @@ -3799,7 +3804,7 @@ _require_scratch_dev_pool() _notrun "$i is part of TEST_DEV, this test requires unique disks" fi if _mount | grep -q $i; then - if ! $UMOUNT_PROG $i; then + if ! _umount $i; then echo "failed to unmount $i - aborting" exit 1 fi diff --git a/tests/btrfs/020 b/tests/btrfs/020 index 7e5c6fd7b25229..f6fadab1f00bdb 100755 --- a/tests/btrfs/020 +++ b/tests/btrfs/020 @@ -17,7 +17,7 @@ _cleanup() { cd / rm -f $tmp.* - $UMOUNT_PROG $loop_mnt + _umount $loop_mnt _destroy_loop_device $loop_dev1 losetup -d $loop_dev2 >/dev/null 2>&1 _destroy_loop_device $loop_dev3 diff --git a/tests/btrfs/029 b/tests/btrfs/029 index c37ad63fb613db..9799b275250e5a 100755 --- a/tests/btrfs/029 +++ b/tests/btrfs/029 @@ -74,7 +74,7 @@ cp --reflink=always $orig_file $copy_file >> $seqres.full 2>&1 || echo "cp refli md5sum $orig_file | _filter_testdir_and_scratch md5sum $copy_file | _filter_testdir_and_scratch -$UMOUNT_PROG $reflink_test_dir +_umount $reflink_test_dir # success, all done status=0 diff --git a/tests/btrfs/031 b/tests/btrfs/031 index 8ac73d3a86e70b..92c1d26f865ba9 100755 --- a/tests/btrfs/031 +++ b/tests/btrfs/031 @@ -99,7 +99,7 @@ mv $testdir2/file* $subvol2/ echo "Verify the file contents:" _checksum_files -$UMOUNT_PROG $cross_mount_test_dir +_umount $cross_mount_test_dir # success, all done status=0 diff --git a/tests/btrfs/060 b/tests/btrfs/060 index 75c10bd23c36f5..0bf88f86ca822b 100755 --- a/tests/btrfs/060 +++ b/tests/btrfs/060 @@ -82,7 +82,7 @@ run_test() fi # in case the subvolume is still mounted - $UMOUNT_PROG $subvol_mnt >/dev/null 2>&1 + _umount $subvol_mnt >/dev/null 2>&1 _scratch_unmount # we called _require_scratch_nocheck instead of _require_scratch # do check after test for each profile config diff --git a/tests/btrfs/065 b/tests/btrfs/065 index b87c66d6e3d45e..9cd38fefe46875 100755 --- a/tests/btrfs/065 +++ b/tests/btrfs/065 @@ -90,7 +90,7 @@ run_test() fi # in case the subvolume is still mounted - $UMOUNT_PROG $subvol_mnt >/dev/null 2>&1 + _umount $subvol_mnt >/dev/null 2>&1 _scratch_unmount # we called _require_scratch_nocheck instead of _require_scratch # do check after test for each profile config diff --git a/tests/btrfs/066 b/tests/btrfs/066 index cc7cd9b7273d1c..b3db57049714ad 100755 --- a/tests/btrfs/066 +++ b/tests/btrfs/066 @@ -82,7 +82,7 @@ run_test() fi # in case the subvolume is still mounted - $UMOUNT_PROG $subvol_mnt >/dev/null 2>&1 + _umount $subvol_mnt >/dev/null 2>&1 _scratch_unmount # we called _require_scratch_nocheck instead of _require_scratch # do check after test for each profile config diff --git a/tests/btrfs/067 b/tests/btrfs/067 index 0b473050027a0a..ede9abbc689fe0 100755 --- a/tests/btrfs/067 +++ b/tests/btrfs/067 @@ -83,7 +83,7 @@ run_test() fi # in case the subvolume is still mounted - $UMOUNT_PROG $subvol_mnt >/dev/null 2>&1 + _umount $subvol_mnt >/dev/null 2>&1 _scratch_unmount # we called _require_scratch_nocheck instead of _require_scratch # do check after test for each profile config diff --git a/tests/btrfs/068 b/tests/btrfs/068 index 83e932e8417c0d..82dac5fd90ba85 100755 --- a/tests/btrfs/068 +++ b/tests/btrfs/068 @@ -83,7 +83,7 @@ run_test() fi # in case the subvolume is still mounted - $UMOUNT_PROG $subvol_mnt >/dev/null 2>&1 + _umount $subvol_mnt >/dev/null 2>&1 _scratch_unmount # we called _require_scratch_nocheck instead of _require_scratch # do check after test for each profile config diff --git a/tests/btrfs/075 b/tests/btrfs/075 index 737c4ffdd57865..8e78bd3d4b2336 100755 --- a/tests/btrfs/075 +++ b/tests/btrfs/075 @@ -15,7 +15,7 @@ _cleanup() { cd / rm -f $tmp.* - $UMOUNT_PROG $subvol_mnt >/dev/null 2>&1 + _umount $subvol_mnt >/dev/null 2>&1 } . ./common/filter diff --git a/tests/btrfs/089 b/tests/btrfs/089 index 8f8e37b6fde87b..ade38a6d189eaa 100755 --- a/tests/btrfs/089 +++ b/tests/btrfs/089 @@ -35,7 +35,7 @@ mount --bind "$SCRATCH_MNT/testvol/testdir" "$SCRATCH_MNT/testvol/mnt" $BTRFS_UTIL_PROG subvolume delete "$SCRATCH_MNT/testvol" >>$seqres.full 2>&1 # Unmount the bind mount, which should still be alive. -$UMOUNT_PROG "$SCRATCH_MNT/testvol/mnt" +_umount "$SCRATCH_MNT/testvol/mnt" echo "Silence is golden" status=0 diff --git a/tests/btrfs/124 b/tests/btrfs/124 index af079c2864de8e..19f8bbfc6b922e 100755 --- a/tests/btrfs/124 +++ b/tests/btrfs/124 @@ -132,7 +132,7 @@ if [ "$checkpoint1" != "$checkpoint3" ]; then echo "Inital sum does not match with data on dev2 written by balance" fi -$UMOUNT_PROG $dev2 +_umount $dev2 _scratch_dev_pool_put _btrfs_rescan_devices _test_mount diff --git a/tests/btrfs/125 b/tests/btrfs/125 index c8c0dd422f72b6..7acef2d38cda46 100755 --- a/tests/btrfs/125 +++ b/tests/btrfs/125 @@ -144,7 +144,7 @@ if [ "$checkpoint1" != "$checkpoint3" ]; then echo "Inital sum does not match with data on dev2 written by balance" fi -$UMOUNT_PROG $dev2 +_umount $dev2 _scratch_dev_pool_put _btrfs_rescan_devices _test_mount diff --git a/tests/btrfs/185 b/tests/btrfs/185 index 8d0643450f5d7d..c3b52fc2dbff66 100755 --- a/tests/btrfs/185 +++ b/tests/btrfs/185 @@ -15,7 +15,7 @@ mnt=$TEST_DIR/$seq.mnt # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $mnt > /dev/null 2>&1 + _umount $mnt > /dev/null 2>&1 rm -rf $mnt > /dev/null 2>&1 cd / rm -f $tmp.* @@ -62,7 +62,7 @@ $BTRFS_UTIL_PROG device scan $device_1 >> $seqres.full 2>&1 _fail "if it fails here, then it means subvolume mount at boot may fail "\ "in some configs." -$UMOUNT_PROG $mnt > /dev/null 2>&1 +_umount $mnt > /dev/null 2>&1 _scratch_dev_pool_put # success, all done diff --git a/tests/btrfs/197 b/tests/btrfs/197 index 9f1d879a4e267a..913dbb2d3a50ef 100755 --- a/tests/btrfs/197 +++ b/tests/btrfs/197 @@ -15,7 +15,7 @@ _begin_fstest auto quick volume # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $TEST_DIR/$seq.mnt >/dev/null 2>&1 + _umount $TEST_DIR/$seq.mnt >/dev/null 2>&1 rm -rf $TEST_DIR/$seq.mnt cd / rm -f $tmp.* @@ -67,7 +67,7 @@ workout() grep -q "${SCRATCH_DEV_NAME[1]}" $tmp.output && _fail "found stale device" $BTRFS_UTIL_PROG device remove "${SCRATCH_DEV_NAME[1]}" "$TEST_DIR/$seq.mnt" - $UMOUNT_PROG $TEST_DIR/$seq.mnt + _umount $TEST_DIR/$seq.mnt _scratch_unmount _spare_dev_put _scratch_dev_pool_put diff --git a/tests/btrfs/219 b/tests/btrfs/219 index 052f61a399ae66..efe5096746652a 100755 --- a/tests/btrfs/219 +++ b/tests/btrfs/219 @@ -21,8 +21,8 @@ _cleanup() rm -f $tmp.* # The variables are set before the test case can fail. - $UMOUNT_PROG ${loop_mnt1} &> /dev/null - $UMOUNT_PROG ${loop_mnt2} &> /dev/null + _umount ${loop_mnt1} &> /dev/null + _umount ${loop_mnt2} &> /dev/null rm -rf $loop_mnt1 rm -rf $loop_mnt2 @@ -66,7 +66,7 @@ loop_dev2=`_create_loop_device $fs_img2` # Normal single device case, should pass just fine _mount $loop_dev1 $loop_mnt1 > /dev/null 2>&1 || \ _fail "Couldn't do initial mount" -$UMOUNT_PROG $loop_mnt1 +_umount $loop_mnt1 _btrfs_forget_or_module_reload @@ -75,15 +75,15 @@ _btrfs_forget_or_module_reload # measure. _mount $loop_dev1 $loop_mnt1 > /dev/null 2>&1 || \ _fail "Failed to mount the second time" -$UMOUNT_PROG $loop_mnt1 +_umount $loop_mnt1 _mount $loop_dev2 $loop_mnt2 > /dev/null 2>&1 || \ _fail "We couldn't mount the old generation" -$UMOUNT_PROG $loop_mnt2 +_umount $loop_mnt2 _mount $loop_dev1 $loop_mnt1 > /dev/null 2>&1 || \ _fail "Failed to mount the second time" -$UMOUNT_PROG $loop_mnt1 +_umount $loop_mnt1 # Now try mount them at the same time, if kernel does not support # temp-fsid feature then mount will fail. diff --git a/tests/btrfs/254 b/tests/btrfs/254 index d9c9eea9c7bf23..eda32be1c2b1d1 100755 --- a/tests/btrfs/254 +++ b/tests/btrfs/254 @@ -96,7 +96,7 @@ test_add_device() $BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | \ _filter_btrfs_filesystem_show - $UMOUNT_PROG $seq_mnt + _umount $seq_mnt _scratch_unmount cleanup_dmdev } diff --git a/tests/ext4/032 b/tests/ext4/032 index 9a1b9312cc42cc..6e98f4f4ebb8de 100755 --- a/tests/ext4/032 +++ b/tests/ext4/032 @@ -63,7 +63,7 @@ ext4_online_resize() fi cat $tmp.resize2fs >> $seqres.full echo "+++ umount fs" | tee -a $seqres.full - $UMOUNT_PROG ${IMG_MNT} + _umount ${IMG_MNT} echo "+++ check fs" | tee -a $seqres.full _check_generic_filesystem $LOOP_DEVICE >> $seqres.full 2>&1 || \ @@ -77,7 +77,7 @@ _cleanup() cd / [ -n "$LOOP_DEVICE" ] && _destroy_loop_device $LOOP_DEVICE > /dev/null 2>&1 rm -f $tmp.* - $UMOUNT_PROG ${IMG_MNT} > /dev/null 2>&1 + _umount ${IMG_MNT} > /dev/null 2>&1 rm -f ${IMG_FILE} > /dev/null 2>&1 } diff --git a/tests/ext4/052 b/tests/ext4/052 index edcdc02515f725..ce3f90eb7e6d02 100755 --- a/tests/ext4/052 +++ b/tests/ext4/052 @@ -18,7 +18,7 @@ _cleanup() cd / rm -r -f $tmp.* if [ ! -z "$loop_mnt" ]; then - $UMOUNT_PROG $loop_mnt + _umount $loop_mnt rm -rf $loop_mnt fi [ ! -z "$fs_img" ] && rm -rf $fs_img @@ -63,7 +63,7 @@ then status=1 fi -$UMOUNT_PROG $loop_mnt || _fail "umount failed" +_umount $loop_mnt || _fail "umount failed" loop_mnt= $E2FSCK_PROG -fn $fs_img >> $seqres.full 2>&1 || _fail "file system corrupted" diff --git a/tests/ext4/053 b/tests/ext4/053 index 4f20d217d5fd7a..0beb2201260162 100755 --- a/tests/ext4/053 +++ b/tests/ext4/053 @@ -20,7 +20,7 @@ trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 if [ -n "$LOOP_LOGDEV" ];then _destroy_loop_device $LOOP_LOGDEV 2>/dev/null fi @@ -237,7 +237,7 @@ not_mnt() { if simple_mount -o $1 $SCRATCH_DEV $SCRATCH_MNT; then print_log "(mount unexpectedly succeeded)" fail - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT return fi ok @@ -248,7 +248,7 @@ not_mnt() { return fi not_remount $1 - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } mnt_only() { @@ -270,7 +270,7 @@ mnt() { fi mnt_only $* - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null [ "$t2fs" -eq 0 ] && return @@ -289,7 +289,7 @@ mnt() { -e 's/data=writeback/journal_data_writeback/') $TUNE2FS_PROG -o $op_set $SCRATCH_DEV > /dev/null 2>&1 mnt_only "defaults" $check - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null if [ "$op_set" = ^* ]; then op_set=${op_set#^} else @@ -309,12 +309,12 @@ remount() { do_mnt remount,$2 $3 if [ $? -ne 0 ]; then fail - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null return else ok fi - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null # Now just specify mnt print_log "mounting $fstype \"$1\" " @@ -328,7 +328,7 @@ remount() { ok fi - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null } # Test that the filesystem cannot be remounted with option(s) $1 (meaning that @@ -364,7 +364,7 @@ mnt_then_not_remount() { return fi not_remount $2 - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } @@ -400,8 +400,8 @@ LOGDEV_DEVNUM=`echo "${majmin%:*}*2^8 + ${majmin#*:}" | bc` fstype= for fstype in ext2 ext3 ext4; do - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null - $UMOUNT_PROG $SCRATCH_DEV 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_DEV 2> /dev/null do_mkfs $SCRATCH_DEV ${SIZE}k @@ -418,7 +418,7 @@ for fstype in ext2 ext3 ext4; do continue fi - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null not_mnt failme mnt @@ -552,7 +552,7 @@ for fstype in ext2 ext3 ext4; do # dax mount options simple_mount -o dax=always $SCRATCH_DEV $SCRATCH_MNT > /dev/null 2>&1 if [ $? -eq 0 ]; then - $UMOUNT_PROG $SCRATCH_MNT 2> /dev/null + _umount $SCRATCH_MNT 2> /dev/null mnt dax mnt dax=always mnt dax=never @@ -633,7 +633,7 @@ for fstype in ext2 ext3 ext4; do not_remount jqfmt=vfsv1 not_remount noquota mnt_only remount,usrquota,grpquota ^usrquota,^grpquota - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 # test clearing/changing quota when enabled do_mkfs -E quotatype=^prjquota $SCRATCH_DEV ${SIZE}k @@ -654,7 +654,7 @@ for fstype in ext2 ext3 ext4; do mnt_only remount,usrquota,grpquota usrquota,grpquota quotaoff -f $SCRATCH_MNT >> $seqres.full 2>&1 mnt_only remount,noquota ^usrquota,^grpquota,quota - $UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 # Quota feature echo "== Testing quota feature " >> $seqres.full @@ -696,7 +696,7 @@ for fstype in ext2 ext3 ext4; do done #for fstype in ext2 ext3 ext4; do -$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1 +_umount $SCRATCH_MNT > /dev/null 2>&1 echo "$ERR errors encountered" >> $seqres.full status=$ERR diff --git a/tests/ext4/056 b/tests/ext4/056 index 8a290b11d69772..f9cb690fdfc80b 100755 --- a/tests/ext4/056 +++ b/tests/ext4/056 @@ -70,7 +70,7 @@ do_resize() # delay sleep 0.2 _scratch_unmount >> $seqres.full 2>&1 \ - || _fail "$UMOUNT_PROG failed. Exiting" + || _fail "_umount failed. Exiting" } run_test() diff --git a/tests/generic/042 b/tests/generic/042 index fd0ef705a18c3e..bea23ce29ac327 100755 --- a/tests/generic/042 +++ b/tests/generic/042 @@ -44,7 +44,7 @@ _crashtest() _filter_xfs_io $here/src/godown -f $mnt - $UMOUNT_PROG $mnt + _umount $mnt _mount $img $mnt # We should /never/ see 0xCD in the file, because we wrote that pattern @@ -54,7 +54,7 @@ _crashtest() _hexdump $file fi - $UMOUNT_PROG $mnt + _umount $mnt } # Modify as appropriate. diff --git a/tests/generic/067 b/tests/generic/067 index b6e984f5231753..19ee28d2cd945e 100755 --- a/tests/generic/067 +++ b/tests/generic/067 @@ -66,7 +66,7 @@ umount_symlink_device() rm -f $symlink echo "# umount symlink to device, which is not mounted" >>$seqres.full ln -s $SCRATCH_DEV $symlink - $UMOUNT_PROG $symlink >>$seqres.full 2>&1 + _umount $symlink >>$seqres.full 2>&1 } # umount a path name that is 256 bytes long, this should fail gracefully, @@ -78,7 +78,7 @@ umount_toolong_name() _scratch_mount 2>&1 | tee -a $seqres.full echo "# umount a too-long name" >>$seqres.full - $UMOUNT_PROG $longname >>$seqres.full 2>&1 + _umount $longname >>$seqres.full 2>&1 _scratch_unmount 2>&1 | tee -a $seqres.full } @@ -93,7 +93,7 @@ lazy_umount_symlink() rm -f $symlink ln -s $SCRATCH_MNT/testdir $symlink - $UMOUNT_PROG -l $symlink >>$seqres.full 2>&1 + _umount -l $symlink >>$seqres.full 2>&1 # _scratch_unmount should not be blocked _scratch_unmount 2>&1 | tee -a $seqres.full } diff --git a/tests/generic/081 b/tests/generic/081 index 468c87ac9a9f0a..57dc07a36395f8 100755 --- a/tests/generic/081 +++ b/tests/generic/081 @@ -32,7 +32,7 @@ _cleanup() # other tests to fail. while test -e /dev/mapper/$vgname-$snapname || \ test -e /dev/mapper/$vgname-$lvname; do - $UMOUNT_PROG $mnt >> $seqres.full 2>&1 + _umount $mnt >> $seqres.full 2>&1 $LVM_PROG lvremove -f $vgname/$snapname >>$seqres.full 2>&1 $LVM_PROG lvremove -f $vgname/$lvname >>$seqres.full 2>&1 $LVM_PROG vgremove -f $vgname >>$seqres.full 2>&1 diff --git a/tests/generic/085 b/tests/generic/085 index cbabd257cad8f0..8c33386b7c383e 100755 --- a/tests/generic/085 +++ b/tests/generic/085 @@ -27,7 +27,7 @@ cleanup_dmdev() $DMSETUP_PROG resume $lvdev >/dev/null 2>&1 [ -n "$pid" ] && kill -9 $pid 2>/dev/null wait $pid - $UMOUNT_PROG $lvdev >/dev/null 2>&1 + _umount $lvdev >/dev/null 2>&1 _dmsetup_remove $node } @@ -70,7 +70,7 @@ done & pid=$! for ((i=0; i<100; i++)); do _mount $lvdev $SCRATCH_MNT >/dev/null 2>&1 - $UMOUNT_PROG $lvdev >/dev/null 2>&1 + _umount $lvdev >/dev/null 2>&1 done & pid="$pid $!" diff --git a/tests/generic/108 b/tests/generic/108 index da13715f27ac21..e1df7ee1886cde 100755 --- a/tests/generic/108 +++ b/tests/generic/108 @@ -18,7 +18,7 @@ _cleanup() { cd / echo running > /sys/block/`_short_dev $SCSI_DEBUG_DEV`/device/state - $UMOUNT_PROG $SCRATCH_MNT >>$seqres.full 2>&1 + _umount $SCRATCH_MNT >>$seqres.full 2>&1 $LVM_PROG vgremove -f $vgname >>$seqres.full 2>&1 $LVM_PROG pvremove -f $SCRATCH_DEV $SCSI_DEBUG_DEV >>$seqres.full 2>&1 $UDEV_SETTLE_PROG diff --git a/tests/generic/361 b/tests/generic/361 index c2ebda3c1a01ad..456271b8d80308 100755 --- a/tests/generic/361 +++ b/tests/generic/361 @@ -16,7 +16,7 @@ _begin_fstest auto quick # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $fs_mnt + _umount $fs_mnt _destroy_loop_device $loop_dev cd / rm -f $tmp.* diff --git a/tests/generic/373 b/tests/generic/373 index 0d5a50cbee40b8..6ede189ead70bd 100755 --- a/tests/generic/373 +++ b/tests/generic/373 @@ -60,7 +60,7 @@ md5sum $testdir/file | _filter_scratch md5sum $othertestdir/otherfile | filter_otherdir echo "Unmount otherdir" -$UMOUNT_PROG $otherdir +_umount $otherdir rm -rf $otherdir # success, all done diff --git a/tests/generic/374 b/tests/generic/374 index 977a2b268bbc98..bbdd8e66b4897b 100755 --- a/tests/generic/374 +++ b/tests/generic/374 @@ -59,7 +59,7 @@ echo "Check output" md5sum $testdir/file $othertestdir/otherfile | filter_md5 echo "Unmount otherdir" -$UMOUNT_PROG $otherdir +_umount $otherdir rm -rf $otherdir # success, all done diff --git a/tests/generic/459 b/tests/generic/459 index 32ee899f929819..e8799f75bf8e05 100755 --- a/tests/generic/459 +++ b/tests/generic/459 @@ -28,7 +28,7 @@ _cleanup() xfs_freeze -u $SCRATCH_MNT 2>/dev/null cd / rm -f $tmp.* - $UMOUNT_PROG $SCRATCH_MNT >>$seqres.full 2>&1 + _umount $SCRATCH_MNT >>$seqres.full 2>&1 $LVM_PROG vgremove -ff $vgname >>$seqres.full 2>&1 $LVM_PROG pvremove -ff $SCRATCH_DEV >>$seqres.full 2>&1 $UDEV_SETTLE_PROG diff --git a/tests/generic/604 b/tests/generic/604 index c2e03c2eabb871..124eea853ecf70 100755 --- a/tests/generic/604 +++ b/tests/generic/604 @@ -26,6 +26,8 @@ done # mount the base fs. Delay the mount attempt by a small amount in the hope # that the mount() call will try to lock s_umount /after/ umount has already # taken it. +# This is the /one/ place in fstests where we need to call the umount binary +# directly. $UMOUNT_PROG $SCRATCH_MNT & sleep 0.01s ; _scratch_mount wait diff --git a/tests/generic/648 b/tests/generic/648 index 29d1b470bded4a..3e995a02983931 100755 --- a/tests/generic/648 +++ b/tests/generic/648 @@ -20,7 +20,7 @@ _cleanup() $KILLALL_PROG -9 fsstress > /dev/null 2>&1 wait if [ -n "$loopmnt" ]; then - $UMOUNT_PROG $loopmnt 2>/dev/null + _umount $loopmnt 2>/dev/null rm -r -f $loopmnt fi rm -f $tmp.* @@ -111,7 +111,7 @@ while _soak_loop_running $((25 * TIME_FACTOR)); do # Mount again to replay log after loading working table, so we have a # consistent fs after test. - $UMOUNT_PROG $loopmnt + _umount $loopmnt is_unmounted=1 # We must unmount dmerror at here, or whole later testing will crash. # So try to umount enough times, before we have no choice. @@ -137,7 +137,7 @@ done # Make sure the fs image file is ok if [ -f "$loopimg" ]; then if _mount $loopimg $loopmnt -o loop; then - $UMOUNT_PROG $loopmnt &> /dev/null + _umount $loopmnt &> /dev/null else _metadump_dev $DMERROR_DEV $seqres.scratch.final.md echo "final scratch mount failed" diff --git a/tests/generic/698 b/tests/generic/698 index 28928b2fb32532..f432837a216f82 100755 --- a/tests/generic/698 +++ b/tests/generic/698 @@ -17,8 +17,8 @@ _begin_fstest auto quick perms attr idmapped mount _cleanup() { cd / - $UMOUNT_PROG $SCRATCH_MNT/target-mnt 2>/dev/null - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT/target-mnt 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null rm -r -f $tmp.* } diff --git a/tests/generic/699 b/tests/generic/699 index 677307538a484b..5cff1cbaa67c4e 100755 --- a/tests/generic/699 +++ b/tests/generic/699 @@ -15,9 +15,9 @@ _begin_fstest auto quick perms attr idmapped mount _cleanup() { cd / - $UMOUNT_PROG $SCRATCH_MNT/target-mnt - $UMOUNT_PROG $SCRATCH_MNT/ovl-merge 2>/dev/null - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT/target-mnt + _umount $SCRATCH_MNT/ovl-merge 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null rm -r -f $tmp.* } @@ -113,7 +113,7 @@ setup_overlayfs_idmapped_lower_metacopy_on() reset_overlayfs() { - $UMOUNT_PROG $SCRATCH_MNT/ovl-merge 2>/dev/null + _umount $SCRATCH_MNT/ovl-merge 2>/dev/null rm -rf $upper $work $merge } diff --git a/tests/generic/704 b/tests/generic/704 index f39d47066ccc4a..31d52a97b37f9d 100755 --- a/tests/generic/704 +++ b/tests/generic/704 @@ -14,7 +14,7 @@ _cleanup() { cd / rm -r -f $tmp.* - [ -d "$SCSI_DEBUG_MNT" ] && $UMOUNT_PROG $SCSI_DEBUG_MNT 2>/dev/null + [ -d "$SCSI_DEBUG_MNT" ] && _umount $SCSI_DEBUG_MNT 2>/dev/null _put_scsi_debug_dev } diff --git a/tests/generic/730 b/tests/generic/730 index 062314ea01e7b5..650c604d5fbefd 100755 --- a/tests/generic/730 +++ b/tests/generic/730 @@ -12,7 +12,7 @@ _begin_fstest auto quick _cleanup() { cd / - $UMOUNT_PROG $SCSI_DEBUG_MNT >>$seqres.full 2>&1 + _umount $SCSI_DEBUG_MNT >>$seqres.full 2>&1 _put_scsi_debug_dev rm -f $tmp.* } diff --git a/tests/generic/731 b/tests/generic/731 index cd39e8b09e3906..2621f6e237741d 100755 --- a/tests/generic/731 +++ b/tests/generic/731 @@ -13,7 +13,7 @@ _begin_fstest auto quick _cleanup() { cd / - $UMOUNT_PROG $SCSI_DEBUG_MNT >>$seqres.full 2>&1 + _umount $SCSI_DEBUG_MNT >>$seqres.full 2>&1 _put_scsi_debug_dev rm -f $tmp.* } diff --git a/tests/generic/732 b/tests/generic/732 index d08028c2333d1b..63406ddc163f2c 100755 --- a/tests/generic/732 +++ b/tests/generic/732 @@ -15,8 +15,8 @@ _begin_fstest auto quick rename # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $testdir1 2>/dev/null - $UMOUNT_PROG $testdir2 2>/dev/null + _umount $testdir1 2>/dev/null + _umount $testdir2 2>/dev/null cd / rm -r -f $tmp.* } diff --git a/tests/generic/746 b/tests/generic/746 index 651affe07b40bc..2b40c964371175 100755 --- a/tests/generic/746 +++ b/tests/generic/746 @@ -38,7 +38,7 @@ esac # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $loop_dev &> /dev/null + _umount $loop_dev &> /dev/null _destroy_loop_device $loop_dev if [ $status -eq 0 ]; then rm -rf $tmp @@ -53,7 +53,7 @@ get_holes() # in-core state that will perturb the free space map on umount. Stick # to established convention which requires the filesystem to be # unmounted while we probe the underlying file. - $UMOUNT_PROG $loop_mnt + _umount $loop_mnt # FIEMAP only works on regular files, so call it on the backing file # and not the loop device like everything else @@ -66,7 +66,7 @@ get_free_sectors() { case $FSTYP in ext4) - $UMOUNT_PROG $loop_mnt + _umount $loop_mnt $DUMPE2FS_PROG $loop_dev 2>&1 | grep " Free blocks" | cut -d ":" -f2- | \ tr ',' '\n' | $SED_PROG 's/^ //' | \ $AWK_PROG -v spb=$sectors_per_block 'BEGIN{FS="-"}; @@ -80,7 +80,7 @@ get_free_sectors() xfs) agsize=`$XFS_INFO_PROG $loop_mnt | $SED_PROG -n 's/.*agsize=\(.*\) blks.*/\1/p'` # Convert free space (agno, block, length) to (start sector, end sector) - $UMOUNT_PROG $loop_mnt + _umount $loop_mnt $XFS_DB_PROG -r -c "freesp -d" $loop_dev | $SED_PROG '/^.*from/,$d'| \ $AWK_PROG -v spb=$sectors_per_block -v agsize=$agsize \ '{ print spb * ($1 * agsize + $2), spb * ($1 * agsize + $2 + $3) - 1 }' diff --git a/tests/overlay/003 b/tests/overlay/003 index 41ad99e794d8ee..0a2cb928ea5c58 100755 --- a/tests/overlay/003 +++ b/tests/overlay/003 @@ -56,7 +56,7 @@ rm -rf ${SCRATCH_MNT}/* ls ${SCRATCH_MNT}/ # unmount overlayfs but not base fs -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT echo "Silence is golden" # success, all done diff --git a/tests/overlay/004 b/tests/overlay/004 index bea4bb543f3611..4591d4e8487ce2 100755 --- a/tests/overlay/004 +++ b/tests/overlay/004 @@ -53,7 +53,7 @@ _user_do "chmod u-X ${SCRATCH_MNT}/attr_file2 > /dev/null 2>&1" stat -c %a ${SCRATCH_MNT}/attr_file2 # unmount overlayfs but not base fs -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # check mode bits of the file that has been copied up, and # the file that should not have been copied up. diff --git a/tests/overlay/005 b/tests/overlay/005 index 01914ee17b9a30..6b382ddb50d873 100755 --- a/tests/overlay/005 +++ b/tests/overlay/005 @@ -75,14 +75,14 @@ $XFS_IO_PROG -f -c "o" ${SCRATCH_MNT}/test_file \ >>$seqres.full 2>&1 # unmount overlayfs -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # check overlayfs _overlay_check_scratch_dirs $lowerd $upperd $workd # unmount undelying xfs, this tiggers panic if memleak happens -$UMOUNT_PROG ${OVL_BASE_SCRATCH_MNT}/uppermnt -$UMOUNT_PROG ${OVL_BASE_SCRATCH_MNT}/lowermnt +_umount ${OVL_BASE_SCRATCH_MNT}/uppermnt +_umount ${OVL_BASE_SCRATCH_MNT}/lowermnt # success, all done echo "Silence is golden" diff --git a/tests/overlay/014 b/tests/overlay/014 index f07fc685572b92..08850d489e4b49 100755 --- a/tests/overlay/014 +++ b/tests/overlay/014 @@ -46,7 +46,7 @@ _overlay_scratch_mount_dirs $lowerdir1 $lowerdir2 $workdir2 rm -rf $SCRATCH_MNT/testdir mkdir -p $SCRATCH_MNT/testdir/visibledir # unmount overlayfs but not base fs -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # check overlayfs _overlay_check_scratch_dirs $lowerdir1 $lowerdir2 $workdir2 @@ -59,7 +59,7 @@ touch $SCRATCH_MNT/testdir/visiblefile # umount and mount overlay again, buggy kernel treats the copied-up dir as # opaque, visibledir is not seen in merged dir. -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT _overlay_scratch_mount_dirs "$lowerdir2:$lowerdir1" $upperdir $workdir ls $SCRATCH_MNT/testdir diff --git a/tests/overlay/022 b/tests/overlay/022 index d33bd29781a356..40b0dd64f6fc6c 100755 --- a/tests/overlay/022 +++ b/tests/overlay/022 @@ -17,7 +17,7 @@ _begin_fstest auto quick mount nested _cleanup() { cd / - $UMOUNT_PROG $tmp/mnt > /dev/null 2>&1 + _umount $tmp/mnt > /dev/null 2>&1 rm -rf $tmp rm -f $tmp.* } diff --git a/tests/overlay/025 b/tests/overlay/025 index 6ba46191b557be..0abc8bf80b1716 100755 --- a/tests/overlay/025 +++ b/tests/overlay/025 @@ -19,8 +19,8 @@ _begin_fstest auto quick attr _cleanup() { cd / - $UMOUNT_PROG $tmpfsdir/mnt - $UMOUNT_PROG $tmpfsdir + _umount $tmpfsdir/mnt + _umount $tmpfsdir rm -rf $tmpfsdir rm -f $tmp.* } diff --git a/tests/overlay/029 b/tests/overlay/029 index 4bade9a0e129a4..007973dc075923 100755 --- a/tests/overlay/029 +++ b/tests/overlay/029 @@ -22,7 +22,7 @@ _begin_fstest auto quick nested _cleanup() { cd / - $UMOUNT_PROG $tmp/mnt + _umount $tmp/mnt rm -rf $tmp rm -f $tmp.* } @@ -56,7 +56,7 @@ _overlay_mount_dirs $SCRATCH_MNT/up $tmp/{upper,work} \ overlay $tmp/mnt # accessing file in the second mount cat $tmp/mnt/foo -$UMOUNT_PROG $tmp/mnt +_umount $tmp/mnt # re-create upper/work to avoid ovl_verify_origin() mount failure # when index is enabled @@ -66,7 +66,7 @@ mkdir -p $tmp/{upper,work} _overlay_mount_dirs $SCRATCH_MNT/low $tmp/{upper,work} \ overlay $tmp/mnt cat $tmp/mnt/bar -$UMOUNT_PROG $tmp/mnt +_umount $tmp/mnt rm -rf $tmp/{upper,work} mkdir -p $tmp/{upper,work} diff --git a/tests/overlay/031 b/tests/overlay/031 index dd9dfcdb970ac7..31d22d1cadae41 100755 --- a/tests/overlay/031 +++ b/tests/overlay/031 @@ -28,7 +28,7 @@ create_whiteout() rm -f $SCRATCH_MNT/testdir/$file - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Import common functions. @@ -68,7 +68,7 @@ rm -rf $SCRATCH_MNT/testdir 2>&1 | _filter_scratch # umount overlay again, create a new file with the same name and # mount overlay again. -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT touch $lowerdir1/testdir _overlay_scratch_mount_dirs $lowerdir1 $upperdir $workdir @@ -77,7 +77,7 @@ _overlay_scratch_mount_dirs $lowerdir1 $upperdir $workdir # it will not clean up the dir and lead to residue. rm -rf $SCRATCH_MNT/testdir 2>&1 | _filter_scratch -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # let lower dir have invalid whiteouts, repeat ls and rmdir test again. rm -rf $lowerdir1/testdir @@ -92,7 +92,7 @@ _overlay_scratch_mount_dirs "$lowerdir1:$lowerdir2" $upperdir $workdir ls $SCRATCH_MNT/testdir rm -rf $SCRATCH_MNT/testdir 2>&1 | _filter_scratch -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # let lower dir and upper dir both have invalid whiteouts, repeat ls and rmdir again. rm -rf $lowerdir1/testdir diff --git a/tests/overlay/035 b/tests/overlay/035 index cede58790e1b9d..c6ce1318fbbb37 100755 --- a/tests/overlay/035 +++ b/tests/overlay/035 @@ -43,7 +43,7 @@ mkdir -p $lowerdir1 $lowerdir2 $upperdir $workdir _overlay_scratch_mount_opts -o"lowerdir=$lowerdir2:$lowerdir1" touch $SCRATCH_MNT/foo 2>&1 | _filter_scratch _mount -o remount,rw $SCRATCH_MNT 2>&1 | _filter_ro_mount -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Make workdir immutable to prevent workdir re-create on mount $CHATTR_PROG +i $workdir diff --git a/tests/overlay/036 b/tests/overlay/036 index 19a181bbdd9361..f902617d4ab0a2 100755 --- a/tests/overlay/036 +++ b/tests/overlay/036 @@ -34,8 +34,8 @@ _cleanup() cd / rm -f $tmp.* # unmount the two extra mounts in case they did not fail - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null } # Import common functions. @@ -66,13 +66,13 @@ _overlay_mount_dirs $lowerdir $upperdir $workdir \ # with index=off - expect success _overlay_mount_dirs $lowerdir $upperdir $workdir2 \ overlay0 $SCRATCH_MNT -oindex=off && \ - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT # Try to mount another overlay with the same workdir # with index=off - expect success _overlay_mount_dirs $lowerdir2 $upperdir2 $workdir \ overlay1 $SCRATCH_MNT -oindex=off && \ - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT # Try to mount another overlay with the same upperdir # with index=on - expect EBUSY diff --git a/tests/overlay/037 b/tests/overlay/037 index 834e176380ebea..c278e7cab1fe05 100755 --- a/tests/overlay/037 +++ b/tests/overlay/037 @@ -39,17 +39,17 @@ mkdir -p $lowerdir $lowerdir2 $upperdir $upperdir2 $workdir # Mount overlay with lowerdir, upperdir, workdir and index=on # to store the file handles of lowerdir and upperdir in overlay.origin xattr _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -oindex=on -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Try to mount an overlay with the same upperdir and different lowerdir - expect ESTALE _overlay_scratch_mount_dirs $lowerdir2 $upperdir $workdir -oindex=on \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null # Try to mount an overlay with the same workdir and different upperdir - expect ESTALE _overlay_scratch_mount_dirs $lowerdir $upperdir2 $workdir -oindex=on \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null # Mount overlay with original lowerdir, upperdir, workdir and index=on - expect success _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -oindex=on diff --git a/tests/overlay/040 b/tests/overlay/040 index 11c7bf129a3626..47f50eb0638da0 100755 --- a/tests/overlay/040 +++ b/tests/overlay/040 @@ -48,7 +48,7 @@ _scratch_mount # modify lower origin file. $CHATTR_PROG +i $SCRATCH_MNT/foo > /dev/null 2>&1 -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # touching origin file in lower, should succeed touch $lowerdir/foo diff --git a/tests/overlay/041 b/tests/overlay/041 index 36491b8fa0edf6..52ca351b66d86c 100755 --- a/tests/overlay/041 +++ b/tests/overlay/041 @@ -142,7 +142,7 @@ subdir_d=$($here/src/t_dir_type $pure_lower_dir $pure_lower_subdir_st_ino) [[ $subdir_d == "subdir d" ]] || \ echo "Merged dir: Invalid d_ino reported for subdir" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # check overlayfs _overlay_check_scratch_dirs $lowerdir $upperdir $workdir -o xino=on diff --git a/tests/overlay/042 b/tests/overlay/042 index aaa10da33e0249..ddd4173abee8ce 100755 --- a/tests/overlay/042 +++ b/tests/overlay/042 @@ -45,7 +45,7 @@ _scratch_mount -o index=off # Copy up lower and create upper hardlink with no index ln $SCRATCH_MNT/0 $SCRATCH_MNT/1 -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Add lower hardlinks while overlay is offline ln $lowerdir/0 $lowerdir/2 diff --git a/tests/overlay/043 b/tests/overlay/043 index 7325c653ab5cab..15cb9bf4bafaca 100755 --- a/tests/overlay/043 +++ b/tests/overlay/043 @@ -126,7 +126,7 @@ echo 3 > /proc/sys/vm/drop_caches check_inode_numbers $testdir $tmp.after_copyup $tmp.after_move # Verify that the inode numbers survive a mount cycle -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -o redirect_dir=on,xino=on # Compare inode numbers before/after mount cycle diff --git a/tests/overlay/044 b/tests/overlay/044 index 4d04d883efd695..5f09cc31c32a1e 100755 --- a/tests/overlay/044 +++ b/tests/overlay/044 @@ -99,7 +99,7 @@ cat $FILES check_ino_nlink $SCRATCH_MNT $tmp.before $tmp.after_one # Verify that the hardlinks survive a mount cycle -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT _overlay_check_scratch_dirs $lowerdir $upperdir $workdir -o index=on,xino=on _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir -o index=on,xino=on diff --git a/tests/overlay/048 b/tests/overlay/048 index 897e797e2ff549..4bd9753666bf6c 100755 --- a/tests/overlay/048 +++ b/tests/overlay/048 @@ -32,7 +32,7 @@ report_nlink() _ls_l $SCRATCH_MNT/$f | awk '{ print $2, $9 }' | _filter_scratch done - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Create lower hardlinks @@ -101,7 +101,7 @@ touch $SCRATCH_MNT/1 touch $SCRATCH_MNT/2 # Perform the rest of the changes offline -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT test_hardlinks_offline diff --git a/tests/overlay/049 b/tests/overlay/049 index 3ee500c5dd13b8..b091330ea26e2c 100755 --- a/tests/overlay/049 +++ b/tests/overlay/049 @@ -32,7 +32,7 @@ create_redirect() touch $SCRATCH_MNT/origin/file mv $SCRATCH_MNT/origin $SCRATCH_MNT/$redirect - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Import common functions. diff --git a/tests/overlay/050 b/tests/overlay/050 index ec936e2a758f81..7c8ed1a4e96e8c 100755 --- a/tests/overlay/050 +++ b/tests/overlay/050 @@ -76,7 +76,7 @@ mount_dirs() # Unmount the overlay without unmounting base fs unmount_dirs() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Check non-stale file handles of lower/upper files and verify diff --git a/tests/overlay/051 b/tests/overlay/051 index 9404dbbab90f15..2dadb5a3027180 100755 --- a/tests/overlay/051 +++ b/tests/overlay/051 @@ -28,7 +28,7 @@ _cleanup() # Cleanup overlay scratch mount that is holding base test mount # to prevent _check_test_fs and _test_umount from failing before # _check_scratch_fs _scratch_umount - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null } # Import common functions. @@ -103,7 +103,7 @@ mount_dirs() # underlying dirs unmount_dirs() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_check_scratch_dirs $middle:$lower $upper $work \ -o "index=on,nfs_export=on" diff --git a/tests/overlay/052 b/tests/overlay/052 index 37402067dbe65e..e3366ea44147cb 100755 --- a/tests/overlay/052 +++ b/tests/overlay/052 @@ -73,7 +73,7 @@ mount_dirs() # Unmount the overlay without unmounting base fs unmount_dirs() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Check non-stale file handles of lower/upper moved files diff --git a/tests/overlay/053 b/tests/overlay/053 index f7891aceda7246..87f748cefd3338 100755 --- a/tests/overlay/053 +++ b/tests/overlay/053 @@ -30,7 +30,7 @@ _cleanup() # Cleanup overlay scratch mount that is holding base test mount # to prevent _check_test_fs and _test_umount from failing before # _check_scratch_fs _scratch_umount - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null } # Import common functions. @@ -99,7 +99,7 @@ mount_dirs() # underlying dirs unmount_dirs() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_check_scratch_dirs $middle:$lower $upper $work \ -o "index=on,nfs_export=on,redirect_dir=on" diff --git a/tests/overlay/054 b/tests/overlay/054 index 8d7f026a2d9b00..566d266a1ad788 100755 --- a/tests/overlay/054 +++ b/tests/overlay/054 @@ -87,7 +87,7 @@ mount_dirs() # Unmount the overlay without unmounting base fs unmount_dirs() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Check encode/decode/read file handles of dir with non-indexed ancestor diff --git a/tests/overlay/055 b/tests/overlay/055 index 87a348c94489b8..a5b169956f4c09 100755 --- a/tests/overlay/055 +++ b/tests/overlay/055 @@ -37,7 +37,7 @@ _cleanup() # Cleanup overlay scratch mount that is holding base test mount # to prevent _check_test_fs and _test_umount from failing before # _check_scratch_fs _scratch_umount - $UMOUNT_PROG $SCRATCH_MNT 2>/dev/null + _umount $SCRATCH_MNT 2>/dev/null } # Import common functions. @@ -109,7 +109,7 @@ mount_dirs() # underlying dirs unmount_dirs() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_check_scratch_dirs $middle:$lower $upper $work \ -o "index=on,nfs_export=on,redirect_dir=on" diff --git a/tests/overlay/056 b/tests/overlay/056 index 158f34d05c22e9..01c319d7263f3c 100755 --- a/tests/overlay/056 +++ b/tests/overlay/056 @@ -73,7 +73,7 @@ mkdir $lowerdir/testdir2/subdir _overlay_scratch_mount_dirs $lowerdir $upperdir $workdir touch $SCRATCH_MNT/testdir1/foo touch $SCRATCH_MNT/testdir2/subdir -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT remove_impure $upperdir/testdir1 remove_impure $upperdir/testdir2 diff --git a/tests/overlay/057 b/tests/overlay/057 index da7ffda30277d9..b631d431a37b47 100755 --- a/tests/overlay/057 +++ b/tests/overlay/057 @@ -48,7 +48,7 @@ _overlay_scratch_mount_dirs $lowerdir $lowerdir2 $workdir2 -o redirect_dir=on # Create opaque parent with absolute redirect child in middle layer mkdir $SCRATCH_MNT/pure mv $SCRATCH_MNT/origin $SCRATCH_MNT/pure/redirect -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT _overlay_scratch_mount_dirs $lowerdir2:$lowerdir $upperdir $workdir -o redirect_dir=on mv $SCRATCH_MNT/pure/redirect $SCRATCH_MNT/redirect # List content of renamed merge dir before mount cycle @@ -56,7 +56,7 @@ ls $SCRATCH_MNT/redirect/ # Verify that redirects are followed by listing content of renamed merge dir # after mount cycle -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT _overlay_scratch_mount_dirs $lowerdir2:$lowerdir $upperdir $workdir -o redirect_dir=on ls $SCRATCH_MNT/redirect/ diff --git a/tests/overlay/059 b/tests/overlay/059 index c48d2a82c76ec4..84b5c80eb984de 100755 --- a/tests/overlay/059 +++ b/tests/overlay/059 @@ -33,7 +33,7 @@ create_origin_ref() _scratch_mount -o redirect_dir=on mv $SCRATCH_MNT/origin $SCRATCH_MNT/$ref - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Import common functions. diff --git a/tests/overlay/060 b/tests/overlay/060 index bb61fcfa644342..3d0ea353feaa9a 100755 --- a/tests/overlay/060 +++ b/tests/overlay/060 @@ -130,7 +130,7 @@ mount_ro_overlay() umount_overlay() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } # Assumes it is called with overlay mounted. diff --git a/tests/overlay/062 b/tests/overlay/062 index 9a1db7419c4ca2..97a1bd8c12f20e 100755 --- a/tests/overlay/062 +++ b/tests/overlay/062 @@ -18,7 +18,7 @@ _cleanup() { cd / rm -f $tmp.* - $UMOUNT_PROG $lowertestdir + _umount $lowertestdir } # Import common functions. diff --git a/tests/overlay/063 b/tests/overlay/063 index d9f30606a92d44..a50e63665202f0 100755 --- a/tests/overlay/063 +++ b/tests/overlay/063 @@ -40,7 +40,7 @@ rm ${upperdir}/file mkdir ${SCRATCH_MNT}/file > /dev/null 2>&1 # unmount overlayfs -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT echo "Silence is golden" # success, all done diff --git a/tests/overlay/065 b/tests/overlay/065 index fb6d6dd1bfcc0e..26f1c4bde4da90 100755 --- a/tests/overlay/065 +++ b/tests/overlay/065 @@ -30,7 +30,7 @@ _cleanup() { cd / rm -f $tmp.* - $UMOUNT_PROG $mnt2 2>/dev/null + _umount $mnt2 2>/dev/null } # Import common functions. @@ -63,7 +63,7 @@ mkdir -p $lowerdir/lower $upperdir $workdir echo Conflicting upperdir/lowerdir _overlay_scratch_mount_dirs $upperdir $upperdir $workdir \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null # Use new upper/work dirs for each test to avoid ESTALE errors # on mismatch lowerdir/upperdir (see test overlay/037) @@ -75,7 +75,7 @@ mkdir $upperdir $workdir echo Conflicting workdir/lowerdir _overlay_scratch_mount_dirs $workdir $upperdir $workdir \ -oindex=off 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir $workdir mkdir -p $upperdir/lower $workdir @@ -85,7 +85,7 @@ mkdir -p $upperdir/lower $workdir echo Overlapping upperdir/lowerdir _overlay_scratch_mount_dirs $upperdir/lower $upperdir $workdir \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir $workdir mkdir $upperdir $workdir @@ -94,7 +94,7 @@ mkdir $upperdir $workdir echo Conflicting lower layers _overlay_scratch_mount_dirs $lowerdir:$lowerdir $upperdir $workdir \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir $workdir mkdir $upperdir $workdir @@ -103,7 +103,7 @@ mkdir $upperdir $workdir echo Overlapping lower layers below _overlay_scratch_mount_dirs $lowerdir:$lowerdir/lower $upperdir $workdir \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir $workdir mkdir $upperdir $workdir @@ -112,7 +112,7 @@ mkdir $upperdir $workdir echo Overlapping lower layers above _overlay_scratch_mount_dirs $lowerdir/lower:$lowerdir $upperdir $workdir \ 2>&1 | _filter_error_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir $workdir mkdir -p $upperdir/upper $workdir $mnt2 @@ -129,14 +129,14 @@ mkdir -p $upperdir2 $workdir2 $mnt2 echo "Overlapping with upperdir of another instance (index=on)" _overlay_scratch_mount_dirs $upperdir/upper $upperdir2 $workdir2 \ -oindex=on 2>&1 | _filter_busy_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir2 $workdir2 mkdir -p $upperdir2 $workdir2 echo "Overlapping with upperdir of another instance (index=off)" _overlay_scratch_mount_dirs $upperdir/upper $upperdir2 $workdir2 \ - -oindex=off && $UMOUNT_PROG $SCRATCH_MNT + -oindex=off && _umount $SCRATCH_MNT rm -rf $upperdir2 $workdir2 mkdir -p $upperdir2 $workdir2 @@ -146,14 +146,14 @@ mkdir -p $upperdir2 $workdir2 echo "Overlapping with workdir of another instance (index=on)" _overlay_scratch_mount_dirs $workdir/work $upperdir2 $workdir2 \ -oindex=on 2>&1 | _filter_busy_mount -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null rm -rf $upperdir2 $workdir2 mkdir -p $upperdir2 $workdir2 echo "Overlapping with workdir of another instance (index=off)" _overlay_scratch_mount_dirs $workdir/work $upperdir2 $workdir2 \ - -oindex=off && $UMOUNT_PROG $SCRATCH_MNT + -oindex=off && _umount $SCRATCH_MNT # Move upper layer root into lower layer after mount echo Overlapping upperdir/lowerdir after mount diff --git a/tests/overlay/067 b/tests/overlay/067 index bb09a6042b275d..12a1781c149644 100755 --- a/tests/overlay/067 +++ b/tests/overlay/067 @@ -70,7 +70,7 @@ stat $testfile >>$seqres.full diff -q $realfile $testfile >>$seqres.full && echo "diff with middle layer file doesn't know right from wrong! (cold cache)" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # check overlayfs _overlay_check_scratch_dirs $middle:$lower $upper $work -o xino=off diff --git a/tests/overlay/068 b/tests/overlay/068 index 0d33cf12de8550..480ba67e33ea74 100755 --- a/tests/overlay/068 +++ b/tests/overlay/068 @@ -28,7 +28,7 @@ _cleanup() cd / rm -f $tmp.* # Unmount the nested overlay mount - $UMOUNT_PROG $mnt2 2>/dev/null + _umount $mnt2 2>/dev/null } # Import common functions. @@ -100,7 +100,7 @@ mount_dirs() unmount_dirs() { # unmount & check nested overlay - $UMOUNT_PROG $mnt2 + _umount $mnt2 _overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \ -o "index=on,nfs_export=on,redirect_dir=on" diff --git a/tests/overlay/069 b/tests/overlay/069 index 373ab1ee3dc115..67969eebbfcaa3 100755 --- a/tests/overlay/069 +++ b/tests/overlay/069 @@ -28,7 +28,7 @@ _cleanup() cd / rm -f $tmp.* # Unmount the nested overlay mount - $UMOUNT_PROG $mnt2 2>/dev/null + _umount $mnt2 2>/dev/null } # Import common functions. @@ -108,12 +108,12 @@ mount_dirs() unmount_dirs() { # unmount & check nested overlay - $UMOUNT_PROG $mnt2 + _umount $mnt2 _overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \ -o "index=on,nfs_export=on,redirect_dir=on" # unmount & check underlying overlay - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_check_dirs $lower $upper $work \ -o "index=on,nfs_export=on,redirect_dir=on" } diff --git a/tests/overlay/070 b/tests/overlay/070 index 36991229f28fe7..104b5f492088d6 100755 --- a/tests/overlay/070 +++ b/tests/overlay/070 @@ -26,7 +26,7 @@ _cleanup() cd / rm -f $tmp.* # Unmount the nested overlay mount - $UMOUNT_PROG $mnt2 2>/dev/null + _umount $mnt2 2>/dev/null [ -z "$loopdev" ] || _destroy_loop_device $loopdev } @@ -93,12 +93,12 @@ mount_dirs() unmount_dirs() { # unmount & check nested overlay - $UMOUNT_PROG $mnt2 + _umount $mnt2 _overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \ -o "redirect_dir=on,index=on,xino=on" # unmount & check underlying overlay - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_check_scratch_dirs $lower $upper $work \ -o "index=on,nfs_export=on" } diff --git a/tests/overlay/071 b/tests/overlay/071 index 2a6313142d09d2..c58347f6cdb1c6 100755 --- a/tests/overlay/071 +++ b/tests/overlay/071 @@ -29,7 +29,7 @@ _cleanup() cd / rm -f $tmp.* # Unmount the nested overlay mount - $UMOUNT_PROG $mnt2 2>/dev/null + _umount $mnt2 2>/dev/null [ -z "$loopdev" ] || _destroy_loop_device $loopdev } @@ -103,12 +103,12 @@ mount_dirs() unmount_dirs() { # unmount & check nested overlay - $UMOUNT_PROG $mnt2 + _umount $mnt2 _overlay_check_dirs $SCRATCH_MNT $upper2 $work2 \ -o "redirect_dir=on,index=on,xino=on" # unmount & check underlying overlay - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT _overlay_check_dirs $lower $upper $work \ -o "index=on,nfs_export=on" } diff --git a/tests/overlay/076 b/tests/overlay/076 index fb94dff685b6cc..28bf2d305b94d7 100755 --- a/tests/overlay/076 +++ b/tests/overlay/076 @@ -47,7 +47,7 @@ _scratch_mount # on kernel v5.10..v5.10.14. Anything but hang is considered a test success. $CHATTR_PROG +i $SCRATCH_MNT/foo > /dev/null 2>&1 -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # success, all done echo "Silence is golden" diff --git a/tests/overlay/077 b/tests/overlay/077 index 00de0825aea6dc..cff24800469362 100755 --- a/tests/overlay/077 +++ b/tests/overlay/077 @@ -65,7 +65,7 @@ mv $SCRATCH_MNT/f100 $SCRATCH_MNT/former/ # Remove the lower directory and mount overlay again to create # a "former merge dir" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT rm -rf $lowerdir/former _scratch_mount diff --git a/tests/overlay/078 b/tests/overlay/078 index d6df11f6852f45..bcc5aff1b7dc89 100755 --- a/tests/overlay/078 +++ b/tests/overlay/078 @@ -61,7 +61,7 @@ do_check() echo "Test chattr +$1 $2" >> $seqres.full - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT # Add attribute to lower file $CHATTR_PROG +$attr $lowertestfile diff --git a/tests/overlay/079 b/tests/overlay/079 index cfcafceea56e66..f8926e091ca137 100755 --- a/tests/overlay/079 +++ b/tests/overlay/079 @@ -156,7 +156,7 @@ mount_ro_overlay() umount_overlay() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } test_no_access() diff --git a/tests/overlay/080 b/tests/overlay/080 index ce5c2375fb3154..94fe33ae7db4d2 100755 --- a/tests/overlay/080 +++ b/tests/overlay/080 @@ -264,7 +264,7 @@ mount_overlay() umount_overlay() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } diff --git a/tests/overlay/081 b/tests/overlay/081 index 2270a04750da1f..454eea2cd96576 100755 --- a/tests/overlay/081 +++ b/tests/overlay/081 @@ -46,7 +46,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir) echo "Overlayfs (uuid=null) and upper fs fsid differ" # Keep base fs mounted in case it has a volatile fsid (e.g. tmpfs) -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Test legacy behavior is preserved by default for existing "impure" overlayfs _scratch_mount @@ -55,7 +55,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir) [[ "$ovl_fsid" == "$upper_fsid" ]] || \ echo "Overlayfs (after uuid=null) and upper fs fsid differ" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Test unique fsid on explicit opt-in for existing "impure" overlayfs _scratch_mount -o uuid=on @@ -65,7 +65,7 @@ ovl_unique_fsid=$ovl_fsid [[ "$ovl_fsid" != "$upper_fsid" ]] || \ echo "Overlayfs (uuid=on) and upper fs fsid are the same" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Test unique fsid is persistent by default after it was created _scratch_mount @@ -74,7 +74,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir) [[ "$ovl_fsid" == "$ovl_unique_fsid" ]] || \ echo "Overlayfs (after uuid=on) unique fsid is not persistent" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Test ignore existing persistent fsid on explicit opt-out _scratch_mount -o uuid=null @@ -83,7 +83,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir) [[ "$ovl_fsid" == "$upper_fsid" ]] || \ echo "Overlayfs (uuid=null) and upper fs fsid differ" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Test fallback to uuid=null with non-upper ovelray _overlay_scratch_mount_dirs "$upperdir:$lowerdir" "-" "-" -o ro,uuid=on @@ -110,7 +110,7 @@ ovl_unique_fsid=$ovl_fsid [[ "$ovl_fsid" != "$upper_fsid" ]] || \ echo "Overlayfs (new) and upper fs fsid are the same" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT # Test unique fsid is persistent by default after it was created _scratch_mount -o uuid=on @@ -119,7 +119,7 @@ ovl_fsid=$(stat -f -c '%i' $test_dir) [[ "$ovl_fsid" == "$ovl_unique_fsid" ]] || \ echo "Overlayfs (uuid=on) unique fsid is not persistent" -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT echo "Silence is golden" status=0 diff --git a/tests/overlay/083 b/tests/overlay/083 index 56e02f8cc77d73..aaa3fdb9ad139a 100755 --- a/tests/overlay/083 +++ b/tests/overlay/083 @@ -52,7 +52,7 @@ _mount -t overlay | grep ovl_esc_test | tee -a $seqres.full | grep -v spaces && # Re-create the upper/work dirs to mount them with a different lower # This is required in case index feature is enabled -$UMOUNT_PROG $SCRATCH_MNT +_umount $SCRATCH_MNT rm -rf "$upperdir" "$workdir" mkdir -p "$upperdir" "$workdir" diff --git a/tests/overlay/084 b/tests/overlay/084 index 28e9a76dc734c0..67321bc7618389 100755 --- a/tests/overlay/084 +++ b/tests/overlay/084 @@ -15,7 +15,7 @@ _cleanup() { cd / # Unmount nested mounts if things fail - $UMOUNT_PROG $OVL_BASE_SCRATCH_MNT/nested 2>/dev/null + _umount $OVL_BASE_SCRATCH_MNT/nested 2>/dev/null rm -rf $tmp } @@ -44,7 +44,7 @@ nesteddir=$OVL_BASE_SCRATCH_MNT/nested umount_overlay() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } test_escape() @@ -88,12 +88,12 @@ test_escape() echo "nested xattr mount with trusted.overlay" _overlay_mount_dirs $SCRATCH_MNT/layer2:$SCRATCH_MNT/layer1 - - overlayfs $nesteddir stat $nesteddir/dir/file 2>&1 | _filter_scratch - $UMOUNT_PROG $nesteddir + _umount $nesteddir echo "nested xattr mount with user.overlay" _overlay_mount_dirs $SCRATCH_MNT/layer2:$SCRATCH_MNT/layer1 - - -o userxattr overlayfs $nesteddir stat $nesteddir/dir/file 2>&1 | _filter_scratch - $UMOUNT_PROG $nesteddir + _umount $nesteddir # Also ensure propagate the escaped xattr when we copy-up layer2/dir echo "copy-up of escaped xattrs" @@ -164,7 +164,7 @@ test_escaped_xwhiteout() do_test_xwhiteout $prefix $nesteddir - $UMOUNT_PROG $nesteddir + _umount $nesteddir } test_escaped_xwhiteout trusted diff --git a/tests/overlay/085 b/tests/overlay/085 index 046d01d161d829..8396ceb7c72b90 100755 --- a/tests/overlay/085 +++ b/tests/overlay/085 @@ -157,7 +157,7 @@ mount_ro_overlay() umount_overlay() { - $UMOUNT_PROG $SCRATCH_MNT + _umount $SCRATCH_MNT } test_no_access() diff --git a/tests/overlay/086 b/tests/overlay/086 index 23c56d074ff34a..45e5b45a279853 100755 --- a/tests/overlay/086 +++ b/tests/overlay/086 @@ -38,21 +38,21 @@ _mount -t overlay none $SCRATCH_MNT \ 2>> $seqres.full && \ echo "ERROR: invalid combination of lowerdir and lowerdir+ mount options" -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null _mount -t overlay none $SCRATCH_MNT \ -o"lowerdir=$lowerdir,datadir+=$lowerdir_colons" \ -o redirect_dir=follow,metacopy=on 2>> $seqres.full && \ echo "ERROR: invalid combination of lowerdir and datadir+ mount options" -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null _mount -t overlay none $SCRATCH_MNT \ -o"datadir+=$lowerdir,lowerdir+=$lowerdir_colons" \ -o redirect_dir=follow,metacopy=on 2>> $seqres.full && \ echo "ERROR: invalid order of lowerdir+ and datadir+ mount options" -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null # mount is expected to fail with escaped colons. _mount -t overlay none $SCRATCH_MNT \ @@ -60,7 +60,7 @@ _mount -t overlay none $SCRATCH_MNT \ 2>> $seqres.full && \ echo "ERROR: incorrect parsing of escaped colons in lowerdir+ mount option" -$UMOUNT_PROG $SCRATCH_MNT 2>/dev/null +_umount $SCRATCH_MNT 2>/dev/null # mount is expected to succeed without escaped colons. _mount -t overlay ovl_esc_test $SCRATCH_MNT \ diff --git a/tests/xfs/078 b/tests/xfs/078 index 4224fd40bc9fea..799d8881220582 100755 --- a/tests/xfs/078 +++ b/tests/xfs/078 @@ -16,7 +16,7 @@ _cleanup() { cd / rm -f $tmp.* - $UMOUNT_PROG $LOOP_MNT 2>/dev/null + _umount $LOOP_MNT 2>/dev/null [ -n "$LOOP_DEV" ] && _destroy_loop_device $LOOP_DEV 2>/dev/null # try to keep the image file if test fails [ $status -eq 0 ] && rm -f $LOOP_IMG @@ -81,7 +81,7 @@ _grow_loop() $XFS_GROWFS_PROG $LOOP_MNT 2>&1 | _filter_growfs 2>&1 echo "*** unmount" - $UMOUNT_PROG -d $LOOP_MNT && LOOP_DEV= + _umount -d $LOOP_MNT && LOOP_DEV= # Large grows takes forever to check.. if [ "$check" -gt "0" ] diff --git a/tests/xfs/148 b/tests/xfs/148 index 9e6798f999b356..7c9badd3c1b3a0 100755 --- a/tests/xfs/148 +++ b/tests/xfs/148 @@ -14,7 +14,7 @@ _begin_fstest auto quick fuzzers _cleanup() { cd / - $UMOUNT_PROG $mntpt > /dev/null 2>&1 + _umount $mntpt > /dev/null 2>&1 _destroy_loop_device $loopdev > /dev/null 2>&1 rm -r -f $tmp.* } @@ -90,7 +90,7 @@ cat $tmp.log >> $seqres.full cat $tmp.log | _filter_test_dir # Corrupt the entries -$UMOUNT_PROG $mntpt +_umount $mntpt _destroy_loop_device $loopdev cp $imgfile $imgfile.old sed -b \ @@ -121,7 +121,7 @@ fi echo "does repair complain?" >> $seqres.full # Does repair complain about this? -$UMOUNT_PROG $mntpt +_umount $mntpt $XFS_REPAIR_PROG -n $loopdev >> $seqres.full 2>&1 res=$? test $res -eq 1 || \ diff --git a/tests/xfs/149 b/tests/xfs/149 index bbaf86132dff37..ceb80b646f5784 100755 --- a/tests/xfs/149 +++ b/tests/xfs/149 @@ -22,7 +22,7 @@ loop_symlink=$TEST_DIR/loop_symlink.$$ # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $mntdir + _umount $mntdir [ -n "$loop_dev" ] && _destroy_loop_device $loop_dev rmdir $mntdir rm -f $loop_symlink @@ -73,7 +73,7 @@ echo "=== xfs_growfs - check device symlink ===" $XFS_GROWFS_PROG -D 12288 $loop_symlink > /dev/null echo "=== unmount ===" -$UMOUNT_PROG $mntdir || _fail "!!! failed to unmount" +_umount $mntdir || _fail "!!! failed to unmount" echo "=== mount device symlink ===" _mount $loop_symlink $mntdir || _fail "!!! failed to loopback mount" diff --git a/tests/xfs/186 b/tests/xfs/186 index 88f02585e7f667..2bd4fe10ab8930 100755 --- a/tests/xfs/186 +++ b/tests/xfs/186 @@ -87,7 +87,7 @@ _do_eas() _create_eas $2 $3 fi echo "" - cd /; $UMOUNT_PROG $SCRATCH_MNT + cd /; _umount $SCRATCH_MNT _print_inode } @@ -99,7 +99,7 @@ _do_dirents() echo "" _scratch_mount _create_dirents $1 $2 - cd /; $UMOUNT_PROG $SCRATCH_MNT + cd /; _umount $SCRATCH_MNT _print_inode } diff --git a/tests/xfs/289 b/tests/xfs/289 index 089a3f8cc14a68..aab5f96293b3a5 100755 --- a/tests/xfs/289 +++ b/tests/xfs/289 @@ -13,8 +13,8 @@ _begin_fstest growfs auto quick # Override the default cleanup function. _cleanup() { - $UMOUNT_PROG $tmpdir - $UMOUNT_PROG $tmpbind + _umount $tmpdir + _umount $tmpbind rmdir $tmpdir rm -f $tmpsymlink rmdir $tmpbind diff --git a/tests/xfs/507 b/tests/xfs/507 index 75c183c07a9fce..60542112fbd5a1 100755 --- a/tests/xfs/507 +++ b/tests/xfs/507 @@ -22,7 +22,7 @@ _register_cleanup "_cleanup" BUS _cleanup() { cd / - test -n "$loop_mount" && $UMOUNT_PROG $loop_mount > /dev/null 2>&1 + test -n "$loop_mount" && _umount $loop_mount > /dev/null 2>&1 test -n "$loop_dev" && _destroy_loop_device $loop_dev rm -rf $tmp.* } diff --git a/tests/xfs/513 b/tests/xfs/513 index 5585a9c8e76703..cb8d0aca841530 100755 --- a/tests/xfs/513 +++ b/tests/xfs/513 @@ -14,7 +14,7 @@ _cleanup() { cd / rm -f $tmp.* - $UMOUNT_PROG $LOOP_MNT 2>/dev/null + _umount $LOOP_MNT 2>/dev/null if [ -n "$LOOP_DEV" ];then _destroy_loop_device $LOOP_DEV 2>/dev/null fi @@ -89,7 +89,7 @@ get_mount_info() force_unmount() { - $UMOUNT_PROG $LOOP_MNT >/dev/null 2>&1 + _umount $LOOP_MNT >/dev/null 2>&1 } # _do_test <mount options> <should be mounted?> [<key string> <key should be found?>] diff --git a/tests/xfs/544 b/tests/xfs/544 index a3a23c1726ca1c..f1b5cc74983a62 100755 --- a/tests/xfs/544 +++ b/tests/xfs/544 @@ -15,7 +15,7 @@ _cleanup() _cleanup_dump cd / rm -r -f $tmp.* - $UMOUNT_PROG $TEST_DIR/dest.$seq 2> /dev/null + _umount $TEST_DIR/dest.$seq 2> /dev/null rmdir $TEST_DIR/src.$seq 2> /dev/null rmdir $TEST_DIR/dest.$seq 2> /dev/null } diff --git a/tests/xfs/806 b/tests/xfs/806 index 09c55332cc8800..9334d1780c6855 100755 --- a/tests/xfs/806 +++ b/tests/xfs/806 @@ -23,7 +23,7 @@ _cleanup() { cd / rm -r -f $tmp.* - umount $dummymnt &>/dev/null + _umount $dummymnt &>/dev/null rmdir $dummymnt &>/dev/null rm -f $dummyfile } @@ -46,7 +46,7 @@ testme() { XFS_SCRUB_PHASE=7 $XFS_SCRUB_PROG -d -o autofsck $dummymnt 2>&1 | \ grep autofsck | _filter_test_dir | \ sed -e 's/\(directive.\).*$/\1/g' - umount $dummymnt + _umount $dummymnt } # We don't test the absence of an autofsck directive because xfs_scrub behaves ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/6] misc: convert all umount(1) invocations to _umount 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:57 ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong @ 2024-12-31 23:57 ` Darrick J. Wong 2024-12-31 23:57 ` [PATCH 3/6] xfs: test health monitoring code Darrick J. Wong ` (3 subsequent siblings) 5 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Find all the places where we call umount(1) directly and convert all of those to _umount calls as well. sed \ -e 's/\([[:space:]]\)umount\([[:space:]]*"\$\)/\1_umount\2/g' \ -e 's/\([[:space:]]\)umount\([[:space:]]*\$\)/\1_umount\2/g' \ -e 's/^umount\([[:space:]]*"\$\)/_umount\1/g' \ -e 's/^umount\([[:space:]]*\$\)/_umount\1/g' \ -i $(git ls-files tests common check) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/dmerror | 2 +- common/populate | 8 ++++---- common/quota | 2 +- common/rc | 4 ++-- common/xfs | 2 +- tests/btrfs/012 | 2 +- tests/btrfs/199 | 2 +- tests/btrfs/291 | 2 +- tests/btrfs/298 | 4 ++-- tests/ext4/006 | 4 ++-- tests/ext4/007 | 4 ++-- tests/ext4/008 | 4 ++-- tests/ext4/009 | 8 ++++---- tests/ext4/010 | 6 +++--- tests/ext4/011 | 2 +- tests/ext4/012 | 2 +- tests/ext4/013 | 6 +++--- tests/ext4/014 | 6 +++--- tests/ext4/015 | 6 +++--- tests/ext4/016 | 6 +++--- tests/ext4/017 | 6 +++--- tests/ext4/018 | 6 +++--- tests/ext4/019 | 6 +++--- tests/ext4/033 | 2 +- tests/generic/171 | 2 +- tests/generic/172 | 2 +- tests/generic/173 | 2 +- tests/generic/174 | 2 +- tests/generic/306 | 2 +- tests/generic/330 | 2 +- tests/generic/332 | 2 +- tests/generic/395 | 2 +- tests/generic/563 | 4 ++-- tests/generic/631 | 2 +- tests/generic/717 | 2 +- tests/xfs/014 | 4 ++-- tests/xfs/049 | 8 ++++---- tests/xfs/073 | 8 ++++---- tests/xfs/074 | 4 ++-- tests/xfs/083 | 6 +++--- tests/xfs/085 | 4 ++-- tests/xfs/086 | 8 ++++---- tests/xfs/087 | 6 +++--- tests/xfs/088 | 8 ++++---- tests/xfs/089 | 8 ++++---- tests/xfs/091 | 8 ++++---- tests/xfs/093 | 6 +++--- tests/xfs/097 | 6 +++--- tests/xfs/098 | 4 ++-- tests/xfs/099 | 6 +++--- tests/xfs/100 | 6 +++--- tests/xfs/101 | 6 +++--- tests/xfs/102 | 6 +++--- tests/xfs/105 | 6 +++--- tests/xfs/112 | 8 ++++---- tests/xfs/113 | 6 +++--- tests/xfs/117 | 6 +++--- tests/xfs/120 | 6 +++--- tests/xfs/123 | 6 +++--- tests/xfs/124 | 6 +++--- tests/xfs/125 | 6 +++--- tests/xfs/126 | 6 +++--- tests/xfs/130 | 2 +- tests/xfs/152 | 2 +- tests/xfs/169 | 6 +++--- tests/xfs/206 | 2 +- tests/xfs/216 | 2 +- tests/xfs/217 | 2 +- tests/xfs/235 | 6 +++--- tests/xfs/236 | 6 +++--- tests/xfs/239 | 2 +- tests/xfs/241 | 2 +- tests/xfs/250 | 4 ++-- tests/xfs/265 | 6 +++--- tests/xfs/310 | 4 ++-- tests/xfs/716 | 4 ++-- 76 files changed, 172 insertions(+), 172 deletions(-) diff --git a/common/dmerror b/common/dmerror index 1e6a35230f3ccb..2b6f001b8427f6 100644 --- a/common/dmerror +++ b/common/dmerror @@ -97,7 +97,7 @@ _dmerror_mount() _dmerror_unmount() { - umount $SCRATCH_MNT + _umount $SCRATCH_MNT } _dmerror_cleanup() diff --git a/common/populate b/common/populate index 96e6a0f0572f12..e6bcdf346ac4ff 100644 --- a/common/populate +++ b/common/populate @@ -540,7 +540,7 @@ _scratch_xfs_populate() { __populate_fragment_file "${SCRATCH_MNT}/REFCOUNTBT" __populate_fragment_file "${SCRATCH_MNT}/RTREFCOUNTBT" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" } # Populate an ext4 on the scratch device with (we hope) all known @@ -642,7 +642,7 @@ _scratch_ext4_populate() { # Make sure we get all the fragmentation we asked for __populate_fragment_file "${SCRATCH_MNT}/S_IFREG.FMT_ETREE" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" } # Find the inode number of a file @@ -831,7 +831,7 @@ _scratch_xfs_populate_check() { dblksz="$(_xfs_get_dir_blocksize "$SCRATCH_MNT")" leaf_lblk="$((32 * 1073741824 / blksz))" node_lblk="$((64 * 1073741824 / blksz))" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" __populate_check_xfs_dformat "${extents_file}" "extents" __populate_check_xfs_dformat "${btree_file}" "btree" @@ -948,7 +948,7 @@ _scratch_ext4_populate_check() { extents_slink="$(__populate_find_inode "${SCRATCH_MNT}/S_IFLNK.FMT_EXTENTS")" local_attr="$(__populate_find_inode "${SCRATCH_MNT}/ATTR.FMT_LOCAL")" block_attr="$(__populate_find_inode "${SCRATCH_MNT}/ATTR.FMT_BLOCK")" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" __populate_check_ext4_dformat "${extents_file}" "extents" __populate_check_ext4_dformat "${etree_file}" "etree" diff --git a/common/quota b/common/quota index 344c942045e5f2..7399819bb2579b 100644 --- a/common/quota +++ b/common/quota @@ -92,7 +92,7 @@ _require_xfs_quota_acct_enabled() if [ -z "$umount" ] && [ "$dev" = "$SCRATCH_DEV" ]; then umount="-u" fi - test "$umount" = "-u" && umount "$dev" &>/dev/null + test "$umount" = "-u" && _umount "$dev" &>/dev/null case "$dev" in "$TEST_DEV") fsname="test";; diff --git a/common/rc b/common/rc index d3ee76e01db892..0d5c785cecc017 100644 --- a/common/rc +++ b/common/rc @@ -1348,7 +1348,7 @@ _repair_scratch_fs() _scratch_xfs_repair -L 2>&1 echo "log zap returns $?" else - umount "$SCRATCH_MNT" + _umount "$SCRATCH_MNT" fi _scratch_xfs_repair "$@" 2>&1 res=$? @@ -1413,7 +1413,7 @@ _repair_test_fs() _test_xfs_repair -L >>$tmp.repair 2>&1 echo "log zap returns $?" >> $tmp.repair else - umount "$TEST_DEV" + _umount "$TEST_DEV" fi _test_xfs_repair "$@" >>$tmp.repair 2>&1 res=$? diff --git a/common/xfs b/common/xfs index 86654a9379cf89..b9e897e0e8839a 100644 --- a/common/xfs +++ b/common/xfs @@ -466,7 +466,7 @@ _require_xfs_has_feature() _xfs_has_feature "$1" "$2" && return 0 - test "$umount" = "-u" && umount "$fs" &>/dev/null + test "$umount" = "-u" && _umount "$fs" &>/dev/null test -n "$message" && _notrun "$message" diff --git a/tests/btrfs/012 b/tests/btrfs/012 index 5811b3b339cb3e..7bb075dc2d0e93 100755 --- a/tests/btrfs/012 +++ b/tests/btrfs/012 @@ -70,7 +70,7 @@ mount -o loop $SCRATCH_MNT/ext2_saved/image $SCRATCH_MNT/mnt || \ echo "Checking saved ext2 image against the original one:" $FSSUM_PROG -r $tmp.original $SCRATCH_MNT/mnt/$BASENAME -umount $SCRATCH_MNT/mnt +_umount $SCRATCH_MNT/mnt echo "Generating new data on the converted btrfs" >> $seqres.full mkdir -p $SCRATCH_MNT/new diff --git a/tests/btrfs/199 b/tests/btrfs/199 index f161e55057ff27..bdad1cb934c91f 100755 --- a/tests/btrfs/199 +++ b/tests/btrfs/199 @@ -19,7 +19,7 @@ _begin_fstest auto quick trim fiemap _cleanup() { cd / - umount $loop_mnt &> /dev/null + _umount $loop_mnt &> /dev/null _destroy_loop_device $loop_dev &> /dev/null rm -rf $tmp.* } diff --git a/tests/btrfs/291 b/tests/btrfs/291 index c31de3a96ef1f5..f69b65114ed696 100755 --- a/tests/btrfs/291 +++ b/tests/btrfs/291 @@ -134,7 +134,7 @@ do _mount $snap_dev $SCRATCH_MNT || _fail "mount failed at entry $cur" fsverity measure $SCRATCH_MNT/fsv >>$seqres.full 2>&1 measured=$? - umount $SCRATCH_MNT + _umount $SCRATCH_MNT [ $state -eq 1 ] && [ $measured -eq 0 ] && state=2 [ $state -eq 2 ] && ([ $measured -eq 0 ] || _fail "verity done, but measurement failed at entry $cur") post_mount=$(count_merkle_items $snap_dev) diff --git a/tests/btrfs/298 b/tests/btrfs/298 index d4aee55e785a94..c5b65772d428b1 100755 --- a/tests/btrfs/298 +++ b/tests/btrfs/298 @@ -31,11 +31,11 @@ $BTRFS_UTIL_PROG device scan --forget echo "#Scan seed device and check using mount" >> $seqres.full $BTRFS_UTIL_PROG device scan $SCRATCH_DEV >> $seqres.full _mount $SPARE_DEV $SCRATCH_MNT -umount $SCRATCH_MNT +_umount $SCRATCH_MNT echo "#check again, ensures seed device still in kernel" >> $seqres.full _mount $SPARE_DEV $SCRATCH_MNT -umount $SCRATCH_MNT +_umount $SCRATCH_MNT echo "#Now scan of non-seed device makes kernel forget" >> $seqres.full $BTRFS_TUNE_PROG -f -S 0 $SCRATCH_DEV >> $seqres.full 2>&1 diff --git a/tests/ext4/006 b/tests/ext4/006 index d7862073114872..579eab55b32d26 100755 --- a/tests/ext4/006 +++ b/tests/ext4/006 @@ -97,7 +97,7 @@ echo "++ modify scratch" >> $seqres.full _scratch_fuzz_modify >> $seqres.full 2>&1 echo "++ unmount" >> $seqres.full -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" # repair in a loop... for p in $(seq 1 "${FSCK_PASSES}"); do @@ -122,7 +122,7 @@ echo "++ modify scratch" >> $ROUND2_LOG _scratch_fuzz_modify >> $ROUND2_LOG 2>&1 echo "++ unmount" >> $ROUND2_LOG -umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1 +_umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1 cat "$ROUND2_LOG" >> $seqres.full diff --git a/tests/ext4/007 b/tests/ext4/007 index deedbd9e8fb3d8..24cc2290f79a29 100755 --- a/tests/ext4/007 +++ b/tests/ext4/007 @@ -54,7 +54,7 @@ done for x in `seq 2 64`; do touch "${TESTFILE}.${x}" done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -89,7 +89,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/008 b/tests/ext4/008 index b4b20ac10d6d2a..a586bf681dfd34 100755 --- a/tests/ext4/008 +++ b/tests/ext4/008 @@ -50,7 +50,7 @@ done for x in `seq 2 64`; do echo moo >> "${TESTFILE}.${x}" done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -70,7 +70,7 @@ e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 echo "+ mount image (2)" _scratch_mount -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/009 b/tests/ext4/009 index 06a42fd77ffa0c..f6fe1e5f0d8d2a 100755 --- a/tests/ext4/009 +++ b/tests/ext4/009 @@ -45,13 +45,13 @@ done blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")" freeblks="$(stat -f -c '%a' "${SCRATCH_MNT}")" $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile2" >> $seqres.full -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ make some files" _scratch_mount rm -rf "${SCRATCH_MNT}/bigfile2" touch "${SCRATCH_MNT}/bigfile" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -70,7 +70,7 @@ $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile" >> after="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")" echo "$((after * b_bytes))" lt "$((blksz * freeblks / 4))" >> $seqres.full test "$((after * b_bytes))" -lt "$((blksz * freeblks / 4))" || _fail "falloc should fail" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -80,7 +80,7 @@ _scratch_mount echo "+ modify files (2)" $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile" >> $seqres.full -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/010 b/tests/ext4/010 index 1139c79e80d538..27ce20f822256f 100755 --- a/tests/ext4/010 +++ b/tests/ext4/010 @@ -46,7 +46,7 @@ echo "+ make some files" for i in `seq 1 $((nr_groups * 8))`; do mkdir -p "${SCRATCH_MNT}/d_${i}" done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -61,7 +61,7 @@ _scratch_mount echo "+ modify files" touch "${SCRATCH_MNT}/file0" > /dev/null 2>&1 && _fail "touch should fail" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -71,7 +71,7 @@ _scratch_mount echo "+ modify files (2)" touch "${SCRATCH_MNT}/file1" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/011 b/tests/ext4/011 index cae4fb6b84768b..cb085c95596de1 100755 --- a/tests/ext4/011 +++ b/tests/ext4/011 @@ -39,7 +39,7 @@ blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")" echo "+ make some files" echo moo > "${SCRATCH_MNT}/file0" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/012 b/tests/ext4/012 index f7f2b0fb455762..e7adc617c4db17 100755 --- a/tests/ext4/012 +++ b/tests/ext4/012 @@ -39,7 +39,7 @@ blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")" echo "+ make some files" echo moo > "${SCRATCH_MNT}/file0" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/013 b/tests/ext4/013 index 7d2a9154a66936..4363e3d104b716 100755 --- a/tests/ext4/013 +++ b/tests/ext4/013 @@ -50,7 +50,7 @@ for x in `seq 2 64`; do touch "${TESTFILE}.${x}" done inode="$(stat -c '%i' "${TESTFILE}.1")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -72,7 +72,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -93,7 +93,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/014 b/tests/ext4/014 index ffed795ad4e93c..c874a62335d1f3 100755 --- a/tests/ext4/014 +++ b/tests/ext4/014 @@ -49,7 +49,7 @@ done for x in `seq 2 64`; do touch "${TESTFILE}.${x}" done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -70,7 +70,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 && _fail "e2fsck should not succeed" @@ -91,7 +91,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/015 b/tests/ext4/015 index 81feda5c9423fb..32b3884de32035 100755 --- a/tests/ext4/015 +++ b/tests/ext4/015 @@ -45,7 +45,7 @@ $XFS_IO_PROG -f -c "falloc 0 $((blksz * freeblks))" "${SCRATCH_MNT}/bigfile" >> seq 1 2 ${freeblks} | while read lblk; do $XFS_IO_PROG -f -c "fpunch $((lblk * blksz)) ${blksz}" "${SCRATCH_MNT}/bigfile" >> $seqres.full done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -60,7 +60,7 @@ _scratch_mount echo "+ modify files" echo moo >> "${SCRATCH_MNT}/bigfile" 2> /dev/null && _fail "extent tree should be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -70,7 +70,7 @@ _scratch_mount echo "+ modify files (2)" $XFS_IO_PROG -f -c "pwrite ${blksz} ${blksz}" "${SCRATCH_MNT}/bigfile" >> $seqres.full -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/016 b/tests/ext4/016 index b7db4cfda649ef..f0f1709b6c208a 100755 --- a/tests/ext4/016 +++ b/tests/ext4/016 @@ -40,7 +40,7 @@ echo "+ make some files" for x in `seq 1 15`; do mkdir -p "${SCRATCH_MNT}/test/d_${x}" done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -53,7 +53,7 @@ _scratch_mount echo "+ modify dirs" mkdir -p "${SCRATCH_MNT}/test/newdir" 2> /dev/null && _fail "directory should be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -63,7 +63,7 @@ _scratch_mount echo "+ modify dirs (2)" mkdir -p "${SCRATCH_MNT}/test/newdir" || _fail "directory should be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/017 b/tests/ext4/017 index fc867442c3da3a..7fa563106d676c 100755 --- a/tests/ext4/017 +++ b/tests/ext4/017 @@ -43,7 +43,7 @@ for x in `seq 1 $((blksz * 4 / 256))`; do fname="$(printf "%.255s\n" "$(perl -e "print \"${x}_\" x 500;")")" touch "${SCRATCH_MNT}/test/${fname}" done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -56,7 +56,7 @@ _scratch_mount echo "+ modify dirs" mkdir -p "${SCRATCH_MNT}/test/newdir" 2> /dev/null && _fail "htree should be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -66,7 +66,7 @@ _scratch_mount echo "+ modify dirs (2)" mkdir -p "${SCRATCH_MNT}/test/newdir" || _fail "htree should not be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/018 b/tests/ext4/018 index f7377f059fb826..2e24fe2e82918d 100755 --- a/tests/ext4/018 +++ b/tests/ext4/018 @@ -40,7 +40,7 @@ blksz="$(stat -f -c '%s' "${SCRATCH_MNT}")" echo "+ make some files" $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${SCRATCH_MNT}/attrfile" >> $seqres.full setfattr -n user.key -v "$(perl -e 'print "v" x 300;')" "${SCRATCH_MNT}/attrfile" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -54,7 +54,7 @@ _scratch_mount echo "+ modify attrs" setfattr -n user.newkey -v "$(perl -e 'print "v" x 300;')" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "xattr should be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -64,7 +64,7 @@ _scratch_mount echo "+ modify attrs (2)" setfattr -n user.newkey -v "$(perl -e 'print "v" x 300;')" "${SCRATCH_MNT}/attrfile" || _fail "xattr should not be corrupt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/019 b/tests/ext4/019 index 987972a80a3704..7df7ccbed5e50d 100755 --- a/tests/ext4/019 +++ b/tests/ext4/019 @@ -43,7 +43,7 @@ echo "file contents: moo" > "${SCRATCH_MNT}/x" str="$(perl -e "print './' x $(( (blksz / 2) - 16));")x" (cd $SCRATCH_MNT; ln -s "${str}" "long_symlink") cat "${SCRATCH_MNT}/long_symlink" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" @@ -54,7 +54,7 @@ debugfs -w -R 'zap -f /long_symlink -p 0x62 0' "${SCRATCH_DEV}" 2> /dev/null echo "+ mount image" _scratch_mount 2> /dev/null cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 @@ -62,7 +62,7 @@ e2fsck -fy "${SCRATCH_DEV}" >> $seqres.full 2>&1 echo "+ mount image (2)" _scratch_mount cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" e2fsck -fn "${SCRATCH_DEV}" >> $seqres.full 2>&1 || _fail "fsck should not fail" diff --git a/tests/ext4/033 b/tests/ext4/033 index 53f7106e2c6ba4..19cd1fb6f20d4c 100755 --- a/tests/ext4/033 +++ b/tests/ext4/033 @@ -14,7 +14,7 @@ _begin_fstest auto ioctl resize # Override the default cleanup function. _cleanup() { - umount $SCRATCH_MNT >/dev/null 2>&1 + _umount $SCRATCH_MNT >/dev/null 2>&1 _dmhugedisk_cleanup cd / rm -f $tmp.* diff --git a/tests/generic/171 b/tests/generic/171 index dd56aa792afbd5..f51f58e9495f8e 100755 --- a/tests/generic/171 +++ b/tests/generic/171 @@ -36,7 +36,7 @@ mkdir $testdir echo "Reformat with appropriate size" blksz="$(_get_block_size $testdir)" nr_blks=10240 -umount $SCRATCH_MNT +_umount $SCRATCH_MNT sz_bytes=$((nr_blks * 8 * blksz)) if [ $sz_bytes -lt $((32 * 1048576)) ]; then sz_bytes=$((32 * 1048576)) diff --git a/tests/generic/172 b/tests/generic/172 index c23a1228455464..8d32f0288b1556 100755 --- a/tests/generic/172 +++ b/tests/generic/172 @@ -35,7 +35,7 @@ mkdir $testdir echo "Reformat with appropriate size" blksz="$(_get_block_size $testdir)" -umount $SCRATCH_MNT +_umount $SCRATCH_MNT file_size=$((768 * 1024 * 1024)) fs_size=$((1024 * 1024 * 1024)) diff --git a/tests/generic/173 b/tests/generic/173 index 8df3c6df21b29c..2f1ea96ef6238e 100755 --- a/tests/generic/173 +++ b/tests/generic/173 @@ -36,7 +36,7 @@ mkdir $testdir echo "Reformat with appropriate size" blksz="$(_get_block_size $testdir)" nr_blks=10240 -umount $SCRATCH_MNT +_umount $SCRATCH_MNT sz_bytes=$((nr_blks * 8 * blksz)) if [ $sz_bytes -lt $((32 * 1048576)) ]; then sz_bytes=$((32 * 1048576)) diff --git a/tests/generic/174 b/tests/generic/174 index b9c292071445fe..d93546eeb35581 100755 --- a/tests/generic/174 +++ b/tests/generic/174 @@ -37,7 +37,7 @@ mkdir $testdir echo "Reformat with appropriate size" blksz="$(_get_block_size $testdir)" nr_blks=10240 -umount $SCRATCH_MNT +_umount $SCRATCH_MNT sz_bytes=$((nr_blks * 8 * blksz)) if [ $sz_bytes -lt $((32 * 1048576)) ]; then sz_bytes=$((32 * 1048576)) diff --git a/tests/generic/306 b/tests/generic/306 index a6ea654b67d179..e6502cb881e21e 100755 --- a/tests/generic/306 +++ b/tests/generic/306 @@ -12,7 +12,7 @@ _begin_fstest auto quick rw # Override the default cleanup function. _cleanup() { - umount $BINDFILE + _umount $BINDFILE cd / rm -f $tmp.* } diff --git a/tests/generic/330 b/tests/generic/330 index 4fa81f9913ee7e..ab9af84611d725 100755 --- a/tests/generic/330 +++ b/tests/generic/330 @@ -61,7 +61,7 @@ md5sum $testdir/file1 | _filter_scratch md5sum $testdir/file2 | _filter_scratch echo "Check for damage" -umount $SCRATCH_MNT +_umount $SCRATCH_MNT _repair_scratch_fs >> $seqres.full # success, all done diff --git a/tests/generic/332 b/tests/generic/332 index 4a61e4a02a7cdc..b15546d66a41e0 100755 --- a/tests/generic/332 +++ b/tests/generic/332 @@ -61,7 +61,7 @@ md5sum $testdir/file1 | _filter_scratch md5sum $testdir/file2 | _filter_scratch echo "Check for damage" -umount $SCRATCH_MNT +_umount $SCRATCH_MNT _repair_scratch_fs >> $seqres.full # success, all done diff --git a/tests/generic/395 b/tests/generic/395 index 45787fff06be1d..d0600d0282c6a4 100755 --- a/tests/generic/395 +++ b/tests/generic/395 @@ -75,7 +75,7 @@ mount --bind $SCRATCH_MNT $SCRATCH_MNT/ro_bind_mnt mount -o remount,ro,bind $SCRATCH_MNT/ro_bind_mnt _set_encpolicy $SCRATCH_MNT/ro_bind_mnt/ro_dir |& _filter_scratch _get_encpolicy $SCRATCH_MNT/ro_bind_mnt/ro_dir |& _filter_scratch -umount $SCRATCH_MNT/ro_bind_mnt +_umount $SCRATCH_MNT/ro_bind_mnt # success, all done status=0 diff --git a/tests/generic/563 b/tests/generic/563 index ade66f93fbf30b..166774653a66d6 100755 --- a/tests/generic/563 +++ b/tests/generic/563 @@ -21,7 +21,7 @@ _cleanup() echo $$ > $cgdir/cgroup.procs rmdir $cgdir/$seq-cg* > /dev/null 2>&1 - umount $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _destroy_loop_device $LOOP_DEV > /dev/null 2>&1 } @@ -80,7 +80,7 @@ reset() rmdir $cgdir/$seq-cg* > /dev/null 2>&1 $XFS_IO_PROG -fc "pwrite 0 $iosize" $SCRATCH_MNT/file \ >> $seqres.full 2>&1 - umount $SCRATCH_MNT || _fail "umount failed" + _umount $SCRATCH_MNT || _fail "umount failed" _mount $LOOP_DEV $SCRATCH_MNT || _fail "mount failed" stat $SCRATCH_MNT/file > /dev/null } diff --git a/tests/generic/631 b/tests/generic/631 index c7c95e5608b760..c9f8299c948f83 100755 --- a/tests/generic/631 +++ b/tests/generic/631 @@ -84,7 +84,7 @@ worker() { touch $mergedir/etc/access.conf mv $mergedir/etc/access.conf $mergedir/etc/access.conf.bak touch $mergedir/etc/access.conf - umount $mergedir + _umount $mergedir done rm -f $SCRATCH_MNT/workers/$tag } diff --git a/tests/generic/717 b/tests/generic/717 index 4378e964ab8597..7ff356e255b3d1 100755 --- a/tests/generic/717 +++ b/tests/generic/717 @@ -85,7 +85,7 @@ mkdir -p $SCRATCH_MNT/xyz mount --bind $dir $SCRATCH_MNT/xyz --bind _pwrite_byte 0x60 0 $((blksz * (nrblks + 2))) $dir/c >> $seqres.full $XFS_IO_PROG -c "exchangerange $SCRATCH_MNT/xyz/c" $dir/a -umount $SCRATCH_MNT/xyz +_umount $SCRATCH_MNT/xyz echo Swapping a file with itself $XFS_IO_PROG -c "exchangerange $dir/a" $dir/a diff --git a/tests/xfs/014 b/tests/xfs/014 index 098f64186e1134..efae4efa5138f5 100755 --- a/tests/xfs/014 +++ b/tests/xfs/014 @@ -22,7 +22,7 @@ _begin_fstest auto enospc quick quota prealloc _cleanup() { cd / - umount $LOOP_MNT 2>/dev/null + _umount $LOOP_MNT 2>/dev/null _scratch_unmount 2>/dev/null rm -f $tmp.* } @@ -174,7 +174,7 @@ mount -t xfs -o loop,uquota,gquota $LOOP_FILE $LOOP_MNT || \ _test_enospc $LOOP_MNT _test_edquot $LOOP_MNT -umount $LOOP_MNT +_umount $LOOP_MNT echo $orig_sp_time > /proc/sys/fs/xfs/speculative_prealloc_lifetime diff --git a/tests/xfs/049 b/tests/xfs/049 index 668ac374576a69..89ee1dbdff4f10 100755 --- a/tests/xfs/049 +++ b/tests/xfs/049 @@ -13,8 +13,8 @@ _begin_fstest rw auto quick _cleanup() { cd / - umount $SCRATCH_MNT/test2 > /dev/null 2>&1 - umount $SCRATCH_MNT/test > /dev/null 2>&1 + _umount $SCRATCH_MNT/test2 > /dev/null 2>&1 + _umount $SCRATCH_MNT/test > /dev/null 2>&1 rm -f $tmp.* if [ -w $seqres.full ] @@ -96,11 +96,11 @@ rm -rf $SCRATCH_MNT/test/* >> $seqres.full 2>&1 \ || _fail "!!! clean failed" _log "umount ext2 on xfs" -umount $SCRATCH_MNT/test2 >> $seqres.full 2>&1 \ +_umount $SCRATCH_MNT/test2 >> $seqres.full 2>&1 \ || _fail "!!! umount ext2 failed" _log "umount xfs" -umount $SCRATCH_MNT/test >> $seqres.full 2>&1 \ +_umount $SCRATCH_MNT/test >> $seqres.full 2>&1 \ || _fail "!!! umount xfs failed" echo "--- mounts at end (before cleanup)" >> $seqres.full diff --git a/tests/xfs/073 b/tests/xfs/073 index 28f1fad08b8c96..7d99179b7bc974 100755 --- a/tests/xfs/073 +++ b/tests/xfs/073 @@ -21,9 +21,9 @@ _cleanup() { cd / _scratch_unmount 2>/dev/null - umount $imgs.loop 2>/dev/null + _umount $imgs.loop 2>/dev/null [ -d $imgs.loop ] && rmdir $imgs.loop - umount $imgs.source_dir 2>/dev/null + _umount $imgs.source_dir 2>/dev/null [ -d $imgs.source_dir ] && rm -rf $imgs.source_dir rm -f $imgs.* $tmp.* /var/tmp/xfs_copy.log.* } @@ -98,8 +98,8 @@ _verify_copy() diff -u $tmp.geometry1 $tmp.geometry2 echo unmounting and removing new image - umount $source_dir - umount $target_dir > /dev/null 2>&1 + _umount $source_dir + _umount $target_dir > /dev/null 2>&1 rm -f $target } diff --git a/tests/xfs/074 b/tests/xfs/074 index 278f0ade694d22..282642a8674557 100755 --- a/tests/xfs/074 +++ b/tests/xfs/074 @@ -59,7 +59,7 @@ $XFS_IO_PROG -ft \ -c "falloc 0 $(($BLOCK_SIZE * 2097152))" \ $LOOP_MNT/foo >> $seqres.full -umount $LOOP_MNT +_umount $LOOP_MNT _check_xfs_filesystem $LOOP_DEV none none _mkfs_dev -f $LOOP_DEV @@ -72,7 +72,7 @@ $XFS_IO_PROG -ft \ -c "falloc 1023m 2g" \ $LOOP_MNT/foo >> $seqres.full -umount $LOOP_MNT +_umount $LOOP_MNT _check_xfs_filesystem $LOOP_DEV none none # success, all done diff --git a/tests/xfs/083 b/tests/xfs/083 index 9291c8c0382489..875937e6ffe3b3 100755 --- a/tests/xfs/083 +++ b/tests/xfs/083 @@ -57,7 +57,7 @@ scratch_repair() { _scratch_xfs_repair -L >> "${FSCK_LOG}" 2>&1 echo "+++ returns $?" >> "${FSCK_LOG}" else - umount "${SCRATCH_MNT}" >> "${FSCK_LOG}" 2>&1 + _umount "${SCRATCH_MNT}" >> "${FSCK_LOG}" 2>&1 fi elif [ "${fsck_pass}" -eq "${FSCK_PASSES}" ]; then echo "++ fsck did not fix in ${FSCK_PASSES} passes." >> "${FSCK_LOG}" @@ -109,7 +109,7 @@ echo "+++ modify scratch" >> $seqres.full _scratch_fuzz_modify >> $seqres.full 2>&1 echo "++ umount" >> $seqres.full -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" # repair in a loop... for p in $(seq 1 "${FSCK_PASSES}"); do @@ -134,7 +134,7 @@ echo "+++ modify scratch" >> $ROUND2_LOG _scratch_fuzz_modify >> $ROUND2_LOG 2>&1 echo "++ umount" >> $ROUND2_LOG -umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1 +_umount "${SCRATCH_MNT}" >> $ROUND2_LOG 2>&1 cat "$ROUND2_LOG" >> $seqres.full diff --git a/tests/xfs/085 b/tests/xfs/085 index d33dd199e6f9c1..9faf16fde5cdab 100755 --- a/tests/xfs/085 +++ b/tests/xfs/085 @@ -54,7 +54,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -82,7 +82,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/086 b/tests/xfs/086 index 44985f3913254d..03327cdeaf3f08 100755 --- a/tests/xfs/086 +++ b/tests/xfs/086 @@ -56,7 +56,7 @@ done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" test "${agcount}" -gt 1 || _notrun "Single-AG XFS not supported" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -97,7 +97,7 @@ echo "+ modify files (2)" for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" _repair_scratch_fs >> $seqres.full 2>&1 @@ -114,7 +114,7 @@ for x in `seq 1 64`; do test -s "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/087 b/tests/xfs/087 index 3cca105685fc6a..aeef30657b9491 100755 --- a/tests/xfs/087 +++ b/tests/xfs/087 @@ -55,7 +55,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -72,7 +72,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 65 70`; do touch "${TESTFILE}.${x}" 2> /dev/null && broken=0 done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "broken: ${broken}" @@ -91,7 +91,7 @@ for x in `seq 65 70`; do touch "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/088 b/tests/xfs/088 index b54a1ab7d00342..de100136014ba7 100755 --- a/tests/xfs/088 +++ b/tests/xfs/088 @@ -56,7 +56,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full 2>> $seqres.full done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -97,7 +97,7 @@ echo "+ modify files (2)" for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" _repair_scratch_fs >> $seqres.full 2>&1 @@ -114,7 +114,7 @@ for x in `seq 1 64`; do test -s "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/089 b/tests/xfs/089 index ff3ae719326eca..f5640a46177578 100755 --- a/tests/xfs/089 +++ b/tests/xfs/089 @@ -56,7 +56,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full 2>> $seqres.full done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -98,7 +98,7 @@ echo "+ modify files (2)" for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" _repair_scratch_fs >> $seqres.full 2>&1 @@ -115,7 +115,7 @@ for x in `seq 1 64`; do test -s "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/091 b/tests/xfs/091 index 3f606f8845797d..c7857cdf1b690b 100755 --- a/tests/xfs/091 +++ b/tests/xfs/091 @@ -56,7 +56,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -73,7 +73,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 0 ${blksz}" "${TESTFILE}.${x}" >> $seqres.full 2>> $seqres.full done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -98,7 +98,7 @@ echo "+ modify files (2)" for x in `seq 1 64`; do $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" "${TESTFILE}.${x}" >> $seqres.full done -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ repair fs" _repair_scratch_fs >> $seqres.full 2>&1 @@ -115,7 +115,7 @@ for x in `seq 1 64`; do test -s "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/093 b/tests/xfs/093 index c4e8006063e121..cfb2a8c80c1770 100755 --- a/tests/xfs/093 +++ b/tests/xfs/093 @@ -55,7 +55,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -72,7 +72,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 65 70`; do touch "${TESTFILE}.${x}" 2> /dev/null && broken=0 done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "broken: ${broken}" @@ -94,7 +94,7 @@ for x in `seq 65 70`; do touch "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/097 b/tests/xfs/097 index 384c76080ddcf4..0fcf65a2a8f65a 100755 --- a/tests/xfs/097 +++ b/tests/xfs/097 @@ -58,7 +58,7 @@ for x in `seq 2 64`; do done inode="$(stat -c '%i' "${TESTFILE}.1")" agcount="$(_xfs_mount_agcount $SCRATCH_MNT)" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -74,7 +74,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then for x in `seq 65 70`; do touch "${TESTFILE}.${x}" 2> /dev/null && broken=0 done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "broken: ${broken}" @@ -93,7 +93,7 @@ for x in `seq 65 70`; do touch "${TESTFILE}.${x}" || broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/098 b/tests/xfs/098 index a47cda67e14e29..48eb3fa2b3a753 100755 --- a/tests/xfs/098 +++ b/tests/xfs/098 @@ -56,7 +56,7 @@ for x in `seq 2 64`; do touch "${TESTFILE}.${x}" done inode="$(stat -c '%i' "${TESTFILE}.1")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -98,7 +98,7 @@ for x in `seq 1 64`; do test $? -ne 0 && broken=1 done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/099 b/tests/xfs/099 index f5321fe3d20b1c..17e1e8df7bf751 100755 --- a/tests/xfs/099 +++ b/tests/xfs/099 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -60,7 +60,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -77,7 +77,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/100 b/tests/xfs/100 index 6f465a79c926d2..dd50d984800335 100755 --- a/tests/xfs/100 +++ b/tests/xfs/100 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -65,7 +65,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -82,7 +82,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/101 b/tests/xfs/101 index a926acb0bc6735..2abcd711b18703 100755 --- a/tests/xfs/101 +++ b/tests/xfs/101 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -60,7 +60,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -77,7 +77,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/102 b/tests/xfs/102 index c3ddec5e432dc5..5a7c036ce55751 100755 --- a/tests/xfs/102 +++ b/tests/xfs/102 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -65,7 +65,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -82,7 +82,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/105 b/tests/xfs/105 index 132aa07f8300ef..30d4dc47ec1fed 100755 --- a/tests/xfs/105 +++ b/tests/xfs/105 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -65,7 +65,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -82,7 +82,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/112 b/tests/xfs/112 index f0e717cf26d8c9..267432a863a92d 100755 --- a/tests/xfs/112 +++ b/tests/xfs/112 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -65,14 +65,14 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" _repair_scratch_fs >> $seqres.full 2>&1 if [ $? -eq 2 ]; then _scratch_mount - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" _repair_scratch_fs >> $seqres.full 2>&1 fi @@ -86,7 +86,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/113 b/tests/xfs/113 index 22ac8c3fd51b80..2f19346aa74b3d 100755 --- a/tests/xfs/113 +++ b/tests/xfs/113 @@ -44,7 +44,7 @@ node_lblk="$((64 * 1073741824 / blksz))" echo "+ make some files" __populate_create_dir "${SCRATCH_MNT}/blockdir" "${nr}" true inode="$(stat -c '%i' "${SCRATCH_MNT}/blockdir")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -86,7 +86,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then rm -rf "${SCRATCH_MNT}/blockdir/00000000" 2> /dev/null && _fail "modified corrupt directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" 2> /dev/null && _fail "add to corrupt directory" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -103,7 +103,7 @@ echo "+ modify dir (2)" mkdir -p "${SCRATCH_MNT}/blockdir" rm -rf "${SCRATCH_MNT}/blockdir/00000000" || _fail "couldn't modify repaired directory" mkdir "${SCRATCH_MNT}/blockdir/xxxxxxxx" || _fail "add to repaired directory" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/117 b/tests/xfs/117 index 0ca8f1b96ddfd9..ae73ddbebfd53b 100755 --- a/tests/xfs/117 +++ b/tests/xfs/117 @@ -65,7 +65,7 @@ for ((i = 0; i < 64; i++)); do done echo "First victim inode is: " >> $seqres.full stat -c '%i' "$fname" >> $seqres.full -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -85,7 +85,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then touch "$fname" &>> $seqres.full test $? -eq 0 && broken=0 done - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "broken: ${broken}" @@ -110,7 +110,7 @@ for x in `seq 1 64`; do echo "${x}: broken=${broken}" >> $seqres.full done echo "broken: ${broken}" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/120 b/tests/xfs/120 index f1f047f53a351b..9d0cc12a3e8b8d 100755 --- a/tests/xfs/120 +++ b/tests/xfs/120 @@ -45,7 +45,7 @@ for i in $(seq 1 2 ${nr}); do $XFS_IO_PROG -f -c "fpunch $((i * blksz)) ${blksz}" "${SCRATCH_MNT}/bigfile" >> $seqres.full done inode="$(stat -c '%i' "${SCRATCH_MNT}/bigfile")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -60,7 +60,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" -c 'fsync' "${SCRATCH_MNT}/bigfile" >> $seqres.full 2> /dev/null after="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")" test "${before}" -eq "${after}" || _fail "pwrite should fail on corrupt bmbt" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -78,7 +78,7 @@ before="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")" $XFS_IO_PROG -f -c "pwrite -S 0x62 ${blksz} ${blksz}" -c 'fsync' "${SCRATCH_MNT}/bigfile" >> $seqres.full 2> /dev/null after="$(stat -c '%b' "${SCRATCH_MNT}/bigfile")" test "${before}" -ne "${after}" || _fail "pwrite failed after fixing corrupt bmbt" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/123 b/tests/xfs/123 index 6b56551374cd8f..5bd3c86372058e 100755 --- a/tests/xfs/123 +++ b/tests/xfs/123 @@ -44,7 +44,7 @@ str="$(perl -e "print './' x $reps;")x" (cd $SCRATCH_MNT; ln -s "${str}" "long_symlink") cat "${SCRATCH_MNT}/long_symlink" inode="$(stat -c '%i' "${SCRATCH_MNT}/long_symlink")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -55,7 +55,7 @@ _scratch_xfs_db -x -c "inode ${inode}" -c "dblock 0" -c "stack" -c "blocktrash - echo "+ mount image" if _try_scratch_mount >> $seqres.full 2>&1; then cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -64,7 +64,7 @@ _repair_scratch_fs >> $seqres.full 2>&1 echo "+ mount image (2)" _scratch_mount cat "${SCRATCH_MNT}/long_symlink" 2>/dev/null && _fail "symlink should be broken" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/124 b/tests/xfs/124 index fe870dc96cc783..7890434b397262 100755 --- a/tests/xfs/124 +++ b/tests/xfs/124 @@ -46,7 +46,7 @@ seq 0 "${nr}" | while read d; do setfattr -n "user.x$(printf "%.08d" "$d")" -v "0000000000000000" "${SCRATCH_MNT}/attrfile" done inode="$(stat -c '%i' "${SCRATCH_MNT}/attrfile")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -64,7 +64,7 @@ echo "+ mount image && modify xattr" if _try_scratch_mount >> $seqres.full 2>&1; then setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "modified corrupt xattr" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -80,7 +80,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/" echo "+ modify xattr (2)" getfattr "${SCRATCH_MNT}/attrfile" -n "user.x00000000" > /dev/null 2>&1 && (setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" || _fail "remove corrupt xattr") setfattr -n "user.x00000000" -v 'x0x0x0x0' "${SCRATCH_MNT}/attrfile" || _fail "add corrupt xattr" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/125 b/tests/xfs/125 index 89e93650556e40..c3770c185b4063 100755 --- a/tests/xfs/125 +++ b/tests/xfs/125 @@ -47,7 +47,7 @@ seq 1 2 "${nr}" | while read d; do setfattr -x "user.x$(printf "%.08d" "$d")" "${SCRATCH_MNT}/attrfile" done inode="$(stat -c '%i' "${SCRATCH_MNT}/attrfile")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -64,7 +64,7 @@ echo "+ mount image && modify xattr" if _try_scratch_mount >> $seqres.full 2>&1; then setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "modified corrupt xattr" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -80,7 +80,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/" echo "+ modify xattr (2)" setfattr -n "user.x00000000" -v "1111111111111111" "${SCRATCH_MNT}/attrfile" || _fail "modified corrupt xattr" setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" || _fail "delete corrupt xattr" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/126 b/tests/xfs/126 index 5614ea398c0142..14eb2a6157e141 100755 --- a/tests/xfs/126 +++ b/tests/xfs/126 @@ -47,7 +47,7 @@ seq 1 2 "${nr}" | while read d; do setfattr -x "user.x$(printf "%.08d" "$d")" "${SCRATCH_MNT}/attrfile" done inode="$(stat -c '%i' "${SCRATCH_MNT}/attrfile")" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" @@ -69,7 +69,7 @@ echo "+ mount image && modify xattr" if _try_scratch_mount >> $seqres.full 2>&1; then setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" 2> /dev/null && _fail "modified corrupt xattr" - umount "${SCRATCH_MNT}" + _umount "${SCRATCH_MNT}" fi echo "+ repair fs" @@ -84,7 +84,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/" echo "+ modify xattr (2)" getfattr "${SCRATCH_MNT}/attrfile" -n "user.x00000000" 2> /dev/null && (setfattr -x "user.x00000000" "${SCRATCH_MNT}/attrfile" || _fail "modified corrupt xattr") -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || _fail "xfs_repair should not fail" diff --git a/tests/xfs/130 b/tests/xfs/130 index 3e6dd861c47851..b1792a98e57db6 100755 --- a/tests/xfs/130 +++ b/tests/xfs/130 @@ -78,7 +78,7 @@ $CHATTR_PROG -R -f -i "${SCRATCH_MNT}/" echo "+ reflink more (2)" _cp_reflink "${SCRATCH_MNT}/file1" "${SCRATCH_MNT}/file5" || \ _fail "modified refcount tree" -umount "${SCRATCH_MNT}" +_umount "${SCRATCH_MNT}" echo "+ check fs (2)" _scratch_xfs_repair -n >> "$seqres.full" 2>&1 || \ diff --git a/tests/xfs/152 b/tests/xfs/152 index 7ba00c4bfac9ff..66577cfb4617fc 100755 --- a/tests/xfs/152 +++ b/tests/xfs/152 @@ -15,7 +15,7 @@ _begin_fstest auto quick quota idmapped wipe_mounts() { - umount "${SCRATCH_MNT}/idmapped" >/dev/null 2>&1 + _umount "${SCRATCH_MNT}/idmapped" >/dev/null 2>&1 _scratch_unmount >/dev/null 2>&1 } diff --git a/tests/xfs/169 b/tests/xfs/169 index 6400fd9e6bdc8b..16c5385cf4815a 100755 --- a/tests/xfs/169 +++ b/tests/xfs/169 @@ -15,7 +15,7 @@ _begin_fstest auto clone _cleanup() { cd / - umount $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 rm -rf $tmp.* } @@ -43,7 +43,7 @@ for i in 1 2 x; do _reflink_range $testdir/file1 $((nr * blksz)) \ $testdir/file2 $((nr * blksz)) $blksz >> $seqres.full done - umount $SCRATCH_MNT + _umount $SCRATCH_MNT _check_scratch_fs _scratch_mount @@ -51,7 +51,7 @@ for i in 1 2 x; do echo "$i: Delete both files" rm -rf $testdir/file1 $testdir/file2 - umount $SCRATCH_MNT + _umount $SCRATCH_MNT _check_scratch_fs _scratch_mount done diff --git a/tests/xfs/206 b/tests/xfs/206 index bfd2dee939ddd7..16a734c3751194 100755 --- a/tests/xfs/206 +++ b/tests/xfs/206 @@ -18,7 +18,7 @@ _begin_fstest growfs auto quick # Override the default cleanup function. _cleanup() { - umount $tmpdir + _umount $tmpdir rmdir $tmpdir rm -f $tmp rm -f $tmpfile diff --git a/tests/xfs/216 b/tests/xfs/216 index 680239b4ef788d..149c8fdfec887d 100755 --- a/tests/xfs/216 +++ b/tests/xfs/216 @@ -52,7 +52,7 @@ _do_mkfs() -d name=$LOOP_DEV,size=${i}g $loop_mkfs_opts |grep log mount -o loop -t xfs $LOOP_DEV $LOOP_MNT echo "test write" > $LOOP_MNT/test - umount $LOOP_MNT > /dev/null 2>&1 + _umount $LOOP_MNT > /dev/null 2>&1 done } # make large holey file diff --git a/tests/xfs/217 b/tests/xfs/217 index 41caaf738267d4..30a186d7294940 100755 --- a/tests/xfs/217 +++ b/tests/xfs/217 @@ -31,7 +31,7 @@ _do_mkfs() -d name=$LOOP_DEV,size=${i}g |grep log mount -o loop -t xfs $LOOP_DEV $LOOP_MNT echo "test write" > $LOOP_MNT/test - umount $LOOP_MNT > /dev/null 2>&1 + _umount $LOOP_MNT > /dev/null 2>&1 # punch out the previous blocks so that we keep the amount of # disk space the test requires down to a minimum. diff --git a/tests/xfs/235 b/tests/xfs/235 index 5b201d93076952..0184ff71f2878c 100755 --- a/tests/xfs/235 +++ b/tests/xfs/235 @@ -31,7 +31,7 @@ _pwrite_byte 0x62 0 $((blksz * 64)) ${SCRATCH_MNT}/file0 >> $seqres.full _pwrite_byte 0x61 0 $((blksz * 64)) ${SCRATCH_MNT}/file1 >> $seqres.full cp -p ${SCRATCH_MNT}/file0 ${SCRATCH_MNT}/file2 cp -p ${SCRATCH_MNT}/file1 ${SCRATCH_MNT}/file3 -umount ${SCRATCH_MNT} +_umount ${SCRATCH_MNT} echo "+ check fs" _scratch_xfs_repair -n >> $seqres.full 2>&1 || \ @@ -49,7 +49,7 @@ if _try_scratch_mount >> $seqres.full 2>&1; then $XFS_IO_PROG -f -c "pwrite -S 0x63 0 $((blksz * 64))" -c "fsync" ${SCRATCH_MNT}/file4 >> $seqres.full 2>&1 test -s ${SCRATCH_MNT}/file4 && _fail "should not be able to copy with busted rmap btree" - umount ${SCRATCH_MNT} + _umount ${SCRATCH_MNT} fi echo "+ repair fs" @@ -66,7 +66,7 @@ $CHATTR_PROG -R -f -i ${SCRATCH_MNT}/ echo "+ copy more (2)" cp -p ${SCRATCH_MNT}/file1 ${SCRATCH_MNT}/file5 || \ _fail "modified rmap tree" -umount ${SCRATCH_MNT} +_umount ${SCRATCH_MNT} echo "+ check fs (2)" _scratch_xfs_repair -n >> $seqres.full 2>&1 || \ diff --git a/tests/xfs/236 b/tests/xfs/236 index a374a300d1905a..277a9a402e2e05 100755 --- a/tests/xfs/236 +++ b/tests/xfs/236 @@ -15,7 +15,7 @@ _begin_fstest auto rmap punch _cleanup() { cd / - umount $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 rm -rf $tmp.* } @@ -44,7 +44,7 @@ for i in 1 2 x; do seq 1 2 $((nr_blks - 1)) | while read nr; do $XFS_IO_PROG -c "fpunch $((nr * blksz)) $blksz" $testdir/file2 >> $seqres.full done - umount $SCRATCH_MNT + _umount $SCRATCH_MNT _check_scratch_fs _scratch_mount @@ -52,7 +52,7 @@ for i in 1 2 x; do echo "$i: Delete both files" rm -rf $testdir/file1 $testdir/file2 - umount $SCRATCH_MNT + _umount $SCRATCH_MNT _check_scratch_fs _scratch_mount done diff --git a/tests/xfs/239 b/tests/xfs/239 index bfe722c0add020..7dc9be7d2edfe0 100755 --- a/tests/xfs/239 +++ b/tests/xfs/239 @@ -66,7 +66,7 @@ md5sum $testdir/file1 | _filter_scratch md5sum $testdir/file2 | _filter_scratch echo "Check for damage" -umount $SCRATCH_MNT +_umount $SCRATCH_MNT _repair_scratch_fs >> $seqres.full # success, all done diff --git a/tests/xfs/241 b/tests/xfs/241 index 1532493979ffa7..a779e321417520 100755 --- a/tests/xfs/241 +++ b/tests/xfs/241 @@ -66,7 +66,7 @@ md5sum $testdir/file1 | _filter_scratch md5sum $testdir/file2 | _filter_scratch echo "Check for damage" -umount $SCRATCH_MNT +_umount $SCRATCH_MNT _repair_scratch_fs >> $seqres.full # success, all done diff --git a/tests/xfs/250 b/tests/xfs/250 index f8846be6e197aa..82ab08d65192e7 100755 --- a/tests/xfs/250 +++ b/tests/xfs/250 @@ -13,7 +13,7 @@ _begin_fstest auto quick rw prealloc metadata _cleanup() { cd / - umount $LOOP_MNT 2>/dev/null + _umount $LOOP_MNT 2>/dev/null rm -f $LOOP_DEV rmdir $LOOP_MNT } @@ -60,7 +60,7 @@ _test_loop() $XFS_IO_PROG -f -c "resvsp 0 $fsize" $LOOP_MNT/foo | _filter_io echo "*** unmount loop filesystem" - umount $LOOP_MNT > /dev/null 2>&1 + _umount $LOOP_MNT > /dev/null 2>&1 echo "*** check loop filesystem" _check_xfs_filesystem $LOOP_DEV none none diff --git a/tests/xfs/265 b/tests/xfs/265 index 21de4c054a573f..2ba7342d066bb6 100755 --- a/tests/xfs/265 +++ b/tests/xfs/265 @@ -16,7 +16,7 @@ _begin_fstest auto clone _cleanup() { cd / - umount $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 rm -rf $tmp.* } @@ -51,7 +51,7 @@ for i in 1 2 x; do truncate -s $((blksz * (nr_blks - nr))) $testdir/file1.$nr >> $seqres.full done - umount $SCRATCH_MNT + _umount $SCRATCH_MNT _check_scratch_fs _scratch_mount @@ -60,7 +60,7 @@ for i in 1 2 x; do echo "$i: Delete both files" rm -rf $testdir mkdir -p $testdir - umount $SCRATCH_MNT + _umount $SCRATCH_MNT _check_scratch_fs _scratch_mount done diff --git a/tests/xfs/310 b/tests/xfs/310 index 34d17be97f36dd..f2a7ca50f67199 100755 --- a/tests/xfs/310 +++ b/tests/xfs/310 @@ -13,7 +13,7 @@ _begin_fstest auto clone rmap prealloc _cleanup() { cd / - umount $SCRATCH_MNT > /dev/null 2>&1 + _umount $SCRATCH_MNT > /dev/null 2>&1 _dmhugedisk_cleanup rm -rf $tmp.* } @@ -53,7 +53,7 @@ $XFS_IO_PROG -f -c "falloc 0 $((nr_blks * blksz))" $testdir/file1 >> $seqres.ful echo "Check extent count" xfs_bmap -l -p -v $testdir/file1 | grep '^[[:space:]]*2:' -q && xfs_bmap -l -p -v $testdir/file1 inum=$(stat -c '%i' $testdir/file1) -umount $SCRATCH_MNT +_umount $SCRATCH_MNT echo "Check bmap count" nr_bmaps=$(xfs_db -c "inode $inum" -c "bmap" $DMHUGEDISK_DEV | grep 'data offset' | wc -l) diff --git a/tests/xfs/716 b/tests/xfs/716 index cd4fffef298d31..55c66d1cf8bb19 100755 --- a/tests/xfs/716 +++ b/tests/xfs/716 @@ -49,7 +49,7 @@ ino=$(stat -c '%i' $file) # Figure out how many extents we need to have to create a data fork that's in # btree format. -umount $SCRATCH_MNT +_umount $SCRATCH_MNT di_forkoff=$(_scratch_xfs_db -c "inode $ino" -c "p core.forkoff" | \ awk '{print $3}') _scratch_xfs_db -c "inode $ino" -c "p" >> $seqres.full @@ -61,7 +61,7 @@ $XFS_IO_PROG -c "falloc 0 $(( (min_ext_for_btree + 1) * 2 * blksz))" $file $here/src/punch-alternating $file # Make sure the data fork is in btree format. -umount $SCRATCH_MNT +_umount $SCRATCH_MNT _scratch_xfs_db -c "inode $ino" -c "p core.format" | grep -q "btree" || \ echo "data fork not in btree format?" echo "about to start test" >> $seqres.full ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 3/6] xfs: test health monitoring code 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:57 ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong 2024-12-31 23:57 ` [PATCH 2/6] misc: convert all umount(1) invocations to _umount Darrick J. Wong @ 2024-12-31 23:57 ` Darrick J. Wong 2024-12-31 23:57 ` [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong ` (2 subsequent siblings) 5 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add some functionality tests for the new health monitoring code. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- doc/group-names.txt | 1 + tests/xfs/1885 | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1885.out | 5 +++++ 3 files changed, 59 insertions(+) create mode 100755 tests/xfs/1885 create mode 100644 tests/xfs/1885.out diff --git a/doc/group-names.txt b/doc/group-names.txt index b04d0180e8ec02..8fbb260d8c7bb5 100644 --- a/doc/group-names.txt +++ b/doc/group-names.txt @@ -117,6 +117,7 @@ samefs overlayfs when all layers are on the same fs scrub filesystem metadata scrubbers seed btrfs seeded filesystems seek llseek functionality +selfhealing self healing filesystem code selftest tests with fixed results, used to validate testing setup send btrfs send/receive shrinkfs decreasing the size of a filesystem diff --git a/tests/xfs/1885 b/tests/xfs/1885 new file mode 100755 index 00000000000000..1b87af3a9178fc --- /dev/null +++ b/tests/xfs/1885 @@ -0,0 +1,53 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test 1885 +# +# Make sure that healthmon handles module refcount correctly. +# +. ./common/preamble +_begin_fstest auto selfhealing + +. ./common/filter +. ./common/module + +refcount_file="/sys/module/xfs/refcnt" +test -e "$refcount_file" || _notrun "cannot find xfs module refcount" + +_require_test +_require_xfs_io_command healthmon + +# Capture mod refcount without the test fs mounted +_test_unmount +init_refcount="$(cat "$refcount_file")" + +# Capture mod refcount with the test fs mounted +_test_mount +nomon_mount_refcount="$(cat "$refcount_file")" + +# Capture mod refcount with test fs mounted and the healthmon fd open. +# Pause the xfs_io process so that it doesn't actually respond to events. +$XFS_IO_PROG -c 'healthmon -c -v' $TEST_DIR & +sleep 0.5 +kill -STOP %1 +mon_mount_refcount="$(cat "$refcount_file")" + +# Capture mod refcount with only the healthmon fd open. +_test_unmount +mon_nomount_refcount="$(cat "$refcount_file")" + +# Capture mod refcount after continuing healthmon (which should exit due to the +# unmount) and killing it. +kill -CONT %1 +kill %1 +wait +nomon_nomount_refcount="$(cat "$refcount_file")" + +_within_tolerance "mount refcount" "$nomon_mount_refcount" "$((init_refcount + 1))" 0 -v +_within_tolerance "mount + healthmon refcount" "$mon_mount_refcount" "$((init_refcount + 2))" 0 -v +_within_tolerance "healthmon refcount" "$mon_nomount_refcount" "$((init_refcount + 1))" 0 -v +_within_tolerance "end refcount" "$nomon_nomount_refcount" "$init_refcount" 0 -v + +status=0 +exit diff --git a/tests/xfs/1885.out b/tests/xfs/1885.out new file mode 100644 index 00000000000000..f152cef0525609 --- /dev/null +++ b/tests/xfs/1885.out @@ -0,0 +1,5 @@ +QA output created by 1885 +mount refcount is in range +mount + healthmon refcount is in range +healthmon refcount is in range +end refcount is in range ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong ` (2 preceding siblings ...) 2024-12-31 23:57 ` [PATCH 3/6] xfs: test health monitoring code Darrick J. Wong @ 2024-12-31 23:57 ` Darrick J. Wong 2024-12-31 23:58 ` [PATCH 5/6] xfs: test io " Darrick J. Wong 2024-12-31 23:58 ` [PATCH 6/6] xfs: test new xfs_scrubbed daemon Darrick J. Wong 5 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:57 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Check if we can detect runtime metadata corruptions via the health monitor. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/rc | 10 ++++++ tests/xfs/1879 | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1879.out | 12 +++++++ 3 files changed, 111 insertions(+) create mode 100755 tests/xfs/1879 create mode 100644 tests/xfs/1879.out diff --git a/common/rc b/common/rc index 0d5c785cecc017..dd6857461e14dd 100644 --- a/common/rc +++ b/common/rc @@ -2850,6 +2850,16 @@ _require_xfs_io_command() echo $testio | grep -q "Inappropriate ioctl" && \ _notrun "xfs_io $command support is missing" ;; + "healthmon") + testio=`$XFS_IO_PROG -c "$command -p $param" $TEST_DIR 2>&1` + echo $testio | grep -q "bad argument count" && \ + _notrun "xfs_io $command $param support is missing" + echo $testio | grep -q "Inappropriate ioctl" && \ + _notrun "xfs_io $command $param ioctl support is missing" + echo $testio | grep -q "Operation not supported" && \ + _notrun "xfs_io $command $param kernel support is missing" + param_checked="$param" + ;; "label") testio=`$XFS_IO_PROG -c "label" $TEST_DIR 2>&1` ;; diff --git a/tests/xfs/1879 b/tests/xfs/1879 new file mode 100755 index 00000000000000..aab7bf9fa1f6e4 --- /dev/null +++ b/tests/xfs/1879 @@ -0,0 +1,89 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test No. 1879 +# +# Corrupt some metadata and try to access it with the health monitoring program +# running. Check that healthmon observes a metadata error. +# +. ./common/preamble +_begin_fstest auto quick eio + +_cleanup() +{ + cd / + rm -rf $tmp.* $testdir +} + +. ./common/filter + +_require_scratch_nocheck +_require_xfs_io_command healthmon + +# Disable the scratch rt device to avoid test failures relating to the rt +# bitmap consuming all the free space in our small data device. +unset SCRATCH_RTDEV + +echo "Format and mount" +_scratch_mkfs -d agcount=1 | _filter_mkfs 2> $tmp.mkfs >> $seqres.full +. $tmp.mkfs +_scratch_mount +mkdir $SCRATCH_MNT/a/ +# Enough entries to get to a single block directory +for ((i = 0; i < ( (isize + 255) / 256); i++)); do + path="$(printf "%s/a/%0255d" "$SCRATCH_MNT" "$i")" + touch "$path" +done +inum="$(stat -c %i "$SCRATCH_MNT/a")" +_scratch_unmount + +# Fuzz the directory block so that the touch below will be guaranteed to trip +# a runtime sickness report in exactly the manner we desire. +_scratch_xfs_db -x -c "inode $inum" -c "dblock 0" -c 'fuzz bhdr.hdr.owner add' -c print &>> $seqres.full + +# Try to allocate space to trigger a metadata corruption event +echo "Runtime corruption detection" +_scratch_mount +$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon & +sleep 1 # wait for python program to start up +touch $SCRATCH_MNT/a/farts &>> $seqres.full +_scratch_unmount + +wait # for healthmon to finish + +# Did we get errors? +filter_healthmon() +{ + cat $tmp.healthmon >> $seqres.full + grep -A2 -E '(sick|corrupt)' $tmp.healthmon | grep -v -- '--' | sort | uniq +} +filter_healthmon + +# Run scrub to trigger a health event from there too. +echo "Scrub corruption detection" +_scratch_mount +if _supports_xfs_scrub $SCRATCH_MNT $SCRATCH_DEV; then + $XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT > $tmp.healthmon & + sleep 1 # wait for python program to start up + $XFS_SCRUB_PROG -n $SCRATCH_MNT &>> $seqres.full + _scratch_unmount + + wait # for healthmon to finish + + # Did we get errors? + filter_healthmon +else + # mock the output since we don't support scrub + _scratch_unmount + cat << ENDL + "domain": "inode", + "structures": ["directory"], + "structures": ["parent"], + "type": "corrupt", + "type": "sick", +ENDL +fi + +status=0 +exit diff --git a/tests/xfs/1879.out b/tests/xfs/1879.out new file mode 100644 index 00000000000000..f02eefbf58ad6c --- /dev/null +++ b/tests/xfs/1879.out @@ -0,0 +1,12 @@ +QA output created by 1879 +Format and mount +Runtime corruption detection + "domain": "inode", + "structures": ["directory"], + "type": "sick", +Scrub corruption detection + "domain": "inode", + "structures": ["directory"], + "structures": ["parent"], + "type": "corrupt", + "type": "sick", ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 5/6] xfs: test io error reporting via healthmon 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong ` (3 preceding siblings ...) 2024-12-31 23:57 ` [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong @ 2024-12-31 23:58 ` Darrick J. Wong 2024-12-31 23:58 ` [PATCH 6/6] xfs: test new xfs_scrubbed daemon Darrick J. Wong 5 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new test to make sure the kernel can report IO errors via health monitoring. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- tests/xfs/1878 | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1878.out | 10 +++++++ 2 files changed, 90 insertions(+) create mode 100755 tests/xfs/1878 create mode 100644 tests/xfs/1878.out diff --git a/tests/xfs/1878 b/tests/xfs/1878 new file mode 100755 index 00000000000000..882d0dcca03cb1 --- /dev/null +++ b/tests/xfs/1878 @@ -0,0 +1,80 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test No. 1878 +# +# Attempt to read and write a file in buffered and directio mode with the +# health monitoring program running. Check that healthmon observes all four +# types of IO errors. +# +. ./common/preamble +_begin_fstest auto quick eio + +_cleanup() +{ + cd / + rm -rf $tmp.* $testdir + _dmerror_cleanup +} + +. ./common/filter +. ./common/dmerror + +_require_scratch_nocheck +_require_xfs_io_command healthmon +_require_dm_target error + +# Disable the scratch rt device to avoid test failures relating to the rt +# bitmap consuming all the free space in our small data device. +unset SCRATCH_RTDEV + +echo "Format and mount" +_scratch_mkfs > $seqres.full 2>&1 +_dmerror_init no_log +_dmerror_mount + +_require_fs_space $SCRATCH_MNT 65536 + +# Create a file with written regions far enough apart that the pagecache can't +# possibly be caching the regions with a single folio. +testfile=$SCRATCH_MNT/fsync-err-test +$XFS_IO_PROG -f \ + -c 'pwrite -b 1m 0 1m' \ + -c 'pwrite -b 1m 10g 1m' \ + -c 'pwrite -b 1m 20g 1m' \ + -c fsync $testfile >> $seqres.full + +# First we check if directio errors get reported +$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon & +sleep 1 # wait for python program to start up +_dmerror_load_error_table +$XFS_IO_PROG -d -c 'pwrite -b 256k 12k 16k' $testfile >> $seqres.full +$XFS_IO_PROG -d -c 'pread -b 256k 10g 16k' $testfile >> $seqres.full +_dmerror_load_working_table + +_dmerror_unmount +wait # for healthmon to finish +_dmerror_mount + +# Next we check if buffered io errors get reported. We have to write something +# before loading the error table to ensure the dquots get loaded. +$XFS_IO_PROG -c 'pwrite -b 256k 20g 1k' -c fsync $testfile >> $seqres.full +$XFS_IO_PROG -c 'healthmon -c -v' $SCRATCH_MNT >> $tmp.healthmon & +sleep 1 # wait for python program to start up +_dmerror_load_error_table +$XFS_IO_PROG -c 'pread -b 256k 12k 16k' $testfile >> $seqres.full +$XFS_IO_PROG -c 'pwrite -b 256k 20g 16k' -c fsync $testfile >> $seqres.full +_dmerror_load_working_table + +_dmerror_unmount +wait # for healthmon to finish + +# Did we get errors? +cat $tmp.healthmon >> $seqres.full +grep -E '(diowrite|dioread|readahead|writeback)' $tmp.healthmon | sort | uniq + +_dmerror_cleanup + +status=0 +exit diff --git a/tests/xfs/1878.out b/tests/xfs/1878.out new file mode 100644 index 00000000000000..a8070c3c1afd23 --- /dev/null +++ b/tests/xfs/1878.out @@ -0,0 +1,10 @@ +QA output created by 1878 +Format and mount +pwrite: Input/output error +pread: Input/output error +pread: Input/output error +fsync: Input/output error + "type": "dioread", + "type": "diowrite", + "type": "readahead", + "type": "writeback", ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 6/6] xfs: test new xfs_scrubbed daemon 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong ` (4 preceding siblings ...) 2024-12-31 23:58 ` [PATCH 5/6] xfs: test io " Darrick J. Wong @ 2024-12-31 23:58 ` Darrick J. Wong 5 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Make sure the daemon in charge of self healing xfs actually does what it says it does. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- common/config | 6 ++++ common/systemd | 9 +++++ common/xfs | 16 ++++++++++ tests/xfs/1882 | 64 ++++++++++++++++++++++++++++++++++++++ tests/xfs/1882.out | 2 + tests/xfs/1883 | 75 +++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1883.out | 2 + tests/xfs/1884 | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++ tests/xfs/1884.out | 2 + 9 files changed, 263 insertions(+) create mode 100755 tests/xfs/1882 create mode 100644 tests/xfs/1882.out create mode 100755 tests/xfs/1883 create mode 100644 tests/xfs/1883.out create mode 100755 tests/xfs/1884 create mode 100644 tests/xfs/1884.out diff --git a/common/config b/common/config index fcff0660b05a97..2b3f946f3d308d 100644 --- a/common/config +++ b/common/config @@ -166,6 +166,12 @@ export XFS_ADMIN_PROG="$(type -P xfs_admin)" export XFS_GROWFS_PROG=$(type -P xfs_growfs) export XFS_SPACEMAN_PROG="$(type -P xfs_spaceman)" export XFS_SCRUB_PROG="$(type -P xfs_scrub)" +XFS_SCRUBBED_PROG="$(type -P xfs_scrubbed)" +# Normally the scrubbed daemon is installed in libexec +if [ -n "$XFS_SCRUBBED_PROG" ] && [ -e /usr/libexec/xfs_scrubbed ]; then + XFS_SCRUBBED_PROG=/usr/libexec/xfs_scrubbed +fi +export XFS_SCRUBBED_PROG export XFS_PARALLEL_REPAIR_PROG="$(type -P xfs_prepair)" export XFS_PARALLEL_REPAIR64_PROG="$(type -P xfs_prepair64)" export __XFSDUMP_PROG="$(type -P xfsdump)" diff --git a/common/systemd b/common/systemd index b2e24f267b2d93..8366d4cba39d85 100644 --- a/common/systemd +++ b/common/systemd @@ -71,3 +71,12 @@ _systemd_unit_status() { _systemd_installed || return 1 systemctl status "$1" } + +# Start a running systemd unit +_systemd_unit_start() { + systemctl start "$1" +} +# Stop a running systemd unit +_systemd_unit_stop() { + systemctl stop "$1" +} diff --git a/common/xfs b/common/xfs index b9e897e0e8839a..b4f69403e7396e 100644 --- a/common/xfs +++ b/common/xfs @@ -2224,3 +2224,19 @@ _scratch_find_rt_metadir_entry() { return 1 } + +# Run the xfs_scrubbed self healing daemon +_scratch_xfs_scrubbed() { + local scrubbed_args=() + local daemon_dir + daemon_dir=$(dirname "$XFS_SCRUBBED_PROG") + + # If we're being run from a development branch, we might need to find + # the schema file on our own. + local maybe_schema="$daemon_dir/../libxfs/xfs_healthmon.schema.json" + if [ -f "$maybe_schema" ]; then + scrubbed_args+=(--event-schema "$maybe_schema") + fi + + $XFS_SCRUBBED_PROG "${scrubbed_args[@]}" "$@" $SCRATCH_MNT +} diff --git a/tests/xfs/1882 b/tests/xfs/1882 new file mode 100755 index 00000000000000..b6a8bd545dbcf5 --- /dev/null +++ b/tests/xfs/1882 @@ -0,0 +1,64 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test 1882 +# +# Make sure that xfs_scrubbed correctly handles all the reports that it gets +# from the kernel. We simulate this by using the --everything mode so we get +# all the events, not just the sickness reports. +# +. ./common/preamble +_begin_fstest auto selfhealing + +. ./common/filter +. ./common/fuzzy +. ./common/systemd +. ./common/populate + +_require_scrub +_require_xfs_io_command "scrub" # online check support +_require_command "$XFS_SCRUBBED_PROG" "xfs_scrubbed" +_require_scratch + +# Does this fs support health monitoring? +_scratch_mkfs >> $seqres.full +_scratch_mount + +_scratch_xfs_scrubbed --check || \ + _notrun "health monitoring not supported on this kernel" +_scratch_xfs_scrubbed --require-validation --check && \ + _notrun "skipping this test in favor of the one that does json validation" +_scratch_unmount + +# Create a sample fs with all the goodies +_scratch_populate_cached nofill &>> $seqres.full +_scratch_mount + +# If the system xfsprogs has self healing enabled, we need to shut down the +# daemon before we try to capture things. +if _systemd_is_running; then + scratch_path=$(systemd-escape --path "$SCRATCH_MNT") + _systemd_unit_stop "xfs_scrubbed@${scratch_path}" &>> $seqres.full +fi + +# Start the health monitor, have it log everything +_scratch_xfs_scrubbed --everything --log > $tmp.scrubbed & +scrubbed_pid=$! +sleep 1 + +# Run scrub to make some noise +_scratch_scrub -b -n >> $seqres.full + +# Unmount fs to kill scrubbed, then wait for it to finish +while ! _scratch_unmount &>/dev/null; do + sleep 0.5 +done +kill $scrubbed_pid +wait + +cat $tmp.scrubbed >> $seqres.full + +echo Silence is golden +status=0 +exit diff --git a/tests/xfs/1882.out b/tests/xfs/1882.out new file mode 100644 index 00000000000000..9b31ccb735cabd --- /dev/null +++ b/tests/xfs/1882.out @@ -0,0 +1,2 @@ +QA output created by 1882 +Silence is golden diff --git a/tests/xfs/1883 b/tests/xfs/1883 new file mode 100755 index 00000000000000..9bba989386b37e --- /dev/null +++ b/tests/xfs/1883 @@ -0,0 +1,75 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test 1883 +# +# Make sure that xfs_scrubbed correctly validates the json events that it gets +# from the kernel. We simulate this by using the --everything mode so we get +# all the events, not just the sickness reports. +# +. ./common/preamble +_begin_fstest auto selfhealing + +. ./common/filter +. ./common/fuzzy +. ./common/systemd +. ./common/populate + +_require_scrub +_require_xfs_io_command "scrub" # online check support +_require_command "$XFS_SCRUBBED_PROG" "xfs_scrubbed" +_require_scratch + +# Does this fs support health monitoring? +_scratch_mkfs >> $seqres.full +_scratch_mount + +_scratch_xfs_scrubbed --require-validation --check || \ + _notrun "health monitoring with validation not supported on this kernel" +_scratch_unmount + +# Create a sample fs with all the goodies +_scratch_populate_cached nofill &>> $seqres.full +_scratch_mount + +# If the system xfsprogs has self healing enabled, we need to shut down the +# daemon before we try to capture things. +if _systemd_is_running; then + scratch_path=$(systemd-escape --path "$SCRATCH_MNT") + _systemd_unit_stop "xfs_scrubbed@${scratch_path}" &>> $seqres.full +fi + +# Start the health monitor, have it validate everything +_scratch_xfs_scrubbed --require-validation --everything --debug-fast --log &> $tmp.scrubbed & +scrubbed_pid=$! +sleep 1 + +# Run scrub to make some noise +_scratch_scrub -b -n >> $seqres.full + +# Wait for up to 60 seconds for the log file to stop growing +old_logsz= +new_logsz=$(stat -c '%s' $tmp.scrubbed) +for ((i = 0; i < 60; i++)); do + test "$old_logsz" = "$new_logsz" && break + old_logsz="$new_logsz" + sleep 1 + new_logsz=$(stat -c '%s' $tmp.scrubbed) +done + +# Unmount fs to kill scrubbed, then wait for it to finish +while ! _scratch_unmount &>/dev/null; do + sleep 0.5 +done +kill $scrubbed_pid +wait + +# Look for schema validation errors +grep -q 'not valid under any of the given schemas' $tmp.scrubbed && \ + echo "Should not have found schema validation errors" +cat $tmp.scrubbed >> $seqres.full + +echo Silence is golden +status=0 +exit diff --git a/tests/xfs/1883.out b/tests/xfs/1883.out new file mode 100644 index 00000000000000..bc9c390c778b6e --- /dev/null +++ b/tests/xfs/1883.out @@ -0,0 +1,2 @@ +QA output created by 1883 +Silence is golden diff --git a/tests/xfs/1884 b/tests/xfs/1884 new file mode 100755 index 00000000000000..fc6e0a48372fda --- /dev/null +++ b/tests/xfs/1884 @@ -0,0 +1,87 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2024-2025 Oracle. All Rights Reserved. +# +# FS QA Test 1884 +# +# Ensure that autonomous self healing works fixes the filesystem correctly. +# +. ./common/preamble +_begin_fstest auto selfhealing + +. ./common/filter +. ./common/fuzzy +. ./common/systemd + +_require_scrub +_require_xfs_io_command "repair" # online repair support +_require_xfs_db_command "blocktrash" +_require_command "$XFS_SCRUBBED_PROG" "xfs_scrubbed" +_require_scratch + +_scratch_mkfs >> $seqres.full +_scratch_mount + +_xfs_has_feature $SCRATCH_MNT parent || \ + _notrun "parent pointers required to test directory auto-repair" +_scratch_xfs_scrubbed --repair --check || \ + _notrun "health monitoring with repair not supported on this kernel" + +# Create a largeish directory +dblksz=$(_xfs_get_dir_blocksize "$SCRATCH_MNT") +echo testdata > $SCRATCH_MNT/a +mkdir -p "$SCRATCH_MNT/some/victimdir" +for ((i = 0; i < (dblksz / 255); i++)); do + fname="$(printf "%0255d" "$i")" + ln $SCRATCH_MNT/a $SCRATCH_MNT/some/victimdir/$fname +done + +# Did we get at least two dir blocks? +dirsize=$(stat -c '%s' $SCRATCH_MNT/some/victimdir) +test "$dirsize" -gt "$dblksz" || echo "failed to create two-block directory" + +# Break the directory, remount filesystem +_scratch_unmount +_scratch_xfs_db -x \ + -c 'path /some/victimdir' \ + -c 'bmap' \ + -c 'dblock 1' \ + -c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' >> $seqres.full +_scratch_mount + +# If the system xfsprogs has self healing enabled, we need to shut down the +# daemon before we try to capture things. +if _systemd_is_running; then + svcname="xfs_scrubbed@$(systemd-escape --path "$SCRATCH_MNT")" + echo "$svcname: $(systemctl is-active "$svcname")" >> $seqres.full + _systemd_unit_stop "$svcname" &>> $seqres.full +fi + +# Start the health monitor, have it repair everything reported corrupt +_scratch_xfs_scrubbed --repair --log > $tmp.scrubbed & +scrubbed_pid=$! +sleep 1 + +# Access the broken directory to trigger a repair, then poll the directory +# for 5 seconds to see if it gets fixed without us needing to intervene. +ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err +_filter_scratch < $tmp.err +try=0 +while [ $try -lt 50 ] && grep -q 'Structure needs cleaning' $tmp.err; do + echo "try $try saw corruption" >> $seqres.full + sleep 0.1 + ls $SCRATCH_MNT/some/victimdir > /dev/null 2> $tmp.err + try=$((try + 1)) +done +_filter_scratch < $tmp.err + +# Unmount fs to kill scrubbed, then wait for it to finish. +while ! _scratch_unmount &>/dev/null; do + sleep 0.5 +done +kill $scrubbed_pid +wait +cat $tmp.scrubbed >> $seqres.full + +status=0 +exit diff --git a/tests/xfs/1884.out b/tests/xfs/1884.out new file mode 100644 index 00000000000000..929e33da01f92c --- /dev/null +++ b/tests/xfs/1884.out @@ -0,0 +1,2 @@ +QA output created by 1884 +ls: reading directory 'SCRATCH_MNT/some/victimdir': Structure needs cleaning ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCHSET 5/5] fstests: add difficult V5 features to filesystems 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (13 preceding siblings ...) 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong @ 2024-12-31 23:35 ` Darrick J. Wong 2024-12-31 23:58 ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong ` (2 more replies) 2025-01-02 1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang 15 siblings, 3 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:35 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs Hi all, This series enables xfs_repair to add select features to existing V5 filesystems. Specifically, one can add free inode btrees, reflink support, and reverse mapping. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-newer-features fstests git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=upgrade-newer-features --- Commits in this patchset: * xfs/1856: add metadir upgrade to test matrix * xfs/1856: add rtrmapbt upgrade to test matrix * xfs/1856: add rtreflink upgrade to test matrix --- tests/xfs/1856 | 42 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix 2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong @ 2024-12-31 23:58 ` Darrick J. Wong 2024-12-31 23:58 ` [PATCH 2/3] xfs/1856: add rtrmapbt " Darrick J. Wong 2024-12-31 23:59 ` [PATCH 3/3] xfs/1856: add rtreflink " Darrick J. Wong 2 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add metadata directory trees to the features that this test will try to upgrade. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- tests/xfs/1856 | 1 + 1 file changed, 1 insertion(+) diff --git a/tests/xfs/1856 b/tests/xfs/1856 index 7524a449c3af00..fedeb157dbd9bb 100755 --- a/tests/xfs/1856 +++ b/tests/xfs/1856 @@ -188,6 +188,7 @@ else check_repair_upgrade reflink && FEATURES+=("reflink") check_repair_upgrade inobtcount && FEATURES+=("inobtcount") check_repair_upgrade bigtime && FEATURES+=("bigtime") + check_repair_upgrade metadir && FEATURES+=("metadir") fi test "${#FEATURES[@]}" -eq 0 && \ ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 2/3] xfs/1856: add rtrmapbt upgrade to test matrix 2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong 2024-12-31 23:58 ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong @ 2024-12-31 23:58 ` Darrick J. Wong 2024-12-31 23:59 ` [PATCH 3/3] xfs/1856: add rtreflink " Darrick J. Wong 2 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:58 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add realtime reverse mapping btrees to the features that this test will try to upgrade. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- tests/xfs/1856 | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/tests/xfs/1856 b/tests/xfs/1856 index fedeb157dbd9bb..8e3213da752348 100755 --- a/tests/xfs/1856 +++ b/tests/xfs/1856 @@ -30,11 +30,47 @@ rt_configured() test "$USE_EXTERNAL" = "yes" && test -n "$SCRATCH_RTDEV" } +# Does mkfs support metadir? +supports_metadir() +{ + $MKFS_XFS_PROG 2>&1 | grep -q 'metadir=' +} + +# Do we need to enable metadir at mkfs time to support a feature upgrade test? +need_metadir() +{ + local feat="$1" + + # if realtime isn't configured, we don't need metadir + rt_configured || return 1 + + # If we don't even know what realtime rmap is, we don't need rt groups + # and hence don't need metadir. + test -z "${FEATURE_STATE["rmapbt"]}" && return 1 + + # rt rmap btrees require metadir, but metadir cannot be added to an + # existing rt filesystem. Force it on at mkfs time. + test "${FEATURE_STATE["rmapbt"]}" -eq 1 && return 0 + test "$feat" = "rmapbt" && return 0 + + return 1 +} + # Compute the MKFS_OPTIONS string for a particular feature upgrade test compute_mkfs_options() { + local feat="$1" local m_opts="" local caller_options="$MKFS_OPTIONS" + local metadir + + need_metadir "$feat" && metadir=1 + if echo "$caller_options" | grep -q 'metadir='; then + test -z "$metadir" && metadir=0 + caller_options="$(echo "$caller_options" | sed -e 's/metadir=*[0-9]*/metadir='$metadir'/g')" + elif [ -n "$metadir" ]; then + caller_options="$caller_options -m metadir=$metadir" + fi for feat in "${FEATURES[@]}"; do local feat_state="${FEATURE_STATE["${feat}"]}" @@ -179,9 +215,11 @@ MKFS_OPTIONS="$(qerase_mkfs_options)" # upgrade don't spread failure to the rest of the tests. FEATURES=() if rt_configured; then + # rmap wasn't added to rt devices until after metadir check_repair_upgrade finobt && FEATURES+=("finobt") check_repair_upgrade inobtcount && FEATURES+=("inobtcount") check_repair_upgrade bigtime && FEATURES+=("bigtime") + supports_metadir && check_repair_upgrade rmapbt && FEATURES+=("rmapbt") else check_repair_upgrade finobt && FEATURES+=("finobt") check_repair_upgrade rmapbt && FEATURES+=("rmapbt") @@ -204,7 +242,7 @@ for feat in "${FEATURES[@]}"; do upgrade_start_message "$feat" | _tee_kernlog $seqres.full > /dev/null - opts="$(compute_mkfs_options)" + opts="$(compute_mkfs_options "$feat")" echo "mkfs.xfs $opts" >> $seqres.full # Format filesystem ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [PATCH 3/3] xfs/1856: add rtreflink upgrade to test matrix 2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong 2024-12-31 23:58 ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong 2024-12-31 23:58 ` [PATCH 2/3] xfs/1856: add rtrmapbt " Darrick J. Wong @ 2024-12-31 23:59 ` Darrick J. Wong 2 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2024-12-31 23:59 UTC (permalink / raw) To: zlang, djwong; +Cc: fstests, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add realtime reflink to the features that this test will try to upgrade. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- tests/xfs/1856 | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tests/xfs/1856 b/tests/xfs/1856 index 8e3213da752348..9b776493f0486f 100755 --- a/tests/xfs/1856 +++ b/tests/xfs/1856 @@ -215,11 +215,12 @@ MKFS_OPTIONS="$(qerase_mkfs_options)" # upgrade don't spread failure to the rest of the tests. FEATURES=() if rt_configured; then - # rmap wasn't added to rt devices until after metadir + # rmap & reflink weren't added to rt devices until after metadir check_repair_upgrade finobt && FEATURES+=("finobt") check_repair_upgrade inobtcount && FEATURES+=("inobtcount") check_repair_upgrade bigtime && FEATURES+=("bigtime") supports_metadir && check_repair_upgrade rmapbt && FEATURES+=("rmapbt") + supports_metadir && check_repair_upgrade reflink && FEATURES+=("reflink") else check_repair_upgrade finobt && FEATURES+=("finobt") check_repair_upgrade rmapbt && FEATURES+=("rmapbt") ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong ` (14 preceding siblings ...) 2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong @ 2025-01-02 1:37 ` Stephen Zhang 2025-01-07 0:26 ` Darrick J. Wong 15 siblings, 1 reply; 110+ messages in thread From: Stephen Zhang @ 2025-01-02 1:37 UTC (permalink / raw) To: Darrick J. Wong Cc: Carlos Maiolino, Zorro Lang, Andrey Albershteyn, Christoph Hellwig, Dave Chinner, xfs, greg.marsden, shirley.ma, konrad.wilk, fstests Darrick J. Wong <djwong@kernel.org> 于2025年1月1日周三 07:25写道: > > Hi everyone, > > Thank you all for helping get online repair, parent pointers, and > metadata directory trees, and realtime allocation groups merged this > year! We got a lot done in 2024. > > Having sent pull requests to Carlos for the last pieces of the realtime > modernization project, I have exactly two worthwhile projects left in my > development trees! The stuff here isn't necessarily in mergeable state > yet, but I still believe everyone ought to know what I'm up to. > > The first project implements (somewhat buggily; I never quite got back > to dealing with moving eof blocks) free space defragmentation so that we > can meaningfully shrink filesystems; garbage collect regions of the > filesystem; or prepare for large allocations. There's not much new > kernel code other than exporting refcounts and gaining the ability to > map free space. > > The second project initiates filesystem self healing routines whenever > problems start to crop up, which means that it can run fully > autonomously in the background. The monitoring system uses some > pseudo-file and seqbuf tricks that I lifted from kmo last winter. > > Both of these projects are largely userspace code. > > Also I threw in some xfs_repair code to do dangerous fs upgrades. > Nobody should use these, ever. > > Maintainers: please do not merge, this is a dog-and-pony show to attract > developer attention. > [Add Dave to the list] Hi, Darrick and all, Recently, I have been considering implementing the XFS shrink feature based on the AF concept, which was mentioned in this link: https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/ In the lore link, it stated: The rules used by AG are more about extending outwards. whilst The rules used by AF are more about restricting inwards. where the AF concept implicitly and naturally involves the semantics of compressing/shrinking(restricting). AG(for xfs extend) and AF(for xfs shrink) are constructed in a symmetrical way, in which it is more elegant and easier to build more complex features on it. To elaborate further, for example, AG should not be seen as independent entities in the shrink context. That means each AG requires separate managements(flags or something to indicate the state of that AG/region), which would increase the system complexity compared to the idea behind AF. AF views several AGs as a whole. And when it comes to growfs, things start to get a little more complicated, and AF can handle it easily and naturally. However talk is too cheap, to validate our point, we truly hope to have the opportunity to participate in developing these features by integrating the existing infrastructure you have already established with the AF concept. Best regards, Shida > --D > > PS: I'll be back after the holidays to look at the zoned/atomic/fsverity > patches. And finally rebase fstests to 2024-12-08. > ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing 2025-01-02 1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang @ 2025-01-07 0:26 ` Darrick J. Wong 0 siblings, 0 replies; 110+ messages in thread From: Darrick J. Wong @ 2025-01-07 0:26 UTC (permalink / raw) To: Stephen Zhang Cc: Carlos Maiolino, Zorro Lang, Andrey Albershteyn, Christoph Hellwig, Dave Chinner, xfs, greg.marsden, shirley.ma, konrad.wilk, fstests On Thu, Jan 02, 2025 at 09:37:47AM +0800, Stephen Zhang wrote: > Darrick J. Wong <djwong@kernel.org> 于2025年1月1日周三 07:25写道: > > > > Hi everyone, > > > > Thank you all for helping get online repair, parent pointers, and > > metadata directory trees, and realtime allocation groups merged this > > year! We got a lot done in 2024. > > > > Having sent pull requests to Carlos for the last pieces of the realtime > > modernization project, I have exactly two worthwhile projects left in my > > development trees! The stuff here isn't necessarily in mergeable state > > yet, but I still believe everyone ought to know what I'm up to. > > > > The first project implements (somewhat buggily; I never quite got back > > to dealing with moving eof blocks) free space defragmentation so that we > > can meaningfully shrink filesystems; garbage collect regions of the > > filesystem; or prepare for large allocations. There's not much new > > kernel code other than exporting refcounts and gaining the ability to > > map free space. > > > > The second project initiates filesystem self healing routines whenever > > problems start to crop up, which means that it can run fully > > autonomously in the background. The monitoring system uses some > > pseudo-file and seqbuf tricks that I lifted from kmo last winter. > > > > Both of these projects are largely userspace code. > > > > Also I threw in some xfs_repair code to do dangerous fs upgrades. > > Nobody should use these, ever. > > > > Maintainers: please do not merge, this is a dog-and-pony show to attract > > developer attention. > > > > [Add Dave to the list] > > Hi, Darrick and all, > > Recently, I have been considering implementing the XFS shrink feature based > on the AF concept, which was mentioned in this link: > > https://lore.kernel.org/linux-xfs/20241104014439.3786609-1-zhangshida@kylinos.cn/ > > In the lore link, it stated: > The rules used by AG are more about extending outwards. > whilst > The rules used by AF are more about restricting inwards. > > where the AF concept implicitly and naturally involves the semantics of > compressing/shrinking(restricting). > > AG(for xfs extend) and AF(for xfs shrink) are constructed in a symmetrical way, > in which it is more elegant and easier to build more complex features on it. > > To elaborate further, for example, AG should not be seen as > independent entities in > the shrink context. That means each AG requires separate > managements(flags or something to indicate the state of that > AG/region), which would increase the system complexity compared to the > idea behind AF. AF views several AGs as a whole. > > And when it comes to growfs, things start to get a little more > complicated, and AF > can handle it easily and naturally. > > However talk is too cheap, to validate our point, we truly hope to have the > opportunity to participate in developing these features by integrating > the existing > infrastructure you have already established with the AF concept. Hmm, now that's interesting -- using the AF ("allocation fencing"?) capability to constrain allocations to a subset of AGs, and then slowly rewriting files and whatnot to migrate data to other AGs. Eventually you end up with an AG that's empty and therefore ready for shrink. That's definitely a different way to do that than what I did (add a "mapfree" ioctl to pin space to a file). I'll ponder these 2 approaches a bit more. --D > Best regards, > Shida > > > > > --D > > > > PS: I'll be back after the holidays to look at the zoned/atomic/fsverity > > patches. And finally rebase fstests to 2024-12-08. > > > ^ permalink raw reply [flat|nested] 110+ messages in thread
end of thread, other threads:[~2025-01-13 5:55 UTC | newest] Thread overview: 110+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-12-31 23:25 [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET 1/5] xfs: improve post-close eofblocks gc behavior Darrick J. Wong 2024-12-31 23:36 ` [PATCH 1/1] xfs: Don't free EOF blocks on close when extent size hints are set Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET RFC 2/5] xfs: noalloc allocation groups Darrick J. Wong 2024-12-31 23:36 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong 2024-12-31 23:36 ` [PATCH 2/5] xfs: whine to dmesg when we encounter errors Darrick J. Wong 2024-12-31 23:37 ` [PATCH 3/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong 2024-12-31 23:37 ` [PATCH 4/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong 2024-12-31 23:37 ` [PATCH 5/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong 2024-12-31 23:32 ` [PATCHSET 3/5] xfs: report refcount information to userspace Darrick J. Wong 2024-12-31 23:37 ` [PATCH 1/1] xfs: export reference count " Darrick J. Wong 2024-12-31 23:33 ` [PATCHSET 4/5] xfs: defragment free space Darrick J. Wong 2024-12-31 23:38 ` [PATCH 1/4] xfs: export realtime refcount information Darrick J. Wong 2024-12-31 23:38 ` [PATCH 2/4] xfs: capture the offset and length in fallocate tracepoints Darrick J. Wong 2024-12-31 23:38 ` [PATCH 3/4] xfs: add an ioctl to map free space into a file Darrick J. Wong 2024-12-31 23:38 ` [PATCH 4/4] xfs: implement FALLOC_FL_MAP_FREE for realtime files Darrick J. Wong 2024-12-31 23:33 ` [PATCHSET 5/5] xfs: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:39 ` [PATCH 01/16] xfs: create debugfs uuid aliases Darrick J. Wong 2024-12-31 23:39 ` [PATCH 02/16] xfs: create hooks for monitoring health updates Darrick J. Wong 2024-12-31 23:39 ` [PATCH 03/16] xfs: create a filesystem shutdown hook Darrick J. Wong 2024-12-31 23:39 ` [PATCH 04/16] xfs: create hooks for media errors Darrick J. Wong 2024-12-31 23:40 ` [PATCH 05/16] iomap, filemap: report buffered read and write io errors to the filesystem Darrick J. Wong 2024-12-31 23:40 ` [PATCH 06/16] iomap: report directio read and write errors to callers Darrick J. Wong 2024-12-31 23:40 ` [PATCH 07/16] xfs: create file io error hooks Darrick J. Wong 2024-12-31 23:40 ` [PATCH 08/16] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong 2024-12-31 23:41 ` [PATCH 09/16] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong 2024-12-31 23:41 ` [PATCH 10/16] xfs: report metadata health events through healthmon Darrick J. Wong 2024-12-31 23:41 ` [PATCH 11/16] xfs: report shutdown " Darrick J. Wong 2024-12-31 23:41 ` [PATCH 12/16] xfs: report media errors " Darrick J. Wong 2024-12-31 23:42 ` [PATCH 13/16] xfs: report file io " Darrick J. Wong 2024-12-31 23:42 ` [PATCH 14/16] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong 2024-12-31 23:42 ` [PATCH 15/16] xfs: add media error reporting ioctl Darrick J. Wong 2024-12-31 23:43 ` [PATCH 16/16] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong 2024-12-31 23:33 ` [PATCHSET RFC 1/5] xfsprogs: noalloc allocation groups Darrick J. Wong 2024-12-31 23:43 ` [PATCH 1/5] xfs: track deferred ops statistics Darrick J. Wong 2024-12-31 23:43 ` [PATCH 2/5] xfs: create a noalloc mode for allocation groups Darrick J. Wong 2024-12-31 23:43 ` [PATCH 3/5] xfs: enable userspace to hide an AG from allocation Darrick J. Wong 2024-12-31 23:44 ` [PATCH 4/5] xfs: apply noalloc mode to inode allocations too Darrick J. Wong 2024-12-31 23:44 ` [PATCH 5/5] xfs_io: enhance the aginfo command to control the noalloc flag Darrick J. Wong 2024-12-31 23:33 ` [PATCHSET 2/5] xfsprogs: report refcount information to userspace Darrick J. Wong 2024-12-31 23:44 ` [PATCH 1/2] xfs: export reference count " Darrick J. Wong 2024-12-31 23:44 ` [PATCH 2/2] xfs_io: dump reference count information Darrick J. Wong 2024-12-31 23:34 ` [PATCHSET 3/5] xfsprogs: defragment free space Darrick J. Wong 2024-12-31 23:45 ` [PATCH 01/11] xfs_io: display rtgroup number in verbose fsrefs output Darrick J. Wong 2024-12-31 23:45 ` [PATCH 02/11] xfs: add an ioctl to map free space into a file Darrick J. Wong 2024-12-31 23:45 ` [PATCH 03/11] xfs_io: support using XFS_IOC_MAP_FREESP to map free space Darrick J. Wong 2024-12-31 23:45 ` [PATCH 04/11] xfs_db: get and put blocks on the AGFL Darrick J. Wong 2024-12-31 23:46 ` [PATCH 05/11] xfs_spaceman: implement clearing free space Darrick J. Wong 2024-12-31 23:46 ` [PATCH 06/11] spaceman: physically move a regular inode Darrick J. Wong 2024-12-31 23:46 ` [PATCH 07/11] spaceman: find owners of space in an AG Darrick J. Wong 2024-12-31 23:46 ` [PATCH 08/11] xfs_spaceman: wrap radix tree accesses in find_owner.c Darrick J. Wong 2024-12-31 23:47 ` [PATCH 09/11] xfs_spaceman: port relocation structure to 32-bit systems Darrick J. Wong 2024-12-31 23:47 ` [PATCH 10/11] spaceman: relocate the contents of an AG Darrick J. Wong 2024-12-31 23:47 ` [PATCH 11/11] spaceman: move inodes with hardlinks Darrick J. Wong 2024-12-31 23:34 ` [PATCHSET 4/5] xfsprogs: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:47 ` [PATCH 01/21] xfs: create hooks for monitoring health updates Darrick J. Wong 2024-12-31 23:48 ` [PATCH 02/21] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong 2024-12-31 23:48 ` [PATCH 03/21] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong 2024-12-31 23:48 ` [PATCH 04/21] xfs: report metadata health events through healthmon Darrick J. Wong 2024-12-31 23:49 ` [PATCH 05/21] xfs: report shutdown " Darrick J. Wong 2024-12-31 23:49 ` [PATCH 06/21] xfs: report media errors " Darrick J. Wong 2024-12-31 23:49 ` [PATCH 07/21] xfs: report file io " Darrick J. Wong 2024-12-31 23:49 ` [PATCH 08/21] xfs: add media error reporting ioctl Darrick J. Wong 2024-12-31 23:50 ` [PATCH 09/21] xfs_io: monitor filesystem health events Darrick J. Wong 2024-12-31 23:50 ` [PATCH 10/21] xfs_io: add a media error reporting command Darrick J. Wong 2024-12-31 23:50 ` [PATCH 11/21] xfs_scrubbed: create daemon to listen for health events Darrick J. Wong 2024-12-31 23:50 ` [PATCH 12/21] xfs_scrubbed: check events against schema Darrick J. Wong 2024-12-31 23:51 ` [PATCH 13/21] xfs_scrubbed: enable repairing filesystems Darrick J. Wong 2024-12-31 23:51 ` [PATCH 14/21] xfs_scrubbed: check for fs features needed for effective repairs Darrick J. Wong 2024-12-31 23:51 ` [PATCH 15/21] xfs_scrubbed: use getparents to look up file names Darrick J. Wong 2024-12-31 23:51 ` [PATCH 16/21] builddefs: refactor udev directory specification Darrick J. Wong 2024-12-31 23:52 ` [PATCH 17/21] xfs_scrubbed: create a background monitoring service Darrick J. Wong 2024-12-31 23:52 ` [PATCH 18/21] xfs_scrubbed: don't start service if kernel support unavailable Darrick J. Wong 2024-12-31 23:52 ` [PATCH 19/21] xfs_scrubbed: use the autofsck fsproperty to select mode Darrick J. Wong 2024-12-31 23:52 ` [PATCH 20/21] xfs_scrub: report media scrub failures to the kernel Darrick J. Wong 2024-12-31 23:53 ` [PATCH 21/21] debian: enable xfs_scrubbed on the root filesystem by default Darrick J. Wong 2024-12-31 23:34 ` [PATCHSET 5/5] xfs_repair: add difficult V5 features to filesystems Darrick J. Wong 2024-12-31 23:53 ` [PATCH 01/10] xfs_repair: allow sysadmins to add free inode btree indexes Darrick J. Wong 2024-12-31 23:53 ` [PATCH 02/10] xfs_repair: allow sysadmins to add reflink Darrick J. Wong 2024-12-31 23:53 ` [PATCH 03/10] xfs_repair: allow sysadmins to add reverse mapping indexes Darrick J. Wong 2024-12-31 23:54 ` [PATCH 04/10] xfs_repair: upgrade an existing filesystem to have parent pointers Darrick J. Wong 2024-12-31 23:54 ` [PATCH 05/10] xfs_repair: allow sysadmins to add metadata directories Darrick J. Wong 2024-12-31 23:54 ` [PATCH 06/10] xfs_repair: upgrade filesystems to support rtgroups when adding metadir Darrick J. Wong 2024-12-31 23:55 ` [PATCH 07/10] xfs_repair: allow sysadmins to add realtime reverse mapping indexes Darrick J. Wong 2024-12-31 23:55 ` [PATCH 08/10] xfs_repair: allow sysadmins to add realtime reflink Darrick J. Wong 2024-12-31 23:55 ` [PATCH 09/10] xfs_repair: skip free space checks when upgrading Darrick J. Wong 2024-12-31 23:55 ` [PATCH 10/10] xfs_repair: allow adding rmapbt to reflink filesystems Darrick J. Wong 2024-12-31 23:34 ` [PATCHSET 1/5] fstests: functional test for refcount reporting Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/1] xfs: test output of new FSREFCOUNTS ioctl Darrick J. Wong 2024-12-31 23:35 ` [PATCHSET 2/5] fstests: defragment free space Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/1] xfs: test clearing of " Darrick J. Wong 2024-12-31 23:35 ` [PATCHSET 3/5] fstests: capture logs from mount failures Darrick J. Wong 2024-12-31 23:56 ` [PATCH 1/2] treewide: convert all $MOUNT_PROG to _mount Darrick J. Wong 2024-12-31 23:56 ` [PATCH 2/2] check: capture dmesg of mount failures if test fails Darrick J. Wong 2025-01-06 11:18 ` Nirjhar Roy 2025-01-06 23:52 ` Darrick J. Wong 2025-01-13 5:55 ` Nirjhar Roy 2024-12-31 23:35 ` [PATCHSET 4/5] fstests: live health monitoring of filesystems Darrick J. Wong 2024-12-31 23:57 ` [PATCH 1/6] misc: convert all $UMOUNT_PROG to a _umount helper Darrick J. Wong 2024-12-31 23:57 ` [PATCH 2/6] misc: convert all umount(1) invocations to _umount Darrick J. Wong 2024-12-31 23:57 ` [PATCH 3/6] xfs: test health monitoring code Darrick J. Wong 2024-12-31 23:57 ` [PATCH 4/6] xfs: test for metadata corruption error reporting via healthmon Darrick J. Wong 2024-12-31 23:58 ` [PATCH 5/6] xfs: test io " Darrick J. Wong 2024-12-31 23:58 ` [PATCH 6/6] xfs: test new xfs_scrubbed daemon Darrick J. Wong 2024-12-31 23:35 ` [PATCHSET 5/5] fstests: add difficult V5 features to filesystems Darrick J. Wong 2024-12-31 23:58 ` [PATCH 1/3] xfs/1856: add metadir upgrade to test matrix Darrick J. Wong 2024-12-31 23:58 ` [PATCH 2/3] xfs/1856: add rtrmapbt " Darrick J. Wong 2024-12-31 23:59 ` [PATCH 3/3] xfs/1856: add rtreflink " Darrick J. Wong 2025-01-02 1:37 ` [NYE PATCHCYCLONE] xfs: free space defrag and autonomous self healing Stephen Zhang 2025-01-07 0:26 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox