linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHBOMB 6.19] xfs: autonomous self healing
@ 2025-10-22 23:56 Darrick J. Wong
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-22 23:56 UTC (permalink / raw)
  To: Carlos Maiolino, Christoph Hellwig
  Cc: xfs, Chandan Babu R, linux-fsdevel, fstests

Hi everyone,

You might recall that 18 months ago I showed off an early draft of a
patchset implementing autonomous self healing capabilities for XFS.
The premise is quite simple -- add a few hooks to the kernel to capture
significant filesystem metadata and file health events (pretty much all
failures), queue these events to a special anonfd, and let userspace
read the events at its leisure.  That's patchset 1.

The userspace part is more interesting, because there's a new daemon
that opens the anonfd given the root dir of a filesystem, captures a
file handle for the root dir, detaches from the root dir, and waits for
metadata events.  Upon receipt of an adverse health event, it will
reopen the root directory and initiate repairs.  I've left the prototype
Python script in place (patchset 2) but my ultimate goal is for everyone
to use the Rust version (patchset 3) because it's much quicker to
respond to problems.

New QA tests are patchset 4.  Zorro: No need to merge this right away.

This work was mostly complete by the end of 2024, and I've been letting
it run on my XFS QA testing fleets ever since then.  I am submitting
this patchset for upstream for 6.19.  Once this is merged, the online
fsck project will be complete.

--D

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCHSET V2] xfs: autonomous self healing of filesystems
  2025-10-22 23:56 [PATCHBOMB 6.19] xfs: autonomous self healing Darrick J. Wong
@ 2025-10-22 23:59 ` Darrick J. Wong
  2025-10-23  0:00   ` [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
                     ` (18 more replies)
  0 siblings, 19 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-22 23:59 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

Hi all,

This patchset builds new functionality to deliver live information about
filesystem health events to userspace.  This is done by creating an
anonymous file that can be read() for events by userspace programs.
Events are captured by hooking various parts of XFS and iomap so that
metadata health failures, file I/O errors, and major changes in
filesystem state (unmounts, shutdowns, etc.) can be observed by
programs.

When an event occurs, the hook functions queue an event object to each
event anonfd for later processing.  Programs must have CAP_SYS_ADMIN
to open the anonfd and there's a maximum event lag to prevent resource
overconsumption.  The events themselves can be read() from the anonfd
either as json objects for human readability, or as C structs for
daemons.

In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically.  This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing.  It is autostarted via some udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * docs: remove obsolete links in the xfs online repair documentation
 * docs: discuss autonomous self healing in the xfs online repair design doc
 * xfs: create debugfs uuid aliases
 * xfs: create hooks for monitoring health updates
 * xfs: create a filesystem shutdown hook
 * xfs: create hooks for media errors
 * iomap: report buffered read and write io errors to the filesystem
 * iomap: report directio read and write errors to callers
 * xfs: create file io error hooks
 * xfs: create a special file to pass filesystem health to userspace
 * xfs: create event queuing, formatting, and discovery infrastructure
 * xfs: report metadata health events through healthmon
 * xfs: report shutdown events through healthmon
 * xfs: report media errors through healthmon
 * xfs: report file io errors through healthmon
 * xfs: allow reconfiguration of the health monitoring device
 * xfs: validate fds against running healthmon
 * xfs: add media error reporting ioctl
 * xfs: send uevents when major filesystem events happen
---
 fs/iomap/internal.h                                |    2 
 fs/xfs/libxfs/xfs_fs.h                             |  173 ++
 fs/xfs/libxfs/xfs_health.h                         |   52 +
 fs/xfs/xfs_file.h                                  |   36 
 fs/xfs/xfs_fsops.h                                 |   14 
 fs/xfs/xfs_healthmon.h                             |  107 +
 fs/xfs/xfs_linux.h                                 |    3 
 fs/xfs/xfs_mount.h                                 |   13 
 fs/xfs/xfs_notify_failure.h                        |   44 +
 fs/xfs/xfs_super.h                                 |   13 
 fs/xfs/xfs_trace.h                                 |  404 +++++
 include/linux/fs.h                                 |    4 
 include/linux/iomap.h                              |    2 
 Documentation/filesystems/vfs.rst                  |    7 
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  336 +---
 fs/iomap/buffered-io.c                             |   27 
 fs/iomap/direct-io.c                               |    4 
 fs/iomap/ioend.c                                   |    4 
 fs/xfs/Kconfig                                     |    8 
 fs/xfs/Makefile                                    |    7 
 fs/xfs/libxfs/xfs_healthmon.schema.json            |  648 +++++++
 fs/xfs/xfs_aops.c                                  |    2 
 fs/xfs/xfs_file.c                                  |  167 ++
 fs/xfs/xfs_fsops.c                                 |   75 +
 fs/xfs/xfs_health.c                                |  269 +++
 fs/xfs/xfs_healthmon.c                             | 1741 ++++++++++++++++++++
 fs/xfs/xfs_ioctl.c                                 |    7 
 fs/xfs/xfs_notify_failure.c                        |  135 +-
 fs/xfs/xfs_super.c                                 |  109 +
 fs/xfs/xfs_trace.c                                 |    4 
 lib/seq_buf.c                                      |    1 
 31 files changed, 4173 insertions(+), 245 deletions(-)
 create mode 100644 fs/xfs/xfs_healthmon.h
 create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json
 create mode 100644 fs/xfs/xfs_healthmon.c


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
@ 2025-10-23  0:00   ` Darrick J. Wong
  2025-10-24  5:40     ` Christoph Hellwig
  2025-10-23  0:01   ` [PATCH 02/19] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
                     ` (17 subsequent siblings)
  18 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:00 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Online repair is now merged in upstream, no need to point to patchset
links anymore.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  236 +-------------------
 1 file changed, 6 insertions(+), 230 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 8cbcd3c2643430..189d1f5f40788d 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -105,10 +105,8 @@ occur; this capability aids both strategies.
 TLDR; Show Me the Code!
 -----------------------
 
-Code is posted to the kernel.org git trees as follows:
-`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
-`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
-`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Kernel and userspace code has been fully merged as of October 2025.
+
 Each kernel patchset adding an online repair function will use the same branch
 name across the kernel, xfsprogs, and fstests git repos.
 
@@ -764,12 +762,8 @@ allow the online fsck developers to compare online fsck against offline fsck,
 and they enable XFS developers to find deficiencies in the code base.
 
 Proposed patchsets include
-`general fuzzer improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
 `fuzzing baselines
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
-and `improvements in fuzz testing comprehensiveness
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_.
 
 Stress Testing
 --------------
@@ -801,11 +795,6 @@ Success is defined by the ability to run all of these tests without observing
 any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
 check warnings, or any other sort of mischief.
 
-Proposed patchsets include `general stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
-and the `evolution of existing per-function stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
-
 4. User Interface
 =================
 
@@ -886,10 +875,6 @@ apply as nice of a priority to IO and CPU scheduling as possible.
 This measure was taken to minimize delays in the rest of the filesystem.
 No such hardening has been performed for the cron job.
 
-Proposed patchset:
-`Enabling the xfs_scrub background service
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
-
 Health Reporting
 ----------------
 
@@ -912,13 +897,6 @@ notifications and initiate a repair?
 *Answer*: These questions remain unanswered, but should be a part of the
 conversation with early adopters and potential downstream users of XFS.
 
-Proposed patchsets include
-`wiring up health reports to correction returns
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
-and
-`preservation of sickness info during memory reclaim
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
-
 5. Kernel Algorithms and Data Structures
 ========================================
 
@@ -1310,21 +1288,6 @@ Space allocation records are cross-referenced as follows:
      are there the same number of reverse mapping records for each block as the
      reference count record claims?
 
-Proposed patchsets are the series to find gaps in
-`refcount btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
-`inode btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
-`rmap btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
-to find
-`mergeable records
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
-and to
-`improve cross referencing with rmap
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
-before starting a repair.
-
 Checking Extended Attributes
 ````````````````````````````
 
@@ -1756,10 +1719,6 @@ For scrub, the drain works as follows:
 To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
 be woken up whenever the intent count drops to zero.
 
-The proposed patchset is the
-`scrub intent drain series
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
-
 .. _jump_labels:
 
 Static Keys (aka Jump Label Patching)
@@ -2036,10 +1995,6 @@ The ``xfarray_store_anywhere`` function is used to insert a record in any
 null record slot in the bag; and the ``xfarray_unset`` function removes a
 record from the bag.
 
-The proposed patchset is the
-`big in-memory array
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
-
 Iterating Array Elements
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2172,10 +2127,6 @@ However, it should be noted that these repair functions only use blob storage
 to cache a small number of entries before adding them to a temporary ondisk
 file, which is why compaction is not required.
 
-The proposed patchset is at the start of the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
-
 .. _xfbtree:
 
 In-Memory B+Trees
@@ -2214,11 +2165,6 @@ xfiles enables reuse of the entire btree library.
 Btrees built atop an xfile are collectively known as ``xfbtrees``.
 The next few sections describe how they actually work.
 
-The proposed patchset is the
-`in-memory btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
-series.
-
 Using xfiles as a Buffer Cache Target
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2459,14 +2405,6 @@ This enables the log to release the old EFI to keep the log moving forwards.
 EFIs have a role to play during the commit and reaping phases; please see the
 next section and the section about :ref:`reaping<reaping>` for more details.
 
-Proposed patchsets are the
-`bitmap rework
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
-and the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
-
-
 Writing the New Tree
 ````````````````````
 
@@ -2623,11 +2561,6 @@ The number of records for the inode btree is the number of xfarray records,
 but the record count for the free inode btree has to be computed as inode chunk
 records are stored in the xfarray.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 Case Study: Rebuilding the Space Reference Counts
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2716,11 +2649,6 @@ Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
 removed via ``xfarray_unset``.
 Bag members are examined through ``xfarray_iter`` loops.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 Case Study: Rebuilding File Fork Mapping Indices
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2757,11 +2685,6 @@ EXTENTS format instead of BMBT, which may require a conversion.
 Third, the incore extent map must be reloaded carefully to avoid disturbing
 any delayed allocation extents.
 
-The proposed patchset is the
-`file mapping repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
-series.
-
 .. _reaping:
 
 Reaping Old Metadata Blocks
@@ -2843,11 +2766,6 @@ blocks.
 As stated earlier, online repair functions use very large transactions to
 minimize the chances of this occurring.
 
-The proposed patchset is the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
-series.
-
 Case Study: Reaping After a Regular Btree Repair
 ````````````````````````````````````````````````
 
@@ -2943,11 +2861,6 @@ When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
 btrees.
 These blocks can then be reaped using the methods outlined above.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 .. _rmap_reap:
 
 Case Study: Reaping After Repairing Reverse Mapping Btrees
@@ -2972,11 +2885,6 @@ methods outlined above.
 The rest of the process of rebuildng the reverse mapping btree is discussed
 in a separate :ref:`case study<rmap_repair>`.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 Case Study: Rebuilding the AGFL
 ```````````````````````````````
 
@@ -3024,11 +2932,6 @@ more complicated, because computing the correct value requires traversing the
 forks, or if that fails, leaving the fields invalid and waiting for the fork
 fsck functions to run.
 
-The proposed patchset is the
-`inode
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
-repair series.
-
 Quota Record Repairs
 --------------------
 
@@ -3045,11 +2948,6 @@ checking are obviously bad limits and timer values.
 Quota usage counters are checked, repaired, and discussed separately in the
 section about :ref:`live quotacheck <quotacheck>`.
 
-The proposed patchset is the
-`quota
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
-repair series.
-
 .. _fscounters:
 
 Freezing to Fix Summary Counters
@@ -3145,11 +3043,6 @@ long enough to check and correct the summary counters.
 |   This bug was fixed in Linux 5.17.                                      |
 +--------------------------------------------------------------------------+
 
-The proposed patchset is the
-`summary counter cleanup
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
-series.
-
 Full Filesystem Scans
 ---------------------
 
@@ -3277,15 +3170,6 @@ Second, if the incore inode is stuck in some intermediate state, the scan
 coordinator must release the AGI and push the main filesystem to get the inode
 back into a loadable state.
 
-The proposed patches are the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-The first user of the new functionality is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
 Inode Management
 ````````````````
 
@@ -3381,12 +3265,6 @@ To capture these nuances, the online fsck code has a separate ``xchk_irele``
 function to set or clear the ``DONTCACHE`` flag to get the required release
 behavior.
 
-Proposed patchsets include fixing
-`scrub iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
-`dir iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
-
 .. _ilocking:
 
 Locking Inodes
@@ -3443,11 +3321,6 @@ If the dotdot entry changes while the directory is unlocked, then a move or
 rename operation must have changed the child's parentage, and the scan can
 exit early.
 
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
 .. _fshooks:
 
 Filesystem Hooks
@@ -3594,11 +3467,6 @@ The inode scan APIs are pretty simple:
 
 - ``xchk_iscan_teardown`` to finish the scan
 
-This functionality is also a part of the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-
 .. _quotacheck:
 
 Case Study: Quota Counter Checking
@@ -3686,11 +3554,6 @@ needing to hold any locks for a long duration.
 If repairs are desired, the real and shadow dquots are locked and their
 resource counts are set to the values in the shadow dquot.
 
-The proposed patchset is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
 .. _nlinks:
 
 Case Study: File Link Count Checking
@@ -3744,11 +3607,6 @@ shadow information.
 If no parents are found, the file must be :ref:`reparented <orphanage>` to the
 orphanage to prevent the file from being lost forever.
 
-The proposed patchset is the
-`file link count repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
-series.
-
 .. _rmap_repair:
 
 Case Study: Rebuilding Reverse Mapping Records
@@ -3828,11 +3686,6 @@ scan for reverse mapping records.
 
 12. Free the xfbtree now that it not needed.
 
-The proposed patchset is the
-`rmap repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
-series.
-
 Staging Repairs with Temporary Files on Disk
 --------------------------------------------
 
@@ -3971,11 +3824,6 @@ Once a good copy of a data file has been constructed in a temporary file, it
 must be conveyed to the file being repaired, which is the topic of the next
 section.
 
-The proposed patches are in the
-`repair temporary files
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
-series.
-
 Logged File Content Exchanges
 -----------------------------
 
@@ -4025,11 +3873,6 @@ The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
 in the superblock protects these new log item records from being replayed on
 old kernels.
 
-The proposed patchset is the
-`file contents exchange
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
-series.
-
 +--------------------------------------------------------------------------+
 | **Sidebar: Using Log-Incompatible Feature Flags**                        |
 +--------------------------------------------------------------------------+
@@ -4323,11 +4166,6 @@ To repair the summary file, write the xfile contents into the temporary file
 and use atomic mapping exchange to commit the new contents.
 The temporary file is then reaped.
 
-The proposed patchset is the
-`realtime summary repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
-series.
-
 Case Study: Salvaging Extended Attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -4369,11 +4207,6 @@ Salvaging extended attributes is done as follows:
 
 4. Reap the temporary file.
 
-The proposed patchset is the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
-series.
-
 Fixing Directories
 ------------------
 
@@ -4448,11 +4281,6 @@ Unfortunately, the current dentry cache design doesn't provide a means to walk
 every child dentry of a specific directory, which makes this a hard problem.
 There is no known solution.
 
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
 Parent Pointers
 ```````````````
 
@@ -4612,11 +4440,6 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
 
 7. Reap the temporary directory.
 
-The proposed patchset is the
-`parent pointers directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
-series.
-
 Case Study: Repairing Parent Pointers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -4662,11 +4485,6 @@ directory reconstruction:
 
 8. Reap the temporary file.
 
-The proposed patchset is the
-`parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
-series.
-
 Digression: Offline Checking of Parent Pointers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -4755,11 +4573,6 @@ connectivity checks:
 
 4. Move on to examining link counts, as we do today.
 
-The proposed patchset is the
-`offline parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
-series.
-
 Rebuilding directories from parent pointers in offline repair would be very
 challenging because xfs_repair currently uses two single-pass scans of the
 filesystem during phases 3 and 4 to decide which files are corrupt enough to be
@@ -4903,12 +4716,6 @@ Repairing the directory tree works as follows:
 
 6. If the subdirectory has zero paths, attach it to the lost and found.
 
-The proposed patches are in the
-`directory tree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
-series.
-
-
 .. _orphanage:
 
 The Orphanage
@@ -4973,11 +4780,6 @@ Orphaned files are adopted by the orphanage as follows:
 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
    resources.
 
-The proposed patches are in the
-`orphanage adoption
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
-series.
-
 6. Userspace Algorithms and Data Structures
 ===========================================
 
@@ -5091,14 +4893,6 @@ first workqueue's workers until the backlog eases.
 This doesn't completely solve the balancing problem, but reduces it enough to
 move on to more pressing issues.
 
-The proposed patchsets are the scrub
-`performance tweaks
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
-and the
-`inode scan rebalance
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
-series.
-
 .. _scrubrepair:
 
 Scheduling Repairs
@@ -5179,20 +4973,6 @@ immediately.
 Corrupt file data blocks reported by phase 6 cannot be recovered by the
 filesystem.
 
-The proposed patchsets are the
-`repair warning improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
-refactoring of the
-`repair data dependency
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
-and
-`object tracking
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
-and the
-`repair scheduling
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
-improvement series.
-
 Checking Names for Confusable Unicode Sequences
 -----------------------------------------------
 
@@ -5372,6 +5152,8 @@ The extra flexibility enables several new use cases:
   This emulates an atomic device write in software, and can support arbitrary
   scattered writes.
 
+(This functionality was merged into mainline as of 2025)
+
 Vectorized Scrub
 ----------------
 
@@ -5393,13 +5175,7 @@ It is hoped that ``io_uring`` will pick up enough of this functionality that
 online fsck can use that instead of adding a separate vectored scrub system
 call to XFS.
 
-The relevant patchsets are the
-`kernel vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
-and
-`userspace vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
-series.
+(This functionality was merged into mainline as of 2025)
 
 Quality of Service Targets for Scrub
 ------------------------------------


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/19] docs: discuss autonomous self healing in the xfs online repair design doc
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
  2025-10-23  0:00   ` [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
@ 2025-10-23  0:01   ` Darrick J. Wong
  2025-10-30 16:38     ` Darrick J. Wong
  2025-10-23  0:01   ` [PATCH 03/19] xfs: create debugfs uuid aliases Darrick J. Wong
                     ` (16 subsequent siblings)
  18 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:01 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Update the XFS online repair document to describe the motivation and
design of the autonomous filesystem healing agent known as xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  102 ++++++++++++++++++++
 1 file changed, 100 insertions(+), 2 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 189d1f5f40788d..bdbf338a9c9f0c 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -166,9 +166,12 @@ The current XFS tools leave several problems unsolved:
    malicious actors **exploit quirks of Unicode** to place misleading names
    in directories.
 
+8. **Site Reliability and Support Engineers** would like to reduce the
+   frequency of incidents requiring **manual intervention**.
+
 Given this definition of the problems to be solved and the actors who would
 benefit, the proposed solution is a third fsck tool that acts on a running
-filesystem.
+filesystem, and an autononmous agent that fixes problems as they arise.
 
 This new third program has three components: an in-kernel facility to check
 metadata, an in-kernel facility to repair metadata, and a userspace driver
@@ -203,6 +206,13 @@ Even if a piece of filesystem metadata can only be regenerated by scanning the
 entire system, the scan can still be done in the background while other file
 operations continue.
 
+The autonomous self healing agent should listen for metadata health impact
+reports coming from the kernel and automatically schedule repairs for the
+damaged metadata.
+If the required repairs are larger in scope than a single metadata structure,
+``xfs_scrub`` should be invoked to perform a full analysis.
+``xfs_healer`` is the name of this program.
+
 In summary, online fsck takes advantage of resource sharding and redundant
 metadata to enable targeted checking and repair operations while the system
 is running.
@@ -850,11 +860,16 @@ variable in the following service files:
 * ``xfs_scrub_all_fail.service``
 
 The decision to enable the background scan is left to the system administrator.
-This can be done by enabling either of the following services:
+This can be done system-wide by enabling either of the following services:
 
 * ``xfs_scrub_all.timer`` on systemd systems
 * ``xfs_scrub_all.cron`` on non-systemd systems
 
+To enable online repair for specific filesystems, the ``autofsck``
+filesystem property should be set to ``repair``.
+To enable only scanning, the property should be set to ``check``.
+To disable online fsck entirely, the property should be set to ``none``.
+
 This automatic weekly scan is configured out of the box to perform an
 additional media scan of all file data once per month.
 This is less foolproof than, say, storing file data block checksums, but much
@@ -897,6 +912,36 @@ notifications and initiate a repair?
 *Answer*: These questions remain unanswered, but should be a part of the
 conversation with early adopters and potential downstream users of XFS.
 
+Autonomous Self Healing
+-----------------------
+
+The autonomous self healing agent is a background system service that starts
+when the filesystem is mounted and runs until unmount.
+When starting up, the agent opens a special pseudofile under the specific
+mount.
+When the filesystem generates new adverse health events, the events will be
+made available for reading via the special pseudofile.
+The events need not be limited to metadata concerns; they can also reflect
+events outside of the filesystem's direct control such as file I/O errors.
+
+The agent reads these events in a loop and responds to the events
+appropriately.
+For a single trouble report about metadata, the agent initiates a targeted
+repair of the specific structure.
+If that repair fails or the agent observes too many metadata trouble reports
+over a short interval, it should then initiate a full scan of the filesystem
+via the ``xfs_scrub`` service.
+
+The decision to enable the background scan is left to the system administrator.
+This can be done system-wide by enabling the following services:
+
+* ``xfs_healer@.service`` on systemd systems
+
+To enable autonomous healing for specific filesystems, the ``autofsck``
+filesystem property should be set to ``repair``.
+To disable self healing, the property should be set to ``check``,
+``optimize``, or ``none``.
+
 5. Kernel Algorithms and Data Structures
 ========================================
 
@@ -5071,6 +5116,59 @@ and report what has been lost.
 For media errors in blocks owned by files, parent pointers can be used to
 construct file paths from inode numbers for user-friendly reporting.
 
+Autonomous Self Healing
+-----------------------
+
+When a filesystem mounts, the Linux kernel initiates a uevent describing the
+mount and the path to the data device.
+A udev rule determines the initial mountpoint from the data device path
+and starts a mount-specific ``xfs_healer`` service instance.
+The ``xfs_healer`` service opens the mountpoint and issues the
+XFS_IOC_HEALTH_MONITOR ioctl to open a special health monitoring file.
+After that is set up, the mountpoint is closed to avoid pinning the mount.
+
+The health monitoring file hooks certain points of the filesystem so that it
+may receive events about metadata health, filesystem shutdowns, media errors,
+file I/O errors, and unmounting of the filesystem.
+Events are queued up for each health monitor file and encoded into a
+``struct xfs_health_monitor_event`` object when the agent calls ``read()`` on
+the file.
+All health events are dispatched to a background threadpool to reduce stalls
+in the main event loop.
+Events can be logged into the system log for further analysis.
+
+For metadata health events, the specific details are used to construct a call
+to the scrub ioctl.
+The filesystem mountpoint is reopened, and the kernel is called.
+If events are lost or the repairs fail, a full scan will be initiated by
+starting up an ``xfs_scrub@.service`` for the given mountpoint.
+
+A filesystem shutdown causes all future repair work to cease, and an unmount
+causes the agent to exit.
+
+**Question**: Why use a pseudofile and not use existing notification methods?
+
+*Answer*: The pseudofile is a private filesystem interface only available to
+processes with the CAP_SYS_ADMIN priviledge.
+Being private gives the kernel and ``xfs_healer`` the flexibility to change
+or update the event format in the future without worrying about backwards
+compatibility.
+Using existing notifications means that the event format would be frozen in
+public UAPI forever.
+
+The pseudofile can also accept ioctls, which gives ``xfs_healer`` a solid
+means to validate that prior to a repair, its reopened mountpoint is actually
+the same filesystem that is being monitored.
+
+**Future Work Question**: Should the healer daemon also register a dbus
+listener and publish events there?
+
+*Answer*: This is unclear -- if there's a demand for system monitoring daemons
+to consume this information and make decisions, then yes, this could be wired
+up in ``xfs_healer``.
+On the other hand, systemd is in the middle of a transition to varlink, so
+it makes more sense to wait and see what happens.
+
 7. Conclusion and Future Work
 =============================
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/19] xfs: create debugfs uuid aliases
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
  2025-10-23  0:00   ` [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
  2025-10-23  0:01   ` [PATCH 02/19] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
@ 2025-10-23  0:01   ` Darrick J. Wong
  2025-10-23  0:01   ` [PATCH 04/19] xfs: create hooks for monitoring health updates Darrick J. Wong
                     ` (15 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:01 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an alias for the debugfs dir so that we can find a filesystem by
uuid.  Unless it's mounted nouuid.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |   11 +++++++++++
 2 files changed, 12 insertions(+)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f046d1215b043c..8643d539bc4869 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -290,6 +290,7 @@ typedef struct xfs_mount {
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct xfs_zone_info	*m_zone_info;	/* zone allocator information */
 	struct dentry		*m_debugfs;	/* debugfs parent */
+	struct dentry		*m_debugfs_uuid; /* debugfs symlink */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d8f326d8838036..abe229fa5aa4b6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -813,6 +813,7 @@ xfs_mount_free(
 	if (mp->m_ddev_targp)
 		xfs_free_buftarg(mp->m_ddev_targp);
 
+	debugfs_remove(mp->m_debugfs_uuid);
 	debugfs_remove(mp->m_debugfs);
 	kfree(mp->m_rtname);
 	kfree(mp->m_logname);
@@ -1963,6 +1964,16 @@ xfs_fs_fill_super(
 		goto out_unmount;
 	}
 
+	if (xfs_debugfs && mp->m_debugfs && !xfs_has_nouuid(mp)) {
+		char	name[UUID_STRING_LEN + 1];
+
+		snprintf(name, UUID_STRING_LEN + 1, "%pU", &mp->m_sb.sb_uuid);
+		mp->m_debugfs_uuid = debugfs_create_symlink(name, xfs_debugfs,
+				mp->m_super->s_id);
+	} else {
+		mp->m_debugfs_uuid = NULL;
+	}
+
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/19] xfs: create hooks for monitoring health updates
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-10-23  0:01   ` [PATCH 03/19] xfs: create debugfs uuid aliases Darrick J. Wong
@ 2025-10-23  0:01   ` Darrick J. Wong
  2025-10-23  0:01   ` [PATCH 05/19] xfs: create a filesystem shutdown hook Darrick J. Wong
                     ` (14 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:01 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks for monitoring health events.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h |   47 ++++++++++
 fs/xfs/xfs_mount.h         |    3 +
 fs/xfs/xfs_health.c        |  202 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c         |    1 
 4 files changed, 252 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index b31000f7190ce5..39fef33dedc6a8 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 #define xfs_metadata_is_sick(error) \
 	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
 
+/*
+ * Parameters for tracking health updates.  The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+	XFS_HEALTHUP_SICK = 1,	/* runtime corruption observed */
+	XFS_HEALTHUP_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHUP_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHUP_UNMOUNT,	/* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+	XFS_HEALTHUP_FS = 1,	/* main filesystem */
+	XFS_HEALTHUP_AG,	/* allocation group */
+	XFS_HEALTHUP_INODE,	/* inode */
+	XFS_HEALTHUP_RTGROUP,	/* realtime group */
+};
+
+struct xfs_health_update_params {
+	/* XFS_HEALTHUP_INODE */
+	xfs_ino_t			ino;
+	uint32_t			gen;
+
+	/* XFS_HEALTHUP_AG/RTGROUP */
+	uint32_t			group;
+
+	/* XFS_SICK_* flags */
+	unsigned int			old_mask;
+	unsigned int			new_mask;
+
+	enum xfs_health_update_domain	domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+	struct xfs_hook			health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8643d539bc4869..b810b01734d854 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -344,6 +344,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed dirent updates to an active online repair. */
 	struct xfs_hooks	m_dir_update_hooks;
+
+	/* Hook to feed health events to a daemon. */
+	struct xfs_hooks	m_health_update_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 7c541fb373d5b2..abf9460ae79953 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -20,6 +20,157 @@
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of health updates.  If
+ * the compiler supports jump labels, the static branch will be replaced by a
+ * nop sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_health_hooks_switch);
+
+void
+xfs_health_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_health_hooks_switch);
+}
+
+void
+xfs_health_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_health_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem unmount health update. */
+static inline void
+xfs_health_unmount_hook(
+	struct xfs_mount		*mp)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+		};
+
+		xfs_hooks_call(&mp->m_health_update_hooks,
+				XFS_HEALTHUP_UNMOUNT, &p);
+	}
+}
+
+/* Call downstream hooks for a filesystem health update. */
+static inline void
+xfs_fs_health_update_hook(
+	struct xfs_mount		*mp,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+		};
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for a group health update. */
+static inline void
+xfs_group_health_update_hook(
+	struct xfs_group		*xg,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.group		= xg->xg_gno,
+		};
+		struct xfs_mount	*mp = xg->xg_mount;
+
+		switch (xg->xg_type) {
+		case XG_TYPE_AG:
+			p.domain = XFS_HEALTHUP_AG;
+			break;
+		case XG_TYPE_RTG:
+			p.domain = XFS_HEALTHUP_RTGROUP;
+			break;
+		default:
+			ASSERT(0);
+			return;
+		}
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for an inode health update. */
+static inline void
+xfs_inode_health_update_hook(
+	struct xfs_inode		*ip,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_INODE,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.ino		= ip->i_ino,
+			.gen		= VFS_I(ip)->i_generation,
+		};
+		struct xfs_mount	*mp = ip->i_mount;
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call the specified function during a health update. */
+int
+xfs_health_hook_add(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Stop calling the specified function during a health update. */
+void
+xfs_health_hook_del(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Configure health update hook functions. */
+void
+xfs_health_hook_setup(
+	struct xfs_health_hook	*hook,
+	notifier_fn_t		mod_fn)
+{
+	xfs_hook_setup(&hook->health_hook, mod_fn);
+}
+#else
+# define xfs_health_unmount_hook(...)			((void)0)
+# define xfs_fs_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_rt_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_group_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_inode_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 static void
 xfs_health_unmount_group(
 	struct xfs_group	*xg,
@@ -50,8 +201,10 @@ xfs_health_unmount(
 	unsigned int		checked = 0;
 	bool			warn = false;
 
-	if (xfs_is_shutdown(mp))
+	if (xfs_is_shutdown(mp)) {
+		xfs_health_unmount_hook(mp);
 		return;
+	}
 
 	/* Measure AG corruption levels. */
 	while ((pag = xfs_perag_next(mp, pag)))
@@ -97,6 +250,8 @@ xfs_health_unmount(
 		if (sick & XFS_SICK_FS_COUNTERS)
 			xfs_fs_mark_healthy(mp, XFS_SICK_FS_COUNTERS);
 	}
+
+	xfs_health_unmount_hook(mp);
 }
 
 /* Mark unhealthy per-fs metadata. */
@@ -105,12 +260,17 @@ xfs_fs_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark per-fs metadata as having been checked and found unhealthy by fsck. */
@@ -119,13 +279,18 @@ xfs_fs_mark_corrupt(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_corrupt(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark a per-fs metadata healed. */
@@ -134,15 +299,20 @@ xfs_fs_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick &= ~mask;
 	if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY))
 		mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-fs metadata are unhealthy. */
@@ -192,12 +362,17 @@ xfs_group_mark_sick(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_sick(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /*
@@ -208,13 +383,18 @@ xfs_group_mark_corrupt(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_corrupt(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick |= mask;
 	xg->xg_checked |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /*
@@ -225,15 +405,20 @@ xfs_group_mark_healthy(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_healthy(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick &= ~mask;
 	if (!(xg->xg_sick & XFS_SICK_AG_PRIMARY))
 		xg->xg_sick &= ~XFS_SICK_AG_SECONDARY;
 	xg->xg_checked |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-ag metadata are unhealthy. */
@@ -272,10 +457,13 @@ xfs_inode_mark_sick(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_sick(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	spin_unlock(&ip->i_flags_lock);
 
@@ -287,6 +475,8 @@ xfs_inode_mark_sick(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark inode metadata as having been checked and found unhealthy by fsck. */
@@ -295,10 +485,13 @@ xfs_inode_mark_corrupt(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_corrupt(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
@@ -311,6 +504,8 @@ xfs_inode_mark_corrupt(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark parts of an inode healed. */
@@ -319,15 +514,20 @@ xfs_inode_mark_healthy(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_healthy(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick &= ~mask;
 	if (!(ip->i_sick & XFS_SICK_INO_PRIMARY))
 		ip->i_sick &= ~XFS_SICK_INO_SECONDARY;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which parts of an inode are unhealthy. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index abe229fa5aa4b6..cd3b7343b326a8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2285,6 +2285,7 @@ xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_health_update_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/19] xfs: create a filesystem shutdown hook
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-10-23  0:01   ` [PATCH 04/19] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2025-10-23  0:01   ` Darrick J. Wong
  2025-10-23  0:02   ` [PATCH 06/19] xfs: create hooks for media errors Darrick J. Wong
                     ` (13 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:01 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a hook so that health monitoring can report filesystem shutdown
events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.h |   14 +++++++++++++
 fs/xfs/xfs_mount.h |    3 +++
 fs/xfs/xfs_fsops.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c |    1 +
 4 files changed, 75 insertions(+)


diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 9d23c361ef56e4..7f6f876de072b1 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -15,4 +15,18 @@ int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags);
 int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
 void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_shutdown_hook {
+	struct xfs_hook			shutdown_hook;
+};
+
+void xfs_shutdown_hook_disable(void);
+void xfs_shutdown_hook_enable(void);
+
+int xfs_shutdown_hook_add(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_del(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_setup(struct xfs_shutdown_hook *hook,
+		notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b810b01734d854..96c920ad5add13 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -347,6 +347,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed health events to a daemon. */
 	struct xfs_hooks	m_health_update_hooks;
+
+	/* Hook to feed shutdown events to a daemon. */
+	struct xfs_hooks	m_shutdown_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 0ada735693945c..69918cd1ba1dbc 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -482,6 +482,61 @@ xfs_fs_goingdown(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_shutdown_hooks_switch);
+
+void
+xfs_shutdown_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_shutdown_hooks_switch);
+}
+
+void
+xfs_shutdown_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_shutdown_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem shutdown. */
+static inline void
+xfs_shutdown_hook(
+	struct xfs_mount		*mp,
+	uint32_t			flags)
+{
+	if (xfs_hooks_switched_on(&xfs_shutdown_hooks_switch))
+		xfs_hooks_call(&mp->m_shutdown_hooks, flags, NULL);
+}
+
+/* Call the specified function during a shutdown update. */
+int
+xfs_shutdown_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Stop calling the specified function during a shutdown update. */
+void
+xfs_shutdown_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Configure shutdown update hook functions. */
+void
+xfs_shutdown_hook_setup(
+	struct xfs_shutdown_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->shutdown_hook, mod_fn);
+}
+#else
+# define xfs_shutdown_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Force a shutdown of the filesystem instantly while keeping the filesystem
  * consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -540,6 +595,8 @@ xfs_do_force_shutdown(
 		"Please unmount the filesystem and rectify the problem(s)");
 	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
 		xfs_stack_trace();
+
+	xfs_shutdown_hook(mp, flags);
 }
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index cd3b7343b326a8..54dcc42c65c786 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2285,6 +2285,7 @@ xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 
 	fc->s_fs_info = mp;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/19] xfs: create hooks for media errors
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-10-23  0:01   ` [PATCH 05/19] xfs: create a filesystem shutdown hook Darrick J. Wong
@ 2025-10-23  0:02   ` Darrick J. Wong
  2025-10-23  0:02   ` [PATCH 07/19] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
                     ` (12 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:02 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a media error event hook so that we can send events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h          |    3 ++
 fs/xfs/xfs_notify_failure.h |   38 +++++++++++++++++++
 fs/xfs/xfs_notify_failure.c |   84 ++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_super.c          |    1 +
 4 files changed, 121 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 96c920ad5add13..0907714c9d6f21 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -350,6 +350,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed shutdown events to a daemon. */
 	struct xfs_hooks	m_shutdown_hooks;
+
+	/* Hook to feed media error events to a daemon. */
+	struct xfs_hooks	m_media_error_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 8d08ec29dd2949..528317ff24320a 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -8,4 +8,42 @@
 
 extern const struct dax_holder_operations xfs_dax_holder_operations;
 
+enum xfs_failed_device {
+	XFS_FAILED_DATADEV,
+	XFS_FAILED_LOGDEV,
+	XFS_FAILED_RTDEV,
+};
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+struct xfs_media_error_params {
+	struct xfs_mount		*mp;
+	enum xfs_failed_device		fdev;
+	xfs_daddr_t			daddr;
+	uint64_t			bbcount;
+	bool				pre_remove;
+};
+
+struct xfs_media_error_hook {
+	struct xfs_hook			error_hook;
+};
+
+void xfs_media_error_hook_disable(void);
+void xfs_media_error_hook_enable(void);
+
+int xfs_media_error_hook_add(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_del(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_setup(struct xfs_media_error_hook *hook,
+		notifier_fn_t mod_fn);
+#else
+struct xfs_media_error_params { };
+struct xfs_media_error_hook { };
+# define xfs_media_error_hook_disable()		((void)0)
+# define xfs_media_error_hook_enable()		((void)0)
+# define xfs_media_error_hook_add(...)		(0)
+# define xfs_media_error_hook_del(...)		((void)0)
+# define xfs_media_error_hook_setup(...)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index b1767288994206..2098ff452a3b87 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -27,6 +27,73 @@
 #include <linux/dax.h>
 #include <linux/fs.h>
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_media_error_hooks_switch);
+
+void
+xfs_media_error_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_media_error_hooks_switch);
+}
+
+void
+xfs_media_error_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_media_error_hooks_switch);
+}
+
+/* Call downstream hooks for a media error. */
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	enum xfs_failed_device		fdev,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	bool				pre_remove)
+{
+	if (xfs_hooks_switched_on(&xfs_media_error_hooks_switch)) {
+		struct xfs_media_error_params p = {
+			.mp		= mp,
+			.fdev		= fdev,
+			.daddr		= daddr,
+			.bbcount	= bbcount,
+			.pre_remove	= pre_remove,
+		};
+
+		xfs_hooks_call(&mp->m_media_error_hooks, 0, &p);
+	}
+}
+
+/* Call the specified function during a media error. */
+int
+xfs_media_error_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Stop calling the specified function during a media error. */
+void
+xfs_media_error_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Configure media error hook functions. */
+void
+xfs_media_error_hook_setup(
+	struct xfs_media_error_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->error_hook, mod_fn);
+}
+#else
+# define xfs_media_error_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
 	xfs_extlen_t		blockcount;
@@ -215,6 +282,9 @@ xfs_dax_notify_logdev_failure(
 	if (error)
 		return error;
 
+	xfs_media_error_hook(mp, XFS_FAILED_LOGDEV, daddr, bblen,
+			mf_flags & MF_MEM_PRE_REMOVE);
+
 	/*
 	 * In the pre-remove case the failure notification is attempting to
 	 * trigger a force unmount.  The expectation is that the device is
@@ -248,16 +318,20 @@ xfs_dax_notify_dev_failure(
 	uint64_t		bblen;
 	struct xfs_group	*xg = NULL;
 
+	error = xfs_dax_translate_range(xfs_group_type_buftarg(mp, type),
+			offset, len, &daddr, &bblen);
+	if (error)
+		return error;
+
+	xfs_media_error_hook(mp, type == XG_TYPE_RTG ?
+			XFS_FAILED_RTDEV : XFS_FAILED_DATADEV,
+			daddr, bblen, mf_flags & MF_MEM_PRE_REMOVE);
+
 	if (!xfs_has_rmapbt(mp)) {
 		xfs_debug(mp, "notify_failure() needs rmapbt enabled!");
 		return -EOPNOTSUPP;
 	}
 
-	error = xfs_dax_translate_range(xfs_group_type_buftarg(mp, type),
-			offset, len, &daddr, &bblen);
-	if (error)
-		return error;
-
 	if (type == XG_TYPE_RTG) {
 		start_bno = xfs_daddr_to_rtb(mp, daddr);
 		end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 54dcc42c65c786..51f8db95e717a8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2287,6 +2287,7 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_dir_update_hooks);
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
+	xfs_hooks_init(&mp->m_media_error_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/19] iomap: report buffered read and write io errors to the filesystem
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-10-23  0:02   ` [PATCH 06/19] xfs: create hooks for media errors Darrick J. Wong
@ 2025-10-23  0:02   ` Darrick J. Wong
  2025-10-23  0:02   ` [PATCH 08/19] iomap: report directio read and write errors to callers Darrick J. Wong
                     ` (11 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:02 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Provide a callback so that iomap can report read and write IO errors to
the caller filesystem.  For now this is only wired up for iomap as a
testbed for XFS.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/internal.h               |    2 ++
 include/linux/fs.h                |    4 ++++
 Documentation/filesystems/vfs.rst |    7 +++++++
 fs/iomap/buffered-io.c            |   27 +++++++++++++++++++++++++--
 fs/iomap/ioend.c                  |    4 ++++
 5 files changed, 42 insertions(+), 2 deletions(-)


diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
index d05cb3aed96e79..06d9145b6be4fa 100644
--- a/fs/iomap/internal.h
+++ b/fs/iomap/internal.h
@@ -5,5 +5,7 @@
 #define IOEND_BATCH_SIZE	4096
 
 u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
+void iomap_mapping_ioerror(struct address_space *mapping, int direction,
+		loff_t pos, u64 len, int error);
 
 #endif /* _IOMAP_INTERNAL_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c895146c1444be..5e4b3a4b24823f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -477,6 +477,10 @@ struct address_space_operations {
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
 	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+
+	/* Callback for dealing with IO errors during readahead or writeback */
+	void (*ioerror)(struct address_space *mapping, int direction,
+			loff_t pos, u64 len, int error);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 4f13b01e42eb5e..9e70006bf99a63 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -822,6 +822,8 @@ cache in your filesystem.  The following members are defined:
 		int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 		int (*swap_deactivate)(struct file *);
 		int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+		void (*ioerror)(struct address_space *mapping, int direction,
+				loff_t pos, u64 len, int error);
 	};
 
 ``read_folio``
@@ -1032,6 +1034,11 @@ cache in your filesystem.  The following members are defined:
 ``swap_rw``
 	Called to read or write swap pages when SWP_FS_OPS is set.
 
+``ioerror``
+        Called to deal with IO errors during readahead or writeback.
+        This may be called from interrupt context, and without any
+        locks necessarily being held.
+
 The File Object
 ===============
 
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8b847a1e27f13e..8dd5421cb910b5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -288,6 +288,14 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 		pos >= i_size_read(iter->inode);
 }
 
+inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
+		loff_t pos, u64 len, int error)
+{
+	if (mapping && mapping->a_ops->ioerror)
+		mapping->a_ops->ioerror(mapping, direction, pos, len,
+				error);
+}
+
 /**
  * iomap_read_inline_data - copy inline data into the page cache
  * @iter: iteration structure
@@ -310,8 +318,11 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (folio_test_uptodate(folio))
 		return 0;
 
-	if (WARN_ON_ONCE(size > iomap->length))
+	if (WARN_ON_ONCE(size > iomap->length)) {
+		iomap_mapping_ioerror(folio->mapping, READ, iomap->offset,
+				size, -EIO);
 		return -EIO;
+	}
 	if (offset > 0)
 		ifs_alloc(iter->inode, folio, iter->flags);
 
@@ -339,6 +350,10 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		spin_unlock_irqrestore(&ifs->state_lock, flags);
 	}
 
+	if (error)
+		iomap_mapping_ioerror(folio->mapping, READ,
+				folio_pos(folio) + off, len, error);
+
 	if (finished)
 		folio_end_read(folio, uptodate);
 }
@@ -558,11 +573,15 @@ static int iomap_read_folio_range(const struct iomap_iter *iter,
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct bio_vec bvec;
 	struct bio bio;
+	int ret;
 
 	bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
 	bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
 	bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
-	return submit_bio_wait(&bio);
+	ret = submit_bio_wait(&bio);
+	if (ret)
+		iomap_mapping_ioerror(folio->mapping, READ, pos, len, ret);
+	return ret;
 }
 #else
 static int iomap_read_folio_range(const struct iomap_iter *iter,
@@ -1674,6 +1693,7 @@ int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio)
 	u64 pos = folio_pos(folio);
 	u64 end_pos = pos + folio_size(folio);
 	u64 end_aligned = 0;
+	loff_t orig_pos = pos;
 	bool wb_pending = false;
 	int error = 0;
 	u32 rlen;
@@ -1724,6 +1744,9 @@ int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio)
 
 	if (wb_pending)
 		wpc->nr_folios++;
+	if (error && pos > orig_pos)
+		iomap_mapping_ioerror(inode->i_mapping, WRITE, orig_pos, 0,
+				error);
 
 	/*
 	 * We can have dirty bits set past end of file in page_mkwrite path
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index b49fa75eab260a..56e654f2d36fe9 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -55,6 +55,10 @@ static u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend)
 
 	/* walk all folios in bio, ending page IO on them */
 	bio_for_each_folio_all(fi, bio) {
+		if (ioend->io_error)
+			iomap_mapping_ioerror(inode->i_mapping, WRITE,
+					folio_pos(fi.folio) + fi.offset,
+					fi.length, ioend->io_error);
 		iomap_finish_folio_write(inode, fi.folio, fi.length);
 		folio_count++;
 	}


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/19] iomap: report directio read and write errors to callers
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-10-23  0:02   ` [PATCH 07/19] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
@ 2025-10-23  0:02   ` Darrick J. Wong
  2025-10-23  0:02   ` [PATCH 09/19] xfs: create file io error hooks Darrick J. Wong
                     ` (10 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:02 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add more hooks to report directio IO errors to the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/linux/iomap.h |    2 ++
 fs/iomap/direct-io.c  |    4 ++++
 2 files changed, 6 insertions(+)


diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 73dceabc21c8c7..ca1590e5002342 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -486,6 +486,8 @@ struct iomap_dio_ops {
 		      unsigned flags);
 	void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
 		          loff_t file_offset);
+	void (*ioerror)(struct inode *inode, int direction, loff_t pos,
+			u64 len, int error);
 
 	/*
 	 * Filesystems wishing to attach private information to a direct io bio
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5d5d63efbd5767..1512d8dbb0d2e7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -95,6 +95,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	if (dops && dops->end_io)
 		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+	if (dio->error && dops && dops->ioerror)
+		dops->ioerror(file_inode(iocb->ki_filp),
+				(dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
+				offset, dio->size, dio->error);
 
 	if (likely(!ret)) {
 		ret = dio->size;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/19] xfs: create file io error hooks
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-10-23  0:02   ` [PATCH 08/19] iomap: report directio read and write errors to callers Darrick J. Wong
@ 2025-10-23  0:02   ` Darrick J. Wong
  2025-10-23  0:03   ` [PATCH 10/19] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
                     ` (9 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:02 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks within XFS to deliver IO errors to callers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_file.h  |   36 +++++++++++
 fs/xfs/xfs_mount.h |    3 +
 fs/xfs/xfs_aops.c  |    2 +
 fs/xfs/xfs_file.c  |  167 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c |    1 
 5 files changed, 208 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h
index 2ad91f755caf35..2b4e02efefb7b1 100644
--- a/fs/xfs/xfs_file.h
+++ b/fs/xfs/xfs_file.h
@@ -12,4 +12,40 @@ extern const struct file_operations xfs_dir_file_operations;
 bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos,
 		long long int len);
 
+enum xfs_file_ioerror_type {
+	XFS_FILE_IOERROR_BUFFERED_READ,
+	XFS_FILE_IOERROR_BUFFERED_WRITE,
+	XFS_FILE_IOERROR_DIRECT_READ,
+	XFS_FILE_IOERROR_DIRECT_WRITE,
+};
+
+struct xfs_file_ioerror_params {
+	xfs_ino_t		ino;
+	loff_t			pos;
+	u64			len;
+	u32			gen;
+	int			error;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_file_ioerror_hook {
+	struct xfs_hook			ioerror_hook;
+};
+
+void xfs_file_ioerror_hook_disable(void);
+void xfs_file_ioerror_hook_enable(void);
+
+int xfs_file_ioerror_hook_add(struct xfs_mount *mp,
+		struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_del(struct xfs_mount *mp,
+		struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_setup(struct xfs_file_ioerror_hook *hook,
+		notifier_fn_t mod_fn);
+
+void xfs_vm_ioerror(struct address_space *mapping, int direction, loff_t pos,
+		u64 len, int error);
+#else
+# define xfs_vm_ioerror			NULL
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* __XFS_FILE_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 0907714c9d6f21..9b17899a012fe6 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -353,6 +353,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed media error events to a daemon. */
 	struct xfs_hooks	m_media_error_hooks;
+
+	/* Hook to feed file io error events to a daemon. */
+	struct xfs_hooks	m_file_ioerror_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a26f798155331f..f3f28b9ae0f70e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -22,6 +22,7 @@
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
+#include "xfs_file.h"
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -810,6 +811,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_folio	= generic_error_remove_folio,
 	.swap_activate		= xfs_vm_swap_activate,
+	.ioerror		= xfs_vm_ioerror,
 };
 
 const struct address_space_operations xfs_dax_aops = {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2702fef2c90cd2..1c9b21ad97d46c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -222,6 +222,169 @@ xfs_ilock_iocb_for_write(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_file_ioerror_hooks_switch);
+
+void
+xfs_file_ioerror_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_file_ioerror_hooks_switch);
+}
+
+void
+xfs_file_ioerror_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_file_ioerror_hooks_switch);
+}
+
+struct xfs_file_ioerror {
+	struct work_struct		work;
+	struct xfs_mount		*mp;
+	xfs_ino_t			ino;
+	loff_t				pos;
+	u64				len;
+	u32				gen;
+	int				error;
+	enum xfs_file_ioerror_type	type;
+};
+
+/* Call downstream hooks for a file io error update. */
+STATIC void
+xfs_file_report_ioerror(
+	struct work_struct	*work)
+{
+	struct xfs_file_ioerror	*ioerr;
+
+	ioerr = container_of(work, struct xfs_file_ioerror, work);
+
+	if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) {
+		struct xfs_file_ioerror_params	p = {
+			.ino		= ioerr->ino,
+			.gen		= ioerr->gen,
+			.pos		= ioerr->pos,
+			.len		= ioerr->len,
+		};
+		struct xfs_mount	*mp = ioerr->mp;
+
+		xfs_hooks_call(&mp->m_file_ioerror_hooks, ioerr->type, &p);
+	}
+
+	kfree(ioerr);
+}
+
+/* Queue a directio io error notification. */
+STATIC void
+xfs_dio_ioerror(
+	struct inode		*inode,
+	int			direction,
+	loff_t			pos,
+	u64			len,
+	int			error)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) {
+		ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+		if (!ioerr) {
+			xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+					ip->i_ino,
+					direction == WRITE ? "WRITE" : "READ",
+					pos, len, error);
+			return;
+		}
+
+		INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+		ioerr->mp = mp;
+		ioerr->ino = ip->i_ino;
+		ioerr->gen = VFS_I(ip)->i_generation;
+		ioerr->pos = pos;
+		ioerr->len = len;
+		if (direction == WRITE)
+			ioerr->type = XFS_FILE_IOERROR_DIRECT_WRITE;
+		else
+			ioerr->type = XFS_FILE_IOERROR_DIRECT_READ;
+		ioerr->error = error;
+		queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+	}
+}
+
+/* Queue a buffered io error notification. */
+void
+xfs_vm_ioerror(
+	struct address_space	*mapping,
+	int			direction,
+	loff_t			pos,
+	u64			len,
+	int			error)
+{
+	struct inode		*inode = mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	if (xfs_hooks_switched_on(&xfs_file_ioerror_hooks_switch)) {
+		ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+		if (!ioerr) {
+			xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+					ip->i_ino,
+					direction == WRITE ? "WRITE" : "READ",
+					pos, len, error);
+			return;
+		}
+
+		INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+		ioerr->mp = mp;
+		ioerr->ino = ip->i_ino;
+		ioerr->gen = VFS_I(ip)->i_generation;
+		ioerr->pos = pos;
+		ioerr->len = len;
+		if (direction == WRITE)
+			ioerr->type = XFS_FILE_IOERROR_BUFFERED_WRITE;
+		else
+			ioerr->type = XFS_FILE_IOERROR_BUFFERED_READ;
+		ioerr->error = error;
+		queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+	}
+}
+
+/* Call the specified function after a file io error. */
+int
+xfs_file_ioerror_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_file_ioerror_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Stop calling the specified function after a file io error. */
+void
+xfs_file_ioerror_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_file_ioerror_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Configure file io error update hook functions. */
+void
+xfs_file_ioerror_hook_setup(
+	struct xfs_file_ioerror_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->ioerror_hook, mod_fn);
+}
+#else
+# define xfs_dio_ioerror		NULL
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
+static const struct iomap_dio_ops xfs_dio_read_ops = {
+	.ioerror	= xfs_dio_ioerror,
+};
+
 STATIC ssize_t
 xfs_file_dio_read(
 	struct kiocb		*iocb,
@@ -240,7 +403,8 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, &xfs_dio_read_ops,
+			0, NULL, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -625,6 +789,7 @@ xfs_dio_write_end_io(
 
 static const struct iomap_dio_ops xfs_dio_write_ops = {
 	.end_io		= xfs_dio_write_end_io,
+	.ioerror	= xfs_dio_ioerror,
 };
 
 static void
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 51f8db95e717a8..b6a6027b4df8d8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2288,6 +2288,7 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 	xfs_hooks_init(&mp->m_media_error_hooks);
+	xfs_hooks_init(&mp->m_file_ioerror_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/19] xfs: create a special file to pass filesystem health to userspace
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-10-23  0:02   ` [PATCH 09/19] xfs: create file io error hooks Darrick J. Wong
@ 2025-10-23  0:03   ` Darrick J. Wong
  2025-10-23  0:03   ` [PATCH 11/19] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
                     ` (8 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:03 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |    8 ++
 fs/xfs/xfs_healthmon.h |   16 +++++
 fs/xfs/Kconfig         |    8 ++
 fs/xfs/Makefile        |    1 
 fs/xfs/xfs_healthmon.c |  157 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c     |    4 +
 6 files changed, 194 insertions(+)
 create mode 100644 fs/xfs/xfs_healthmon.h
 create mode 100644 fs/xfs/xfs_healthmon.c


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 12463ba766da05..dba7896f716092 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1003,6 +1003,13 @@ struct xfs_rtgroup_geometry {
 #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
 #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
 
+struct xfs_health_monitor {
+	__u64	flags;		/* flags */
+	__u8	format;		/* output format */
+	__u8	pad1[7];	/* zeroes */
+	__u64	pad2[2];	/* zeroes */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1042,6 +1049,7 @@ struct xfs_rtgroup_geometry {
 #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle)
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
+#define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
new file mode 100644
index 00000000000000..07126e39281a0c
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_HEALTHMON_H__
+#define __XFS_HEALTHMON_H__
+
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+long xfs_ioc_health_monitor(struct xfs_mount *mp,
+		struct xfs_health_monitor __user *arg);
+#else
+# define xfs_ioc_health_monitor(mp, hmo)	(-ENOTTY)
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
+#endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 8930d5254e1da6..b5d48515236302 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -121,6 +121,14 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_HEALTH_MONITOR
+	bool "Report filesystem health events to userspace"
+	depends on XFS_FS
+	select XFS_LIVE_HOOKS
+	default y
+	help
+	  Report health events to userspace programs.
+
 config XFS_DRAIN_INTENTS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5bf501cf827172..d4e9070a9326ba 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -157,6 +157,7 @@ xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
 xfs-$(CONFIG_XFS_BTREE_IN_MEM)	+= libxfs/xfs_btree_mem.o
+xfs-$(CONFIG_XFS_HEALTH_MONITOR) += xfs_healthmon.o
 
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
new file mode 100644
index 00000000000000..7b0d9f78b0a402
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trace.h"
+#include "xfs_ag.h"
+#include "xfs_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_quota_defs.h"
+#include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
+
+#include <linux/anon_inodes.h>
+#include <linux/eventpoll.h>
+#include <linux/poll.h>
+
+/*
+ * Live Health Monitoring
+ * ======================
+ *
+ * Autonomous self-healing of XFS filesystems requires a means for the kernel
+ * to send filesystem health events to a monitoring daemon in userspace.  To
+ * accomplish this, we establish a thread_with_file kthread object to handle
+ * translating internal events about filesystem health into a format that can
+ * be parsed easily by userspace.  Then we hook various parts of the filesystem
+ * to supply those internal events to the kthread.  Userspace reads events
+ * from the file descriptor returned by the ioctl.
+ *
+ * The healthmon abstraction has a weak reference to the host filesystem mount
+ * so that the queueing and processing of the events do not pin the mount and
+ * cannot slow down the main filesystem.  The healthmon object can exist past
+ * the end of the filesystem mount.
+ */
+
+struct xfs_healthmon {
+	struct xfs_mount		*mp;
+};
+
+/*
+ * Convey queued event data to userspace.  First copy any remaining bytes in
+ * the outbuf, then format the oldest event into the outbuf and copy that too.
+ */
+STATIC ssize_t
+xfs_healthmon_read_iter(
+	struct kiocb		*iocb,
+	struct iov_iter		*to)
+{
+	return -EIO;
+}
+
+/* Free the health monitoring information. */
+STATIC int
+xfs_healthmon_release(
+	struct inode		*inode,
+	struct file		*file)
+{
+	struct xfs_healthmon	*hm = file->private_data;
+
+	kfree(hm);
+
+	return 0;
+}
+
+/* Validate ioctl parameters. */
+static inline bool
+xfs_healthmon_validate(
+	const struct xfs_health_monitor	*hmo)
+{
+	if (hmo->flags)
+		return false;
+	if (hmo->format)
+		return false;
+	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
+		return false;
+	if (memchr_inv(&hmo->pad2, 0, sizeof(hmo->pad2)))
+		return false;
+	return true;
+}
+
+/* Emit some data about the health monitoring fd. */
+#ifdef CONFIG_PROC_FS
+static void
+xfs_healthmon_show_fdinfo(
+	struct seq_file		*m,
+	struct file		*file)
+{
+	struct xfs_healthmon	*hm = file->private_data;
+
+	seq_printf(m, "state:\talive\ndev:\t%s\n",
+			hm->mp->m_super->s_id);
+}
+#endif
+
+static const struct file_operations xfs_healthmon_fops = {
+	.owner		= THIS_MODULE,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= xfs_healthmon_show_fdinfo,
+#endif
+	.read_iter	= xfs_healthmon_read_iter,
+	.release	= xfs_healthmon_release,
+};
+
+/*
+ * Create a health monitoring file.  Returns an index to the fd table or a
+ * negative errno.
+ */
+long
+xfs_ioc_health_monitor(
+	struct xfs_mount		*mp,
+	struct xfs_health_monitor __user *arg)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm;
+	int				fd;
+	int				ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	hm = kzalloc(sizeof(*hm), GFP_KERNEL);
+	if (!hm)
+		return -ENOMEM;
+	hm->mp = mp;
+
+	/*
+	 * Create the anonymous file.  If it succeeds, the file owns hm and
+	 * can go away at any time, so we must not access it again.
+	 */
+	fd = anon_inode_getfd("xfs_healthmon", &xfs_healthmon_fops, hm,
+			O_CLOEXEC | O_RDONLY);
+	if (fd < 0) {
+		ret = fd;
+		goto out_hm;
+	}
+
+	return fd;
+
+out_hm:
+	kfree(hm);
+	return ret;
+}
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a6bb7ee7a27ad5..08998d84554f09 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -41,6 +41,7 @@
 #include "xfs_exchrange.h"
 #include "xfs_handle.h"
 #include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -1421,6 +1422,9 @@ xfs_file_ioctl(
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);
 
+	case XFS_IOC_HEALTH_MONITOR:
+		return xfs_ioc_health_monitor(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/19] xfs: create event queuing, formatting, and discovery infrastructure
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-10-23  0:03   ` [PATCH 10/19] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
@ 2025-10-23  0:03   ` Darrick J. Wong
  2025-10-30 16:54     ` Darrick J. Wong
  2025-10-23  0:03   ` [PATCH 12/19] xfs: report metadata health events through healthmon Darrick J. Wong
                     ` (7 subsequent siblings)
  18 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:03 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.

Here, we've chosen json to export information to userspace.  The
structured key-value nature of json gives us enormous flexibility to
modify the schema of what we'll send to userspace because we can add new
keys at any time.  Userspace can use whatever json parsers are available
to consume the events and will not be confused by keys they don't
recognize.

Note that we do NOT allow sending json back to the kernel, nor is there
any intent to do that.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h                  |   50 ++
 fs/xfs/xfs_healthmon.h                  |   29 +
 fs/xfs/xfs_linux.h                      |    3 
 fs/xfs/xfs_trace.h                      |  171 +++++++
 fs/xfs/libxfs/xfs_healthmon.schema.json |  129 +++++
 fs/xfs/xfs_healthmon.c                  |  728 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.c                      |    2 
 lib/seq_buf.c                           |    1 
 8 files changed, 1106 insertions(+), 7 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index dba7896f716092..4b642eea18b5ca 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1003,6 +1003,45 @@ struct xfs_rtgroup_geometry {
 #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
 #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
 
+/* Health monitor event domains */
+
+/* affects the whole fs */
+#define XFS_HEALTH_MONITOR_DOMAIN_MOUNT		(0)
+
+/* Health monitor event types */
+
+/* status of the monitor itself */
+#define XFS_HEALTH_MONITOR_TYPE_RUNNING		(0)
+#define XFS_HEALTH_MONITOR_TYPE_LOST		(1)
+
+/* lost events */
+struct xfs_health_monitor_lost {
+	__u64	count;
+};
+
+struct xfs_health_monitor_event {
+	/* XFS_HEALTH_MONITOR_DOMAIN_* */
+	__u32	domain;
+
+	/* XFS_HEALTH_MONITOR_TYPE_* */
+	__u32	type;
+
+	/* Timestamp of the event, in nanoseconds since the Unix epoch */
+	__u64	time_ns;
+
+	/*
+	 * Details of the event.  The primary clients are written in python
+	 * and rust, so break this up because bindgen hates anonymous structs
+	 * and unions.
+	 */
+	union {
+		struct xfs_health_monitor_lost lost;
+	} e;
+
+	/* zeroes */
+	__u64	pad[2];
+};
+
 struct xfs_health_monitor {
 	__u64	flags;		/* flags */
 	__u8	format;		/* output format */
@@ -1010,6 +1049,17 @@ struct xfs_health_monitor {
 	__u64	pad2[2];	/* zeroes */
 };
 
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Return events in a C structure */
+#define XFS_HEALTH_MONITOR_FMT_CSTRUCT	(0)
+
+/* Return events in JSON format */
+#define XFS_HEALTH_MONITOR_FMT_JSON	(1)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 07126e39281a0c..ea2d6a327dfb16 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -6,6 +6,35 @@
 #ifndef __XFS_HEALTHMON_H__
 #define __XFS_HEALTHMON_H__
 
+enum xfs_healthmon_type {
+	XFS_HEALTHMON_RUNNING,	/* monitor running */
+	XFS_HEALTHMON_LOST,	/* message lost */
+};
+
+enum xfs_healthmon_domain {
+	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+};
+
+struct xfs_healthmon_event {
+	struct xfs_healthmon_event	*next;
+
+	enum xfs_healthmon_type		type;
+	enum xfs_healthmon_domain	domain;
+
+	uint64_t			time_ns;
+
+	union {
+		/* lost events */
+		struct {
+			uint64_t	lostcount;
+		};
+		/* mount */
+		struct {
+			unsigned int	flags;
+		};
+	};
+};
+
 #ifdef CONFIG_XFS_HEALTH_MONITOR
 long xfs_ioc_health_monitor(struct xfs_mount *mp,
 		struct xfs_health_monitor __user *arg);
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 4dd747bdbccab2..e122db938cc06b 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -63,6 +63,9 @@ typedef __u32			xfs_nlink_t;
 #include <linux/xattr.h>
 #include <linux/mnt_idmapping.h>
 #include <linux/debugfs.h>
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+# include <linux/seq_buf.h>
+#endif
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 79b8641880ab9d..17af5efee026c9 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -103,6 +103,8 @@ struct xfs_refcount_intent;
 struct xfs_metadir_update;
 struct xfs_rtgroup;
 struct xfs_open_zone;
+struct xfs_healthmon_event;
+struct xfs_health_update_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -5908,6 +5910,175 @@ DEFINE_EVENT(xfs_freeblocks_resv_class, name, \
 DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_reserved);
 DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_enospc);
 
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+TRACE_EVENT(xfs_healthmon_lost_event,
+	TP_PROTO(const struct xfs_mount *mp, unsigned long long lost_prev),
+	TP_ARGS(mp, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d lost_prev %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->lost_prev)
+);
+
+#define XFS_HEALTHMON_FLAGS_STRINGS \
+	{ XFS_HEALTH_MONITOR_VERBOSE,	"verbose" }
+#define XFS_HEALTHMON_FMT_STRINGS \
+	{ XFS_HEALTH_MONITOR_FMT_JSON,	"json" }, \
+	{ XFS_HEALTH_MONITOR_FMT_CSTRUCT,	"cstruct" }
+
+TRACE_EVENT(xfs_healthmon_create,
+	TP_PROTO(const struct xfs_mount *mp, u64 flags, u8 format),
+	TP_ARGS(mp, flags, format),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, flags)
+		__field(u8, format)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->flags = flags;
+		__entry->format = format;
+	),
+	TP_printk("dev %d:%d flags %s format %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->flags, "|", XFS_HEALTHMON_FLAGS_STRINGS),
+		  __print_symbolic(__entry->format, XFS_HEALTHMON_FMT_STRINGS))
+);
+
+TRACE_EVENT(xfs_healthmon_copybuf,
+	TP_PROTO(const struct xfs_mount *mp, const struct iov_iter *iov,
+		 const struct seq_buf *seqbuf, size_t outpos),
+	TP_ARGS(mp, iov, seqbuf, outpos),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(size_t, seqbuf_size)
+		__field(size_t, seqbuf_len)
+		__field(size_t, outpos)
+		__field(size_t, to_copy)
+		__field(size_t, iter_count)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->seqbuf_size = seqbuf->size;
+		__entry->seqbuf_len = seqbuf->len;
+		__entry->outpos = outpos;
+		__entry->to_copy = seqbuf->len - outpos;
+		__entry->iter_count = iov_iter_count(iov);
+	),
+	TP_printk("dev %d:%d seqsize %zu seqlen %zu out_pos %zu to_copy %zu iter_count %zu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->seqbuf_size,
+		  __entry->seqbuf_len,
+		  __entry->outpos,
+		  __entry->to_copy,
+		  __entry->iter_count)
+);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_class,
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events,
+		 unsigned long long lost_prev),
+	TP_ARGS(mp, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#define DEFINE_HEALTHMON_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events, \
+		 unsigned long long lost_prev), \
+	TP_ARGS(mp, events, lost_prev))
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_start);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
+
+#define XFS_HEALTHMON_TYPE_STRINGS \
+	{ XFS_HEALTHMON_LOST,		"lost" }
+
+#define XFS_HEALTHMON_DOMAIN_STRINGS \
+	{ XFS_HEALTHMON_MOUNT,		"mount" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
+	TP_ARGS(mp, event),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+		__field(unsigned long long, offset)
+		__field(unsigned long long, length)
+		__field(unsigned long long, lostcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = event->type;
+		__entry->domain = event->domain;
+		__entry->mask = 0;
+		__entry->group = 0;
+		__entry->ino = 0;
+		__entry->gen = 0;
+		__entry->offset = 0;
+		__entry->length = 0;
+		__entry->lostcount = 0;
+		switch (__entry->domain) {
+		case XFS_HEALTHMON_MOUNT:
+			switch (__entry->type) {
+			case XFS_HEALTHMON_LOST:
+				__entry->lostcount = event->lostcount;
+				break;
+			}
+			break;
+		}
+	),
+	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS),
+		  __entry->mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->offset,
+		  __entry->length,
+		  __entry->group,
+		  __entry->lostcount)
+);
+#define DEFINE_HEALTHMONEVENT_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_event_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
+	TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
new file mode 100644
index 00000000000000..68762738b04191
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -0,0 +1,129 @@
+{
+	"$comment": [
+		"SPDX-License-Identifier: GPL-2.0-or-later",
+		"Copyright (c) 2024-2025 Oracle.  All Rights Reserved.",
+		"Author: Darrick J. Wong <djwong@kernel.org>",
+		"",
+		"This schema file describes the format of the json objects",
+		"readable from the fd returned by the XFS_IOC_HEALTHMON",
+		"ioctl."
+	],
+
+	"$schema": "https://json-schema.org/draft/2020-12/schema",
+	"$id": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/fs/xfs/libxfs/xfs_healthmon.schema.json",
+
+	"title": "XFS Health Monitoring Events",
+
+	"$comment": "Events must be one of the following types:",
+	"oneOf": [
+		{
+			"$ref": "#/$events/running"
+		},
+		{
+			"$ref": "#/$events/unmount"
+		},
+		{
+			"$ref": "#/$events/lost"
+		}
+	],
+
+	"$comment": "Simple data types are defined here.",
+	"$defs": {
+		"time_ns": {
+			"title": "Time of Event",
+			"description": "Timestamp of the event, in nanoseconds since the Unix epoch.",
+			"type": "integer"
+		},
+		"count": {
+			"title": "Count of events",
+			"description": "Number of events.",
+			"type": "integer",
+			"minimum": 1
+		}
+	},
+
+	"$comment": "Event types are defined here.",
+	"$events": {
+		"running": {
+			"title": "Health Monitoring Running",
+			"$comment": [
+				"The health monitor is actually running."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "running"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain"
+			]
+		},
+		"unmount": {
+			"title": "Filesystem Unmounted",
+			"$comment": [
+				"The filesystem was unmounted."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "unmount"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain"
+			]
+		},
+		"lost": {
+			"title": "Health Monitoring Events Lost",
+			"$comment": [
+				"Previous health monitoring events were",
+				"dropped due to memory allocation failures",
+				"or queue limits."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "lost"
+				},
+				"count": {
+					"$ref": "#/$defs/count"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				}
+			},
+
+			"required": [
+				"type",
+				"count",
+				"time_ns",
+				"domain"
+			]
+		}
+	}
+}
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 7b0d9f78b0a402..d5ca6ef8015c0e 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -40,12 +40,558 @@
  * so that the queueing and processing of the events do not pin the mount and
  * cannot slow down the main filesystem.  The healthmon object can exist past
  * the end of the filesystem mount.
+ *
+ * Please see the xfs_healthmon.schema.json file for a description of the
+ * format of the json events that are conveyed to userspace.
  */
 
+/* Allow this many events to build up in memory per healthmon fd. */
+#define XFS_HEALTHMON_MAX_EVENTS \
+		(32768 / sizeof(struct xfs_healthmon_event))
+
+struct flag_string {
+	unsigned int	mask;
+	const char	*str;
+};
+
 struct xfs_healthmon {
+	/* lock for mp and eventlist */
+	struct mutex			lock;
+
+	/* waiter for signalling the arrival of events */
+	struct wait_queue_head		wait;
+
+	/* list of event objects */
+	struct xfs_healthmon_event	*first_event;
+	struct xfs_healthmon_event	*last_event;
+
 	struct xfs_mount		*mp;
+
+	/* number of events */
+	unsigned int			events;
+
+	/*
+	 * Buffer for formatting events.  New buffer data are appended to the
+	 * end of the seqbuf, and outpos is used to determine where to start
+	 * a copy_iter.  Both are protected by inode_lock.
+	 */
+	struct seq_buf			outbuf;
+	size_t				outpos;
+
+	/* XFS_HEALTH_MONITOR_FMT_* */
+	uint8_t				format;
+
+	/* do we want all events? */
+	bool				verbose;
+
+	/* did we lose previous events? */
+	unsigned long long		lost_prev_event;
+
+	/* total counts of events observed and lost events */
+	unsigned long long		total_events;
+	unsigned long long		total_lost;
 };
 
+static inline void xfs_healthmon_bump_events(struct xfs_healthmon *hm)
+{
+	hm->events++;
+	hm->total_events++;
+}
+
+static inline void xfs_healthmon_bump_lost(struct xfs_healthmon *hm)
+{
+	hm->lost_prev_event++;
+	hm->total_lost++;
+}
+
+/* Remove an event from the head of the list. */
+static inline int
+xfs_healthmon_free_head(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	struct xfs_healthmon_event	*head;
+
+	mutex_lock(&hm->lock);
+	head = hm->first_event;
+	if (head != event) {
+		ASSERT(hm->first_event == event);
+		mutex_unlock(&hm->lock);
+		return -EFSCORRUPTED;
+	}
+
+	if (hm->last_event == head)
+		hm->last_event = NULL;
+	hm->first_event = head->next;
+	hm->events--;
+	mutex_unlock(&hm->lock);
+
+	trace_xfs_healthmon_pop(hm->mp, head);
+	kfree(event);
+	return 0;
+}
+
+/* Push an event onto the end of the list. */
+static inline void
+__xfs_healthmon_push(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	if (!hm->first_event)
+		hm->first_event = event;
+	if (hm->last_event)
+		hm->last_event->next = event;
+	hm->last_event = event;
+	event->next = NULL;
+	xfs_healthmon_bump_events(hm);
+	wake_up(&hm->wait);
+
+	trace_xfs_healthmon_push(hm->mp, event);
+}
+
+/* Push an event onto the end of the list if we're not full. */
+static inline int
+xfs_healthmon_push(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+		xfs_healthmon_bump_lost(hm);
+		return -ENOMEM;
+	}
+
+	__xfs_healthmon_push(hm, event);
+	return 0;
+}
+
+/* Create a new event or record that we failed. */
+static struct xfs_healthmon_event *
+xfs_healthmon_alloc(
+	struct xfs_healthmon		*hm,
+	enum xfs_healthmon_type		type,
+	enum xfs_healthmon_domain	domain)
+{
+	struct timespec64		now;
+	struct xfs_healthmon_event	*event;
+
+	event = kzalloc(sizeof(*event), GFP_NOFS);
+	if (!event) {
+		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+		xfs_healthmon_bump_lost(hm);
+		return NULL;
+	}
+
+	event->type = type;
+	event->domain = domain;
+	ktime_get_coarse_real_ts64(&now);
+	event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec;
+
+	return event;
+}
+
+/*
+ * Before we accept an event notification from a live update hook, we need to
+ * clear out any previously lost events.
+ */
+static inline int
+xfs_healthmon_start_live_update(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event;
+
+	/* If the queue is already full.... */
+	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+		if (hm->last_event &&
+		    hm->last_event->type == XFS_HEALTHMON_LOST) {
+			/*
+			 * ...and the last event notes lost events, then add
+			 * the number of events we already lost, plus one for
+			 * this event that we're about to lose.
+			 */
+			hm->last_event->lostcount += hm->lost_prev_event + 1;
+			hm->lost_prev_event = 0;
+		} else {
+			/*
+			 * ...try to create a new lost event.  Add the number
+			 * of events we previously lost, plus one for this
+			 * event.
+			 */
+			event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+					XFS_HEALTHMON_MOUNT);
+			if (!event) {
+				xfs_healthmon_bump_lost(hm);
+				return -ENOMEM;
+			}
+			event->lostcount = hm->lost_prev_event + 1;
+			hm->lost_prev_event = 0;
+
+			__xfs_healthmon_push(hm, event);
+		}
+
+		return -ENOSPC;
+	}
+
+	/* If we lost an event in the past, but the queue isn't yet full... */
+	if (hm->lost_prev_event) {
+		/*
+		 * ...try to create a new lost event.  Add the number of events
+		 * we previously lost, plus one for this event.
+		 */
+		event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+				XFS_HEALTHMON_MOUNT);
+		if (!event) {
+			xfs_healthmon_bump_lost(hm);
+			return -ENOMEM;
+		}
+		event->lostcount = hm->lost_prev_event;
+		hm->lost_prev_event = 0;
+
+		/*
+		 * If adding this lost event pushes us over the limit, we're
+		 * going to lose the current event.  Note that in the lost
+		 * event count too.
+		 */
+		if (hm->events == XFS_HEALTHMON_MAX_EVENTS - 1)
+			event->lostcount++;
+
+		__xfs_healthmon_push(hm, event);
+		if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+			trace_xfs_healthmon_lost_event(hm->mp,
+					hm->lost_prev_event);
+			return -ENOSPC;
+		}
+	}
+
+	/*
+	 * The queue is not full and it is not currently the case that events
+	 * were lost.
+	 */
+	return 0;
+}
+
+/* Render the health update type as a string. */
+STATIC const char *
+xfs_healthmon_typestring(
+	const struct xfs_healthmon_event	*event)
+{
+	static const char *type_strings[] = {
+		[XFS_HEALTHMON_RUNNING]		= "running",
+		[XFS_HEALTHMON_LOST]		= "lost",
+	};
+
+	if (event->type >= ARRAY_SIZE(type_strings))
+		return "?";
+
+	return type_strings[event->type];
+}
+
+/* Render the health domain as a string. */
+STATIC const char *
+xfs_healthmon_domstring(
+	const struct xfs_healthmon_event	*event)
+{
+	static const char *dom_strings[] = {
+		[XFS_HEALTHMON_MOUNT]		= "mount",
+	};
+
+	if (event->domain >= ARRAY_SIZE(dom_strings))
+		return "?";
+
+	return dom_strings[event->domain];
+}
+
+/* Convert a flags bitmap into a jsonable string. */
+static inline int
+xfs_healthmon_format_flags(
+	struct seq_buf			*outbuf,
+	const struct flag_string	*strings,
+	size_t				nr_strings,
+	unsigned int			flags)
+{
+	const struct flag_string	*p;
+	ssize_t				ret;
+	unsigned int			i;
+	bool				first = true;
+
+	for (i = 0, p = strings; i < nr_strings; i++, p++) {
+		if (!(p->mask & flags))
+			continue;
+
+		ret = seq_buf_printf(outbuf, "%s\"%s\"",
+				first ? "" : ", ", p->str);
+		if (ret < 0)
+			return ret;
+
+		first = false;
+		flags &= ~p->mask;
+	}
+
+	for (i = 0; flags != 0 && i < sizeof(flags) * NBBY; i++) {
+		if (!(flags & (1U << i)))
+			continue;
+
+		/* json doesn't support hexadecimal notation */
+		ret = seq_buf_printf(outbuf, "%s%u",
+				first ? "" : ", ", (1U << i));
+		if (ret < 0)
+			return ret;
+
+		first = false;
+	}
+
+	return 0;
+}
+
+/* Convert the event mask into a jsonable string. */
+static inline int
+__xfs_healthmon_format_mask(
+	struct seq_buf			*outbuf,
+	const char			*descr,
+	const struct flag_string	*strings,
+	size_t				nr_strings,
+	unsigned int			mask)
+{
+	ssize_t				ret;
+
+	ret = seq_buf_printf(outbuf, "  \"%s\":  [", descr);
+	if (ret < 0)
+		return ret;
+
+	ret = xfs_healthmon_format_flags(outbuf, strings, nr_strings, mask);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "],\n");
+}
+
+#define xfs_healthmon_format_mask(o, d, s, m) \
+	__xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m))
+
+static inline void
+xfs_healthmon_reset_outbuf(
+	struct xfs_healthmon		*hm)
+{
+	hm->outpos = 0;
+	seq_buf_clear(&hm->outbuf);
+}
+
+/* Render lost event mask as a string set */
+static int
+xfs_healthmon_format_lost(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	return seq_buf_printf(outbuf, "  \"count\":      %llu,\n",
+			event->lostcount);
+}
+
+/*
+ * Format an event into json.  Returns 0 if we formatted the event.  If
+ * formatting the event overflows the buffer, returns -1 with the seqbuf len
+ * unchanged.
+ */
+STATIC int
+xfs_healthmon_format_json(
+	struct xfs_healthmon		*hm,
+	const struct xfs_healthmon_event *event)
+{
+	struct seq_buf			*outbuf = &hm->outbuf;
+	size_t				old_seqlen = outbuf->len;
+	int				ret;
+
+	trace_xfs_healthmon_format(hm->mp, event);
+
+	ret = seq_buf_printf(outbuf, "{\n");
+	if (ret < 0)
+		goto overrun;
+
+	ret = seq_buf_printf(outbuf, "  \"domain\":     \"%s\",\n",
+			xfs_healthmon_domstring(event));
+	if (ret < 0)
+		goto overrun;
+
+	ret = seq_buf_printf(outbuf, "  \"type\":       \"%s\",\n",
+			xfs_healthmon_typestring(event));
+	if (ret < 0)
+		goto overrun;
+
+	switch (event->domain) {
+	case XFS_HEALTHMON_MOUNT:
+		switch (event->type) {
+		case XFS_HEALTHMON_RUNNING:
+			/* nothing to format */
+			break;
+		case XFS_HEALTHMON_LOST:
+			ret = xfs_healthmon_format_lost(outbuf, event);
+			break;
+		default:
+			break;
+		}
+		break;
+	}
+	if (ret < 0)
+		goto overrun;
+
+	/* The last element in the json must not have a trailing comma. */
+	ret = seq_buf_printf(outbuf, "  \"time_ns\":    %llu\n",
+			event->time_ns);
+	if (ret < 0)
+		goto overrun;
+
+	ret = seq_buf_printf(outbuf, "}\n");
+	if (ret < 0)
+		goto overrun;
+
+	ASSERT(!seq_buf_has_overflowed(outbuf));
+	return 0;
+overrun:
+	/*
+	 * We overflowed the buffer and could not format the event.  Reset the
+	 * seqbuf and tell the caller not to delete the event.
+	 */
+	trace_xfs_healthmon_format_overflow(hm->mp, event);
+	outbuf->len = old_seqlen;
+	return -1;
+}
+
+static const unsigned int domain_map[] = {
+	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
+};
+
+static const unsigned int type_map[] = {
+	[XFS_HEALTHMON_RUNNING]		= XFS_HEALTH_MONITOR_TYPE_RUNNING,
+	[XFS_HEALTHMON_LOST]		= XFS_HEALTH_MONITOR_TYPE_LOST,
+};
+
+/* Render event as a C structure */
+STATIC int
+xfs_healthmon_format_cstruct(
+	struct xfs_healthmon		*hm,
+	const struct xfs_healthmon_event *event)
+{
+	struct xfs_health_monitor_event	hme = {
+		.time_ns		= event->time_ns,
+	};
+	struct seq_buf			*outbuf = &hm->outbuf;
+	size_t				old_seqlen = outbuf->len;
+	int				ret;
+
+	trace_xfs_healthmon_format(hm->mp, event);
+
+	if (event->domain < 0 || event->domain >= ARRAY_SIZE(domain_map) ||
+	    event->type < 0   || event->type >= ARRAY_SIZE(type_map))
+		return -EFSCORRUPTED;
+
+	hme.domain = domain_map[event->domain];
+	hme.type = type_map[event->type];
+
+	/* fill in the event-specific details */
+	switch (event->domain) {
+	case XFS_HEALTHMON_MOUNT:
+		switch (event->type) {
+		case XFS_HEALTHMON_LOST:
+			hme.e.lost.count = event->lostcount;
+			break;
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+
+	ret = seq_buf_putmem(outbuf, &hme, sizeof(hme));
+	if (ret < 0) {
+		/*
+		 * We overflowed the buffer and could not format the event.
+		 * Reset the seqbuf and tell the caller not to delete the
+		 * event.
+		 */
+		trace_xfs_healthmon_format_overflow(hm->mp, event);
+		outbuf->len = old_seqlen;
+		return -1;
+	}
+
+	ASSERT(!seq_buf_has_overflowed(outbuf));
+	return 0;
+}
+
+/* How many bytes are waiting in the outbuf to be copied? */
+static inline size_t
+xfs_healthmon_outbuf_bytes(
+	struct xfs_healthmon	*hm)
+{
+	unsigned int		used = seq_buf_used(&hm->outbuf);
+
+	if (used > hm->outpos)
+		return used - hm->outpos;
+	return 0;
+}
+
+/*
+ * Do we have something for userspace to do?  This can mean unmount events,
+ * events pending in the queue, or pending bytes in the outbuf.
+ */
+static inline bool
+xfs_healthmon_has_eventdata(
+	struct xfs_healthmon	*hm)
+{
+	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+}
+
+/* Try to copy the rest of the outbuf to the iov iter. */
+STATIC ssize_t
+xfs_healthmon_copybuf(
+	struct xfs_healthmon	*hm,
+	struct iov_iter		*to)
+{
+	size_t			to_copy;
+	size_t			w = 0;
+
+	trace_xfs_healthmon_copybuf(hm->mp, to, &hm->outbuf, hm->outpos);
+
+	to_copy = xfs_healthmon_outbuf_bytes(hm);
+	if (to_copy) {
+		w = copy_to_iter(hm->outbuf.buffer + hm->outpos, to_copy, to);
+		if (!w)
+			return -EFAULT;
+
+		hm->outpos += w;
+	}
+
+	/*
+	 * Nothing left to copy?  Reset the seqbuf pointers and outbuf to the
+	 * start since there's no live data in the buffer.
+	 */
+	if (xfs_healthmon_outbuf_bytes(hm) == 0)
+		xfs_healthmon_reset_outbuf(hm);
+	return w;
+}
+
+/*
+ * See if there's an event waiting for us.  If the fs is no longer mounted,
+ * don't bother sending any more events.
+ */
+static inline struct xfs_healthmon_event *
+xfs_healthmon_peek(
+	struct xfs_healthmon	*hm)
+{
+	struct xfs_healthmon_event *event;
+
+	mutex_lock(&hm->lock);
+	if (hm->mp)
+		event = hm->first_event;
+	else
+		event = NULL;
+	mutex_unlock(&hm->lock);
+	return event;
+}
+
 /*
  * Convey queued event data to userspace.  First copy any remaining bytes in
  * the outbuf, then format the oldest event into the outbuf and copy that too.
@@ -55,7 +601,125 @@ xfs_healthmon_read_iter(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	return -EIO;
+	struct file		*file = iocb->ki_filp;
+	struct inode		*inode = file_inode(file);
+	struct xfs_healthmon	*hm = file->private_data;
+	struct xfs_healthmon_event *event;
+	size_t			copied = 0;
+	ssize_t			ret = 0;
+
+	/* Wait for data to become available */
+	if (!(file->f_flags & O_NONBLOCK)) {
+		ret = wait_event_interruptible(hm->wait,
+				xfs_healthmon_has_eventdata(hm));
+		if (ret)
+			return ret;
+	} else if (!xfs_healthmon_has_eventdata(hm)) {
+		return -EAGAIN;
+	}
+
+	/* Allocate formatting buffer up to 64k if necessary */
+	if (hm->outbuf.size == 0) {
+		void		*outbuf;
+		size_t		bufsize = min(65536, max(PAGE_SIZE,
+							 iov_iter_count(to)));
+
+		outbuf = kzalloc(bufsize, GFP_KERNEL);
+		if (!outbuf) {
+			bufsize = PAGE_SIZE;
+			outbuf = kzalloc(bufsize, GFP_KERNEL);
+			if (!outbuf)
+				return -ENOMEM;
+		}
+
+		inode_lock(inode);
+		if (hm->outbuf.size == 0) {
+			seq_buf_init(&hm->outbuf, outbuf, bufsize);
+			hm->outpos = 0;
+		} else {
+			kfree(outbuf);
+		}
+	} else {
+		inode_lock(inode);
+	}
+
+	trace_xfs_healthmon_read_start(hm->mp, hm->events, hm->lost_prev_event);
+
+	/*
+	 * If there's anything left in the seqbuf, copy that before formatting
+	 * more events.
+	 */
+	ret = xfs_healthmon_copybuf(hm, to);
+	if (ret < 0)
+		goto out_unlock;
+	copied += ret;
+
+	while (iov_iter_count(to) > 0) {
+		/* Format the next events into the outbuf until it's full. */
+		while ((event = xfs_healthmon_peek(hm)) != NULL) {
+			switch (hm->format) {
+			case XFS_HEALTH_MONITOR_FMT_JSON:
+				ret = xfs_healthmon_format_json(hm, event);
+				break;
+			case XFS_HEALTH_MONITOR_FMT_CSTRUCT:
+				ret = xfs_healthmon_format_cstruct(hm, event);
+				break;
+			default:
+				ret = -EINVAL;
+				goto out_unlock;
+			}
+			if (ret < 0)
+				break;
+			ret = xfs_healthmon_free_head(hm, event);
+			if (ret)
+				goto out_unlock;
+		}
+
+		/* Copy it to userspace */
+		ret = xfs_healthmon_copybuf(hm, to);
+		if (ret <= 0)
+			break;
+
+		copied += ret;
+	}
+
+out_unlock:
+	trace_xfs_healthmon_read_finish(hm->mp, hm->events, hm->lost_prev_event);
+	inode_unlock(inode);
+	return copied ?: ret;
+}
+
+/* Poll for available events. */
+STATIC __poll_t
+xfs_healthmon_poll(
+	struct file			*file,
+	struct poll_table_struct	*wait)
+{
+	struct xfs_healthmon		*hm = file->private_data;
+	__poll_t			mask = 0;
+
+	poll_wait(file, &hm->wait, wait);
+
+	if (xfs_healthmon_has_eventdata(hm))
+		mask |= EPOLLIN;
+	return mask;
+}
+
+/* Free all events */
+STATIC void
+xfs_healthmon_free_events(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event, *next;
+
+	event = hm->first_event;
+	while (event != NULL) {
+		trace_xfs_healthmon_drop(hm->mp, event);
+		next = event->next;
+		kfree(event);
+		event = next;
+	}
+	hm->first_event = hm->last_event = NULL;
 }
 
 /* Free the health monitoring information. */
@@ -66,6 +730,14 @@ xfs_healthmon_release(
 {
 	struct xfs_healthmon	*hm = file->private_data;
 
+	trace_xfs_healthmon_release(hm->mp, hm->events, hm->lost_prev_event);
+
+	wake_up_all(&hm->wait);
+
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
+	if (hm->outbuf.size)
+		kfree(hm->outbuf.buffer);
 	kfree(hm);
 
 	return 0;
@@ -76,9 +748,10 @@ static inline bool
 xfs_healthmon_validate(
 	const struct xfs_health_monitor	*hmo)
 {
-	if (hmo->flags)
+	if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL)
 		return false;
-	if (hmo->format)
+	if (hmo->format != XFS_HEALTH_MONITOR_FMT_JSON &&
+	    hmo->format != XFS_HEALTH_MONITOR_FMT_CSTRUCT)
 		return false;
 	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
 		return false;
@@ -89,6 +762,19 @@ xfs_healthmon_validate(
 
 /* Emit some data about the health monitoring fd. */
 #ifdef CONFIG_PROC_FS
+static const char *
+xfs_healthmon_format_string(const struct xfs_healthmon *hm)
+{
+	switch (hm->format) {
+	case XFS_HEALTH_MONITOR_FMT_JSON:
+		return "json";
+	case XFS_HEALTH_MONITOR_FMT_CSTRUCT:
+		return "blob";
+	}
+
+	return "";
+}
+
 static void
 xfs_healthmon_show_fdinfo(
 	struct seq_file		*m,
@@ -96,8 +782,13 @@ xfs_healthmon_show_fdinfo(
 {
 	struct xfs_healthmon	*hm = file->private_data;
 
-	seq_printf(m, "state:\talive\ndev:\t%s\n",
-			hm->mp->m_super->s_id);
+	mutex_lock(&hm->lock);
+	seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
+			hm->mp->m_super->s_id,
+			xfs_healthmon_format_string(hm),
+			hm->total_events,
+			hm->total_lost);
+	mutex_unlock(&hm->lock);
 }
 #endif
 
@@ -107,6 +798,7 @@ static const struct file_operations xfs_healthmon_fops = {
 	.show_fdinfo	= xfs_healthmon_show_fdinfo,
 #endif
 	.read_iter	= xfs_healthmon_read_iter,
+	.poll		= xfs_healthmon_poll,
 	.release	= xfs_healthmon_release,
 };
 
@@ -121,6 +813,7 @@ xfs_ioc_health_monitor(
 {
 	struct xfs_health_monitor	hmo;
 	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
 	int				fd;
 	int				ret;
 
@@ -137,6 +830,23 @@ xfs_ioc_health_monitor(
 	if (!hm)
 		return -ENOMEM;
 	hm->mp = mp;
+	hm->format = hmo.format;
+
+	seq_buf_init(&hm->outbuf, NULL, 0);
+	mutex_init(&hm->lock);
+	init_waitqueue_head(&hm->wait);
+
+	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
+		hm->verbose = true;
+
+	/* Queue up the first event that lets the client know we're running. */
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
+			XFS_HEALTHMON_MOUNT);
+	if (!event) {
+		ret = -ENOMEM;
+		goto out_mutex;
+	}
+	__xfs_healthmon_push(hm, event);
 
 	/*
 	 * Create the anonymous file.  If it succeeds, the file owns hm and
@@ -146,12 +856,16 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_hm;
+		goto out_mutex;
 	}
 
+	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
+
 	return fd;
 
-out_hm:
+out_mutex:
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
 	kfree(hm);
 	return ret;
 }
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index a60556dbd172ee..d42b864a3837a2 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -51,6 +51,8 @@
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_zone_priv.h"
+#include "xfs_health.h"
+#include "xfs_healthmon.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index f3f3436d60a940..f6a1fb46a1d6c9 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -245,6 +245,7 @@ int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int len)
 	seq_buf_set_overflow(s);
 	return -1;
 }
+EXPORT_SYMBOL_GPL(seq_buf_putmem);
 
 #define MAX_MEMHEX_BYTES	8U
 #define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/19] xfs: report metadata health events through healthmon
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-10-23  0:03   ` [PATCH 11/19] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
@ 2025-10-23  0:03   ` Darrick J. Wong
  2025-10-23  0:04   ` [PATCH 13/19] xfs: report shutdown " Darrick J. Wong
                     ` (6 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:03 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a metadata health event hook so that we can send events to
userspace as we collect information.  The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h                  |   38 +++
 fs/xfs/libxfs/xfs_health.h              |    5 
 fs/xfs/xfs_healthmon.h                  |   31 ++
 fs/xfs/xfs_trace.h                      |   98 ++++++-
 fs/xfs/libxfs/xfs_healthmon.schema.json |  315 +++++++++++++++++++++
 fs/xfs/xfs_health.c                     |   67 ++++
 fs/xfs/xfs_healthmon.c                  |  465 +++++++++++++++++++++++++++++++
 7 files changed, 1010 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 4b642eea18b5ca..358abe98776d69 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1008,17 +1008,52 @@ struct xfs_rtgroup_geometry {
 /* affects the whole fs */
 #define XFS_HEALTH_MONITOR_DOMAIN_MOUNT		(0)
 
+/* metadata health events */
+#define XFS_HEALTH_MONITOR_DOMAIN_FS		(1)
+#define XFS_HEALTH_MONITOR_DOMAIN_AG		(2)
+#define XFS_HEALTH_MONITOR_DOMAIN_INODE		(3)
+#define XFS_HEALTH_MONITOR_DOMAIN_RTGROUP	(4)
+
 /* Health monitor event types */
 
 /* status of the monitor itself */
 #define XFS_HEALTH_MONITOR_TYPE_RUNNING		(0)
 #define XFS_HEALTH_MONITOR_TYPE_LOST		(1)
 
+/* metadata health events */
+#define XFS_HEALTH_MONITOR_TYPE_SICK		(2)
+#define XFS_HEALTH_MONITOR_TYPE_CORRUPT		(3)
+#define XFS_HEALTH_MONITOR_TYPE_HEALTHY		(4)
+
+/* filesystem was unmounted */
+#define XFS_HEALTH_MONITOR_TYPE_UNMOUNT		(5)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
 };
 
+/* fs/rt metadata */
+struct xfs_health_monitor_fs {
+	/* XFS_FSOP_GEOM_SICK_* flags */
+	__u32	mask;
+};
+
+/* ag/rtgroup metadata */
+struct xfs_health_monitor_group {
+	/* XFS_{AG,RTGROUP}_SICK_* flags */
+	__u32	mask;
+	__u32	gno;
+};
+
+/* inode metadata */
+struct xfs_health_monitor_inode {
+	/* XFS_BS_SICK_* flags */
+	__u32	mask;
+	__u32	gen;
+	__u64	ino;
+};
+
 struct xfs_health_monitor_event {
 	/* XFS_HEALTH_MONITOR_DOMAIN_* */
 	__u32	domain;
@@ -1036,6 +1071,9 @@ struct xfs_health_monitor_event {
 	 */
 	union {
 		struct xfs_health_monitor_lost lost;
+		struct xfs_health_monitor_fs fs;
+		struct xfs_health_monitor_group group;
+		struct xfs_health_monitor_inode inode;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 39fef33dedc6a8..9ff3bf8ba4ed8f 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -336,4 +336,9 @@ void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
 void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+unsigned int xfs_healthmon_inode_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_rtgroup_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_perag_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_fs_mask(unsigned int sick_mask);
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index ea2d6a327dfb16..3f3ba16d5af56a 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -9,10 +9,23 @@
 enum xfs_healthmon_type {
 	XFS_HEALTHMON_RUNNING,	/* monitor running */
 	XFS_HEALTHMON_LOST,	/* message lost */
+	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
+
+	/* metadata health events */
+	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
+	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
+
 };
 
 enum xfs_healthmon_domain {
 	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+
+	/* metadata health events */
+	XFS_HEALTHMON_FS,	/* main filesystem metadata */
+	XFS_HEALTHMON_AG,	/* allocation group metadata */
+	XFS_HEALTHMON_INODE,	/* inode metadata */
+	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
 };
 
 struct xfs_healthmon_event {
@@ -32,6 +45,24 @@ struct xfs_healthmon_event {
 		struct {
 			unsigned int	flags;
 		};
+		/* fs/rt metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	fsmask;
+		};
+		/* ag/rtgroup metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	grpmask;
+			unsigned int	group;
+		};
+		/* inode metadata */
+		struct {
+			/* XFS_SICK_INO_* flags */
+			unsigned int	imask;
+			uint32_t	gen;
+			xfs_ino_t	ino;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 17af5efee026c9..df09c225e13c2e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6011,14 +6011,30 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
 #define XFS_HEALTHMON_TYPE_STRINGS \
-	{ XFS_HEALTHMON_LOST,		"lost" }
+	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
+	{ XFS_HEALTHMON_SICK,		"sick" }, \
+	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
+	{ XFS_HEALTHMON_HEALTHY,	"healthy" }
 
 #define XFS_HEALTHMON_DOMAIN_STRINGS \
-	{ XFS_HEALTHMON_MOUNT,		"mount" }
+	{ XFS_HEALTHMON_MOUNT,		"mount" }, \
+	{ XFS_HEALTHMON_FS,		"fs" }, \
+	{ XFS_HEALTHMON_AG,		"ag" }, \
+	{ XFS_HEALTHMON_INODE,		"inode" }, \
+	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_HEALTHY);
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_RTGROUP);
 
 DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
@@ -6054,6 +6070,19 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 				break;
 			}
 			break;
+		case XFS_HEALTHMON_FS:
+			__entry->mask = event->fsmask;
+			break;
+		case XFS_HEALTHMON_AG:
+		case XFS_HEALTHMON_RTGROUP:
+			__entry->mask = event->grpmask;
+			__entry->group = event->group;
+			break;
+		case XFS_HEALTHMON_INODE:
+			__entry->mask = event->imask;
+			__entry->ino = event->ino;
+			__entry->gen = event->gen;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6072,11 +6101,76 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 DEFINE_EVENT(xfs_healthmon_event_class, name, \
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
 	TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_insert);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+
+#define XFS_HEALTHUP_TYPE_STRINGS \
+	{ XFS_HEALTHUP_UNMOUNT,		"unmount" }, \
+	{ XFS_HEALTHUP_SICK,		"sick" }, \
+	{ XFS_HEALTHUP_CORRUPT,		"corrupt" }, \
+	{ XFS_HEALTHUP_HEALTHY,		"healthy" }
+
+#define XFS_HEALTHUP_DOMAIN_STRINGS \
+	{ XFS_HEALTHUP_FS,		"fs" }, \
+	{ XFS_HEALTHUP_AG,		"ag" }, \
+	{ XFS_HEALTHUP_INODE,		"inode" }, \
+	{ XFS_HEALTHUP_RTGROUP,		"rtgroup" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_HEALTHY);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_RTGROUP);
+
+TRACE_EVENT(xfs_healthmon_metadata_hook,
+	TP_PROTO(const struct xfs_mount *mp, unsigned long type,
+		 const struct xfs_health_update_params *update,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(mp, type, update, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, old_mask)
+		__field(unsigned int, new_mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = type;
+		__entry->domain = update->domain;
+		__entry->old_mask = update->old_mask;
+		__entry->new_mask = update->new_mask;
+		__entry->ino = update->ino;
+		__entry->gen = update->gen;
+		__entry->group = update->group;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d type %s domain %s oldmask 0x%x newmask 0x%x ino 0x%llx gen 0x%x group 0x%x events %u lost_prev %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHUP_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHUP_DOMAIN_STRINGS),
+		  __entry->old_mask,
+		  __entry->new_mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->group,
+		  __entry->events,
+		  __entry->lost_prev)
+);
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index 68762738b04191..dd78f1b71d587b 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -24,6 +24,18 @@
 		},
 		{
 			"$ref": "#/$events/lost"
+		},
+		{
+			"$ref": "#/$events/fs_metadata"
+		},
+		{
+			"$ref": "#/$events/rtgroup_metadata"
+		},
+		{
+			"$ref": "#/$events/perag_metadata"
+		},
+		{
+			"$ref": "#/$events/inode_metadata"
 		}
 	],
 
@@ -39,6 +51,156 @@
 			"description": "Number of events.",
 			"type": "integer",
 			"minimum": 1
+		},
+		"xfs_agnumber_t": {
+			"description": "Allocation group number",
+			"type": "integer",
+			"minimum": 0,
+			"maximum": 2147483647
+		},
+		"xfs_rgnumber_t": {
+			"description": "Realtime allocation group number",
+			"type": "integer",
+			"minimum": 0,
+			"maximum": 2147483647
+		},
+		"xfs_ino_t": {
+			"description": "Inode number",
+			"type": "integer",
+			"minimum": 1
+		},
+		"i_generation": {
+			"description": "Inode generation number",
+			"type": "integer"
+		}
+	},
+
+	"$comment": "Filesystem metadata event data are defined here.",
+	"$metadata": {
+		"status": {
+			"description": "Metadata health status",
+			"$comment": [
+				"One of:",
+				"",
+				" * sick:    metadata corruption discovered",
+				"            during a runtime operation.",
+				" * corrupt: corruption discovered during",
+				"            an xfs_scrub run.",
+				" * healthy: metadata object was found to be",
+				"            ok by xfs_scrub."
+			],
+			"enum": [
+				"sick",
+				"corrupt",
+				"healthy"
+			]
+		},
+		"fs": {
+			"description": [
+				"Metadata structures that affect the entire",
+				"filesystem.  Options include:",
+				"",
+				" * fscounters: summary counters",
+				" * usrquota:   user quota records",
+				" * grpquota:   group quota records",
+				" * prjquota:   project quota records",
+				" * quotacheck: quota counters",
+				" * nlinks:     file link counts",
+				" * metadir:    metadata directory",
+				" * metapath:   metadata inode paths"
+			],
+			"enum": [
+				"fscounters",
+				"grpquota",
+				"metadir",
+				"metapath",
+				"nlinks",
+				"prjquota",
+				"quotacheck",
+				"usrquota"
+			]
+		},
+		"perag": {
+			"description": [
+				"Metadata structures owned by allocation",
+				"groups on the data device.  Options include:",
+				"",
+				" * agf:        group space header",
+				" * agfl:       per-group free block list",
+				" * agi:        group inode header",
+				" * bnobt:      free space by position btree",
+				" * cntbt:      free space by length btree",
+				" * finobt:     free inode btree",
+				" * inobt:      inode btree",
+				" * rmapbt:     reverse mapping btree",
+				" * refcountbt: reference count btree",
+				" * inodes:     problems were recorded for",
+				"               this group's inodes, but the",
+				"               inodes themselves had to be",
+				"               reclaimed.",
+				" * super:      superblock"
+			],
+			"enum": [
+				"agf",
+				"agfl",
+				"agi",
+				"bnobt",
+				"cntbt",
+				"finobt",
+				"inobt",
+				"inodes",
+				"refcountbt",
+				"rmapbt",
+				"super"
+			]
+		},
+		"rtgroup": {
+			"description": [
+				"Metadata structures owned by allocation",
+				"groups on the realtime volume.  Options",
+				"include:",
+				"",
+				" * bitmap:     free space bitmap contents",
+				"               for this group",
+				" * summary:    realtime free space summary file",
+				" * rmapbt:     reverse mapping btree",
+				" * refcountbt: reference count btree",
+				" * super:      group superblock"
+			],
+			"enum": [
+				"bitmap",
+				"summary",
+				"refcountbt",
+				"rmapbt",
+				"super"
+			]
+		},
+		"inode": {
+			"description": [
+				"Metadata structures owned by file inodes.",
+				"Options include:",
+				"",
+				" * bmapbta:    attr fork",
+				" * bmapbtc:    cow fork",
+				" * bmapbtd:    data fork",
+				" * core:       inode record",
+				" * directory:  directory entries",
+				" * dirtree:    directory tree problems detected",
+				" * parent:     directory parent pointer",
+				" * symlink:    symbolic link target",
+				" * xattr:      extended attributes"
+			],
+			"enum": [
+				"bmapbta",
+				"bmapbtc",
+				"bmapbtd",
+				"core",
+				"directory",
+				"dirtree",
+				"parent",
+				"symlink",
+				"xattr"
+			]
 		}
 	},
 
@@ -124,6 +286,159 @@
 				"time_ns",
 				"domain"
 			]
+		},
+		"fs_metadata": {
+			"title": "Filesystem-wide metadata event",
+			"description": [
+				"Health status updates for filesystem-wide",
+				"metadata objects."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "fs"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/fs"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"structures"
+			]
+		},
+		"perag_metadata": {
+			"title": "Data device allocation group metadata event",
+			"description": [
+				"Health status updates for data device ",
+				"allocation group metadata."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "perag"
+				},
+				"group": {
+					"$ref": "#/$defs/xfs_agnumber_t"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/perag"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"group",
+				"structures"
+			]
+		},
+		"rtgroup_metadata": {
+			"title": "Realtime allocation group metadata event",
+			"description": [
+				"Health status updates for realtime allocation",
+				"group metadata."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "rtgroup"
+				},
+				"group": {
+					"$ref": "#/$defs/xfs_rgnumber_t"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/rtgroup"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"group",
+				"structures"
+			]
+		},
+		"inode_metadata": {
+			"title": "Inode metadata event",
+			"description": [
+				"Health status updates for inode metadata.",
+				"The inode and generation number describe the",
+				"file that is affected by the change."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$metadata/status"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "inode"
+				},
+				"inumber": {
+					"$ref": "#/$defs/xfs_ino_t"
+				},
+				"generation": {
+					"$ref": "#/$defs/i_generation"
+				},
+				"structures": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$metadata/inode"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"inumber",
+				"generation",
+				"structures"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index abf9460ae79953..70e1b098c8b449 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -607,6 +607,25 @@ xfs_fsop_geom_health(
 	}
 }
 
+/*
+ * Translate XFS_SICK_FS_* into XFS_FSOP_GEOM_SICK_* except for the rt free
+ * space codes, which are sent via the rtgroup events.
+ */
+unsigned int
+xfs_healthmon_fs_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(fs_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 static const struct ioctl_sick_map ag_map[] = {
 	{ XFS_SICK_AG_SB,	XFS_AG_GEOM_SICK_SB },
 	{ XFS_SICK_AG_AGF,	XFS_AG_GEOM_SICK_AGF },
@@ -643,6 +662,22 @@ xfs_ag_geom_health(
 	}
 }
 
+/* Translate XFS_SICK_AG_* into XFS_AG_GEOM_SICK_*. */
+unsigned int
+xfs_healthmon_perag_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(ag_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 static const struct ioctl_sick_map rtgroup_map[] = {
 	{ XFS_SICK_RG_SUPER,	XFS_RTGROUP_GEOM_SICK_SUPER },
 	{ XFS_SICK_RG_BITMAP,	XFS_RTGROUP_GEOM_SICK_BITMAP },
@@ -673,6 +708,22 @@ xfs_rtgroup_geom_health(
 	}
 }
 
+/* Translate XFS_SICK_RG_* into XFS_RTGROUP_GEOM_SICK_*. */
+unsigned int
+xfs_healthmon_rtgroup_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(rtgroup_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_CORE,	XFS_BS_SICK_INODE },
 	{ XFS_SICK_INO_BMBTD,	XFS_BS_SICK_BMBTD },
@@ -711,6 +762,22 @@ xfs_bulkstat_health(
 	}
 }
 
+/* Translate XFS_SICK_INO_* into XFS_BS_SICK_*. */
+unsigned int
+xfs_healthmon_inode_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(ino_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 /* Mark a block mapping sick. */
 void
 xfs_bmap_mark_sick(
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index d5ca6ef8015c0e..05c67fe40f2bac 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -18,6 +18,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
+#include "xfs_health.h"
 #include "xfs_healthmon.h"
 
 #include <linux/anon_inodes.h>
@@ -65,8 +66,15 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*first_event;
 	struct xfs_healthmon_event	*last_event;
 
+	/* live update hooks */
+	struct xfs_health_hook		hhook;
+
+	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
 
+	/* filesystem type for safe cleanup of hooks; requires module_get */
+	struct file_system_type		*fstyp;
+
 	/* number of events */
 	unsigned int			events;
 
@@ -131,6 +139,23 @@ xfs_healthmon_free_head(
 	return 0;
 }
 
+/* Insert an event onto the start of the list. */
+static inline void
+__xfs_healthmon_insert(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	event->next = hm->first_event;
+	if (!hm->first_event)
+		hm->first_event = event;
+	if (!hm->last_event)
+		hm->last_event = event;
+	xfs_healthmon_bump_events(hm);
+	wake_up(&hm->wait);
+
+	trace_xfs_healthmon_insert(hm->mp, event);
+}
+
 /* Push an event onto the end of the list. */
 static inline void
 __xfs_healthmon_push(
@@ -202,6 +227,10 @@ xfs_healthmon_start_live_update(
 {
 	struct xfs_healthmon_event	*event;
 
+	/* Filesystem already unmounted, do nothing. */
+	if (!hm->mp)
+		return -ESHUTDOWN;
+
 	/* If the queue is already full.... */
 	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
 		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
@@ -274,6 +303,185 @@ xfs_healthmon_start_live_update(
 	return 0;
 }
 
+/* Compute the reporting mask. */
+static inline bool
+xfs_healthmon_event_mask(
+	struct xfs_healthmon			*hm,
+	enum xfs_health_update_type		type,
+	const struct xfs_health_update_params	*hup,
+	unsigned int				*mask)
+{
+	/* Always report unmounts. */
+	if (type == XFS_HEALTHUP_UNMOUNT)
+		return true;
+
+	/* If we want all events, return all events. */
+	if (hm->verbose) {
+		*mask = hup->new_mask;
+		return true;
+	}
+
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		/* Always report runtime corruptions */
+		*mask = hup->new_mask;
+		break;
+	case XFS_HEALTHUP_CORRUPT:
+		/* Only report new fsck errors */
+		*mask = hup->new_mask & ~hup->old_mask;
+		break;
+	case XFS_HEALTHUP_HEALTHY:
+		/* Only report healthy metadata that got fixed */
+		*mask = hup->new_mask & hup->old_mask;
+		break;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* This is here for static enum checking */
+		break;
+	}
+
+	/* If not in verbose mode, mask state has to change. */
+	return *mask != 0;
+}
+
+static inline enum xfs_healthmon_type
+health_update_to_type(
+	enum xfs_health_update_type	type)
+{
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		return XFS_HEALTHMON_SICK;
+	case XFS_HEALTHUP_CORRUPT:
+		return XFS_HEALTHMON_CORRUPT;
+	case XFS_HEALTHUP_HEALTHY:
+		return XFS_HEALTHMON_HEALTHY;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_UNMOUNT;
+}
+
+static inline enum xfs_healthmon_domain
+health_update_to_domain(
+	enum xfs_health_update_domain	domain)
+{
+	switch (domain) {
+	case XFS_HEALTHUP_FS:
+		return XFS_HEALTHMON_FS;
+	case XFS_HEALTHUP_AG:
+		return XFS_HEALTHMON_AG;
+	case XFS_HEALTHUP_RTGROUP:
+		return XFS_HEALTHMON_RTGROUP;
+	case XFS_HEALTHUP_INODE:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_INODE;
+}
+
+/* Add a health event to the reporting queue. */
+STATIC int
+xfs_healthmon_metadata_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_health_update_params	*hup = data;
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	enum xfs_health_update_type	type = action;
+	unsigned int			mask = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, hhook.health_hook.nb);
+
+	/* Decode event mask and skip events we don't care about. */
+	if (!xfs_healthmon_event_mask(hm, type, hup, &mask))
+		return NOTIFY_DONE;
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	if (type == XFS_HEALTHUP_UNMOUNT) {
+		/*
+		 * The filesystem is unmounting, so we must detach from the
+		 * mount.  After this point, the healthmon thread has no
+		 * connection to the mounted filesystem and must not touch its
+		 * hooks.
+		 */
+		trace_xfs_healthmon_unmount(hm->mp, hm->events,
+				hm->lost_prev_event);
+
+		hm->mp = NULL;
+
+		/*
+		 * Try to add an unmount message to the head of the list so
+		 * that userspace will notice the unmount.  If we can't add
+		 * the event, wake up the reader directly.
+		 */
+		event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_UNMOUNT,
+				XFS_HEALTHMON_MOUNT);
+		if (event)
+			__xfs_healthmon_insert(hm, event);
+		else
+			wake_up(&hm->wait);
+
+		goto out_unlock;
+	}
+
+	event = xfs_healthmon_alloc(hm, health_update_to_type(type),
+			  health_update_to_domain(hup->domain));
+	if (!event)
+		goto out_unlock;
+
+	/* Ignore the event if it's only reporting a secondary health state. */
+	switch (event->domain) {
+	case XFS_HEALTHMON_FS:
+		event->fsmask = mask & ~XFS_SICK_FS_SECONDARY;
+		if (!event->fsmask)
+			goto out_event;
+		break;
+	case XFS_HEALTHMON_AG:
+		event->grpmask = mask & ~XFS_SICK_AG_SECONDARY;
+		if (!event->grpmask)
+			goto out_event;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		event->grpmask = mask & ~XFS_SICK_RG_SECONDARY;
+		if (!event->grpmask)
+			goto out_event;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_INODE:
+		event->imask = mask & ~XFS_SICK_INO_SECONDARY;
+		if (!event->imask)
+			goto out_event;
+		event->ino = hup->ino;
+		event->gen = hup->gen;
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		goto out_event;
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+out_event:
+	kfree(event);
+	goto out_unlock;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -282,6 +490,10 @@ xfs_healthmon_typestring(
 	static const char *type_strings[] = {
 		[XFS_HEALTHMON_RUNNING]		= "running",
 		[XFS_HEALTHMON_LOST]		= "lost",
+		[XFS_HEALTHMON_UNMOUNT]		= "unmount",
+		[XFS_HEALTHMON_SICK]		= "sick",
+		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
+		[XFS_HEALTHMON_HEALTHY]		= "healthy",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -297,6 +509,10 @@ xfs_healthmon_domstring(
 {
 	static const char *dom_strings[] = {
 		[XFS_HEALTHMON_MOUNT]		= "mount",
+		[XFS_HEALTHMON_FS]		= "fs",
+		[XFS_HEALTHMON_AG]		= "perag",
+		[XFS_HEALTHMON_INODE]		= "inode",
+		[XFS_HEALTHMON_RTGROUP]		= "rtgroup",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -322,6 +538,11 @@ xfs_healthmon_format_flags(
 		if (!(p->mask & flags))
 			continue;
 
+		if (!p->str) {
+			flags &= ~p->mask;
+			continue;
+		}
+
 		ret = seq_buf_printf(outbuf, "%s\"%s\"",
 				first ? "" : ", ", p->str);
 		if (ret < 0)
@@ -372,6 +593,113 @@ __xfs_healthmon_format_mask(
 #define xfs_healthmon_format_mask(o, d, s, m) \
 	__xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m))
 
+/* Render fs sickness mask as a string set */
+static int
+xfs_healthmon_format_fs(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_FSOP_GEOM_SICK_COUNTERS,		"fscounters" },
+		{ XFS_FSOP_GEOM_SICK_UQUOTA,		"usrquota" },
+		{ XFS_FSOP_GEOM_SICK_GQUOTA,		"grpquota" },
+		{ XFS_FSOP_GEOM_SICK_PQUOTA,		"prjquota" },
+		{ XFS_FSOP_GEOM_SICK_QUOTACHECK,	"quotacheck" },
+		{ XFS_FSOP_GEOM_SICK_NLINKS,		"nlinks" },
+		{ XFS_FSOP_GEOM_SICK_METADIR,		"metadir" },
+		{ XFS_FSOP_GEOM_SICK_METAPATH,		"metapath" },
+	};
+
+	return xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			xfs_healthmon_fs_mask(event->fsmask));
+}
+
+/* Render rtgroup sickness mask as a string set */
+static int
+xfs_healthmon_format_rtgroup(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_RTGROUP_GEOM_SICK_SUPER,		"super" },
+		{ XFS_RTGROUP_GEOM_SICK_BITMAP,		"bitmap" },
+		{ XFS_RTGROUP_GEOM_SICK_SUMMARY,	"summary" },
+		{ XFS_RTGROUP_GEOM_SICK_RMAPBT,		"rmapbt" },
+		{ XFS_RTGROUP_GEOM_SICK_REFCNTBT,	"refcountbt" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			xfs_healthmon_rtgroup_mask(event->grpmask));
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"group\":      %u,\n",
+			event->group);
+}
+
+/* Render perag sickness mask as a string set */
+static int
+xfs_healthmon_format_ag(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_AG_GEOM_SICK_SB,		"super" },
+		{ XFS_AG_GEOM_SICK_AGF,		"agf" },
+		{ XFS_AG_GEOM_SICK_AGFL,	"agfl" },
+		{ XFS_AG_GEOM_SICK_AGI,		"agi" },
+		{ XFS_AG_GEOM_SICK_BNOBT,	"bnobt" },
+		{ XFS_AG_GEOM_SICK_CNTBT,	"cntbt" },
+		{ XFS_AG_GEOM_SICK_INOBT,	"inobt" },
+		{ XFS_AG_GEOM_SICK_FINOBT,	"finobt" },
+		{ XFS_AG_GEOM_SICK_RMAPBT,	"rmapbt" },
+		{ XFS_AG_GEOM_SICK_REFCNTBT,	"refcountbt" },
+		{ XFS_AG_GEOM_SICK_INODES,	"inodes" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			xfs_healthmon_perag_mask(event->grpmask));
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"group\":      %u,\n",
+			event->group);
+}
+
+/* Render inode sickness mask as a string set */
+static int
+xfs_healthmon_format_inode(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_BS_SICK_INODE,		"core" },
+		{ XFS_BS_SICK_BMBTD,		"bmapbtd" },
+		{ XFS_BS_SICK_BMBTA,		"bmapbta" },
+		{ XFS_BS_SICK_BMBTC,		"bmapbtc" },
+		{ XFS_BS_SICK_DIR,		"directory" },
+		{ XFS_BS_SICK_XATTR,		"xattr" },
+		{ XFS_BS_SICK_SYMLINK,		"symlink" },
+		{ XFS_BS_SICK_PARENT,		"parent" },
+		{ XFS_BS_SICK_DIRTREE,		"dirtree" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(outbuf, "structures", mask_strings,
+			xfs_healthmon_inode_mask(event->imask));
+	if (ret < 0)
+		return ret;
+
+	ret = seq_buf_printf(outbuf, "  \"inumber\":    %llu,\n",
+			event->ino);
+	if (ret < 0)
+		return ret;
+	return seq_buf_printf(outbuf, "  \"generation\": %u,\n",
+			event->gen);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -433,6 +761,18 @@ xfs_healthmon_format_json(
 			break;
 		}
 		break;
+	case XFS_HEALTHMON_FS:
+		ret = xfs_healthmon_format_fs(outbuf, event);
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		ret = xfs_healthmon_format_rtgroup(outbuf, event);
+		break;
+	case XFS_HEALTHMON_AG:
+		ret = xfs_healthmon_format_ag(outbuf, event);
+		break;
+	case XFS_HEALTHMON_INODE:
+		ret = xfs_healthmon_format_inode(outbuf, event);
+		break;
 	}
 	if (ret < 0)
 		goto overrun;
@@ -461,11 +801,19 @@ xfs_healthmon_format_json(
 
 static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
+	[XFS_HEALTHMON_FS]		= XFS_HEALTH_MONITOR_DOMAIN_FS,
+	[XFS_HEALTHMON_AG]		= XFS_HEALTH_MONITOR_DOMAIN_AG,
+	[XFS_HEALTHMON_INODE]		= XFS_HEALTH_MONITOR_DOMAIN_INODE,
+	[XFS_HEALTHMON_RTGROUP]		= XFS_HEALTH_MONITOR_DOMAIN_RTGROUP,
 };
 
 static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_RUNNING]		= XFS_HEALTH_MONITOR_TYPE_RUNNING,
 	[XFS_HEALTHMON_LOST]		= XFS_HEALTH_MONITOR_TYPE_LOST,
+	[XFS_HEALTHMON_SICK]		= XFS_HEALTH_MONITOR_TYPE_SICK,
+	[XFS_HEALTHMON_CORRUPT]		= XFS_HEALTH_MONITOR_TYPE_CORRUPT,
+	[XFS_HEALTHMON_HEALTHY]		= XFS_HEALTH_MONITOR_TYPE_HEALTHY,
+	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
 };
 
 /* Render event as a C structure */
@@ -501,6 +849,22 @@ xfs_healthmon_format_cstruct(
 			break;
 		}
 		break;
+	case XFS_HEALTHMON_FS:
+		hme.e.fs.mask = xfs_healthmon_fs_mask(event->fsmask);
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		hme.e.group.mask = xfs_healthmon_rtgroup_mask(event->grpmask);
+		hme.e.group.gno = event->group;
+		break;
+	case XFS_HEALTHMON_AG:
+		hme.e.group.mask = xfs_healthmon_perag_mask(event->grpmask);
+		hme.e.group.gno = event->group;
+		break;
+	case XFS_HEALTHMON_INODE:
+		hme.e.inode.mask = xfs_healthmon_inode_mask(event->imask);
+		hme.e.inode.ino = event->ino;
+		hme.e.inode.gen = event->gen;
+		break;
 	default:
 		break;
 	}
@@ -541,7 +905,7 @@ static inline bool
 xfs_healthmon_has_eventdata(
 	struct xfs_healthmon	*hm)
 {
-	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+	return !hm->mp || hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
 }
 
 /* Try to copy the rest of the outbuf to the iov iter. */
@@ -584,10 +948,16 @@ xfs_healthmon_peek(
 	struct xfs_healthmon_event *event;
 
 	mutex_lock(&hm->lock);
+	event = hm->first_event;
 	if (hm->mp)
-		event = hm->first_event;
-	else
-		event = NULL;
+		goto done;
+
+	/* If the filesystem is unmounted, only return the unmount event */
+	if (event && event->type == XFS_HEALTHMON_UNMOUNT)
+		goto done;
+	event = NULL;
+
+done:
 	mutex_unlock(&hm->lock);
 	return event;
 }
@@ -722,6 +1092,58 @@ xfs_healthmon_free_events(
 	hm->first_event = hm->last_event = NULL;
 }
 
+/*
+ * Detach all filesystem hooks that were set up for a health monitor.  Only
+ * call this from iterate_super*.
+ */
+STATIC void
+xfs_healthmon_detach_hooks(
+	struct super_block	*sb,
+	void			*arg)
+{
+	struct xfs_healthmon	*hm = arg;
+
+	mutex_lock(&hm->lock);
+
+	/*
+	 * Because health monitors have a weak reference to the filesystem
+	 * they're monitoring, the hook deletions below must not race against
+	 * that filesystem being unmounted because that could lead to UAF
+	 * errors.
+	 *
+	 * If hm->mp is NULL, the health unmount hook already ran and the hook
+	 * chain head (contained within the xfs_mount structure) is gone.  Do
+	 * not detach any hooks; just let them get freed when the healthmon
+	 * object is torn down.
+	 */
+	if (!hm->mp)
+		goto out_unlock;
+
+	/*
+	 * Otherwise, the caller gave us a non-dying @sb with s_umount held in
+	 * shared mode, which means that @sb cannot be running through
+	 * deactivate_locked_super and cannot be freed.  It's safe to compare
+	 * @sb against the super that we snapshotted when we set up the health
+	 * monitor.
+	 */
+	if (hm->mp->m_super != sb)
+		goto out_unlock;
+
+	mutex_unlock(&hm->lock);
+
+	/*
+	 * Now we know that the filesystem @hm->mp is active and cannot be
+	 * deactivated until this function returns.  Unmount events are sent
+	 * through the health monitoring subsystem from xfs_fs_put_super, so
+	 * it is now time to detach the hooks.
+	 */
+	xfs_health_hook_del(hm->mp, &hm->hhook);
+	return;
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+}
+
 /* Free the health monitoring information. */
 STATIC int
 xfs_healthmon_release(
@@ -734,6 +1156,9 @@ xfs_healthmon_release(
 
 	wake_up_all(&hm->wait);
 
+	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_health_hook_disable();
+
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	if (hm->outbuf.size)
@@ -783,11 +1208,18 @@ xfs_healthmon_show_fdinfo(
 	struct xfs_healthmon	*hm = file->private_data;
 
 	mutex_lock(&hm->lock);
+	if (!hm->mp) {
+		seq_printf(m, "state:\tdead\n");
+		goto out_unlock;
+	}
+
 	seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
 			hm->mp->m_super->s_id,
 			xfs_healthmon_format_string(hm),
 			hm->total_events,
 			hm->total_lost);
+
+out_unlock:
 	mutex_unlock(&hm->lock);
 }
 #endif
@@ -832,6 +1264,13 @@ xfs_ioc_health_monitor(
 	hm->mp = mp;
 	hm->format = hmo.format;
 
+	/*
+	 * Since we already got a ref to the module, take a reference to the
+	 * fstype to make it easier to detach the hooks when we tear things
+	 * down later.
+	 */
+	hm->fstyp = mp->m_super->s_type;
+
 	seq_buf_init(&hm->outbuf, NULL, 0);
 	mutex_init(&hm->lock);
 	init_waitqueue_head(&hm->wait);
@@ -839,12 +1278,21 @@ xfs_ioc_health_monitor(
 	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
 		hm->verbose = true;
 
+	/* Enable hooks to receive events, generally. */
+	xfs_health_hook_enable();
+
+	/* Attach specific event hooks to this monitor. */
+	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
+	ret = xfs_health_hook_add(mp, &hm->hhook);
+	if (ret)
+		goto out_hooks;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_mutex;
+		goto out_healthhook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -856,14 +1304,17 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_mutex;
+		goto out_healthhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
-out_mutex:
+out_healthhook:
+	xfs_health_hook_del(mp, &hm->hhook);
+out_hooks:
+	xfs_health_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	kfree(hm);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 13/19] xfs: report shutdown events through healthmon
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-10-23  0:03   ` [PATCH 12/19] xfs: report metadata health events through healthmon Darrick J. Wong
@ 2025-10-23  0:04   ` Darrick J. Wong
  2025-10-23  0:04   ` [PATCH 14/19] xfs: report media errors " Darrick J. Wong
                     ` (5 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:04 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h                  |   18 +++++
 fs/xfs/xfs_healthmon.h                  |    5 +
 fs/xfs/xfs_trace.h                      |   28 +++++++
 fs/xfs/libxfs/xfs_healthmon.schema.json |   62 ++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |  119 ++++++++++++++++++++++++++++++-
 5 files changed, 229 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 358abe98776d69..918362a7294f27 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1028,6 +1028,9 @@ struct xfs_rtgroup_geometry {
 /* filesystem was unmounted */
 #define XFS_HEALTH_MONITOR_TYPE_UNMOUNT		(5)
 
+/* filesystem shutdown */
+#define XFS_HEALTH_MONITOR_TYPE_SHUTDOWN	(6)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
@@ -1054,6 +1057,20 @@ struct xfs_health_monitor_inode {
 	__u64	ino;
 };
 
+/* shutdown reasons */
+#define XFS_HEALTH_SHUTDOWN_META_IO_ERROR	(1u << 0)
+#define XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR	(1u << 1)
+#define XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT	(1u << 2)
+#define XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE	(1u << 3)
+#define XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK	(1u << 4)
+#define XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED	(1u << 5)
+
+/* shutdown */
+struct xfs_health_monitor_shutdown {
+	/* XFS_HEALTH_SHUTDOWN_* flags */
+	__u32	reasons;
+};
+
 struct xfs_health_monitor_event {
 	/* XFS_HEALTH_MONITOR_DOMAIN_* */
 	__u32	domain;
@@ -1074,6 +1091,7 @@ struct xfs_health_monitor_event {
 		struct xfs_health_monitor_fs fs;
 		struct xfs_health_monitor_group group;
 		struct xfs_health_monitor_inode inode;
+		struct xfs_health_monitor_shutdown shutdown;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 3f3ba16d5af56a..a82a684bbc0e03 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -11,6 +11,9 @@ enum xfs_healthmon_type {
 	XFS_HEALTHMON_LOST,	/* message lost */
 	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
 
+	/* filesystem shutdown */
+	XFS_HEALTHMON_SHUTDOWN,
+
 	/* metadata health events */
 	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
@@ -41,7 +44,7 @@ struct xfs_healthmon_event {
 		struct {
 			uint64_t	lostcount;
 		};
-		/* mount */
+		/* shutdown */
 		struct {
 			unsigned int	flags;
 		};
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index df09c225e13c2e..e39138293c2782 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6010,8 +6010,32 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
+TRACE_EVENT(xfs_healthmon_shutdown_hook,
+	TP_PROTO(const struct xfs_mount *mp, uint32_t shutdown_flags,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(mp, shutdown_flags, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint32_t, shutdown_flags)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->shutdown_flags = shutdown_flags;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d shutdown_flags %s events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->shutdown_flags, "|", XFS_SHUTDOWN_STRINGS),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+
 #define XFS_HEALTHMON_TYPE_STRINGS \
 	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_SHUTDOWN,	"shutdown" }, \
 	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
 	{ XFS_HEALTHMON_SICK,		"sick" }, \
 	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
@@ -6025,6 +6049,7 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
@@ -6065,6 +6090,9 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 		switch (__entry->domain) {
 		case XFS_HEALTHMON_MOUNT:
 			switch (__entry->type) {
+			case XFS_HEALTHMON_SHUTDOWN:
+				__entry->mask = event->flags;
+				break;
 			case XFS_HEALTHMON_LOST:
 				__entry->lostcount = event->lostcount;
 				break;
diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index dd78f1b71d587b..1657ccc482edff 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -36,6 +36,9 @@
 		},
 		{
 			"$ref": "#/$events/inode_metadata"
+		},
+		{
+			"$ref": "#/$events/shutdown"
 		}
 	],
 
@@ -204,6 +207,31 @@
 		}
 	},
 
+	"$comment": "Shutdown event data are defined here.",
+	"$shutdown": {
+		"reason": {
+			"description": [
+				"Reason for a filesystem to shut down.",
+				"Options include:",
+				"",
+				" * corrupt_incore: in-memory corruption",
+				" * corrupt_ondisk: on-disk corruption",
+				" * device_removed: device removed",
+				" * force_umount:   userspace asked for it",
+				" * log_ioerr:      log write IO error",
+				" * meta_ioerr:     metadata writeback IO error"
+			],
+			"enum": [
+				"corrupt_incore",
+				"corrupt_ondisk",
+				"device_removed",
+				"force_umount",
+				"log_ioerr",
+				"meta_ioerr"
+			]
+		}
+	},
+
 	"$comment": "Event types are defined here.",
 	"$events": {
 		"running": {
@@ -439,6 +467,40 @@
 				"generation",
 				"structures"
 			]
+		},
+		"shutdown": {
+			"title": "Abnormal Shutdown Event",
+			"description": [
+				"The filesystem went offline due to",
+				"unrecoverable errors."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "shutdown"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "mount"
+				},
+				"reasons": {
+					"type": "array",
+					"items": {
+						"$ref": "#/$shutdown/reason"
+					},
+					"minItems": 1
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"reasons"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 05c67fe40f2bac..76de516708e8f9 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -20,6 +20,7 @@
 #include "xfs_rtgroup.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_fsops.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -67,6 +68,7 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*last_event;
 
 	/* live update hooks */
+	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
@@ -482,6 +484,43 @@ xfs_healthmon_metadata_hook(
 	goto out_unlock;
 }
 
+/* Add a shutdown event to the reporting queue. */
+STATIC int
+xfs_healthmon_shutdown_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_SHUTDOWN,
+			XFS_HEALTHMON_MOUNT);
+	if (!event)
+		goto out_unlock;
+
+	event->flags = action;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -490,6 +529,7 @@ xfs_healthmon_typestring(
 	static const char *type_strings[] = {
 		[XFS_HEALTHMON_RUNNING]		= "running",
 		[XFS_HEALTHMON_LOST]		= "lost",
+		[XFS_HEALTHMON_SHUTDOWN]	= "shutdown",
 		[XFS_HEALTHMON_UNMOUNT]		= "unmount",
 		[XFS_HEALTHMON_SICK]		= "sick",
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
@@ -700,6 +740,25 @@ xfs_healthmon_format_inode(
 			event->gen);
 }
 
+/* Render shutdown mask as a string set */
+static int
+xfs_healthmon_format_shutdown(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ SHUTDOWN_META_IO_ERROR,	"meta_ioerr" },
+		{ SHUTDOWN_LOG_IO_ERROR,	"log_ioerr" },
+		{ SHUTDOWN_FORCE_UMOUNT,	"force_umount" },
+		{ SHUTDOWN_CORRUPT_INCORE,	"corrupt_incore" },
+		{ SHUTDOWN_CORRUPT_ONDISK,	"corrupt_ondisk" },
+		{ SHUTDOWN_DEVICE_REMOVED,	"device_removed" },
+	};
+
+	return xfs_healthmon_format_mask(outbuf, "reasons", mask_strings,
+			event->flags);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -757,6 +816,9 @@ xfs_healthmon_format_json(
 		case XFS_HEALTHMON_LOST:
 			ret = xfs_healthmon_format_lost(outbuf, event);
 			break;
+		case XFS_HEALTHMON_SHUTDOWN:
+			ret = xfs_healthmon_format_shutdown(outbuf, event);
+			break;
 		default:
 			break;
 		}
@@ -799,6 +861,44 @@ xfs_healthmon_format_json(
 	return -1;
 }
 
+struct flags_map {
+	unsigned int		in_mask;
+	unsigned int		out_mask;
+};
+
+static const struct flags_map shutdown_map[] = {
+	{ SHUTDOWN_META_IO_ERROR,	XFS_HEALTH_SHUTDOWN_META_IO_ERROR },
+	{ SHUTDOWN_LOG_IO_ERROR,	XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR },
+	{ SHUTDOWN_FORCE_UMOUNT,	XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT },
+	{ SHUTDOWN_CORRUPT_INCORE,	XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE },
+	{ SHUTDOWN_CORRUPT_ONDISK,	XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK },
+	{ SHUTDOWN_DEVICE_REMOVED,	XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED },
+};
+
+static inline unsigned int
+__map_flags(
+	const struct flags_map	*map,
+	size_t			array_len,
+	unsigned int		flags)
+{
+	const struct flags_map	*m;
+	unsigned int		ret = 0;
+
+	for (m = map; m < map + array_len; m++) {
+		if (flags & m->in_mask)
+			ret |= m->out_mask;
+	}
+
+	return ret;
+}
+
+#define map_flags(map, flags) __map_flags((map), ARRAY_SIZE(map), (flags))
+
+static inline unsigned int shutdown_mask(unsigned int in)
+{
+	return map_flags(shutdown_map, in);
+}
+
 static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
 	[XFS_HEALTHMON_FS]		= XFS_HEALTH_MONITOR_DOMAIN_FS,
@@ -814,6 +914,7 @@ static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_CORRUPT]		= XFS_HEALTH_MONITOR_TYPE_CORRUPT,
 	[XFS_HEALTHMON_HEALTHY]		= XFS_HEALTH_MONITOR_TYPE_HEALTHY,
 	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
+	[XFS_HEALTHMON_SHUTDOWN]	= XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
 };
 
 /* Render event as a C structure */
@@ -845,6 +946,9 @@ xfs_healthmon_format_cstruct(
 		case XFS_HEALTHMON_LOST:
 			hme.e.lost.count = event->lostcount;
 			break;
+		case XFS_HEALTHMON_SHUTDOWN:
+			hme.e.shutdown.reasons = shutdown_mask(event->flags);
+			break;
 		default:
 			break;
 		}
@@ -1137,6 +1241,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
 	return;
 
@@ -1157,6 +1262,7 @@ xfs_healthmon_release(
 	wake_up_all(&hm->wait);
 
 	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_shutdown_hook_disable();
 	xfs_health_hook_disable();
 
 	mutex_destroy(&hm->lock);
@@ -1280,6 +1386,7 @@ xfs_ioc_health_monitor(
 
 	/* Enable hooks to receive events, generally. */
 	xfs_health_hook_enable();
+	xfs_shutdown_hook_enable();
 
 	/* Attach specific event hooks to this monitor. */
 	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
@@ -1287,12 +1394,17 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_hooks;
 
+	xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
+	ret = xfs_shutdown_hook_add(mp, &hm->shook);
+	if (ret)
+		goto out_healthhook;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_healthhook;
+		goto out_shutdownhook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -1304,17 +1416,20 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_healthhook;
+		goto out_shutdownhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_shutdownhook:
+	xfs_shutdown_hook_del(mp, &hm->shook);
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:
 	xfs_health_hook_disable();
+	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	kfree(hm);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 14/19] xfs: report media errors through healthmon
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-10-23  0:04   ` [PATCH 13/19] xfs: report shutdown " Darrick J. Wong
@ 2025-10-23  0:04   ` Darrick J. Wong
  2025-10-23  0:04   ` [PATCH 15/19] xfs: report file io " Darrick J. Wong
                     ` (4 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:04 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have hooks to report media errors, connect this to the
health monitor as well.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h                  |   15 ++++
 fs/xfs/xfs_healthmon.h                  |   12 ++++
 fs/xfs/xfs_trace.h                      |   57 +++++++++++++++++
 fs/xfs/libxfs/xfs_healthmon.schema.json |   65 +++++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |  106 ++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.c                      |    1 
 6 files changed, 254 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 918362a7294f27..a551b1d5d0db58 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1014,6 +1014,11 @@ struct xfs_rtgroup_geometry {
 #define XFS_HEALTH_MONITOR_DOMAIN_INODE		(3)
 #define XFS_HEALTH_MONITOR_DOMAIN_RTGROUP	(4)
 
+/* disk events */
+#define XFS_HEALTH_MONITOR_DOMAIN_DATADEV	(5)
+#define XFS_HEALTH_MONITOR_DOMAIN_RTDEV		(6)
+#define XFS_HEALTH_MONITOR_DOMAIN_LOGDEV	(7)
+
 /* Health monitor event types */
 
 /* status of the monitor itself */
@@ -1031,6 +1036,9 @@ struct xfs_rtgroup_geometry {
 /* filesystem shutdown */
 #define XFS_HEALTH_MONITOR_TYPE_SHUTDOWN	(6)
 
+/* media errors */
+#define XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR	(7)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
@@ -1071,6 +1079,12 @@ struct xfs_health_monitor_shutdown {
 	__u32	reasons;
 };
 
+/* disk media errors */
+struct xfs_health_monitor_media {
+	__u64	daddr;
+	__u64	bbcount;
+};
+
 struct xfs_health_monitor_event {
 	/* XFS_HEALTH_MONITOR_DOMAIN_* */
 	__u32	domain;
@@ -1092,6 +1106,7 @@ struct xfs_health_monitor_event {
 		struct xfs_health_monitor_group group;
 		struct xfs_health_monitor_inode inode;
 		struct xfs_health_monitor_shutdown shutdown;
+		struct xfs_health_monitor_media media;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index a82a684bbc0e03..407c5e1f466726 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -19,6 +19,8 @@ enum xfs_healthmon_type {
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
 	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
 
+	/* media errors */
+	XFS_HEALTHMON_MEDIA_ERROR,
 };
 
 enum xfs_healthmon_domain {
@@ -29,6 +31,11 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_AG,	/* allocation group metadata */
 	XFS_HEALTHMON_INODE,	/* inode metadata */
 	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
+
+	/* media errors */
+	XFS_HEALTHMON_DATADEV,
+	XFS_HEALTHMON_RTDEV,
+	XFS_HEALTHMON_LOGDEV,
 };
 
 struct xfs_healthmon_event {
@@ -66,6 +73,11 @@ struct xfs_healthmon_event {
 			uint32_t	gen;
 			xfs_ino_t	ino;
 		};
+		/* media errors */
+		struct {
+			xfs_daddr_t	daddr;
+			uint64_t	bbcount;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e39138293c2782..11d70e3792493a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -105,6 +105,7 @@ struct xfs_rtgroup;
 struct xfs_open_zone;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
+struct xfs_media_error_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6111,6 +6112,12 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 			__entry->ino = event->ino;
 			__entry->gen = event->gen;
 			break;
+		case XFS_HEALTHMON_DATADEV:
+		case XFS_HEALTHMON_LOGDEV:
+		case XFS_HEALTHMON_RTDEV:
+			__entry->offset = event->daddr;
+			__entry->length = event->bbcount;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6199,6 +6206,56 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->events,
 		  __entry->lost_prev)
 );
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+TRACE_EVENT(xfs_healthmon_media_error_hook,
+	TP_PROTO(const struct xfs_media_error_params *p,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(uint64_t, daddr)
+		__field(uint64_t, bbcount)
+		__field(int, pre_remove)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		struct xfs_mount	*mp = p->mp;
+		struct xfs_buftarg	*btp = NULL;
+
+		switch (p->fdev) {
+		case XFS_FAILED_DATADEV:
+			btp = mp->m_ddev_targp;
+			break;
+		case XFS_FAILED_LOGDEV:
+			btp = mp->m_logdev_targp;
+			break;
+		case XFS_FAILED_RTDEV:
+			btp = mp->m_rtdev_targp;
+			break;
+		}
+
+		__entry->dev = mp->m_super->s_dev;
+		if (btp)
+			__entry->error_dev = btp->bt_dev;
+		__entry->daddr = p->daddr;
+		__entry->bbcount = p->bbcount;
+		__entry->pre_remove = p->pre_remove;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d error_dev %d:%d daddr 0x%llx bbcount 0x%llx pre_remove? %d events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->error_dev), MINOR(__entry->error_dev),
+		  __entry->daddr,
+		  __entry->bbcount,
+		  __entry->pre_remove,
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#endif
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index 1657ccc482edff..d3b537a040cb83 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -39,6 +39,9 @@
 		},
 		{
 			"$ref": "#/$events/shutdown"
+		},
+		{
+			"$ref": "#/$events/media_error"
 		}
 	],
 
@@ -75,6 +78,31 @@
 		"i_generation": {
 			"description": "Inode generation number",
 			"type": "integer"
+		},
+		"storage_devs": {
+			"description": "Storage devices in a filesystem",
+			"_comment": [
+				"One of:",
+				"",
+				" * datadev: filesystem device",
+				" * logdev:  external log device",
+				" * rtdev:   realtime volume"
+			],
+			"enum": [
+				"datadev",
+				"logdev",
+				"rtdev"
+			]
+		},
+		"xfs_daddr_t": {
+			"description": "Storage device address, in units of 512-byte blocks",
+			"type": "integer",
+			"minimum": 0
+		},
+		"bbcount": {
+			"description": "Storage space length, in units of 512-byte blocks",
+			"type": "integer",
+			"minimum": 1
 		}
 	},
 
@@ -501,6 +529,43 @@
 				"domain",
 				"reasons"
 			]
+		},
+		"media_error": {
+			"title": "Media Error",
+			"description": [
+				"A storage device reported a media error.",
+				"The domain element tells us which storage",
+				"device reported the media failure.  The",
+				"daddr and bbcount elements tell us where",
+				"inside that device the failure was observed."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"const": "media"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"$ref": "#/$defs/storage_devs"
+				},
+				"daddr": {
+					"$ref": "#/$defs/xfs_daddr_t"
+				},
+				"bbcount": {
+					"$ref": "#/$defs/bbcount"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"daddr",
+				"bbcount"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 76de516708e8f9..52b8be8eb7a11b 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -21,6 +21,7 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_fsops.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -70,6 +71,7 @@ struct xfs_healthmon {
 	/* live update hooks */
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
+	struct xfs_media_error_hook	mhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -521,6 +523,59 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+/* Add a media error event to the reporting queue. */
+STATIC int
+xfs_healthmon_media_error_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_media_error_params	*p = data;
+	enum xfs_healthmon_domain	domain = 0; /* shut up gcc */
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_media_error_hook(p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	switch (p->fdev) {
+	case XFS_FAILED_LOGDEV:
+		domain = XFS_HEALTHMON_LOGDEV;
+		break;
+	case XFS_FAILED_RTDEV:
+		domain = XFS_HEALTHMON_RTDEV;
+		break;
+	case XFS_FAILED_DATADEV:
+		domain = XFS_HEALTHMON_DATADEV;
+		break;
+	}
+
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_MEDIA_ERROR, domain);
+	if (!event)
+		goto out_unlock;
+
+	event->daddr = p->daddr;
+	event->bbcount = p->bbcount;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+#endif
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -534,6 +589,7 @@ xfs_healthmon_typestring(
 		[XFS_HEALTHMON_SICK]		= "sick",
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
 		[XFS_HEALTHMON_HEALTHY]		= "healthy",
+		[XFS_HEALTHMON_MEDIA_ERROR]	= "media",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -553,6 +609,9 @@ xfs_healthmon_domstring(
 		[XFS_HEALTHMON_AG]		= "perag",
 		[XFS_HEALTHMON_INODE]		= "inode",
 		[XFS_HEALTHMON_RTGROUP]		= "rtgroup",
+		[XFS_HEALTHMON_DATADEV]		= "datadev",
+		[XFS_HEALTHMON_LOGDEV]		= "logdev",
+		[XFS_HEALTHMON_RTDEV]		= "rtdev",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -759,6 +818,23 @@ xfs_healthmon_format_shutdown(
 			event->flags);
 }
 
+/* Render media error as a string set */
+static int
+xfs_healthmon_format_media_error(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	ssize_t				ret;
+
+	ret = seq_buf_printf(outbuf, "  \"daddr\":      %llu,\n",
+			event->daddr);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"bbcount\":    %llu,\n",
+			event->bbcount);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -835,6 +911,11 @@ xfs_healthmon_format_json(
 	case XFS_HEALTHMON_INODE:
 		ret = xfs_healthmon_format_inode(outbuf, event);
 		break;
+	case XFS_HEALTHMON_DATADEV:
+	case XFS_HEALTHMON_LOGDEV:
+	case XFS_HEALTHMON_RTDEV:
+		ret = xfs_healthmon_format_media_error(outbuf, event);
+		break;
 	}
 	if (ret < 0)
 		goto overrun;
@@ -905,6 +986,9 @@ static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_AG]		= XFS_HEALTH_MONITOR_DOMAIN_AG,
 	[XFS_HEALTHMON_INODE]		= XFS_HEALTH_MONITOR_DOMAIN_INODE,
 	[XFS_HEALTHMON_RTGROUP]		= XFS_HEALTH_MONITOR_DOMAIN_RTGROUP,
+	[XFS_HEALTHMON_DATADEV]		= XFS_HEALTH_MONITOR_DOMAIN_DATADEV,
+	[XFS_HEALTHMON_RTDEV]		= XFS_HEALTH_MONITOR_DOMAIN_RTDEV,
+	[XFS_HEALTHMON_LOGDEV]		= XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,
 };
 
 static const unsigned int type_map[] = {
@@ -915,6 +999,7 @@ static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_HEALTHY]		= XFS_HEALTH_MONITOR_TYPE_HEALTHY,
 	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
 	[XFS_HEALTHMON_SHUTDOWN]	= XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
+	[XFS_HEALTHMON_MEDIA_ERROR]	= XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR,
 };
 
 /* Render event as a C structure */
@@ -969,6 +1054,12 @@ xfs_healthmon_format_cstruct(
 		hme.e.inode.ino = event->ino;
 		hme.e.inode.gen = event->gen;
 		break;
+	case XFS_HEALTHMON_DATADEV:
+	case XFS_HEALTHMON_LOGDEV:
+	case XFS_HEALTHMON_RTDEV:
+		hme.e.media.daddr = event->daddr;
+		hme.e.media.bbcount = event->bbcount;
+		break;
 	default:
 		break;
 	}
@@ -1241,6 +1332,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
 	return;
@@ -1262,6 +1354,7 @@ xfs_healthmon_release(
 	wake_up_all(&hm->wait);
 
 	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_media_error_hook_disable();
 	xfs_shutdown_hook_disable();
 	xfs_health_hook_disable();
 
@@ -1387,6 +1480,7 @@ xfs_ioc_health_monitor(
 	/* Enable hooks to receive events, generally. */
 	xfs_health_hook_enable();
 	xfs_shutdown_hook_enable();
+	xfs_media_error_hook_enable();
 
 	/* Attach specific event hooks to this monitor. */
 	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
@@ -1399,12 +1493,17 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_healthhook;
 
+	xfs_media_error_hook_setup(&hm->mhook, xfs_healthmon_media_error_hook);
+	ret = xfs_media_error_hook_add(mp, &hm->mhook);
+	if (ret)
+		goto out_shutdownhook;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_shutdownhook;
+		goto out_mediahook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -1416,18 +1515,21 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_shutdownhook;
+		goto out_mediahook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_mediahook:
+	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:
 	xfs_shutdown_hook_del(mp, &hm->shook);
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:
+	xfs_media_error_hook_disable();
 	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index d42b864a3837a2..08ddab700a6cd3 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -53,6 +53,7 @@
 #include "xfs_zone_priv.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 /*
  * We include this last to have the helpers above available for the trace


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 15/19] xfs: report file io errors through healthmon
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (13 preceding siblings ...)
  2025-10-23  0:04   ` [PATCH 14/19] xfs: report media errors " Darrick J. Wong
@ 2025-10-23  0:04   ` Darrick J. Wong
  2025-10-23  0:04   ` [PATCH 16/19] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
                     ` (3 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:04 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a file io error event hook so that we can send events about read
errors, writeback errors, and directio errors to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h                  |   18 ++++
 fs/xfs/xfs_healthmon.h                  |   16 ++++
 fs/xfs/xfs_trace.h                      |   56 +++++++++++++
 fs/xfs/libxfs/xfs_healthmon.schema.json |   77 ++++++++++++++++++
 fs/xfs/xfs_healthmon.c                  |  131 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.c                      |    1 
 6 files changed, 297 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index a551b1d5d0db58..87e915baa875d6 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1019,6 +1019,9 @@ struct xfs_rtgroup_geometry {
 #define XFS_HEALTH_MONITOR_DOMAIN_RTDEV		(6)
 #define XFS_HEALTH_MONITOR_DOMAIN_LOGDEV	(7)
 
+/* file range events */
+#define XFS_HEALTH_MONITOR_DOMAIN_FILERANGE	(8)
+
 /* Health monitor event types */
 
 /* status of the monitor itself */
@@ -1039,6 +1042,12 @@ struct xfs_rtgroup_geometry {
 /* media errors */
 #define XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR	(7)
 
+/* file range events */
+#define XFS_HEALTH_MONITOR_TYPE_BUFREAD		(8)
+#define XFS_HEALTH_MONITOR_TYPE_BUFWRITE	(9)
+#define XFS_HEALTH_MONITOR_TYPE_DIOREAD		(10)
+#define XFS_HEALTH_MONITOR_TYPE_DIOWRITE	(11)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
@@ -1079,6 +1088,14 @@ struct xfs_health_monitor_shutdown {
 	__u32	reasons;
 };
 
+/* file range events */
+struct xfs_health_monitor_filerange {
+	__u64	pos;
+	__u64	len;
+	__u64	ino;
+	__u32	gen;
+};
+
 /* disk media errors */
 struct xfs_health_monitor_media {
 	__u64	daddr;
@@ -1107,6 +1124,7 @@ struct xfs_health_monitor_event {
 		struct xfs_health_monitor_inode inode;
 		struct xfs_health_monitor_shutdown shutdown;
 		struct xfs_health_monitor_media media;
+		struct xfs_health_monitor_filerange filerange;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 407c5e1f466726..421b46f97df482 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -21,6 +21,12 @@ enum xfs_healthmon_type {
 
 	/* media errors */
 	XFS_HEALTHMON_MEDIA_ERROR,
+
+	/* file range events */
+	XFS_HEALTHMON_BUFREAD,
+	XFS_HEALTHMON_BUFWRITE,
+	XFS_HEALTHMON_DIOREAD,
+	XFS_HEALTHMON_DIOWRITE,
 };
 
 enum xfs_healthmon_domain {
@@ -36,6 +42,9 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_DATADEV,
 	XFS_HEALTHMON_RTDEV,
 	XFS_HEALTHMON_LOGDEV,
+
+	/* file range events */
+	XFS_HEALTHMON_FILERANGE,
 };
 
 struct xfs_healthmon_event {
@@ -78,6 +87,13 @@ struct xfs_healthmon_event {
 			xfs_daddr_t	daddr;
 			uint64_t	bbcount;
 		};
+		/* file range events */
+		struct {
+			xfs_ino_t	fino;
+			loff_t		fpos;
+			uint64_t	flen;
+			uint32_t	fgen;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 11d70e3792493a..b23f3c41db1c03 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -106,6 +106,7 @@ struct xfs_open_zone;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
 struct xfs_media_error_params;
+struct xfs_file_ioerror_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6118,6 +6119,12 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 			__entry->offset = event->daddr;
 			__entry->length = event->bbcount;
 			break;
+		case XFS_HEALTHMON_FILERANGE:
+			__entry->ino = event->fino;
+			__entry->gen = event->fgen;
+			__entry->offset = event->fpos;
+			__entry->length = event->flen;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6256,6 +6263,55 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->lost_prev)
 );
 #endif
+
+#define XFS_FILE_IOERROR_STRINGS \
+	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \
+	{ XFS_FILE_IOERROR_BUFFERED_WRITE,	"writeback" }, \
+	{ XFS_FILE_IOERROR_DIRECT_READ,		"directio_read" }, \
+	{ XFS_FILE_IOERROR_DIRECT_WRITE,	"directio_write" }
+
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_WRITE);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_WRITE);
+
+TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
+	TP_PROTO(const struct xfs_mount *mp,
+		 unsigned long action,
+		 const struct xfs_file_ioerror_params *p,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(mp, action, p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(unsigned long, action)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(long long, pos)
+		__field(unsigned long long, len)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->action = action;
+		__entry->ino = p->ino;
+		__entry->gen = p->gen;
+		__entry->pos = p->pos;
+		__entry->len = p->len;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d ino 0x%llx gen 0x%x op %s pos 0x%llx bytecount 0x%llx events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->gen,
+		  __print_symbolic(__entry->action, XFS_FILE_IOERROR_STRINGS),
+		  __entry->pos,
+		  __entry->len,
+		  __entry->events,
+		  __entry->lost_prev)
+);
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
index d3b537a040cb83..fb696dfbbfd044 100644
--- a/fs/xfs/libxfs/xfs_healthmon.schema.json
+++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
@@ -42,6 +42,9 @@
 		},
 		{
 			"$ref": "#/$events/media_error"
+		},
+		{
+			"$ref": "#/$events/file_ioerror"
 		}
 	],
 
@@ -79,6 +82,16 @@
 			"description": "Inode generation number",
 			"type": "integer"
 		},
+		"off_t": {
+			"description": "File position, in bytes",
+			"type": "integer",
+			"minimum": 0
+		},
+		"size_t": {
+			"description": "File operation length, in bytes",
+			"type": "integer",
+			"minimum": 1
+		},
 		"storage_devs": {
 			"description": "Storage devices in a filesystem",
 			"_comment": [
@@ -260,6 +273,26 @@
 		}
 	},
 
+	"$comment": "File IO event data are defined here.",
+	"$fileio": {
+		"types": {
+			"description": [
+				"File I/O operations.  One of:",
+				"",
+				" * readahead: reads into the page cache.",
+				" * writeback: writeback of dirty page cache.",
+				" * directio_read:   O_DIRECT reads.",
+				" * directio_owrite: O_DIRECT writes."
+			],
+			"enum": [
+				"readahead",
+				"writeback",
+				"directio_read",
+				"directio_write"
+			]
+		}
+	},
+
 	"$comment": "Event types are defined here.",
 	"$events": {
 		"running": {
@@ -566,6 +599,50 @@
 				"daddr",
 				"bbcount"
 			]
+		},
+		"file_ioerror": {
+			"title": "File I/O error",
+			"description": [
+				"A read or a write to a file failed.  The",
+				"inode, generation, pos, and len fields",
+				"describe the range of the file that is",
+				"affected."
+			],
+			"type": "object",
+
+			"properties": {
+				"type": {
+					"$ref": "#/$fileio/types"
+				},
+				"time_ns": {
+					"$ref": "#/$defs/time_ns"
+				},
+				"domain": {
+					"const": "filerange"
+				},
+				"inumber": {
+					"$ref": "#/$defs/xfs_ino_t"
+				},
+				"generation": {
+					"$ref": "#/$defs/i_generation"
+				},
+				"pos": {
+					"$ref": "#/$defs/off_t"
+				},
+				"length": {
+					"$ref": "#/$defs/size_t"
+				}
+			},
+
+			"required": [
+				"type",
+				"time_ns",
+				"domain",
+				"inumber",
+				"generation",
+				"pos",
+				"length"
+			]
 		}
 	}
 }
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 52b8be8eb7a11b..74ffb7c4af078c 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -22,6 +22,7 @@
 #include "xfs_healthmon.h"
 #include "xfs_fsops.h"
 #include "xfs_notify_failure.h"
+#include "xfs_file.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -72,6 +73,7 @@ struct xfs_healthmon {
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 	struct xfs_media_error_hook	mhook;
+	struct xfs_file_ioerror_hook	fhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -576,6 +578,73 @@ xfs_healthmon_media_error_hook(
 }
 #endif
 
+/* Add a file io error event to the reporting queue. */
+STATIC int
+xfs_healthmon_file_ioerror_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_file_ioerror_params	*p = data;
+	enum xfs_healthmon_type		type = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, fhook.ioerror_hook.nb);
+
+	switch (action) {
+	case XFS_FILE_IOERROR_BUFFERED_READ:
+	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+	case XFS_FILE_IOERROR_DIRECT_READ:
+	case XFS_FILE_IOERROR_DIRECT_WRITE:
+		break;
+	default:
+		ASSERT(0);
+		return NOTIFY_DONE;
+	}
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	switch (action) {
+	case XFS_FILE_IOERROR_BUFFERED_READ:
+		type = XFS_HEALTHMON_BUFREAD;
+		break;
+	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+		type = XFS_HEALTHMON_BUFWRITE;
+		break;
+	case XFS_FILE_IOERROR_DIRECT_READ:
+		type = XFS_HEALTHMON_DIOREAD;
+		break;
+	case XFS_FILE_IOERROR_DIRECT_WRITE:
+		type = XFS_HEALTHMON_DIOWRITE;
+		break;
+	}
+
+	event = xfs_healthmon_alloc(hm, type, XFS_HEALTHMON_FILERANGE);
+	if (!event)
+		goto out_unlock;
+
+	event->fino = p->ino;
+	event->fgen = p->gen;
+	event->fpos = p->pos;
+	event->flen = p->len;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -590,6 +659,10 @@ xfs_healthmon_typestring(
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
 		[XFS_HEALTHMON_HEALTHY]		= "healthy",
 		[XFS_HEALTHMON_MEDIA_ERROR]	= "media",
+		[XFS_HEALTHMON_BUFREAD]		= "readahead",
+		[XFS_HEALTHMON_BUFWRITE]	= "writeback",
+		[XFS_HEALTHMON_DIOREAD]		= "directio_read",
+		[XFS_HEALTHMON_DIOWRITE]	= "directio_write",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -612,6 +685,7 @@ xfs_healthmon_domstring(
 		[XFS_HEALTHMON_DATADEV]		= "datadev",
 		[XFS_HEALTHMON_LOGDEV]		= "logdev",
 		[XFS_HEALTHMON_RTDEV]		= "rtdev",
+		[XFS_HEALTHMON_FILERANGE]	= "filerange",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -835,6 +909,33 @@ xfs_healthmon_format_media_error(
 			event->bbcount);
 }
 
+/* Render file range events as a string set */
+static int
+xfs_healthmon_format_filerange(
+	struct seq_buf			*outbuf,
+	const struct xfs_healthmon_event *event)
+{
+	ssize_t				ret;
+
+	ret = seq_buf_printf(outbuf, "  \"inumber\":    %llu,\n",
+			event->fino);
+	if (ret < 0)
+		return ret;
+
+	ret = seq_buf_printf(outbuf, "  \"generation\": %u,\n",
+			event->fgen);
+	if (ret < 0)
+		return ret;
+
+	ret = seq_buf_printf(outbuf, "  \"pos\":        %llu,\n",
+			event->fpos);
+	if (ret < 0)
+		return ret;
+
+	return seq_buf_printf(outbuf, "  \"length\":     %llu,\n",
+			event->flen);
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -916,6 +1017,9 @@ xfs_healthmon_format_json(
 	case XFS_HEALTHMON_RTDEV:
 		ret = xfs_healthmon_format_media_error(outbuf, event);
 		break;
+	case XFS_HEALTHMON_FILERANGE:
+		ret = xfs_healthmon_format_filerange(outbuf, event);
+		break;
 	}
 	if (ret < 0)
 		goto overrun;
@@ -989,6 +1093,7 @@ static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_DATADEV]		= XFS_HEALTH_MONITOR_DOMAIN_DATADEV,
 	[XFS_HEALTHMON_RTDEV]		= XFS_HEALTH_MONITOR_DOMAIN_RTDEV,
 	[XFS_HEALTHMON_LOGDEV]		= XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,
+	[XFS_HEALTHMON_FILERANGE]	= XFS_HEALTH_MONITOR_DOMAIN_FILERANGE,
 };
 
 static const unsigned int type_map[] = {
@@ -1000,6 +1105,10 @@ static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
 	[XFS_HEALTHMON_SHUTDOWN]	= XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
 	[XFS_HEALTHMON_MEDIA_ERROR]	= XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR,
+	[XFS_HEALTHMON_BUFREAD]		= XFS_HEALTH_MONITOR_TYPE_BUFREAD,
+	[XFS_HEALTHMON_BUFWRITE]	= XFS_HEALTH_MONITOR_TYPE_BUFWRITE,
+	[XFS_HEALTHMON_DIOREAD]		= XFS_HEALTH_MONITOR_TYPE_DIOREAD,
+	[XFS_HEALTHMON_DIOWRITE]	= XFS_HEALTH_MONITOR_TYPE_DIOWRITE,
 };
 
 /* Render event as a C structure */
@@ -1060,6 +1169,12 @@ xfs_healthmon_format_cstruct(
 		hme.e.media.daddr = event->daddr;
 		hme.e.media.bbcount = event->bbcount;
 		break;
+	case XFS_HEALTHMON_FILERANGE:
+		hme.e.filerange.ino = event->fino;
+		hme.e.filerange.gen = event->fgen;
+		hme.e.filerange.pos = event->fpos;
+		hme.e.filerange.len = event->flen;
+		break;
 	default:
 		break;
 	}
@@ -1332,6 +1447,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_file_ioerror_hook_del(hm->mp, &hm->fhook);
 	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
@@ -1354,6 +1470,7 @@ xfs_healthmon_release(
 	wake_up_all(&hm->wait);
 
 	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_file_ioerror_hook_disable();
 	xfs_media_error_hook_disable();
 	xfs_shutdown_hook_disable();
 	xfs_health_hook_disable();
@@ -1481,6 +1598,7 @@ xfs_ioc_health_monitor(
 	xfs_health_hook_enable();
 	xfs_shutdown_hook_enable();
 	xfs_media_error_hook_enable();
+	xfs_file_ioerror_hook_enable();
 
 	/* Attach specific event hooks to this monitor. */
 	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
@@ -1498,12 +1616,18 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_shutdownhook;
 
+	xfs_file_ioerror_hook_setup(&hm->fhook,
+			xfs_healthmon_file_ioerror_hook);
+	ret = xfs_file_ioerror_hook_add(mp, &hm->fhook);
+	if (ret)
+		goto out_mediahook;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_mediahook;
+		goto out_ioerrhook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -1515,13 +1639,15 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_mediahook;
+		goto out_ioerrhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_ioerrhook:
+	xfs_file_ioerror_hook_del(mp, &hm->fhook);
 out_mediahook:
 	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:
@@ -1529,6 +1655,7 @@ xfs_ioc_health_monitor(
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:
+	xfs_file_ioerror_hook_disable();
 	xfs_media_error_hook_disable();
 	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 08ddab700a6cd3..eb35015c091570 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -54,6 +54,7 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_notify_failure.h"
+#include "xfs_file.h"
 
 /*
  * We include this last to have the helpers above available for the trace


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 16/19] xfs: allow reconfiguration of the health monitoring device
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (14 preceding siblings ...)
  2025-10-23  0:04   ` [PATCH 15/19] xfs: report file io " Darrick J. Wong
@ 2025-10-23  0:04   ` Darrick J. Wong
  2025-10-23  0:05   ` [PATCH 17/19] xfs: validate fds against running healthmon Darrick J. Wong
                     ` (2 subsequent siblings)
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:04 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can reconfigure the health monitoring device by
calling the XFS_IOC_HEALTH_MONITOR ioctl on it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)


diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 74ffb7c4af078c..ce84cd90df2379 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -23,6 +23,8 @@
 #include "xfs_fsops.h"
 #include "xfs_notify_failure.h"
 #include "xfs_file.h"
+#include "xfs_fs.h"
+#include "xfs_ioctl.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -1540,6 +1542,48 @@ xfs_healthmon_show_fdinfo(
 }
 #endif
 
+/* Reconfigure the health monitor. */
+STATIC long
+xfs_healthmon_reconfigure(
+	struct file			*file,
+	unsigned int			cmd,
+	void __user			*arg)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm = file->private_data;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	mutex_lock(&hm->lock);
+	hm->format = hmo.format;
+	hm->verbose = !!(hmo.flags & XFS_HEALTH_MONITOR_VERBOSE);
+	mutex_unlock(&hm->lock);
+	return 0;
+}
+
+/* Handle ioctls for the health monitoring thread. */
+STATIC long
+xfs_healthmon_ioctl(
+	struct file			*file,
+	unsigned int			cmd,
+	unsigned long			p)
+{
+	void __user			*arg = (void __user *)p;
+
+	switch (cmd) {
+	case XFS_IOC_HEALTH_MONITOR:
+		return xfs_healthmon_reconfigure(file, cmd, arg);
+	default:
+		break;
+	}
+
+	return -ENOTTY;
+}
+
 static const struct file_operations xfs_healthmon_fops = {
 	.owner		= THIS_MODULE,
 #ifdef CONFIG_PROC_FS
@@ -1548,6 +1592,7 @@ static const struct file_operations xfs_healthmon_fops = {
 	.read_iter	= xfs_healthmon_read_iter,
 	.poll		= xfs_healthmon_poll,
 	.release	= xfs_healthmon_release,
+	.unlocked_ioctl	= xfs_healthmon_ioctl,
 };
 
 /*


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 17/19] xfs: validate fds against running healthmon
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (15 preceding siblings ...)
  2025-10-23  0:04   ` [PATCH 16/19] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
@ 2025-10-23  0:05   ` Darrick J. Wong
  2025-10-23  0:05   ` [PATCH 18/19] xfs: add media error reporting ioctl Darrick J. Wong
  2025-10-23  0:05   ` [PATCH 19/19] xfs: send uevents when major filesystem events happen Darrick J. Wong
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:05 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.

(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   10 ++++++++++
 fs/xfs/xfs_healthmon.c |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 87e915baa875d6..b5a00ef6ce5fb9 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1149,6 +1149,15 @@ struct xfs_health_monitor {
 /* Return events in JSON format */
 #define XFS_HEALTH_MONITOR_FMT_JSON	(1)
 
+/*
+ * Check that a given fd points to the same filesystem that the health monitor
+ * is monitoring.
+ */
+struct xfs_health_samefs {
+	__s32		fd;
+	__u32		flags;	/* zero for now */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1189,6 +1198,7 @@ struct xfs_health_monitor {
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
+#define XFS_IOC_HEALTH_SAMEFS	_IOW ('X', 69, struct xfs_health_samefs)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index ce84cd90df2379..666c27d73efbdc 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -1565,6 +1565,36 @@ xfs_healthmon_reconfigure(
 	return 0;
 }
 
+/* Does the fd point to the same filesystem as the one we're monitoring? */
+STATIC long
+xfs_healthmon_samefs(
+	struct file			*file,
+	unsigned int			cmd,
+	void __user			*arg)
+{
+	struct xfs_health_samefs	hms;
+	struct xfs_healthmon		*hm = file->private_data;
+	struct inode			*hms_inode;
+	int				ret = 0;
+
+	if (copy_from_user(&hms, arg, sizeof(hms)))
+		return -EFAULT;
+
+	if (hms.flags)
+		return -EINVAL;
+
+	CLASS(fd, hms_fd)(hms.fd);
+	if (fd_empty(hms_fd))
+		return -EBADF;
+
+	hms_inode = file_inode(fd_file(hms_fd));
+	mutex_lock(&hm->lock);
+	if (!hm->mp || hm->mp->m_super != hms_inode->i_sb)
+		ret = -ESTALE;
+	mutex_unlock(&hm->lock);
+	return ret;
+}
+
 /* Handle ioctls for the health monitoring thread. */
 STATIC long
 xfs_healthmon_ioctl(
@@ -1577,6 +1607,8 @@ xfs_healthmon_ioctl(
 	switch (cmd) {
 	case XFS_IOC_HEALTH_MONITOR:
 		return xfs_healthmon_reconfigure(file, cmd, arg);
+	case XFS_IOC_HEALTH_SAMEFS:
+		return xfs_healthmon_samefs(file, cmd, arg);
 	default:
 		break;
 	}


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 18/19] xfs: add media error reporting ioctl
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (16 preceding siblings ...)
  2025-10-23  0:05   ` [PATCH 17/19] xfs: validate fds against running healthmon Darrick J. Wong
@ 2025-10-23  0:05   ` Darrick J. Wong
  2025-10-23  0:05   ` [PATCH 19/19] xfs: send uevents when major filesystem events happen Darrick J. Wong
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:05 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new privileged ioctl so that xfs_scrub can report media errors to
the kernel for further processing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h      |   16 +++++++++++++
 fs/xfs/xfs_notify_failure.h |    8 ++++++
 fs/xfs/xfs_trace.h          |    2 --
 fs/xfs/Makefile             |    6 +----
 fs/xfs/xfs_healthmon.c      |    2 --
 fs/xfs/xfs_ioctl.c          |    3 ++
 fs/xfs/xfs_notify_failure.c |   53 ++++++++++++++++++++++++++++++++++++++++++-
 7 files changed, 79 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b5a00ef6ce5fb9..5d35d67b10e153 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1158,6 +1158,21 @@ struct xfs_health_samefs {
 	__u32		flags;	/* zero for now */
 };
 
+/* Report a media error */
+struct xfs_media_error {
+	__u64	flags;		/* flags */
+	__u64	daddr;		/* disk address of range */
+	__u64	bbcount;	/* length, in 512b blocks */
+	__u64	pad;		/* zero */
+};
+
+#define XFS_MEDIA_ERROR_DATADEV	(1)	/* data device */
+#define XFS_MEDIA_ERROR_LOGDEV	(2)	/* external log device */
+#define XFS_MEDIA_ERROR_RTDEV	(3)	/* realtime device */
+
+/* bottom byte of flags is the device code */
+#define XFS_MEDIA_ERROR_DEVMASK	(0xFF)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1199,6 +1214,7 @@ struct xfs_health_samefs {
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
 #define XFS_IOC_HEALTH_SAMEFS	_IOW ('X', 69, struct xfs_health_samefs)
+#define XFS_IOC_MEDIA_ERROR	_IOW ('X', 70, struct xfs_media_error)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 528317ff24320a..e9ee74aa540bff 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -6,7 +6,9 @@
 #ifndef __XFS_NOTIFY_FAILURE_H__
 #define __XFS_NOTIFY_FAILURE_H__
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 extern const struct dax_holder_operations xfs_dax_holder_operations;
+#endif
 
 enum xfs_failed_device {
 	XFS_FAILED_DATADEV,
@@ -14,7 +16,7 @@ enum xfs_failed_device {
 	XFS_FAILED_RTDEV,
 };
 
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+#if defined(CONFIG_XFS_LIVE_HOOKS)
 struct xfs_media_error_params {
 	struct xfs_mount		*mp;
 	enum xfs_failed_device		fdev;
@@ -46,4 +48,8 @@ struct xfs_media_error_hook { };
 # define xfs_media_error_hook_setup(...)	((void)0)
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+struct xfs_media_error;
+int xfs_ioc_media_error(struct xfs_mount *mp,
+		struct xfs_media_error __user *arg);
+
 #endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b23f3c41db1c03..10b1ef735a7c9c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6214,7 +6214,6 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->lost_prev)
 );
 
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 TRACE_EVENT(xfs_healthmon_media_error_hook,
 	TP_PROTO(const struct xfs_media_error_params *p,
 		 unsigned int events, unsigned long long lost_prev),
@@ -6262,7 +6261,6 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->events,
 		  __entry->lost_prev)
 );
-#endif
 
 #define XFS_FILE_IOERROR_STRINGS \
 	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d4e9070a9326ba..2279cb0b874814 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -98,6 +98,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_notify_failure.o \
 				   xfs_pwork.o \
 				   xfs_reflink.o \
 				   xfs_stats.o \
@@ -148,11 +149,6 @@ xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
 
-# notify failure
-ifeq ($(CONFIG_MEMORY_FAILURE),y)
-xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
-endif
-
 xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 666c27d73efbdc..3053b2da6b3109 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -527,7 +527,6 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
-#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 /* Add a media error event to the reporting queue. */
 STATIC int
 xfs_healthmon_media_error_hook(
@@ -578,7 +577,6 @@ xfs_healthmon_media_error_hook(
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 }
-#endif
 
 /* Add a file io error event to the reporting queue. */
 STATIC int
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 08998d84554f09..7a80a6ad4b2d99 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -42,6 +42,7 @@
 #include "xfs_handle.h"
 #include "xfs_rtgroup.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -1424,6 +1425,8 @@ xfs_file_ioctl(
 
 	case XFS_IOC_HEALTH_MONITOR:
 		return xfs_ioc_health_monitor(mp, arg);
+	case XFS_IOC_MEDIA_ERROR:
+		return xfs_ioc_media_error(mp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 2098ff452a3b87..00120dd1ddefbd 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -91,9 +91,19 @@ xfs_media_error_hook_setup(
 	xfs_hook_setup(&hook->error_hook, mod_fn);
 }
 #else
-# define xfs_media_error_hook(...)		((void)0)
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	enum xfs_failed_device		fdev,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	bool				pre_remove)
+{
+	/* empty */
+}
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
 	xfs_extlen_t		blockcount;
@@ -458,3 +468,44 @@ xfs_dax_notify_failure(
 const struct dax_holder_operations xfs_dax_holder_operations = {
 	.notify_failure		= xfs_dax_notify_failure,
 };
+#endif /* CONFIG_MEMORY_FAILURE && CONFIG_FS_DAX */
+
+#define XFS_VALID_MEDIA_ERROR_FLAGS	(XFS_MEDIA_ERROR_DATADEV | \
+					 XFS_MEDIA_ERROR_LOGDEV | \
+					 XFS_MEDIA_ERROR_RTDEV)
+int
+xfs_ioc_media_error(
+	struct xfs_mount		*mp,
+	struct xfs_media_error __user	*arg)
+{
+	struct xfs_media_error		me;
+	enum xfs_failed_device		fdev;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&me, arg, sizeof(me)))
+		return -EFAULT;
+
+	if (me.pad)
+		return -EINVAL;
+	if (me.flags & ~XFS_VALID_MEDIA_ERROR_FLAGS)
+		return -EINVAL;
+
+	switch (me.flags & XFS_MEDIA_ERROR_DEVMASK) {
+	case XFS_MEDIA_ERROR_DATADEV:
+		fdev = XFS_FAILED_DATADEV;
+		break;
+	case XFS_MEDIA_ERROR_LOGDEV:
+		fdev = XFS_FAILED_LOGDEV;
+		break;
+	case XFS_MEDIA_ERROR_RTDEV:
+		fdev = XFS_FAILED_RTDEV;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	xfs_media_error_hook(mp, fdev, me.daddr, me.bbcount, false);
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 19/19] xfs: send uevents when major filesystem events happen
  2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (17 preceding siblings ...)
  2025-10-23  0:05   ` [PATCH 18/19] xfs: add media error reporting ioctl Darrick J. Wong
@ 2025-10-23  0:05   ` Darrick J. Wong
  18 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-23  0:05 UTC (permalink / raw)
  To: cem, djwong; +Cc: linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Send uevents when we mount, unmount, and shut down the filesystem, so
that we can trigger systemd services when major events happen.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_super.h |   13 +++++++
 fs/xfs/xfs_fsops.c |   18 ++++++++++
 fs/xfs/xfs_super.c |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 125 insertions(+)


diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index c0e85c1e42f27d..6d428bd04a0248 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -101,4 +101,17 @@ extern struct workqueue_struct *xfs_discard_wq;
 
 struct dentry *xfs_debugfs_mkdir(const char *name, struct dentry *parent);
 
+#define XFS_UEVENT_BUFLEN ( \
+	sizeof("SID=") + sizeof_field(struct super_block, s_id) + \
+	sizeof("UUID=") + UUID_STRING_LEN + \
+	sizeof("META_UUID=") + UUID_STRING_LEN)
+
+#define XFS_UEVENT_STR_PTRS \
+	NULL, /* sid */ \
+	NULL, /* uuid */ \
+	NULL /* metauuid */
+
+int xfs_format_uevent_strings(struct xfs_mount *mp, char *buf, ssize_t buflen,
+		char **env);
+
 #endif	/* __XFS_SUPER_H__ */
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 69918cd1ba1dbc..b3a01361318320 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -537,6 +537,23 @@ xfs_shutdown_hook_setup(
 # define xfs_shutdown_hook(...)		((void)0)
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+static void
+xfs_send_shutdown_uevent(
+	struct xfs_mount	*mp)
+{
+	char			buf[XFS_UEVENT_BUFLEN];
+	char			*env[] = {
+		"TYPE=shutdown",
+		XFS_UEVENT_STR_PTRS,
+		NULL,
+	};
+	int			error;
+
+	error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[2]);
+	if (!error)
+		kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_OFFLINE, env);
+}
+
 /*
  * Force a shutdown of the filesystem instantly while keeping the filesystem
  * consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -587,6 +604,7 @@ xfs_do_force_shutdown(
 	}
 
 	trace_xfs_force_shutdown(mp, tag, flags, fname, lnnum);
+	xfs_send_shutdown_uevent(mp);
 
 	xfs_alert_tag(mp, tag,
 "%s (0x%x) detected at %pS (%s:%d).  Shutting down filesystem.",
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b6a6027b4df8d8..5137f4cb8640b8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -53,6 +53,7 @@
 #include <linux/magic.h>
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
+#include <linux/uuid.h>
 
 static const struct super_operations xfs_super_operations;
 
@@ -1238,12 +1239,73 @@ xfs_inodegc_free_percpu(
 	free_percpu(mp->m_inodegc);
 }
 
+int
+xfs_format_uevent_strings(
+	struct xfs_mount	*mp,
+	char			*buf,
+	ssize_t			buflen,
+	char			**env)
+{
+	ssize_t			written;
+
+	ASSERT(buflen >= XFS_UEVENT_BUFLEN);
+
+	written = snprintf(buf, buflen, "SID=%s", mp->m_super->s_id);
+	if (written >= buflen)
+		return -EINVAL;
+
+	*env = buf;
+	env++;
+	buf += written + 1;
+	buflen -= written + 1;
+
+	written = snprintf(buf, buflen, "UUID=%pU", &mp->m_sb.sb_uuid);
+	if (written >= buflen)
+		return EINVAL;
+
+	*env = buf;
+	env++;
+	buf += written + 1;
+	buflen -= written + 1;
+
+	written = snprintf(buf, buflen, "META_UUID=%pU",
+			&mp->m_sb.sb_meta_uuid);
+	if (written >= buflen)
+		return EINVAL;
+
+	*env = buf;
+	env++;
+	buf += written + 1;
+	buflen -= written + 1;
+
+	ASSERT(buflen >= 0);
+	return 0;
+}
+
+static void
+xfs_send_unmount_uevent(
+	struct xfs_mount	*mp)
+{
+	char			buf[XFS_UEVENT_BUFLEN];
+	char			*env[] = {
+		"TYPE=mount",
+		XFS_UEVENT_STR_PTRS,
+		NULL,
+	};
+	int error;
+
+	error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[1]);
+	if (!error)
+		kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_REMOVE, env);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	xfs_send_unmount_uevent(mp);
 	xfs_notice(mp, "Unmounting Filesystem %pU", &mp->m_sb.sb_uuid);
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
@@ -1661,6 +1723,37 @@ xfs_debugfs_mkdir(
 	return child;
 }
 
+/*
+ * Send a uevent signalling that the mount succeeded so we can use udev rules
+ * to start background services.
+ */
+static void
+xfs_send_mount_uevent(
+	struct fs_context	*fc,
+	struct xfs_mount	*mp)
+{
+	char			*source;
+	char			buf[XFS_UEVENT_BUFLEN];
+	char			*env[] = {
+		"TYPE=mount",
+		NULL, /* source */
+		XFS_UEVENT_STR_PTRS,
+		NULL,
+	};
+	int			error;
+
+	source = kasprintf(GFP_KERNEL, "SOURCE=%s", fc->source);
+	if (!source)
+		return;
+	env[1] = source;
+
+	error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[2]);
+	if (!error)
+		kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_ADD, env);
+
+	kfree(source);
+}
+
 static int
 xfs_fs_fill_super(
 	struct super_block	*sb,
@@ -1974,6 +2067,7 @@ xfs_fs_fill_super(
 		mp->m_debugfs_uuid = NULL;
 	}
 
+	xfs_send_mount_uevent(fc, mp);
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation
  2025-10-23  0:00   ` [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
@ 2025-10-24  5:40     ` Christoph Hellwig
  2025-10-27 16:15       ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2025-10-24  5:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, linux-fsdevel, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Maybe expedite this for 6.18-rc?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation
  2025-10-24  5:40     ` Christoph Hellwig
@ 2025-10-27 16:15       ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-27 16:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cem, linux-fsdevel, linux-xfs

On Thu, Oct 23, 2025 at 10:40:24PM -0700, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> Maybe expedite this for 6.18-rc?

Ok.  I guess removing obsolete links is a bug fix :)

--D

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/19] docs: discuss autonomous self healing in the xfs online repair design doc
  2025-10-23  0:01   ` [PATCH 02/19] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
@ 2025-10-30 16:38     ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-30 16:38 UTC (permalink / raw)
  To: cem; +Cc: linux-fsdevel, linux-xfs

On Wed, Oct 22, 2025 at 05:01:07PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Update the XFS online repair document to describe the motivation and
> design of the autonomous filesystem healing agent known as xfs_healer.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

/me decides (or rather it was pointed out to me) that there's a kernel
component to xfs_healer, but no explicit discussion of it in section 5
("Kernel Algorithms and Data Structures").  Also given the frequency of
the question "why not reuse fsnotify?" I'll address the reasons for that
here.

I've added the following text, which will appear in the next revision:

 5. Kernel Algorithms and Data Structures
 ========================================

<snip>

+Health Monitoring
+-----------------
+
+A self-correcting filesystem responds to observations of problems by scheduling
+repairs of the affected areas.
+The filesystem must therefore create event objects in response to stimuli
+(metadata corruption, file I/O errors, etc.) and dispatch these events to
+downstream consumers.
+Downstream consumers that are in the kernel itself are easy to implement with
+the ``xfs_hooks`` infrastructure created for other parts of online repair; these
+are basically indirect function calls.
+
+However, the decision to translate an adverse metadata health report into a
+repair should be made by userspace, and the actual scheduling done by userspace.
+Some users (e.g. containers) would prefer to fast-fail the container and restart
+it on another node at a previous checkpoint.
+For workloads running in isolation, repairs may be preferable; either way this
+is something the system administrator knows, and not the kernel.
+A userspace agent (``xfs_healer``, described later) will collect events from the
+kernel and dispatch them appropriately.
+
+Exporting health events to userspace requires the creation of a new component,
+known as the health monitor.
+Because the monitor exposes itself to userspace to deliver information, a file
+descriptor is the natural abstraction to use here.
+The health monitor hooks all the relevant sources of metadata health events.
+Upon activation of the hook, a new event object is created and added to a queue.
+When the agent reads from the fd, event objects are pulled from the start of the
+queue and formatted into the user's buffer.
+The events are freed, and the read call returns to userspace to allow the agent
+to perform some work.
+Memory usage is constrained on a per-fd basis to prevent memory exhaustion; if
+an event must be discarded, a special "lost event" event is delivered to the
+agent.
+
+In short, health events are captured, queued, and eventually copied out to
+userspace for dispatching.
+
+**Question**: Why use a pseudofile and not use existing notification methods?
+
+*Answer*: The pseudofile is a private filesystem interface only available to
+processes with the CAP_SYS_ADMIN priviledge and the ability to open the root
+directory.
+Being private gives the kernel and ``xfs_healer`` the flexibility to change
+or update the event format in the future without worrying about backwards
+compatibility.
+Using existing notifications means that the event format would be frozen in
+the public fsnotify UAPI forever, which would affect two subsystems.
+
+The pseudofile can also accept ioctls, which gives ``xfs_healer`` a solid
+means to validate that prior to a repair, its reopened mountpoint is actually
+the same filesystem that is being monitored.
+
+**Question**: Why not reuse fs/notify?
+
+*Answer*: It's much simpler for the healthmon code to manage its own queue of
+events and to wake up readers instead of reusing fsnotify because that's the
+only part of fsnotify that would use.
+
+Before I get started, an introduction: fsnotify expects its users (e.g.
+fanotify) to implement quite a bit of functionality; all it provides is a
+wrapper around a simple queue and a lot of code to convey information about the
+calling process to that user.
+fanotify has to actually implement all the queue management code on its own,
+and so would healthmon.
+
+So if healthmon used fsnotify, it would have to create its own fsnotify group
+structure.
+For our purposes, the group is a very large wrapper around a linked list, some
+counters, and a mutex.
+The group object is critical for ensuring that sees only its own events, and
+that nobody else (e.g. regular fanotify) ever sees these events.
+There's a lot more in there for controlling whether fanotify reports pids,
+groups, file handles, etc. that healthmon doesn't care about.
+
+Starting from the fsnotify() function call:
+
+ - I /think/ we'd have to define a new "data type", which itself is just a plain
+   int but I think they correspond to FSNOTIFY_EVENT_* values which themselves
+   are actually part of an enum.
+   The data type controls the typecasting options for the ``void *data``
+   parameter, which I guess is how I'd pass the healthmon event info from the
+   hooks into the fsnotify mechanism and back out to the healthmon code.
+
+ - Each filesystem that wants to do this probably has to add their own
+   FSNOTIFY_EVENT_{XFS,BTRFS,BFS} data type value because that's a casting
+   decision that's made inside the main fsnotify code.
+   I think this can be avoided if each fs is careful never to leak events
+   outside of the group.
+   Either way, it's harder to follow the data flows here because fsnotify can
+   only take and pass around ``void *`` pointers, and it makes various indirect
+   function calls to manage events.
+   Contrast this with doing everything with typed pointers and direct calls
+   within ``xfs_healthmon.c``.
+
+ - Since healthmon is both producer and consumer of fsnotify events, we can
+   probably define our own "mask" value.
+   It's a relief that we don't have to interact with fanotify, because fanotify
+   has used up 22 of its 32 mask bits.
+
+Once healthmon gets an event into fsnotify, fsnotify will call back (into
+healthmon!) to tell it that it got an event.
+From there, the fsnotify implementation (healthmon) has to allocate an event
+object and add it to the event queue in the group, which is what it already does
+now.
+Overflow control is up to the fsnotify implementation, which healthmon already
+implements.
+
+After the event is queued, the fsnotify implementation also has to implement its
+own read file op to dequeue an event and copy it to the userspace buffer in
+whatever format it likes.
+Again, healthmon already does all this.
+
+In the end, replacing the homegrown event dispatching in healthmon with fsnotify
+would make the data flows much harder to understand, and all we gain is a
+generic event dispatcher that relies on indirect function calls instead of
+direct ones.
+We still have to implement the queuing discipline ourselves! :(
+
+**Future Work Question**: Should these events be exposed through the fanotify
+filesystem error event interface?
+
+*Answer*: Yes.
+fanotify is much more careful about filtering out events to processes that
+aren't running with privileges.
+These processes should have a means to receive simple notifications about
+file errors.
+However, this will require coordination between fanotify, ext4, and XFS, and
+is (for now) outside the scope of this project.

--D

> ---
>  .../filesystems/xfs/xfs-online-fsck-design.rst     |  102 ++++++++++++++++++++
>  1 file changed, 100 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
> index 189d1f5f40788d..bdbf338a9c9f0c 100644
> --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
> +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
> @@ -166,9 +166,12 @@ The current XFS tools leave several problems unsolved:
>     malicious actors **exploit quirks of Unicode** to place misleading names
>     in directories.
>  
> +8. **Site Reliability and Support Engineers** would like to reduce the
> +   frequency of incidents requiring **manual intervention**.
> +
>  Given this definition of the problems to be solved and the actors who would
>  benefit, the proposed solution is a third fsck tool that acts on a running
> -filesystem.
> +filesystem, and an autononmous agent that fixes problems as they arise.
>  
>  This new third program has three components: an in-kernel facility to check
>  metadata, an in-kernel facility to repair metadata, and a userspace driver
> @@ -203,6 +206,13 @@ Even if a piece of filesystem metadata can only be regenerated by scanning the
>  entire system, the scan can still be done in the background while other file
>  operations continue.
>  
> +The autonomous self healing agent should listen for metadata health impact
> +reports coming from the kernel and automatically schedule repairs for the
> +damaged metadata.
> +If the required repairs are larger in scope than a single metadata structure,
> +``xfs_scrub`` should be invoked to perform a full analysis.
> +``xfs_healer`` is the name of this program.
> +
>  In summary, online fsck takes advantage of resource sharding and redundant
>  metadata to enable targeted checking and repair operations while the system
>  is running.
> @@ -850,11 +860,16 @@ variable in the following service files:
>  * ``xfs_scrub_all_fail.service``
>  
>  The decision to enable the background scan is left to the system administrator.
> -This can be done by enabling either of the following services:
> +This can be done system-wide by enabling either of the following services:
>  
>  * ``xfs_scrub_all.timer`` on systemd systems
>  * ``xfs_scrub_all.cron`` on non-systemd systems
>  
> +To enable online repair for specific filesystems, the ``autofsck``
> +filesystem property should be set to ``repair``.
> +To enable only scanning, the property should be set to ``check``.
> +To disable online fsck entirely, the property should be set to ``none``.
> +
>  This automatic weekly scan is configured out of the box to perform an
>  additional media scan of all file data once per month.
>  This is less foolproof than, say, storing file data block checksums, but much
> @@ -897,6 +912,36 @@ notifications and initiate a repair?
>  *Answer*: These questions remain unanswered, but should be a part of the
>  conversation with early adopters and potential downstream users of XFS.
>  
> +Autonomous Self Healing
> +-----------------------
> +
> +The autonomous self healing agent is a background system service that starts
> +when the filesystem is mounted and runs until unmount.
> +When starting up, the agent opens a special pseudofile under the specific
> +mount.
> +When the filesystem generates new adverse health events, the events will be
> +made available for reading via the special pseudofile.
> +The events need not be limited to metadata concerns; they can also reflect
> +events outside of the filesystem's direct control such as file I/O errors.
> +
> +The agent reads these events in a loop and responds to the events
> +appropriately.
> +For a single trouble report about metadata, the agent initiates a targeted
> +repair of the specific structure.
> +If that repair fails or the agent observes too many metadata trouble reports
> +over a short interval, it should then initiate a full scan of the filesystem
> +via the ``xfs_scrub`` service.
> +
> +The decision to enable the background scan is left to the system administrator.
> +This can be done system-wide by enabling the following services:
> +
> +* ``xfs_healer@.service`` on systemd systems
> +
> +To enable autonomous healing for specific filesystems, the ``autofsck``
> +filesystem property should be set to ``repair``.
> +To disable self healing, the property should be set to ``check``,
> +``optimize``, or ``none``.
> +
>  5. Kernel Algorithms and Data Structures
>  ========================================
>  
> @@ -5071,6 +5116,59 @@ and report what has been lost.
>  For media errors in blocks owned by files, parent pointers can be used to
>  construct file paths from inode numbers for user-friendly reporting.
>  
> +Autonomous Self Healing
> +-----------------------
> +
> +When a filesystem mounts, the Linux kernel initiates a uevent describing the
> +mount and the path to the data device.
> +A udev rule determines the initial mountpoint from the data device path
> +and starts a mount-specific ``xfs_healer`` service instance.
> +The ``xfs_healer`` service opens the mountpoint and issues the
> +XFS_IOC_HEALTH_MONITOR ioctl to open a special health monitoring file.
> +After that is set up, the mountpoint is closed to avoid pinning the mount.
> +
> +The health monitoring file hooks certain points of the filesystem so that it
> +may receive events about metadata health, filesystem shutdowns, media errors,
> +file I/O errors, and unmounting of the filesystem.
> +Events are queued up for each health monitor file and encoded into a
> +``struct xfs_health_monitor_event`` object when the agent calls ``read()`` on
> +the file.
> +All health events are dispatched to a background threadpool to reduce stalls
> +in the main event loop.
> +Events can be logged into the system log for further analysis.
> +
> +For metadata health events, the specific details are used to construct a call
> +to the scrub ioctl.
> +The filesystem mountpoint is reopened, and the kernel is called.
> +If events are lost or the repairs fail, a full scan will be initiated by
> +starting up an ``xfs_scrub@.service`` for the given mountpoint.
> +
> +A filesystem shutdown causes all future repair work to cease, and an unmount
> +causes the agent to exit.
> +
> +**Question**: Why use a pseudofile and not use existing notification methods?
> +
> +*Answer*: The pseudofile is a private filesystem interface only available to
> +processes with the CAP_SYS_ADMIN priviledge.
> +Being private gives the kernel and ``xfs_healer`` the flexibility to change
> +or update the event format in the future without worrying about backwards
> +compatibility.
> +Using existing notifications means that the event format would be frozen in
> +public UAPI forever.
> +
> +The pseudofile can also accept ioctls, which gives ``xfs_healer`` a solid
> +means to validate that prior to a repair, its reopened mountpoint is actually
> +the same filesystem that is being monitored.
> +
> +**Future Work Question**: Should the healer daemon also register a dbus
> +listener and publish events there?
> +
> +*Answer*: This is unclear -- if there's a demand for system monitoring daemons
> +to consume this information and make decisions, then yes, this could be wired
> +up in ``xfs_healer``.
> +On the other hand, systemd is in the middle of a transition to varlink, so
> +it makes more sense to wait and see what happens.
> +
>  7. Conclusion and Future Work
>  =============================
>  
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 11/19] xfs: create event queuing, formatting, and discovery infrastructure
  2025-10-23  0:03   ` [PATCH 11/19] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
@ 2025-10-30 16:54     ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2025-10-30 16:54 UTC (permalink / raw)
  To: cem; +Cc: linux-fsdevel, linux-xfs

On Wed, Oct 22, 2025 at 05:03:27PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create the basic infrastructure that we need to report health events to
> userspace.  We need a compact form for recording critical information
> about an event and queueing them; a means to notice that we've lost some
> events; and a means to format the events into something that userspace
> can handle.
> 
> Here, we've chosen json to export information to userspace.  The
> structured key-value nature of json gives us enormous flexibility to
> modify the schema of what we'll send to userspace because we can add new
> keys at any time.  Userspace can use whatever json parsers are available
> to consume the events and will not be confused by keys they don't
> recognize.

Self-review: originally when I started designing this new subsystem, I
wanted to explore data exchange formats that are more flexible and
easier for humans to read than C structures.  The thought being that
when we want to rev (or worse, enlarge) the event format, it ought to be
trivially easy to do that in a way that doesn't break old userspace.

I looked at formats such as protobufs and capnproto.  These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc.  The huge downside is that neither support C
all that well.

Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs.  Then I
thought, how about JSON?  Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.

There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit.  Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.

However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers.  This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants.  Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision.  Not good when you're dealing with
discrete units.

It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.

It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs.  The parsing in userspace was also noticeably slow and
memory-consumptive.

As a result, I'm dropping all the json stuff from the codebase and
leaving only the C structure event format.  Since this is a mostly
private interface, we can always rev the format in the traditional ways
if we ever have to; there are 254 remaining unused format values.

I wanted to document the outcome of this experiment for posterity in a
public place, and now I have done so.

--D

> Note that we do NOT allow sending json back to the kernel, nor is there
> any intent to do that.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_fs.h                  |   50 ++
>  fs/xfs/xfs_healthmon.h                  |   29 +
>  fs/xfs/xfs_linux.h                      |    3 
>  fs/xfs/xfs_trace.h                      |  171 +++++++
>  fs/xfs/libxfs/xfs_healthmon.schema.json |  129 +++++
>  fs/xfs/xfs_healthmon.c                  |  728 +++++++++++++++++++++++++++++++
>  fs/xfs/xfs_trace.c                      |    2 
>  lib/seq_buf.c                           |    1 
>  8 files changed, 1106 insertions(+), 7 deletions(-)
>  create mode 100644 fs/xfs/libxfs/xfs_healthmon.schema.json
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index dba7896f716092..4b642eea18b5ca 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -1003,6 +1003,45 @@ struct xfs_rtgroup_geometry {
>  #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
>  #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
>  
> +/* Health monitor event domains */
> +
> +/* affects the whole fs */
> +#define XFS_HEALTH_MONITOR_DOMAIN_MOUNT		(0)
> +
> +/* Health monitor event types */
> +
> +/* status of the monitor itself */
> +#define XFS_HEALTH_MONITOR_TYPE_RUNNING		(0)
> +#define XFS_HEALTH_MONITOR_TYPE_LOST		(1)
> +
> +/* lost events */
> +struct xfs_health_monitor_lost {
> +	__u64	count;
> +};
> +
> +struct xfs_health_monitor_event {
> +	/* XFS_HEALTH_MONITOR_DOMAIN_* */
> +	__u32	domain;
> +
> +	/* XFS_HEALTH_MONITOR_TYPE_* */
> +	__u32	type;
> +
> +	/* Timestamp of the event, in nanoseconds since the Unix epoch */
> +	__u64	time_ns;
> +
> +	/*
> +	 * Details of the event.  The primary clients are written in python
> +	 * and rust, so break this up because bindgen hates anonymous structs
> +	 * and unions.
> +	 */
> +	union {
> +		struct xfs_health_monitor_lost lost;
> +	} e;
> +
> +	/* zeroes */
> +	__u64	pad[2];
> +};
> +
>  struct xfs_health_monitor {
>  	__u64	flags;		/* flags */
>  	__u8	format;		/* output format */
> @@ -1010,6 +1049,17 @@ struct xfs_health_monitor {
>  	__u64	pad2[2];	/* zeroes */
>  };
>  
> +/* Return all health status events, not just deltas */
> +#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
> +
> +#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
> +
> +/* Return events in a C structure */
> +#define XFS_HEALTH_MONITOR_FMT_CSTRUCT	(0)
> +
> +/* Return events in JSON format */
> +#define XFS_HEALTH_MONITOR_FMT_JSON	(1)
> +
>  /*
>   * ioctl commands that are used by Linux filesystems
>   */
> diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
> index 07126e39281a0c..ea2d6a327dfb16 100644
> --- a/fs/xfs/xfs_healthmon.h
> +++ b/fs/xfs/xfs_healthmon.h
> @@ -6,6 +6,35 @@
>  #ifndef __XFS_HEALTHMON_H__
>  #define __XFS_HEALTHMON_H__
>  
> +enum xfs_healthmon_type {
> +	XFS_HEALTHMON_RUNNING,	/* monitor running */
> +	XFS_HEALTHMON_LOST,	/* message lost */
> +};
> +
> +enum xfs_healthmon_domain {
> +	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
> +};
> +
> +struct xfs_healthmon_event {
> +	struct xfs_healthmon_event	*next;
> +
> +	enum xfs_healthmon_type		type;
> +	enum xfs_healthmon_domain	domain;
> +
> +	uint64_t			time_ns;
> +
> +	union {
> +		/* lost events */
> +		struct {
> +			uint64_t	lostcount;
> +		};
> +		/* mount */
> +		struct {
> +			unsigned int	flags;
> +		};
> +	};
> +};
> +
>  #ifdef CONFIG_XFS_HEALTH_MONITOR
>  long xfs_ioc_health_monitor(struct xfs_mount *mp,
>  		struct xfs_health_monitor __user *arg);
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 4dd747bdbccab2..e122db938cc06b 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -63,6 +63,9 @@ typedef __u32			xfs_nlink_t;
>  #include <linux/xattr.h>
>  #include <linux/mnt_idmapping.h>
>  #include <linux/debugfs.h>
> +#ifdef CONFIG_XFS_HEALTH_MONITOR
> +# include <linux/seq_buf.h>
> +#endif
>  
>  #include <asm/page.h>
>  #include <asm/div64.h>
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 79b8641880ab9d..17af5efee026c9 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -103,6 +103,8 @@ struct xfs_refcount_intent;
>  struct xfs_metadir_update;
>  struct xfs_rtgroup;
>  struct xfs_open_zone;
> +struct xfs_healthmon_event;
> +struct xfs_health_update_params;
>  
>  #define XFS_ATTR_FILTER_FLAGS \
>  	{ XFS_ATTR_ROOT,	"ROOT" }, \
> @@ -5908,6 +5910,175 @@ DEFINE_EVENT(xfs_freeblocks_resv_class, name, \
>  DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_reserved);
>  DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_enospc);
>  
> +#ifdef CONFIG_XFS_HEALTH_MONITOR
> +TRACE_EVENT(xfs_healthmon_lost_event,
> +	TP_PROTO(const struct xfs_mount *mp, unsigned long long lost_prev),
> +	TP_ARGS(mp, lost_prev),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned long long, lost_prev)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->lost_prev = lost_prev;
> +	),
> +	TP_printk("dev %d:%d lost_prev %llu",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->lost_prev)
> +);
> +
> +#define XFS_HEALTHMON_FLAGS_STRINGS \
> +	{ XFS_HEALTH_MONITOR_VERBOSE,	"verbose" }
> +#define XFS_HEALTHMON_FMT_STRINGS \
> +	{ XFS_HEALTH_MONITOR_FMT_JSON,	"json" }, \
> +	{ XFS_HEALTH_MONITOR_FMT_CSTRUCT,	"cstruct" }
> +
> +TRACE_EVENT(xfs_healthmon_create,
> +	TP_PROTO(const struct xfs_mount *mp, u64 flags, u8 format),
> +	TP_ARGS(mp, flags, format),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(u64, flags)
> +		__field(u8, format)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->flags = flags;
> +		__entry->format = format;
> +	),
> +	TP_printk("dev %d:%d flags %s format %s",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __print_flags(__entry->flags, "|", XFS_HEALTHMON_FLAGS_STRINGS),
> +		  __print_symbolic(__entry->format, XFS_HEALTHMON_FMT_STRINGS))
> +);
> +
> +TRACE_EVENT(xfs_healthmon_copybuf,
> +	TP_PROTO(const struct xfs_mount *mp, const struct iov_iter *iov,
> +		 const struct seq_buf *seqbuf, size_t outpos),
> +	TP_ARGS(mp, iov, seqbuf, outpos),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(size_t, seqbuf_size)
> +		__field(size_t, seqbuf_len)
> +		__field(size_t, outpos)
> +		__field(size_t, to_copy)
> +		__field(size_t, iter_count)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->seqbuf_size = seqbuf->size;
> +		__entry->seqbuf_len = seqbuf->len;
> +		__entry->outpos = outpos;
> +		__entry->to_copy = seqbuf->len - outpos;
> +		__entry->iter_count = iov_iter_count(iov);
> +	),
> +	TP_printk("dev %d:%d seqsize %zu seqlen %zu out_pos %zu to_copy %zu iter_count %zu",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->seqbuf_size,
> +		  __entry->seqbuf_len,
> +		  __entry->outpos,
> +		  __entry->to_copy,
> +		  __entry->iter_count)
> +);
> +
> +DECLARE_EVENT_CLASS(xfs_healthmon_class,
> +	TP_PROTO(const struct xfs_mount *mp, unsigned int events,
> +		 unsigned long long lost_prev),
> +	TP_ARGS(mp, events, lost_prev),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, events)
> +		__field(unsigned long long, lost_prev)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->events = events;
> +		__entry->lost_prev = lost_prev;
> +	),
> +	TP_printk("dev %d:%d events %u lost_prev? %llu",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->events,
> +		  __entry->lost_prev)
> +);
> +#define DEFINE_HEALTHMON_EVENT(name) \
> +DEFINE_EVENT(xfs_healthmon_class, name, \
> +	TP_PROTO(const struct xfs_mount *mp, unsigned int events, \
> +		 unsigned long long lost_prev), \
> +	TP_ARGS(mp, events, lost_prev))
> +DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_start);
> +DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
> +DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
> +DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
> +
> +#define XFS_HEALTHMON_TYPE_STRINGS \
> +	{ XFS_HEALTHMON_LOST,		"lost" }
> +
> +#define XFS_HEALTHMON_DOMAIN_STRINGS \
> +	{ XFS_HEALTHMON_MOUNT,		"mount" }
> +
> +TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
> +
> +TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
> +
> +DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
> +	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
> +	TP_ARGS(mp, event),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, type)
> +		__field(unsigned int, domain)
> +		__field(unsigned int, mask)
> +		__field(unsigned long long, ino)
> +		__field(unsigned int, gen)
> +		__field(unsigned int, group)
> +		__field(unsigned long long, offset)
> +		__field(unsigned long long, length)
> +		__field(unsigned long long, lostcount)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp ? mp->m_super->s_dev : 0;
> +		__entry->type = event->type;
> +		__entry->domain = event->domain;
> +		__entry->mask = 0;
> +		__entry->group = 0;
> +		__entry->ino = 0;
> +		__entry->gen = 0;
> +		__entry->offset = 0;
> +		__entry->length = 0;
> +		__entry->lostcount = 0;
> +		switch (__entry->domain) {
> +		case XFS_HEALTHMON_MOUNT:
> +			switch (__entry->type) {
> +			case XFS_HEALTHMON_LOST:
> +				__entry->lostcount = event->lostcount;
> +				break;
> +			}
> +			break;
> +		}
> +	),
> +	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS),
> +		  __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS),
> +		  __entry->mask,
> +		  __entry->ino,
> +		  __entry->gen,
> +		  __entry->offset,
> +		  __entry->length,
> +		  __entry->group,
> +		  __entry->lostcount)
> +);
> +#define DEFINE_HEALTHMONEVENT_EVENT(name) \
> +DEFINE_EVENT(xfs_healthmon_event_class, name, \
> +	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
> +	TP_ARGS(mp, event))
> +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
> +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
> +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
> +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
> +DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
> +#endif /* CONFIG_XFS_HEALTH_MONITOR */
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/fs/xfs/libxfs/xfs_healthmon.schema.json b/fs/xfs/libxfs/xfs_healthmon.schema.json
> new file mode 100644
> index 00000000000000..68762738b04191
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_healthmon.schema.json
> @@ -0,0 +1,129 @@
> +{
> +	"$comment": [
> +		"SPDX-License-Identifier: GPL-2.0-or-later",
> +		"Copyright (c) 2024-2025 Oracle.  All Rights Reserved.",
> +		"Author: Darrick J. Wong <djwong@kernel.org>",
> +		"",
> +		"This schema file describes the format of the json objects",
> +		"readable from the fd returned by the XFS_IOC_HEALTHMON",
> +		"ioctl."
> +	],
> +
> +	"$schema": "https://json-schema.org/draft/2020-12/schema",
> +	"$id": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/fs/xfs/libxfs/xfs_healthmon.schema.json",
> +
> +	"title": "XFS Health Monitoring Events",
> +
> +	"$comment": "Events must be one of the following types:",
> +	"oneOf": [
> +		{
> +			"$ref": "#/$events/running"
> +		},
> +		{
> +			"$ref": "#/$events/unmount"
> +		},
> +		{
> +			"$ref": "#/$events/lost"
> +		}
> +	],
> +
> +	"$comment": "Simple data types are defined here.",
> +	"$defs": {
> +		"time_ns": {
> +			"title": "Time of Event",
> +			"description": "Timestamp of the event, in nanoseconds since the Unix epoch.",
> +			"type": "integer"
> +		},
> +		"count": {
> +			"title": "Count of events",
> +			"description": "Number of events.",
> +			"type": "integer",
> +			"minimum": 1
> +		}
> +	},
> +
> +	"$comment": "Event types are defined here.",
> +	"$events": {
> +		"running": {
> +			"title": "Health Monitoring Running",
> +			"$comment": [
> +				"The health monitor is actually running."
> +			],
> +			"type": "object",
> +
> +			"properties": {
> +				"type": {
> +					"const": "running"
> +				},
> +				"time_ns": {
> +					"$ref": "#/$defs/time_ns"
> +				},
> +				"domain": {
> +					"const": "mount"
> +				}
> +			},
> +
> +			"required": [
> +				"type",
> +				"time_ns",
> +				"domain"
> +			]
> +		},
> +		"unmount": {
> +			"title": "Filesystem Unmounted",
> +			"$comment": [
> +				"The filesystem was unmounted."
> +			],
> +			"type": "object",
> +
> +			"properties": {
> +				"type": {
> +					"const": "unmount"
> +				},
> +				"time_ns": {
> +					"$ref": "#/$defs/time_ns"
> +				},
> +				"domain": {
> +					"const": "mount"
> +				}
> +			},
> +
> +			"required": [
> +				"type",
> +				"time_ns",
> +				"domain"
> +			]
> +		},
> +		"lost": {
> +			"title": "Health Monitoring Events Lost",
> +			"$comment": [
> +				"Previous health monitoring events were",
> +				"dropped due to memory allocation failures",
> +				"or queue limits."
> +			],
> +			"type": "object",
> +
> +			"properties": {
> +				"type": {
> +					"const": "lost"
> +				},
> +				"count": {
> +					"$ref": "#/$defs/count"
> +				},
> +				"time_ns": {
> +					"$ref": "#/$defs/time_ns"
> +				},
> +				"domain": {
> +					"const": "mount"
> +				}
> +			},
> +
> +			"required": [
> +				"type",
> +				"count",
> +				"time_ns",
> +				"domain"
> +			]
> +		}
> +	}
> +}
> diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
> index 7b0d9f78b0a402..d5ca6ef8015c0e 100644
> --- a/fs/xfs/xfs_healthmon.c
> +++ b/fs/xfs/xfs_healthmon.c
> @@ -40,12 +40,558 @@
>   * so that the queueing and processing of the events do not pin the mount and
>   * cannot slow down the main filesystem.  The healthmon object can exist past
>   * the end of the filesystem mount.
> + *
> + * Please see the xfs_healthmon.schema.json file for a description of the
> + * format of the json events that are conveyed to userspace.
>   */
>  
> +/* Allow this many events to build up in memory per healthmon fd. */
> +#define XFS_HEALTHMON_MAX_EVENTS \
> +		(32768 / sizeof(struct xfs_healthmon_event))
> +
> +struct flag_string {
> +	unsigned int	mask;
> +	const char	*str;
> +};
> +
>  struct xfs_healthmon {
> +	/* lock for mp and eventlist */
> +	struct mutex			lock;
> +
> +	/* waiter for signalling the arrival of events */
> +	struct wait_queue_head		wait;
> +
> +	/* list of event objects */
> +	struct xfs_healthmon_event	*first_event;
> +	struct xfs_healthmon_event	*last_event;
> +
>  	struct xfs_mount		*mp;
> +
> +	/* number of events */
> +	unsigned int			events;
> +
> +	/*
> +	 * Buffer for formatting events.  New buffer data are appended to the
> +	 * end of the seqbuf, and outpos is used to determine where to start
> +	 * a copy_iter.  Both are protected by inode_lock.
> +	 */
> +	struct seq_buf			outbuf;
> +	size_t				outpos;
> +
> +	/* XFS_HEALTH_MONITOR_FMT_* */
> +	uint8_t				format;
> +
> +	/* do we want all events? */
> +	bool				verbose;
> +
> +	/* did we lose previous events? */
> +	unsigned long long		lost_prev_event;
> +
> +	/* total counts of events observed and lost events */
> +	unsigned long long		total_events;
> +	unsigned long long		total_lost;
>  };
>  
> +static inline void xfs_healthmon_bump_events(struct xfs_healthmon *hm)
> +{
> +	hm->events++;
> +	hm->total_events++;
> +}
> +
> +static inline void xfs_healthmon_bump_lost(struct xfs_healthmon *hm)
> +{
> +	hm->lost_prev_event++;
> +	hm->total_lost++;
> +}
> +
> +/* Remove an event from the head of the list. */
> +static inline int
> +xfs_healthmon_free_head(
> +	struct xfs_healthmon		*hm,
> +	struct xfs_healthmon_event	*event)
> +{
> +	struct xfs_healthmon_event	*head;
> +
> +	mutex_lock(&hm->lock);
> +	head = hm->first_event;
> +	if (head != event) {
> +		ASSERT(hm->first_event == event);
> +		mutex_unlock(&hm->lock);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	if (hm->last_event == head)
> +		hm->last_event = NULL;
> +	hm->first_event = head->next;
> +	hm->events--;
> +	mutex_unlock(&hm->lock);
> +
> +	trace_xfs_healthmon_pop(hm->mp, head);
> +	kfree(event);
> +	return 0;
> +}
> +
> +/* Push an event onto the end of the list. */
> +static inline void
> +__xfs_healthmon_push(
> +	struct xfs_healthmon		*hm,
> +	struct xfs_healthmon_event	*event)
> +{
> +	if (!hm->first_event)
> +		hm->first_event = event;
> +	if (hm->last_event)
> +		hm->last_event->next = event;
> +	hm->last_event = event;
> +	event->next = NULL;
> +	xfs_healthmon_bump_events(hm);
> +	wake_up(&hm->wait);
> +
> +	trace_xfs_healthmon_push(hm->mp, event);
> +}
> +
> +/* Push an event onto the end of the list if we're not full. */
> +static inline int
> +xfs_healthmon_push(
> +	struct xfs_healthmon		*hm,
> +	struct xfs_healthmon_event	*event)
> +{
> +	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
> +		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
> +
> +		xfs_healthmon_bump_lost(hm);
> +		return -ENOMEM;
> +	}
> +
> +	__xfs_healthmon_push(hm, event);
> +	return 0;
> +}
> +
> +/* Create a new event or record that we failed. */
> +static struct xfs_healthmon_event *
> +xfs_healthmon_alloc(
> +	struct xfs_healthmon		*hm,
> +	enum xfs_healthmon_type		type,
> +	enum xfs_healthmon_domain	domain)
> +{
> +	struct timespec64		now;
> +	struct xfs_healthmon_event	*event;
> +
> +	event = kzalloc(sizeof(*event), GFP_NOFS);
> +	if (!event) {
> +		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
> +
> +		xfs_healthmon_bump_lost(hm);
> +		return NULL;
> +	}
> +
> +	event->type = type;
> +	event->domain = domain;
> +	ktime_get_coarse_real_ts64(&now);
> +	event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec;
> +
> +	return event;
> +}
> +
> +/*
> + * Before we accept an event notification from a live update hook, we need to
> + * clear out any previously lost events.
> + */
> +static inline int
> +xfs_healthmon_start_live_update(
> +	struct xfs_healthmon		*hm)
> +{
> +	struct xfs_healthmon_event	*event;
> +
> +	/* If the queue is already full.... */
> +	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
> +		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
> +
> +		if (hm->last_event &&
> +		    hm->last_event->type == XFS_HEALTHMON_LOST) {
> +			/*
> +			 * ...and the last event notes lost events, then add
> +			 * the number of events we already lost, plus one for
> +			 * this event that we're about to lose.
> +			 */
> +			hm->last_event->lostcount += hm->lost_prev_event + 1;
> +			hm->lost_prev_event = 0;
> +		} else {
> +			/*
> +			 * ...try to create a new lost event.  Add the number
> +			 * of events we previously lost, plus one for this
> +			 * event.
> +			 */
> +			event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
> +					XFS_HEALTHMON_MOUNT);
> +			if (!event) {
> +				xfs_healthmon_bump_lost(hm);
> +				return -ENOMEM;
> +			}
> +			event->lostcount = hm->lost_prev_event + 1;
> +			hm->lost_prev_event = 0;
> +
> +			__xfs_healthmon_push(hm, event);
> +		}
> +
> +		return -ENOSPC;
> +	}
> +
> +	/* If we lost an event in the past, but the queue isn't yet full... */
> +	if (hm->lost_prev_event) {
> +		/*
> +		 * ...try to create a new lost event.  Add the number of events
> +		 * we previously lost, plus one for this event.
> +		 */
> +		event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
> +				XFS_HEALTHMON_MOUNT);
> +		if (!event) {
> +			xfs_healthmon_bump_lost(hm);
> +			return -ENOMEM;
> +		}
> +		event->lostcount = hm->lost_prev_event;
> +		hm->lost_prev_event = 0;
> +
> +		/*
> +		 * If adding this lost event pushes us over the limit, we're
> +		 * going to lose the current event.  Note that in the lost
> +		 * event count too.
> +		 */
> +		if (hm->events == XFS_HEALTHMON_MAX_EVENTS - 1)
> +			event->lostcount++;
> +
> +		__xfs_healthmon_push(hm, event);
> +		if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
> +			trace_xfs_healthmon_lost_event(hm->mp,
> +					hm->lost_prev_event);
> +			return -ENOSPC;
> +		}
> +	}
> +
> +	/*
> +	 * The queue is not full and it is not currently the case that events
> +	 * were lost.
> +	 */
> +	return 0;
> +}
> +
> +/* Render the health update type as a string. */
> +STATIC const char *
> +xfs_healthmon_typestring(
> +	const struct xfs_healthmon_event	*event)
> +{
> +	static const char *type_strings[] = {
> +		[XFS_HEALTHMON_RUNNING]		= "running",
> +		[XFS_HEALTHMON_LOST]		= "lost",
> +	};
> +
> +	if (event->type >= ARRAY_SIZE(type_strings))
> +		return "?";
> +
> +	return type_strings[event->type];
> +}
> +
> +/* Render the health domain as a string. */
> +STATIC const char *
> +xfs_healthmon_domstring(
> +	const struct xfs_healthmon_event	*event)
> +{
> +	static const char *dom_strings[] = {
> +		[XFS_HEALTHMON_MOUNT]		= "mount",
> +	};
> +
> +	if (event->domain >= ARRAY_SIZE(dom_strings))
> +		return "?";
> +
> +	return dom_strings[event->domain];
> +}
> +
> +/* Convert a flags bitmap into a jsonable string. */
> +static inline int
> +xfs_healthmon_format_flags(
> +	struct seq_buf			*outbuf,
> +	const struct flag_string	*strings,
> +	size_t				nr_strings,
> +	unsigned int			flags)
> +{
> +	const struct flag_string	*p;
> +	ssize_t				ret;
> +	unsigned int			i;
> +	bool				first = true;
> +
> +	for (i = 0, p = strings; i < nr_strings; i++, p++) {
> +		if (!(p->mask & flags))
> +			continue;
> +
> +		ret = seq_buf_printf(outbuf, "%s\"%s\"",
> +				first ? "" : ", ", p->str);
> +		if (ret < 0)
> +			return ret;
> +
> +		first = false;
> +		flags &= ~p->mask;
> +	}
> +
> +	for (i = 0; flags != 0 && i < sizeof(flags) * NBBY; i++) {
> +		if (!(flags & (1U << i)))
> +			continue;
> +
> +		/* json doesn't support hexadecimal notation */
> +		ret = seq_buf_printf(outbuf, "%s%u",
> +				first ? "" : ", ", (1U << i));
> +		if (ret < 0)
> +			return ret;
> +
> +		first = false;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Convert the event mask into a jsonable string. */
> +static inline int
> +__xfs_healthmon_format_mask(
> +	struct seq_buf			*outbuf,
> +	const char			*descr,
> +	const struct flag_string	*strings,
> +	size_t				nr_strings,
> +	unsigned int			mask)
> +{
> +	ssize_t				ret;
> +
> +	ret = seq_buf_printf(outbuf, "  \"%s\":  [", descr);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = xfs_healthmon_format_flags(outbuf, strings, nr_strings, mask);
> +	if (ret < 0)
> +		return ret;
> +
> +	return seq_buf_printf(outbuf, "],\n");
> +}
> +
> +#define xfs_healthmon_format_mask(o, d, s, m) \
> +	__xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m))
> +
> +static inline void
> +xfs_healthmon_reset_outbuf(
> +	struct xfs_healthmon		*hm)
> +{
> +	hm->outpos = 0;
> +	seq_buf_clear(&hm->outbuf);
> +}
> +
> +/* Render lost event mask as a string set */
> +static int
> +xfs_healthmon_format_lost(
> +	struct seq_buf			*outbuf,
> +	const struct xfs_healthmon_event *event)
> +{
> +	return seq_buf_printf(outbuf, "  \"count\":      %llu,\n",
> +			event->lostcount);
> +}
> +
> +/*
> + * Format an event into json.  Returns 0 if we formatted the event.  If
> + * formatting the event overflows the buffer, returns -1 with the seqbuf len
> + * unchanged.
> + */
> +STATIC int
> +xfs_healthmon_format_json(
> +	struct xfs_healthmon		*hm,
> +	const struct xfs_healthmon_event *event)
> +{
> +	struct seq_buf			*outbuf = &hm->outbuf;
> +	size_t				old_seqlen = outbuf->len;
> +	int				ret;
> +
> +	trace_xfs_healthmon_format(hm->mp, event);
> +
> +	ret = seq_buf_printf(outbuf, "{\n");
> +	if (ret < 0)
> +		goto overrun;
> +
> +	ret = seq_buf_printf(outbuf, "  \"domain\":     \"%s\",\n",
> +			xfs_healthmon_domstring(event));
> +	if (ret < 0)
> +		goto overrun;
> +
> +	ret = seq_buf_printf(outbuf, "  \"type\":       \"%s\",\n",
> +			xfs_healthmon_typestring(event));
> +	if (ret < 0)
> +		goto overrun;
> +
> +	switch (event->domain) {
> +	case XFS_HEALTHMON_MOUNT:
> +		switch (event->type) {
> +		case XFS_HEALTHMON_RUNNING:
> +			/* nothing to format */
> +			break;
> +		case XFS_HEALTHMON_LOST:
> +			ret = xfs_healthmon_format_lost(outbuf, event);
> +			break;
> +		default:
> +			break;
> +		}
> +		break;
> +	}
> +	if (ret < 0)
> +		goto overrun;
> +
> +	/* The last element in the json must not have a trailing comma. */
> +	ret = seq_buf_printf(outbuf, "  \"time_ns\":    %llu\n",
> +			event->time_ns);
> +	if (ret < 0)
> +		goto overrun;
> +
> +	ret = seq_buf_printf(outbuf, "}\n");
> +	if (ret < 0)
> +		goto overrun;
> +
> +	ASSERT(!seq_buf_has_overflowed(outbuf));
> +	return 0;
> +overrun:
> +	/*
> +	 * We overflowed the buffer and could not format the event.  Reset the
> +	 * seqbuf and tell the caller not to delete the event.
> +	 */
> +	trace_xfs_healthmon_format_overflow(hm->mp, event);
> +	outbuf->len = old_seqlen;
> +	return -1;
> +}
> +
> +static const unsigned int domain_map[] = {
> +	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
> +};
> +
> +static const unsigned int type_map[] = {
> +	[XFS_HEALTHMON_RUNNING]		= XFS_HEALTH_MONITOR_TYPE_RUNNING,
> +	[XFS_HEALTHMON_LOST]		= XFS_HEALTH_MONITOR_TYPE_LOST,
> +};
> +
> +/* Render event as a C structure */
> +STATIC int
> +xfs_healthmon_format_cstruct(
> +	struct xfs_healthmon		*hm,
> +	const struct xfs_healthmon_event *event)
> +{
> +	struct xfs_health_monitor_event	hme = {
> +		.time_ns		= event->time_ns,
> +	};
> +	struct seq_buf			*outbuf = &hm->outbuf;
> +	size_t				old_seqlen = outbuf->len;
> +	int				ret;
> +
> +	trace_xfs_healthmon_format(hm->mp, event);
> +
> +	if (event->domain < 0 || event->domain >= ARRAY_SIZE(domain_map) ||
> +	    event->type < 0   || event->type >= ARRAY_SIZE(type_map))
> +		return -EFSCORRUPTED;
> +
> +	hme.domain = domain_map[event->domain];
> +	hme.type = type_map[event->type];
> +
> +	/* fill in the event-specific details */
> +	switch (event->domain) {
> +	case XFS_HEALTHMON_MOUNT:
> +		switch (event->type) {
> +		case XFS_HEALTHMON_LOST:
> +			hme.e.lost.count = event->lostcount;
> +			break;
> +		default:
> +			break;
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	ret = seq_buf_putmem(outbuf, &hme, sizeof(hme));
> +	if (ret < 0) {
> +		/*
> +		 * We overflowed the buffer and could not format the event.
> +		 * Reset the seqbuf and tell the caller not to delete the
> +		 * event.
> +		 */
> +		trace_xfs_healthmon_format_overflow(hm->mp, event);
> +		outbuf->len = old_seqlen;
> +		return -1;
> +	}
> +
> +	ASSERT(!seq_buf_has_overflowed(outbuf));
> +	return 0;
> +}
> +
> +/* How many bytes are waiting in the outbuf to be copied? */
> +static inline size_t
> +xfs_healthmon_outbuf_bytes(
> +	struct xfs_healthmon	*hm)
> +{
> +	unsigned int		used = seq_buf_used(&hm->outbuf);
> +
> +	if (used > hm->outpos)
> +		return used - hm->outpos;
> +	return 0;
> +}
> +
> +/*
> + * Do we have something for userspace to do?  This can mean unmount events,
> + * events pending in the queue, or pending bytes in the outbuf.
> + */
> +static inline bool
> +xfs_healthmon_has_eventdata(
> +	struct xfs_healthmon	*hm)
> +{
> +	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
> +}
> +
> +/* Try to copy the rest of the outbuf to the iov iter. */
> +STATIC ssize_t
> +xfs_healthmon_copybuf(
> +	struct xfs_healthmon	*hm,
> +	struct iov_iter		*to)
> +{
> +	size_t			to_copy;
> +	size_t			w = 0;
> +
> +	trace_xfs_healthmon_copybuf(hm->mp, to, &hm->outbuf, hm->outpos);
> +
> +	to_copy = xfs_healthmon_outbuf_bytes(hm);
> +	if (to_copy) {
> +		w = copy_to_iter(hm->outbuf.buffer + hm->outpos, to_copy, to);
> +		if (!w)
> +			return -EFAULT;
> +
> +		hm->outpos += w;
> +	}
> +
> +	/*
> +	 * Nothing left to copy?  Reset the seqbuf pointers and outbuf to the
> +	 * start since there's no live data in the buffer.
> +	 */
> +	if (xfs_healthmon_outbuf_bytes(hm) == 0)
> +		xfs_healthmon_reset_outbuf(hm);
> +	return w;
> +}
> +
> +/*
> + * See if there's an event waiting for us.  If the fs is no longer mounted,
> + * don't bother sending any more events.
> + */
> +static inline struct xfs_healthmon_event *
> +xfs_healthmon_peek(
> +	struct xfs_healthmon	*hm)
> +{
> +	struct xfs_healthmon_event *event;
> +
> +	mutex_lock(&hm->lock);
> +	if (hm->mp)
> +		event = hm->first_event;
> +	else
> +		event = NULL;
> +	mutex_unlock(&hm->lock);
> +	return event;
> +}
> +
>  /*
>   * Convey queued event data to userspace.  First copy any remaining bytes in
>   * the outbuf, then format the oldest event into the outbuf and copy that too.
> @@ -55,7 +601,125 @@ xfs_healthmon_read_iter(
>  	struct kiocb		*iocb,
>  	struct iov_iter		*to)
>  {
> -	return -EIO;
> +	struct file		*file = iocb->ki_filp;
> +	struct inode		*inode = file_inode(file);
> +	struct xfs_healthmon	*hm = file->private_data;
> +	struct xfs_healthmon_event *event;
> +	size_t			copied = 0;
> +	ssize_t			ret = 0;
> +
> +	/* Wait for data to become available */
> +	if (!(file->f_flags & O_NONBLOCK)) {
> +		ret = wait_event_interruptible(hm->wait,
> +				xfs_healthmon_has_eventdata(hm));
> +		if (ret)
> +			return ret;
> +	} else if (!xfs_healthmon_has_eventdata(hm)) {
> +		return -EAGAIN;
> +	}
> +
> +	/* Allocate formatting buffer up to 64k if necessary */
> +	if (hm->outbuf.size == 0) {
> +		void		*outbuf;
> +		size_t		bufsize = min(65536, max(PAGE_SIZE,
> +							 iov_iter_count(to)));
> +
> +		outbuf = kzalloc(bufsize, GFP_KERNEL);
> +		if (!outbuf) {
> +			bufsize = PAGE_SIZE;
> +			outbuf = kzalloc(bufsize, GFP_KERNEL);
> +			if (!outbuf)
> +				return -ENOMEM;
> +		}
> +
> +		inode_lock(inode);
> +		if (hm->outbuf.size == 0) {
> +			seq_buf_init(&hm->outbuf, outbuf, bufsize);
> +			hm->outpos = 0;
> +		} else {
> +			kfree(outbuf);
> +		}
> +	} else {
> +		inode_lock(inode);
> +	}
> +
> +	trace_xfs_healthmon_read_start(hm->mp, hm->events, hm->lost_prev_event);
> +
> +	/*
> +	 * If there's anything left in the seqbuf, copy that before formatting
> +	 * more events.
> +	 */
> +	ret = xfs_healthmon_copybuf(hm, to);
> +	if (ret < 0)
> +		goto out_unlock;
> +	copied += ret;
> +
> +	while (iov_iter_count(to) > 0) {
> +		/* Format the next events into the outbuf until it's full. */
> +		while ((event = xfs_healthmon_peek(hm)) != NULL) {
> +			switch (hm->format) {
> +			case XFS_HEALTH_MONITOR_FMT_JSON:
> +				ret = xfs_healthmon_format_json(hm, event);
> +				break;
> +			case XFS_HEALTH_MONITOR_FMT_CSTRUCT:
> +				ret = xfs_healthmon_format_cstruct(hm, event);
> +				break;
> +			default:
> +				ret = -EINVAL;
> +				goto out_unlock;
> +			}
> +			if (ret < 0)
> +				break;
> +			ret = xfs_healthmon_free_head(hm, event);
> +			if (ret)
> +				goto out_unlock;
> +		}
> +
> +		/* Copy it to userspace */
> +		ret = xfs_healthmon_copybuf(hm, to);
> +		if (ret <= 0)
> +			break;
> +
> +		copied += ret;
> +	}
> +
> +out_unlock:
> +	trace_xfs_healthmon_read_finish(hm->mp, hm->events, hm->lost_prev_event);
> +	inode_unlock(inode);
> +	return copied ?: ret;
> +}
> +
> +/* Poll for available events. */
> +STATIC __poll_t
> +xfs_healthmon_poll(
> +	struct file			*file,
> +	struct poll_table_struct	*wait)
> +{
> +	struct xfs_healthmon		*hm = file->private_data;
> +	__poll_t			mask = 0;
> +
> +	poll_wait(file, &hm->wait, wait);
> +
> +	if (xfs_healthmon_has_eventdata(hm))
> +		mask |= EPOLLIN;
> +	return mask;
> +}
> +
> +/* Free all events */
> +STATIC void
> +xfs_healthmon_free_events(
> +	struct xfs_healthmon		*hm)
> +{
> +	struct xfs_healthmon_event	*event, *next;
> +
> +	event = hm->first_event;
> +	while (event != NULL) {
> +		trace_xfs_healthmon_drop(hm->mp, event);
> +		next = event->next;
> +		kfree(event);
> +		event = next;
> +	}
> +	hm->first_event = hm->last_event = NULL;
>  }
>  
>  /* Free the health monitoring information. */
> @@ -66,6 +730,14 @@ xfs_healthmon_release(
>  {
>  	struct xfs_healthmon	*hm = file->private_data;
>  
> +	trace_xfs_healthmon_release(hm->mp, hm->events, hm->lost_prev_event);
> +
> +	wake_up_all(&hm->wait);
> +
> +	mutex_destroy(&hm->lock);
> +	xfs_healthmon_free_events(hm);
> +	if (hm->outbuf.size)
> +		kfree(hm->outbuf.buffer);
>  	kfree(hm);
>  
>  	return 0;
> @@ -76,9 +748,10 @@ static inline bool
>  xfs_healthmon_validate(
>  	const struct xfs_health_monitor	*hmo)
>  {
> -	if (hmo->flags)
> +	if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL)
>  		return false;
> -	if (hmo->format)
> +	if (hmo->format != XFS_HEALTH_MONITOR_FMT_JSON &&
> +	    hmo->format != XFS_HEALTH_MONITOR_FMT_CSTRUCT)
>  		return false;
>  	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
>  		return false;
> @@ -89,6 +762,19 @@ xfs_healthmon_validate(
>  
>  /* Emit some data about the health monitoring fd. */
>  #ifdef CONFIG_PROC_FS
> +static const char *
> +xfs_healthmon_format_string(const struct xfs_healthmon *hm)
> +{
> +	switch (hm->format) {
> +	case XFS_HEALTH_MONITOR_FMT_JSON:
> +		return "json";
> +	case XFS_HEALTH_MONITOR_FMT_CSTRUCT:
> +		return "blob";
> +	}
> +
> +	return "";
> +}
> +
>  static void
>  xfs_healthmon_show_fdinfo(
>  	struct seq_file		*m,
> @@ -96,8 +782,13 @@ xfs_healthmon_show_fdinfo(
>  {
>  	struct xfs_healthmon	*hm = file->private_data;
>  
> -	seq_printf(m, "state:\talive\ndev:\t%s\n",
> -			hm->mp->m_super->s_id);
> +	mutex_lock(&hm->lock);
> +	seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
> +			hm->mp->m_super->s_id,
> +			xfs_healthmon_format_string(hm),
> +			hm->total_events,
> +			hm->total_lost);
> +	mutex_unlock(&hm->lock);
>  }
>  #endif
>  
> @@ -107,6 +798,7 @@ static const struct file_operations xfs_healthmon_fops = {
>  	.show_fdinfo	= xfs_healthmon_show_fdinfo,
>  #endif
>  	.read_iter	= xfs_healthmon_read_iter,
> +	.poll		= xfs_healthmon_poll,
>  	.release	= xfs_healthmon_release,
>  };
>  
> @@ -121,6 +813,7 @@ xfs_ioc_health_monitor(
>  {
>  	struct xfs_health_monitor	hmo;
>  	struct xfs_healthmon		*hm;
> +	struct xfs_healthmon_event	*event;
>  	int				fd;
>  	int				ret;
>  
> @@ -137,6 +830,23 @@ xfs_ioc_health_monitor(
>  	if (!hm)
>  		return -ENOMEM;
>  	hm->mp = mp;
> +	hm->format = hmo.format;
> +
> +	seq_buf_init(&hm->outbuf, NULL, 0);
> +	mutex_init(&hm->lock);
> +	init_waitqueue_head(&hm->wait);
> +
> +	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
> +		hm->verbose = true;
> +
> +	/* Queue up the first event that lets the client know we're running. */
> +	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
> +			XFS_HEALTHMON_MOUNT);
> +	if (!event) {
> +		ret = -ENOMEM;
> +		goto out_mutex;
> +	}
> +	__xfs_healthmon_push(hm, event);
>  
>  	/*
>  	 * Create the anonymous file.  If it succeeds, the file owns hm and
> @@ -146,12 +856,16 @@ xfs_ioc_health_monitor(
>  			O_CLOEXEC | O_RDONLY);
>  	if (fd < 0) {
>  		ret = fd;
> -		goto out_hm;
> +		goto out_mutex;
>  	}
>  
> +	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
> +
>  	return fd;
>  
> -out_hm:
> +out_mutex:
> +	mutex_destroy(&hm->lock);
> +	xfs_healthmon_free_events(hm);
>  	kfree(hm);
>  	return ret;
>  }
> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
> index a60556dbd172ee..d42b864a3837a2 100644
> --- a/fs/xfs/xfs_trace.c
> +++ b/fs/xfs/xfs_trace.c
> @@ -51,6 +51,8 @@
>  #include "xfs_rtgroup.h"
>  #include "xfs_zone_alloc.h"
>  #include "xfs_zone_priv.h"
> +#include "xfs_health.h"
> +#include "xfs_healthmon.h"
>  
>  /*
>   * We include this last to have the helpers above available for the trace
> diff --git a/lib/seq_buf.c b/lib/seq_buf.c
> index f3f3436d60a940..f6a1fb46a1d6c9 100644
> --- a/lib/seq_buf.c
> +++ b/lib/seq_buf.c
> @@ -245,6 +245,7 @@ int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int len)
>  	seq_buf_set_overflow(s);
>  	return -1;
>  }
> +EXPORT_SYMBOL_GPL(seq_buf_putmem);
>  
>  #define MAX_MEMHEX_BYTES	8U
>  #define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-10-30 16:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 23:56 [PATCHBOMB 6.19] xfs: autonomous self healing Darrick J. Wong
2025-10-22 23:59 ` [PATCHSET V2] xfs: autonomous self healing of filesystems Darrick J. Wong
2025-10-23  0:00   ` [PATCH 01/19] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
2025-10-24  5:40     ` Christoph Hellwig
2025-10-27 16:15       ` Darrick J. Wong
2025-10-23  0:01   ` [PATCH 02/19] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
2025-10-30 16:38     ` Darrick J. Wong
2025-10-23  0:01   ` [PATCH 03/19] xfs: create debugfs uuid aliases Darrick J. Wong
2025-10-23  0:01   ` [PATCH 04/19] xfs: create hooks for monitoring health updates Darrick J. Wong
2025-10-23  0:01   ` [PATCH 05/19] xfs: create a filesystem shutdown hook Darrick J. Wong
2025-10-23  0:02   ` [PATCH 06/19] xfs: create hooks for media errors Darrick J. Wong
2025-10-23  0:02   ` [PATCH 07/19] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
2025-10-23  0:02   ` [PATCH 08/19] iomap: report directio read and write errors to callers Darrick J. Wong
2025-10-23  0:02   ` [PATCH 09/19] xfs: create file io error hooks Darrick J. Wong
2025-10-23  0:03   ` [PATCH 10/19] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
2025-10-23  0:03   ` [PATCH 11/19] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
2025-10-30 16:54     ` Darrick J. Wong
2025-10-23  0:03   ` [PATCH 12/19] xfs: report metadata health events through healthmon Darrick J. Wong
2025-10-23  0:04   ` [PATCH 13/19] xfs: report shutdown " Darrick J. Wong
2025-10-23  0:04   ` [PATCH 14/19] xfs: report media errors " Darrick J. Wong
2025-10-23  0:04   ` [PATCH 15/19] xfs: report file io " Darrick J. Wong
2025-10-23  0:04   ` [PATCH 16/19] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
2025-10-23  0:05   ` [PATCH 17/19] xfs: validate fds against running healthmon Darrick J. Wong
2025-10-23  0:05   ` [PATCH 18/19] xfs: add media error reporting ioctl Darrick J. Wong
2025-10-23  0:05   ` [PATCH 19/19] xfs: send uevents when major filesystem events happen Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).