[PATCHBOMB v2 6.19] xfs: autonomous self healing

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHBOMB v2 6.19] xfs: autonomous self healing
@ 2025-11-05  0:46 Darrick J. Wong
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
  0 siblings, 2 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:46 UTC (permalink / raw)
  To: Carlos Maiolino, Christoph Hellwig; +Cc: xfs, Chandan Babu R, linux-fsdevel

Hi everyone,

You might recall that 18 months ago I showed off an early draft of a
patchset implementing autonomous self healing capabilities for XFS.
The premise is quite simple -- add a few hooks to the kernel to capture
significant filesystem metadata and file health events (pretty much all
failures), queue these events to a special anonfd, and let userspace
read the events at its leisure.  That's patchset 1.

Since the previous release, I've removed all the json event generation
stuff and made media errors use the rmap btree to report file data loss.
I also ported the userspace program to C.  I'm not going to blast
everyone with the full set; just know that the C version is here:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

Patchset 2 is now a cleanup of the file IO error hooks in patchset 1 to
use a more generic interface and to call fsnotify with the error
reports.  This means that the fsnotify filesystem error functionality
conveys generic errors to unprivileged userspace programs, but I'm
leaving the privileged healthmon interface so that xfsprogs can figure
out which specific part of the filesystem needs fixing.

This work was mostly complete by the end of 2024, and I've been letting
it run on my XFS QA testing fleets ever since then.  I am submitting
this patchset for upstream for 6.19.  Once this is merged, the online
fsck project will be complete.

--D

The unreviewed patches in this series are:

[PATCHSET V3 1/2] xfs: autonomous self healing of filesystems
  [PATCH 02/22] docs: discuss autonomous self healing in the xfs online
  [PATCH 03/22] xfs: create debugfs uuid aliases
  [PATCH 04/22] xfs: create hooks for monitoring health updates
  [PATCH 05/22] xfs: create a filesystem shutdown hook
  [PATCH 06/22] xfs: create hooks for media errors
  [PATCH 07/22] iomap: report buffered read and write io errors to the
  [PATCH 08/22] iomap: report directio read and write errors to callers
  [PATCH 09/22] xfs: create file io error hooks
  [PATCH 10/22] xfs: create a special file to pass filesystem health to
  [PATCH 11/22] xfs: create event queuing, formatting,
  [PATCH 12/22] xfs: report metadata health events through healthmon
  [PATCH 13/22] xfs: report shutdown events through healthmon
  [PATCH 14/22] xfs: report media errors through healthmon
  [PATCH 15/22] xfs: report file io errors through healthmon
  [PATCH 16/22] xfs: allow reconfiguration of the health monitoring
  [PATCH 17/22] xfs: validate fds against running healthmon
  [PATCH 18/22] xfs: add media error reporting ioctl
  [PATCH 19/22] xfs: send uevents when major filesystem events happen
  [PATCH 20/22] xfs: merge health monitoring events when possible
  [PATCH 21/22] xfs: restrict healthmon users further
  [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the
[PATCHSET V3 2/2] iomap: generic file IO error reporting
  [PATCH 1/6] iomap: report file IO errors to fsnotify
  [PATCH 2/6] xfs: switch healthmon to use the iomap I/O error
  [PATCH 3/6] xfs: port notify-failure to use the new vfs io error
  [PATCH 4/6] xfs: remove file I/O error hooks
  [PATCH 5/6] iomap: remove I/O error hooks
  [PATCH 6/6] xfs: report fs metadata errors via fsnotify

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems
  2025-11-05  0:46 [PATCHBOMB v2 6.19] xfs: autonomous self healing Darrick J. Wong
@ 2025-11-05  0:48 ` Darrick J. Wong
  2025-11-05  0:48   ` [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
                     ` (21 more replies)
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
  1 sibling, 22 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:48 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, hch, linux-fsdevel, linux-xfs

Hi all,

This patchset builds new functionality to deliver live information about
filesystem health events to userspace.  This is done by creating an
anonymous file that can be read() for events by userspace programs.
Events are captured by hooking various parts of XFS and iomap so that
metadata health failures, file I/O errors, and major changes in
filesystem state (unmounts, shutdowns, etc.) can be observed by
programs.

When an event occurs, the hook functions queue an event object to each
event anonfd for later processing.  Programs must have CAP_SYS_ADMIN
to open the anonfd and there's a maximum event lag to prevent resource
overconsumption.  The events themselves can be read() from the anonfd
either as json objects for human readability, or as C structs for
daemons.

In userspace, we create a new daemon program that will read the event
objects and initiate repairs automatically.  This daemon is managed
entirely by systemd and will not block unmounting of the filesystem
unless repairs are ongoing.  It is autostarted via some udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * docs: remove obsolete links in the xfs online repair documentation
 * docs: discuss autonomous self healing in the xfs online repair design doc
 * xfs: create debugfs uuid aliases
 * xfs: create hooks for monitoring health updates
 * xfs: create a filesystem shutdown hook
 * xfs: create hooks for media errors
 * iomap: report buffered read and write io errors to the filesystem
 * iomap: report directio read and write errors to callers
 * xfs: create file io error hooks
 * xfs: create a special file to pass filesystem health to userspace
 * xfs: create event queuing, formatting, and discovery infrastructure
 * xfs: report metadata health events through healthmon
 * xfs: report shutdown events through healthmon
 * xfs: report media errors through healthmon
 * xfs: report file io errors through healthmon
 * xfs: allow reconfiguration of the health monitoring device
 * xfs: validate fds against running healthmon
 * xfs: add media error reporting ioctl
 * xfs: send uevents when major filesystem events happen
 * xfs: merge health monitoring events when possible
 * xfs: restrict healthmon users further
 * xfs: charge healthmon event objects to the memcg of the listening process
---
 fs/iomap/internal.h                                |    2 
 fs/xfs/libxfs/xfs_fs.h                             |  171 ++
 fs/xfs/libxfs/xfs_health.h                         |   52 +
 fs/xfs/xfs_file.h                                  |   37 +
 fs/xfs/xfs_fsops.h                                 |   11 
 fs/xfs/xfs_healthmon.h                             |  108 +
 fs/xfs/xfs_linux.h                                 |    3 
 fs/xfs/xfs_mount.h                                 |   13 
 fs/xfs/xfs_notify_failure.h                        |   39 +
 fs/xfs/xfs_super.h                                 |   13 
 fs/xfs/xfs_trace.h                                 |  407 ++++++
 include/linux/fs.h                                 |    4 
 include/linux/iomap.h                              |    2 
 Documentation/filesystems/vfs.rst                  |    7 
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  450 +++---
 fs/iomap/buffered-io.c                             |   27 
 fs/iomap/direct-io.c                               |    4 
 fs/iomap/ioend.c                                   |    4 
 fs/xfs/Kconfig                                     |    8 
 fs/xfs/Makefile                                    |    7 
 fs/xfs/xfs_aops.c                                  |    2 
 fs/xfs/xfs_file.c                                  |  174 ++
 fs/xfs/xfs_fsops.c                                 |   60 +
 fs/xfs/xfs_health.c                                |  271 ++++
 fs/xfs/xfs_healthmon.c                             | 1466 ++++++++++++++++++++
 fs/xfs/xfs_ioctl.c                                 |    7 
 fs/xfs/xfs_notify_failure.c                        |  258 +++-
 fs/xfs/xfs_super.c                                 |  109 +
 fs/xfs/xfs_trace.c                                 |    4 
 lib/seq_buf.c                                      |    1 
 30 files changed, 3477 insertions(+), 244 deletions(-)
 create mode 100644 fs/xfs/xfs_healthmon.h
 create mode 100644 fs/xfs/xfs_healthmon.c


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
@ 2025-11-05  0:48   ` Darrick J. Wong
  2025-11-05  0:48   ` [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
                     ` (20 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:48 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Online repair is now merged in upstream, no need to point to patchset
links anymore.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  236 +-------------------
 1 file changed, 6 insertions(+), 230 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 8cbcd3c2643430..189d1f5f40788d 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -105,10 +105,8 @@ occur; this capability aids both strategies.
 TLDR; Show Me the Code!
 -----------------------
 
-Code is posted to the kernel.org git trees as follows:
-`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
-`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
-`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Kernel and userspace code has been fully merged as of October 2025.
+
 Each kernel patchset adding an online repair function will use the same branch
 name across the kernel, xfsprogs, and fstests git repos.
 
@@ -764,12 +762,8 @@ allow the online fsck developers to compare online fsck against offline fsck,
 and they enable XFS developers to find deficiencies in the code base.
 
 Proposed patchsets include
-`general fuzzer improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
 `fuzzing baselines
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
-and `improvements in fuzz testing comprehensiveness
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_.
 
 Stress Testing
 --------------
@@ -801,11 +795,6 @@ Success is defined by the ability to run all of these tests without observing
 any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
 check warnings, or any other sort of mischief.
 
-Proposed patchsets include `general stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
-and the `evolution of existing per-function stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
-
 4. User Interface
 =================
 
@@ -886,10 +875,6 @@ apply as nice of a priority to IO and CPU scheduling as possible.
 This measure was taken to minimize delays in the rest of the filesystem.
 No such hardening has been performed for the cron job.
 
-Proposed patchset:
-`Enabling the xfs_scrub background service
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
-
 Health Reporting
 ----------------
 
@@ -912,13 +897,6 @@ notifications and initiate a repair?
 *Answer*: These questions remain unanswered, but should be a part of the
 conversation with early adopters and potential downstream users of XFS.
 
-Proposed patchsets include
-`wiring up health reports to correction returns
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
-and
-`preservation of sickness info during memory reclaim
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
-
 5. Kernel Algorithms and Data Structures
 ========================================
 
@@ -1310,21 +1288,6 @@ Space allocation records are cross-referenced as follows:
      are there the same number of reverse mapping records for each block as the
      reference count record claims?
 
-Proposed patchsets are the series to find gaps in
-`refcount btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
-`inode btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
-`rmap btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
-to find
-`mergeable records
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
-and to
-`improve cross referencing with rmap
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
-before starting a repair.
-
 Checking Extended Attributes
 ````````````````````````````
 
@@ -1756,10 +1719,6 @@ For scrub, the drain works as follows:
 To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
 be woken up whenever the intent count drops to zero.
 
-The proposed patchset is the
-`scrub intent drain series
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
-
 .. _jump_labels:
 
 Static Keys (aka Jump Label Patching)
@@ -2036,10 +1995,6 @@ The ``xfarray_store_anywhere`` function is used to insert a record in any
 null record slot in the bag; and the ``xfarray_unset`` function removes a
 record from the bag.
 
-The proposed patchset is the
-`big in-memory array
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
-
 Iterating Array Elements
 ^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2172,10 +2127,6 @@ However, it should be noted that these repair functions only use blob storage
 to cache a small number of entries before adding them to a temporary ondisk
 file, which is why compaction is not required.
 
-The proposed patchset is at the start of the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
-
 .. _xfbtree:
 
 In-Memory B+Trees
@@ -2214,11 +2165,6 @@ xfiles enables reuse of the entire btree library.
 Btrees built atop an xfile are collectively known as ``xfbtrees``.
 The next few sections describe how they actually work.
 
-The proposed patchset is the
-`in-memory btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
-series.
-
 Using xfiles as a Buffer Cache Target
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2459,14 +2405,6 @@ This enables the log to release the old EFI to keep the log moving forwards.
 EFIs have a role to play during the commit and reaping phases; please see the
 next section and the section about :ref:`reaping<reaping>` for more details.
 
-Proposed patchsets are the
-`bitmap rework
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
-and the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
-
-
 Writing the New Tree
 ````````````````````
 
@@ -2623,11 +2561,6 @@ The number of records for the inode btree is the number of xfarray records,
 but the record count for the free inode btree has to be computed as inode chunk
 records are stored in the xfarray.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 Case Study: Rebuilding the Space Reference Counts
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2716,11 +2649,6 @@ Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
 removed via ``xfarray_unset``.
 Bag members are examined through ``xfarray_iter`` loops.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 Case Study: Rebuilding File Fork Mapping Indices
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -2757,11 +2685,6 @@ EXTENTS format instead of BMBT, which may require a conversion.
 Third, the incore extent map must be reloaded carefully to avoid disturbing
 any delayed allocation extents.
 
-The proposed patchset is the
-`file mapping repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
-series.
-
 .. _reaping:
 
 Reaping Old Metadata Blocks
@@ -2843,11 +2766,6 @@ blocks.
 As stated earlier, online repair functions use very large transactions to
 minimize the chances of this occurring.
 
-The proposed patchset is the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
-series.
-
 Case Study: Reaping After a Regular Btree Repair
 ````````````````````````````````````````````````
 
@@ -2943,11 +2861,6 @@ When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
 btrees.
 These blocks can then be reaped using the methods outlined above.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 .. _rmap_reap:
 
 Case Study: Reaping After Repairing Reverse Mapping Btrees
@@ -2972,11 +2885,6 @@ methods outlined above.
 The rest of the process of rebuildng the reverse mapping btree is discussed
 in a separate :ref:`case study<rmap_repair>`.
 
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
 Case Study: Rebuilding the AGFL
 ```````````````````````````````
 
@@ -3024,11 +2932,6 @@ more complicated, because computing the correct value requires traversing the
 forks, or if that fails, leaving the fields invalid and waiting for the fork
 fsck functions to run.
 
-The proposed patchset is the
-`inode
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
-repair series.
-
 Quota Record Repairs
 --------------------
 
@@ -3045,11 +2948,6 @@ checking are obviously bad limits and timer values.
 Quota usage counters are checked, repaired, and discussed separately in the
 section about :ref:`live quotacheck <quotacheck>`.
 
-The proposed patchset is the
-`quota
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
-repair series.
-
 .. _fscounters:
 
 Freezing to Fix Summary Counters
@@ -3145,11 +3043,6 @@ long enough to check and correct the summary counters.
 |   This bug was fixed in Linux 5.17.                                      |
 +--------------------------------------------------------------------------+
 
-The proposed patchset is the
-`summary counter cleanup
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
-series.
-
 Full Filesystem Scans
 ---------------------
 
@@ -3277,15 +3170,6 @@ Second, if the incore inode is stuck in some intermediate state, the scan
 coordinator must release the AGI and push the main filesystem to get the inode
 back into a loadable state.
 
-The proposed patches are the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-The first user of the new functionality is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
 Inode Management
 ````````````````
 
@@ -3381,12 +3265,6 @@ To capture these nuances, the online fsck code has a separate ``xchk_irele``
 function to set or clear the ``DONTCACHE`` flag to get the required release
 behavior.
 
-Proposed patchsets include fixing
-`scrub iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
-`dir iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
-
 .. _ilocking:
 
 Locking Inodes
@@ -3443,11 +3321,6 @@ If the dotdot entry changes while the directory is unlocked, then a move or
 rename operation must have changed the child's parentage, and the scan can
 exit early.
 
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
 .. _fshooks:
 
 Filesystem Hooks
@@ -3594,11 +3467,6 @@ The inode scan APIs are pretty simple:
 
 - ``xchk_iscan_teardown`` to finish the scan
 
-This functionality is also a part of the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-
 .. _quotacheck:
 
 Case Study: Quota Counter Checking
@@ -3686,11 +3554,6 @@ needing to hold any locks for a long duration.
 If repairs are desired, the real and shadow dquots are locked and their
 resource counts are set to the values in the shadow dquot.
 
-The proposed patchset is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
 .. _nlinks:
 
 Case Study: File Link Count Checking
@@ -3744,11 +3607,6 @@ shadow information.
 If no parents are found, the file must be :ref:`reparented <orphanage>` to the
 orphanage to prevent the file from being lost forever.
 
-The proposed patchset is the
-`file link count repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
-series.
-
 .. _rmap_repair:
 
 Case Study: Rebuilding Reverse Mapping Records
@@ -3828,11 +3686,6 @@ scan for reverse mapping records.
 
 12. Free the xfbtree now that it not needed.
 
-The proposed patchset is the
-`rmap repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
-series.
-
 Staging Repairs with Temporary Files on Disk
 --------------------------------------------
 
@@ -3971,11 +3824,6 @@ Once a good copy of a data file has been constructed in a temporary file, it
 must be conveyed to the file being repaired, which is the topic of the next
 section.
 
-The proposed patches are in the
-`repair temporary files
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
-series.
-
 Logged File Content Exchanges
 -----------------------------
 
@@ -4025,11 +3873,6 @@ The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
 in the superblock protects these new log item records from being replayed on
 old kernels.
 
-The proposed patchset is the
-`file contents exchange
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
-series.
-
 +--------------------------------------------------------------------------+
 | **Sidebar: Using Log-Incompatible Feature Flags**                        |
 +--------------------------------------------------------------------------+
@@ -4323,11 +4166,6 @@ To repair the summary file, write the xfile contents into the temporary file
 and use atomic mapping exchange to commit the new contents.
 The temporary file is then reaped.
 
-The proposed patchset is the
-`realtime summary repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
-series.
-
 Case Study: Salvaging Extended Attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -4369,11 +4207,6 @@ Salvaging extended attributes is done as follows:
 
 4. Reap the temporary file.
 
-The proposed patchset is the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
-series.
-
 Fixing Directories
 ------------------
 
@@ -4448,11 +4281,6 @@ Unfortunately, the current dentry cache design doesn't provide a means to walk
 every child dentry of a specific directory, which makes this a hard problem.
 There is no known solution.
 
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
 Parent Pointers
 ```````````````
 
@@ -4612,11 +4440,6 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
 
 7. Reap the temporary directory.
 
-The proposed patchset is the
-`parent pointers directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
-series.
-
 Case Study: Repairing Parent Pointers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -4662,11 +4485,6 @@ directory reconstruction:
 
 8. Reap the temporary file.
 
-The proposed patchset is the
-`parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
-series.
-
 Digression: Offline Checking of Parent Pointers
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -4755,11 +4573,6 @@ connectivity checks:
 
 4. Move on to examining link counts, as we do today.
 
-The proposed patchset is the
-`offline parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
-series.
-
 Rebuilding directories from parent pointers in offline repair would be very
 challenging because xfs_repair currently uses two single-pass scans of the
 filesystem during phases 3 and 4 to decide which files are corrupt enough to be
@@ -4903,12 +4716,6 @@ Repairing the directory tree works as follows:
 
 6. If the subdirectory has zero paths, attach it to the lost and found.
 
-The proposed patches are in the
-`directory tree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
-series.
-
-
 .. _orphanage:
 
 The Orphanage
@@ -4973,11 +4780,6 @@ Orphaned files are adopted by the orphanage as follows:
 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
    resources.
 
-The proposed patches are in the
-`orphanage adoption
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
-series.
-
 6. Userspace Algorithms and Data Structures
 ===========================================
 
@@ -5091,14 +4893,6 @@ first workqueue's workers until the backlog eases.
 This doesn't completely solve the balancing problem, but reduces it enough to
 move on to more pressing issues.
 
-The proposed patchsets are the scrub
-`performance tweaks
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
-and the
-`inode scan rebalance
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
-series.
-
 .. _scrubrepair:
 
 Scheduling Repairs
@@ -5179,20 +4973,6 @@ immediately.
 Corrupt file data blocks reported by phase 6 cannot be recovered by the
 filesystem.
 
-The proposed patchsets are the
-`repair warning improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
-refactoring of the
-`repair data dependency
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
-and
-`object tracking
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
-and the
-`repair scheduling
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
-improvement series.
-
 Checking Names for Confusable Unicode Sequences
 -----------------------------------------------
 
@@ -5372,6 +5152,8 @@ The extra flexibility enables several new use cases:
   This emulates an atomic device write in software, and can support arbitrary
   scattered writes.
 
+(This functionality was merged into mainline as of 2025)
+
 Vectorized Scrub
 ----------------
 
@@ -5393,13 +5175,7 @@ It is hoped that ``io_uring`` will pick up enough of this functionality that
 online fsck can use that instead of adding a separate vectored scrub system
 call to XFS.
 
-The relevant patchsets are the
-`kernel vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
-and
-`userspace vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
-series.
+(This functionality was merged into mainline as of 2025)
 
 Quality of Service Targets for Scrub
 ------------------------------------


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
  2025-11-05  0:48   ` [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
@ 2025-11-05  0:48   ` Darrick J. Wong
  2025-11-05  0:49   ` [PATCH 03/22] xfs: create debugfs uuid aliases Darrick J. Wong
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:48 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Update the XFS online repair document to describe the motivation and
design of the autonomous filesystem healing agent known as xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  218 ++++++++++++++++++++
 1 file changed, 216 insertions(+), 2 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 189d1f5f40788d..79d5aa78f2a8bf 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -166,9 +166,12 @@ The current XFS tools leave several problems unsolved:
    malicious actors **exploit quirks of Unicode** to place misleading names
    in directories.
 
+8. **Site Reliability and Support Engineers** would like to reduce the
+   frequency of incidents requiring **manual intervention**.
+
 Given this definition of the problems to be solved and the actors who would
 benefit, the proposed solution is a third fsck tool that acts on a running
-filesystem.
+filesystem, and an autononmous agent that fixes problems as they arise.
 
 This new third program has three components: an in-kernel facility to check
 metadata, an in-kernel facility to repair metadata, and a userspace driver
@@ -203,6 +206,13 @@ Even if a piece of filesystem metadata can only be regenerated by scanning the
 entire system, the scan can still be done in the background while other file
 operations continue.
 
+The autonomous self healing agent should listen for metadata health impact
+reports coming from the kernel and automatically schedule repairs for the
+damaged metadata.
+If the required repairs are larger in scope than a single metadata structure,
+``xfs_scrub`` should be invoked to perform a full analysis.
+``xfs_healer`` is the name of this program.
+
 In summary, online fsck takes advantage of resource sharding and redundant
 metadata to enable targeted checking and repair operations while the system
 is running.
@@ -850,11 +860,16 @@ variable in the following service files:
 * ``xfs_scrub_all_fail.service``
 
 The decision to enable the background scan is left to the system administrator.
-This can be done by enabling either of the following services:
+This can be done system-wide by enabling either of the following services:
 
 * ``xfs_scrub_all.timer`` on systemd systems
 * ``xfs_scrub_all.cron`` on non-systemd systems
 
+To enable online repair for specific filesystems, the ``autofsck``
+filesystem property should be set to ``repair``.
+To enable only scanning, the property should be set to ``check``.
+To disable online fsck entirely, the property should be set to ``none``.
+
 This automatic weekly scan is configured out of the box to perform an
 additional media scan of all file data once per month.
 This is less foolproof than, say, storing file data block checksums, but much
@@ -897,6 +912,36 @@ notifications and initiate a repair?
 *Answer*: These questions remain unanswered, but should be a part of the
 conversation with early adopters and potential downstream users of XFS.
 
+Autonomous Self Healing
+-----------------------
+
+The autonomous self healing agent is a background system service that starts
+when the filesystem is mounted and runs until unmount.
+When starting up, the agent opens a special pseudofile under the specific
+mount.
+When the filesystem generates new adverse health events, the events will be
+made available for reading via the special pseudofile.
+The events need not be limited to metadata concerns; they can also reflect
+events outside of the filesystem's direct control such as file I/O errors.
+
+The agent reads these events in a loop and responds to the events
+appropriately.
+For a single trouble report about metadata, the agent initiates a targeted
+repair of the specific structure.
+If that repair fails or the agent observes too many metadata trouble reports
+over a short interval, it should then initiate a full scan of the filesystem
+via the ``xfs_scrub`` service.
+
+The decision to enable the background scan is left to the system administrator.
+This can be done system-wide by enabling the following services:
+
+* ``xfs_healer@.service`` on systemd systems
+
+To enable autonomous healing for specific filesystems, the ``autofsck``
+filesystem property should be set to ``repair``.
+To disable self healing, the property should be set to ``check``,
+``optimize``, or ``none``.
+
 5. Kernel Algorithms and Data Structures
 ========================================
 
@@ -4780,6 +4825,136 @@ Orphaned files are adopted by the orphanage as follows:
 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
    resources.
 
+Health Monitoring
+-----------------
+
+A self-correcting filesystem responds to observations of problems by scheduling
+repairs of the affected areas.
+The filesystem must therefore create event objects in response to stimuli
+(metadata corruption, file I/O errors, etc.) and dispatch these events to
+downstream consumers.
+Downstream consumers that are in the kernel itself are easy to implement with
+the ``xfs_hooks`` infrastructure created for other parts of online repair; these
+are basically indirect function calls.
+
+However, the decision to translate an adverse metadata health report into a
+repair should be made by userspace, and the actual scheduling done by userspace.
+Some users (e.g. containers) would prefer to fast-fail the container and restart
+it on another node at a previous checkpoint.
+For workloads running in isolation, repairs may be preferable; either way this
+is something the system administrator knows, and not the kernel.
+A userspace agent (``xfs_healer``, described later) will collect events from the
+kernel and dispatch them appropriately.
+
+Exporting health events to userspace requires the creation of a new component,
+known as the health monitor.
+Because the monitor exposes itself to userspace to deliver information, a file
+descriptor is the natural abstraction to use here.
+The health monitor hooks all the relevant sources of metadata health events.
+Upon activation of the hook, a new event object is created and added to a queue.
+When the agent reads from the fd, event objects are pulled from the start of the
+queue and formatted into the user's buffer.
+The events are freed, and the read call returns to userspace to allow the agent
+to perform some work.
+Memory usage is constrained on a per-fd basis to prevent memory exhaustion; if
+an event must be discarded, a special "lost event" event is delivered to the
+agent.
+
+In short, health events are captured, queued, and eventually copied out to
+userspace for dispatching.
+
+**Question**: Why use a pseudofile and not use existing notification methods?
+
+*Answer*: The pseudofile is a private filesystem interface only available to
+processes with the CAP_SYS_ADMIN priviledge and the ability to open the root
+directory.
+Being private gives the kernel and ``xfs_healer`` the flexibility to change
+or update the event format in the future without worrying about backwards
+compatibility.
+Using existing notifications means that the event format would be frozen in
+the public fsnotify UAPI forever, which would affect two subsystems.
+
+The pseudofile can also accept ioctls, which gives ``xfs_healer`` a solid
+means to validate that prior to a repair, its reopened mountpoint is actually
+the same filesystem that is being monitored.
+
+**Question**: Why not reuse fs/notify?
+
+*Answer*: It's much simpler for the healthmon code to manage its own queue of
+events and to wake up readers instead of reusing fsnotify because that's the
+only part of fsnotify that would use.
+
+Before I get started, an introduction: fsnotify expects its users (e.g.
+fanotify) to implement quite a bit of functionality; all it provides is a
+wrapper around a simple queue and a lot of code to convey information about the
+calling process to that user.
+fanotify has to actually implement all the queue management code on its own,
+and so would healthmon.
+
+So if healthmon used fsnotify, it would have to create its own fsnotify group
+structure.
+For our purposes, the group is a very large wrapper around a linked list, some
+counters, and a mutex.
+The group object is critical for ensuring that sees only its own events, and
+that nobody else (e.g. regular fanotify) ever sees these events.
+There's a lot more in there for controlling whether fanotify reports pids,
+groups, file handles, etc. that healthmon doesn't care about.
+
+Starting from the fsnotify() function call:
+
+ - I /think/ we'd have to define a new "data type", which itself is just a plain
+   int but I think they correspond to FSNOTIFY_EVENT_* values which themselves
+   are actually part of an enum.
+   The data type controls the typecasting options for the ``void *data``
+   parameter, which I guess is how I'd pass the healthmon event info from the
+   hooks into the fsnotify mechanism and back out to the healthmon code.
+
+ - Each filesystem that wants to do this probably has to add their own
+   FSNOTIFY_EVENT_{XFS,BTRFS,BFS} data type value because that's a casting
+   decision that's made inside the main fsnotify code.
+   I think this can be avoided if each fs is careful never to leak events
+   outside of the group.
+   Either way, it's harder to follow the data flows here because fsnotify can
+   only take and pass around ``void *`` pointers, and it makes various indirect
+   function calls to manage events.
+   Contrast this with doing everything with typed pointers and direct calls
+   within ``xfs_healthmon.c``.
+
+ - Since healthmon is both producer and consumer of fsnotify events, we can
+   probably define our own "mask" value.
+   It's a relief that we don't have to interact with fanotify, because fanotify
+   has used up 22 of its 32 mask bits.
+
+Once healthmon gets an event into fsnotify, fsnotify will call back (into
+healthmon!) to tell it that it got an event.
+From there, the fsnotify implementation (healthmon) has to allocate an event
+object and add it to the event queue in the group, which is what it already does
+now.
+Overflow control is up to the fsnotify implementation, which healthmon already
+implements.
+
+After the event is queued, the fsnotify implementation also has to implement its
+own read file op to dequeue an event and copy it to the userspace buffer in
+whatever format it likes.
+Again, healthmon already does all this.
+
+In the end, replacing the homegrown event dispatching in healthmon with fsnotify
+would make the data flows much harder to understand, and all we gain is a
+generic event dispatcher that relies on indirect function calls instead of
+direct ones.
+We still have to implement the queuing discipline ourselves! :(
+
+**Future Work Question**: Should these events be exposed through the fanotify
+filesystem error event interface?
+
+*Answer*: Yes.
+fanotify is much more careful about filtering out events to processes that
+aren't running with privileges.
+These processes should have a means to receive simple notifications about
+file errors.
+However, this will require coordination between fanotify, ext4, and XFS, and
+is (for now) outside the scope of this project.
+
 6. Userspace Algorithms and Data Structures
 ===========================================
 
@@ -5071,6 +5246,45 @@ and report what has been lost.
 For media errors in blocks owned by files, parent pointers can be used to
 construct file paths from inode numbers for user-friendly reporting.
 
+Autonomous Self Healing
+-----------------------
+
+When a filesystem mounts, the Linux kernel initiates a uevent describing the
+mount and the path to the data device.
+A udev rule determines the initial mountpoint from the data device path
+and starts a mount-specific ``xfs_healer`` service instance.
+The ``xfs_healer`` service opens the mountpoint and issues the
+XFS_IOC_HEALTH_MONITOR ioctl to open a special health monitoring file.
+After that is set up, the mountpoint is closed to avoid pinning the mount.
+
+The health monitoring file hooks certain points of the filesystem so that it
+may receive events about metadata health, filesystem shutdowns, media errors,
+file I/O errors, and unmounting of the filesystem.
+Events are queued up for each health monitor file and encoded into a
+``struct xfs_health_monitor_event`` object when the agent calls ``read()`` on
+the file.
+All health events are dispatched to a background threadpool to reduce stalls
+in the main event loop.
+Events can be logged into the system log for further analysis.
+
+For metadata health events, the specific details are used to construct a call
+to the scrub ioctl.
+The filesystem mountpoint is reopened, and the kernel is called.
+If events are lost or the repairs fail, a full scan will be initiated by
+starting up an ``xfs_scrub@.service`` for the given mountpoint.
+
+A filesystem shutdown causes all future repair work to cease, and an unmount
+causes the agent to exit.
+
+**Future Work Question**: Should the healer daemon also register a dbus
+listener and publish events there?
+
+*Answer*: This is unclear -- if there's a demand for system monitoring daemons
+to consume this information and make decisions, then yes, this could be wired
+up in ``xfs_healer``.
+On the other hand, systemd is in the middle of a transition to varlink, so
+it makes more sense to wait and see what happens.
+
 7. Conclusion and Future Work
 =============================
 


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 03/22] xfs: create debugfs uuid aliases
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
  2025-11-05  0:48   ` [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
  2025-11-05  0:48   ` [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
@ 2025-11-05  0:49   ` Darrick J. Wong
  2025-11-05  0:49   ` [PATCH 04/22] xfs: create hooks for monitoring health updates Darrick J. Wong
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:49 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an alias for the debugfs dir so that we can find a filesystem by
uuid.  Unless it's mounted nouuid.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |   11 +++++++++++
 2 files changed, 12 insertions(+)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b871dfde372b52..94108668ddabbd 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -289,6 +289,7 @@ typedef struct xfs_mount {
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct xfs_zone_info	*m_zone_info;	/* zone allocator information */
 	struct dentry		*m_debugfs;	/* debugfs parent */
+	struct dentry		*m_debugfs_uuid; /* debugfs symlink */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1067ebb3b001bf..ba07e4a4ae3ffa 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -819,6 +819,7 @@ xfs_mount_free(
 	if (mp->m_ddev_targp)
 		xfs_free_buftarg(mp->m_ddev_targp);
 
+	debugfs_remove(mp->m_debugfs_uuid);
 	debugfs_remove(mp->m_debugfs);
 	kfree(mp->m_rtname);
 	kfree(mp->m_logname);
@@ -1969,6 +1970,16 @@ xfs_fs_fill_super(
 		goto out_unmount;
 	}
 
+	if (xfs_debugfs && mp->m_debugfs && !xfs_has_nouuid(mp)) {
+		char	name[UUID_STRING_LEN + 1];
+
+		snprintf(name, UUID_STRING_LEN + 1, "%pU", &mp->m_sb.sb_uuid);
+		mp->m_debugfs_uuid = debugfs_create_symlink(name, xfs_debugfs,
+				mp->m_super->s_id);
+	} else {
+		mp->m_debugfs_uuid = NULL;
+	}
+
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 04/22] xfs: create hooks for monitoring health updates
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-11-05  0:49   ` [PATCH 03/22] xfs: create debugfs uuid aliases Darrick J. Wong
@ 2025-11-05  0:49   ` Darrick J. Wong
  2025-11-05  0:49   ` [PATCH 05/22] xfs: create a filesystem shutdown hook Darrick J. Wong
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:49 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks for monitoring health events.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h |   47 ++++++++++
 fs/xfs/xfs_mount.h         |    3 +
 fs/xfs/xfs_health.c        |  204 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c         |    1 
 4 files changed, 254 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index b31000f7190ce5..39fef33dedc6a8 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 #define xfs_metadata_is_sick(error) \
 	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
 
+/*
+ * Parameters for tracking health updates.  The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+	XFS_HEALTHUP_SICK = 1,	/* runtime corruption observed */
+	XFS_HEALTHUP_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHUP_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHUP_UNMOUNT,	/* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+	XFS_HEALTHUP_FS = 1,	/* main filesystem */
+	XFS_HEALTHUP_AG,	/* allocation group */
+	XFS_HEALTHUP_INODE,	/* inode */
+	XFS_HEALTHUP_RTGROUP,	/* realtime group */
+};
+
+struct xfs_health_update_params {
+	/* XFS_HEALTHUP_INODE */
+	xfs_ino_t			ino;
+	uint32_t			gen;
+
+	/* XFS_HEALTHUP_AG/RTGROUP */
+	uint32_t			group;
+
+	/* XFS_SICK_* flags */
+	unsigned int			old_mask;
+	unsigned int			new_mask;
+
+	enum xfs_health_update_domain	domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+	struct xfs_hook			health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 94108668ddabbd..3f20baaf9cc226 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -343,6 +343,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed dirent updates to an active online repair. */
 	struct xfs_hooks	m_dir_update_hooks;
+
+	/* Hook to feed health events to a daemon. */
+	struct xfs_hooks	m_health_update_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 7c541fb373d5b2..71952d5eec2a9e 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -20,6 +20,159 @@
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of health updates.  If the
+ * compiler supports jump labels, the static branch will be replaced by a nop
+ * sled when there are no hook users.  Health monitoring is currently the only
+ * caller, so this is a reasonable tradeoff, because health event status
+ * updates can be very frequent when xfs_scrub is running, and we don't know
+ * if xfs_healthmon has attached to this filesystem.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_health_hooks_switch);
+
+void
+xfs_health_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_health_hooks_switch);
+}
+
+void
+xfs_health_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_health_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem unmount health update. */
+static inline void
+xfs_health_unmount_hook(
+	struct xfs_mount		*mp)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+		};
+
+		xfs_hooks_call(&mp->m_health_update_hooks,
+				XFS_HEALTHUP_UNMOUNT, &p);
+	}
+}
+
+/* Call downstream hooks for a filesystem health update. */
+static inline void
+xfs_fs_health_update_hook(
+	struct xfs_mount		*mp,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+		};
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for a group health update. */
+static inline void
+xfs_group_health_update_hook(
+	struct xfs_group		*xg,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.group		= xg->xg_gno,
+		};
+		struct xfs_mount	*mp = xg->xg_mount;
+
+		switch (xg->xg_type) {
+		case XG_TYPE_AG:
+			p.domain = XFS_HEALTHUP_AG;
+			break;
+		case XG_TYPE_RTG:
+			p.domain = XFS_HEALTHUP_RTGROUP;
+			break;
+		default:
+			ASSERT(0);
+			return;
+		}
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for an inode health update. */
+static inline void
+xfs_inode_health_update_hook(
+	struct xfs_inode		*ip,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_INODE,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.ino		= ip->i_ino,
+			.gen		= VFS_I(ip)->i_generation,
+		};
+		struct xfs_mount	*mp = ip->i_mount;
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call the specified function during a health update. */
+int
+xfs_health_hook_add(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Stop calling the specified function during a health update. */
+void
+xfs_health_hook_del(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Configure health update hook functions. */
+void
+xfs_health_hook_setup(
+	struct xfs_health_hook	*hook,
+	notifier_fn_t		mod_fn)
+{
+	xfs_hook_setup(&hook->health_hook, mod_fn);
+}
+#else
+# define xfs_health_unmount_hook(...)			((void)0)
+# define xfs_fs_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_rt_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_group_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+# define xfs_inode_health_update_hook(a,b,o,n)		do {o = o;} while(0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 static void
 xfs_health_unmount_group(
 	struct xfs_group	*xg,
@@ -50,8 +203,10 @@ xfs_health_unmount(
 	unsigned int		checked = 0;
 	bool			warn = false;
 
-	if (xfs_is_shutdown(mp))
+	if (xfs_is_shutdown(mp)) {
+		xfs_health_unmount_hook(mp);
 		return;
+	}
 
 	/* Measure AG corruption levels. */
 	while ((pag = xfs_perag_next(mp, pag)))
@@ -97,6 +252,8 @@ xfs_health_unmount(
 		if (sick & XFS_SICK_FS_COUNTERS)
 			xfs_fs_mark_healthy(mp, XFS_SICK_FS_COUNTERS);
 	}
+
+	xfs_health_unmount_hook(mp);
 }
 
 /* Mark unhealthy per-fs metadata. */
@@ -105,12 +262,17 @@ xfs_fs_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark per-fs metadata as having been checked and found unhealthy by fsck. */
@@ -119,13 +281,18 @@ xfs_fs_mark_corrupt(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_corrupt(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark a per-fs metadata healed. */
@@ -134,15 +301,20 @@ xfs_fs_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick &= ~mask;
 	if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY))
 		mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-fs metadata are unhealthy. */
@@ -192,12 +364,17 @@ xfs_group_mark_sick(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_sick(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /*
@@ -208,13 +385,18 @@ xfs_group_mark_corrupt(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_corrupt(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick |= mask;
 	xg->xg_checked |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /*
@@ -225,15 +407,20 @@ xfs_group_mark_healthy(
 	struct xfs_group	*xg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	xfs_group_check_mask(xg, mask);
 	trace_xfs_group_mark_healthy(xg, mask);
 
 	spin_lock(&xg->xg_state_lock);
+	old_mask = xg->xg_sick;
 	xg->xg_sick &= ~mask;
 	if (!(xg->xg_sick & XFS_SICK_AG_PRIMARY))
 		xg->xg_sick &= ~XFS_SICK_AG_SECONDARY;
 	xg->xg_checked |= mask;
 	spin_unlock(&xg->xg_state_lock);
+
+	xfs_group_health_update_hook(xg, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-ag metadata are unhealthy. */
@@ -272,10 +459,13 @@ xfs_inode_mark_sick(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_sick(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	spin_unlock(&ip->i_flags_lock);
 
@@ -287,6 +477,8 @@ xfs_inode_mark_sick(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark inode metadata as having been checked and found unhealthy by fsck. */
@@ -295,10 +487,13 @@ xfs_inode_mark_corrupt(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_corrupt(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
@@ -311,6 +506,8 @@ xfs_inode_mark_corrupt(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark parts of an inode healed. */
@@ -319,15 +516,20 @@ xfs_inode_mark_healthy(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_healthy(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick &= ~mask;
 	if (!(ip->i_sick & XFS_SICK_INO_PRIMARY))
 		ip->i_sick &= ~XFS_SICK_INO_SECONDARY;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which parts of an inode are unhealthy. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ba07e4a4ae3ffa..84cbba0ab698aa 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2291,6 +2291,7 @@ xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_health_update_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 05/22] xfs: create a filesystem shutdown hook
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-11-05  0:49   ` [PATCH 04/22] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2025-11-05  0:49   ` Darrick J. Wong
  2025-11-05  0:49   ` [PATCH 06/22] xfs: create hooks for media errors Darrick J. Wong
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:49 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a hook so that health monitoring can report filesystem shutdown
events to userspace.  Shutdowns should be infrequent, so we don't bother
with a static key here.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.h |   11 +++++++++++
 fs/xfs/xfs_mount.h |    3 +++
 fs/xfs/xfs_fsops.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c |    1 +
 4 files changed, 57 insertions(+)


diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 9d23c361ef56e4..ea5561b8580574 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -15,4 +15,15 @@ int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags);
 int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
 void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_shutdown_hook {
+	struct xfs_hook			shutdown_hook;
+};
+
+int xfs_shutdown_hook_add(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_del(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_setup(struct xfs_shutdown_hook *hook,
+		notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3f20baaf9cc226..2d4305d91a3cd9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -346,6 +346,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed health events to a daemon. */
 	struct xfs_hooks	m_health_update_hooks;
+
+	/* Hook to feed shutdown events to a daemon. */
+	struct xfs_hooks	m_shutdown_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 0ada735693945c..26ed16e67410d7 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -482,6 +482,46 @@ xfs_fs_goingdown(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/* Call downstream hooks for a filesystem shutdown. */
+static inline void
+xfs_shutdown_hook(
+	struct xfs_mount		*mp,
+	uint32_t			flags)
+{
+	xfs_hooks_call(&mp->m_shutdown_hooks, flags, NULL);
+}
+
+/* Call the specified function during a shutdown update. */
+int
+xfs_shutdown_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Stop calling the specified function during a shutdown update. */
+void
+xfs_shutdown_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Configure shutdown update hook functions. */
+void
+xfs_shutdown_hook_setup(
+	struct xfs_shutdown_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->shutdown_hook, mod_fn);
+}
+#else
+# define xfs_shutdown_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Force a shutdown of the filesystem instantly while keeping the filesystem
  * consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -540,6 +580,8 @@ xfs_do_force_shutdown(
 		"Please unmount the filesystem and rectify the problem(s)");
 	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
 		xfs_stack_trace();
+
+	xfs_shutdown_hook(mp, flags);
 }
 
 /*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 84cbba0ab698aa..599900b9b0dd63 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2291,6 +2291,7 @@ xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 
 	fc->s_fs_info = mp;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 06/22] xfs: create hooks for media errors
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-11-05  0:49   ` [PATCH 05/22] xfs: create a filesystem shutdown hook Darrick J. Wong
@ 2025-11-05  0:49   ` Darrick J. Wong
  2025-11-05  0:50   ` [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:49 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a media error event hook so that we can send events to userspace.
Media errors are not expected to be frequent, so we don't have a static
key guarding them here either.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h          |    3 ++
 fs/xfs/xfs_notify_failure.h |   33 +++++++++++++++++++++
 fs/xfs/xfs_notify_failure.c |   68 ++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_super.c          |    1 +
 4 files changed, 100 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2d4305d91a3cd9..0feb0fb685f51f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -349,6 +349,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed shutdown events to a daemon. */
 	struct xfs_hooks	m_shutdown_hooks;
+
+	/* Hook to feed media error events to a daemon. */
+	struct xfs_hooks	m_media_error_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 8d08ec29dd2949..2695732ec20875 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -8,4 +8,37 @@
 
 extern const struct dax_holder_operations xfs_dax_holder_operations;
 
+enum xfs_failed_device {
+	XFS_FAILED_DATADEV,
+	XFS_FAILED_LOGDEV,
+	XFS_FAILED_RTDEV,
+};
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+struct xfs_media_error_params {
+	struct xfs_mount		*mp;
+	enum xfs_failed_device		fdev;
+	xfs_daddr_t			daddr;
+	uint64_t			bbcount;
+	bool				pre_remove;
+};
+
+struct xfs_media_error_hook {
+	struct xfs_hook			error_hook;
+};
+
+int xfs_media_error_hook_add(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_del(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_setup(struct xfs_media_error_hook *hook,
+		notifier_fn_t mod_fn);
+#else
+struct xfs_media_error_params { };
+struct xfs_media_error_hook { };
+# define xfs_media_error_hook_add(...)		(0)
+# define xfs_media_error_hook_del(...)		((void)0)
+# define xfs_media_error_hook_setup(...)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index b1767288994206..557f4bf3463dcb 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -27,6 +27,57 @@
 #include <linux/dax.h>
 #include <linux/fs.h>
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/* Call downstream hooks for a media error. */
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	enum xfs_failed_device		fdev,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	bool				pre_remove)
+{
+	struct xfs_media_error_params	p = {
+		.mp			= mp,
+		.fdev			= fdev,
+		.daddr			= daddr,
+		.bbcount		= bbcount,
+		.pre_remove		= pre_remove,
+	};
+
+	xfs_hooks_call(&mp->m_media_error_hooks, 0, &p);
+}
+
+/* Call the specified function during a media error. */
+int
+xfs_media_error_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Stop calling the specified function during a media error. */
+void
+xfs_media_error_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Configure media error hook functions. */
+void
+xfs_media_error_hook_setup(
+	struct xfs_media_error_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->error_hook, mod_fn);
+}
+#else
+# define xfs_media_error_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
 	xfs_extlen_t		blockcount;
@@ -215,6 +266,9 @@ xfs_dax_notify_logdev_failure(
 	if (error)
 		return error;
 
+	xfs_media_error_hook(mp, XFS_FAILED_LOGDEV, daddr, bblen,
+			mf_flags & MF_MEM_PRE_REMOVE);
+
 	/*
 	 * In the pre-remove case the failure notification is attempting to
 	 * trigger a force unmount.  The expectation is that the device is
@@ -248,16 +302,20 @@ xfs_dax_notify_dev_failure(
 	uint64_t		bblen;
 	struct xfs_group	*xg = NULL;
 
+	error = xfs_dax_translate_range(xfs_group_type_buftarg(mp, type),
+			offset, len, &daddr, &bblen);
+	if (error)
+		return error;
+
+	xfs_media_error_hook(mp, type == XG_TYPE_RTG ?
+			XFS_FAILED_RTDEV : XFS_FAILED_DATADEV,
+			daddr, bblen, mf_flags & MF_MEM_PRE_REMOVE);
+
 	if (!xfs_has_rmapbt(mp)) {
 		xfs_debug(mp, "notify_failure() needs rmapbt enabled!");
 		return -EOPNOTSUPP;
 	}
 
-	error = xfs_dax_translate_range(xfs_group_type_buftarg(mp, type),
-			offset, len, &daddr, &bblen);
-	if (error)
-		return error;
-
 	if (type == XG_TYPE_RTG) {
 		start_bno = xfs_daddr_to_rtb(mp, daddr);
 		end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 599900b9b0dd63..fb72a4976e8570 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2293,6 +2293,7 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_dir_update_hooks);
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
+	xfs_hooks_init(&mp->m_media_error_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-11-05  0:49   ` [PATCH 06/22] xfs: create hooks for media errors Darrick J. Wong
@ 2025-11-05  0:50   ` Darrick J. Wong
  2025-11-05  0:50   ` [PATCH 08/22] iomap: report directio read and write errors to callers Darrick J. Wong
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:50 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Provide a callback so that iomap can report read and write IO errors to
the caller filesystem.  For now this is only wired up for iomap as a
testbed for XFS.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/internal.h               |    2 ++
 include/linux/fs.h                |    4 ++++
 Documentation/filesystems/vfs.rst |    7 +++++++
 fs/iomap/buffered-io.c            |   27 +++++++++++++++++++++++++--
 fs/iomap/ioend.c                  |    4 ++++
 5 files changed, 42 insertions(+), 2 deletions(-)


diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
index d05cb3aed96e79..06d9145b6be4fa 100644
--- a/fs/iomap/internal.h
+++ b/fs/iomap/internal.h
@@ -5,5 +5,7 @@
 #define IOEND_BATCH_SIZE	4096
 
 u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
+void iomap_mapping_ioerror(struct address_space *mapping, int direction,
+		loff_t pos, u64 len, int error);
 
 #endif /* _IOMAP_INTERNAL_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c895146c1444be..5e4b3a4b24823f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -477,6 +477,10 @@ struct address_space_operations {
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
 	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+
+	/* Callback for dealing with IO errors during readahead or writeback */
+	void (*ioerror)(struct address_space *mapping, int direction,
+			loff_t pos, u64 len, int error);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 4f13b01e42eb5e..9e70006bf99a63 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -822,6 +822,8 @@ cache in your filesystem.  The following members are defined:
 		int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 		int (*swap_deactivate)(struct file *);
 		int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+		void (*ioerror)(struct address_space *mapping, int direction,
+				loff_t pos, u64 len, int error);
 	};
 
 ``read_folio``
@@ -1032,6 +1034,11 @@ cache in your filesystem.  The following members are defined:
 ``swap_rw``
 	Called to read or write swap pages when SWP_FS_OPS is set.
 
+``ioerror``
+        Called to deal with IO errors during readahead or writeback.
+        This may be called from interrupt context, and without any
+        locks necessarily being held.
+
 The File Object
 ===============
 
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8b847a1e27f13e..8dd5421cb910b5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -288,6 +288,14 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 		pos >= i_size_read(iter->inode);
 }
 
+inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
+		loff_t pos, u64 len, int error)
+{
+	if (mapping && mapping->a_ops->ioerror)
+		mapping->a_ops->ioerror(mapping, direction, pos, len,
+				error);
+}
+
 /**
  * iomap_read_inline_data - copy inline data into the page cache
  * @iter: iteration structure
@@ -310,8 +318,11 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (folio_test_uptodate(folio))
 		return 0;
 
-	if (WARN_ON_ONCE(size > iomap->length))
+	if (WARN_ON_ONCE(size > iomap->length)) {
+		iomap_mapping_ioerror(folio->mapping, READ, iomap->offset,
+				size, -EIO);
 		return -EIO;
+	}
 	if (offset > 0)
 		ifs_alloc(iter->inode, folio, iter->flags);
 
@@ -339,6 +350,10 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
 		spin_unlock_irqrestore(&ifs->state_lock, flags);
 	}
 
+	if (error)
+		iomap_mapping_ioerror(folio->mapping, READ,
+				folio_pos(folio) + off, len, error);
+
 	if (finished)
 		folio_end_read(folio, uptodate);
 }
@@ -558,11 +573,15 @@ static int iomap_read_folio_range(const struct iomap_iter *iter,
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct bio_vec bvec;
 	struct bio bio;
+	int ret;
 
 	bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
 	bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
 	bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
-	return submit_bio_wait(&bio);
+	ret = submit_bio_wait(&bio);
+	if (ret)
+		iomap_mapping_ioerror(folio->mapping, READ, pos, len, ret);
+	return ret;
 }
 #else
 static int iomap_read_folio_range(const struct iomap_iter *iter,
@@ -1674,6 +1693,7 @@ int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio)
 	u64 pos = folio_pos(folio);
 	u64 end_pos = pos + folio_size(folio);
 	u64 end_aligned = 0;
+	loff_t orig_pos = pos;
 	bool wb_pending = false;
 	int error = 0;
 	u32 rlen;
@@ -1724,6 +1744,9 @@ int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio)
 
 	if (wb_pending)
 		wpc->nr_folios++;
+	if (error && pos > orig_pos)
+		iomap_mapping_ioerror(inode->i_mapping, WRITE, orig_pos, 0,
+				error);
 
 	/*
 	 * We can have dirty bits set past end of file in page_mkwrite path
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index b49fa75eab260a..56e654f2d36fe9 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -55,6 +55,10 @@ static u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend)
 
 	/* walk all folios in bio, ending page IO on them */
 	bio_for_each_folio_all(fi, bio) {
+		if (ioend->io_error)
+			iomap_mapping_ioerror(inode->i_mapping, WRITE,
+					folio_pos(fi.folio) + fi.offset,
+					fi.length, ioend->io_error);
 		iomap_finish_folio_write(inode, fi.folio, fi.length);
 		folio_count++;
 	}


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/22] iomap: report directio read and write errors to callers
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2025-11-05  0:50   ` [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
@ 2025-11-05  0:50   ` Darrick J. Wong
  2025-11-05  0:50   ` [PATCH 09/22] xfs: create file io error hooks Darrick J. Wong
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:50 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add more hooks to report directio IO errors to the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/linux/iomap.h |    2 ++
 fs/iomap/direct-io.c  |    4 ++++
 2 files changed, 6 insertions(+)


diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 73dceabc21c8c7..ca1590e5002342 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -486,6 +486,8 @@ struct iomap_dio_ops {
 		      unsigned flags);
 	void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
 		          loff_t file_offset);
+	void (*ioerror)(struct inode *inode, int direction, loff_t pos,
+			u64 len, int error);
 
 	/*
 	 * Filesystems wishing to attach private information to a direct io bio
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5d5d63efbd5767..1512d8dbb0d2e7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -95,6 +95,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	if (dops && dops->end_io)
 		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+	if (dio->error && dops && dops->ioerror)
+		dops->ioerror(file_inode(iocb->ki_filp),
+				(dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
+				offset, dio->size, dio->error);
 
 	if (likely(!ret)) {
 		ret = dio->size;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 09/22] xfs: create file io error hooks
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (7 preceding siblings ...)
  2025-11-05  0:50   ` [PATCH 08/22] iomap: report directio read and write errors to callers Darrick J. Wong
@ 2025-11-05  0:50   ` Darrick J. Wong
  2025-11-05  0:51   ` [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:50 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks within XFS to deliver IO errors to callers.  File I/O
errors are usually rare, so we don't employ a static key here.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_file.h           |   37 +++++++++
 fs/xfs/xfs_mount.h          |    3 +
 fs/xfs/xfs_aops.c           |    2 
 fs/xfs/xfs_file.c           |  174 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_notify_failure.c |    5 +
 fs/xfs/xfs_super.c          |    1 
 6 files changed, 221 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h
index 2ad91f755caf35..441f8a693bb884 100644
--- a/fs/xfs/xfs_file.h
+++ b/fs/xfs/xfs_file.h
@@ -12,4 +12,41 @@ extern const struct file_operations xfs_dir_file_operations;
 bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos,
 		long long int len);
 
+enum xfs_file_ioerror_type {
+	XFS_FILE_IOERROR_BUFFERED_READ,
+	XFS_FILE_IOERROR_BUFFERED_WRITE,
+	XFS_FILE_IOERROR_DIRECT_READ,
+	XFS_FILE_IOERROR_DIRECT_WRITE,
+	XFS_FILE_IOERROR_DATA_LOST,
+};
+
+struct xfs_file_ioerror_params {
+	xfs_ino_t		ino;
+	loff_t			pos;
+	u64			len;
+	u32			gen;
+	int			error;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_file_ioerror_hook {
+	struct xfs_hook			ioerror_hook;
+};
+
+int xfs_file_ioerror_hook_add(struct xfs_mount *mp,
+		struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_del(struct xfs_mount *mp,
+		struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_setup(struct xfs_file_ioerror_hook *hook,
+		notifier_fn_t mod_fn);
+
+void xfs_vm_ioerror(struct address_space *mapping, int direction, loff_t pos,
+		u64 len, int error);
+
+void xfs_inode_media_error(struct xfs_inode *ip, loff_t pos, u64 len);
+#else
+# define xfs_vm_ioerror			NULL
+# define xfs_inode_media_error(...)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif /* __XFS_FILE_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 0feb0fb685f51f..2d7f9ccba5287e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -352,6 +352,9 @@ typedef struct xfs_mount {
 
 	/* Hook to feed media error events to a daemon. */
 	struct xfs_hooks	m_media_error_hooks;
+
+	/* Hook to feed file io error events to a daemon. */
+	struct xfs_hooks	m_file_ioerror_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a26f798155331f..f3f28b9ae0f70e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -22,6 +22,7 @@
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
+#include "xfs_file.h"
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -810,6 +811,7 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_folio	= generic_error_remove_folio,
 	.swap_activate		= xfs_vm_swap_activate,
+	.ioerror		= xfs_vm_ioerror,
 };
 
 const struct address_space_operations xfs_dax_aops = {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2702fef2c90cd2..f5988904f5d44d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -222,6 +222,176 @@ xfs_ilock_iocb_for_write(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_file_ioerror {
+	struct work_struct		work;
+	struct xfs_mount		*mp;
+	xfs_ino_t			ino;
+	loff_t				pos;
+	u64				len;
+	u32				gen;
+	int				error;
+	enum xfs_file_ioerror_type	type;
+};
+
+/* Call downstream hooks for a file io error update. */
+STATIC void
+xfs_file_report_ioerror(
+	struct work_struct	*work)
+{
+	struct xfs_file_ioerror	*ioerr =
+		container_of(work, struct xfs_file_ioerror, work);
+	struct xfs_file_ioerror_params	p = {
+		.ino		= ioerr->ino,
+		.gen		= ioerr->gen,
+		.pos		= ioerr->pos,
+		.len		= ioerr->len,
+	};
+	struct xfs_mount	*mp = ioerr->mp;
+
+	xfs_hooks_call(&mp->m_file_ioerror_hooks, ioerr->type, &p);
+	kfree(ioerr);
+}
+
+/* Queue a directio io error notification. */
+STATIC void
+xfs_dio_ioerror(
+	struct inode		*inode,
+	int			direction,
+	loff_t			pos,
+	u64			len,
+	int			error)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+	if (!ioerr) {
+		xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+				ip->i_ino,
+				direction == WRITE ? "WRITE" : "READ",
+				pos, len, error);
+		return;
+	}
+
+	INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+	ioerr->mp = mp;
+	ioerr->ino = ip->i_ino;
+	ioerr->gen = VFS_I(ip)->i_generation;
+	ioerr->pos = pos;
+	ioerr->len = len;
+	if (direction == WRITE)
+		ioerr->type = XFS_FILE_IOERROR_DIRECT_WRITE;
+	else
+		ioerr->type = XFS_FILE_IOERROR_DIRECT_READ;
+	ioerr->error = error;
+	queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+}
+
+/* Deal with a media error */
+void
+xfs_inode_media_error(
+	struct xfs_inode	*ip,
+	loff_t			pos,
+	u64			len)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+	if (!ioerr) {
+		xfs_err(mp,
+ "lost data error report for ino 0x%llx pos 0x%llx len 0x%llx",
+				ip->i_ino,
+				pos, len);
+		return;
+	}
+
+	INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+	ioerr->mp = mp;
+	ioerr->ino = ip->i_ino;
+	ioerr->gen = VFS_I(ip)->i_generation;
+	ioerr->pos = pos;
+	ioerr->len = len;
+	ioerr->type = XFS_FILE_IOERROR_DATA_LOST;
+	ioerr->error = -EIO;
+	queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+}
+
+/* Queue a buffered io error notification. */
+void
+xfs_vm_ioerror(
+	struct address_space	*mapping,
+	int			direction,
+	loff_t			pos,
+	u64			len,
+	int			error)
+{
+	struct inode		*inode = mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_file_ioerror	*ioerr;
+
+	ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+	if (!ioerr) {
+		xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+				ip->i_ino,
+				direction == WRITE ? "WRITE" : "READ",
+				pos, len, error);
+		return;
+	}
+
+	INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+	ioerr->mp = mp;
+	ioerr->ino = ip->i_ino;
+	ioerr->gen = VFS_I(ip)->i_generation;
+	ioerr->pos = pos;
+	ioerr->len = len;
+	if (direction == WRITE)
+		ioerr->type = XFS_FILE_IOERROR_BUFFERED_WRITE;
+	else
+		ioerr->type = XFS_FILE_IOERROR_BUFFERED_READ;
+	ioerr->error = error;
+	queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+}
+
+/* Call the specified function after a file io error. */
+int
+xfs_file_ioerror_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_file_ioerror_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Stop calling the specified function after a file io error. */
+void
+xfs_file_ioerror_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_file_ioerror_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Configure file io error update hook functions. */
+void
+xfs_file_ioerror_hook_setup(
+	struct xfs_file_ioerror_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->ioerror_hook, mod_fn);
+}
+#else
+# define xfs_dio_ioerror		NULL
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
+static const struct iomap_dio_ops xfs_dio_read_ops = {
+	.ioerror	= xfs_dio_ioerror,
+};
+
 STATIC ssize_t
 xfs_file_dio_read(
 	struct kiocb		*iocb,
@@ -240,7 +410,8 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, &xfs_dio_read_ops,
+			0, NULL, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -625,6 +796,7 @@ xfs_dio_write_end_io(
 
 static const struct iomap_dio_ops xfs_dio_write_ops = {
 	.end_io		= xfs_dio_write_end_io,
+	.ioerror	= xfs_dio_ioerror,
 };
 
 static void
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 557f4bf3463dcb..8766d83385ddad 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 #include "xfs_notify_failure.h"
 #include "xfs_rtgroup.h"
 #include "xfs_rtrmap_btree.h"
+#include "xfs_file.h"
 
 #include <linux/mm.h>
 #include <linux/dax.h>
@@ -167,6 +168,10 @@ xfs_dax_failure_fn(
 		invalidate_inode_pages2_range(mapping, pgoff,
 					      pgoff + pgcnt - 1);
 
+	xfs_inode_media_error(ip,
+			XFS_FSB_TO_B(mp, (u64)pgoff << PAGE_SHIFT),
+			XFS_FSB_TO_B(mp, (u64)pgcnt << PAGE_SHIFT));
+
 	xfs_irele(ip);
 	return error;
 }
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fb72a4976e8570..54d82f5a5b8863 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2294,6 +2294,7 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 	xfs_hooks_init(&mp->m_media_error_hooks);
+	xfs_hooks_init(&mp->m_file_ioerror_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (8 preceding siblings ...)
  2025-11-05  0:50   ` [PATCH 09/22] xfs: create file io error hooks Darrick J. Wong
@ 2025-11-05  0:51   ` Darrick J. Wong
  2025-11-05  0:51   ` [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:51 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |    8 ++
 fs/xfs/xfs_healthmon.h |   16 +++++
 fs/xfs/Kconfig         |    8 ++
 fs/xfs/Makefile        |    1 
 fs/xfs/xfs_healthmon.c |  157 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c     |    4 +
 6 files changed, 194 insertions(+)
 create mode 100644 fs/xfs/xfs_healthmon.h
 create mode 100644 fs/xfs/xfs_healthmon.c


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 12463ba766da05..dba7896f716092 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1003,6 +1003,13 @@ struct xfs_rtgroup_geometry {
 #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
 #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
 
+struct xfs_health_monitor {
+	__u64	flags;		/* flags */
+	__u8	format;		/* output format */
+	__u8	pad1[7];	/* zeroes */
+	__u64	pad2[2];	/* zeroes */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1042,6 +1049,7 @@ struct xfs_rtgroup_geometry {
 #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle)
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
+#define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
new file mode 100644
index 00000000000000..07126e39281a0c
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_HEALTHMON_H__
+#define __XFS_HEALTHMON_H__
+
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+long xfs_ioc_health_monitor(struct xfs_mount *mp,
+		struct xfs_health_monitor __user *arg);
+#else
+# define xfs_ioc_health_monitor(mp, hmo)	(-ENOTTY)
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
+#endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index b99da294e9a310..682d9a35203494 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -130,6 +130,14 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_HEALTH_MONITOR
+	bool "Report filesystem health events to userspace"
+	depends on XFS_FS
+	select XFS_LIVE_HOOKS
+	default y
+	help
+	  Report health events to userspace programs.
+
 config XFS_DRAIN_INTENTS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5bf501cf827172..d4e9070a9326ba 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -157,6 +157,7 @@ xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
 xfs-$(CONFIG_XFS_BTREE_IN_MEM)	+= libxfs/xfs_btree_mem.o
+xfs-$(CONFIG_XFS_HEALTH_MONITOR) += xfs_healthmon.o
 
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
new file mode 100644
index 00000000000000..7b0d9f78b0a402
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024-2025 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trace.h"
+#include "xfs_ag.h"
+#include "xfs_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_quota_defs.h"
+#include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
+
+#include <linux/anon_inodes.h>
+#include <linux/eventpoll.h>
+#include <linux/poll.h>
+
+/*
+ * Live Health Monitoring
+ * ======================
+ *
+ * Autonomous self-healing of XFS filesystems requires a means for the kernel
+ * to send filesystem health events to a monitoring daemon in userspace.  To
+ * accomplish this, we establish a thread_with_file kthread object to handle
+ * translating internal events about filesystem health into a format that can
+ * be parsed easily by userspace.  Then we hook various parts of the filesystem
+ * to supply those internal events to the kthread.  Userspace reads events
+ * from the file descriptor returned by the ioctl.
+ *
+ * The healthmon abstraction has a weak reference to the host filesystem mount
+ * so that the queueing and processing of the events do not pin the mount and
+ * cannot slow down the main filesystem.  The healthmon object can exist past
+ * the end of the filesystem mount.
+ */
+
+struct xfs_healthmon {
+	struct xfs_mount		*mp;
+};
+
+/*
+ * Convey queued event data to userspace.  First copy any remaining bytes in
+ * the outbuf, then format the oldest event into the outbuf and copy that too.
+ */
+STATIC ssize_t
+xfs_healthmon_read_iter(
+	struct kiocb		*iocb,
+	struct iov_iter		*to)
+{
+	return -EIO;
+}
+
+/* Free the health monitoring information. */
+STATIC int
+xfs_healthmon_release(
+	struct inode		*inode,
+	struct file		*file)
+{
+	struct xfs_healthmon	*hm = file->private_data;
+
+	kfree(hm);
+
+	return 0;
+}
+
+/* Validate ioctl parameters. */
+static inline bool
+xfs_healthmon_validate(
+	const struct xfs_health_monitor	*hmo)
+{
+	if (hmo->flags)
+		return false;
+	if (hmo->format)
+		return false;
+	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
+		return false;
+	if (memchr_inv(&hmo->pad2, 0, sizeof(hmo->pad2)))
+		return false;
+	return true;
+}
+
+/* Emit some data about the health monitoring fd. */
+#ifdef CONFIG_PROC_FS
+static void
+xfs_healthmon_show_fdinfo(
+	struct seq_file		*m,
+	struct file		*file)
+{
+	struct xfs_healthmon	*hm = file->private_data;
+
+	seq_printf(m, "state:\talive\ndev:\t%s\n",
+			hm->mp->m_super->s_id);
+}
+#endif
+
+static const struct file_operations xfs_healthmon_fops = {
+	.owner		= THIS_MODULE,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= xfs_healthmon_show_fdinfo,
+#endif
+	.read_iter	= xfs_healthmon_read_iter,
+	.release	= xfs_healthmon_release,
+};
+
+/*
+ * Create a health monitoring file.  Returns an index to the fd table or a
+ * negative errno.
+ */
+long
+xfs_ioc_health_monitor(
+	struct xfs_mount		*mp,
+	struct xfs_health_monitor __user *arg)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm;
+	int				fd;
+	int				ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	hm = kzalloc(sizeof(*hm), GFP_KERNEL);
+	if (!hm)
+		return -ENOMEM;
+	hm->mp = mp;
+
+	/*
+	 * Create the anonymous file.  If it succeeds, the file owns hm and
+	 * can go away at any time, so we must not access it again.
+	 */
+	fd = anon_inode_getfd("xfs_healthmon", &xfs_healthmon_fops, hm,
+			O_CLOEXEC | O_RDONLY);
+	if (fd < 0) {
+		ret = fd;
+		goto out_hm;
+	}
+
+	return fd;
+
+out_hm:
+	kfree(hm);
+	return ret;
+}
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a6bb7ee7a27ad5..08998d84554f09 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -41,6 +41,7 @@
 #include "xfs_exchrange.h"
 #include "xfs_handle.h"
 #include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -1421,6 +1422,9 @@ xfs_file_ioctl(
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);
 
+	case XFS_IOC_HEALTH_MONITOR:
+		return xfs_ioc_health_monitor(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (9 preceding siblings ...)
  2025-11-05  0:51   ` [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
@ 2025-11-05  0:51   ` Darrick J. Wong
  2025-11-05  0:51   ` [PATCH 12/22] xfs: report metadata health events through healthmon Darrick J. Wong
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:51 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.  Make the kernel export C structures via read().

In a previous iteration of this new subsystem, I wanted to explore data
exchange formats that are more flexible and easier for humans to read
than C structures.  The thought being that when we want to rev (or
worse, enlarge) the event format, it ought to be trivially easy to do
that in a way that doesn't break old userspace.

I looked at formats such as protobufs and capnproto.  These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc.  The huge downside is that neither support C
all that well.

Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs.  Then I
thought, how about JSON?  Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.

There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit.  Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.

However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers.  This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants.  Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision.  Not good when you're dealing with
discrete units.

It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.

It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs.  The parsing in userspace was also noticeably slow and
memory-consumptive.

Hence only the C interface survives.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   47 ++++
 fs/xfs/xfs_healthmon.h |   29 +++
 fs/xfs/xfs_linux.h     |    3 
 fs/xfs/xfs_trace.h     |  170 +++++++++++++++
 fs/xfs/xfs_healthmon.c |  542 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.c     |    2 
 lib/seq_buf.c          |    1 
 7 files changed, 787 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index dba7896f716092..dfca42b2c31192 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1003,6 +1003,45 @@ struct xfs_rtgroup_geometry {
 #define XFS_RTGROUP_GEOM_SICK_RMAPBT	(1U << 3)  /* reverse mappings */
 #define XFS_RTGROUP_GEOM_SICK_REFCNTBT	(1U << 4)  /* reference counts */
 
+/* Health monitor event domains */
+
+/* affects the whole fs */
+#define XFS_HEALTH_MONITOR_DOMAIN_MOUNT		(0)
+
+/* Health monitor event types */
+
+/* status of the monitor itself */
+#define XFS_HEALTH_MONITOR_TYPE_RUNNING		(0)
+#define XFS_HEALTH_MONITOR_TYPE_LOST		(1)
+
+/* lost events */
+struct xfs_health_monitor_lost {
+	__u64	count;
+};
+
+struct xfs_health_monitor_event {
+	/* XFS_HEALTH_MONITOR_DOMAIN_* */
+	__u32	domain;
+
+	/* XFS_HEALTH_MONITOR_TYPE_* */
+	__u32	type;
+
+	/* Timestamp of the event, in nanoseconds since the Unix epoch */
+	__u64	time_ns;
+
+	/*
+	 * Details of the event.  The primary clients are written in python
+	 * and rust, so break this up because bindgen hates anonymous structs
+	 * and unions.
+	 */
+	union {
+		struct xfs_health_monitor_lost lost;
+	} e;
+
+	/* zeroes */
+	__u64	pad[2];
+};
+
 struct xfs_health_monitor {
 	__u64	flags;		/* flags */
 	__u8	format;		/* output format */
@@ -1010,6 +1049,14 @@ struct xfs_health_monitor {
 	__u64	pad2[2];	/* zeroes */
 };
 
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Initial return format version */
+#define XFS_HEALTH_MONITOR_FMT_V0	(0)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 07126e39281a0c..ea2d6a327dfb16 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -6,6 +6,35 @@
 #ifndef __XFS_HEALTHMON_H__
 #define __XFS_HEALTHMON_H__
 
+enum xfs_healthmon_type {
+	XFS_HEALTHMON_RUNNING,	/* monitor running */
+	XFS_HEALTHMON_LOST,	/* message lost */
+};
+
+enum xfs_healthmon_domain {
+	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+};
+
+struct xfs_healthmon_event {
+	struct xfs_healthmon_event	*next;
+
+	enum xfs_healthmon_type		type;
+	enum xfs_healthmon_domain	domain;
+
+	uint64_t			time_ns;
+
+	union {
+		/* lost events */
+		struct {
+			uint64_t	lostcount;
+		};
+		/* mount */
+		struct {
+			unsigned int	flags;
+		};
+	};
+};
+
 #ifdef CONFIG_XFS_HEALTH_MONITOR
 long xfs_ioc_health_monitor(struct xfs_mount *mp,
 		struct xfs_health_monitor __user *arg);
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 4dd747bdbccab2..e122db938cc06b 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -63,6 +63,9 @@ typedef __u32			xfs_nlink_t;
 #include <linux/xattr.h>
 #include <linux/mnt_idmapping.h>
 #include <linux/debugfs.h>
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+# include <linux/seq_buf.h>
+#endif
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 79b8641880ab9d..309af9082c4179 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -103,6 +103,8 @@ struct xfs_refcount_intent;
 struct xfs_metadir_update;
 struct xfs_rtgroup;
 struct xfs_open_zone;
+struct xfs_healthmon_event;
+struct xfs_health_update_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -5908,6 +5910,174 @@ DEFINE_EVENT(xfs_freeblocks_resv_class, name, \
 DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_reserved);
 DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_enospc);
 
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+TRACE_EVENT(xfs_healthmon_lost_event,
+	TP_PROTO(const struct xfs_mount *mp, unsigned long long lost_prev),
+	TP_ARGS(mp, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d lost_prev %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->lost_prev)
+);
+
+#define XFS_HEALTHMON_FLAGS_STRINGS \
+	{ XFS_HEALTH_MONITOR_VERBOSE,	"verbose" }
+#define XFS_HEALTHMON_FMT_STRINGS \
+	{ XFS_HEALTH_MONITOR_FMT_V0,	"v0" }
+
+TRACE_EVENT(xfs_healthmon_create,
+	TP_PROTO(const struct xfs_mount *mp, u64 flags, u8 format),
+	TP_ARGS(mp, flags, format),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(u64, flags)
+		__field(u8, format)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->flags = flags;
+		__entry->format = format;
+	),
+	TP_printk("dev %d:%d flags %s format %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->flags, "|", XFS_HEALTHMON_FLAGS_STRINGS),
+		  __print_symbolic(__entry->format, XFS_HEALTHMON_FMT_STRINGS))
+);
+
+TRACE_EVENT(xfs_healthmon_copybuf,
+	TP_PROTO(const struct xfs_mount *mp, const struct iov_iter *iov,
+		 const struct seq_buf *seqbuf, size_t outpos),
+	TP_ARGS(mp, iov, seqbuf, outpos),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(size_t, seqbuf_size)
+		__field(size_t, seqbuf_len)
+		__field(size_t, outpos)
+		__field(size_t, to_copy)
+		__field(size_t, iter_count)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->seqbuf_size = seqbuf->size;
+		__entry->seqbuf_len = seqbuf->len;
+		__entry->outpos = outpos;
+		__entry->to_copy = seqbuf->len - outpos;
+		__entry->iter_count = iov_iter_count(iov);
+	),
+	TP_printk("dev %d:%d seqsize %zu seqlen %zu out_pos %zu to_copy %zu iter_count %zu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->seqbuf_size,
+		  __entry->seqbuf_len,
+		  __entry->outpos,
+		  __entry->to_copy,
+		  __entry->iter_count)
+);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_class,
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events,
+		 unsigned long long lost_prev),
+	TP_ARGS(mp, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#define DEFINE_HEALTHMON_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events, \
+		 unsigned long long lost_prev), \
+	TP_ARGS(mp, events, lost_prev))
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_start);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
+
+#define XFS_HEALTHMON_TYPE_STRINGS \
+	{ XFS_HEALTHMON_LOST,		"lost" }
+
+#define XFS_HEALTHMON_DOMAIN_STRINGS \
+	{ XFS_HEALTHMON_MOUNT,		"mount" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
+	TP_ARGS(mp, event),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+		__field(unsigned long long, offset)
+		__field(unsigned long long, length)
+		__field(unsigned long long, lostcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = event->type;
+		__entry->domain = event->domain;
+		__entry->mask = 0;
+		__entry->group = 0;
+		__entry->ino = 0;
+		__entry->gen = 0;
+		__entry->offset = 0;
+		__entry->length = 0;
+		__entry->lostcount = 0;
+		switch (__entry->domain) {
+		case XFS_HEALTHMON_MOUNT:
+			switch (__entry->type) {
+			case XFS_HEALTHMON_LOST:
+				__entry->lostcount = event->lostcount;
+				break;
+			}
+			break;
+		}
+	),
+	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS),
+		  __entry->mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->offset,
+		  __entry->length,
+		  __entry->group,
+		  __entry->lostcount)
+);
+#define DEFINE_HEALTHMONEVENT_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_event_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
+	TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 7b0d9f78b0a402..8cf6b0b81a721b 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -42,10 +42,376 @@
  * the end of the filesystem mount.
  */
 
+/* Allow this many events to build up in memory per healthmon fd. */
+#define XFS_HEALTHMON_MAX_EVENTS \
+		(32768 / sizeof(struct xfs_healthmon_event))
+
+struct flag_string {
+	unsigned int	mask;
+	const char	*str;
+};
+
 struct xfs_healthmon {
+	/* lock for mp and eventlist */
+	struct mutex			lock;
+
+	/* waiter for signalling the arrival of events */
+	struct wait_queue_head		wait;
+
+	/* list of event objects */
+	struct xfs_healthmon_event	*first_event;
+	struct xfs_healthmon_event	*last_event;
+
 	struct xfs_mount		*mp;
+
+	/* number of events */
+	unsigned int			events;
+
+	/*
+	 * Buffer for formatting events.  New buffer data are appended to the
+	 * end of the seqbuf, and outpos is used to determine where to start
+	 * a copy_iter.  Both are protected by inode_lock.
+	 */
+	struct seq_buf			outbuf;
+	size_t				outpos;
+
+	/* XFS_HEALTH_MONITOR_FMT_* */
+	uint8_t				format;
+
+	/* do we want all events? */
+	bool				verbose;
+
+	/* did we lose previous events? */
+	unsigned long long		lost_prev_event;
+
+	/* total counts of events observed and lost events */
+	unsigned long long		total_events;
+	unsigned long long		total_lost;
 };
 
+static inline void xfs_healthmon_bump_events(struct xfs_healthmon *hm)
+{
+	hm->events++;
+	hm->total_events++;
+}
+
+static inline void xfs_healthmon_bump_lost(struct xfs_healthmon *hm)
+{
+	hm->lost_prev_event++;
+	hm->total_lost++;
+}
+
+/* Remove an event from the head of the list. */
+static inline int
+xfs_healthmon_free_head(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	struct xfs_healthmon_event	*head;
+
+	mutex_lock(&hm->lock);
+	head = hm->first_event;
+	if (head != event) {
+		ASSERT(hm->first_event == event);
+		mutex_unlock(&hm->lock);
+		return -EFSCORRUPTED;
+	}
+
+	if (hm->last_event == head)
+		hm->last_event = NULL;
+	hm->first_event = head->next;
+	hm->events--;
+	mutex_unlock(&hm->lock);
+
+	trace_xfs_healthmon_pop(hm->mp, head);
+	kfree(event);
+	return 0;
+}
+
+/* Push an event onto the end of the list. */
+static inline void
+__xfs_healthmon_push(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	if (!hm->first_event)
+		hm->first_event = event;
+	if (hm->last_event)
+		hm->last_event->next = event;
+	hm->last_event = event;
+	event->next = NULL;
+	xfs_healthmon_bump_events(hm);
+	wake_up(&hm->wait);
+
+	trace_xfs_healthmon_push(hm->mp, event);
+}
+
+/* Push an event onto the end of the list if we're not full. */
+static inline int
+xfs_healthmon_push(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+		xfs_healthmon_bump_lost(hm);
+		return -ENOMEM;
+	}
+
+	__xfs_healthmon_push(hm, event);
+	return 0;
+}
+
+/* Create a new event or record that we failed. */
+static struct xfs_healthmon_event *
+xfs_healthmon_alloc(
+	struct xfs_healthmon		*hm,
+	enum xfs_healthmon_type		type,
+	enum xfs_healthmon_domain	domain)
+{
+	struct timespec64		now;
+	struct xfs_healthmon_event	*event;
+
+	event = kzalloc(sizeof(*event), GFP_NOFS);
+	if (!event) {
+		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+		xfs_healthmon_bump_lost(hm);
+		return NULL;
+	}
+
+	event->type = type;
+	event->domain = domain;
+	ktime_get_coarse_real_ts64(&now);
+	event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec;
+
+	return event;
+}
+
+/*
+ * Before we accept an event notification from a live update hook, we need to
+ * clear out any previously lost events.
+ */
+static inline int
+xfs_healthmon_start_live_update(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event;
+
+	/* If the queue is already full.... */
+	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+		if (hm->last_event &&
+		    hm->last_event->type == XFS_HEALTHMON_LOST) {
+			/*
+			 * ...and the last event notes lost events, then add
+			 * the number of events we already lost, plus one for
+			 * this event that we're about to lose.
+			 */
+			hm->last_event->lostcount += hm->lost_prev_event + 1;
+			hm->lost_prev_event = 0;
+		} else {
+			/*
+			 * ...try to create a new lost event.  Add the number
+			 * of events we previously lost, plus one for this
+			 * event.
+			 */
+			event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+					XFS_HEALTHMON_MOUNT);
+			if (!event) {
+				xfs_healthmon_bump_lost(hm);
+				return -ENOMEM;
+			}
+			event->lostcount = hm->lost_prev_event + 1;
+			hm->lost_prev_event = 0;
+
+			__xfs_healthmon_push(hm, event);
+		}
+
+		return -ENOSPC;
+	}
+
+	/* If we lost an event in the past, but the queue isn't yet full... */
+	if (hm->lost_prev_event) {
+		/*
+		 * ...try to create a new lost event.  Add the number of events
+		 * we previously lost, plus one for this event.
+		 */
+		event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+				XFS_HEALTHMON_MOUNT);
+		if (!event) {
+			xfs_healthmon_bump_lost(hm);
+			return -ENOMEM;
+		}
+		event->lostcount = hm->lost_prev_event;
+		hm->lost_prev_event = 0;
+
+		/*
+		 * If adding this lost event pushes us over the limit, we're
+		 * going to lose the current event.  Note that in the lost
+		 * event count too.
+		 */
+		if (hm->events == XFS_HEALTHMON_MAX_EVENTS - 1)
+			event->lostcount++;
+
+		__xfs_healthmon_push(hm, event);
+		if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+			trace_xfs_healthmon_lost_event(hm->mp,
+					hm->lost_prev_event);
+			return -ENOSPC;
+		}
+	}
+
+	/*
+	 * The queue is not full and it is not currently the case that events
+	 * were lost.
+	 */
+	return 0;
+}
+
+static inline void
+xfs_healthmon_reset_outbuf(
+	struct xfs_healthmon		*hm)
+{
+	hm->outpos = 0;
+	seq_buf_clear(&hm->outbuf);
+}
+
+static const unsigned int domain_map[] = {
+	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
+};
+
+static const unsigned int type_map[] = {
+	[XFS_HEALTHMON_RUNNING]		= XFS_HEALTH_MONITOR_TYPE_RUNNING,
+	[XFS_HEALTHMON_LOST]		= XFS_HEALTH_MONITOR_TYPE_LOST,
+};
+
+/* Render event as a V0 structure */
+STATIC int
+xfs_healthmon_format_v0(
+	struct xfs_healthmon		*hm,
+	const struct xfs_healthmon_event *event)
+{
+	struct xfs_health_monitor_event	hme = {
+		.time_ns		= event->time_ns,
+	};
+	struct seq_buf			*outbuf = &hm->outbuf;
+	size_t				old_seqlen = outbuf->len;
+	int				ret;
+
+	trace_xfs_healthmon_format(hm->mp, event);
+
+	if (event->domain < 0 || event->domain >= ARRAY_SIZE(domain_map) ||
+	    event->type < 0   || event->type >= ARRAY_SIZE(type_map))
+		return -EFSCORRUPTED;
+
+	hme.domain = domain_map[event->domain];
+	hme.type = type_map[event->type];
+
+	/* fill in the event-specific details */
+	switch (event->domain) {
+	case XFS_HEALTHMON_MOUNT:
+		switch (event->type) {
+		case XFS_HEALTHMON_LOST:
+			hme.e.lost.count = event->lostcount;
+			break;
+		default:
+			break;
+		}
+		break;
+	default:
+		break;
+	}
+
+	ret = seq_buf_putmem(outbuf, &hme, sizeof(hme));
+	if (ret < 0) {
+		/*
+		 * We overflowed the buffer and could not format the event.
+		 * Reset the seqbuf and tell the caller not to delete the
+		 * event.
+		 */
+		trace_xfs_healthmon_format_overflow(hm->mp, event);
+		outbuf->len = old_seqlen;
+		return -1;
+	}
+
+	ASSERT(!seq_buf_has_overflowed(outbuf));
+	return 0;
+}
+
+/* How many bytes are waiting in the outbuf to be copied? */
+static inline size_t
+xfs_healthmon_outbuf_bytes(
+	struct xfs_healthmon	*hm)
+{
+	unsigned int		used = seq_buf_used(&hm->outbuf);
+
+	if (used > hm->outpos)
+		return used - hm->outpos;
+	return 0;
+}
+
+/*
+ * Do we have something for userspace to do?  This can mean unmount events,
+ * events pending in the queue, or pending bytes in the outbuf.
+ */
+static inline bool
+xfs_healthmon_has_eventdata(
+	struct xfs_healthmon	*hm)
+{
+	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+}
+
+/* Try to copy the rest of the outbuf to the iov iter. */
+STATIC ssize_t
+xfs_healthmon_copybuf(
+	struct xfs_healthmon	*hm,
+	struct iov_iter		*to)
+{
+	size_t			to_copy;
+	size_t			w = 0;
+
+	trace_xfs_healthmon_copybuf(hm->mp, to, &hm->outbuf, hm->outpos);
+
+	to_copy = xfs_healthmon_outbuf_bytes(hm);
+	if (to_copy) {
+		w = copy_to_iter(hm->outbuf.buffer + hm->outpos, to_copy, to);
+		if (!w)
+			return -EFAULT;
+
+		hm->outpos += w;
+	}
+
+	/*
+	 * Nothing left to copy?  Reset the seqbuf pointers and outbuf to the
+	 * start since there's no live data in the buffer.
+	 */
+	if (xfs_healthmon_outbuf_bytes(hm) == 0)
+		xfs_healthmon_reset_outbuf(hm);
+	return w;
+}
+
+/*
+ * See if there's an event waiting for us.  If the fs is no longer mounted,
+ * don't bother sending any more events.
+ */
+static inline struct xfs_healthmon_event *
+xfs_healthmon_peek(
+	struct xfs_healthmon	*hm)
+{
+	struct xfs_healthmon_event *event;
+
+	mutex_lock(&hm->lock);
+	if (hm->mp)
+		event = hm->first_event;
+	else
+		event = NULL;
+	mutex_unlock(&hm->lock);
+	return event;
+}
+
 /*
  * Convey queued event data to userspace.  First copy any remaining bytes in
  * the outbuf, then format the oldest event into the outbuf and copy that too.
@@ -55,7 +421,122 @@ xfs_healthmon_read_iter(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	return -EIO;
+	struct file		*file = iocb->ki_filp;
+	struct inode		*inode = file_inode(file);
+	struct xfs_healthmon	*hm = file->private_data;
+	struct xfs_healthmon_event *event;
+	size_t			copied = 0;
+	ssize_t			ret = 0;
+
+	/* Wait for data to become available */
+	if (!(file->f_flags & O_NONBLOCK)) {
+		ret = wait_event_interruptible(hm->wait,
+				xfs_healthmon_has_eventdata(hm));
+		if (ret)
+			return ret;
+	} else if (!xfs_healthmon_has_eventdata(hm)) {
+		return -EAGAIN;
+	}
+
+	/* Allocate formatting buffer up to 64k if necessary */
+	if (hm->outbuf.size == 0) {
+		void		*outbuf;
+		size_t		bufsize = min(65536, max(PAGE_SIZE,
+							 iov_iter_count(to)));
+
+		outbuf = kzalloc(bufsize, GFP_KERNEL);
+		if (!outbuf) {
+			bufsize = PAGE_SIZE;
+			outbuf = kzalloc(bufsize, GFP_KERNEL);
+			if (!outbuf)
+				return -ENOMEM;
+		}
+
+		inode_lock(inode);
+		if (hm->outbuf.size == 0) {
+			seq_buf_init(&hm->outbuf, outbuf, bufsize);
+			hm->outpos = 0;
+		} else {
+			kfree(outbuf);
+		}
+	} else {
+		inode_lock(inode);
+	}
+
+	trace_xfs_healthmon_read_start(hm->mp, hm->events, hm->lost_prev_event);
+
+	/*
+	 * If there's anything left in the seqbuf, copy that before formatting
+	 * more events.
+	 */
+	ret = xfs_healthmon_copybuf(hm, to);
+	if (ret < 0)
+		goto out_unlock;
+	copied += ret;
+
+	while (iov_iter_count(to) > 0) {
+		/* Format the next events into the outbuf until it's full. */
+		while ((event = xfs_healthmon_peek(hm)) != NULL) {
+			switch (hm->format) {
+			case XFS_HEALTH_MONITOR_FMT_V0:
+				ret = xfs_healthmon_format_v0(hm, event);
+				break;
+			default:
+				ret = -EINVAL;
+				goto out_unlock;
+			}
+			if (ret < 0)
+				break;
+			ret = xfs_healthmon_free_head(hm, event);
+			if (ret)
+				goto out_unlock;
+		}
+
+		/* Copy it to userspace */
+		ret = xfs_healthmon_copybuf(hm, to);
+		if (ret <= 0)
+			break;
+
+		copied += ret;
+	}
+
+out_unlock:
+	trace_xfs_healthmon_read_finish(hm->mp, hm->events, hm->lost_prev_event);
+	inode_unlock(inode);
+	return copied ?: ret;
+}
+
+/* Poll for available events. */
+STATIC __poll_t
+xfs_healthmon_poll(
+	struct file			*file,
+	struct poll_table_struct	*wait)
+{
+	struct xfs_healthmon		*hm = file->private_data;
+	__poll_t			mask = 0;
+
+	poll_wait(file, &hm->wait, wait);
+
+	if (xfs_healthmon_has_eventdata(hm))
+		mask |= EPOLLIN;
+	return mask;
+}
+
+/* Free all events */
+STATIC void
+xfs_healthmon_free_events(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event, *next;
+
+	event = hm->first_event;
+	while (event != NULL) {
+		trace_xfs_healthmon_drop(hm->mp, event);
+		next = event->next;
+		kfree(event);
+		event = next;
+	}
+	hm->first_event = hm->last_event = NULL;
 }
 
 /* Free the health monitoring information. */
@@ -66,6 +547,14 @@ xfs_healthmon_release(
 {
 	struct xfs_healthmon	*hm = file->private_data;
 
+	trace_xfs_healthmon_release(hm->mp, hm->events, hm->lost_prev_event);
+
+	wake_up_all(&hm->wait);
+
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
+	if (hm->outbuf.size)
+		kfree(hm->outbuf.buffer);
 	kfree(hm);
 
 	return 0;
@@ -76,9 +565,9 @@ static inline bool
 xfs_healthmon_validate(
 	const struct xfs_health_monitor	*hmo)
 {
-	if (hmo->flags)
+	if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL)
 		return false;
-	if (hmo->format)
+	if (hmo->format != XFS_HEALTH_MONITOR_FMT_V0)
 		return false;
 	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
 		return false;
@@ -89,6 +578,17 @@ xfs_healthmon_validate(
 
 /* Emit some data about the health monitoring fd. */
 #ifdef CONFIG_PROC_FS
+static const char *
+xfs_healthmon_format_string(const struct xfs_healthmon *hm)
+{
+	switch (hm->format) {
+	case XFS_HEALTH_MONITOR_FMT_V0:
+		return "v0";
+	}
+
+	return "";
+}
+
 static void
 xfs_healthmon_show_fdinfo(
 	struct seq_file		*m,
@@ -96,8 +596,13 @@ xfs_healthmon_show_fdinfo(
 {
 	struct xfs_healthmon	*hm = file->private_data;
 
-	seq_printf(m, "state:\talive\ndev:\t%s\n",
-			hm->mp->m_super->s_id);
+	mutex_lock(&hm->lock);
+	seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
+			hm->mp->m_super->s_id,
+			xfs_healthmon_format_string(hm),
+			hm->total_events,
+			hm->total_lost);
+	mutex_unlock(&hm->lock);
 }
 #endif
 
@@ -107,6 +612,7 @@ static const struct file_operations xfs_healthmon_fops = {
 	.show_fdinfo	= xfs_healthmon_show_fdinfo,
 #endif
 	.read_iter	= xfs_healthmon_read_iter,
+	.poll		= xfs_healthmon_poll,
 	.release	= xfs_healthmon_release,
 };
 
@@ -121,6 +627,7 @@ xfs_ioc_health_monitor(
 {
 	struct xfs_health_monitor	hmo;
 	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
 	int				fd;
 	int				ret;
 
@@ -137,6 +644,23 @@ xfs_ioc_health_monitor(
 	if (!hm)
 		return -ENOMEM;
 	hm->mp = mp;
+	hm->format = hmo.format;
+
+	seq_buf_init(&hm->outbuf, NULL, 0);
+	mutex_init(&hm->lock);
+	init_waitqueue_head(&hm->wait);
+
+	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
+		hm->verbose = true;
+
+	/* Queue up the first event that lets the client know we're running. */
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
+			XFS_HEALTHMON_MOUNT);
+	if (!event) {
+		ret = -ENOMEM;
+		goto out_mutex;
+	}
+	__xfs_healthmon_push(hm, event);
 
 	/*
 	 * Create the anonymous file.  If it succeeds, the file owns hm and
@@ -146,12 +670,16 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_hm;
+		goto out_mutex;
 	}
 
+	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
+
 	return fd;
 
-out_hm:
+out_mutex:
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
 	kfree(hm);
 	return ret;
 }
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index a60556dbd172ee..d42b864a3837a2 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -51,6 +51,8 @@
 #include "xfs_rtgroup.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_zone_priv.h"
+#include "xfs_health.h"
+#include "xfs_healthmon.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index f3f3436d60a940..f6a1fb46a1d6c9 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -245,6 +245,7 @@ int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int len)
 	seq_buf_set_overflow(s);
 	return -1;
 }
+EXPORT_SYMBOL_GPL(seq_buf_putmem);
 
 #define MAX_MEMHEX_BYTES	8U
 #define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 12/22] xfs: report metadata health events through healthmon
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (10 preceding siblings ...)
  2025-11-05  0:51   ` [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
@ 2025-11-05  0:51   ` Darrick J. Wong
  2025-11-05  0:51   ` [PATCH 13/22] xfs: report shutdown " Darrick J. Wong
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:51 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a metadata health event hook so that we can send events to
userspace as we collect information.  The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h     |   38 +++++
 fs/xfs/libxfs/xfs_health.h |    5 +
 fs/xfs/xfs_healthmon.h     |   31 ++++
 fs/xfs/xfs_trace.h         |   98 +++++++++++++
 fs/xfs/xfs_health.c        |   67 +++++++++
 fs/xfs/xfs_healthmon.c     |  333 +++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 563 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index dfca42b2c31192..2ad45351ac0ea6 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1008,17 +1008,52 @@ struct xfs_rtgroup_geometry {
 /* affects the whole fs */
 #define XFS_HEALTH_MONITOR_DOMAIN_MOUNT		(0)
 
+/* metadata health events */
+#define XFS_HEALTH_MONITOR_DOMAIN_FS		(1)
+#define XFS_HEALTH_MONITOR_DOMAIN_AG		(2)
+#define XFS_HEALTH_MONITOR_DOMAIN_INODE		(3)
+#define XFS_HEALTH_MONITOR_DOMAIN_RTGROUP	(4)
+
 /* Health monitor event types */
 
 /* status of the monitor itself */
 #define XFS_HEALTH_MONITOR_TYPE_RUNNING		(0)
 #define XFS_HEALTH_MONITOR_TYPE_LOST		(1)
 
+/* metadata health events */
+#define XFS_HEALTH_MONITOR_TYPE_SICK		(2)
+#define XFS_HEALTH_MONITOR_TYPE_CORRUPT		(3)
+#define XFS_HEALTH_MONITOR_TYPE_HEALTHY		(4)
+
+/* filesystem was unmounted */
+#define XFS_HEALTH_MONITOR_TYPE_UNMOUNT		(5)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
 };
 
+/* fs/rt metadata */
+struct xfs_health_monitor_fs {
+	/* XFS_FSOP_GEOM_SICK_* flags */
+	__u32	mask;
+};
+
+/* ag/rtgroup metadata */
+struct xfs_health_monitor_group {
+	/* XFS_{AG,RTGROUP}_SICK_* flags */
+	__u32	mask;
+	__u32	gno;
+};
+
+/* inode metadata */
+struct xfs_health_monitor_inode {
+	/* XFS_BS_SICK_* flags */
+	__u32	mask;
+	__u32	gen;
+	__u64	ino;
+};
+
 struct xfs_health_monitor_event {
 	/* XFS_HEALTH_MONITOR_DOMAIN_* */
 	__u32	domain;
@@ -1036,6 +1071,9 @@ struct xfs_health_monitor_event {
 	 */
 	union {
 		struct xfs_health_monitor_lost lost;
+		struct xfs_health_monitor_fs fs;
+		struct xfs_health_monitor_group group;
+		struct xfs_health_monitor_inode inode;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 39fef33dedc6a8..9ff3bf8ba4ed8f 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -336,4 +336,9 @@ void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
 void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+unsigned int xfs_healthmon_inode_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_rtgroup_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_perag_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_fs_mask(unsigned int sick_mask);
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index ea2d6a327dfb16..3f3ba16d5af56a 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -9,10 +9,23 @@
 enum xfs_healthmon_type {
 	XFS_HEALTHMON_RUNNING,	/* monitor running */
 	XFS_HEALTHMON_LOST,	/* message lost */
+	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
+
+	/* metadata health events */
+	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
+	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
+
 };
 
 enum xfs_healthmon_domain {
 	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+
+	/* metadata health events */
+	XFS_HEALTHMON_FS,	/* main filesystem metadata */
+	XFS_HEALTHMON_AG,	/* allocation group metadata */
+	XFS_HEALTHMON_INODE,	/* inode metadata */
+	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
 };
 
 struct xfs_healthmon_event {
@@ -32,6 +45,24 @@ struct xfs_healthmon_event {
 		struct {
 			unsigned int	flags;
 		};
+		/* fs/rt metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	fsmask;
+		};
+		/* ag/rtgroup metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	grpmask;
+			unsigned int	group;
+		};
+		/* inode metadata */
+		struct {
+			/* XFS_SICK_INO_* flags */
+			unsigned int	imask;
+			uint32_t	gen;
+			xfs_ino_t	ino;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 309af9082c4179..051599f8433ed6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6010,14 +6010,30 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
 #define XFS_HEALTHMON_TYPE_STRINGS \
-	{ XFS_HEALTHMON_LOST,		"lost" }
+	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
+	{ XFS_HEALTHMON_SICK,		"sick" }, \
+	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
+	{ XFS_HEALTHMON_HEALTHY,	"healthy" }
 
 #define XFS_HEALTHMON_DOMAIN_STRINGS \
-	{ XFS_HEALTHMON_MOUNT,		"mount" }
+	{ XFS_HEALTHMON_MOUNT,		"mount" }, \
+	{ XFS_HEALTHMON_FS,		"fs" }, \
+	{ XFS_HEALTHMON_AG,		"ag" }, \
+	{ XFS_HEALTHMON_INODE,		"inode" }, \
+	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_HEALTHY);
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_RTGROUP);
 
 DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
@@ -6053,6 +6069,19 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 				break;
 			}
 			break;
+		case XFS_HEALTHMON_FS:
+			__entry->mask = event->fsmask;
+			break;
+		case XFS_HEALTHMON_AG:
+		case XFS_HEALTHMON_RTGROUP:
+			__entry->mask = event->grpmask;
+			__entry->group = event->group;
+			break;
+		case XFS_HEALTHMON_INODE:
+			__entry->mask = event->imask;
+			__entry->ino = event->ino;
+			__entry->gen = event->gen;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6071,11 +6100,76 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 DEFINE_EVENT(xfs_healthmon_event_class, name, \
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
 	TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_insert);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+
+#define XFS_HEALTHUP_TYPE_STRINGS \
+	{ XFS_HEALTHUP_UNMOUNT,		"unmount" }, \
+	{ XFS_HEALTHUP_SICK,		"sick" }, \
+	{ XFS_HEALTHUP_CORRUPT,		"corrupt" }, \
+	{ XFS_HEALTHUP_HEALTHY,		"healthy" }
+
+#define XFS_HEALTHUP_DOMAIN_STRINGS \
+	{ XFS_HEALTHUP_FS,		"fs" }, \
+	{ XFS_HEALTHUP_AG,		"ag" }, \
+	{ XFS_HEALTHUP_INODE,		"inode" }, \
+	{ XFS_HEALTHUP_RTGROUP,		"rtgroup" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_HEALTHY);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_RTGROUP);
+
+TRACE_EVENT(xfs_healthmon_metadata_hook,
+	TP_PROTO(const struct xfs_mount *mp, unsigned long type,
+		 const struct xfs_health_update_params *update,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(mp, type, update, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, old_mask)
+		__field(unsigned int, new_mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = type;
+		__entry->domain = update->domain;
+		__entry->old_mask = update->old_mask;
+		__entry->new_mask = update->new_mask;
+		__entry->ino = update->ino;
+		__entry->gen = update->gen;
+		__entry->group = update->group;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d type %s domain %s oldmask 0x%x newmask 0x%x ino 0x%llx gen 0x%x group 0x%x events %u lost_prev %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHUP_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHUP_DOMAIN_STRINGS),
+		  __entry->old_mask,
+		  __entry->new_mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->group,
+		  __entry->events,
+		  __entry->lost_prev)
+);
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 71952d5eec2a9e..da827060853a8f 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -609,6 +609,25 @@ xfs_fsop_geom_health(
 	}
 }
 
+/*
+ * Translate XFS_SICK_FS_* into XFS_FSOP_GEOM_SICK_* except for the rt free
+ * space codes, which are sent via the rtgroup events.
+ */
+unsigned int
+xfs_healthmon_fs_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(fs_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 static const struct ioctl_sick_map ag_map[] = {
 	{ XFS_SICK_AG_SB,	XFS_AG_GEOM_SICK_SB },
 	{ XFS_SICK_AG_AGF,	XFS_AG_GEOM_SICK_AGF },
@@ -645,6 +664,22 @@ xfs_ag_geom_health(
 	}
 }
 
+/* Translate XFS_SICK_AG_* into XFS_AG_GEOM_SICK_*. */
+unsigned int
+xfs_healthmon_perag_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(ag_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 static const struct ioctl_sick_map rtgroup_map[] = {
 	{ XFS_SICK_RG_SUPER,	XFS_RTGROUP_GEOM_SICK_SUPER },
 	{ XFS_SICK_RG_BITMAP,	XFS_RTGROUP_GEOM_SICK_BITMAP },
@@ -675,6 +710,22 @@ xfs_rtgroup_geom_health(
 	}
 }
 
+/* Translate XFS_SICK_RG_* into XFS_RTGROUP_GEOM_SICK_*. */
+unsigned int
+xfs_healthmon_rtgroup_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(rtgroup_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_CORE,	XFS_BS_SICK_INODE },
 	{ XFS_SICK_INO_BMBTD,	XFS_BS_SICK_BMBTD },
@@ -713,6 +764,22 @@ xfs_bulkstat_health(
 	}
 }
 
+/* Translate XFS_SICK_INO_* into XFS_BS_SICK_*. */
+unsigned int
+xfs_healthmon_inode_mask(
+	unsigned int			sick_mask)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			ioctl_mask = 0;
+
+	for_each_sick_map(ino_map, m) {
+		if (sick_mask & m->sick_mask)
+			ioctl_mask |= m->ioctl_mask;
+	}
+
+	return ioctl_mask;
+}
+
 /* Mark a block mapping sick. */
 void
 xfs_bmap_mark_sick(
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 8cf6b0b81a721b..d1474e6b9ab544 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -18,6 +18,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
+#include "xfs_health.h"
 #include "xfs_healthmon.h"
 
 #include <linux/anon_inodes.h>
@@ -62,8 +63,15 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*first_event;
 	struct xfs_healthmon_event	*last_event;
 
+	/* live update hooks */
+	struct xfs_health_hook		hhook;
+
+	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
 
+	/* filesystem type for safe cleanup of hooks; requires module_get */
+	struct file_system_type		*fstyp;
+
 	/* number of events */
 	unsigned int			events;
 
@@ -128,6 +136,23 @@ xfs_healthmon_free_head(
 	return 0;
 }
 
+/* Insert an event onto the start of the list. */
+static inline void
+__xfs_healthmon_insert(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	event->next = hm->first_event;
+	if (!hm->first_event)
+		hm->first_event = event;
+	if (!hm->last_event)
+		hm->last_event = event;
+	xfs_healthmon_bump_events(hm);
+	wake_up(&hm->wait);
+
+	trace_xfs_healthmon_insert(hm->mp, event);
+}
+
 /* Push an event onto the end of the list. */
 static inline void
 __xfs_healthmon_push(
@@ -199,6 +224,10 @@ xfs_healthmon_start_live_update(
 {
 	struct xfs_healthmon_event	*event;
 
+	/* Filesystem already unmounted, do nothing. */
+	if (!hm->mp)
+		return -ESHUTDOWN;
+
 	/* If the queue is already full.... */
 	if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
 		trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
@@ -271,6 +300,185 @@ xfs_healthmon_start_live_update(
 	return 0;
 }
 
+/* Compute the reporting mask. */
+static inline bool
+xfs_healthmon_event_mask(
+	struct xfs_healthmon			*hm,
+	enum xfs_health_update_type		type,
+	const struct xfs_health_update_params	*hup,
+	unsigned int				*mask)
+{
+	/* Always report unmounts. */
+	if (type == XFS_HEALTHUP_UNMOUNT)
+		return true;
+
+	/* If we want all events, return all events. */
+	if (hm->verbose) {
+		*mask = hup->new_mask;
+		return true;
+	}
+
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		/* Always report runtime corruptions */
+		*mask = hup->new_mask;
+		break;
+	case XFS_HEALTHUP_CORRUPT:
+		/* Only report new fsck errors */
+		*mask = hup->new_mask & ~hup->old_mask;
+		break;
+	case XFS_HEALTHUP_HEALTHY:
+		/* Only report healthy metadata that got fixed */
+		*mask = hup->new_mask & hup->old_mask;
+		break;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* This is here for static enum checking */
+		break;
+	}
+
+	/* If not in verbose mode, mask state has to change. */
+	return *mask != 0;
+}
+
+static inline enum xfs_healthmon_type
+health_update_to_type(
+	enum xfs_health_update_type	type)
+{
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		return XFS_HEALTHMON_SICK;
+	case XFS_HEALTHUP_CORRUPT:
+		return XFS_HEALTHMON_CORRUPT;
+	case XFS_HEALTHUP_HEALTHY:
+		return XFS_HEALTHMON_HEALTHY;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_UNMOUNT;
+}
+
+static inline enum xfs_healthmon_domain
+health_update_to_domain(
+	enum xfs_health_update_domain	domain)
+{
+	switch (domain) {
+	case XFS_HEALTHUP_FS:
+		return XFS_HEALTHMON_FS;
+	case XFS_HEALTHUP_AG:
+		return XFS_HEALTHMON_AG;
+	case XFS_HEALTHUP_RTGROUP:
+		return XFS_HEALTHMON_RTGROUP;
+	case XFS_HEALTHUP_INODE:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_INODE;
+}
+
+/* Add a health event to the reporting queue. */
+STATIC int
+xfs_healthmon_metadata_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_health_update_params	*hup = data;
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	enum xfs_health_update_type	type = action;
+	unsigned int			mask = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, hhook.health_hook.nb);
+
+	/* Decode event mask and skip events we don't care about. */
+	if (!xfs_healthmon_event_mask(hm, type, hup, &mask))
+		return NOTIFY_DONE;
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	if (type == XFS_HEALTHUP_UNMOUNT) {
+		/*
+		 * The filesystem is unmounting, so we must detach from the
+		 * mount.  After this point, the healthmon thread has no
+		 * connection to the mounted filesystem and must not touch its
+		 * hooks.
+		 */
+		trace_xfs_healthmon_unmount(hm->mp, hm->events,
+				hm->lost_prev_event);
+
+		hm->mp = NULL;
+
+		/*
+		 * Try to add an unmount message to the head of the list so
+		 * that userspace will notice the unmount.  If we can't add
+		 * the event, wake up the reader directly.
+		 */
+		event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_UNMOUNT,
+				XFS_HEALTHMON_MOUNT);
+		if (event)
+			__xfs_healthmon_insert(hm, event);
+		else
+			wake_up(&hm->wait);
+
+		goto out_unlock;
+	}
+
+	event = xfs_healthmon_alloc(hm, health_update_to_type(type),
+			  health_update_to_domain(hup->domain));
+	if (!event)
+		goto out_unlock;
+
+	/* Ignore the event if it's only reporting a secondary health state. */
+	switch (event->domain) {
+	case XFS_HEALTHMON_FS:
+		event->fsmask = mask & ~XFS_SICK_FS_SECONDARY;
+		if (!event->fsmask)
+			goto out_event;
+		break;
+	case XFS_HEALTHMON_AG:
+		event->grpmask = mask & ~XFS_SICK_AG_SECONDARY;
+		if (!event->grpmask)
+			goto out_event;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		event->grpmask = mask & ~XFS_SICK_RG_SECONDARY;
+		if (!event->grpmask)
+			goto out_event;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_INODE:
+		event->imask = mask & ~XFS_SICK_INO_SECONDARY;
+		if (!event->imask)
+			goto out_event;
+		event->ino = hup->ino;
+		event->gen = hup->gen;
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		goto out_event;
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+out_event:
+	kfree(event);
+	goto out_unlock;
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -281,11 +489,19 @@ xfs_healthmon_reset_outbuf(
 
 static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
+	[XFS_HEALTHMON_FS]		= XFS_HEALTH_MONITOR_DOMAIN_FS,
+	[XFS_HEALTHMON_AG]		= XFS_HEALTH_MONITOR_DOMAIN_AG,
+	[XFS_HEALTHMON_INODE]		= XFS_HEALTH_MONITOR_DOMAIN_INODE,
+	[XFS_HEALTHMON_RTGROUP]		= XFS_HEALTH_MONITOR_DOMAIN_RTGROUP,
 };
 
 static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_RUNNING]		= XFS_HEALTH_MONITOR_TYPE_RUNNING,
 	[XFS_HEALTHMON_LOST]		= XFS_HEALTH_MONITOR_TYPE_LOST,
+	[XFS_HEALTHMON_SICK]		= XFS_HEALTH_MONITOR_TYPE_SICK,
+	[XFS_HEALTHMON_CORRUPT]		= XFS_HEALTH_MONITOR_TYPE_CORRUPT,
+	[XFS_HEALTHMON_HEALTHY]		= XFS_HEALTH_MONITOR_TYPE_HEALTHY,
+	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
 };
 
 /* Render event as a V0 structure */
@@ -321,6 +537,22 @@ xfs_healthmon_format_v0(
 			break;
 		}
 		break;
+	case XFS_HEALTHMON_FS:
+		hme.e.fs.mask = xfs_healthmon_fs_mask(event->fsmask);
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		hme.e.group.mask = xfs_healthmon_rtgroup_mask(event->grpmask);
+		hme.e.group.gno = event->group;
+		break;
+	case XFS_HEALTHMON_AG:
+		hme.e.group.mask = xfs_healthmon_perag_mask(event->grpmask);
+		hme.e.group.gno = event->group;
+		break;
+	case XFS_HEALTHMON_INODE:
+		hme.e.inode.mask = xfs_healthmon_inode_mask(event->imask);
+		hme.e.inode.ino = event->ino;
+		hme.e.inode.gen = event->gen;
+		break;
 	default:
 		break;
 	}
@@ -361,7 +593,7 @@ static inline bool
 xfs_healthmon_has_eventdata(
 	struct xfs_healthmon	*hm)
 {
-	return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+	return !hm->mp || hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
 }
 
 /* Try to copy the rest of the outbuf to the iov iter. */
@@ -404,10 +636,16 @@ xfs_healthmon_peek(
 	struct xfs_healthmon_event *event;
 
 	mutex_lock(&hm->lock);
+	event = hm->first_event;
 	if (hm->mp)
-		event = hm->first_event;
-	else
-		event = NULL;
+		goto done;
+
+	/* If the filesystem is unmounted, only return the unmount event */
+	if (event && event->type == XFS_HEALTHMON_UNMOUNT)
+		goto done;
+	event = NULL;
+
+done:
 	mutex_unlock(&hm->lock);
 	return event;
 }
@@ -539,6 +777,58 @@ xfs_healthmon_free_events(
 	hm->first_event = hm->last_event = NULL;
 }
 
+/*
+ * Detach all filesystem hooks that were set up for a health monitor.  Only
+ * call this from iterate_super*.
+ */
+STATIC void
+xfs_healthmon_detach_hooks(
+	struct super_block	*sb,
+	void			*arg)
+{
+	struct xfs_healthmon	*hm = arg;
+
+	mutex_lock(&hm->lock);
+
+	/*
+	 * Because health monitors have a weak reference to the filesystem
+	 * they're monitoring, the hook deletions below must not race against
+	 * that filesystem being unmounted because that could lead to UAF
+	 * errors.
+	 *
+	 * If hm->mp is NULL, the health unmount hook already ran and the hook
+	 * chain head (contained within the xfs_mount structure) is gone.  Do
+	 * not detach any hooks; just let them get freed when the healthmon
+	 * object is torn down.
+	 */
+	if (!hm->mp)
+		goto out_unlock;
+
+	/*
+	 * Otherwise, the caller gave us a non-dying @sb with s_umount held in
+	 * shared mode, which means that @sb cannot be running through
+	 * deactivate_locked_super and cannot be freed.  It's safe to compare
+	 * @sb against the super that we snapshotted when we set up the health
+	 * monitor.
+	 */
+	if (hm->mp->m_super != sb)
+		goto out_unlock;
+
+	mutex_unlock(&hm->lock);
+
+	/*
+	 * Now we know that the filesystem @hm->mp is active and cannot be
+	 * deactivated until this function returns.  Unmount events are sent
+	 * through the health monitoring subsystem from xfs_fs_put_super, so
+	 * it is now time to detach the hooks.
+	 */
+	xfs_health_hook_del(hm->mp, &hm->hhook);
+	return;
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+}
+
 /* Free the health monitoring information. */
 STATIC int
 xfs_healthmon_release(
@@ -551,6 +841,9 @@ xfs_healthmon_release(
 
 	wake_up_all(&hm->wait);
 
+	iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+	xfs_health_hook_disable();
+
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	if (hm->outbuf.size)
@@ -597,11 +890,18 @@ xfs_healthmon_show_fdinfo(
 	struct xfs_healthmon	*hm = file->private_data;
 
 	mutex_lock(&hm->lock);
+	if (!hm->mp) {
+		seq_printf(m, "state:\tdead\n");
+		goto out_unlock;
+	}
+
 	seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
 			hm->mp->m_super->s_id,
 			xfs_healthmon_format_string(hm),
 			hm->total_events,
 			hm->total_lost);
+
+out_unlock:
 	mutex_unlock(&hm->lock);
 }
 #endif
@@ -646,6 +946,13 @@ xfs_ioc_health_monitor(
 	hm->mp = mp;
 	hm->format = hmo.format;
 
+	/*
+	 * Since we already got a ref to the module, take a reference to the
+	 * fstype to make it easier to detach the hooks when we tear things
+	 * down later.
+	 */
+	hm->fstyp = mp->m_super->s_type;
+
 	seq_buf_init(&hm->outbuf, NULL, 0);
 	mutex_init(&hm->lock);
 	init_waitqueue_head(&hm->wait);
@@ -653,12 +960,21 @@ xfs_ioc_health_monitor(
 	if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
 		hm->verbose = true;
 
+	/* Enable hooks to receive events, generally. */
+	xfs_health_hook_enable();
+
+	/* Attach specific event hooks to this monitor. */
+	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
+	ret = xfs_health_hook_add(mp, &hm->hhook);
+	if (ret)
+		goto out_hooks;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_mutex;
+		goto out_healthhook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -670,14 +986,17 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_mutex;
+		goto out_healthhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
-out_mutex:
+out_healthhook:
+	xfs_health_hook_del(mp, &hm->hhook);
+out_hooks:
+	xfs_health_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
 	kfree(hm);


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 13/22] xfs: report shutdown events through healthmon
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (11 preceding siblings ...)
  2025-11-05  0:51   ` [PATCH 12/22] xfs: report metadata health events through healthmon Darrick J. Wong
@ 2025-11-05  0:51   ` Darrick J. Wong
  2025-11-05  0:52   ` [PATCH 14/22] xfs: report media errors " Darrick J. Wong
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:51 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   18 +++++++++
 fs/xfs/xfs_healthmon.h |    5 ++-
 fs/xfs/xfs_trace.h     |   28 ++++++++++++++
 fs/xfs/xfs_healthmon.c |   93 +++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 141 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2ad45351ac0ea6..677141a17605a4 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1028,6 +1028,9 @@ struct xfs_rtgroup_geometry {
 /* filesystem was unmounted */
 #define XFS_HEALTH_MONITOR_TYPE_UNMOUNT		(5)
 
+/* filesystem shutdown */
+#define XFS_HEALTH_MONITOR_TYPE_SHUTDOWN	(6)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
@@ -1054,6 +1057,20 @@ struct xfs_health_monitor_inode {
 	__u64	ino;
 };
 
+/* shutdown reasons */
+#define XFS_HEALTH_SHUTDOWN_META_IO_ERROR	(1u << 0)
+#define XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR	(1u << 1)
+#define XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT	(1u << 2)
+#define XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE	(1u << 3)
+#define XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK	(1u << 4)
+#define XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED	(1u << 5)
+
+/* shutdown */
+struct xfs_health_monitor_shutdown {
+	/* XFS_HEALTH_SHUTDOWN_* flags */
+	__u32	reasons;
+};
+
 struct xfs_health_monitor_event {
 	/* XFS_HEALTH_MONITOR_DOMAIN_* */
 	__u32	domain;
@@ -1074,6 +1091,7 @@ struct xfs_health_monitor_event {
 		struct xfs_health_monitor_fs fs;
 		struct xfs_health_monitor_group group;
 		struct xfs_health_monitor_inode inode;
+		struct xfs_health_monitor_shutdown shutdown;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 3f3ba16d5af56a..a82a684bbc0e03 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -11,6 +11,9 @@ enum xfs_healthmon_type {
 	XFS_HEALTHMON_LOST,	/* message lost */
 	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
 
+	/* filesystem shutdown */
+	XFS_HEALTHMON_SHUTDOWN,
+
 	/* metadata health events */
 	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
@@ -41,7 +44,7 @@ struct xfs_healthmon_event {
 		struct {
 			uint64_t	lostcount;
 		};
-		/* mount */
+		/* shutdown */
 		struct {
 			unsigned int	flags;
 		};
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 051599f8433ed6..b2b056ceb52f5c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6009,8 +6009,32 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
 DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
+TRACE_EVENT(xfs_healthmon_shutdown_hook,
+	TP_PROTO(const struct xfs_mount *mp, uint32_t shutdown_flags,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(mp, shutdown_flags, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint32_t, shutdown_flags)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->shutdown_flags = shutdown_flags;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d shutdown_flags %s events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->shutdown_flags, "|", XFS_SHUTDOWN_STRINGS),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+
 #define XFS_HEALTHMON_TYPE_STRINGS \
 	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_SHUTDOWN,	"shutdown" }, \
 	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
 	{ XFS_HEALTHMON_SICK,		"sick" }, \
 	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
@@ -6024,6 +6048,7 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
@@ -6064,6 +6089,9 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 		switch (__entry->domain) {
 		case XFS_HEALTHMON_MOUNT:
 			switch (__entry->type) {
+			case XFS_HEALTHMON_SHUTDOWN:
+				__entry->mask = event->flags;
+				break;
 			case XFS_HEALTHMON_LOST:
 				__entry->lostcount = event->lostcount;
 				break;
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index d1474e6b9ab544..f36d7fbfb1ca16 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -20,6 +20,7 @@
 #include "xfs_rtgroup.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_fsops.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -64,6 +65,7 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*last_event;
 
 	/* live update hooks */
+	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
@@ -479,6 +481,43 @@ xfs_healthmon_metadata_hook(
 	goto out_unlock;
 }
 
+/* Add a shutdown event to the reporting queue. */
+STATIC int
+xfs_healthmon_shutdown_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_SHUTDOWN,
+			XFS_HEALTHMON_MOUNT);
+	if (!event)
+		goto out_unlock;
+
+	event->flags = action;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -487,6 +526,44 @@ xfs_healthmon_reset_outbuf(
 	seq_buf_clear(&hm->outbuf);
 }
 
+struct flags_map {
+	unsigned int		in_mask;
+	unsigned int		out_mask;
+};
+
+static const struct flags_map shutdown_map[] = {
+	{ SHUTDOWN_META_IO_ERROR,	XFS_HEALTH_SHUTDOWN_META_IO_ERROR },
+	{ SHUTDOWN_LOG_IO_ERROR,	XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR },
+	{ SHUTDOWN_FORCE_UMOUNT,	XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT },
+	{ SHUTDOWN_CORRUPT_INCORE,	XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE },
+	{ SHUTDOWN_CORRUPT_ONDISK,	XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK },
+	{ SHUTDOWN_DEVICE_REMOVED,	XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED },
+};
+
+static inline unsigned int
+__map_flags(
+	const struct flags_map	*map,
+	size_t			array_len,
+	unsigned int		flags)
+{
+	const struct flags_map	*m;
+	unsigned int		ret = 0;
+
+	for (m = map; m < map + array_len; m++) {
+		if (flags & m->in_mask)
+			ret |= m->out_mask;
+	}
+
+	return ret;
+}
+
+#define map_flags(map, flags) __map_flags((map), ARRAY_SIZE(map), (flags))
+
+static inline unsigned int shutdown_mask(unsigned int in)
+{
+	return map_flags(shutdown_map, in);
+}
+
 static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_MOUNT]		= XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
 	[XFS_HEALTHMON_FS]		= XFS_HEALTH_MONITOR_DOMAIN_FS,
@@ -502,6 +579,7 @@ static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_CORRUPT]		= XFS_HEALTH_MONITOR_TYPE_CORRUPT,
 	[XFS_HEALTHMON_HEALTHY]		= XFS_HEALTH_MONITOR_TYPE_HEALTHY,
 	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
+	[XFS_HEALTHMON_SHUTDOWN]	= XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
 };
 
 /* Render event as a V0 structure */
@@ -533,6 +611,9 @@ xfs_healthmon_format_v0(
 		case XFS_HEALTHMON_LOST:
 			hme.e.lost.count = event->lostcount;
 			break;
+		case XFS_HEALTHMON_SHUTDOWN:
+			hme.e.shutdown.reasons = shutdown_mask(event->flags);
+			break;
 		default:
 			break;
 		}
@@ -822,6 +903,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
 	return;
 
@@ -969,12 +1051,17 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_hooks;
 
+	xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
+	ret = xfs_shutdown_hook_add(mp, &hm->shook);
+	if (ret)
+		goto out_healthhook;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_healthhook;
+		goto out_shutdownhook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -986,13 +1073,15 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_healthhook;
+		goto out_shutdownhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_shutdownhook:
+	xfs_shutdown_hook_del(mp, &hm->shook);
 out_healthhook:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_hooks:


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 14/22] xfs: report media errors through healthmon
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (12 preceding siblings ...)
  2025-11-05  0:51   ` [PATCH 13/22] xfs: report shutdown " Darrick J. Wong
@ 2025-11-05  0:52   ` Darrick J. Wong
  2025-11-05  0:52   ` [PATCH 15/22] xfs: report file io " Darrick J. Wong
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:52 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we have hooks to report media errors, connect this to the
health monitor as well.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   15 +++++++++
 fs/xfs/xfs_healthmon.h |   12 +++++++
 fs/xfs/xfs_trace.h     |   57 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.c |   77 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.c     |    1 +
 5 files changed, 160 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 677141a17605a4..0711007344e16d 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1014,6 +1014,11 @@ struct xfs_rtgroup_geometry {
 #define XFS_HEALTH_MONITOR_DOMAIN_INODE		(3)
 #define XFS_HEALTH_MONITOR_DOMAIN_RTGROUP	(4)
 
+/* disk events */
+#define XFS_HEALTH_MONITOR_DOMAIN_DATADEV	(5)
+#define XFS_HEALTH_MONITOR_DOMAIN_RTDEV		(6)
+#define XFS_HEALTH_MONITOR_DOMAIN_LOGDEV	(7)
+
 /* Health monitor event types */
 
 /* status of the monitor itself */
@@ -1031,6 +1036,9 @@ struct xfs_rtgroup_geometry {
 /* filesystem shutdown */
 #define XFS_HEALTH_MONITOR_TYPE_SHUTDOWN	(6)
 
+/* media errors */
+#define XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR	(7)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
@@ -1071,6 +1079,12 @@ struct xfs_health_monitor_shutdown {
 	__u32	reasons;
 };
 
+/* disk media errors */
+struct xfs_health_monitor_media {
+	__u64	daddr;
+	__u64	bbcount;
+};
+
 struct xfs_health_monitor_event {
 	/* XFS_HEALTH_MONITOR_DOMAIN_* */
 	__u32	domain;
@@ -1092,6 +1106,7 @@ struct xfs_health_monitor_event {
 		struct xfs_health_monitor_group group;
 		struct xfs_health_monitor_inode inode;
 		struct xfs_health_monitor_shutdown shutdown;
+		struct xfs_health_monitor_media media;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index a82a684bbc0e03..407c5e1f466726 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -19,6 +19,8 @@ enum xfs_healthmon_type {
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
 	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
 
+	/* media errors */
+	XFS_HEALTHMON_MEDIA_ERROR,
 };
 
 enum xfs_healthmon_domain {
@@ -29,6 +31,11 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_AG,	/* allocation group metadata */
 	XFS_HEALTHMON_INODE,	/* inode metadata */
 	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
+
+	/* media errors */
+	XFS_HEALTHMON_DATADEV,
+	XFS_HEALTHMON_RTDEV,
+	XFS_HEALTHMON_LOGDEV,
 };
 
 struct xfs_healthmon_event {
@@ -66,6 +73,11 @@ struct xfs_healthmon_event {
 			uint32_t	gen;
 			xfs_ino_t	ino;
 		};
+		/* media errors */
+		struct {
+			xfs_daddr_t	daddr;
+			uint64_t	bbcount;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b2b056ceb52f5c..79805ee9aa64f5 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -105,6 +105,7 @@ struct xfs_rtgroup;
 struct xfs_open_zone;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
+struct xfs_media_error_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6110,6 +6111,12 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 			__entry->ino = event->ino;
 			__entry->gen = event->gen;
 			break;
+		case XFS_HEALTHMON_DATADEV:
+		case XFS_HEALTHMON_LOGDEV:
+		case XFS_HEALTHMON_RTDEV:
+			__entry->offset = event->daddr;
+			__entry->length = event->bbcount;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6198,6 +6205,56 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->events,
 		  __entry->lost_prev)
 );
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+TRACE_EVENT(xfs_healthmon_media_error_hook,
+	TP_PROTO(const struct xfs_media_error_params *p,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(uint64_t, daddr)
+		__field(uint64_t, bbcount)
+		__field(int, pre_remove)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		struct xfs_mount	*mp = p->mp;
+		struct xfs_buftarg	*btp = NULL;
+
+		switch (p->fdev) {
+		case XFS_FAILED_DATADEV:
+			btp = mp->m_ddev_targp;
+			break;
+		case XFS_FAILED_LOGDEV:
+			btp = mp->m_logdev_targp;
+			break;
+		case XFS_FAILED_RTDEV:
+			btp = mp->m_rtdev_targp;
+			break;
+		}
+
+		__entry->dev = mp->m_super->s_dev;
+		if (btp)
+			__entry->error_dev = btp->bt_dev;
+		__entry->daddr = p->daddr;
+		__entry->bbcount = p->bbcount;
+		__entry->pre_remove = p->pre_remove;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d error_dev %d:%d daddr 0x%llx bbcount 0x%llx pre_remove? %d events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->error_dev), MINOR(__entry->error_dev),
+		  __entry->daddr,
+		  __entry->bbcount,
+		  __entry->pre_remove,
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#endif
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index f36d7fbfb1ca16..efc8ff554e42da 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -21,6 +21,7 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_fsops.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -67,6 +68,7 @@ struct xfs_healthmon {
 	/* live update hooks */
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
+	struct xfs_media_error_hook	mhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -518,6 +520,59 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+/* Add a media error event to the reporting queue. */
+STATIC int
+xfs_healthmon_media_error_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_media_error_params	*p = data;
+	enum xfs_healthmon_domain	domain = 0; /* shut up gcc */
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_media_error_hook(p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	switch (p->fdev) {
+	case XFS_FAILED_LOGDEV:
+		domain = XFS_HEALTHMON_LOGDEV;
+		break;
+	case XFS_FAILED_RTDEV:
+		domain = XFS_HEALTHMON_RTDEV;
+		break;
+	case XFS_FAILED_DATADEV:
+		domain = XFS_HEALTHMON_DATADEV;
+		break;
+	}
+
+	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_MEDIA_ERROR, domain);
+	if (!event)
+		goto out_unlock;
+
+	event->daddr = p->daddr;
+	event->bbcount = p->bbcount;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+#endif
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -570,6 +625,9 @@ static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_AG]		= XFS_HEALTH_MONITOR_DOMAIN_AG,
 	[XFS_HEALTHMON_INODE]		= XFS_HEALTH_MONITOR_DOMAIN_INODE,
 	[XFS_HEALTHMON_RTGROUP]		= XFS_HEALTH_MONITOR_DOMAIN_RTGROUP,
+	[XFS_HEALTHMON_DATADEV]		= XFS_HEALTH_MONITOR_DOMAIN_DATADEV,
+	[XFS_HEALTHMON_RTDEV]		= XFS_HEALTH_MONITOR_DOMAIN_RTDEV,
+	[XFS_HEALTHMON_LOGDEV]		= XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,
 };
 
 static const unsigned int type_map[] = {
@@ -580,6 +638,7 @@ static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_HEALTHY]		= XFS_HEALTH_MONITOR_TYPE_HEALTHY,
 	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
 	[XFS_HEALTHMON_SHUTDOWN]	= XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
+	[XFS_HEALTHMON_MEDIA_ERROR]	= XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR,
 };
 
 /* Render event as a V0 structure */
@@ -634,6 +693,12 @@ xfs_healthmon_format_v0(
 		hme.e.inode.ino = event->ino;
 		hme.e.inode.gen = event->gen;
 		break;
+	case XFS_HEALTHMON_DATADEV:
+	case XFS_HEALTHMON_LOGDEV:
+	case XFS_HEALTHMON_RTDEV:
+		hme.e.media.daddr = event->daddr;
+		hme.e.media.bbcount = event->bbcount;
+		break;
 	default:
 		break;
 	}
@@ -903,6 +968,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
 	return;
@@ -1056,12 +1122,17 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_healthhook;
 
+	xfs_media_error_hook_setup(&hm->mhook, xfs_healthmon_media_error_hook);
+	ret = xfs_media_error_hook_add(mp, &hm->mhook);
+	if (ret)
+		goto out_shutdownhook;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_shutdownhook;
+		goto out_mediahook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -1073,13 +1144,15 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_shutdownhook;
+		goto out_mediahook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_mediahook:
+	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:
 	xfs_shutdown_hook_del(mp, &hm->shook);
 out_healthhook:
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index d42b864a3837a2..08ddab700a6cd3 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -53,6 +53,7 @@
 #include "xfs_zone_priv.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 /*
  * We include this last to have the helpers above available for the trace


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 15/22] xfs: report file io errors through healthmon
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (13 preceding siblings ...)
  2025-11-05  0:52   ` [PATCH 14/22] xfs: report media errors " Darrick J. Wong
@ 2025-11-05  0:52   ` Darrick J. Wong
  2025-11-05  0:52   ` [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:52 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a file io error event hook so that we can send events about read
errors, writeback errors, and directio errors to userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   19 +++++++++
 fs/xfs/xfs_healthmon.h |   17 ++++++++
 fs/xfs/xfs_trace.h     |   59 +++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.c |   98 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.c     |    1 
 5 files changed, 192 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 0711007344e16d..a96f11d9bd9c64 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1019,6 +1019,9 @@ struct xfs_rtgroup_geometry {
 #define XFS_HEALTH_MONITOR_DOMAIN_RTDEV		(6)
 #define XFS_HEALTH_MONITOR_DOMAIN_LOGDEV	(7)
 
+/* file range events */
+#define XFS_HEALTH_MONITOR_DOMAIN_FILERANGE	(8)
+
 /* Health monitor event types */
 
 /* status of the monitor itself */
@@ -1039,6 +1042,13 @@ struct xfs_rtgroup_geometry {
 /* media errors */
 #define XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR	(7)
 
+/* file range events */
+#define XFS_HEALTH_MONITOR_TYPE_BUFREAD		(8)
+#define XFS_HEALTH_MONITOR_TYPE_BUFWRITE	(9)
+#define XFS_HEALTH_MONITOR_TYPE_DIOREAD		(10)
+#define XFS_HEALTH_MONITOR_TYPE_DIOWRITE	(11)
+#define XFS_HEALTH_MONITOR_TYPE_DATALOST	(12)
+
 /* lost events */
 struct xfs_health_monitor_lost {
 	__u64	count;
@@ -1079,6 +1089,14 @@ struct xfs_health_monitor_shutdown {
 	__u32	reasons;
 };
 
+/* file range events */
+struct xfs_health_monitor_filerange {
+	__u64	pos;
+	__u64	len;
+	__u64	ino;
+	__u32	gen;
+};
+
 /* disk media errors */
 struct xfs_health_monitor_media {
 	__u64	daddr;
@@ -1107,6 +1125,7 @@ struct xfs_health_monitor_event {
 		struct xfs_health_monitor_inode inode;
 		struct xfs_health_monitor_shutdown shutdown;
 		struct xfs_health_monitor_media media;
+		struct xfs_health_monitor_filerange filerange;
 	} e;
 
 	/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 407c5e1f466726..1ce49197262b1c 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -21,6 +21,13 @@ enum xfs_healthmon_type {
 
 	/* media errors */
 	XFS_HEALTHMON_MEDIA_ERROR,
+
+	/* file range events */
+	XFS_HEALTHMON_BUFREAD,
+	XFS_HEALTHMON_BUFWRITE,
+	XFS_HEALTHMON_DIOREAD,
+	XFS_HEALTHMON_DIOWRITE,
+	XFS_HEALTHMON_DATALOST,
 };
 
 enum xfs_healthmon_domain {
@@ -36,6 +43,9 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_DATADEV,
 	XFS_HEALTHMON_RTDEV,
 	XFS_HEALTHMON_LOGDEV,
+
+	/* file range events */
+	XFS_HEALTHMON_FILERANGE,
 };
 
 struct xfs_healthmon_event {
@@ -78,6 +88,13 @@ struct xfs_healthmon_event {
 			xfs_daddr_t	daddr;
 			uint64_t	bbcount;
 		};
+		/* file range events */
+		struct {
+			xfs_ino_t	fino;
+			loff_t		fpos;
+			uint64_t	flen;
+			uint32_t	fgen;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 79805ee9aa64f5..d1836583d4dfbb 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -106,6 +106,7 @@ struct xfs_open_zone;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
 struct xfs_media_error_params;
+struct xfs_file_ioerror_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6117,6 +6118,12 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 			__entry->offset = event->daddr;
 			__entry->length = event->bbcount;
 			break;
+		case XFS_HEALTHMON_FILERANGE:
+			__entry->ino = event->fino;
+			__entry->gen = event->fgen;
+			__entry->offset = event->fpos;
+			__entry->length = event->flen;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6255,6 +6262,58 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->lost_prev)
 );
 #endif
+
+#define XFS_FILE_IOERROR_STRINGS \
+	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \
+	{ XFS_FILE_IOERROR_BUFFERED_WRITE,	"writeback" }, \
+	{ XFS_FILE_IOERROR_DIRECT_READ,		"directio_read" }, \
+	{ XFS_FILE_IOERROR_DIRECT_WRITE,	"directio_write" }, \
+	{ XFS_FILE_IOERROR_DATA_LOST,		"datalost" }
+
+
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_WRITE);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_WRITE);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DATA_LOST);
+
+TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
+	TP_PROTO(const struct xfs_mount *mp,
+		 unsigned long action,
+		 const struct xfs_file_ioerror_params *p,
+		 unsigned int events, unsigned long long lost_prev),
+	TP_ARGS(mp, action, p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(unsigned long, action)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(long long, pos)
+		__field(unsigned long long, len)
+		__field(unsigned int, events)
+		__field(unsigned long long, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->action = action;
+		__entry->ino = p->ino;
+		__entry->gen = p->gen;
+		__entry->pos = p->pos;
+		__entry->len = p->len;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d ino 0x%llx gen 0x%x op %s pos 0x%llx bytecount 0x%llx events %u lost_prev? %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->gen,
+		  __print_symbolic(__entry->action, XFS_FILE_IOERROR_STRINGS),
+		  __entry->pos,
+		  __entry->len,
+		  __entry->events,
+		  __entry->lost_prev)
+);
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index efc8ff554e42da..31c2f6f43cf474 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -22,6 +22,7 @@
 #include "xfs_healthmon.h"
 #include "xfs_fsops.h"
 #include "xfs_notify_failure.h"
+#include "xfs_file.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -69,6 +70,7 @@ struct xfs_healthmon {
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 	struct xfs_media_error_hook	mhook;
+	struct xfs_file_ioerror_hook	fhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -573,6 +575,77 @@ xfs_healthmon_media_error_hook(
 }
 #endif
 
+/* Add a file io error event to the reporting queue. */
+STATIC int
+xfs_healthmon_file_ioerror_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_file_ioerror_params	*p = data;
+	enum xfs_healthmon_type		type = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, fhook.ioerror_hook.nb);
+
+	switch (action) {
+	case XFS_FILE_IOERROR_BUFFERED_READ:
+	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+	case XFS_FILE_IOERROR_DIRECT_READ:
+	case XFS_FILE_IOERROR_DIRECT_WRITE:
+	case XFS_FILE_IOERROR_DATA_LOST:
+		break;
+	default:
+		ASSERT(0);
+		return NOTIFY_DONE;
+	}
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	switch (action) {
+	case XFS_FILE_IOERROR_BUFFERED_READ:
+		type = XFS_HEALTHMON_BUFREAD;
+		break;
+	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+		type = XFS_HEALTHMON_BUFWRITE;
+		break;
+	case XFS_FILE_IOERROR_DIRECT_READ:
+		type = XFS_HEALTHMON_DIOREAD;
+		break;
+	case XFS_FILE_IOERROR_DIRECT_WRITE:
+		type = XFS_HEALTHMON_DIOWRITE;
+		break;
+	case XFS_FILE_IOERROR_DATA_LOST:
+		type = XFS_HEALTHMON_DATALOST;
+		break;
+	}
+
+	event = xfs_healthmon_alloc(hm, type, XFS_HEALTHMON_FILERANGE);
+	if (!event)
+		goto out_unlock;
+
+	event->fino = p->ino;
+	event->fgen = p->gen;
+	event->fpos = p->pos;
+	event->flen = p->len;
+	error = xfs_healthmon_push(hm, event);
+	if (error)
+		kfree(event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 static inline void
 xfs_healthmon_reset_outbuf(
 	struct xfs_healthmon		*hm)
@@ -628,6 +701,7 @@ static const unsigned int domain_map[] = {
 	[XFS_HEALTHMON_DATADEV]		= XFS_HEALTH_MONITOR_DOMAIN_DATADEV,
 	[XFS_HEALTHMON_RTDEV]		= XFS_HEALTH_MONITOR_DOMAIN_RTDEV,
 	[XFS_HEALTHMON_LOGDEV]		= XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,
+	[XFS_HEALTHMON_FILERANGE]	= XFS_HEALTH_MONITOR_DOMAIN_FILERANGE,
 };
 
 static const unsigned int type_map[] = {
@@ -639,6 +713,11 @@ static const unsigned int type_map[] = {
 	[XFS_HEALTHMON_UNMOUNT]		= XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
 	[XFS_HEALTHMON_SHUTDOWN]	= XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
 	[XFS_HEALTHMON_MEDIA_ERROR]	= XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR,
+	[XFS_HEALTHMON_BUFREAD]		= XFS_HEALTH_MONITOR_TYPE_BUFREAD,
+	[XFS_HEALTHMON_BUFWRITE]	= XFS_HEALTH_MONITOR_TYPE_BUFWRITE,
+	[XFS_HEALTHMON_DIOREAD]		= XFS_HEALTH_MONITOR_TYPE_DIOREAD,
+	[XFS_HEALTHMON_DIOWRITE]	= XFS_HEALTH_MONITOR_TYPE_DIOWRITE,
+	[XFS_HEALTHMON_DATALOST]	= XFS_HEALTH_MONITOR_TYPE_DATALOST,
 };
 
 /* Render event as a V0 structure */
@@ -699,6 +778,12 @@ xfs_healthmon_format_v0(
 		hme.e.media.daddr = event->daddr;
 		hme.e.media.bbcount = event->bbcount;
 		break;
+	case XFS_HEALTHMON_FILERANGE:
+		hme.e.filerange.ino = event->fino;
+		hme.e.filerange.gen = event->fgen;
+		hme.e.filerange.pos = event->fpos;
+		hme.e.filerange.len = event->flen;
+		break;
 	default:
 		break;
 	}
@@ -968,6 +1053,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
+	xfs_file_ioerror_hook_del(hm->mp, &hm->fhook);
 	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
@@ -1127,12 +1213,18 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_shutdownhook;
 
+	xfs_file_ioerror_hook_setup(&hm->fhook,
+			xfs_healthmon_file_ioerror_hook);
+	ret = xfs_file_ioerror_hook_add(mp, &hm->fhook);
+	if (ret)
+		goto out_mediahook;
+
 	/* Queue up the first event that lets the client know we're running. */
 	event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
 			XFS_HEALTHMON_MOUNT);
 	if (!event) {
 		ret = -ENOMEM;
-		goto out_mediahook;
+		goto out_ioerrhook;
 	}
 	__xfs_healthmon_push(hm, event);
 
@@ -1144,13 +1236,15 @@ xfs_ioc_health_monitor(
 			O_CLOEXEC | O_RDONLY);
 	if (fd < 0) {
 		ret = fd;
-		goto out_mediahook;
+		goto out_ioerrhook;
 	}
 
 	trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
 
 	return fd;
 
+out_ioerrhook:
+	xfs_file_ioerror_hook_del(mp, &hm->fhook);
 out_mediahook:
 	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 08ddab700a6cd3..eb35015c091570 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -54,6 +54,7 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_notify_failure.h"
+#include "xfs_file.h"
 
 /*
  * We include this last to have the helpers above available for the trace


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (14 preceding siblings ...)
  2025-11-05  0:52   ` [PATCH 15/22] xfs: report file io " Darrick J. Wong
@ 2025-11-05  0:52   ` Darrick J. Wong
  2025-11-05  0:52   ` [PATCH 17/22] xfs: validate fds against running healthmon Darrick J. Wong
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:52 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can reconfigure the health monitoring device by
calling the XFS_IOC_HEALTH_MONITOR ioctl on it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)


diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 31c2f6f43cf474..d3784073494ec6 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -23,6 +23,7 @@
 #include "xfs_fsops.h"
 #include "xfs_notify_failure.h"
 #include "xfs_file.h"
+#include "xfs_ioctl.h"
 
 #include <linux/anon_inodes.h>
 #include <linux/eventpoll.h>
@@ -1140,6 +1141,48 @@ xfs_healthmon_show_fdinfo(
 }
 #endif
 
+/* Reconfigure the health monitor. */
+STATIC long
+xfs_healthmon_reconfigure(
+	struct file			*file,
+	unsigned int			cmd,
+	void __user			*arg)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm = file->private_data;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	mutex_lock(&hm->lock);
+	hm->format = hmo.format;
+	hm->verbose = !!(hmo.flags & XFS_HEALTH_MONITOR_VERBOSE);
+	mutex_unlock(&hm->lock);
+	return 0;
+}
+
+/* Handle ioctls for the health monitoring thread. */
+STATIC long
+xfs_healthmon_ioctl(
+	struct file			*file,
+	unsigned int			cmd,
+	unsigned long			p)
+{
+	void __user			*arg = (void __user *)p;
+
+	switch (cmd) {
+	case XFS_IOC_HEALTH_MONITOR:
+		return xfs_healthmon_reconfigure(file, cmd, arg);
+	default:
+		break;
+	}
+
+	return -ENOTTY;
+}
+
 static const struct file_operations xfs_healthmon_fops = {
 	.owner		= THIS_MODULE,
 #ifdef CONFIG_PROC_FS
@@ -1148,6 +1191,7 @@ static const struct file_operations xfs_healthmon_fops = {
 	.read_iter	= xfs_healthmon_read_iter,
 	.poll		= xfs_healthmon_poll,
 	.release	= xfs_healthmon_release,
+	.unlocked_ioctl	= xfs_healthmon_ioctl,
 };
 
 /*


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 17/22] xfs: validate fds against running healthmon
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (15 preceding siblings ...)
  2025-11-05  0:52   ` [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
@ 2025-11-05  0:52   ` Darrick J. Wong
  2025-11-05  0:53   ` [PATCH 18/22] xfs: add media error reporting ioctl Darrick J. Wong
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:52 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.

(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   10 ++++++++++
 fs/xfs/xfs_healthmon.c |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index a96f11d9bd9c64..2b82535196cdb0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1147,6 +1147,15 @@ struct xfs_health_monitor {
 /* Initial return format version */
 #define XFS_HEALTH_MONITOR_FMT_V0	(0)
 
+/*
+ * Check that a given fd points to the same filesystem that the health monitor
+ * is monitoring.
+ */
+struct xfs_health_samefs {
+	__s32		fd;
+	__u32		flags;	/* zero for now */
+};
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1187,6 +1196,7 @@ struct xfs_health_monitor {
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
+#define XFS_IOC_HEALTH_SAMEFS	_IOW ('X', 69, struct xfs_health_samefs)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index d3784073494ec6..9752b058978995 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -1164,6 +1164,36 @@ xfs_healthmon_reconfigure(
 	return 0;
 }
 
+/* Does the fd point to the same filesystem as the one we're monitoring? */
+STATIC long
+xfs_healthmon_samefs(
+	struct file			*file,
+	unsigned int			cmd,
+	void __user			*arg)
+{
+	struct xfs_health_samefs	hms;
+	struct xfs_healthmon		*hm = file->private_data;
+	struct inode			*hms_inode;
+	int				ret = 0;
+
+	if (copy_from_user(&hms, arg, sizeof(hms)))
+		return -EFAULT;
+
+	if (hms.flags)
+		return -EINVAL;
+
+	CLASS(fd, hms_fd)(hms.fd);
+	if (fd_empty(hms_fd))
+		return -EBADF;
+
+	hms_inode = file_inode(fd_file(hms_fd));
+	mutex_lock(&hm->lock);
+	if (!hm->mp || hm->mp->m_super != hms_inode->i_sb)
+		ret = -ESTALE;
+	mutex_unlock(&hm->lock);
+	return ret;
+}
+
 /* Handle ioctls for the health monitoring thread. */
 STATIC long
 xfs_healthmon_ioctl(
@@ -1176,6 +1206,8 @@ xfs_healthmon_ioctl(
 	switch (cmd) {
 	case XFS_IOC_HEALTH_MONITOR:
 		return xfs_healthmon_reconfigure(file, cmd, arg);
+	case XFS_IOC_HEALTH_SAMEFS:
+		return xfs_healthmon_samefs(file, cmd, arg);
 	default:
 		break;
 	}


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 18/22] xfs: add media error reporting ioctl
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (16 preceding siblings ...)
  2025-11-05  0:52   ` [PATCH 17/22] xfs: validate fds against running healthmon Darrick J. Wong
@ 2025-11-05  0:53   ` Darrick J. Wong
  2025-11-05  0:53   ` [PATCH 19/22] xfs: send uevents when major filesystem events happen Darrick J. Wong
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:53 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a new privileged ioctl so that xfs_scrub can report media errors to
the kernel for further processing.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h      |   16 ++++
 fs/xfs/xfs_notify_failure.h |    8 ++
 fs/xfs/xfs_trace.h          |    2 
 fs/xfs/Makefile             |    6 -
 fs/xfs/xfs_healthmon.c      |    2 
 fs/xfs/xfs_ioctl.c          |    3 +
 fs/xfs/xfs_notify_failure.c |  187 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 213 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2b82535196cdb0..65fcc94ed9b40c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1156,6 +1156,21 @@ struct xfs_health_samefs {
 	__u32		flags;	/* zero for now */
 };
 
+/* Report a media error */
+struct xfs_media_error {
+	__u64	flags;		/* flags */
+	__u64	daddr;		/* disk address of range */
+	__u64	bbcount;	/* length, in 512b blocks */
+	__u64	pad;		/* zero */
+};
+
+#define XFS_MEDIA_ERROR_DATADEV	(1)	/* data device */
+#define XFS_MEDIA_ERROR_LOGDEV	(2)	/* external log device */
+#define XFS_MEDIA_ERROR_RTDEV	(3)	/* realtime device */
+
+/* bottom byte of flags is the device code */
+#define XFS_MEDIA_ERROR_DEVMASK	(0xFF)
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1197,6 +1212,7 @@ struct xfs_health_samefs {
 #define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 #define XFS_IOC_HEALTH_MONITOR	_IOW ('X', 68, struct xfs_health_monitor)
 #define XFS_IOC_HEALTH_SAMEFS	_IOW ('X', 69, struct xfs_health_samefs)
+#define XFS_IOC_MEDIA_ERROR	_IOW ('X', 70, struct xfs_media_error)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 2695732ec20875..279f9329a4d5f3 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -6,7 +6,9 @@
 #ifndef __XFS_NOTIFY_FAILURE_H__
 #define __XFS_NOTIFY_FAILURE_H__
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 extern const struct dax_holder_operations xfs_dax_holder_operations;
+#endif
 
 enum xfs_failed_device {
 	XFS_FAILED_DATADEV,
@@ -14,7 +16,7 @@ enum xfs_failed_device {
 	XFS_FAILED_RTDEV,
 };
 
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+#if defined(CONFIG_XFS_LIVE_HOOKS)
 struct xfs_media_error_params {
 	struct xfs_mount		*mp;
 	enum xfs_failed_device		fdev;
@@ -41,4 +43,8 @@ struct xfs_media_error_hook { };
 # define xfs_media_error_hook_setup(...)	((void)0)
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+struct xfs_media_error;
+int xfs_ioc_media_error(struct xfs_mount *mp,
+		struct xfs_media_error __user *arg);
+
 #endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d1836583d4dfbb..e5d95add53d347 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6213,7 +6213,6 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->lost_prev)
 );
 
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 TRACE_EVENT(xfs_healthmon_media_error_hook,
 	TP_PROTO(const struct xfs_media_error_params *p,
 		 unsigned int events, unsigned long long lost_prev),
@@ -6261,7 +6260,6 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->events,
 		  __entry->lost_prev)
 );
-#endif
 
 #define XFS_FILE_IOERROR_STRINGS \
 	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d4e9070a9326ba..2279cb0b874814 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -98,6 +98,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_notify_failure.o \
 				   xfs_pwork.o \
 				   xfs_reflink.o \
 				   xfs_stats.o \
@@ -148,11 +149,6 @@ xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
 
-# notify failure
-ifeq ($(CONFIG_MEMORY_FAILURE),y)
-xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
-endif
-
 xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 9752b058978995..e5715f52f4b218 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -523,7 +523,6 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
-#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 /* Add a media error event to the reporting queue. */
 STATIC int
 xfs_healthmon_media_error_hook(
@@ -574,7 +573,6 @@ xfs_healthmon_media_error_hook(
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 }
-#endif
 
 /* Add a file io error event to the reporting queue. */
 STATIC int
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 08998d84554f09..7a80a6ad4b2d99 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -42,6 +42,7 @@
 #include "xfs_handle.h"
 #include "xfs_rtgroup.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -1424,6 +1425,8 @@ xfs_file_ioctl(
 
 	case XFS_IOC_HEALTH_MONITOR:
 		return xfs_ioc_health_monitor(mp, arg);
+	case XFS_IOC_MEDIA_ERROR:
+		return xfs_ioc_media_error(mp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 8766d83385ddad..bf6e1865d5c3a5 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -76,9 +76,19 @@ xfs_media_error_hook_setup(
 	xfs_hook_setup(&hook->error_hook, mod_fn);
 }
 #else
-# define xfs_media_error_hook(...)		((void)0)
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	enum xfs_failed_device		fdev,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	bool				pre_remove)
+{
+	/* empty */
+}
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
 	xfs_extlen_t		blockcount;
@@ -447,3 +457,178 @@ xfs_dax_notify_failure(
 const struct dax_holder_operations xfs_dax_holder_operations = {
 	.notify_failure		= xfs_dax_notify_failure,
 };
+#endif /* CONFIG_MEMORY_FAILURE && CONFIG_FS_DAX */
+
+struct xfs_group_data_lost {
+	xfs_agblock_t		startblock;
+	xfs_extlen_t		blockcount;
+};
+
+static int
+xfs_report_one_data_lost(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*data)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip;
+	struct xfs_group_data_lost	*lost = data;
+	xfs_fileoff_t			fileoff = rec->rm_offset;
+	xfs_extlen_t			blocks = rec->rm_blockcount;
+	const xfs_agblock_t		lost_end =
+			lost->startblock + lost->blockcount;
+	const xfs_agblock_t		rmap_end =
+			rec->rm_startblock + rec->rm_blockcount;
+	int				error = 0;
+
+	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)))
+		return 0;
+
+	error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, 0, 0, &ip);
+	if (error)
+		return 0;
+
+	if (lost->startblock > rec->rm_startblock) {
+		fileoff += lost->startblock - rec->rm_startblock;
+		blocks -= lost->startblock - rec->rm_startblock;
+	}
+	if (rmap_end > lost_end)
+		blocks -= rmap_end - lost_end;
+
+	xfs_inode_media_error(ip, XFS_FSB_TO_B(mp, fileoff),
+			XFS_FSB_TO_B(mp, blocks));
+
+	xfs_irele(ip);
+	return 0;
+}
+
+static int
+xfs_report_data_lost(
+	struct xfs_mount	*mp,
+	enum xfs_group_type	type,
+	xfs_daddr_t		daddr,
+	u64			bblen)
+{
+	struct xfs_group	*xg = NULL;
+	struct xfs_trans	*tp;
+	xfs_fsblock_t		start_bno, end_bno;
+	uint32_t		start_gno, end_gno;
+	int			error;
+
+	if (type == XG_TYPE_RTG) {
+		start_bno = xfs_daddr_to_rtb(mp, daddr);
+		end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1);
+	} else {
+		start_bno = XFS_DADDR_TO_FSB(mp, daddr);
+		end_bno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1);
+	}
+
+	tp = xfs_trans_alloc_empty(mp);
+	start_gno = xfs_fsb_to_gno(mp, start_bno, type);
+	end_gno = xfs_fsb_to_gno(mp, end_bno, type);
+	while ((xg = xfs_group_next_range(mp, xg, start_gno, end_gno, type))) {
+		struct xfs_buf		*agf_bp = NULL;
+		struct xfs_rtgroup	*rtg = NULL;
+		struct xfs_btree_cur	*cur;
+		struct xfs_rmap_irec	ri_low = { };
+		struct xfs_rmap_irec	ri_high;
+		struct xfs_group_data_lost lost;
+
+		if (type == XG_TYPE_AG) {
+			struct xfs_perag	*pag = to_perag(xg);
+
+			error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+			if (error) {
+				xfs_perag_put(pag);
+				break;
+			}
+
+			cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, pag);
+		} else {
+			rtg = to_rtg(xg);
+			xfs_rtgroup_lock(rtg, XFS_RTGLOCK_RMAP);
+			cur = xfs_rtrmapbt_init_cursor(tp, rtg);
+		}
+
+		/*
+		 * Set the rmap range from ri_low to ri_high, which represents
+		 * a [start, end] where we looking for the files or metadata.
+		 */
+		memset(&ri_high, 0xFF, sizeof(ri_high));
+		if (xg->xg_gno == start_gno)
+			ri_low.rm_startblock =
+				xfs_fsb_to_gbno(mp, start_bno, type);
+		if (xg->xg_gno == end_gno)
+			ri_high.rm_startblock =
+				xfs_fsb_to_gbno(mp, end_bno, type);
+
+		lost.startblock = ri_low.rm_startblock;
+		lost.blockcount = min(xg->xg_block_count,
+				      ri_high.rm_startblock + 1) -
+							ri_low.rm_startblock;
+
+		error = xfs_rmap_query_range(cur, &ri_low, &ri_high,
+				xfs_report_one_data_lost, &lost);
+		xfs_btree_del_cursor(cur, error);
+		if (agf_bp)
+			xfs_trans_brelse(tp, agf_bp);
+		if (rtg)
+			xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_RMAP);
+		if (error) {
+			xfs_group_put(xg);
+			break;
+		}
+	}
+
+	xfs_trans_cancel(tp);
+	return 0;
+}
+
+#define XFS_VALID_MEDIA_ERROR_FLAGS	(XFS_MEDIA_ERROR_DATADEV | \
+					 XFS_MEDIA_ERROR_LOGDEV | \
+					 XFS_MEDIA_ERROR_RTDEV)
+int
+xfs_ioc_media_error(
+	struct xfs_mount		*mp,
+	struct xfs_media_error __user	*arg)
+{
+	struct xfs_media_error		me;
+	enum xfs_failed_device		fdev;
+	enum xfs_group_type		type;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&me, arg, sizeof(me)))
+		return -EFAULT;
+
+	if (me.pad)
+		return -EINVAL;
+	if (me.flags & ~XFS_VALID_MEDIA_ERROR_FLAGS)
+		return -EINVAL;
+
+	switch (me.flags & XFS_MEDIA_ERROR_DEVMASK) {
+	case XFS_MEDIA_ERROR_DATADEV:
+		fdev = XFS_FAILED_DATADEV;
+		type = XG_TYPE_AG;
+		break;
+	case XFS_MEDIA_ERROR_LOGDEV:
+		fdev = XFS_FAILED_LOGDEV;
+		type = -1;
+		break;
+	case XFS_MEDIA_ERROR_RTDEV:
+		fdev = XFS_FAILED_RTDEV;
+		type = XG_TYPE_RTG;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	xfs_media_error_hook(mp, fdev, me.daddr, me.bbcount, false);
+
+	if (xfs_has_rmapbt(mp) && fdev != XFS_FAILED_LOGDEV)
+		return xfs_report_data_lost(mp, type, me.daddr, me.bbcount);
+
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 19/22] xfs: send uevents when major filesystem events happen
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (17 preceding siblings ...)
  2025-11-05  0:53   ` [PATCH 18/22] xfs: add media error reporting ioctl Darrick J. Wong
@ 2025-11-05  0:53   ` Darrick J. Wong
  2025-11-05  0:53   ` [PATCH 20/22] xfs: merge health monitoring events when possible Darrick J. Wong
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:53 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Send uevents when we mount, unmount, and shut down the filesystem, so
that we can trigger systemd services when major events happen.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_super.h |   13 +++++++
 fs/xfs/xfs_fsops.c |   18 ++++++++++
 fs/xfs/xfs_super.c |   94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 125 insertions(+)


diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index c0e85c1e42f27d..6d428bd04a0248 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -101,4 +101,17 @@ extern struct workqueue_struct *xfs_discard_wq;
 
 struct dentry *xfs_debugfs_mkdir(const char *name, struct dentry *parent);
 
+#define XFS_UEVENT_BUFLEN ( \
+	sizeof("SID=") + sizeof_field(struct super_block, s_id) + \
+	sizeof("UUID=") + UUID_STRING_LEN + \
+	sizeof("META_UUID=") + UUID_STRING_LEN)
+
+#define XFS_UEVENT_STR_PTRS \
+	NULL, /* sid */ \
+	NULL, /* uuid */ \
+	NULL /* metauuid */
+
+int xfs_format_uevent_strings(struct xfs_mount *mp, char *buf, ssize_t buflen,
+		char **env);
+
 #endif	/* __XFS_SUPER_H__ */
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 26ed16e67410d7..0b6b178cb8169a 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -522,6 +522,23 @@ xfs_shutdown_hook_setup(
 # define xfs_shutdown_hook(...)		((void)0)
 #endif /* CONFIG_XFS_LIVE_HOOKS */
 
+static void
+xfs_send_shutdown_uevent(
+	struct xfs_mount	*mp)
+{
+	char			buf[XFS_UEVENT_BUFLEN];
+	char			*env[] = {
+		"TYPE=shutdown",
+		XFS_UEVENT_STR_PTRS,
+		NULL,
+	};
+	int			error;
+
+	error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[2]);
+	if (!error)
+		kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_OFFLINE, env);
+}
+
 /*
  * Force a shutdown of the filesystem instantly while keeping the filesystem
  * consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -572,6 +589,7 @@ xfs_do_force_shutdown(
 	}
 
 	trace_xfs_force_shutdown(mp, tag, flags, fname, lnnum);
+	xfs_send_shutdown_uevent(mp);
 
 	xfs_alert_tag(mp, tag,
 "%s (0x%x) detected at %pS (%s:%d).  Shutting down filesystem.",
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 54d82f5a5b8863..bfd12ccaa707a8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -53,6 +53,7 @@
 #include <linux/magic.h>
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
+#include <linux/uuid.h>
 
 static const struct super_operations xfs_super_operations;
 
@@ -1244,12 +1245,73 @@ xfs_inodegc_free_percpu(
 	free_percpu(mp->m_inodegc);
 }
 
+int
+xfs_format_uevent_strings(
+	struct xfs_mount	*mp,
+	char			*buf,
+	ssize_t			buflen,
+	char			**env)
+{
+	ssize_t			written;
+
+	ASSERT(buflen >= XFS_UEVENT_BUFLEN);
+
+	written = snprintf(buf, buflen, "SID=%s", mp->m_super->s_id);
+	if (written >= buflen)
+		return -EINVAL;
+
+	*env = buf;
+	env++;
+	buf += written + 1;
+	buflen -= written + 1;
+
+	written = snprintf(buf, buflen, "UUID=%pU", &mp->m_sb.sb_uuid);
+	if (written >= buflen)
+		return EINVAL;
+
+	*env = buf;
+	env++;
+	buf += written + 1;
+	buflen -= written + 1;
+
+	written = snprintf(buf, buflen, "META_UUID=%pU",
+			&mp->m_sb.sb_meta_uuid);
+	if (written >= buflen)
+		return EINVAL;
+
+	*env = buf;
+	env++;
+	buf += written + 1;
+	buflen -= written + 1;
+
+	ASSERT(buflen >= 0);
+	return 0;
+}
+
+static void
+xfs_send_unmount_uevent(
+	struct xfs_mount	*mp)
+{
+	char			buf[XFS_UEVENT_BUFLEN];
+	char			*env[] = {
+		"TYPE=mount",
+		XFS_UEVENT_STR_PTRS,
+		NULL,
+	};
+	int error;
+
+	error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[1]);
+	if (!error)
+		kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_REMOVE, env);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	xfs_send_unmount_uevent(mp);
 	xfs_notice(mp, "Unmounting Filesystem %pU", &mp->m_sb.sb_uuid);
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
@@ -1667,6 +1729,37 @@ xfs_debugfs_mkdir(
 	return child;
 }
 
+/*
+ * Send a uevent signalling that the mount succeeded so we can use udev rules
+ * to start background services.
+ */
+static void
+xfs_send_mount_uevent(
+	struct fs_context	*fc,
+	struct xfs_mount	*mp)
+{
+	char			*source;
+	char			buf[XFS_UEVENT_BUFLEN];
+	char			*env[] = {
+		"TYPE=mount",
+		NULL, /* source */
+		XFS_UEVENT_STR_PTRS,
+		NULL,
+	};
+	int			error;
+
+	source = kasprintf(GFP_KERNEL, "SOURCE=%s", fc->source);
+	if (!source)
+		return;
+	env[1] = source;
+
+	error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[2]);
+	if (!error)
+		kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_ADD, env);
+
+	kfree(source);
+}
+
 static int
 xfs_fs_fill_super(
 	struct super_block	*sb,
@@ -1980,6 +2073,7 @@ xfs_fs_fill_super(
 		mp->m_debugfs_uuid = NULL;
 	}
 
+	xfs_send_mount_uevent(fc, mp);
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 20/22] xfs: merge health monitoring events when possible
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (18 preceding siblings ...)
  2025-11-05  0:53   ` [PATCH 19/22] xfs: send uevents when major filesystem events happen Darrick J. Wong
@ 2025-11-05  0:53   ` Darrick J. Wong
  2025-11-05  0:53   ` [PATCH 21/22] xfs: restrict healthmon users further Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process Darrick J. Wong
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:53 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Reduce memory consumption and event traffic by merging healthmon events
whenever we actually add an event to the queue.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_trace.h     |    1 
 fs/xfs/xfs_healthmon.c |  107 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e5d95add53d347..520526ef9cd11c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6143,6 +6143,7 @@ DEFINE_EVENT(xfs_healthmon_event_class, name, \
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
 	TP_ARGS(mp, event))
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_insert);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_merge);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
 DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index e5715f52f4b218..b46b63e62d5143 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -143,12 +143,112 @@ xfs_healthmon_free_head(
 	return 0;
 }
 
+static bool
+xfs_healthmon_merge_events(
+	struct xfs_healthmon_event		*existing,
+	const struct xfs_healthmon_event	*new)
+{
+	if (!existing)
+		return false;
+
+	/* type and domain must match to merge events */
+	if (existing->type != new->type ||
+	    existing->domain != new->domain)
+		return false;
+
+	switch (existing->type) {
+	case XFS_HEALTHMON_RUNNING:
+	case XFS_HEALTHMON_UNMOUNT:
+		/* should only ever be one of these events anyway */
+		return false;
+
+	case XFS_HEALTHMON_LOST:
+		existing->lostcount += new->lostcount;
+		return true;
+
+	case XFS_HEALTHMON_SHUTDOWN:
+		/* yes, we can race to shutdown */
+		existing->flags |= new->flags;
+		return true;
+
+	case XFS_HEALTHMON_SICK:
+	case XFS_HEALTHMON_CORRUPT:
+	case XFS_HEALTHMON_HEALTHY:
+		switch (existing->domain) {
+		case XFS_HEALTHMON_FS:
+			existing->fsmask |= new->fsmask;
+			return true;
+		case XFS_HEALTHMON_AG:
+		case XFS_HEALTHMON_RTGROUP:
+			if (existing->group == new->group){
+				existing->grpmask |= new->grpmask;
+				return true;
+			}
+			return false;
+		case XFS_HEALTHMON_INODE:
+			if (existing->ino == new->ino &&
+			    existing->gen == new->gen) {
+				existing->imask |= new->imask;
+				return true;
+			}
+			return false;
+		default:
+			ASSERT(0);
+			return false;
+		}
+		return false;
+
+	case XFS_HEALTHMON_MEDIA_ERROR:
+		/* physically adjacent errors can merge */
+		if (existing->daddr + existing->bbcount == new->daddr) {
+			existing->bbcount += new->bbcount;
+			return true;
+		}
+		if (new->daddr + new->bbcount == existing->daddr) {
+			existing->daddr = new->daddr;
+			existing->bbcount += new->bbcount;
+			return true;
+		}
+		return false;
+
+	case XFS_HEALTHMON_BUFREAD:
+	case XFS_HEALTHMON_BUFWRITE:
+	case XFS_HEALTHMON_DIOREAD:
+	case XFS_HEALTHMON_DIOWRITE:
+	case XFS_HEALTHMON_DATALOST:
+		/* logically adjacent file ranges can merge */
+		if (existing->fino != new->fino || existing->fgen != new->fgen)
+			return false;
+
+		if (existing->fpos + existing->flen == new->fpos) {
+			existing->flen += new->flen;
+			return true;
+		}
+
+		if (new->fpos + new->flen == existing->fpos) {
+			existing->fpos = new->fpos;
+			existing->flen += new->flen;
+			return true;
+		}
+		return false;
+	}
+
+	return false;
+}
+
 /* Insert an event onto the start of the list. */
 static inline void
 __xfs_healthmon_insert(
 	struct xfs_healthmon		*hm,
 	struct xfs_healthmon_event	*event)
 {
+	if (xfs_healthmon_merge_events(hm->first_event, event)) {
+		trace_xfs_healthmon_merge(hm->mp, hm->first_event);
+		kfree(event);
+		wake_up(&hm->wait);
+		return;
+	}
+
 	event->next = hm->first_event;
 	if (!hm->first_event)
 		hm->first_event = event;
@@ -166,6 +266,13 @@ __xfs_healthmon_push(
 	struct xfs_healthmon		*hm,
 	struct xfs_healthmon_event	*event)
 {
+	if (xfs_healthmon_merge_events(hm->last_event, event)) {
+		trace_xfs_healthmon_merge(hm->mp, hm->last_event);
+		kfree(event);
+		wake_up(&hm->wait);
+		return;
+	}
+
 	if (!hm->first_event)
 		hm->first_event = event;
 	if (hm->last_event)


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 21/22] xfs: restrict healthmon users further
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (19 preceding siblings ...)
  2025-11-05  0:53   ` [PATCH 20/22] xfs: merge health monitoring events when possible Darrick J. Wong
@ 2025-11-05  0:53   ` Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process Darrick J. Wong
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:53 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Because health monitoring events include file handles and deep
information about the filesystem structure, restrict usage to healthmon
to processes that can open the root directory and run in the initial
user namespace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.h |    4 ++--
 fs/xfs/xfs_healthmon.c |    9 ++++++++-
 fs/xfs/xfs_ioctl.c     |    2 +-
 3 files changed, 11 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 1ce49197262b1c..6b650ab0c92238 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -99,10 +99,10 @@ struct xfs_healthmon_event {
 };
 
 #ifdef CONFIG_XFS_HEALTH_MONITOR
-long xfs_ioc_health_monitor(struct xfs_mount *mp,
+long xfs_ioc_health_monitor(struct file *file,
 		struct xfs_health_monitor __user *arg);
 #else
-# define xfs_ioc_health_monitor(mp, hmo)	(-ENOTTY)
+# define xfs_ioc_health_monitor(file, hmo)	(-ENOTTY)
 #endif /* CONFIG_XFS_HEALTH_MONITOR */
 
 #endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index b46b63e62d5143..a8ea6483ca98fb 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -1337,18 +1337,25 @@ static const struct file_operations xfs_healthmon_fops = {
  */
 long
 xfs_ioc_health_monitor(
-	struct xfs_mount		*mp,
+	struct file			*file,
 	struct xfs_health_monitor __user *arg)
 {
 	struct xfs_health_monitor	hmo;
 	struct xfs_healthmon		*hm;
 	struct xfs_healthmon_event	*event;
+	struct xfs_inode		*ip = XFS_I(file_inode(file));
+	struct xfs_mount		*mp = ip->i_mount;
 	int				fd;
 	int				ret;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+	if (ip->i_ino != mp->m_sb.sb_rootino)
+		return -EPERM;
+	if (current_user_ns() != &init_user_ns)
+		return -EPERM;
+
 	if (copy_from_user(&hmo, arg, sizeof(hmo)))
 		return -EFAULT;
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 7a80a6ad4b2d99..6c3eecabf09908 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1424,7 +1424,7 @@ xfs_file_ioctl(
 		return xfs_ioc_commit_range(filp, arg);
 
 	case XFS_IOC_HEALTH_MONITOR:
-		return xfs_ioc_health_monitor(mp, arg);
+		return xfs_ioc_health_monitor(filp, arg);
 	case XFS_IOC_MEDIA_ERROR:
 		return xfs_ioc_media_error(mp, arg);
 


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
                     ` (20 preceding siblings ...)
  2025-11-05  0:53   ` [PATCH 21/22] xfs: restrict healthmon users further Darrick J. Wong
@ 2025-11-05  0:54   ` Darrick J. Wong
  21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:54 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Event objects are created in the context of whichever process
experienced the health event, which means that we currently charge that
process' memory cgroup controller for that object.  This isn't entirely
fair to that process, because it's being charged for memory that solely
benefits whatever's using the healthmon fd (xfs_healer).

Therefore, save the memcg that was in place when the healthmon fd was
created (which we assume is xfs_healer) and make sure the objects are
charged to that memcg.  This also enables sysadmins to constrain the
kernel memory usage of xfs_healer through memcgs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)


diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index a8ea6483ca98fb..def4de5f6bc543 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -67,6 +67,9 @@ struct xfs_healthmon {
 	struct xfs_healthmon_event	*first_event;
 	struct xfs_healthmon_event	*last_event;
 
+	/* charge event object usage to this memory cgroup */
+	struct mem_cgroup		*memcg;
+
 	/* live update hooks */
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
@@ -500,6 +503,7 @@ xfs_healthmon_metadata_hook(
 	struct xfs_health_update_params	*hup = data;
 	struct xfs_healthmon		*hm;
 	struct xfs_healthmon_event	*event;
+	struct mem_cgroup		*old_memcg;
 	enum xfs_health_update_type	type = action;
 	unsigned int			mask = 0;
 	int				error;
@@ -511,6 +515,7 @@ xfs_healthmon_metadata_hook(
 		return NOTIFY_DONE;
 
 	mutex_lock(&hm->lock);
+	old_memcg = set_active_memcg(hm->memcg);
 
 	trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
 			hm->lost_prev_event);
@@ -586,6 +591,7 @@ xfs_healthmon_metadata_hook(
 		goto out_event;
 
 out_unlock:
+	set_active_memcg(old_memcg);
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 out_event:
@@ -602,11 +608,13 @@ xfs_healthmon_shutdown_hook(
 {
 	struct xfs_healthmon		*hm;
 	struct xfs_healthmon_event	*event;
+	struct mem_cgroup		*old_memcg;
 	int				error;
 
 	hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
 
 	mutex_lock(&hm->lock);
+	old_memcg = set_active_memcg(hm->memcg);
 
 	trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
 			hm->lost_prev_event);
@@ -626,6 +634,7 @@ xfs_healthmon_shutdown_hook(
 		kfree(event);
 
 out_unlock:
+	set_active_memcg(old_memcg);
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 }
@@ -640,12 +649,14 @@ xfs_healthmon_media_error_hook(
 	struct xfs_healthmon		*hm;
 	struct xfs_healthmon_event	*event;
 	struct xfs_media_error_params	*p = data;
+	struct mem_cgroup		*old_memcg;
 	enum xfs_healthmon_domain	domain = 0; /* shut up gcc */
 	int				error;
 
 	hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
 
 	mutex_lock(&hm->lock);
+	old_memcg = set_active_memcg(hm->memcg);
 
 	trace_xfs_healthmon_media_error_hook(p, hm->events,
 			hm->lost_prev_event);
@@ -677,6 +688,7 @@ xfs_healthmon_media_error_hook(
 		kfree(event);
 
 out_unlock:
+	set_active_memcg(old_memcg);
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 }
@@ -691,6 +703,7 @@ xfs_healthmon_file_ioerror_hook(
 	struct xfs_healthmon		*hm;
 	struct xfs_healthmon_event	*event;
 	struct xfs_file_ioerror_params	*p = data;
+	struct mem_cgroup		*old_memcg;
 	enum xfs_healthmon_type		type = 0;
 	int				error;
 
@@ -709,6 +722,7 @@ xfs_healthmon_file_ioerror_hook(
 	}
 
 	mutex_lock(&hm->lock);
+	old_memcg = set_active_memcg(hm->memcg);
 
 	trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events,
 			hm->lost_prev_event);
@@ -748,6 +762,7 @@ xfs_healthmon_file_ioerror_hook(
 		kfree(event);
 
 out_unlock:
+	set_active_memcg(old_memcg);
 	mutex_unlock(&hm->lock);
 	return NOTIFY_DONE;
 }
@@ -1188,6 +1203,7 @@ xfs_healthmon_release(
 	xfs_healthmon_free_events(hm);
 	if (hm->outbuf.size)
 		kfree(hm->outbuf.buffer);
+	mem_cgroup_put(hm->memcg);
 	kfree(hm);
 
 	return 0;
@@ -1367,6 +1383,7 @@ xfs_ioc_health_monitor(
 		return -ENOMEM;
 	hm->mp = mp;
 	hm->format = hmo.format;
+	hm->memcg = get_mem_cgroup_from_mm(current->mm);
 
 	/*
 	 * Since we already got a ref to the module, take a reference to the
@@ -1443,6 +1460,7 @@ xfs_ioc_health_monitor(
 	xfs_health_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
+	mem_cgroup_put(hm->memcg);
 	kfree(hm);
 	return ret;
 }


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCHSET V3 2/2] iomap: generic file IO error reporting
  2025-11-05  0:46 [PATCHBOMB v2 6.19] xfs: autonomous self healing Darrick J. Wong
  2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
@ 2025-11-05  0:48 ` Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 1/6] iomap: report file IO errors to fsnotify Darrick J. Wong
                     ` (5 more replies)
  1 sibling, 6 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:48 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

Hi all,

Refactor the iomap file I/O error handling code so that failures are
reported in a generic way to fsnotify.  Then connect the XFS health
reporting to the same fsnotify, and now XFS can fsnotify userspace of
all manner of problems.

This series is much more experimental than the main healer patchset,
and I'd rather it not become a blocker for the main patchset.  I
wouldn't mind rebasing if they went in at the same time though.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=iomap-ioerr-reporting
---
Commits in this patchset:
 * iomap: report file IO errors to fsnotify
 * xfs: switch healthmon to use the iomap I/O error reporting
 * xfs: port notify-failure to use the new vfs io error reporting
 * xfs: remove file I/O error hooks
 * iomap: remove I/O error hooks
 * xfs: report fs metadata errors via fsnotify
---
 fs/xfs/xfs_file.h                 |   37 --------
 fs/xfs/xfs_mount.h                |    3 -
 fs/xfs/xfs_trace.h                |   35 ++++---
 include/linux/fs.h                |   68 ++++++++++++++
 include/linux/iomap.h             |    2 
 Documentation/filesystems/vfs.rst |    7 -
 fs/iomap/buffered-io.c            |    8 +-
 fs/iomap/direct-io.c              |    9 +-
 fs/super.c                        |   53 +++++++++++
 fs/xfs/xfs_aops.c                 |    2 
 fs/xfs/xfs_file.c                 |  174 -------------------------------------
 fs/xfs/xfs_health.c               |   12 ++-
 fs/xfs/xfs_healthmon.c            |   51 ++++++-----
 fs/xfs/xfs_notify_failure.c       |    9 +-
 fs/xfs/xfs_super.c                |    1 
 15 files changed, 190 insertions(+), 281 deletions(-)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
@ 2025-11-05  0:54   ` Darrick J. Wong
  2025-11-05 11:00     ` Jan Kara
  2025-11-05  0:54   ` [PATCH 2/6] xfs: switch healthmon to use the iomap I/O error reporting Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:54 UTC (permalink / raw)
  To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

From: Darrick J. Wong <djwong@kernel.org>

Create a generic hook for iomap filesystems to report IO errors to
fsnotify and in-kernel subsystems that want to know about such things.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/linux/fs.h     |   64 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/iomap/buffered-io.c |    6 +++++
 fs/iomap/direct-io.c   |    5 ++++
 fs/super.c             |   53 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+)


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5e4b3a4b24823f..1cb3965db3275c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -80,6 +80,7 @@ struct fs_context;
 struct fs_parameter_spec;
 struct file_kattr;
 struct iomap_ops;
+struct notifier_head;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1587,6 +1588,7 @@ struct super_block {
 
 	spinlock_t		s_inode_wblist_lock;
 	struct list_head	s_inodes_wb;	/* writeback inodes */
+	struct blocking_notifier_head	s_error_notifier;
 } __randomize_layout;
 
 static inline struct user_namespace *i_user_ns(const struct inode *inode)
@@ -4069,4 +4071,66 @@ static inline bool extensible_ioctl_valid(unsigned int cmd_a,
 	return true;
 }
 
+enum fs_error_type {
+	/* pagecache reads and writes */
+	FSERR_READAHEAD,
+	FSERR_WRITEBACK,
+
+	/* directio read and writes */
+	FSERR_DIO_READ,
+	FSERR_DIO_WRITE,
+
+	/* media error */
+	FSERR_DATA_LOST,
+
+	/* filesystem metadata */
+	FSERR_METADATA,
+};
+
+struct fs_error {
+	struct work_struct work;
+	struct super_block *sb;
+	struct inode *inode;
+	loff_t pos;
+	u64 len;
+	enum fs_error_type type;
+	int error;
+};
+
+struct fs_error_hook {
+	struct notifier_block nb;
+};
+
+static inline int sb_hook_error(struct super_block *sb,
+				struct fs_error_hook *h)
+{
+	return blocking_notifier_chain_register(&sb->s_error_notifier, &h->nb);
+}
+
+static inline void sb_unhook_error(struct super_block *sb,
+				   struct fs_error_hook *h)
+{
+	blocking_notifier_chain_unregister(&sb->s_error_notifier, &h->nb);
+}
+
+static inline void sb_init_error_hook(struct fs_error_hook *h, notifier_fn_t fn)
+{
+	h->nb.notifier_call = fn;
+	h->nb.priority = 0;
+}
+
+void __sb_error(struct super_block *sb, struct inode *inode,
+		enum fs_error_type type, loff_t pos, u64 len, int error);
+
+static inline void sb_error(struct super_block *sb, int error)
+{
+	__sb_error(sb, NULL, FSERR_METADATA, 0, 0, error);
+}
+
+static inline void inode_error(struct inode *inode, enum fs_error_type type,
+			       loff_t pos, u64 len, int error)
+{
+	__sb_error(inode->i_sb, inode, type, pos, len, error);
+}
+
 #endif /* _LINUX_FS_H */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8dd5421cb910b5..dc19311fe1c6c0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -291,6 +291,12 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
 inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
 		loff_t pos, u64 len, int error)
 {
+	struct inode *inode = mapping->host;
+
+	inode_error(inode,
+		    direction == READ ? FSERR_READAHEAD : FSERR_WRITEBACK,
+		    pos, len, error);
+
 	if (mapping && mapping->a_ops->ioerror)
 		mapping->a_ops->ioerror(mapping, direction, pos, len,
 				error);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 1512d8dbb0d2e7..9f6ce0d9c531bb 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -95,6 +95,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 
 	if (dops && dops->end_io)
 		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+	if (dio->error)
+		inode_error(file_inode(iocb->ki_filp),
+			    (dio->flags & IOMAP_DIO_WRITE) ? FSERR_DIO_WRITE :
+							     FSERR_DIO_READ,
+			    offset, dio->size, dio->error);
 	if (dio->error && dops && dops->ioerror)
 		dops->ioerror(file_inode(iocb->ki_filp),
 				(dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
diff --git a/fs/super.c b/fs/super.c
index 5bab94fb7e0358..f6d38e4b3d76b2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -363,6 +363,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	spin_lock_init(&s->s_inode_list_lock);
 	INIT_LIST_HEAD(&s->s_inodes_wb);
 	spin_lock_init(&s->s_inode_wblist_lock);
+	BLOCKING_INIT_NOTIFIER_HEAD(&s->s_error_notifier);
 
 	s->s_count = 1;
 	atomic_set(&s->s_active, 1);
@@ -2267,3 +2268,55 @@ int sb_init_dio_done_wq(struct super_block *sb)
 	return 0;
 }
 EXPORT_SYMBOL_GPL(sb_init_dio_done_wq);
+
+static void handle_sb_error(struct work_struct *work)
+{
+	struct fs_error *fserr = container_of(work, struct fs_error, work);
+
+	fsnotify_sb_error(fserr->sb, fserr->inode, fserr->error);
+	blocking_notifier_call_chain(&fserr->sb->s_error_notifier, fserr->type,
+				     fserr);
+	iput(fserr->inode);
+	kfree(fserr);
+}
+
+/**
+ * Report a filesystem error.  The actual work is deferred to a workqueue so
+ * that we're always in process context and to avoid blowing out the caller's
+ * stack.
+ *
+ * @sb Filesystem superblock
+ * @inode Inode within filesystem, if applicable
+ * @type Type of error
+ * @pos Start of file range affected, if applicable
+ * @len Length of file range affected, if applicable
+ * @error Error encountered.
+ */
+void __sb_error(struct super_block *sb, struct inode *inode,
+		enum fs_error_type type, loff_t pos, u64 len, int error)
+{
+	struct fs_error *fserr = kzalloc(sizeof(struct fs_error), GFP_ATOMIC);
+
+	if (!fserr) {
+		printk(KERN_ERR
+ "lost fs error report for ino %lu type %u pos 0x%llx len 0x%llx error %d",
+				inode ? inode->i_ino : 0, type,
+				pos, len, error);
+		return;
+	}
+
+	if (inode) {
+		fserr->sb = inode->i_sb;
+		fserr->inode = igrab(inode);
+	} else {
+		fserr->sb = sb;
+	}
+	fserr->type = type;
+	fserr->pos = pos;
+	fserr->len = len;
+	fserr->error = error;
+	INIT_WORK(&fserr->work, handle_sb_error);
+
+	schedule_work(&fserr->work);
+}
+EXPORT_SYMBOL_GPL(__sb_error);


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05  0:54   ` [PATCH 1/6] iomap: report file IO errors to fsnotify Darrick J. Wong
@ 2025-11-05 11:00     ` Jan Kara
  2025-11-05 11:14       ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-11-05 11:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: cem, hch, linux-fsdevel, linux-xfs, amir73il, jack, gabriel

On Tue 04-11-25 16:54:24, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a generic hook for iomap filesystems to report IO errors to
> fsnotify and in-kernel subsystems that want to know about such things.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h     |   64 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/iomap/buffered-io.c |    6 +++++
>  fs/iomap/direct-io.c   |    5 ++++
>  fs/super.c             |   53 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 128 insertions(+)
> 
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5e4b3a4b24823f..1cb3965db3275c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -80,6 +80,7 @@ struct fs_context;
>  struct fs_parameter_spec;
>  struct file_kattr;
>  struct iomap_ops;
> +struct notifier_head;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1587,6 +1588,7 @@ struct super_block {
>  
>  	spinlock_t		s_inode_wblist_lock;
>  	struct list_head	s_inodes_wb;	/* writeback inodes */
> +	struct blocking_notifier_head	s_error_notifier;
>  } __randomize_layout;
>  
>  static inline struct user_namespace *i_user_ns(const struct inode *inode)
> @@ -4069,4 +4071,66 @@ static inline bool extensible_ioctl_valid(unsigned int cmd_a,
>  	return true;
>  }
>  
> +enum fs_error_type {
> +	/* pagecache reads and writes */
> +	FSERR_READAHEAD,
> +	FSERR_WRITEBACK,
> +
> +	/* directio read and writes */
> +	FSERR_DIO_READ,
> +	FSERR_DIO_WRITE,
> +
> +	/* media error */
> +	FSERR_DATA_LOST,
> +
> +	/* filesystem metadata */
> +	FSERR_METADATA,
> +};
> +
> +struct fs_error {
> +	struct work_struct work;
> +	struct super_block *sb;
> +	struct inode *inode;
> +	loff_t pos;
> +	u64 len;
> +	enum fs_error_type type;
> +	int error;
> +};
> +
> +struct fs_error_hook {
> +	struct notifier_block nb;
> +};
> +
> +static inline int sb_hook_error(struct super_block *sb,
> +				struct fs_error_hook *h)
> +{
> +	return blocking_notifier_chain_register(&sb->s_error_notifier, &h->nb);
> +}
> +
> +static inline void sb_unhook_error(struct super_block *sb,
> +				   struct fs_error_hook *h)
> +{
> +	blocking_notifier_chain_unregister(&sb->s_error_notifier, &h->nb);
> +}
> +
> +static inline void sb_init_error_hook(struct fs_error_hook *h, notifier_fn_t fn)
> +{
> +	h->nb.notifier_call = fn;
> +	h->nb.priority = 0;
> +}
> +
> +void __sb_error(struct super_block *sb, struct inode *inode,
> +		enum fs_error_type type, loff_t pos, u64 len, int error);
> +
> +static inline void sb_error(struct super_block *sb, int error)
> +{
> +	__sb_error(sb, NULL, FSERR_METADATA, 0, 0, error);
> +}
> +
> +static inline void inode_error(struct inode *inode, enum fs_error_type type,
> +			       loff_t pos, u64 len, int error)
> +{
> +	__sb_error(inode->i_sb, inode, type, pos, len, error);
> +}
> +
>  #endif /* _LINUX_FS_H */
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8dd5421cb910b5..dc19311fe1c6c0 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -291,6 +291,12 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
>  inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
>  		loff_t pos, u64 len, int error)
>  {
> +	struct inode *inode = mapping->host;
> +
> +	inode_error(inode,
> +		    direction == READ ? FSERR_READAHEAD : FSERR_WRITEBACK,
> +		    pos, len, error);
> +
>  	if (mapping && mapping->a_ops->ioerror)
>  		mapping->a_ops->ioerror(mapping, direction, pos, len,
>  				error);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 1512d8dbb0d2e7..9f6ce0d9c531bb 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -95,6 +95,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  
>  	if (dops && dops->end_io)
>  		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
> +	if (dio->error)
> +		inode_error(file_inode(iocb->ki_filp),
> +			    (dio->flags & IOMAP_DIO_WRITE) ? FSERR_DIO_WRITE :
> +							     FSERR_DIO_READ,
> +			    offset, dio->size, dio->error);
>  	if (dio->error && dops && dops->ioerror)
>  		dops->ioerror(file_inode(iocb->ki_filp),
>  				(dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
> diff --git a/fs/super.c b/fs/super.c
> index 5bab94fb7e0358..f6d38e4b3d76b2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -363,6 +363,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
>  	spin_lock_init(&s->s_inode_list_lock);
>  	INIT_LIST_HEAD(&s->s_inodes_wb);
>  	spin_lock_init(&s->s_inode_wblist_lock);
> +	BLOCKING_INIT_NOTIFIER_HEAD(&s->s_error_notifier);
>  
>  	s->s_count = 1;
>  	atomic_set(&s->s_active, 1);
> @@ -2267,3 +2268,55 @@ int sb_init_dio_done_wq(struct super_block *sb)
>  	return 0;
>  }
>  EXPORT_SYMBOL_GPL(sb_init_dio_done_wq);
> +
> +static void handle_sb_error(struct work_struct *work)
> +{
> +	struct fs_error *fserr = container_of(work, struct fs_error, work);
> +
> +	fsnotify_sb_error(fserr->sb, fserr->inode, fserr->error);
> +	blocking_notifier_call_chain(&fserr->sb->s_error_notifier, fserr->type,
> +				     fserr);
> +	iput(fserr->inode);
> +	kfree(fserr);
> +}
> +
> +/**
> + * Report a filesystem error.  The actual work is deferred to a workqueue so
> + * that we're always in process context and to avoid blowing out the caller's
> + * stack.
> + *
> + * @sb Filesystem superblock
> + * @inode Inode within filesystem, if applicable
> + * @type Type of error
> + * @pos Start of file range affected, if applicable
> + * @len Length of file range affected, if applicable
> + * @error Error encountered.
> + */
> +void __sb_error(struct super_block *sb, struct inode *inode,
> +		enum fs_error_type type, loff_t pos, u64 len, int error)
> +{
> +	struct fs_error *fserr = kzalloc(sizeof(struct fs_error), GFP_ATOMIC);
> +
> +	if (!fserr) {
> +		printk(KERN_ERR
> + "lost fs error report for ino %lu type %u pos 0x%llx len 0x%llx error %d",
> +				inode ? inode->i_ino : 0, type,
> +				pos, len, error);
> +		return;
> +	}
> +
> +	if (inode) {
> +		fserr->sb = inode->i_sb;
> +		fserr->inode = igrab(inode);
> +	} else {
> +		fserr->sb = sb;
> +	}
> +	fserr->type = type;
> +	fserr->pos = pos;
> +	fserr->len = len;
> +	fserr->error = error;
> +	INIT_WORK(&fserr->work, handle_sb_error);
> +
> +	schedule_work(&fserr->work);
> +}
> +EXPORT_SYMBOL_GPL(__sb_error);
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05 11:00     ` Jan Kara
@ 2025-11-05 11:14       ` Amir Goldstein
  2025-11-05 14:24         ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2025-11-05 11:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Darrick J. Wong, cem, hch, linux-fsdevel, linux-xfs, gabriel,
	Christian Brauner

On Wed, Nov 5, 2025 at 12:00 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 04-11-25 16:54:24, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Create a generic hook for iomap filesystems to report IO errors to
> > fsnotify and in-kernel subsystems that want to know about such things.
> >
> > Suggested-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
>
> Looks good to me. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>
>
>                                                                 Honza
>
> > ---
> >  include/linux/fs.h     |   64 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/iomap/buffered-io.c |    6 +++++
> >  fs/iomap/direct-io.c   |    5 ++++
> >  fs/super.c             |   53 ++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 128 insertions(+)
> >
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 5e4b3a4b24823f..1cb3965db3275c 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -80,6 +80,7 @@ struct fs_context;
> >  struct fs_parameter_spec;
> >  struct file_kattr;
> >  struct iomap_ops;
> > +struct notifier_head;
> >
> >  extern void __init inode_init(void);
> >  extern void __init inode_init_early(void);
> > @@ -1587,6 +1588,7 @@ struct super_block {
> >
> >       spinlock_t              s_inode_wblist_lock;
> >       struct list_head        s_inodes_wb;    /* writeback inodes */
> > +     struct blocking_notifier_head   s_error_notifier;
> >  } __randomize_layout;
> >
> >  static inline struct user_namespace *i_user_ns(const struct inode *inode)
> > @@ -4069,4 +4071,66 @@ static inline bool extensible_ioctl_valid(unsigned int cmd_a,
> >       return true;
> >  }
> >
> > +enum fs_error_type {
> > +     /* pagecache reads and writes */
> > +     FSERR_READAHEAD,
> > +     FSERR_WRITEBACK,
> > +
> > +     /* directio read and writes */
> > +     FSERR_DIO_READ,
> > +     FSERR_DIO_WRITE,
> > +
> > +     /* media error */
> > +     FSERR_DATA_LOST,
> > +
> > +     /* filesystem metadata */
> > +     FSERR_METADATA,
> > +};
> > +
> > +struct fs_error {
> > +     struct work_struct work;
> > +     struct super_block *sb;
> > +     struct inode *inode;
> > +     loff_t pos;
> > +     u64 len;
> > +     enum fs_error_type type;
> > +     int error;
> > +};
> > +
> > +struct fs_error_hook {
> > +     struct notifier_block nb;
> > +};
> > +
> > +static inline int sb_hook_error(struct super_block *sb,
> > +                             struct fs_error_hook *h)
> > +{
> > +     return blocking_notifier_chain_register(&sb->s_error_notifier, &h->nb);
> > +}
> > +
> > +static inline void sb_unhook_error(struct super_block *sb,
> > +                                struct fs_error_hook *h)
> > +{
> > +     blocking_notifier_chain_unregister(&sb->s_error_notifier, &h->nb);
> > +}
> > +
> > +static inline void sb_init_error_hook(struct fs_error_hook *h, notifier_fn_t fn)
> > +{
> > +     h->nb.notifier_call = fn;
> > +     h->nb.priority = 0;
> > +}
> > +
> > +void __sb_error(struct super_block *sb, struct inode *inode,
> > +             enum fs_error_type type, loff_t pos, u64 len, int error);
> > +
> > +static inline void sb_error(struct super_block *sb, int error)
> > +{
> > +     __sb_error(sb, NULL, FSERR_METADATA, 0, 0, error);
> > +}
> > +
> > +static inline void inode_error(struct inode *inode, enum fs_error_type type,
> > +                            loff_t pos, u64 len, int error)
> > +{
> > +     __sb_error(inode->i_sb, inode, type, pos, len, error);
> > +}
> > +

Apart from the fact that Christian is not going to be happy with this
bloat of fs.h
shouldn't all this be part of fsnotify.h?

I do not see why ext4 should not use the same workqueue
or why any code would need to call fsnotify_sb_error() directly.
...

> >  #endif /* _LINUX_FS_H */
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 8dd5421cb910b5..dc19311fe1c6c0 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -291,6 +291,12 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
> >  inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
> >               loff_t pos, u64 len, int error)
> >  {
> > +     struct inode *inode = mapping->host;
> > +
> > +     inode_error(inode,
> > +                 direction == READ ? FSERR_READAHEAD : FSERR_WRITEBACK,
> > +                 pos, len, error);
> > +
> >       if (mapping && mapping->a_ops->ioerror)
> >               mapping->a_ops->ioerror(mapping, direction, pos, len,
> >                               error);
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 1512d8dbb0d2e7..9f6ce0d9c531bb 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -95,6 +95,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> >
> >       if (dops && dops->end_io)
> >               ret = dops->end_io(iocb, dio->size, ret, dio->flags);
> > +     if (dio->error)
> > +             inode_error(file_inode(iocb->ki_filp),
> > +                         (dio->flags & IOMAP_DIO_WRITE) ? FSERR_DIO_WRITE :
> > +                                                          FSERR_DIO_READ,
> > +                         offset, dio->size, dio->error);
> >       if (dio->error && dops && dops->ioerror)
> >               dops->ioerror(file_inode(iocb->ki_filp),
> >                               (dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
> > diff --git a/fs/super.c b/fs/super.c
> > index 5bab94fb7e0358..f6d38e4b3d76b2 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -363,6 +363,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
> >       spin_lock_init(&s->s_inode_list_lock);
> >       INIT_LIST_HEAD(&s->s_inodes_wb);
> >       spin_lock_init(&s->s_inode_wblist_lock);
> > +     BLOCKING_INIT_NOTIFIER_HEAD(&s->s_error_notifier);
> >
> >       s->s_count = 1;
> >       atomic_set(&s->s_active, 1);
> > @@ -2267,3 +2268,55 @@ int sb_init_dio_done_wq(struct super_block *sb)
> >       return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(sb_init_dio_done_wq);
> > +
> > +static void handle_sb_error(struct work_struct *work)
> > +{
> > +     struct fs_error *fserr = container_of(work, struct fs_error, work);
> > +
> > +     fsnotify_sb_error(fserr->sb, fserr->inode, fserr->error);
> > +     blocking_notifier_call_chain(&fserr->sb->s_error_notifier, fserr->type,
> > +                                  fserr);
> > +     iput(fserr->inode);
> > +     kfree(fserr);
> > +}
> > +
> > +/**
> > + * Report a filesystem error.  The actual work is deferred to a workqueue so
> > + * that we're always in process context and to avoid blowing out the caller's
> > + * stack.
> > + *
> > + * @sb Filesystem superblock
> > + * @inode Inode within filesystem, if applicable
> > + * @type Type of error
> > + * @pos Start of file range affected, if applicable
> > + * @len Length of file range affected, if applicable
> > + * @error Error encountered.
> > + */
> > +void __sb_error(struct super_block *sb, struct inode *inode,
> > +             enum fs_error_type type, loff_t pos, u64 len, int error)
> > +{
> > +     struct fs_error *fserr = kzalloc(sizeof(struct fs_error), GFP_ATOMIC);
> > +
> > +     if (!fserr) {
> > +             printk(KERN_ERR
> > + "lost fs error report for ino %lu type %u pos 0x%llx len 0x%llx error %d",
> > +                             inode ? inode->i_ino : 0, type,
> > +                             pos, len, error);
> > +             return;
> > +     }
> > +
> > +     if (inode) {
> > +             fserr->sb = inode->i_sb;
> > +             fserr->inode = igrab(inode);
> > +     } else {
> > +             fserr->sb = sb;
> > +     }
> > +     fserr->type = type;
> > +     fserr->pos = pos;
> > +     fserr->len = len;
> > +     fserr->error = error;
> > +     INIT_WORK(&fserr->work, handle_sb_error);
> > +
> > +     schedule_work(&fserr->work);
> > +}
> > +EXPORT_SYMBOL_GPL(__sb_error);
> >

...
We recently discovered that fsnotify_sb_error() calls are exposed to
races with generic_shutdown_super():
https://lore.kernel.org/linux-fsdevel/scmyycf2trich22v25s6gpe3ib6ejawflwf76znxg7sedqablp@ejfycd34xvpa/

Will punting all FS_ERROR events to workqueue help to improve this
situation or will it make it worse?

Another question to ask is whether reporting fs error duing fs shutdown
is a feature or anti feature?

If this is needed then we could change fsnotify_sb_error() to
take ino,gen or file handle directly instead of calling filesystem to encode
a file handle to report with the event.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05 11:14       ` Amir Goldstein
@ 2025-11-05 14:24         ` Jan Kara
  2025-11-05 18:28           ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-11-05 14:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Darrick J. Wong, cem, hch, linux-fsdevel, linux-xfs,
	gabriel, Christian Brauner

On Wed 05-11-25 12:14:52, Amir Goldstein wrote:
> On Wed, Nov 5, 2025 at 12:00 PM Jan Kara <jack@suse.cz> wrote:
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 5e4b3a4b24823f..1cb3965db3275c 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -80,6 +80,7 @@ struct fs_context;
> > >  struct fs_parameter_spec;
> > >  struct file_kattr;
> > >  struct iomap_ops;
> > > +struct notifier_head;
> > >
> > >  extern void __init inode_init(void);
> > >  extern void __init inode_init_early(void);
> > > @@ -1587,6 +1588,7 @@ struct super_block {
> > >
> > >       spinlock_t              s_inode_wblist_lock;
> > >       struct list_head        s_inodes_wb;    /* writeback inodes */
> > > +     struct blocking_notifier_head   s_error_notifier;
> > >  } __randomize_layout;
> > >
> > >  static inline struct user_namespace *i_user_ns(const struct inode *inode)
> > > @@ -4069,4 +4071,66 @@ static inline bool extensible_ioctl_valid(unsigned int cmd_a,
> > >       return true;
> > >  }
> > >
> > > +enum fs_error_type {
> > > +     /* pagecache reads and writes */
> > > +     FSERR_READAHEAD,
> > > +     FSERR_WRITEBACK,
> > > +
> > > +     /* directio read and writes */
> > > +     FSERR_DIO_READ,
> > > +     FSERR_DIO_WRITE,
> > > +
> > > +     /* media error */
> > > +     FSERR_DATA_LOST,
> > > +
> > > +     /* filesystem metadata */
> > > +     FSERR_METADATA,
> > > +};
> > > +
> > > +struct fs_error {
> > > +     struct work_struct work;
> > > +     struct super_block *sb;
> > > +     struct inode *inode;
> > > +     loff_t pos;
> > > +     u64 len;
> > > +     enum fs_error_type type;
> > > +     int error;
> > > +};
> > > +
> > > +struct fs_error_hook {
> > > +     struct notifier_block nb;
> > > +};
> > > +
> > > +static inline int sb_hook_error(struct super_block *sb,
> > > +                             struct fs_error_hook *h)
> > > +{
> > > +     return blocking_notifier_chain_register(&sb->s_error_notifier, &h->nb);
> > > +}
> > > +
> > > +static inline void sb_unhook_error(struct super_block *sb,
> > > +                                struct fs_error_hook *h)
> > > +{
> > > +     blocking_notifier_chain_unregister(&sb->s_error_notifier, &h->nb);
> > > +}
> > > +
> > > +static inline void sb_init_error_hook(struct fs_error_hook *h, notifier_fn_t fn)
> > > +{
> > > +     h->nb.notifier_call = fn;
> > > +     h->nb.priority = 0;
> > > +}
> > > +
> > > +void __sb_error(struct super_block *sb, struct inode *inode,
> > > +             enum fs_error_type type, loff_t pos, u64 len, int error);
> > > +
> > > +static inline void sb_error(struct super_block *sb, int error)
> > > +{
> > > +     __sb_error(sb, NULL, FSERR_METADATA, 0, 0, error);
> > > +}
> > > +
> > > +static inline void inode_error(struct inode *inode, enum fs_error_type type,
> > > +                            loff_t pos, u64 len, int error)
> > > +{
> > > +     __sb_error(inode->i_sb, inode, type, pos, len, error);
> > > +}
> > > +
> 
> Apart from the fact that Christian is not going to be happy with this
> bloat of fs.h shouldn't all this be part of fsnotify.h?

Point that this maybe doesn't belong to fs.h is a good one. But I don't
think fsnotify.h is appropriate either because this isn't really part of
fsnotify. It is a layer on top that's binding fsnotify and notifier chain
notification. So maybe a new fs_error.h header?

> I do not see why ext4 should not use the same workqueue
> or why any code would need to call fsnotify_sb_error() directly.

Yes, I guess we can convert ext4 to the same framework but I'm fine with
cleaning that up later.

> > > +void __sb_error(struct super_block *sb, struct inode *inode,
> > > +             enum fs_error_type type, loff_t pos, u64 len, int error)
> > > +{
> > > +     struct fs_error *fserr = kzalloc(sizeof(struct fs_error), GFP_ATOMIC);
> > > +
> > > +     if (!fserr) {
> > > +             printk(KERN_ERR
> > > + "lost fs error report for ino %lu type %u pos 0x%llx len 0x%llx error %d",
> > > +                             inode ? inode->i_ino : 0, type,
> > > +                             pos, len, error);
> > > +             return;
> > > +     }
> > > +
> > > +     if (inode) {
> > > +             fserr->sb = inode->i_sb;
> > > +             fserr->inode = igrab(inode);
> > > +     } else {
> > > +             fserr->sb = sb;
> > > +     }
> > > +     fserr->type = type;
> > > +     fserr->pos = pos;
> > > +     fserr->len = len;
> > > +     fserr->error = error;
> > > +     INIT_WORK(&fserr->work, handle_sb_error);
> > > +
> > > +     schedule_work(&fserr->work);
> > > +}
> > > +EXPORT_SYMBOL_GPL(__sb_error);
> > >
> 
> ...
> We recently discovered that fsnotify_sb_error() calls are exposed to
> races with generic_shutdown_super():
> https://lore.kernel.org/linux-fsdevel/scmyycf2trich22v25s6gpe3ib6ejawflwf76znxg7sedqablp@ejfycd34xvpa/
> 
> Will punting all FS_ERROR events to workqueue help to improve this
> situation or will it make it worse?

Worse. But you raise a really good point which I've missed during my
review. Currently there's nothing which synchronizes pending works with
superblock getting destroyed with obvious UAF issues already in
handle_sb_error().

> Another question to ask is whether reporting fs error duing fs shutdown
> is a feature or anti feature?

I think there must be a point of no return during fs shutdown after which
we just stop emitting errors.

> If this is needed then we could change fsnotify_sb_error() to
> take ino,gen or file handle directly instead of calling filesystem to encode
> a file handle to report with the event.

This lifetime issue is not limited to fsnotify. I think __sb_error() needs
to check whether the superblock is still alive and synchronize properly
with sb shutdown (at which point making ext4 use this framework will be a
net win because it will close this race for ext4 as well).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05 14:24         ` Jan Kara
@ 2025-11-05 18:28           ` Darrick J. Wong
  2025-11-05 19:41             ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 18:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, cem, hch, linux-fsdevel, linux-xfs, gabriel,
	Christian Brauner

On Wed, Nov 05, 2025 at 03:24:41PM +0100, Jan Kara wrote:
> On Wed 05-11-25 12:14:52, Amir Goldstein wrote:
> > On Wed, Nov 5, 2025 at 12:00 PM Jan Kara <jack@suse.cz> wrote:
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 5e4b3a4b24823f..1cb3965db3275c 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -80,6 +80,7 @@ struct fs_context;
> > > >  struct fs_parameter_spec;
> > > >  struct file_kattr;
> > > >  struct iomap_ops;
> > > > +struct notifier_head;
> > > >
> > > >  extern void __init inode_init(void);
> > > >  extern void __init inode_init_early(void);
> > > > @@ -1587,6 +1588,7 @@ struct super_block {
> > > >
> > > >       spinlock_t              s_inode_wblist_lock;
> > > >       struct list_head        s_inodes_wb;    /* writeback inodes */
> > > > +     struct blocking_notifier_head   s_error_notifier;
> > > >  } __randomize_layout;
> > > >
> > > >  static inline struct user_namespace *i_user_ns(const struct inode *inode)
> > > > @@ -4069,4 +4071,66 @@ static inline bool extensible_ioctl_valid(unsigned int cmd_a,
> > > >       return true;
> > > >  }
> > > >
> > > > +enum fs_error_type {
> > > > +     /* pagecache reads and writes */
> > > > +     FSERR_READAHEAD,
> > > > +     FSERR_WRITEBACK,
> > > > +
> > > > +     /* directio read and writes */
> > > > +     FSERR_DIO_READ,
> > > > +     FSERR_DIO_WRITE,
> > > > +
> > > > +     /* media error */
> > > > +     FSERR_DATA_LOST,
> > > > +
> > > > +     /* filesystem metadata */
> > > > +     FSERR_METADATA,
> > > > +};
> > > > +
> > > > +struct fs_error {
> > > > +     struct work_struct work;
> > > > +     struct super_block *sb;
> > > > +     struct inode *inode;
> > > > +     loff_t pos;
> > > > +     u64 len;
> > > > +     enum fs_error_type type;
> > > > +     int error;
> > > > +};
> > > > +
> > > > +struct fs_error_hook {
> > > > +     struct notifier_block nb;
> > > > +};
> > > > +
> > > > +static inline int sb_hook_error(struct super_block *sb,
> > > > +                             struct fs_error_hook *h)
> > > > +{
> > > > +     return blocking_notifier_chain_register(&sb->s_error_notifier, &h->nb);
> > > > +}
> > > > +
> > > > +static inline void sb_unhook_error(struct super_block *sb,
> > > > +                                struct fs_error_hook *h)
> > > > +{
> > > > +     blocking_notifier_chain_unregister(&sb->s_error_notifier, &h->nb);
> > > > +}
> > > > +
> > > > +static inline void sb_init_error_hook(struct fs_error_hook *h, notifier_fn_t fn)
> > > > +{
> > > > +     h->nb.notifier_call = fn;
> > > > +     h->nb.priority = 0;
> > > > +}
> > > > +
> > > > +void __sb_error(struct super_block *sb, struct inode *inode,
> > > > +             enum fs_error_type type, loff_t pos, u64 len, int error);
> > > > +
> > > > +static inline void sb_error(struct super_block *sb, int error)
> > > > +{
> > > > +     __sb_error(sb, NULL, FSERR_METADATA, 0, 0, error);
> > > > +}
> > > > +
> > > > +static inline void inode_error(struct inode *inode, enum fs_error_type type,
> > > > +                            loff_t pos, u64 len, int error)
> > > > +{
> > > > +     __sb_error(inode->i_sb, inode, type, pos, len, error);
> > > > +}
> > > > +
> > 
> > Apart from the fact that Christian is not going to be happy with this
> > bloat of fs.h shouldn't all this be part of fsnotify.h?
> 
> Point that this maybe doesn't belong to fs.h is a good one. But I don't
> think fsnotify.h is appropriate either because this isn't really part of
> fsnotify. It is a layer on top that's binding fsnotify and notifier chain
> notification. So maybe a new fs_error.h header?

Fine with me, though it'd be a small header.

> > I do not see why ext4 should not use the same workqueue
> > or why any code would need to call fsnotify_sb_error() directly.
> 
> Yes, I guess we can convert ext4 to the same framework but I'm fine with
> cleaning that up later.

Yes, that should be a trivial patch to change fsnotify_sb_error ->
sb_error/inode_error.

> > > > +void __sb_error(struct super_block *sb, struct inode *inode,
> > > > +             enum fs_error_type type, loff_t pos, u64 len, int error)
> > > > +{
> > > > +     struct fs_error *fserr = kzalloc(sizeof(struct fs_error), GFP_ATOMIC);
> > > > +
> > > > +     if (!fserr) {
> > > > +             printk(KERN_ERR
> > > > + "lost fs error report for ino %lu type %u pos 0x%llx len 0x%llx error %d",
> > > > +                             inode ? inode->i_ino : 0, type,
> > > > +                             pos, len, error);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     if (inode) {
> > > > +             fserr->sb = inode->i_sb;
> > > > +             fserr->inode = igrab(inode);
> > > > +     } else {
> > > > +             fserr->sb = sb;
> > > > +     }
> > > > +     fserr->type = type;
> > > > +     fserr->pos = pos;
> > > > +     fserr->len = len;
> > > > +     fserr->error = error;
> > > > +     INIT_WORK(&fserr->work, handle_sb_error);
> > > > +
> > > > +     schedule_work(&fserr->work);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(__sb_error);
> > > >
> > 
> > ...
> > We recently discovered that fsnotify_sb_error() calls are exposed to
> > races with generic_shutdown_super():
> > https://lore.kernel.org/linux-fsdevel/scmyycf2trich22v25s6gpe3ib6ejawflwf76znxg7sedqablp@ejfycd34xvpa/

Hrmm.  I've noticed that ever since I added this new patchset, I've been
getting more instances of outright crashes in the timer code, or
workqueue lockups.  I wonder if that UAF is what's going on here...

> > Will punting all FS_ERROR events to workqueue help to improve this
> > situation or will it make it worse?
> 
> Worse. But you raise a really good point which I've missed during my
> review. Currently there's nothing which synchronizes pending works with
> superblock getting destroyed with obvious UAF issues already in
> handle_sb_error().

I wonder, could __sb_error call get_active_super() to obtain an active
reference to the sb, and then deactivate_super() it in the workqueue
callback?  If we can't get an active ref then we presume that the fs is
already shutting down and don't send the event.

The igrab/iput was supposed to prevent the same UAF from happening with
the inode, but I should've checked for a non-null return value.

> > Another question to ask is whether reporting fs error duing fs shutdown
> > is a feature or anti feature?
> 
> I think there must be a point of no return during fs shutdown after which
> we just stop emitting errors.

I agree, once S_ACTIVE hits zero there's no point in sending further
errors.

> > If this is needed then we could change fsnotify_sb_error() to
> > take ino,gen or file handle directly instead of calling filesystem to encode
> > a file handle to report with the event.

That would be another way to do it.  The sole downstream consumer of the
s_error_notifier-based events only cares about ino/gen.

> This lifetime issue is not limited to fsnotify. I think __sb_error() needs
> to check whether the superblock is still alive and synchronize properly
> with sb shutdown (at which point making ext4 use this framework will be a
> net win because it will close this race for ext4 as well).

<nod>

--D

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05 18:28           ` Darrick J. Wong
@ 2025-11-05 19:41             ` Darrick J. Wong
  2025-11-06 10:13               ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 19:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, cem, hch, linux-fsdevel, linux-xfs, gabriel,
	Christian Brauner

On Wed, Nov 05, 2025 at 10:28:08AM -0800, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 03:24:41PM +0100, Jan Kara wrote:
> > On Wed 05-11-25 12:14:52, Amir Goldstein wrote:
> > > On Wed, Nov 5, 2025 at 12:00 PM Jan Kara <jack@suse.cz> wrote:
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 5e4b3a4b24823f..1cb3965db3275c 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -80,6 +80,7 @@ struct fs_context;
> > > > >  struct fs_parameter_spec;
> > > > >  struct file_kattr;
> > > > >  struct iomap_ops;
> > > > > +struct notifier_head;
> > > > >
> > > > >  extern void __init inode_init(void);
> > > > >  extern void __init inode_init_early(void);
> > > > > @@ -1587,6 +1588,7 @@ struct super_block {
> > > > >
> > > > >       spinlock_t              s_inode_wblist_lock;
> > > > >       struct list_head        s_inodes_wb;    /* writeback inodes */
> > > > > +     struct blocking_notifier_head   s_error_notifier;
> > > > >  } __randomize_layout;
> > > > >
> > > > >  static inline struct user_namespace *i_user_ns(const struct inode *inode)
> > > > > @@ -4069,4 +4071,66 @@ static inline bool extensible_ioctl_valid(unsigned int cmd_a,
> > > > >       return true;
> > > > >  }
> > > > >
> > > > > +enum fs_error_type {
> > > > > +     /* pagecache reads and writes */
> > > > > +     FSERR_READAHEAD,
> > > > > +     FSERR_WRITEBACK,
> > > > > +
> > > > > +     /* directio read and writes */
> > > > > +     FSERR_DIO_READ,
> > > > > +     FSERR_DIO_WRITE,
> > > > > +
> > > > > +     /* media error */
> > > > > +     FSERR_DATA_LOST,
> > > > > +
> > > > > +     /* filesystem metadata */
> > > > > +     FSERR_METADATA,
> > > > > +};
> > > > > +
> > > > > +struct fs_error {
> > > > > +     struct work_struct work;
> > > > > +     struct super_block *sb;
> > > > > +     struct inode *inode;
> > > > > +     loff_t pos;
> > > > > +     u64 len;
> > > > > +     enum fs_error_type type;
> > > > > +     int error;
> > > > > +};
> > > > > +
> > > > > +struct fs_error_hook {
> > > > > +     struct notifier_block nb;
> > > > > +};
> > > > > +
> > > > > +static inline int sb_hook_error(struct super_block *sb,
> > > > > +                             struct fs_error_hook *h)
> > > > > +{
> > > > > +     return blocking_notifier_chain_register(&sb->s_error_notifier, &h->nb);
> > > > > +}
> > > > > +
> > > > > +static inline void sb_unhook_error(struct super_block *sb,
> > > > > +                                struct fs_error_hook *h)
> > > > > +{
> > > > > +     blocking_notifier_chain_unregister(&sb->s_error_notifier, &h->nb);
> > > > > +}
> > > > > +
> > > > > +static inline void sb_init_error_hook(struct fs_error_hook *h, notifier_fn_t fn)
> > > > > +{
> > > > > +     h->nb.notifier_call = fn;
> > > > > +     h->nb.priority = 0;
> > > > > +}
> > > > > +
> > > > > +void __sb_error(struct super_block *sb, struct inode *inode,
> > > > > +             enum fs_error_type type, loff_t pos, u64 len, int error);
> > > > > +
> > > > > +static inline void sb_error(struct super_block *sb, int error)
> > > > > +{
> > > > > +     __sb_error(sb, NULL, FSERR_METADATA, 0, 0, error);
> > > > > +}
> > > > > +
> > > > > +static inline void inode_error(struct inode *inode, enum fs_error_type type,
> > > > > +                            loff_t pos, u64 len, int error)
> > > > > +{
> > > > > +     __sb_error(inode->i_sb, inode, type, pos, len, error);
> > > > > +}
> > > > > +
> > > 
> > > Apart from the fact that Christian is not going to be happy with this
> > > bloat of fs.h shouldn't all this be part of fsnotify.h?
> > 
> > Point that this maybe doesn't belong to fs.h is a good one. But I don't
> > think fsnotify.h is appropriate either because this isn't really part of
> > fsnotify. It is a layer on top that's binding fsnotify and notifier chain
> > notification. So maybe a new fs_error.h header?
> 
> Fine with me, though it'd be a small header.
> 
> > > I do not see why ext4 should not use the same workqueue
> > > or why any code would need to call fsnotify_sb_error() directly.
> > 
> > Yes, I guess we can convert ext4 to the same framework but I'm fine with
> > cleaning that up later.
> 
> Yes, that should be a trivial patch to change fsnotify_sb_error ->
> sb_error/inode_error.
> 
> > > > > +void __sb_error(struct super_block *sb, struct inode *inode,
> > > > > +             enum fs_error_type type, loff_t pos, u64 len, int error)
> > > > > +{
> > > > > +     struct fs_error *fserr = kzalloc(sizeof(struct fs_error), GFP_ATOMIC);
> > > > > +
> > > > > +     if (!fserr) {
> > > > > +             printk(KERN_ERR
> > > > > + "lost fs error report for ino %lu type %u pos 0x%llx len 0x%llx error %d",
> > > > > +                             inode ? inode->i_ino : 0, type,
> > > > > +                             pos, len, error);
> > > > > +             return;
> > > > > +     }
> > > > > +
> > > > > +     if (inode) {
> > > > > +             fserr->sb = inode->i_sb;
> > > > > +             fserr->inode = igrab(inode);
> > > > > +     } else {
> > > > > +             fserr->sb = sb;
> > > > > +     }
> > > > > +     fserr->type = type;
> > > > > +     fserr->pos = pos;
> > > > > +     fserr->len = len;
> > > > > +     fserr->error = error;
> > > > > +     INIT_WORK(&fserr->work, handle_sb_error);
> > > > > +
> > > > > +     schedule_work(&fserr->work);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(__sb_error);
> > > > >
> > > 
> > > ...
> > > We recently discovered that fsnotify_sb_error() calls are exposed to
> > > races with generic_shutdown_super():
> > > https://lore.kernel.org/linux-fsdevel/scmyycf2trich22v25s6gpe3ib6ejawflwf76znxg7sedqablp@ejfycd34xvpa/
> 
> Hrmm.  I've noticed that ever since I added this new patchset, I've been
> getting more instances of outright crashes in the timer code, or
> workqueue lockups.  I wonder if that UAF is what's going on here...
> 
> > > Will punting all FS_ERROR events to workqueue help to improve this
> > > situation or will it make it worse?
> > 
> > Worse. But you raise a really good point which I've missed during my
> > review. Currently there's nothing which synchronizes pending works with
> > superblock getting destroyed with obvious UAF issues already in
> > handle_sb_error().
> 
> I wonder, could __sb_error call get_active_super() to obtain an active
> reference to the sb, and then deactivate_super() it in the workqueue
> callback?  If we can't get an active ref then we presume that the fs is
> already shutting down and don't send the event.

...and now that I've actually tried it, I realize that we can't actually
call get_active_super because it can sleep waiting for s_umount and
SB_BORN.  Maybe we could directly atomic_inc_not_zero(&sb->s_active)
adn trust that the caller has an active ref to the sb?  I think that's
true for anyone calling __sb_error with a non-null inode.

--D

> The igrab/iput was supposed to prevent the same UAF from happening with
> the inode, but I should've checked for a non-null return value.
> 
> > > Another question to ask is whether reporting fs error duing fs shutdown
> > > is a feature or anti feature?
> > 
> > I think there must be a point of no return during fs shutdown after which
> > we just stop emitting errors.
> 
> I agree, once S_ACTIVE hits zero there's no point in sending further
> errors.
> 
> > > If this is needed then we could change fsnotify_sb_error() to
> > > take ino,gen or file handle directly instead of calling filesystem to encode
> > > a file handle to report with the event.
> 
> That would be another way to do it.  The sole downstream consumer of the
> s_error_notifier-based events only cares about ino/gen.
> 
> > This lifetime issue is not limited to fsnotify. I think __sb_error() needs
> > to check whether the superblock is still alive and synchronize properly
> > with sb shutdown (at which point making ext4 use this framework will be a
> > net win because it will close this race for ext4 as well).
> 
> <nod>
> 
> --D
> 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> > 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-05 19:41             ` Darrick J. Wong
@ 2025-11-06 10:13               ` Jan Kara
  2025-11-06 17:06                 ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2025-11-06 10:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Amir Goldstein, cem, hch, linux-fsdevel, linux-xfs,
	gabriel, Christian Brauner

On Wed 05-11-25 11:41:38, Darrick J. Wong wrote:
> On Wed, Nov 05, 2025 at 10:28:08AM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 05, 2025 at 03:24:41PM +0100, Jan Kara wrote:
> > > On Wed 05-11-25 12:14:52, Amir Goldstein wrote:
> > > > 
> > > > ...
> > > > We recently discovered that fsnotify_sb_error() calls are exposed to
> > > > races with generic_shutdown_super():
> > > > https://lore.kernel.org/linux-fsdevel/scmyycf2trich22v25s6gpe3ib6ejawflwf76znxg7sedqablp@ejfycd34xvpa/
> > 
> > Hrmm.  I've noticed that ever since I added this new patchset, I've been
> > getting more instances of outright crashes in the timer code, or
> > workqueue lockups.  I wonder if that UAF is what's going on here...
> > 
> > > > Will punting all FS_ERROR events to workqueue help to improve this
> > > > situation or will it make it worse?
> > > 
> > > Worse. But you raise a really good point which I've missed during my
> > > review. Currently there's nothing which synchronizes pending works with
> > > superblock getting destroyed with obvious UAF issues already in
> > > handle_sb_error().
> > 
> > I wonder, could __sb_error call get_active_super() to obtain an active
> > reference to the sb, and then deactivate_super() it in the workqueue
> > callback?  If we can't get an active ref then we presume that the fs is
> > already shutting down and don't send the event.
> 
> ...and now that I've actually tried it, I realize that we can't actually
> call get_active_super because it can sleep waiting for s_umount and
> SB_BORN.  Maybe we could directly atomic_inc_not_zero(&sb->s_active)
> adn trust that the caller has an active ref to the sb?  I think that's
> true for anyone calling __sb_error with a non-null inode.

Well, the side-effects of holding active sb reference from some workqueue
item tend to hit back occasionally (like when userspace assumes the device
isn't used anymore but it in fact still is because of the active
reference). Every time we tried something like this (last time it was with
iouring I believe) some user came back and complained his setup broke. In
this case it should be really rare but still I think it's better to avoid
it if we can (plus I'm not sure what you'd like to do for __sb_error()
callers that don't get the inode and thus active reference isn't really
guaranteed - they still need the protection against umount so that
handle_sb_error() can do the notifier callchain thing).

So I think a better solution might be that generic_shutdown_super() waits
for pending error notifications after clearing SB_ACTIVE before umount
proceeds further and __sb_error() just starts discarding new notifications
as soon as we see SB_ACTIVE is clear.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/6] iomap: report file IO errors to fsnotify
  2025-11-06 10:13               ` Jan Kara
@ 2025-11-06 17:06                 ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-06 17:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, cem, hch, linux-fsdevel, linux-xfs, gabriel,
	Christian Brauner

On Thu, Nov 06, 2025 at 11:13:45AM +0100, Jan Kara wrote:
> On Wed 05-11-25 11:41:38, Darrick J. Wong wrote:
> > On Wed, Nov 05, 2025 at 10:28:08AM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 05, 2025 at 03:24:41PM +0100, Jan Kara wrote:
> > > > On Wed 05-11-25 12:14:52, Amir Goldstein wrote:
> > > > > 
> > > > > ...
> > > > > We recently discovered that fsnotify_sb_error() calls are exposed to
> > > > > races with generic_shutdown_super():
> > > > > https://lore.kernel.org/linux-fsdevel/scmyycf2trich22v25s6gpe3ib6ejawflwf76znxg7sedqablp@ejfycd34xvpa/
> > > 
> > > Hrmm.  I've noticed that ever since I added this new patchset, I've been
> > > getting more instances of outright crashes in the timer code, or
> > > workqueue lockups.  I wonder if that UAF is what's going on here...
> > > 
> > > > > Will punting all FS_ERROR events to workqueue help to improve this
> > > > > situation or will it make it worse?
> > > > 
> > > > Worse. But you raise a really good point which I've missed during my
> > > > review. Currently there's nothing which synchronizes pending works with
> > > > superblock getting destroyed with obvious UAF issues already in
> > > > handle_sb_error().
> > > 
> > > I wonder, could __sb_error call get_active_super() to obtain an active
> > > reference to the sb, and then deactivate_super() it in the workqueue
> > > callback?  If we can't get an active ref then we presume that the fs is
> > > already shutting down and don't send the event.
> > 
> > ...and now that I've actually tried it, I realize that we can't actually
> > call get_active_super because it can sleep waiting for s_umount and
> > SB_BORN.  Maybe we could directly atomic_inc_not_zero(&sb->s_active)
> > adn trust that the caller has an active ref to the sb?  I think that's
> > true for anyone calling __sb_error with a non-null inode.
> 
> Well, the side-effects of holding active sb reference from some workqueue
> item tend to hit back occasionally (like when userspace assumes the device
> isn't used anymore but it in fact still is because of the active
> reference). Every time we tried something like this (last time it was with
> iouring I believe) some user came back and complained his setup broke. In
> this case it should be really rare but still I think it's better to avoid
> it if we can (plus I'm not sure what you'd like to do for __sb_error()
> callers that don't get the inode and thus active reference isn't really
> guaranteed - they still need the protection against umount so that
> handle_sb_error() can do the notifier callchain thing).
> 
> So I think a better solution might be that generic_shutdown_super() waits
> for pending error notifications after clearing SB_ACTIVE before umount
> proceeds further and __sb_error() just starts discarding new notifications
> as soon as we see SB_ACTIVE is clear.

<nod> Summarizing what we just talked about on the ext4 call--

Instead of grabbing an active reference to the sb, I'll instead fix
__sb_error to (a) ignore !BORN || !ACTIVE || DYING supers, and (b)
increment a counter in the sb whenever we queue an event, and decrement
it when the worker finishes with it.  generic_shutdown_super can then
wait (having just cleared ACTIVE) for the counter to hit zero before it
stops fsnotify and drops the dentry cache.

Or at least that's what I'll try today.

--D

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 2/6] xfs: switch healthmon to use the iomap I/O error reporting
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 1/6] iomap: report file IO errors to fsnotify Darrick J. Wong
@ 2025-11-05  0:54   ` Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 3/6] xfs: port notify-failure to use the new vfs io " Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:54 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

From: Darrick J. Wong <djwong@kernel.org>

Use the new generic I/O error reporting paths so that we can remove the
xfs-specific hooks.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_trace.h     |   35 ++++++++++++++++++---------------
 fs/xfs/xfs_healthmon.c |   51 +++++++++++++++++++++++++-----------------------
 2 files changed, 46 insertions(+), 40 deletions(-)


diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 520526ef9cd11c..bb7335b56a53e4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6262,24 +6262,25 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
 		  __entry->lost_prev)
 );
 
-#define XFS_FILE_IOERROR_STRINGS \
-	{ XFS_FILE_IOERROR_BUFFERED_READ,	"readahead" }, \
-	{ XFS_FILE_IOERROR_BUFFERED_WRITE,	"writeback" }, \
-	{ XFS_FILE_IOERROR_DIRECT_READ,		"directio_read" }, \
-	{ XFS_FILE_IOERROR_DIRECT_WRITE,	"directio_write" }, \
-	{ XFS_FILE_IOERROR_DATA_LOST,		"datalost" }
+#define FS_ERROR_STRINGS \
+	{ FSERR_READAHEAD,	"readahead" }, \
+	{ FSERR_WRITEBACK,	"writeback" }, \
+	{ FSERR_DIO_READ,	"directio_read" }, \
+	{ FSERR_DIO_WRITE,	"directio_write" }, \
+	{ FSERR_DATA_LOST,	"datalost" }, \
+	{ FSERR_METADATA,	"metadata" }
 
-
-TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_READ);
-TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_WRITE);
-TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_READ);
-TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_WRITE);
-TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DATA_LOST);
+TRACE_DEFINE_ENUM(FSERR_READAHEAD);
+TRACE_DEFINE_ENUM(FSERR_WRITEBACK);
+TRACE_DEFINE_ENUM(FSERR_DIO_READ);
+TRACE_DEFINE_ENUM(FSERR_DIO_WRITE);
+TRACE_DEFINE_ENUM(FSERR_DATA_LOST);
+TRACE_DEFINE_ENUM(FSERR_METADATA);
 
 TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
 	TP_PROTO(const struct xfs_mount *mp,
 		 unsigned long action,
-		 const struct xfs_file_ioerror_params *p,
+		 const struct fs_error *p,
 		 unsigned int events, unsigned long long lost_prev),
 	TP_ARGS(mp, action, p, events, lost_prev),
 	TP_STRUCT__entry(
@@ -6294,10 +6295,12 @@ TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
 		__field(unsigned long long, lost_prev)
 	),
 	TP_fast_assign(
+		struct xfs_inode *ip = XFS_I(p->inode);
+
 		__entry->dev = mp ? mp->m_super->s_dev : 0;
 		__entry->action = action;
-		__entry->ino = p->ino;
-		__entry->gen = p->gen;
+		__entry->ino = ip->i_ino;
+		__entry->gen = p->inode->i_generation;
 		__entry->pos = p->pos;
 		__entry->len = p->len;
 		__entry->events = events;
@@ -6307,7 +6310,7 @@ TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->gen,
-		  __print_symbolic(__entry->action, XFS_FILE_IOERROR_STRINGS),
+		  __print_symbolic(__entry->action, FS_ERROR_STRINGS),
 		  __entry->pos,
 		  __entry->len,
 		  __entry->events,
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index def4de5f6bc543..3796c7335bb58d 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -74,7 +74,7 @@ struct xfs_healthmon {
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
 	struct xfs_media_error_hook	mhook;
-	struct xfs_file_ioerror_hook	fhook;
+	struct fs_error_hook		fhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -702,22 +702,23 @@ xfs_healthmon_file_ioerror_hook(
 {
 	struct xfs_healthmon		*hm;
 	struct xfs_healthmon_event	*event;
-	struct xfs_file_ioerror_params	*p = data;
+	struct fs_error			*p = data;
 	struct mem_cgroup		*old_memcg;
+	struct xfs_inode		*ip;
 	enum xfs_healthmon_type		type = 0;
 	int				error;
 
-	hm = container_of(nb, struct xfs_healthmon, fhook.ioerror_hook.nb);
+	hm = container_of(nb, struct xfs_healthmon, fhook.nb);
 
-	switch (action) {
-	case XFS_FILE_IOERROR_BUFFERED_READ:
-	case XFS_FILE_IOERROR_BUFFERED_WRITE:
-	case XFS_FILE_IOERROR_DIRECT_READ:
-	case XFS_FILE_IOERROR_DIRECT_WRITE:
-	case XFS_FILE_IOERROR_DATA_LOST:
+	switch (p->type) {
+	case FSERR_READAHEAD:
+	case FSERR_WRITEBACK:
+	case FSERR_DIO_READ:
+	case FSERR_DIO_WRITE:
+	case FSERR_DATA_LOST:
 		break;
-	default:
-		ASSERT(0);
+	case FSERR_METADATA:
+		/* already handled by xfs_health */
 		return NOTIFY_DONE;
 	}
 
@@ -731,30 +732,33 @@ xfs_healthmon_file_ioerror_hook(
 	if (error)
 		goto out_unlock;
 
-	switch (action) {
-	case XFS_FILE_IOERROR_BUFFERED_READ:
+	switch (p->type) {
+	case FSERR_READAHEAD:
 		type = XFS_HEALTHMON_BUFREAD;
 		break;
-	case XFS_FILE_IOERROR_BUFFERED_WRITE:
+	case FSERR_WRITEBACK:
 		type = XFS_HEALTHMON_BUFWRITE;
 		break;
-	case XFS_FILE_IOERROR_DIRECT_READ:
+	case FSERR_DIO_READ:
 		type = XFS_HEALTHMON_DIOREAD;
 		break;
-	case XFS_FILE_IOERROR_DIRECT_WRITE:
+	case FSERR_DIO_WRITE:
 		type = XFS_HEALTHMON_DIOWRITE;
 		break;
-	case XFS_FILE_IOERROR_DATA_LOST:
+	case FSERR_DATA_LOST:
 		type = XFS_HEALTHMON_DATALOST;
 		break;
+	default:
+		break;
 	}
 
 	event = xfs_healthmon_alloc(hm, type, XFS_HEALTHMON_FILERANGE);
 	if (!event)
 		goto out_unlock;
 
-	event->fino = p->ino;
-	event->fgen = p->gen;
+	ip = XFS_I(p->inode);
+	event->fino = ip->i_ino;
+	event->fgen = p->inode->i_generation;
 	event->fpos = p->pos;
 	event->flen = p->len;
 	error = xfs_healthmon_push(hm, event);
@@ -1174,7 +1178,7 @@ xfs_healthmon_detach_hooks(
 	 * through the health monitoring subsystem from xfs_fs_put_super, so
 	 * it is now time to detach the hooks.
 	 */
-	xfs_file_ioerror_hook_del(hm->mp, &hm->fhook);
+	sb_unhook_error(hm->mp->m_super, &hm->fhook);
 	xfs_media_error_hook_del(hm->mp, &hm->mhook);
 	xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	xfs_health_hook_del(hm->mp, &hm->hhook);
@@ -1418,9 +1422,8 @@ xfs_ioc_health_monitor(
 	if (ret)
 		goto out_shutdownhook;
 
-	xfs_file_ioerror_hook_setup(&hm->fhook,
-			xfs_healthmon_file_ioerror_hook);
-	ret = xfs_file_ioerror_hook_add(mp, &hm->fhook);
+	sb_init_error_hook(&hm->fhook, xfs_healthmon_file_ioerror_hook);
+	ret = sb_hook_error(mp->m_super, &hm->fhook);
 	if (ret)
 		goto out_mediahook;
 
@@ -1449,7 +1452,7 @@ xfs_ioc_health_monitor(
 	return fd;
 
 out_ioerrhook:
-	xfs_file_ioerror_hook_del(mp, &hm->fhook);
+	sb_unhook_error(mp->m_super, &hm->fhook);
 out_mediahook:
 	xfs_media_error_hook_del(mp, &hm->mhook);
 out_shutdownhook:


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/6] xfs: port notify-failure to use the new vfs io error reporting
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 1/6] iomap: report file IO errors to fsnotify Darrick J. Wong
  2025-11-05  0:54   ` [PATCH 2/6] xfs: switch healthmon to use the iomap I/O error reporting Darrick J. Wong
@ 2025-11-05  0:54   ` Darrick J. Wong
  2025-11-05  0:55   ` [PATCH 4/6] xfs: remove file I/O error hooks Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:54 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

From: Darrick J. Wong <djwong@kernel.org>

Port the media error notification code to use the new generic reporting
code.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_notify_failure.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index bf6e1865d5c3a5..20f040a35537f6 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -178,9 +178,10 @@ xfs_dax_failure_fn(
 		invalidate_inode_pages2_range(mapping, pgoff,
 					      pgoff + pgcnt - 1);
 
-	xfs_inode_media_error(ip,
+	inode_error(VFS_I(ip), FSERR_DATA_LOST,
 			XFS_FSB_TO_B(mp, (u64)pgoff << PAGE_SHIFT),
-			XFS_FSB_TO_B(mp, (u64)pgcnt << PAGE_SHIFT));
+			XFS_FSB_TO_B(mp, (u64)pgcnt << PAGE_SHIFT),
+			-EIO);
 
 	xfs_irele(ip);
 	return error;
@@ -496,8 +497,8 @@ xfs_report_one_data_lost(
 	if (rmap_end > lost_end)
 		blocks -= rmap_end - lost_end;
 
-	xfs_inode_media_error(ip, XFS_FSB_TO_B(mp, fileoff),
-			XFS_FSB_TO_B(mp, blocks));
+	inode_error(VFS_I(ip), FSERR_DATA_LOST, XFS_FSB_TO_B(mp, fileoff),
+			XFS_FSB_TO_B(mp, blocks), -EIO);
 
 	xfs_irele(ip);
 	return 0;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/6] xfs: remove file I/O error hooks
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-11-05  0:54   ` [PATCH 3/6] xfs: port notify-failure to use the new vfs io " Darrick J. Wong
@ 2025-11-05  0:55   ` Darrick J. Wong
  2025-11-05  0:55   ` [PATCH 5/6] iomap: remove " Darrick J. Wong
  2025-11-05  0:55   ` [PATCH 6/6] xfs: report fs metadata errors via fsnotify Darrick J. Wong
  5 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:55 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

From: Darrick J. Wong <djwong@kernel.org>

Remove these hooks since iomap now does that on its own.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_file.h  |   37 -----------
 fs/xfs/xfs_mount.h |    3 -
 fs/xfs/xfs_aops.c  |    2 -
 fs/xfs/xfs_file.c  |  174 ----------------------------------------------------
 fs/xfs/xfs_super.c |    1 
 5 files changed, 1 insertion(+), 216 deletions(-)


diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h
index 441f8a693bb884..2ad91f755caf35 100644
--- a/fs/xfs/xfs_file.h
+++ b/fs/xfs/xfs_file.h
@@ -12,41 +12,4 @@ extern const struct file_operations xfs_dir_file_operations;
 bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos,
 		long long int len);
 
-enum xfs_file_ioerror_type {
-	XFS_FILE_IOERROR_BUFFERED_READ,
-	XFS_FILE_IOERROR_BUFFERED_WRITE,
-	XFS_FILE_IOERROR_DIRECT_READ,
-	XFS_FILE_IOERROR_DIRECT_WRITE,
-	XFS_FILE_IOERROR_DATA_LOST,
-};
-
-struct xfs_file_ioerror_params {
-	xfs_ino_t		ino;
-	loff_t			pos;
-	u64			len;
-	u32			gen;
-	int			error;
-};
-
-#ifdef CONFIG_XFS_LIVE_HOOKS
-struct xfs_file_ioerror_hook {
-	struct xfs_hook			ioerror_hook;
-};
-
-int xfs_file_ioerror_hook_add(struct xfs_mount *mp,
-		struct xfs_file_ioerror_hook *hook);
-void xfs_file_ioerror_hook_del(struct xfs_mount *mp,
-		struct xfs_file_ioerror_hook *hook);
-void xfs_file_ioerror_hook_setup(struct xfs_file_ioerror_hook *hook,
-		notifier_fn_t mod_fn);
-
-void xfs_vm_ioerror(struct address_space *mapping, int direction, loff_t pos,
-		u64 len, int error);
-
-void xfs_inode_media_error(struct xfs_inode *ip, loff_t pos, u64 len);
-#else
-# define xfs_vm_ioerror			NULL
-# define xfs_inode_media_error(...)	((void)0)
-#endif /* CONFIG_XFS_LIVE_HOOKS */
-
 #endif /* __XFS_FILE_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2d7f9ccba5287e..0feb0fb685f51f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -352,9 +352,6 @@ typedef struct xfs_mount {
 
 	/* Hook to feed media error events to a daemon. */
 	struct xfs_hooks	m_media_error_hooks;
-
-	/* Hook to feed file io error events to a daemon. */
-	struct xfs_hooks	m_file_ioerror_hooks;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index f3f28b9ae0f70e..a26f798155331f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -22,7 +22,6 @@
 #include "xfs_icache.h"
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
-#include "xfs_file.h"
 
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
@@ -811,7 +810,6 @@ const struct address_space_operations xfs_address_space_operations = {
 	.is_partially_uptodate  = iomap_is_partially_uptodate,
 	.error_remove_folio	= generic_error_remove_folio,
 	.swap_activate		= xfs_vm_swap_activate,
-	.ioerror		= xfs_vm_ioerror,
 };
 
 const struct address_space_operations xfs_dax_aops = {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f5988904f5d44d..2702fef2c90cd2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -222,176 +222,6 @@ xfs_ilock_iocb_for_write(
 	return 0;
 }
 
-#ifdef CONFIG_XFS_LIVE_HOOKS
-struct xfs_file_ioerror {
-	struct work_struct		work;
-	struct xfs_mount		*mp;
-	xfs_ino_t			ino;
-	loff_t				pos;
-	u64				len;
-	u32				gen;
-	int				error;
-	enum xfs_file_ioerror_type	type;
-};
-
-/* Call downstream hooks for a file io error update. */
-STATIC void
-xfs_file_report_ioerror(
-	struct work_struct	*work)
-{
-	struct xfs_file_ioerror	*ioerr =
-		container_of(work, struct xfs_file_ioerror, work);
-	struct xfs_file_ioerror_params	p = {
-		.ino		= ioerr->ino,
-		.gen		= ioerr->gen,
-		.pos		= ioerr->pos,
-		.len		= ioerr->len,
-	};
-	struct xfs_mount	*mp = ioerr->mp;
-
-	xfs_hooks_call(&mp->m_file_ioerror_hooks, ioerr->type, &p);
-	kfree(ioerr);
-}
-
-/* Queue a directio io error notification. */
-STATIC void
-xfs_dio_ioerror(
-	struct inode		*inode,
-	int			direction,
-	loff_t			pos,
-	u64			len,
-	int			error)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_file_ioerror	*ioerr;
-
-	ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
-	if (!ioerr) {
-		xfs_err(mp,
- "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
-				ip->i_ino,
-				direction == WRITE ? "WRITE" : "READ",
-				pos, len, error);
-		return;
-	}
-
-	INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
-	ioerr->mp = mp;
-	ioerr->ino = ip->i_ino;
-	ioerr->gen = VFS_I(ip)->i_generation;
-	ioerr->pos = pos;
-	ioerr->len = len;
-	if (direction == WRITE)
-		ioerr->type = XFS_FILE_IOERROR_DIRECT_WRITE;
-	else
-		ioerr->type = XFS_FILE_IOERROR_DIRECT_READ;
-	ioerr->error = error;
-	queue_work(mp->m_unwritten_workqueue, &ioerr->work);
-}
-
-/* Deal with a media error */
-void
-xfs_inode_media_error(
-	struct xfs_inode	*ip,
-	loff_t			pos,
-	u64			len)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_file_ioerror	*ioerr;
-
-	ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
-	if (!ioerr) {
-		xfs_err(mp,
- "lost data error report for ino 0x%llx pos 0x%llx len 0x%llx",
-				ip->i_ino,
-				pos, len);
-		return;
-	}
-
-	INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
-	ioerr->mp = mp;
-	ioerr->ino = ip->i_ino;
-	ioerr->gen = VFS_I(ip)->i_generation;
-	ioerr->pos = pos;
-	ioerr->len = len;
-	ioerr->type = XFS_FILE_IOERROR_DATA_LOST;
-	ioerr->error = -EIO;
-	queue_work(mp->m_unwritten_workqueue, &ioerr->work);
-}
-
-/* Queue a buffered io error notification. */
-void
-xfs_vm_ioerror(
-	struct address_space	*mapping,
-	int			direction,
-	loff_t			pos,
-	u64			len,
-	int			error)
-{
-	struct inode		*inode = mapping->host;
-	struct xfs_inode	*ip = XFS_I(inode);
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_file_ioerror	*ioerr;
-
-	ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
-	if (!ioerr) {
-		xfs_err(mp,
- "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
-				ip->i_ino,
-				direction == WRITE ? "WRITE" : "READ",
-				pos, len, error);
-		return;
-	}
-
-	INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
-	ioerr->mp = mp;
-	ioerr->ino = ip->i_ino;
-	ioerr->gen = VFS_I(ip)->i_generation;
-	ioerr->pos = pos;
-	ioerr->len = len;
-	if (direction == WRITE)
-		ioerr->type = XFS_FILE_IOERROR_BUFFERED_WRITE;
-	else
-		ioerr->type = XFS_FILE_IOERROR_BUFFERED_READ;
-	ioerr->error = error;
-	queue_work(mp->m_unwritten_workqueue, &ioerr->work);
-}
-
-/* Call the specified function after a file io error. */
-int
-xfs_file_ioerror_hook_add(
-	struct xfs_mount		*mp,
-	struct xfs_file_ioerror_hook	*hook)
-{
-	return xfs_hooks_add(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
-}
-
-/* Stop calling the specified function after a file io error. */
-void
-xfs_file_ioerror_hook_del(
-	struct xfs_mount		*mp,
-	struct xfs_file_ioerror_hook	*hook)
-{
-	xfs_hooks_del(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
-}
-
-/* Configure file io error update hook functions. */
-void
-xfs_file_ioerror_hook_setup(
-	struct xfs_file_ioerror_hook	*hook,
-	notifier_fn_t			mod_fn)
-{
-	xfs_hook_setup(&hook->ioerror_hook, mod_fn);
-}
-#else
-# define xfs_dio_ioerror		NULL
-#endif /* CONFIG_XFS_LIVE_HOOKS */
-
-static const struct iomap_dio_ops xfs_dio_read_ops = {
-	.ioerror	= xfs_dio_ioerror,
-};
-
 STATIC ssize_t
 xfs_file_dio_read(
 	struct kiocb		*iocb,
@@ -410,8 +240,7 @@ xfs_file_dio_read(
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
 		return ret;
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, &xfs_dio_read_ops,
-			0, NULL, 0);
+	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
@@ -796,7 +625,6 @@ xfs_dio_write_end_io(
 
 static const struct iomap_dio_ops xfs_dio_write_ops = {
 	.end_io		= xfs_dio_write_end_io,
-	.ioerror	= xfs_dio_ioerror,
 };
 
 static void
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bfd12ccaa707a8..4a8d439ff57408 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2388,7 +2388,6 @@ xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 	xfs_hooks_init(&mp->m_media_error_hooks);
-	xfs_hooks_init(&mp->m_file_ioerror_hooks);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/6] iomap: remove I/O error hooks
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-11-05  0:55   ` [PATCH 4/6] xfs: remove file I/O error hooks Darrick J. Wong
@ 2025-11-05  0:55   ` Darrick J. Wong
  2025-11-05  0:55   ` [PATCH 6/6] xfs: report fs metadata errors via fsnotify Darrick J. Wong
  5 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:55 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

From: Darrick J. Wong <djwong@kernel.org>

Remove the I/O error hooks from struct address_space and iomap_dio_ops
because there are no more callers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/linux/fs.h                |    4 ----
 include/linux/iomap.h             |    2 --
 Documentation/filesystems/vfs.rst |    7 -------
 fs/iomap/buffered-io.c            |    4 ----
 fs/iomap/direct-io.c              |    4 ----
 5 files changed, 21 deletions(-)


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1cb3965db3275c..6e3a7cbefbca8a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -478,10 +478,6 @@ struct address_space_operations {
 				sector_t *span);
 	void (*swap_deactivate)(struct file *file);
 	int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
-
-	/* Callback for dealing with IO errors during readahead or writeback */
-	void (*ioerror)(struct address_space *mapping, int direction,
-			loff_t pos, u64 len, int error);
 };
 
 extern const struct address_space_operations empty_aops;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index ca1590e5002342..73dceabc21c8c7 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -486,8 +486,6 @@ struct iomap_dio_ops {
 		      unsigned flags);
 	void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
 		          loff_t file_offset);
-	void (*ioerror)(struct inode *inode, int direction, loff_t pos,
-			u64 len, int error);
 
 	/*
 	 * Filesystems wishing to attach private information to a direct io bio
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 9e70006bf99a63..4f13b01e42eb5e 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -822,8 +822,6 @@ cache in your filesystem.  The following members are defined:
 		int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
 		int (*swap_deactivate)(struct file *);
 		int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
-		void (*ioerror)(struct address_space *mapping, int direction,
-				loff_t pos, u64 len, int error);
 	};
 
 ``read_folio``
@@ -1034,11 +1032,6 @@ cache in your filesystem.  The following members are defined:
 ``swap_rw``
 	Called to read or write swap pages when SWP_FS_OPS is set.
 
-``ioerror``
-        Called to deal with IO errors during readahead or writeback.
-        This may be called from interrupt context, and without any
-        locks necessarily being held.
-
 The File Object
 ===============
 
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index dc19311fe1c6c0..32628550093f65 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -296,10 +296,6 @@ inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
 	inode_error(inode,
 		    direction == READ ? FSERR_READAHEAD : FSERR_WRITEBACK,
 		    pos, len, error);
-
-	if (mapping && mapping->a_ops->ioerror)
-		mapping->a_ops->ioerror(mapping, direction, pos, len,
-				error);
 }
 
 /**
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9f6ce0d9c531bb..1f140031416a0c 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -100,10 +100,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 			    (dio->flags & IOMAP_DIO_WRITE) ? FSERR_DIO_WRITE :
 							     FSERR_DIO_READ,
 			    offset, dio->size, dio->error);
-	if (dio->error && dops && dops->ioerror)
-		dops->ioerror(file_inode(iocb->ki_filp),
-				(dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
-				offset, dio->size, dio->error);
 
 	if (likely(!ret)) {
 		ret = dio->size;


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/6] xfs: report fs metadata errors via fsnotify
  2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-11-05  0:55   ` [PATCH 5/6] iomap: remove " Darrick J. Wong
@ 2025-11-05  0:55   ` Darrick J. Wong
  5 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05  0:55 UTC (permalink / raw)
  To: djwong, cem; +Cc: linux-fsdevel, linux-xfs, hch, amir73il, jack, gabriel

From: Darrick J. Wong <djwong@kernel.org>

Report filesystem corruption problems to fsnotify.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_health.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index da827060853a8f..0da4ae216dc169 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -71,6 +71,9 @@ xfs_fs_health_update_hook(
 	unsigned int			old_mask,
 	unsigned int			new_mask)
 {
+	if (op != XFS_HEALTHUP_HEALTHY && new_mask)
+		sb_error(mp->m_super, -EFSCORRUPTED);
+
 	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
 		struct xfs_health_update_params	p = {
 			.domain		= XFS_HEALTHUP_FS,
@@ -91,13 +94,17 @@ xfs_group_health_update_hook(
 	unsigned int			old_mask,
 	unsigned int			new_mask)
 {
+	struct xfs_mount		*mp = xg->xg_mount;
+
+	if (op != XFS_HEALTHUP_HEALTHY && new_mask)
+		sb_error(mp->m_super, -EFSCORRUPTED);
+
 	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
 		struct xfs_health_update_params	p = {
 			.old_mask	= old_mask,
 			.new_mask	= new_mask,
 			.group		= xg->xg_gno,
 		};
-		struct xfs_mount	*mp = xg->xg_mount;
 
 		switch (xg->xg_type) {
 		case XG_TYPE_AG:
@@ -124,6 +131,9 @@ xfs_inode_health_update_hook(
 	unsigned int			old_mask,
 	unsigned int			new_mask)
 {
+	if (op != XFS_HEALTHUP_HEALTHY && new_mask)
+		inode_error(VFS_I(ip), FSERR_METADATA, 0, 0, -EFSCORRUPTED);
+
 	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
 		struct xfs_health_update_params	p = {
 			.domain		= XFS_HEALTHUP_INODE,


^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-11-06 17:06 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05  0:46 [PATCHBOMB v2 6.19] xfs: autonomous self healing Darrick J. Wong
2025-11-05  0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
2025-11-05  0:48   ` [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
2025-11-05  0:48   ` [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
2025-11-05  0:49   ` [PATCH 03/22] xfs: create debugfs uuid aliases Darrick J. Wong
2025-11-05  0:49   ` [PATCH 04/22] xfs: create hooks for monitoring health updates Darrick J. Wong
2025-11-05  0:49   ` [PATCH 05/22] xfs: create a filesystem shutdown hook Darrick J. Wong
2025-11-05  0:49   ` [PATCH 06/22] xfs: create hooks for media errors Darrick J. Wong
2025-11-05  0:50   ` [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
2025-11-05  0:50   ` [PATCH 08/22] iomap: report directio read and write errors to callers Darrick J. Wong
2025-11-05  0:50   ` [PATCH 09/22] xfs: create file io error hooks Darrick J. Wong
2025-11-05  0:51   ` [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
2025-11-05  0:51   ` [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
2025-11-05  0:51   ` [PATCH 12/22] xfs: report metadata health events through healthmon Darrick J. Wong
2025-11-05  0:51   ` [PATCH 13/22] xfs: report shutdown " Darrick J. Wong
2025-11-05  0:52   ` [PATCH 14/22] xfs: report media errors " Darrick J. Wong
2025-11-05  0:52   ` [PATCH 15/22] xfs: report file io " Darrick J. Wong
2025-11-05  0:52   ` [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
2025-11-05  0:52   ` [PATCH 17/22] xfs: validate fds against running healthmon Darrick J. Wong
2025-11-05  0:53   ` [PATCH 18/22] xfs: add media error reporting ioctl Darrick J. Wong
2025-11-05  0:53   ` [PATCH 19/22] xfs: send uevents when major filesystem events happen Darrick J. Wong
2025-11-05  0:53   ` [PATCH 20/22] xfs: merge health monitoring events when possible Darrick J. Wong
2025-11-05  0:53   ` [PATCH 21/22] xfs: restrict healthmon users further Darrick J. Wong
2025-11-05  0:54   ` [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process Darrick J. Wong
2025-11-05  0:48 ` [PATCHSET V3 2/2] iomap: generic file IO error reporting Darrick J. Wong
2025-11-05  0:54   ` [PATCH 1/6] iomap: report file IO errors to fsnotify Darrick J. Wong
2025-11-05 11:00     ` Jan Kara
2025-11-05 11:14       ` Amir Goldstein
2025-11-05 14:24         ` Jan Kara
2025-11-05 18:28           ` Darrick J. Wong
2025-11-05 19:41             ` Darrick J. Wong
2025-11-06 10:13               ` Jan Kara
2025-11-06 17:06                 ` Darrick J. Wong
2025-11-05  0:54   ` [PATCH 2/6] xfs: switch healthmon to use the iomap I/O error reporting Darrick J. Wong
2025-11-05  0:54   ` [PATCH 3/6] xfs: port notify-failure to use the new vfs io " Darrick J. Wong
2025-11-05  0:55   ` [PATCH 4/6] xfs: remove file I/O error hooks Darrick J. Wong
2025-11-05  0:55   ` [PATCH 5/6] iomap: remove " Darrick J. Wong
2025-11-05  0:55   ` [PATCH 6/6] xfs: report fs metadata errors via fsnotify Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).