* [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
@ 2025-11-05 0:48 ` Darrick J. Wong
2025-11-05 0:48 ` [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
` (20 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:48 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Online repair is now merged in upstream, no need to point to patchset
links anymore.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
.../filesystems/xfs/xfs-online-fsck-design.rst | 236 +-------------------
1 file changed, 6 insertions(+), 230 deletions(-)
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 8cbcd3c2643430..189d1f5f40788d 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -105,10 +105,8 @@ occur; this capability aids both strategies.
TLDR; Show Me the Code!
-----------------------
-Code is posted to the kernel.org git trees as follows:
-`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
-`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
-`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Kernel and userspace code has been fully merged as of October 2025.
+
Each kernel patchset adding an online repair function will use the same branch
name across the kernel, xfsprogs, and fstests git repos.
@@ -764,12 +762,8 @@ allow the online fsck developers to compare online fsck against offline fsck,
and they enable XFS developers to find deficiencies in the code base.
Proposed patchsets include
-`general fuzzer improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
`fuzzing baselines
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
-and `improvements in fuzz testing comprehensiveness
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_.
Stress Testing
--------------
@@ -801,11 +795,6 @@ Success is defined by the ability to run all of these tests without observing
any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
check warnings, or any other sort of mischief.
-Proposed patchsets include `general stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
-and the `evolution of existing per-function stress testing
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
-
4. User Interface
=================
@@ -886,10 +875,6 @@ apply as nice of a priority to IO and CPU scheduling as possible.
This measure was taken to minimize delays in the rest of the filesystem.
No such hardening has been performed for the cron job.
-Proposed patchset:
-`Enabling the xfs_scrub background service
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
-
Health Reporting
----------------
@@ -912,13 +897,6 @@ notifications and initiate a repair?
*Answer*: These questions remain unanswered, but should be a part of the
conversation with early adopters and potential downstream users of XFS.
-Proposed patchsets include
-`wiring up health reports to correction returns
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
-and
-`preservation of sickness info during memory reclaim
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
-
5. Kernel Algorithms and Data Structures
========================================
@@ -1310,21 +1288,6 @@ Space allocation records are cross-referenced as follows:
are there the same number of reverse mapping records for each block as the
reference count record claims?
-Proposed patchsets are the series to find gaps in
-`refcount btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
-`inode btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
-`rmap btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
-to find
-`mergeable records
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
-and to
-`improve cross referencing with rmap
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
-before starting a repair.
-
Checking Extended Attributes
````````````````````````````
@@ -1756,10 +1719,6 @@ For scrub, the drain works as follows:
To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
be woken up whenever the intent count drops to zero.
-The proposed patchset is the
-`scrub intent drain series
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
-
.. _jump_labels:
Static Keys (aka Jump Label Patching)
@@ -2036,10 +1995,6 @@ The ``xfarray_store_anywhere`` function is used to insert a record in any
null record slot in the bag; and the ``xfarray_unset`` function removes a
record from the bag.
-The proposed patchset is the
-`big in-memory array
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
-
Iterating Array Elements
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2172,10 +2127,6 @@ However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.
-The proposed patchset is at the start of the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
-
.. _xfbtree:
In-Memory B+Trees
@@ -2214,11 +2165,6 @@ xfiles enables reuse of the entire btree library.
Btrees built atop an xfile are collectively known as ``xfbtrees``.
The next few sections describe how they actually work.
-The proposed patchset is the
-`in-memory btree
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
-series.
-
Using xfiles as a Buffer Cache Target
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2459,14 +2405,6 @@ This enables the log to release the old EFI to keep the log moving forwards.
EFIs have a role to play during the commit and reaping phases; please see the
next section and the section about :ref:`reaping<reaping>` for more details.
-Proposed patchsets are the
-`bitmap rework
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
-and the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
-
-
Writing the New Tree
````````````````````
@@ -2623,11 +2561,6 @@ The number of records for the inode btree is the number of xfarray records,
but the record count for the free inode btree has to be computed as inode chunk
records are stored in the xfarray.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
Case Study: Rebuilding the Space Reference Counts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2716,11 +2649,6 @@ Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
removed via ``xfarray_unset``.
Bag members are examined through ``xfarray_iter`` loops.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
Case Study: Rebuilding File Fork Mapping Indices
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -2757,11 +2685,6 @@ EXTENTS format instead of BMBT, which may require a conversion.
Third, the incore extent map must be reloaded carefully to avoid disturbing
any delayed allocation extents.
-The proposed patchset is the
-`file mapping repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
-series.
-
.. _reaping:
Reaping Old Metadata Blocks
@@ -2843,11 +2766,6 @@ blocks.
As stated earlier, online repair functions use very large transactions to
minimize the chances of this occurring.
-The proposed patchset is the
-`preparation for bulk loading btrees
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
-series.
-
Case Study: Reaping After a Regular Btree Repair
````````````````````````````````````````````````
@@ -2943,11 +2861,6 @@ When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
btrees.
These blocks can then be reaped using the methods outlined above.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
.. _rmap_reap:
Case Study: Reaping After Repairing Reverse Mapping Btrees
@@ -2972,11 +2885,6 @@ methods outlined above.
The rest of the process of rebuildng the reverse mapping btree is discussed
in a separate :ref:`case study<rmap_repair>`.
-The proposed patchset is the
-`AG btree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
-series.
-
Case Study: Rebuilding the AGFL
```````````````````````````````
@@ -3024,11 +2932,6 @@ more complicated, because computing the correct value requires traversing the
forks, or if that fails, leaving the fields invalid and waiting for the fork
fsck functions to run.
-The proposed patchset is the
-`inode
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
-repair series.
-
Quota Record Repairs
--------------------
@@ -3045,11 +2948,6 @@ checking are obviously bad limits and timer values.
Quota usage counters are checked, repaired, and discussed separately in the
section about :ref:`live quotacheck <quotacheck>`.
-The proposed patchset is the
-`quota
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
-repair series.
-
.. _fscounters:
Freezing to Fix Summary Counters
@@ -3145,11 +3043,6 @@ long enough to check and correct the summary counters.
| This bug was fixed in Linux 5.17. |
+--------------------------------------------------------------------------+
-The proposed patchset is the
-`summary counter cleanup
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
-series.
-
Full Filesystem Scans
---------------------
@@ -3277,15 +3170,6 @@ Second, if the incore inode is stuck in some intermediate state, the scan
coordinator must release the AGI and push the main filesystem to get the inode
back into a loadable state.
-The proposed patches are the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-The first user of the new functionality is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
Inode Management
````````````````
@@ -3381,12 +3265,6 @@ To capture these nuances, the online fsck code has a separate ``xchk_irele``
function to set or clear the ``DONTCACHE`` flag to get the required release
behavior.
-Proposed patchsets include fixing
-`scrub iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
-`dir iget usage
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
-
.. _ilocking:
Locking Inodes
@@ -3443,11 +3321,6 @@ If the dotdot entry changes while the directory is unlocked, then a move or
rename operation must have changed the child's parentage, and the scan can
exit early.
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
.. _fshooks:
Filesystem Hooks
@@ -3594,11 +3467,6 @@ The inode scan APIs are pretty simple:
- ``xchk_iscan_teardown`` to finish the scan
-This functionality is also a part of the
-`inode scanner
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
-series.
-
.. _quotacheck:
Case Study: Quota Counter Checking
@@ -3686,11 +3554,6 @@ needing to hold any locks for a long duration.
If repairs are desired, the real and shadow dquots are locked and their
resource counts are set to the values in the shadow dquot.
-The proposed patchset is the
-`online quotacheck
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
-series.
-
.. _nlinks:
Case Study: File Link Count Checking
@@ -3744,11 +3607,6 @@ shadow information.
If no parents are found, the file must be :ref:`reparented <orphanage>` to the
orphanage to prevent the file from being lost forever.
-The proposed patchset is the
-`file link count repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
-series.
-
.. _rmap_repair:
Case Study: Rebuilding Reverse Mapping Records
@@ -3828,11 +3686,6 @@ scan for reverse mapping records.
12. Free the xfbtree now that it not needed.
-The proposed patchset is the
-`rmap repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
-series.
-
Staging Repairs with Temporary Files on Disk
--------------------------------------------
@@ -3971,11 +3824,6 @@ Once a good copy of a data file has been constructed in a temporary file, it
must be conveyed to the file being repaired, which is the topic of the next
section.
-The proposed patches are in the
-`repair temporary files
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
-series.
-
Logged File Content Exchanges
-----------------------------
@@ -4025,11 +3873,6 @@ The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.
-The proposed patchset is the
-`file contents exchange
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
-series.
-
+--------------------------------------------------------------------------+
| **Sidebar: Using Log-Incompatible Feature Flags** |
+--------------------------------------------------------------------------+
@@ -4323,11 +4166,6 @@ To repair the summary file, write the xfile contents into the temporary file
and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.
-The proposed patchset is the
-`realtime summary repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
-series.
-
Case Study: Salvaging Extended Attributes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4369,11 +4207,6 @@ Salvaging extended attributes is done as follows:
4. Reap the temporary file.
-The proposed patchset is the
-`extended attribute repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
-series.
-
Fixing Directories
------------------
@@ -4448,11 +4281,6 @@ Unfortunately, the current dentry cache design doesn't provide a means to walk
every child dentry of a specific directory, which makes this a hard problem.
There is no known solution.
-The proposed patchset is the
-`directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
-series.
-
Parent Pointers
```````````````
@@ -4612,11 +4440,6 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
7. Reap the temporary directory.
-The proposed patchset is the
-`parent pointers directory repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
-series.
-
Case Study: Repairing Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4662,11 +4485,6 @@ directory reconstruction:
8. Reap the temporary file.
-The proposed patchset is the
-`parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
-series.
-
Digression: Offline Checking of Parent Pointers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -4755,11 +4573,6 @@ connectivity checks:
4. Move on to examining link counts, as we do today.
-The proposed patchset is the
-`offline parent pointers repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
-series.
-
Rebuilding directories from parent pointers in offline repair would be very
challenging because xfs_repair currently uses two single-pass scans of the
filesystem during phases 3 and 4 to decide which files are corrupt enough to be
@@ -4903,12 +4716,6 @@ Repairing the directory tree works as follows:
6. If the subdirectory has zero paths, attach it to the lost and found.
-The proposed patches are in the
-`directory tree repair
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
-series.
-
-
.. _orphanage:
The Orphanage
@@ -4973,11 +4780,6 @@ Orphaned files are adopted by the orphanage as follows:
7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
resources.
-The proposed patches are in the
-`orphanage adoption
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
-series.
-
6. Userspace Algorithms and Data Structures
===========================================
@@ -5091,14 +4893,6 @@ first workqueue's workers until the backlog eases.
This doesn't completely solve the balancing problem, but reduces it enough to
move on to more pressing issues.
-The proposed patchsets are the scrub
-`performance tweaks
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
-and the
-`inode scan rebalance
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
-series.
-
.. _scrubrepair:
Scheduling Repairs
@@ -5179,20 +4973,6 @@ immediately.
Corrupt file data blocks reported by phase 6 cannot be recovered by the
filesystem.
-The proposed patchsets are the
-`repair warning improvements
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
-refactoring of the
-`repair data dependency
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
-and
-`object tracking
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
-and the
-`repair scheduling
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
-improvement series.
-
Checking Names for Confusable Unicode Sequences
-----------------------------------------------
@@ -5372,6 +5152,8 @@ The extra flexibility enables several new use cases:
This emulates an atomic device write in software, and can support arbitrary
scattered writes.
+(This functionality was merged into mainline as of 2025)
+
Vectorized Scrub
----------------
@@ -5393,13 +5175,7 @@ It is hoped that ``io_uring`` will pick up enough of this functionality that
online fsck can use that instead of adding a separate vectored scrub system
call to XFS.
-The relevant patchsets are the
-`kernel vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
-and
-`userspace vectorized scrub
-<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
-series.
+(This functionality was merged into mainline as of 2025)
Quality of Service Targets for Scrub
------------------------------------
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
2025-11-05 0:48 ` [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
@ 2025-11-05 0:48 ` Darrick J. Wong
2025-11-05 0:49 ` [PATCH 03/22] xfs: create debugfs uuid aliases Darrick J. Wong
` (19 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:48 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Update the XFS online repair document to describe the motivation and
design of the autonomous filesystem healing agent known as xfs_healer.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
.../filesystems/xfs/xfs-online-fsck-design.rst | 218 ++++++++++++++++++++
1 file changed, 216 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 189d1f5f40788d..79d5aa78f2a8bf 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -166,9 +166,12 @@ The current XFS tools leave several problems unsolved:
malicious actors **exploit quirks of Unicode** to place misleading names
in directories.
+8. **Site Reliability and Support Engineers** would like to reduce the
+ frequency of incidents requiring **manual intervention**.
+
Given this definition of the problems to be solved and the actors who would
benefit, the proposed solution is a third fsck tool that acts on a running
-filesystem.
+filesystem, and an autononmous agent that fixes problems as they arise.
This new third program has three components: an in-kernel facility to check
metadata, an in-kernel facility to repair metadata, and a userspace driver
@@ -203,6 +206,13 @@ Even if a piece of filesystem metadata can only be regenerated by scanning the
entire system, the scan can still be done in the background while other file
operations continue.
+The autonomous self healing agent should listen for metadata health impact
+reports coming from the kernel and automatically schedule repairs for the
+damaged metadata.
+If the required repairs are larger in scope than a single metadata structure,
+``xfs_scrub`` should be invoked to perform a full analysis.
+``xfs_healer`` is the name of this program.
+
In summary, online fsck takes advantage of resource sharding and redundant
metadata to enable targeted checking and repair operations while the system
is running.
@@ -850,11 +860,16 @@ variable in the following service files:
* ``xfs_scrub_all_fail.service``
The decision to enable the background scan is left to the system administrator.
-This can be done by enabling either of the following services:
+This can be done system-wide by enabling either of the following services:
* ``xfs_scrub_all.timer`` on systemd systems
* ``xfs_scrub_all.cron`` on non-systemd systems
+To enable online repair for specific filesystems, the ``autofsck``
+filesystem property should be set to ``repair``.
+To enable only scanning, the property should be set to ``check``.
+To disable online fsck entirely, the property should be set to ``none``.
+
This automatic weekly scan is configured out of the box to perform an
additional media scan of all file data once per month.
This is less foolproof than, say, storing file data block checksums, but much
@@ -897,6 +912,36 @@ notifications and initiate a repair?
*Answer*: These questions remain unanswered, but should be a part of the
conversation with early adopters and potential downstream users of XFS.
+Autonomous Self Healing
+-----------------------
+
+The autonomous self healing agent is a background system service that starts
+when the filesystem is mounted and runs until unmount.
+When starting up, the agent opens a special pseudofile under the specific
+mount.
+When the filesystem generates new adverse health events, the events will be
+made available for reading via the special pseudofile.
+The events need not be limited to metadata concerns; they can also reflect
+events outside of the filesystem's direct control such as file I/O errors.
+
+The agent reads these events in a loop and responds to the events
+appropriately.
+For a single trouble report about metadata, the agent initiates a targeted
+repair of the specific structure.
+If that repair fails or the agent observes too many metadata trouble reports
+over a short interval, it should then initiate a full scan of the filesystem
+via the ``xfs_scrub`` service.
+
+The decision to enable the background scan is left to the system administrator.
+This can be done system-wide by enabling the following services:
+
+* ``xfs_healer@.service`` on systemd systems
+
+To enable autonomous healing for specific filesystems, the ``autofsck``
+filesystem property should be set to ``repair``.
+To disable self healing, the property should be set to ``check``,
+``optimize``, or ``none``.
+
5. Kernel Algorithms and Data Structures
========================================
@@ -4780,6 +4825,136 @@ Orphaned files are adopted by the orphanage as follows:
7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
resources.
+Health Monitoring
+-----------------
+
+A self-correcting filesystem responds to observations of problems by scheduling
+repairs of the affected areas.
+The filesystem must therefore create event objects in response to stimuli
+(metadata corruption, file I/O errors, etc.) and dispatch these events to
+downstream consumers.
+Downstream consumers that are in the kernel itself are easy to implement with
+the ``xfs_hooks`` infrastructure created for other parts of online repair; these
+are basically indirect function calls.
+
+However, the decision to translate an adverse metadata health report into a
+repair should be made by userspace, and the actual scheduling done by userspace.
+Some users (e.g. containers) would prefer to fast-fail the container and restart
+it on another node at a previous checkpoint.
+For workloads running in isolation, repairs may be preferable; either way this
+is something the system administrator knows, and not the kernel.
+A userspace agent (``xfs_healer``, described later) will collect events from the
+kernel and dispatch them appropriately.
+
+Exporting health events to userspace requires the creation of a new component,
+known as the health monitor.
+Because the monitor exposes itself to userspace to deliver information, a file
+descriptor is the natural abstraction to use here.
+The health monitor hooks all the relevant sources of metadata health events.
+Upon activation of the hook, a new event object is created and added to a queue.
+When the agent reads from the fd, event objects are pulled from the start of the
+queue and formatted into the user's buffer.
+The events are freed, and the read call returns to userspace to allow the agent
+to perform some work.
+Memory usage is constrained on a per-fd basis to prevent memory exhaustion; if
+an event must be discarded, a special "lost event" event is delivered to the
+agent.
+
+In short, health events are captured, queued, and eventually copied out to
+userspace for dispatching.
+
+**Question**: Why use a pseudofile and not use existing notification methods?
+
+*Answer*: The pseudofile is a private filesystem interface only available to
+processes with the CAP_SYS_ADMIN priviledge and the ability to open the root
+directory.
+Being private gives the kernel and ``xfs_healer`` the flexibility to change
+or update the event format in the future without worrying about backwards
+compatibility.
+Using existing notifications means that the event format would be frozen in
+the public fsnotify UAPI forever, which would affect two subsystems.
+
+The pseudofile can also accept ioctls, which gives ``xfs_healer`` a solid
+means to validate that prior to a repair, its reopened mountpoint is actually
+the same filesystem that is being monitored.
+
+**Question**: Why not reuse fs/notify?
+
+*Answer*: It's much simpler for the healthmon code to manage its own queue of
+events and to wake up readers instead of reusing fsnotify because that's the
+only part of fsnotify that would use.
+
+Before I get started, an introduction: fsnotify expects its users (e.g.
+fanotify) to implement quite a bit of functionality; all it provides is a
+wrapper around a simple queue and a lot of code to convey information about the
+calling process to that user.
+fanotify has to actually implement all the queue management code on its own,
+and so would healthmon.
+
+So if healthmon used fsnotify, it would have to create its own fsnotify group
+structure.
+For our purposes, the group is a very large wrapper around a linked list, some
+counters, and a mutex.
+The group object is critical for ensuring that sees only its own events, and
+that nobody else (e.g. regular fanotify) ever sees these events.
+There's a lot more in there for controlling whether fanotify reports pids,
+groups, file handles, etc. that healthmon doesn't care about.
+
+Starting from the fsnotify() function call:
+
+ - I /think/ we'd have to define a new "data type", which itself is just a plain
+ int but I think they correspond to FSNOTIFY_EVENT_* values which themselves
+ are actually part of an enum.
+ The data type controls the typecasting options for the ``void *data``
+ parameter, which I guess is how I'd pass the healthmon event info from the
+ hooks into the fsnotify mechanism and back out to the healthmon code.
+
+ - Each filesystem that wants to do this probably has to add their own
+ FSNOTIFY_EVENT_{XFS,BTRFS,BFS} data type value because that's a casting
+ decision that's made inside the main fsnotify code.
+ I think this can be avoided if each fs is careful never to leak events
+ outside of the group.
+ Either way, it's harder to follow the data flows here because fsnotify can
+ only take and pass around ``void *`` pointers, and it makes various indirect
+ function calls to manage events.
+ Contrast this with doing everything with typed pointers and direct calls
+ within ``xfs_healthmon.c``.
+
+ - Since healthmon is both producer and consumer of fsnotify events, we can
+ probably define our own "mask" value.
+ It's a relief that we don't have to interact with fanotify, because fanotify
+ has used up 22 of its 32 mask bits.
+
+Once healthmon gets an event into fsnotify, fsnotify will call back (into
+healthmon!) to tell it that it got an event.
+From there, the fsnotify implementation (healthmon) has to allocate an event
+object and add it to the event queue in the group, which is what it already does
+now.
+Overflow control is up to the fsnotify implementation, which healthmon already
+implements.
+
+After the event is queued, the fsnotify implementation also has to implement its
+own read file op to dequeue an event and copy it to the userspace buffer in
+whatever format it likes.
+Again, healthmon already does all this.
+
+In the end, replacing the homegrown event dispatching in healthmon with fsnotify
+would make the data flows much harder to understand, and all we gain is a
+generic event dispatcher that relies on indirect function calls instead of
+direct ones.
+We still have to implement the queuing discipline ourselves! :(
+
+**Future Work Question**: Should these events be exposed through the fanotify
+filesystem error event interface?
+
+*Answer*: Yes.
+fanotify is much more careful about filtering out events to processes that
+aren't running with privileges.
+These processes should have a means to receive simple notifications about
+file errors.
+However, this will require coordination between fanotify, ext4, and XFS, and
+is (for now) outside the scope of this project.
+
6. Userspace Algorithms and Data Structures
===========================================
@@ -5071,6 +5246,45 @@ and report what has been lost.
For media errors in blocks owned by files, parent pointers can be used to
construct file paths from inode numbers for user-friendly reporting.
+Autonomous Self Healing
+-----------------------
+
+When a filesystem mounts, the Linux kernel initiates a uevent describing the
+mount and the path to the data device.
+A udev rule determines the initial mountpoint from the data device path
+and starts a mount-specific ``xfs_healer`` service instance.
+The ``xfs_healer`` service opens the mountpoint and issues the
+XFS_IOC_HEALTH_MONITOR ioctl to open a special health monitoring file.
+After that is set up, the mountpoint is closed to avoid pinning the mount.
+
+The health monitoring file hooks certain points of the filesystem so that it
+may receive events about metadata health, filesystem shutdowns, media errors,
+file I/O errors, and unmounting of the filesystem.
+Events are queued up for each health monitor file and encoded into a
+``struct xfs_health_monitor_event`` object when the agent calls ``read()`` on
+the file.
+All health events are dispatched to a background threadpool to reduce stalls
+in the main event loop.
+Events can be logged into the system log for further analysis.
+
+For metadata health events, the specific details are used to construct a call
+to the scrub ioctl.
+The filesystem mountpoint is reopened, and the kernel is called.
+If events are lost or the repairs fail, a full scan will be initiated by
+starting up an ``xfs_scrub@.service`` for the given mountpoint.
+
+A filesystem shutdown causes all future repair work to cease, and an unmount
+causes the agent to exit.
+
+**Future Work Question**: Should the healer daemon also register a dbus
+listener and publish events there?
+
+*Answer*: This is unclear -- if there's a demand for system monitoring daemons
+to consume this information and make decisions, then yes, this could be wired
+up in ``xfs_healer``.
+On the other hand, systemd is in the middle of a transition to varlink, so
+it makes more sense to wait and see what happens.
+
7. Conclusion and Future Work
=============================
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 03/22] xfs: create debugfs uuid aliases
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
2025-11-05 0:48 ` [PATCH 01/22] docs: remove obsolete links in the xfs online repair documentation Darrick J. Wong
2025-11-05 0:48 ` [PATCH 02/22] docs: discuss autonomous self healing in the xfs online repair design doc Darrick J. Wong
@ 2025-11-05 0:49 ` Darrick J. Wong
2025-11-05 0:49 ` [PATCH 04/22] xfs: create hooks for monitoring health updates Darrick J. Wong
` (18 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:49 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create an alias for the debugfs dir so that we can find a filesystem by
uuid. Unless it's mounted nouuid.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b871dfde372b52..94108668ddabbd 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -289,6 +289,7 @@ typedef struct xfs_mount {
struct delayed_work m_reclaim_work; /* background inode reclaim */
struct xfs_zone_info *m_zone_info; /* zone allocator information */
struct dentry *m_debugfs; /* debugfs parent */
+ struct dentry *m_debugfs_uuid; /* debugfs symlink */
struct xfs_kobj m_kobj;
struct xfs_kobj m_error_kobj;
struct xfs_kobj m_error_meta_kobj;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1067ebb3b001bf..ba07e4a4ae3ffa 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -819,6 +819,7 @@ xfs_mount_free(
if (mp->m_ddev_targp)
xfs_free_buftarg(mp->m_ddev_targp);
+ debugfs_remove(mp->m_debugfs_uuid);
debugfs_remove(mp->m_debugfs);
kfree(mp->m_rtname);
kfree(mp->m_logname);
@@ -1969,6 +1970,16 @@ xfs_fs_fill_super(
goto out_unmount;
}
+ if (xfs_debugfs && mp->m_debugfs && !xfs_has_nouuid(mp)) {
+ char name[UUID_STRING_LEN + 1];
+
+ snprintf(name, UUID_STRING_LEN + 1, "%pU", &mp->m_sb.sb_uuid);
+ mp->m_debugfs_uuid = debugfs_create_symlink(name, xfs_debugfs,
+ mp->m_super->s_id);
+ } else {
+ mp->m_debugfs_uuid = NULL;
+ }
+
return 0;
out_filestream_unmount:
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 04/22] xfs: create hooks for monitoring health updates
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (2 preceding siblings ...)
2025-11-05 0:49 ` [PATCH 03/22] xfs: create debugfs uuid aliases Darrick J. Wong
@ 2025-11-05 0:49 ` Darrick J. Wong
2025-11-05 0:49 ` [PATCH 05/22] xfs: create a filesystem shutdown hook Darrick J. Wong
` (17 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:49 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create hooks for monitoring health events.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_health.h | 47 ++++++++++
fs/xfs/xfs_mount.h | 3 +
fs/xfs/xfs_health.c | 204 ++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_super.c | 1
4 files changed, 254 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index b31000f7190ce5..39fef33dedc6a8 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -289,4 +289,51 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
#define xfs_metadata_is_sick(error) \
(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
+/*
+ * Parameters for tracking health updates. The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+ XFS_HEALTHUP_SICK = 1, /* runtime corruption observed */
+ XFS_HEALTHUP_CORRUPT, /* fsck reported corruption */
+ XFS_HEALTHUP_HEALTHY, /* fsck reported healthy structure */
+ XFS_HEALTHUP_UNMOUNT, /* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+ XFS_HEALTHUP_FS = 1, /* main filesystem */
+ XFS_HEALTHUP_AG, /* allocation group */
+ XFS_HEALTHUP_INODE, /* inode */
+ XFS_HEALTHUP_RTGROUP, /* realtime group */
+};
+
+struct xfs_health_update_params {
+ /* XFS_HEALTHUP_INODE */
+ xfs_ino_t ino;
+ uint32_t gen;
+
+ /* XFS_HEALTHUP_AG/RTGROUP */
+ uint32_t group;
+
+ /* XFS_SICK_* flags */
+ unsigned int old_mask;
+ unsigned int new_mask;
+
+ enum xfs_health_update_domain domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+ struct xfs_hook health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
#endif /* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 94108668ddabbd..3f20baaf9cc226 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -343,6 +343,9 @@ typedef struct xfs_mount {
/* Hook to feed dirent updates to an active online repair. */
struct xfs_hooks m_dir_update_hooks;
+
+ /* Hook to feed health events to a daemon. */
+ struct xfs_hooks m_health_update_hooks;
} xfs_mount_t;
#define M_IGEO(mp) (&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 7c541fb373d5b2..71952d5eec2a9e 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -20,6 +20,159 @@
#include "xfs_quota_defs.h"
#include "xfs_rtgroup.h"
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of health updates. If the
+ * compiler supports jump labels, the static branch will be replaced by a nop
+ * sled when there are no hook users. Health monitoring is currently the only
+ * caller, so this is a reasonable tradeoff, because health event status
+ * updates can be very frequent when xfs_scrub is running, and we don't know
+ * if xfs_healthmon has attached to this filesystem.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock. Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_health_hooks_switch);
+
+void
+xfs_health_hook_disable(void)
+{
+ xfs_hooks_switch_off(&xfs_health_hooks_switch);
+}
+
+void
+xfs_health_hook_enable(void)
+{
+ xfs_hooks_switch_on(&xfs_health_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem unmount health update. */
+static inline void
+xfs_health_unmount_hook(
+ struct xfs_mount *mp)
+{
+ if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+ struct xfs_health_update_params p = {
+ .domain = XFS_HEALTHUP_FS,
+ };
+
+ xfs_hooks_call(&mp->m_health_update_hooks,
+ XFS_HEALTHUP_UNMOUNT, &p);
+ }
+}
+
+/* Call downstream hooks for a filesystem health update. */
+static inline void
+xfs_fs_health_update_hook(
+ struct xfs_mount *mp,
+ enum xfs_health_update_type op,
+ unsigned int old_mask,
+ unsigned int new_mask)
+{
+ if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+ struct xfs_health_update_params p = {
+ .domain = XFS_HEALTHUP_FS,
+ .old_mask = old_mask,
+ .new_mask = new_mask,
+ };
+
+ if (new_mask)
+ xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+ }
+}
+
+/* Call downstream hooks for a group health update. */
+static inline void
+xfs_group_health_update_hook(
+ struct xfs_group *xg,
+ enum xfs_health_update_type op,
+ unsigned int old_mask,
+ unsigned int new_mask)
+{
+ if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+ struct xfs_health_update_params p = {
+ .old_mask = old_mask,
+ .new_mask = new_mask,
+ .group = xg->xg_gno,
+ };
+ struct xfs_mount *mp = xg->xg_mount;
+
+ switch (xg->xg_type) {
+ case XG_TYPE_AG:
+ p.domain = XFS_HEALTHUP_AG;
+ break;
+ case XG_TYPE_RTG:
+ p.domain = XFS_HEALTHUP_RTGROUP;
+ break;
+ default:
+ ASSERT(0);
+ return;
+ }
+
+ if (new_mask)
+ xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+ }
+}
+
+/* Call downstream hooks for an inode health update. */
+static inline void
+xfs_inode_health_update_hook(
+ struct xfs_inode *ip,
+ enum xfs_health_update_type op,
+ unsigned int old_mask,
+ unsigned int new_mask)
+{
+ if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+ struct xfs_health_update_params p = {
+ .domain = XFS_HEALTHUP_INODE,
+ .old_mask = old_mask,
+ .new_mask = new_mask,
+ .ino = ip->i_ino,
+ .gen = VFS_I(ip)->i_generation,
+ };
+ struct xfs_mount *mp = ip->i_mount;
+
+ if (new_mask)
+ xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+ }
+}
+
+/* Call the specified function during a health update. */
+int
+xfs_health_hook_add(
+ struct xfs_mount *mp,
+ struct xfs_health_hook *hook)
+{
+ return xfs_hooks_add(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Stop calling the specified function during a health update. */
+void
+xfs_health_hook_del(
+ struct xfs_mount *mp,
+ struct xfs_health_hook *hook)
+{
+ xfs_hooks_del(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Configure health update hook functions. */
+void
+xfs_health_hook_setup(
+ struct xfs_health_hook *hook,
+ notifier_fn_t mod_fn)
+{
+ xfs_hook_setup(&hook->health_hook, mod_fn);
+}
+#else
+# define xfs_health_unmount_hook(...) ((void)0)
+# define xfs_fs_health_update_hook(a,b,o,n) do {o = o;} while(0)
+# define xfs_rt_health_update_hook(a,b,o,n) do {o = o;} while(0)
+# define xfs_group_health_update_hook(a,b,o,n) do {o = o;} while(0)
+# define xfs_inode_health_update_hook(a,b,o,n) do {o = o;} while(0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
static void
xfs_health_unmount_group(
struct xfs_group *xg,
@@ -50,8 +203,10 @@ xfs_health_unmount(
unsigned int checked = 0;
bool warn = false;
- if (xfs_is_shutdown(mp))
+ if (xfs_is_shutdown(mp)) {
+ xfs_health_unmount_hook(mp);
return;
+ }
/* Measure AG corruption levels. */
while ((pag = xfs_perag_next(mp, pag)))
@@ -97,6 +252,8 @@ xfs_health_unmount(
if (sick & XFS_SICK_FS_COUNTERS)
xfs_fs_mark_healthy(mp, XFS_SICK_FS_COUNTERS);
}
+
+ xfs_health_unmount_hook(mp);
}
/* Mark unhealthy per-fs metadata. */
@@ -105,12 +262,17 @@ xfs_fs_mark_sick(
struct xfs_mount *mp,
unsigned int mask)
{
+ unsigned int old_mask;
+
ASSERT(!(mask & ~XFS_SICK_FS_ALL));
trace_xfs_fs_mark_sick(mp, mask);
spin_lock(&mp->m_sb_lock);
+ old_mask = mp->m_fs_sick;
mp->m_fs_sick |= mask;
spin_unlock(&mp->m_sb_lock);
+
+ xfs_fs_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask);
}
/* Mark per-fs metadata as having been checked and found unhealthy by fsck. */
@@ -119,13 +281,18 @@ xfs_fs_mark_corrupt(
struct xfs_mount *mp,
unsigned int mask)
{
+ unsigned int old_mask;
+
ASSERT(!(mask & ~XFS_SICK_FS_ALL));
trace_xfs_fs_mark_corrupt(mp, mask);
spin_lock(&mp->m_sb_lock);
+ old_mask = mp->m_fs_sick;
mp->m_fs_sick |= mask;
mp->m_fs_checked |= mask;
spin_unlock(&mp->m_sb_lock);
+
+ xfs_fs_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask);
}
/* Mark a per-fs metadata healed. */
@@ -134,15 +301,20 @@ xfs_fs_mark_healthy(
struct xfs_mount *mp,
unsigned int mask)
{
+ unsigned int old_mask;
+
ASSERT(!(mask & ~XFS_SICK_FS_ALL));
trace_xfs_fs_mark_healthy(mp, mask);
spin_lock(&mp->m_sb_lock);
+ old_mask = mp->m_fs_sick;
mp->m_fs_sick &= ~mask;
if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY))
mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY;
mp->m_fs_checked |= mask;
spin_unlock(&mp->m_sb_lock);
+
+ xfs_fs_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask);
}
/* Sample which per-fs metadata are unhealthy. */
@@ -192,12 +364,17 @@ xfs_group_mark_sick(
struct xfs_group *xg,
unsigned int mask)
{
+ unsigned int old_mask;
+
xfs_group_check_mask(xg, mask);
trace_xfs_group_mark_sick(xg, mask);
spin_lock(&xg->xg_state_lock);
+ old_mask = xg->xg_sick;
xg->xg_sick |= mask;
spin_unlock(&xg->xg_state_lock);
+
+ xfs_group_health_update_hook(xg, XFS_HEALTHUP_SICK, old_mask, mask);
}
/*
@@ -208,13 +385,18 @@ xfs_group_mark_corrupt(
struct xfs_group *xg,
unsigned int mask)
{
+ unsigned int old_mask;
+
xfs_group_check_mask(xg, mask);
trace_xfs_group_mark_corrupt(xg, mask);
spin_lock(&xg->xg_state_lock);
+ old_mask = xg->xg_sick;
xg->xg_sick |= mask;
xg->xg_checked |= mask;
spin_unlock(&xg->xg_state_lock);
+
+ xfs_group_health_update_hook(xg, XFS_HEALTHUP_CORRUPT, old_mask, mask);
}
/*
@@ -225,15 +407,20 @@ xfs_group_mark_healthy(
struct xfs_group *xg,
unsigned int mask)
{
+ unsigned int old_mask;
+
xfs_group_check_mask(xg, mask);
trace_xfs_group_mark_healthy(xg, mask);
spin_lock(&xg->xg_state_lock);
+ old_mask = xg->xg_sick;
xg->xg_sick &= ~mask;
if (!(xg->xg_sick & XFS_SICK_AG_PRIMARY))
xg->xg_sick &= ~XFS_SICK_AG_SECONDARY;
xg->xg_checked |= mask;
spin_unlock(&xg->xg_state_lock);
+
+ xfs_group_health_update_hook(xg, XFS_HEALTHUP_HEALTHY, old_mask, mask);
}
/* Sample which per-ag metadata are unhealthy. */
@@ -272,10 +459,13 @@ xfs_inode_mark_sick(
struct xfs_inode *ip,
unsigned int mask)
{
+ unsigned int old_mask;
+
ASSERT(!(mask & ~XFS_SICK_INO_ALL));
trace_xfs_inode_mark_sick(ip, mask);
spin_lock(&ip->i_flags_lock);
+ old_mask = ip->i_sick;
ip->i_sick |= mask;
spin_unlock(&ip->i_flags_lock);
@@ -287,6 +477,8 @@ xfs_inode_mark_sick(
spin_lock(&VFS_I(ip)->i_lock);
VFS_I(ip)->i_state &= ~I_DONTCACHE;
spin_unlock(&VFS_I(ip)->i_lock);
+
+ xfs_inode_health_update_hook(ip, XFS_HEALTHUP_SICK, old_mask, mask);
}
/* Mark inode metadata as having been checked and found unhealthy by fsck. */
@@ -295,10 +487,13 @@ xfs_inode_mark_corrupt(
struct xfs_inode *ip,
unsigned int mask)
{
+ unsigned int old_mask;
+
ASSERT(!(mask & ~XFS_SICK_INO_ALL));
trace_xfs_inode_mark_corrupt(ip, mask);
spin_lock(&ip->i_flags_lock);
+ old_mask = ip->i_sick;
ip->i_sick |= mask;
ip->i_checked |= mask;
spin_unlock(&ip->i_flags_lock);
@@ -311,6 +506,8 @@ xfs_inode_mark_corrupt(
spin_lock(&VFS_I(ip)->i_lock);
VFS_I(ip)->i_state &= ~I_DONTCACHE;
spin_unlock(&VFS_I(ip)->i_lock);
+
+ xfs_inode_health_update_hook(ip, XFS_HEALTHUP_CORRUPT, old_mask, mask);
}
/* Mark parts of an inode healed. */
@@ -319,15 +516,20 @@ xfs_inode_mark_healthy(
struct xfs_inode *ip,
unsigned int mask)
{
+ unsigned int old_mask;
+
ASSERT(!(mask & ~XFS_SICK_INO_ALL));
trace_xfs_inode_mark_healthy(ip, mask);
spin_lock(&ip->i_flags_lock);
+ old_mask = ip->i_sick;
ip->i_sick &= ~mask;
if (!(ip->i_sick & XFS_SICK_INO_PRIMARY))
ip->i_sick &= ~XFS_SICK_INO_SECONDARY;
ip->i_checked |= mask;
spin_unlock(&ip->i_flags_lock);
+
+ xfs_inode_health_update_hook(ip, XFS_HEALTHUP_HEALTHY, old_mask, mask);
}
/* Sample which parts of an inode are unhealthy. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ba07e4a4ae3ffa..84cbba0ab698aa 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2291,6 +2291,7 @@ xfs_init_fs_context(
mp->m_allocsize_log = 16; /* 64k */
xfs_hooks_init(&mp->m_dir_update_hooks);
+ xfs_hooks_init(&mp->m_health_update_hooks);
fc->s_fs_info = mp;
fc->ops = &xfs_context_ops;
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 05/22] xfs: create a filesystem shutdown hook
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (3 preceding siblings ...)
2025-11-05 0:49 ` [PATCH 04/22] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2025-11-05 0:49 ` Darrick J. Wong
2025-11-05 0:49 ` [PATCH 06/22] xfs: create hooks for media errors Darrick J. Wong
` (16 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:49 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create a hook so that health monitoring can report filesystem shutdown
events to userspace. Shutdowns should be infrequent, so we don't bother
with a static key here.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_fsops.h | 11 +++++++++++
fs/xfs/xfs_mount.h | 3 +++
fs/xfs/xfs_fsops.c | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_super.c | 1 +
4 files changed, 57 insertions(+)
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 9d23c361ef56e4..ea5561b8580574 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -15,4 +15,15 @@ int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags);
int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_shutdown_hook {
+ struct xfs_hook shutdown_hook;
+};
+
+int xfs_shutdown_hook_add(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_del(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_setup(struct xfs_shutdown_hook *hook,
+ notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
#endif /* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 3f20baaf9cc226..2d4305d91a3cd9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -346,6 +346,9 @@ typedef struct xfs_mount {
/* Hook to feed health events to a daemon. */
struct xfs_hooks m_health_update_hooks;
+
+ /* Hook to feed shutdown events to a daemon. */
+ struct xfs_hooks m_shutdown_hooks;
} xfs_mount_t;
#define M_IGEO(mp) (&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 0ada735693945c..26ed16e67410d7 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -482,6 +482,46 @@ xfs_fs_goingdown(
return 0;
}
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/* Call downstream hooks for a filesystem shutdown. */
+static inline void
+xfs_shutdown_hook(
+ struct xfs_mount *mp,
+ uint32_t flags)
+{
+ xfs_hooks_call(&mp->m_shutdown_hooks, flags, NULL);
+}
+
+/* Call the specified function during a shutdown update. */
+int
+xfs_shutdown_hook_add(
+ struct xfs_mount *mp,
+ struct xfs_shutdown_hook *hook)
+{
+ return xfs_hooks_add(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Stop calling the specified function during a shutdown update. */
+void
+xfs_shutdown_hook_del(
+ struct xfs_mount *mp,
+ struct xfs_shutdown_hook *hook)
+{
+ xfs_hooks_del(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Configure shutdown update hook functions. */
+void
+xfs_shutdown_hook_setup(
+ struct xfs_shutdown_hook *hook,
+ notifier_fn_t mod_fn)
+{
+ xfs_hook_setup(&hook->shutdown_hook, mod_fn);
+}
+#else
+# define xfs_shutdown_hook(...) ((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
/*
* Force a shutdown of the filesystem instantly while keeping the filesystem
* consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -540,6 +580,8 @@ xfs_do_force_shutdown(
"Please unmount the filesystem and rectify the problem(s)");
if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
xfs_stack_trace();
+
+ xfs_shutdown_hook(mp, flags);
}
/*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 84cbba0ab698aa..599900b9b0dd63 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2291,6 +2291,7 @@ xfs_init_fs_context(
mp->m_allocsize_log = 16; /* 64k */
xfs_hooks_init(&mp->m_dir_update_hooks);
+ xfs_hooks_init(&mp->m_shutdown_hooks);
xfs_hooks_init(&mp->m_health_update_hooks);
fc->s_fs_info = mp;
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 06/22] xfs: create hooks for media errors
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (4 preceding siblings ...)
2025-11-05 0:49 ` [PATCH 05/22] xfs: create a filesystem shutdown hook Darrick J. Wong
@ 2025-11-05 0:49 ` Darrick J. Wong
2025-11-05 0:50 ` [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
` (15 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:49 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Set up a media error event hook so that we can send events to userspace.
Media errors are not expected to be frequent, so we don't have a static
key guarding them here either.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_mount.h | 3 ++
fs/xfs/xfs_notify_failure.h | 33 +++++++++++++++++++++
fs/xfs/xfs_notify_failure.c | 68 ++++++++++++++++++++++++++++++++++++++++---
fs/xfs/xfs_super.c | 1 +
4 files changed, 100 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2d4305d91a3cd9..0feb0fb685f51f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -349,6 +349,9 @@ typedef struct xfs_mount {
/* Hook to feed shutdown events to a daemon. */
struct xfs_hooks m_shutdown_hooks;
+
+ /* Hook to feed media error events to a daemon. */
+ struct xfs_hooks m_media_error_hooks;
} xfs_mount_t;
#define M_IGEO(mp) (&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 8d08ec29dd2949..2695732ec20875 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -8,4 +8,37 @@
extern const struct dax_holder_operations xfs_dax_holder_operations;
+enum xfs_failed_device {
+ XFS_FAILED_DATADEV,
+ XFS_FAILED_LOGDEV,
+ XFS_FAILED_RTDEV,
+};
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+struct xfs_media_error_params {
+ struct xfs_mount *mp;
+ enum xfs_failed_device fdev;
+ xfs_daddr_t daddr;
+ uint64_t bbcount;
+ bool pre_remove;
+};
+
+struct xfs_media_error_hook {
+ struct xfs_hook error_hook;
+};
+
+int xfs_media_error_hook_add(struct xfs_mount *mp,
+ struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_del(struct xfs_mount *mp,
+ struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_setup(struct xfs_media_error_hook *hook,
+ notifier_fn_t mod_fn);
+#else
+struct xfs_media_error_params { };
+struct xfs_media_error_hook { };
+# define xfs_media_error_hook_add(...) (0)
+# define xfs_media_error_hook_del(...) ((void)0)
+# define xfs_media_error_hook_setup(...) ((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
#endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index b1767288994206..557f4bf3463dcb 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -27,6 +27,57 @@
#include <linux/dax.h>
#include <linux/fs.h>
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/* Call downstream hooks for a media error. */
+static inline void
+xfs_media_error_hook(
+ struct xfs_mount *mp,
+ enum xfs_failed_device fdev,
+ xfs_daddr_t daddr,
+ uint64_t bbcount,
+ bool pre_remove)
+{
+ struct xfs_media_error_params p = {
+ .mp = mp,
+ .fdev = fdev,
+ .daddr = daddr,
+ .bbcount = bbcount,
+ .pre_remove = pre_remove,
+ };
+
+ xfs_hooks_call(&mp->m_media_error_hooks, 0, &p);
+}
+
+/* Call the specified function during a media error. */
+int
+xfs_media_error_hook_add(
+ struct xfs_mount *mp,
+ struct xfs_media_error_hook *hook)
+{
+ return xfs_hooks_add(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Stop calling the specified function during a media error. */
+void
+xfs_media_error_hook_del(
+ struct xfs_mount *mp,
+ struct xfs_media_error_hook *hook)
+{
+ xfs_hooks_del(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Configure media error hook functions. */
+void
+xfs_media_error_hook_setup(
+ struct xfs_media_error_hook *hook,
+ notifier_fn_t mod_fn)
+{
+ xfs_hook_setup(&hook->error_hook, mod_fn);
+}
+#else
+# define xfs_media_error_hook(...) ((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
struct xfs_failure_info {
xfs_agblock_t startblock;
xfs_extlen_t blockcount;
@@ -215,6 +266,9 @@ xfs_dax_notify_logdev_failure(
if (error)
return error;
+ xfs_media_error_hook(mp, XFS_FAILED_LOGDEV, daddr, bblen,
+ mf_flags & MF_MEM_PRE_REMOVE);
+
/*
* In the pre-remove case the failure notification is attempting to
* trigger a force unmount. The expectation is that the device is
@@ -248,16 +302,20 @@ xfs_dax_notify_dev_failure(
uint64_t bblen;
struct xfs_group *xg = NULL;
+ error = xfs_dax_translate_range(xfs_group_type_buftarg(mp, type),
+ offset, len, &daddr, &bblen);
+ if (error)
+ return error;
+
+ xfs_media_error_hook(mp, type == XG_TYPE_RTG ?
+ XFS_FAILED_RTDEV : XFS_FAILED_DATADEV,
+ daddr, bblen, mf_flags & MF_MEM_PRE_REMOVE);
+
if (!xfs_has_rmapbt(mp)) {
xfs_debug(mp, "notify_failure() needs rmapbt enabled!");
return -EOPNOTSUPP;
}
- error = xfs_dax_translate_range(xfs_group_type_buftarg(mp, type),
- offset, len, &daddr, &bblen);
- if (error)
- return error;
-
if (type == XG_TYPE_RTG) {
start_bno = xfs_daddr_to_rtb(mp, daddr);
end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 599900b9b0dd63..fb72a4976e8570 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2293,6 +2293,7 @@ xfs_init_fs_context(
xfs_hooks_init(&mp->m_dir_update_hooks);
xfs_hooks_init(&mp->m_shutdown_hooks);
xfs_hooks_init(&mp->m_health_update_hooks);
+ xfs_hooks_init(&mp->m_media_error_hooks);
fc->s_fs_info = mp;
fc->ops = &xfs_context_ops;
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (5 preceding siblings ...)
2025-11-05 0:49 ` [PATCH 06/22] xfs: create hooks for media errors Darrick J. Wong
@ 2025-11-05 0:50 ` Darrick J. Wong
2025-11-05 0:50 ` [PATCH 08/22] iomap: report directio read and write errors to callers Darrick J. Wong
` (14 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:50 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Provide a callback so that iomap can report read and write IO errors to
the caller filesystem. For now this is only wired up for iomap as a
testbed for XFS.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/iomap/internal.h | 2 ++
include/linux/fs.h | 4 ++++
Documentation/filesystems/vfs.rst | 7 +++++++
fs/iomap/buffered-io.c | 27 +++++++++++++++++++++++++--
fs/iomap/ioend.c | 4 ++++
5 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/fs/iomap/internal.h b/fs/iomap/internal.h
index d05cb3aed96e79..06d9145b6be4fa 100644
--- a/fs/iomap/internal.h
+++ b/fs/iomap/internal.h
@@ -5,5 +5,7 @@
#define IOEND_BATCH_SIZE 4096
u32 iomap_finish_ioend_direct(struct iomap_ioend *ioend);
+void iomap_mapping_ioerror(struct address_space *mapping, int direction,
+ loff_t pos, u64 len, int error);
#endif /* _IOMAP_INTERNAL_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c895146c1444be..5e4b3a4b24823f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -477,6 +477,10 @@ struct address_space_operations {
sector_t *span);
void (*swap_deactivate)(struct file *file);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+
+ /* Callback for dealing with IO errors during readahead or writeback */
+ void (*ioerror)(struct address_space *mapping, int direction,
+ loff_t pos, u64 len, int error);
};
extern const struct address_space_operations empty_aops;
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 4f13b01e42eb5e..9e70006bf99a63 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -822,6 +822,8 @@ cache in your filesystem. The following members are defined:
int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
+ void (*ioerror)(struct address_space *mapping, int direction,
+ loff_t pos, u64 len, int error);
};
``read_folio``
@@ -1032,6 +1034,11 @@ cache in your filesystem. The following members are defined:
``swap_rw``
Called to read or write swap pages when SWP_FS_OPS is set.
+``ioerror``
+ Called to deal with IO errors during readahead or writeback.
+ This may be called from interrupt context, and without any
+ locks necessarily being held.
+
The File Object
===============
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8b847a1e27f13e..8dd5421cb910b5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -288,6 +288,14 @@ static inline bool iomap_block_needs_zeroing(const struct iomap_iter *iter,
pos >= i_size_read(iter->inode);
}
+inline void iomap_mapping_ioerror(struct address_space *mapping, int direction,
+ loff_t pos, u64 len, int error)
+{
+ if (mapping && mapping->a_ops->ioerror)
+ mapping->a_ops->ioerror(mapping, direction, pos, len,
+ error);
+}
+
/**
* iomap_read_inline_data - copy inline data into the page cache
* @iter: iteration structure
@@ -310,8 +318,11 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
if (folio_test_uptodate(folio))
return 0;
- if (WARN_ON_ONCE(size > iomap->length))
+ if (WARN_ON_ONCE(size > iomap->length)) {
+ iomap_mapping_ioerror(folio->mapping, READ, iomap->offset,
+ size, -EIO);
return -EIO;
+ }
if (offset > 0)
ifs_alloc(iter->inode, folio, iter->flags);
@@ -339,6 +350,10 @@ static void iomap_finish_folio_read(struct folio *folio, size_t off,
spin_unlock_irqrestore(&ifs->state_lock, flags);
}
+ if (error)
+ iomap_mapping_ioerror(folio->mapping, READ,
+ folio_pos(folio) + off, len, error);
+
if (finished)
folio_end_read(folio, uptodate);
}
@@ -558,11 +573,15 @@ static int iomap_read_folio_range(const struct iomap_iter *iter,
const struct iomap *srcmap = iomap_iter_srcmap(iter);
struct bio_vec bvec;
struct bio bio;
+ int ret;
bio_init(&bio, srcmap->bdev, &bvec, 1, REQ_OP_READ);
bio.bi_iter.bi_sector = iomap_sector(srcmap, pos);
bio_add_folio_nofail(&bio, folio, len, offset_in_folio(folio, pos));
- return submit_bio_wait(&bio);
+ ret = submit_bio_wait(&bio);
+ if (ret)
+ iomap_mapping_ioerror(folio->mapping, READ, pos, len, ret);
+ return ret;
}
#else
static int iomap_read_folio_range(const struct iomap_iter *iter,
@@ -1674,6 +1693,7 @@ int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio)
u64 pos = folio_pos(folio);
u64 end_pos = pos + folio_size(folio);
u64 end_aligned = 0;
+ loff_t orig_pos = pos;
bool wb_pending = false;
int error = 0;
u32 rlen;
@@ -1724,6 +1744,9 @@ int iomap_writeback_folio(struct iomap_writepage_ctx *wpc, struct folio *folio)
if (wb_pending)
wpc->nr_folios++;
+ if (error && pos > orig_pos)
+ iomap_mapping_ioerror(inode->i_mapping, WRITE, orig_pos, 0,
+ error);
/*
* We can have dirty bits set past end of file in page_mkwrite path
diff --git a/fs/iomap/ioend.c b/fs/iomap/ioend.c
index b49fa75eab260a..56e654f2d36fe9 100644
--- a/fs/iomap/ioend.c
+++ b/fs/iomap/ioend.c
@@ -55,6 +55,10 @@ static u32 iomap_finish_ioend_buffered(struct iomap_ioend *ioend)
/* walk all folios in bio, ending page IO on them */
bio_for_each_folio_all(fi, bio) {
+ if (ioend->io_error)
+ iomap_mapping_ioerror(inode->i_mapping, WRITE,
+ folio_pos(fi.folio) + fi.offset,
+ fi.length, ioend->io_error);
iomap_finish_folio_write(inode, fi.folio, fi.length);
folio_count++;
}
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 08/22] iomap: report directio read and write errors to callers
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (6 preceding siblings ...)
2025-11-05 0:50 ` [PATCH 07/22] iomap: report buffered read and write io errors to the filesystem Darrick J. Wong
@ 2025-11-05 0:50 ` Darrick J. Wong
2025-11-05 0:50 ` [PATCH 09/22] xfs: create file io error hooks Darrick J. Wong
` (13 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:50 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Add more hooks to report directio IO errors to the filesystem.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
include/linux/iomap.h | 2 ++
fs/iomap/direct-io.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 73dceabc21c8c7..ca1590e5002342 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -486,6 +486,8 @@ struct iomap_dio_ops {
unsigned flags);
void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
loff_t file_offset);
+ void (*ioerror)(struct inode *inode, int direction, loff_t pos,
+ u64 len, int error);
/*
* Filesystems wishing to attach private information to a direct io bio
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5d5d63efbd5767..1512d8dbb0d2e7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -95,6 +95,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
if (dops && dops->end_io)
ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+ if (dio->error && dops && dops->ioerror)
+ dops->ioerror(file_inode(iocb->ki_filp),
+ (dio->flags & IOMAP_DIO_WRITE) ? WRITE : READ,
+ offset, dio->size, dio->error);
if (likely(!ret)) {
ret = dio->size;
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 09/22] xfs: create file io error hooks
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (7 preceding siblings ...)
2025-11-05 0:50 ` [PATCH 08/22] iomap: report directio read and write errors to callers Darrick J. Wong
@ 2025-11-05 0:50 ` Darrick J. Wong
2025-11-05 0:51 ` [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
` (12 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:50 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create hooks within XFS to deliver IO errors to callers. File I/O
errors are usually rare, so we don't employ a static key here.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_file.h | 37 +++++++++
fs/xfs/xfs_mount.h | 3 +
fs/xfs/xfs_aops.c | 2
fs/xfs/xfs_file.c | 174 +++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_notify_failure.c | 5 +
fs/xfs/xfs_super.c | 1
6 files changed, 221 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h
index 2ad91f755caf35..441f8a693bb884 100644
--- a/fs/xfs/xfs_file.h
+++ b/fs/xfs/xfs_file.h
@@ -12,4 +12,41 @@ extern const struct file_operations xfs_dir_file_operations;
bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos,
long long int len);
+enum xfs_file_ioerror_type {
+ XFS_FILE_IOERROR_BUFFERED_READ,
+ XFS_FILE_IOERROR_BUFFERED_WRITE,
+ XFS_FILE_IOERROR_DIRECT_READ,
+ XFS_FILE_IOERROR_DIRECT_WRITE,
+ XFS_FILE_IOERROR_DATA_LOST,
+};
+
+struct xfs_file_ioerror_params {
+ xfs_ino_t ino;
+ loff_t pos;
+ u64 len;
+ u32 gen;
+ int error;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_file_ioerror_hook {
+ struct xfs_hook ioerror_hook;
+};
+
+int xfs_file_ioerror_hook_add(struct xfs_mount *mp,
+ struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_del(struct xfs_mount *mp,
+ struct xfs_file_ioerror_hook *hook);
+void xfs_file_ioerror_hook_setup(struct xfs_file_ioerror_hook *hook,
+ notifier_fn_t mod_fn);
+
+void xfs_vm_ioerror(struct address_space *mapping, int direction, loff_t pos,
+ u64 len, int error);
+
+void xfs_inode_media_error(struct xfs_inode *ip, loff_t pos, u64 len);
+#else
+# define xfs_vm_ioerror NULL
+# define xfs_inode_media_error(...) ((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
#endif /* __XFS_FILE_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 0feb0fb685f51f..2d7f9ccba5287e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -352,6 +352,9 @@ typedef struct xfs_mount {
/* Hook to feed media error events to a daemon. */
struct xfs_hooks m_media_error_hooks;
+
+ /* Hook to feed file io error events to a daemon. */
+ struct xfs_hooks m_file_ioerror_hooks;
} xfs_mount_t;
#define M_IGEO(mp) (&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a26f798155331f..f3f28b9ae0f70e 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -22,6 +22,7 @@
#include "xfs_icache.h"
#include "xfs_zone_alloc.h"
#include "xfs_rtgroup.h"
+#include "xfs_file.h"
struct xfs_writepage_ctx {
struct iomap_writepage_ctx ctx;
@@ -810,6 +811,7 @@ const struct address_space_operations xfs_address_space_operations = {
.is_partially_uptodate = iomap_is_partially_uptodate,
.error_remove_folio = generic_error_remove_folio,
.swap_activate = xfs_vm_swap_activate,
+ .ioerror = xfs_vm_ioerror,
};
const struct address_space_operations xfs_dax_aops = {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2702fef2c90cd2..f5988904f5d44d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -222,6 +222,176 @@ xfs_ilock_iocb_for_write(
return 0;
}
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_file_ioerror {
+ struct work_struct work;
+ struct xfs_mount *mp;
+ xfs_ino_t ino;
+ loff_t pos;
+ u64 len;
+ u32 gen;
+ int error;
+ enum xfs_file_ioerror_type type;
+};
+
+/* Call downstream hooks for a file io error update. */
+STATIC void
+xfs_file_report_ioerror(
+ struct work_struct *work)
+{
+ struct xfs_file_ioerror *ioerr =
+ container_of(work, struct xfs_file_ioerror, work);
+ struct xfs_file_ioerror_params p = {
+ .ino = ioerr->ino,
+ .gen = ioerr->gen,
+ .pos = ioerr->pos,
+ .len = ioerr->len,
+ };
+ struct xfs_mount *mp = ioerr->mp;
+
+ xfs_hooks_call(&mp->m_file_ioerror_hooks, ioerr->type, &p);
+ kfree(ioerr);
+}
+
+/* Queue a directio io error notification. */
+STATIC void
+xfs_dio_ioerror(
+ struct inode *inode,
+ int direction,
+ loff_t pos,
+ u64 len,
+ int error)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_file_ioerror *ioerr;
+
+ ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+ if (!ioerr) {
+ xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+ ip->i_ino,
+ direction == WRITE ? "WRITE" : "READ",
+ pos, len, error);
+ return;
+ }
+
+ INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+ ioerr->mp = mp;
+ ioerr->ino = ip->i_ino;
+ ioerr->gen = VFS_I(ip)->i_generation;
+ ioerr->pos = pos;
+ ioerr->len = len;
+ if (direction == WRITE)
+ ioerr->type = XFS_FILE_IOERROR_DIRECT_WRITE;
+ else
+ ioerr->type = XFS_FILE_IOERROR_DIRECT_READ;
+ ioerr->error = error;
+ queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+}
+
+/* Deal with a media error */
+void
+xfs_inode_media_error(
+ struct xfs_inode *ip,
+ loff_t pos,
+ u64 len)
+{
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_file_ioerror *ioerr;
+
+ ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+ if (!ioerr) {
+ xfs_err(mp,
+ "lost data error report for ino 0x%llx pos 0x%llx len 0x%llx",
+ ip->i_ino,
+ pos, len);
+ return;
+ }
+
+ INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+ ioerr->mp = mp;
+ ioerr->ino = ip->i_ino;
+ ioerr->gen = VFS_I(ip)->i_generation;
+ ioerr->pos = pos;
+ ioerr->len = len;
+ ioerr->type = XFS_FILE_IOERROR_DATA_LOST;
+ ioerr->error = -EIO;
+ queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+}
+
+/* Queue a buffered io error notification. */
+void
+xfs_vm_ioerror(
+ struct address_space *mapping,
+ int direction,
+ loff_t pos,
+ u64 len,
+ int error)
+{
+ struct inode *inode = mapping->host;
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
+ struct xfs_file_ioerror *ioerr;
+
+ ioerr = kzalloc(sizeof(*ioerr), GFP_ATOMIC);
+ if (!ioerr) {
+ xfs_err(mp,
+ "lost ioerror report for ino 0x%llx %s pos 0x%llx len 0x%llx error %d",
+ ip->i_ino,
+ direction == WRITE ? "WRITE" : "READ",
+ pos, len, error);
+ return;
+ }
+
+ INIT_WORK(&ioerr->work, xfs_file_report_ioerror);
+ ioerr->mp = mp;
+ ioerr->ino = ip->i_ino;
+ ioerr->gen = VFS_I(ip)->i_generation;
+ ioerr->pos = pos;
+ ioerr->len = len;
+ if (direction == WRITE)
+ ioerr->type = XFS_FILE_IOERROR_BUFFERED_WRITE;
+ else
+ ioerr->type = XFS_FILE_IOERROR_BUFFERED_READ;
+ ioerr->error = error;
+ queue_work(mp->m_unwritten_workqueue, &ioerr->work);
+}
+
+/* Call the specified function after a file io error. */
+int
+xfs_file_ioerror_hook_add(
+ struct xfs_mount *mp,
+ struct xfs_file_ioerror_hook *hook)
+{
+ return xfs_hooks_add(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Stop calling the specified function after a file io error. */
+void
+xfs_file_ioerror_hook_del(
+ struct xfs_mount *mp,
+ struct xfs_file_ioerror_hook *hook)
+{
+ xfs_hooks_del(&mp->m_file_ioerror_hooks, &hook->ioerror_hook);
+}
+
+/* Configure file io error update hook functions. */
+void
+xfs_file_ioerror_hook_setup(
+ struct xfs_file_ioerror_hook *hook,
+ notifier_fn_t mod_fn)
+{
+ xfs_hook_setup(&hook->ioerror_hook, mod_fn);
+}
+#else
+# define xfs_dio_ioerror NULL
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
+static const struct iomap_dio_ops xfs_dio_read_ops = {
+ .ioerror = xfs_dio_ioerror,
+};
+
STATIC ssize_t
xfs_file_dio_read(
struct kiocb *iocb,
@@ -240,7 +410,8 @@ xfs_file_dio_read(
ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
if (ret)
return ret;
- ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+ ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, &xfs_dio_read_ops,
+ 0, NULL, 0);
xfs_iunlock(ip, XFS_IOLOCK_SHARED);
return ret;
@@ -625,6 +796,7 @@ xfs_dio_write_end_io(
static const struct iomap_dio_ops xfs_dio_write_ops = {
.end_io = xfs_dio_write_end_io,
+ .ioerror = xfs_dio_ioerror,
};
static void
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 557f4bf3463dcb..8766d83385ddad 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include "xfs_notify_failure.h"
#include "xfs_rtgroup.h"
#include "xfs_rtrmap_btree.h"
+#include "xfs_file.h"
#include <linux/mm.h>
#include <linux/dax.h>
@@ -167,6 +168,10 @@ xfs_dax_failure_fn(
invalidate_inode_pages2_range(mapping, pgoff,
pgoff + pgcnt - 1);
+ xfs_inode_media_error(ip,
+ XFS_FSB_TO_B(mp, (u64)pgoff << PAGE_SHIFT),
+ XFS_FSB_TO_B(mp, (u64)pgcnt << PAGE_SHIFT));
+
xfs_irele(ip);
return error;
}
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fb72a4976e8570..54d82f5a5b8863 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2294,6 +2294,7 @@ xfs_init_fs_context(
xfs_hooks_init(&mp->m_shutdown_hooks);
xfs_hooks_init(&mp->m_health_update_hooks);
xfs_hooks_init(&mp->m_media_error_hooks);
+ xfs_hooks_init(&mp->m_file_ioerror_hooks);
fc->s_fs_info = mp;
fc->ops = &xfs_context_ops;
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (8 preceding siblings ...)
2025-11-05 0:50 ` [PATCH 09/22] xfs: create file io error hooks Darrick J. Wong
@ 2025-11-05 0:51 ` Darrick J. Wong
2025-11-05 0:51 ` [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
` (11 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:51 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 8 ++
fs/xfs/xfs_healthmon.h | 16 +++++
fs/xfs/Kconfig | 8 ++
fs/xfs/Makefile | 1
fs/xfs/xfs_healthmon.c | 157 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_ioctl.c | 4 +
6 files changed, 194 insertions(+)
create mode 100644 fs/xfs/xfs_healthmon.h
create mode 100644 fs/xfs/xfs_healthmon.c
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 12463ba766da05..dba7896f716092 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1003,6 +1003,13 @@ struct xfs_rtgroup_geometry {
#define XFS_RTGROUP_GEOM_SICK_RMAPBT (1U << 3) /* reverse mappings */
#define XFS_RTGROUP_GEOM_SICK_REFCNTBT (1U << 4) /* reference counts */
+struct xfs_health_monitor {
+ __u64 flags; /* flags */
+ __u8 format; /* output format */
+ __u8 pad1[7]; /* zeroes */
+ __u64 pad2[2]; /* zeroes */
+};
+
/*
* ioctl commands that are used by Linux filesystems
*/
@@ -1042,6 +1049,7 @@ struct xfs_rtgroup_geometry {
#define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle)
#define XFS_IOC_SCRUBV_METADATA _IOWR('X', 64, struct xfs_scrub_vec_head)
#define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
+#define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor)
/*
* ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
new file mode 100644
index 00000000000000..07126e39281a0c
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2024-2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_HEALTHMON_H__
+#define __XFS_HEALTHMON_H__
+
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+long xfs_ioc_health_monitor(struct xfs_mount *mp,
+ struct xfs_health_monitor __user *arg);
+#else
+# define xfs_ioc_health_monitor(mp, hmo) (-ENOTTY)
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
+#endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index b99da294e9a310..682d9a35203494 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -130,6 +130,14 @@ config XFS_RT
If unsure, say N.
+config XFS_HEALTH_MONITOR
+ bool "Report filesystem health events to userspace"
+ depends on XFS_FS
+ select XFS_LIVE_HOOKS
+ default y
+ help
+ Report health events to userspace programs.
+
config XFS_DRAIN_INTENTS
bool
select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 5bf501cf827172..d4e9070a9326ba 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -157,6 +157,7 @@ xfs-$(CONFIG_XFS_DRAIN_INTENTS) += xfs_drain.o
xfs-$(CONFIG_XFS_LIVE_HOOKS) += xfs_hooks.o
xfs-$(CONFIG_XFS_MEMORY_BUFS) += xfs_buf_mem.o
xfs-$(CONFIG_XFS_BTREE_IN_MEM) += libxfs/xfs_btree_mem.o
+xfs-$(CONFIG_XFS_HEALTH_MONITOR) += xfs_healthmon.o
# online scrub/repair
ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
new file mode 100644
index 00000000000000..7b0d9f78b0a402
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024-2025 Oracle. All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trace.h"
+#include "xfs_ag.h"
+#include "xfs_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_quota_defs.h"
+#include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
+
+#include <linux/anon_inodes.h>
+#include <linux/eventpoll.h>
+#include <linux/poll.h>
+
+/*
+ * Live Health Monitoring
+ * ======================
+ *
+ * Autonomous self-healing of XFS filesystems requires a means for the kernel
+ * to send filesystem health events to a monitoring daemon in userspace. To
+ * accomplish this, we establish a thread_with_file kthread object to handle
+ * translating internal events about filesystem health into a format that can
+ * be parsed easily by userspace. Then we hook various parts of the filesystem
+ * to supply those internal events to the kthread. Userspace reads events
+ * from the file descriptor returned by the ioctl.
+ *
+ * The healthmon abstraction has a weak reference to the host filesystem mount
+ * so that the queueing and processing of the events do not pin the mount and
+ * cannot slow down the main filesystem. The healthmon object can exist past
+ * the end of the filesystem mount.
+ */
+
+struct xfs_healthmon {
+ struct xfs_mount *mp;
+};
+
+/*
+ * Convey queued event data to userspace. First copy any remaining bytes in
+ * the outbuf, then format the oldest event into the outbuf and copy that too.
+ */
+STATIC ssize_t
+xfs_healthmon_read_iter(
+ struct kiocb *iocb,
+ struct iov_iter *to)
+{
+ return -EIO;
+}
+
+/* Free the health monitoring information. */
+STATIC int
+xfs_healthmon_release(
+ struct inode *inode,
+ struct file *file)
+{
+ struct xfs_healthmon *hm = file->private_data;
+
+ kfree(hm);
+
+ return 0;
+}
+
+/* Validate ioctl parameters. */
+static inline bool
+xfs_healthmon_validate(
+ const struct xfs_health_monitor *hmo)
+{
+ if (hmo->flags)
+ return false;
+ if (hmo->format)
+ return false;
+ if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
+ return false;
+ if (memchr_inv(&hmo->pad2, 0, sizeof(hmo->pad2)))
+ return false;
+ return true;
+}
+
+/* Emit some data about the health monitoring fd. */
+#ifdef CONFIG_PROC_FS
+static void
+xfs_healthmon_show_fdinfo(
+ struct seq_file *m,
+ struct file *file)
+{
+ struct xfs_healthmon *hm = file->private_data;
+
+ seq_printf(m, "state:\talive\ndev:\t%s\n",
+ hm->mp->m_super->s_id);
+}
+#endif
+
+static const struct file_operations xfs_healthmon_fops = {
+ .owner = THIS_MODULE,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = xfs_healthmon_show_fdinfo,
+#endif
+ .read_iter = xfs_healthmon_read_iter,
+ .release = xfs_healthmon_release,
+};
+
+/*
+ * Create a health monitoring file. Returns an index to the fd table or a
+ * negative errno.
+ */
+long
+xfs_ioc_health_monitor(
+ struct xfs_mount *mp,
+ struct xfs_health_monitor __user *arg)
+{
+ struct xfs_health_monitor hmo;
+ struct xfs_healthmon *hm;
+ int fd;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (copy_from_user(&hmo, arg, sizeof(hmo)))
+ return -EFAULT;
+
+ if (!xfs_healthmon_validate(&hmo))
+ return -EINVAL;
+
+ hm = kzalloc(sizeof(*hm), GFP_KERNEL);
+ if (!hm)
+ return -ENOMEM;
+ hm->mp = mp;
+
+ /*
+ * Create the anonymous file. If it succeeds, the file owns hm and
+ * can go away at any time, so we must not access it again.
+ */
+ fd = anon_inode_getfd("xfs_healthmon", &xfs_healthmon_fops, hm,
+ O_CLOEXEC | O_RDONLY);
+ if (fd < 0) {
+ ret = fd;
+ goto out_hm;
+ }
+
+ return fd;
+
+out_hm:
+ kfree(hm);
+ return ret;
+}
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index a6bb7ee7a27ad5..08998d84554f09 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -41,6 +41,7 @@
#include "xfs_exchrange.h"
#include "xfs_handle.h"
#include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
#include <linux/mount.h>
#include <linux/fileattr.h>
@@ -1421,6 +1422,9 @@ xfs_file_ioctl(
case XFS_IOC_COMMIT_RANGE:
return xfs_ioc_commit_range(filp, arg);
+ case XFS_IOC_HEALTH_MONITOR:
+ return xfs_ioc_health_monitor(mp, arg);
+
default:
return -ENOTTY;
}
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (9 preceding siblings ...)
2025-11-05 0:51 ` [PATCH 10/22] xfs: create a special file to pass filesystem health to userspace Darrick J. Wong
@ 2025-11-05 0:51 ` Darrick J. Wong
2025-11-05 0:51 ` [PATCH 12/22] xfs: report metadata health events through healthmon Darrick J. Wong
` (10 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:51 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create the basic infrastructure that we need to report health events to
userspace. We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle. Make the kernel export C structures via read().
In a previous iteration of this new subsystem, I wanted to explore data
exchange formats that are more flexible and easier for humans to read
than C structures. The thought being that when we want to rev (or
worse, enlarge) the event format, it ought to be trivially easy to do
that in a way that doesn't break old userspace.
I looked at formats such as protobufs and capnproto. These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc. The huge downside is that neither support C
all that well.
Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs. Then I
thought, how about JSON? Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.
There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit. Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.
However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers. This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants. Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision. Not good when you're dealing with
discrete units.
It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.
It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs. The parsing in userspace was also noticeably slow and
memory-consumptive.
Hence only the C interface survives.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 47 ++++
fs/xfs/xfs_healthmon.h | 29 +++
fs/xfs/xfs_linux.h | 3
fs/xfs/xfs_trace.h | 170 +++++++++++++++
fs/xfs/xfs_healthmon.c | 542 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_trace.c | 2
lib/seq_buf.c | 1
7 files changed, 787 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index dba7896f716092..dfca42b2c31192 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1003,6 +1003,45 @@ struct xfs_rtgroup_geometry {
#define XFS_RTGROUP_GEOM_SICK_RMAPBT (1U << 3) /* reverse mappings */
#define XFS_RTGROUP_GEOM_SICK_REFCNTBT (1U << 4) /* reference counts */
+/* Health monitor event domains */
+
+/* affects the whole fs */
+#define XFS_HEALTH_MONITOR_DOMAIN_MOUNT (0)
+
+/* Health monitor event types */
+
+/* status of the monitor itself */
+#define XFS_HEALTH_MONITOR_TYPE_RUNNING (0)
+#define XFS_HEALTH_MONITOR_TYPE_LOST (1)
+
+/* lost events */
+struct xfs_health_monitor_lost {
+ __u64 count;
+};
+
+struct xfs_health_monitor_event {
+ /* XFS_HEALTH_MONITOR_DOMAIN_* */
+ __u32 domain;
+
+ /* XFS_HEALTH_MONITOR_TYPE_* */
+ __u32 type;
+
+ /* Timestamp of the event, in nanoseconds since the Unix epoch */
+ __u64 time_ns;
+
+ /*
+ * Details of the event. The primary clients are written in python
+ * and rust, so break this up because bindgen hates anonymous structs
+ * and unions.
+ */
+ union {
+ struct xfs_health_monitor_lost lost;
+ } e;
+
+ /* zeroes */
+ __u64 pad[2];
+};
+
struct xfs_health_monitor {
__u64 flags; /* flags */
__u8 format; /* output format */
@@ -1010,6 +1049,14 @@ struct xfs_health_monitor {
__u64 pad2[2]; /* zeroes */
};
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE (1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL (XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Initial return format version */
+#define XFS_HEALTH_MONITOR_FMT_V0 (0)
+
/*
* ioctl commands that are used by Linux filesystems
*/
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 07126e39281a0c..ea2d6a327dfb16 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -6,6 +6,35 @@
#ifndef __XFS_HEALTHMON_H__
#define __XFS_HEALTHMON_H__
+enum xfs_healthmon_type {
+ XFS_HEALTHMON_RUNNING, /* monitor running */
+ XFS_HEALTHMON_LOST, /* message lost */
+};
+
+enum xfs_healthmon_domain {
+ XFS_HEALTHMON_MOUNT, /* affects the whole fs */
+};
+
+struct xfs_healthmon_event {
+ struct xfs_healthmon_event *next;
+
+ enum xfs_healthmon_type type;
+ enum xfs_healthmon_domain domain;
+
+ uint64_t time_ns;
+
+ union {
+ /* lost events */
+ struct {
+ uint64_t lostcount;
+ };
+ /* mount */
+ struct {
+ unsigned int flags;
+ };
+ };
+};
+
#ifdef CONFIG_XFS_HEALTH_MONITOR
long xfs_ioc_health_monitor(struct xfs_mount *mp,
struct xfs_health_monitor __user *arg);
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 4dd747bdbccab2..e122db938cc06b 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -63,6 +63,9 @@ typedef __u32 xfs_nlink_t;
#include <linux/xattr.h>
#include <linux/mnt_idmapping.h>
#include <linux/debugfs.h>
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+# include <linux/seq_buf.h>
+#endif
#include <asm/page.h>
#include <asm/div64.h>
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 79b8641880ab9d..309af9082c4179 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -103,6 +103,8 @@ struct xfs_refcount_intent;
struct xfs_metadir_update;
struct xfs_rtgroup;
struct xfs_open_zone;
+struct xfs_healthmon_event;
+struct xfs_health_update_params;
#define XFS_ATTR_FILTER_FLAGS \
{ XFS_ATTR_ROOT, "ROOT" }, \
@@ -5908,6 +5910,174 @@ DEFINE_EVENT(xfs_freeblocks_resv_class, name, \
DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_reserved);
DEFINE_FREEBLOCKS_RESV_EVENT(xfs_freecounter_enospc);
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+TRACE_EVENT(xfs_healthmon_lost_event,
+ TP_PROTO(const struct xfs_mount *mp, unsigned long long lost_prev),
+ TP_ARGS(mp, lost_prev),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(unsigned long long, lost_prev)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->lost_prev = lost_prev;
+ ),
+ TP_printk("dev %d:%d lost_prev %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->lost_prev)
+);
+
+#define XFS_HEALTHMON_FLAGS_STRINGS \
+ { XFS_HEALTH_MONITOR_VERBOSE, "verbose" }
+#define XFS_HEALTHMON_FMT_STRINGS \
+ { XFS_HEALTH_MONITOR_FMT_V0, "v0" }
+
+TRACE_EVENT(xfs_healthmon_create,
+ TP_PROTO(const struct xfs_mount *mp, u64 flags, u8 format),
+ TP_ARGS(mp, flags, format),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(u64, flags)
+ __field(u8, format)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->flags = flags;
+ __entry->format = format;
+ ),
+ TP_printk("dev %d:%d flags %s format %s",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_flags(__entry->flags, "|", XFS_HEALTHMON_FLAGS_STRINGS),
+ __print_symbolic(__entry->format, XFS_HEALTHMON_FMT_STRINGS))
+);
+
+TRACE_EVENT(xfs_healthmon_copybuf,
+ TP_PROTO(const struct xfs_mount *mp, const struct iov_iter *iov,
+ const struct seq_buf *seqbuf, size_t outpos),
+ TP_ARGS(mp, iov, seqbuf, outpos),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(size_t, seqbuf_size)
+ __field(size_t, seqbuf_len)
+ __field(size_t, outpos)
+ __field(size_t, to_copy)
+ __field(size_t, iter_count)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->seqbuf_size = seqbuf->size;
+ __entry->seqbuf_len = seqbuf->len;
+ __entry->outpos = outpos;
+ __entry->to_copy = seqbuf->len - outpos;
+ __entry->iter_count = iov_iter_count(iov);
+ ),
+ TP_printk("dev %d:%d seqsize %zu seqlen %zu out_pos %zu to_copy %zu iter_count %zu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->seqbuf_size,
+ __entry->seqbuf_len,
+ __entry->outpos,
+ __entry->to_copy,
+ __entry->iter_count)
+);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_class,
+ TP_PROTO(const struct xfs_mount *mp, unsigned int events,
+ unsigned long long lost_prev),
+ TP_ARGS(mp, events, lost_prev),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(unsigned int, events)
+ __field(unsigned long long, lost_prev)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->events = events;
+ __entry->lost_prev = lost_prev;
+ ),
+ TP_printk("dev %d:%d events %u lost_prev? %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->events,
+ __entry->lost_prev)
+);
+#define DEFINE_HEALTHMON_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_class, name, \
+ TP_PROTO(const struct xfs_mount *mp, unsigned int events, \
+ unsigned long long lost_prev), \
+ TP_ARGS(mp, events, lost_prev))
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_start);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
+
+#define XFS_HEALTHMON_TYPE_STRINGS \
+ { XFS_HEALTHMON_LOST, "lost" }
+
+#define XFS_HEALTHMON_DOMAIN_STRINGS \
+ { XFS_HEALTHMON_MOUNT, "mount" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
+ TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
+ TP_ARGS(mp, event),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(unsigned int, type)
+ __field(unsigned int, domain)
+ __field(unsigned int, mask)
+ __field(unsigned long long, ino)
+ __field(unsigned int, gen)
+ __field(unsigned int, group)
+ __field(unsigned long long, offset)
+ __field(unsigned long long, length)
+ __field(unsigned long long, lostcount)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->type = event->type;
+ __entry->domain = event->domain;
+ __entry->mask = 0;
+ __entry->group = 0;
+ __entry->ino = 0;
+ __entry->gen = 0;
+ __entry->offset = 0;
+ __entry->length = 0;
+ __entry->lostcount = 0;
+ switch (__entry->domain) {
+ case XFS_HEALTHMON_MOUNT:
+ switch (__entry->type) {
+ case XFS_HEALTHMON_LOST:
+ __entry->lostcount = event->lostcount;
+ break;
+ }
+ break;
+ }
+ ),
+ TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS),
+ __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS),
+ __entry->mask,
+ __entry->ino,
+ __entry->gen,
+ __entry->offset,
+ __entry->length,
+ __entry->group,
+ __entry->lostcount)
+);
+#define DEFINE_HEALTHMONEVENT_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_event_class, name, \
+ TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
+ TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
#endif /* _TRACE_XFS_H */
#undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 7b0d9f78b0a402..8cf6b0b81a721b 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -42,10 +42,376 @@
* the end of the filesystem mount.
*/
+/* Allow this many events to build up in memory per healthmon fd. */
+#define XFS_HEALTHMON_MAX_EVENTS \
+ (32768 / sizeof(struct xfs_healthmon_event))
+
+struct flag_string {
+ unsigned int mask;
+ const char *str;
+};
+
struct xfs_healthmon {
+ /* lock for mp and eventlist */
+ struct mutex lock;
+
+ /* waiter for signalling the arrival of events */
+ struct wait_queue_head wait;
+
+ /* list of event objects */
+ struct xfs_healthmon_event *first_event;
+ struct xfs_healthmon_event *last_event;
+
struct xfs_mount *mp;
+
+ /* number of events */
+ unsigned int events;
+
+ /*
+ * Buffer for formatting events. New buffer data are appended to the
+ * end of the seqbuf, and outpos is used to determine where to start
+ * a copy_iter. Both are protected by inode_lock.
+ */
+ struct seq_buf outbuf;
+ size_t outpos;
+
+ /* XFS_HEALTH_MONITOR_FMT_* */
+ uint8_t format;
+
+ /* do we want all events? */
+ bool verbose;
+
+ /* did we lose previous events? */
+ unsigned long long lost_prev_event;
+
+ /* total counts of events observed and lost events */
+ unsigned long long total_events;
+ unsigned long long total_lost;
};
+static inline void xfs_healthmon_bump_events(struct xfs_healthmon *hm)
+{
+ hm->events++;
+ hm->total_events++;
+}
+
+static inline void xfs_healthmon_bump_lost(struct xfs_healthmon *hm)
+{
+ hm->lost_prev_event++;
+ hm->total_lost++;
+}
+
+/* Remove an event from the head of the list. */
+static inline int
+xfs_healthmon_free_head(
+ struct xfs_healthmon *hm,
+ struct xfs_healthmon_event *event)
+{
+ struct xfs_healthmon_event *head;
+
+ mutex_lock(&hm->lock);
+ head = hm->first_event;
+ if (head != event) {
+ ASSERT(hm->first_event == event);
+ mutex_unlock(&hm->lock);
+ return -EFSCORRUPTED;
+ }
+
+ if (hm->last_event == head)
+ hm->last_event = NULL;
+ hm->first_event = head->next;
+ hm->events--;
+ mutex_unlock(&hm->lock);
+
+ trace_xfs_healthmon_pop(hm->mp, head);
+ kfree(event);
+ return 0;
+}
+
+/* Push an event onto the end of the list. */
+static inline void
+__xfs_healthmon_push(
+ struct xfs_healthmon *hm,
+ struct xfs_healthmon_event *event)
+{
+ if (!hm->first_event)
+ hm->first_event = event;
+ if (hm->last_event)
+ hm->last_event->next = event;
+ hm->last_event = event;
+ event->next = NULL;
+ xfs_healthmon_bump_events(hm);
+ wake_up(&hm->wait);
+
+ trace_xfs_healthmon_push(hm->mp, event);
+}
+
+/* Push an event onto the end of the list if we're not full. */
+static inline int
+xfs_healthmon_push(
+ struct xfs_healthmon *hm,
+ struct xfs_healthmon_event *event)
+{
+ if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+ trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+ xfs_healthmon_bump_lost(hm);
+ return -ENOMEM;
+ }
+
+ __xfs_healthmon_push(hm, event);
+ return 0;
+}
+
+/* Create a new event or record that we failed. */
+static struct xfs_healthmon_event *
+xfs_healthmon_alloc(
+ struct xfs_healthmon *hm,
+ enum xfs_healthmon_type type,
+ enum xfs_healthmon_domain domain)
+{
+ struct timespec64 now;
+ struct xfs_healthmon_event *event;
+
+ event = kzalloc(sizeof(*event), GFP_NOFS);
+ if (!event) {
+ trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+ xfs_healthmon_bump_lost(hm);
+ return NULL;
+ }
+
+ event->type = type;
+ event->domain = domain;
+ ktime_get_coarse_real_ts64(&now);
+ event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec;
+
+ return event;
+}
+
+/*
+ * Before we accept an event notification from a live update hook, we need to
+ * clear out any previously lost events.
+ */
+static inline int
+xfs_healthmon_start_live_update(
+ struct xfs_healthmon *hm)
+{
+ struct xfs_healthmon_event *event;
+
+ /* If the queue is already full.... */
+ if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+ trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
+
+ if (hm->last_event &&
+ hm->last_event->type == XFS_HEALTHMON_LOST) {
+ /*
+ * ...and the last event notes lost events, then add
+ * the number of events we already lost, plus one for
+ * this event that we're about to lose.
+ */
+ hm->last_event->lostcount += hm->lost_prev_event + 1;
+ hm->lost_prev_event = 0;
+ } else {
+ /*
+ * ...try to create a new lost event. Add the number
+ * of events we previously lost, plus one for this
+ * event.
+ */
+ event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+ XFS_HEALTHMON_MOUNT);
+ if (!event) {
+ xfs_healthmon_bump_lost(hm);
+ return -ENOMEM;
+ }
+ event->lostcount = hm->lost_prev_event + 1;
+ hm->lost_prev_event = 0;
+
+ __xfs_healthmon_push(hm, event);
+ }
+
+ return -ENOSPC;
+ }
+
+ /* If we lost an event in the past, but the queue isn't yet full... */
+ if (hm->lost_prev_event) {
+ /*
+ * ...try to create a new lost event. Add the number of events
+ * we previously lost, plus one for this event.
+ */
+ event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_LOST,
+ XFS_HEALTHMON_MOUNT);
+ if (!event) {
+ xfs_healthmon_bump_lost(hm);
+ return -ENOMEM;
+ }
+ event->lostcount = hm->lost_prev_event;
+ hm->lost_prev_event = 0;
+
+ /*
+ * If adding this lost event pushes us over the limit, we're
+ * going to lose the current event. Note that in the lost
+ * event count too.
+ */
+ if (hm->events == XFS_HEALTHMON_MAX_EVENTS - 1)
+ event->lostcount++;
+
+ __xfs_healthmon_push(hm, event);
+ if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
+ trace_xfs_healthmon_lost_event(hm->mp,
+ hm->lost_prev_event);
+ return -ENOSPC;
+ }
+ }
+
+ /*
+ * The queue is not full and it is not currently the case that events
+ * were lost.
+ */
+ return 0;
+}
+
+static inline void
+xfs_healthmon_reset_outbuf(
+ struct xfs_healthmon *hm)
+{
+ hm->outpos = 0;
+ seq_buf_clear(&hm->outbuf);
+}
+
+static const unsigned int domain_map[] = {
+ [XFS_HEALTHMON_MOUNT] = XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
+};
+
+static const unsigned int type_map[] = {
+ [XFS_HEALTHMON_RUNNING] = XFS_HEALTH_MONITOR_TYPE_RUNNING,
+ [XFS_HEALTHMON_LOST] = XFS_HEALTH_MONITOR_TYPE_LOST,
+};
+
+/* Render event as a V0 structure */
+STATIC int
+xfs_healthmon_format_v0(
+ struct xfs_healthmon *hm,
+ const struct xfs_healthmon_event *event)
+{
+ struct xfs_health_monitor_event hme = {
+ .time_ns = event->time_ns,
+ };
+ struct seq_buf *outbuf = &hm->outbuf;
+ size_t old_seqlen = outbuf->len;
+ int ret;
+
+ trace_xfs_healthmon_format(hm->mp, event);
+
+ if (event->domain < 0 || event->domain >= ARRAY_SIZE(domain_map) ||
+ event->type < 0 || event->type >= ARRAY_SIZE(type_map))
+ return -EFSCORRUPTED;
+
+ hme.domain = domain_map[event->domain];
+ hme.type = type_map[event->type];
+
+ /* fill in the event-specific details */
+ switch (event->domain) {
+ case XFS_HEALTHMON_MOUNT:
+ switch (event->type) {
+ case XFS_HEALTHMON_LOST:
+ hme.e.lost.count = event->lostcount;
+ break;
+ default:
+ break;
+ }
+ break;
+ default:
+ break;
+ }
+
+ ret = seq_buf_putmem(outbuf, &hme, sizeof(hme));
+ if (ret < 0) {
+ /*
+ * We overflowed the buffer and could not format the event.
+ * Reset the seqbuf and tell the caller not to delete the
+ * event.
+ */
+ trace_xfs_healthmon_format_overflow(hm->mp, event);
+ outbuf->len = old_seqlen;
+ return -1;
+ }
+
+ ASSERT(!seq_buf_has_overflowed(outbuf));
+ return 0;
+}
+
+/* How many bytes are waiting in the outbuf to be copied? */
+static inline size_t
+xfs_healthmon_outbuf_bytes(
+ struct xfs_healthmon *hm)
+{
+ unsigned int used = seq_buf_used(&hm->outbuf);
+
+ if (used > hm->outpos)
+ return used - hm->outpos;
+ return 0;
+}
+
+/*
+ * Do we have something for userspace to do? This can mean unmount events,
+ * events pending in the queue, or pending bytes in the outbuf.
+ */
+static inline bool
+xfs_healthmon_has_eventdata(
+ struct xfs_healthmon *hm)
+{
+ return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+}
+
+/* Try to copy the rest of the outbuf to the iov iter. */
+STATIC ssize_t
+xfs_healthmon_copybuf(
+ struct xfs_healthmon *hm,
+ struct iov_iter *to)
+{
+ size_t to_copy;
+ size_t w = 0;
+
+ trace_xfs_healthmon_copybuf(hm->mp, to, &hm->outbuf, hm->outpos);
+
+ to_copy = xfs_healthmon_outbuf_bytes(hm);
+ if (to_copy) {
+ w = copy_to_iter(hm->outbuf.buffer + hm->outpos, to_copy, to);
+ if (!w)
+ return -EFAULT;
+
+ hm->outpos += w;
+ }
+
+ /*
+ * Nothing left to copy? Reset the seqbuf pointers and outbuf to the
+ * start since there's no live data in the buffer.
+ */
+ if (xfs_healthmon_outbuf_bytes(hm) == 0)
+ xfs_healthmon_reset_outbuf(hm);
+ return w;
+}
+
+/*
+ * See if there's an event waiting for us. If the fs is no longer mounted,
+ * don't bother sending any more events.
+ */
+static inline struct xfs_healthmon_event *
+xfs_healthmon_peek(
+ struct xfs_healthmon *hm)
+{
+ struct xfs_healthmon_event *event;
+
+ mutex_lock(&hm->lock);
+ if (hm->mp)
+ event = hm->first_event;
+ else
+ event = NULL;
+ mutex_unlock(&hm->lock);
+ return event;
+}
+
/*
* Convey queued event data to userspace. First copy any remaining bytes in
* the outbuf, then format the oldest event into the outbuf and copy that too.
@@ -55,7 +421,122 @@ xfs_healthmon_read_iter(
struct kiocb *iocb,
struct iov_iter *to)
{
- return -EIO;
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ struct xfs_healthmon *hm = file->private_data;
+ struct xfs_healthmon_event *event;
+ size_t copied = 0;
+ ssize_t ret = 0;
+
+ /* Wait for data to become available */
+ if (!(file->f_flags & O_NONBLOCK)) {
+ ret = wait_event_interruptible(hm->wait,
+ xfs_healthmon_has_eventdata(hm));
+ if (ret)
+ return ret;
+ } else if (!xfs_healthmon_has_eventdata(hm)) {
+ return -EAGAIN;
+ }
+
+ /* Allocate formatting buffer up to 64k if necessary */
+ if (hm->outbuf.size == 0) {
+ void *outbuf;
+ size_t bufsize = min(65536, max(PAGE_SIZE,
+ iov_iter_count(to)));
+
+ outbuf = kzalloc(bufsize, GFP_KERNEL);
+ if (!outbuf) {
+ bufsize = PAGE_SIZE;
+ outbuf = kzalloc(bufsize, GFP_KERNEL);
+ if (!outbuf)
+ return -ENOMEM;
+ }
+
+ inode_lock(inode);
+ if (hm->outbuf.size == 0) {
+ seq_buf_init(&hm->outbuf, outbuf, bufsize);
+ hm->outpos = 0;
+ } else {
+ kfree(outbuf);
+ }
+ } else {
+ inode_lock(inode);
+ }
+
+ trace_xfs_healthmon_read_start(hm->mp, hm->events, hm->lost_prev_event);
+
+ /*
+ * If there's anything left in the seqbuf, copy that before formatting
+ * more events.
+ */
+ ret = xfs_healthmon_copybuf(hm, to);
+ if (ret < 0)
+ goto out_unlock;
+ copied += ret;
+
+ while (iov_iter_count(to) > 0) {
+ /* Format the next events into the outbuf until it's full. */
+ while ((event = xfs_healthmon_peek(hm)) != NULL) {
+ switch (hm->format) {
+ case XFS_HEALTH_MONITOR_FMT_V0:
+ ret = xfs_healthmon_format_v0(hm, event);
+ break;
+ default:
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ if (ret < 0)
+ break;
+ ret = xfs_healthmon_free_head(hm, event);
+ if (ret)
+ goto out_unlock;
+ }
+
+ /* Copy it to userspace */
+ ret = xfs_healthmon_copybuf(hm, to);
+ if (ret <= 0)
+ break;
+
+ copied += ret;
+ }
+
+out_unlock:
+ trace_xfs_healthmon_read_finish(hm->mp, hm->events, hm->lost_prev_event);
+ inode_unlock(inode);
+ return copied ?: ret;
+}
+
+/* Poll for available events. */
+STATIC __poll_t
+xfs_healthmon_poll(
+ struct file *file,
+ struct poll_table_struct *wait)
+{
+ struct xfs_healthmon *hm = file->private_data;
+ __poll_t mask = 0;
+
+ poll_wait(file, &hm->wait, wait);
+
+ if (xfs_healthmon_has_eventdata(hm))
+ mask |= EPOLLIN;
+ return mask;
+}
+
+/* Free all events */
+STATIC void
+xfs_healthmon_free_events(
+ struct xfs_healthmon *hm)
+{
+ struct xfs_healthmon_event *event, *next;
+
+ event = hm->first_event;
+ while (event != NULL) {
+ trace_xfs_healthmon_drop(hm->mp, event);
+ next = event->next;
+ kfree(event);
+ event = next;
+ }
+ hm->first_event = hm->last_event = NULL;
}
/* Free the health monitoring information. */
@@ -66,6 +547,14 @@ xfs_healthmon_release(
{
struct xfs_healthmon *hm = file->private_data;
+ trace_xfs_healthmon_release(hm->mp, hm->events, hm->lost_prev_event);
+
+ wake_up_all(&hm->wait);
+
+ mutex_destroy(&hm->lock);
+ xfs_healthmon_free_events(hm);
+ if (hm->outbuf.size)
+ kfree(hm->outbuf.buffer);
kfree(hm);
return 0;
@@ -76,9 +565,9 @@ static inline bool
xfs_healthmon_validate(
const struct xfs_health_monitor *hmo)
{
- if (hmo->flags)
+ if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL)
return false;
- if (hmo->format)
+ if (hmo->format != XFS_HEALTH_MONITOR_FMT_V0)
return false;
if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
return false;
@@ -89,6 +578,17 @@ xfs_healthmon_validate(
/* Emit some data about the health monitoring fd. */
#ifdef CONFIG_PROC_FS
+static const char *
+xfs_healthmon_format_string(const struct xfs_healthmon *hm)
+{
+ switch (hm->format) {
+ case XFS_HEALTH_MONITOR_FMT_V0:
+ return "v0";
+ }
+
+ return "";
+}
+
static void
xfs_healthmon_show_fdinfo(
struct seq_file *m,
@@ -96,8 +596,13 @@ xfs_healthmon_show_fdinfo(
{
struct xfs_healthmon *hm = file->private_data;
- seq_printf(m, "state:\talive\ndev:\t%s\n",
- hm->mp->m_super->s_id);
+ mutex_lock(&hm->lock);
+ seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
+ hm->mp->m_super->s_id,
+ xfs_healthmon_format_string(hm),
+ hm->total_events,
+ hm->total_lost);
+ mutex_unlock(&hm->lock);
}
#endif
@@ -107,6 +612,7 @@ static const struct file_operations xfs_healthmon_fops = {
.show_fdinfo = xfs_healthmon_show_fdinfo,
#endif
.read_iter = xfs_healthmon_read_iter,
+ .poll = xfs_healthmon_poll,
.release = xfs_healthmon_release,
};
@@ -121,6 +627,7 @@ xfs_ioc_health_monitor(
{
struct xfs_health_monitor hmo;
struct xfs_healthmon *hm;
+ struct xfs_healthmon_event *event;
int fd;
int ret;
@@ -137,6 +644,23 @@ xfs_ioc_health_monitor(
if (!hm)
return -ENOMEM;
hm->mp = mp;
+ hm->format = hmo.format;
+
+ seq_buf_init(&hm->outbuf, NULL, 0);
+ mutex_init(&hm->lock);
+ init_waitqueue_head(&hm->wait);
+
+ if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
+ hm->verbose = true;
+
+ /* Queue up the first event that lets the client know we're running. */
+ event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
+ XFS_HEALTHMON_MOUNT);
+ if (!event) {
+ ret = -ENOMEM;
+ goto out_mutex;
+ }
+ __xfs_healthmon_push(hm, event);
/*
* Create the anonymous file. If it succeeds, the file owns hm and
@@ -146,12 +670,16 @@ xfs_ioc_health_monitor(
O_CLOEXEC | O_RDONLY);
if (fd < 0) {
ret = fd;
- goto out_hm;
+ goto out_mutex;
}
+ trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
+
return fd;
-out_hm:
+out_mutex:
+ mutex_destroy(&hm->lock);
+ xfs_healthmon_free_events(hm);
kfree(hm);
return ret;
}
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index a60556dbd172ee..d42b864a3837a2 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -51,6 +51,8 @@
#include "xfs_rtgroup.h"
#include "xfs_zone_alloc.h"
#include "xfs_zone_priv.h"
+#include "xfs_health.h"
+#include "xfs_healthmon.h"
/*
* We include this last to have the helpers above available for the trace
diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index f3f3436d60a940..f6a1fb46a1d6c9 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -245,6 +245,7 @@ int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int len)
seq_buf_set_overflow(s);
return -1;
}
+EXPORT_SYMBOL_GPL(seq_buf_putmem);
#define MAX_MEMHEX_BYTES 8U
#define HEX_CHARS (MAX_MEMHEX_BYTES*2 + 1)
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 12/22] xfs: report metadata health events through healthmon
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (10 preceding siblings ...)
2025-11-05 0:51 ` [PATCH 11/22] xfs: create event queuing, formatting, and discovery infrastructure Darrick J. Wong
@ 2025-11-05 0:51 ` Darrick J. Wong
2025-11-05 0:51 ` [PATCH 13/22] xfs: report shutdown " Darrick J. Wong
` (9 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:51 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Set up a metadata health event hook so that we can send events to
userspace as we collect information. The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 38 +++++
fs/xfs/libxfs/xfs_health.h | 5 +
fs/xfs/xfs_healthmon.h | 31 ++++
fs/xfs/xfs_trace.h | 98 +++++++++++++
fs/xfs/xfs_health.c | 67 +++++++++
fs/xfs/xfs_healthmon.c | 333 +++++++++++++++++++++++++++++++++++++++++++-
6 files changed, 563 insertions(+), 9 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index dfca42b2c31192..2ad45351ac0ea6 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1008,17 +1008,52 @@ struct xfs_rtgroup_geometry {
/* affects the whole fs */
#define XFS_HEALTH_MONITOR_DOMAIN_MOUNT (0)
+/* metadata health events */
+#define XFS_HEALTH_MONITOR_DOMAIN_FS (1)
+#define XFS_HEALTH_MONITOR_DOMAIN_AG (2)
+#define XFS_HEALTH_MONITOR_DOMAIN_INODE (3)
+#define XFS_HEALTH_MONITOR_DOMAIN_RTGROUP (4)
+
/* Health monitor event types */
/* status of the monitor itself */
#define XFS_HEALTH_MONITOR_TYPE_RUNNING (0)
#define XFS_HEALTH_MONITOR_TYPE_LOST (1)
+/* metadata health events */
+#define XFS_HEALTH_MONITOR_TYPE_SICK (2)
+#define XFS_HEALTH_MONITOR_TYPE_CORRUPT (3)
+#define XFS_HEALTH_MONITOR_TYPE_HEALTHY (4)
+
+/* filesystem was unmounted */
+#define XFS_HEALTH_MONITOR_TYPE_UNMOUNT (5)
+
/* lost events */
struct xfs_health_monitor_lost {
__u64 count;
};
+/* fs/rt metadata */
+struct xfs_health_monitor_fs {
+ /* XFS_FSOP_GEOM_SICK_* flags */
+ __u32 mask;
+};
+
+/* ag/rtgroup metadata */
+struct xfs_health_monitor_group {
+ /* XFS_{AG,RTGROUP}_SICK_* flags */
+ __u32 mask;
+ __u32 gno;
+};
+
+/* inode metadata */
+struct xfs_health_monitor_inode {
+ /* XFS_BS_SICK_* flags */
+ __u32 mask;
+ __u32 gen;
+ __u64 ino;
+};
+
struct xfs_health_monitor_event {
/* XFS_HEALTH_MONITOR_DOMAIN_* */
__u32 domain;
@@ -1036,6 +1071,9 @@ struct xfs_health_monitor_event {
*/
union {
struct xfs_health_monitor_lost lost;
+ struct xfs_health_monitor_fs fs;
+ struct xfs_health_monitor_group group;
+ struct xfs_health_monitor_inode inode;
} e;
/* zeroes */
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 39fef33dedc6a8..9ff3bf8ba4ed8f 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -336,4 +336,9 @@ void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
#endif /* CONFIG_XFS_LIVE_HOOKS */
+unsigned int xfs_healthmon_inode_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_rtgroup_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_perag_mask(unsigned int sick_mask);
+unsigned int xfs_healthmon_fs_mask(unsigned int sick_mask);
+
#endif /* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index ea2d6a327dfb16..3f3ba16d5af56a 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -9,10 +9,23 @@
enum xfs_healthmon_type {
XFS_HEALTHMON_RUNNING, /* monitor running */
XFS_HEALTHMON_LOST, /* message lost */
+ XFS_HEALTHMON_UNMOUNT, /* filesystem is unmounting */
+
+ /* metadata health events */
+ XFS_HEALTHMON_SICK, /* runtime corruption observed */
+ XFS_HEALTHMON_CORRUPT, /* fsck reported corruption */
+ XFS_HEALTHMON_HEALTHY, /* fsck reported healthy structure */
+
};
enum xfs_healthmon_domain {
XFS_HEALTHMON_MOUNT, /* affects the whole fs */
+
+ /* metadata health events */
+ XFS_HEALTHMON_FS, /* main filesystem metadata */
+ XFS_HEALTHMON_AG, /* allocation group metadata */
+ XFS_HEALTHMON_INODE, /* inode metadata */
+ XFS_HEALTHMON_RTGROUP, /* realtime group metadata */
};
struct xfs_healthmon_event {
@@ -32,6 +45,24 @@ struct xfs_healthmon_event {
struct {
unsigned int flags;
};
+ /* fs/rt metadata */
+ struct {
+ /* XFS_SICK_* flags */
+ unsigned int fsmask;
+ };
+ /* ag/rtgroup metadata */
+ struct {
+ /* XFS_SICK_* flags */
+ unsigned int grpmask;
+ unsigned int group;
+ };
+ /* inode metadata */
+ struct {
+ /* XFS_SICK_INO_* flags */
+ unsigned int imask;
+ uint32_t gen;
+ xfs_ino_t ino;
+ };
};
};
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 309af9082c4179..051599f8433ed6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6010,14 +6010,30 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
#define XFS_HEALTHMON_TYPE_STRINGS \
- { XFS_HEALTHMON_LOST, "lost" }
+ { XFS_HEALTHMON_LOST, "lost" }, \
+ { XFS_HEALTHMON_UNMOUNT, "unmount" }, \
+ { XFS_HEALTHMON_SICK, "sick" }, \
+ { XFS_HEALTHMON_CORRUPT, "corrupt" }, \
+ { XFS_HEALTHMON_HEALTHY, "healthy" }
#define XFS_HEALTHMON_DOMAIN_STRINGS \
- { XFS_HEALTHMON_MOUNT, "mount" }
+ { XFS_HEALTHMON_MOUNT, "mount" }, \
+ { XFS_HEALTHMON_FS, "fs" }, \
+ { XFS_HEALTHMON_AG, "ag" }, \
+ { XFS_HEALTHMON_INODE, "inode" }, \
+ { XFS_HEALTHMON_RTGROUP, "rtgroup" }
TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_HEALTHY);
TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_RTGROUP);
DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
@@ -6053,6 +6069,19 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
break;
}
break;
+ case XFS_HEALTHMON_FS:
+ __entry->mask = event->fsmask;
+ break;
+ case XFS_HEALTHMON_AG:
+ case XFS_HEALTHMON_RTGROUP:
+ __entry->mask = event->grpmask;
+ __entry->group = event->group;
+ break;
+ case XFS_HEALTHMON_INODE:
+ __entry->mask = event->imask;
+ __entry->ino = event->ino;
+ __entry->gen = event->gen;
+ break;
}
),
TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6071,11 +6100,76 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
DEFINE_EVENT(xfs_healthmon_event_class, name, \
TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_insert);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format_overflow);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+
+#define XFS_HEALTHUP_TYPE_STRINGS \
+ { XFS_HEALTHUP_UNMOUNT, "unmount" }, \
+ { XFS_HEALTHUP_SICK, "sick" }, \
+ { XFS_HEALTHUP_CORRUPT, "corrupt" }, \
+ { XFS_HEALTHUP_HEALTHY, "healthy" }
+
+#define XFS_HEALTHUP_DOMAIN_STRINGS \
+ { XFS_HEALTHUP_FS, "fs" }, \
+ { XFS_HEALTHUP_AG, "ag" }, \
+ { XFS_HEALTHUP_INODE, "inode" }, \
+ { XFS_HEALTHUP_RTGROUP, "rtgroup" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_HEALTHY);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_RTGROUP);
+
+TRACE_EVENT(xfs_healthmon_metadata_hook,
+ TP_PROTO(const struct xfs_mount *mp, unsigned long type,
+ const struct xfs_health_update_params *update,
+ unsigned int events, unsigned long long lost_prev),
+ TP_ARGS(mp, type, update, events, lost_prev),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(unsigned long, type)
+ __field(unsigned int, domain)
+ __field(unsigned int, old_mask)
+ __field(unsigned int, new_mask)
+ __field(unsigned long long, ino)
+ __field(unsigned int, gen)
+ __field(unsigned int, group)
+ __field(unsigned int, events)
+ __field(unsigned long long, lost_prev)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->type = type;
+ __entry->domain = update->domain;
+ __entry->old_mask = update->old_mask;
+ __entry->new_mask = update->new_mask;
+ __entry->ino = update->ino;
+ __entry->gen = update->gen;
+ __entry->group = update->group;
+ __entry->events = events;
+ __entry->lost_prev = lost_prev;
+ ),
+ TP_printk("dev %d:%d type %s domain %s oldmask 0x%x newmask 0x%x ino 0x%llx gen 0x%x group 0x%x events %u lost_prev %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_symbolic(__entry->type, XFS_HEALTHUP_TYPE_STRINGS),
+ __print_symbolic(__entry->domain, XFS_HEALTHUP_DOMAIN_STRINGS),
+ __entry->old_mask,
+ __entry->new_mask,
+ __entry->ino,
+ __entry->gen,
+ __entry->group,
+ __entry->events,
+ __entry->lost_prev)
+);
#endif /* CONFIG_XFS_HEALTH_MONITOR */
#endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 71952d5eec2a9e..da827060853a8f 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -609,6 +609,25 @@ xfs_fsop_geom_health(
}
}
+/*
+ * Translate XFS_SICK_FS_* into XFS_FSOP_GEOM_SICK_* except for the rt free
+ * space codes, which are sent via the rtgroup events.
+ */
+unsigned int
+xfs_healthmon_fs_mask(
+ unsigned int sick_mask)
+{
+ const struct ioctl_sick_map *m;
+ unsigned int ioctl_mask = 0;
+
+ for_each_sick_map(fs_map, m) {
+ if (sick_mask & m->sick_mask)
+ ioctl_mask |= m->ioctl_mask;
+ }
+
+ return ioctl_mask;
+}
+
static const struct ioctl_sick_map ag_map[] = {
{ XFS_SICK_AG_SB, XFS_AG_GEOM_SICK_SB },
{ XFS_SICK_AG_AGF, XFS_AG_GEOM_SICK_AGF },
@@ -645,6 +664,22 @@ xfs_ag_geom_health(
}
}
+/* Translate XFS_SICK_AG_* into XFS_AG_GEOM_SICK_*. */
+unsigned int
+xfs_healthmon_perag_mask(
+ unsigned int sick_mask)
+{
+ const struct ioctl_sick_map *m;
+ unsigned int ioctl_mask = 0;
+
+ for_each_sick_map(ag_map, m) {
+ if (sick_mask & m->sick_mask)
+ ioctl_mask |= m->ioctl_mask;
+ }
+
+ return ioctl_mask;
+}
+
static const struct ioctl_sick_map rtgroup_map[] = {
{ XFS_SICK_RG_SUPER, XFS_RTGROUP_GEOM_SICK_SUPER },
{ XFS_SICK_RG_BITMAP, XFS_RTGROUP_GEOM_SICK_BITMAP },
@@ -675,6 +710,22 @@ xfs_rtgroup_geom_health(
}
}
+/* Translate XFS_SICK_RG_* into XFS_RTGROUP_GEOM_SICK_*. */
+unsigned int
+xfs_healthmon_rtgroup_mask(
+ unsigned int sick_mask)
+{
+ const struct ioctl_sick_map *m;
+ unsigned int ioctl_mask = 0;
+
+ for_each_sick_map(rtgroup_map, m) {
+ if (sick_mask & m->sick_mask)
+ ioctl_mask |= m->ioctl_mask;
+ }
+
+ return ioctl_mask;
+}
+
static const struct ioctl_sick_map ino_map[] = {
{ XFS_SICK_INO_CORE, XFS_BS_SICK_INODE },
{ XFS_SICK_INO_BMBTD, XFS_BS_SICK_BMBTD },
@@ -713,6 +764,22 @@ xfs_bulkstat_health(
}
}
+/* Translate XFS_SICK_INO_* into XFS_BS_SICK_*. */
+unsigned int
+xfs_healthmon_inode_mask(
+ unsigned int sick_mask)
+{
+ const struct ioctl_sick_map *m;
+ unsigned int ioctl_mask = 0;
+
+ for_each_sick_map(ino_map, m) {
+ if (sick_mask & m->sick_mask)
+ ioctl_mask |= m->ioctl_mask;
+ }
+
+ return ioctl_mask;
+}
+
/* Mark a block mapping sick. */
void
xfs_bmap_mark_sick(
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 8cf6b0b81a721b..d1474e6b9ab544 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -18,6 +18,7 @@
#include "xfs_da_btree.h"
#include "xfs_quota_defs.h"
#include "xfs_rtgroup.h"
+#include "xfs_health.h"
#include "xfs_healthmon.h"
#include <linux/anon_inodes.h>
@@ -62,8 +63,15 @@ struct xfs_healthmon {
struct xfs_healthmon_event *first_event;
struct xfs_healthmon_event *last_event;
+ /* live update hooks */
+ struct xfs_health_hook hhook;
+
+ /* filesystem mount, or NULL if we've unmounted */
struct xfs_mount *mp;
+ /* filesystem type for safe cleanup of hooks; requires module_get */
+ struct file_system_type *fstyp;
+
/* number of events */
unsigned int events;
@@ -128,6 +136,23 @@ xfs_healthmon_free_head(
return 0;
}
+/* Insert an event onto the start of the list. */
+static inline void
+__xfs_healthmon_insert(
+ struct xfs_healthmon *hm,
+ struct xfs_healthmon_event *event)
+{
+ event->next = hm->first_event;
+ if (!hm->first_event)
+ hm->first_event = event;
+ if (!hm->last_event)
+ hm->last_event = event;
+ xfs_healthmon_bump_events(hm);
+ wake_up(&hm->wait);
+
+ trace_xfs_healthmon_insert(hm->mp, event);
+}
+
/* Push an event onto the end of the list. */
static inline void
__xfs_healthmon_push(
@@ -199,6 +224,10 @@ xfs_healthmon_start_live_update(
{
struct xfs_healthmon_event *event;
+ /* Filesystem already unmounted, do nothing. */
+ if (!hm->mp)
+ return -ESHUTDOWN;
+
/* If the queue is already full.... */
if (hm->events >= XFS_HEALTHMON_MAX_EVENTS) {
trace_xfs_healthmon_lost_event(hm->mp, hm->lost_prev_event);
@@ -271,6 +300,185 @@ xfs_healthmon_start_live_update(
return 0;
}
+/* Compute the reporting mask. */
+static inline bool
+xfs_healthmon_event_mask(
+ struct xfs_healthmon *hm,
+ enum xfs_health_update_type type,
+ const struct xfs_health_update_params *hup,
+ unsigned int *mask)
+{
+ /* Always report unmounts. */
+ if (type == XFS_HEALTHUP_UNMOUNT)
+ return true;
+
+ /* If we want all events, return all events. */
+ if (hm->verbose) {
+ *mask = hup->new_mask;
+ return true;
+ }
+
+ switch (type) {
+ case XFS_HEALTHUP_SICK:
+ /* Always report runtime corruptions */
+ *mask = hup->new_mask;
+ break;
+ case XFS_HEALTHUP_CORRUPT:
+ /* Only report new fsck errors */
+ *mask = hup->new_mask & ~hup->old_mask;
+ break;
+ case XFS_HEALTHUP_HEALTHY:
+ /* Only report healthy metadata that got fixed */
+ *mask = hup->new_mask & hup->old_mask;
+ break;
+ case XFS_HEALTHUP_UNMOUNT:
+ /* This is here for static enum checking */
+ break;
+ }
+
+ /* If not in verbose mode, mask state has to change. */
+ return *mask != 0;
+}
+
+static inline enum xfs_healthmon_type
+health_update_to_type(
+ enum xfs_health_update_type type)
+{
+ switch (type) {
+ case XFS_HEALTHUP_SICK:
+ return XFS_HEALTHMON_SICK;
+ case XFS_HEALTHUP_CORRUPT:
+ return XFS_HEALTHMON_CORRUPT;
+ case XFS_HEALTHUP_HEALTHY:
+ return XFS_HEALTHMON_HEALTHY;
+ case XFS_HEALTHUP_UNMOUNT:
+ /* static checking */
+ break;
+ }
+ return XFS_HEALTHMON_UNMOUNT;
+}
+
+static inline enum xfs_healthmon_domain
+health_update_to_domain(
+ enum xfs_health_update_domain domain)
+{
+ switch (domain) {
+ case XFS_HEALTHUP_FS:
+ return XFS_HEALTHMON_FS;
+ case XFS_HEALTHUP_AG:
+ return XFS_HEALTHMON_AG;
+ case XFS_HEALTHUP_RTGROUP:
+ return XFS_HEALTHMON_RTGROUP;
+ case XFS_HEALTHUP_INODE:
+ /* static checking */
+ break;
+ }
+ return XFS_HEALTHMON_INODE;
+}
+
+/* Add a health event to the reporting queue. */
+STATIC int
+xfs_healthmon_metadata_hook(
+ struct notifier_block *nb,
+ unsigned long action,
+ void *data)
+{
+ struct xfs_health_update_params *hup = data;
+ struct xfs_healthmon *hm;
+ struct xfs_healthmon_event *event;
+ enum xfs_health_update_type type = action;
+ unsigned int mask = 0;
+ int error;
+
+ hm = container_of(nb, struct xfs_healthmon, hhook.health_hook.nb);
+
+ /* Decode event mask and skip events we don't care about. */
+ if (!xfs_healthmon_event_mask(hm, type, hup, &mask))
+ return NOTIFY_DONE;
+
+ mutex_lock(&hm->lock);
+
+ trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
+ hm->lost_prev_event);
+
+ error = xfs_healthmon_start_live_update(hm);
+ if (error)
+ goto out_unlock;
+
+ if (type == XFS_HEALTHUP_UNMOUNT) {
+ /*
+ * The filesystem is unmounting, so we must detach from the
+ * mount. After this point, the healthmon thread has no
+ * connection to the mounted filesystem and must not touch its
+ * hooks.
+ */
+ trace_xfs_healthmon_unmount(hm->mp, hm->events,
+ hm->lost_prev_event);
+
+ hm->mp = NULL;
+
+ /*
+ * Try to add an unmount message to the head of the list so
+ * that userspace will notice the unmount. If we can't add
+ * the event, wake up the reader directly.
+ */
+ event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_UNMOUNT,
+ XFS_HEALTHMON_MOUNT);
+ if (event)
+ __xfs_healthmon_insert(hm, event);
+ else
+ wake_up(&hm->wait);
+
+ goto out_unlock;
+ }
+
+ event = xfs_healthmon_alloc(hm, health_update_to_type(type),
+ health_update_to_domain(hup->domain));
+ if (!event)
+ goto out_unlock;
+
+ /* Ignore the event if it's only reporting a secondary health state. */
+ switch (event->domain) {
+ case XFS_HEALTHMON_FS:
+ event->fsmask = mask & ~XFS_SICK_FS_SECONDARY;
+ if (!event->fsmask)
+ goto out_event;
+ break;
+ case XFS_HEALTHMON_AG:
+ event->grpmask = mask & ~XFS_SICK_AG_SECONDARY;
+ if (!event->grpmask)
+ goto out_event;
+ event->group = hup->group;
+ break;
+ case XFS_HEALTHMON_RTGROUP:
+ event->grpmask = mask & ~XFS_SICK_RG_SECONDARY;
+ if (!event->grpmask)
+ goto out_event;
+ event->group = hup->group;
+ break;
+ case XFS_HEALTHMON_INODE:
+ event->imask = mask & ~XFS_SICK_INO_SECONDARY;
+ if (!event->imask)
+ goto out_event;
+ event->ino = hup->ino;
+ event->gen = hup->gen;
+ break;
+ default:
+ ASSERT(0);
+ break;
+ }
+ error = xfs_healthmon_push(hm, event);
+ if (error)
+ goto out_event;
+
+out_unlock:
+ mutex_unlock(&hm->lock);
+ return NOTIFY_DONE;
+out_event:
+ kfree(event);
+ goto out_unlock;
+}
+
static inline void
xfs_healthmon_reset_outbuf(
struct xfs_healthmon *hm)
@@ -281,11 +489,19 @@ xfs_healthmon_reset_outbuf(
static const unsigned int domain_map[] = {
[XFS_HEALTHMON_MOUNT] = XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
+ [XFS_HEALTHMON_FS] = XFS_HEALTH_MONITOR_DOMAIN_FS,
+ [XFS_HEALTHMON_AG] = XFS_HEALTH_MONITOR_DOMAIN_AG,
+ [XFS_HEALTHMON_INODE] = XFS_HEALTH_MONITOR_DOMAIN_INODE,
+ [XFS_HEALTHMON_RTGROUP] = XFS_HEALTH_MONITOR_DOMAIN_RTGROUP,
};
static const unsigned int type_map[] = {
[XFS_HEALTHMON_RUNNING] = XFS_HEALTH_MONITOR_TYPE_RUNNING,
[XFS_HEALTHMON_LOST] = XFS_HEALTH_MONITOR_TYPE_LOST,
+ [XFS_HEALTHMON_SICK] = XFS_HEALTH_MONITOR_TYPE_SICK,
+ [XFS_HEALTHMON_CORRUPT] = XFS_HEALTH_MONITOR_TYPE_CORRUPT,
+ [XFS_HEALTHMON_HEALTHY] = XFS_HEALTH_MONITOR_TYPE_HEALTHY,
+ [XFS_HEALTHMON_UNMOUNT] = XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
};
/* Render event as a V0 structure */
@@ -321,6 +537,22 @@ xfs_healthmon_format_v0(
break;
}
break;
+ case XFS_HEALTHMON_FS:
+ hme.e.fs.mask = xfs_healthmon_fs_mask(event->fsmask);
+ break;
+ case XFS_HEALTHMON_RTGROUP:
+ hme.e.group.mask = xfs_healthmon_rtgroup_mask(event->grpmask);
+ hme.e.group.gno = event->group;
+ break;
+ case XFS_HEALTHMON_AG:
+ hme.e.group.mask = xfs_healthmon_perag_mask(event->grpmask);
+ hme.e.group.gno = event->group;
+ break;
+ case XFS_HEALTHMON_INODE:
+ hme.e.inode.mask = xfs_healthmon_inode_mask(event->imask);
+ hme.e.inode.ino = event->ino;
+ hme.e.inode.gen = event->gen;
+ break;
default:
break;
}
@@ -361,7 +593,7 @@ static inline bool
xfs_healthmon_has_eventdata(
struct xfs_healthmon *hm)
{
- return hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
+ return !hm->mp || hm->events > 0 || xfs_healthmon_outbuf_bytes(hm) > 0;
}
/* Try to copy the rest of the outbuf to the iov iter. */
@@ -404,10 +636,16 @@ xfs_healthmon_peek(
struct xfs_healthmon_event *event;
mutex_lock(&hm->lock);
+ event = hm->first_event;
if (hm->mp)
- event = hm->first_event;
- else
- event = NULL;
+ goto done;
+
+ /* If the filesystem is unmounted, only return the unmount event */
+ if (event && event->type == XFS_HEALTHMON_UNMOUNT)
+ goto done;
+ event = NULL;
+
+done:
mutex_unlock(&hm->lock);
return event;
}
@@ -539,6 +777,58 @@ xfs_healthmon_free_events(
hm->first_event = hm->last_event = NULL;
}
+/*
+ * Detach all filesystem hooks that were set up for a health monitor. Only
+ * call this from iterate_super*.
+ */
+STATIC void
+xfs_healthmon_detach_hooks(
+ struct super_block *sb,
+ void *arg)
+{
+ struct xfs_healthmon *hm = arg;
+
+ mutex_lock(&hm->lock);
+
+ /*
+ * Because health monitors have a weak reference to the filesystem
+ * they're monitoring, the hook deletions below must not race against
+ * that filesystem being unmounted because that could lead to UAF
+ * errors.
+ *
+ * If hm->mp is NULL, the health unmount hook already ran and the hook
+ * chain head (contained within the xfs_mount structure) is gone. Do
+ * not detach any hooks; just let them get freed when the healthmon
+ * object is torn down.
+ */
+ if (!hm->mp)
+ goto out_unlock;
+
+ /*
+ * Otherwise, the caller gave us a non-dying @sb with s_umount held in
+ * shared mode, which means that @sb cannot be running through
+ * deactivate_locked_super and cannot be freed. It's safe to compare
+ * @sb against the super that we snapshotted when we set up the health
+ * monitor.
+ */
+ if (hm->mp->m_super != sb)
+ goto out_unlock;
+
+ mutex_unlock(&hm->lock);
+
+ /*
+ * Now we know that the filesystem @hm->mp is active and cannot be
+ * deactivated until this function returns. Unmount events are sent
+ * through the health monitoring subsystem from xfs_fs_put_super, so
+ * it is now time to detach the hooks.
+ */
+ xfs_health_hook_del(hm->mp, &hm->hhook);
+ return;
+
+out_unlock:
+ mutex_unlock(&hm->lock);
+}
+
/* Free the health monitoring information. */
STATIC int
xfs_healthmon_release(
@@ -551,6 +841,9 @@ xfs_healthmon_release(
wake_up_all(&hm->wait);
+ iterate_supers_type(hm->fstyp, xfs_healthmon_detach_hooks, hm);
+ xfs_health_hook_disable();
+
mutex_destroy(&hm->lock);
xfs_healthmon_free_events(hm);
if (hm->outbuf.size)
@@ -597,11 +890,18 @@ xfs_healthmon_show_fdinfo(
struct xfs_healthmon *hm = file->private_data;
mutex_lock(&hm->lock);
+ if (!hm->mp) {
+ seq_printf(m, "state:\tdead\n");
+ goto out_unlock;
+ }
+
seq_printf(m, "state:\talive\ndev:\t%s\nformat:\t%s\nevents:\t%llu\nlost:\t%llu\n",
hm->mp->m_super->s_id,
xfs_healthmon_format_string(hm),
hm->total_events,
hm->total_lost);
+
+out_unlock:
mutex_unlock(&hm->lock);
}
#endif
@@ -646,6 +946,13 @@ xfs_ioc_health_monitor(
hm->mp = mp;
hm->format = hmo.format;
+ /*
+ * Since we already got a ref to the module, take a reference to the
+ * fstype to make it easier to detach the hooks when we tear things
+ * down later.
+ */
+ hm->fstyp = mp->m_super->s_type;
+
seq_buf_init(&hm->outbuf, NULL, 0);
mutex_init(&hm->lock);
init_waitqueue_head(&hm->wait);
@@ -653,12 +960,21 @@ xfs_ioc_health_monitor(
if (hmo.flags & XFS_HEALTH_MONITOR_VERBOSE)
hm->verbose = true;
+ /* Enable hooks to receive events, generally. */
+ xfs_health_hook_enable();
+
+ /* Attach specific event hooks to this monitor. */
+ xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
+ ret = xfs_health_hook_add(mp, &hm->hhook);
+ if (ret)
+ goto out_hooks;
+
/* Queue up the first event that lets the client know we're running. */
event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
XFS_HEALTHMON_MOUNT);
if (!event) {
ret = -ENOMEM;
- goto out_mutex;
+ goto out_healthhook;
}
__xfs_healthmon_push(hm, event);
@@ -670,14 +986,17 @@ xfs_ioc_health_monitor(
O_CLOEXEC | O_RDONLY);
if (fd < 0) {
ret = fd;
- goto out_mutex;
+ goto out_healthhook;
}
trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
return fd;
-out_mutex:
+out_healthhook:
+ xfs_health_hook_del(mp, &hm->hhook);
+out_hooks:
+ xfs_health_hook_disable();
mutex_destroy(&hm->lock);
xfs_healthmon_free_events(hm);
kfree(hm);
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 13/22] xfs: report shutdown events through healthmon
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (11 preceding siblings ...)
2025-11-05 0:51 ` [PATCH 12/22] xfs: report metadata health events through healthmon Darrick J. Wong
@ 2025-11-05 0:51 ` Darrick J. Wong
2025-11-05 0:52 ` [PATCH 14/22] xfs: report media errors " Darrick J. Wong
` (8 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:51 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Set up a shutdown hook so that we can send notifications to userspace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 18 +++++++++
fs/xfs/xfs_healthmon.h | 5 ++-
fs/xfs/xfs_trace.h | 28 ++++++++++++++
fs/xfs/xfs_healthmon.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 141 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2ad45351ac0ea6..677141a17605a4 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1028,6 +1028,9 @@ struct xfs_rtgroup_geometry {
/* filesystem was unmounted */
#define XFS_HEALTH_MONITOR_TYPE_UNMOUNT (5)
+/* filesystem shutdown */
+#define XFS_HEALTH_MONITOR_TYPE_SHUTDOWN (6)
+
/* lost events */
struct xfs_health_monitor_lost {
__u64 count;
@@ -1054,6 +1057,20 @@ struct xfs_health_monitor_inode {
__u64 ino;
};
+/* shutdown reasons */
+#define XFS_HEALTH_SHUTDOWN_META_IO_ERROR (1u << 0)
+#define XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR (1u << 1)
+#define XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT (1u << 2)
+#define XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE (1u << 3)
+#define XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK (1u << 4)
+#define XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED (1u << 5)
+
+/* shutdown */
+struct xfs_health_monitor_shutdown {
+ /* XFS_HEALTH_SHUTDOWN_* flags */
+ __u32 reasons;
+};
+
struct xfs_health_monitor_event {
/* XFS_HEALTH_MONITOR_DOMAIN_* */
__u32 domain;
@@ -1074,6 +1091,7 @@ struct xfs_health_monitor_event {
struct xfs_health_monitor_fs fs;
struct xfs_health_monitor_group group;
struct xfs_health_monitor_inode inode;
+ struct xfs_health_monitor_shutdown shutdown;
} e;
/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 3f3ba16d5af56a..a82a684bbc0e03 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -11,6 +11,9 @@ enum xfs_healthmon_type {
XFS_HEALTHMON_LOST, /* message lost */
XFS_HEALTHMON_UNMOUNT, /* filesystem is unmounting */
+ /* filesystem shutdown */
+ XFS_HEALTHMON_SHUTDOWN,
+
/* metadata health events */
XFS_HEALTHMON_SICK, /* runtime corruption observed */
XFS_HEALTHMON_CORRUPT, /* fsck reported corruption */
@@ -41,7 +44,7 @@ struct xfs_healthmon_event {
struct {
uint64_t lostcount;
};
- /* mount */
+ /* shutdown */
struct {
unsigned int flags;
};
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 051599f8433ed6..b2b056ceb52f5c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6009,8 +6009,32 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_read_finish);
DEFINE_HEALTHMON_EVENT(xfs_healthmon_release);
DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
+TRACE_EVENT(xfs_healthmon_shutdown_hook,
+ TP_PROTO(const struct xfs_mount *mp, uint32_t shutdown_flags,
+ unsigned int events, unsigned long long lost_prev),
+ TP_ARGS(mp, shutdown_flags, events, lost_prev),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(uint32_t, shutdown_flags)
+ __field(unsigned int, events)
+ __field(unsigned long long, lost_prev)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->shutdown_flags = shutdown_flags;
+ __entry->events = events;
+ __entry->lost_prev = lost_prev;
+ ),
+ TP_printk("dev %d:%d shutdown_flags %s events %u lost_prev? %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __print_flags(__entry->shutdown_flags, "|", XFS_SHUTDOWN_STRINGS),
+ __entry->events,
+ __entry->lost_prev)
+);
+
#define XFS_HEALTHMON_TYPE_STRINGS \
{ XFS_HEALTHMON_LOST, "lost" }, \
+ { XFS_HEALTHMON_SHUTDOWN, "shutdown" }, \
{ XFS_HEALTHMON_UNMOUNT, "unmount" }, \
{ XFS_HEALTHMON_SICK, "sick" }, \
{ XFS_HEALTHMON_CORRUPT, "corrupt" }, \
@@ -6024,6 +6048,7 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
{ XFS_HEALTHMON_RTGROUP, "rtgroup" }
TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN);
TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
@@ -6064,6 +6089,9 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
switch (__entry->domain) {
case XFS_HEALTHMON_MOUNT:
switch (__entry->type) {
+ case XFS_HEALTHMON_SHUTDOWN:
+ __entry->mask = event->flags;
+ break;
case XFS_HEALTHMON_LOST:
__entry->lostcount = event->lostcount;
break;
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index d1474e6b9ab544..f36d7fbfb1ca16 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -20,6 +20,7 @@
#include "xfs_rtgroup.h"
#include "xfs_health.h"
#include "xfs_healthmon.h"
+#include "xfs_fsops.h"
#include <linux/anon_inodes.h>
#include <linux/eventpoll.h>
@@ -64,6 +65,7 @@ struct xfs_healthmon {
struct xfs_healthmon_event *last_event;
/* live update hooks */
+ struct xfs_shutdown_hook shook;
struct xfs_health_hook hhook;
/* filesystem mount, or NULL if we've unmounted */
@@ -479,6 +481,43 @@ xfs_healthmon_metadata_hook(
goto out_unlock;
}
+/* Add a shutdown event to the reporting queue. */
+STATIC int
+xfs_healthmon_shutdown_hook(
+ struct notifier_block *nb,
+ unsigned long action,
+ void *data)
+{
+ struct xfs_healthmon *hm;
+ struct xfs_healthmon_event *event;
+ int error;
+
+ hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
+
+ mutex_lock(&hm->lock);
+
+ trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
+ hm->lost_prev_event);
+
+ error = xfs_healthmon_start_live_update(hm);
+ if (error)
+ goto out_unlock;
+
+ event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_SHUTDOWN,
+ XFS_HEALTHMON_MOUNT);
+ if (!event)
+ goto out_unlock;
+
+ event->flags = action;
+ error = xfs_healthmon_push(hm, event);
+ if (error)
+ kfree(event);
+
+out_unlock:
+ mutex_unlock(&hm->lock);
+ return NOTIFY_DONE;
+}
+
static inline void
xfs_healthmon_reset_outbuf(
struct xfs_healthmon *hm)
@@ -487,6 +526,44 @@ xfs_healthmon_reset_outbuf(
seq_buf_clear(&hm->outbuf);
}
+struct flags_map {
+ unsigned int in_mask;
+ unsigned int out_mask;
+};
+
+static const struct flags_map shutdown_map[] = {
+ { SHUTDOWN_META_IO_ERROR, XFS_HEALTH_SHUTDOWN_META_IO_ERROR },
+ { SHUTDOWN_LOG_IO_ERROR, XFS_HEALTH_SHUTDOWN_LOG_IO_ERROR },
+ { SHUTDOWN_FORCE_UMOUNT, XFS_HEALTH_SHUTDOWN_FORCE_UMOUNT },
+ { SHUTDOWN_CORRUPT_INCORE, XFS_HEALTH_SHUTDOWN_CORRUPT_INCORE },
+ { SHUTDOWN_CORRUPT_ONDISK, XFS_HEALTH_SHUTDOWN_CORRUPT_ONDISK },
+ { SHUTDOWN_DEVICE_REMOVED, XFS_HEALTH_SHUTDOWN_DEVICE_REMOVED },
+};
+
+static inline unsigned int
+__map_flags(
+ const struct flags_map *map,
+ size_t array_len,
+ unsigned int flags)
+{
+ const struct flags_map *m;
+ unsigned int ret = 0;
+
+ for (m = map; m < map + array_len; m++) {
+ if (flags & m->in_mask)
+ ret |= m->out_mask;
+ }
+
+ return ret;
+}
+
+#define map_flags(map, flags) __map_flags((map), ARRAY_SIZE(map), (flags))
+
+static inline unsigned int shutdown_mask(unsigned int in)
+{
+ return map_flags(shutdown_map, in);
+}
+
static const unsigned int domain_map[] = {
[XFS_HEALTHMON_MOUNT] = XFS_HEALTH_MONITOR_DOMAIN_MOUNT,
[XFS_HEALTHMON_FS] = XFS_HEALTH_MONITOR_DOMAIN_FS,
@@ -502,6 +579,7 @@ static const unsigned int type_map[] = {
[XFS_HEALTHMON_CORRUPT] = XFS_HEALTH_MONITOR_TYPE_CORRUPT,
[XFS_HEALTHMON_HEALTHY] = XFS_HEALTH_MONITOR_TYPE_HEALTHY,
[XFS_HEALTHMON_UNMOUNT] = XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
+ [XFS_HEALTHMON_SHUTDOWN] = XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
};
/* Render event as a V0 structure */
@@ -533,6 +611,9 @@ xfs_healthmon_format_v0(
case XFS_HEALTHMON_LOST:
hme.e.lost.count = event->lostcount;
break;
+ case XFS_HEALTHMON_SHUTDOWN:
+ hme.e.shutdown.reasons = shutdown_mask(event->flags);
+ break;
default:
break;
}
@@ -822,6 +903,7 @@ xfs_healthmon_detach_hooks(
* through the health monitoring subsystem from xfs_fs_put_super, so
* it is now time to detach the hooks.
*/
+ xfs_shutdown_hook_del(hm->mp, &hm->shook);
xfs_health_hook_del(hm->mp, &hm->hhook);
return;
@@ -969,12 +1051,17 @@ xfs_ioc_health_monitor(
if (ret)
goto out_hooks;
+ xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
+ ret = xfs_shutdown_hook_add(mp, &hm->shook);
+ if (ret)
+ goto out_healthhook;
+
/* Queue up the first event that lets the client know we're running. */
event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
XFS_HEALTHMON_MOUNT);
if (!event) {
ret = -ENOMEM;
- goto out_healthhook;
+ goto out_shutdownhook;
}
__xfs_healthmon_push(hm, event);
@@ -986,13 +1073,15 @@ xfs_ioc_health_monitor(
O_CLOEXEC | O_RDONLY);
if (fd < 0) {
ret = fd;
- goto out_healthhook;
+ goto out_shutdownhook;
}
trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
return fd;
+out_shutdownhook:
+ xfs_shutdown_hook_del(mp, &hm->shook);
out_healthhook:
xfs_health_hook_del(mp, &hm->hhook);
out_hooks:
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 14/22] xfs: report media errors through healthmon
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (12 preceding siblings ...)
2025-11-05 0:51 ` [PATCH 13/22] xfs: report shutdown " Darrick J. Wong
@ 2025-11-05 0:52 ` Darrick J. Wong
2025-11-05 0:52 ` [PATCH 15/22] xfs: report file io " Darrick J. Wong
` (7 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:52 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Now that we have hooks to report media errors, connect this to the
health monitor as well.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 15 +++++++++
fs/xfs/xfs_healthmon.h | 12 +++++++
fs/xfs/xfs_trace.h | 57 ++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_healthmon.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_trace.c | 1 +
5 files changed, 160 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 677141a17605a4..0711007344e16d 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1014,6 +1014,11 @@ struct xfs_rtgroup_geometry {
#define XFS_HEALTH_MONITOR_DOMAIN_INODE (3)
#define XFS_HEALTH_MONITOR_DOMAIN_RTGROUP (4)
+/* disk events */
+#define XFS_HEALTH_MONITOR_DOMAIN_DATADEV (5)
+#define XFS_HEALTH_MONITOR_DOMAIN_RTDEV (6)
+#define XFS_HEALTH_MONITOR_DOMAIN_LOGDEV (7)
+
/* Health monitor event types */
/* status of the monitor itself */
@@ -1031,6 +1036,9 @@ struct xfs_rtgroup_geometry {
/* filesystem shutdown */
#define XFS_HEALTH_MONITOR_TYPE_SHUTDOWN (6)
+/* media errors */
+#define XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR (7)
+
/* lost events */
struct xfs_health_monitor_lost {
__u64 count;
@@ -1071,6 +1079,12 @@ struct xfs_health_monitor_shutdown {
__u32 reasons;
};
+/* disk media errors */
+struct xfs_health_monitor_media {
+ __u64 daddr;
+ __u64 bbcount;
+};
+
struct xfs_health_monitor_event {
/* XFS_HEALTH_MONITOR_DOMAIN_* */
__u32 domain;
@@ -1092,6 +1106,7 @@ struct xfs_health_monitor_event {
struct xfs_health_monitor_group group;
struct xfs_health_monitor_inode inode;
struct xfs_health_monitor_shutdown shutdown;
+ struct xfs_health_monitor_media media;
} e;
/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index a82a684bbc0e03..407c5e1f466726 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -19,6 +19,8 @@ enum xfs_healthmon_type {
XFS_HEALTHMON_CORRUPT, /* fsck reported corruption */
XFS_HEALTHMON_HEALTHY, /* fsck reported healthy structure */
+ /* media errors */
+ XFS_HEALTHMON_MEDIA_ERROR,
};
enum xfs_healthmon_domain {
@@ -29,6 +31,11 @@ enum xfs_healthmon_domain {
XFS_HEALTHMON_AG, /* allocation group metadata */
XFS_HEALTHMON_INODE, /* inode metadata */
XFS_HEALTHMON_RTGROUP, /* realtime group metadata */
+
+ /* media errors */
+ XFS_HEALTHMON_DATADEV,
+ XFS_HEALTHMON_RTDEV,
+ XFS_HEALTHMON_LOGDEV,
};
struct xfs_healthmon_event {
@@ -66,6 +73,11 @@ struct xfs_healthmon_event {
uint32_t gen;
xfs_ino_t ino;
};
+ /* media errors */
+ struct {
+ xfs_daddr_t daddr;
+ uint64_t bbcount;
+ };
};
};
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index b2b056ceb52f5c..79805ee9aa64f5 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -105,6 +105,7 @@ struct xfs_rtgroup;
struct xfs_open_zone;
struct xfs_healthmon_event;
struct xfs_health_update_params;
+struct xfs_media_error_params;
#define XFS_ATTR_FILTER_FLAGS \
{ XFS_ATTR_ROOT, "ROOT" }, \
@@ -6110,6 +6111,12 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
__entry->ino = event->ino;
__entry->gen = event->gen;
break;
+ case XFS_HEALTHMON_DATADEV:
+ case XFS_HEALTHMON_LOGDEV:
+ case XFS_HEALTHMON_RTDEV:
+ __entry->offset = event->daddr;
+ __entry->length = event->bbcount;
+ break;
}
),
TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6198,6 +6205,56 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
__entry->events,
__entry->lost_prev)
);
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+TRACE_EVENT(xfs_healthmon_media_error_hook,
+ TP_PROTO(const struct xfs_media_error_params *p,
+ unsigned int events, unsigned long long lost_prev),
+ TP_ARGS(p, events, lost_prev),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(dev_t, error_dev)
+ __field(uint64_t, daddr)
+ __field(uint64_t, bbcount)
+ __field(int, pre_remove)
+ __field(unsigned int, events)
+ __field(unsigned long long, lost_prev)
+ ),
+ TP_fast_assign(
+ struct xfs_mount *mp = p->mp;
+ struct xfs_buftarg *btp = NULL;
+
+ switch (p->fdev) {
+ case XFS_FAILED_DATADEV:
+ btp = mp->m_ddev_targp;
+ break;
+ case XFS_FAILED_LOGDEV:
+ btp = mp->m_logdev_targp;
+ break;
+ case XFS_FAILED_RTDEV:
+ btp = mp->m_rtdev_targp;
+ break;
+ }
+
+ __entry->dev = mp->m_super->s_dev;
+ if (btp)
+ __entry->error_dev = btp->bt_dev;
+ __entry->daddr = p->daddr;
+ __entry->bbcount = p->bbcount;
+ __entry->pre_remove = p->pre_remove;
+ __entry->events = events;
+ __entry->lost_prev = lost_prev;
+ ),
+ TP_printk("dev %d:%d error_dev %d:%d daddr 0x%llx bbcount 0x%llx pre_remove? %d events %u lost_prev? %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ MAJOR(__entry->error_dev), MINOR(__entry->error_dev),
+ __entry->daddr,
+ __entry->bbcount,
+ __entry->pre_remove,
+ __entry->events,
+ __entry->lost_prev)
+);
+#endif
#endif /* CONFIG_XFS_HEALTH_MONITOR */
#endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index f36d7fbfb1ca16..efc8ff554e42da 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -21,6 +21,7 @@
#include "xfs_health.h"
#include "xfs_healthmon.h"
#include "xfs_fsops.h"
+#include "xfs_notify_failure.h"
#include <linux/anon_inodes.h>
#include <linux/eventpoll.h>
@@ -67,6 +68,7 @@ struct xfs_healthmon {
/* live update hooks */
struct xfs_shutdown_hook shook;
struct xfs_health_hook hhook;
+ struct xfs_media_error_hook mhook;
/* filesystem mount, or NULL if we've unmounted */
struct xfs_mount *mp;
@@ -518,6 +520,59 @@ xfs_healthmon_shutdown_hook(
return NOTIFY_DONE;
}
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+/* Add a media error event to the reporting queue. */
+STATIC int
+xfs_healthmon_media_error_hook(
+ struct notifier_block *nb,
+ unsigned long action,
+ void *data)
+{
+ struct xfs_healthmon *hm;
+ struct xfs_healthmon_event *event;
+ struct xfs_media_error_params *p = data;
+ enum xfs_healthmon_domain domain = 0; /* shut up gcc */
+ int error;
+
+ hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
+
+ mutex_lock(&hm->lock);
+
+ trace_xfs_healthmon_media_error_hook(p, hm->events,
+ hm->lost_prev_event);
+
+ error = xfs_healthmon_start_live_update(hm);
+ if (error)
+ goto out_unlock;
+
+ switch (p->fdev) {
+ case XFS_FAILED_LOGDEV:
+ domain = XFS_HEALTHMON_LOGDEV;
+ break;
+ case XFS_FAILED_RTDEV:
+ domain = XFS_HEALTHMON_RTDEV;
+ break;
+ case XFS_FAILED_DATADEV:
+ domain = XFS_HEALTHMON_DATADEV;
+ break;
+ }
+
+ event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_MEDIA_ERROR, domain);
+ if (!event)
+ goto out_unlock;
+
+ event->daddr = p->daddr;
+ event->bbcount = p->bbcount;
+ error = xfs_healthmon_push(hm, event);
+ if (error)
+ kfree(event);
+
+out_unlock:
+ mutex_unlock(&hm->lock);
+ return NOTIFY_DONE;
+}
+#endif
+
static inline void
xfs_healthmon_reset_outbuf(
struct xfs_healthmon *hm)
@@ -570,6 +625,9 @@ static const unsigned int domain_map[] = {
[XFS_HEALTHMON_AG] = XFS_HEALTH_MONITOR_DOMAIN_AG,
[XFS_HEALTHMON_INODE] = XFS_HEALTH_MONITOR_DOMAIN_INODE,
[XFS_HEALTHMON_RTGROUP] = XFS_HEALTH_MONITOR_DOMAIN_RTGROUP,
+ [XFS_HEALTHMON_DATADEV] = XFS_HEALTH_MONITOR_DOMAIN_DATADEV,
+ [XFS_HEALTHMON_RTDEV] = XFS_HEALTH_MONITOR_DOMAIN_RTDEV,
+ [XFS_HEALTHMON_LOGDEV] = XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,
};
static const unsigned int type_map[] = {
@@ -580,6 +638,7 @@ static const unsigned int type_map[] = {
[XFS_HEALTHMON_HEALTHY] = XFS_HEALTH_MONITOR_TYPE_HEALTHY,
[XFS_HEALTHMON_UNMOUNT] = XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
[XFS_HEALTHMON_SHUTDOWN] = XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
+ [XFS_HEALTHMON_MEDIA_ERROR] = XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR,
};
/* Render event as a V0 structure */
@@ -634,6 +693,12 @@ xfs_healthmon_format_v0(
hme.e.inode.ino = event->ino;
hme.e.inode.gen = event->gen;
break;
+ case XFS_HEALTHMON_DATADEV:
+ case XFS_HEALTHMON_LOGDEV:
+ case XFS_HEALTHMON_RTDEV:
+ hme.e.media.daddr = event->daddr;
+ hme.e.media.bbcount = event->bbcount;
+ break;
default:
break;
}
@@ -903,6 +968,7 @@ xfs_healthmon_detach_hooks(
* through the health monitoring subsystem from xfs_fs_put_super, so
* it is now time to detach the hooks.
*/
+ xfs_media_error_hook_del(hm->mp, &hm->mhook);
xfs_shutdown_hook_del(hm->mp, &hm->shook);
xfs_health_hook_del(hm->mp, &hm->hhook);
return;
@@ -1056,12 +1122,17 @@ xfs_ioc_health_monitor(
if (ret)
goto out_healthhook;
+ xfs_media_error_hook_setup(&hm->mhook, xfs_healthmon_media_error_hook);
+ ret = xfs_media_error_hook_add(mp, &hm->mhook);
+ if (ret)
+ goto out_shutdownhook;
+
/* Queue up the first event that lets the client know we're running. */
event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
XFS_HEALTHMON_MOUNT);
if (!event) {
ret = -ENOMEM;
- goto out_shutdownhook;
+ goto out_mediahook;
}
__xfs_healthmon_push(hm, event);
@@ -1073,13 +1144,15 @@ xfs_ioc_health_monitor(
O_CLOEXEC | O_RDONLY);
if (fd < 0) {
ret = fd;
- goto out_shutdownhook;
+ goto out_mediahook;
}
trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
return fd;
+out_mediahook:
+ xfs_media_error_hook_del(mp, &hm->mhook);
out_shutdownhook:
xfs_shutdown_hook_del(mp, &hm->shook);
out_healthhook:
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index d42b864a3837a2..08ddab700a6cd3 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -53,6 +53,7 @@
#include "xfs_zone_priv.h"
#include "xfs_health.h"
#include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
/*
* We include this last to have the helpers above available for the trace
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 15/22] xfs: report file io errors through healthmon
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (13 preceding siblings ...)
2025-11-05 0:52 ` [PATCH 14/22] xfs: report media errors " Darrick J. Wong
@ 2025-11-05 0:52 ` Darrick J. Wong
2025-11-05 0:52 ` [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
` (6 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:52 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Set up a file io error event hook so that we can send events about read
errors, writeback errors, and directio errors to userspace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 19 +++++++++
fs/xfs/xfs_healthmon.h | 17 ++++++++
fs/xfs/xfs_trace.h | 59 +++++++++++++++++++++++++++++
fs/xfs/xfs_healthmon.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/xfs/xfs_trace.c | 1
5 files changed, 192 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 0711007344e16d..a96f11d9bd9c64 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1019,6 +1019,9 @@ struct xfs_rtgroup_geometry {
#define XFS_HEALTH_MONITOR_DOMAIN_RTDEV (6)
#define XFS_HEALTH_MONITOR_DOMAIN_LOGDEV (7)
+/* file range events */
+#define XFS_HEALTH_MONITOR_DOMAIN_FILERANGE (8)
+
/* Health monitor event types */
/* status of the monitor itself */
@@ -1039,6 +1042,13 @@ struct xfs_rtgroup_geometry {
/* media errors */
#define XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR (7)
+/* file range events */
+#define XFS_HEALTH_MONITOR_TYPE_BUFREAD (8)
+#define XFS_HEALTH_MONITOR_TYPE_BUFWRITE (9)
+#define XFS_HEALTH_MONITOR_TYPE_DIOREAD (10)
+#define XFS_HEALTH_MONITOR_TYPE_DIOWRITE (11)
+#define XFS_HEALTH_MONITOR_TYPE_DATALOST (12)
+
/* lost events */
struct xfs_health_monitor_lost {
__u64 count;
@@ -1079,6 +1089,14 @@ struct xfs_health_monitor_shutdown {
__u32 reasons;
};
+/* file range events */
+struct xfs_health_monitor_filerange {
+ __u64 pos;
+ __u64 len;
+ __u64 ino;
+ __u32 gen;
+};
+
/* disk media errors */
struct xfs_health_monitor_media {
__u64 daddr;
@@ -1107,6 +1125,7 @@ struct xfs_health_monitor_event {
struct xfs_health_monitor_inode inode;
struct xfs_health_monitor_shutdown shutdown;
struct xfs_health_monitor_media media;
+ struct xfs_health_monitor_filerange filerange;
} e;
/* zeroes */
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 407c5e1f466726..1ce49197262b1c 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -21,6 +21,13 @@ enum xfs_healthmon_type {
/* media errors */
XFS_HEALTHMON_MEDIA_ERROR,
+
+ /* file range events */
+ XFS_HEALTHMON_BUFREAD,
+ XFS_HEALTHMON_BUFWRITE,
+ XFS_HEALTHMON_DIOREAD,
+ XFS_HEALTHMON_DIOWRITE,
+ XFS_HEALTHMON_DATALOST,
};
enum xfs_healthmon_domain {
@@ -36,6 +43,9 @@ enum xfs_healthmon_domain {
XFS_HEALTHMON_DATADEV,
XFS_HEALTHMON_RTDEV,
XFS_HEALTHMON_LOGDEV,
+
+ /* file range events */
+ XFS_HEALTHMON_FILERANGE,
};
struct xfs_healthmon_event {
@@ -78,6 +88,13 @@ struct xfs_healthmon_event {
xfs_daddr_t daddr;
uint64_t bbcount;
};
+ /* file range events */
+ struct {
+ xfs_ino_t fino;
+ loff_t fpos;
+ uint64_t flen;
+ uint32_t fgen;
+ };
};
};
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 79805ee9aa64f5..d1836583d4dfbb 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -106,6 +106,7 @@ struct xfs_open_zone;
struct xfs_healthmon_event;
struct xfs_health_update_params;
struct xfs_media_error_params;
+struct xfs_file_ioerror_params;
#define XFS_ATTR_FILTER_FLAGS \
{ XFS_ATTR_ROOT, "ROOT" }, \
@@ -6117,6 +6118,12 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
__entry->offset = event->daddr;
__entry->length = event->bbcount;
break;
+ case XFS_HEALTHMON_FILERANGE:
+ __entry->ino = event->fino;
+ __entry->gen = event->fgen;
+ __entry->offset = event->fpos;
+ __entry->length = event->flen;
+ break;
}
),
TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x offset 0x%llx len 0x%llx group 0x%x lost %llu",
@@ -6255,6 +6262,58 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
__entry->lost_prev)
);
#endif
+
+#define XFS_FILE_IOERROR_STRINGS \
+ { XFS_FILE_IOERROR_BUFFERED_READ, "readahead" }, \
+ { XFS_FILE_IOERROR_BUFFERED_WRITE, "writeback" }, \
+ { XFS_FILE_IOERROR_DIRECT_READ, "directio_read" }, \
+ { XFS_FILE_IOERROR_DIRECT_WRITE, "directio_write" }, \
+ { XFS_FILE_IOERROR_DATA_LOST, "datalost" }
+
+
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_BUFFERED_WRITE);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_READ);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DIRECT_WRITE);
+TRACE_DEFINE_ENUM(XFS_FILE_IOERROR_DATA_LOST);
+
+TRACE_EVENT(xfs_healthmon_file_ioerror_hook,
+ TP_PROTO(const struct xfs_mount *mp,
+ unsigned long action,
+ const struct xfs_file_ioerror_params *p,
+ unsigned int events, unsigned long long lost_prev),
+ TP_ARGS(mp, action, p, events, lost_prev),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(dev_t, error_dev)
+ __field(unsigned long, action)
+ __field(unsigned long long, ino)
+ __field(unsigned int, gen)
+ __field(long long, pos)
+ __field(unsigned long long, len)
+ __field(unsigned int, events)
+ __field(unsigned long long, lost_prev)
+ ),
+ TP_fast_assign(
+ __entry->dev = mp ? mp->m_super->s_dev : 0;
+ __entry->action = action;
+ __entry->ino = p->ino;
+ __entry->gen = p->gen;
+ __entry->pos = p->pos;
+ __entry->len = p->len;
+ __entry->events = events;
+ __entry->lost_prev = lost_prev;
+ ),
+ TP_printk("dev %d:%d ino 0x%llx gen 0x%x op %s pos 0x%llx bytecount 0x%llx events %u lost_prev? %llu",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->gen,
+ __print_symbolic(__entry->action, XFS_FILE_IOERROR_STRINGS),
+ __entry->pos,
+ __entry->len,
+ __entry->events,
+ __entry->lost_prev)
+);
#endif /* CONFIG_XFS_HEALTH_MONITOR */
#endif /* _TRACE_XFS_H */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index efc8ff554e42da..31c2f6f43cf474 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -22,6 +22,7 @@
#include "xfs_healthmon.h"
#include "xfs_fsops.h"
#include "xfs_notify_failure.h"
+#include "xfs_file.h"
#include <linux/anon_inodes.h>
#include <linux/eventpoll.h>
@@ -69,6 +70,7 @@ struct xfs_healthmon {
struct xfs_shutdown_hook shook;
struct xfs_health_hook hhook;
struct xfs_media_error_hook mhook;
+ struct xfs_file_ioerror_hook fhook;
/* filesystem mount, or NULL if we've unmounted */
struct xfs_mount *mp;
@@ -573,6 +575,77 @@ xfs_healthmon_media_error_hook(
}
#endif
+/* Add a file io error event to the reporting queue. */
+STATIC int
+xfs_healthmon_file_ioerror_hook(
+ struct notifier_block *nb,
+ unsigned long action,
+ void *data)
+{
+ struct xfs_healthmon *hm;
+ struct xfs_healthmon_event *event;
+ struct xfs_file_ioerror_params *p = data;
+ enum xfs_healthmon_type type = 0;
+ int error;
+
+ hm = container_of(nb, struct xfs_healthmon, fhook.ioerror_hook.nb);
+
+ switch (action) {
+ case XFS_FILE_IOERROR_BUFFERED_READ:
+ case XFS_FILE_IOERROR_BUFFERED_WRITE:
+ case XFS_FILE_IOERROR_DIRECT_READ:
+ case XFS_FILE_IOERROR_DIRECT_WRITE:
+ case XFS_FILE_IOERROR_DATA_LOST:
+ break;
+ default:
+ ASSERT(0);
+ return NOTIFY_DONE;
+ }
+
+ mutex_lock(&hm->lock);
+
+ trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events,
+ hm->lost_prev_event);
+
+ error = xfs_healthmon_start_live_update(hm);
+ if (error)
+ goto out_unlock;
+
+ switch (action) {
+ case XFS_FILE_IOERROR_BUFFERED_READ:
+ type = XFS_HEALTHMON_BUFREAD;
+ break;
+ case XFS_FILE_IOERROR_BUFFERED_WRITE:
+ type = XFS_HEALTHMON_BUFWRITE;
+ break;
+ case XFS_FILE_IOERROR_DIRECT_READ:
+ type = XFS_HEALTHMON_DIOREAD;
+ break;
+ case XFS_FILE_IOERROR_DIRECT_WRITE:
+ type = XFS_HEALTHMON_DIOWRITE;
+ break;
+ case XFS_FILE_IOERROR_DATA_LOST:
+ type = XFS_HEALTHMON_DATALOST;
+ break;
+ }
+
+ event = xfs_healthmon_alloc(hm, type, XFS_HEALTHMON_FILERANGE);
+ if (!event)
+ goto out_unlock;
+
+ event->fino = p->ino;
+ event->fgen = p->gen;
+ event->fpos = p->pos;
+ event->flen = p->len;
+ error = xfs_healthmon_push(hm, event);
+ if (error)
+ kfree(event);
+
+out_unlock:
+ mutex_unlock(&hm->lock);
+ return NOTIFY_DONE;
+}
+
static inline void
xfs_healthmon_reset_outbuf(
struct xfs_healthmon *hm)
@@ -628,6 +701,7 @@ static const unsigned int domain_map[] = {
[XFS_HEALTHMON_DATADEV] = XFS_HEALTH_MONITOR_DOMAIN_DATADEV,
[XFS_HEALTHMON_RTDEV] = XFS_HEALTH_MONITOR_DOMAIN_RTDEV,
[XFS_HEALTHMON_LOGDEV] = XFS_HEALTH_MONITOR_DOMAIN_LOGDEV,
+ [XFS_HEALTHMON_FILERANGE] = XFS_HEALTH_MONITOR_DOMAIN_FILERANGE,
};
static const unsigned int type_map[] = {
@@ -639,6 +713,11 @@ static const unsigned int type_map[] = {
[XFS_HEALTHMON_UNMOUNT] = XFS_HEALTH_MONITOR_TYPE_UNMOUNT,
[XFS_HEALTHMON_SHUTDOWN] = XFS_HEALTH_MONITOR_TYPE_SHUTDOWN,
[XFS_HEALTHMON_MEDIA_ERROR] = XFS_HEALTH_MONITOR_TYPE_MEDIA_ERROR,
+ [XFS_HEALTHMON_BUFREAD] = XFS_HEALTH_MONITOR_TYPE_BUFREAD,
+ [XFS_HEALTHMON_BUFWRITE] = XFS_HEALTH_MONITOR_TYPE_BUFWRITE,
+ [XFS_HEALTHMON_DIOREAD] = XFS_HEALTH_MONITOR_TYPE_DIOREAD,
+ [XFS_HEALTHMON_DIOWRITE] = XFS_HEALTH_MONITOR_TYPE_DIOWRITE,
+ [XFS_HEALTHMON_DATALOST] = XFS_HEALTH_MONITOR_TYPE_DATALOST,
};
/* Render event as a V0 structure */
@@ -699,6 +778,12 @@ xfs_healthmon_format_v0(
hme.e.media.daddr = event->daddr;
hme.e.media.bbcount = event->bbcount;
break;
+ case XFS_HEALTHMON_FILERANGE:
+ hme.e.filerange.ino = event->fino;
+ hme.e.filerange.gen = event->fgen;
+ hme.e.filerange.pos = event->fpos;
+ hme.e.filerange.len = event->flen;
+ break;
default:
break;
}
@@ -968,6 +1053,7 @@ xfs_healthmon_detach_hooks(
* through the health monitoring subsystem from xfs_fs_put_super, so
* it is now time to detach the hooks.
*/
+ xfs_file_ioerror_hook_del(hm->mp, &hm->fhook);
xfs_media_error_hook_del(hm->mp, &hm->mhook);
xfs_shutdown_hook_del(hm->mp, &hm->shook);
xfs_health_hook_del(hm->mp, &hm->hhook);
@@ -1127,12 +1213,18 @@ xfs_ioc_health_monitor(
if (ret)
goto out_shutdownhook;
+ xfs_file_ioerror_hook_setup(&hm->fhook,
+ xfs_healthmon_file_ioerror_hook);
+ ret = xfs_file_ioerror_hook_add(mp, &hm->fhook);
+ if (ret)
+ goto out_mediahook;
+
/* Queue up the first event that lets the client know we're running. */
event = xfs_healthmon_alloc(hm, XFS_HEALTHMON_RUNNING,
XFS_HEALTHMON_MOUNT);
if (!event) {
ret = -ENOMEM;
- goto out_mediahook;
+ goto out_ioerrhook;
}
__xfs_healthmon_push(hm, event);
@@ -1144,13 +1236,15 @@ xfs_ioc_health_monitor(
O_CLOEXEC | O_RDONLY);
if (fd < 0) {
ret = fd;
- goto out_mediahook;
+ goto out_ioerrhook;
}
trace_xfs_healthmon_create(mp, hmo.flags, hmo.format);
return fd;
+out_ioerrhook:
+ xfs_file_ioerror_hook_del(mp, &hm->fhook);
out_mediahook:
xfs_media_error_hook_del(mp, &hm->mhook);
out_shutdownhook:
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 08ddab700a6cd3..eb35015c091570 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -54,6 +54,7 @@
#include "xfs_health.h"
#include "xfs_healthmon.h"
#include "xfs_notify_failure.h"
+#include "xfs_file.h"
/*
* We include this last to have the helpers above available for the trace
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (14 preceding siblings ...)
2025-11-05 0:52 ` [PATCH 15/22] xfs: report file io " Darrick J. Wong
@ 2025-11-05 0:52 ` Darrick J. Wong
2025-11-05 0:52 ` [PATCH 17/22] xfs: validate fds against running healthmon Darrick J. Wong
` (5 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:52 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Make it so that we can reconfigure the health monitoring device by
calling the XFS_IOC_HEALTH_MONITOR ioctl on it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_healthmon.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 31c2f6f43cf474..d3784073494ec6 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -23,6 +23,7 @@
#include "xfs_fsops.h"
#include "xfs_notify_failure.h"
#include "xfs_file.h"
+#include "xfs_ioctl.h"
#include <linux/anon_inodes.h>
#include <linux/eventpoll.h>
@@ -1140,6 +1141,48 @@ xfs_healthmon_show_fdinfo(
}
#endif
+/* Reconfigure the health monitor. */
+STATIC long
+xfs_healthmon_reconfigure(
+ struct file *file,
+ unsigned int cmd,
+ void __user *arg)
+{
+ struct xfs_health_monitor hmo;
+ struct xfs_healthmon *hm = file->private_data;
+
+ if (copy_from_user(&hmo, arg, sizeof(hmo)))
+ return -EFAULT;
+
+ if (!xfs_healthmon_validate(&hmo))
+ return -EINVAL;
+
+ mutex_lock(&hm->lock);
+ hm->format = hmo.format;
+ hm->verbose = !!(hmo.flags & XFS_HEALTH_MONITOR_VERBOSE);
+ mutex_unlock(&hm->lock);
+ return 0;
+}
+
+/* Handle ioctls for the health monitoring thread. */
+STATIC long
+xfs_healthmon_ioctl(
+ struct file *file,
+ unsigned int cmd,
+ unsigned long p)
+{
+ void __user *arg = (void __user *)p;
+
+ switch (cmd) {
+ case XFS_IOC_HEALTH_MONITOR:
+ return xfs_healthmon_reconfigure(file, cmd, arg);
+ default:
+ break;
+ }
+
+ return -ENOTTY;
+}
+
static const struct file_operations xfs_healthmon_fops = {
.owner = THIS_MODULE,
#ifdef CONFIG_PROC_FS
@@ -1148,6 +1191,7 @@ static const struct file_operations xfs_healthmon_fops = {
.read_iter = xfs_healthmon_read_iter,
.poll = xfs_healthmon_poll,
.release = xfs_healthmon_release,
+ .unlocked_ioctl = xfs_healthmon_ioctl,
};
/*
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 17/22] xfs: validate fds against running healthmon
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (15 preceding siblings ...)
2025-11-05 0:52 ` [PATCH 16/22] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
@ 2025-11-05 0:52 ` Darrick J. Wong
2025-11-05 0:53 ` [PATCH 18/22] xfs: add media error reporting ioctl Darrick J. Wong
` (4 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:52 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.
(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 10 ++++++++++
fs/xfs/xfs_healthmon.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 42 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index a96f11d9bd9c64..2b82535196cdb0 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1147,6 +1147,15 @@ struct xfs_health_monitor {
/* Initial return format version */
#define XFS_HEALTH_MONITOR_FMT_V0 (0)
+/*
+ * Check that a given fd points to the same filesystem that the health monitor
+ * is monitoring.
+ */
+struct xfs_health_samefs {
+ __s32 fd;
+ __u32 flags; /* zero for now */
+};
+
/*
* ioctl commands that are used by Linux filesystems
*/
@@ -1187,6 +1196,7 @@ struct xfs_health_monitor {
#define XFS_IOC_SCRUBV_METADATA _IOWR('X', 64, struct xfs_scrub_vec_head)
#define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
#define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor)
+#define XFS_IOC_HEALTH_SAMEFS _IOW ('X', 69, struct xfs_health_samefs)
/*
* ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index d3784073494ec6..9752b058978995 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -1164,6 +1164,36 @@ xfs_healthmon_reconfigure(
return 0;
}
+/* Does the fd point to the same filesystem as the one we're monitoring? */
+STATIC long
+xfs_healthmon_samefs(
+ struct file *file,
+ unsigned int cmd,
+ void __user *arg)
+{
+ struct xfs_health_samefs hms;
+ struct xfs_healthmon *hm = file->private_data;
+ struct inode *hms_inode;
+ int ret = 0;
+
+ if (copy_from_user(&hms, arg, sizeof(hms)))
+ return -EFAULT;
+
+ if (hms.flags)
+ return -EINVAL;
+
+ CLASS(fd, hms_fd)(hms.fd);
+ if (fd_empty(hms_fd))
+ return -EBADF;
+
+ hms_inode = file_inode(fd_file(hms_fd));
+ mutex_lock(&hm->lock);
+ if (!hm->mp || hm->mp->m_super != hms_inode->i_sb)
+ ret = -ESTALE;
+ mutex_unlock(&hm->lock);
+ return ret;
+}
+
/* Handle ioctls for the health monitoring thread. */
STATIC long
xfs_healthmon_ioctl(
@@ -1176,6 +1206,8 @@ xfs_healthmon_ioctl(
switch (cmd) {
case XFS_IOC_HEALTH_MONITOR:
return xfs_healthmon_reconfigure(file, cmd, arg);
+ case XFS_IOC_HEALTH_SAMEFS:
+ return xfs_healthmon_samefs(file, cmd, arg);
default:
break;
}
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 18/22] xfs: add media error reporting ioctl
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (16 preceding siblings ...)
2025-11-05 0:52 ` [PATCH 17/22] xfs: validate fds against running healthmon Darrick J. Wong
@ 2025-11-05 0:53 ` Darrick J. Wong
2025-11-05 0:53 ` [PATCH 19/22] xfs: send uevents when major filesystem events happen Darrick J. Wong
` (3 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:53 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Add a new privileged ioctl so that xfs_scrub can report media errors to
the kernel for further processing.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 16 ++++
fs/xfs/xfs_notify_failure.h | 8 ++
fs/xfs/xfs_trace.h | 2
fs/xfs/Makefile | 6 -
fs/xfs/xfs_healthmon.c | 2
fs/xfs/xfs_ioctl.c | 3 +
fs/xfs/xfs_notify_failure.c | 187 +++++++++++++++++++++++++++++++++++++++++++
7 files changed, 213 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2b82535196cdb0..65fcc94ed9b40c 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -1156,6 +1156,21 @@ struct xfs_health_samefs {
__u32 flags; /* zero for now */
};
+/* Report a media error */
+struct xfs_media_error {
+ __u64 flags; /* flags */
+ __u64 daddr; /* disk address of range */
+ __u64 bbcount; /* length, in 512b blocks */
+ __u64 pad; /* zero */
+};
+
+#define XFS_MEDIA_ERROR_DATADEV (1) /* data device */
+#define XFS_MEDIA_ERROR_LOGDEV (2) /* external log device */
+#define XFS_MEDIA_ERROR_RTDEV (3) /* realtime device */
+
+/* bottom byte of flags is the device code */
+#define XFS_MEDIA_ERROR_DEVMASK (0xFF)
+
/*
* ioctl commands that are used by Linux filesystems
*/
@@ -1197,6 +1212,7 @@ struct xfs_health_samefs {
#define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
#define XFS_IOC_HEALTH_MONITOR _IOW ('X', 68, struct xfs_health_monitor)
#define XFS_IOC_HEALTH_SAMEFS _IOW ('X', 69, struct xfs_health_samefs)
+#define XFS_IOC_MEDIA_ERROR _IOW ('X', 70, struct xfs_media_error)
/*
* ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
index 2695732ec20875..279f9329a4d5f3 100644
--- a/fs/xfs/xfs_notify_failure.h
+++ b/fs/xfs/xfs_notify_failure.h
@@ -6,7 +6,9 @@
#ifndef __XFS_NOTIFY_FAILURE_H__
#define __XFS_NOTIFY_FAILURE_H__
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
extern const struct dax_holder_operations xfs_dax_holder_operations;
+#endif
enum xfs_failed_device {
XFS_FAILED_DATADEV,
@@ -14,7 +16,7 @@ enum xfs_failed_device {
XFS_FAILED_RTDEV,
};
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+#if defined(CONFIG_XFS_LIVE_HOOKS)
struct xfs_media_error_params {
struct xfs_mount *mp;
enum xfs_failed_device fdev;
@@ -41,4 +43,8 @@ struct xfs_media_error_hook { };
# define xfs_media_error_hook_setup(...) ((void)0)
#endif /* CONFIG_XFS_LIVE_HOOKS */
+struct xfs_media_error;
+int xfs_ioc_media_error(struct xfs_mount *mp,
+ struct xfs_media_error __user *arg);
+
#endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d1836583d4dfbb..e5d95add53d347 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6213,7 +6213,6 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
__entry->lost_prev)
);
-#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
TRACE_EVENT(xfs_healthmon_media_error_hook,
TP_PROTO(const struct xfs_media_error_params *p,
unsigned int events, unsigned long long lost_prev),
@@ -6261,7 +6260,6 @@ TRACE_EVENT(xfs_healthmon_media_error_hook,
__entry->events,
__entry->lost_prev)
);
-#endif
#define XFS_FILE_IOERROR_STRINGS \
{ XFS_FILE_IOERROR_BUFFERED_READ, "readahead" }, \
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index d4e9070a9326ba..2279cb0b874814 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -98,6 +98,7 @@ xfs-y += xfs_aops.o \
xfs_message.o \
xfs_mount.o \
xfs_mru_cache.o \
+ xfs_notify_failure.o \
xfs_pwork.o \
xfs_reflink.o \
xfs_stats.o \
@@ -148,11 +149,6 @@ xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o
-# notify failure
-ifeq ($(CONFIG_MEMORY_FAILURE),y)
-xfs-$(CONFIG_FS_DAX) += xfs_notify_failure.o
-endif
-
xfs-$(CONFIG_XFS_DRAIN_INTENTS) += xfs_drain.o
xfs-$(CONFIG_XFS_LIVE_HOOKS) += xfs_hooks.o
xfs-$(CONFIG_XFS_MEMORY_BUFS) += xfs_buf_mem.o
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 9752b058978995..e5715f52f4b218 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -523,7 +523,6 @@ xfs_healthmon_shutdown_hook(
return NOTIFY_DONE;
}
-#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
/* Add a media error event to the reporting queue. */
STATIC int
xfs_healthmon_media_error_hook(
@@ -574,7 +573,6 @@ xfs_healthmon_media_error_hook(
mutex_unlock(&hm->lock);
return NOTIFY_DONE;
}
-#endif
/* Add a file io error event to the reporting queue. */
STATIC int
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 08998d84554f09..7a80a6ad4b2d99 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -42,6 +42,7 @@
#include "xfs_handle.h"
#include "xfs_rtgroup.h"
#include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
#include <linux/mount.h>
#include <linux/fileattr.h>
@@ -1424,6 +1425,8 @@ xfs_file_ioctl(
case XFS_IOC_HEALTH_MONITOR:
return xfs_ioc_health_monitor(mp, arg);
+ case XFS_IOC_MEDIA_ERROR:
+ return xfs_ioc_media_error(mp, arg);
default:
return -ENOTTY;
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 8766d83385ddad..bf6e1865d5c3a5 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -76,9 +76,19 @@ xfs_media_error_hook_setup(
xfs_hook_setup(&hook->error_hook, mod_fn);
}
#else
-# define xfs_media_error_hook(...) ((void)0)
+static inline void
+xfs_media_error_hook(
+ struct xfs_mount *mp,
+ enum xfs_failed_device fdev,
+ xfs_daddr_t daddr,
+ uint64_t bbcount,
+ bool pre_remove)
+{
+ /* empty */
+}
#endif /* CONFIG_XFS_LIVE_HOOKS */
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
struct xfs_failure_info {
xfs_agblock_t startblock;
xfs_extlen_t blockcount;
@@ -447,3 +457,178 @@ xfs_dax_notify_failure(
const struct dax_holder_operations xfs_dax_holder_operations = {
.notify_failure = xfs_dax_notify_failure,
};
+#endif /* CONFIG_MEMORY_FAILURE && CONFIG_FS_DAX */
+
+struct xfs_group_data_lost {
+ xfs_agblock_t startblock;
+ xfs_extlen_t blockcount;
+};
+
+static int
+xfs_report_one_data_lost(
+ struct xfs_btree_cur *cur,
+ const struct xfs_rmap_irec *rec,
+ void *data)
+{
+ struct xfs_mount *mp = cur->bc_mp;
+ struct xfs_inode *ip;
+ struct xfs_group_data_lost *lost = data;
+ xfs_fileoff_t fileoff = rec->rm_offset;
+ xfs_extlen_t blocks = rec->rm_blockcount;
+ const xfs_agblock_t lost_end =
+ lost->startblock + lost->blockcount;
+ const xfs_agblock_t rmap_end =
+ rec->rm_startblock + rec->rm_blockcount;
+ int error = 0;
+
+ if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+ (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK)))
+ return 0;
+
+ error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, 0, 0, &ip);
+ if (error)
+ return 0;
+
+ if (lost->startblock > rec->rm_startblock) {
+ fileoff += lost->startblock - rec->rm_startblock;
+ blocks -= lost->startblock - rec->rm_startblock;
+ }
+ if (rmap_end > lost_end)
+ blocks -= rmap_end - lost_end;
+
+ xfs_inode_media_error(ip, XFS_FSB_TO_B(mp, fileoff),
+ XFS_FSB_TO_B(mp, blocks));
+
+ xfs_irele(ip);
+ return 0;
+}
+
+static int
+xfs_report_data_lost(
+ struct xfs_mount *mp,
+ enum xfs_group_type type,
+ xfs_daddr_t daddr,
+ u64 bblen)
+{
+ struct xfs_group *xg = NULL;
+ struct xfs_trans *tp;
+ xfs_fsblock_t start_bno, end_bno;
+ uint32_t start_gno, end_gno;
+ int error;
+
+ if (type == XG_TYPE_RTG) {
+ start_bno = xfs_daddr_to_rtb(mp, daddr);
+ end_bno = xfs_daddr_to_rtb(mp, daddr + bblen - 1);
+ } else {
+ start_bno = XFS_DADDR_TO_FSB(mp, daddr);
+ end_bno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1);
+ }
+
+ tp = xfs_trans_alloc_empty(mp);
+ start_gno = xfs_fsb_to_gno(mp, start_bno, type);
+ end_gno = xfs_fsb_to_gno(mp, end_bno, type);
+ while ((xg = xfs_group_next_range(mp, xg, start_gno, end_gno, type))) {
+ struct xfs_buf *agf_bp = NULL;
+ struct xfs_rtgroup *rtg = NULL;
+ struct xfs_btree_cur *cur;
+ struct xfs_rmap_irec ri_low = { };
+ struct xfs_rmap_irec ri_high;
+ struct xfs_group_data_lost lost;
+
+ if (type == XG_TYPE_AG) {
+ struct xfs_perag *pag = to_perag(xg);
+
+ error = xfs_alloc_read_agf(pag, tp, 0, &agf_bp);
+ if (error) {
+ xfs_perag_put(pag);
+ break;
+ }
+
+ cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, pag);
+ } else {
+ rtg = to_rtg(xg);
+ xfs_rtgroup_lock(rtg, XFS_RTGLOCK_RMAP);
+ cur = xfs_rtrmapbt_init_cursor(tp, rtg);
+ }
+
+ /*
+ * Set the rmap range from ri_low to ri_high, which represents
+ * a [start, end] where we looking for the files or metadata.
+ */
+ memset(&ri_high, 0xFF, sizeof(ri_high));
+ if (xg->xg_gno == start_gno)
+ ri_low.rm_startblock =
+ xfs_fsb_to_gbno(mp, start_bno, type);
+ if (xg->xg_gno == end_gno)
+ ri_high.rm_startblock =
+ xfs_fsb_to_gbno(mp, end_bno, type);
+
+ lost.startblock = ri_low.rm_startblock;
+ lost.blockcount = min(xg->xg_block_count,
+ ri_high.rm_startblock + 1) -
+ ri_low.rm_startblock;
+
+ error = xfs_rmap_query_range(cur, &ri_low, &ri_high,
+ xfs_report_one_data_lost, &lost);
+ xfs_btree_del_cursor(cur, error);
+ if (agf_bp)
+ xfs_trans_brelse(tp, agf_bp);
+ if (rtg)
+ xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_RMAP);
+ if (error) {
+ xfs_group_put(xg);
+ break;
+ }
+ }
+
+ xfs_trans_cancel(tp);
+ return 0;
+}
+
+#define XFS_VALID_MEDIA_ERROR_FLAGS (XFS_MEDIA_ERROR_DATADEV | \
+ XFS_MEDIA_ERROR_LOGDEV | \
+ XFS_MEDIA_ERROR_RTDEV)
+int
+xfs_ioc_media_error(
+ struct xfs_mount *mp,
+ struct xfs_media_error __user *arg)
+{
+ struct xfs_media_error me;
+ enum xfs_failed_device fdev;
+ enum xfs_group_type type;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (copy_from_user(&me, arg, sizeof(me)))
+ return -EFAULT;
+
+ if (me.pad)
+ return -EINVAL;
+ if (me.flags & ~XFS_VALID_MEDIA_ERROR_FLAGS)
+ return -EINVAL;
+
+ switch (me.flags & XFS_MEDIA_ERROR_DEVMASK) {
+ case XFS_MEDIA_ERROR_DATADEV:
+ fdev = XFS_FAILED_DATADEV;
+ type = XG_TYPE_AG;
+ break;
+ case XFS_MEDIA_ERROR_LOGDEV:
+ fdev = XFS_FAILED_LOGDEV;
+ type = -1;
+ break;
+ case XFS_MEDIA_ERROR_RTDEV:
+ fdev = XFS_FAILED_RTDEV;
+ type = XG_TYPE_RTG;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ xfs_media_error_hook(mp, fdev, me.daddr, me.bbcount, false);
+
+ if (xfs_has_rmapbt(mp) && fdev != XFS_FAILED_LOGDEV)
+ return xfs_report_data_lost(mp, type, me.daddr, me.bbcount);
+
+ return 0;
+}
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 19/22] xfs: send uevents when major filesystem events happen
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (17 preceding siblings ...)
2025-11-05 0:53 ` [PATCH 18/22] xfs: add media error reporting ioctl Darrick J. Wong
@ 2025-11-05 0:53 ` Darrick J. Wong
2025-11-05 0:53 ` [PATCH 20/22] xfs: merge health monitoring events when possible Darrick J. Wong
` (2 subsequent siblings)
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:53 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Send uevents when we mount, unmount, and shut down the filesystem, so
that we can trigger systemd services when major events happen.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_super.h | 13 +++++++
fs/xfs/xfs_fsops.c | 18 ++++++++++
fs/xfs/xfs_super.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 125 insertions(+)
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index c0e85c1e42f27d..6d428bd04a0248 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -101,4 +101,17 @@ extern struct workqueue_struct *xfs_discard_wq;
struct dentry *xfs_debugfs_mkdir(const char *name, struct dentry *parent);
+#define XFS_UEVENT_BUFLEN ( \
+ sizeof("SID=") + sizeof_field(struct super_block, s_id) + \
+ sizeof("UUID=") + UUID_STRING_LEN + \
+ sizeof("META_UUID=") + UUID_STRING_LEN)
+
+#define XFS_UEVENT_STR_PTRS \
+ NULL, /* sid */ \
+ NULL, /* uuid */ \
+ NULL /* metauuid */
+
+int xfs_format_uevent_strings(struct xfs_mount *mp, char *buf, ssize_t buflen,
+ char **env);
+
#endif /* __XFS_SUPER_H__ */
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 26ed16e67410d7..0b6b178cb8169a 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -522,6 +522,23 @@ xfs_shutdown_hook_setup(
# define xfs_shutdown_hook(...) ((void)0)
#endif /* CONFIG_XFS_LIVE_HOOKS */
+static void
+xfs_send_shutdown_uevent(
+ struct xfs_mount *mp)
+{
+ char buf[XFS_UEVENT_BUFLEN];
+ char *env[] = {
+ "TYPE=shutdown",
+ XFS_UEVENT_STR_PTRS,
+ NULL,
+ };
+ int error;
+
+ error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[2]);
+ if (!error)
+ kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_OFFLINE, env);
+}
+
/*
* Force a shutdown of the filesystem instantly while keeping the filesystem
* consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -572,6 +589,7 @@ xfs_do_force_shutdown(
}
trace_xfs_force_shutdown(mp, tag, flags, fname, lnnum);
+ xfs_send_shutdown_uevent(mp);
xfs_alert_tag(mp, tag,
"%s (0x%x) detected at %pS (%s:%d). Shutting down filesystem.",
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 54d82f5a5b8863..bfd12ccaa707a8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -53,6 +53,7 @@
#include <linux/magic.h>
#include <linux/fs_context.h>
#include <linux/fs_parser.h>
+#include <linux/uuid.h>
static const struct super_operations xfs_super_operations;
@@ -1244,12 +1245,73 @@ xfs_inodegc_free_percpu(
free_percpu(mp->m_inodegc);
}
+int
+xfs_format_uevent_strings(
+ struct xfs_mount *mp,
+ char *buf,
+ ssize_t buflen,
+ char **env)
+{
+ ssize_t written;
+
+ ASSERT(buflen >= XFS_UEVENT_BUFLEN);
+
+ written = snprintf(buf, buflen, "SID=%s", mp->m_super->s_id);
+ if (written >= buflen)
+ return -EINVAL;
+
+ *env = buf;
+ env++;
+ buf += written + 1;
+ buflen -= written + 1;
+
+ written = snprintf(buf, buflen, "UUID=%pU", &mp->m_sb.sb_uuid);
+ if (written >= buflen)
+ return EINVAL;
+
+ *env = buf;
+ env++;
+ buf += written + 1;
+ buflen -= written + 1;
+
+ written = snprintf(buf, buflen, "META_UUID=%pU",
+ &mp->m_sb.sb_meta_uuid);
+ if (written >= buflen)
+ return EINVAL;
+
+ *env = buf;
+ env++;
+ buf += written + 1;
+ buflen -= written + 1;
+
+ ASSERT(buflen >= 0);
+ return 0;
+}
+
+static void
+xfs_send_unmount_uevent(
+ struct xfs_mount *mp)
+{
+ char buf[XFS_UEVENT_BUFLEN];
+ char *env[] = {
+ "TYPE=mount",
+ XFS_UEVENT_STR_PTRS,
+ NULL,
+ };
+ int error;
+
+ error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[1]);
+ if (!error)
+ kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_REMOVE, env);
+}
+
static void
xfs_fs_put_super(
struct super_block *sb)
{
struct xfs_mount *mp = XFS_M(sb);
+ xfs_send_unmount_uevent(mp);
xfs_notice(mp, "Unmounting Filesystem %pU", &mp->m_sb.sb_uuid);
xfs_filestream_unmount(mp);
xfs_unmountfs(mp);
@@ -1667,6 +1729,37 @@ xfs_debugfs_mkdir(
return child;
}
+/*
+ * Send a uevent signalling that the mount succeeded so we can use udev rules
+ * to start background services.
+ */
+static void
+xfs_send_mount_uevent(
+ struct fs_context *fc,
+ struct xfs_mount *mp)
+{
+ char *source;
+ char buf[XFS_UEVENT_BUFLEN];
+ char *env[] = {
+ "TYPE=mount",
+ NULL, /* source */
+ XFS_UEVENT_STR_PTRS,
+ NULL,
+ };
+ int error;
+
+ source = kasprintf(GFP_KERNEL, "SOURCE=%s", fc->source);
+ if (!source)
+ return;
+ env[1] = source;
+
+ error = xfs_format_uevent_strings(mp, buf, sizeof(buf), &env[2]);
+ if (!error)
+ kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_ADD, env);
+
+ kfree(source);
+}
+
static int
xfs_fs_fill_super(
struct super_block *sb,
@@ -1980,6 +2073,7 @@ xfs_fs_fill_super(
mp->m_debugfs_uuid = NULL;
}
+ xfs_send_mount_uevent(fc, mp);
return 0;
out_filestream_unmount:
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 20/22] xfs: merge health monitoring events when possible
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (18 preceding siblings ...)
2025-11-05 0:53 ` [PATCH 19/22] xfs: send uevents when major filesystem events happen Darrick J. Wong
@ 2025-11-05 0:53 ` Darrick J. Wong
2025-11-05 0:53 ` [PATCH 21/22] xfs: restrict healthmon users further Darrick J. Wong
2025-11-05 0:54 ` [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process Darrick J. Wong
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:53 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Reduce memory consumption and event traffic by merging healthmon events
whenever we actually add an event to the queue.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_trace.h | 1
fs/xfs/xfs_healthmon.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 108 insertions(+)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e5d95add53d347..520526ef9cd11c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -6143,6 +6143,7 @@ DEFINE_EVENT(xfs_healthmon_event_class, name, \
TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
TP_ARGS(mp, event))
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_insert);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_merge);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index e5715f52f4b218..b46b63e62d5143 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -143,12 +143,112 @@ xfs_healthmon_free_head(
return 0;
}
+static bool
+xfs_healthmon_merge_events(
+ struct xfs_healthmon_event *existing,
+ const struct xfs_healthmon_event *new)
+{
+ if (!existing)
+ return false;
+
+ /* type and domain must match to merge events */
+ if (existing->type != new->type ||
+ existing->domain != new->domain)
+ return false;
+
+ switch (existing->type) {
+ case XFS_HEALTHMON_RUNNING:
+ case XFS_HEALTHMON_UNMOUNT:
+ /* should only ever be one of these events anyway */
+ return false;
+
+ case XFS_HEALTHMON_LOST:
+ existing->lostcount += new->lostcount;
+ return true;
+
+ case XFS_HEALTHMON_SHUTDOWN:
+ /* yes, we can race to shutdown */
+ existing->flags |= new->flags;
+ return true;
+
+ case XFS_HEALTHMON_SICK:
+ case XFS_HEALTHMON_CORRUPT:
+ case XFS_HEALTHMON_HEALTHY:
+ switch (existing->domain) {
+ case XFS_HEALTHMON_FS:
+ existing->fsmask |= new->fsmask;
+ return true;
+ case XFS_HEALTHMON_AG:
+ case XFS_HEALTHMON_RTGROUP:
+ if (existing->group == new->group){
+ existing->grpmask |= new->grpmask;
+ return true;
+ }
+ return false;
+ case XFS_HEALTHMON_INODE:
+ if (existing->ino == new->ino &&
+ existing->gen == new->gen) {
+ existing->imask |= new->imask;
+ return true;
+ }
+ return false;
+ default:
+ ASSERT(0);
+ return false;
+ }
+ return false;
+
+ case XFS_HEALTHMON_MEDIA_ERROR:
+ /* physically adjacent errors can merge */
+ if (existing->daddr + existing->bbcount == new->daddr) {
+ existing->bbcount += new->bbcount;
+ return true;
+ }
+ if (new->daddr + new->bbcount == existing->daddr) {
+ existing->daddr = new->daddr;
+ existing->bbcount += new->bbcount;
+ return true;
+ }
+ return false;
+
+ case XFS_HEALTHMON_BUFREAD:
+ case XFS_HEALTHMON_BUFWRITE:
+ case XFS_HEALTHMON_DIOREAD:
+ case XFS_HEALTHMON_DIOWRITE:
+ case XFS_HEALTHMON_DATALOST:
+ /* logically adjacent file ranges can merge */
+ if (existing->fino != new->fino || existing->fgen != new->fgen)
+ return false;
+
+ if (existing->fpos + existing->flen == new->fpos) {
+ existing->flen += new->flen;
+ return true;
+ }
+
+ if (new->fpos + new->flen == existing->fpos) {
+ existing->fpos = new->fpos;
+ existing->flen += new->flen;
+ return true;
+ }
+ return false;
+ }
+
+ return false;
+}
+
/* Insert an event onto the start of the list. */
static inline void
__xfs_healthmon_insert(
struct xfs_healthmon *hm,
struct xfs_healthmon_event *event)
{
+ if (xfs_healthmon_merge_events(hm->first_event, event)) {
+ trace_xfs_healthmon_merge(hm->mp, hm->first_event);
+ kfree(event);
+ wake_up(&hm->wait);
+ return;
+ }
+
event->next = hm->first_event;
if (!hm->first_event)
hm->first_event = event;
@@ -166,6 +266,13 @@ __xfs_healthmon_push(
struct xfs_healthmon *hm,
struct xfs_healthmon_event *event)
{
+ if (xfs_healthmon_merge_events(hm->last_event, event)) {
+ trace_xfs_healthmon_merge(hm->mp, hm->last_event);
+ kfree(event);
+ wake_up(&hm->wait);
+ return;
+ }
+
if (!hm->first_event)
hm->first_event = event;
if (hm->last_event)
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 21/22] xfs: restrict healthmon users further
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (19 preceding siblings ...)
2025-11-05 0:53 ` [PATCH 20/22] xfs: merge health monitoring events when possible Darrick J. Wong
@ 2025-11-05 0:53 ` Darrick J. Wong
2025-11-05 0:54 ` [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process Darrick J. Wong
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:53 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Because health monitoring events include file handles and deep
information about the filesystem structure, restrict usage to healthmon
to processes that can open the root directory and run in the initial
user namespace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_healthmon.h | 4 ++--
fs/xfs/xfs_healthmon.c | 9 ++++++++-
fs/xfs/xfs_ioctl.c | 2 +-
3 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index 1ce49197262b1c..6b650ab0c92238 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -99,10 +99,10 @@ struct xfs_healthmon_event {
};
#ifdef CONFIG_XFS_HEALTH_MONITOR
-long xfs_ioc_health_monitor(struct xfs_mount *mp,
+long xfs_ioc_health_monitor(struct file *file,
struct xfs_health_monitor __user *arg);
#else
-# define xfs_ioc_health_monitor(mp, hmo) (-ENOTTY)
+# define xfs_ioc_health_monitor(file, hmo) (-ENOTTY)
#endif /* CONFIG_XFS_HEALTH_MONITOR */
#endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index b46b63e62d5143..a8ea6483ca98fb 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -1337,18 +1337,25 @@ static const struct file_operations xfs_healthmon_fops = {
*/
long
xfs_ioc_health_monitor(
- struct xfs_mount *mp,
+ struct file *file,
struct xfs_health_monitor __user *arg)
{
struct xfs_health_monitor hmo;
struct xfs_healthmon *hm;
struct xfs_healthmon_event *event;
+ struct xfs_inode *ip = XFS_I(file_inode(file));
+ struct xfs_mount *mp = ip->i_mount;
int fd;
int ret;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
+ if (ip->i_ino != mp->m_sb.sb_rootino)
+ return -EPERM;
+ if (current_user_ns() != &init_user_ns)
+ return -EPERM;
+
if (copy_from_user(&hmo, arg, sizeof(hmo)))
return -EFAULT;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 7a80a6ad4b2d99..6c3eecabf09908 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1424,7 +1424,7 @@ xfs_file_ioctl(
return xfs_ioc_commit_range(filp, arg);
case XFS_IOC_HEALTH_MONITOR:
- return xfs_ioc_health_monitor(mp, arg);
+ return xfs_ioc_health_monitor(filp, arg);
case XFS_IOC_MEDIA_ERROR:
return xfs_ioc_media_error(mp, arg);
^ permalink raw reply related [flat|nested] 38+ messages in thread* [PATCH 22/22] xfs: charge healthmon event objects to the memcg of the listening process
2025-11-05 0:48 ` [PATCHSET V3 1/2] xfs: autonomous self healing of filesystems Darrick J. Wong
` (20 preceding siblings ...)
2025-11-05 0:53 ` [PATCH 21/22] xfs: restrict healthmon users further Darrick J. Wong
@ 2025-11-05 0:54 ` Darrick J. Wong
21 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-11-05 0:54 UTC (permalink / raw)
To: djwong, cem; +Cc: hch, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Event objects are created in the context of whichever process
experienced the health event, which means that we currently charge that
process' memory cgroup controller for that object. This isn't entirely
fair to that process, because it's being charged for memory that solely
benefits whatever's using the healthmon fd (xfs_healer).
Therefore, save the memcg that was in place when the healthmon fd was
created (which we assume is xfs_healer) and make sure the objects are
charged to that memcg. This also enables sysadmins to constrain the
kernel memory usage of xfs_healer through memcgs.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
fs/xfs/xfs_healthmon.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index a8ea6483ca98fb..def4de5f6bc543 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -67,6 +67,9 @@ struct xfs_healthmon {
struct xfs_healthmon_event *first_event;
struct xfs_healthmon_event *last_event;
+ /* charge event object usage to this memory cgroup */
+ struct mem_cgroup *memcg;
+
/* live update hooks */
struct xfs_shutdown_hook shook;
struct xfs_health_hook hhook;
@@ -500,6 +503,7 @@ xfs_healthmon_metadata_hook(
struct xfs_health_update_params *hup = data;
struct xfs_healthmon *hm;
struct xfs_healthmon_event *event;
+ struct mem_cgroup *old_memcg;
enum xfs_health_update_type type = action;
unsigned int mask = 0;
int error;
@@ -511,6 +515,7 @@ xfs_healthmon_metadata_hook(
return NOTIFY_DONE;
mutex_lock(&hm->lock);
+ old_memcg = set_active_memcg(hm->memcg);
trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
hm->lost_prev_event);
@@ -586,6 +591,7 @@ xfs_healthmon_metadata_hook(
goto out_event;
out_unlock:
+ set_active_memcg(old_memcg);
mutex_unlock(&hm->lock);
return NOTIFY_DONE;
out_event:
@@ -602,11 +608,13 @@ xfs_healthmon_shutdown_hook(
{
struct xfs_healthmon *hm;
struct xfs_healthmon_event *event;
+ struct mem_cgroup *old_memcg;
int error;
hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
mutex_lock(&hm->lock);
+ old_memcg = set_active_memcg(hm->memcg);
trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
hm->lost_prev_event);
@@ -626,6 +634,7 @@ xfs_healthmon_shutdown_hook(
kfree(event);
out_unlock:
+ set_active_memcg(old_memcg);
mutex_unlock(&hm->lock);
return NOTIFY_DONE;
}
@@ -640,12 +649,14 @@ xfs_healthmon_media_error_hook(
struct xfs_healthmon *hm;
struct xfs_healthmon_event *event;
struct xfs_media_error_params *p = data;
+ struct mem_cgroup *old_memcg;
enum xfs_healthmon_domain domain = 0; /* shut up gcc */
int error;
hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
mutex_lock(&hm->lock);
+ old_memcg = set_active_memcg(hm->memcg);
trace_xfs_healthmon_media_error_hook(p, hm->events,
hm->lost_prev_event);
@@ -677,6 +688,7 @@ xfs_healthmon_media_error_hook(
kfree(event);
out_unlock:
+ set_active_memcg(old_memcg);
mutex_unlock(&hm->lock);
return NOTIFY_DONE;
}
@@ -691,6 +703,7 @@ xfs_healthmon_file_ioerror_hook(
struct xfs_healthmon *hm;
struct xfs_healthmon_event *event;
struct xfs_file_ioerror_params *p = data;
+ struct mem_cgroup *old_memcg;
enum xfs_healthmon_type type = 0;
int error;
@@ -709,6 +722,7 @@ xfs_healthmon_file_ioerror_hook(
}
mutex_lock(&hm->lock);
+ old_memcg = set_active_memcg(hm->memcg);
trace_xfs_healthmon_file_ioerror_hook(hm->mp, action, p, hm->events,
hm->lost_prev_event);
@@ -748,6 +762,7 @@ xfs_healthmon_file_ioerror_hook(
kfree(event);
out_unlock:
+ set_active_memcg(old_memcg);
mutex_unlock(&hm->lock);
return NOTIFY_DONE;
}
@@ -1188,6 +1203,7 @@ xfs_healthmon_release(
xfs_healthmon_free_events(hm);
if (hm->outbuf.size)
kfree(hm->outbuf.buffer);
+ mem_cgroup_put(hm->memcg);
kfree(hm);
return 0;
@@ -1367,6 +1383,7 @@ xfs_ioc_health_monitor(
return -ENOMEM;
hm->mp = mp;
hm->format = hmo.format;
+ hm->memcg = get_mem_cgroup_from_mm(current->mm);
/*
* Since we already got a ref to the module, take a reference to the
@@ -1443,6 +1460,7 @@ xfs_ioc_health_monitor(
xfs_health_hook_disable();
mutex_destroy(&hm->lock);
xfs_healthmon_free_events(hm);
+ mem_cgroup_put(hm->memcg);
kfree(hm);
return ret;
}
^ permalink raw reply related [flat|nested] 38+ messages in thread