From: "Darrick J. Wong" <djwong@kernel.org>
To: Allison Henderson <allison.henderson@oracle.com>
Cc: Catherine Hoang <catherine.hoang@oracle.com>,
"david@fromorbit.com" <david@fromorbit.com>,
"willy@infradead.org" <willy@infradead.org>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
Chandan Babu <chandan.babu@oracle.com>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"hch@infradead.org" <hch@infradead.org>
Subject: Re: [PATCH 14/14] xfs: document future directions of online fsck
Date: Wed, 1 Mar 2023 16:39:53 -0800 [thread overview]
Message-ID: <Y//wWfERMOrEtFnu@magnolia> (raw)
In-Reply-To: <1a1bb01af95baab71172d0f6366e156a01b68143.camel@oracle.com>
On Wed, Mar 01, 2023 at 05:37:19AM +0000, Allison Henderson wrote:
> On Fri, 2022-12-30 at 14:10 -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Add the seventh and final chapter of the online fsck documentation,
> > where we talk about future functionality that can tie in with the
> > functionality provided by the online fsck patchset.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > .../filesystems/xfs-online-fsck-design.rst | 155
> > ++++++++++++++++++++
> > 1 file changed, 155 insertions(+)
> >
> >
> > diff --git a/Documentation/filesystems/xfs-online-fsck-design.rst
> > b/Documentation/filesystems/xfs-online-fsck-design.rst
> > index 05b9411fac7f..41291edb02b9 100644
> > --- a/Documentation/filesystems/xfs-online-fsck-design.rst
> > +++ b/Documentation/filesystems/xfs-online-fsck-design.rst
> > @@ -4067,6 +4067,8 @@ The extra flexibility enables several new use
> > cases:
> > (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby
> > committing all
> > of the updates to the original file, or none of them.
> >
> > +.. _swapext_if_unchanged:
> > +
> > - **Transactional file updates**: The same mechanism as above, but
> > the caller
> > only wants the commit to occur if the original file's contents
> > have not
> > changed.
> > @@ -4818,3 +4820,156 @@ and report what has been lost.
> > For media errors in blocks owned by files, the lack of parent
> > pointers means
> > that the entire filesystem must be walked to report the file paths
> > and offsets
> > corresponding to the media error.
> > +
> > +7. Conclusion and Future Work
> > +=============================
> > +
> > +It is hoped that the reader of this document has followed the
> > designs laid out
> > +in this document and now has some familiarity with how XFS performs
> > online
> > +rebuilding of its metadata indices, and how filesystem users can
> > interact with
> > +that functionality.
> > +Although the scope of this work is daunting, it is hoped that this
> > guide will
> > +make it easier for code readers to understand what has been built,
> > for whom it
> > +has been built, and why.
> > +Please feel free to contact the XFS mailing list with questions.
> > +
> > +FIEXCHANGE_RANGE
> > +----------------
> > +
> > +As discussed earlier, a second frontend to the atomic extent swap
> > mechanism is
> > +a new ioctl call that userspace programs can use to commit updates
> > to files
> > +atomically.
> > +This frontend has been out for review for several years now, though
> > the
> > +necessary refinements to online repair and lack of customer demand
> > mean that
> > +the proposal has not been pushed very hard.
Note: The "Extent Swapping with Regular User Files" section has moved
here.
> > +Vectorized Scrub
> > +----------------
> > +
> > +As it turns out, the :ref:`refactoring <scrubrepair>` of repair
> > items mentioned
> > +earlier was a catalyst for enabling a vectorized scrub system call.
> > +Since 2018, the cost of making a kernel call has increased
> > considerably on some
> > +systems to mitigate the effects of speculative execution attacks.
> > +This incentivizes program authors to make as few system calls as
> > possible to
> > +reduce the number of times an execution path crosses a security
> > boundary.
> > +
> > +With vectorized scrub, userspace pushes to the kernel the identity
> > of a
> > +filesystem object, a list of scrub types to run against that object,
> > and a
> > +simple representation of the data dependencies between the selected
> > scrub
> > +types.
> > +The kernel executes as much of the caller's plan as it can until it
> > hits a
> > +dependency that cannot be satisfied due to a corruption, and tells
> > userspace
> > +how much was accomplished.
> > +It is hoped that ``io_uring`` will pick up enough of this
> > functionality that
> > +online fsck can use that instead of adding a separate vectored scrub
> > system
> > +call to XFS.
> > +
> > +The relevant patchsets are the
> > +`kernel vectorized scrub
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=vectorized-scrub>`_
> > +and
> > +`userspace vectorized scrub
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=vectorized-scrub>`_
> > +series.
> > +
> > +Quality of Service Targets for Scrub
> > +------------------------------------
> > +
> > +One serious shortcoming of the online fsck code is that the amount
> > of time that
> > +it can spend in the kernel holding resource locks is basically
> > unbounded.
> > +Userspace is allowed to send a fatal signal to the process which
> > will cause
> > +``xfs_scrub`` to exit when it reaches a good stopping point, but
> > there's no way
> > +for userspace to provide a time budget to the kernel.
> > +Given that the scrub codebase has helpers to detect fatal signals,
> > it shouldn't
> > +be too much work to allow userspace to specify a timeout for a
> > scrub/repair
> > +operation and abort the operation if it exceeds budget.
> > +However, most repair functions have the property that once they
> > begin to touch
> > +ondisk metadata, the operation cannot be cancelled cleanly, after
> > which a QoS
> > +timeout is no longer useful.
> > +
> > +Defragmenting Free Space
> > +------------------------
> > +
> > +Over the years, many XFS users have requested the creation of a
> > program to
> > +clear a portion of the physical storage underlying a filesystem so
> > that it
> > +becomes a contiguous chunk of free space.
> > +Call this free space defragmenter ``clearspace`` for short.
> > +
> > +The first piece the ``clearspace`` program needs is the ability to
> > read the
> > +reverse mapping index from userspace.
> > +This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
> > +The second piece it needs is a new fallocate mode
> > +(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a
> > region and
> > +maps it to a file.
> > +Call this file the "space collector" file.
> > +The third piece is the ability to force an online repair.
> > +
> > +To clear all the metadata out of a portion of physical storage,
> > clearspace
> > +uses the new fallocate map-freespace call to map any free space in
> > that region
> > +to the space collector file.
> > +Next, clearspace finds all metadata blocks in that region by way of
> > +``GETFSMAP`` and issues forced repair requests on the data
> > structure.
> > +This often results in the metadata being rebuilt somewhere that is
> > not being
> > +cleared.
> > +After each relocation, clearspace calls the "map free space"
> > function again to
> > +collect any newly freed space in the region being cleared.
> > +
> > +To clear all the file data out of a portion of the physical storage,
> > clearspace
> > +uses the FSMAP information to find relevant file data blocks.
> > +Having identified a good target, it uses the ``FICLONERANGE`` call
> > on that part
> > +of the file to try to share the physical space with a dummy file.
> > +Cloning the extent means that the original owners cannot overwrite
> > the
> > +contents; any changes will be written somewhere else via copy-on-
> > write.
> > +Clearspace makes its own copy of the frozen extent in an area that
> > is not being
> > +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
> > +<swapext_if_unchanged>` feature) to change the target file's data
> > extent
> > +mapping away from the area being cleared.
> > +When all other mappings have been moved, clearspace reflinks the
> > space into the
> > +space collector file so that it becomes unavailable.
> > +
> > +There are further optimizations that could apply to the above
> > algorithm.
> > +To clear a piece of physical storage that has a high sharing factor,
> > it is
> > +strongly desirable to retain this sharing factor.
> > +In fact, these extents should be moved first to maximize sharing
> > factor after
> > +the operation completes.
> > +To make this work smoothly, clearspace needs a new ioctl
> > +(``FS_IOC_GETREFCOUNTS``) to report reference count information to
> > userspace.
> > +With the refcount information exposed, clearspace can quickly find
> > the longest,
> > +most shared data extents in the filesystem, and target them first.
> > +
>
>
> > +**Question**: How might the filesystem move inode chunks?
> > +
> > +*Answer*:
> "In order to move inode chunks.."
Done.
> > Dave Chinner has a prototype that creates a new file with the old
> > +contents and then locklessly runs around the filesystem updating
> > directory
> > +entries.
> > +The operation cannot complete if the filesystem goes down.
> > +That problem isn't totally insurmountable: create an inode remapping
> > table
> > +hidden behind a jump label, and a log item that tracks the kernel
> > walking the
> > +filesystem to update directory entries.
> > +The trouble is, the kernel can't do anything about open files, since
> > it cannot
> > +revoke them.
> > +
>
>
> > +**Question**: Can static keys be used to add a revoke bailout return
> > to
> > +*every* code path coming in from userspace?
> > +
> > +*Answer*: In principle, yes.
> > +This
>
> "It is also possible to use static keys to add a revoke bailout return
> to each code path coming in from userspace. This..."
I think this change would make the answer redundant with the question.
"Can static keys be used to minimize the runtime cost of supporting
``revoke()`` on XFS files?"
"Yes. Until the first revocation, the bailout code need not be in the
call path at all."
> > would eliminate the overhead of the check until a revocation happens.
> > +It's not clear what we do to a revoked file after all the callers
> > are finished
> > +with it, however.
> > +
> > +The relevant patchsets are the
> > +`kernel freespace defrag
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/
> > log/?h=defrag-freespace>`_
> > +and
> > +`userspace freespace defrag
> > +<
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.g
> > it/log/?h=defrag-freespace>`_
> > +series.
>
> I guess since they're just future ideas just light documentation is
> fine. Other than cleaning out the Q & A's, I think it looks pretty
> good.
Ok. Thank you x100000000 for being the first person to publicly comment
on the entire document!
--D
> Allison
>
> > +
> > +Shrinking Filesystems
> > +---------------------
> > +
> > +Removing the end of the filesystem ought to be a simple matter of
> > evacuating
> > +the data and metadata at the end of the filesystem, and handing the
> > freed space
> > +to the shrink code.
> > +That requires an evacuation of the space at end of the filesystem,
> > which is a
> > +use of free space defragmentation!
> >
>
next prev parent reply other threads:[~2023-03-02 0:39 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Y69UceeA2MEpjMJ8@magnolia>
2022-12-30 22:10 ` [PATCHSET v24.0 00/14] xfs: design documentation for online fsck Darrick J. Wong
2022-12-30 22:10 ` [PATCH 02/14] xfs: document the general theory underlying online fsck design Darrick J. Wong
2023-01-11 1:25 ` Allison Henderson
2023-01-11 23:39 ` Darrick J. Wong
2023-01-12 0:29 ` Dave Chinner
2023-01-18 0:03 ` Allison Henderson
2023-01-18 2:35 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 01/14] xfs: document the motivation for " Darrick J. Wong
2023-01-07 5:01 ` Allison Henderson
2023-01-11 19:10 ` Darrick J. Wong
2023-01-18 0:03 ` Allison Henderson
2023-01-18 1:29 ` Darrick J. Wong
2023-01-12 0:10 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
2023-01-05 9:08 ` Amir Goldstein
2023-01-05 19:40 ` Darrick J. Wong
2023-01-06 3:33 ` Amir Goldstein
2023-01-11 17:54 ` Darrick J. Wong
2023-01-31 6:11 ` Allison Henderson
2023-02-02 19:55 ` Darrick J. Wong
2023-02-09 5:41 ` Allison Henderson
2022-12-30 22:10 ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
2023-02-09 5:47 ` Allison Henderson
2023-02-10 0:24 ` Darrick J. Wong
2023-02-16 15:46 ` Allison Henderson
2023-02-16 21:08 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2022-12-30 22:10 ` [PATCH 04/14] xfs: document the user interface for online fsck Darrick J. Wong
2023-01-18 0:03 ` Allison Henderson
2023-01-18 2:42 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2023-01-21 1:38 ` Allison Henderson
2023-02-02 19:04 ` Darrick J. Wong
2023-02-09 5:41 ` Allison Henderson
2022-12-30 22:10 ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
2023-02-02 7:14 ` Allison Henderson
2023-02-02 23:14 ` Darrick J. Wong
2023-02-09 5:41 ` Allison Henderson
2023-02-09 23:14 ` Darrick J. Wong
2023-02-25 7:32 ` Allison Henderson
2022-12-30 22:10 ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
2023-01-18 0:03 ` Allison Henderson
2023-01-18 2:38 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
2023-03-01 5:36 ` Allison Henderson
2023-03-02 0:27 ` Darrick J. Wong
2023-03-03 23:51 ` Allison Henderson
2023-03-04 2:25 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
2023-02-16 15:47 ` Allison Henderson
2023-02-16 22:48 ` Darrick J. Wong
2023-02-25 7:33 ` Allison Henderson
2023-03-01 22:09 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 14/14] xfs: document future directions of " Darrick J. Wong
2023-03-01 5:37 ` Allison Henderson
2023-03-02 0:39 ` Darrick J. Wong [this message]
2023-03-03 23:51 ` Allison Henderson
2023-03-04 2:28 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
2023-01-14 2:32 ` [PATCH v24.2 " Darrick J. Wong
2023-02-03 2:12 ` [PATCH v24.3 " Darrick J. Wong
2023-02-25 7:33 ` Allison Henderson
2023-03-02 0:14 ` Darrick J. Wong
2023-03-03 23:50 ` Allison Henderson
2023-03-04 2:19 ` Darrick J. Wong
2022-12-30 22:10 ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
2023-02-25 7:33 ` Allison Henderson
2023-03-01 2:42 ` Darrick J. Wong
2023-03-07 1:30 ` [PATCHSET v24.3 00/14] xfs: design documentation for online fsck Darrick J. Wong
2023-03-07 1:30 ` Darrick J. Wong
2023-03-07 1:30 ` [PATCH 01/14] xfs: document the motivation for online fsck design Darrick J. Wong
2023-03-07 1:31 ` [PATCH 02/14] xfs: document the general theory underlying " Darrick J. Wong
2023-03-07 1:31 ` [PATCH 03/14] xfs: document the testing plan for online fsck Darrick J. Wong
2023-03-07 1:31 ` [PATCH 04/14] xfs: document the user interface " Darrick J. Wong
2023-03-07 1:31 ` [PATCH 05/14] xfs: document the filesystem metadata checking strategy Darrick J. Wong
2023-03-07 1:31 ` [PATCH 06/14] xfs: document how online fsck deals with eventual consistency Darrick J. Wong
2023-03-07 1:31 ` [PATCH 07/14] xfs: document pageable kernel memory Darrick J. Wong
2023-03-07 1:31 ` [PATCH 08/14] xfs: document btree bulk loading Darrick J. Wong
2023-03-07 1:31 ` [PATCH 09/14] xfs: document online file metadata repair code Darrick J. Wong
2023-03-07 1:31 ` [PATCH 10/14] xfs: document full filesystem scans for online fsck Darrick J. Wong
2023-03-07 1:31 ` [PATCH 11/14] xfs: document metadata file repair Darrick J. Wong
2023-03-07 1:31 ` [PATCH 12/14] xfs: document directory tree repairs Darrick J. Wong
2023-03-07 1:32 ` [PATCH 13/14] xfs: document the userspace fsck driver program Darrick J. Wong
2023-03-07 1:32 ` [PATCH 14/14] xfs: document future directions of online fsck Darrick J. Wong
2022-10-02 18:19 [PATCHSET v23.3 00/14] xfs: design documentation for " Darrick J. Wong
2022-10-02 18:19 ` [PATCH 14/14] xfs: document future directions of " Darrick J. Wong
-- strict thread matches above, loose matches on Subject: below --
2022-08-07 18:30 [PATCHSET v2 00/14] xfs: design documentation for " Darrick J. Wong
2022-08-07 18:31 ` [PATCH 14/14] xfs: document future directions of " Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y//wWfERMOrEtFnu@magnolia \
--to=djwong@kernel.org \
--cc=allison.henderson@oracle.com \
--cc=catherine.hoang@oracle.com \
--cc=chandan.babu@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox