[PATCHBOMB 6.12] xfs: metadata directories and realtime groups

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHBOMB 6.12] xfs: metadata directories and realtime groups
@ 2024-08-22 23:52 Darrick J. Wong
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                   ` (9 more replies)
  0 siblings, 10 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:52 UTC (permalink / raw)
  To: Christoph Hellwig, Chandan Babu R; +Cc: xfs

Hi everyone,

Christoph and I have been working on getting the long-delayed metadata
directory tree patchset into mergeable shape, and I think we're now
satisfied that we've gotten the code to where we want it for 6.12.

First comes all the accumulated bug fixes for 6.11.  After that is all
the new code:

The metadata directory tree sets us up for much more flexible metadata
within an XFS filesystem.  Instead of rooting inodes in the superblock
which has very limited space, we instead create a directory tree that
can contain arbitrary numbers of metadata files.

Having done that, we can now shard the realtime volume into multiple
allocation groups, much as we do with AGs for the data device.  However,
the realtime volume has a fun twist -- each rtgroup gets its own space
metadata files, and for that we need a metadata directory tree.

Metadata directory trees and realtime groups also enable us to complete
the realtime modernization project, which will add reverse mapping
btrees, reflink, quota support, and zoned storage support for rt
volumes.  The commit-range ioctl is now part of the rt groups patchset,
because that's the only practical way to defragment rt files when the
rt extent size is larger than 1 fsblock and rmap is enabled.  Also,
with Jeff Layton's multigrained ctime work headed for 6.12, we can now
measure file changes in a saner fashion.

Finally, quota inodes now live in the metadata directory tree, which is
a pretty simple conversion.  However, we added yet another new feature,
which is that xfs will now remember the quota accounting and enforcement
state across unmounts.  You can still tweak them via mount options, but
not specifying any is no longer interpreted the same as 'noquota'.

I'm only sending the kernel patches to the list for now, but please have
a look at the git tree links for xfsprogs and fstests changes.

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=metadir-quotas_2024-08-22
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=metadir-quotas_2024-08-22
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=metadir-quotas_2024-08-22

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
@ 2024-08-22 23:56 ` Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 1/9] xfs: fix di_onlink checking for V1/V2 inodes Darrick J. Wong
                     ` (9 more replies)
  2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
                   ` (8 subsequent siblings)
  9 siblings, 10 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:56 UTC (permalink / raw)
  To: djwong
  Cc: Dave Chinner, wozizhi, Anders Blomdell, Christoph Hellwig, willy,
	kjell.m.randa, Zizhi Wo, hch, linux-xfs

Hi all,

Various bug fixes for 6.11.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xfs-6.11-fixes
---
Commits in this patchset:
 * xfs: fix di_onlink checking for V1/V2 inodes
 * xfs: fix folio dirtying for XFILE_ALLOC callers
 * xfs: xfs_finobt_count_blocks() walks the wrong btree
 * xfs: don't bother reporting blocks trimmed via FITRIM
 * xfs: Fix the owner setting issue for rmap query in xfs fsmap
 * xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code
 * xfs: Fix missing interval for missing_owner in xfs fsmap
 * xfs: take m_growlock when running growfsrt
 * xfs: reset rootdir extent size hint after growfsrt
---
 fs/xfs/libxfs/xfs_ialloc_btree.c |    2 -
 fs/xfs/libxfs/xfs_inode_buf.c    |   14 +++++--
 fs/xfs/scrub/xfile.c             |    2 -
 fs/xfs/xfs_discard.c             |   36 +++++-------------
 fs/xfs/xfs_fsmap.c               |   30 +++++++++++++--
 fs/xfs/xfs_rtalloc.c             |   78 ++++++++++++++++++++++++++++++++------
 6 files changed, 114 insertions(+), 48 deletions(-)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v31.0 02/10] xfs: atomic file content commits
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
@ 2024-08-22 23:56 ` Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
  2024-08-22 23:56 ` [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories Darrick J. Wong
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:56 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs, linux-fsdevel

Hi all,

This series creates XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE ioctls
to perform the exchange only if the target file has not been changed
since a given sampling point.

This new functionality uses the mechanism underlying EXCHANGE_RANGE to
stage and commit file updates such that reader programs will see either
the old contents or the new contents in their entirety, with no chance
of torn writes.  A successful call completion guarantees that the new
contents will be seen even if the system fails.  The pair of ioctls
allows userspace to perform what amounts to a compare and exchange
operation on entire file contents.

Note that there are ongoing arguments in the community about how best to
implement some sort of file data write counter that nfsd could also use
to signal invalidations to clients.  Until such a thing is implemented,
this patch will rely on ctime/mtime updates.

Here are the proposed manual pages:

IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)

NAME
       ioctl_xfs_start_commit  -  prepare  to exchange the contents of
       two files ioctl_xfs_commit_range - conditionally  exchange  the
       contents of parts of two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs.h>

       int  ioctl(int  file2_fd, XFS_IOC_START_COMMIT, struct xfs_com‐
       mit_range *arg);

       int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
       mit_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes  the contents of the two ranges if file2_fd passes cer‐
       tain freshness criteria.

       Before exchanging the  contents,  the  program  must  call  the
       XFS_IOC_START_COMMIT   ioctl   to  sample  freshness  data  for
       file2_fd.  If the sampled metadata  does  not  match  the  file
       metadata  at  commit  time,  XFS_IOC_COMMIT_RANGE  will  return
       EBUSY.

       Exchanges are atomic with regards  to  concurrent  file  opera‐
       tions.   Implementations must guarantee that readers see either
       the old contents or the new contents in their entirety, even if
       the system fails.

       The  system  call  parameters are conveyed in structures of the
       following form:

           struct xfs_commit_range {
               __s32    file1_fd;
               __u32    pad;
               __u64    file1_offset;
               __u64    file2_offset;
               __u64    length;
               __u64    flags;
               __u64    file2_freshness[5];
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       The field file2_freshness is an opaque field whose contents are
       determined  by  the  kernel.  These file attributes are used to
       confirm that file2_fd has not changed by another  thread  since
       the current thread began staging its own update.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHANGE_RANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHANGE_RANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHANGE_RANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EBUSY  The file2 inode number and timestamps  supplied  do  not
              match file2_fd.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system call.  Coordina‐
       tion between multiple threads is performed by the kernel.

       The first is a filesystem defragmenter, which copies  the  con‐
       tents  of  a  file into another file and wishes to exchange the
       space mappings of the two files,  provided  that  the  original
       file has not changed.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           struct xfs_commit_range args = {
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           ioctl(fd, XFS_IOC_START_COMMIT, &args);
           fstat(fd, &sb);

           /* make a fresh copy of the file with terrible alignment to avoid reflink */
           clone_file_range(fd, NULL, temp_fd, NULL, 1, 0);
           clone_file_range(fd, NULL, temp_fd, NULL, sb.st_size - 1, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed while defrag was underway
");

       The  second is a data storage program that wants to commit non-
       contiguous updates to a file atomically.  This  program  cannot
       coordinate updates to the file and therefore relies on the ker‐
       nel to reject the COMMIT_RANGE command if the file has been up‐
       dated  by  someone else.  This can be done by creating a tempo‐
       rary file, calling FICLONE(2) to share the contents, and  stag‐
       ing  the  updates into the temporary file.  The FULL_FILES flag
       is recommended for this purpose.  The  temporary  file  can  be
       deleted or punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct xfs_commit_range args = {
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           ioctl(fd, XFS_IOC_START_COMMIT, &args);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed before commit; will roll back
");

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-18     IOCTL-XFS-COMMIT-RANGE(2)

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-commits

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-commits

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-commits
---
Commits in this patchset:
 * xfs: introduce new file range commit ioctls
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
  2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
@ 2024-08-22 23:56 ` Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 1/3] xfs: validate inumber in xfs_iget Darrick J. Wong
                     ` (2 more replies)
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:56 UTC (permalink / raw)
  To: djwong; +Cc: Dave Chinner, Christoph Hellwig, hch, linux-xfs

Hi all,

Before we start adding code for metadata directory trees, let's clean up
some warts in the realtime bitmap code and the inode allocator code.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=metadir-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=metadir-cleanups
---
Commits in this patchset:
 * xfs: validate inumber in xfs_iget
 * xfs: match on the global RT inode numbers in xfs_is_metadata_inode
 * xfs: pass the icreate args object to xfs_dialloc
---
 fs/xfs/libxfs/xfs_ialloc.c |    5 +++--
 fs/xfs/libxfs/xfs_ialloc.h |    4 +++-
 fs/xfs/scrub/tempfile.c    |    2 +-
 fs/xfs/xfs_icache.c        |    2 +-
 fs/xfs/xfs_inode.c         |    4 ++--
 fs/xfs/xfs_inode.h         |    7 ++++---
 fs/xfs/xfs_qm.c            |    2 +-
 fs/xfs/xfs_symlink.c       |    2 +-
 8 files changed, 16 insertions(+), 12 deletions(-)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 04/10] xfs: metadata inode directories
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-08-22 23:56 ` [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories Darrick J. Wong
@ 2024-08-22 23:57 ` Darrick J. Wong
  2024-08-23  0:02   ` [PATCH 01/26] xfs: define the on-disk format for the metadir feature Darrick J. Wong
                     ` (25 more replies)
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                   ` (5 subsequent siblings)
  9 siblings, 26 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:57 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

Hi all,

This series delivers a new feature -- metadata inode directories.  This
is a separate directory tree (rooted in the superblock) that contains
only inodes that contain filesystem metadata.  Different metadata
objects can be looked up with regular paths.

We start by creating xfs_imeta_* functions to mediate access to metadata
inode pointers.  This enables the imeta code to abstract inode pointers,
whether they're the classic five in the superblock, or the much more
complex directory tree.  All current users of metadata inodes (rt+quota)
are converted to use the boilerplate code.

Next, we define the metadir on-disk format, which consists of marking
inodes with a new iflag that says they're metadata.  This we use to
prevent bulkstat and friends from ever getting their hands on fs
metadata.

Finally, we implement metadir operations so that clients can create,
delete, zap, and look up metadata inodes by path.  Beware that much of
this code is only lightly used, because the five current users of
metadata inodes don't tend to change them very often.  This is likely to
change if and when the subvolume and multiple-rt-volume features get
written/merged/etc.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=metadir

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=metadir

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=metadir

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=metadir
---
Commits in this patchset:
 * xfs: define the on-disk format for the metadir feature
 * xfs: refactor loading quota inodes in the regular case
 * xfs: iget for metadata inodes
 * xfs: load metadata directory root at mount time
 * xfs: enforce metadata inode flag
 * xfs: read and write metadata inode directory tree
 * xfs: disable the agi rotor for metadata inodes
 * xfs: hide metadata inodes from everyone because they are special
 * xfs: advertise metadata directory feature
 * xfs: allow bulkstat to return metadata directories
 * xfs: don't count metadata directory files to quota
 * xfs: mark quota inodes as metadata files
 * xfs: adjust xfs_bmap_add_attrfork for metadir
 * xfs: record health problems with the metadata directory
 * xfs: refactor directory tree root predicates
 * xfs: do not count metadata directory files when doing online quotacheck
 * xfs: don't fail repairs on metadata files with no attr fork
 * xfs: metadata files can have xattrs if metadir is enabled
 * xfs: adjust parent pointer scrubber for sb-rooted metadata files
 * xfs: fix di_metatype field of inodes that won't load
 * xfs: scrub metadata directories
 * xfs: check the metadata directory inumber in superblocks
 * xfs: move repair temporary files to the metadata directory tree
 * xfs: check metadata directory file path connectivity
 * xfs: confirm dotdot target before replacing it during a repair
 * xfs: repair metadata directory file path connectivity
---
 fs/xfs/Makefile                 |    5 
 fs/xfs/libxfs/xfs_attr.c        |    5 
 fs/xfs/libxfs/xfs_bmap.c        |    5 
 fs/xfs/libxfs/xfs_format.h      |   81 +++++-
 fs/xfs/libxfs/xfs_fs.h          |   26 ++
 fs/xfs/libxfs/xfs_health.h      |    6 
 fs/xfs/libxfs/xfs_ialloc.c      |   58 +++-
 fs/xfs/libxfs/xfs_inode_buf.c   |   83 ++++++
 fs/xfs/libxfs/xfs_inode_buf.h   |    3 
 fs/xfs/libxfs/xfs_inode_util.c  |    2 
 fs/xfs/libxfs/xfs_log_format.h  |    2 
 fs/xfs/libxfs/xfs_metadir.c     |  481 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_metadir.h     |   47 ++++
 fs/xfs/libxfs/xfs_metafile.c    |   52 ++++
 fs/xfs/libxfs/xfs_metafile.h    |   31 ++
 fs/xfs/libxfs/xfs_ondisk.h      |    2 
 fs/xfs/libxfs/xfs_sb.c          |   12 +
 fs/xfs/scrub/agheader.c         |    5 
 fs/xfs/scrub/common.c           |   65 ++++-
 fs/xfs/scrub/common.h           |    5 
 fs/xfs/scrub/dir.c              |   10 +
 fs/xfs/scrub/dir_repair.c       |   20 +
 fs/xfs/scrub/dirtree.c          |   32 ++
 fs/xfs/scrub/dirtree.h          |   12 -
 fs/xfs/scrub/findparent.c       |   28 ++
 fs/xfs/scrub/health.c           |    1 
 fs/xfs/scrub/inode.c            |   35 ++-
 fs/xfs/scrub/inode_repair.c     |   34 ++-
 fs/xfs/scrub/metapath.c         |  521 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/nlinks.c           |    4 
 fs/xfs/scrub/nlinks_repair.c    |    4 
 fs/xfs/scrub/orphanage.c        |    4 
 fs/xfs/scrub/parent.c           |   39 ++-
 fs/xfs/scrub/parent_repair.c    |   37 ++-
 fs/xfs/scrub/quotacheck.c       |    7 -
 fs/xfs/scrub/repair.c           |   22 +-
 fs/xfs/scrub/repair.h           |    3 
 fs/xfs/scrub/scrub.c            |    9 +
 fs/xfs/scrub/scrub.h            |    2 
 fs/xfs/scrub/stats.c            |    1 
 fs/xfs/scrub/tempfile.c         |  105 ++++++++
 fs/xfs/scrub/tempfile.h         |    3 
 fs/xfs/scrub/trace.c            |    1 
 fs/xfs/scrub/trace.h            |   42 +++
 fs/xfs/xfs_dquot.c              |    1 
 fs/xfs/xfs_health.c             |    2 
 fs/xfs/xfs_icache.c             |   73 +++++
 fs/xfs/xfs_inode.c              |   13 +
 fs/xfs/xfs_inode.h              |   14 +
 fs/xfs/xfs_inode_item.c         |    7 -
 fs/xfs/xfs_inode_item_recover.c |    5 
 fs/xfs/xfs_ioctl.c              |    7 +
 fs/xfs/xfs_iops.c               |   15 +
 fs/xfs/xfs_itable.c             |   33 ++
 fs/xfs/xfs_itable.h             |    3 
 fs/xfs/xfs_mount.c              |   31 ++
 fs/xfs/xfs_mount.h              |    3 
 fs/xfs/xfs_qm.c                 |   80 +++++-
 fs/xfs/xfs_qm.h                 |    3 
 fs/xfs/xfs_qm_syscalls.c        |   13 -
 fs/xfs/xfs_quota.h              |    5 
 fs/xfs/xfs_quotaops.c           |   53 ++--
 fs/xfs/xfs_rtalloc.c            |   38 ++-
 fs/xfs/xfs_super.c              |    4 
 fs/xfs/xfs_trace.c              |    2 
 fs/xfs/xfs_trace.h              |  102 ++++++++
 fs/xfs/xfs_trans_dquot.c        |    6 
 67 files changed, 2288 insertions(+), 177 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_metadir.c
 create mode 100644 fs/xfs/libxfs/xfs_metadir.h
 create mode 100644 fs/xfs/libxfs/xfs_metafile.c
 create mode 100644 fs/xfs/libxfs/xfs_metafile.h
 create mode 100644 fs/xfs/scrub/metapath.c


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (3 preceding siblings ...)
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
@ 2024-08-22 23:57 ` Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 01/12] xfs: remove xfs_validate_rtextents Darrick J. Wong
                     ` (11 more replies)
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                   ` (4 subsequent siblings)
  9 siblings, 12 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:57 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

Hi all,

Here are some cleanups and reorganization of the realtime bitmap code to share
more of that code between userspace and the kernel.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rtbitmap-cleanups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rtbitmap-cleanups
---
Commits in this patchset:
 * xfs: remove xfs_validate_rtextents
 * xfs: factor out a xfs_validate_rt_geometry helper
 * xfs: make the RT rsum_cache mandatory
 * xfs: remove the limit argument to xfs_rtfind_back
 * xfs: assert a valid limit in xfs_rtfind_forw
 * xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf
 * xfs: cleanup the calling convention for xfs_rtpick_extent
 * xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc
 * xfs: factor out a xfs_growfs_rt_bmblock helper
 * xfs: factor out a xfs_last_rt_bmblock helper
 * xfs: factor out rtbitmap/summary initialization helpers
 * xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock
---
 fs/xfs/libxfs/xfs_bmap.c     |    3 
 fs/xfs/libxfs/xfs_rtbitmap.c |  192 ++++++++++++++-
 fs/xfs/libxfs/xfs_rtbitmap.h |   33 +--
 fs/xfs/libxfs/xfs_sb.c       |   64 +++--
 fs/xfs/libxfs/xfs_sb.h       |    1 
 fs/xfs/libxfs/xfs_types.h    |   12 -
 fs/xfs/xfs_rtalloc.c         |  535 +++++++++++++++++-------------------------
 7 files changed, 438 insertions(+), 402 deletions(-)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (4 preceding siblings ...)
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
@ 2024-08-22 23:57 ` Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 01/10] xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock Darrick J. Wong
                     ` (9 more replies)
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                   ` (3 subsequent siblings)
  9 siblings, 10 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:57 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

Hi all,

While I was reviewing how to integrate realtime allocation groups with
the rt allocator, I noticed several bugs in the existing allocation code
with regards to calculating the maximum range of rtx to scan for free
space.  This series fixes those range bugs and cleans up a few things
too.

I also added a few cleanups from Christoph.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rtalloc-fixes

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rtalloc-fixes
---
Commits in this patchset:
 * xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock
 * xfs: ensure rtx mask/shift are correct after growfs
 * xfs: don't return too-short extents from xfs_rtallocate_extent_block
 * xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block
 * xfs: refactor aligning bestlen to prod
 * xfs: clean up xfs_rtallocate_extent_exact a bit
 * xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near
 * xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block
 * xfs: remove xfs_rtb_to_rtxrem
 * xfs: simplify xfs_rtalloc_query_range
---
 fs/xfs/libxfs/xfs_rtbitmap.c |   51 ++++++---------
 fs/xfs/libxfs/xfs_rtbitmap.h |   21 ------
 fs/xfs/libxfs/xfs_sb.c       |   12 +++
 fs/xfs/libxfs/xfs_sb.h       |    2 +
 fs/xfs/xfs_discard.c         |   15 ++--
 fs/xfs/xfs_fsmap.c           |   11 +--
 fs/xfs/xfs_rtalloc.c         |  145 +++++++++++++++++++++++-------------------
 7 files changed, 124 insertions(+), 133 deletions(-)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (5 preceding siblings ...)
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
@ 2024-08-22 23:57 ` Darrick J. Wong
  2024-08-23  0:14   ` [PATCH 01/24] xfs: clean up the ISVALID macro in xfs_bmap_adjacent Darrick J. Wong
                     ` (23 more replies)
  2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
                   ` (2 subsequent siblings)
  9 siblings, 24 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:57 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

Hi all,

This series adds in-memory data structures for sharding the realtime volume
into independent allocation groups.  For existing filesystems, the entire rt
volume is modelled as having a single large group, with (potentially) a number
of rt extents exceeding 2^32, though these are not likely to exist because the
codebase has been a bit broken for decades.  The next series fills in the
ondisk format and other supporting structures.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=incore-rtgroups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=incore-rtgroups
---
Commits in this patchset:
 * xfs: clean up the ISVALID macro in xfs_bmap_adjacent
 * xfs: factor out a xfs_rtallocate helper
 * xfs: rework the rtalloc fallback handling
 * xfs: factor out a xfs_rtallocate_align helper
 * xfs: make the rtalloc start hint a xfs_rtblock_t
 * xfs: add xchk_setup_nothing and xchk_nothing helpers
 * xfs: remove xfs_{rtbitmap,rtsummary}_wordcount
 * xfs: replace m_rsumsize with m_rsumblocks
 * xfs: rearrange xfs_fsmap.c a little bit
 * xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c
 * xfs: create incore realtime group structures
 * xfs: define locking primitives for realtime groups
 * xfs: add a lockdep class key for rtgroup inodes
 * xfs: support caching rtgroup metadata inodes
 * xfs: add rtgroup-based realtime scrubbing context management
 * xfs: move RT bitmap and summary information to the rtgroup
 * xfs: remove XFS_ILOCK_RT*
 * xfs: calculate RT bitmap and summary blocks based on sb_rextents
 * xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper
 * xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks
 * xfs: factor out a xfs_growfs_check_rtgeom helper
 * xfs: refactor xfs_rtbitmap_blockcount
 * xfs: refactor xfs_rtsummary_blockcount
 * xfs: make RT extent numbers relative to the rtgroup
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_bmap.c        |  101 +++--
 fs/xfs/libxfs/xfs_format.h      |    3 
 fs/xfs/libxfs/xfs_rtbitmap.c    |  222 +++++------
 fs/xfs/libxfs/xfs_rtbitmap.h    |  152 ++++----
 fs/xfs/libxfs/xfs_rtgroup.c     |  529 +++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rtgroup.h     |  268 ++++++++++++++
 fs/xfs/libxfs/xfs_sb.c          |    7 
 fs/xfs/libxfs/xfs_trans_resv.c  |    4 
 fs/xfs/libxfs/xfs_types.h       |    4 
 fs/xfs/scrub/bmap.c             |   13 +
 fs/xfs/scrub/common.c           |   78 ++++
 fs/xfs/scrub/common.h           |   59 ++-
 fs/xfs/scrub/fscounters.c       |   26 +
 fs/xfs/scrub/repair.c           |   24 +
 fs/xfs/scrub/repair.h           |    7 
 fs/xfs/scrub/rtbitmap.c         |   54 ++-
 fs/xfs/scrub/rtsummary.c        |  118 +++---
 fs/xfs/scrub/rtsummary.h        |    2 
 fs/xfs/scrub/rtsummary_repair.c |   19 -
 fs/xfs/scrub/scrub.c            |   33 ++
 fs/xfs/scrub/scrub.h            |   42 +-
 fs/xfs/xfs_discard.c            |  100 +++--
 fs/xfs/xfs_fsmap.c              |  435 +++++++++++++++--------
 fs/xfs/xfs_fsmap.h              |    6 
 fs/xfs/xfs_inode.c              |    3 
 fs/xfs/xfs_inode.h              |   13 -
 fs/xfs/xfs_ioctl.c              |  130 -------
 fs/xfs/xfs_iomap.c              |    4 
 fs/xfs/xfs_log_recover.c        |   20 +
 fs/xfs/xfs_mount.c              |   18 +
 fs/xfs/xfs_mount.h              |   29 +-
 fs/xfs/xfs_qm.c                 |   27 +
 fs/xfs/xfs_rtalloc.c            |  753 ++++++++++++++++++++++++---------------
 fs/xfs/xfs_super.c              |    4 
 fs/xfs/xfs_trace.c              |    1 
 fs/xfs/xfs_trace.h              |   38 ++
 37 files changed, 2326 insertions(+), 1021 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
 create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (6 preceding siblings ...)
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
@ 2024-08-22 23:58 ` Darrick J. Wong
  2024-08-23  0:21   ` [PATCH 1/1] iomap: add a merge boundary flag Darrick J. Wong
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:58 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, linux-fsdevel, hch, linux-xfs

Hi all,

Having cleaned up the rtbitmap code and fixed various weird bugs in the
allocator, now we want to do some more cleanups to the rt free space management
code to get it ready for the introduction of allocation groups.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rtgroups-prep

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rtgroups-prep
---
Commits in this patchset:
 * iomap: add a merge boundary flag
---
 fs/iomap/buffered-io.c |    6 ++++++
 include/linux/iomap.h  |    4 ++++
 2 files changed, 10 insertions(+)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 09/10] xfs: shard the realtime section
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (7 preceding siblings ...)
  2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
@ 2024-08-22 23:58 ` Darrick J. Wong
  2024-08-23  0:21   ` [PATCH 01/26] xfs: define the format of rt groups Darrick J. Wong
                     ` (25 more replies)
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
  9 siblings, 26 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:58 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

Hi all,

Right now, the realtime section uses a single pair of metadata inodes to
store the free space information.  This presents a scalability problem
since every thread trying to allocate or free rt extents have to lock
these files.  It would be very useful if we could begin to tackle these
problems by sharding the realtime section, so create the notion of
realtime groups, which are similar to allocation groups on the data
section.

While we're at it, define a superblock to be stamped into the start of
each rt section.  This enables utilities such as blkid to identify block
devices containing realtime sections, and helpfully avoids the situation
where a file extent can cross an rtgroup boundary.

The best advantage for rtgroups will become evident later when we get to
adding rmap and reflink to the realtime volume, since the geometry
constraints are the same for rt groups and AGs.  Hence we can reuse all
that code directly.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=realtime-groups

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=realtime-groups

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=realtime-groups

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=realtime-groups
---
Commits in this patchset:
 * xfs: define the format of rt groups
 * xfs: check the realtime superblock at mount time
 * xfs: update realtime super every time we update the primary fs super
 * xfs: export realtime group geometry via XFS_FSOP_GEOM
 * xfs: check that rtblock extents do not break rtsupers or rtgroups
 * xfs: add a helper to prevent bmap merges across rtgroup boundaries
 * xfs: add frextents to the lazysbcounters when rtgroups enabled
 * xfs: convert sick_map loops to use ARRAY_SIZE
 * xfs: record rt group metadata errors in the health system
 * xfs: export the geometry of realtime groups to userspace
 * xfs: add block headers to realtime bitmap and summary blocks
 * xfs: encode the rtbitmap in big endian format
 * xfs: encode the rtsummary in big endian format
 * xfs: grow the realtime section when realtime groups are enabled
 * xfs: store rtgroup information with a bmap intent
 * xfs: force swapext to a realtime file to use the file content exchange ioctl
 * xfs: support logging EFIs for realtime extents
 * xfs: support error injection when freeing rt extents
 * xfs: use realtime EFI to free extents when rtgroups are enabled
 * xfs: don't merge ioends across RTGs
 * xfs: make the RT allocator rtgroup aware
 * xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub
 * xfs: scrub the realtime group superblock
 * xfs: repair realtime group superblock
 * xfs: scrub metadir paths for rtgroup metadata
 * xfs: mask off the rtbitmap and summary inodes when metadir in use
---
 fs/xfs/Makefile                  |    1 
 fs/xfs/libxfs/xfs_alloc.c        |   15 +
 fs/xfs/libxfs/xfs_alloc.h        |   17 +
 fs/xfs/libxfs/xfs_bmap.c         |   86 ++++++-
 fs/xfs/libxfs/xfs_bmap.h         |    5 
 fs/xfs/libxfs/xfs_defer.c        |    6 +
 fs/xfs/libxfs/xfs_defer.h        |    1 
 fs/xfs/libxfs/xfs_format.h       |   76 ++++++
 fs/xfs/libxfs/xfs_fs.h           |   29 ++
 fs/xfs/libxfs/xfs_health.h       |   61 +++--
 fs/xfs/libxfs/xfs_log_format.h   |    6 -
 fs/xfs/libxfs/xfs_log_recover.h  |    2 
 fs/xfs/libxfs/xfs_ondisk.h       |    4 
 fs/xfs/libxfs/xfs_rtbitmap.c     |  211 +++++++++++++++---
 fs/xfs/libxfs/xfs_rtbitmap.h     |   64 +++++
 fs/xfs/libxfs/xfs_rtgroup.c      |  195 ++++++++++++++++-
 fs/xfs/libxfs/xfs_rtgroup.h      |   20 ++
 fs/xfs/libxfs/xfs_sb.c           |  165 +++++++++++++-
 fs/xfs/libxfs/xfs_sb.h           |    2 
 fs/xfs/libxfs/xfs_shared.h       |    4 
 fs/xfs/libxfs/xfs_types.c        |   38 +++
 fs/xfs/scrub/bmap.c              |   16 +
 fs/xfs/scrub/common.h            |    2 
 fs/xfs/scrub/fscounters_repair.c |    9 -
 fs/xfs/scrub/health.c            |   34 ++-
 fs/xfs/scrub/metapath.c          |   92 ++++++++
 fs/xfs/scrub/repair.h            |    3 
 fs/xfs/scrub/rgsuper.c           |   89 ++++++++
 fs/xfs/scrub/rtsummary.c         |    5 
 fs/xfs/scrub/rtsummary_repair.c  |   15 +
 fs/xfs/scrub/scrub.c             |    7 +
 fs/xfs/scrub/scrub.h             |    2 
 fs/xfs/scrub/stats.c             |    1 
 fs/xfs/scrub/trace.h             |    4 
 fs/xfs/xfs_bmap_item.c           |   18 +-
 fs/xfs/xfs_bmap_util.c           |   12 +
 fs/xfs/xfs_buf_item_recover.c    |   43 +++-
 fs/xfs/xfs_discard.c             |    2 
 fs/xfs/xfs_extfree_item.c        |  281 ++++++++++++++++++++++--
 fs/xfs/xfs_health.c              |  205 +++++++++++------
 fs/xfs/xfs_ioctl.c               |   37 +++
 fs/xfs/xfs_iomap.c               |   14 +
 fs/xfs/xfs_log_recover.c         |    2 
 fs/xfs/xfs_mount.h               |   11 +
 fs/xfs/xfs_rtalloc.c             |  446 ++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_rtalloc.h             |    6 +
 fs/xfs/xfs_super.c               |   12 +
 fs/xfs/xfs_trace.h               |   30 ++-
 fs/xfs/xfs_trans.c               |   27 ++
 fs/xfs/xfs_trans.h               |    2 
 fs/xfs/xfs_trans_buf.c           |   25 ++
 51 files changed, 2146 insertions(+), 314 deletions(-)
 create mode 100644 fs/xfs/scrub/rgsuper.c


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCHSET v4.0 10/10] xfs: store quota files in the metadir
  2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
                   ` (8 preceding siblings ...)
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
@ 2024-08-22 23:58 ` Darrick J. Wong
  2024-08-23  0:28   ` [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos Darrick J. Wong
                     ` (5 more replies)
  9 siblings, 6 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:58 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

Hi all,

Store the quota files in the metadata directory tree instead of the superblock.
Since we're introducing a new incompat feature flag, let's also make the mount
process bring up quotas in whatever state they were when the filesystem was
last unmounted, instead of requiring sysadmins to remember that themselves.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=metadir-quotas

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=metadir-quotas

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=metadir-quotas
---
Commits in this patchset:
 * xfs: refactor xfs_qm_destroy_quotainos
 * xfs: use metadir for quota inodes
 * xfs: scrub quota file metapaths
 * xfs: persist quota flags with metadir
 * xfs: update sb field checks when metadir is turned on
 * xfs: enable metadata directory feature
---
 fs/xfs/libxfs/xfs_dquot_buf.c  |  190 ++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_format.h     |    3 
 fs/xfs/libxfs/xfs_fs.h         |    6 +
 fs/xfs/libxfs/xfs_quota_defs.h |   43 +++++++
 fs/xfs/libxfs/xfs_sb.c         |    1 
 fs/xfs/scrub/agheader.c        |   36 ++++--
 fs/xfs/scrub/metapath.c        |   76 ++++++++++++
 fs/xfs/xfs_mount.c             |   15 ++
 fs/xfs/xfs_mount.h             |    6 +
 fs/xfs/xfs_qm.c                |  250 +++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_qm_bhv.c            |   18 +++
 fs/xfs/xfs_quota.h             |    2 
 fs/xfs/xfs_super.c             |   22 ++++
 13 files changed, 597 insertions(+), 71 deletions(-)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCH 1/9] xfs: fix di_onlink checking for V1/V2 inodes
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
@ 2024-08-22 23:59   ` Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 2/9] xfs: fix folio dirtying for XFILE_ALLOC callers Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:59 UTC (permalink / raw)
  To: djwong; +Cc: kjell.m.randa, Christoph Hellwig, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

"KjellR" complained on IRC that an old V4 filesystem suddenly stopped
mounting after upgrading from 6.9.11 to 6.10.3, with the following splat
when trying to read the rt bitmap inode:

00000000: 49 4e 80 00 01 02 00 01 00 00 00 00 00 00 00 00  IN..............
00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 43 d2 a9 da 21 0f d6 30  ........C...!..0
00000030: 43 d2 a9 da 21 0f d6 30 00 00 00 00 00 00 00 00  C...!..0........
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00  ................
00000060: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

As Dave Chinner points out, this is a V1 inode with both di_onlink and
di_nlink set to 1 and di_flushiter == 0.  In other words, this inode was
formatted this way by mkfs and hasn't been touched since then.

Back in the old days of xfsprogs 3.2.3, I observed that libxfs_ialloc
would set di_nlink, but if the filesystem didn't have NLINK, it would
then set di_version = 1.  libxfs_iflush_int later sees the V1 inode and
copies the value of di_nlink to di_onlink without zeroing di_onlink.

Eventually this filesystem must have been upgraded to support NLINK
because 6.10 doesn't support !NLINK filesystems, which is how we tripped
over this old behavior.  The filesystem doesn't have a realtime section,
so that's why the rtbitmap inode has never been touched.

Fix this by removing the di_onlink/di_nlink checking for all V1/V2
inodes because this is a muddy mess.  The V3 inode handling code has
always supported NLINK and written di_onlink==0 so keep that check.
The removal of the V1 inode handling code when we dropped support for
!NLINK obscured this old behavior.

Reported-by: kjell.m.randa@gmail.com
Fixes: 40cb8613d612 ("xfs: check unused nlink fields in the ondisk inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_inode_buf.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 513b50da6215f..79babeac9d754 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -514,12 +514,18 @@ xfs_dinode_verify(
 			return __this_address;
 	}
 
-	if (dip->di_version > 1) {
+	/*
+	 * Historical note: xfsprogs in the 3.2 era set up its incore inodes to
+	 * have di_nlink track the link count, even if the actual filesystem
+	 * only supported V1 inodes (i.e. di_onlink).  When writing out the
+	 * ondisk inode, it would set both the ondisk di_nlink and di_onlink to
+	 * the the incore di_nlink value, which is why we cannot check for
+	 * di_nlink==0 on a V1 inode.  V2/3 inodes would get written out with
+	 * di_onlink==0, so we can check that.
+	 */
+	if (dip->di_version >= 2) {
 		if (dip->di_onlink)
 			return __this_address;
-	} else {
-		if (dip->di_nlink)
-			return __this_address;
 	}
 
 	/* don't allow invalid i_size */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 2/9] xfs: fix folio dirtying for XFILE_ALLOC callers
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 1/9] xfs: fix di_onlink checking for V1/V2 inodes Darrick J. Wong
@ 2024-08-22 23:59   ` Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 3/9] xfs: xfs_finobt_count_blocks() walks the wrong btree Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:59 UTC (permalink / raw)
  To: djwong; +Cc: willy, Christoph Hellwig, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

willy pointed out that folio_mark_dirty is the correct function to use
to mark an xfile folio dirty because it calls out to the mapping's aops
to mark it dirty.  For tmpfs this likely doesn't matter much since it
currently uses nop_dirty_folio, but let's use the abstractions properly.

Reported-by: willy@infradead.org
Fixes: 6907e3c00a40 ("xfs: add file_{get,put}_folio")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/scrub/xfile.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c
index d848222f802ba..9b5d98fe1f8ab 100644
--- a/fs/xfs/scrub/xfile.c
+++ b/fs/xfs/scrub/xfile.c
@@ -293,7 +293,7 @@ xfile_get_folio(
 	 * (potentially last) reference in xfile_put_folio.
 	 */
 	if (flags & XFILE_ALLOC)
-		folio_set_dirty(folio);
+		folio_mark_dirty(folio);
 	return folio;
 }
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 3/9] xfs: xfs_finobt_count_blocks() walks the wrong btree
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 1/9] xfs: fix di_onlink checking for V1/V2 inodes Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 2/9] xfs: fix folio dirtying for XFILE_ALLOC callers Darrick J. Wong
@ 2024-08-22 23:59   ` Darrick J. Wong
  2024-08-22 23:59   ` [PATCH 4/9] xfs: don't bother reporting blocks trimmed via FITRIM Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:59 UTC (permalink / raw)
  To: djwong; +Cc: Anders Blomdell, Dave Chinner, Christoph Hellwig, hch, linux-xfs

From: Dave Chinner <dchinner@redhat.com>

As a result of the factoring in commit 14dd46cf31f4 ("xfs: split
xfs_inobt_init_cursor"), mount started taking a long time on a
user's filesystem.  For Anders, this made mount times regress from
under a second to over 15 minutes for a filesystem with only 30
million inodes in it.

Anders bisected it down to the above commit, but even then the bug
was not obvious. In this commit, over 20 calls to
xfs_inobt_init_cursor() were modified, and some we modified to call
a new function named xfs_finobt_init_cursor().

If that takes you a moment to reread those function names to see
what the rename was, then you have realised why this bug wasn't
spotted during review. And it wasn't spotted on inspection even
after the bisect pointed at this commit - a single missing "f" isn't
the easiest thing for a human eye to notice....

The result is that xfs_finobt_count_blocks() now incorrectly calls
xfs_inobt_init_cursor() so it is now walking the inobt instead of
the finobt. Hence when there are lots of allocated inodes in a
filesystem, mount takes a -long- time run because it now walks a
massive allocated inode btrees instead of the small, nearly empty
free inode btrees. It also means all the finobt space reservations
are wrong, so mount could potentially given ENOSPC on kernel
upgrade.

In hindsight, commit 14dd46cf31f4 should have been two commits - the
first to convert the finobt callers to the new API, the second to
modify the xfs_inobt_init_cursor() API for the inobt callers. That
would have made the bug very obvious during review.

Fixes: 14dd46cf31f4 ("xfs: split xfs_inobt_init_cursor")
Reported-by: Anders Blomdell <anders.blomdell@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc_btree.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 496e2f72a85b9..797d5b5f7b725 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -749,7 +749,7 @@ xfs_finobt_count_blocks(
 	if (error)
 		return error;

-	cur = xfs_inobt_init_cursor(pag, tp, agbp);
+	cur = xfs_finobt_init_cursor(pag, tp, agbp);
 	error = xfs_btree_count_blocks(cur, tree_blocks);
 	xfs_btree_del_cursor(cur, error);
 	xfs_trans_brelse(tp, agbp);

^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 4/9] xfs: don't bother reporting blocks trimmed via FITRIM
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-22 23:59   ` [PATCH 3/9] xfs: xfs_finobt_count_blocks() walks the wrong btree Darrick J. Wong
@ 2024-08-22 23:59   ` Darrick J. Wong
  2024-08-23  0:00   ` [PATCH 5/9] xfs: Fix the owner setting issue for rmap query in xfs fsmap Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:59 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, Christoph Hellwig, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Don't bother reporting the number of bytes that we "trimmed" because the
underlying storage isn't required to do anything(!) and failed discard
IOs aren't reported to the caller anyway.  It's not like userspace can
use the reported value for anything useful like adjusting the offset
parameter of the next call, and it's not like anyone ever wrote a
manpage about FITRIM's out parameters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_discard.c |   36 +++++++++++-------------------------
 1 file changed, 11 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 6f0fc7fe1f2ba..25f5dffeab2ae 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -158,8 +158,7 @@ static int
 xfs_trim_gather_extents(
 	struct xfs_perag	*pag,
 	struct xfs_trim_cur	*tcur,
-	struct xfs_busy_extents	*extents,
-	uint64_t		*blocks_trimmed)
+	struct xfs_busy_extents	*extents)
 {
 	struct xfs_mount	*mp = pag->pag_mount;
 	struct xfs_trans	*tp;
@@ -280,7 +279,6 @@ xfs_trim_gather_extents(
 
 		xfs_extent_busy_insert_discard(pag, fbno, flen,
 				&extents->extent_list);
-		*blocks_trimmed += flen;
 next_extent:
 		if (tcur->by_bno)
 			error = xfs_btree_increment(cur, 0, &i);
@@ -327,8 +325,7 @@ xfs_trim_perag_extents(
 	struct xfs_perag	*pag,
 	xfs_agblock_t		start,
 	xfs_agblock_t		end,
-	xfs_extlen_t		minlen,
-	uint64_t		*blocks_trimmed)
+	xfs_extlen_t		minlen)
 {
 	struct xfs_trim_cur	tcur = {
 		.start		= start,
@@ -354,8 +351,7 @@ xfs_trim_perag_extents(
 		extents->owner = extents;
 		INIT_LIST_HEAD(&extents->extent_list);
 
-		error = xfs_trim_gather_extents(pag, &tcur, extents,
-				blocks_trimmed);
+		error = xfs_trim_gather_extents(pag, &tcur, extents);
 		if (error) {
 			kfree(extents);
 			break;
@@ -389,8 +385,7 @@ xfs_trim_datadev_extents(
 	struct xfs_mount	*mp,
 	xfs_daddr_t		start,
 	xfs_daddr_t		end,
-	xfs_extlen_t		minlen,
-	uint64_t		*blocks_trimmed)
+	xfs_extlen_t		minlen)
 {
 	xfs_agnumber_t		start_agno, end_agno;
 	xfs_agblock_t		start_agbno, end_agbno;
@@ -411,8 +406,7 @@ xfs_trim_datadev_extents(
 
 		if (start_agno == end_agno)
 			agend = end_agbno;
-		error = xfs_trim_perag_extents(pag, start_agbno, agend, minlen,
-				blocks_trimmed);
+		error = xfs_trim_perag_extents(pag, start_agbno, agend, minlen);
 		if (error)
 			last_error = error;
 
@@ -431,9 +425,6 @@ struct xfs_trim_rtdev {
 	/* list of rt extents to free */
 	struct list_head	extent_list;
 
-	/* pointer to count of blocks trimmed */
-	uint64_t		*blocks_trimmed;
-
 	/* minimum length that caller allows us to trim */
 	xfs_rtblock_t		minlen_fsb;
 
@@ -551,7 +542,6 @@ xfs_trim_gather_rtextent(
 	busyp->length = rlen;
 	INIT_LIST_HEAD(&busyp->list);
 	list_add_tail(&busyp->list, &tr->extent_list);
-	*tr->blocks_trimmed += rlen;
 
 	tr->restart_rtx = rec->ar_startext + rec->ar_extcount;
 	return 0;
@@ -562,13 +552,11 @@ xfs_trim_rtdev_extents(
 	struct xfs_mount	*mp,
 	xfs_daddr_t		start,
 	xfs_daddr_t		end,
-	xfs_daddr_t		minlen,
-	uint64_t		*blocks_trimmed)
+	xfs_daddr_t		minlen)
 {
 	struct xfs_rtalloc_rec	low = { };
 	struct xfs_rtalloc_rec	high = { };
 	struct xfs_trim_rtdev	tr = {
-		.blocks_trimmed	= blocks_trimmed,
 		.minlen_fsb	= XFS_BB_TO_FSB(mp, minlen),
 	};
 	struct xfs_trans	*tp;
@@ -634,7 +622,7 @@ xfs_trim_rtdev_extents(
 	return error;
 }
 #else
-# define xfs_trim_rtdev_extents(m,s,e,n,b)	(-EOPNOTSUPP)
+# define xfs_trim_rtdev_extents(...)	(-EOPNOTSUPP)
 #endif /* CONFIG_XFS_RT */
 
 /*
@@ -661,7 +649,6 @@ xfs_ioc_trim(
 	xfs_daddr_t		start, end;
 	xfs_extlen_t		minlen;
 	xfs_rfsblock_t		max_blocks;
-	uint64_t		blocks_trimmed = 0;
 	int			error, last_error = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -706,15 +693,13 @@ xfs_ioc_trim(
 	end = start + BTOBBT(range.len) - 1;
 
 	if (bdev_max_discard_sectors(mp->m_ddev_targp->bt_bdev)) {
-		error = xfs_trim_datadev_extents(mp, start, end, minlen,
-				&blocks_trimmed);
+		error = xfs_trim_datadev_extents(mp, start, end, minlen);
 		if (error)
 			last_error = error;
 	}
 
 	if (rt_bdev && !xfs_trim_should_stop()) {
-		error = xfs_trim_rtdev_extents(mp, start, end, minlen,
-				&blocks_trimmed);
+		error = xfs_trim_rtdev_extents(mp, start, end, minlen);
 		if (error)
 			last_error = error;
 	}
@@ -722,7 +707,8 @@ xfs_ioc_trim(
 	if (last_error)
 		return last_error;
 
-	range.len = XFS_FSB_TO_B(mp, blocks_trimmed);
+	range.len = min_t(unsigned long long, range.len,
+			  XFS_FSB_TO_B(mp, max_blocks));
 	if (copy_to_user(urange, &range, sizeof(range)))
 		return -EFAULT;
 	return 0;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 5/9] xfs: Fix the owner setting issue for rmap query in xfs fsmap
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-22 23:59   ` [PATCH 4/9] xfs: don't bother reporting blocks trimmed via FITRIM Darrick J. Wong
@ 2024-08-23  0:00   ` Darrick J. Wong
  2024-08-23  4:10     ` Christoph Hellwig
  2024-08-23  0:00   ` [PATCH 6/9] xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:00 UTC (permalink / raw)
  To: djwong; +Cc: Zizhi Wo, hch, linux-xfs

From: Zizhi Wo <wozizhi@huawei.com>

I notice a rmap query bug in xfs_io fsmap:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
 EXT: DEV    BLOCK-RANGE           OWNER              FILE-OFFSET      AG AG-OFFSET             TOTAL
   0: 253:16 [0..7]:               static fs metadata                  0  (0..7)                    8
   1: 253:16 [8..23]:              per-AG metadata                     0  (8..23)                  16
   2: 253:16 [24..39]:             inode btree                         0  (24..39)                 16
   3: 253:16 [40..47]:             per-AG metadata                     0  (40..47)                  8
   4: 253:16 [48..55]:             refcount btree                      0  (48..55)                  8
   5: 253:16 [56..103]:            per-AG metadata                     0  (56..103)                48
   6: 253:16 [104..127]:           free space                          0  (104..127)               24
   ......

Bug:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt
[root@fedora ~]#
Normally, we should be able to get one record, but we got nothing.

The root cause of this problem lies in the incorrect setting of rm_owner in
the rmap query. In the case of the initial query where the owner is not
set, __xfs_getfsmap_datadev() first sets info->high.rm_owner to ULLONG_MAX.
This is done to prevent any omissions when comparing rmap items. However,
if the current ag is detected to be the last one, the function sets info's
high_irec based on the provided key. If high->rm_owner is not specified, it
should continue to be set to ULLONG_MAX; otherwise, there will be issues
with interval omissions. For example, consider "start" and "end" within the
same block. If high->rm_owner == 0, it will be smaller than the founded
record in rmapbt, resulting in a query with no records. The main call stack
is as follows:

xfs_ioc_getfsmap
  xfs_getfsmap
    xfs_getfsmap_datadev_rmapbt
      __xfs_getfsmap_datadev
        info->high.rm_owner = ULLONG_MAX
        if (pag->pag_agno == end_ag)
	  xfs_fsmap_owner_to_rmap
	    // set info->high.rm_owner = 0 because fmr_owner == -1ULL
	    dest->rm_owner = 0
	// get nothing
	xfs_getfsmap_datadev_rmapbt_query

The problem can be resolved by simply modify the xfs_fsmap_owner_to_rmap
function internal logic to achieve.

After applying this patch, the above problem have been solved:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt
 EXT: DEV    BLOCK-RANGE      OWNER              FILE-OFFSET      AG AG-OFFSET        TOTAL
   0: 253:16 [0..7]:          static fs metadata                  0  (0..7)               8

Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 85dbb46452ca0..3a30b36779db5 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -71,7 +71,7 @@ xfs_fsmap_owner_to_rmap(
 	switch (src->fmr_owner) {
 	case 0:			/* "lowest owner id possible" */
 	case -1ULL:		/* "highest owner id possible" */
-		dest->rm_owner = 0;
+		dest->rm_owner = src->fmr_owner;
 		break;
 	case XFS_FMR_OWN_FREE:
 		dest->rm_owner = XFS_RMAP_OWN_NULL;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 6/9] xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:00   ` [PATCH 5/9] xfs: Fix the owner setting issue for rmap query in xfs fsmap Darrick J. Wong
@ 2024-08-23  0:00   ` Darrick J. Wong
  2024-08-23  4:10     ` Christoph Hellwig
  2024-08-23  0:00   ` [PATCH 7/9] xfs: Fix missing interval for missing_owner in xfs fsmap Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:00 UTC (permalink / raw)
  To: djwong; +Cc: wozizhi, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use XFS_BUF_DADDR_NULL (instead of a magic sentinel value) to mean "this
field is null" like the rest of xfs.

Cc: wozizhi@huawei.com
Fixes: e89c041338ed6 ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsmap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 3a30b36779db5..613a0ec204120 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -252,7 +252,7 @@ xfs_getfsmap_rec_before_start(
 	const struct xfs_rmap_irec	*rec,
 	xfs_daddr_t			rec_daddr)
 {
-	if (info->low_daddr != -1ULL)
+	if (info->low_daddr != XFS_BUF_DADDR_NULL)
 		return rec_daddr < info->low_daddr;
 	if (info->low.rm_blockcount)
 		return xfs_rmap_compare(rec, &info->low) < 0;
@@ -983,7 +983,7 @@ xfs_getfsmap(
 		info.dev = handlers[i].dev;
 		info.last = false;
 		info.pag = NULL;
-		info.low_daddr = -1ULL;
+		info.low_daddr = XFS_BUF_DADDR_NULL;
 		info.low.rm_blockcount = 0;
 		error = handlers[i].fn(tp, dkeys, &info);
 		if (error)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 7/9] xfs: Fix missing interval for missing_owner in xfs fsmap
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-08-23  0:00   ` [PATCH 6/9] xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code Darrick J. Wong
@ 2024-08-23  0:00   ` Darrick J. Wong
  2024-08-26  3:58     ` Zizhi Wo
  2024-08-23  0:00   ` [PATCH 8/9] xfs: take m_growlock when running growfsrt Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:00 UTC (permalink / raw)
  To: djwong; +Cc: Zizhi Wo, hch, linux-xfs

From: Zizhi Wo <wozizhi@huawei.com>

In the fsmap query of xfs, there is an interval missing problem:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
 EXT: DEV    BLOCK-RANGE           OWNER              FILE-OFFSET      AG AG-OFFSET             TOTAL
   0: 253:16 [0..7]:               static fs metadata                  0  (0..7)                    8
   1: 253:16 [8..23]:              per-AG metadata                     0  (8..23)                  16
   2: 253:16 [24..39]:             inode btree                         0  (24..39)                 16
   3: 253:16 [40..47]:             per-AG metadata                     0  (40..47)                  8
   4: 253:16 [48..55]:             refcount btree                      0  (48..55)                  8
   5: 253:16 [56..103]:            per-AG metadata                     0  (56..103)                48
   6: 253:16 [104..127]:           free space                          0  (104..127)               24
   ......

BUG:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt
[root@fedora ~]#
Normally, we should be able to get [104, 107), but we got nothing.

The problem is caused by shifting. The query for the problem-triggered
scenario is for the missing_owner interval (e.g. freespace in rmapbt/
unknown space in bnobt), which is obtained by subtraction (gap). For this
scenario, the interval is obtained by info->last. However, rec_daddr is
calculated based on the start_block recorded in key[1], which is converted
by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed
info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log
<= info->next_daddr, no records will be displayed. In the above example,
104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two
are reduced to 0 and the gap is ignored:

 before calculate ----------------> after shifting
 104(st)  107(ed)		      12(st/ed)
  |---------|				  |
  sector size			      block size

Resolve this issue by introducing the "end_daddr" field in
xfs_getfsmap_info. This records |key[1].fmr_physical + key[1].length| at
the granularity of sector. If the current query is the last, the rec_daddr
is end_daddr to prevent missing interval problems caused by shifting. We
only need to focus on the last query, because xfs disks are internally
aligned with disk blocksize that are powers of two and minimum 512, so
there is no problem with shifting in previous queries.

After applying this patch, the above problem have been solved:
[root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt
 EXT: DEV    BLOCK-RANGE      OWNER            FILE-OFFSET      AG AG-OFFSET        TOTAL
   0: 253:16 [104..106]:      free space                        0  (104..106)           3

Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: limit the range of end_addr correctly]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsmap.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 613a0ec204120..71f32354944e4 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -162,6 +162,7 @@ struct xfs_getfsmap_info {
 	xfs_daddr_t		next_daddr;	/* next daddr we expect */
 	/* daddr of low fsmap key when we're using the rtbitmap */
 	xfs_daddr_t		low_daddr;
+	xfs_daddr_t		end_daddr;	/* daddr of high fsmap key */
 	u64			missing_owner;	/* owner of holes */
 	u32			dev;		/* device id */
 	/*
@@ -182,6 +183,7 @@ struct xfs_getfsmap_dev {
 	int			(*fn)(struct xfs_trans *tp,
 				      const struct xfs_fsmap *keys,
 				      struct xfs_getfsmap_info *info);
+	sector_t		nr_sectors;
 };
 
 /* Compare two getfsmap device handlers. */
@@ -294,6 +296,18 @@ xfs_getfsmap_helper(
 		return 0;
 	}
 
+	/*
+	 * For an info->last query, we're looking for a gap between the last
+	 * mapping emitted and the high key specified by userspace.  If the
+	 * user's query spans less than 1 fsblock, then info->high and
+	 * info->low will have the same rm_startblock, which causes rec_daddr
+	 * and next_daddr to be the same.  Therefore, use the end_daddr that
+	 * we calculated from userspace's high key to synthesize the record.
+	 * Note that if the btree query found a mapping, there won't be a gap.
+	 */
+	if (info->last && info->end_daddr != XFS_BUF_DADDR_NULL)
+		rec_daddr = info->end_daddr;
+
 	/* Are we just counting mappings? */
 	if (info->head->fmh_count == 0) {
 		if (info->head->fmh_entries == UINT_MAX)
@@ -904,17 +918,21 @@ xfs_getfsmap(
 
 	/* Set up our device handlers. */
 	memset(handlers, 0, sizeof(handlers));
+	handlers[0].nr_sectors = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
 	handlers[0].dev = new_encode_dev(mp->m_ddev_targp->bt_dev);
 	if (use_rmap)
 		handlers[0].fn = xfs_getfsmap_datadev_rmapbt;
 	else
 		handlers[0].fn = xfs_getfsmap_datadev_bnobt;
 	if (mp->m_logdev_targp != mp->m_ddev_targp) {
+		handlers[1].nr_sectors = XFS_FSB_TO_BB(mp,
+						       mp->m_sb.sb_logblocks);
 		handlers[1].dev = new_encode_dev(mp->m_logdev_targp->bt_dev);
 		handlers[1].fn = xfs_getfsmap_logdev;
 	}
 #ifdef CONFIG_XFS_RT
 	if (mp->m_rtdev_targp) {
+		handlers[2].nr_sectors = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks);
 		handlers[2].dev = new_encode_dev(mp->m_rtdev_targp->bt_dev);
 		handlers[2].fn = xfs_getfsmap_rtdev_rtbitmap;
 	}
@@ -946,6 +964,7 @@ xfs_getfsmap(
 
 	info.next_daddr = head->fmh_keys[0].fmr_physical +
 			  head->fmh_keys[0].fmr_length;
+	info.end_daddr = XFS_BUF_DADDR_NULL;
 	info.fsmap_recs = fsmap_recs;
 	info.head = head;
 
@@ -966,8 +985,11 @@ xfs_getfsmap(
 		 * low key, zero out the low key so that we get
 		 * everything from the beginning.
 		 */
-		if (handlers[i].dev == head->fmh_keys[1].fmr_device)
+		if (handlers[i].dev == head->fmh_keys[1].fmr_device) {
 			dkeys[1] = head->fmh_keys[1];
+			info.end_daddr = min(handlers[i].nr_sectors - 1,
+					     dkeys[1].fmr_physical);
+		}
 		if (handlers[i].dev > head->fmh_keys[0].fmr_device)
 			memset(&dkeys[0], 0, sizeof(struct xfs_fsmap));
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 8/9] xfs: take m_growlock when running growfsrt
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-08-23  0:00   ` [PATCH 7/9] xfs: Fix missing interval for missing_owner in xfs fsmap Darrick J. Wong
@ 2024-08-23  0:00   ` Darrick J. Wong
  2024-08-23  4:08     ` Christoph Hellwig
  2024-08-23  0:01   ` [PATCH 9/9] xfs: reset rootdir extent size hint after growfsrt Darrick J. Wong
  2024-08-23  4:09   ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Christoph Hellwig
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:00 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Take the grow lock when we're expanding the realtime volume, like we do
for the other growfs calls.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 0c3e96c621a67..776d6c401f62f 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -821,34 +821,39 @@ xfs_growfs_rt(
 	/* Needs to have been mounted with an rt device. */
 	if (!XFS_IS_REALTIME_MOUNT(mp))
 		return -EINVAL;
+
+	if (!mutex_trylock(&mp->m_growlock))
+		return -EWOULDBLOCK;
 	/*
 	 * Mount should fail if the rt bitmap/summary files don't load, but
 	 * we'll check anyway.
 	 */
+	error = -EINVAL;
 	if (!mp->m_rbmip || !mp->m_rsumip)
-		return -EINVAL;
+		goto out_unlock;
 
 	/* Shrink not supported. */
 	if (in->newblocks <= sbp->sb_rblocks)
-		return -EINVAL;
+		goto out_unlock;
 
 	/* Can only change rt extent size when adding rt volume. */
 	if (sbp->sb_rblocks > 0 && in->extsize != sbp->sb_rextsize)
-		return -EINVAL;
+		goto out_unlock;
 
 	/* Range check the extent size. */
 	if (XFS_FSB_TO_B(mp, in->extsize) > XFS_MAX_RTEXTSIZE ||
 	    XFS_FSB_TO_B(mp, in->extsize) < XFS_MIN_RTEXTSIZE)
-		return -EINVAL;
+		goto out_unlock;
 
 	/* Unsupported realtime features. */
+	error = -EOPNOTSUPP;
 	if (xfs_has_rmapbt(mp) || xfs_has_reflink(mp) || xfs_has_quota(mp))
-		return -EOPNOTSUPP;
+		goto out_unlock;
 
 	nrblocks = in->newblocks;
 	error = xfs_sb_validate_fsb_count(sbp, nrblocks);
 	if (error)
-		return error;
+		goto out_unlock;
 	/*
 	 * Read in the last block of the device, make sure it exists.
 	 */
@@ -856,7 +861,7 @@ xfs_growfs_rt(
 				XFS_FSB_TO_BB(mp, nrblocks - 1),
 				XFS_FSB_TO_BB(mp, 1), 0, &bp, NULL);
 	if (error)
-		return error;
+		goto out_unlock;
 	xfs_buf_relse(bp);
 
 	/*
@@ -864,8 +869,10 @@ xfs_growfs_rt(
 	 */
 	nrextents = nrblocks;
 	do_div(nrextents, in->extsize);
-	if (!xfs_validate_rtextents(nrextents))
-		return -EINVAL;
+	if (!xfs_validate_rtextents(nrextents)) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
 	nrbmblocks = xfs_rtbitmap_blockcount(mp, nrextents);
 	nrextslog = xfs_compute_rextslog(nrextents);
 	nrsumlevels = nrextslog + 1;
@@ -876,8 +883,11 @@ xfs_growfs_rt(
 	 * the log.  This prevents us from getting a log overflow,
 	 * since we'll log basically the whole summary file at once.
 	 */
-	if (nrsumblocks > (mp->m_sb.sb_logblocks >> 1))
-		return -EINVAL;
+	if (nrsumblocks > (mp->m_sb.sb_logblocks >> 1)) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Get the old block counts for bitmap and summary inodes.
 	 * These can't change since other growfs callers are locked out.
@@ -889,10 +899,10 @@ xfs_growfs_rt(
 	 */
 	error = xfs_growfs_rt_alloc(mp, rbmblocks, nrbmblocks, mp->m_rbmip);
 	if (error)
-		return error;
+		goto out_unlock;
 	error = xfs_growfs_rt_alloc(mp, rsumblocks, nrsumblocks, mp->m_rsumip);
 	if (error)
-		return error;
+		goto out_unlock;
 
 	rsum_cache = mp->m_rsum_cache;
 	if (nrbmblocks != sbp->sb_rbmblocks)
@@ -1059,6 +1069,8 @@ xfs_growfs_rt(
 		}
 	}
 
+out_unlock:
+	mutex_unlock(&mp->m_growlock);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 9/9] xfs: reset rootdir extent size hint after growfsrt
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-08-23  0:00   ` [PATCH 8/9] xfs: take m_growlock when running growfsrt Darrick J. Wong
@ 2024-08-23  0:01   ` Darrick J. Wong
  2024-08-23  4:09     ` Christoph Hellwig
  2024-08-23  4:09   ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Christoph Hellwig
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:01 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If growfsrt is run on a filesystem that doesn't have a rt volume, it's
possible to change the rt extent size.  If the root directory was
previously set up with an inherited extent size hint and rtinherit, it's
possible that the hint is no longer a multiple of the rt extent size.
Although the verifiers don't complain about this, xfs_repair will, so if
we detect this situation, log the root directory to clean it up.  This
is still racy, but it's better than nothing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 776d6c401f62f..ebeab8e4dab10 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -784,6 +784,39 @@ xfs_alloc_rsum_cache(
 		xfs_warn(mp, "could not allocate realtime summary cache");
 }
 
+/*
+ * If we changed the rt extent size (meaning there was no rt volume previously)
+ * and the root directory had EXTSZINHERIT and RTINHERIT set, it's possible
+ * that the extent size hint on the root directory is no longer congruent with
+ * the new rt extent size.  Log the rootdir inode to fix this.
+ */
+static int
+xfs_growfs_rt_fixup_extsize(
+	struct xfs_mount	*mp)
+{
+	struct xfs_inode	*ip = mp->m_rootip;
+	struct xfs_trans	*tp;
+	int			error = 0;
+
+	xfs_ilock(ip, XFS_IOLOCK_EXCL);
+	if (!(ip->i_diflags & XFS_DIFLAG_RTINHERIT) ||
+	    !(ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT))
+		goto out_iolock;
+
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_ichange, 0, 0, false,
+			&tp);
+	if (error)
+		goto out_iolock;
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+out_iolock:
+	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
+	return error;
+}
+
 /*
  * Visible (exported) functions.
  */
@@ -812,6 +845,7 @@ xfs_growfs_rt(
 	xfs_extlen_t	rsumblocks;	/* current number of rt summary blks */
 	xfs_sb_t	*sbp;		/* old superblock */
 	uint8_t		*rsum_cache;	/* old summary cache */
+	xfs_agblock_t	old_rextsize = mp->m_sb.sb_rextsize;
 
 	sbp = &mp->m_sb;
 
@@ -1046,6 +1080,12 @@ xfs_growfs_rt(
 	if (error)
 		goto out_free;
 
+	if (old_rextsize != in->extsize) {
+		error = xfs_growfs_rt_fixup_extsize(mp);
+		if (error)
+			goto out_free;
+	}
+
 	/* Update secondary superblocks now the physical grow has completed */
 	error = xfs_update_secondary_sbs(mp);
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
@ 2024-08-23  0:01   ` Darrick J. Wong
  2024-08-23  4:12     ` Christoph Hellwig
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
  0 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:01 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef72016..c85c8077fac39 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
 	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
 };
 
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call.  The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+	__s32		file1_fd;
+	__u32		pad;		/* must be zeroes */
+	__u64		file1_offset;	/* file1 offset, bytes */
+	__u64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
+
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[6];
+};
+
 /*
  * Exchange file data all the way to the ends of both files, and then exchange
  * the file sizes.  This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92f..d0889190ab7ff 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
 	return error;
 }
 
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip2)
+{
+	struct inode			*inode2 = VFS_I(ip2);
+	struct timespec64		ctime = inode_get_ctime(inode2);
+	struct timespec64		mtime = inode_get_mtime(inode2);
+
+	trace_xfs_exchrange_freshness(fxr, ip2);
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if (fxr->file2_ino != ip2->i_ino ||
+	    fxr->file2_gen != inode2->i_generation ||
+	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+	    !timespec64_equal(&fxr->file2_mtime, &mtime))
+		return -EBUSY;
+
+	return 0;
+}
+
 #define QRETRY_IP1	(0x1)
 #define QRETRY_IP2	(0x2)
 
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
 	if (error || fxr->length == 0)
 		return error;
 
+	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+		error = xfs_exchrange_check_freshness(fxr, ip2);
+		if (error)
+			return error;
+	}
+
 	/* Attach dquots to both inodes before changing block maps. */
 	error = xfs_qm_dqattach(ip2);
 	if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
 	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
 		return -EXDEV;
 
-	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
 		return -EINVAL;
 
 	/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
 	fdput(file1);
 	return error;
 }
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	xfs_fsid_t	fsid;		/* m_fixedfsid */
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u32		file2_gen;	/* inode generation */
+	__u32		magic;		/* zero */
+};
+#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_commit_range		args = { };
+	struct timespec64		ts;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+	struct inode			*inode2 = file_inode(file);
+	struct xfs_inode		*ip2 = XFS_I(inode2);
+	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
+						    XFS_MMAPLOCK_SHARED |
+						    XFS_ILOCK_SHARED;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+	xfs_ilock(ip2, lockflags);
+	ts = inode_get_ctime(inode2);
+	kern_f->file2_ctime		= ts.tv_sec;
+	kern_f->file2_ctime_nsec	= ts.tv_nsec;
+	ts = inode_get_mtime(inode2);
+	kern_f->file2_mtime		= ts.tv_sec;
+	kern_f->file2_mtime_nsec	= ts.tv_nsec;
+	kern_f->file2_ino		= ip2->i_ino;
+	kern_f->file2_gen		= inode2->i_generation;
+	kern_f->magic			= XCR_FRESH_MAGIC;
+	xfs_iunlock(ip2, lockflags);
+
+	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
+	struct xfs_mount		*mp = ip2->i_mount;
+	struct fd			file1;
+	int				error;
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+		return -EINVAL;
+	if (kern_f->magic != XCR_FRESH_MAGIC)
+		return -EBUSY;
+	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+		return -EBUSY;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_gen		= kern_f->file2_gen;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = xfs_exchange_range(&fxr);
+	fdput(file1);
+	return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e8..bc1298aba806b 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
 
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
+
 #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
-					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
 
 struct xfs_exchrange {
 	struct file		*file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
 	u64			length;
 
 	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
+
+	/* file2 metadata for freshness checks */
+	u64			file2_ino;
+	struct timespec64	file2_mtime;
+	struct timespec64	file2_ctime;
+	u32			file2_gen;
 };
 
 long xfs_ioc_exchange_range(struct file *file,
 		struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+		struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+		struct xfs_commit_range __user	*argp);
 
 struct xfs_exchmaps_req;
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e963..90b3ee21e7fe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
+	case XFS_IOC_COMMIT_RANGE:
+		return xfs_ioc_commit_range(filp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a9..4cf0fa71ba9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
 	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
 	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
 	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
-	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
+	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
+	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
 
 /* file exchange-range tracepoint class */
 DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
 
+TRACE_EVENT(xfs_exchrange_freshness,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		struct timespec64	ts64;
+		struct inode		*inode2 = VFS_I(ip2);
+
+		__entry->dev = inode2->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+
+		ts64 = inode_get_ctime(inode2);
+		__entry->ip2_ctime = ts64.tv_sec;
+		__entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+		ts64 = inode_get_mtime(inode2);
+		__entry->ip2_mtime = ts64.tv_sec;
+		__entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
+		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
+		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 TRACE_EVENT(xfs_exchmaps_overhead,
 	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
 		 unsigned long long rmapbt_blocks),


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 1/3] xfs: validate inumber in xfs_iget
  2024-08-22 23:56 ` [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories Darrick J. Wong
@ 2024-08-23  0:01   ` Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 2/3] xfs: match on the global RT inode numbers in xfs_is_metadata_inode Darrick J. Wong
  2024-08-23  0:02   ` [PATCH 3/3] xfs: pass the icreate args object to xfs_dialloc Darrick J. Wong
  2 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:01 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, Dave Chinner, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Actually use the inumber validator to check the argument passed in here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_icache.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e7..887d2a01161e4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -755,7 +755,7 @@ xfs_iget(
 	ASSERT((lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) == 0);
 
 	/* reject inode numbers outside existing AGs */
-	if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount)
+	if (!xfs_verify_ino(mp, ino))
 		return -EINVAL;
 
 	XFS_STATS_INC(mp, xs_ig_attempts);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 2/3] xfs: match on the global RT inode numbers in xfs_is_metadata_inode
  2024-08-22 23:56 ` [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 1/3] xfs: validate inumber in xfs_iget Darrick J. Wong
@ 2024-08-23  0:01   ` Darrick J. Wong
  2024-08-23  0:02   ` [PATCH 3/3] xfs: pass the icreate args object to xfs_dialloc Darrick J. Wong
  2 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:01 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Match the inode number instead of the inode pointers, as the inode
pointers in the superblock will go away soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: port to my tree, make the parameter a const pointer]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_inode.h |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 51defdebef30e..1908409968dba 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -276,12 +276,13 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
 }
 
-static inline bool xfs_is_metadata_inode(struct xfs_inode *ip)
+static inline bool xfs_is_metadata_inode(const struct xfs_inode *ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
-	return ip == mp->m_rbmip || ip == mp->m_rsumip ||
-		xfs_is_quota_inode(&mp->m_sb, ip->i_ino);
+	return ip->i_ino == mp->m_sb.sb_rbmino ||
+	       ip->i_ino == mp->m_sb.sb_rsumino ||
+	       xfs_is_quota_inode(&mp->m_sb, ip->i_ino);
 }
 
 bool xfs_is_always_cow_inode(struct xfs_inode *ip);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 3/3] xfs: pass the icreate args object to xfs_dialloc
  2024-08-22 23:56 ` [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 1/3] xfs: validate inumber in xfs_iget Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 2/3] xfs: match on the global RT inode numbers in xfs_is_metadata_inode Darrick J. Wong
@ 2024-08-23  0:02   ` Darrick J. Wong
  2024-08-23  4:13     ` Christoph Hellwig
  2 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:02 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Pass the xfs_icreate_args object to xfs_dialloc since we can extract the
relevant mode (really just the file type) and parent inumber from there.
This simplifies the calling convention in preparation for the next
patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |    5 +++--
 fs/xfs/libxfs/xfs_ialloc.h |    4 +++-
 fs/xfs/scrub/tempfile.c    |    2 +-
 fs/xfs/xfs_inode.c         |    4 ++--
 fs/xfs/xfs_qm.c            |    2 +-
 fs/xfs/xfs_symlink.c       |    2 +-
 6 files changed, 11 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0af5b7a33d055..fc70601e8d8ee 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1855,11 +1855,12 @@ xfs_dialloc_try_ag(
 int
 xfs_dialloc(
 	struct xfs_trans	**tpp,
-	xfs_ino_t		parent,
-	umode_t			mode,
+	const struct xfs_icreate_args *args,
 	xfs_ino_t		*new_ino)
 {
 	struct xfs_mount	*mp = (*tpp)->t_mountp;
+	xfs_ino_t		parent = args->pip ? args->pip->i_ino : 0;
+	umode_t			mode = args->mode & S_IFMT;
 	xfs_agnumber_t		agno;
 	int			error = 0;
 	xfs_agnumber_t		start_agno;
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index b549627e3a615..3a1323155a455 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -33,11 +33,13 @@ xfs_make_iptr(struct xfs_mount *mp, struct xfs_buf *b, int o)
 	return xfs_buf_offset(b, o << (mp)->m_sb.sb_inodelog);
 }
 
+struct xfs_icreate_args;
+
 /*
  * Allocate an inode on disk.  Mode is used to tell whether the new inode will
  * need space, and whether it is a directory.
  */
-int xfs_dialloc(struct xfs_trans **tpp, xfs_ino_t parent, umode_t mode,
+int xfs_dialloc(struct xfs_trans **tpp, const struct xfs_icreate_args *args,
 		xfs_ino_t *new_ino);
 
 int xfs_difree(struct xfs_trans *tp, struct xfs_perag *pag,
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index d390d56cd8751..177f922acfaf1 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -88,7 +88,7 @@ xrep_tempfile_create(
 		goto out_release_dquots;
 
 	/* Allocate inode, set up directory. */
-	error = xfs_dialloc(&tp, dp->i_ino, mode, &ino);
+	error = xfs_dialloc(&tp, &args, &ino);
 	if (error)
 		goto out_trans_cancel;
 	error = xfs_icreate(tp, ino, &args, &sc->tempip);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7dc6f326936ca..9ea7a18f5da14 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -704,7 +704,7 @@ xfs_create(
 	 * entry pointing to them, but a directory also the "." entry
 	 * pointing to itself.
 	 */
-	error = xfs_dialloc(&tp, dp->i_ino, args->mode, &ino);
+	error = xfs_dialloc(&tp, args, &ino);
 	if (!error)
 		error = xfs_icreate(tp, ino, args, &du.ip);
 	if (error)
@@ -812,7 +812,7 @@ xfs_create_tmpfile(
 	if (error)
 		goto out_release_dquots;
 
-	error = xfs_dialloc(&tp, dp->i_ino, args->mode, &ino);
+	error = xfs_dialloc(&tp, args, &ino);
 	if (!error)
 		error = xfs_icreate(tp, ino, args, &ip);
 	if (error)
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 9490b913a4ab4..63f6ca2db2515 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -799,7 +799,7 @@ xfs_qm_qino_alloc(
 		};
 		xfs_ino_t	ino;
 
-		error = xfs_dialloc(&tp, 0, S_IFREG, &ino);
+		error = xfs_dialloc(&tp, &args, &ino);
 		if (!error)
 			error = xfs_icreate(tp, ino, &args, ipp);
 		if (error) {
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 77f19e2f66e07..4252b07cd2513 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -165,7 +165,7 @@ xfs_symlink(
 	/*
 	 * Allocate an inode for the symlink.
 	 */
-	error = xfs_dialloc(&tp, dp->i_ino, S_IFLNK, &ino);
+	error = xfs_dialloc(&tp, &args, &ino);
 	if (!error)
 		error = xfs_icreate(tp, ino, &args, &du.ip);
 	if (error)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 01/26] xfs: define the on-disk format for the metadir feature
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
@ 2024-08-23  0:02   ` Darrick J. Wong
  2024-08-23  4:30     ` Christoph Hellwig
  2024-08-23  0:02   ` [PATCH 02/26] xfs: refactor loading quota inodes in the regular case Darrick J. Wong
                     ` (24 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:02 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Define the on-disk layout and feature flags for the metadata inode
directory feature.  Add a xfs_sb_version_hasmetadir for benefit of
xfs_repair, which needs to know where the new end of the superblock
lies.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h      |   81 +++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_inode_buf.c   |   13 +++++-
 fs/xfs/libxfs/xfs_inode_util.c  |    2 +
 fs/xfs/libxfs/xfs_log_format.h  |    2 -
 fs/xfs/libxfs/xfs_ondisk.h      |    2 -
 fs/xfs/libxfs/xfs_sb.c          |   10 +++++
 fs/xfs/scrub/inode.c            |    3 +
 fs/xfs/scrub/inode_repair.c     |    9 +++-
 fs/xfs/xfs_inode.h              |   14 +++++++
 fs/xfs/xfs_inode_item.c         |    7 +++
 fs/xfs/xfs_inode_item_recover.c |    5 ++
 fs/xfs/xfs_mount.h              |    2 +
 fs/xfs/xfs_super.c              |    4 ++
 13 files changed, 136 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index e1bfee0c3b1a8..16a7bc02aa5f5 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -174,6 +174,8 @@ typedef struct xfs_sb {
 	xfs_lsn_t	sb_lsn;		/* last write sequence */
 	uuid_t		sb_meta_uuid;	/* metadata file system unique id */
 
+	xfs_ino_t	sb_metadirino;	/* metadata directory tree root */
+
 	/* must be padded to 64 bit alignment */
 } xfs_sb_t;
 
@@ -259,6 +261,8 @@ struct xfs_dsb {
 	__be64		sb_lsn;		/* last write sequence */
 	uuid_t		sb_meta_uuid;	/* metadata file system unique id */
 
+	__be64		sb_metadirino;	/* metadata directory tree root */
+
 	/* must be padded to 64 bit alignment */
 };
 
@@ -374,6 +378,7 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_NREXT64	(1 << 5)  /* large extent counters */
 #define XFS_SB_FEAT_INCOMPAT_EXCHRANGE	(1 << 6)  /* exchangerange supported */
 #define XFS_SB_FEAT_INCOMPAT_PARENT	(1 << 7)  /* parent pointers */
+#define XFS_SB_FEAT_INCOMPAT_METADIR	(1 << 8)  /* metadata dir tree */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
 		(XFS_SB_FEAT_INCOMPAT_FTYPE | \
 		 XFS_SB_FEAT_INCOMPAT_SPINODES | \
@@ -426,6 +431,12 @@ static inline bool xfs_sb_version_haslogxattrs(struct xfs_sb *sbp)
 		 XFS_SB_FEAT_INCOMPAT_LOG_XATTRS);
 }
 
+static inline bool xfs_sb_version_hasmetadir(const struct xfs_sb *sbp)
+{
+	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&
+		(sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR);
+}
+
 static inline bool
 xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
 {
@@ -790,6 +801,27 @@ static inline time64_t xfs_bigtime_to_unix(uint64_t ondisk_seconds)
 	return (time64_t)ondisk_seconds - XFS_BIGTIME_EPOCH_OFFSET;
 }
 
+enum xfs_metafile_type {
+	XFS_METAFILE_UNKNOWN,		/* unknown */
+	XFS_METAFILE_DIR,		/* metadir directory */
+	XFS_METAFILE_USRQUOTA,		/* user quota */
+	XFS_METAFILE_GRPQUOTA,		/* group quota */
+	XFS_METAFILE_PRJQUOTA,		/* project quota */
+	XFS_METAFILE_RTBITMAP,		/* rt bitmap */
+	XFS_METAFILE_RTSUMMARY,		/* rt summary */
+
+	XFS_METAFILE_MAX
+} __packed;
+
+#define XFS_METAFILE_TYPE_STR \
+	{ XFS_METAFILE_UNKNOWN,		"unknown" }, \
+	{ XFS_METAFILE_DIR,		"dir" }, \
+	{ XFS_METAFILE_USRQUOTA,	"usrquota" }, \
+	{ XFS_METAFILE_GRPQUOTA,	"grpquota" }, \
+	{ XFS_METAFILE_PRJQUOTA,	"prjquota" }, \
+	{ XFS_METAFILE_RTBITMAP,	"rtbitmap" }, \
+	{ XFS_METAFILE_RTSUMMARY,	"rtsummary" }
+
 /*
  * On-disk inode structure.
  *
@@ -812,7 +844,10 @@ struct xfs_dinode {
 	__be16		di_mode;	/* mode and type of file */
 	__u8		di_version;	/* inode version */
 	__u8		di_format;	/* format of di_c data */
-	__be16		di_onlink;	/* old number of links to file */
+	union {
+		__be16	di_onlink;	/* old number of links to file */
+		__be16	di_metatype;	/* XFS_METAFILE_* */
+	} __packed; /* explicit packing because arm gcc bloats this up */
 	__be32		di_uid;		/* owner's user id */
 	__be32		di_gid;		/* owner's group id */
 	__be32		di_nlink;	/* number of links to file */
@@ -1092,17 +1127,47 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
-#define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+#define XFS_DIFLAG2_NREXT64_BIT	4	/* large extent counters */
+#define XFS_DIFLAG2_METADATA_BIT	63	/* filesystem metadata */
 
-#define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
-#define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
-#define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
-#define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
-#define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_DAX		(1ULL << XFS_DIFLAG2_DAX_BIT)
+#define XFS_DIFLAG2_REFLINK	(1ULL << XFS_DIFLAG2_REFLINK_BIT)
+#define XFS_DIFLAG2_COWEXTSIZE	(1ULL << XFS_DIFLAG2_COWEXTSIZE_BIT)
+#define XFS_DIFLAG2_BIGTIME	(1ULL << XFS_DIFLAG2_BIGTIME_BIT)
+#define XFS_DIFLAG2_NREXT64	(1ULL << XFS_DIFLAG2_NREXT64_BIT)
+
+/*
+ * The inode contains filesystem metadata and can be found through the metadata
+ * directory tree.  Metadata inodes must satisfy the following constraints:
+ *
+ * - V5 filesystem (and ftype) are enabled;
+ * - The only valid modes are regular files and directories;
+ * - The access bits must be zero;
+ * - DMAPI event and state masks are zero;
+ * - The user and group IDs must be zero;
+ * - The project ID can be used as a u32 annotation;
+ * - The immutable, sync, noatime, nodump, nodefrag flags must be set.
+ * - The dax flag must not be set.
+ * - Directories must have nosymlinks set.
+ *
+ * These requirements are chosen defensively to minimize the ability of
+ * userspace to read or modify the contents, should a metadata file ever
+ * escape to userspace.
+ *
+ * There are further constraints on the directory tree itself:
+ *
+ * - Metadata inodes must never be resolvable through the root directory;
+ * - They must never be accessed by userspace;
+ * - Metadata directory entries must have correct ftype.
+ *
+ * Superblock-rooted metadata files must have the METADATA iflag set even
+ * though they do not have a parent directory.
+ */
+#define XFS_DIFLAG2_METADATA	(1ULL << XFS_DIFLAG2_METADATA_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_METADATA)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 79babeac9d754..cdd6ed4279649 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -215,6 +215,8 @@ xfs_inode_from_disk(
 		set_nlink(inode, be32_to_cpu(from->di_nlink));
 		ip->i_projid = (prid_t)be16_to_cpu(from->di_projid_hi) << 16 |
 					be16_to_cpu(from->di_projid_lo);
+		if (from->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA))
+			ip->i_metatype = be16_to_cpu(from->di_metatype);
 	}
 
 	i_uid_write(inode, be32_to_cpu(from->di_uid));
@@ -315,7 +317,10 @@ xfs_inode_to_disk(
 	struct inode		*inode = VFS_I(ip);
 
 	to->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
-	to->di_onlink = 0;
+	if (xfs_is_metadir_inode(ip))
+		to->di_metatype = cpu_to_be16(ip->i_metatype);
+	else
+		to->di_onlink = 0;
 
 	to->di_format = xfs_ifork_format(&ip->i_df);
 	to->di_uid = cpu_to_be32(i_uid_read(inode));
@@ -523,9 +528,13 @@ xfs_dinode_verify(
 	 * di_nlink==0 on a V1 inode.  V2/3 inodes would get written out with
 	 * di_onlink==0, so we can check that.
 	 */
-	if (dip->di_version >= 2) {
+	if (dip->di_version == 2) {
 		if (dip->di_onlink)
 			return __this_address;
+	} else if (dip->di_version >= 3) {
+		if (!(dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)) &&
+		    dip->di_onlink)
+			return __this_address;
 	}
 
 	/* don't allow invalid i_size */
diff --git a/fs/xfs/libxfs/xfs_inode_util.c b/fs/xfs/libxfs/xfs_inode_util.c
index 032333289113b..34c1b998b6c9a 100644
--- a/fs/xfs/libxfs/xfs_inode_util.c
+++ b/fs/xfs/libxfs/xfs_inode_util.c
@@ -224,6 +224,8 @@ xfs_inode_inherit_flags2(
 	}
 	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
 		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+	if (pip->i_diflags2 & XFS_DIFLAG2_METADATA)
+		ip->i_diflags2 |= XFS_DIFLAG2_METADATA;
 
 	/* Don't let invalid cowextsize hints propagate. */
 	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 3e6682ed656b3..ace7384a275bf 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -404,7 +404,7 @@ struct xfs_log_dinode {
 	uint16_t	di_mode;	/* mode and type of file */
 	int8_t		di_version;	/* inode version */
 	int8_t		di_format;	/* format of di_c data */
-	uint8_t		di_pad3[2];	/* unused in v2/3 inodes */
+	uint16_t	di_metatype;	/* metadata type, if DIFLAG2_METADATA */
 	uint32_t	di_uid;		/* owner's user id */
 	uint32_t	di_gid;		/* owner's group id */
 	uint32_t	di_nlink;	/* number of links to file */
diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
index 23c133fd36f5b..8bca86e350fdc 100644
--- a/fs/xfs/libxfs/xfs_ondisk.h
+++ b/fs/xfs/libxfs/xfs_ondisk.h
@@ -37,7 +37,7 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dinode,		176);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_disk_dquot,		104);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dqblk,			136);
-	XFS_CHECK_STRUCT_SIZE(struct xfs_dsb,			264);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_dsb,			272);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 6b56f0f6d4c1a..7afde477c0a79 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -180,6 +180,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_EXCHANGE_RANGE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_PARENT)
 		features |= XFS_FEAT_PARENT;
+	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
+		features |= XFS_FEAT_METADIR;
 
 	return features;
 }
@@ -683,6 +685,11 @@ __xfs_sb_from_disk(
 	/* Convert on-disk flags to in-memory flags? */
 	if (convert_xquota)
 		xfs_sb_quota_from_disk(to);
+
+	if (to->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
+		to->sb_metadirino = be64_to_cpu(from->sb_metadirino);
+	else
+		to->sb_metadirino = NULLFSINO;
 }
 
 void
@@ -830,6 +837,9 @@ xfs_sb_to_disk(
 	to->sb_lsn = cpu_to_be64(from->sb_lsn);
 	if (from->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_META_UUID)
 		uuid_copy(&to->sb_meta_uuid, &from->sb_meta_uuid);
+
+	if (from->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
+		to->sb_metadirino = cpu_to_be64(from->sb_metadirino);
 }
 
 /*
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index d32716fb2fecf..ec2c694c4083f 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -421,7 +421,8 @@ xchk_dinode(
 		break;
 	case 2:
 	case 3:
-		if (dip->di_onlink != 0)
+		if (!(dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)) &&
+		    dip->di_onlink != 0)
 			xchk_ino_set_corrupt(sc, ino);
 
 		if (dip->di_mode == 0 && sc->ip)
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index daf9f1ee7c2cb..344fdffb19aba 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -521,10 +521,13 @@ STATIC void
 xrep_dinode_nlinks(
 	struct xfs_dinode	*dip)
 {
-	if (dip->di_version > 1)
-		dip->di_onlink = 0;
-	else
+	if (dip->di_version < 2) {
 		dip->di_nlink = 0;
+		return;
+	}
+
+	if (!(dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)))
+		dip->di_onlink = 0;
 }
 
 /* Fix any conflicting flags that the verifiers complain about. */
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1908409968dba..54d995740b328 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -65,6 +65,7 @@ typedef struct xfs_inode {
 		uint16_t	i_flushiter;	/* incremented on flush */
 	};
 	uint8_t			i_forkoff;	/* attr fork offset >> 3 */
+	enum xfs_metafile_type	i_metatype;	/* XFS_METAFILE_* */
 	uint16_t		i_diflags;	/* XFS_DIFLAG_... */
 	uint64_t		i_diflags2;	/* XFS_DIFLAG2_... */
 	struct timespec64	i_crtime;	/* time created */
@@ -276,10 +277,23 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_REFLINK;
 }
 
+static inline bool xfs_is_metadir_inode(const struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_METADATA;
+}
+
 static inline bool xfs_is_metadata_inode(const struct xfs_inode *ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
+	/* Any file in the metadata directory tree is a metadata inode. */
+	if (xfs_has_metadir(mp))
+		return xfs_is_metadir_inode(ip);
+
+	/*
+	 * Before metadata directories, the only metadata inodes were the
+	 * three quota files, the realtime bitmap, and the realtime summary.
+	 */
 	return ip->i_ino == mp->m_sb.sb_rbmino ||
 	       ip->i_ino == mp->m_sb.sb_rsumino ||
 	       xfs_is_quota_inode(&mp->m_sb, ip->i_ino);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index b509cbd191f4e..912f0b1bc3cb7 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -556,7 +556,6 @@ xfs_inode_to_log_dinode(
 	to->di_projid_lo = ip->i_projid & 0xffff;
 	to->di_projid_hi = ip->i_projid >> 16;
 
-	memset(to->di_pad3, 0, sizeof(to->di_pad3));
 	to->di_atime = xfs_inode_to_log_dinode_ts(ip, inode_get_atime(inode));
 	to->di_mtime = xfs_inode_to_log_dinode_ts(ip, inode_get_mtime(inode));
 	to->di_ctime = xfs_inode_to_log_dinode_ts(ip, inode_get_ctime(inode));
@@ -590,10 +589,16 @@ xfs_inode_to_log_dinode(
 
 		/* dummy value for initialisation */
 		to->di_crc = 0;
+
+		if (xfs_is_metadir_inode(ip))
+			to->di_metatype = ip->i_metatype;
+		else
+			to->di_metatype = 0;
 	} else {
 		to->di_version = 2;
 		to->di_flushiter = ip->i_flushiter;
 		memset(to->di_v2_pad, 0, sizeof(to->di_v2_pad));
+		to->di_metatype = 0;
 	}
 
 	xfs_inode_to_log_dinode_iext_counters(ip, to);
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index dbdab4ce7c44c..4034933386807 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -175,7 +175,10 @@ xfs_log_dinode_to_disk(
 	to->di_mode = cpu_to_be16(from->di_mode);
 	to->di_version = from->di_version;
 	to->di_format = from->di_format;
-	to->di_onlink = 0;
+	if (from->di_flags2 & XFS_DIFLAG2_METADATA)
+		to->di_metatype = cpu_to_be16(from->di_metatype);
+	else
+		to->di_onlink = 0;
 	to->di_uid = cpu_to_be32(from->di_uid);
 	to->di_gid = cpu_to_be32(from->di_gid);
 	to->di_nlink = cpu_to_be32(from->di_nlink);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d0567dfbc0368..d404ce122f238 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -299,6 +299,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
 #define XFS_FEAT_EXCHANGE_RANGE	(1ULL << 27)	/* exchange range */
+#define XFS_FEAT_METADIR	(1ULL << 28)	/* metadata directory tree */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -354,6 +355,7 @@ __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
 __XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
+__XFS_HAS_FEAT(metadir, METADIR)
 
 /*
  * Some features are always on for v5 file systems, allow the compiler to
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7f..34066b50585e8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1721,6 +1721,10 @@ xfs_fs_fill_super(
 		mp->m_features &= ~XFS_FEAT_DISCARD;
 	}
 
+	if (xfs_has_metadir(mp))
+		xfs_warn(mp,
+"EXPERIMENTAL metadata directory feature in use. Use at your own risk!");
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 02/26] xfs: refactor loading quota inodes in the regular case
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
  2024-08-23  0:02   ` [PATCH 01/26] xfs: define the on-disk format for the metadir feature Darrick J. Wong
@ 2024-08-23  0:02   ` Darrick J. Wong
  2024-08-23  4:31     ` Christoph Hellwig
  2024-08-23  0:02   ` [PATCH 03/26] xfs: iget for metadata inodes Darrick J. Wong
                     ` (23 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:02 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a helper function to load quota inodes in the case where the
dqtype and the sb quota inode fields correspond.  This is true for
nearly all the iget callsites in the quota code, except for when we're
switching the group and project quota inodes.  We'll need this in
subsequent patches to make the metadir handling less convoluted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_qm.c          |   46 +++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_qm.h          |    3 +++
 fs/xfs/xfs_qm_syscalls.c |   13 +++++------
 fs/xfs/xfs_quotaops.c    |   53 +++++++++++++++++++++++++++-------------------
 4 files changed, 80 insertions(+), 35 deletions(-)


diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 63f6ca2db2515..7e2307921deb2 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1538,6 +1538,43 @@ xfs_qm_mount_quotas(
 	}
 }
 
+/*
+ * Load the inode for a given type of quota, assuming that the sb fields have
+ * been sorted out.  This is not true when switching quota types on a V4
+ * filesystem, so do not use this function for that.
+ *
+ * Returns -ENOENT if the quota inode field is NULLFSINO; 0 and an inode on
+ * success; or a negative errno.
+ */
+int
+xfs_qm_qino_load(
+	struct xfs_mount	*mp,
+	xfs_dqtype_t		type,
+	struct xfs_inode	**ipp)
+{
+	xfs_ino_t		ino = NULLFSINO;
+
+	switch (type) {
+	case XFS_DQTYPE_USER:
+		ino = mp->m_sb.sb_uquotino;
+		break;
+	case XFS_DQTYPE_GROUP:
+		ino = mp->m_sb.sb_gquotino;
+		break;
+	case XFS_DQTYPE_PROJ:
+		ino = mp->m_sb.sb_pquotino;
+		break;
+	default:
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	if (ino == NULLFSINO)
+		return -ENOENT;
+
+	return xfs_iget(mp, NULL, ino, 0, 0, ipp);
+}
+
 /*
  * This is called after the superblock has been read in and we're ready to
  * iget the quota inodes.
@@ -1561,24 +1598,21 @@ xfs_qm_init_quotainos(
 		if (XFS_IS_UQUOTA_ON(mp) &&
 		    mp->m_sb.sb_uquotino != NULLFSINO) {
 			ASSERT(mp->m_sb.sb_uquotino > 0);
-			error = xfs_iget(mp, NULL, mp->m_sb.sb_uquotino,
-					     0, 0, &uip);
+			error = xfs_qm_qino_load(mp, XFS_DQTYPE_USER, &uip);
 			if (error)
 				return error;
 		}
 		if (XFS_IS_GQUOTA_ON(mp) &&
 		    mp->m_sb.sb_gquotino != NULLFSINO) {
 			ASSERT(mp->m_sb.sb_gquotino > 0);
-			error = xfs_iget(mp, NULL, mp->m_sb.sb_gquotino,
-					     0, 0, &gip);
+			error = xfs_qm_qino_load(mp, XFS_DQTYPE_GROUP, &gip);
 			if (error)
 				goto error_rele;
 		}
 		if (XFS_IS_PQUOTA_ON(mp) &&
 		    mp->m_sb.sb_pquotino != NULLFSINO) {
 			ASSERT(mp->m_sb.sb_pquotino > 0);
-			error = xfs_iget(mp, NULL, mp->m_sb.sb_pquotino,
-					     0, 0, &pip);
+			error = xfs_qm_qino_load(mp, XFS_DQTYPE_PROJ, &pip);
 			if (error)
 				goto error_rele;
 		}
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 6e09dfcd13e25..e919c7f62f578 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -184,4 +184,7 @@ xfs_get_defquota(struct xfs_quotainfo *qi, xfs_dqtype_t type)
 	}
 }
 
+int xfs_qm_qino_load(struct xfs_mount *mp, xfs_dqtype_t type,
+		struct xfs_inode **ipp);
+
 #endif /* __XFS_QM_H__ */
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 392cb39cc10c8..4eda50ae2d1cb 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -53,16 +53,15 @@ xfs_qm_scall_quotaoff(
 STATIC int
 xfs_qm_scall_trunc_qfile(
 	struct xfs_mount	*mp,
-	xfs_ino_t		ino)
+	xfs_dqtype_t		type)
 {
 	struct xfs_inode	*ip;
 	struct xfs_trans	*tp;
 	int			error;
 
-	if (ino == NULLFSINO)
+	error = xfs_qm_qino_load(mp, type, &ip);
+	if (error == -ENOENT)
 		return 0;
-
-	error = xfs_iget(mp, NULL, ino, 0, 0, &ip);
 	if (error)
 		return error;
 
@@ -113,17 +112,17 @@ xfs_qm_scall_trunc_qfiles(
 	}
 
 	if (flags & XFS_QMOPT_UQUOTA) {
-		error = xfs_qm_scall_trunc_qfile(mp, mp->m_sb.sb_uquotino);
+		error = xfs_qm_scall_trunc_qfile(mp, XFS_DQTYPE_USER);
 		if (error)
 			return error;
 	}
 	if (flags & XFS_QMOPT_GQUOTA) {
-		error = xfs_qm_scall_trunc_qfile(mp, mp->m_sb.sb_gquotino);
+		error = xfs_qm_scall_trunc_qfile(mp, XFS_DQTYPE_GROUP);
 		if (error)
 			return error;
 	}
 	if (flags & XFS_QMOPT_PQUOTA)
-		error = xfs_qm_scall_trunc_qfile(mp, mp->m_sb.sb_pquotino);
+		error = xfs_qm_scall_trunc_qfile(mp, XFS_DQTYPE_PROJ);
 
 	return error;
 }
diff --git a/fs/xfs/xfs_quotaops.c b/fs/xfs/xfs_quotaops.c
index 9c162e69976be..4c7f7ce4fd2f4 100644
--- a/fs/xfs/xfs_quotaops.c
+++ b/fs/xfs/xfs_quotaops.c
@@ -16,24 +16,25 @@
 #include "xfs_qm.h"
 
 
-static void
+static int
 xfs_qm_fill_state(
 	struct qc_type_state	*tstate,
 	struct xfs_mount	*mp,
-	struct xfs_inode	*ip,
-	xfs_ino_t		ino,
-	struct xfs_def_quota	*defq)
+	xfs_dqtype_t		type)
 {
-	bool			tempqip = false;
+	struct xfs_inode	*ip;
+	struct xfs_def_quota	*defq;
+	int			error;
 
-	tstate->ino = ino;
-	if (!ip && ino == NULLFSINO)
-		return;
-	if (!ip) {
-		if (xfs_iget(mp, NULL, ino, 0, 0, &ip))
-			return;
-		tempqip = true;
+	error = xfs_qm_qino_load(mp, type, &ip);
+	if (error) {
+		tstate->ino = NULLFSINO;
+		return error != -ENOENT ? error : 0;
 	}
+
+	defq = xfs_get_defquota(mp->m_quotainfo, type);
+
+	tstate->ino = ip->i_ino;
 	tstate->flags |= QCI_SYSFILE;
 	tstate->blocks = ip->i_nblocks;
 	tstate->nextents = ip->i_df.if_nextents;
@@ -43,8 +44,9 @@ xfs_qm_fill_state(
 	tstate->spc_warnlimit = 0;
 	tstate->ino_warnlimit = 0;
 	tstate->rt_spc_warnlimit = 0;
-	if (tempqip)
-		xfs_irele(ip);
+	xfs_irele(ip);
+
+	return 0;
 }
 
 /*
@@ -56,8 +58,9 @@ xfs_fs_get_quota_state(
 	struct super_block	*sb,
 	struct qc_state		*state)
 {
-	struct xfs_mount *mp = XFS_M(sb);
-	struct xfs_quotainfo *q = mp->m_quotainfo;
+	struct xfs_mount	*mp = XFS_M(sb);
+	struct xfs_quotainfo	*q = mp->m_quotainfo;
+	int			error;
 
 	memset(state, 0, sizeof(*state));
 	if (!XFS_IS_QUOTA_ON(mp))
@@ -76,12 +79,18 @@ xfs_fs_get_quota_state(
 	if (XFS_IS_PQUOTA_ENFORCED(mp))
 		state->s_state[PRJQUOTA].flags |= QCI_LIMITS_ENFORCED;
 
-	xfs_qm_fill_state(&state->s_state[USRQUOTA], mp, q->qi_uquotaip,
-			  mp->m_sb.sb_uquotino, &q->qi_usr_default);
-	xfs_qm_fill_state(&state->s_state[GRPQUOTA], mp, q->qi_gquotaip,
-			  mp->m_sb.sb_gquotino, &q->qi_grp_default);
-	xfs_qm_fill_state(&state->s_state[PRJQUOTA], mp, q->qi_pquotaip,
-			  mp->m_sb.sb_pquotino, &q->qi_prj_default);
+	error = xfs_qm_fill_state(&state->s_state[USRQUOTA], mp,
+			XFS_DQTYPE_USER);
+	if (error)
+		return error;
+	error = xfs_qm_fill_state(&state->s_state[GRPQUOTA], mp,
+			XFS_DQTYPE_GROUP);
+	if (error)
+		return error;
+	error = xfs_qm_fill_state(&state->s_state[PRJQUOTA], mp,
+			XFS_DQTYPE_PROJ);
+	if (error)
+		return error;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 03/26] xfs: iget for metadata inodes
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
  2024-08-23  0:02   ` [PATCH 01/26] xfs: define the on-disk format for the metadir feature Darrick J. Wong
  2024-08-23  0:02   ` [PATCH 02/26] xfs: refactor loading quota inodes in the regular case Darrick J. Wong
@ 2024-08-23  0:02   ` Darrick J. Wong
  2024-08-23  4:35     ` Christoph Hellwig
  2024-08-23  0:03   ` [PATCH 04/26] xfs: load metadata directory root at mount time Darrick J. Wong
                     ` (22 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:02 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a xfs_metafile_iget function for metadata inodes to ensure that
when we try to iget a metadata file, the inobt thinks a metadata inode
is in use and that the metadata type matches what we are expecting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_metafile.h |   16 ++++++++++
 fs/xfs/xfs_icache.c          |   65 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.c           |    1 +
 fs/xfs/xfs_qm.c              |   23 ++++++++++++++-
 fs/xfs/xfs_rtalloc.c         |   38 ++++++++++++++-----------
 5 files changed, 125 insertions(+), 18 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_metafile.h


diff --git a/fs/xfs/libxfs/xfs_metafile.h b/fs/xfs/libxfs/xfs_metafile.h
new file mode 100644
index 0000000000000..60fe189061127
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_metafile.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2018-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_METAFILE_H__
+#define __XFS_METAFILE_H__
+
+/* Code specific to kernel/userspace; must be provided externally. */
+
+int xfs_trans_metafile_iget(struct xfs_trans *tp, xfs_ino_t ino,
+		enum xfs_metafile_type metafile_type, struct xfs_inode **ipp);
+int xfs_metafile_iget(struct xfs_mount *mp, xfs_ino_t ino,
+		enum xfs_metafile_type metafile_type, struct xfs_inode **ipp);
+
+#endif /* __XFS_METAFILE_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 887d2a01161e4..a3d4334d4151b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -25,6 +25,9 @@
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
 #include "xfs_health.h"
+#include "xfs_da_format.h"
+#include "xfs_dir2.h"
+#include "xfs_metafile.h"
 
 #include <linux/iversion.h>
 
@@ -809,6 +812,68 @@ xfs_iget(
 	return error;
 }
 
+/*
+ * Get a metadata inode.  The metafile @type must match the inode exactly.
+ * Caller must supply a transaction (even if empty) to avoid livelocking if the
+ * inobt has a cycle.
+ */
+int
+xfs_trans_metafile_iget(
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	enum xfs_metafile_type	metafile_type,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_inode	*ip;
+	umode_t			mode;
+	int			error;
+
+	error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &ip);
+	if (error == -EFSCORRUPTED)
+		goto whine;
+	if (error)
+		return error;
+
+	if (VFS_I(ip)->i_nlink == 0)
+		goto bad_rele;
+
+	if (metafile_type == XFS_METAFILE_DIR)
+		mode = S_IFDIR;
+	else
+		mode = S_IFREG;
+	if (inode_wrong_type(VFS_I(ip), mode))
+		goto bad_rele;
+
+	*ipp = ip;
+	return 0;
+bad_rele:
+	xfs_irele(ip);
+whine:
+	xfs_err(mp, "metadata inode 0x%llx is corrupt", ino);
+	return -EFSCORRUPTED;
+}
+
+/* Grab a metadata file if the caller doesn't already have a transaction. */
+int
+xfs_metafile_iget(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino,
+	enum xfs_metafile_type	metafile_type,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_trans	*tp;
+	int			error;
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_trans_metafile_iget(tp, ino, metafile_type, ipp);
+	xfs_trans_cancel(tp);
+	return error;
+}
+
 /*
  * Grab the inode for reclaim exclusively.
  *
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9ea7a18f5da14..e1c65507479cd 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -43,6 +43,7 @@
 #include "xfs_parent.h"
 #include "xfs_xattr.h"
 #include "xfs_inode_util.h"
+#include "xfs_metafile.h"
 
 struct kmem_cache *xfs_inode_cache;
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 7e2307921deb2..d0674d84af3ec 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -27,6 +27,8 @@
 #include "xfs_ialloc.h"
 #include "xfs_log_priv.h"
 #include "xfs_health.h"
+#include "xfs_da_format.h"
+#include "xfs_metafile.h"
 
 /*
  * The global quota manager. There is only one of these for the entire
@@ -733,6 +735,17 @@ xfs_qm_destroy_quotainfo(
 	mp->m_quotainfo = NULL;
 }
 
+static inline enum xfs_metafile_type
+xfs_qm_metafile_type(
+	unsigned int		flags)
+{
+	if (flags & XFS_QMOPT_UQUOTA)
+		return XFS_METAFILE_USRQUOTA;
+	else if (flags & XFS_QMOPT_GQUOTA)
+		return XFS_METAFILE_GRPQUOTA;
+	return XFS_METAFILE_PRJQUOTA;
+}
+
 /*
  * Create an inode and return with a reference already taken, but unlocked
  * This is how we create quota inodes
@@ -744,6 +757,7 @@ xfs_qm_qino_alloc(
 	unsigned int		flags)
 {
 	struct xfs_trans	*tp;
+	enum xfs_metafile_type	metafile_type = xfs_qm_metafile_type(flags);
 	int			error;
 	bool			need_alloc = true;
 
@@ -777,9 +791,10 @@ xfs_qm_qino_alloc(
 			}
 		}
 		if (ino != NULLFSINO) {
-			error = xfs_iget(mp, NULL, ino, 0, 0, ipp);
+			error = xfs_metafile_iget(mp, ino, metafile_type, ipp);
 			if (error)
 				return error;
+
 			mp->m_sb.sb_gquotino = NULLFSINO;
 			mp->m_sb.sb_pquotino = NULLFSINO;
 			need_alloc = false;
@@ -1553,16 +1568,20 @@ xfs_qm_qino_load(
 	struct xfs_inode	**ipp)
 {
 	xfs_ino_t		ino = NULLFSINO;
+	enum xfs_metafile_type	metafile_type = XFS_METAFILE_UNKNOWN;
 
 	switch (type) {
 	case XFS_DQTYPE_USER:
 		ino = mp->m_sb.sb_uquotino;
+		metafile_type = XFS_METAFILE_USRQUOTA;
 		break;
 	case XFS_DQTYPE_GROUP:
 		ino = mp->m_sb.sb_gquotino;
+		metafile_type = XFS_METAFILE_GRPQUOTA;
 		break;
 	case XFS_DQTYPE_PROJ:
 		ino = mp->m_sb.sb_pquotino;
+		metafile_type = XFS_METAFILE_PRJQUOTA;
 		break;
 	default:
 		ASSERT(0);
@@ -1572,7 +1591,7 @@ xfs_qm_qino_load(
 	if (ino == NULLFSINO)
 		return -ENOENT;
 
-	return xfs_iget(mp, NULL, ino, 0, 0, ipp);
+	return xfs_metafile_iget(mp, ino, metafile_type, ipp);
 }
 
 /*
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index ebeab8e4dab10..b4c3c5a3171bf 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -25,6 +25,8 @@
 #include "xfs_quota.h"
 #include "xfs_log_priv.h"
 #include "xfs_health.h"
+#include "xfs_da_format.h"
+#include "xfs_metafile.h"
 
 /*
  * Return whether there are any free extents in the size range given
@@ -1206,16 +1208,12 @@ xfs_rtalloc_reinit_frextents(
  */
 static inline int
 xfs_rtmount_iread_extents(
+	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	unsigned int		lock_class)
 {
-	struct xfs_trans	*tp;
 	int			error;
 
-	error = xfs_trans_alloc_empty(ip->i_mount, &tp);
-	if (error)
-		return error;
-
 	xfs_ilock(ip, XFS_ILOCK_EXCL | lock_class);
 
 	error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
@@ -1230,7 +1228,6 @@ xfs_rtmount_iread_extents(
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL | lock_class);
-	xfs_trans_cancel(tp);
 	return error;
 }
 
@@ -1238,43 +1235,52 @@ xfs_rtmount_iread_extents(
  * Get the bitmap and summary inodes and the summary cache into the mount
  * structure at mount time.
  */
-int					/* error */
+int
 xfs_rtmount_inodes(
-	xfs_mount_t	*mp)		/* file system mount structure */
+	struct xfs_mount	*mp)
 {
-	int		error;		/* error return value */
-	xfs_sb_t	*sbp;
+	struct xfs_trans	*tp;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	int			error;
 
-	sbp = &mp->m_sb;
-	error = xfs_iget(mp, NULL, sbp->sb_rbmino, 0, 0, &mp->m_rbmip);
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_trans_metafile_iget(tp, mp->m_sb.sb_rbmino,
+			XFS_METAFILE_RTBITMAP, &mp->m_rbmip);
 	if (xfs_metadata_is_sick(error))
 		xfs_rt_mark_sick(mp, XFS_SICK_RT_BITMAP);
 	if (error)
-		return error;
+		goto out_trans;
 	ASSERT(mp->m_rbmip != NULL);
 
-	error = xfs_rtmount_iread_extents(mp->m_rbmip, XFS_ILOCK_RTBITMAP);
+	error = xfs_rtmount_iread_extents(tp, mp->m_rbmip, XFS_ILOCK_RTBITMAP);
 	if (error)
 		goto out_rele_bitmap;
 
-	error = xfs_iget(mp, NULL, sbp->sb_rsumino, 0, 0, &mp->m_rsumip);
+	error = xfs_trans_metafile_iget(tp, mp->m_sb.sb_rsumino,
+			XFS_METAFILE_RTSUMMARY, &mp->m_rsumip);
 	if (xfs_metadata_is_sick(error))
 		xfs_rt_mark_sick(mp, XFS_SICK_RT_SUMMARY);
 	if (error)
 		goto out_rele_bitmap;
 	ASSERT(mp->m_rsumip != NULL);
 
-	error = xfs_rtmount_iread_extents(mp->m_rsumip, XFS_ILOCK_RTSUM);
+	error = xfs_rtmount_iread_extents(tp, mp->m_rsumip, XFS_ILOCK_RTSUM);
 	if (error)
 		goto out_rele_summary;
 
 	xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
+	xfs_trans_cancel(tp);
 	return 0;
 
 out_rele_summary:
 	xfs_irele(mp->m_rsumip);
 out_rele_bitmap:
 	xfs_irele(mp->m_rbmip);
+out_trans:
+	xfs_trans_cancel(tp);
 	return error;
 }
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 04/26] xfs: load metadata directory root at mount time
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-23  0:02   ` [PATCH 03/26] xfs: iget for metadata inodes Darrick J. Wong
@ 2024-08-23  0:03   ` Darrick J. Wong
  2024-08-23  4:35     ` Christoph Hellwig
  2024-08-23  0:03   ` [PATCH 05/26] xfs: enforce metadata inode flag Darrick J. Wong
                     ` (21 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:03 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Load the metadata directory root inode into memory at mount time and
release it at unmount time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.c |   31 +++++++++++++++++++++++++++++--
 fs/xfs/xfs_mount.h |    1 +
 2 files changed, 30 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 09eef1721ef4f..b0ea88acdb618 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -35,6 +35,7 @@
 #include "xfs_trace.h"
 #include "xfs_ag.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_metafile.h"
 #include "scrub/stats.h"
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -616,6 +617,22 @@ xfs_mount_setup_inode_geom(
 	xfs_ialloc_setup_geometry(mp);
 }
 
+/* Mount the metadata directory tree root. */
+STATIC int
+xfs_mount_setup_metadir(
+	struct xfs_mount	*mp)
+{
+	int			error;
+
+	/* Load the metadata directory root inode into memory. */
+	error = xfs_metafile_iget(mp, mp->m_sb.sb_metadirino, XFS_METAFILE_DIR,
+			&mp->m_metadirip);
+	if (error)
+		xfs_warn(mp, "Failed to load metadir root directory, error %d",
+				error);
+	return error;
+}
+
 /* Compute maximum possible height for per-AG btree types for this fs. */
 static inline void
 xfs_agbtree_compute_maxlevels(
@@ -862,6 +879,12 @@ xfs_mountfs(
 		mp->m_features |= XFS_FEAT_ATTR2;
 	}
 
+	if (xfs_has_metadir(mp)) {
+		error = xfs_mount_setup_metadir(mp);
+		if (error)
+			goto out_free_metadir;
+	}
+
 	/*
 	 * Get and sanity-check the root inode.
 	 * Save the pointer to it in the mount structure.
@@ -872,7 +895,7 @@ xfs_mountfs(
 		xfs_warn(mp,
 			"Failed to read root inode 0x%llx, error %d",
 			sbp->sb_rootino, -error);
-		goto out_log_dealloc;
+		goto out_free_metadir;
 	}
 
 	ASSERT(rip != NULL);
@@ -1014,6 +1037,9 @@ xfs_mountfs(
 	xfs_irele(rip);
 	/* Clean out dquots that might be in memory after quotacheck. */
 	xfs_qm_unmount(mp);
+ out_free_metadir:
+	if (mp->m_metadirip)
+		xfs_irele(mp->m_metadirip);
 
 	/*
 	 * Inactivate all inodes that might still be in memory after a log
@@ -1035,7 +1061,6 @@ xfs_mountfs(
 	 * quota inodes.
 	 */
 	xfs_unmount_flush_inodes(mp);
- out_log_dealloc:
 	xfs_log_mount_cancel(mp);
  out_inodegc_shrinker:
 	shrinker_free(mp->m_inodegc_shrinker);
@@ -1087,6 +1112,8 @@ xfs_unmountfs(
 	xfs_qm_unmount_quotas(mp);
 	xfs_rtunmount_inodes(mp);
 	xfs_irele(mp->m_rootip);
+	if (mp->m_metadirip)
+		xfs_irele(mp->m_metadirip);
 
 	xfs_unmount_flush_inodes(mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d404ce122f238..6251ebced3062 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -93,6 +93,7 @@ typedef struct xfs_mount {
 	struct xfs_inode	*m_rbmip;	/* pointer to bitmap inode */
 	struct xfs_inode	*m_rsumip;	/* pointer to summary inode */
 	struct xfs_inode	*m_rootip;	/* pointer to root directory */
+	struct xfs_inode	*m_metadirip;	/* ptr to metadata directory */
 	struct xfs_quotainfo	*m_quotainfo;	/* disk quota information */
 	struct xfs_buftarg	*m_ddev_targp;	/* data device */
 	struct xfs_buftarg	*m_logdev_targp;/* log device */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 05/26] xfs: enforce metadata inode flag
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-23  0:03   ` [PATCH 04/26] xfs: load metadata directory root at mount time Darrick J. Wong
@ 2024-08-23  0:03   ` Darrick J. Wong
  2024-08-23  4:38     ` Christoph Hellwig
  2024-08-23  0:03   ` [PATCH 06/26] xfs: read and write metadata inode directory tree Darrick J. Wong
                     ` (20 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:03 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add checks for the metadata inode flag so that we don't ever leak
metadata inodes out to userspace, and we don't ever try to read a
regular inode as metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_inode_buf.c |   70 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_buf.h |    3 ++
 fs/xfs/libxfs/xfs_metafile.h  |   11 ++++++
 fs/xfs/scrub/common.c         |   10 +++++-
 fs/xfs/scrub/inode.c          |   26 ++++++++++++++-
 fs/xfs/scrub/inode_repair.c   |   10 ++++++
 fs/xfs/xfs_icache.c           |    9 +++++
 fs/xfs/xfs_inode.c            |   11 ++++++
 8 files changed, 145 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index cdd6ed4279649..a74040ffdb5e2 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -19,6 +19,7 @@
 #include "xfs_ialloc.h"
 #include "xfs_dir2.h"
 #include "xfs_health.h"
+#include "xfs_metafile.h"
 
 #include <linux/iversion.h>
 
@@ -488,6 +489,69 @@ xfs_dinode_verify_nrext64(
 	return NULL;
 }
 
+/*
+ * Validate all the picky requirements we have for a file that claims to be
+ * filesystem metadata.
+ */
+xfs_failaddr_t
+xfs_dinode_verify_metadir(
+	struct xfs_mount	*mp,
+	struct xfs_dinode	*dip,
+	uint16_t		mode,
+	uint16_t		flags,
+	uint64_t		flags2)
+{
+	if (!xfs_has_metadir(mp))
+		return __this_address;
+
+	/* V5 filesystem only */
+	if (dip->di_version < 3)
+		return __this_address;
+
+	if (be16_to_cpu(dip->di_metatype) >= XFS_METAFILE_MAX)
+		return __this_address;
+
+	/* V3 inode fields that are always zero */
+	if ((flags2 & XFS_DIFLAG2_NREXT64) && dip->di_nrext64_pad)
+		return __this_address;
+	if (!(flags2 & XFS_DIFLAG2_NREXT64) && dip->di_flushiter)
+		return __this_address;
+
+	/* Metadata files can only be directories or regular files */
+	if (!S_ISDIR(mode) && !S_ISREG(mode))
+		return __this_address;
+
+	/* They must have zero access permissions */
+	if (mode & 0777)
+		return __this_address;
+
+	/* DMAPI event and state masks are zero */
+	if (dip->di_dmevmask || dip->di_dmstate)
+		return __this_address;
+
+	/*
+	 * User and group IDs must be zero.  The project ID is used for
+	 * grouping inodes.  Metadata inodes are never accounted to quotas.
+	 */
+	if (dip->di_uid || dip->di_gid)
+		return __this_address;
+
+	/* Mandatory directory flags must be set */
+	if (S_ISDIR(mode)) {
+		if ((flags & XFS_METADIR_DIFLAGS) != XFS_METADIR_DIFLAGS)
+			return __this_address;
+	} else {
+		if ((flags & XFS_METAFILE_DIFLAGS) != XFS_METAFILE_DIFLAGS)
+			return __this_address;
+	}
+
+	/* dax flags2 must not be set */
+	if (flags2 & XFS_DIFLAG2_DAX)
+		return __this_address;
+
+	return NULL;
+}
+
 xfs_failaddr_t
 xfs_dinode_verify(
 	struct xfs_mount	*mp,
@@ -672,6 +736,12 @@ xfs_dinode_verify(
 	    !xfs_has_bigtime(mp))
 		return __this_address;
 
+	if (flags2 & XFS_DIFLAG2_METADATA) {
+		fa = xfs_dinode_verify_metadir(mp, dip, mode, flags, flags2);
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af4..8d43d2641c732 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -28,6 +28,9 @@ int	xfs_inode_from_disk(struct xfs_inode *ip, struct xfs_dinode *from);
 
 xfs_failaddr_t xfs_dinode_verify(struct xfs_mount *mp, xfs_ino_t ino,
 			   struct xfs_dinode *dip);
+xfs_failaddr_t xfs_dinode_verify_metadir(struct xfs_mount *mp,
+		struct xfs_dinode *dip, uint16_t mode, uint16_t flags,
+		uint64_t flags2);
 xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
 		uint32_t extsize, uint16_t mode, uint16_t flags);
 xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
diff --git a/fs/xfs/libxfs/xfs_metafile.h b/fs/xfs/libxfs/xfs_metafile.h
index 60fe189061127..07ff20639bd54 100644
--- a/fs/xfs/libxfs/xfs_metafile.h
+++ b/fs/xfs/libxfs/xfs_metafile.h
@@ -6,6 +6,17 @@
 #ifndef __XFS_METAFILE_H__
 #define __XFS_METAFILE_H__
 
+/* All metadata files must have these flags set. */
+#define XFS_METAFILE_DIFLAGS	(XFS_DIFLAG_IMMUTABLE | \
+				 XFS_DIFLAG_SYNC | \
+				 XFS_DIFLAG_NOATIME | \
+				 XFS_DIFLAG_NODUMP | \
+				 XFS_DIFLAG_NODEFRAG)
+
+/* All metadata directory files must have these flags set. */
+#define XFS_METADIR_DIFLAGS	(XFS_METAFILE_DIFLAGS | \
+				 XFS_DIFLAG_NOSYMLINKS)
+
 /* Code specific to kernel/userspace; must be provided externally. */
 
 int xfs_trans_metafile_iget(struct xfs_trans *tp, xfs_ino_t ino,
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 22f5f1a9d3f09..f64271ccb786c 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -947,9 +947,15 @@ xchk_iget_for_scrubbing(
 	if (sc->sm->sm_ino == 0 || sc->sm->sm_ino == ip_in->i_ino)
 		return xchk_install_live_inode(sc, ip_in);
 
-	/* Reject internal metadata files and obviously bad inode numbers. */
-	if (xfs_internal_inum(mp, sc->sm->sm_ino))
+	/*
+	 * On pre-metadir filesystems, reject internal metadata files.  For
+	 * metadir filesystems, limited scrubbing of any file in the metadata
+	 * directory tree by handle is allowed, because that is the only way to
+	 * validate the lack of parent pointers in the sb-root metadata inodes.
+	 */
+	if (!xfs_has_metadir(mp) && xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
+	/* Reject obviously bad inode numbers. */
 	if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino))
 		return -ENOENT;
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index ec2c694c4083f..45222552a51cc 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -60,6 +60,22 @@ xchk_install_handle_iscrub(
 	if (error)
 		return error;
 
+	/*
+	 * Don't allow scrubbing by handle of any non-directory inode records
+	 * in the metadata directory tree.  We don't know if any of the scans
+	 * launched by this scrubber will end up indirectly trying to lock this
+	 * file.
+	 *
+	 * Scrubbers of inode-rooted metadata files (e.g. quota files) will
+	 * attach all the resources needed to scrub the inode and call
+	 * xchk_inode directly.  Userspace cannot call this directly.
+	 */
+	if (xfs_is_metadir_inode(ip) && !S_ISDIR(VFS_I(ip)->i_mode)) {
+		xchk_irele(sc, ip);
+		sc->ip = NULL;
+		return -ENOENT;
+	}
+
 	return xchk_prepare_iscrub(sc);
 }
 
@@ -94,9 +110,15 @@ xchk_setup_inode(
 		return xchk_prepare_iscrub(sc);
 	}
 
-	/* Reject internal metadata files and obviously bad inode numbers. */
-	if (xfs_internal_inum(mp, sc->sm->sm_ino))
+	/*
+	 * On pre-metadir filesystems, reject internal metadata files.  For
+	 * metadir filesystems, limited scrubbing of any file in the metadata
+	 * directory tree by handle is allowed, because that is the only way to
+	 * validate the lack of parent pointers in the sb-root metadata inodes.
+	 */
+	if (!xfs_has_metadir(mp) && xfs_internal_inum(mp, sc->sm->sm_ino))
 		return -ENOENT;
+	/* Reject obviously bad inode numbers. */
 	if (!xfs_verify_ino(sc->mp, sc->sm->sm_ino))
 		return -ENOENT;
 
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 344fdffb19aba..060ebfb25c7a5 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -568,6 +568,16 @@ xrep_dinode_flags(
 		dip->di_nrext64_pad = 0;
 	else if (dip->di_version >= 3)
 		dip->di_v3_pad = 0;
+
+	if (flags2 & XFS_DIFLAG2_METADATA) {
+		xfs_failaddr_t	fa;
+
+		fa = xfs_dinode_verify_metadir(sc->mp, dip, mode, flags,
+				flags2);
+		if (fa)
+			flags2 &= ~XFS_DIFLAG2_METADATA;
+	}
+
 	dip->di_flags = cpu_to_be16(flags);
 	dip->di_flags2 = cpu_to_be64(flags2);
 }
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index a3d4334d4151b..61bba47e565f4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -844,13 +844,20 @@ xfs_trans_metafile_iget(
 		mode = S_IFREG;
 	if (inode_wrong_type(VFS_I(ip), mode))
 		goto bad_rele;
+	if (xfs_has_metadir(mp)) {
+		if (!xfs_is_metadir_inode(ip))
+			goto bad_rele;
+		if (metafile_type != ip->i_metatype)
+			goto bad_rele;
+	}
 
 	*ipp = ip;
 	return 0;
 bad_rele:
 	xfs_irele(ip);
 whine:
-	xfs_err(mp, "metadata inode 0x%llx is corrupt", ino);
+	xfs_err(mp, "metadata inode 0x%llx type %u is corrupt", ino,
+			metafile_type);
 	return -EFSCORRUPTED;
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index e1c65507479cd..35acb73665fdd 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -555,8 +555,19 @@ xfs_lookup(
 	if (error)
 		goto out_free_name;
 
+	/*
+	 * Fail if a directory entry in the regular directory tree points to
+	 * a metadata file.
+	 */
+	if (XFS_IS_CORRUPT(dp->i_mount, xfs_is_metadir_inode(*ipp))) {
+		error = -EFSCORRUPTED;
+		goto out_irele;
+	}
+
 	return 0;
 
+out_irele:
+	xfs_irele(*ipp);
 out_free_name:
 	if (ci_name)
 		kfree(ci_name->name);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 06/26] xfs: read and write metadata inode directory tree
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:03   ` [PATCH 05/26] xfs: enforce metadata inode flag Darrick J. Wong
@ 2024-08-23  0:03   ` Darrick J. Wong
  2024-08-23  4:39     ` Christoph Hellwig
  2024-08-23  0:03   ` [PATCH 07/26] xfs: disable the agi rotor for metadata inodes Darrick J. Wong
                     ` (19 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:03 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Plumb in the bits we need to load metadata inodes from a named entry in
a metadir directory, create (or hardlink) inodes into a metadir
directory, create metadir directories, and flag inodes as being metadata
files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile              |    4 
 fs/xfs/libxfs/xfs_metadir.c  |  474 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_metadir.h  |   47 ++++
 fs/xfs/libxfs/xfs_metafile.c |   52 +++++
 fs/xfs/libxfs/xfs_metafile.h |    4 
 fs/xfs/xfs_icache.c          |    2 
 fs/xfs/xfs_trace.c           |    2 
 fs/xfs/xfs_trace.h           |  102 +++++++++
 8 files changed, 685 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_metadir.c
 create mode 100644 fs/xfs/libxfs/xfs_metadir.h
 create mode 100644 fs/xfs/libxfs/xfs_metafile.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index dd692619bed58..4482cc8c39039 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -15,6 +15,7 @@ xfs-y				+= xfs_trace.o
 # build the libxfs code first
 xfs-y				+= $(addprefix libxfs/, \
 				   xfs_ag.o \
+				   xfs_ag_resv.o \
 				   xfs_alloc.o \
 				   xfs_alloc_btree.o \
 				   xfs_attr.o \
@@ -42,7 +43,8 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_inode_buf.o \
 				   xfs_inode_util.o \
 				   xfs_log_rlimit.o \
-				   xfs_ag_resv.o \
+				   xfs_metadir.o \
+				   xfs_metafile.o \
 				   xfs_parent.o \
 				   xfs_rmap.o \
 				   xfs_rmap_btree.o \
diff --git a/fs/xfs/libxfs/xfs_metadir.c b/fs/xfs/libxfs/xfs_metadir.c
new file mode 100644
index 0000000000000..0a61316b4f520
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_metadir.c
@@ -0,0 +1,474 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2018-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_metafile.h"
+#include "xfs_metadir.h"
+#include "xfs_trace.h"
+#include "xfs_inode.h"
+#include "xfs_quota.h"
+#include "xfs_ialloc.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_ag.h"
+#include "xfs_dir2.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_parent.h"
+
+/*
+ * Metadata Directory Tree
+ * =======================
+ *
+ * These functions provide an abstraction layer for looking up, creating, and
+ * deleting metadata inodes that live within a special metadata directory tree.
+ *
+ * This code does not manage the five existing metadata inodes: real time
+ * bitmap & summary; and the user, group, and quotas.  All other metadata
+ * inodes must use only the xfs_meta{dir,file}_* functions.
+ *
+ * Callers wishing to create or hardlink a metadata inode must create an
+ * xfs_metadir_update structure, call the appropriate xfs_metadir* function,
+ * and then call xfs_metadir_commit or xfs_metadir_cancel to commit or cancel
+ * the update.  Files in the metadata directory tree currently cannot be
+ * unlinked.
+ *
+ * When the metadir feature is enabled, all metadata inodes must have the
+ * "metadata" inode flag set to prevent them from being exposed to the outside
+ * world.
+ *
+ * Callers must take the ILOCK of any inode in the metadata directory tree to
+ * synchronize access to that inode.  It is never necessary to take the IOLOCK
+ * or the MMAPLOCK since metadata inodes must not be exposed to user space.
+ */
+
+static inline void
+xfs_metadir_set_xname(
+	struct xfs_name		*xname,
+	const char		*path,
+	unsigned char		ftype)
+{
+	xname->name = (const unsigned char *)path;
+	xname->len = strlen(path);
+	xname->type = ftype;
+}
+
+/*
+ * Given a parent directory @dp and a metadata inode path component @xname,
+ * Look up the inode number in the directory, returning it in @ino.
+ * @xname.type must match the directory entry's ftype.
+ *
+ * Caller must hold ILOCK_EXCL.
+ */
+static inline int
+xfs_metadir_lookup(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*dp,
+	struct xfs_name		*xname,
+	xfs_ino_t		*ino)
+{
+	struct xfs_mount	*mp = dp->i_mount;
+	struct xfs_da_args	args = {
+		.trans		= tp,
+		.dp		= dp,
+		.geo		= mp->m_dir_geo,
+		.name		= xname->name,
+		.namelen	= xname->len,
+		.hashval	= xfs_dir2_hashname(mp, xname),
+		.whichfork	= XFS_DATA_FORK,
+		.op_flags	= XFS_DA_OP_OKNOENT,
+		.owner		= dp->i_ino,
+	};
+	int			error;
+
+	if (!S_ISDIR(VFS_I(dp)->i_mode))
+		return -EFSCORRUPTED;
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	error = xfs_dir_lookup_args(&args);
+	if (error)
+		return error;
+
+	if (!xfs_verify_ino(mp, args.inumber))
+		return -EFSCORRUPTED;
+	if (xname->type != XFS_DIR3_FT_UNKNOWN && xname->type != args.filetype)
+		return -EFSCORRUPTED;
+
+	trace_xfs_metadir_lookup(dp, xname, args.inumber);
+	*ino = args.inumber;
+	return 0;
+}
+
+/*
+ * Look up and read a metadata inode from the metadata directory.  If the path
+ * component doesn't exist, return -ENOENT.
+ */
+int
+xfs_metadir_load(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*dp,
+	const char		*path,
+	enum xfs_metafile_type	metafile_type,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_name		xname;
+	xfs_ino_t		ino;
+	int			error;
+
+	xfs_metadir_set_xname(&xname, path, XFS_DIR3_FT_UNKNOWN);
+
+	xfs_ilock(dp, XFS_ILOCK_EXCL);
+	error = xfs_metadir_lookup(tp, dp, &xname, &ino);
+	xfs_iunlock(dp, XFS_ILOCK_EXCL);
+	if (error)
+		return error;
+	return xfs_trans_metafile_iget(tp, ino, metafile_type, ipp);
+}
+
+/*
+ * Unlock and release resources after committing (or cancelling) a metadata
+ * directory tree operation.  The caller retains its reference to @upd->ip
+ * and must release it explicitly.
+ */
+static inline void
+xfs_metadir_teardown(
+	struct xfs_metadir_update	*upd,
+	int				error)
+{
+	trace_xfs_metadir_teardown(upd, error);
+
+	if (upd->ppargs) {
+		xfs_parent_finish(upd->dp->i_mount, upd->ppargs);
+		upd->ppargs = NULL;
+	}
+
+	if (upd->ip) {
+		if (upd->ip_locked)
+			xfs_iunlock(upd->ip, XFS_ILOCK_EXCL);
+		upd->ip_locked = false;
+	}
+
+	if (upd->dp_locked)
+		xfs_iunlock(upd->dp, XFS_ILOCK_EXCL);
+	upd->dp_locked = false;
+}
+
+/*
+ * Begin the process of creating a metadata file by allocating transactions
+ * and taking whatever resources we're going to need.
+ */
+int
+xfs_metadir_start_create(
+	struct xfs_metadir_update	*upd)
+{
+	struct xfs_mount		*mp = upd->dp->i_mount;
+	int				error;
+
+	ASSERT(upd->dp != NULL);
+	ASSERT(upd->ip == NULL);
+	ASSERT(xfs_has_metadir(mp));
+	ASSERT(upd->metafile_type != XFS_METAFILE_UNKNOWN);
+
+	error = xfs_parent_start(mp, &upd->ppargs);
+	if (error)
+		return error;
+
+	/*
+	 * If we ever need the ability to create rt metadata files on a
+	 * pre-metadir filesystem, we'll need to dqattach the parent here.
+	 * Currently we assume that mkfs will create the files and quotacheck
+	 * will account for them.
+	 */
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_create,
+			xfs_create_space_res(mp, MAXNAMELEN), 0, 0, &upd->tp);
+	if (error)
+		goto out_teardown;
+
+	/*
+	 * Lock the parent directory if there is one.  We can't ijoin it to
+	 * the transaction until after the child file has been created.
+	 */
+	xfs_ilock(upd->dp, XFS_ILOCK_EXCL | XFS_ILOCK_PARENT);
+	upd->dp_locked = true;
+
+	trace_xfs_metadir_start_create(upd);
+	return 0;
+out_teardown:
+	xfs_metadir_teardown(upd, error);
+	return error;
+}
+
+/*
+ * Create a metadata inode with the given @mode, and insert it into the
+ * metadata directory tree at the given @upd->path.  The path up to the final
+ * component must already exist.  The final path component must not exist.
+ *
+ * The new metadata inode will be attached to the update structure @upd->ip,
+ * with the ILOCK held until the caller releases it.
+ *
+ * NOTE: This function may return a new inode to the caller even if it returns
+ * a negative error code.  If an inode is passed back, the caller must finish
+ * setting up the inode before releasing it.
+ */
+int
+xfs_metadir_create(
+	struct xfs_metadir_update	*upd,
+	umode_t				mode)
+{
+	struct xfs_icreate_args		args = {
+		.pip			= upd->dp,
+		.mode			= mode,
+	};
+	struct xfs_name			xname;
+	struct xfs_dir_update		du = {
+		.dp			= upd->dp,
+		.name			= &xname,
+		.ppargs			= upd->ppargs,
+	};
+	struct xfs_mount		*mp = upd->dp->i_mount;
+	xfs_ino_t			ino;
+	unsigned int			resblks;
+	int				error;
+
+	xfs_assert_ilocked(upd->dp, XFS_ILOCK_EXCL);
+
+	/* Check that the name does not already exist in the directory. */
+	xfs_metadir_set_xname(&xname, upd->path, XFS_DIR3_FT_UNKNOWN);
+	error = xfs_metadir_lookup(upd->tp, upd->dp, &xname, &ino);
+	switch (error) {
+	case -ENOENT:
+		break;
+	case 0:
+		error = -EEXIST;
+		fallthrough;
+	default:
+		return error;
+	}
+
+	/*
+	 * A newly created regular or special file just has one directory
+	 * entry pointing to them, but a directory also the "." entry
+	 * pointing to itself.
+	 */
+	error = xfs_dialloc(&upd->tp, &args, &ino);
+	if (error)
+		return error;
+	error = xfs_icreate(upd->tp, ino, &args, &upd->ip);
+	if (error)
+		return error;
+	du.ip = upd->ip;
+	xfs_metafile_set_iflag(upd->tp, upd->ip, upd->metafile_type);
+	upd->ip_locked = true;
+
+	/*
+	 * Join the directory inode to the transaction.  We do not do it
+	 * earlier because xfs_dialloc rolls the transaction.
+	 */
+	xfs_trans_ijoin(upd->tp, upd->dp, 0);
+
+	/* Create the entry. */
+	if (S_ISDIR(args.mode))
+		resblks = xfs_mkdir_space_res(mp, xname.len);
+	else
+		resblks = xfs_create_space_res(mp, xname.len);
+	xname.type = xfs_mode_to_ftype(args.mode);
+
+	trace_xfs_metadir_try_create(upd);
+
+	error = xfs_dir_create_child(upd->tp, resblks, &du);
+	if (error)
+		return error;
+
+	/* Metadir files are not accounted to quota. */
+
+	trace_xfs_metadir_create(upd);
+
+	return 0;
+}
+
+#ifndef __KERNEL__
+/*
+ * Begin the process of linking a metadata file by allocating transactions
+ * and locking whatever resources we're going to need.
+ */
+int
+xfs_metadir_start_link(
+	struct xfs_metadir_update	*upd)
+{
+	struct xfs_mount		*mp = upd->dp->i_mount;
+	unsigned int			resblks;
+	int				nospace_error = 0;
+	int				error;
+
+	ASSERT(upd->dp != NULL);
+	ASSERT(upd->ip != NULL);
+	ASSERT(xfs_has_metadir(mp));
+
+	error = xfs_parent_start(mp, &upd->ppargs);
+	if (error)
+		return error;
+
+	resblks = xfs_link_space_res(mp, MAXNAMELEN);
+	error = xfs_trans_alloc_dir(upd->dp, &M_RES(mp)->tr_link, upd->ip,
+			&resblks, &upd->tp, &nospace_error);
+	if (error)
+		goto out_teardown;
+	if (!resblks) {
+		/* We don't allow reservationless updates. */
+		xfs_trans_cancel(upd->tp);
+		upd->tp = NULL;
+		xfs_iunlock(upd->dp, XFS_ILOCK_EXCL);
+		xfs_iunlock(upd->ip, XFS_ILOCK_EXCL);
+		error = nospace_error;
+		goto out_teardown;
+	}
+
+	upd->dp_locked = true;
+	upd->ip_locked = true;
+
+	trace_xfs_metadir_start_link(upd);
+	return 0;
+out_teardown:
+	xfs_metadir_teardown(upd, error);
+	return error;
+}
+
+/*
+ * Link the metadata directory given by @path to the inode @upd->ip.
+ * The path (up to the final component) must already exist, but the final
+ * component must not already exist.
+ */
+int
+xfs_metadir_link(
+	struct xfs_metadir_update	*upd)
+{
+	struct xfs_name			xname;
+	struct xfs_dir_update		du = {
+		.dp			= upd->dp,
+		.name			= &xname,
+		.ip			= upd->ip,
+		.ppargs			= upd->ppargs,
+	};
+	struct xfs_mount		*mp = upd->dp->i_mount;
+	xfs_ino_t			ino;
+	unsigned int			resblks;
+	int				error;
+
+	xfs_assert_ilocked(upd->dp, XFS_ILOCK_EXCL);
+	xfs_assert_ilocked(upd->ip, XFS_ILOCK_EXCL);
+
+	/* Look up the name in the current directory. */
+	xfs_metadir_set_xname(&xname, upd->path,
+			xfs_mode_to_ftype(VFS_I(upd->ip)->i_mode));
+	error = xfs_metadir_lookup(upd->tp, upd->dp, &xname, &ino);
+	switch (error) {
+	case -ENOENT:
+		break;
+	case 0:
+		error = -EEXIST;
+		fallthrough;
+	default:
+		return error;
+	}
+
+	resblks = xfs_link_space_res(mp, xname.len);
+	error = xfs_dir_add_child(upd->tp, resblks, &du);
+	if (error)
+		return error;
+
+	trace_xfs_metadir_link(upd);
+
+	return 0;
+}
+#endif /* ! __KERNEL__ */
+
+/* Commit a metadir update and unlock/drop all resources. */
+int
+xfs_metadir_commit(
+	struct xfs_metadir_update	*upd)
+{
+	int				error;
+
+	trace_xfs_metadir_commit(upd);
+
+	error = xfs_trans_commit(upd->tp);
+	upd->tp = NULL;
+
+	xfs_metadir_teardown(upd, error);
+	return error;
+}
+
+/* Cancel a metadir update and unlock/drop all resources. */
+void
+xfs_metadir_cancel(
+	struct xfs_metadir_update	*upd,
+	int				error)
+{
+	trace_xfs_metadir_cancel(upd);
+
+	xfs_trans_cancel(upd->tp);
+	upd->tp = NULL;
+
+	xfs_metadir_teardown(upd, error);
+}
+
+/* Create a metadata for the last component of the path. */
+int
+xfs_metadir_mkdir(
+	struct xfs_inode		*dp,
+	const char			*path,
+	struct xfs_inode		**ipp)
+{
+	struct xfs_metadir_update	upd = {
+		.dp			= dp,
+		.path			= path,
+		.metafile_type		= XFS_METAFILE_DIR,
+	};
+	int				error;
+
+	if (xfs_is_shutdown(dp->i_mount))
+		return -EIO;
+
+	/* Allocate a transaction to create the last directory. */
+	error = xfs_metadir_start_create(&upd);
+	if (error)
+		return error;
+
+	/* Create the subdirectory and take our reference. */
+	error = xfs_metadir_create(&upd, S_IFDIR);
+	if (error)
+		goto out_cancel;
+
+	error = xfs_metadir_commit(&upd);
+	if (error)
+		goto out_irele;
+
+	xfs_finish_inode_setup(upd.ip);
+	*ipp = upd.ip;
+	return 0;
+
+out_cancel:
+	xfs_metadir_cancel(&upd, error);
+out_irele:
+	/* Have to finish setting up the inode to ensure it's deleted. */
+	if (upd.ip) {
+		xfs_finish_inode_setup(upd.ip);
+		xfs_irele(upd.ip);
+	}
+	return error;
+}
diff --git a/fs/xfs/libxfs/xfs_metadir.h b/fs/xfs/libxfs/xfs_metadir.h
new file mode 100644
index 0000000000000..bfecac7d3d147
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_metadir.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2018-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_METADIR_H__
+#define __XFS_METADIR_H__
+
+/* Cleanup widget for metadata inode creation and deletion. */
+struct xfs_metadir_update {
+	/* Parent directory */
+	struct xfs_inode	*dp;
+
+	/* Path to metadata file */
+	const char		*path;
+
+	/* Parent pointer update context */
+	struct xfs_parent_args	*ppargs;
+
+	/* Child metadata file */
+	struct xfs_inode	*ip;
+
+	struct xfs_trans	*tp;
+
+	enum xfs_metafile_type	metafile_type;
+
+	unsigned int		dp_locked:1;
+	unsigned int		ip_locked:1;
+};
+
+int xfs_metadir_load(struct xfs_trans *tp, struct xfs_inode *dp,
+		const char *path, enum xfs_metafile_type metafile_type,
+		struct xfs_inode **ipp);
+
+int xfs_metadir_start_create(struct xfs_metadir_update *upd);
+int xfs_metadir_create(struct xfs_metadir_update *upd, umode_t mode);
+
+int xfs_metadir_start_link(struct xfs_metadir_update *upd);
+int xfs_metadir_link(struct xfs_metadir_update *upd);
+
+int xfs_metadir_commit(struct xfs_metadir_update *upd);
+void xfs_metadir_cancel(struct xfs_metadir_update *upd, int error);
+
+int xfs_metadir_mkdir(struct xfs_inode *dp, const char *path,
+		struct xfs_inode **ipp);
+
+#endif /* __XFS_METADIR_H__ */
diff --git a/fs/xfs/libxfs/xfs_metafile.c b/fs/xfs/libxfs/xfs_metafile.c
new file mode 100644
index 0000000000000..adeb25d1a444c
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_metafile.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2018-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_trans.h"
+#include "xfs_metafile.h"
+#include "xfs_trace.h"
+#include "xfs_inode.h"
+
+/* Set up an inode to be recognized as a metadata directory inode. */
+void
+xfs_metafile_set_iflag(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	enum xfs_metafile_type	metafile_type)
+{
+	VFS_I(ip)->i_mode &= ~0777;
+	VFS_I(ip)->i_uid = GLOBAL_ROOT_UID;
+	VFS_I(ip)->i_gid = GLOBAL_ROOT_GID;
+	if (S_ISDIR(VFS_I(ip)->i_mode))
+		ip->i_diflags |= XFS_METADIR_DIFLAGS;
+	else
+		ip->i_diflags |= XFS_METAFILE_DIFLAGS;
+	ip->i_diflags2 &= ~XFS_DIFLAG2_DAX;
+	ip->i_diflags2 |= XFS_DIFLAG2_METADATA;
+	ip->i_metatype = metafile_type;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Clear the metadata directory inode flag. */
+void
+xfs_metafile_clear_iflag(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	ASSERT(xfs_is_metadir_inode(ip));
+	ASSERT(VFS_I(ip)->i_nlink == 0);
+
+	ip->i_diflags2 &= ~XFS_DIFLAG2_METADATA;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
diff --git a/fs/xfs/libxfs/xfs_metafile.h b/fs/xfs/libxfs/xfs_metafile.h
index 07ff20639bd54..c3ffe7e49430d 100644
--- a/fs/xfs/libxfs/xfs_metafile.h
+++ b/fs/xfs/libxfs/xfs_metafile.h
@@ -17,6 +17,10 @@
 #define XFS_METADIR_DIFLAGS	(XFS_METAFILE_DIFLAGS | \
 				 XFS_DIFLAG_NOSYMLINKS)
 
+void xfs_metafile_set_iflag(struct xfs_trans *tp, struct xfs_inode *ip,
+		enum xfs_metafile_type metafile_type);
+void xfs_metafile_clear_iflag(struct xfs_trans *tp, struct xfs_inode *ip);
+
 /* Code specific to kernel/userspace; must be provided externally. */
 
 int xfs_trans_metafile_iget(struct xfs_trans *tp, xfs_ino_t ino,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 61bba47e565f4..c5ccfbbb98c5c 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -830,7 +830,7 @@ xfs_trans_metafile_iget(
 	int			error;
 
 	error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &ip);
-	if (error == -EFSCORRUPTED)
+	if (error == -EFSCORRUPTED || error == -EINVAL)
 		goto whine;
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 2af9f274e8724..c5f818cf40c29 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -44,6 +44,8 @@
 #include "xfs_parent.h"
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
+#include "xfs_metafile.h"
+#include "xfs_metadir.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 4cf0fa71ba9ce..7f259891ebcaa 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -93,6 +93,7 @@ struct xfs_attrlist_cursor_kern;
 struct xfs_extent_free_item;
 struct xfs_rmap_intent;
 struct xfs_refcount_intent;
+struct xfs_metadir_update;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -5332,6 +5333,107 @@ DEFINE_EVENT(xfs_getparents_class, name, \
 DEFINE_XFS_GETPARENTS_EVENT(xfs_getparents_begin);
 DEFINE_XFS_GETPARENTS_EVENT(xfs_getparents_end);
 
+DECLARE_EVENT_CLASS(xfs_metadir_update_class,
+	TP_PROTO(const struct xfs_metadir_update *upd),
+	TP_ARGS(upd),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dp_ino)
+		__field(xfs_ino_t, ino)
+		__string(fname, upd->path)
+	),
+	TP_fast_assign(
+		__entry->dev = upd->dp->i_mount->m_super->s_dev;
+		__entry->dp_ino = upd->dp->i_ino;
+		__entry->ino = upd->ip ? upd->ip->i_ino : NULLFSINO;
+		__assign_str(fname);
+	),
+	TP_printk("dev %d:%d dp 0x%llx fname '%s' ino 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dp_ino,
+		  __get_str(fname),
+		  __entry->ino)
+)
+
+#define DEFINE_METADIR_UPDATE_EVENT(name) \
+DEFINE_EVENT(xfs_metadir_update_class, name, \
+	TP_PROTO(const struct xfs_metadir_update *upd), \
+	TP_ARGS(upd))
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_start_create);
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_start_link);
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_commit);
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_cancel);
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_try_create);
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_create);
+DEFINE_METADIR_UPDATE_EVENT(xfs_metadir_link);
+
+DECLARE_EVENT_CLASS(xfs_metadir_update_error_class,
+	TP_PROTO(const struct xfs_metadir_update *upd, int error),
+	TP_ARGS(upd, error),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dp_ino)
+		__field(xfs_ino_t, ino)
+		__field(int, error)
+		__string(fname, upd->path)
+	),
+	TP_fast_assign(
+		__entry->dev = upd->dp->i_mount->m_super->s_dev;
+		__entry->dp_ino = upd->dp->i_ino;
+		__entry->ino = upd->ip ? upd->ip->i_ino : NULLFSINO;
+		__entry->error = error;
+		__assign_str(fname);
+	),
+	TP_printk("dev %d:%d dp 0x%llx fname '%s' ino 0x%llx error %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dp_ino,
+		  __get_str(fname),
+		  __entry->ino,
+		  __entry->error)
+)
+
+#define DEFINE_METADIR_UPDATE_ERROR_EVENT(name) \
+DEFINE_EVENT(xfs_metadir_update_error_class, name, \
+	TP_PROTO(const struct xfs_metadir_update *upd, int error), \
+	TP_ARGS(upd, error))
+DEFINE_METADIR_UPDATE_ERROR_EVENT(xfs_metadir_teardown);
+
+DECLARE_EVENT_CLASS(xfs_metadir_class,
+	TP_PROTO(struct xfs_inode *dp, struct xfs_name *name,
+		 xfs_ino_t ino),
+	TP_ARGS(dp, name, ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, dp_ino)
+		__field(xfs_ino_t, ino)
+		__field(int, ftype)
+		__field(int, namelen)
+		__dynamic_array(char, name, name->len)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(dp)->i_sb->s_dev;
+		__entry->dp_ino = dp->i_ino;
+		__entry->ino = ino,
+		__entry->ftype = name->type;
+		__entry->namelen = name->len;
+		memcpy(__get_str(name), name->name, name->len);
+	),
+	TP_printk("dev %d:%d dir 0x%llx type %s name '%.*s' ino 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->dp_ino,
+		  __print_symbolic(__entry->ftype, XFS_DIR3_FTYPE_STR),
+		  __entry->namelen,
+		  __get_str(name),
+		  __entry->ino)
+)
+
+#define DEFINE_METADIR_EVENT(name) \
+DEFINE_EVENT(xfs_metadir_class, name, \
+	TP_PROTO(struct xfs_inode *dp, struct xfs_name *name, \
+		 xfs_ino_t ino), \
+	TP_ARGS(dp, name, ino))
+DEFINE_METADIR_EVENT(xfs_metadir_lookup);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 07/26] xfs: disable the agi rotor for metadata inodes
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-08-23  0:03   ` [PATCH 06/26] xfs: read and write metadata inode directory tree Darrick J. Wong
@ 2024-08-23  0:03   ` Darrick J. Wong
  2024-08-23  4:39     ` Christoph Hellwig
  2024-08-23  0:04   ` [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special Darrick J. Wong
                     ` (18 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:03 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Ideally, we'd put all the metadata inodes in one place if we could, so
that the metadata all stay reasonably close together instead of
spreading out over the disk.  Furthermore, if the log is internal we'd
probably prefer to keep the metadata near the log.  Therefore, disable
AGI rotoring for metadata inode allocations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |   58 ++++++++++++++++++++++++++++++--------------
 1 file changed, 40 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index fc70601e8d8ee..79321aed6dc20 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1844,6 +1844,40 @@ xfs_dialloc_try_ag(
 	return error;
 }
 
+/*
+ * Pick an AG for the new inode.
+ *
+ * Directories, symlinks, and regular files frequently allocate at least one
+ * block, so factor that potential expansion when we examine whether an AG has
+ * enough space for file creation.  Try to keep metadata files all in the same
+ * AG.
+ */
+static inline xfs_agnumber_t
+xfs_dialloc_pick_ag(
+	struct xfs_mount	*mp,
+	struct xfs_inode	*dp,
+	umode_t			mode)
+{
+	xfs_agnumber_t		start_agno;
+
+	if (!dp)
+		return 0;
+	if (xfs_is_metadir_inode(dp)) {
+		if (mp->m_sb.sb_logstart)
+			return XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart);
+		return 0;
+	}
+
+	if (S_ISDIR(mode))
+		return (atomic_inc_return(&mp->m_agirotor) - 1) % mp->m_maxagi;
+
+	start_agno = XFS_INO_TO_AGNO(mp, dp->i_ino);
+	if (start_agno >= mp->m_maxagi)
+		start_agno = 0;
+
+	return start_agno;
+}
+
 /*
  * Allocate an on-disk inode.
  *
@@ -1859,31 +1893,19 @@ xfs_dialloc(
 	xfs_ino_t		*new_ino)
 {
 	struct xfs_mount	*mp = (*tpp)->t_mountp;
+	struct xfs_perag	*pag;
+	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	xfs_ino_t		ino = NULLFSINO;
 	xfs_ino_t		parent = args->pip ? args->pip->i_ino : 0;
-	umode_t			mode = args->mode & S_IFMT;
 	xfs_agnumber_t		agno;
-	int			error = 0;
 	xfs_agnumber_t		start_agno;
-	struct xfs_perag	*pag;
-	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
+	umode_t			mode = args->mode & S_IFMT;
 	bool			ok_alloc = true;
 	bool			low_space = false;
 	int			flags;
-	xfs_ino_t		ino = NULLFSINO;
+	int			error = 0;
 
-	/*
-	 * Directories, symlinks, and regular files frequently allocate at least
-	 * one block, so factor that potential expansion when we examine whether
-	 * an AG has enough space for file creation.
-	 */
-	if (S_ISDIR(mode))
-		start_agno = (atomic_inc_return(&mp->m_agirotor) - 1) %
-				mp->m_maxagi;
-	else {
-		start_agno = XFS_INO_TO_AGNO(mp, parent);
-		if (start_agno >= mp->m_maxagi)
-			start_agno = 0;
-	}
+	start_agno = xfs_dialloc_pick_ag(mp, args->pip, mode);
 
 	/*
 	 * If we have already hit the ceiling of inode blocks then clear


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-08-23  0:03   ` [PATCH 07/26] xfs: disable the agi rotor for metadata inodes Darrick J. Wong
@ 2024-08-23  0:04   ` Darrick J. Wong
  2024-08-23  4:40     ` Christoph Hellwig
  2024-08-26  0:41     ` Dave Chinner
  2024-08-23  0:04   ` [PATCH 09/26] xfs: advertise metadata directory feature Darrick J. Wong
                     ` (17 subsequent siblings)
  25 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:04 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Metadata inodes are private files and therefore cannot be exposed to
userspace.  This means no bulkstat, no open-by-handle, no linking them
into the directory tree, and no feeding them to LSMs.  As such, we mark
them S_PRIVATE, which stops all that.

While we're at it, put them in a separate lockdep class so that it won't
get confused by "recursive" i_rwsem locking such as what happens when we
write to a rt file and need to allocate from the rt bitmap file.  The
static function that we use to do this will be exported in the rtgroups
patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/tempfile.c |    8 ++++++++
 fs/xfs/xfs_iops.c       |   15 ++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index 177f922acfaf1..3c5a1d77fefae 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -844,6 +844,14 @@ xrep_is_tempfile(
 	const struct xfs_inode	*ip)
 {
 	const struct inode	*inode = &ip->i_vnode;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	/*
+	 * Files in the metadata directory tree also have S_PRIVATE set and
+	 * IOP_XATTR unset, so we must distinguish them separately.
+	 */
+	if (xfs_has_metadir(mp) && (ip->i_diflags2 & XFS_DIFLAG2_METADATA))
+		return false;
 
 	if (IS_PRIVATE(inode) && !(inode->i_opflags & IOP_XATTR))
 		return true;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1cdc8034f54d9..c1686163299a0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -42,7 +42,9 @@
  * held. For regular files, the lock order is the other way around - the
  * mmap_lock is taken during the page fault, and then we lock the ilock to do
  * block mapping. Hence we need a different class for the directory ilock so
- * that lockdep can tell them apart.
+ * that lockdep can tell them apart.  Directories in the metadata directory
+ * tree get a separate class so that lockdep reports will warn us if someone
+ * ever tries to lock regular directories after locking metadata directories.
  */
 static struct lock_class_key xfs_nondir_ilock_class;
 static struct lock_class_key xfs_dir_ilock_class;
@@ -1299,6 +1301,7 @@ xfs_setup_inode(
 {
 	struct inode		*inode = &ip->i_vnode;
 	gfp_t			gfp_mask;
+	bool			is_meta = xfs_is_metadata_inode(ip);
 
 	inode->i_ino = ip->i_ino;
 	inode->i_state |= I_NEW;
@@ -1310,6 +1313,16 @@ xfs_setup_inode(
 	i_size_write(inode, ip->i_disk_size);
 	xfs_diflags_to_iflags(ip, true);
 
+	/*
+	 * Mark our metadata files as private so that LSMs and the ACL code
+	 * don't try to add their own metadata or reason about these files,
+	 * and users cannot ever obtain file handles to them.
+	 */
+	if (is_meta) {
+		inode->i_flags |= S_PRIVATE;
+		inode->i_opflags &= ~IOP_XATTR;
+	}
+
 	if (S_ISDIR(inode->i_mode)) {
 		/*
 		 * We set the i_rwsem class here to avoid potential races with


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 09/26] xfs: advertise metadata directory feature
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-08-23  0:04   ` [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special Darrick J. Wong
@ 2024-08-23  0:04   ` Darrick J. Wong
  2024-08-23  4:40     ` Christoph Hellwig
  2024-08-23  0:04   ` [PATCH 10/26] xfs: allow bulkstat to return metadata directories Darrick J. Wong
                     ` (16 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:04 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Advertise the existence of the metadata directory feature; this will be
used by scrub to decide if it needs to scan the metadir too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |    2 ++
 fs/xfs/libxfs/xfs_sb.c |    2 ++
 2 files changed, 4 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index c85c8077fac39..aba7fb0389bab 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -242,6 +242,8 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE (1 << 24) /* exchange range */
 #define XFS_FSOP_GEOM_FLAGS_PARENT	(1 << 25) /* linux parent pointers */
 
+#define XFS_FSOP_GEOM_FLAGS_METADIR	(1U << 30) /* metadata directories */
+
 /*
  * Minimum and maximum sizes need for growth checks.
  *
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 7afde477c0a79..1dcbf8ade39f8 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1279,6 +1279,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
 	if (xfs_has_exchange_range(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE;
+	if (xfs_has_metadir(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_METADIR;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 10/26] xfs: allow bulkstat to return metadata directories
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-08-23  0:04   ` [PATCH 09/26] xfs: advertise metadata directory feature Darrick J. Wong
@ 2024-08-23  0:04   ` Darrick J. Wong
  2024-08-23  4:41     ` Christoph Hellwig
  2024-08-23  0:05   ` [PATCH 11/26] xfs: don't count metadata directory files to quota Darrick J. Wong
                     ` (15 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:04 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow the V5 bulkstat ioctl to return information about metadata
directory files so that xfs_scrub can find and scrub them, since they
are otherwise ordinary directories.

(Metadata files of course require per-file scrub code and hence do not
need exposure.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   10 +++++++++-
 fs/xfs/xfs_ioctl.c     |    7 +++++++
 fs/xfs/xfs_itable.c    |   33 +++++++++++++++++++++++++++++----
 fs/xfs/xfs_itable.h    |    3 +++
 4 files changed, 48 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index aba7fb0389bab..cb7563d330d0f 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -490,9 +490,17 @@ struct xfs_bulk_ireq {
  */
 #define XFS_BULK_IREQ_NREXT64	(1U << 2)
 
+/*
+ * Allow bulkstat to return information about metadata directories.  This
+ * enables xfs_scrub to find them for scanning, as they are otherwise ordinary
+ * directories.
+ */
+#define XFS_BULK_IREQ_METADIR	(1U << 31)
+
 #define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO |	 \
 				 XFS_BULK_IREQ_SPECIAL | \
-				 XFS_BULK_IREQ_NREXT64)
+				 XFS_BULK_IREQ_NREXT64 | \
+				 XFS_BULK_IREQ_METADIR)
 
 /* Operate on the root directory inode. */
 #define XFS_BULK_IREQ_SPECIAL_ROOT	(1)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 90b3ee21e7fe6..b53af3e674912 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -233,6 +233,10 @@ xfs_bulk_ireq_setup(
 	if (hdr->flags & XFS_BULK_IREQ_NREXT64)
 		breq->flags |= XFS_IBULK_NREXT64;
 
+	/* Caller wants to see metadata directories in bulkstat output. */
+	if (hdr->flags & XFS_BULK_IREQ_METADIR)
+		breq->flags |= XFS_IBULK_METADIR;
+
 	return 0;
 }
 
@@ -323,6 +327,9 @@ xfs_ioc_inumbers(
 	if (copy_from_user(&hdr, &arg->hdr, sizeof(hdr)))
 		return -EFAULT;
 
+	if (hdr.flags & XFS_BULK_IREQ_METADIR)
+		return -EINVAL;
+
 	error = xfs_bulk_ireq_setup(mp, &hdr, &breq, arg->inumbers);
 	if (error == -ECANCELED)
 		goto out_teardown;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index c0757ab994957..198d52e9f81f6 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -36,6 +36,14 @@ struct xfs_bstat_chunk {
 	struct xfs_bulkstat	*buf;
 };
 
+static inline bool
+want_metadir_file(
+	struct xfs_inode	*ip,
+	struct xfs_ibulk	*breq)
+{
+	return xfs_is_metadir_inode(ip) && (breq->flags & XFS_IBULK_METADIR);
+}
+
 /*
  * Fill out the bulkstat info for a single inode and report it somewhere.
  *
@@ -69,9 +77,6 @@ xfs_bulkstat_one_int(
 	vfsuid_t		vfsuid;
 	vfsgid_t		vfsgid;
 
-	if (xfs_internal_inum(mp, ino))
-		goto out_advance;
-
 	error = xfs_iget(mp, tp, ino,
 			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
 			 XFS_ILOCK_SHARED, &ip);
@@ -97,8 +102,28 @@ xfs_bulkstat_one_int(
 	vfsuid = i_uid_into_vfsuid(idmap, inode);
 	vfsgid = i_gid_into_vfsgid(idmap, inode);
 
+	/*
+	 * If caller wants files from the metadata directories, push out the
+	 * bare minimum information for enabling scrub.
+	 */
+	if (want_metadir_file(ip, bc->breq)) {
+		memset(buf, 0, sizeof(*buf));
+		buf->bs_ino = ino;
+		buf->bs_gen = inode->i_generation;
+		buf->bs_mode = inode->i_mode & S_IFMT;
+		xfs_bulkstat_health(ip, buf);
+		buf->bs_version = XFS_BULKSTAT_VERSION_V5;
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		xfs_irele(ip);
+
+		error = bc->formatter(bc->breq, buf);
+		if (!error || error == -ECANCELED)
+			goto out_advance;
+		goto out;
+	}
+
 	/* If this is a private inode, don't leak its details to userspace. */
-	if (IS_PRIVATE(inode)) {
+	if (IS_PRIVATE(inode) || xfs_internal_inum(mp, ino)) {
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		xfs_irele(ip);
 		error = -EINVAL;
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 1659f13f17a89..f10e8f8f23351 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -22,6 +22,9 @@ struct xfs_ibulk {
 /* Fill out the bs_extents64 field if set. */
 #define XFS_IBULK_NREXT64	(1U << 1)
 
+/* Signal that we can return metadata directories. */
+#define XFS_IBULK_METADIR	(1U << 2)
+
 /*
  * Advance the user buffer pointer by one record of the given size.  If the
  * buffer is now full, return the appropriate error code.


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 11/26] xfs: don't count metadata directory files to quota
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-08-23  0:04   ` [PATCH 10/26] xfs: allow bulkstat to return metadata directories Darrick J. Wong
@ 2024-08-23  0:05   ` Darrick J. Wong
  2024-08-23  4:42     ` Christoph Hellwig
  2024-08-26  0:47     ` Dave Chinner
  2024-08-23  0:05   ` [PATCH 12/26] xfs: mark quota inodes as metadata files Darrick J. Wong
                     ` (14 subsequent siblings)
  25 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:05 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Files in the metadata directory tree are internal to the filesystem.
Don't count the inodes or the blocks they use in the root dquot because
users do not need to know about their resource usage.  This will also
quiet down complaints about dquot usage not matching du output.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_dquot.c       |    1 +
 fs/xfs/xfs_qm.c          |   11 +++++++++++
 fs/xfs/xfs_quota.h       |    5 +++++
 fs/xfs/xfs_trans_dquot.c |    6 ++++++
 4 files changed, 23 insertions(+)


diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index c1b211c260a9d..3bf47458c517a 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -983,6 +983,7 @@ xfs_qm_dqget_inode(
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 	ASSERT(xfs_inode_dquot(ip, type) == NULL);
+	ASSERT(!xfs_is_metadir_inode(ip));
 
 	id = xfs_qm_id_for_quotatype(ip, type);
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index d0674d84af3ec..ec983cca9adae 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -304,6 +304,8 @@ xfs_qm_need_dqattach(
 		return false;
 	if (xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
 		return false;
+	if (xfs_is_metadir_inode(ip))
+		return false;
 	return true;
 }
 
@@ -326,6 +328,7 @@ xfs_qm_dqattach_locked(
 		return 0;
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
+	ASSERT(!xfs_is_metadir_inode(ip));
 
 	if (XFS_IS_UQUOTA_ON(mp) && !ip->i_udquot) {
 		error = xfs_qm_dqattach_one(ip, XFS_DQTYPE_USER,
@@ -1204,6 +1207,10 @@ xfs_qm_dqusage_adjust(
 		}
 	}
 
+	/* Metadata directory files are not accounted to user-visible quotas. */
+	if (xfs_is_metadir_inode(ip))
+		goto error0;
+
 	ASSERT(ip->i_delayed_blks == 0);
 
 	if (XFS_IS_REALTIME_INODE(ip)) {
@@ -1754,6 +1761,8 @@ xfs_qm_vop_dqalloc(
 	if (!XFS_IS_QUOTA_ON(mp))
 		return 0;
 
+	ASSERT(!xfs_is_metadir_inode(ip));
+
 	lockflags = XFS_ILOCK_EXCL;
 	xfs_ilock(ip, lockflags);
 
@@ -1883,6 +1892,7 @@ xfs_qm_vop_chown(
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 	ASSERT(XFS_IS_QUOTA_ON(ip->i_mount));
+	ASSERT(!xfs_is_metadir_inode(ip));
 
 	/* old dquot */
 	prevdq = *IO_olddq;
@@ -1970,6 +1980,7 @@ xfs_qm_vop_create_dqattach(
 		return;
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
+	ASSERT(!xfs_is_metadir_inode(ip));
 
 	if (udqp && XFS_IS_UQUOTA_ON(mp)) {
 		ASSERT(ip->i_udquot == NULL);
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 23d71a55bbc00..645761997bf2d 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -29,6 +29,11 @@ struct xfs_buf;
 	 (XFS_IS_GQUOTA_ON(mp) && (ip)->i_gdquot == NULL) || \
 	 (XFS_IS_PQUOTA_ON(mp) && (ip)->i_pdquot == NULL))
 
+#define XFS_IS_DQDETACHED(ip) \
+	((ip)->i_udquot == NULL && \
+	 (ip)->i_gdquot == NULL && \
+	 (ip)->i_pdquot == NULL)
+
 #define XFS_QM_NEED_QUOTACHECK(mp) \
 	((XFS_IS_UQUOTA_ON(mp) && \
 		(mp->m_sb.sb_qflags & XFS_UQUOTA_CHKD) == 0) || \
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index b368e13424c4f..ca7df018290e0 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -156,6 +156,8 @@ xfs_trans_mod_ino_dquot(
 	unsigned int			field,
 	int64_t				delta)
 {
+	ASSERT(!xfs_is_metadir_inode(ip) || XFS_IS_DQDETACHED(ip));
+
 	xfs_trans_mod_dquot(tp, dqp, field, delta);
 
 	if (xfs_hooks_switched_on(&xfs_dqtrx_hooks_switch)) {
@@ -247,6 +249,8 @@ xfs_trans_mod_dquot_byino(
 	    xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
 		return;
 
+	ASSERT(!xfs_is_metadir_inode(ip) || XFS_IS_DQDETACHED(ip));
+
 	if (XFS_IS_UQUOTA_ON(mp) && ip->i_udquot)
 		xfs_trans_mod_ino_dquot(tp, ip, ip->i_udquot, field, delta);
 	if (XFS_IS_GQUOTA_ON(mp) && ip->i_gdquot)
@@ -962,6 +966,8 @@ xfs_trans_reserve_quota_nblks(
 
 	if (!XFS_IS_QUOTA_ON(mp))
 		return 0;
+	if (xfs_is_metadir_inode(ip))
+		return 0;
 
 	ASSERT(!xfs_is_quota_inode(&mp->m_sb, ip->i_ino));
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 12/26] xfs: mark quota inodes as metadata files
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (10 preceding siblings ...)
  2024-08-23  0:05   ` [PATCH 11/26] xfs: don't count metadata directory files to quota Darrick J. Wong
@ 2024-08-23  0:05   ` Darrick J. Wong
  2024-08-23  4:42     ` Christoph Hellwig
  2024-08-23  0:05   ` [PATCH 13/26] xfs: adjust xfs_bmap_add_attrfork for metadir Darrick J. Wong
                     ` (13 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:05 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're creating quota files at mount time, make sure to mark them as
metadir inodes if appropriate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_qm.c |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index ec983cca9adae..b94d6f192e725 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -824,6 +824,8 @@ xfs_qm_qino_alloc(
 			xfs_trans_cancel(tp);
 			return error;
 		}
+		if (xfs_has_metadir(mp))
+			xfs_metafile_set_iflag(tp, *ipp, metafile_type);
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 13/26] xfs: adjust xfs_bmap_add_attrfork for metadir
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (11 preceding siblings ...)
  2024-08-23  0:05   ` [PATCH 12/26] xfs: mark quota inodes as metadata files Darrick J. Wong
@ 2024-08-23  0:05   ` Darrick J. Wong
  2024-08-23  4:42     ` Christoph Hellwig
  2024-08-23  0:05   ` [PATCH 14/26] xfs: record health problems with the metadata directory Darrick J. Wong
                     ` (12 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:05 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Online repair might use the xfs_bmap_add_attrfork to repair a file in
the metadata directory tree if (say) the metadata file lacks the correct
parent pointers.  In that case, it is not correct to check that the file
is dqattached -- metadata files must be not have /any/ dquot attached at
all.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr.c |    5 ++++-
 fs/xfs/libxfs/xfs_bmap.c |    5 ++++-
 2 files changed, 8 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index f30bcc64100d5..4b7202e91b0ff 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -953,7 +953,10 @@ xfs_attr_add_fork(
 	unsigned int		blks;		/* space reservation */
 	int			error;		/* error return value */
 
-	ASSERT(!XFS_NOT_DQATTACHED(mp, ip));
+	if (xfs_is_metadir_inode(ip))
+		ASSERT(XFS_IS_DQDETACHED(ip));
+	else
+		ASSERT(!XFS_NOT_DQATTACHED(mp, ip));
 
 	blks = XFS_ADDAFORK_SPACE_RES(mp);
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 7df74c35d9f90..b79803784b766 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1042,7 +1042,10 @@ xfs_bmap_add_attrfork(
 	int			error;		/* error return value */
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
-	ASSERT(!XFS_NOT_DQATTACHED(mp, ip));
+	if (xfs_is_metadir_inode(ip))
+		ASSERT(XFS_IS_DQDETACHED(ip));
+	else
+		ASSERT(!XFS_NOT_DQATTACHED(mp, ip));
 	ASSERT(!xfs_inode_has_attr_fork(ip));
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 14/26] xfs: record health problems with the metadata directory
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (12 preceding siblings ...)
  2024-08-23  0:05   ` [PATCH 13/26] xfs: adjust xfs_bmap_add_attrfork for metadir Darrick J. Wong
@ 2024-08-23  0:05   ` Darrick J. Wong
  2024-08-23  4:43     ` Christoph Hellwig
  2024-08-23  0:06   ` [PATCH 15/26] xfs: refactor directory tree root predicates Darrick J. Wong
                     ` (11 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:05 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make a report to the health monitoring subsystem any time we encounter
something in the metadata directory tree that looks like corruption.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h      |    1 +
 fs/xfs/libxfs/xfs_health.h  |    4 +++-
 fs/xfs/libxfs/xfs_metadir.c |   13 ++++++++++---
 fs/xfs/xfs_health.c         |    1 +
 fs/xfs/xfs_icache.c         |    1 +
 fs/xfs/xfs_inode.c          |    1 +
 6 files changed, 17 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index cb7563d330d0f..6f5aebaf47ac8 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -197,6 +197,7 @@ struct xfs_fsop_geom {
 #define XFS_FSOP_GEOM_SICK_RT_SUMMARY	(1 << 5)  /* realtime summary */
 #define XFS_FSOP_GEOM_SICK_QUOTACHECK	(1 << 6)  /* quota counts */
 #define XFS_FSOP_GEOM_SICK_NLINKS	(1 << 7)  /* inode link counts */
+#define XFS_FSOP_GEOM_SICK_METADIR	(1 << 8)  /* metadata directory */
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index b0edb4288e592..0ded0cd93ce63 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -60,6 +60,7 @@ struct xfs_da_args;
 #define XFS_SICK_FS_PQUOTA	(1 << 3)  /* project quota */
 #define XFS_SICK_FS_QUOTACHECK	(1 << 4)  /* quota counts */
 #define XFS_SICK_FS_NLINKS	(1 << 5)  /* inode link counts */
+#define XFS_SICK_FS_METADIR	(1 << 6)  /* metadata directory tree */
 
 /* Observable health issues for realtime volume metadata. */
 #define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
@@ -103,7 +104,8 @@ struct xfs_da_args;
 				 XFS_SICK_FS_GQUOTA | \
 				 XFS_SICK_FS_PQUOTA | \
 				 XFS_SICK_FS_QUOTACHECK | \
-				 XFS_SICK_FS_NLINKS)
+				 XFS_SICK_FS_NLINKS | \
+				 XFS_SICK_FS_METADIR)
 
 #define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
 				 XFS_SICK_RT_SUMMARY)
diff --git a/fs/xfs/libxfs/xfs_metadir.c b/fs/xfs/libxfs/xfs_metadir.c
index 0a61316b4f520..bae7377c0f228 100644
--- a/fs/xfs/libxfs/xfs_metadir.c
+++ b/fs/xfs/libxfs/xfs_metadir.c
@@ -28,6 +28,7 @@
 #include "xfs_dir2.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_parent.h"
+#include "xfs_health.h"
 
 /*
  * Metadata Directory Tree
@@ -94,8 +95,10 @@ xfs_metadir_lookup(
 	};
 	int			error;
 
-	if (!S_ISDIR(VFS_I(dp)->i_mode))
+	if (!S_ISDIR(VFS_I(dp)->i_mode)) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 		return -EFSCORRUPTED;
+	}
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
@@ -103,10 +106,14 @@ xfs_metadir_lookup(
 	if (error)
 		return error;
 
-	if (!xfs_verify_ino(mp, args.inumber))
+	if (!xfs_verify_ino(mp, args.inumber)) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 		return -EFSCORRUPTED;
-	if (xname->type != XFS_DIR3_FT_UNKNOWN && xname->type != args.filetype)
+	}
+	if (xname->type != XFS_DIR3_FT_UNKNOWN && xname->type != args.filetype) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 		return -EFSCORRUPTED;
+	}
 
 	trace_xfs_metadir_lookup(dp, xname, args.inumber);
 	*ino = args.inumber;
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 10f116d093a22..d5367fd2d0615 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -376,6 +376,7 @@ static const struct ioctl_sick_map fs_map[] = {
 	{ XFS_SICK_FS_PQUOTA,	XFS_FSOP_GEOM_SICK_PQUOTA },
 	{ XFS_SICK_FS_QUOTACHECK, XFS_FSOP_GEOM_SICK_QUOTACHECK },
 	{ XFS_SICK_FS_NLINKS,	XFS_FSOP_GEOM_SICK_NLINKS },
+	{ XFS_SICK_FS_METADIR,	XFS_FSOP_GEOM_SICK_METADIR },
 	{ 0, 0 },
 };
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index c5ccfbbb98c5c..321efbf6656fa 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -858,6 +858,7 @@ xfs_trans_metafile_iget(
 whine:
 	xfs_err(mp, "metadata inode 0x%llx type %u is corrupt", ino,
 			metafile_type);
+	xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 	return -EFSCORRUPTED;
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 35acb73665fdd..fff3037e67574 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -560,6 +560,7 @@ xfs_lookup(
 	 * a metadata file.
 	 */
 	if (XFS_IS_CORRUPT(dp->i_mount, xfs_is_metadir_inode(*ipp))) {
+		xfs_fs_mark_sick(dp->i_mount, XFS_SICK_FS_METADIR);
 		error = -EFSCORRUPTED;
 		goto out_irele;
 	}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 15/26] xfs: refactor directory tree root predicates
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (13 preceding siblings ...)
  2024-08-23  0:05   ` [PATCH 14/26] xfs: record health problems with the metadata directory Darrick J. Wong
@ 2024-08-23  0:06   ` Darrick J. Wong
  2024-08-23  4:48     ` Christoph Hellwig
  2024-08-23  0:06   ` [PATCH 16/26] xfs: do not count metadata directory files when doing online quotacheck Darrick J. Wong
                     ` (10 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:06 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Metadata directory trees make reasoning about the parent of a file more
difficult.  Traditionally, user files are children of sb_rootino, and
metadata files are "children" of the superblock.  Now, we add a third
possibility -- some metadata files can be children of sb_metadirino, but
the classic ones (rt free space data and quotas) are left alone.

Let's add some helper functions (instead of open-coding the logic
everywhere) to make scrub logic easier to understand.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c        |   29 +++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h        |    4 ++++
 fs/xfs/scrub/dir.c           |    2 +-
 fs/xfs/scrub/dir_repair.c    |    2 +-
 fs/xfs/scrub/dirtree.c       |   15 ++++++++++++++-
 fs/xfs/scrub/dirtree.h       |   12 +-----------
 fs/xfs/scrub/findparent.c    |   15 +++++++++------
 fs/xfs/scrub/inode_repair.c  |   11 ++---------
 fs/xfs/scrub/nlinks.c        |    4 ++--
 fs/xfs/scrub/nlinks_repair.c |    4 +---
 fs/xfs/scrub/orphanage.c     |    4 +++-
 fs/xfs/scrub/parent.c        |   17 ++++++++---------
 fs/xfs/scrub/parent_repair.c |    2 +-
 13 files changed, 76 insertions(+), 45 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index f64271ccb786c..72cec56f52eb1 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1452,3 +1452,32 @@ xchk_inode_is_allocated(
 	rcu_read_unlock();
 	return error;
 }
+
+/* Is this inode a root directory for either tree? */
+bool
+xchk_inode_is_dirtree_root(const struct xfs_inode *ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	return ip == mp->m_rootip ||
+		(xfs_has_metadir(mp) && ip == mp->m_metadirip);
+}
+
+/* Does the superblock point down to this inode? */
+bool
+xchk_inode_is_sb_rooted(const struct xfs_inode *ip)
+{
+	return xchk_inode_is_dirtree_root(ip) ||
+	       xfs_internal_inum(ip->i_mount, ip->i_ino);
+}
+
+/* What is the root directory inumber for this inode? */
+xfs_ino_t
+xchk_inode_rootdir_inum(const struct xfs_inode *ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (xfs_is_metadir_inode(ip))
+		return mp->m_metadirip->i_ino;
+	return mp->m_rootip->i_ino;
+}
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 3d5f1f6b4b7bf..4d713e2a463cd 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -252,4 +252,8 @@ void xchk_fsgates_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
 int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_agino_t agino,
 		bool *inuse);
 
+bool xchk_inode_is_dirtree_root(const struct xfs_inode *ip);
+bool xchk_inode_is_sb_rooted(const struct xfs_inode *ip);
+xfs_ino_t xchk_inode_rootdir_inum(const struct xfs_inode *ip);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index bf9199e8df633..6b719c8885ef7 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -253,7 +253,7 @@ xchk_dir_actor(
 		 * If this is ".." in the root inode, check that the inum
 		 * matches this dir.
 		 */
-		if (dp->i_ino == mp->m_sb.sb_rootino && ino != dp->i_ino)
+		if (xchk_inode_is_dirtree_root(dp) && ino != dp->i_ino)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 	}
 
diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
index 64679fe084465..0c2cd42b3110f 100644
--- a/fs/xfs/scrub/dir_repair.c
+++ b/fs/xfs/scrub/dir_repair.c
@@ -1270,7 +1270,7 @@ xrep_dir_scan_dirtree(
 	int			error;
 
 	/* Roots of directory trees are their own parents. */
-	if (sc->ip == sc->mp->m_rootip)
+	if (xchk_inode_is_dirtree_root(sc->ip))
 		xrep_findparent_scan_found(&rd->pscan, sc->ip->i_ino);
 
 	/*
diff --git a/fs/xfs/scrub/dirtree.c b/fs/xfs/scrub/dirtree.c
index bde58fb561ea1..e43840733de94 100644
--- a/fs/xfs/scrub/dirtree.c
+++ b/fs/xfs/scrub/dirtree.c
@@ -917,7 +917,7 @@ xchk_dirtree(
 	 * scan, because the hook doesn't detach until after sc->ip gets
 	 * released during teardown.
 	 */
-	dl->root_ino = sc->mp->m_rootip->i_ino;
+	dl->root_ino = xchk_inode_rootdir_inum(sc->ip);
 	dl->scan_ino = sc->ip->i_ino;
 
 	trace_xchk_dirtree_start(sc->ip, sc->sm, 0);
@@ -983,3 +983,16 @@ xchk_dirtree(
 	trace_xchk_dirtree_done(sc->ip, sc->sm, error);
 	return error;
 }
+
+/* Does the directory targetted by this scrub have no parents? */
+bool
+xchk_dirtree_parentless(const struct xchk_dirtree *dl)
+{
+	struct xfs_scrub	*sc = dl->sc;
+
+	if (xchk_inode_is_dirtree_root(sc->ip))
+		return true;
+	if (VFS_I(sc->ip)->i_nlink == 0)
+		return true;
+	return false;
+}
diff --git a/fs/xfs/scrub/dirtree.h b/fs/xfs/scrub/dirtree.h
index 1e1686365c61c..9e5d95492717d 100644
--- a/fs/xfs/scrub/dirtree.h
+++ b/fs/xfs/scrub/dirtree.h
@@ -156,17 +156,7 @@ struct xchk_dirtree {
 #define xchk_dirtree_for_each_path(dl, path) \
 	list_for_each_entry((path), &(dl)->path_list, list)
 
-static inline bool
-xchk_dirtree_parentless(const struct xchk_dirtree *dl)
-{
-	struct xfs_scrub	*sc = dl->sc;
-
-	if (sc->ip == sc->mp->m_rootip)
-		return true;
-	if (VFS_I(sc->ip)->i_nlink == 0)
-		return true;
-	return false;
-}
+bool xchk_dirtree_parentless(const struct xchk_dirtree *dl);
 
 int xchk_dirtree_find_paths_to_root(struct xchk_dirtree *dl);
 int xchk_dirpath_append(struct xchk_dirtree *dl, struct xfs_inode *ip,
diff --git a/fs/xfs/scrub/findparent.c b/fs/xfs/scrub/findparent.c
index 01766041ba2cd..153d185190d8a 100644
--- a/fs/xfs/scrub/findparent.c
+++ b/fs/xfs/scrub/findparent.c
@@ -362,15 +362,18 @@ xrep_findparent_confirm(
 	};
 	int			error;
 
-	/*
-	 * The root directory always points to itself.  Unlinked dirs can point
-	 * anywhere, so we point them at the root dir too.
-	 */
-	if (sc->ip == sc->mp->m_rootip || VFS_I(sc->ip)->i_nlink == 0) {
+	/* The root directory always points to itself. */
+	if (sc->ip == sc->mp->m_rootip) {
 		*parent_ino = sc->mp->m_sb.sb_rootino;
 		return 0;
 	}
 
+	/* Unlinked dirs can point anywhere; point them up to the root dir. */
+	if (VFS_I(sc->ip)->i_nlink == 0) {
+		*parent_ino = xchk_inode_rootdir_inum(sc->ip);
+		return 0;
+	}
+
 	/* Reject garbage parent inode numbers and self-referential parents. */
 	if (*parent_ino == NULLFSINO)
 	       return 0;
@@ -413,7 +416,7 @@ xrep_findparent_self_reference(
 		return sc->mp->m_sb.sb_rootino;
 
 	if (VFS_I(sc->ip)->i_nlink == 0)
-		return sc->mp->m_sb.sb_rootino;
+		return xchk_inode_rootdir_inum(sc->ip);
 
 	return NULLFSINO;
 }
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 060ebfb25c7a5..91d0da58443a1 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -1768,15 +1768,8 @@ xrep_inode_pptr(
 	if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
 		return 0;
 
-	/* The root directory doesn't have a parent pointer. */
-	if (ip == mp->m_rootip)
-		return 0;
-
-	/*
-	 * Metadata inodes are rooted in the superblock and do not have any
-	 * parents.
-	 */
-	if (xfs_is_metadata_inode(ip))
+	/* Children of the superblock do not have parent pointers. */
+	if (xchk_inode_is_sb_rooted(ip))
 		return 0;
 
 	/* Inode already has an attr fork; no further work possible here. */
diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c
index 80aee30886c45..4a47d0aabf73b 100644
--- a/fs/xfs/scrub/nlinks.c
+++ b/fs/xfs/scrub/nlinks.c
@@ -279,7 +279,7 @@ xchk_nlinks_collect_dirent(
 	 * determine the backref count.
 	 */
 	if (dotdot) {
-		if (dp == sc->mp->m_rootip)
+		if (xchk_inode_is_dirtree_root(dp))
 			error = xchk_nlinks_update_incore(xnc, ino, 1, 0, 0);
 		else if (!xfs_has_parent(sc->mp))
 			error = xchk_nlinks_update_incore(xnc, ino, 0, 1, 0);
@@ -735,7 +735,7 @@ xchk_nlinks_compare_inode(
 		}
 	}
 
-	if (ip == sc->mp->m_rootip) {
+	if (xchk_inode_is_dirtree_root(ip)) {
 		/*
 		 * For the root of a directory tree, both the '.' and '..'
 		 * entries should point to the root directory.  The dotdot
diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c
index b3e707f47b7b5..4ebdee0954280 100644
--- a/fs/xfs/scrub/nlinks_repair.c
+++ b/fs/xfs/scrub/nlinks_repair.c
@@ -60,11 +60,9 @@ xrep_nlinks_is_orphaned(
 	unsigned int		actual_nlink,
 	const struct xchk_nlink	*obs)
 {
-	struct xfs_mount	*mp = ip->i_mount;
-
 	if (obs->parents != 0)
 		return false;
-	if (ip == mp->m_rootip || ip == sc->orphanage)
+	if (xchk_inode_is_dirtree_root(ip) || ip == sc->orphanage)
 		return false;
 	return actual_nlink != 0;
 }
diff --git a/fs/xfs/scrub/orphanage.c b/fs/xfs/scrub/orphanage.c
index 7148d8362db83..11d7b8a62ebe1 100644
--- a/fs/xfs/scrub/orphanage.c
+++ b/fs/xfs/scrub/orphanage.c
@@ -295,7 +295,9 @@ xrep_orphanage_can_adopt(
 		return false;
 	if (sc->ip == sc->orphanage)
 		return false;
-	if (xfs_internal_inum(sc->mp, sc->ip->i_ino))
+	if (xchk_inode_is_sb_rooted(sc->ip))
+		return false;
+	if (xfs_is_metadata_inode(sc->ip))
 		return false;
 	return true;
 }
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 91e7b51ce0680..582536076433a 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -300,7 +300,7 @@ xchk_parent_pptr_and_dotdot(
 	}
 
 	/* Is this the root dir?  Then '..' must point to itself. */
-	if (sc->ip == sc->mp->m_rootip) {
+	if (xchk_inode_is_dirtree_root(sc->ip)) {
 		if (sc->ip->i_ino != pp->parent_ino)
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 		return 0;
@@ -711,7 +711,7 @@ xchk_parent_count_pptrs(
 	}
 
 	if (S_ISDIR(VFS_I(sc->ip)->i_mode)) {
-		if (sc->ip == sc->mp->m_rootip)
+		if (xchk_inode_is_dirtree_root(sc->ip))
 			pp->pptrs_found++;
 
 		if (VFS_I(sc->ip)->i_nlink == 0 && pp->pptrs_found > 0)
@@ -885,10 +885,9 @@ bool
 xchk_pptr_looks_zapped(
 	struct xfs_inode	*ip)
 {
-	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
 
-	ASSERT(xfs_has_parent(mp));
+	ASSERT(xfs_has_parent(ip->i_mount));
 
 	/*
 	 * Temporary files that cannot be linked into the directory tree do not
@@ -902,15 +901,15 @@ xchk_pptr_looks_zapped(
 	 * of a parent pointer scan is always the empty set.  It's safe to scan
 	 * them even if the attr fork was zapped.
 	 */
-	if (ip == mp->m_rootip)
+	if (xchk_inode_is_dirtree_root(ip))
 		return false;
 
 	/*
-	 * Metadata inodes are all rooted in the superblock and do not have
-	 * any parents.  Hence the attr fork will not be initialized, but
-	 * there are no parent pointers that might have been zapped.
+	 * Metadata inodes that are rooted in the superblock do not have any
+	 * parents.  Hence the attr fork will not be initialized, but there are
+	 * no parent pointers that might have been zapped.
 	 */
-	if (xfs_is_metadata_inode(ip))
+	if (xchk_inode_is_sb_rooted(ip))
 		return false;
 
 	/*
diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c
index 7b42b7f65a0bd..f4e4845b7ec09 100644
--- a/fs/xfs/scrub/parent_repair.c
+++ b/fs/xfs/scrub/parent_repair.c
@@ -1334,7 +1334,7 @@ xrep_parent_rebuild_pptrs(
 	 * so that we can decide if we're moving this file to the orphanage.
 	 * For this purpose, root directories are their own parents.
 	 */
-	if (sc->ip == sc->mp->m_rootip) {
+	if (xchk_inode_is_dirtree_root(sc->ip)) {
 		xrep_findparent_scan_found(&rp->pscan, sc->ip->i_ino);
 	} else {
 		error = xrep_parent_lookup_pptrs(sc, &parent_ino);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 16/26] xfs: do not count metadata directory files when doing online quotacheck
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (14 preceding siblings ...)
  2024-08-23  0:06   ` [PATCH 15/26] xfs: refactor directory tree root predicates Darrick J. Wong
@ 2024-08-23  0:06   ` Darrick J. Wong
  2024-08-23  4:48     ` Christoph Hellwig
  2024-08-23  0:06   ` [PATCH 17/26] xfs: don't fail repairs on metadata files with no attr fork Darrick J. Wong
                     ` (9 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:06 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Previously, we stated that files in the metadata directory tree are not
counted in the dquot information.  Fix the online quotacheck code to
reflect this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/quotacheck.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/scrub/quotacheck.c b/fs/xfs/scrub/quotacheck.c
index c77eb2de8df71..dc4033b91e440 100644
--- a/fs/xfs/scrub/quotacheck.c
+++ b/fs/xfs/scrub/quotacheck.c
@@ -398,10 +398,13 @@ xqcheck_collect_inode(
 	bool			isreg = S_ISREG(VFS_I(ip)->i_mode);
 	int			error = 0;
 
-	if (xfs_is_quota_inode(&tp->t_mountp->m_sb, ip->i_ino)) {
+	if (xfs_is_metadir_inode(ip) ||
+	    xfs_is_quota_inode(&tp->t_mountp->m_sb, ip->i_ino)) {
 		/*
 		 * Quota files are never counted towards quota, so we do not
-		 * need to take the lock.
+		 * need to take the lock.  Files do not switch between the
+		 * metadata and regular directory trees without a reallocation,
+		 * so we do not need to ILOCK them either.
 		 */
 		xchk_iscan_mark_visited(&xqc->iscan, ip);
 		return 0;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 17/26] xfs: don't fail repairs on metadata files with no attr fork
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (15 preceding siblings ...)
  2024-08-23  0:06   ` [PATCH 16/26] xfs: do not count metadata directory files when doing online quotacheck Darrick J. Wong
@ 2024-08-23  0:06   ` Darrick J. Wong
  2024-08-23  4:49     ` Christoph Hellwig
  2024-08-23  0:06   ` [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled Darrick J. Wong
                     ` (8 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:06 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Fix a minor bug where we fail repairs on metadata files that do not have
attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT.

Fixes: 5a8e07e799721 ("xfs: repair the inode core and forks of a metadata inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 67478294f11ae..155bbaaa496e4 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1084,9 +1084,11 @@ xrep_metadata_inode_forks(
 		return error;
 
 	/* Make sure the attr fork looks ok before we delete it. */
-	error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTA);
-	if (error)
-		return error;
+	if (xfs_inode_hasattr(sc->ip)) {
+		error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTA);
+		if (error)
+			return error;
+	}
 
 	/* Clear the reflink flag since metadata never shares. */
 	if (xfs_is_reflink_inode(sc->ip)) {


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (16 preceding siblings ...)
  2024-08-23  0:06   ` [PATCH 17/26] xfs: don't fail repairs on metadata files with no attr fork Darrick J. Wong
@ 2024-08-23  0:06   ` Darrick J. Wong
  2024-08-23  4:50     ` Christoph Hellwig
  2024-08-23  0:07   ` [PATCH 19/26] xfs: adjust parent pointer scrubber for sb-rooted metadata files Darrick J. Wong
                     ` (7 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:06 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If metadata directory trees are enabled, it's possible that some future
metadata file might want to store information in extended attributes.
Or, if parent pointers are enabled, then children of the metadir tree
need parent pointers.  Either way, we start allowing xattr data when
metadir is enabled, so we now need check and repair to examine attr
forks for metadata files on metadir filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |   21 +++++++++++++++------
 fs/xfs/scrub/repair.c |   14 +++++++++++---
 2 files changed, 26 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 72cec56f52eb1..f3fa9f2770d4a 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -1245,12 +1245,6 @@ xchk_metadata_inode_forks(
 		return 0;
 	}
 
-	/* They also should never have extended attributes. */
-	if (xfs_inode_hasattr(sc->ip)) {
-		xchk_ino_set_corrupt(sc, sc->ip->i_ino);
-		return 0;
-	}
-
 	/* Invoke the data fork scrubber. */
 	error = xchk_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTD);
 	if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
@@ -1267,6 +1261,21 @@ xchk_metadata_inode_forks(
 			xchk_ino_set_corrupt(sc, sc->ip->i_ino);
 	}
 
+	/*
+	 * Metadata files can only have extended attributes on metadir
+	 * filesystems, either for parent pointers or for actual xattr data.
+	 */
+	if (xfs_inode_hasattr(sc->ip)) {
+		if (!xfs_has_metadir(sc->mp)) {
+			xchk_ino_set_corrupt(sc, sc->ip->i_ino);
+			return 0;
+		}
+
+		error = xchk_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTA);
+		if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
+			return error;
+	}
+
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 155bbaaa496e4..01c0e863775d4 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1083,7 +1083,12 @@ xrep_metadata_inode_forks(
 	if (error)
 		return error;
 
-	/* Make sure the attr fork looks ok before we delete it. */
+	/*
+	 * Metadata files can only have extended attributes on metadir
+	 * filesystems, either for parent pointers or for actual xattr data.
+	 * For a non-metadir filesystem, make sure the attr fork looks ok
+	 * before we delete it.
+	 */
 	if (xfs_inode_hasattr(sc->ip)) {
 		error = xrep_metadata_inode_subtype(sc, XFS_SCRUB_TYPE_BMBTA);
 		if (error)
@@ -1099,8 +1104,11 @@ xrep_metadata_inode_forks(
 			return error;
 	}
 
-	/* Clear the attr forks since metadata shouldn't have that. */
-	if (xfs_inode_hasattr(sc->ip)) {
+	/*
+	 * Metadata files on non-metadir filesystems cannot have attr forks,
+	 * so clear them now.
+	 */
+	if (xfs_inode_hasattr(sc->ip) && !xfs_has_metadir(sc->mp)) {
 		if (!dirty) {
 			dirty = true;
 			xfs_trans_ijoin(sc->tp, sc->ip, 0);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 19/26] xfs: adjust parent pointer scrubber for sb-rooted metadata files
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (17 preceding siblings ...)
  2024-08-23  0:06   ` [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled Darrick J. Wong
@ 2024-08-23  0:07   ` Darrick J. Wong
  2024-08-23  4:50     ` Christoph Hellwig
  2024-08-23  0:07   ` [PATCH 20/26] xfs: fix di_metatype field of inodes that won't load Darrick J. Wong
                     ` (6 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:07 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Starting with the metadata directory feature, we're allowed to call the
directory and parent pointer scrubbers for every metadata file,
including the ones that are children of the superblock.

For these children, checking the link count against the number of parent
pointers is a bit funny -- there's no such thing as a parent pointer for
a child of the superblock since there's no corresponding dirent.  For
purposes of validating nlink, we pretend that there is a parent pointer.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/parent.c        |    8 ++++++++
 fs/xfs/scrub/parent_repair.c |   35 +++++++++++++++++++++++++++++++----
 2 files changed, 39 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index 582536076433a..d8ea393f50597 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -720,6 +720,14 @@ xchk_parent_count_pptrs(
 			 pp->pptrs_found == 0)
 			xchk_ino_set_corrupt(sc, sc->ip->i_ino);
 	} else {
+		/*
+		 * Starting with metadir, we allow checking of parent pointers
+		 * of non-directory files that are children of the superblock.
+		 * Pretend that we found a parent pointer attr.
+		 */
+		if (xfs_has_metadir(sc->mp) && xchk_inode_is_sb_rooted(sc->ip))
+			pp->pptrs_found++;
+
 		if (VFS_I(sc->ip)->i_nlink != pp->pptrs_found)
 			xchk_ino_set_corrupt(sc, sc->ip->i_ino);
 	}
diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c
index f4e4845b7ec09..31bfe10be22a2 100644
--- a/fs/xfs/scrub/parent_repair.c
+++ b/fs/xfs/scrub/parent_repair.c
@@ -1354,21 +1354,40 @@ STATIC int
 xrep_parent_rebuild_tree(
 	struct xrep_parent	*rp)
 {
+	struct xfs_scrub	*sc = rp->sc;
+	bool			try_adoption;
 	int			error;
 
-	if (xfs_has_parent(rp->sc->mp)) {
+	if (xfs_has_parent(sc->mp)) {
 		error = xrep_parent_rebuild_pptrs(rp);
 		if (error)
 			return error;
 	}
 
-	if (rp->pscan.parent_ino == NULLFSINO) {
-		if (xrep_orphanage_can_adopt(rp->sc))
+	/*
+	 * Any file with no parent could be adopted.  This check happens after
+	 * rebuilding the parent pointer structure because we might have cycled
+	 * the ILOCK during that process.
+	 */
+	try_adoption = rp->pscan.parent_ino == NULLFSINO;
+
+	/*
+	 * Starting with metadir, we allow checking of parent pointers
+	 * of non-directory files that are children of the superblock.
+	 * Lack of parent is ok here.
+	 */
+	if (try_adoption && xfs_has_metadir(sc->mp) &&
+	    xchk_inode_is_sb_rooted(sc->ip))
+		try_adoption = false;
+
+	if (try_adoption) {
+		if (xrep_orphanage_can_adopt(sc))
 			return xrep_parent_move_to_orphanage(rp);
 		return -EFSCORRUPTED;
+
 	}
 
-	if (S_ISDIR(VFS_I(rp->sc->ip)->i_mode))
+	if (S_ISDIR(VFS_I(sc->ip)->i_mode))
 		return xrep_parent_reset_dotdot(rp);
 
 	return 0;
@@ -1422,6 +1441,14 @@ xrep_parent_set_nondir_nlink(
 	if (error)
 		return error;
 
+	/*
+	 * Starting with metadir, we allow checking of parent pointers of
+	 * non-directory files that are children of the superblock.  Pretend
+	 * that we found a parent pointer attr.
+	 */
+	if (xfs_has_metadir(sc->mp) && xchk_inode_is_sb_rooted(sc->ip))
+		rp->parents++;
+
 	if (rp->parents > 0 && xfs_inode_on_unlinked_list(ip)) {
 		xfs_trans_ijoin(sc->tp, sc->ip, 0);
 		joined = true;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 20/26] xfs: fix di_metatype field of inodes that won't load
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (18 preceding siblings ...)
  2024-08-23  0:07   ` [PATCH 19/26] xfs: adjust parent pointer scrubber for sb-rooted metadata files Darrick J. Wong
@ 2024-08-23  0:07   ` Darrick J. Wong
  2024-08-23  4:51     ` Christoph Hellwig
  2024-08-23  0:07   ` [PATCH 21/26] xfs: scrub metadata directories Darrick J. Wong
                     ` (5 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:07 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make sure that the di_metatype field is at least set plausibly so that
later scrubbers could set the real type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/inode.c        |   10 +++++++---
 fs/xfs/scrub/inode_repair.c |    6 +++++-
 2 files changed, 12 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 45222552a51cc..07987a6569c43 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -443,9 +443,13 @@ xchk_dinode(
 		break;
 	case 2:
 	case 3:
-		if (!(dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)) &&
-		    dip->di_onlink != 0)
-			xchk_ino_set_corrupt(sc, ino);
+		if (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)) {
+			if (be16_to_cpu(dip->di_metatype) >= XFS_METAFILE_MAX)
+				xchk_ino_set_corrupt(sc, ino);
+		} else {
+			if (dip->di_onlink != 0)
+				xchk_ino_set_corrupt(sc, ino);
+		}
 
 		if (dip->di_mode == 0 && sc->ip)
 			xchk_ino_set_corrupt(sc, ino);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 91d0da58443a1..e3f9a91807de7 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -526,8 +526,12 @@ xrep_dinode_nlinks(
 		return;
 	}
 
-	if (!(dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)))
+	if (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_METADATA)) {
+		if (be16_to_cpu(dip->di_metatype) >= XFS_METAFILE_MAX)
+			dip->di_metatype = cpu_to_be16(XFS_METAFILE_UNKNOWN);
+	} else {
 		dip->di_onlink = 0;
+	}
 }
 
 /* Fix any conflicting flags that the verifiers complain about. */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 21/26] xfs: scrub metadata directories
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (19 preceding siblings ...)
  2024-08-23  0:07   ` [PATCH 20/26] xfs: fix di_metatype field of inodes that won't load Darrick J. Wong
@ 2024-08-23  0:07   ` Darrick J. Wong
  2024-08-23  4:53     ` Christoph Hellwig
  2024-08-23  0:07   ` [PATCH 22/26] xfs: check the metadata directory inumber in superblocks Darrick J. Wong
                     ` (4 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:07 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach online scrub about the metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir.c        |    8 ++++++++
 fs/xfs/scrub/dir_repair.c |    6 ++++++
 fs/xfs/scrub/dirtree.c    |   17 ++++++++++++++---
 fs/xfs/scrub/findparent.c |   13 +++++++++++++
 fs/xfs/scrub/parent.c     |   14 ++++++++++++++
 fs/xfs/scrub/trace.h      |    1 +
 6 files changed, 56 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 6b719c8885ef7..c877bde71e628 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -100,6 +100,14 @@ xchk_dir_check_ftype(
 
 	if (xfs_mode_to_ftype(VFS_I(ip)->i_mode) != ftype)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
+
+	/*
+	 * Metadata and regular inodes cannot cross trees.  This property
+	 * cannot change without a full inode free and realloc cycle, so it's
+	 * safe to check this without holding locks.
+	 */
+	if (xfs_is_metadir_inode(ip) != xfs_is_metadir_inode(sc->ip))
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 }
 
 /*
diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
index 0c2cd42b3110f..2456cf1cb7441 100644
--- a/fs/xfs/scrub/dir_repair.c
+++ b/fs/xfs/scrub/dir_repair.c
@@ -415,6 +415,12 @@ xrep_dir_salvage_entry(
 	if (error)
 		return 0;
 
+	/* Don't mix metadata and regular directory trees. */
+	if (xfs_is_metadir_inode(ip) != xfs_is_metadir_inode(rd->sc->ip)) {
+		xchk_irele(sc, ip);
+		return 0;
+	}
+
 	xname.type = xfs_mode_to_ftype(VFS_I(ip)->i_mode);
 	xchk_irele(sc, ip);
 
diff --git a/fs/xfs/scrub/dirtree.c b/fs/xfs/scrub/dirtree.c
index e43840733de94..3a9cdf8738b6d 100644
--- a/fs/xfs/scrub/dirtree.c
+++ b/fs/xfs/scrub/dirtree.c
@@ -362,7 +362,8 @@ xchk_dirpath_set_outcome(
 STATIC int
 xchk_dirpath_step_up(
 	struct xchk_dirtree	*dl,
-	struct xchk_dirpath	*path)
+	struct xchk_dirpath	*path,
+	bool			is_metadir)
 {
 	struct xfs_scrub	*sc = dl->sc;
 	struct xfs_inode	*dp;
@@ -435,6 +436,14 @@ xchk_dirpath_step_up(
 		goto out_scanlock;
 	}
 
+	/* Parent must be in the same directory tree. */
+	if (is_metadir != xfs_is_metadir_inode(dp)) {
+		trace_xchk_dirpath_crosses_tree(dl->sc, dp, path->path_nr,
+				path->nr_steps, &dl->xname, &dl->pptr_rec);
+		error = -EFSCORRUPTED;
+		goto out_scanlock;
+	}
+
 	/*
 	 * If the extended attributes look as though they has been zapped by
 	 * the inode record repair code, we cannot scan for parent pointers.
@@ -508,6 +517,7 @@ xchk_dirpath_walk_upwards(
 	struct xchk_dirpath	*path)
 {
 	struct xfs_scrub	*sc = dl->sc;
+	bool			is_metadir;
 	int			error;
 
 	ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL);
@@ -538,6 +548,7 @@ xchk_dirpath_walk_upwards(
 	 * ILOCK state is no longer tracked in the scrub context.  Hence we
 	 * must drop @sc->ip's ILOCK during the walk.
 	 */
+	is_metadir = xfs_is_metadir_inode(sc->ip);
 	mutex_unlock(&dl->lock);
 	xchk_iunlock(sc, XFS_ILOCK_EXCL);
 
@@ -547,7 +558,7 @@ xchk_dirpath_walk_upwards(
 	 * If we see any kind of error here (including corruptions), the parent
 	 * pointer of @sc->ip is corrupt.  Stop the whole scan.
 	 */
-	error = xchk_dirpath_step_up(dl, path);
+	error = xchk_dirpath_step_up(dl, path, is_metadir);
 	if (error) {
 		xchk_ilock(sc, XFS_ILOCK_EXCL);
 		mutex_lock(&dl->lock);
@@ -560,7 +571,7 @@ xchk_dirpath_walk_upwards(
 	 * *somewhere* in the path, but we don't need to stop scanning.
 	 */
 	while (!error && path->outcome == XCHK_DIRPATH_SCANNING)
-		error = xchk_dirpath_step_up(dl, path);
+		error = xchk_dirpath_step_up(dl, path, is_metadir);
 
 	/* Retake the locks we had, mark paths, etc. */
 	xchk_ilock(sc, XFS_ILOCK_EXCL);
diff --git a/fs/xfs/scrub/findparent.c b/fs/xfs/scrub/findparent.c
index 153d185190d8a..84487072b6dd6 100644
--- a/fs/xfs/scrub/findparent.c
+++ b/fs/xfs/scrub/findparent.c
@@ -172,6 +172,10 @@ xrep_findparent_walk_directory(
 	 */
 	lock_mode = xfs_ilock_data_map_shared(dp);
 
+	/* Don't mix metadata and regular directory trees. */
+	if (xfs_is_metadir_inode(dp) != xfs_is_metadir_inode(sc->ip))
+		goto out_unlock;
+
 	/*
 	 * If this directory is known to be sick, we cannot scan it reliably
 	 * and must abort.
@@ -368,6 +372,12 @@ xrep_findparent_confirm(
 		return 0;
 	}
 
+	/* The metadata root directory always points to itself. */
+	if (sc->ip == sc->mp->m_metadirip) {
+		*parent_ino = sc->mp->m_sb.sb_metadirino;
+		return 0;
+	}
+
 	/* Unlinked dirs can point anywhere; point them up to the root dir. */
 	if (VFS_I(sc->ip)->i_nlink == 0) {
 		*parent_ino = xchk_inode_rootdir_inum(sc->ip);
@@ -415,6 +425,9 @@ xrep_findparent_self_reference(
 	if (sc->ip->i_ino == sc->mp->m_sb.sb_rootino)
 		return sc->mp->m_sb.sb_rootino;
 
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_metadirino)
+		return sc->mp->m_sb.sb_metadirino;
+
 	if (VFS_I(sc->ip)->i_nlink == 0)
 		return xchk_inode_rootdir_inum(sc->ip);
 
diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c
index d8ea393f50597..3b692c4acc1e6 100644
--- a/fs/xfs/scrub/parent.c
+++ b/fs/xfs/scrub/parent.c
@@ -132,6 +132,14 @@ xchk_parent_validate(
 		return 0;
 	}
 
+	/* Is this the metadata root dir?  Then '..' must point to itself. */
+	if (sc->ip == mp->m_metadirip) {
+		if (sc->ip->i_ino != mp->m_sb.sb_metadirino ||
+		    sc->ip->i_ino != parent_ino)
+			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		return 0;
+	}
+
 	/* '..' must not point to ourselves. */
 	if (sc->ip->i_ino == parent_ino) {
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
@@ -185,6 +193,12 @@ xchk_parent_validate(
 		goto out_unlock;
 	}
 
+	/* Metadata and regular inodes cannot cross trees. */
+	if (xfs_is_metadir_inode(dp) != xfs_is_metadir_inode(sc->ip)) {
+		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
+		goto out_unlock;
+	}
+
 	/* Look for a directory entry in the parent pointing to the child. */
 	error = xchk_dir_walk(sc, dp, xchk_parent_actor, &spc);
 	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index c886d5d0eb021..f9d37db6fa5d2 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1752,6 +1752,7 @@ DEFINE_XCHK_DIRPATH_EVENT(xchk_dirpath_badgen);
 DEFINE_XCHK_DIRPATH_EVENT(xchk_dirpath_nondir_parent);
 DEFINE_XCHK_DIRPATH_EVENT(xchk_dirpath_unlinked_parent);
 DEFINE_XCHK_DIRPATH_EVENT(xchk_dirpath_found_next_step);
+DEFINE_XCHK_DIRPATH_EVENT(xchk_dirpath_crosses_tree);
 
 TRACE_DEFINE_ENUM(XCHK_DIRPATH_SCANNING);
 TRACE_DEFINE_ENUM(XCHK_DIRPATH_DELETE);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 22/26] xfs: check the metadata directory inumber in superblocks
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (20 preceding siblings ...)
  2024-08-23  0:07   ` [PATCH 21/26] xfs: scrub metadata directories Darrick J. Wong
@ 2024-08-23  0:07   ` Darrick J. Wong
  2024-08-23  4:53     ` Christoph Hellwig
  2024-08-23  0:08   ` [PATCH 23/26] xfs: move repair temporary files to the metadata directory tree Darrick J. Wong
                     ` (3 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:07 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When metadata directories are enabled, make sure that the secondary
superblocks point to the metadata directory.  This isn't strictly
required because the secondaries are only used to recover damaged
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader.c |    5 +++++
 1 file changed, 5 insertions(+)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index f8e5b67128d25..cad997f38a424 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -144,6 +144,11 @@ xchk_superblock(
 	if (sb->sb_rootino != cpu_to_be64(mp->m_sb.sb_rootino))
 		xchk_block_set_preen(sc, bp);
 
+	if (xfs_has_metadir(sc->mp)) {
+		if (sb->sb_metadirino != cpu_to_be64(mp->m_sb.sb_metadirino))
+			xchk_block_set_preen(sc, bp);
+	}
+
 	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
 		xchk_block_set_preen(sc, bp);
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 23/26] xfs: move repair temporary files to the metadata directory tree
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (21 preceding siblings ...)
  2024-08-23  0:07   ` [PATCH 22/26] xfs: check the metadata directory inumber in superblocks Darrick J. Wong
@ 2024-08-23  0:08   ` Darrick J. Wong
  2024-08-23  4:54     ` Christoph Hellwig
  2024-08-23  0:08   ` [PATCH 24/26] xfs: check metadata directory file path connectivity Darrick J. Wong
                     ` (2 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:08 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Due to resource acquisition rules, we have to create the ondisk
temporary files used to stage a filesystem repair before we can acquire
a reference to the inode that we actually want to repair.  Therefore,
we do not know at tempfile creation time whether the tempfile will
belong to the regular directory tree or the metadata directory tree.

This distinction becomes important when the swapext code tries to figure
out the quota accounting of the two files whose mappings are being
swapped.  The swapext code assumes that accounting updates are required
for a file if dqattach attaches dquots.  Metadir files are never
accounted in quota, which means that swapext must not update the quota
accounting when swapping in a repaired directory/xattr/rtbitmap structure.

Prior to the swapext call, therefore, both files must be marked as
METADIR for dqattach so that dqattach will ignore them.  Add support for
a repair tempfile to be switched to the metadir tree and switched back
before being released so that ifree will just free the file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c   |    5 ++
 fs/xfs/scrub/tempfile.c |   97 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/tempfile.h |    3 +
 3 files changed, 105 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index f3fa9f2770d4a..5245943496c8b 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -39,6 +39,7 @@
 #include "scrub/trace.h"
 #include "scrub/repair.h"
 #include "scrub/health.h"
+#include "scrub/tempfile.h"
 
 /* Common code for the metadata scrubbers. */
 
@@ -1090,6 +1091,10 @@ xchk_setup_inode_contents(
 	if (error)
 		return error;
 
+	error = xrep_tempfile_adjust_directory_tree(sc);
+	if (error)
+		return error;
+
 	/* Lock the inode so the VFS cannot touch this file. */
 	xchk_ilock(sc, XFS_IOLOCK_EXCL);
 
diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
index 3c5a1d77fefae..4b7f7860e37ec 100644
--- a/fs/xfs/scrub/tempfile.c
+++ b/fs/xfs/scrub/tempfile.c
@@ -22,6 +22,7 @@
 #include "xfs_exchmaps.h"
 #include "xfs_defer.h"
 #include "xfs_symlink_remote.h"
+#include "xfs_metafile.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/repair.h"
@@ -182,6 +183,101 @@ xrep_tempfile_create(
 	return error;
 }
 
+/*
+ * Temporary files have to be created before we even know which inode we're
+ * going to scrub, so we assume that they will be part of the regular directory
+ * tree.  If it turns out that we're actually scrubbing a file from the
+ * metadata directory tree, we have to subtract the temp file from the root
+ * dquots and detach the dquots.
+ */
+int
+xrep_tempfile_adjust_directory_tree(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	if (!sc->tempip)
+		return 0;
+
+	ASSERT(sc->tp == NULL);
+	ASSERT(!xfs_is_metadir_inode(sc->tempip));
+
+	if (!sc->ip || !xfs_is_metadir_inode(sc->ip))
+		return 0;
+
+	xfs_ilock(sc->tempip, XFS_IOLOCK_EXCL);
+	sc->temp_ilock_flags |= XFS_IOLOCK_EXCL;
+
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		goto out_iolock;
+
+	xrep_tempfile_ilock(sc);
+	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
+
+	/* Metadir files are not accounted in quota, so drop icount */
+	xfs_trans_mod_dquot_byino(sc->tp, sc->tempip, XFS_TRANS_DQ_ICOUNT, -1L);
+	xfs_metafile_set_iflag(sc->tp, sc->tempip, XFS_METAFILE_UNKNOWN);
+
+	error = xrep_trans_commit(sc);
+	if (error)
+		goto out_ilock;
+
+	xfs_qm_dqdetach(sc->tempip);
+out_ilock:
+	xrep_tempfile_iunlock(sc);
+out_iolock:
+	xrep_tempfile_iounlock(sc);
+	return error;
+}
+
+/*
+ * Remove this temporary file from the metadata directory tree so that it can
+ * be inactivated the normal way.
+ */
+STATIC int
+xrep_tempfile_remove_metadir(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	if (!sc->tempip || !xfs_is_metadir_inode(sc->tempip))
+		return 0;
+
+	ASSERT(sc->tp == NULL);
+
+	xfs_ilock(sc->tempip, XFS_IOLOCK_EXCL);
+	sc->temp_ilock_flags |= XFS_IOLOCK_EXCL;
+
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		goto out_iolock;
+
+	xrep_tempfile_ilock(sc);
+	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
+
+	xfs_metafile_clear_iflag(sc->tp, sc->tempip);
+
+	/* Non-metadir files are accounted in quota, so bump bcount/icount */
+	error = xfs_qm_dqattach_locked(sc->tempip, false);
+	if (error)
+		goto out_cancel;
+
+	xfs_trans_mod_dquot_byino(sc->tp, sc->tempip, XFS_TRANS_DQ_ICOUNT, 1L);
+	xfs_trans_mod_dquot_byino(sc->tp, sc->tempip, XFS_TRANS_DQ_BCOUNT,
+			sc->tempip->i_nblocks);
+	error = xrep_trans_commit(sc);
+	goto out_ilock;
+
+out_cancel:
+	xchk_trans_cancel(sc);
+out_ilock:
+	xrep_tempfile_iunlock(sc);
+out_iolock:
+	xrep_tempfile_iounlock(sc);
+	return error;
+}
+
 /* Take IOLOCK_EXCL on the temporary file, maybe. */
 bool
 xrep_tempfile_iolock_nowait(
@@ -290,6 +386,7 @@ xrep_tempfile_rele(
 		sc->temp_ilock_flags = 0;
 	}
 
+	xrep_tempfile_remove_metadir(sc);
 	xchk_irele(sc, sc->tempip);
 	sc->tempip = NULL;
 }
diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h
index e51399f595fe9..71c1b54599c30 100644
--- a/fs/xfs/scrub/tempfile.h
+++ b/fs/xfs/scrub/tempfile.h
@@ -10,6 +10,8 @@
 int xrep_tempfile_create(struct xfs_scrub *sc, uint16_t mode);
 void xrep_tempfile_rele(struct xfs_scrub *sc);
 
+int xrep_tempfile_adjust_directory_tree(struct xfs_scrub *sc);
+
 bool xrep_tempfile_iolock_nowait(struct xfs_scrub *sc);
 int xrep_tempfile_iolock_polled(struct xfs_scrub *sc);
 void xrep_tempfile_iounlock(struct xfs_scrub *sc);
@@ -42,6 +44,7 @@ static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc)
 	xchk_ilock(sc, XFS_IOLOCK_EXCL);
 }
 # define xrep_is_tempfile(ip)		(false)
+# define xrep_tempfile_adjust_directory_tree(sc)	(0)
 # define xrep_tempfile_rele(sc)
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 24/26] xfs: check metadata directory file path connectivity
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (22 preceding siblings ...)
  2024-08-23  0:08   ` [PATCH 23/26] xfs: move repair temporary files to the metadata directory tree Darrick J. Wong
@ 2024-08-23  0:08   ` Darrick J. Wong
  2024-08-23  4:55     ` Christoph Hellwig
  2024-08-23  0:08   ` [PATCH 25/26] xfs: confirm dotdot target before replacing it during a repair Darrick J. Wong
  2024-08-23  0:08   ` [PATCH 26/26] xfs: repair metadata directory file path connectivity Darrick J. Wong
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:08 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a new scrubber type that checks that well known metadata
directory paths are connected to the metadata inode that the incore
structures think is in use.  IOWs, check that "/quota/user" in the
metadata directory tree actually points to
mp->m_quotainfo->qi_uquotaip->i_ino.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/libxfs/xfs_fs.h     |   13 +++
 fs/xfs/libxfs/xfs_health.h |    4 +
 fs/xfs/scrub/common.h      |    1 
 fs/xfs/scrub/health.c      |    1 
 fs/xfs/scrub/metapath.c    |  174 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c       |    9 ++
 fs/xfs/scrub/scrub.h       |    2 +
 fs/xfs/scrub/stats.c       |    1 
 fs/xfs/scrub/trace.c       |    1 
 fs/xfs/scrub/trace.h       |   36 +++++++++
 fs/xfs/xfs_health.c        |    1 
 12 files changed, 241 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/metapath.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 4482cc8c39039..4d8ca08cdd0ec 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -173,6 +173,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   inode.o \
 				   iscan.o \
 				   listxattr.o \
+				   metapath.o \
 				   nlinks.o \
 				   parent.o \
 				   readdir.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 6f5aebaf47ac8..b441b9258128e 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -198,6 +198,7 @@ struct xfs_fsop_geom {
 #define XFS_FSOP_GEOM_SICK_QUOTACHECK	(1 << 6)  /* quota counts */
 #define XFS_FSOP_GEOM_SICK_NLINKS	(1 << 7)  /* inode link counts */
 #define XFS_FSOP_GEOM_SICK_METADIR	(1 << 8)  /* metadata directory */
+#define XFS_FSOP_GEOM_SICK_METAPATH	(1 << 9)  /* metadir tree path */
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
@@ -732,9 +733,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_NLINKS	26	/* inode link counts */
 #define XFS_SCRUB_TYPE_HEALTHY	27	/* everything checked out ok */
 #define XFS_SCRUB_TYPE_DIRTREE	28	/* directory tree structure */
+#define XFS_SCRUB_TYPE_METAPATH	29	/* metadata directory tree paths */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	29
+#define XFS_SCRUB_TYPE_NR	30
 
 /*
  * This special type code only applies to the vectored scrub implementation.
@@ -812,6 +814,15 @@ struct xfs_scrub_vec_head {
 
 #define XFS_SCRUB_VEC_FLAGS_ALL		(0)
 
+/*
+ * i: sm_ino values for XFS_SCRUB_TYPE_METAPATH to select a metadata file for
+ * path checking.
+ */
+#define XFS_SCRUB_METAPATH_PROBE	(0)  /* do we have a metapath scrubber? */
+
+/* Number of metapath sm_ino values */
+#define XFS_SCRUB_METAPATH_NR		(1)
+
 /*
  * ioctl limits
  */
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 0ded0cd93ce63..8abd345e23885 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -61,6 +61,7 @@ struct xfs_da_args;
 #define XFS_SICK_FS_QUOTACHECK	(1 << 4)  /* quota counts */
 #define XFS_SICK_FS_NLINKS	(1 << 5)  /* inode link counts */
 #define XFS_SICK_FS_METADIR	(1 << 6)  /* metadata directory tree */
+#define XFS_SICK_FS_METAPATH	(1 << 7)  /* metadata directory tree path */
 
 /* Observable health issues for realtime volume metadata. */
 #define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
@@ -105,7 +106,8 @@ struct xfs_da_args;
 				 XFS_SICK_FS_PQUOTA | \
 				 XFS_SICK_FS_QUOTACHECK | \
 				 XFS_SICK_FS_NLINKS | \
-				 XFS_SICK_FS_METADIR)
+				 XFS_SICK_FS_METADIR | \
+				 XFS_SICK_FS_METAPATH)
 
 #define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
 				 XFS_SICK_RT_SUMMARY)
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 4d713e2a463cd..96fe6ef5f4dc7 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -68,6 +68,7 @@ int xchk_setup_xattr(struct xfs_scrub *sc);
 int xchk_setup_symlink(struct xfs_scrub *sc);
 int xchk_setup_parent(struct xfs_scrub *sc);
 int xchk_setup_dirtree(struct xfs_scrub *sc);
+int xchk_setup_metapath(struct xfs_scrub *sc);
 #ifdef CONFIG_XFS_RT
 int xchk_setup_rtbitmap(struct xfs_scrub *sc);
 int xchk_setup_rtsummary(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index b712a8bd34f54..e202d84ec5140 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -109,6 +109,7 @@ static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_QUOTACHECK]	= { XHG_FS,  XFS_SICK_FS_QUOTACHECK },
 	[XFS_SCRUB_TYPE_NLINKS]		= { XHG_FS,  XFS_SICK_FS_NLINKS },
 	[XFS_SCRUB_TYPE_DIRTREE]	= { XHG_INO, XFS_SICK_INO_DIRTREE },
+	[XFS_SCRUB_TYPE_METAPATH]	= { XHG_FS,  XFS_SICK_FS_METAPATH },
 };
 
 /* Return the health status mask for this scrub type. */
diff --git a/fs/xfs/scrub/metapath.c b/fs/xfs/scrub/metapath.c
new file mode 100644
index 0000000000000..b7bd86df9877c
--- /dev/null
+++ b/fs/xfs/scrub/metapath.c
@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2023-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_inode.h"
+#include "xfs_metafile.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_dir2.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/readdir.h"
+
+/*
+ * Metadata Directory Tree Paths
+ * =============================
+ *
+ * A filesystem with metadir enabled expects to find metadata structures
+ * attached to files that are accessible by walking a path down the metadata
+ * directory tree.  Given the metadir path and the incore inode storing the
+ * metadata, this scrubber ensures that the ondisk metadir path points to the
+ * ondisk inode represented by the incore inode.
+ */
+
+struct xchk_metapath {
+	struct xfs_scrub		*sc;
+
+	/* Name for lookup */
+	struct xfs_name			xname;
+
+	/* Path for this metadata file and the parent directory */
+	const char			*path;
+	const char			*parent_path;
+
+	/* Directory parent of the metadata file. */
+	struct xfs_inode		*dp;
+
+	/* Locks held on dp */
+	unsigned int			dp_ilock_flags;
+};
+
+/* Release resources tracked in the buffer. */
+static inline void
+xchk_metapath_cleanup(
+	void			*buf)
+{
+	struct xchk_metapath	*mpath = buf;
+
+	if (mpath->dp_ilock_flags)
+		xfs_iunlock(mpath->dp, mpath->dp_ilock_flags);
+	kfree(mpath->path);
+}
+
+int
+xchk_setup_metapath(
+	struct xfs_scrub	*sc)
+{
+	if (!xfs_has_metadir(sc->mp))
+		return -ENOENT;
+	if (sc->sm->sm_gen)
+		return -EINVAL;
+
+	switch (sc->sm->sm_ino) {
+	case XFS_SCRUB_METAPATH_PROBE:
+		/* Just probing, nothing else to do. */
+		if (sc->sm->sm_agno)
+			return -EINVAL;
+		return 0;
+	default:
+		return -ENOENT;
+	}
+}
+
+/*
+ * Take the ILOCK on the metadata directory parent and child.  We do not know
+ * that the metadata directory is not corrupt, so we lock the parent and try
+ * to lock the child.  Returns 0 if successful, or -EINTR to abort the scrub.
+ */
+STATIC int
+xchk_metapath_ilock_both(
+	struct xchk_metapath	*mpath)
+{
+	struct xfs_scrub	*sc = mpath->sc;
+	int			error = 0;
+
+	while (true) {
+		xfs_ilock(mpath->dp, XFS_ILOCK_EXCL);
+		if (xchk_ilock_nowait(sc, XFS_ILOCK_EXCL)) {
+			mpath->dp_ilock_flags |= XFS_ILOCK_EXCL;
+			return 0;
+		}
+		xfs_iunlock(mpath->dp, XFS_ILOCK_EXCL);
+
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		delay(1);
+	}
+
+	ASSERT(0);
+	return -EINTR;
+}
+
+/* Unlock parent and child inodes. */
+static inline void
+xchk_metapath_iunlock(
+	struct xchk_metapath	*mpath)
+{
+	struct xfs_scrub	*sc = mpath->sc;
+
+	xchk_iunlock(sc, XFS_ILOCK_EXCL);
+
+	mpath->dp_ilock_flags &= ~XFS_ILOCK_EXCL;
+	xfs_iunlock(mpath->dp, XFS_ILOCK_EXCL);
+}
+
+int
+xchk_metapath(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_metapath	*mpath = sc->buf;
+	xfs_ino_t		ino = NULLFSINO;
+	int			error;
+
+	/* Just probing, nothing else to do. */
+	if (sc->sm->sm_ino == XFS_SCRUB_METAPATH_PROBE)
+		return 0;
+
+	/* Parent required to do anything else. */
+	if (mpath->dp == NULL) {
+		xchk_ino_set_corrupt(sc, sc->ip->i_ino);
+		return 0;
+	}
+
+	error = xchk_trans_alloc_empty(sc);
+	if (error)
+		return error;
+
+	error = xchk_metapath_ilock_both(mpath);
+	if (error)
+		goto out_cancel;
+
+	/* Make sure the parent dir has a dirent pointing to this file. */
+	error = xchk_dir_lookup(sc, mpath->dp, &mpath->xname, &ino);
+	trace_xchk_metapath_lookup(sc, mpath->path, mpath->dp, ino);
+	if (error == -ENOENT) {
+		/* No directory entry at all */
+		xchk_ino_set_corrupt(sc, sc->ip->i_ino);
+		error = 0;
+		goto out_ilock;
+	}
+	if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error))
+		goto out_ilock;
+	if (ino != sc->ip->i_ino) {
+		/* Pointing to wrong inode */
+		xchk_ino_set_corrupt(sc, sc->ip->i_ino);
+	}
+
+out_ilock:
+	xchk_metapath_iunlock(mpath);
+out_cancel:
+	xchk_trans_cancel(sc);
+	return error;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 4cbcf7a86dbec..f1b2714e2894a 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -442,6 +442,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.has	= xfs_has_parent,
 		.repair	= xrep_dirtree,
 	},
+	[XFS_SCRUB_TYPE_METAPATH] = {	/* metadata directory tree path */
+		.type	= ST_GENERIC,
+		.setup	= xchk_setup_metapath,
+		.scrub	= xchk_metapath,
+		.has	= xfs_has_metadir,
+		.repair	= xrep_notsupported,
+	},
 };
 
 static int
@@ -489,6 +496,8 @@ xchk_validate_inputs(
 		if (sm->sm_agno || (sm->sm_gen && !sm->sm_ino))
 			goto out;
 		break;
+	case ST_GENERIC:
+		break;
 	default:
 		goto out;
 	}
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 1bc33f010d0e7..ab143c7a531e8 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -73,6 +73,7 @@ enum xchk_type {
 	ST_PERAG,	/* per-AG metadata */
 	ST_FS,		/* per-FS metadata */
 	ST_INODE,	/* per-inode metadata */
+	ST_GENERIC,	/* determined by the scrubber */
 };
 
 struct xchk_meta_ops {
@@ -250,6 +251,7 @@ int xchk_xattr(struct xfs_scrub *sc);
 int xchk_symlink(struct xfs_scrub *sc);
 int xchk_parent(struct xfs_scrub *sc);
 int xchk_dirtree(struct xfs_scrub *sc);
+int xchk_metapath(struct xfs_scrub *sc);
 #ifdef CONFIG_XFS_RT
 int xchk_rtbitmap(struct xfs_scrub *sc);
 int xchk_rtsummary(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/stats.c b/fs/xfs/scrub/stats.c
index 7996c23354763..edcd02dc2e62c 100644
--- a/fs/xfs/scrub/stats.c
+++ b/fs/xfs/scrub/stats.c
@@ -80,6 +80,7 @@ static const char *name_map[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_QUOTACHECK]	= "quotacheck",
 	[XFS_SCRUB_TYPE_NLINKS]		= "nlinks",
 	[XFS_SCRUB_TYPE_DIRTREE]	= "dirtree",
+	[XFS_SCRUB_TYPE_METAPATH]	= "metapath",
 };
 
 /* Format the scrub stats into a text buffer, similar to pcp style. */
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 4470ad0533b81..98f923ae664d0 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -20,6 +20,7 @@
 #include "xfs_dir2.h"
 #include "xfs_rmap.h"
 #include "xfs_parent.h"
+#include "xfs_metafile.h"
 #include "scrub/scrub.h"
 #include "scrub/xfile.h"
 #include "scrub/xfarray.h"
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index f9d37db6fa5d2..d4ca3f98679a2 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -70,6 +70,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_NLINKS);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_DIRTREE);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_BARRIER);
+TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_METAPATH);
 
 #define XFS_SCRUB_TYPE_STRINGS \
 	{ XFS_SCRUB_TYPE_PROBE,		"probe" }, \
@@ -101,7 +102,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_BARRIER);
 	{ XFS_SCRUB_TYPE_NLINKS,	"nlinks" }, \
 	{ XFS_SCRUB_TYPE_HEALTHY,	"healthy" }, \
 	{ XFS_SCRUB_TYPE_DIRTREE,	"dirtree" }, \
-	{ XFS_SCRUB_TYPE_BARRIER,	"barrier" }
+	{ XFS_SCRUB_TYPE_BARRIER,	"barrier" }, \
+	{ XFS_SCRUB_TYPE_METAPATH,	"metapath" }
 
 #define XFS_SCRUB_FLAG_STRINGS \
 	{ XFS_SCRUB_IFLAG_REPAIR,		"repair" }, \
@@ -1915,6 +1917,38 @@ TRACE_EVENT(xchk_dirtree_live_update,
 		  __get_str(name))
 );
 
+DECLARE_EVENT_CLASS(xchk_metapath_class,
+	TP_PROTO(struct xfs_scrub *sc, const char *path,
+		 struct xfs_inode *dp, xfs_ino_t ino),
+	TP_ARGS(sc, path, dp, ino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, scrub_ino)
+		__field(xfs_ino_t, parent_ino)
+		__field(xfs_ino_t, ino)
+		__string(name, path)
+	),
+	TP_fast_assign(
+		__entry->dev = sc->mp->m_super->s_dev;
+		__entry->scrub_ino = sc->ip ? sc->ip->i_ino : NULLFSINO;
+		__entry->parent_ino = dp ? dp->i_ino : NULLFSINO;
+		__entry->ino = ino;
+		__assign_str(name);
+	),
+	TP_printk("dev %d:%d ino 0x%llx parent_ino 0x%llx name '%s' ino 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->scrub_ino,
+		  __entry->parent_ino,
+		  __get_str(name),
+		  __entry->ino)
+);
+#define DEFINE_XCHK_METAPATH_EVENT(name) \
+DEFINE_EVENT(xchk_metapath_class, name, \
+	TP_PROTO(struct xfs_scrub *sc, const char *path, \
+		 struct xfs_inode *dp, xfs_ino_t ino), \
+	TP_ARGS(sc, path, dp, ino))
+DEFINE_XCHK_METAPATH_EVENT(xchk_metapath_lookup);
+
 /* repair tracepoints */
 #if IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR)
 
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index d5367fd2d0615..0bdbf6807bd29 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -377,6 +377,7 @@ static const struct ioctl_sick_map fs_map[] = {
 	{ XFS_SICK_FS_QUOTACHECK, XFS_FSOP_GEOM_SICK_QUOTACHECK },
 	{ XFS_SICK_FS_NLINKS,	XFS_FSOP_GEOM_SICK_NLINKS },
 	{ XFS_SICK_FS_METADIR,	XFS_FSOP_GEOM_SICK_METADIR },
+	{ XFS_SICK_FS_METAPATH,	XFS_FSOP_GEOM_SICK_METAPATH },
 	{ 0, 0 },
 };
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 25/26] xfs: confirm dotdot target before replacing it during a repair
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (23 preceding siblings ...)
  2024-08-23  0:08   ` [PATCH 24/26] xfs: check metadata directory file path connectivity Darrick J. Wong
@ 2024-08-23  0:08   ` Darrick J. Wong
  2024-08-23  4:55     ` Christoph Hellwig
  2024-08-23  0:08   ` [PATCH 26/26] xfs: repair metadata directory file path connectivity Darrick J. Wong
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:08 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

xfs_dir_replace trips an assertion if you tell it to change a dirent to
point to an inumber that it already points at.  Look up the dotdot entry
directly to confirm that we need to make a change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/dir_repair.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c
index 2456cf1cb7441..2493138821087 100644
--- a/fs/xfs/scrub/dir_repair.c
+++ b/fs/xfs/scrub/dir_repair.c
@@ -1638,6 +1638,7 @@ xrep_dir_swap(
 	struct xrep_dir		*rd)
 {
 	struct xfs_scrub	*sc = rd->sc;
+	xfs_ino_t		ino;
 	bool			ip_local, temp_local;
 	int			error = 0;
 
@@ -1655,14 +1656,17 @@ xrep_dir_swap(
 
 	/*
 	 * Reset the temporary directory's '..' entry to point to the parent
-	 * that we found.  The temporary directory was created with the root
-	 * directory as the parent, so we can skip this if repairing a
-	 * subdirectory of the root.
+	 * that we found.  The dirent replace code asserts if the dirent
+	 * already points at the new inumber, so we look it up here.
 	 *
 	 * It's also possible that this replacement could also expand a sf
 	 * tempdir into block format.
 	 */
-	if (rd->pscan.parent_ino != sc->mp->m_rootip->i_ino) {
+	error = xchk_dir_lookup(sc, rd->sc->tempip, &xfs_name_dotdot, &ino);
+	if (error)
+		return error;
+
+	if (rd->pscan.parent_ino != ino) {
 		error = xrep_dir_replace(rd, rd->sc->tempip, &xfs_name_dotdot,
 				rd->pscan.parent_ino, rd->tx.req.resblks);
 		if (error)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 26/26] xfs: repair metadata directory file path connectivity
  2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
                     ` (24 preceding siblings ...)
  2024-08-23  0:08   ` [PATCH 25/26] xfs: confirm dotdot target before replacing it during a repair Darrick J. Wong
@ 2024-08-23  0:08   ` Darrick J. Wong
  2024-08-23  4:56     ` Christoph Hellwig
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:08 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Fix disconnected or incorrect metadata directory paths.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/metapath.c |  351 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h   |    3 
 fs/xfs/scrub/scrub.c    |    2 
 fs/xfs/scrub/trace.h    |    5 +
 4 files changed, 358 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/metapath.c b/fs/xfs/scrub/metapath.c
index b7bd86df9877c..edc1a395c4015 100644
--- a/fs/xfs/scrub/metapath.c
+++ b/fs/xfs/scrub/metapath.c
@@ -16,10 +16,15 @@
 #include "xfs_quota.h"
 #include "xfs_qm.h"
 #include "xfs_dir2.h"
+#include "xfs_parent.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_attr.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
 #include "scrub/readdir.h"
+#include "scrub/repair.h"
 
 /*
  * Metadata Directory Tree Paths
@@ -38,15 +43,28 @@ struct xchk_metapath {
 	/* Name for lookup */
 	struct xfs_name			xname;
 
-	/* Path for this metadata file and the parent directory */
+	/* Directory update for repairs */
+	struct xfs_dir_update		du;
+
+	/* Path down to this metadata file from the parent directory */
 	const char			*path;
-	const char			*parent_path;
 
 	/* Directory parent of the metadata file. */
 	struct xfs_inode		*dp;
 
 	/* Locks held on dp */
 	unsigned int			dp_ilock_flags;
+
+	/* Transaction block reservations */
+	unsigned int			link_resblks;
+	unsigned int			unlink_resblks;
+
+	/* Parent pointer updates */
+	struct xfs_parent_args		link_ppargs;
+	struct xfs_parent_args		unlink_ppargs;
+
+	/* Scratchpads for removing links */
+	struct xfs_da_args		pptr_args;
 };
 
 /* Release resources tracked in the buffer. */
@@ -172,3 +190,332 @@ xchk_metapath(
 	xchk_trans_cancel(sc);
 	return error;
 }
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+/* Create the dirent represented by the final component of the path. */
+STATIC int
+xrep_metapath_link(
+	struct xchk_metapath	*mpath)
+{
+	struct xfs_scrub	*sc = mpath->sc;
+
+	mpath->du.dp = mpath->dp;
+	mpath->du.name = &mpath->xname;
+	mpath->du.ip = sc->ip;
+
+	if (xfs_has_parent(sc->mp))
+		mpath->du.ppargs = &mpath->link_ppargs;
+	else
+		mpath->du.ppargs = NULL;
+
+	trace_xrep_metapath_link(sc, mpath->path, mpath->dp, sc->ip->i_ino);
+
+	return xfs_dir_add_child(sc->tp, mpath->link_resblks, &mpath->du);
+}
+
+/* Remove the dirent at the final component of the path. */
+STATIC int
+xrep_metapath_unlink(
+	struct xchk_metapath	*mpath,
+	xfs_ino_t		ino,
+	struct xfs_inode	*ip)
+{
+	struct xfs_parent_rec	rec;
+	struct xfs_scrub	*sc = mpath->sc;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	trace_xrep_metapath_unlink(sc, mpath->path, mpath->dp, ino);
+
+	if (!ip) {
+		/* The child inode isn't allocated.  Junk the dirent. */
+		xfs_trans_log_inode(sc->tp, mpath->dp, XFS_ILOG_CORE);
+		return xfs_dir_removename(sc->tp, mpath->dp, &mpath->xname,
+				ino, mpath->unlink_resblks);
+	}
+
+	mpath->du.dp = mpath->dp;
+	mpath->du.name = &mpath->xname;
+	mpath->du.ip = ip;
+	mpath->du.ppargs = NULL;
+
+	/* Figure out if we're removing a parent pointer too. */
+	if (xfs_has_parent(mp)) {
+		xfs_inode_to_parent_rec(&rec, ip);
+		error = xfs_parent_lookup(sc->tp, ip, &mpath->xname, &rec,
+				&mpath->pptr_args);
+		switch (error) {
+		case -ENOATTR:
+			break;
+		case 0:
+			mpath->du.ppargs = &mpath->unlink_ppargs;
+			break;
+		default:
+			return error;
+		}
+	}
+
+	return xfs_dir_remove_child(sc->tp, mpath->unlink_resblks, &mpath->du);
+}
+
+/*
+ * Try to create a dirent in @mpath->dp with the name @mpath->xname that points
+ * to @sc->ip.  Returns:
+ *
+ * -EEXIST and an @alleged_child if the dirent that points to the wrong inode;
+ * 0 if there is now a dirent pointing to @sc->ip; or
+ * A negative errno on error.
+ */
+STATIC int
+xrep_metapath_try_link(
+	struct xchk_metapath	*mpath,
+	xfs_ino_t		*alleged_child)
+{
+	struct xfs_scrub	*sc = mpath->sc;
+	xfs_ino_t		ino;
+	int			error;
+
+	/* Allocate transaction, lock inodes, join to transaction. */
+	error = xchk_trans_alloc(sc, mpath->link_resblks);
+	if (error)
+		return error;
+
+	error = xchk_metapath_ilock_both(mpath);
+	if (error) {
+		xchk_trans_cancel(sc);
+		return error;
+	}
+	xfs_trans_ijoin(sc->tp, mpath->dp, 0);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	error = xchk_dir_lookup(sc, mpath->dp, &mpath->xname, &ino);
+	trace_xrep_metapath_lookup(sc, mpath->path, mpath->dp, ino);
+	if (error == -ENOENT) {
+		/*
+		 * There is no dirent in the directory.  Create an entry
+		 * pointing to @sc->ip.
+		 */
+		error = xrep_metapath_link(mpath);
+		if (error)
+			goto out_cancel;
+
+		error = xrep_trans_commit(sc);
+		xchk_metapath_iunlock(mpath);
+		return error;
+	}
+	if (error)
+		goto out_cancel;
+
+	if (ino == sc->ip->i_ino) {
+		/* The dirent already points to @sc->ip; we're done. */
+		error = 0;
+		goto out_cancel;
+	}
+
+	/*
+	 * The dirent points elsewhere; pass that back so that the caller
+	 * can try to remove the dirent.
+	 */
+	*alleged_child = ino;
+	error = -EEXIST;
+
+out_cancel:
+	xchk_trans_cancel(sc);
+	xchk_metapath_iunlock(mpath);
+	return error;
+}
+
+/*
+ * Take the ILOCK on the metadata directory parent and a bad child, if one is
+ * supplied.  We do not know that the metadata directory is not corrupt, so we
+ * lock the parent and try to lock the child.  Returns 0 if successful, or
+ * -EINTR to abort the repair.  The lock state of @dp is not recorded in @mpath.
+ */
+STATIC int
+xchk_metapath_ilock_parent_and_child(
+	struct xchk_metapath	*mpath,
+	struct xfs_inode	*ip)
+{
+	struct xfs_scrub	*sc = mpath->sc;
+	int			error = 0;
+
+	while (true) {
+		xfs_ilock(mpath->dp, XFS_ILOCK_EXCL);
+		if (!ip || xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+			return 0;
+		xfs_iunlock(mpath->dp, XFS_ILOCK_EXCL);
+
+		if (xchk_should_terminate(sc, &error))
+			return error;
+
+		delay(1);
+	}
+
+	ASSERT(0);
+	return -EINTR;
+}
+
+/*
+ * Try to remove a dirent in @mpath->dp with the name @mpath->xname that points
+ * to @alleged_child.  Returns:
+ *
+ * 0 if there is no longer a dirent;
+ * -EEXIST if the dirent points to @sc->ip;
+ * -EAGAIN and an updated @alleged_child if the dirent points elsewhere; or
+ * A negative errno for any other error.
+ */
+STATIC int
+xrep_metapath_try_unlink(
+	struct xchk_metapath	*mpath,
+	xfs_ino_t		*alleged_child)
+{
+	struct xfs_scrub	*sc = mpath->sc;
+	struct xfs_inode	*ip = NULL;
+	xfs_ino_t		ino;
+	int			error;
+
+	ASSERT(*alleged_child != sc->ip->i_ino);
+
+	trace_xrep_metapath_try_unlink(sc, mpath->path, mpath->dp,
+			*alleged_child);
+
+	/*
+	 * Allocate transaction, grab the alleged child inode, lock inodes,
+	 * join to transaction.
+	 */
+	error = xchk_trans_alloc(sc, mpath->unlink_resblks);
+	if (error)
+		return error;
+
+	error = xchk_iget(sc, *alleged_child, &ip);
+	if (error == -EINVAL || error == -ENOENT) {
+		/* inode number is bogus, junk the dirent */
+		error = 0;
+	}
+	if (error) {
+		xchk_trans_cancel(sc);
+		return error;
+	}
+
+	error = xchk_metapath_ilock_parent_and_child(mpath, ip);
+	if (error) {
+		xchk_trans_cancel(sc);
+		return error;
+	}
+	xfs_trans_ijoin(sc->tp, mpath->dp, 0);
+	if (ip)
+		xfs_trans_ijoin(sc->tp, ip, 0);
+
+	error = xchk_dir_lookup(sc, mpath->dp, &mpath->xname, &ino);
+	trace_xrep_metapath_lookup(sc, mpath->path, mpath->dp, ino);
+	if (error == -ENOENT) {
+		/*
+		 * There is no dirent in the directory anymore.  We're ready to
+		 * try the link operation again.
+		 */
+		error = 0;
+		goto out_cancel;
+	}
+	if (error)
+		goto out_cancel;
+
+	if (ino == sc->ip->i_ino) {
+		/* The dirent already points to @sc->ip; we're done. */
+		error = -EEXIST;
+		goto out_cancel;
+	}
+
+	/*
+	 * The dirent does not point to the alleged child.  Update the caller
+	 * and signal that we want to be called again.
+	 */
+	if (ino != *alleged_child) {
+		*alleged_child = ino;
+		error = -EAGAIN;
+		goto out_cancel;
+	}
+
+	/* Remove the link to the child. */
+	error = xrep_metapath_unlink(mpath, ino, ip);
+	if (error)
+		goto out_cancel;
+
+	error = xrep_trans_commit(sc);
+	goto out_unlock;
+
+out_cancel:
+	xchk_trans_cancel(sc);
+out_unlock:
+	xfs_iunlock(mpath->dp, XFS_ILOCK_EXCL);
+	if (ip) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		xchk_irele(sc, ip);
+	}
+	return error;
+}
+
+/*
+ * Make sure the metadata directory path points to the child being examined.
+ *
+ * Repair needs to be able to create a directory structure, create its own
+ * transactions, and take ILOCKs.  This function /must/ be called after all
+ * other repairs have completed.
+ */
+int
+xrep_metapath(
+	struct xfs_scrub	*sc)
+{
+	struct xchk_metapath	*mpath = sc->buf;
+	struct xfs_mount	*mp = sc->mp;
+	int			error = 0;
+
+	/* Just probing, nothing to repair. */
+	if (sc->sm->sm_ino == XFS_SCRUB_METAPATH_PROBE)
+		return 0;
+
+	/* Parent required to do anything else. */
+	if (mpath->dp == NULL)
+		return -EFSCORRUPTED;
+
+	/*
+	 * Make sure the child file actually has an attr fork to receive a new
+	 * parent pointer if the fs has parent pointers.
+	 */
+	if (xfs_has_parent(mp)) {
+		error = xfs_attr_add_fork(sc->ip,
+				sizeof(struct xfs_attr_sf_hdr), 1);
+		if (error)
+			return error;
+	}
+
+	/* Compute block reservation required to unlink and link a file. */
+	mpath->unlink_resblks = xfs_remove_space_res(mp, MAXNAMELEN);
+	mpath->link_resblks = xfs_link_space_res(mp, MAXNAMELEN);
+
+	do {
+		xfs_ino_t	alleged_child;
+
+		/* Re-establish the link, or tell us which inode to remove. */
+		error = xrep_metapath_try_link(mpath, &alleged_child);
+		if (!error)
+			return 0;
+		if (error != -EEXIST)
+			return error;
+
+		/*
+		 * Remove an incorrect link to an alleged child, or tell us
+		 * which inode to remove.
+		 */
+		do {
+			error = xrep_metapath_try_unlink(mpath, &alleged_child);
+		} while (error == -EAGAIN);
+		if (error == -EEXIST) {
+			/* Link established; we're done. */
+			error = 0;
+			break;
+		}
+	} while (!error);
+
+	return error;
+}
+#endif /* CONFIG_XFS_ONLINE_REPAIR */
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 0e0dc2bf985c2..90f9cb3b5ad8b 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -134,6 +134,7 @@ int xrep_directory(struct xfs_scrub *sc);
 int xrep_parent(struct xfs_scrub *sc);
 int xrep_symlink(struct xfs_scrub *sc);
 int xrep_dirtree(struct xfs_scrub *sc);
+int xrep_metapath(struct xfs_scrub *sc);
 
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
@@ -208,6 +209,7 @@ xrep_setup_nothing(
 #define xrep_setup_parent		xrep_setup_nothing
 #define xrep_setup_nlinks		xrep_setup_nothing
 #define xrep_setup_dirtree		xrep_setup_nothing
+#define xrep_setup_metapath		xrep_setup_nothing
 
 #define xrep_setup_inode(sc, imap)	((void)0)
 
@@ -243,6 +245,7 @@ static inline int xrep_setup_symlink(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_parent			xrep_notsupported
 #define xrep_symlink			xrep_notsupported
 #define xrep_dirtree			xrep_notsupported
+#define xrep_metapath			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index f1b2714e2894a..04a7a5944837d 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -447,7 +447,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_metapath,
 		.scrub	= xchk_metapath,
 		.has	= xfs_has_metadir,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_metapath,
 	},
 };
 
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index d4ca3f98679a2..fe901b9138b4b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -3607,6 +3607,11 @@ DEFINE_XCHK_DIRTREE_EVENT(xrep_dirtree_delete_path);
 DEFINE_XCHK_DIRTREE_EVENT(xrep_dirtree_create_adoption);
 DEFINE_XCHK_DIRTREE_EVALUATE_EVENT(xrep_dirtree_decided_fate);
 
+DEFINE_XCHK_METAPATH_EVENT(xrep_metapath_lookup);
+DEFINE_XCHK_METAPATH_EVENT(xrep_metapath_try_unlink);
+DEFINE_XCHK_METAPATH_EVENT(xrep_metapath_unlink);
+DEFINE_XCHK_METAPATH_EVENT(xrep_metapath_link);
+
 #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */
 
 #endif /* _TRACE_XFS_SCRUB_TRACE_H */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 01/12] xfs: remove xfs_validate_rtextents
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
@ 2024-08-23  0:09   ` Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 02/12] xfs: factor out a xfs_validate_rt_geometry helper Darrick J. Wong
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:09 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Replace xfs_validate_rtextents with an open coded check for 0
rtextents.  The name for the function implies it does a lot more
than a zero check, which is more obvious when open coded.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_sb.c    |    2 +-
 fs/xfs/libxfs/xfs_types.h |   12 ------------
 fs/xfs/xfs_rtalloc.c      |    2 +-
 3 files changed, 2 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 1dcbf8ade39f8..b781e5f836e4c 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -516,7 +516,7 @@ xfs_validate_sb_common(
 		rbmblocks = howmany_64(sbp->sb_rextents,
 				       NBBY * sbp->sb_blocksize);
 
-		if (!xfs_validate_rtextents(rexts) ||
+		if (sbp->sb_rextents == 0 ||
 		    sbp->sb_rextents != rexts ||
 		    sbp->sb_rextslog != xfs_compute_rextslog(rexts) ||
 		    sbp->sb_rbmblocks != rbmblocks) {
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 76eb9e328835f..a8cd44d03ef64 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -235,16 +235,4 @@ bool xfs_verify_fileoff(struct xfs_mount *mp, xfs_fileoff_t off);
 bool xfs_verify_fileext(struct xfs_mount *mp, xfs_fileoff_t off,
 		xfs_fileoff_t len);
 
-/* Do we support an rt volume having this number of rtextents? */
-static inline bool
-xfs_validate_rtextents(
-	xfs_rtbxlen_t		rtextents)
-{
-	/* No runt rt volumes */
-	if (rtextents == 0)
-		return false;
-
-	return true;
-}
-
 #endif	/* __XFS_TYPES_H__ */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index b4c3c5a3171bf..2acb75336b7b9 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -905,7 +905,7 @@ xfs_growfs_rt(
 	 */
 	nrextents = nrblocks;
 	do_div(nrextents, in->extsize);
-	if (!xfs_validate_rtextents(nrextents)) {
+	if (nrextents == 0) {
 		error = -EINVAL;
 		goto out_unlock;
 	}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 02/12] xfs: factor out a xfs_validate_rt_geometry helper
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 01/12] xfs: remove xfs_validate_rtextents Darrick J. Wong
@ 2024-08-23  0:09   ` Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 03/12] xfs: make the RT rsum_cache mandatory Darrick J. Wong
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:09 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Split the RT geometry validation in the early mount code into a
helper than can be reused by repair (from which this code was
apparently originally stolen anyway).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: u64 return value for calc_rbmblocks]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_sb.c |   64 ++++++++++++++++++++++++++----------------------
 fs/xfs/libxfs/xfs_sb.h |    1 +
 2 files changed, 36 insertions(+), 29 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b781e5f836e4c..a4221afb012b6 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -234,6 +234,38 @@ xfs_validate_sb_read(
 	return 0;
 }
 
+static uint64_t
+xfs_sb_calc_rbmblocks(
+	struct xfs_sb		*sbp)
+{
+	return howmany_64(sbp->sb_rextents, NBBY * sbp->sb_blocksize);
+}
+
+/* Validate the realtime geometry */
+bool
+xfs_validate_rt_geometry(
+	struct xfs_sb		*sbp)
+{
+	if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE ||
+	    sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE)
+		return false;
+
+	if (sbp->sb_rblocks == 0) {
+		if (sbp->sb_rextents != 0 || sbp->sb_rbmblocks != 0 ||
+		    sbp->sb_rextslog != 0 || sbp->sb_frextents != 0)
+			return false;
+		return true;
+	}
+
+	if (sbp->sb_rextents == 0 ||
+	    sbp->sb_rextents != div_u64(sbp->sb_rblocks, sbp->sb_rextsize) ||
+	    sbp->sb_rextslog != xfs_compute_rextslog(sbp->sb_rextents) ||
+	    sbp->sb_rbmblocks != xfs_sb_calc_rbmblocks(sbp))
+		return false;
+
+	return true;
+}
+
 /* Check all the superblock fields we care about when writing one out. */
 STATIC int
 xfs_validate_sb_write(
@@ -493,39 +525,13 @@ xfs_validate_sb_common(
 		}
 	}
 
-	/* Validate the realtime geometry; stolen from xfs_repair */
-	if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE ||
-	    sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE) {
+	if (!xfs_validate_rt_geometry(sbp)) {
 		xfs_notice(mp,
-			"realtime extent sanity check failed");
+			"realtime %sgeometry check failed",
+			sbp->sb_rblocks ? "" : "zeroed ");
 		return -EFSCORRUPTED;
 	}
 
-	if (sbp->sb_rblocks == 0) {
-		if (sbp->sb_rextents != 0 || sbp->sb_rbmblocks != 0 ||
-		    sbp->sb_rextslog != 0 || sbp->sb_frextents != 0) {
-			xfs_notice(mp,
-				"realtime zeroed geometry check failed");
-			return -EFSCORRUPTED;
-		}
-	} else {
-		uint64_t	rexts;
-		uint64_t	rbmblocks;
-
-		rexts = div_u64(sbp->sb_rblocks, sbp->sb_rextsize);
-		rbmblocks = howmany_64(sbp->sb_rextents,
-				       NBBY * sbp->sb_blocksize);
-
-		if (sbp->sb_rextents == 0 ||
-		    sbp->sb_rextents != rexts ||
-		    sbp->sb_rextslog != xfs_compute_rextslog(rexts) ||
-		    sbp->sb_rbmblocks != rbmblocks) {
-			xfs_notice(mp,
-				"realtime geometry sanity check failed");
-			return -EFSCORRUPTED;
-		}
-	}
-
 	/*
 	 * Either (sb_unit and !hasdalign) or (!sb_unit and hasdalign)
 	 * would imply the image is corrupted.
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 37b1ed1bc2095..796f02191dfd2 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -38,6 +38,7 @@ extern int	xfs_sb_get_secondary(struct xfs_mount *mp,
 bool	xfs_validate_stripe_geometry(struct xfs_mount *mp,
 		__s64 sunit, __s64 swidth, int sectorsize, bool may_repair,
 		bool silent);
+bool	xfs_validate_rt_geometry(struct xfs_sb *sbp);
 
 uint8_t xfs_compute_rextslog(xfs_rtbxlen_t rtextents);
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 03/12] xfs: make the RT rsum_cache mandatory
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 01/12] xfs: remove xfs_validate_rtextents Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 02/12] xfs: factor out a xfs_validate_rt_geometry helper Darrick J. Wong
@ 2024-08-23  0:09   ` Darrick J. Wong
  2024-08-23  0:09   ` [PATCH 04/12] xfs: remove the limit argument to xfs_rtfind_back Darrick J. Wong
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:09 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Currently the RT mount code simply ignores an allocation failure for the
rsum_cache.  The code mostly works fine with it, but not having it leads
to nasty corner cases in the growfs code that we don't really handle
well.  Switch to failing the mount if we can't allocate the memory, the
file system would not exactly be useful in such a constrained environment
to start with.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 2acb75336b7b9..f5a39cfd9bcb8 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -769,21 +769,20 @@ xfs_growfs_rt_alloc(
 	return error;
 }
 
-static void
+static int
 xfs_alloc_rsum_cache(
-	xfs_mount_t	*mp,		/* file system mount structure */
-	xfs_extlen_t	rbmblocks)	/* number of rt bitmap blocks */
+	struct xfs_mount	*mp,
+	xfs_extlen_t		rbmblocks)
 {
 	/*
 	 * The rsum cache is initialized to the maximum value, which is
 	 * trivially an upper bound on the maximum level with any free extents.
-	 * We can continue without the cache if it couldn't be allocated.
 	 */
 	mp->m_rsum_cache = kvmalloc(rbmblocks, GFP_KERNEL);
-	if (mp->m_rsum_cache)
-		memset(mp->m_rsum_cache, -1, rbmblocks);
-	else
-		xfs_warn(mp, "could not allocate realtime summary cache");
+	if (!mp->m_rsum_cache)
+		return -ENOMEM;
+	memset(mp->m_rsum_cache, -1, rbmblocks);
+	return 0;
 }
 
 /*
@@ -941,8 +940,11 @@ xfs_growfs_rt(
 		goto out_unlock;
 
 	rsum_cache = mp->m_rsum_cache;
-	if (nrbmblocks != sbp->sb_rbmblocks)
-		xfs_alloc_rsum_cache(mp, nrbmblocks);
+	if (nrbmblocks != sbp->sb_rbmblocks) {
+		error = xfs_alloc_rsum_cache(mp, nrbmblocks);
+		if (error)
+			goto out_unlock;
+	}
 
 	/*
 	 * Allocate a new (fake) mount/sb.
@@ -1271,7 +1273,9 @@ xfs_rtmount_inodes(
 	if (error)
 		goto out_rele_summary;
 
-	xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
+	error = xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
+	if (error)
+		goto out_rele_summary;
 	xfs_trans_cancel(tp);
 	return 0;
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 04/12] xfs: remove the limit argument to xfs_rtfind_back
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-23  0:09   ` [PATCH 03/12] xfs: make the RT rsum_cache mandatory Darrick J. Wong
@ 2024-08-23  0:09   ` Darrick J. Wong
  2024-08-23  0:10   ` [PATCH 05/12] xfs: assert a valid limit in xfs_rtfind_forw Darrick J. Wong
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:09 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

All callers pass a 0 limit to xfs_rtfind_back, so remove the argument
and hard code it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |    9 ++++-----
 fs/xfs/libxfs/xfs_rtbitmap.h |    2 +-
 fs/xfs/xfs_rtalloc.c         |    2 +-
 3 files changed, 6 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 386b672c50589..9feeefe539488 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -139,14 +139,13 @@ xfs_rtbuf_get(
 }
 
 /*
- * Searching backward from start to limit, find the first block whose
- * allocated/free state is different from start's.
+ * Searching backward from start find the first block whose allocated/free state
+ * is different from start's.
  */
 int
 xfs_rtfind_back(
 	struct xfs_rtalloc_args	*args,
 	xfs_rtxnum_t		start,	/* starting rtext to look at */
-	xfs_rtxnum_t		limit,	/* last rtext to look at */
 	xfs_rtxnum_t		*rtx)	/* out: start rtext found */
 {
 	struct xfs_mount	*mp = args->mp;
@@ -175,7 +174,7 @@ xfs_rtfind_back(
 	 */
 	word = xfs_rtx_to_rbmword(mp, start);
 	bit = (int)(start & (XFS_NBWORD - 1));
-	len = start - limit + 1;
+	len = start + 1;
 	/*
 	 * Compute match value, based on the bit at start: if 1 (free)
 	 * then all-ones, else all-zeroes.
@@ -698,7 +697,7 @@ xfs_rtfree_range(
 	 * We need to find the beginning and end of the extent so we can
 	 * properly update the summary.
 	 */
-	error = xfs_rtfind_back(args, start, 0, &preblock);
+	error = xfs_rtfind_back(args, start, &preblock);
 	if (error) {
 		return error;
 	}
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 6186585f2c376..1e04f0954a0fa 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -316,7 +316,7 @@ xfs_rtsummary_read_buf(
 int xfs_rtcheck_range(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
 		xfs_rtxlen_t len, int val, xfs_rtxnum_t *new, int *stat);
 int xfs_rtfind_back(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
-		xfs_rtxnum_t limit, xfs_rtxnum_t *rtblock);
+		xfs_rtxnum_t *rtblock);
 int xfs_rtfind_forw(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
 		xfs_rtxnum_t limit, xfs_rtxnum_t *rtblock);
 int xfs_rtmodify_range(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index f5a39cfd9bcb8..aaa969433ba8a 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -144,7 +144,7 @@ xfs_rtallocate_range(
 	 * We need to find the beginning and end of the extent so we can
 	 * properly update the summary.
 	 */
-	error = xfs_rtfind_back(args, start, 0, &preblock);
+	error = xfs_rtfind_back(args, start, &preblock);
 	if (error)
 		return error;
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 05/12] xfs: assert a valid limit in xfs_rtfind_forw
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-23  0:09   ` [PATCH 04/12] xfs: remove the limit argument to xfs_rtfind_back Darrick J. Wong
@ 2024-08-23  0:10   ` Darrick J. Wong
  2024-08-23  0:10   ` [PATCH 06/12] xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf Darrick J. Wong
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:10 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Protect against developers passing stupid limits when refactoring the
RT code once again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 9feeefe539488..4de97c4e8ebdd 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -315,6 +315,8 @@ xfs_rtfind_forw(
 	xfs_rtword_t		incore;
 	unsigned int		word;	/* word number in the buffer */
 
+	ASSERT(start <= limit);
+
 	/*
 	 * Compute and read in starting bitmap block for starting block.
 	 */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 06/12] xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:10   ` [PATCH 05/12] xfs: assert a valid limit in xfs_rtfind_forw Darrick J. Wong
@ 2024-08-23  0:10   ` Darrick J. Wong
  2024-08-23  0:10   ` [PATCH 07/12] xfs: cleanup the calling convention for xfs_rtpick_extent Darrick J. Wong
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:10 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Add a corruption check for passing an invalid block number, which is a
lot easier to understand than the xfs_bmapi_read failure later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |   31 ++++++++++++++++++++++++++++++-
 fs/xfs/libxfs/xfs_rtbitmap.h |   22 ++--------------------
 2 files changed, 32 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 4de97c4e8ebdd..02d6668d860fd 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -69,7 +69,7 @@ xfs_rtbuf_cache_relse(
  * Get a buffer for the bitmap or summary file block specified.
  * The buffer is returned read and locked.
  */
-int
+static int
 xfs_rtbuf_get(
 	struct xfs_rtalloc_args	*args,
 	xfs_fileoff_t		block,	/* block number in bitmap or summary */
@@ -138,6 +138,35 @@ xfs_rtbuf_get(
 	return 0;
 }
 
+int
+xfs_rtbitmap_read_buf(
+	struct xfs_rtalloc_args		*args,
+	xfs_fileoff_t			block)
+{
+	struct xfs_mount		*mp = args->mp;
+
+	if (XFS_IS_CORRUPT(mp, block >= mp->m_sb.sb_rbmblocks)) {
+		xfs_rt_mark_sick(mp, XFS_SICK_RT_BITMAP);
+		return -EFSCORRUPTED;
+	}
+
+	return xfs_rtbuf_get(args, block, 0);
+}
+
+int
+xfs_rtsummary_read_buf(
+	struct xfs_rtalloc_args		*args,
+	xfs_fileoff_t			block)
+{
+	struct xfs_mount		*mp = args->mp;
+
+	if (XFS_IS_CORRUPT(mp, block >= XFS_B_TO_FSB(mp, mp->m_rsumsize))) {
+		xfs_rt_mark_sick(args->mp, XFS_SICK_RT_SUMMARY);
+		return -EFSCORRUPTED;
+	}
+	return xfs_rtbuf_get(args, block, 1);
+}
+
 /*
  * Searching backward from start find the first block whose allocated/free state
  * is different from start's.
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 1e04f0954a0fa..e87e2099cff5e 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -293,26 +293,8 @@ typedef int (*xfs_rtalloc_query_range_fn)(
 
 #ifdef CONFIG_XFS_RT
 void xfs_rtbuf_cache_relse(struct xfs_rtalloc_args *args);
-
-int xfs_rtbuf_get(struct xfs_rtalloc_args *args, xfs_fileoff_t block,
-		int issum);
-
-static inline int
-xfs_rtbitmap_read_buf(
-	struct xfs_rtalloc_args		*args,
-	xfs_fileoff_t			block)
-{
-	return xfs_rtbuf_get(args, block, 0);
-}
-
-static inline int
-xfs_rtsummary_read_buf(
-	struct xfs_rtalloc_args		*args,
-	xfs_fileoff_t			block)
-{
-	return xfs_rtbuf_get(args, block, 1);
-}
-
+int xfs_rtbitmap_read_buf(struct xfs_rtalloc_args *args, xfs_fileoff_t block);
+int xfs_rtsummary_read_buf(struct xfs_rtalloc_args *args, xfs_fileoff_t block);
 int xfs_rtcheck_range(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
 		xfs_rtxlen_t len, int val, xfs_rtxnum_t *new, int *stat);
 int xfs_rtfind_back(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 07/12] xfs: cleanup the calling convention for xfs_rtpick_extent
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-08-23  0:10   ` [PATCH 06/12] xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf Darrick J. Wong
@ 2024-08-23  0:10   ` Darrick J. Wong
  2024-08-23  0:11   ` [PATCH 08/12] xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc Darrick J. Wong
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:10 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

xfs_rtpick_extent never returns an error.  Do away with the error return
and directly return the picked extent instead of doing that through a
call by reference argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index aaa969433ba8a..8da59d941db3c 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1306,12 +1306,11 @@ xfs_rtunmount_inodes(
  * of rtextents and the fraction.
  * The fraction sequence is 0, 1/2, 1/4, 3/4, 1/8, ..., 7/8, 1/16, ...
  */
-static int
+static xfs_rtxnum_t
 xfs_rtpick_extent(
 	xfs_mount_t		*mp,		/* file system mount point */
 	xfs_trans_t		*tp,		/* transaction pointer */
-	xfs_rtxlen_t		len,		/* allocation length (rtextents) */
-	xfs_rtxnum_t		*pick)		/* result rt extent */
+	xfs_rtxlen_t		len)		/* allocation length (rtextents) */
 {
 	xfs_rtxnum_t		b;		/* result rtext */
 	int			log2;		/* log of sequence number */
@@ -1342,8 +1341,7 @@ xfs_rtpick_extent(
 	ts.tv_sec = seq + 1;
 	inode_set_atime_to_ts(VFS_I(mp->m_rbmip), ts);
 	xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE);
-	*pick = b;
-	return 0;
+	return b;
 }
 
 static void
@@ -1450,9 +1448,7 @@ xfs_bmap_rtalloc(
 		 * If it's an allocation to an empty file at offset 0, pick an
 		 * extent that will space things out in the rt area.
 		 */
-		error = xfs_rtpick_extent(mp, ap->tp, ralen, &start);
-		if (error)
-			return error;
+		start = xfs_rtpick_extent(mp, ap->tp, ralen);
 	} else {
 		start = 0;
 	}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 08/12] xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-08-23  0:10   ` [PATCH 07/12] xfs: cleanup the calling convention for xfs_rtpick_extent Darrick J. Wong
@ 2024-08-23  0:11   ` Darrick J. Wong
  2024-08-23  0:11   ` [PATCH 09/12] xfs: factor out a xfs_growfs_rt_bmblock helper Darrick J. Wong
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:11 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Currently the various low-level RT allocator functions call into
xfs_rtallocate_range directly, which ties them into the locking protocol
for the RT bitmap.  As these helpers already return the allocated range,
lift the call to xfs_rtallocate_range into xfs_bmap_rtalloc so that it
happens as high as possible in the stack, which will simplify future
changes to the locking protocol.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   38 ++++++++++++++++++--------------------
 1 file changed, 18 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 8da59d941db3c..a98e22c76280b 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -261,9 +261,9 @@ xfs_rtallocate_extent_block(
 			/*
 			 * i for maxlen is all free, allocate and return that.
 			 */
-			bestlen = maxlen;
-			besti = i;
-			goto allocate;
+			*len = maxlen;
+			*rtx = i;
+			return 0;
 		}
 
 		/*
@@ -314,12 +314,8 @@ xfs_rtallocate_extent_block(
 	}
 
 	/*
-	 * Allocate besti for bestlen & return that.
+	 * Pick besti for bestlen & return that.
 	 */
-allocate:
-	error = xfs_rtallocate_range(args, besti, bestlen);
-	if (error)
-		return error;
 	*len = bestlen;
 	*rtx = besti;
 	return 0;
@@ -373,12 +369,6 @@ xfs_rtallocate_extent_exact(
 		}
 	}
 
-	/*
-	 * Allocate what we can and return it.
-	 */
-	error = xfs_rtallocate_range(args, start, maxlen);
-	if (error)
-		return error;
 	*len = maxlen;
 	*rtx = start;
 	return 0;
@@ -431,7 +421,6 @@ xfs_rtallocate_extent_near(
 	if (error != -ENOSPC)
 		return error;
 
-
 	bbno = xfs_rtx_to_rbmblock(mp, start);
 	i = 0;
 	j = -1;
@@ -554,11 +543,11 @@ xfs_rtalloc_sumlevel(
 	xfs_rtxnum_t		*rtx)	/* out: start rtext allocated */
 {
 	xfs_fileoff_t		i;	/* bitmap block number */
+	int			error;
 
 	for (i = 0; i < args->mp->m_sb.sb_rbmblocks; i++) {
 		xfs_suminfo_t	sum;	/* summary information for extents */
 		xfs_rtxnum_t	n;	/* next rtext to be tried */
-		int		error;
 
 		error = xfs_rtget_summary(args, l, i, &sum);
 		if (error)
@@ -1473,9 +1462,12 @@ xfs_bmap_rtalloc(
 		error = xfs_rtallocate_extent_size(&args, raminlen,
 				ralen, &ralen, prod, &rtx);
 	}
-	xfs_rtbuf_cache_relse(&args);
 
-	if (error == -ENOSPC) {
+	if (error) {
+		xfs_rtbuf_cache_relse(&args);
+		if (error != -ENOSPC)
+			return error;
+
 		if (align > mp->m_sb.sb_rextsize) {
 			/*
 			 * We previously enlarged the request length to try to
@@ -1503,14 +1495,20 @@ xfs_bmap_rtalloc(
 		ap->length = 0;
 		return 0;
 	}
+
+	error = xfs_rtallocate_range(&args, rtx, ralen);
 	if (error)
-		return error;
+		goto out_release;
 
 	xfs_trans_mod_sb(ap->tp, ap->wasdel ?
 			XFS_TRANS_SB_RES_FREXTENTS : XFS_TRANS_SB_FREXTENTS,
 			-(long)ralen);
+
 	ap->blkno = xfs_rtx_to_rtb(mp, rtx);
 	ap->length = xfs_rtxlen_to_extlen(mp, ralen);
 	xfs_bmap_alloc_account(ap);
-	return 0;
+
+out_release:
+	xfs_rtbuf_cache_relse(&args);
+	return error;
 }


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 09/12] xfs: factor out a xfs_growfs_rt_bmblock helper
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-08-23  0:11   ` [PATCH 08/12] xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc Darrick J. Wong
@ 2024-08-23  0:11   ` Darrick J. Wong
  2024-08-23  0:11   ` [PATCH 10/12] xfs: factor out a xfs_last_rt_bmblock helper Darrick J. Wong
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:11 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Add a helper to contain the per-rtbitmap block logic in xfs_growfs_rt.

Note that this helper now allocates a new fake mount structure for
each rtbitmap block iteration instead of reusing the memory for an
entire growfs call.  Compared to all the other work done when freeing
the blocks the overhead for this is in the noise and it keeps the code
nicely modular.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |  317 +++++++++++++++++++++++++-------------------------
 1 file changed, 158 insertions(+), 159 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index a98e22c76280b..5a637166cc788 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -807,9 +807,148 @@ xfs_growfs_rt_fixup_extsize(
 	return error;
 }
 
-/*
- * Visible (exported) functions.
- */
+static int
+xfs_growfs_rt_bmblock(
+	struct xfs_mount	*mp,
+	xfs_rfsblock_t		nrblocks,
+	xfs_agblock_t		rextsize,
+	xfs_fileoff_t		bmbno)
+{
+	struct xfs_inode	*rbmip = mp->m_rbmip;
+	struct xfs_inode	*rsumip = mp->m_rsumip;
+	struct xfs_rtalloc_args	args = {
+		.mp		= mp,
+	};
+	struct xfs_rtalloc_args	nargs = {
+	};
+	struct xfs_mount	*nmp;
+	xfs_rfsblock_t		nrblocks_step;
+	xfs_rtbxlen_t		freed_rtx;
+	int			error;
+
+
+	nrblocks_step = (bmbno + 1) * NBBY * mp->m_sb.sb_blocksize * rextsize;
+
+	nmp = nargs.mp = kmemdup(mp, sizeof(*mp), GFP_KERNEL);
+	if (!nmp)
+		return -ENOMEM;
+
+	/*
+	 * Calculate new sb and mount fields for this round.
+	 */
+	nmp->m_rtxblklog = -1; /* don't use shift or masking */
+	nmp->m_sb.sb_rextsize = rextsize;
+	nmp->m_sb.sb_rbmblocks = bmbno + 1;
+	nmp->m_sb.sb_rblocks = min(nrblocks, nrblocks_step);
+	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
+	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
+	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
+	nmp->m_rsumsize = XFS_FSB_TO_B(mp,
+		xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
+			nmp->m_sb.sb_rbmblocks));
+
+	/* recompute growfsrt reservation from new rsumsize */
+	xfs_trans_resv_calc(nmp, &nmp->m_resv);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtfree, 0, 0, 0,
+			&args.tp);
+	if (error)
+		goto out_free;
+	nargs.tp = args.tp;
+
+	xfs_rtbitmap_lock(args.tp, mp);
+
+	/*
+	 * Update the bitmap inode's size ondisk and incore.  We need to update
+	 * the incore size so that inode inactivation won't punch what it thinks
+	 * are "posteof" blocks.
+	 */
+	rbmip->i_disk_size = nmp->m_sb.sb_rbmblocks * nmp->m_sb.sb_blocksize;
+	i_size_write(VFS_I(rbmip), rbmip->i_disk_size);
+	xfs_trans_log_inode(args.tp, rbmip, XFS_ILOG_CORE);
+
+	/*
+	 * Update the summary inode's size.  We need to update the incore size
+	 * so that inode inactivation won't punch what it thinks are "posteof"
+	 * blocks.
+	 */
+	rsumip->i_disk_size = nmp->m_rsumsize;
+	i_size_write(VFS_I(rsumip), rsumip->i_disk_size);
+	xfs_trans_log_inode(args.tp, rsumip, XFS_ILOG_CORE);
+
+	/*
+	 * Copy summary data from old to new sizes when the real size (not
+	 * block-aligned) changes.
+	 */
+	if (mp->m_sb.sb_rbmblocks != nmp->m_sb.sb_rbmblocks ||
+	    mp->m_rsumlevels != nmp->m_rsumlevels) {
+		error = xfs_rtcopy_summary(&args, &nargs);
+		if (error)
+			goto out_cancel;
+	}
+
+	/*
+	 * Update superblock fields.
+	 */
+	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
+		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
+			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
+	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
+		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
+			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
+	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
+		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
+			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
+	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
+		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
+			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
+	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
+		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
+			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
+
+	/*
+	 * Free the new extent.
+	 */
+	freed_rtx = nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents;
+	error = xfs_rtfree_range(&nargs, mp->m_sb.sb_rextents, freed_rtx);
+	xfs_rtbuf_cache_relse(&nargs);
+	if (error)
+		goto out_cancel;
+
+	/*
+	 * Mark more blocks free in the superblock.
+	 */
+	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
+
+	/*
+	 * Update mp values into the real mp structure.
+	 */
+	mp->m_rsumlevels = nmp->m_rsumlevels;
+	mp->m_rsumsize = nmp->m_rsumsize;
+
+	/*
+	 * Recompute the growfsrt reservation from the new rsumsize.
+	 */
+	xfs_trans_resv_calc(mp, &mp->m_resv);
+
+	error = xfs_trans_commit(args.tp);
+	if (error)
+		goto out_free;
+
+	/*
+	 * Ensure the mount RT feature flag is now set.
+	 */
+	mp->m_features |= XFS_FEAT_REALTIME;
+
+	kfree(nmp);
+	return 0;
+
+out_cancel:
+	xfs_trans_cancel(args.tp);
+out_free:
+	kfree(nmp);
+	return error;
+}
 
 /*
  * Grow the realtime area of the filesystem.
@@ -822,23 +961,14 @@ xfs_growfs_rt(
 	xfs_fileoff_t	bmbno;		/* bitmap block number */
 	struct xfs_buf	*bp;		/* temporary buffer */
 	int		error;		/* error return value */
-	xfs_mount_t	*nmp;		/* new (fake) mount structure */
-	xfs_rfsblock_t	nrblocks;	/* new number of realtime blocks */
 	xfs_extlen_t	nrbmblocks;	/* new number of rt bitmap blocks */
 	xfs_rtxnum_t	nrextents;	/* new number of realtime extents */
-	uint8_t		nrextslog;	/* new log2 of sb_rextents */
 	xfs_extlen_t	nrsumblocks;	/* new number of summary blocks */
-	uint		nrsumlevels;	/* new rt summary levels */
-	uint		nrsumsize;	/* new size of rt summary, bytes */
-	xfs_sb_t	*nsbp;		/* new superblock */
 	xfs_extlen_t	rbmblocks;	/* current number of rt bitmap blocks */
 	xfs_extlen_t	rsumblocks;	/* current number of rt summary blks */
-	xfs_sb_t	*sbp;		/* old superblock */
 	uint8_t		*rsum_cache;	/* old summary cache */
 	xfs_agblock_t	old_rextsize = mp->m_sb.sb_rextsize;
 
-	sbp = &mp->m_sb;
-
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
@@ -857,11 +987,10 @@ xfs_growfs_rt(
 		goto out_unlock;
 
 	/* Shrink not supported. */
-	if (in->newblocks <= sbp->sb_rblocks)
+	if (in->newblocks <= mp->m_sb.sb_rblocks)
 		goto out_unlock;
-
 	/* Can only change rt extent size when adding rt volume. */
-	if (sbp->sb_rblocks > 0 && in->extsize != sbp->sb_rextsize)
+	if (mp->m_sb.sb_rblocks > 0 && in->extsize != mp->m_sb.sb_rextsize)
 		goto out_unlock;
 
 	/* Range check the extent size. */
@@ -874,15 +1003,14 @@ xfs_growfs_rt(
 	if (xfs_has_rmapbt(mp) || xfs_has_reflink(mp) || xfs_has_quota(mp))
 		goto out_unlock;
 
-	nrblocks = in->newblocks;
-	error = xfs_sb_validate_fsb_count(sbp, nrblocks);
+	error = xfs_sb_validate_fsb_count(&mp->m_sb, in->newblocks);
 	if (error)
 		goto out_unlock;
 	/*
 	 * Read in the last block of the device, make sure it exists.
 	 */
 	error = xfs_buf_read_uncached(mp->m_rtdev_targp,
-				XFS_FSB_TO_BB(mp, nrblocks - 1),
+				XFS_FSB_TO_BB(mp, in->newblocks - 1),
 				XFS_FSB_TO_BB(mp, 1), 0, &bp, NULL);
 	if (error)
 		goto out_unlock;
@@ -891,17 +1019,15 @@ xfs_growfs_rt(
 	/*
 	 * Calculate new parameters.  These are the final values to be reached.
 	 */
-	nrextents = nrblocks;
-	do_div(nrextents, in->extsize);
+	nrextents = div_u64(in->newblocks, in->extsize);
 	if (nrextents == 0) {
 		error = -EINVAL;
 		goto out_unlock;
 	}
 	nrbmblocks = xfs_rtbitmap_blockcount(mp, nrextents);
-	nrextslog = xfs_compute_rextslog(nrextents);
-	nrsumlevels = nrextslog + 1;
-	nrsumblocks = xfs_rtsummary_blockcount(mp, nrsumlevels, nrbmblocks);
-	nrsumsize = XFS_FSB_TO_B(mp, nrsumblocks);
+	nrsumblocks = xfs_rtsummary_blockcount(mp,
+			xfs_compute_rextslog(nrextents) + 1, nrbmblocks);
+
 	/*
 	 * New summary size can't be more than half the size of
 	 * the log.  This prevents us from getting a log overflow,
@@ -929,149 +1055,27 @@ xfs_growfs_rt(
 		goto out_unlock;
 
 	rsum_cache = mp->m_rsum_cache;
-	if (nrbmblocks != sbp->sb_rbmblocks) {
+	if (nrbmblocks != mp->m_sb.sb_rbmblocks) {
 		error = xfs_alloc_rsum_cache(mp, nrbmblocks);
 		if (error)
 			goto out_unlock;
 	}
 
-	/*
-	 * Allocate a new (fake) mount/sb.
-	 */
-	nmp = kmalloc(sizeof(*nmp), GFP_KERNEL | __GFP_NOFAIL);
 	/*
 	 * Loop over the bitmap blocks.
 	 * We will do everything one bitmap block at a time.
 	 * Skip the current block if it is exactly full.
 	 * This also deals with the case where there were no rtextents before.
 	 */
-	for (bmbno = sbp->sb_rbmblocks -
-		     ((sbp->sb_rextents & ((1 << mp->m_blkbit_log) - 1)) != 0);
-	     bmbno < nrbmblocks;
-	     bmbno++) {
-		struct xfs_rtalloc_args	args = {
-			.mp		= mp,
-		};
-		struct xfs_rtalloc_args	nargs = {
-			.mp		= nmp,
-		};
-		struct xfs_trans	*tp;
-		xfs_rfsblock_t		nrblocks_step;
-
-		*nmp = *mp;
-		nsbp = &nmp->m_sb;
-		/*
-		 * Calculate new sb and mount fields for this round.
-		 */
-		nsbp->sb_rextsize = in->extsize;
-		nmp->m_rtxblklog = -1; /* don't use shift or masking */
-		nsbp->sb_rbmblocks = bmbno + 1;
-		nrblocks_step = (bmbno + 1) * NBBY * nsbp->sb_blocksize *
-				nsbp->sb_rextsize;
-		nsbp->sb_rblocks = min(nrblocks, nrblocks_step);
-		nsbp->sb_rextents = xfs_rtb_to_rtx(nmp, nsbp->sb_rblocks);
-		ASSERT(nsbp->sb_rextents != 0);
-		nsbp->sb_rextslog = xfs_compute_rextslog(nsbp->sb_rextents);
-		nrsumlevels = nmp->m_rsumlevels = nsbp->sb_rextslog + 1;
-		nrsumblocks = xfs_rtsummary_blockcount(mp, nrsumlevels,
-				nsbp->sb_rbmblocks);
-		nmp->m_rsumsize = nrsumsize = XFS_FSB_TO_B(mp, nrsumblocks);
-		/* recompute growfsrt reservation from new rsumsize */
-		xfs_trans_resv_calc(nmp, &nmp->m_resv);
-
-		/*
-		 * Start a transaction, get the log reservation.
-		 */
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtfree, 0, 0, 0,
-				&tp);
+	bmbno = mp->m_sb.sb_rbmblocks;
+	if (xfs_rtx_to_rbmword(mp, mp->m_sb.sb_rextents) != 0)
+		bmbno--;
+	for (; bmbno < nrbmblocks; bmbno++) {
+		error = xfs_growfs_rt_bmblock(mp, in->newblocks, in->extsize,
+				bmbno);
 		if (error)
-			break;
-		args.tp = tp;
-		nargs.tp = tp;
-
-		/*
-		 * Lock out other callers by grabbing the bitmap and summary
-		 * inode locks and joining them to the transaction.
-		 */
-		xfs_rtbitmap_lock(tp, mp);
-		/*
-		 * Update the bitmap inode's size ondisk and incore.  We need
-		 * to update the incore size so that inode inactivation won't
-		 * punch what it thinks are "posteof" blocks.
-		 */
-		mp->m_rbmip->i_disk_size =
-			nsbp->sb_rbmblocks * nsbp->sb_blocksize;
-		i_size_write(VFS_I(mp->m_rbmip), mp->m_rbmip->i_disk_size);
-		xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE);
-		/*
-		 * Update the summary inode's size.  We need to update the
-		 * incore size so that inode inactivation won't punch what it
-		 * thinks are "posteof" blocks.
-		 */
-		mp->m_rsumip->i_disk_size = nmp->m_rsumsize;
-		i_size_write(VFS_I(mp->m_rsumip), mp->m_rsumip->i_disk_size);
-		xfs_trans_log_inode(tp, mp->m_rsumip, XFS_ILOG_CORE);
-		/*
-		 * Copy summary data from old to new sizes.
-		 * Do this when the real size (not block-aligned) changes.
-		 */
-		if (sbp->sb_rbmblocks != nsbp->sb_rbmblocks ||
-		    mp->m_rsumlevels != nmp->m_rsumlevels) {
-			error = xfs_rtcopy_summary(&args, &nargs);
-			if (error)
-				goto error_cancel;
-		}
-		/*
-		 * Update superblock fields.
-		 */
-		if (nsbp->sb_rextsize != sbp->sb_rextsize)
-			xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSIZE,
-				nsbp->sb_rextsize - sbp->sb_rextsize);
-		if (nsbp->sb_rbmblocks != sbp->sb_rbmblocks)
-			xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
-				nsbp->sb_rbmblocks - sbp->sb_rbmblocks);
-		if (nsbp->sb_rblocks != sbp->sb_rblocks)
-			xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBLOCKS,
-				nsbp->sb_rblocks - sbp->sb_rblocks);
-		if (nsbp->sb_rextents != sbp->sb_rextents)
-			xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTENTS,
-				nsbp->sb_rextents - sbp->sb_rextents);
-		if (nsbp->sb_rextslog != sbp->sb_rextslog)
-			xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
-				nsbp->sb_rextslog - sbp->sb_rextslog);
-		/*
-		 * Free new extent.
-		 */
-		error = xfs_rtfree_range(&nargs, sbp->sb_rextents,
-				nsbp->sb_rextents - sbp->sb_rextents);
-		xfs_rtbuf_cache_relse(&nargs);
-		if (error) {
-error_cancel:
-			xfs_trans_cancel(tp);
-			break;
-		}
-		/*
-		 * Mark more blocks free in the superblock.
-		 */
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FREXTENTS,
-			nsbp->sb_rextents - sbp->sb_rextents);
-		/*
-		 * Update mp values into the real mp structure.
-		 */
-		mp->m_rsumlevels = nrsumlevels;
-		mp->m_rsumsize = nrsumsize;
-		/* recompute growfsrt reservation from new rsumsize */
-		xfs_trans_resv_calc(mp, &mp->m_resv);
-
-		error = xfs_trans_commit(tp);
-		if (error)
-			break;
-
-		/* Ensure the mount RT feature flag is now set. */
-		mp->m_features |= XFS_FEAT_REALTIME;
+			goto out_free;
 	}
-	if (error)
-		goto out_free;
 
 	if (old_rextsize != in->extsize) {
 		error = xfs_growfs_rt_fixup_extsize(mp);
@@ -1083,11 +1087,6 @@ xfs_growfs_rt(
 	error = xfs_update_secondary_sbs(mp);
 
 out_free:
-	/*
-	 * Free the fake mp structure.
-	 */
-	kfree(nmp);
-
 	/*
 	 * If we had to allocate a new rsum_cache, we either need to free the
 	 * old one (if we succeeded) or free the new one and restore the old one


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 10/12] xfs: factor out a xfs_last_rt_bmblock helper
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-08-23  0:11   ` [PATCH 09/12] xfs: factor out a xfs_growfs_rt_bmblock helper Darrick J. Wong
@ 2024-08-23  0:11   ` Darrick J. Wong
  2024-08-23  0:11   ` [PATCH 11/12] xfs: factor out rtbitmap/summary initialization helpers Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 12/12] xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock Darrick J. Wong
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:11 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Add helper to calculate the last currently used rt bitmap block to
better structure the growfs code and prepare for future changes to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 5a637166cc788..5bead4db164e9 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -950,6 +950,23 @@ xfs_growfs_rt_bmblock(
 	return error;
 }
 
+/*
+ * Calculate the last rbmblock currently used.
+ *
+ * This also deals with the case where there were no rtextents before.
+ */
+static xfs_fileoff_t
+xfs_last_rt_bmblock(
+	struct xfs_mount	*mp)
+{
+	xfs_fileoff_t		bmbno = mp->m_sb.sb_rbmblocks;
+
+	/* Skip the current block if it is exactly full. */
+	if (xfs_rtx_to_rbmword(mp, mp->m_sb.sb_rextents) != 0)
+		bmbno--;
+	return bmbno;
+}
+
 /*
  * Grow the realtime area of the filesystem.
  */
@@ -1061,16 +1078,8 @@ xfs_growfs_rt(
 			goto out_unlock;
 	}
 
-	/*
-	 * Loop over the bitmap blocks.
-	 * We will do everything one bitmap block at a time.
-	 * Skip the current block if it is exactly full.
-	 * This also deals with the case where there were no rtextents before.
-	 */
-	bmbno = mp->m_sb.sb_rbmblocks;
-	if (xfs_rtx_to_rbmword(mp, mp->m_sb.sb_rextents) != 0)
-		bmbno--;
-	for (; bmbno < nrbmblocks; bmbno++) {
+	/* Initialize the free space bitmap one bitmap block at a time. */
+	for (bmbno = xfs_last_rt_bmblock(mp); bmbno < nrbmblocks; bmbno++) {
 		error = xfs_growfs_rt_bmblock(mp, in->newblocks, in->extsize,
 				bmbno);
 		if (error)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 11/12] xfs: factor out rtbitmap/summary initialization helpers
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-08-23  0:11   ` [PATCH 10/12] xfs: factor out a xfs_last_rt_bmblock helper Darrick J. Wong
@ 2024-08-23  0:11   ` Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 12/12] xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock Darrick J. Wong
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:11 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Add helpers to libxfs that can be shared by growfs and mkfs for
initializing the rtbitmap and summary, and by passing the optional data
pointer also by repair for rebuilding them.  This will become even more
useful when the rtgroups feature adds a metadata header to each block,
which means even more shared code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor documentation and data advance tweaks]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |  126 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rtbitmap.h |    3 +
 fs/xfs/xfs_rtalloc.c         |  121 +---------------------------------------
 3 files changed, 133 insertions(+), 117 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 02d6668d860fd..715d2c54ce029 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -13,6 +13,8 @@
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
 #include "xfs_trans.h"
 #include "xfs_rtalloc.h"
 #include "xfs_error.h"
@@ -1255,3 +1257,127 @@ xfs_rtbitmap_unlock_shared(
 	if (rbmlock_flags & XFS_RBMLOCK_BITMAP)
 		xfs_iunlock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
 }
+
+static int
+xfs_rtfile_alloc_blocks(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		offset_fsb,
+	xfs_filblks_t		count_fsb,
+	struct xfs_bmbt_irec	*map)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	int			nmap = 1;
+	int			error;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtalloc,
+			XFS_GROWFSRT_SPACE_RES(mp, count_fsb), 0, 0, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
+				XFS_IEXT_ADD_NOSPLIT_CNT);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
+			XFS_BMAPI_METADATA, 0, map, &nmap);
+	if (error)
+		goto out_trans_cancel;
+
+	return xfs_trans_commit(tp);
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+/* Get a buffer for the block. */
+static int
+xfs_rtfile_initialize_block(
+	struct xfs_inode	*ip,
+	xfs_fsblock_t		fsbno,
+	void			*data)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	struct xfs_buf		*bp;
+	const size_t		copylen = mp->m_blockwsize << XFS_WORDLOG;
+	enum xfs_blft		buf_type;
+	int			error;
+
+	if (ip == mp->m_rsumip)
+		buf_type = XFS_BLFT_RTSUMMARY_BUF;
+	else
+		buf_type = XFS_BLFT_RTBITMAP_BUF;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtzero, 0, 0, 0, &tp);
+	if (error)
+		return error;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
+
+	error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
+			XFS_FSB_TO_DADDR(mp, fsbno), mp->m_bsize, 0, &bp);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	xfs_trans_buf_set_type(tp, bp, buf_type);
+	bp->b_ops = &xfs_rtbuf_ops;
+	if (data)
+		memcpy(bp->b_addr, data, copylen);
+	else
+		memset(bp->b_addr, 0, copylen);
+	xfs_trans_log_buf(tp, bp, 0, mp->m_sb.sb_blocksize - 1);
+	return xfs_trans_commit(tp);
+}
+
+/*
+ * Allocate space to the bitmap or summary file, and zero it, for growfs.
+ * @data must be a contiguous buffer large enough to fill all blocks in the
+ * file; or NULL to initialize the contents to zeroes.
+ */
+int
+xfs_rtfile_initialize_blocks(
+	struct xfs_inode	*ip,		/* inode (bitmap/summary) */
+	xfs_fileoff_t		offset_fsb,	/* offset to start from */
+	xfs_fileoff_t		end_fsb,	/* offset to allocate to */
+	void			*data)		/* data to fill the blocks */
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	const size_t		copylen = mp->m_blockwsize << XFS_WORDLOG;
+
+	while (offset_fsb < end_fsb) {
+		struct xfs_bmbt_irec	map;
+		xfs_filblks_t		i;
+		int			error;
+
+		error = xfs_rtfile_alloc_blocks(ip, offset_fsb,
+				end_fsb - offset_fsb, &map);
+		if (error)
+			return error;
+
+		/*
+		 * Now we need to clear the allocated blocks.
+		 *
+		 * Do this one block per transaction, to keep it simple.
+		 */
+		for (i = 0; i < map.br_blockcount; i++) {
+			error = xfs_rtfile_initialize_block(ip,
+					map.br_startblock + i, data);
+			if (error)
+				return error;
+			if (data)
+				data += copylen;
+		}
+
+		offset_fsb = map.br_startoff + map.br_blockcount;
+	}
+
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index e87e2099cff5e..0d5ab5e2cb6a3 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -343,6 +343,9 @@ xfs_filblks_t xfs_rtsummary_blockcount(struct xfs_mount *mp,
 unsigned long long xfs_rtsummary_wordcount(struct xfs_mount *mp,
 		unsigned int rsumlevels, xfs_extlen_t rbmblocks);
 
+int xfs_rtfile_initialize_blocks(struct xfs_inode *ip,
+		xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb, void *data);
+
 void xfs_rtbitmap_lock(struct xfs_trans *tp, struct xfs_mount *mp);
 void xfs_rtbitmap_unlock(struct xfs_mount *mp);
 
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 5bead4db164e9..52ed8448d9925 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -643,121 +643,6 @@ xfs_rtallocate_extent_size(
 	return -ENOSPC;
 }
 
-/*
- * Allocate space to the bitmap or summary file, and zero it, for growfs.
- */
-STATIC int
-xfs_growfs_rt_alloc(
-	struct xfs_mount	*mp,		/* file system mount point */
-	xfs_extlen_t		oblocks,	/* old count of blocks */
-	xfs_extlen_t		nblocks,	/* new count of blocks */
-	struct xfs_inode	*ip)		/* inode (bitmap/summary) */
-{
-	xfs_fileoff_t		bno;		/* block number in file */
-	struct xfs_buf		*bp;	/* temporary buffer for zeroing */
-	xfs_daddr_t		d;		/* disk block address */
-	int			error;		/* error return value */
-	xfs_fsblock_t		fsbno;		/* filesystem block for bno */
-	struct xfs_bmbt_irec	map;		/* block map output */
-	int			nmap;		/* number of block maps */
-	int			resblks;	/* space reservation */
-	enum xfs_blft		buf_type;
-	struct xfs_trans	*tp;
-
-	if (ip == mp->m_rsumip)
-		buf_type = XFS_BLFT_RTSUMMARY_BUF;
-	else
-		buf_type = XFS_BLFT_RTBITMAP_BUF;
-
-	/*
-	 * Allocate space to the file, as necessary.
-	 */
-	while (oblocks < nblocks) {
-		resblks = XFS_GROWFSRT_SPACE_RES(mp, nblocks - oblocks);
-		/*
-		 * Reserve space & log for one extent added to the file.
-		 */
-		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtalloc, resblks,
-				0, 0, &tp);
-		if (error)
-			return error;
-		/*
-		 * Lock the inode.
-		 */
-		xfs_ilock(ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
-
-		error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
-				XFS_IEXT_ADD_NOSPLIT_CNT);
-		if (error)
-			goto out_trans_cancel;
-
-		/*
-		 * Allocate blocks to the bitmap file.
-		 */
-		nmap = 1;
-		error = xfs_bmapi_write(tp, ip, oblocks, nblocks - oblocks,
-					XFS_BMAPI_METADATA, 0, &map, &nmap);
-		if (error)
-			goto out_trans_cancel;
-		/*
-		 * Free any blocks freed up in the transaction, then commit.
-		 */
-		error = xfs_trans_commit(tp);
-		if (error)
-			return error;
-		/*
-		 * Now we need to clear the allocated blocks.
-		 * Do this one block per transaction, to keep it simple.
-		 */
-		for (bno = map.br_startoff, fsbno = map.br_startblock;
-		     bno < map.br_startoff + map.br_blockcount;
-		     bno++, fsbno++) {
-			/*
-			 * Reserve log for one block zeroing.
-			 */
-			error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtzero,
-					0, 0, 0, &tp);
-			if (error)
-				return error;
-			/*
-			 * Lock the bitmap inode.
-			 */
-			xfs_ilock(ip, XFS_ILOCK_EXCL);
-			xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
-			/*
-			 * Get a buffer for the block.
-			 */
-			d = XFS_FSB_TO_DADDR(mp, fsbno);
-			error = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
-					mp->m_bsize, 0, &bp);
-			if (error)
-				goto out_trans_cancel;
-
-			xfs_trans_buf_set_type(tp, bp, buf_type);
-			bp->b_ops = &xfs_rtbuf_ops;
-			memset(bp->b_addr, 0, mp->m_sb.sb_blocksize);
-			xfs_trans_log_buf(tp, bp, 0, mp->m_sb.sb_blocksize - 1);
-			/*
-			 * Commit the transaction.
-			 */
-			error = xfs_trans_commit(tp);
-			if (error)
-				return error;
-		}
-		/*
-		 * Go on to the next extent, if any.
-		 */
-		oblocks = map.br_startoff + map.br_blockcount;
-	}
-
-	return 0;
-
-out_trans_cancel:
-	xfs_trans_cancel(tp);
-	return error;
-}
-
 static int
 xfs_alloc_rsum_cache(
 	struct xfs_mount	*mp,
@@ -1064,10 +949,12 @@ xfs_growfs_rt(
 	/*
 	 * Allocate space to the bitmap and summary files, as necessary.
 	 */
-	error = xfs_growfs_rt_alloc(mp, rbmblocks, nrbmblocks, mp->m_rbmip);
+	error = xfs_rtfile_initialize_blocks(mp->m_rbmip, rbmblocks,
+			nrbmblocks, NULL);
 	if (error)
 		goto out_unlock;
-	error = xfs_growfs_rt_alloc(mp, rsumblocks, nrsumblocks, mp->m_rsumip);
+	error = xfs_rtfile_initialize_blocks(mp->m_rsumip, rsumblocks,
+			nrsumblocks, NULL);
 	if (error)
 		goto out_unlock;
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 12/12] xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock
  2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
                     ` (10 preceding siblings ...)
  2024-08-23  0:11   ` [PATCH 11/12] xfs: factor out rtbitmap/summary initialization helpers Darrick J. Wong
@ 2024-08-23  0:12   ` Darrick J. Wong
  11 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:12 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

To prepare for being able to join an already locked rtbitmap inode to a
transaction split out separate helpers for joining the transaction from
the locking helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c     |    3 ++-
 fs/xfs/libxfs/xfs_rtbitmap.c |   24 +++++++++++++-----------
 fs/xfs/libxfs/xfs_rtbitmap.h |    6 ++++--
 fs/xfs/xfs_rtalloc.c         |    6 ++++--
 4 files changed, 23 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b79803784b766..314fc7d55659a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5379,7 +5379,8 @@ xfs_bmap_del_extent_real(
 			 */
 			if (!(tp->t_flags & XFS_TRANS_RTBITMAP_LOCKED)) {
 				tp->t_flags |= XFS_TRANS_RTBITMAP_LOCKED;
-				xfs_rtbitmap_lock(tp, mp);
+				xfs_rtbitmap_lock(mp);
+				xfs_rtbitmap_trans_join(tp);
 			}
 			error = xfs_rtfree_blocks(tp, del->br_startblock,
 					del->br_blockcount);
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 715d2c54ce029..d7c731aeee12d 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1201,23 +1201,25 @@ xfs_rtsummary_wordcount(
 	return XFS_FSB_TO_B(mp, blocks) >> XFS_WORDLOG;
 }
 
-/*
- * Lock both realtime free space metadata inodes for a freespace update.  If a
- * transaction is given, the inodes will be joined to the transaction and the
- * ILOCKs will be released on transaction commit.
- */
+/* Lock both realtime free space metadata inodes for a freespace update. */
 void
 xfs_rtbitmap_lock(
-	struct xfs_trans	*tp,
 	struct xfs_mount	*mp)
 {
 	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP);
-	if (tp)
-		xfs_trans_ijoin(tp, mp->m_rbmip, XFS_ILOCK_EXCL);
-
 	xfs_ilock(mp->m_rsumip, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-	if (tp)
-		xfs_trans_ijoin(tp, mp->m_rsumip, XFS_ILOCK_EXCL);
+}
+
+/*
+ * Join both realtime free space metadata inodes to the transaction.  The
+ * ILOCKs will be released on transaction commit.
+ */
+void
+xfs_rtbitmap_trans_join(
+	struct xfs_trans	*tp)
+{
+	xfs_trans_ijoin(tp, tp->t_mountp->m_rbmip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, tp->t_mountp->m_rsumip, XFS_ILOCK_EXCL);
 }
 
 /* Unlock both realtime free space metadata inodes after a freespace update. */
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 0d5ab5e2cb6a3..523d3d3c12c60 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -346,8 +346,9 @@ unsigned long long xfs_rtsummary_wordcount(struct xfs_mount *mp,
 int xfs_rtfile_initialize_blocks(struct xfs_inode *ip,
 		xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb, void *data);
 
-void xfs_rtbitmap_lock(struct xfs_trans *tp, struct xfs_mount *mp);
+void xfs_rtbitmap_lock(struct xfs_mount *mp);
 void xfs_rtbitmap_unlock(struct xfs_mount *mp);
+void xfs_rtbitmap_trans_join(struct xfs_trans *tp);
 
 /* Lock the rt bitmap inode in shared mode */
 #define XFS_RBMLOCK_BITMAP	(1U << 0)
@@ -376,7 +377,8 @@ xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t rtextents)
 # define xfs_rtbitmap_wordcount(mp, r)			(0)
 # define xfs_rtsummary_blockcount(mp, l, b)		(0)
 # define xfs_rtsummary_wordcount(mp, l, b)		(0)
-# define xfs_rtbitmap_lock(tp, mp)		do { } while (0)
+# define xfs_rtbitmap_lock(mp)			do { } while (0)
+# define xfs_rtbitmap_trans_join(tp)		do { } while (0)
 # define xfs_rtbitmap_unlock(mp)		do { } while (0)
 # define xfs_rtbitmap_lock_shared(mp, lf)	do { } while (0)
 # define xfs_rtbitmap_unlock_shared(mp, lf)	do { } while (0)
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 52ed8448d9925..e809a8649c60c 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -741,7 +741,8 @@ xfs_growfs_rt_bmblock(
 		goto out_free;
 	nargs.tp = args.tp;
 
-	xfs_rtbitmap_lock(args.tp, mp);
+	xfs_rtbitmap_lock(mp);
+	xfs_rtbitmap_trans_join(args.tp);
 
 	/*
 	 * Update the bitmap inode's size ondisk and incore.  We need to update
@@ -1319,7 +1320,8 @@ xfs_bmap_rtalloc(
 	 * Lock out modifications to both the RT bitmap and summary inodes
 	 */
 	if (!rtlocked) {
-		xfs_rtbitmap_lock(ap->tp, mp);
+		xfs_rtbitmap_lock(mp);
+		xfs_rtbitmap_trans_join(ap->tp);
 		rtlocked = true;
 	}
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 01/10] xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
@ 2024-08-23  0:12   ` Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 02/10] xfs: ensure rtx mask/shift are correct after growfs Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:12 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

After going great length to calculate the transaction reservation for
the new geometry, we should also use it to allocate the transaction it
was calculated for.

Fixes: 578bd4ce7100 ("xfs: recompute growfsrtfree transaction reservation while growing rt volume")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index e809a8649c60c..1f31b08c95a06 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -732,10 +732,12 @@ xfs_growfs_rt_bmblock(
 		xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
 			nmp->m_sb.sb_rbmblocks));
 
-	/* recompute growfsrt reservation from new rsumsize */
+	/*
+	 * Recompute the growfsrt reservation from the new rsumsize, so that the
+	 * transaction below use the new, potentially larger value.
+	 * */
 	xfs_trans_resv_calc(nmp, &nmp->m_resv);
-
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtfree, 0, 0, 0,
+	error = xfs_trans_alloc(mp, &M_RES(nmp)->tr_growrtfree, 0, 0, 0,
 			&args.tp);
 	if (error)
 		goto out_free;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 02/10] xfs: ensure rtx mask/shift are correct after growfs
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 01/10] xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock Darrick J. Wong
@ 2024-08-23  0:12   ` Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 03/10] xfs: don't return too-short extents from xfs_rtallocate_extent_block Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:12 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

When growfs sets an extent size, it doesn't updated the m_rtxblklog and
m_rtxblkmask values, which could lead to incorrect usage of them if they
were set before and can't be used for the new extent size.

Add a xfs_mount_sb_set_rextsize helper that updates the two fields, and
also use it when calculating the new RT geometry instead of disabling
the optimization there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_sb.c |   12 ++++++++++--
 fs/xfs/libxfs/xfs_sb.h |    2 ++
 fs/xfs/xfs_rtalloc.c   |    5 +++--
 3 files changed, 15 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a4221afb012b6..b83ce29640511 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -975,6 +975,15 @@ const struct xfs_buf_ops xfs_sb_quiet_buf_ops = {
 	.verify_write = xfs_sb_write_verify,
 };
 
+void
+xfs_mount_sb_set_rextsize(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*sbp)
+{
+	mp->m_rtxblklog = log2_if_power2(sbp->sb_rextsize);
+	mp->m_rtxblkmask = mask64_if_power2(sbp->sb_rextsize);
+}
+
 /*
  * xfs_mount_common
  *
@@ -999,8 +1008,7 @@ xfs_sb_mount_common(
 	mp->m_blockmask = sbp->sb_blocksize - 1;
 	mp->m_blockwsize = sbp->sb_blocksize >> XFS_WORDLOG;
 	mp->m_blockwmask = mp->m_blockwsize - 1;
-	mp->m_rtxblklog = log2_if_power2(sbp->sb_rextsize);
-	mp->m_rtxblkmask = mask64_if_power2(sbp->sb_rextsize);
+	xfs_mount_sb_set_rextsize(mp, sbp);
 
 	mp->m_alloc_mxr[0] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 1);
 	mp->m_alloc_mxr[1] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 0);
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 796f02191dfd2..885c837559914 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -17,6 +17,8 @@ extern void	xfs_log_sb(struct xfs_trans *tp);
 extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
 extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
 extern void	xfs_sb_mount_common(struct xfs_mount *mp, struct xfs_sb *sbp);
+void		xfs_mount_sb_set_rextsize(struct xfs_mount *mp,
+			struct xfs_sb *sbp);
 extern void	xfs_sb_from_disk(struct xfs_sb *to, struct xfs_dsb *from);
 extern void	xfs_sb_to_disk(struct xfs_dsb *to, struct xfs_sb *from);
 extern void	xfs_sb_quota_from_disk(struct xfs_sb *sbp);
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 1f31b08c95a06..ffa417a3e8a76 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -721,8 +721,8 @@ xfs_growfs_rt_bmblock(
 	/*
 	 * Calculate new sb and mount fields for this round.
 	 */
-	nmp->m_rtxblklog = -1; /* don't use shift or masking */
 	nmp->m_sb.sb_rextsize = rextsize;
+	xfs_mount_sb_set_rextsize(nmp, &nmp->m_sb);
 	nmp->m_sb.sb_rbmblocks = bmbno + 1;
 	nmp->m_sb.sb_rblocks = min(nrblocks, nrblocks_step);
 	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
@@ -809,10 +809,11 @@ xfs_growfs_rt_bmblock(
 	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
 
 	/*
-	 * Update mp values into the real mp structure.
+	 * Update the calculated values in the real mount structure.
 	 */
 	mp->m_rsumlevels = nmp->m_rsumlevels;
 	mp->m_rsumsize = nmp->m_rsumsize;
+	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
 
 	/*
 	 * Recompute the growfsrt reservation from the new rsumsize.


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 03/10] xfs: don't return too-short extents from xfs_rtallocate_extent_block
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 01/10] xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock Darrick J. Wong
  2024-08-23  0:12   ` [PATCH 02/10] xfs: ensure rtx mask/shift are correct after growfs Darrick J. Wong
@ 2024-08-23  0:12   ` Darrick J. Wong
  2024-08-23  4:57     ` Christoph Hellwig
  2024-08-23  0:13   ` [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:12 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If xfs_rtallocate_extent_block is asked for a variable-sized allocation,
it will try to return the best-sized free extent, which is apparently
the largest one that it finds starting in this rtbitmap block.  It will
then trim the size of the extent as needed to align it with prod.

However, it misses one thing -- rounding down the best-fit candidate to
the required alignment could make the extent shorter than minlen.  In
the case where minlen > 1, we'd rather the caller relaxed its alignment
requirements and tried again, as the allocator already supports that.

Returning a too-short extent that causes xfs_bmapi_write to return
ENOSR if there aren't enough nmaps to handle multiple new allocations,
which can then cause filesystem shutdowns.

I haven't seen this happen on any production systems, but then I don't
think it's very common to set a per-file extent size hint on realtime
files.  I tripped it while working on the rtgroups feature and pounding
on the realtime allocator enthusiastically.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index ffa417a3e8a76..3d78dc0940190 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -291,16 +291,9 @@ xfs_rtallocate_extent_block(
 			return error;
 	}
 
-	/*
-	 * Searched the whole thing & didn't find a maxlen free extent.
-	 */
-	if (minlen > maxlen || besti == -1) {
-		/*
-		 * Allocation failed.  Set *nextp to the next block to try.
-		 */
-		*nextp = next;
-		return -ENOSPC;
-	}
+	/* Searched the whole thing & didn't find a maxlen free extent. */
+	if (minlen > maxlen || besti == -1)
+		goto nospace;
 
 	/*
 	 * If size should be a multiple of prod, make that so.
@@ -313,12 +306,20 @@ xfs_rtallocate_extent_block(
 			bestlen -= p;
 	}
 
+	/* Don't return a too-short extent. */
+	if (bestlen < minlen)
+		goto nospace;
+
 	/*
 	 * Pick besti for bestlen & return that.
 	 */
 	*len = bestlen;
 	*rtx = besti;
 	return 0;
+nospace:
+	/* Allocation failed.  Set *nextp to the next block to try. */
+	*nextp = next;
+	return -ENOSPC;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-23  0:12   ` [PATCH 03/10] xfs: don't return too-short extents from xfs_rtallocate_extent_block Darrick J. Wong
@ 2024-08-23  0:13   ` Darrick J. Wong
  2024-08-23  4:57     ` Christoph Hellwig
  2024-08-23  0:13   ` [PATCH 05/10] xfs: refactor aligning bestlen to prod Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:13 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The loop conditional here is not quite correct because an rtbitmap block
can represent rtextents beyond the end of the rt volume.  There's no way
that it makes sense to scan for free space beyond EOFS, so don't do it.
This overrun has been present since v2.6.0.

Also fix the type of bestlen, which was incorrectly converted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 3d78dc0940190..7e45e1c74c027 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -231,22 +231,20 @@ xfs_rtallocate_extent_block(
 	xfs_rtxnum_t		*rtx)	/* out: start rtext allocated */
 {
 	struct xfs_mount	*mp = args->mp;
-	xfs_rtxnum_t		besti;	/* best rtext found so far */
-	xfs_rtxnum_t		bestlen;/* best length found so far */
+	xfs_rtxnum_t		besti = -1; /* best rtext found so far */
 	xfs_rtxnum_t		end;	/* last rtext in chunk */
-	int			error;
 	xfs_rtxnum_t		i;	/* current rtext trying */
 	xfs_rtxnum_t		next;	/* next rtext to try */
+	xfs_rtxlen_t		bestlen = 0; /* best length found so far */
 	int			stat;	/* status from internal calls */
+	int			error;
 
 	/*
-	 * Loop over all the extents starting in this bitmap block,
-	 * looking for one that's long enough.
+	 * Loop over all the extents starting in this bitmap block up to the
+	 * end of the rt volume, looking for one that's long enough.
 	 */
-	for (i = xfs_rbmblock_to_rtx(mp, bbno), besti = -1, bestlen = 0,
-		end = xfs_rbmblock_to_rtx(mp, bbno + 1) - 1;
-	     i <= end;
-	     i++) {
+	end = min(mp->m_sb.sb_rextents, xfs_rbmblock_to_rtx(mp, bbno + 1)) - 1;
+	for (i = xfs_rbmblock_to_rtx(mp, bbno); i <= end; i++) {
 		/* Make sure we don't scan off the end of the rt volume. */
 		maxlen = xfs_rtallocate_clamp_len(mp, i, maxlen, prod);
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 05/10] xfs: refactor aligning bestlen to prod
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-23  0:13   ` [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block Darrick J. Wong
@ 2024-08-23  0:13   ` Darrick J. Wong
  2024-08-23  4:58     ` Christoph Hellwig
  2024-08-23  0:13   ` [PATCH 06/10] xfs: clean up xfs_rtallocate_extent_exact a bit Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:13 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

There are two places in xfs_rtalloc.c where we want to make sure that a
count of rt extents is aligned with a particular prod(uct) factor.  In
one spot, we actually use rounddown(), albeit unnecessarily if prod < 2.
In the other case, we open-code this rounding inefficiently by promoting
the 32-bit length value to a 64-bit value and then performing a 64-bit
division to figure out the subtraction.

Refactor this into a single helper that uses the correct types and
division method for the type, and skips the division entirely unless
prod is large enough to make a difference.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 7e45e1c74c027..54f34d7d4c199 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -196,6 +196,17 @@ xfs_rtallocate_range(
 	return xfs_rtmodify_range(args, start, len, 0);
 }
 
+/* Reduce @rtxlen until it is a multiple of @prod. */
+static inline xfs_rtxlen_t
+xfs_rtalloc_align_len(
+	xfs_rtxlen_t	rtxlen,
+	xfs_rtxlen_t	prod)
+{
+	if (unlikely(prod > 1))
+		return rounddown(rtxlen, prod);
+	return rtxlen;
+}
+
 /*
  * Make sure we don't run off the end of the rt volume.  Be careful that
  * adjusting maxlen downwards doesn't cause us to fail the alignment checks.
@@ -210,7 +221,7 @@ xfs_rtallocate_clamp_len(
 	xfs_rtxlen_t		ret;
 
 	ret = min(mp->m_sb.sb_rextents, startrtx + rtxlen) - startrtx;
-	return rounddown(ret, prod);
+	return xfs_rtalloc_align_len(ret, prod);
 }
 
 /*
@@ -294,17 +305,10 @@ xfs_rtallocate_extent_block(
 		goto nospace;
 
 	/*
-	 * If size should be a multiple of prod, make that so.
+	 * Ensure bestlen is a multiple of prod, but don't return a too-short
+	 * extent.
 	 */
-	if (prod > 1) {
-		xfs_rtxlen_t	p;	/* amount to trim length by */
-
-		div_u64_rem(bestlen, prod, &p);
-		if (p)
-			bestlen -= p;
-	}
-
-	/* Don't return a too-short extent. */
+	bestlen = xfs_rtalloc_align_len(bestlen, prod);
 	if (bestlen < minlen)
 		goto nospace;
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 06/10] xfs: clean up xfs_rtallocate_extent_exact a bit
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:13   ` [PATCH 05/10] xfs: refactor aligning bestlen to prod Darrick J. Wong
@ 2024-08-23  0:13   ` Darrick J. Wong
  2024-08-23  4:58     ` Christoph Hellwig
  2024-08-23  0:13   ` [PATCH 07/10] xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:13 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Before we start doing more surgery on the rt allocator, let's clean up
the exact allocator so that it doesn't change its arguments and uses the
helper introduced in the previous patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 54f34d7d4c199..2fe3f6563cad3 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -340,10 +340,10 @@ xfs_rtallocate_extent_exact(
 	xfs_rtxlen_t		prod,	/* extent product factor */
 	xfs_rtxnum_t		*rtx)	/* out: start rtext allocated */
 {
-	int			error;
-	xfs_rtxlen_t		i;	/* extent length trimmed due to prod */
-	int			isfree;	/* extent is free */
 	xfs_rtxnum_t		next;	/* next rtext to try (dummy) */
+	xfs_rtxlen_t		alloclen; /* candidate length */
+	int			isfree;	/* extent is free */
+	int			error;
 
 	ASSERT(minlen % prod == 0);
 	ASSERT(maxlen % prod == 0);
@@ -354,25 +354,26 @@ xfs_rtallocate_extent_exact(
 	if (error)
 		return error;
 
-	if (!isfree) {
-		/*
-		 * If not, allocate what there is, if it's at least minlen.
-		 */
-		maxlen = next - start;
-		if (maxlen < minlen)
-			return -ENOSPC;
-
-		/*
-		 * Trim off tail of extent, if prod is specified.
-		 */
-		if (prod > 1 && (i = maxlen % prod)) {
-			maxlen -= i;
-			if (maxlen < minlen)
-				return -ENOSPC;
-		}
+	if (isfree) {
+		/* start to maxlen is all free; allocate it. */
+		*len = maxlen;
+		*rtx = start;
+		return 0;
 	}
 
-	*len = maxlen;
+	/*
+	 * If not, allocate what there is, if it's at least minlen.
+	 */
+	alloclen = next - start;
+	if (alloclen < minlen)
+		return -ENOSPC;
+
+	/* Ensure alloclen is a multiple of prod. */
+	alloclen = xfs_rtalloc_align_len(alloclen, prod);
+	if (alloclen < minlen)
+		return -ENOSPC;
+
+	*len = alloclen;
 	*rtx = start;
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 07/10] xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-08-23  0:13   ` [PATCH 06/10] xfs: clean up xfs_rtallocate_extent_exact a bit Darrick J. Wong
@ 2024-08-23  0:13   ` Darrick J. Wong
  2024-08-23  4:59     ` Christoph Hellwig
  2024-08-23  0:14   ` [PATCH 08/10] xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:13 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The near rt allocator employs two allocation strategies -- first it
tries to allocate at exactly @start.  If that fails, it will pivot back
and forth around that starting point looking for an appropriately sized
free space.

However, I clamped maxlen ages ago to prevent the exact allocation scan
from running off the end of the rt volume.  This, I realize, was
excessive.  If the allocation request is (say) for 32 rtx but the start
position is 5 rtx from the end of the volume, we clamp maxlen to 5.  If
the exact allocation fails, we then pivot back and forth looking for 5
rtx, even though the original intent was to try to get 32 rtx.

If we then find 5 rtx when we could have gotten 32 rtx, we've not done
as well as we could have.  This may be moot if the caller immediately
comes back for more space, but it might not be.  Either way, we can do
better here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 2fe3f6563cad3..3dafe37f01f64 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -340,23 +340,29 @@ xfs_rtallocate_extent_exact(
 	xfs_rtxlen_t		prod,	/* extent product factor */
 	xfs_rtxnum_t		*rtx)	/* out: start rtext allocated */
 {
+	struct xfs_mount	*mp = args->mp;
 	xfs_rtxnum_t		next;	/* next rtext to try (dummy) */
 	xfs_rtxlen_t		alloclen; /* candidate length */
+	xfs_rtxlen_t		scanlen; /* number of free rtx to look for */
 	int			isfree;	/* extent is free */
 	int			error;
 
 	ASSERT(minlen % prod == 0);
 	ASSERT(maxlen % prod == 0);
-	/*
-	 * Check if the range in question (for maxlen) is free.
-	 */
-	error = xfs_rtcheck_range(args, start, maxlen, 1, &next, &isfree);
+
+	/* Make sure we don't run off the end of the rt volume. */
+	scanlen = xfs_rtallocate_clamp_len(mp, start, maxlen, prod);
+	if (scanlen < minlen)
+		return -ENOSPC;
+
+	/* Check if the range in question (for scanlen) is free. */
+	error = xfs_rtcheck_range(args, start, scanlen, 1, &next, &isfree);
 	if (error)
 		return error;
 
 	if (isfree) {
-		/* start to maxlen is all free; allocate it. */
-		*len = maxlen;
+		/* start to scanlen is all free; allocate it. */
+		*len = scanlen;
 		*rtx = start;
 		return 0;
 	}
@@ -412,11 +418,6 @@ xfs_rtallocate_extent_near(
 	if (start >= mp->m_sb.sb_rextents)
 		start = mp->m_sb.sb_rextents - 1;
 
-	/* Make sure we don't run off the end of the rt volume. */
-	maxlen = xfs_rtallocate_clamp_len(mp, start, maxlen, prod);
-	if (maxlen < minlen)
-		return -ENOSPC;
-
 	/*
 	 * Try the exact allocation first.
 	 */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 08/10] xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-08-23  0:13   ` [PATCH 07/10] xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near Darrick J. Wong
@ 2024-08-23  0:14   ` Darrick J. Wong
  2024-08-23  4:59     ` Christoph Hellwig
  2024-08-23  0:14   ` [PATCH 09/10] xfs: remove xfs_rtb_to_rtxrem Darrick J. Wong
  2024-08-23  0:14   ` [PATCH 10/10] xfs: simplify xfs_rtalloc_query_range Darrick J. Wong
  9 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:14 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This function tries to find a suitable free space extent starting from
a particular rtbitmap block.  Some time ago, I added a clamping function
to prevent the free space scans from running off the end of the bitmap,
but I didn't quite get the logic right.

Let's say there's an allocation request with a minlen of 5 and a maxlen
of 32 and we're scanning the last rtbitmap block.  If we come within 4
rtx of the end of the rt volume, maxlen will get clamped to 4.  If the
next 3 rtx are free, we could have satisfied the allocation, but the
code setting partial besti/bestlen for "minlen < maxlen" will think that
we're doing a non-variable allocation and ignore it.

The root of this problem is overwriting maxlen; I should have stuffed
the results in a different variable, which would not have introduced
this bug.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 3dafe37f01f64..4e7db8d4c0827 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -246,6 +246,7 @@ xfs_rtallocate_extent_block(
 	xfs_rtxnum_t		end;	/* last rtext in chunk */
 	xfs_rtxnum_t		i;	/* current rtext trying */
 	xfs_rtxnum_t		next;	/* next rtext to try */
+	xfs_rtxlen_t		scanlen; /* number of free rtx to look for */
 	xfs_rtxlen_t		bestlen = 0; /* best length found so far */
 	int			stat;	/* status from internal calls */
 	int			error;
@@ -257,20 +258,22 @@ xfs_rtallocate_extent_block(
 	end = min(mp->m_sb.sb_rextents, xfs_rbmblock_to_rtx(mp, bbno + 1)) - 1;
 	for (i = xfs_rbmblock_to_rtx(mp, bbno); i <= end; i++) {
 		/* Make sure we don't scan off the end of the rt volume. */
-		maxlen = xfs_rtallocate_clamp_len(mp, i, maxlen, prod);
+		scanlen = xfs_rtallocate_clamp_len(mp, i, maxlen, prod);
+		if (scanlen < minlen)
+			break;
 
 		/*
-		 * See if there's a free extent of maxlen starting at i.
+		 * See if there's a free extent of scanlen starting at i.
 		 * If it's not so then next will contain the first non-free.
 		 */
-		error = xfs_rtcheck_range(args, i, maxlen, 1, &next, &stat);
+		error = xfs_rtcheck_range(args, i, scanlen, 1, &next, &stat);
 		if (error)
 			return error;
 		if (stat) {
 			/*
-			 * i for maxlen is all free, allocate and return that.
+			 * i to scanlen is all free, allocate and return that.
 			 */
-			*len = maxlen;
+			*len = scanlen;
 			*rtx = i;
 			return 0;
 		}
@@ -301,7 +304,7 @@ xfs_rtallocate_extent_block(
 	}
 
 	/* Searched the whole thing & didn't find a maxlen free extent. */
-	if (minlen > maxlen || besti == -1)
+	if (besti == -1)
 		goto nospace;
 
 	/*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 09/10] xfs: remove xfs_rtb_to_rtxrem
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-08-23  0:14   ` [PATCH 08/10] xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block Darrick J. Wong
@ 2024-08-23  0:14   ` Darrick J. Wong
  2024-08-23  0:14   ` [PATCH 10/10] xfs: simplify xfs_rtalloc_query_range Darrick J. Wong
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:14 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Simplify the number of block number conversion helpers by removing
xfs_rtb_to_rtxrem.  Any recent compiler is smart enough to eliminate
the double divisions if using separate xfs_rtb_to_rtx and
xfs_rtb_to_rtxoff calls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |    9 ++++-----
 fs/xfs/libxfs/xfs_rtbitmap.h |   18 ------------------
 2 files changed, 4 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index d7c731aeee12d..431ef62939caa 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1022,25 +1022,24 @@ xfs_rtfree_blocks(
 	xfs_filblks_t		rtlen)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_rtxnum_t		start;
-	xfs_filblks_t		len;
 	xfs_extlen_t		mod;
 
 	ASSERT(rtlen <= XFS_MAX_BMBT_EXTLEN);
 
-	len = xfs_rtb_to_rtxrem(mp, rtlen, &mod);
+	mod = xfs_rtb_to_rtxoff(mp, rtlen);
 	if (mod) {
 		ASSERT(mod == 0);
 		return -EIO;
 	}
 
-	start = xfs_rtb_to_rtxrem(mp, rtbno, &mod);
+	mod = xfs_rtb_to_rtxoff(mp, rtbno);
 	if (mod) {
 		ASSERT(mod == 0);
 		return -EIO;
 	}
 
-	return xfs_rtfree_extent(tp, start, len);
+	return xfs_rtfree_extent(tp, xfs_rtb_to_rtx(mp, rtbno),
+			xfs_rtb_to_rtx(mp, rtlen));
 }
 
 /* Find all the free records within a given range. */
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 523d3d3c12c60..69ddacd4b01e6 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -86,24 +86,6 @@ xfs_rtb_to_rtxoff(
 	return do_div(rtbno, mp->m_sb.sb_rextsize);
 }
 
-/*
- * Crack an rt block number into an rt extent number and an offset within that
- * rt extent.  Returns the rt extent number directly and the offset in @off.
- */
-static inline xfs_rtxnum_t
-xfs_rtb_to_rtxrem(
-	struct xfs_mount	*mp,
-	xfs_rtblock_t		rtbno,
-	xfs_extlen_t		*off)
-{
-	if (likely(mp->m_rtxblklog >= 0)) {
-		*off = rtbno & mp->m_rtxblkmask;
-		return rtbno >> mp->m_rtxblklog;
-	}
-
-	return div_u64_rem(rtbno, mp->m_sb.sb_rextsize, off);
-}
-
 /*
  * Convert an rt block number into an rt extent number, rounding up to the next
  * rt extent if the rt block is not aligned to an rt extent boundary.


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 10/10] xfs: simplify xfs_rtalloc_query_range
  2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-08-23  0:14   ` [PATCH 09/10] xfs: remove xfs_rtb_to_rtxrem Darrick J. Wong
@ 2024-08-23  0:14   ` Darrick J. Wong
  9 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:14 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

There isn't much of a good reason to pass the xfs_rtalloc_rec structures
that describe extents to xfs_rtalloc_query_range as we really just want
a lower and upper bound xfs_rtxnum_t.  Pass the rtxnum directly and
simply the interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |   42 +++++++++++++++++-------------------------
 fs/xfs/libxfs/xfs_rtbitmap.h |    3 +--
 fs/xfs/xfs_discard.c         |   15 +++++++--------
 fs/xfs/xfs_fsmap.c           |   11 +++++------
 4 files changed, 30 insertions(+), 41 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 431ef62939caa..c58eb75ef0fa0 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1047,8 +1047,8 @@ int
 xfs_rtalloc_query_range(
 	struct xfs_mount		*mp,
 	struct xfs_trans		*tp,
-	const struct xfs_rtalloc_rec	*low_rec,
-	const struct xfs_rtalloc_rec	*high_rec,
+	xfs_rtxnum_t			start,
+	xfs_rtxnum_t			end,
 	xfs_rtalloc_query_range_fn	fn,
 	void				*priv)
 {
@@ -1056,45 +1056,42 @@ xfs_rtalloc_query_range(
 		.mp			= mp,
 		.tp			= tp,
 	};
-	struct xfs_rtalloc_rec		rec;
-	xfs_rtxnum_t			rtstart;
-	xfs_rtxnum_t			rtend;
-	xfs_rtxnum_t			high_key;
-	int				is_free;
 	int				error = 0;
 
-	if (low_rec->ar_startext > high_rec->ar_startext)
+	if (start > end)
 		return -EINVAL;
-	if (low_rec->ar_startext >= mp->m_sb.sb_rextents ||
-	    low_rec->ar_startext == high_rec->ar_startext)
+	if (start == end || start >= mp->m_sb.sb_rextents)
 		return 0;
 
-	high_key = min(high_rec->ar_startext, mp->m_sb.sb_rextents - 1);
+	end = min(end, mp->m_sb.sb_rextents - 1);
 
 	/* Iterate the bitmap, looking for discrepancies. */
-	rtstart = low_rec->ar_startext;
-	while (rtstart <= high_key) {
+	while (start <= end) {
+		struct xfs_rtalloc_rec	rec;
+		int			is_free;
+		xfs_rtxnum_t		rtend;
+
 		/* Is the first block free? */
-		error = xfs_rtcheck_range(&args, rtstart, 1, 1, &rtend,
+		error = xfs_rtcheck_range(&args, start, 1, 1, &rtend,
 				&is_free);
 		if (error)
 			break;
 
 		/* How long does the extent go for? */
-		error = xfs_rtfind_forw(&args, rtstart, high_key, &rtend);
+		error = xfs_rtfind_forw(&args, start, end, &rtend);
 		if (error)
 			break;
 
 		if (is_free) {
-			rec.ar_startext = rtstart;
-			rec.ar_extcount = rtend - rtstart + 1;
+			rec.ar_startext = start;
+			rec.ar_extcount = rtend - start + 1;
 
 			error = fn(mp, tp, &rec, priv);
 			if (error)
 				break;
 		}
 
-		rtstart = rtend + 1;
+		start = rtend + 1;
 	}
 
 	xfs_rtbuf_cache_relse(&args);
@@ -1109,13 +1106,8 @@ xfs_rtalloc_query_all(
 	xfs_rtalloc_query_range_fn	fn,
 	void				*priv)
 {
-	struct xfs_rtalloc_rec		keys[2];
-
-	keys[0].ar_startext = 0;
-	keys[1].ar_startext = mp->m_sb.sb_rextents - 1;
-	keys[0].ar_extcount = keys[1].ar_extcount = 0;
-
-	return xfs_rtalloc_query_range(mp, tp, &keys[0], &keys[1], fn, priv);
+	return xfs_rtalloc_query_range(mp, tp, 0, mp->m_sb.sb_rextents - 1, fn,
+			priv);
 }
 
 /* Is the given extent all free? */
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 69ddacd4b01e6..0dbc9bb40668a 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -292,8 +292,7 @@ int xfs_rtmodify_summary(struct xfs_rtalloc_args *args, int log,
 int xfs_rtfree_range(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
 		xfs_rtxlen_t len);
 int xfs_rtalloc_query_range(struct xfs_mount *mp, struct xfs_trans *tp,
-		const struct xfs_rtalloc_rec *low_rec,
-		const struct xfs_rtalloc_rec *high_rec,
+		xfs_rtxnum_t start, xfs_rtxnum_t end,
 		xfs_rtalloc_query_range_fn fn, void *priv);
 int xfs_rtalloc_query_all(struct xfs_mount *mp, struct xfs_trans *tp,
 			  xfs_rtalloc_query_range_fn fn,
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 25f5dffeab2ae..bf1e3f330018d 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -554,11 +554,10 @@ xfs_trim_rtdev_extents(
 	xfs_daddr_t		end,
 	xfs_daddr_t		minlen)
 {
-	struct xfs_rtalloc_rec	low = { };
-	struct xfs_rtalloc_rec	high = { };
 	struct xfs_trim_rtdev	tr = {
 		.minlen_fsb	= XFS_BB_TO_FSB(mp, minlen),
 	};
+	xfs_rtxnum_t		low, high;
 	struct xfs_trans	*tp;
 	xfs_daddr_t		rtdev_daddr;
 	int			error;
@@ -584,17 +583,17 @@ xfs_trim_rtdev_extents(
 			XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks) - 1);
 
 	/* Convert the rt blocks to rt extents */
-	low.ar_startext = xfs_rtb_to_rtxup(mp, XFS_BB_TO_FSB(mp, start));
-	high.ar_startext = xfs_rtb_to_rtx(mp, XFS_BB_TO_FSBT(mp, end));
+	low = xfs_rtb_to_rtxup(mp, XFS_BB_TO_FSB(mp, start));
+	high = xfs_rtb_to_rtx(mp, XFS_BB_TO_FSBT(mp, end));
 
 	/*
 	 * Walk the free ranges between low and high.  The query_range function
 	 * trims the extents returned.
 	 */
 	do {
-		tr.stop_rtx = low.ar_startext + (mp->m_sb.sb_blocksize * NBBY);
+		tr.stop_rtx = low + (mp->m_sb.sb_blocksize * NBBY);
 		xfs_rtbitmap_lock_shared(mp, XFS_RBMLOCK_BITMAP);
-		error = xfs_rtalloc_query_range(mp, tp, &low, &high,
+		error = xfs_rtalloc_query_range(mp, tp, low, high,
 				xfs_trim_gather_rtextent, &tr);
 
 		if (error == -ECANCELED)
@@ -615,8 +614,8 @@ xfs_trim_rtdev_extents(
 		if (error)
 			break;
 
-		low.ar_startext = tr.restart_rtx;
-	} while (!xfs_trim_should_stop() && low.ar_startext <= high.ar_startext);
+		low = tr.restart_rtx;
+	} while (!xfs_trim_should_stop() && low <= high);
 
 	xfs_trans_cancel(tp);
 	return error;
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 71f32354944e4..e154466268757 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -520,11 +520,11 @@ xfs_getfsmap_rtdev_rtbitmap(
 	struct xfs_getfsmap_info	*info)
 {
 
-	struct xfs_rtalloc_rec		alow = { 0 };
 	struct xfs_rtalloc_rec		ahigh = { 0 };
 	struct xfs_mount		*mp = tp->t_mountp;
 	xfs_rtblock_t			start_rtb;
 	xfs_rtblock_t			end_rtb;
+	xfs_rtxnum_t			high;
 	uint64_t			eofs;
 	int				error;
 
@@ -553,10 +553,9 @@ xfs_getfsmap_rtdev_rtbitmap(
 	 * Set up query parameters to return free rtextents covering the range
 	 * we want.
 	 */
-	alow.ar_startext = xfs_rtb_to_rtx(mp, start_rtb);
-	ahigh.ar_startext = xfs_rtb_to_rtxup(mp, end_rtb);
-	error = xfs_rtalloc_query_range(mp, tp, &alow, &ahigh,
-			xfs_getfsmap_rtdev_rtbitmap_helper, info);
+	high = xfs_rtb_to_rtxup(mp, end_rtb);
+	error = xfs_rtalloc_query_range(mp, tp, xfs_rtb_to_rtx(mp, start_rtb),
+			high, xfs_getfsmap_rtdev_rtbitmap_helper, info);
 	if (error)
 		goto err;
 
@@ -565,7 +564,7 @@ xfs_getfsmap_rtdev_rtbitmap(
 	 * rmap starting at the block after the end of the query range.
 	 */
 	info->last = true;
-	ahigh.ar_startext = min(mp->m_sb.sb_rextents, ahigh.ar_startext);
+	ahigh.ar_startext = min(mp->m_sb.sb_rextents, high);
 
 	error = xfs_getfsmap_rtdev_rtbitmap_helper(mp, tp, &ahigh, info);
 	if (error)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 01/24] xfs: clean up the ISVALID macro in xfs_bmap_adjacent
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
@ 2024-08-23  0:14   ` Darrick J. Wong
  2024-08-23  0:15   ` [PATCH 02/24] xfs: factor out a xfs_rtallocate helper Darrick J. Wong
                     ` (22 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:14 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Turn the  ISVALID macro defined and used inside in xfs_bmap_adjacent
that relies on implict context into a proper inline function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   55 +++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 23 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 314fc7d55659a..3a8796f165d6d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3115,6 +3115,23 @@ xfs_bmap_extsize_align(
 	return 0;
 }
 
+static inline bool
+xfs_bmap_adjacent_valid(
+	struct xfs_bmalloca	*ap,
+	xfs_fsblock_t		x,
+	xfs_fsblock_t		y)
+{
+	struct xfs_mount	*mp = ap->ip->i_mount;
+
+	if (XFS_IS_REALTIME_INODE(ap->ip) &&
+	    (ap->datatype & XFS_ALLOC_USERDATA))
+		return x < mp->m_sb.sb_rblocks;
+
+	return XFS_FSB_TO_AGNO(mp, x) == XFS_FSB_TO_AGNO(mp, y) &&
+		XFS_FSB_TO_AGNO(mp, x) < mp->m_sb.sb_agcount &&
+		XFS_FSB_TO_AGBNO(mp, x) < mp->m_sb.sb_agblocks;
+}
+
 #define XFS_ALLOC_GAP_UNITS	4
 
 /* returns true if ap->blkno was modified */
@@ -3122,36 +3139,25 @@ bool
 xfs_bmap_adjacent(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
 {
-	xfs_fsblock_t	adjust;		/* adjustment to block numbers */
-	xfs_mount_t	*mp;		/* mount point structure */
-	int		rt;		/* true if inode is realtime */
+	xfs_fsblock_t		adjust;		/* adjustment to block numbers */
 
-#define	ISVALID(x,y)	\
-	(rt ? \
-		(x) < mp->m_sb.sb_rblocks : \
-		XFS_FSB_TO_AGNO(mp, x) == XFS_FSB_TO_AGNO(mp, y) && \
-		XFS_FSB_TO_AGNO(mp, x) < mp->m_sb.sb_agcount && \
-		XFS_FSB_TO_AGBNO(mp, x) < mp->m_sb.sb_agblocks)
-
-	mp = ap->ip->i_mount;
-	rt = XFS_IS_REALTIME_INODE(ap->ip) &&
-		(ap->datatype & XFS_ALLOC_USERDATA);
 	/*
 	 * If allocating at eof, and there's a previous real block,
 	 * try to use its last block as our starting point.
 	 */
 	if (ap->eof && ap->prev.br_startoff != NULLFILEOFF &&
 	    !isnullstartblock(ap->prev.br_startblock) &&
-	    ISVALID(ap->prev.br_startblock + ap->prev.br_blockcount,
-		    ap->prev.br_startblock)) {
+	    xfs_bmap_adjacent_valid(ap,
+			ap->prev.br_startblock + ap->prev.br_blockcount,
+			ap->prev.br_startblock)) {
 		ap->blkno = ap->prev.br_startblock + ap->prev.br_blockcount;
 		/*
 		 * Adjust for the gap between prevp and us.
 		 */
 		adjust = ap->offset -
 			(ap->prev.br_startoff + ap->prev.br_blockcount);
-		if (adjust &&
-		    ISVALID(ap->blkno + adjust, ap->prev.br_startblock))
+		if (adjust && xfs_bmap_adjacent_valid(ap, ap->blkno + adjust,
+				ap->prev.br_startblock))
 			ap->blkno += adjust;
 		return true;
 	}
@@ -3174,7 +3180,8 @@ xfs_bmap_adjacent(
 		    !isnullstartblock(ap->prev.br_startblock) &&
 		    (prevbno = ap->prev.br_startblock +
 			       ap->prev.br_blockcount) &&
-		    ISVALID(prevbno, ap->prev.br_startblock)) {
+		    xfs_bmap_adjacent_valid(ap, prevbno,
+				ap->prev.br_startblock)) {
 			/*
 			 * Calculate gap to end of previous block.
 			 */
@@ -3190,8 +3197,8 @@ xfs_bmap_adjacent(
 			 * number, then just use the end of the previous block.
 			 */
 			if (prevdiff <= XFS_ALLOC_GAP_UNITS * ap->length &&
-			    ISVALID(prevbno + prevdiff,
-				    ap->prev.br_startblock))
+			    xfs_bmap_adjacent_valid(ap, prevbno + prevdiff,
+					ap->prev.br_startblock))
 				prevbno += adjust;
 			else
 				prevdiff += adjust;
@@ -3223,9 +3230,11 @@ xfs_bmap_adjacent(
 			 * offset by our length.
 			 */
 			if (gotdiff <= XFS_ALLOC_GAP_UNITS * ap->length &&
-			    ISVALID(gotbno - gotdiff, gotbno))
+			    xfs_bmap_adjacent_valid(ap, gotbno - gotdiff,
+					gotbno))
 				gotbno -= adjust;
-			else if (ISVALID(gotbno - ap->length, gotbno)) {
+			else if (xfs_bmap_adjacent_valid(ap, gotbno - ap->length,
+					gotbno)) {
 				gotbno -= ap->length;
 				gotdiff += adjust - ap->length;
 			} else
@@ -3253,7 +3262,7 @@ xfs_bmap_adjacent(
 			return true;
 		}
 	}
-#undef ISVALID
+
 	return false;
 }
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 02/24] xfs: factor out a xfs_rtallocate helper
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
  2024-08-23  0:14   ` [PATCH 01/24] xfs: clean up the ISVALID macro in xfs_bmap_adjacent Darrick J. Wong
@ 2024-08-23  0:15   ` Darrick J. Wong
  2024-08-23  0:15   ` [PATCH 03/24] xfs: rework the rtalloc fallback handling Darrick J. Wong
                     ` (21 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:15 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Split out a helper from xfs_rtallocate that performs the actual
allocation.  This keeps the scope of the xfs_rtalloc_args structure
contained, and prepares for rtgroups support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   81 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 50 insertions(+), 31 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 4e7db8d4c0827..861a82471b5d0 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1269,6 +1269,51 @@ xfs_rtalloc_align_minmax(
 	*raminlen = newminlen;
 }
 
+static int
+xfs_rtallocate(
+	struct xfs_trans	*tp,
+	xfs_rtxnum_t		start,
+	xfs_rtxlen_t		minlen,
+	xfs_rtxlen_t		maxlen,
+	xfs_rtxlen_t		prod,
+	bool			wasdel,
+	xfs_rtblock_t		*bno,
+	xfs_extlen_t		*blen)
+{
+	struct xfs_rtalloc_args	args = {
+		.mp		= tp->t_mountp,
+		.tp		= tp,
+	};
+	xfs_rtxnum_t		rtx;
+	xfs_rtxlen_t		len = 0;
+	int			error;
+
+	if (start) {
+		error = xfs_rtallocate_extent_near(&args, start, minlen, maxlen,
+				&len, prod, &rtx);
+	} else {
+		error = xfs_rtallocate_extent_size(&args, minlen, maxlen, &len,
+				prod, &rtx);
+	}
+
+	if (error)
+		goto out_release;
+
+	error = xfs_rtallocate_range(&args, rtx, len);
+	if (error)
+		goto out_release;
+
+	xfs_trans_mod_sb(tp, wasdel ?
+			XFS_TRANS_SB_RES_FREXTENTS : XFS_TRANS_SB_FREXTENTS,
+			-(long)len);
+	*bno = xfs_rtx_to_rtb(args.mp, rtx);
+	*blen = xfs_rtxlen_to_extlen(args.mp, len);
+
+out_release:
+	xfs_rtbuf_cache_relse(&args);
+	return error;
+}
+
 int
 xfs_bmap_rtalloc(
 	struct xfs_bmalloca	*ap)
@@ -1276,7 +1321,6 @@ xfs_bmap_rtalloc(
 	struct xfs_mount	*mp = ap->ip->i_mount;
 	xfs_fileoff_t		orig_offset = ap->offset;
 	xfs_rtxnum_t		start;	   /* allocation hint rtextent no */
-	xfs_rtxnum_t		rtx;	   /* actually allocated rtextent no */
 	xfs_rtxlen_t		prod = 0;  /* product factor for allocators */
 	xfs_extlen_t		mod = 0;   /* product factor for allocators */
 	xfs_rtxlen_t		ralen = 0; /* realtime allocation length */
@@ -1286,10 +1330,6 @@ xfs_bmap_rtalloc(
 	xfs_rtxlen_t		raminlen;
 	bool			rtlocked = false;
 	bool			ignore_locality = false;
-	struct xfs_rtalloc_args	args = {
-		.mp		= mp,
-		.tp		= ap->tp,
-	};
 	int			error;
 
 	align = xfs_get_extsz_hint(ap->ip);
@@ -1363,19 +1403,9 @@ xfs_bmap_rtalloc(
 			xfs_rtalloc_align_minmax(&raminlen, &ralen, &prod);
 	}
 
-	if (start) {
-		error = xfs_rtallocate_extent_near(&args, start, raminlen,
-				ralen, &ralen, prod, &rtx);
-	} else {
-		error = xfs_rtallocate_extent_size(&args, raminlen,
-				ralen, &ralen, prod, &rtx);
-	}
-
-	if (error) {
-		xfs_rtbuf_cache_relse(&args);
-		if (error != -ENOSPC)
-			return error;
-
+	error = xfs_rtallocate(ap->tp, start, raminlen, ralen, prod, ap->wasdel,
+			       &ap->blkno, &ap->length);
+	if (error == -ENOSPC) {
 		if (align > mp->m_sb.sb_rextsize) {
 			/*
 			 * We previously enlarged the request length to try to
@@ -1403,20 +1433,9 @@ xfs_bmap_rtalloc(
 		ap->length = 0;
 		return 0;
 	}
-
-	error = xfs_rtallocate_range(&args, rtx, ralen);
 	if (error)
-		goto out_release;
+		return error;
 
-	xfs_trans_mod_sb(ap->tp, ap->wasdel ?
-			XFS_TRANS_SB_RES_FREXTENTS : XFS_TRANS_SB_FREXTENTS,
-			-(long)ralen);
-
-	ap->blkno = xfs_rtx_to_rtb(mp, rtx);
-	ap->length = xfs_rtxlen_to_extlen(mp, ralen);
 	xfs_bmap_alloc_account(ap);
-
-out_release:
-	xfs_rtbuf_cache_relse(&args);
-	return error;
+	return 0;
 }


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 03/24] xfs: rework the rtalloc fallback handling
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
  2024-08-23  0:14   ` [PATCH 01/24] xfs: clean up the ISVALID macro in xfs_bmap_adjacent Darrick J. Wong
  2024-08-23  0:15   ` [PATCH 02/24] xfs: factor out a xfs_rtallocate helper Darrick J. Wong
@ 2024-08-23  0:15   ` Darrick J. Wong
  2024-08-23  0:15   ` [PATCH 04/24] xfs: factor out a xfs_rtallocate_align helper Darrick J. Wong
                     ` (20 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:15 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

xfs_rtallocate currently has two fallbacks, when an allocation fails:

 1) drop the requested extent size alignment, if any, and retry
 2) ignore the locality hint

Oddly enough it does those in order, as trying a different location
is more in line with what the user asked for, and does it in a very
unstructured way.

Lift the fallback to try to allocate without the locality hint into
xfs_rtallocate to both perform them in a more sensible order and to
clean up the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   69 +++++++++++++++++++++++++-------------------------
 1 file changed, 34 insertions(+), 35 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 861a82471b5d0..f39f05397201a 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1277,6 +1277,8 @@ xfs_rtallocate(
 	xfs_rtxlen_t		maxlen,
 	xfs_rtxlen_t		prod,
 	bool			wasdel,
+	bool			initial_user_data,
+	bool			*rtlocked,
 	xfs_rtblock_t		*bno,
 	xfs_extlen_t		*blen)
 {
@@ -1286,12 +1288,38 @@ xfs_rtallocate(
 	};
 	xfs_rtxnum_t		rtx;
 	xfs_rtxlen_t		len = 0;
-	int			error;
+	int			error = 0;
+
+	/*
+	 * Lock out modifications to both the RT bitmap and summary inodes.
+	 */
+	if (!*rtlocked) {
+		xfs_rtbitmap_lock(args.mp);
+		xfs_rtbitmap_trans_join(tp);
+		*rtlocked = true;
+	}
+
+	/*
+	 * For an allocation to an empty file at offset 0, pick an extent that
+	 * will space things out in the rt area.
+	 */
+	if (!start && initial_user_data)
+		start = xfs_rtpick_extent(args.mp, tp, maxlen);
 
 	if (start) {
 		error = xfs_rtallocate_extent_near(&args, start, minlen, maxlen,
 				&len, prod, &rtx);
-	} else {
+		/*
+		 * If we can't allocate near a specific rt extent, try again
+		 * without locality criteria.
+		 */
+		if (error == -ENOSPC) {
+			xfs_rtbuf_cache_relse(&args);
+			error = 0;
+		}
+	}
+
+	if (!error) {
 		error = xfs_rtallocate_extent_size(&args, minlen, maxlen, &len,
 				prod, &rtx);
 	}
@@ -1320,7 +1348,7 @@ xfs_bmap_rtalloc(
 {
 	struct xfs_mount	*mp = ap->ip->i_mount;
 	xfs_fileoff_t		orig_offset = ap->offset;
-	xfs_rtxnum_t		start;	   /* allocation hint rtextent no */
+	xfs_rtxnum_t		start = 0;   /* allocation hint rtextent no */
 	xfs_rtxlen_t		prod = 0;  /* product factor for allocators */
 	xfs_extlen_t		mod = 0;   /* product factor for allocators */
 	xfs_rtxlen_t		ralen = 0; /* realtime allocation length */
@@ -1329,7 +1357,6 @@ xfs_bmap_rtalloc(
 	xfs_extlen_t		minlen = mp->m_sb.sb_rextsize;
 	xfs_rtxlen_t		raminlen;
 	bool			rtlocked = false;
-	bool			ignore_locality = false;
 	int			error;
 
 	align = xfs_get_extsz_hint(ap->ip);
@@ -1367,28 +1394,8 @@ xfs_bmap_rtalloc(
 	ASSERT(raminlen > 0);
 	ASSERT(raminlen <= ralen);
 
-	/*
-	 * Lock out modifications to both the RT bitmap and summary inodes
-	 */
-	if (!rtlocked) {
-		xfs_rtbitmap_lock(mp);
-		xfs_rtbitmap_trans_join(ap->tp);
-		rtlocked = true;
-	}
-
-	if (ignore_locality) {
-		start = 0;
-	} else if (xfs_bmap_adjacent(ap)) {
+	if (xfs_bmap_adjacent(ap))
 		start = xfs_rtb_to_rtx(mp, ap->blkno);
-	} else if (ap->datatype & XFS_ALLOC_INITIAL_USER_DATA) {
-		/*
-		 * If it's an allocation to an empty file at offset 0, pick an
-		 * extent that will space things out in the rt area.
-		 */
-		start = xfs_rtpick_extent(mp, ap->tp, ralen);
-	} else {
-		start = 0;
-	}
 
 	/*
 	 * Only bother calculating a real prod factor if offset & length are
@@ -1404,7 +1411,8 @@ xfs_bmap_rtalloc(
 	}
 
 	error = xfs_rtallocate(ap->tp, start, raminlen, ralen, prod, ap->wasdel,
-			       &ap->blkno, &ap->length);
+			ap->datatype & XFS_ALLOC_INITIAL_USER_DATA, &rtlocked,
+			&ap->blkno, &ap->length);
 	if (error == -ENOSPC) {
 		if (align > mp->m_sb.sb_rextsize) {
 			/*
@@ -1420,15 +1428,6 @@ xfs_bmap_rtalloc(
 			goto retry;
 		}
 
-		if (!ignore_locality && start != 0) {
-			/*
-			 * If we can't allocate near a specific rt extent, try
-			 * again without locality criteria.
-			 */
-			ignore_locality = true;
-			goto retry;
-		}
-
 		ap->blkno = NULLFSBLOCK;
 		ap->length = 0;
 		return 0;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 04/24] xfs: factor out a xfs_rtallocate_align helper
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-23  0:15   ` [PATCH 03/24] xfs: rework the rtalloc fallback handling Darrick J. Wong
@ 2024-08-23  0:15   ` Darrick J. Wong
  2024-08-23  0:15   ` [PATCH 05/24] xfs: make the rtalloc start hint a xfs_rtblock_t Darrick J. Wong
                     ` (19 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:15 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Split the code to calculate the aligned allocation request from
xfs_bmap_rtalloc into a separate self-contained helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   93 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 59 insertions(+), 34 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index f39f05397201a..7f20bc412d074 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1342,30 +1342,33 @@ xfs_rtallocate(
 	return error;
 }
 
-int
-xfs_bmap_rtalloc(
-	struct xfs_bmalloca	*ap)
+static int
+xfs_rtallocate_align(
+	struct xfs_bmalloca	*ap,
+	xfs_rtxlen_t		*ralen,
+	xfs_rtxlen_t		*raminlen,
+	xfs_rtxlen_t		*prod,
+	bool			*noalign)
 {
 	struct xfs_mount	*mp = ap->ip->i_mount;
 	xfs_fileoff_t		orig_offset = ap->offset;
-	xfs_rtxnum_t		start = 0;   /* allocation hint rtextent no */
-	xfs_rtxlen_t		prod = 0;  /* product factor for allocators */
-	xfs_extlen_t		mod = 0;   /* product factor for allocators */
-	xfs_rtxlen_t		ralen = 0; /* realtime allocation length */
-	xfs_extlen_t		align;     /* minimum allocation alignment */
-	xfs_extlen_t		orig_length = ap->length;
 	xfs_extlen_t		minlen = mp->m_sb.sb_rextsize;
-	xfs_rtxlen_t		raminlen;
-	bool			rtlocked = false;
+	xfs_extlen_t            align;	/* minimum allocation alignment */
+	xfs_extlen_t		mod;	/* product factor for allocators */
 	int			error;
 
-	align = xfs_get_extsz_hint(ap->ip);
-	if (!align)
-		align = 1;
-retry:
-	error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
-					align, 1, ap->eof, 0,
-					ap->conv, &ap->offset, &ap->length);
+	if (*noalign) {
+		align = mp->m_sb.sb_rextsize;
+	} else {
+		align = xfs_get_extsz_hint(ap->ip);
+		if (!align)
+			align = 1;
+		if (align == mp->m_sb.sb_rextsize)
+			*noalign = true;
+	}
+
+	error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 1,
+			ap->eof, 0, ap->conv, &ap->offset, &ap->length);
 	if (error)
 		return error;
 	ASSERT(ap->length);
@@ -1389,32 +1392,54 @@ xfs_bmap_rtalloc(
 	 * XFS_BMBT_MAX_EXTLEN), we don't hear about that number, and can't
 	 * adjust the starting point to match it.
 	 */
-	ralen = xfs_extlen_to_rtxlen(mp, min(ap->length, XFS_MAX_BMBT_EXTLEN));
-	raminlen = max_t(xfs_rtxlen_t, 1, xfs_extlen_to_rtxlen(mp, minlen));
-	ASSERT(raminlen > 0);
-	ASSERT(raminlen <= ralen);
-
-	if (xfs_bmap_adjacent(ap))
-		start = xfs_rtb_to_rtx(mp, ap->blkno);
+	*ralen = xfs_extlen_to_rtxlen(mp, min(ap->length, XFS_MAX_BMBT_EXTLEN));
+	*raminlen = max_t(xfs_rtxlen_t, 1, xfs_extlen_to_rtxlen(mp, minlen));
+	ASSERT(*raminlen > 0);
+	ASSERT(*raminlen <= *ralen);
 
 	/*
 	 * Only bother calculating a real prod factor if offset & length are
 	 * perfectly aligned, otherwise it will just get us in trouble.
 	 */
 	div_u64_rem(ap->offset, align, &mod);
-	if (mod || ap->length % align) {
-		prod = 1;
-	} else {
-		prod = xfs_extlen_to_rtxlen(mp, align);
-		if (prod > 1)
-			xfs_rtalloc_align_minmax(&raminlen, &ralen, &prod);
-	}
+	if (mod || ap->length % align)
+		*prod = 1;
+	else
+		*prod = xfs_extlen_to_rtxlen(mp, align);
+
+	if (*prod > 1)
+		xfs_rtalloc_align_minmax(raminlen, ralen, prod);
+	return 0;
+}
+
+int
+xfs_bmap_rtalloc(
+	struct xfs_bmalloca	*ap)
+{
+	struct xfs_mount	*mp = ap->ip->i_mount;
+	xfs_fileoff_t		orig_offset = ap->offset;
+	xfs_rtxnum_t		start = 0; /* allocation hint rtextent no */
+	xfs_rtxlen_t		prod = 0;  /* product factor for allocators */
+	xfs_rtxlen_t		ralen = 0; /* realtime allocation length */
+	xfs_extlen_t		orig_length = ap->length;
+	xfs_rtxlen_t		raminlen;
+	bool			rtlocked = false;
+	bool			noalign = false;
+	int			error;
+
+retry:
+	error = xfs_rtallocate_align(ap, &ralen, &raminlen, &prod, &noalign);
+	if (error)
+		return error;
+
+	if (xfs_bmap_adjacent(ap))
+		start = xfs_rtb_to_rtx(mp, ap->blkno);
 
 	error = xfs_rtallocate(ap->tp, start, raminlen, ralen, prod, ap->wasdel,
 			ap->datatype & XFS_ALLOC_INITIAL_USER_DATA, &rtlocked,
 			&ap->blkno, &ap->length);
 	if (error == -ENOSPC) {
-		if (align > mp->m_sb.sb_rextsize) {
+		if (!noalign) {
 			/*
 			 * We previously enlarged the request length to try to
 			 * satisfy an extent size hint.  The allocator didn't
@@ -1424,7 +1449,7 @@ xfs_bmap_rtalloc(
 			 */
 			ap->offset = orig_offset;
 			ap->length = orig_length;
-			minlen = align = mp->m_sb.sb_rextsize;
+			noalign = true;
 			goto retry;
 		}
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 05/24] xfs: make the rtalloc start hint a xfs_rtblock_t
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-23  0:15   ` [PATCH 04/24] xfs: factor out a xfs_rtallocate_align helper Darrick J. Wong
@ 2024-08-23  0:15   ` Darrick J. Wong
  2024-08-23  0:16   ` [PATCH 06/24] xfs: add xchk_setup_nothing and xchk_nothing helpers Darrick J. Wong
                     ` (18 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:15 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

0 is a valid start RT extent, and with pending changes it will become
both more common and non-unique.  Switch to pass a xfs_rtblock_t instead
so that we can use NULLRTBLOCK to determine if a hint was set or not.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 7f20bc412d074..7854cd355311b 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1272,7 +1272,7 @@ xfs_rtalloc_align_minmax(
 static int
 xfs_rtallocate(
 	struct xfs_trans	*tp,
-	xfs_rtxnum_t		start,
+	xfs_rtblock_t		bno_hint,
 	xfs_rtxlen_t		minlen,
 	xfs_rtxlen_t		maxlen,
 	xfs_rtxlen_t		prod,
@@ -1286,6 +1286,7 @@ xfs_rtallocate(
 		.mp		= tp->t_mountp,
 		.tp		= tp,
 	};
+	xfs_rtxnum_t		start = 0;
 	xfs_rtxnum_t		rtx;
 	xfs_rtxlen_t		len = 0;
 	int			error = 0;
@@ -1303,7 +1304,9 @@ xfs_rtallocate(
 	 * For an allocation to an empty file at offset 0, pick an extent that
 	 * will space things out in the rt area.
 	 */
-	if (!start && initial_user_data)
+	if (bno_hint)
+		start = xfs_rtb_to_rtx(args.mp, bno_hint);
+	else if (initial_user_data)
 		start = xfs_rtpick_extent(args.mp, tp, maxlen);
 
 	if (start) {
@@ -1416,15 +1419,16 @@ int
 xfs_bmap_rtalloc(
 	struct xfs_bmalloca	*ap)
 {
-	struct xfs_mount	*mp = ap->ip->i_mount;
 	xfs_fileoff_t		orig_offset = ap->offset;
-	xfs_rtxnum_t		start = 0; /* allocation hint rtextent no */
 	xfs_rtxlen_t		prod = 0;  /* product factor for allocators */
 	xfs_rtxlen_t		ralen = 0; /* realtime allocation length */
+	xfs_rtblock_t		bno_hint = NULLRTBLOCK;
 	xfs_extlen_t		orig_length = ap->length;
 	xfs_rtxlen_t		raminlen;
 	bool			rtlocked = false;
 	bool			noalign = false;
+	bool			initial_user_data =
+		ap->datatype & XFS_ALLOC_INITIAL_USER_DATA;
 	int			error;
 
 retry:
@@ -1433,10 +1437,10 @@ xfs_bmap_rtalloc(
 		return error;
 
 	if (xfs_bmap_adjacent(ap))
-		start = xfs_rtb_to_rtx(mp, ap->blkno);
+		bno_hint = ap->blkno;
 
-	error = xfs_rtallocate(ap->tp, start, raminlen, ralen, prod, ap->wasdel,
-			ap->datatype & XFS_ALLOC_INITIAL_USER_DATA, &rtlocked,
+	error = xfs_rtallocate(ap->tp, bno_hint, raminlen, ralen, prod,
+			ap->wasdel, initial_user_data, &rtlocked,
 			&ap->blkno, &ap->length);
 	if (error == -ENOSPC) {
 		if (!noalign) {


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 06/24] xfs: add xchk_setup_nothing and xchk_nothing helpers
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:15   ` [PATCH 05/24] xfs: make the rtalloc start hint a xfs_rtblock_t Darrick J. Wong
@ 2024-08-23  0:16   ` Darrick J. Wong
  2024-08-23  5:00     ` Christoph Hellwig
  2024-08-23  0:16   ` [PATCH 07/24] xfs: remove xfs_{rtbitmap,rtsummary}_wordcount Darrick J. Wong
                     ` (17 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:16 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add common helpers for no-op scrubbing methods.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/scrub/common.h |   29 +++++++++--------------------
 fs/xfs/scrub/scrub.h  |   29 +++++++++--------------------
 2 files changed, 18 insertions(+), 40 deletions(-)


diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 96fe6ef5f4dc7..27e5bf8f7c60b 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -53,6 +53,11 @@ int xchk_checkpoint_log(struct xfs_mount *mp);
 bool xchk_should_check_xref(struct xfs_scrub *sc, int *error,
 			   struct xfs_btree_cur **curpp);
 
+static inline int xchk_setup_nothing(struct xfs_scrub *sc)
+{
+	return -ENOENT;
+}
+
 /* Setup functions */
 int xchk_setup_agheader(struct xfs_scrub *sc);
 int xchk_setup_fs(struct xfs_scrub *sc);
@@ -73,16 +78,8 @@ int xchk_setup_metapath(struct xfs_scrub *sc);
 int xchk_setup_rtbitmap(struct xfs_scrub *sc);
 int xchk_setup_rtsummary(struct xfs_scrub *sc);
 #else
-static inline int
-xchk_setup_rtbitmap(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
-static inline int
-xchk_setup_rtsummary(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
+# define xchk_setup_rtbitmap		xchk_setup_nothing
+# define xchk_setup_rtsummary		xchk_setup_nothing
 #endif
 #ifdef CONFIG_XFS_QUOTA
 int xchk_ino_dqattach(struct xfs_scrub *sc);
@@ -94,16 +91,8 @@ xchk_ino_dqattach(struct xfs_scrub *sc)
 {
 	return 0;
 }
-static inline int
-xchk_setup_quota(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
-static inline int
-xchk_setup_quotacheck(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
+# define xchk_setup_quota		xchk_setup_nothing
+# define xchk_setup_quotacheck		xchk_setup_nothing
 #endif
 int xchk_setup_fscounters(struct xfs_scrub *sc);
 int xchk_setup_nlinks(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index ab143c7a531e8..c688ff4fc7fc4 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -232,6 +232,11 @@ xchk_should_terminate(
 	return false;
 }
 
+static inline int xchk_nothing(struct xfs_scrub *sc)
+{
+	return -ENOENT;
+}
+
 /* Metadata scrubbers */
 int xchk_tester(struct xfs_scrub *sc);
 int xchk_superblock(struct xfs_scrub *sc);
@@ -256,31 +261,15 @@ int xchk_metapath(struct xfs_scrub *sc);
 int xchk_rtbitmap(struct xfs_scrub *sc);
 int xchk_rtsummary(struct xfs_scrub *sc);
 #else
-static inline int
-xchk_rtbitmap(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
-static inline int
-xchk_rtsummary(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
+# define xchk_rtbitmap		xchk_nothing
+# define xchk_rtsummary		xchk_nothing
 #endif
 #ifdef CONFIG_XFS_QUOTA
 int xchk_quota(struct xfs_scrub *sc);
 int xchk_quotacheck(struct xfs_scrub *sc);
 #else
-static inline int
-xchk_quota(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
-static inline int
-xchk_quotacheck(struct xfs_scrub *sc)
-{
-	return -ENOENT;
-}
+# define xchk_quota		xchk_nothing
+# define xchk_quotacheck	xchk_nothing
 #endif
 int xchk_fscounters(struct xfs_scrub *sc);
 int xchk_nlinks(struct xfs_scrub *sc);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 07/24] xfs: remove xfs_{rtbitmap,rtsummary}_wordcount
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-08-23  0:16   ` [PATCH 06/24] xfs: add xchk_setup_nothing and xchk_nothing helpers Darrick J. Wong
@ 2024-08-23  0:16   ` Darrick J. Wong
  2024-08-23  0:16   ` [PATCH 08/24] xfs: replace m_rsumsize with m_rsumblocks Darrick J. Wong
                     ` (16 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:16 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

xfs_rtbitmap_wordcount and xfs_rtsummary_wordcount are currently unused,
so remove them to simplify refactoring other rtbitmap helpers.  They
can be added back or simply open coded when actually needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |   31 -------------------------------
 fs/xfs/libxfs/xfs_rtbitmap.h |    7 -------
 2 files changed, 38 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index c58eb75ef0fa0..76706e8bbc4ea 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1148,21 +1148,6 @@ xfs_rtbitmap_blockcount(
 	return howmany_64(rtextents, NBBY * mp->m_sb.sb_blocksize);
 }
 
-/*
- * Compute the number of rtbitmap words needed to populate every block of a
- * bitmap that is large enough to track the given number of rt extents.
- */
-unsigned long long
-xfs_rtbitmap_wordcount(
-	struct xfs_mount	*mp,
-	xfs_rtbxlen_t		rtextents)
-{
-	xfs_filblks_t		blocks;
-
-	blocks = xfs_rtbitmap_blockcount(mp, rtextents);
-	return XFS_FSB_TO_B(mp, blocks) >> XFS_WORDLOG;
-}
-
 /* Compute the number of rtsummary blocks needed to track the given rt space. */
 xfs_filblks_t
 xfs_rtsummary_blockcount(
@@ -1176,22 +1161,6 @@ xfs_rtsummary_blockcount(
 	return XFS_B_TO_FSB(mp, rsumwords << XFS_WORDLOG);
 }
 
-/*
- * Compute the number of rtsummary info words needed to populate every block of
- * a summary file that is large enough to track the given rt space.
- */
-unsigned long long
-xfs_rtsummary_wordcount(
-	struct xfs_mount	*mp,
-	unsigned int		rsumlevels,
-	xfs_extlen_t		rbmblocks)
-{
-	xfs_filblks_t		blocks;
-
-	blocks = xfs_rtsummary_blockcount(mp, rsumlevels, rbmblocks);
-	return XFS_FSB_TO_B(mp, blocks) >> XFS_WORDLOG;
-}
-
 /* Lock both realtime free space metadata inodes for a freespace update. */
 void
 xfs_rtbitmap_lock(
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 0dbc9bb40668a..140513d1d6bcf 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -316,13 +316,8 @@ int xfs_rtfree_blocks(struct xfs_trans *tp, xfs_fsblock_t rtbno,
 
 xfs_filblks_t xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t
 		rtextents);
-unsigned long long xfs_rtbitmap_wordcount(struct xfs_mount *mp,
-		xfs_rtbxlen_t rtextents);
-
 xfs_filblks_t xfs_rtsummary_blockcount(struct xfs_mount *mp,
 		unsigned int rsumlevels, xfs_extlen_t rbmblocks);
-unsigned long long xfs_rtsummary_wordcount(struct xfs_mount *mp,
-		unsigned int rsumlevels, xfs_extlen_t rbmblocks);
 
 int xfs_rtfile_initialize_blocks(struct xfs_inode *ip,
 		xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb, void *data);
@@ -355,9 +350,7 @@ xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t rtextents)
 	/* shut up gcc */
 	return 0;
 }
-# define xfs_rtbitmap_wordcount(mp, r)			(0)
 # define xfs_rtsummary_blockcount(mp, l, b)		(0)
-# define xfs_rtsummary_wordcount(mp, l, b)		(0)
 # define xfs_rtbitmap_lock(mp)			do { } while (0)
 # define xfs_rtbitmap_trans_join(tp)		do { } while (0)
 # define xfs_rtbitmap_unlock(mp)		do { } while (0)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 08/24] xfs: replace m_rsumsize with m_rsumblocks
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-08-23  0:16   ` [PATCH 07/24] xfs: remove xfs_{rtbitmap,rtsummary}_wordcount Darrick J. Wong
@ 2024-08-23  0:16   ` Darrick J. Wong
  2024-08-23  0:17   ` [PATCH 09/24] xfs: rearrange xfs_fsmap.c a little bit Darrick J. Wong
                     ` (15 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:16 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Track the RT summary file size in blocks, just like the RT bitmap
file.  While we have users of both units, blocks are used slightly
more often and this matches the bitmap file for consistency.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c    |    2 +-
 fs/xfs/libxfs/xfs_trans_resv.c  |    2 +-
 fs/xfs/scrub/rtsummary.c        |   11 +++++------
 fs/xfs/scrub/rtsummary.h        |    2 +-
 fs/xfs/scrub/rtsummary_repair.c |   12 +++++-------
 fs/xfs/xfs_mount.h              |    2 +-
 fs/xfs/xfs_rtalloc.c            |   13 +++++--------
 7 files changed, 19 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 76706e8bbc4ea..27a4472402bac 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -162,7 +162,7 @@ xfs_rtsummary_read_buf(
 {
 	struct xfs_mount		*mp = args->mp;
 
-	if (XFS_IS_CORRUPT(mp, block >= XFS_B_TO_FSB(mp, mp->m_rsumsize))) {
+	if (XFS_IS_CORRUPT(mp, block >= mp->m_rsumblocks)) {
 		xfs_rt_mark_sick(args->mp, XFS_SICK_RT_SUMMARY);
 		return -EFSCORRUPTED;
 	}
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 45aaf169806aa..2e6d7bb3b5a2f 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -918,7 +918,7 @@ xfs_calc_growrtfree_reservation(
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
 		xfs_calc_inode_res(mp, 2) +
 		xfs_calc_buf_res(1, mp->m_sb.sb_blocksize) +
-		xfs_calc_buf_res(1, mp->m_rsumsize);
+		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, mp->m_rsumblocks));
 }
 
 /*
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 3fee603f52441..7c7366c98338b 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -63,7 +63,8 @@ xchk_setup_rtsummary(
 	 * us to avoid pinning kernel memory for this purpose.
 	 */
 	descr = xchk_xfile_descr(sc, "realtime summary file");
-	error = xfile_create(descr, mp->m_rsumsize, &sc->xfile);
+	error = xfile_create(descr, XFS_FSB_TO_B(mp, mp->m_rsumblocks),
+			&sc->xfile);
 	kfree(descr);
 	if (error)
 		return error;
@@ -95,16 +96,14 @@ xchk_setup_rtsummary(
 	 * volume.  Hence it is safe to compute and check the geometry values.
 	 */
 	if (mp->m_sb.sb_rblocks) {
-		xfs_filblks_t	rsumblocks;
 		int		rextslog;
 
 		rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
 		rextslog = xfs_compute_rextslog(rts->rextents);
 		rts->rsumlevels = rextslog + 1;
 		rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents);
-		rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels,
+		rts->rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels,
 				rts->rbmblocks);
-		rts->rsumsize = XFS_FSB_TO_B(mp, rsumblocks);
 	}
 	return 0;
 }
@@ -316,7 +315,7 @@ xchk_rtsummary(
 	}
 
 	/* Is m_rsumsize correct? */
-	if (mp->m_rsumsize != rts->rsumsize) {
+	if (mp->m_rsumblocks != rts->rsumblocks) {
 		xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino);
 		goto out_rbm;
 	}
@@ -332,7 +331,7 @@ xchk_rtsummary(
 	 * growfsrt expands the summary file before updating sb_rextents, so
 	 * the file can be larger than rsumsize.
 	 */
-	if (mp->m_rsumip->i_disk_size < rts->rsumsize) {
+	if (mp->m_rsumip->i_disk_size < XFS_FSB_TO_B(mp, rts->rsumblocks)) {
 		xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino);
 		goto out_rbm;
 	}
diff --git a/fs/xfs/scrub/rtsummary.h b/fs/xfs/scrub/rtsummary.h
index e1d50304d8d48..e44b04cb6e2d5 100644
--- a/fs/xfs/scrub/rtsummary.h
+++ b/fs/xfs/scrub/rtsummary.h
@@ -14,7 +14,7 @@ struct xchk_rtsummary {
 
 	uint64_t		rextents;
 	uint64_t		rbmblocks;
-	uint64_t		rsumsize;
+	xfs_filblks_t		rsumblocks;
 	unsigned int		rsumlevels;
 	unsigned int		resblks;
 
diff --git a/fs/xfs/scrub/rtsummary_repair.c b/fs/xfs/scrub/rtsummary_repair.c
index d9e971c4c79fb..7deeb948cb702 100644
--- a/fs/xfs/scrub/rtsummary_repair.c
+++ b/fs/xfs/scrub/rtsummary_repair.c
@@ -56,7 +56,7 @@ xrep_setup_rtsummary(
 	 * transaction (which we cannot drop because we cannot drop the
 	 * rtsummary ILOCK) and cannot ask for more reservation.
 	 */
-	blocks = XFS_B_TO_FSB(mp, mp->m_rsumsize);
+	blocks = mp->m_rsumblocks;
 	blocks += xfs_bmbt_calc_size(mp, blocks) * 2;
 	if (blocks > UINT_MAX)
 		return -EOPNOTSUPP;
@@ -100,7 +100,6 @@ xrep_rtsummary(
 {
 	struct xchk_rtsummary	*rts = sc->buf;
 	struct xfs_mount	*mp = sc->mp;
-	xfs_filblks_t		rsumblocks;
 	int			error;
 
 	/* We require the rmapbt to rebuild anything. */
@@ -131,10 +130,9 @@ xrep_rtsummary(
 	}
 
 	/* Make sure we have space allocated for the entire summary file. */
-	rsumblocks = XFS_B_TO_FSB(mp, rts->rsumsize);
 	xfs_trans_ijoin(sc->tp, sc->ip, 0);
 	xfs_trans_ijoin(sc->tp, sc->tempip, 0);
-	error = xrep_tempfile_prealloc(sc, 0, rsumblocks);
+	error = xrep_tempfile_prealloc(sc, 0, rts->rsumblocks);
 	if (error)
 		return error;
 
@@ -143,11 +141,11 @@ xrep_rtsummary(
 		return error;
 
 	/* Copy the rtsummary file that we generated. */
-	error = xrep_tempfile_copyin(sc, 0, rsumblocks,
+	error = xrep_tempfile_copyin(sc, 0, rts->rsumblocks,
 			xrep_rtsummary_prep_buf, rts);
 	if (error)
 		return error;
-	error = xrep_tempfile_set_isize(sc, rts->rsumsize);
+	error = xrep_tempfile_set_isize(sc, XFS_FSB_TO_B(mp, rts->rsumblocks));
 	if (error)
 		return error;
 
@@ -168,7 +166,7 @@ xrep_rtsummary(
 		memset(mp->m_rsum_cache, 0xFF, mp->m_sb.sb_rbmblocks);
 
 	mp->m_rsumlevels = rts->rsumlevels;
-	mp->m_rsumsize = rts->rsumsize;
+	mp->m_rsumblocks = rts->rsumblocks;
 
 	/* Free the old rtsummary blocks if they're not in use. */
 	return xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 6251ebced3062..9e883d2159fd9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -148,7 +148,7 @@ typedef struct xfs_mount {
 	int			m_logbufs;	/* number of log buffers */
 	int			m_logbsize;	/* size of each log buffer */
 	uint			m_rsumlevels;	/* rt summary levels */
-	uint			m_rsumsize;	/* size of rt summary, bytes */
+	xfs_filblks_t		m_rsumblocks;	/* size of rt summary, FSBs */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint			m_qflags;	/* quota status flags */
 	uint64_t		m_features;	/* active filesystem features */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 7854cd355311b..46a920b192d19 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -736,9 +736,8 @@ xfs_growfs_rt_bmblock(
 	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
 	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
 	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
-	nmp->m_rsumsize = XFS_FSB_TO_B(mp,
-		xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
-			nmp->m_sb.sb_rbmblocks));
+	nmp->m_rsumblocks = xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
+			nmp->m_sb.sb_rbmblocks);
 
 	/*
 	 * Recompute the growfsrt reservation from the new rsumsize, so that the
@@ -768,7 +767,7 @@ xfs_growfs_rt_bmblock(
 	 * so that inode inactivation won't punch what it thinks are "posteof"
 	 * blocks.
 	 */
-	rsumip->i_disk_size = nmp->m_rsumsize;
+	rsumip->i_disk_size = nmp->m_rsumblocks * nmp->m_sb.sb_blocksize;
 	i_size_write(VFS_I(rsumip), rsumip->i_disk_size);
 	xfs_trans_log_inode(args.tp, rsumip, XFS_ILOG_CORE);
 
@@ -820,7 +819,7 @@ xfs_growfs_rt_bmblock(
 	 * Update the calculated values in the real mount structure.
 	 */
 	mp->m_rsumlevels = nmp->m_rsumlevels;
-	mp->m_rsumsize = nmp->m_rsumsize;
+	mp->m_rsumblocks = nmp->m_rsumblocks;
 	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
 
 	/*
@@ -1024,7 +1023,6 @@ xfs_rtmount_init(
 	struct xfs_buf		*bp;	/* buffer for last block of subvolume */
 	struct xfs_sb		*sbp;	/* filesystem superblock copy in mount */
 	xfs_daddr_t		d;	/* address of last block of subvolume */
-	unsigned int		rsumblocks;
 	int			error;
 
 	sbp = &mp->m_sb;
@@ -1036,9 +1034,8 @@ xfs_rtmount_init(
 		return -ENODEV;
 	}
 	mp->m_rsumlevels = sbp->sb_rextslog + 1;
-	rsumblocks = xfs_rtsummary_blockcount(mp, mp->m_rsumlevels,
+	mp->m_rsumblocks = xfs_rtsummary_blockcount(mp, mp->m_rsumlevels,
 			mp->m_sb.sb_rbmblocks);
-	mp->m_rsumsize = XFS_FSB_TO_B(mp, rsumblocks);
 	mp->m_rbmip = mp->m_rsumip = NULL;
 	/*
 	 * Check that the realtime section is an ok size.


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 09/24] xfs: rearrange xfs_fsmap.c a little bit
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-08-23  0:16   ` [PATCH 08/24] xfs: replace m_rsumsize with m_rsumblocks Darrick J. Wong
@ 2024-08-23  0:17   ` Darrick J. Wong
  2024-08-23  5:01     ` Christoph Hellwig
  2024-08-23  0:17   ` [PATCH 10/24] xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c Darrick J. Wong
                     ` (14 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:17 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The order of the functions in this file has gotten a little confusing
over the years.  Specifically, the two data device implementations
(bnobt and rmapbt) could be adjacent in the source code instead of split
in two by the logdev and rtdev fsmap implementations.  We're about to
add more functionality to this file, so rearrange things now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsmap.c |  268 ++++++++++++++++++++++++++--------------------------
 1 file changed, 134 insertions(+), 134 deletions(-)


diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index e154466268757..615253406fde1 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -441,140 +441,6 @@ xfs_getfsmap_set_irec_flags(
 		irec->rm_flags |= XFS_RMAP_UNWRITTEN;
 }
 
-/* Execute a getfsmap query against the log device. */
-STATIC int
-xfs_getfsmap_logdev(
-	struct xfs_trans		*tp,
-	const struct xfs_fsmap		*keys,
-	struct xfs_getfsmap_info	*info)
-{
-	struct xfs_mount		*mp = tp->t_mountp;
-	struct xfs_rmap_irec		rmap;
-	xfs_daddr_t			rec_daddr, len_daddr;
-	xfs_fsblock_t			start_fsb, end_fsb;
-	uint64_t			eofs;
-
-	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
-	if (keys[0].fmr_physical >= eofs)
-		return 0;
-	start_fsb = XFS_BB_TO_FSBT(mp,
-				keys[0].fmr_physical + keys[0].fmr_length);
-	end_fsb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fmr_physical));
-
-	/* Adjust the low key if we are continuing from where we left off. */
-	if (keys[0].fmr_length > 0)
-		info->low_daddr = XFS_FSB_TO_BB(mp, start_fsb);
-
-	trace_xfs_fsmap_low_key_linear(mp, info->dev, start_fsb);
-	trace_xfs_fsmap_high_key_linear(mp, info->dev, end_fsb);
-
-	if (start_fsb > 0)
-		return 0;
-
-	/* Fabricate an rmap entry for the external log device. */
-	rmap.rm_startblock = 0;
-	rmap.rm_blockcount = mp->m_sb.sb_logblocks;
-	rmap.rm_owner = XFS_RMAP_OWN_LOG;
-	rmap.rm_offset = 0;
-	rmap.rm_flags = 0;
-
-	rec_daddr = XFS_FSB_TO_BB(mp, rmap.rm_startblock);
-	len_daddr = XFS_FSB_TO_BB(mp, rmap.rm_blockcount);
-	return xfs_getfsmap_helper(tp, info, &rmap, rec_daddr, len_daddr);
-}
-
-#ifdef CONFIG_XFS_RT
-/* Transform a rtbitmap "record" into a fsmap */
-STATIC int
-xfs_getfsmap_rtdev_rtbitmap_helper(
-	struct xfs_mount		*mp,
-	struct xfs_trans		*tp,
-	const struct xfs_rtalloc_rec	*rec,
-	void				*priv)
-{
-	struct xfs_getfsmap_info	*info = priv;
-	struct xfs_rmap_irec		irec;
-	xfs_rtblock_t			rtbno;
-	xfs_daddr_t			rec_daddr, len_daddr;
-
-	rtbno = xfs_rtx_to_rtb(mp, rec->ar_startext);
-	rec_daddr = XFS_FSB_TO_BB(mp, rtbno);
-	irec.rm_startblock = rtbno;
-
-	rtbno = xfs_rtx_to_rtb(mp, rec->ar_extcount);
-	len_daddr = XFS_FSB_TO_BB(mp, rtbno);
-	irec.rm_blockcount = rtbno;
-
-	irec.rm_owner = XFS_RMAP_OWN_NULL;	/* "free" */
-	irec.rm_offset = 0;
-	irec.rm_flags = 0;
-
-	return xfs_getfsmap_helper(tp, info, &irec, rec_daddr, len_daddr);
-}
-
-/* Execute a getfsmap query against the realtime device rtbitmap. */
-STATIC int
-xfs_getfsmap_rtdev_rtbitmap(
-	struct xfs_trans		*tp,
-	const struct xfs_fsmap		*keys,
-	struct xfs_getfsmap_info	*info)
-{
-
-	struct xfs_rtalloc_rec		ahigh = { 0 };
-	struct xfs_mount		*mp = tp->t_mountp;
-	xfs_rtblock_t			start_rtb;
-	xfs_rtblock_t			end_rtb;
-	xfs_rtxnum_t			high;
-	uint64_t			eofs;
-	int				error;
-
-	eofs = XFS_FSB_TO_BB(mp, xfs_rtx_to_rtb(mp, mp->m_sb.sb_rextents));
-	if (keys[0].fmr_physical >= eofs)
-		return 0;
-	start_rtb = XFS_BB_TO_FSBT(mp,
-				keys[0].fmr_physical + keys[0].fmr_length);
-	end_rtb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fmr_physical));
-
-	info->missing_owner = XFS_FMR_OWN_UNKNOWN;
-
-	/* Adjust the low key if we are continuing from where we left off. */
-	if (keys[0].fmr_length > 0) {
-		info->low_daddr = XFS_FSB_TO_BB(mp, start_rtb);
-		if (info->low_daddr >= eofs)
-			return 0;
-	}
-
-	trace_xfs_fsmap_low_key_linear(mp, info->dev, start_rtb);
-	trace_xfs_fsmap_high_key_linear(mp, info->dev, end_rtb);
-
-	xfs_rtbitmap_lock_shared(mp, XFS_RBMLOCK_BITMAP);
-
-	/*
-	 * Set up query parameters to return free rtextents covering the range
-	 * we want.
-	 */
-	high = xfs_rtb_to_rtxup(mp, end_rtb);
-	error = xfs_rtalloc_query_range(mp, tp, xfs_rtb_to_rtx(mp, start_rtb),
-			high, xfs_getfsmap_rtdev_rtbitmap_helper, info);
-	if (error)
-		goto err;
-
-	/*
-	 * Report any gaps at the end of the rtbitmap by simulating a null
-	 * rmap starting at the block after the end of the query range.
-	 */
-	info->last = true;
-	ahigh.ar_startext = min(mp->m_sb.sb_rextents, high);
-
-	error = xfs_getfsmap_rtdev_rtbitmap_helper(mp, tp, &ahigh, info);
-	if (error)
-		goto err;
-err:
-	xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
-	return error;
-}
-#endif /* CONFIG_XFS_RT */
-
 static inline bool
 rmap_not_shareable(struct xfs_mount *mp, const struct xfs_rmap_irec *r)
 {
@@ -799,6 +665,140 @@ xfs_getfsmap_datadev_bnobt(
 			xfs_getfsmap_datadev_bnobt_query, &akeys[0]);
 }
 
+/* Execute a getfsmap query against the log device. */
+STATIC int
+xfs_getfsmap_logdev(
+	struct xfs_trans		*tp,
+	const struct xfs_fsmap		*keys,
+	struct xfs_getfsmap_info	*info)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_rmap_irec		rmap;
+	xfs_daddr_t			rec_daddr, len_daddr;
+	xfs_fsblock_t			start_fsb, end_fsb;
+	uint64_t			eofs;
+
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
+	if (keys[0].fmr_physical >= eofs)
+		return 0;
+	start_fsb = XFS_BB_TO_FSBT(mp,
+				keys[0].fmr_physical + keys[0].fmr_length);
+	end_fsb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fmr_physical));
+
+	/* Adjust the low key if we are continuing from where we left off. */
+	if (keys[0].fmr_length > 0)
+		info->low_daddr = XFS_FSB_TO_BB(mp, start_fsb);
+
+	trace_xfs_fsmap_low_key_linear(mp, info->dev, start_fsb);
+	trace_xfs_fsmap_high_key_linear(mp, info->dev, end_fsb);
+
+	if (start_fsb > 0)
+		return 0;
+
+	/* Fabricate an rmap entry for the external log device. */
+	rmap.rm_startblock = 0;
+	rmap.rm_blockcount = mp->m_sb.sb_logblocks;
+	rmap.rm_owner = XFS_RMAP_OWN_LOG;
+	rmap.rm_offset = 0;
+	rmap.rm_flags = 0;
+
+	rec_daddr = XFS_FSB_TO_BB(mp, rmap.rm_startblock);
+	len_daddr = XFS_FSB_TO_BB(mp, rmap.rm_blockcount);
+	return xfs_getfsmap_helper(tp, info, &rmap, rec_daddr, len_daddr);
+}
+
+#ifdef CONFIG_XFS_RT
+/* Transform a rtbitmap "record" into a fsmap */
+STATIC int
+xfs_getfsmap_rtdev_rtbitmap_helper(
+	struct xfs_mount		*mp,
+	struct xfs_trans		*tp,
+	const struct xfs_rtalloc_rec	*rec,
+	void				*priv)
+{
+	struct xfs_getfsmap_info	*info = priv;
+	struct xfs_rmap_irec		irec;
+	xfs_rtblock_t			rtbno;
+	xfs_daddr_t			rec_daddr, len_daddr;
+
+	rtbno = xfs_rtx_to_rtb(mp, rec->ar_startext);
+	rec_daddr = XFS_FSB_TO_BB(mp, rtbno);
+	irec.rm_startblock = rtbno;
+
+	rtbno = xfs_rtx_to_rtb(mp, rec->ar_extcount);
+	len_daddr = XFS_FSB_TO_BB(mp, rtbno);
+	irec.rm_blockcount = rtbno;
+
+	irec.rm_owner = XFS_RMAP_OWN_NULL;	/* "free" */
+	irec.rm_offset = 0;
+	irec.rm_flags = 0;
+
+	return xfs_getfsmap_helper(tp, info, &irec, rec_daddr, len_daddr);
+}
+
+/* Execute a getfsmap query against the realtime device rtbitmap. */
+STATIC int
+xfs_getfsmap_rtdev_rtbitmap(
+	struct xfs_trans		*tp,
+	const struct xfs_fsmap		*keys,
+	struct xfs_getfsmap_info	*info)
+{
+
+	struct xfs_rtalloc_rec		ahigh = { 0 };
+	struct xfs_mount		*mp = tp->t_mountp;
+	xfs_rtblock_t			start_rtb;
+	xfs_rtblock_t			end_rtb;
+	xfs_rtxnum_t			high;
+	uint64_t			eofs;
+	int				error;
+
+	eofs = XFS_FSB_TO_BB(mp, xfs_rtx_to_rtb(mp, mp->m_sb.sb_rextents));
+	if (keys[0].fmr_physical >= eofs)
+		return 0;
+	start_rtb = XFS_BB_TO_FSBT(mp,
+				keys[0].fmr_physical + keys[0].fmr_length);
+	end_rtb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fmr_physical));
+
+	info->missing_owner = XFS_FMR_OWN_UNKNOWN;
+
+	/* Adjust the low key if we are continuing from where we left off. */
+	if (keys[0].fmr_length > 0) {
+		info->low_daddr = XFS_FSB_TO_BB(mp, start_rtb);
+		if (info->low_daddr >= eofs)
+			return 0;
+	}
+
+	trace_xfs_fsmap_low_key_linear(mp, info->dev, start_rtb);
+	trace_xfs_fsmap_high_key_linear(mp, info->dev, end_rtb);
+
+	xfs_rtbitmap_lock_shared(mp, XFS_RBMLOCK_BITMAP);
+
+	/*
+	 * Set up query parameters to return free rtextents covering the range
+	 * we want.
+	 */
+	high = xfs_rtb_to_rtxup(mp, end_rtb);
+	error = xfs_rtalloc_query_range(mp, tp, xfs_rtb_to_rtx(mp, start_rtb),
+			high, xfs_getfsmap_rtdev_rtbitmap_helper, info);
+	if (error)
+		goto err;
+
+	/*
+	 * Report any gaps at the end of the rtbitmap by simulating a null
+	 * rmap starting at the block after the end of the query range.
+	 */
+	info->last = true;
+	ahigh.ar_startext = min(mp->m_sb.sb_rextents, high);
+
+	error = xfs_getfsmap_rtdev_rtbitmap_helper(mp, tp, &ahigh, info);
+	if (error)
+		goto err;
+err:
+	xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
+	return error;
+}
+#endif /* CONFIG_XFS_RT */
+
 /* Do we recognize the device? */
 STATIC bool
 xfs_getfsmap_is_valid_device(


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 10/24] xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-08-23  0:17   ` [PATCH 09/24] xfs: rearrange xfs_fsmap.c a little bit Darrick J. Wong
@ 2024-08-23  0:17   ` Darrick J. Wong
  2024-08-23  5:01     ` Christoph Hellwig
  2024-08-23  0:17   ` [PATCH 11/24] xfs: create incore realtime group structures Darrick J. Wong
                     ` (13 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:17 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move this function out of xfs_ioctl.c to reduce the clutter in there,
and make the entire getfsmap implementation self-contained in a single
file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsmap.c |  134 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_fsmap.h |    6 +-
 fs/xfs/xfs_ioctl.c |  130 --------------------------------------------------
 3 files changed, 134 insertions(+), 136 deletions(-)


diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 615253406fde1..ae18ab86e608b 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -44,7 +44,7 @@ xfs_fsmap_from_internal(
 }
 
 /* Convert an fsmap to an xfs_fsmap. */
-void
+static void
 xfs_fsmap_to_internal(
 	struct xfs_fsmap	*dest,
 	struct fsmap		*src)
@@ -889,7 +889,7 @@ xfs_getfsmap_check_keys(
  * xfs_getfsmap_info.low/high	-- per-AG low/high keys computed from
  *				   dkeys; used to query the metadata.
  */
-int
+STATIC int
 xfs_getfsmap(
 	struct xfs_mount		*mp,
 	struct xfs_fsmap_head		*head,
@@ -1019,3 +1019,133 @@ xfs_getfsmap(
 	head->fmh_oflags = FMH_OF_DEV_T;
 	return error;
 }
+
+int
+xfs_ioc_getfsmap(
+	struct xfs_inode	*ip,
+	struct fsmap_head	__user *arg)
+{
+	struct xfs_fsmap_head	xhead = {0};
+	struct fsmap_head	head;
+	struct fsmap		*recs;
+	unsigned int		count;
+	__u32			last_flags = 0;
+	bool			done = false;
+	int			error;
+
+	if (copy_from_user(&head, arg, sizeof(struct fsmap_head)))
+		return -EFAULT;
+	if (memchr_inv(head.fmh_reserved, 0, sizeof(head.fmh_reserved)) ||
+	    memchr_inv(head.fmh_keys[0].fmr_reserved, 0,
+		       sizeof(head.fmh_keys[0].fmr_reserved)) ||
+	    memchr_inv(head.fmh_keys[1].fmr_reserved, 0,
+		       sizeof(head.fmh_keys[1].fmr_reserved)))
+		return -EINVAL;
+
+	/*
+	 * Use an internal memory buffer so that we don't have to copy fsmap
+	 * data to userspace while holding locks.  Start by trying to allocate
+	 * up to 128k for the buffer, but fall back to a single page if needed.
+	 */
+	count = min_t(unsigned int, head.fmh_count,
+			131072 / sizeof(struct fsmap));
+	recs = kvcalloc(count, sizeof(struct fsmap), GFP_KERNEL);
+	if (!recs) {
+		count = min_t(unsigned int, head.fmh_count,
+				PAGE_SIZE / sizeof(struct fsmap));
+		recs = kvcalloc(count, sizeof(struct fsmap), GFP_KERNEL);
+		if (!recs)
+			return -ENOMEM;
+	}
+
+	xhead.fmh_iflags = head.fmh_iflags;
+	xfs_fsmap_to_internal(&xhead.fmh_keys[0], &head.fmh_keys[0]);
+	xfs_fsmap_to_internal(&xhead.fmh_keys[1], &head.fmh_keys[1]);
+
+	trace_xfs_getfsmap_low_key(ip->i_mount, &xhead.fmh_keys[0]);
+	trace_xfs_getfsmap_high_key(ip->i_mount, &xhead.fmh_keys[1]);
+
+	head.fmh_entries = 0;
+	do {
+		struct fsmap __user	*user_recs;
+		struct fsmap		*last_rec;
+
+		user_recs = &arg->fmh_recs[head.fmh_entries];
+		xhead.fmh_entries = 0;
+		xhead.fmh_count = min_t(unsigned int, count,
+					head.fmh_count - head.fmh_entries);
+
+		/* Run query, record how many entries we got. */
+		error = xfs_getfsmap(ip->i_mount, &xhead, recs);
+		switch (error) {
+		case 0:
+			/*
+			 * There are no more records in the result set.  Copy
+			 * whatever we got to userspace and break out.
+			 */
+			done = true;
+			break;
+		case -ECANCELED:
+			/*
+			 * The internal memory buffer is full.  Copy whatever
+			 * records we got to userspace and go again if we have
+			 * not yet filled the userspace buffer.
+			 */
+			error = 0;
+			break;
+		default:
+			goto out_free;
+		}
+		head.fmh_entries += xhead.fmh_entries;
+		head.fmh_oflags = xhead.fmh_oflags;
+
+		/*
+		 * If the caller wanted a record count or there aren't any
+		 * new records to return, we're done.
+		 */
+		if (head.fmh_count == 0 || xhead.fmh_entries == 0)
+			break;
+
+		/* Copy all the records we got out to userspace. */
+		if (copy_to_user(user_recs, recs,
+				 xhead.fmh_entries * sizeof(struct fsmap))) {
+			error = -EFAULT;
+			goto out_free;
+		}
+
+		/* Remember the last record flags we copied to userspace. */
+		last_rec = &recs[xhead.fmh_entries - 1];
+		last_flags = last_rec->fmr_flags;
+
+		/* Set up the low key for the next iteration. */
+		xfs_fsmap_to_internal(&xhead.fmh_keys[0], last_rec);
+		trace_xfs_getfsmap_low_key(ip->i_mount, &xhead.fmh_keys[0]);
+	} while (!done && head.fmh_entries < head.fmh_count);
+
+	/*
+	 * If there are no more records in the query result set and we're not
+	 * in counting mode, mark the last record returned with the LAST flag.
+	 */
+	if (done && head.fmh_count > 0 && head.fmh_entries > 0) {
+		struct fsmap __user	*user_rec;
+
+		last_flags |= FMR_OF_LAST;
+		user_rec = &arg->fmh_recs[head.fmh_entries - 1];
+
+		if (copy_to_user(&user_rec->fmr_flags, &last_flags,
+					sizeof(last_flags))) {
+			error = -EFAULT;
+			goto out_free;
+		}
+	}
+
+	/* copy back header */
+	if (copy_to_user(arg, &head, sizeof(struct fsmap_head))) {
+		error = -EFAULT;
+		goto out_free;
+	}
+
+out_free:
+	kvfree(recs);
+	return error;
+}
diff --git a/fs/xfs/xfs_fsmap.h b/fs/xfs/xfs_fsmap.h
index a0775788e7b13..a0bcc38486a56 100644
--- a/fs/xfs/xfs_fsmap.h
+++ b/fs/xfs/xfs_fsmap.h
@@ -7,6 +7,7 @@
 #define __XFS_FSMAP_H__
 
 struct fsmap;
+struct fsmap_head;
 
 /* internal fsmap representation */
 struct xfs_fsmap {
@@ -27,9 +28,6 @@ struct xfs_fsmap_head {
 	struct xfs_fsmap fmh_keys[2];	/* low and high keys */
 };
 
-void xfs_fsmap_to_internal(struct xfs_fsmap *dest, struct fsmap *src);
-
-int xfs_getfsmap(struct xfs_mount *mp, struct xfs_fsmap_head *head,
-		struct fsmap *out_recs);
+int xfs_ioc_getfsmap(struct xfs_inode *ip, struct fsmap_head __user *arg);
 
 #endif /* __XFS_FSMAP_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index b53af3e674912..461780ffb8fc0 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -883,136 +883,6 @@ xfs_ioc_getbmap(
 	return error;
 }
 
-STATIC int
-xfs_ioc_getfsmap(
-	struct xfs_inode	*ip,
-	struct fsmap_head	__user *arg)
-{
-	struct xfs_fsmap_head	xhead = {0};
-	struct fsmap_head	head;
-	struct fsmap		*recs;
-	unsigned int		count;
-	__u32			last_flags = 0;
-	bool			done = false;
-	int			error;
-
-	if (copy_from_user(&head, arg, sizeof(struct fsmap_head)))
-		return -EFAULT;
-	if (memchr_inv(head.fmh_reserved, 0, sizeof(head.fmh_reserved)) ||
-	    memchr_inv(head.fmh_keys[0].fmr_reserved, 0,
-		       sizeof(head.fmh_keys[0].fmr_reserved)) ||
-	    memchr_inv(head.fmh_keys[1].fmr_reserved, 0,
-		       sizeof(head.fmh_keys[1].fmr_reserved)))
-		return -EINVAL;
-
-	/*
-	 * Use an internal memory buffer so that we don't have to copy fsmap
-	 * data to userspace while holding locks.  Start by trying to allocate
-	 * up to 128k for the buffer, but fall back to a single page if needed.
-	 */
-	count = min_t(unsigned int, head.fmh_count,
-			131072 / sizeof(struct fsmap));
-	recs = kvcalloc(count, sizeof(struct fsmap), GFP_KERNEL);
-	if (!recs) {
-		count = min_t(unsigned int, head.fmh_count,
-				PAGE_SIZE / sizeof(struct fsmap));
-		recs = kvcalloc(count, sizeof(struct fsmap), GFP_KERNEL);
-		if (!recs)
-			return -ENOMEM;
-	}
-
-	xhead.fmh_iflags = head.fmh_iflags;
-	xfs_fsmap_to_internal(&xhead.fmh_keys[0], &head.fmh_keys[0]);
-	xfs_fsmap_to_internal(&xhead.fmh_keys[1], &head.fmh_keys[1]);
-
-	trace_xfs_getfsmap_low_key(ip->i_mount, &xhead.fmh_keys[0]);
-	trace_xfs_getfsmap_high_key(ip->i_mount, &xhead.fmh_keys[1]);
-
-	head.fmh_entries = 0;
-	do {
-		struct fsmap __user	*user_recs;
-		struct fsmap		*last_rec;
-
-		user_recs = &arg->fmh_recs[head.fmh_entries];
-		xhead.fmh_entries = 0;
-		xhead.fmh_count = min_t(unsigned int, count,
-					head.fmh_count - head.fmh_entries);
-
-		/* Run query, record how many entries we got. */
-		error = xfs_getfsmap(ip->i_mount, &xhead, recs);
-		switch (error) {
-		case 0:
-			/*
-			 * There are no more records in the result set.  Copy
-			 * whatever we got to userspace and break out.
-			 */
-			done = true;
-			break;
-		case -ECANCELED:
-			/*
-			 * The internal memory buffer is full.  Copy whatever
-			 * records we got to userspace and go again if we have
-			 * not yet filled the userspace buffer.
-			 */
-			error = 0;
-			break;
-		default:
-			goto out_free;
-		}
-		head.fmh_entries += xhead.fmh_entries;
-		head.fmh_oflags = xhead.fmh_oflags;
-
-		/*
-		 * If the caller wanted a record count or there aren't any
-		 * new records to return, we're done.
-		 */
-		if (head.fmh_count == 0 || xhead.fmh_entries == 0)
-			break;
-
-		/* Copy all the records we got out to userspace. */
-		if (copy_to_user(user_recs, recs,
-				 xhead.fmh_entries * sizeof(struct fsmap))) {
-			error = -EFAULT;
-			goto out_free;
-		}
-
-		/* Remember the last record flags we copied to userspace. */
-		last_rec = &recs[xhead.fmh_entries - 1];
-		last_flags = last_rec->fmr_flags;
-
-		/* Set up the low key for the next iteration. */
-		xfs_fsmap_to_internal(&xhead.fmh_keys[0], last_rec);
-		trace_xfs_getfsmap_low_key(ip->i_mount, &xhead.fmh_keys[0]);
-	} while (!done && head.fmh_entries < head.fmh_count);
-
-	/*
-	 * If there are no more records in the query result set and we're not
-	 * in counting mode, mark the last record returned with the LAST flag.
-	 */
-	if (done && head.fmh_count > 0 && head.fmh_entries > 0) {
-		struct fsmap __user	*user_rec;
-
-		last_flags |= FMR_OF_LAST;
-		user_rec = &arg->fmh_recs[head.fmh_entries - 1];
-
-		if (copy_to_user(&user_rec->fmr_flags, &last_flags,
-					sizeof(last_flags))) {
-			error = -EFAULT;
-			goto out_free;
-		}
-	}
-
-	/* copy back header */
-	if (copy_to_user(arg, &head, sizeof(struct fsmap_head))) {
-		error = -EFAULT;
-		goto out_free;
-	}
-
-out_free:
-	kvfree(recs);
-	return error;
-}
-
 int
 xfs_ioc_swapext(
 	xfs_swapext_t	*sxp)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-08-23  0:17   ` [PATCH 10/24] xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c Darrick J. Wong
@ 2024-08-23  0:17   ` Darrick J. Wong
  2024-08-23  5:01     ` Christoph Hellwig
  2024-08-25 23:56     ` Dave Chinner
  2024-08-23  0:17   ` [PATCH 12/24] xfs: define locking primitives for realtime groups Darrick J. Wong
                     ` (12 subsequent siblings)
  23 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:17 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an incore object that will contain information about a realtime
allocation group.  This will eventually enable us to shard the realtime
section in a similar manner to how we shard the data section, but for
now just a single object for the entire RT subvolume is created.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/libxfs/xfs_format.h  |    3 +
 fs/xfs/libxfs/xfs_rtgroup.c |  196 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rtgroup.h |  212 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_sb.c      |    7 +
 fs/xfs/libxfs/xfs_types.h   |    4 +
 fs/xfs/xfs_log_recover.c    |   20 ++++
 fs/xfs/xfs_mount.c          |   16 +++
 fs/xfs/xfs_mount.h          |   14 +++
 fs/xfs/xfs_rtalloc.c        |    6 +
 fs/xfs/xfs_super.c          |    1 
 fs/xfs/xfs_trace.c          |    1 
 fs/xfs/xfs_trace.h          |   38 ++++++++
 13 files changed, 517 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
 create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 4d8ca08cdd0ec..388b5cef48ca5 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -60,6 +60,7 @@ xfs-y				+= $(addprefix libxfs/, \
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix libxfs/, \
 				   xfs_rtbitmap.o \
+				   xfs_rtgroup.o \
 				   )
 
 # highlevel code
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 16a7bc02aa5f5..fa5cfc8265d92 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -176,6 +176,9 @@ typedef struct xfs_sb {
 
 	xfs_ino_t	sb_metadirino;	/* metadata directory tree root */
 
+	xfs_rgnumber_t	sb_rgcount;	/* number of realtime groups */
+	xfs_rtxlen_t	sb_rgextents;	/* size of a realtime group in rtx */
+
 	/* must be padded to 64 bit alignment */
 } xfs_sb_t;
 
diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
new file mode 100644
index 0000000000000..2bad1ecb811eb
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_ag.h"
+#include "xfs_ag_resv.h"
+#include "xfs_health.h"
+#include "xfs_error.h"
+#include "xfs_bmap.h"
+#include "xfs_defer.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_trace.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_rtgroup.h"
+#include "xfs_rtbitmap.h"
+
+/*
+ * Passive reference counting access wrappers to the rtgroup structures.  If
+ * the rtgroup structure is to be freed, the freeing code is responsible for
+ * cleaning up objects with passive references before freeing the structure.
+ */
+struct xfs_rtgroup *
+xfs_rtgroup_get(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno)
+{
+	struct xfs_rtgroup	*rtg;
+
+	rcu_read_lock();
+	rtg = xa_load(&mp->m_rtgroups, rgno);
+	if (rtg) {
+		trace_xfs_rtgroup_get(rtg, _RET_IP_);
+		ASSERT(atomic_read(&rtg->rtg_ref) >= 0);
+		atomic_inc(&rtg->rtg_ref);
+	}
+	rcu_read_unlock();
+	return rtg;
+}
+
+/* Get a passive reference to the given rtgroup. */
+struct xfs_rtgroup *
+xfs_rtgroup_hold(
+	struct xfs_rtgroup	*rtg)
+{
+	ASSERT(atomic_read(&rtg->rtg_ref) > 0 ||
+	       atomic_read(&rtg->rtg_active_ref) > 0);
+
+	trace_xfs_rtgroup_hold(rtg, _RET_IP_);
+	atomic_inc(&rtg->rtg_ref);
+	return rtg;
+}
+
+void
+xfs_rtgroup_put(
+	struct xfs_rtgroup	*rtg)
+{
+	trace_xfs_rtgroup_put(rtg, _RET_IP_);
+	ASSERT(atomic_read(&rtg->rtg_ref) > 0);
+	atomic_dec(&rtg->rtg_ref);
+}
+
+/*
+ * Active references for rtgroup structures. This is for short term access to
+ * the rtgroup structures for walking trees or accessing state. If an rtgroup
+ * is being shrunk or is offline, then this will fail to find that group and
+ * return NULL instead.
+ */
+struct xfs_rtgroup *
+xfs_rtgroup_grab(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_rtgroup	*rtg;
+
+	rcu_read_lock();
+	rtg = xa_load(&mp->m_rtgroups, agno);
+	if (rtg) {
+		trace_xfs_rtgroup_grab(rtg, _RET_IP_);
+		if (!atomic_inc_not_zero(&rtg->rtg_active_ref))
+			rtg = NULL;
+	}
+	rcu_read_unlock();
+	return rtg;
+}
+
+void
+xfs_rtgroup_rele(
+	struct xfs_rtgroup	*rtg)
+{
+	trace_xfs_rtgroup_rele(rtg, _RET_IP_);
+	if (atomic_dec_and_test(&rtg->rtg_active_ref))
+		wake_up(&rtg->rtg_active_wq);
+}
+
+int
+xfs_rtgroup_alloc(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno)
+{
+	struct xfs_rtgroup	*rtg;
+	int			error;
+
+	rtg = kzalloc(sizeof(struct xfs_rtgroup), GFP_KERNEL);
+	if (!rtg)
+		return -ENOMEM;
+	rtg->rtg_rgno = rgno;
+	rtg->rtg_mount = mp;
+
+	error = xa_insert(&mp->m_rtgroups, rgno, rtg, GFP_KERNEL);
+	if (error) {
+		WARN_ON_ONCE(error == -EBUSY);
+		goto out_free_rtg;
+	}
+
+#ifdef __KERNEL__
+	/* Place kernel structure only init below this point. */
+	spin_lock_init(&rtg->rtg_state_lock);
+	init_waitqueue_head(&rtg->rtg_active_wq);
+#endif /* __KERNEL__ */
+
+	/* Active ref owned by mount indicates rtgroup is online. */
+	atomic_set(&rtg->rtg_active_ref, 1);
+	return 0;
+
+out_free_rtg:
+	kfree(rtg);
+	return error;
+}
+
+void
+xfs_rtgroup_free(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno)
+{
+	struct xfs_rtgroup	*rtg;
+
+	rtg = xa_erase(&mp->m_rtgroups, rgno);
+	if (!rtg) /* can happen when growfs fails */
+		return;
+
+	XFS_IS_CORRUPT(mp, atomic_read(&rtg->rtg_ref) != 0);
+
+	/* drop the mount's active reference */
+	xfs_rtgroup_rele(rtg);
+	XFS_IS_CORRUPT(mp, atomic_read(&rtg->rtg_active_ref) != 0);
+
+	kfree_rcu_mightsleep(rtg);
+}
+
+/*
+ * Free up the rtgroup resources associated with the mount structure.
+ */
+void
+xfs_free_rtgroups(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgcount)
+{
+	xfs_rgnumber_t		rgno;
+
+	for (rgno = 0; rgno < rgcount; rgno++)
+		xfs_rtgroup_free(mp, rgno);
+}
+
+/* Compute the number of rt extents in this realtime group. */
+xfs_rtxnum_t
+xfs_rtgroup_extents(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno)
+{
+	xfs_rgnumber_t		rgcount = mp->m_sb.sb_rgcount;
+
+	ASSERT(rgno < rgcount);
+	if (rgno == rgcount - 1)
+		return mp->m_sb.sb_rextents -
+			((xfs_rtxnum_t)rgno * mp->m_sb.sb_rgextents);
+
+	ASSERT(xfs_has_rtgroups(mp));
+	return mp->m_sb.sb_rgextents;
+}
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
new file mode 100644
index 0000000000000..2c09ecfc50328
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __LIBXFS_RTGROUP_H
+#define __LIBXFS_RTGROUP_H 1
+
+struct xfs_mount;
+struct xfs_trans;
+
+/*
+ * Realtime group incore structure, similar to the per-AG structure.
+ */
+struct xfs_rtgroup {
+	struct xfs_mount	*rtg_mount;
+	xfs_rgnumber_t		rtg_rgno;
+	atomic_t		rtg_ref;	/* passive reference count */
+	atomic_t		rtg_active_ref;	/* active reference count */
+	wait_queue_head_t	rtg_active_wq;/* woken active_ref falls to zero */
+
+	/* Number of blocks in this group */
+	xfs_rtxnum_t		rtg_extents;
+
+#ifdef __KERNEL__
+	/* -- kernel only structures below this line -- */
+	spinlock_t		rtg_state_lock;
+#endif /* __KERNEL__ */
+};
+
+#ifdef CONFIG_XFS_RT
+/* Passive rtgroup references */
+struct xfs_rtgroup *xfs_rtgroup_get(struct xfs_mount *mp, xfs_rgnumber_t rgno);
+struct xfs_rtgroup *xfs_rtgroup_hold(struct xfs_rtgroup *rtg);
+void xfs_rtgroup_put(struct xfs_rtgroup *rtg);
+
+/* Active rtgroup references */
+struct xfs_rtgroup *xfs_rtgroup_grab(struct xfs_mount *mp, xfs_rgnumber_t rgno);
+void xfs_rtgroup_rele(struct xfs_rtgroup *rtg);
+
+int xfs_rtgroup_alloc(struct xfs_mount *mp, xfs_rgnumber_t rgno);
+void xfs_rtgroup_free(struct xfs_mount *mp, xfs_rgnumber_t rgno);
+void xfs_free_rtgroups(struct xfs_mount *mp, xfs_rgnumber_t rgcount);
+#else /* CONFIG_XFS_RT */
+static inline struct xfs_rtgroup *xfs_rtgroup_get(struct xfs_mount *mp,
+		xfs_rgnumber_t rgno)
+{
+	return NULL;
+}
+static inline struct xfs_rtgroup *xfs_rtgroup_hold(struct xfs_rtgroup *rtg)
+{
+	ASSERT(rtg == NULL);
+	return NULL;
+}
+static inline void xfs_rtgroup_put(struct xfs_rtgroup *rtg)
+{
+}
+static inline int xfs_rtgroup_alloc( struct xfs_mount *mp,
+		xfs_rgnumber_t rgno)
+{
+	return 0;
+}
+static inline void xfs_free_rtgroups(struct xfs_mount *mp,
+		xfs_rgnumber_t rgcount)
+{
+}
+#define xfs_rtgroup_grab			xfs_rtgroup_get
+#define xfs_rtgroup_rele			xfs_rtgroup_put
+#endif /* CONFIG_XFS_RT */
+
+/*
+ * rt group iteration APIs
+ */
+static inline struct xfs_rtgroup *
+xfs_rtgroup_next(
+	struct xfs_rtgroup	*rtg,
+	xfs_rgnumber_t		*rgno,
+	xfs_rgnumber_t		end_rgno)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+
+	*rgno = rtg->rtg_rgno + 1;
+	xfs_rtgroup_rele(rtg);
+	if (*rgno > end_rgno)
+		return NULL;
+	return xfs_rtgroup_grab(mp, *rgno);
+}
+
+#define for_each_rtgroup_range(mp, rgno, end_rgno, rtg) \
+	for ((rtg) = xfs_rtgroup_grab((mp), (rgno)); \
+		(rtg) != NULL; \
+		(rtg) = xfs_rtgroup_next((rtg), &(rgno), (end_rgno)))
+
+#define for_each_rtgroup_from(mp, rgno, rtg) \
+	for_each_rtgroup_range((mp), (rgno), (mp)->m_sb.sb_rgcount - 1, (rtg))
+
+
+#define for_each_rtgroup(mp, rgno, rtg) \
+	(rgno) = 0; \
+	for_each_rtgroup_from((mp), (rgno), (rtg))
+
+static inline bool
+xfs_verify_rgbno(
+	struct xfs_rtgroup	*rtg,
+	xfs_rgblock_t		rgbno)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+
+	if (rgbno >= rtg->rtg_extents * mp->m_sb.sb_rextsize)
+		return false;
+	if (xfs_has_rtsb(mp) && rtg->rtg_rgno == 0 &&
+	    rgbno < mp->m_sb.sb_rextsize)
+		return false;
+	return true;
+}
+
+static inline bool
+xfs_verify_rgbext(
+	struct xfs_rtgroup	*rtg,
+	xfs_rgblock_t		rgbno,
+	xfs_rgblock_t		len)
+{
+	if (rgbno + len <= rgbno)
+		return false;
+
+	if (!xfs_verify_rgbno(rtg, rgbno))
+		return false;
+
+	return xfs_verify_rgbno(rtg, rgbno + len - 1);
+}
+
+static inline xfs_rtblock_t
+xfs_rgno_start_rtb(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno)
+{
+	if (mp->m_rgblklog >= 0)
+		return ((xfs_rtblock_t)rgno << mp->m_rgblklog);
+	return ((xfs_rtblock_t)rgno * mp->m_rgblocks);
+}
+
+static inline xfs_rtblock_t
+xfs_rgbno_to_rtb(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno,
+	xfs_rgblock_t		rgbno)
+{
+	return xfs_rgno_start_rtb(mp, rgno) + rgbno;
+}
+
+static inline xfs_rgnumber_t
+xfs_rtb_to_rgno(
+	struct xfs_mount	*mp,
+	xfs_rtblock_t		rtbno)
+{
+	if (!xfs_has_rtgroups(mp))
+		return 0;
+
+	if (mp->m_rgblklog >= 0)
+		return rtbno >> mp->m_rgblklog;
+
+	return div_u64(rtbno, mp->m_rgblocks);
+}
+
+static inline uint64_t
+__xfs_rtb_to_rgbno(
+	struct xfs_mount	*mp,
+	xfs_rtblock_t		rtbno)
+{
+	uint32_t		rem;
+
+	if (!xfs_has_rtgroups(mp))
+		return rtbno;
+
+	if (mp->m_rgblklog >= 0)
+		return rtbno & mp->m_rgblkmask;
+
+	div_u64_rem(rtbno, mp->m_rgblocks, &rem);
+	return rem;
+}
+
+static inline xfs_rgblock_t
+xfs_rtb_to_rgbno(
+	struct xfs_mount	*mp,
+	xfs_rtblock_t		rtbno)
+{
+	return __xfs_rtb_to_rgbno(mp, rtbno);
+}
+
+static inline xfs_daddr_t
+xfs_rtb_to_daddr(
+	struct xfs_mount	*mp,
+	xfs_rtblock_t		rtbno)
+{
+	return rtbno << mp->m_blkbb_log;
+}
+
+static inline xfs_rtblock_t
+xfs_daddr_to_rtb(
+	struct xfs_mount	*mp,
+	xfs_daddr_t		daddr)
+{
+	return daddr >> mp->m_blkbb_log;
+}
+
+#ifdef CONFIG_XFS_RT
+xfs_rtxnum_t xfs_rtgroup_extents(struct xfs_mount *mp, xfs_rgnumber_t rgno);
+#else
+# define xfs_rtgroup_extents(mp, rgno)		(0)
+#endif /* CONFIG_XFS_RT */
+
+#endif /* __LIBXFS_RTGROUP_H */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b83ce29640511..f1cdffb2f3392 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -696,6 +696,9 @@ __xfs_sb_from_disk(
 		to->sb_metadirino = be64_to_cpu(from->sb_metadirino);
 	else
 		to->sb_metadirino = NULLFSINO;
+
+	to->sb_rgcount = 1;
+	to->sb_rgextents = 0;
 }
 
 void
@@ -982,6 +985,10 @@ xfs_mount_sb_set_rextsize(
 {
 	mp->m_rtxblklog = log2_if_power2(sbp->sb_rextsize);
 	mp->m_rtxblkmask = mask64_if_power2(sbp->sb_rextsize);
+
+	mp->m_rgblocks = 0;
+	mp->m_rgblklog = 0;
+	mp->m_rgblkmask = 0;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index a8cd44d03ef64..1ce4b9eb16f47 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -9,10 +9,12 @@
 typedef uint32_t	prid_t;		/* project ID */
 
 typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
+typedef uint32_t	xfs_rgblock_t;	/* blockno in realtime group */
 typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
 typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
 typedef uint32_t	xfs_rtxlen_t;	/* file extent length in rtextents */
 typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
+typedef uint32_t	xfs_rgnumber_t;	/* realtime group number */
 typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
 typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
 typedef int64_t		xfs_fsize_t;	/* bytes in a file */
@@ -53,7 +55,9 @@ typedef void *		xfs_failaddr_t;
 #define	NULLFILEOFF	((xfs_fileoff_t)-1)
 
 #define	NULLAGBLOCK	((xfs_agblock_t)-1)
+#define NULLRGBLOCK	((xfs_rgblock_t)-1)
 #define	NULLAGNUMBER	((xfs_agnumber_t)-1)
+#define	NULLRGNUMBER	((xfs_rgnumber_t)-1)
 
 #define NULLCOMMITLSN	((xfs_lsn_t)-1)
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 4423dd344239b..c627cde3bb1e0 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -28,6 +28,7 @@
 #include "xfs_ag.h"
 #include "xfs_quota.h"
 #include "xfs_reflink.h"
+#include "xfs_rtgroup.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -3346,6 +3347,7 @@ xlog_do_recover(
 	struct xfs_mount	*mp = log->l_mp;
 	struct xfs_buf		*bp = mp->m_sb_bp;
 	struct xfs_sb		*sbp = &mp->m_sb;
+	xfs_rgnumber_t		old_rgcount = sbp->sb_rgcount;
 	int			error;
 
 	trace_xfs_log_recover(log, head_blk, tail_blk);
@@ -3399,6 +3401,24 @@ xlog_do_recover(
 		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
 		return error;
 	}
+
+	if (sbp->sb_rgcount < old_rgcount) {
+		xfs_warn(mp, "rgcount shrink not supported");
+		return -EINVAL;
+	}
+	if (sbp->sb_rgcount > old_rgcount) {
+		xfs_rgnumber_t		rgno;
+
+		for (rgno = old_rgcount; rgno < sbp->sb_rgcount; rgno++) {
+			error = xfs_rtgroup_alloc(mp, rgno);
+			if (error) {
+				xfs_warn(mp,
+	"Failed post-recovery rtgroup init: %d",
+						error);
+				return error;
+			}
+		}
+	}
 	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
 	/* Normal transactions can now occur */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b0ea88acdb618..e1e849101cdd4 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -36,6 +36,7 @@
 #include "xfs_ag.h"
 #include "xfs_rtbitmap.h"
 #include "xfs_metafile.h"
+#include "xfs_rtgroup.h"
 #include "scrub/stats.h"
 
 static DEFINE_MUTEX(xfs_uuid_table_mutex);
@@ -664,6 +665,7 @@ xfs_mountfs(
 	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
 	uint			quotamount = 0;
 	uint			quotaflags = 0;
+	xfs_rgnumber_t		rgno;
 	int			error = 0;
 
 	xfs_sb_mount_common(mp, sbp);
@@ -830,10 +832,18 @@ xfs_mountfs(
 		goto out_free_dir;
 	}
 
+	for (rgno = 0; rgno < mp->m_sb.sb_rgcount; rgno++) {
+		error = xfs_rtgroup_alloc(mp, rgno);
+		if (error) {
+			xfs_warn(mp, "Failed rtgroup init: %d", error);
+			goto out_free_rtgroup;
+		}
+	}
+
 	if (XFS_IS_CORRUPT(mp, !sbp->sb_logblocks)) {
 		xfs_warn(mp, "no log defined");
 		error = -EFSCORRUPTED;
-		goto out_free_perag;
+		goto out_free_rtgroup;
 	}
 
 	error = xfs_inodegc_register_shrinker(mp);
@@ -1068,7 +1078,8 @@ xfs_mountfs(
 	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
 		xfs_buftarg_drain(mp->m_logdev_targp);
 	xfs_buftarg_drain(mp->m_ddev_targp);
- out_free_perag:
+ out_free_rtgroup:
+	xfs_free_rtgroups(mp, rgno);
 	xfs_free_perag(mp);
  out_free_dir:
 	xfs_da_unmount(mp);
@@ -1152,6 +1163,7 @@ xfs_unmountfs(
 	xfs_errortag_clearall(mp);
 #endif
 	shrinker_free(mp->m_inodegc_shrinker);
+	xfs_free_rtgroups(mp, mp->m_sb.sb_rgcount);
 	xfs_free_perag(mp);
 
 	xfs_errortag_del(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9e883d2159fd9..f69da6802e8c1 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -121,6 +121,7 @@ typedef struct xfs_mount {
 	uint8_t			m_agno_log;	/* log #ag's */
 	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	int8_t			m_rtxblklog;	/* log2 of rextsize, if possible */
+	int8_t			m_rgblklog;	/* log2 of rt group sz if possible */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -149,12 +150,14 @@ typedef struct xfs_mount {
 	int			m_logbsize;	/* size of each log buffer */
 	uint			m_rsumlevels;	/* rt summary levels */
 	xfs_filblks_t		m_rsumblocks;	/* size of rt summary, FSBs */
+	uint32_t		m_rgblocks;	/* size of rtgroup in rtblocks */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint			m_qflags;	/* quota status flags */
 	uint64_t		m_features;	/* active filesystem features */
 	uint64_t		m_low_space[XFS_LOWSP_MAX];
 	uint64_t		m_low_rtexts[XFS_LOWSP_MAX];
 	uint64_t		m_rtxblkmask;	/* rt extent block mask */
+	uint64_t		m_rgblkmask;	/* rt group block mask */
 	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
 						/* low free space thresholds */
@@ -209,6 +212,7 @@ typedef struct xfs_mount {
 	 */
 	atomic64_t		m_allocbt_blks;
 
+	struct xarray		m_rtgroups;	/* per-rt group info */
 	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
 	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
 	uint64_t		m_resblks;	/* total reserved blocks */
@@ -358,6 +362,16 @@ __XFS_HAS_FEAT(large_extent_counts, NREXT64)
 __XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
 __XFS_HAS_FEAT(metadir, METADIR)
 
+static inline bool xfs_has_rtgroups(struct xfs_mount *mp)
+{
+	return false;
+}
+
+static inline bool xfs_has_rtsb(struct xfs_mount *mp)
+{
+	return false;
+}
+
 /*
  * Some features are always on for v5 file systems, allow the compiler to
  * eliminiate dead code when building without v4 support.
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 46a920b192d19..59898117f817d 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -27,6 +27,7 @@
 #include "xfs_health.h"
 #include "xfs_da_format.h"
 #include "xfs_metafile.h"
+#include "xfs_rtgroup.h"
 
 /*
  * Return whether there are any free extents in the size range given
@@ -1136,6 +1137,8 @@ xfs_rtmount_inodes(
 {
 	struct xfs_trans	*tp;
 	struct xfs_sb		*sbp = &mp->m_sb;
+	struct xfs_rtgroup	*rtg;
+	xfs_rgnumber_t		rgno;
 	int			error;
 
 	error = xfs_trans_alloc_empty(mp, &tp);
@@ -1166,6 +1169,9 @@ xfs_rtmount_inodes(
 	if (error)
 		goto out_rele_summary;
 
+	for_each_rtgroup(mp, rgno, rtg)
+		rtg->rtg_extents = xfs_rtgroup_extents(mp, rtg->rtg_rgno);
+
 	error = xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
 	if (error)
 		goto out_rele_summary;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 34066b50585e8..cee64c1a7d650 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2015,6 +2015,7 @@ static int xfs_init_fs_context(
 	spin_lock_init(&mp->m_sb_lock);
 	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	spin_lock_init(&mp->m_perag_lock);
+	xa_init(&mp->m_rtgroups);
 	mutex_init(&mp->m_growlock);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index c5f818cf40c29..f888d41e3283f 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -46,6 +46,7 @@
 #include "xfs_refcount.h"
 #include "xfs_metafile.h"
 #include "xfs_metadir.h"
+#include "xfs_rtgroup.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 7f259891ebcaa..4401a7c6230df 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -94,6 +94,7 @@ struct xfs_extent_free_item;
 struct xfs_rmap_intent;
 struct xfs_refcount_intent;
 struct xfs_metadir_update;
+struct xfs_rtgroup;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -220,6 +221,43 @@ DEFINE_PERAG_REF_EVENT(xfs_perag_rele);
 DEFINE_PERAG_REF_EVENT(xfs_perag_set_inode_tag);
 DEFINE_PERAG_REF_EVENT(xfs_perag_clear_inode_tag);
 
+#ifdef CONFIG_XFS_RT
+DECLARE_EVENT_CLASS(xfs_rtgroup_class,
+	TP_PROTO(struct xfs_rtgroup *rtg, unsigned long caller_ip),
+	TP_ARGS(rtg, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_rgnumber_t, rgno)
+		__field(int, refcount)
+		__field(int, active_refcount)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = rtg->rtg_mount->m_super->s_dev;
+		__entry->rgno = rtg->rtg_rgno;
+		__entry->refcount = atomic_read(&rtg->rtg_ref);
+		__entry->active_refcount = atomic_read(&rtg->rtg_active_ref);
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d rgno 0x%x passive refs %d active refs %d caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rgno,
+		  __entry->refcount,
+		  __entry->active_refcount,
+		  (char *)__entry->caller_ip)
+);
+
+#define DEFINE_RTGROUP_REF_EVENT(name)	\
+DEFINE_EVENT(xfs_rtgroup_class, name,	\
+	TP_PROTO(struct xfs_rtgroup *rtg, unsigned long caller_ip), \
+	TP_ARGS(rtg, caller_ip))
+DEFINE_RTGROUP_REF_EVENT(xfs_rtgroup_get);
+DEFINE_RTGROUP_REF_EVENT(xfs_rtgroup_hold);
+DEFINE_RTGROUP_REF_EVENT(xfs_rtgroup_put);
+DEFINE_RTGROUP_REF_EVENT(xfs_rtgroup_grab);
+DEFINE_RTGROUP_REF_EVENT(xfs_rtgroup_rele);
+#endif /* CONFIG_XFS_RT */
+
 TRACE_EVENT(xfs_inodegc_worker,
 	TP_PROTO(struct xfs_mount *mp, unsigned int shrinker_hits),
 	TP_ARGS(mp, shrinker_hits),


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 12/24] xfs: define locking primitives for realtime groups
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (10 preceding siblings ...)
  2024-08-23  0:17   ` [PATCH 11/24] xfs: create incore realtime group structures Darrick J. Wong
@ 2024-08-23  0:17   ` Darrick J. Wong
  2024-08-23  5:02     ` Christoph Hellwig
  2024-08-23  0:18   ` [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes Darrick J. Wong
                     ` (11 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:17 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Define helper functions to lock all metadata inodes related to a
realtime group.  There's not much to look at now, but this will become
important when we add per-rtgroup metadata files and online fsck code
for them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtgroup.c |   49 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rtgroup.h |   16 ++++++++++++++
 2 files changed, 65 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 2bad1ecb811eb..51f04cad5227c 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -194,3 +194,52 @@ xfs_rtgroup_extents(
 	ASSERT(xfs_has_rtgroups(mp));
 	return mp->m_sb.sb_rgextents;
 }
+
+/* Lock metadata inodes associated with this rt group. */
+void
+xfs_rtgroup_lock(
+	struct xfs_rtgroup	*rtg,
+	unsigned int		rtglock_flags)
+{
+	ASSERT(!(rtglock_flags & ~XFS_RTGLOCK_ALL_FLAGS));
+	ASSERT(!(rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED) ||
+	       !(rtglock_flags & XFS_RTGLOCK_BITMAP));
+
+	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
+		xfs_rtbitmap_lock(rtg->rtg_mount);
+	else if (rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED)
+		xfs_rtbitmap_lock_shared(rtg->rtg_mount, XFS_RBMLOCK_BITMAP);
+}
+
+/* Unlock metadata inodes associated with this rt group. */
+void
+xfs_rtgroup_unlock(
+	struct xfs_rtgroup	*rtg,
+	unsigned int		rtglock_flags)
+{
+	ASSERT(!(rtglock_flags & ~XFS_RTGLOCK_ALL_FLAGS));
+	ASSERT(!(rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED) ||
+	       !(rtglock_flags & XFS_RTGLOCK_BITMAP));
+
+	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
+		xfs_rtbitmap_unlock(rtg->rtg_mount);
+	else if (rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED)
+		xfs_rtbitmap_unlock_shared(rtg->rtg_mount, XFS_RBMLOCK_BITMAP);
+}
+
+/*
+ * Join realtime group metadata inodes to the transaction.  The ILOCKs will be
+ * released on transaction commit.
+ */
+void
+xfs_rtgroup_trans_join(
+	struct xfs_trans	*tp,
+	struct xfs_rtgroup	*rtg,
+	unsigned int		rtglock_flags)
+{
+	ASSERT(!(rtglock_flags & ~XFS_RTGLOCK_ALL_FLAGS));
+	ASSERT(!(rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED));
+
+	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
+		xfs_rtbitmap_trans_join(tp);
+}
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index 2c09ecfc50328..d2eb2cd5775dd 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -205,8 +205,24 @@ xfs_daddr_to_rtb(
 
 #ifdef CONFIG_XFS_RT
 xfs_rtxnum_t xfs_rtgroup_extents(struct xfs_mount *mp, xfs_rgnumber_t rgno);
+
+/* Lock the rt bitmap inode in exclusive mode */
+#define XFS_RTGLOCK_BITMAP		(1U << 0)
+/* Lock the rt bitmap inode in shared mode */
+#define XFS_RTGLOCK_BITMAP_SHARED	(1U << 1)
+
+#define XFS_RTGLOCK_ALL_FLAGS	(XFS_RTGLOCK_BITMAP | \
+				 XFS_RTGLOCK_BITMAP_SHARED)
+
+void xfs_rtgroup_lock(struct xfs_rtgroup *rtg, unsigned int rtglock_flags);
+void xfs_rtgroup_unlock(struct xfs_rtgroup *rtg, unsigned int rtglock_flags);
+void xfs_rtgroup_trans_join(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
+		unsigned int rtglock_flags);
 #else
 # define xfs_rtgroup_extents(mp, rgno)		(0)
+# define xfs_rtgroup_lock(rtg, gf)		((void)0)
+# define xfs_rtgroup_unlock(rtg, gf)		((void)0)
+# define xfs_rtgroup_trans_join(tp, rtg, gf)	((void)0)
 #endif /* CONFIG_XFS_RT */
 
 #endif /* __LIBXFS_RTGROUP_H */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (11 preceding siblings ...)
  2024-08-23  0:17   ` [PATCH 12/24] xfs: define locking primitives for realtime groups Darrick J. Wong
@ 2024-08-23  0:18   ` Darrick J. Wong
  2024-08-23  5:02     ` Christoph Hellwig
  2024-08-25 23:58     ` Dave Chinner
  2024-08-23  0:18   ` [PATCH 14/24] xfs: support caching rtgroup metadata inodes Darrick J. Wong
                     ` (10 subsequent siblings)
  23 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:18 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add a dynamic lockdep class key for rtgroup inodes.  This will enable
lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
order.  Each class can have 8 subclasses, and for now we will only have
2 inodes per group.  This enables rtgroup order and inode order checks
when nesting ILOCKs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtgroup.c |   52 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 51f04cad5227c..ae6d67c673b1a 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -243,3 +243,55 @@ xfs_rtgroup_trans_join(
 	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
 		xfs_rtbitmap_trans_join(tp);
 }
+
+#ifdef CONFIG_PROVE_LOCKING
+static struct lock_class_key xfs_rtginode_lock_class;
+
+static int
+xfs_rtginode_ilock_cmp_fn(
+	const struct lockdep_map	*m1,
+	const struct lockdep_map	*m2)
+{
+	const struct xfs_inode *ip1 =
+		container_of(m1, struct xfs_inode, i_lock.dep_map);
+	const struct xfs_inode *ip2 =
+		container_of(m2, struct xfs_inode, i_lock.dep_map);
+
+	if (ip1->i_projid < ip2->i_projid)
+		return -1;
+	if (ip1->i_projid > ip2->i_projid)
+		return 1;
+	return 0;
+}
+
+static inline void
+xfs_rtginode_ilock_print_fn(
+	const struct lockdep_map	*m)
+{
+	const struct xfs_inode *ip =
+		container_of(m, struct xfs_inode, i_lock.dep_map);
+
+	printk(KERN_CONT " rgno=%u", ip->i_projid);
+}
+
+/*
+ * Most of the time each of the RTG inode locks are only taken one at a time.
+ * But when committing deferred ops, more than one of a kind can be taken.
+ * However, deferred rt ops will be committed in rgno order so there is no
+ * potential for deadlocks.  The code here is needed to tell lockdep about this
+ * order.
+ */
+static inline void
+xfs_rtginode_lockdep_setup(
+	struct xfs_inode	*ip,
+	xfs_rgnumber_t		rgno,
+	enum xfs_rtg_inodes	type)
+{
+	lockdep_set_class_and_subclass(&ip->i_lock, &xfs_rtginode_lock_class,
+			type);
+	lock_set_cmp_fn(&ip->i_lock, xfs_rtginode_ilock_cmp_fn,
+			xfs_rtginode_ilock_print_fn);
+}
+#else
+#define xfs_rtginode_lockdep_setup(ip, rgno, type)	do { } while (0)
+#endif /* CONFIG_PROVE_LOCKING */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 14/24] xfs: support caching rtgroup metadata inodes
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (12 preceding siblings ...)
  2024-08-23  0:18   ` [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes Darrick J. Wong
@ 2024-08-23  0:18   ` Darrick J. Wong
  2024-08-23  5:02     ` Christoph Hellwig
  2024-08-26  1:41     ` Dave Chinner
  2024-08-23  0:18   ` [PATCH 15/24] xfs: add rtgroup-based realtime scrubbing context management Darrick J. Wong
                     ` (9 subsequent siblings)
  23 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:18 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create the necessary per-rtgroup infrastructure that we need to load
metadata inodes into memory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtgroup.c |  182 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rtgroup.h |   28 +++++++
 fs/xfs/xfs_mount.h          |    1 
 fs/xfs/xfs_rtalloc.c        |   48 +++++++++++
 4 files changed, 258 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index ae6d67c673b1a..50e4a56d749f0 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -30,6 +30,8 @@
 #include "xfs_icache.h"
 #include "xfs_rtgroup.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_metafile.h"
+#include "xfs_metadir.h"
 
 /*
  * Passive reference counting access wrappers to the rtgroup structures.  If
@@ -295,3 +297,183 @@ xfs_rtginode_lockdep_setup(
 #else
 #define xfs_rtginode_lockdep_setup(ip, rgno, type)	do { } while (0)
 #endif /* CONFIG_PROVE_LOCKING */
+
+struct xfs_rtginode_ops {
+	const char		*name;	/* short name */
+
+	enum xfs_metafile_type	metafile_type;
+
+	/* Does the fs have this feature? */
+	bool			(*enabled)(struct xfs_mount *mp);
+
+	/* Create this rtgroup metadata inode and initialize it. */
+	int			(*create)(struct xfs_rtgroup *rtg,
+					  struct xfs_inode *ip,
+					  struct xfs_trans *tp,
+					  bool init);
+};
+
+static const struct xfs_rtginode_ops xfs_rtginode_ops[XFS_RTGI_MAX] = {
+};
+
+/* Return the shortname of this rtgroup inode. */
+const char *
+xfs_rtginode_name(
+	enum xfs_rtg_inodes	type)
+{
+	return xfs_rtginode_ops[type].name;
+}
+
+/* Should this rtgroup inode be present? */
+bool
+xfs_rtginode_enabled(
+	struct xfs_rtgroup	*rtg,
+	enum xfs_rtg_inodes	type)
+{
+	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
+
+	if (!ops->enabled)
+		return true;
+	return ops->enabled(rtg->rtg_mount);
+}
+
+/* Load and existing rtgroup inode into the rtgroup structure. */
+int
+xfs_rtginode_load(
+	struct xfs_rtgroup	*rtg,
+	enum xfs_rtg_inodes	type,
+	struct xfs_trans	*tp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	const char		*path;
+	struct xfs_inode	*ip;
+	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
+	int			error;
+
+	if (!xfs_rtginode_enabled(rtg, type))
+		return 0;
+
+	if (!mp->m_rtdirip)
+		return -EFSCORRUPTED;
+
+	path = xfs_rtginode_path(rtg->rtg_rgno, type);
+	if (!path)
+		return -ENOMEM;
+	error = xfs_metadir_load(tp, mp->m_rtdirip, path, ops->metafile_type,
+			&ip);
+	kfree(path);
+
+	if (error)
+		return error;
+
+	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
+			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
+		xfs_irele(ip);
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_IS_CORRUPT(mp, ip->i_projid != rtg->rtg_rgno)) {
+		xfs_irele(ip);
+		return -EFSCORRUPTED;
+	}
+
+	xfs_rtginode_lockdep_setup(ip, rtg->rtg_rgno, type);
+	rtg->rtg_inodes[type] = ip;
+	return 0;
+}
+
+/* Release an rtgroup metadata inode. */
+void
+xfs_rtginode_irele(
+	struct xfs_inode	**ipp)
+{
+	if (*ipp)
+		xfs_irele(*ipp);
+	*ipp = NULL;
+}
+
+/* Add a metadata inode for a realtime rmap btree. */
+int
+xfs_rtginode_create(
+	struct xfs_rtgroup		*rtg,
+	enum xfs_rtg_inodes		type,
+	bool				init)
+{
+	const struct xfs_rtginode_ops	*ops = &xfs_rtginode_ops[type];
+	struct xfs_mount		*mp = rtg->rtg_mount;
+	struct xfs_metadir_update	upd = {
+		.dp			= mp->m_rtdirip,
+		.metafile_type		= ops->metafile_type,
+	};
+	int				error;
+
+	if (!xfs_rtginode_enabled(rtg, type))
+		return 0;
+
+	if (!mp->m_rtdirip)
+		return -EFSCORRUPTED;
+
+	upd.path = xfs_rtginode_path(rtg->rtg_rgno, type);
+	if (!upd.path)
+		return -ENOMEM;
+
+	error = xfs_metadir_start_create(&upd);
+	if (error)
+		goto out_path;
+
+	error = xfs_metadir_create(&upd, S_IFREG);
+	if (error)
+		return error;
+
+	xfs_rtginode_lockdep_setup(upd.ip, rtg->rtg_rgno, type);
+
+	upd.ip->i_projid = rtg->rtg_rgno;
+	error = ops->create(rtg, upd.ip, upd.tp, init);
+	if (error)
+		goto out_cancel;
+
+	error = xfs_metadir_commit(&upd);
+	if (error)
+		goto out_path;
+
+	kfree(upd.path);
+	xfs_finish_inode_setup(upd.ip);
+	rtg->rtg_inodes[type] = upd.ip;
+	return 0;
+
+out_cancel:
+	xfs_metadir_cancel(&upd, error);
+	/* Have to finish setting up the inode to ensure it's deleted. */
+	if (upd.ip) {
+		xfs_finish_inode_setup(upd.ip);
+		xfs_irele(upd.ip);
+	}
+out_path:
+	kfree(upd.path);
+	return error;
+}
+
+/* Create the parent directory for all rtgroup inodes and load it. */
+int
+xfs_rtginode_mkdir_parent(
+	struct xfs_mount	*mp)
+{
+	if (!mp->m_metadirip)
+		return -EFSCORRUPTED;
+
+	return xfs_metadir_mkdir(mp->m_metadirip, "rtgroups", &mp->m_rtdirip);
+}
+
+/* Load the parent directory of all rtgroup inodes. */
+int
+xfs_rtginode_load_parent(
+	struct xfs_trans	*tp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+
+	if (!mp->m_metadirip)
+		return -EFSCORRUPTED;
+
+	return xfs_metadir_load(tp, mp->m_metadirip, "rtgroups",
+			XFS_METAFILE_DIR, &mp->m_rtdirip);
+}
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index d2eb2cd5775dd..b5c769211b4bb 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -9,6 +9,14 @@
 struct xfs_mount;
 struct xfs_trans;
 
+enum xfs_rtg_inodes {
+	XFS_RTGI_MAX,
+};
+
+#ifdef MAX_LOCKDEP_SUBCLASSES
+static_assert(XFS_RTGI_MAX <= MAX_LOCKDEP_SUBCLASSES);
+#endif
+
 /*
  * Realtime group incore structure, similar to the per-AG structure.
  */
@@ -19,6 +27,9 @@ struct xfs_rtgroup {
 	atomic_t		rtg_active_ref;	/* active reference count */
 	wait_queue_head_t	rtg_active_wq;/* woken active_ref falls to zero */
 
+	/* per-rtgroup metadata inodes */
+	struct xfs_inode	*rtg_inodes[1 /* hack */];
+
 	/* Number of blocks in this group */
 	xfs_rtxnum_t		rtg_extents;
 
@@ -218,6 +229,23 @@ void xfs_rtgroup_lock(struct xfs_rtgroup *rtg, unsigned int rtglock_flags);
 void xfs_rtgroup_unlock(struct xfs_rtgroup *rtg, unsigned int rtglock_flags);
 void xfs_rtgroup_trans_join(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
 		unsigned int rtglock_flags);
+
+int xfs_rtginode_mkdir_parent(struct xfs_mount *mp);
+int xfs_rtginode_load_parent(struct xfs_trans *tp);
+
+const char *xfs_rtginode_name(enum xfs_rtg_inodes type);
+bool xfs_rtginode_enabled(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type);
+int xfs_rtginode_load(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type,
+		struct xfs_trans *tp);
+int xfs_rtginode_create(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type,
+		bool init);
+void xfs_rtginode_irele(struct xfs_inode **ipp);
+
+static inline const char *xfs_rtginode_path(xfs_rgnumber_t rgno,
+		enum xfs_rtg_inodes type)
+{
+	return kasprintf(GFP_KERNEL, "%u.%s", rgno, xfs_rtginode_name(type));
+}
 #else
 # define xfs_rtgroup_extents(mp, rgno)		(0)
 # define xfs_rtgroup_lock(rtg, gf)		((void)0)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f69da6802e8c1..73959c26075a5 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -94,6 +94,7 @@ typedef struct xfs_mount {
 	struct xfs_inode	*m_rsumip;	/* pointer to summary inode */
 	struct xfs_inode	*m_rootip;	/* pointer to root directory */
 	struct xfs_inode	*m_metadirip;	/* ptr to metadata directory */
+	struct xfs_inode	*m_rtdirip;	/* ptr to realtime metadir */
 	struct xfs_quotainfo	*m_quotainfo;	/* disk quota information */
 	struct xfs_buftarg	*m_ddev_targp;	/* data device */
 	struct xfs_buftarg	*m_logdev_targp;/* log device */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 59898117f817d..dcdb726ebe4a0 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -28,6 +28,7 @@
 #include "xfs_da_format.h"
 #include "xfs_metafile.h"
 #include "xfs_rtgroup.h"
+#include "xfs_error.h"
 
 /*
  * Return whether there are any free extents in the size range given
@@ -652,6 +653,16 @@ xfs_rtallocate_extent_size(
 	return -ENOSPC;
 }
 
+static void
+xfs_rtunmount_rtg(
+	struct xfs_rtgroup	*rtg)
+{
+	int			i;
+
+	for (i = 0; i < XFS_RTGI_MAX; i++)
+		xfs_rtginode_irele(&rtg->rtg_inodes[i]);
+}
+
 static int
 xfs_alloc_rsum_cache(
 	struct xfs_mount	*mp,
@@ -1127,6 +1138,18 @@ xfs_rtmount_iread_extents(
 	return error;
 }
 
+static void
+xfs_rtgroup_unmount_inodes(
+	struct xfs_mount	*mp)
+{
+	struct xfs_rtgroup	*rtg;
+	xfs_rgnumber_t		rgno;
+
+	for_each_rtgroup(mp, rgno, rtg)
+		xfs_rtunmount_rtg(rtg);
+	xfs_rtginode_irele(&mp->m_rtdirip);
+}
+
 /*
  * Get the bitmap and summary inodes and the summary cache into the mount
  * structure at mount time.
@@ -1139,6 +1162,7 @@ xfs_rtmount_inodes(
 	struct xfs_sb		*sbp = &mp->m_sb;
 	struct xfs_rtgroup	*rtg;
 	xfs_rgnumber_t		rgno;
+	unsigned int		i;
 	int			error;
 
 	error = xfs_trans_alloc_empty(mp, &tp);
@@ -1169,15 +1193,34 @@ xfs_rtmount_inodes(
 	if (error)
 		goto out_rele_summary;
 
-	for_each_rtgroup(mp, rgno, rtg)
+	if (xfs_has_rtgroups(mp) && mp->m_sb.sb_rgcount > 0) {
+		error = xfs_rtginode_load_parent(tp);
+		if (error)
+			goto out_rele_rtdir;
+	}
+
+	for_each_rtgroup(mp, rgno, rtg) {
 		rtg->rtg_extents = xfs_rtgroup_extents(mp, rtg->rtg_rgno);
 
+		for (i = 0; i < XFS_RTGI_MAX; i++) {
+			error = xfs_rtginode_load(rtg, i, tp);
+			if (error) {
+				xfs_rtgroup_rele(rtg);
+				goto out_rele_inodes;
+			}
+		}
+	}
+
 	error = xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
 	if (error)
 		goto out_rele_summary;
 	xfs_trans_cancel(tp);
 	return 0;
 
+out_rele_inodes:
+	xfs_rtgroup_unmount_inodes(mp);
+out_rele_rtdir:
+	xfs_rtginode_irele(&mp->m_rtdirip);
 out_rele_summary:
 	xfs_irele(mp->m_rsumip);
 out_rele_bitmap:
@@ -1192,6 +1235,9 @@ xfs_rtunmount_inodes(
 	struct xfs_mount	*mp)
 {
 	kvfree(mp->m_rsum_cache);
+
+	xfs_rtgroup_unmount_inodes(mp);
+	xfs_rtginode_irele(&mp->m_rtdirip);
 	if (mp->m_rbmip)
 		xfs_irele(mp->m_rbmip);
 	if (mp->m_rsumip)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 15/24] xfs: add rtgroup-based realtime scrubbing context management
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (13 preceding siblings ...)
  2024-08-23  0:18   ` [PATCH 14/24] xfs: support caching rtgroup metadata inodes Darrick J. Wong
@ 2024-08-23  0:18   ` Darrick J. Wong
  2024-08-23  5:03     ` Christoph Hellwig
  2024-08-23  0:18   ` [PATCH 16/24] xfs: move RT bitmap and summary information to the rtgroup Darrick J. Wong
                     ` (8 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:18 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a pair of helpers to deal with setting up the necessary incore
context to check metadata records against the realtime metadata.  Right
now this is limited to locking the realtime bitmap and summary inodes,
but as we add rmap and reflink to the realtime device this will grow to
include btree cursors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/common.c |   78 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.h |   30 +++++++++++++++++++
 fs/xfs/scrub/scrub.c  |   29 ++++++++++++++++++
 fs/xfs/scrub/scrub.h  |   13 ++++++++
 4 files changed, 150 insertions(+)


diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 5245943496c8b..8d44f18787c42 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -34,6 +34,7 @@
 #include "xfs_quota.h"
 #include "xfs_exchmaps.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_rtgroup.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -121,6 +122,17 @@ xchk_process_error(
 			XFS_SCRUB_OFLAG_CORRUPT, __return_address);
 }
 
+bool
+xchk_process_rt_error(
+	struct xfs_scrub	*sc,
+	xfs_rgnumber_t		rgno,
+	xfs_rgblock_t		rgbno,
+	int			*error)
+{
+	return __xchk_process_error(sc, rgno, rgbno, error,
+			XFS_SCRUB_OFLAG_CORRUPT, __return_address);
+}
+
 bool
 xchk_xref_process_error(
 	struct xfs_scrub	*sc,
@@ -684,6 +696,72 @@ xchk_ag_init(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_RT
+/*
+ * For scrubbing a realtime group, grab all the in-core resources we'll need to
+ * check the metadata, which means taking the ILOCK of the realtime group's
+ * metadata inodes.  Callers must not join these inodes to the transaction with
+ * non-zero lockflags or concurrency problems will result.  The @rtglock_flags
+ * argument takes XFS_RTGLOCK_* flags.
+ */
+int
+xchk_rtgroup_init(
+	struct xfs_scrub	*sc,
+	xfs_rgnumber_t		rgno,
+	struct xchk_rt		*sr)
+{
+	ASSERT(sr->rtg == NULL);
+	ASSERT(sr->rtlock_flags == 0);
+
+	sr->rtg = xfs_rtgroup_get(sc->mp, rgno);
+	if (!sr->rtg)
+		return -ENOENT;
+	return 0;
+}
+
+void
+xchk_rtgroup_lock(
+	struct xchk_rt		*sr,
+	unsigned int		rtglock_flags)
+{
+	xfs_rtgroup_lock(sr->rtg, rtglock_flags);
+	sr->rtlock_flags = rtglock_flags;
+}
+
+/*
+ * Unlock the realtime group.  This must be done /after/ committing (or
+ * cancelling) the scrub transaction.
+ */
+static void
+xchk_rtgroup_unlock(
+	struct xchk_rt		*sr)
+{
+	ASSERT(sr->rtg != NULL);
+
+	if (sr->rtlock_flags) {
+		xfs_rtgroup_unlock(sr->rtg, sr->rtlock_flags);
+		sr->rtlock_flags = 0;
+	}
+}
+
+/*
+ * Unlock the realtime group and release its resources.  This must be done
+ * /after/ committing (or cancelling) the scrub transaction.
+ */
+void
+xchk_rtgroup_free(
+	struct xfs_scrub	*sc,
+	struct xchk_rt		*sr)
+{
+	ASSERT(sr->rtg != NULL);
+
+	xchk_rtgroup_unlock(sr);
+
+	xfs_rtgroup_put(sr->rtg);
+	sr->rtg = NULL;
+}
+#endif /* CONFIG_XFS_RT */
+
 /* Per-scrubber setup functions */
 
 void
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 27e5bf8f7c60b..0d531770e83b0 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -12,6 +12,8 @@ void xchk_trans_cancel(struct xfs_scrub *sc);
 
 bool xchk_process_error(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		xfs_agblock_t bno, int *error);
+bool xchk_process_rt_error(struct xfs_scrub *sc, xfs_rgnumber_t rgno,
+		xfs_rgblock_t rgbno, int *error);
 bool xchk_fblock_process_error(struct xfs_scrub *sc, int whichfork,
 		xfs_fileoff_t offset, int *error);
 
@@ -118,6 +120,34 @@ xchk_ag_init_existing(
 	return error == -ENOENT ? -EFSCORRUPTED : error;
 }
 
+#ifdef CONFIG_XFS_RT
+
+/* All the locks we need to check an rtgroup. */
+#define XCHK_RTGLOCK_ALL	(XFS_RTGLOCK_BITMAP)
+
+int xchk_rtgroup_init(struct xfs_scrub *sc, xfs_rgnumber_t rgno,
+		struct xchk_rt *sr);
+
+static inline int
+xchk_rtgroup_init_existing(
+	struct xfs_scrub	*sc,
+	xfs_rgnumber_t		rgno,
+	struct xchk_rt		*sr)
+{
+	int			error = xchk_rtgroup_init(sc, rgno, sr);
+
+	return error == -ENOENT ? -EFSCORRUPTED : error;
+}
+
+void xchk_rtgroup_lock(struct xchk_rt *sr, unsigned int rtglock_flags);
+void xchk_rtgroup_free(struct xfs_scrub *sc, struct xchk_rt *sr);
+#else
+# define xchk_rtgroup_init(sc, rgno, sr)		(-EFSCORRUPTED)
+# define xchk_rtgroup_init_existing(sc, rgno, sr)	(-EFSCORRUPTED)
+# define xchk_rtgroup_lock(sc, lockflags)		do { } while (0)
+# define xchk_rtgroup_free(sc, sr)			do { } while (0)
+#endif /* CONFIG_XFS_RT */
+
 int xchk_ag_read_headers(struct xfs_scrub *sc, xfs_agnumber_t agno,
 		struct xchk_ag *sa);
 void xchk_ag_btcur_free(struct xchk_ag *sa);
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 04a7a5944837d..9d9990d5c6c48 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -225,6 +225,8 @@ xchk_teardown(
 			xfs_trans_cancel(sc->tp);
 		sc->tp = NULL;
 	}
+	if (sc->sr.rtg)
+		xchk_rtgroup_free(sc, &sc->sr);
 	if (sc->ip) {
 		if (sc->ilock_flags)
 			xchk_iunlock(sc, sc->ilock_flags);
@@ -498,6 +500,33 @@ xchk_validate_inputs(
 		break;
 	case ST_GENERIC:
 		break;
+	case ST_RTGROUP:
+		if (sm->sm_ino || sm->sm_gen)
+			goto out;
+		if (xfs_has_rtgroups(mp)) {
+			/*
+			 * On a rtgroups filesystem, there won't be an rtbitmap
+			 * or rtsummary file for group 0 unless there's
+			 * actually a realtime volume attached.  However, older
+			 * xfs_scrub always calls the rtbitmap/rtsummary
+			 * scrubbers with sm_agno==0 so transform the error
+			 * code to ENOENT.
+			 */
+			if (sm->sm_agno >= mp->m_sb.sb_rgcount) {
+				if (sm->sm_agno == 0)
+					error = -ENOENT;
+				goto out;
+			}
+		} else {
+			/*
+			 * Prior to rtgroups, the rtbitmap/rtsummary scrubbers
+			 * accepted sm_agno==0, so we still accept that for
+			 * scrubbing pre-rtgroups filesystems.
+			 */
+			if (sm->sm_agno != 0)
+				goto out;
+		}
+		break;
 	default:
 		goto out;
 	}
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index c688ff4fc7fc4..f73c6d0d90a11 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -74,6 +74,7 @@ enum xchk_type {
 	ST_FS,		/* per-FS metadata */
 	ST_INODE,	/* per-inode metadata */
 	ST_GENERIC,	/* determined by the scrubber */
+	ST_RTGROUP,	/* rtgroup metadata */
 };
 
 struct xchk_meta_ops {
@@ -118,6 +119,15 @@ struct xchk_ag {
 	struct xfs_btree_cur	*refc_cur;
 };
 
+/* Inode lock state for the RT volume. */
+struct xchk_rt {
+	/* incore rtgroup, if applicable */
+	struct xfs_rtgroup	*rtg;
+
+	/* XFS_RTGLOCK_* lock state if locked */
+	unsigned int		rtlock_flags;
+};
+
 struct xfs_scrub {
 	/* General scrub state. */
 	struct xfs_mount		*mp;
@@ -179,6 +189,9 @@ struct xfs_scrub {
 
 	/* State tracking for single-AG operations. */
 	struct xchk_ag			sa;
+
+	/* State tracking for realtime operations. */
+	struct xchk_rt			sr;
 };
 
 /* XCHK state flags grow up from zero, XREP state flags grown down from 2^31 */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 16/24] xfs: move RT bitmap and summary information to the rtgroup
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (14 preceding siblings ...)
  2024-08-23  0:18   ` [PATCH 15/24] xfs: add rtgroup-based realtime scrubbing context management Darrick J. Wong
@ 2024-08-23  0:18   ` Darrick J. Wong
  2024-08-26  1:58     ` Dave Chinner
  2024-08-23  0:19   ` [PATCH 17/24] xfs: remove XFS_ILOCK_RT* Darrick J. Wong
                     ` (7 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:18 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Move the pointers to the RT bitmap and summary inodes as well as the
summary cache to the rtgroups structure to prepare for having a
separate bitmap and summary inodes for each rtgroup.

Code using the inodes now needs to operate on a rtgroup.  Where easily
possible such code is converted to iterate over all rtgroups, else
rtgroup 0 (the only one that can currently exist) is hardcoded.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c        |   40 +++-
 fs/xfs/libxfs/xfs_rtbitmap.c    |  174 ++++++++--------
 fs/xfs/libxfs/xfs_rtbitmap.h    |   68 +++---
 fs/xfs/libxfs/xfs_rtgroup.c     |   90 +++++++-
 fs/xfs/libxfs/xfs_rtgroup.h     |   14 +
 fs/xfs/scrub/bmap.c             |   13 +
 fs/xfs/scrub/fscounters.c       |   26 +-
 fs/xfs/scrub/repair.c           |   24 ++
 fs/xfs/scrub/repair.h           |    7 +
 fs/xfs/scrub/rtbitmap.c         |   45 ++--
 fs/xfs/scrub/rtsummary.c        |   93 +++++----
 fs/xfs/scrub/rtsummary_repair.c |    7 -
 fs/xfs/scrub/scrub.c            |    4 
 fs/xfs/xfs_discard.c            |  100 ++++++---
 fs/xfs/xfs_fsmap.c              |  143 ++++++++-----
 fs/xfs/xfs_mount.h              |   10 -
 fs/xfs/xfs_qm.c                 |   27 ++-
 fs/xfs/xfs_rtalloc.c            |  415 ++++++++++++++++++++++-----------------
 18 files changed, 763 insertions(+), 537 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3a8796f165d6d..a1ee8dc91d6ba 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5167,6 +5167,34 @@ xfs_bmap_del_extent_cow(
 	ip->i_delayed_blks -= del->br_blockcount;
 }
 
+static int
+xfs_bmap_free_rtblocks(
+	struct xfs_trans	*tp,
+	struct xfs_bmbt_irec	*del)
+{
+	struct xfs_rtgroup	*rtg;
+	int			error;
+
+	rtg = xfs_rtgroup_grab(tp->t_mountp, 0);
+	if (!rtg)
+		return -EIO;
+
+	/*
+	 * Ensure the bitmap and summary inodes are locked and joined to the
+	 * transaction before modifying them.
+	 */
+	if (!(tp->t_flags & XFS_TRANS_RTBITMAP_LOCKED)) {
+		tp->t_flags |= XFS_TRANS_RTBITMAP_LOCKED;
+		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP);
+		xfs_rtgroup_trans_join(tp, rtg, XFS_RTGLOCK_BITMAP);
+	}
+
+	error = xfs_rtfree_blocks(tp, rtg, del->br_startblock,
+			del->br_blockcount);
+	xfs_rtgroup_rele(rtg);
+	return error;
+}
+
 /*
  * Called by xfs_bmapi to update file extent records and the btree
  * after removing space.
@@ -5382,17 +5410,7 @@ xfs_bmap_del_extent_real(
 		if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) {
 			xfs_refcount_decrease_extent(tp, del);
 		} else if (xfs_ifork_is_realtime(ip, whichfork)) {
-			/*
-			 * Ensure the bitmap and summary inodes are locked
-			 * and joined to the transaction before modifying them.
-			 */
-			if (!(tp->t_flags & XFS_TRANS_RTBITMAP_LOCKED)) {
-				tp->t_flags |= XFS_TRANS_RTBITMAP_LOCKED;
-				xfs_rtbitmap_lock(mp);
-				xfs_rtbitmap_trans_join(tp);
-			}
-			error = xfs_rtfree_blocks(tp, del->br_startblock,
-					del->br_blockcount);
+			error = xfs_bmap_free_rtblocks(tp, del);
 		} else {
 			unsigned int	efi_flags = 0;
 
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 27a4472402bac..41de2f071934f 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -90,12 +90,12 @@ xfs_rtbuf_get(
 	if (issum) {
 		cbpp = &args->sumbp;
 		coffp = &args->sumoff;
-		ip = mp->m_rsumip;
+		ip = args->rtg->rtg_inodes[XFS_RTGI_SUMMARY];
 		type = XFS_BLFT_RTSUMMARY_BUF;
 	} else {
 		cbpp = &args->rbmbp;
 		coffp = &args->rbmoff;
-		ip = mp->m_rbmip;
+		ip = args->rtg->rtg_inodes[XFS_RTGI_BITMAP];
 		type = XFS_BLFT_RTBITMAP_BUF;
 	}
 
@@ -503,6 +503,7 @@ xfs_rtmodify_summary(
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_rtsumoff_t		so = xfs_rtsumoffs(mp, log, bbno);
+	uint8_t			*rsum_cache = args->rtg->rtg_rsum_cache;
 	unsigned int		infoword;
 	xfs_suminfo_t		val;
 	int			error;
@@ -514,11 +515,11 @@ xfs_rtmodify_summary(
 	infoword = xfs_rtsumoffs_to_infoword(mp, so);
 	val = xfs_suminfo_add(args, infoword, delta);
 
-	if (mp->m_rsum_cache) {
-		if (val == 0 && log + 1 == mp->m_rsum_cache[bbno])
-			mp->m_rsum_cache[bbno] = log;
-		if (val != 0 && log >= mp->m_rsum_cache[bbno])
-			mp->m_rsum_cache[bbno] = log + 1;
+	if (rsum_cache) {
+		if (val == 0 && log + 1 == rsum_cache[bbno])
+			rsum_cache[bbno] = log;
+		if (val != 0 && log >= rsum_cache[bbno])
+			rsum_cache[bbno] = log + 1;
 	}
 
 	xfs_trans_log_rtsummary(args, infoword);
@@ -737,7 +738,7 @@ xfs_rtfree_range(
 	/*
 	 * Find the next allocated block (end of allocated extent).
 	 */
-	error = xfs_rtfind_forw(args, end, mp->m_sb.sb_rextents - 1,
+	error = xfs_rtfind_forw(args, end, args->rtg->rtg_extents - 1,
 			&postblock);
 	if (error)
 		return error;
@@ -961,19 +962,22 @@ xfs_rtcheck_alloc_range(
 int
 xfs_rtfree_extent(
 	struct xfs_trans	*tp,	/* transaction pointer */
+	struct xfs_rtgroup	*rtg,
 	xfs_rtxnum_t		start,	/* starting rtext number to free */
 	xfs_rtxlen_t		len)	/* length of extent freed */
 {
 	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
 	struct xfs_rtalloc_args	args = {
 		.mp		= mp,
 		.tp		= tp,
+		.rtg		= rtg,
 	};
 	int			error;
 	struct timespec64	atime;
 
-	ASSERT(mp->m_rbmip->i_itemp != NULL);
-	xfs_assert_ilocked(mp->m_rbmip, XFS_ILOCK_EXCL);
+	ASSERT(rbmip->i_itemp != NULL);
+	xfs_assert_ilocked(rbmip, XFS_ILOCK_EXCL);
 
 	error = xfs_rtcheck_alloc_range(&args, start, len);
 	if (error)
@@ -996,13 +1000,13 @@ xfs_rtfree_extent(
 	 */
 	if (tp->t_frextents_delta + mp->m_sb.sb_frextents ==
 	    mp->m_sb.sb_rextents) {
-		if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM))
-			mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
+		if (!(rbmip->i_diflags & XFS_DIFLAG_NEWRTBM))
+			rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
 
-		atime = inode_get_atime(VFS_I(mp->m_rbmip));
+		atime = inode_get_atime(VFS_I(rbmip));
 		atime.tv_sec = 0;
-		inode_set_atime_to_ts(VFS_I(mp->m_rbmip), atime);
-		xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE);
+		inode_set_atime_to_ts(VFS_I(rbmip), atime);
+		xfs_trans_log_inode(tp, rbmip, XFS_ILOG_CORE);
 	}
 	error = 0;
 out:
@@ -1018,6 +1022,7 @@ xfs_rtfree_extent(
 int
 xfs_rtfree_blocks(
 	struct xfs_trans	*tp,
+	struct xfs_rtgroup	*rtg,
 	xfs_fsblock_t		rtbno,
 	xfs_filblks_t		rtlen)
 {
@@ -1038,21 +1043,23 @@ xfs_rtfree_blocks(
 		return -EIO;
 	}
 
-	return xfs_rtfree_extent(tp, xfs_rtb_to_rtx(mp, rtbno),
-			xfs_rtb_to_rtx(mp, rtlen));
+	return xfs_rtfree_extent(tp, rtg, xfs_rtb_to_rtx(mp, rtbno),
+			xfs_extlen_to_rtxlen(mp, rtlen));
 }
 
 /* Find all the free records within a given range. */
 int
 xfs_rtalloc_query_range(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	xfs_rtxnum_t			start,
 	xfs_rtxnum_t			end,
 	xfs_rtalloc_query_range_fn	fn,
 	void				*priv)
 {
+	struct xfs_mount		*mp = rtg->rtg_mount;
 	struct xfs_rtalloc_args		args = {
+		.rtg			= rtg,
 		.mp			= mp,
 		.tp			= tp,
 	};
@@ -1060,10 +1067,10 @@ xfs_rtalloc_query_range(
 
 	if (start > end)
 		return -EINVAL;
-	if (start == end || start >= mp->m_sb.sb_rextents)
+	if (start == end || start >= rtg->rtg_extents)
 		return 0;
 
-	end = min(end, mp->m_sb.sb_rextents - 1);
+	end = min(end, rtg->rtg_extents - 1);
 
 	/* Iterate the bitmap, looking for discrepancies. */
 	while (start <= end) {
@@ -1086,7 +1093,7 @@ xfs_rtalloc_query_range(
 			rec.ar_startext = start;
 			rec.ar_extcount = rtend - start + 1;
 
-			error = fn(mp, tp, &rec, priv);
+			error = fn(rtg, tp, &rec, priv);
 			if (error)
 				break;
 		}
@@ -1101,26 +1108,27 @@ xfs_rtalloc_query_range(
 /* Find all the free records. */
 int
 xfs_rtalloc_query_all(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	xfs_rtalloc_query_range_fn	fn,
 	void				*priv)
 {
-	return xfs_rtalloc_query_range(mp, tp, 0, mp->m_sb.sb_rextents - 1, fn,
+	return xfs_rtalloc_query_range(rtg, tp, 0, rtg->rtg_extents - 1, fn,
 			priv);
 }
 
 /* Is the given extent all free? */
 int
 xfs_rtalloc_extent_is_free(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	xfs_rtxnum_t			start,
 	xfs_rtxlen_t			len,
 	bool				*is_free)
 {
 	struct xfs_rtalloc_args		args = {
-		.mp			= mp,
+		.mp			= rtg->rtg_mount,
+		.rtg			= rtg,
 		.tp			= tp,
 	};
 	xfs_rtxnum_t			end;
@@ -1161,65 +1169,6 @@ xfs_rtsummary_blockcount(
 	return XFS_B_TO_FSB(mp, rsumwords << XFS_WORDLOG);
 }
 
-/* Lock both realtime free space metadata inodes for a freespace update. */
-void
-xfs_rtbitmap_lock(
-	struct xfs_mount	*mp)
-{
-	xfs_ilock(mp->m_rbmip, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP);
-	xfs_ilock(mp->m_rsumip, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-}
-
-/*
- * Join both realtime free space metadata inodes to the transaction.  The
- * ILOCKs will be released on transaction commit.
- */
-void
-xfs_rtbitmap_trans_join(
-	struct xfs_trans	*tp)
-{
-	xfs_trans_ijoin(tp, tp->t_mountp->m_rbmip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, tp->t_mountp->m_rsumip, XFS_ILOCK_EXCL);
-}
-
-/* Unlock both realtime free space metadata inodes after a freespace update. */
-void
-xfs_rtbitmap_unlock(
-	struct xfs_mount	*mp)
-{
-	xfs_iunlock(mp->m_rsumip, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-	xfs_iunlock(mp->m_rbmip, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP);
-}
-
-/*
- * Lock the realtime free space metadata inodes for a freespace scan.  Callers
- * must walk metadata blocks in order of increasing file offset.
- */
-void
-xfs_rtbitmap_lock_shared(
-	struct xfs_mount	*mp,
-	unsigned int		rbmlock_flags)
-{
-	if (rbmlock_flags & XFS_RBMLOCK_BITMAP)
-		xfs_ilock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
-
-	if (rbmlock_flags & XFS_RBMLOCK_SUMMARY)
-		xfs_ilock(mp->m_rsumip, XFS_ILOCK_SHARED | XFS_ILOCK_RTSUM);
-}
-
-/* Unlock the realtime free space metadata inodes after a freespace scan. */
-void
-xfs_rtbitmap_unlock_shared(
-	struct xfs_mount	*mp,
-	unsigned int		rbmlock_flags)
-{
-	if (rbmlock_flags & XFS_RBMLOCK_SUMMARY)
-		xfs_iunlock(mp->m_rsumip, XFS_ILOCK_SHARED | XFS_ILOCK_RTSUM);
-
-	if (rbmlock_flags & XFS_RBMLOCK_BITMAP)
-		xfs_iunlock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
-}
-
 static int
 xfs_rtfile_alloc_blocks(
 	struct xfs_inode	*ip,
@@ -1260,21 +1209,25 @@ xfs_rtfile_alloc_blocks(
 /* Get a buffer for the block. */
 static int
 xfs_rtfile_initialize_block(
-	struct xfs_inode	*ip,
+	struct xfs_rtgroup	*rtg,
+	enum xfs_rtg_inodes	type,
 	xfs_fsblock_t		fsbno,
 	void			*data)
 {
-	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	struct xfs_inode	*ip = rtg->rtg_inodes[type];
 	struct xfs_trans	*tp;
 	struct xfs_buf		*bp;
 	const size_t		copylen = mp->m_blockwsize << XFS_WORDLOG;
 	enum xfs_blft		buf_type;
 	int			error;
 
-	if (ip == mp->m_rsumip)
-		buf_type = XFS_BLFT_RTSUMMARY_BUF;
-	else
+	if (type == XFS_RTGI_BITMAP)
 		buf_type = XFS_BLFT_RTBITMAP_BUF;
+	else if (type == XFS_RTGI_SUMMARY)
+		buf_type = XFS_BLFT_RTSUMMARY_BUF;
+	else
+		return -EINVAL;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growrtzero, 0, 0, 0, &tp);
 	if (error)
@@ -1306,12 +1259,13 @@ xfs_rtfile_initialize_block(
  */
 int
 xfs_rtfile_initialize_blocks(
-	struct xfs_inode	*ip,		/* inode (bitmap/summary) */
+	struct xfs_rtgroup	*rtg,
+	enum xfs_rtg_inodes	type,
 	xfs_fileoff_t		offset_fsb,	/* offset to start from */
 	xfs_fileoff_t		end_fsb,	/* offset to allocate to */
 	void			*data)		/* data to fill the blocks */
 {
-	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_mount	*mp = rtg->rtg_mount;
 	const size_t		copylen = mp->m_blockwsize << XFS_WORDLOG;
 
 	while (offset_fsb < end_fsb) {
@@ -1319,8 +1273,8 @@ xfs_rtfile_initialize_blocks(
 		xfs_filblks_t		i;
 		int			error;
 
-		error = xfs_rtfile_alloc_blocks(ip, offset_fsb,
-				end_fsb - offset_fsb, &map);
+		error = xfs_rtfile_alloc_blocks(rtg->rtg_inodes[type],
+				offset_fsb, end_fsb - offset_fsb, &map);
 		if (error)
 			return error;
 
@@ -1330,7 +1284,7 @@ xfs_rtfile_initialize_blocks(
 		 * Do this one block per transaction, to keep it simple.
 		 */
 		for (i = 0; i < map.br_blockcount; i++) {
-			error = xfs_rtfile_initialize_block(ip,
+			error = xfs_rtfile_initialize_block(rtg, type,
 					map.br_startblock + i, data);
 			if (error)
 				return error;
@@ -1343,3 +1297,35 @@ xfs_rtfile_initialize_blocks(
 
 	return 0;
 }
+
+int
+xfs_rtbitmap_create(
+	struct xfs_rtgroup	*rtg,
+	struct xfs_inode	*ip,
+	struct xfs_trans	*tp,
+	bool			init)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+
+	ip->i_disk_size = mp->m_sb.sb_rbmblocks * mp->m_sb.sb_blocksize;
+	if (init && !xfs_has_rtgroups(mp)) {
+		ip->i_diflags |= XFS_DIFLAG_NEWRTBM;
+		inode_set_atime(VFS_I(ip), 0, 0);
+	}
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	return 0;
+}
+
+int
+xfs_rtsummary_create(
+	struct xfs_rtgroup	*rtg,
+	struct xfs_inode	*ip,
+	struct xfs_trans	*tp,
+	bool			init)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+
+	ip->i_disk_size = mp->m_rsumblocks * mp->m_sb.sb_blocksize;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 140513d1d6bcf..e4994a3e461d3 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -6,7 +6,10 @@
 #ifndef __XFS_RTBITMAP_H__
 #define	__XFS_RTBITMAP_H__
 
+#include "xfs_rtgroup.h"
+
 struct xfs_rtalloc_args {
+	struct xfs_rtgroup	*rtg;
 	struct xfs_mount	*mp;
 	struct xfs_trans	*tp;
 
@@ -268,7 +271,7 @@ struct xfs_rtalloc_rec {
 };
 
 typedef int (*xfs_rtalloc_query_range_fn)(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	const struct xfs_rtalloc_rec	*rec,
 	void				*priv);
@@ -291,53 +294,41 @@ int xfs_rtmodify_summary(struct xfs_rtalloc_args *args, int log,
 		xfs_fileoff_t bbno, int delta);
 int xfs_rtfree_range(struct xfs_rtalloc_args *args, xfs_rtxnum_t start,
 		xfs_rtxlen_t len);
-int xfs_rtalloc_query_range(struct xfs_mount *mp, struct xfs_trans *tp,
+int xfs_rtalloc_query_range(struct xfs_rtgroup *rtg, struct xfs_trans *tp,
 		xfs_rtxnum_t start, xfs_rtxnum_t end,
 		xfs_rtalloc_query_range_fn fn, void *priv);
-int xfs_rtalloc_query_all(struct xfs_mount *mp, struct xfs_trans *tp,
-			  xfs_rtalloc_query_range_fn fn,
-			  void *priv);
-int xfs_rtalloc_extent_is_free(struct xfs_mount *mp, struct xfs_trans *tp,
-			       xfs_rtxnum_t start, xfs_rtxlen_t len,
-			       bool *is_free);
-/*
- * Free an extent in the realtime subvolume.  Length is expressed in
- * realtime extents, as is the block number.
- */
-int					/* error */
-xfs_rtfree_extent(
-	struct xfs_trans	*tp,	/* transaction pointer */
-	xfs_rtxnum_t		start,	/* starting rtext number to free */
-	xfs_rtxlen_t		len);	/* length of extent freed */
-
+int xfs_rtalloc_query_all(struct xfs_rtgroup *rtg, struct xfs_trans *tp,
+		xfs_rtalloc_query_range_fn fn, void *priv);
+int xfs_rtalloc_extent_is_free(struct xfs_rtgroup *rtg, struct xfs_trans *tp,
+		xfs_rtxnum_t start, xfs_rtxlen_t len, bool *is_free);
+int xfs_rtfree_extent(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
+		xfs_rtxnum_t start, xfs_rtxlen_t len);
 /* Same as above, but in units of rt blocks. */
-int xfs_rtfree_blocks(struct xfs_trans *tp, xfs_fsblock_t rtbno,
-		xfs_filblks_t rtlen);
+int xfs_rtfree_blocks(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
+		xfs_fsblock_t rtbno, xfs_filblks_t rtlen);
 
 xfs_filblks_t xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t
 		rtextents);
 xfs_filblks_t xfs_rtsummary_blockcount(struct xfs_mount *mp,
 		unsigned int rsumlevels, xfs_extlen_t rbmblocks);
 
-int xfs_rtfile_initialize_blocks(struct xfs_inode *ip,
-		xfs_fileoff_t offset_fsb, xfs_fileoff_t end_fsb, void *data);
+int xfs_rtfile_initialize_blocks(struct xfs_rtgroup *rtg,
+		enum xfs_rtg_inodes type, xfs_fileoff_t offset_fsb,
+		xfs_fileoff_t end_fsb, void *data);
+int xfs_rtbitmap_create(struct xfs_rtgroup *rtg, struct xfs_inode *ip,
+		struct xfs_trans *tp, bool init);
+int xfs_rtsummary_create(struct xfs_rtgroup *rtg, struct xfs_inode *ip,
+		struct xfs_trans *tp, bool init);
 
-void xfs_rtbitmap_lock(struct xfs_mount *mp);
-void xfs_rtbitmap_unlock(struct xfs_mount *mp);
-void xfs_rtbitmap_trans_join(struct xfs_trans *tp);
-
-/* Lock the rt bitmap inode in shared mode */
-#define XFS_RBMLOCK_BITMAP	(1U << 0)
-/* Lock the rt summary inode in shared mode */
-#define XFS_RBMLOCK_SUMMARY	(1U << 1)
-
-void xfs_rtbitmap_lock_shared(struct xfs_mount *mp,
-		unsigned int rbmlock_flags);
-void xfs_rtbitmap_unlock_shared(struct xfs_mount *mp,
-		unsigned int rbmlock_flags);
 #else /* CONFIG_XFS_RT */
 # define xfs_rtfree_extent(t,b,l)			(-ENOSYS)
-# define xfs_rtfree_blocks(t,rb,rl)			(-ENOSYS)
+
+static inline int xfs_rtfree_blocks(struct xfs_trans *tp,
+		struct xfs_rtgroup *rtg, xfs_fsblock_t rtbno,
+		xfs_filblks_t rtlen)
+{
+	return -ENOSYS;
+}
 # define xfs_rtalloc_query_range(m,t,l,h,f,p)		(-ENOSYS)
 # define xfs_rtalloc_query_all(m,t,f,p)			(-ENOSYS)
 # define xfs_rtbitmap_read_buf(a,b)			(-ENOSYS)
@@ -351,11 +342,6 @@ xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t rtextents)
 	return 0;
 }
 # define xfs_rtsummary_blockcount(mp, l, b)		(0)
-# define xfs_rtbitmap_lock(mp)			do { } while (0)
-# define xfs_rtbitmap_trans_join(tp)		do { } while (0)
-# define xfs_rtbitmap_unlock(mp)		do { } while (0)
-# define xfs_rtbitmap_lock_shared(mp, lf)	do { } while (0)
-# define xfs_rtbitmap_unlock_shared(mp, lf)	do { } while (0)
 #endif /* CONFIG_XFS_RT */
 
 #endif /* __XFS_RTBITMAP_H__ */
diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 50e4a56d749f0..4618caf344efd 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -207,10 +207,16 @@ xfs_rtgroup_lock(
 	ASSERT(!(rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED) ||
 	       !(rtglock_flags & XFS_RTGLOCK_BITMAP));
 
-	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
-		xfs_rtbitmap_lock(rtg->rtg_mount);
-	else if (rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED)
-		xfs_rtbitmap_lock_shared(rtg->rtg_mount, XFS_RBMLOCK_BITMAP);
+	if (rtglock_flags & XFS_RTGLOCK_BITMAP) {
+		/*
+		 * Lock both realtime free space metadata inodes for a freespace
+		 * update.
+		 */
+		xfs_ilock(rtg->rtg_inodes[XFS_RTGI_BITMAP], XFS_ILOCK_EXCL);
+		xfs_ilock(rtg->rtg_inodes[XFS_RTGI_SUMMARY], XFS_ILOCK_EXCL);
+	} else if (rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED) {
+		xfs_ilock(rtg->rtg_inodes[XFS_RTGI_BITMAP], XFS_ILOCK_SHARED);
+	}
 }
 
 /* Unlock metadata inodes associated with this rt group. */
@@ -223,10 +229,12 @@ xfs_rtgroup_unlock(
 	ASSERT(!(rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED) ||
 	       !(rtglock_flags & XFS_RTGLOCK_BITMAP));
 
-	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
-		xfs_rtbitmap_unlock(rtg->rtg_mount);
-	else if (rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED)
-		xfs_rtbitmap_unlock_shared(rtg->rtg_mount, XFS_RBMLOCK_BITMAP);
+	if (rtglock_flags & XFS_RTGLOCK_BITMAP) {
+		xfs_iunlock(rtg->rtg_inodes[XFS_RTGI_SUMMARY], XFS_ILOCK_EXCL);
+		xfs_iunlock(rtg->rtg_inodes[XFS_RTGI_BITMAP], XFS_ILOCK_EXCL);
+	} else if (rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED) {
+		xfs_iunlock(rtg->rtg_inodes[XFS_RTGI_BITMAP], XFS_ILOCK_SHARED);
+	}
 }
 
 /*
@@ -242,8 +250,12 @@ xfs_rtgroup_trans_join(
 	ASSERT(!(rtglock_flags & ~XFS_RTGLOCK_ALL_FLAGS));
 	ASSERT(!(rtglock_flags & XFS_RTGLOCK_BITMAP_SHARED));
 
-	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
-		xfs_rtbitmap_trans_join(tp);
+	if (rtglock_flags & XFS_RTGLOCK_BITMAP) {
+		xfs_trans_ijoin(tp, rtg->rtg_inodes[XFS_RTGI_BITMAP],
+				XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, rtg->rtg_inodes[XFS_RTGI_SUMMARY],
+				XFS_ILOCK_EXCL);
+	}
 }
 
 #ifdef CONFIG_PROVE_LOCKING
@@ -314,6 +326,16 @@ struct xfs_rtginode_ops {
 };
 
 static const struct xfs_rtginode_ops xfs_rtginode_ops[XFS_RTGI_MAX] = {
+	[XFS_RTGI_BITMAP] = {
+		.name		= "bitmap",
+		.metafile_type	= XFS_METAFILE_RTBITMAP,
+		.create		= xfs_rtbitmap_create,
+	},
+	[XFS_RTGI_SUMMARY] = {
+		.name		= "summary",
+		.metafile_type	= XFS_METAFILE_RTSUMMARY,
+		.create		= xfs_rtsummary_create,
+	},
 };
 
 /* Return the shortname of this rtgroup inode. */
@@ -324,6 +346,14 @@ xfs_rtginode_name(
 	return xfs_rtginode_ops[type].name;
 }
 
+/* Return the metafile type of this rtgroup inode. */
+enum xfs_metafile_type
+xfs_rtginode_metafile_type(
+	enum xfs_rtg_inodes	type)
+{
+	return xfs_rtginode_ops[type].metafile_type;
+}
+
 /* Should this rtgroup inode be present? */
 bool
 xfs_rtginode_enabled(
@@ -345,7 +375,6 @@ xfs_rtginode_load(
 	struct xfs_trans	*tp)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	const char		*path;
 	struct xfs_inode	*ip;
 	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
 	int			error;
@@ -353,15 +382,36 @@ xfs_rtginode_load(
 	if (!xfs_rtginode_enabled(rtg, type))
 		return 0;
 
-	if (!mp->m_rtdirip)
-		return -EFSCORRUPTED;
-
-	path = xfs_rtginode_path(rtg->rtg_rgno, type);
-	if (!path)
-		return -ENOMEM;
-	error = xfs_metadir_load(tp, mp->m_rtdirip, path, ops->metafile_type,
-			&ip);
-	kfree(path);
+	if (!xfs_has_rtgroups(mp)) {
+		xfs_ino_t	ino;
+
+		switch (type) {
+		case XFS_RTGI_BITMAP:
+			ino = mp->m_sb.sb_rbmino;
+			break;
+		case XFS_RTGI_SUMMARY:
+			ino = mp->m_sb.sb_rsumino;
+			break;
+		default:
+			/* None of the other types exist on !rtgroups */
+			return 0;
+		}
+
+		error = xfs_trans_metafile_iget(tp, ino, ops->metafile_type,
+				&ip);
+	} else {
+		const char	*path;
+
+		if (!mp->m_rtdirip)
+			return -EFSCORRUPTED;
+
+		path = xfs_rtginode_path(rtg->rtg_rgno, type);
+		if (!path)
+			return -ENOMEM;
+		error = xfs_metadir_load(tp, mp->m_rtdirip, path,
+				ops->metafile_type, &ip);
+		kfree(path);
+	}
 
 	if (error)
 		return error;
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index b5c769211b4bb..e622b24a0d75f 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -10,6 +10,9 @@ struct xfs_mount;
 struct xfs_trans;
 
 enum xfs_rtg_inodes {
+	XFS_RTGI_BITMAP,	/* allocation bitmap */
+	XFS_RTGI_SUMMARY,	/* allocation summary */
+
 	XFS_RTGI_MAX,
 };
 
@@ -28,11 +31,19 @@ struct xfs_rtgroup {
 	wait_queue_head_t	rtg_active_wq;/* woken active_ref falls to zero */
 
 	/* per-rtgroup metadata inodes */
-	struct xfs_inode	*rtg_inodes[1 /* hack */];
+	struct xfs_inode	*rtg_inodes[XFS_RTGI_MAX];
 
 	/* Number of blocks in this group */
 	xfs_rtxnum_t		rtg_extents;
 
+	/*
+	 * Optional cache of rt summary level per bitmap block with the
+	 * invariant that rtg_rsum_cache[bbno] > the maximum i for which
+	 * rsum[i][bbno] != 0, or 0 if rsum[i][bbno] == 0 for all i.
+	 * Reads and writes are serialized by the rsumip inode lock.
+	 */
+	uint8_t			*rtg_rsum_cache;
+
 #ifdef __KERNEL__
 	/* -- kernel only structures below this line -- */
 	spinlock_t		rtg_state_lock;
@@ -234,6 +245,7 @@ int xfs_rtginode_mkdir_parent(struct xfs_mount *mp);
 int xfs_rtginode_load_parent(struct xfs_trans *tp);
 
 const char *xfs_rtginode_name(enum xfs_rtg_inodes type);
+enum xfs_metafile_type xfs_rtginode_metafile_type(enum xfs_rtg_inodes type);
 bool xfs_rtginode_enabled(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type);
 int xfs_rtginode_load(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type,
 		struct xfs_trans *tp);
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 5ab2ac53c9200..69dac1bd6a83e 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -19,6 +19,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_rmap.h"
 #include "xfs_rmap_btree.h"
+#include "xfs_rtgroup.h"
 #include "xfs_health.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -314,8 +315,20 @@ xchk_bmap_rt_iextent_xref(
 	struct xchk_bmap_info	*info,
 	struct xfs_bmbt_irec	*irec)
 {
+	int			error;
+
+	error = xchk_rtgroup_init_existing(info->sc,
+			xfs_rtb_to_rgno(ip->i_mount, irec->br_startblock),
+			&info->sc->sr);
+	if (!xchk_fblock_process_error(info->sc, info->whichfork,
+			irec->br_startoff, &error))
+		return;
+
+	xchk_rtgroup_lock(&info->sc->sr, XCHK_RTGLOCK_ALL);
 	xchk_xref_is_used_rt_space(info->sc, irec->br_startblock,
 			irec->br_blockcount);
+
+	xchk_rtgroup_free(info->sc, &info->sc->sr);
 }
 
 /* Cross-reference a single datadev extent record. */
diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c
index 1d3e98346933e..5f6449fc85dc0 100644
--- a/fs/xfs/scrub/fscounters.c
+++ b/fs/xfs/scrub/fscounters.c
@@ -19,6 +19,7 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_inode.h"
 #include "xfs_icache.h"
+#include "xfs_rtgroup.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -388,7 +389,7 @@ xchk_fscount_aggregate_agcounts(
 #ifdef CONFIG_XFS_RT
 STATIC int
 xchk_fscount_add_frextent(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	const struct xfs_rtalloc_rec	*rec,
 	void				*priv)
@@ -409,6 +410,8 @@ xchk_fscount_count_frextents(
 	struct xchk_fscounters	*fsc)
 {
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_rtgroup	*rtg;
+	xfs_rgnumber_t		rgno;
 	int			error;
 
 	fsc->frextents = 0;
@@ -416,19 +419,20 @@ xchk_fscount_count_frextents(
 	if (!xfs_has_realtime(mp))
 		return 0;
 
-	xfs_rtbitmap_lock_shared(sc->mp, XFS_RBMLOCK_BITMAP);
-	error = xfs_rtalloc_query_all(sc->mp, sc->tp,
-			xchk_fscount_add_frextent, fsc);
-	if (error) {
-		xchk_set_incomplete(sc);
-		goto out_unlock;
+	for_each_rtgroup(mp, rgno, rtg) {
+		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		error = xfs_rtalloc_query_all(rtg, sc->tp,
+				xchk_fscount_add_frextent, fsc);
+		xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		if (error) {
+			xchk_set_incomplete(sc);
+			xfs_rtgroup_rele(rtg);
+			return error;
+		}
 	}
 
 	fsc->frextents_delayed = percpu_counter_sum(&mp->m_delalloc_rtextents);
-
-out_unlock:
-	xfs_rtbitmap_unlock_shared(sc->mp, XFS_RBMLOCK_BITMAP);
-	return error;
+	return 0;
 }
 #else
 STATIC int
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 01c0e863775d4..cb01a9bdfd6db 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -21,6 +21,7 @@
 #include "xfs_rmap.h"
 #include "xfs_rmap_btree.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_rtbitmap.h"
 #include "xfs_extent_busy.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
@@ -953,6 +954,29 @@ xrep_ag_init(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_RT
+/*
+ * Given a reference to a rtgroup structure, lock rtgroup btree inodes and
+ * create btree cursors.  Must only be called to repair a regular rt file.
+ */
+int
+xrep_rtgroup_init(
+	struct xfs_scrub	*sc,
+	struct xfs_rtgroup	*rtg,
+	struct xchk_rt		*sr,
+	unsigned int		rtglock_flags)
+{
+	ASSERT(sr->rtg == NULL);
+
+	xfs_rtgroup_lock(rtg, rtglock_flags);
+	sr->rtlock_flags = rtglock_flags;
+
+	/* Grab our own passive reference from the caller's ref. */
+	sr->rtg = xfs_rtgroup_hold(rtg);
+	return 0;
+}
+#endif /* CONFIG_XFS_RT */
+
 /* Reinitialize the per-AG block reservation for the AG we just fixed. */
 int
 xrep_reset_perag_resv(
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 90f9cb3b5ad8b..4052185743910 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -8,6 +8,7 @@
 
 #include "xfs_quota_defs.h"
 
+struct xfs_rtgroup;
 struct xchk_stats_run;
 
 static inline int xrep_notsupported(struct xfs_scrub *sc)
@@ -106,6 +107,12 @@ int xrep_setup_inode(struct xfs_scrub *sc, const struct xfs_imap *imap);
 void xrep_ag_btcur_init(struct xfs_scrub *sc, struct xchk_ag *sa);
 int xrep_ag_init(struct xfs_scrub *sc, struct xfs_perag *pag,
 		struct xchk_ag *sa);
+#ifdef CONFIG_XFS_RT
+int xrep_rtgroup_init(struct xfs_scrub *sc, struct xfs_rtgroup *rtg,
+		struct xchk_rt *sr, unsigned int rtglock_flags);
+#else
+# define xrep_rtgroup_init(sc, rtg, sr, lockflags)	(-ENOSYS)
+#endif /* CONFIG_XFS_RT */
 
 /* Metadata revalidators */
 
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 46583517377ff..6551b4374b89f 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -35,6 +35,10 @@ xchk_setup_rtbitmap(
 		return -ENOMEM;
 	sc->buf = rtb;
 
+	error = xchk_rtgroup_init(sc, sc->sm->sm_agno, &sc->sr);
+	if (error)
+		return error;
+
 	if (xchk_could_repair(sc)) {
 		error = xrep_setup_rtbitmap(sc, rtb);
 		if (error)
@@ -45,7 +49,8 @@ xchk_setup_rtbitmap(
 	if (error)
 		return error;
 
-	error = xchk_install_live_inode(sc, sc->mp->m_rbmip);
+	error = xchk_install_live_inode(sc,
+			sc->sr.rtg->rtg_inodes[XFS_RTGI_BITMAP]);
 	if (error)
 		return error;
 
@@ -53,18 +58,18 @@ xchk_setup_rtbitmap(
 	if (error)
 		return error;
 
-	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTBITMAP);
-
 	/*
 	 * Now that we've locked the rtbitmap, we can't race with growfsrt
 	 * trying to expand the bitmap or change the size of the rt volume.
 	 * Hence it is safe to compute and check the geometry values.
 	 */
+	xchk_rtgroup_lock(&sc->sr, XFS_RTGLOCK_BITMAP);
 	if (mp->m_sb.sb_rblocks) {
 		rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
 		rtb->rextslog = xfs_compute_rextslog(rtb->rextents);
 		rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents);
 	}
+
 	return 0;
 }
 
@@ -73,11 +78,12 @@ xchk_setup_rtbitmap(
 /* Scrub a free extent record from the realtime bitmap. */
 STATIC int
 xchk_rtbitmap_rec(
-	struct xfs_mount	*mp,
+	struct xfs_rtgroup	*rtg,
 	struct xfs_trans	*tp,
 	const struct xfs_rtalloc_rec *rec,
 	void			*priv)
 {
+	struct xfs_mount	*mp = rtg->rtg_mount;
 	struct xfs_scrub	*sc = priv;
 	xfs_rtblock_t		startblock;
 	xfs_filblks_t		blockcount;
@@ -140,18 +146,20 @@ xchk_rtbitmap(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_rtgroup	*rtg = sc->sr.rtg;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
 	struct xchk_rtbitmap	*rtb = sc->buf;
 	int			error;
 
 	/* Is sb_rextents correct? */
 	if (mp->m_sb.sb_rextents != rtb->rextents) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
 		return 0;
 	}
 
 	/* Is sb_rextslog correct? */
 	if (mp->m_sb.sb_rextslog != rtb->rextslog) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
 		return 0;
 	}
 
@@ -160,17 +168,17 @@ xchk_rtbitmap(
 	 * case can we exceed 4bn bitmap blocks since the super field is a u32.
 	 */
 	if (rtb->rbmblocks > U32_MAX) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
 		return 0;
 	}
 	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
 		return 0;
 	}
 
 	/* The bitmap file length must be aligned to an fsblock. */
-	if (mp->m_rbmip->i_disk_size & mp->m_blockmask) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
+	if (rbmip->i_disk_size & mp->m_blockmask) {
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
 		return 0;
 	}
 
@@ -179,8 +187,8 @@ xchk_rtbitmap(
 	 * growfsrt expands the bitmap file before updating sb_rextents, so the
 	 * file can be larger than sb_rbmblocks.
 	 */
-	if (mp->m_rbmip->i_disk_size < XFS_FSB_TO_B(mp, rtb->rbmblocks)) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
+	if (rbmip->i_disk_size < XFS_FSB_TO_B(mp, rtb->rbmblocks)) {
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
 		return 0;
 	}
 
@@ -193,7 +201,7 @@ xchk_rtbitmap(
 	if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 		return error;
 
-	error = xfs_rtalloc_query_all(mp, sc->tp, xchk_rtbitmap_rec, sc);
+	error = xfs_rtalloc_query_all(rtg, sc->tp, xchk_rtbitmap_rec, sc);
 	if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, 0, &error))
 		return error;
 
@@ -207,6 +215,8 @@ xchk_xref_is_used_rt_space(
 	xfs_rtblock_t		rtbno,
 	xfs_extlen_t		len)
 {
+	struct xfs_rtgroup	*rtg = sc->sr.rtg;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
 	xfs_rtxnum_t		startext;
 	xfs_rtxnum_t		endext;
 	bool			is_free;
@@ -217,13 +227,10 @@ xchk_xref_is_used_rt_space(
 
 	startext = xfs_rtb_to_rtx(sc->mp, rtbno);
 	endext = xfs_rtb_to_rtx(sc->mp, rtbno + len - 1);
-	xfs_ilock(sc->mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
-	error = xfs_rtalloc_extent_is_free(sc->mp, sc->tp, startext,
+	error = xfs_rtalloc_extent_is_free(rtg, sc->tp, startext,
 			endext - startext + 1, &is_free);
 	if (!xchk_should_check_xref(sc, &error, NULL))
-		goto out_unlock;
+		return;
 	if (is_free)
-		xchk_ino_xref_set_corrupt(sc, sc->mp->m_rbmip->i_ino);
-out_unlock:
-	xfs_iunlock(sc->mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
+		xchk_ino_xref_set_corrupt(sc, rbmip->i_ino);
 }
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 7c7366c98338b..43d509422053c 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -18,6 +18,7 @@
 #include "xfs_bmap.h"
 #include "xfs_sb.h"
 #include "xfs_exchmaps.h"
+#include "xfs_rtgroup.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -46,12 +47,19 @@ xchk_setup_rtsummary(
 	struct xchk_rtsummary	*rts;
 	int			error;
 
+	if (xchk_need_intent_drain(sc))
+		xchk_fsgates_enable(sc, XCHK_FSGATES_DRAIN);
+
 	rts = kvzalloc(struct_size(rts, words, mp->m_blockwsize),
 			XCHK_GFP_FLAGS);
 	if (!rts)
 		return -ENOMEM;
 	sc->buf = rts;
 
+	error = xchk_rtgroup_init(sc, sc->sm->sm_agno, &sc->sr);
+	if (error)
+		return error;
+
 	if (xchk_could_repair(sc)) {
 		error = xrep_setup_rtsummary(sc, rts);
 		if (error)
@@ -73,7 +81,8 @@ xchk_setup_rtsummary(
 	if (error)
 		return error;
 
-	error = xchk_install_live_inode(sc, mp->m_rsumip);
+	error = xchk_install_live_inode(sc,
+			sc->sr.rtg->rtg_inodes[XFS_RTGI_SUMMARY]);
 	if (error)
 		return error;
 
@@ -81,20 +90,17 @@ xchk_setup_rtsummary(
 	if (error)
 		return error;
 
-	/*
-	 * Locking order requires us to take the rtbitmap first.  We must be
-	 * careful to unlock it ourselves when we are done with the rtbitmap
-	 * file since the scrub infrastructure won't do that for us.  Only
-	 * then we can lock the rtsummary inode.
-	 */
-	xfs_ilock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
-	xchk_ilock(sc, XFS_ILOCK_EXCL | XFS_ILOCK_RTSUM);
-
 	/*
 	 * Now that we've locked the rtbitmap and rtsummary, we can't race with
 	 * growfsrt trying to expand the summary or change the size of the rt
 	 * volume.  Hence it is safe to compute and check the geometry values.
+	 *
+	 * Note that there is no strict requirement for an exclusive lock on the
+	 * summary here, but to keep the locking APIs simple we lock both inodes
+	 * exclusively here.  If we ever start caring about running concurrent
+	 * fsmap with scrub this could be changed.
 	 */
+	xchk_rtgroup_lock(&sc->sr, XFS_RTGLOCK_BITMAP);
 	if (mp->m_sb.sb_rblocks) {
 		int		rextslog;
 
@@ -105,6 +111,7 @@ xchk_setup_rtsummary(
 		rts->rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels,
 				rts->rbmblocks);
 	}
+
 	return 0;
 }
 
@@ -155,11 +162,12 @@ xchk_rtsum_inc(
 /* Update the summary file to reflect the free extent that we've accumulated. */
 STATIC int
 xchk_rtsum_record_free(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	const struct xfs_rtalloc_rec	*rec,
 	void				*priv)
 {
+	struct xfs_mount		*mp = rtg->rtg_mount;
 	struct xfs_scrub		*sc = priv;
 	xfs_fileoff_t			rbmoff;
 	xfs_rtblock_t			rtbno;
@@ -182,7 +190,8 @@ xchk_rtsum_record_free(
 	rtlen = xfs_rtx_to_rtb(mp, rec->ar_extcount);
 
 	if (!xfs_verify_rtbext(mp, rtbno, rtlen)) {
-		xchk_ino_xref_set_corrupt(sc, mp->m_rbmip->i_ino);
+		xchk_ino_xref_set_corrupt(sc,
+				rtg->rtg_inodes[XFS_RTGI_BITMAP]->i_ino);
 		return -EFSCORRUPTED;
 	}
 
@@ -204,15 +213,16 @@ xchk_rtsum_compute(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_rtgroup	*rtg = sc->sr.rtg;
 	unsigned long long	rtbmp_blocks;
 
 	/* If the bitmap size doesn't match the computed size, bail. */
 	rtbmp_blocks = xfs_rtbitmap_blockcount(mp, mp->m_sb.sb_rextents);
-	if (XFS_FSB_TO_B(mp, rtbmp_blocks) != mp->m_rbmip->i_disk_size)
+	if (XFS_FSB_TO_B(mp, rtbmp_blocks) !=
+	    rtg->rtg_inodes[XFS_RTGI_BITMAP]->i_disk_size)
 		return -EFSCORRUPTED;
 
-	return xfs_rtalloc_query_all(sc->mp, sc->tp, xchk_rtsum_record_free,
-			sc);
+	return xfs_rtalloc_query_all(rtg, sc->tp, xchk_rtsum_record_free, sc);
 }
 
 /* Compare the rtsummary file against the one we computed. */
@@ -231,8 +241,9 @@ xchk_rtsum_compare(
 	xfs_rtsumoff_t		sumoff = 0;
 	int			error = 0;
 
-	rts->args.mp = sc->mp;
+	rts->args.mp = mp;
 	rts->args.tp = sc->tp;
+	rts->args.rtg = sc->sr.rtg;
 
 	/* Mappings may not cross or lie beyond EOF. */
 	endoff = XFS_B_TO_FSB(mp, ip->i_disk_size);
@@ -299,31 +310,34 @@ xchk_rtsummary(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_mount	*mp = sc->mp;
+	struct xfs_rtgroup	*rtg = sc->sr.rtg;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
+	struct xfs_inode	*rsumip = rtg->rtg_inodes[XFS_RTGI_SUMMARY];
 	struct xchk_rtsummary	*rts = sc->buf;
-	int			error = 0;
+	int			error;
 
 	/* Is sb_rextents correct? */
 	if (mp->m_sb.sb_rextents != rts->rextents) {
-		xchk_ino_set_corrupt(sc, mp->m_rbmip->i_ino);
-		goto out_rbm;
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
+		return 0;
 	}
 
 	/* Is m_rsumlevels correct? */
 	if (mp->m_rsumlevels != rts->rsumlevels) {
-		xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino);
-		goto out_rbm;
+		xchk_ino_set_corrupt(sc, rsumip->i_ino);
+		return 0;
 	}
 
 	/* Is m_rsumsize correct? */
 	if (mp->m_rsumblocks != rts->rsumblocks) {
-		xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino);
-		goto out_rbm;
+		xchk_ino_set_corrupt(sc, rsumip->i_ino);
+		return 0;
 	}
 
 	/* The summary file length must be aligned to an fsblock. */
-	if (mp->m_rsumip->i_disk_size & mp->m_blockmask) {
-		xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino);
-		goto out_rbm;
+	if (rsumip->i_disk_size & mp->m_blockmask) {
+		xchk_ino_set_corrupt(sc, rsumip->i_ino);
+		return 0;
 	}
 
 	/*
@@ -331,15 +345,15 @@ xchk_rtsummary(
 	 * growfsrt expands the summary file before updating sb_rextents, so
 	 * the file can be larger than rsumsize.
 	 */
-	if (mp->m_rsumip->i_disk_size < XFS_FSB_TO_B(mp, rts->rsumblocks)) {
-		xchk_ino_set_corrupt(sc, mp->m_rsumip->i_ino);
-		goto out_rbm;
+	if (rsumip->i_disk_size < XFS_FSB_TO_B(mp, rts->rsumblocks)) {
+		xchk_ino_set_corrupt(sc, rsumip->i_ino);
+		return 0;
 	}
 
 	/* Invoke the fork scrubber. */
 	error = xchk_metadata_inode_forks(sc);
 	if (error || (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
-		goto out_rbm;
+		return error;
 
 	/* Construct the new summary file from the rtbitmap. */
 	error = xchk_rtsum_compute(sc);
@@ -348,23 +362,12 @@ xchk_rtsummary(
 		 * EFSCORRUPTED means the rtbitmap is corrupt, which is an xref
 		 * error since we're checking the summary file.
 		 */
-		xchk_ino_xref_set_corrupt(sc, mp->m_rbmip->i_ino);
-		error = 0;
-		goto out_rbm;
+		xchk_ino_set_corrupt(sc, rbmip->i_ino);
+		return 0;
 	}
 	if (error)
-		goto out_rbm;
+		return error;
 
 	/* Does the computed summary file match the actual rtsummary file? */
-	error = xchk_rtsum_compare(sc);
-
-out_rbm:
-	/*
-	 * Unlock the rtbitmap since we're done with it.  All other writers of
-	 * the rt free space metadata grab the bitmap and summary ILOCKs in
-	 * that order, so we're still protected against allocation activities
-	 * even if we continue on to the repair function.
-	 */
-	xfs_iunlock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP);
-	return error;
+	return xchk_rtsum_compare(sc);
 }
diff --git a/fs/xfs/scrub/rtsummary_repair.c b/fs/xfs/scrub/rtsummary_repair.c
index 7deeb948cb702..1688380988007 100644
--- a/fs/xfs/scrub/rtsummary_repair.c
+++ b/fs/xfs/scrub/rtsummary_repair.c
@@ -76,8 +76,9 @@ xrep_rtsummary_prep_buf(
 	union xfs_suminfo_raw	*ondisk;
 	int			error;
 
-	rts->args.mp = sc->mp;
+	rts->args.mp = mp;
 	rts->args.tp = sc->tp;
+	rts->args.rtg = sc->sr.rtg;
 	rts->args.sumbp = bp;
 	ondisk = xfs_rsumblock_infoptr(&rts->args, 0);
 	rts->args.sumbp = NULL;
@@ -162,8 +163,8 @@ xrep_rtsummary(
 		return error;
 
 	/* Reset incore state and blow out the summary cache. */
-	if (mp->m_rsum_cache)
-		memset(mp->m_rsum_cache, 0xFF, mp->m_sb.sb_rbmblocks);
+	if (sc->sr.rtg->rtg_rsum_cache)
+		memset(sc->sr.rtg->rtg_rsum_cache, 0xFF, mp->m_sb.sb_rbmblocks);
 
 	mp->m_rsumlevels = rts->rsumlevels;
 	mp->m_rsumblocks = rts->rsumblocks;
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 9d9990d5c6c48..910825d4b61a2 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -384,13 +384,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.repair	= xrep_parent,
 	},
 	[XFS_SCRUB_TYPE_RTBITMAP] = {	/* realtime bitmap */
-		.type	= ST_FS,
+		.type	= ST_RTGROUP,
 		.setup	= xchk_setup_rtbitmap,
 		.scrub	= xchk_rtbitmap,
 		.repair	= xrep_rtbitmap,
 	},
 	[XFS_SCRUB_TYPE_RTSUM] = {	/* realtime summary */
-		.type	= ST_FS,
+		.type	= ST_RTGROUP,
 		.setup	= xchk_setup_rtsummary,
 		.scrub	= xchk_rtsummary,
 		.repair	= xrep_rtsummary,
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index bf1e3f330018d..b2ef5ebe1f047 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -21,6 +21,7 @@
 #include "xfs_ag.h"
 #include "xfs_health.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_rtgroup.h"
 
 /*
  * Notes on an efficient, low latency fstrim algorithm
@@ -506,7 +507,7 @@ xfs_discard_rtdev_extents(
 
 static int
 xfs_trim_gather_rtextent(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	const struct xfs_rtalloc_rec	*rec,
 	void				*priv)
@@ -525,12 +526,12 @@ xfs_trim_gather_rtextent(
 		return -ECANCELED;
 	}
 
-	rbno = xfs_rtx_to_rtb(mp, rec->ar_startext);
-	rlen = xfs_rtx_to_rtb(mp, rec->ar_extcount);
+	rbno = xfs_rtx_to_rtb(rtg->rtg_mount, rec->ar_startext);
+	rlen = xfs_rtx_to_rtb(rtg->rtg_mount, rec->ar_extcount);
 
 	/* Ignore too small. */
 	if (rlen < tr->minlen_fsb) {
-		trace_xfs_discard_rttoosmall(mp, rbno, rlen);
+		trace_xfs_discard_rttoosmall(rtg->rtg_mount, rbno, rlen);
 		return 0;
 	}
 
@@ -548,69 +549,49 @@ xfs_trim_gather_rtextent(
 }
 
 static int
-xfs_trim_rtdev_extents(
-	struct xfs_mount	*mp,
-	xfs_daddr_t		start,
-	xfs_daddr_t		end,
+xfs_trim_rtg_extents(
+	struct xfs_rtgroup	*rtg,
+	xfs_rtxnum_t		low,
+	xfs_rtxnum_t		high,
 	xfs_daddr_t		minlen)
 {
+	struct xfs_mount	*mp = rtg->rtg_mount;
 	struct xfs_trim_rtdev	tr = {
 		.minlen_fsb	= XFS_BB_TO_FSB(mp, minlen),
+		.extent_list	= LIST_HEAD_INIT(tr.extent_list),
 	};
-	xfs_rtxnum_t		low, high;
 	struct xfs_trans	*tp;
-	xfs_daddr_t		rtdev_daddr;
 	int			error;
 
-	INIT_LIST_HEAD(&tr.extent_list);
-
-	/* Shift the start and end downwards to match the rt device. */
-	rtdev_daddr = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
-	if (start > rtdev_daddr)
-		start -= rtdev_daddr;
-	else
-		start = 0;
-
-	if (end <= rtdev_daddr)
-		return 0;
-	end -= rtdev_daddr;
-
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
 		return error;
 
-	end = min_t(xfs_daddr_t, end,
-			XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks) - 1);
-
-	/* Convert the rt blocks to rt extents */
-	low = xfs_rtb_to_rtxup(mp, XFS_BB_TO_FSB(mp, start));
-	high = xfs_rtb_to_rtx(mp, XFS_BB_TO_FSBT(mp, end));
-
 	/*
 	 * Walk the free ranges between low and high.  The query_range function
 	 * trims the extents returned.
 	 */
 	do {
 		tr.stop_rtx = low + (mp->m_sb.sb_blocksize * NBBY);
-		xfs_rtbitmap_lock_shared(mp, XFS_RBMLOCK_BITMAP);
-		error = xfs_rtalloc_query_range(mp, tp, low, high,
+		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		error = xfs_rtalloc_query_range(rtg, tp, low, high,
 				xfs_trim_gather_rtextent, &tr);
 
 		if (error == -ECANCELED)
 			error = 0;
 		if (error) {
-			xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
+			xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
 			xfs_discard_free_rtdev_extents(&tr);
 			break;
 		}
 
 		if (list_empty(&tr.extent_list)) {
-			xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
+			xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
 			break;
 		}
 
 		error = xfs_discard_rtdev_extents(mp, &tr);
-		xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
+		xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
 		if (error)
 			break;
 
@@ -620,6 +601,55 @@ xfs_trim_rtdev_extents(
 	xfs_trans_cancel(tp);
 	return error;
 }
+
+static int
+xfs_trim_rtdev_extents(
+	struct xfs_mount	*mp,
+	xfs_daddr_t		start,
+	xfs_daddr_t		end,
+	xfs_daddr_t		minlen)
+{
+	xfs_rtblock_t		start_rtbno, end_rtbno;
+	xfs_rtxnum_t		start_rtx, end_rtx;
+	xfs_rgnumber_t		rgno, end_rgno;
+	int			last_error = 0, error;
+	struct xfs_rtgroup	*rtg;
+
+	/* Shift the start and end downwards to match the rt device. */
+	start_rtbno = xfs_daddr_to_rtb(mp, start);
+	if (start_rtbno > mp->m_sb.sb_dblocks)
+		start_rtbno -= mp->m_sb.sb_dblocks;
+	else
+		start_rtbno = 0;
+	start_rtx = xfs_rtb_to_rtx(mp, start_rtbno);
+	rgno = xfs_rtb_to_rgno(mp, start_rtbno);
+
+	end_rtbno = xfs_daddr_to_rtb(mp, end);
+	if (end_rtbno <= mp->m_sb.sb_dblocks)
+		return 0;
+	end_rtbno -= mp->m_sb.sb_dblocks;
+	end_rtx = xfs_rtb_to_rtx(mp, end_rtbno + mp->m_sb.sb_rextsize - 1);
+	end_rgno = xfs_rtb_to_rgno(mp, end_rtbno);
+
+	for_each_rtgroup_range(mp, rgno, end_rgno, rtg) {
+		xfs_rtxnum_t	rtg_end = rtg->rtg_extents;
+
+		if (rgno == end_rgno)
+			rtg_end = min(rtg_end, end_rtx);
+
+		error = xfs_trim_rtg_extents(rtg, start_rtx, rtg_end, minlen);
+		if (error)
+			last_error = error;
+
+		if (xfs_trim_should_stop()) {
+			xfs_rtgroup_rele(rtg);
+			break;
+		}
+		start_rtx = 0;
+	}
+
+	return last_error;
+}
 #else
 # define xfs_trim_rtdev_extents(...)	(-EOPNOTSUPP)
 #endif /* CONFIG_XFS_RT */
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index ae18ab86e608b..0e0ec3f0574b1 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -159,6 +159,7 @@ struct xfs_getfsmap_info {
 	struct fsmap		*fsmap_recs;	/* mapping records */
 	struct xfs_buf		*agf_bp;	/* AGF, for refcount queries */
 	struct xfs_perag	*pag;		/* AG info, if applicable */
+	struct xfs_rtgroup	*rtg;		/* rtgroup, if applicable */
 	xfs_daddr_t		next_daddr;	/* next daddr we expect */
 	/* daddr of low fsmap key when we're using the rtbitmap */
 	xfs_daddr_t		low_daddr;
@@ -352,8 +353,14 @@ xfs_getfsmap_helper(
 	if (info->head->fmh_entries >= info->head->fmh_count)
 		return -ECANCELED;
 
-	trace_xfs_fsmap_mapping(mp, info->dev,
-			info->pag ? info->pag->pag_agno : NULLAGNUMBER, rec);
+	if (info->pag)
+		trace_xfs_fsmap_mapping(mp, info->dev, info->pag->pag_agno,
+				rec);
+	else if (info->rtg)
+		trace_xfs_fsmap_mapping(mp, info->dev, info->rtg->rtg_rgno,
+				rec);
+	else
+		trace_xfs_fsmap_mapping(mp, info->dev, NULLAGNUMBER, rec);
 
 	fmr.fmr_device = info->dev;
 	fmr.fmr_physical = rec_daddr;
@@ -711,29 +718,26 @@ xfs_getfsmap_logdev(
 /* Transform a rtbitmap "record" into a fsmap */
 STATIC int
 xfs_getfsmap_rtdev_rtbitmap_helper(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	const struct xfs_rtalloc_rec	*rec,
 	void				*priv)
 {
+	struct xfs_mount		*mp = rtg->rtg_mount;
 	struct xfs_getfsmap_info	*info = priv;
-	struct xfs_rmap_irec		irec;
-	xfs_rtblock_t			rtbno;
-	xfs_daddr_t			rec_daddr, len_daddr;
-
-	rtbno = xfs_rtx_to_rtb(mp, rec->ar_startext);
-	rec_daddr = XFS_FSB_TO_BB(mp, rtbno);
-	irec.rm_startblock = rtbno;
-
-	rtbno = xfs_rtx_to_rtb(mp, rec->ar_extcount);
-	len_daddr = XFS_FSB_TO_BB(mp, rtbno);
-	irec.rm_blockcount = rtbno;
-
-	irec.rm_owner = XFS_RMAP_OWN_NULL;	/* "free" */
-	irec.rm_offset = 0;
-	irec.rm_flags = 0;
-
-	return xfs_getfsmap_helper(tp, info, &irec, rec_daddr, len_daddr);
+	xfs_rtblock_t			start_rtb =
+				xfs_rtx_to_rtb(mp, rec->ar_startext);
+	uint64_t			rtbcount =
+				xfs_rtx_to_rtb(mp, rec->ar_extcount);
+	struct xfs_rmap_irec		irec = {
+		.rm_startblock		= start_rtb,
+		.rm_blockcount		= rtbcount,
+		.rm_owner		= XFS_RMAP_OWN_NULL, /* "free" */
+	};
+
+	return xfs_getfsmap_helper(tp, info, &irec,
+			xfs_rtb_to_daddr(mp, start_rtb),
+			xfs_rtb_to_daddr(mp, rtbcount));
 }
 
 /* Execute a getfsmap query against the realtime device rtbitmap. */
@@ -743,58 +747,82 @@ xfs_getfsmap_rtdev_rtbitmap(
 	const struct xfs_fsmap		*keys,
 	struct xfs_getfsmap_info	*info)
 {
-
-	struct xfs_rtalloc_rec		ahigh = { 0 };
 	struct xfs_mount		*mp = tp->t_mountp;
-	xfs_rtblock_t			start_rtb;
-	xfs_rtblock_t			end_rtb;
-	xfs_rtxnum_t			high;
+	xfs_rtblock_t			start_rtbno, end_rtbno;
+	xfs_rtxnum_t			start_rtx, end_rtx;
+	xfs_rgnumber_t			rgno, end_rgno;
+	struct xfs_rtgroup		*rtg;
 	uint64_t			eofs;
 	int				error;
 
-	eofs = XFS_FSB_TO_BB(mp, xfs_rtx_to_rtb(mp, mp->m_sb.sb_rextents));
+	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks);
 	if (keys[0].fmr_physical >= eofs)
 		return 0;
-	start_rtb = XFS_BB_TO_FSBT(mp,
-				keys[0].fmr_physical + keys[0].fmr_length);
-	end_rtb = XFS_BB_TO_FSB(mp, min(eofs - 1, keys[1].fmr_physical));
 
 	info->missing_owner = XFS_FMR_OWN_UNKNOWN;
 
 	/* Adjust the low key if we are continuing from where we left off. */
+	start_rtbno = xfs_daddr_to_rtb(mp,
+			keys[0].fmr_physical + keys[0].fmr_length);
 	if (keys[0].fmr_length > 0) {
-		info->low_daddr = XFS_FSB_TO_BB(mp, start_rtb);
+		info->low_daddr = XFS_FSB_TO_BB(mp, start_rtbno);
 		if (info->low_daddr >= eofs)
 			return 0;
 	}
+	start_rtx = xfs_rtb_to_rtx(mp, start_rtbno);
+	rgno = xfs_rtb_to_rgno(mp, start_rtbno);
+	trace_xfs_fsmap_low_key_linear(mp, info->dev, start_rtbno);
+
+	end_rtbno = xfs_daddr_to_rtb(mp, min(eofs - 1, keys[1].fmr_physical));
+	end_rgno = xfs_rtb_to_rgno(mp, end_rtbno);
+	trace_xfs_fsmap_high_key_linear(mp, info->dev, end_rtbno);
+
+	end_rtx = -1ULL;
+
+	for_each_rtgroup_range(mp, rgno, end_rgno, rtg) {
+		if (rgno == end_rgno)
+			end_rtx = xfs_rtb_to_rtx(mp,
+					end_rtbno + mp->m_sb.sb_rextsize - 1);
+
+		info->rtg = rtg;
+		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		error = xfs_rtalloc_query_range(rtg, tp, start_rtx, end_rtx,
+				xfs_getfsmap_rtdev_rtbitmap_helper, info);
+		if (error)
+			break;
+
+		/*
+		 * Report any gaps at the end of the rtbitmap by simulating a
+		 * zero-length free extent starting at the rtx after the end
+		 * of the query range.
+		 */
+		if (rgno == end_rgno) {
+			struct xfs_rtalloc_rec	ahigh = {
+				.ar_startext	= min(end_rtx + 1,
+						      rtg->rtg_extents),
+			};
+
+			info->last = true;
+			error = xfs_getfsmap_rtdev_rtbitmap_helper(rtg, tp,
+					&ahigh, info);
+			if (error)
+				break;
+		}
+
+		xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		info->rtg = NULL;
+		start_rtx = 0;
+	}
+
+	if (info->rtg) {
+		xfs_rtgroup_unlock(info->rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		xfs_rtgroup_rele(info->rtg);
+		info->rtg = NULL;
+	} else if (rtg) {
+		/* loop termination case */
+		xfs_rtgroup_rele(rtg);
+	}
 
-	trace_xfs_fsmap_low_key_linear(mp, info->dev, start_rtb);
-	trace_xfs_fsmap_high_key_linear(mp, info->dev, end_rtb);
-
-	xfs_rtbitmap_lock_shared(mp, XFS_RBMLOCK_BITMAP);
-
-	/*
-	 * Set up query parameters to return free rtextents covering the range
-	 * we want.
-	 */
-	high = xfs_rtb_to_rtxup(mp, end_rtb);
-	error = xfs_rtalloc_query_range(mp, tp, xfs_rtb_to_rtx(mp, start_rtb),
-			high, xfs_getfsmap_rtdev_rtbitmap_helper, info);
-	if (error)
-		goto err;
-
-	/*
-	 * Report any gaps at the end of the rtbitmap by simulating a null
-	 * rmap starting at the block after the end of the query range.
-	 */
-	info->last = true;
-	ahigh.ar_startext = min(mp->m_sb.sb_rextents, high);
-
-	error = xfs_getfsmap_rtdev_rtbitmap_helper(mp, tp, &ahigh, info);
-	if (error)
-		goto err;
-err:
-	xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
 	return error;
 }
 #endif /* CONFIG_XFS_RT */
@@ -1004,6 +1032,7 @@ xfs_getfsmap(
 		info.dev = handlers[i].dev;
 		info.last = false;
 		info.pag = NULL;
+		info.rtg = NULL;
 		info.low_daddr = XFS_BUF_DADDR_NULL;
 		info.low.rm_blockcount = 0;
 		error = handlers[i].fn(tp, dkeys, &info);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 73959c26075a5..2518977150295 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -90,8 +90,6 @@ typedef struct xfs_mount {
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
-	struct xfs_inode	*m_rbmip;	/* pointer to bitmap inode */
-	struct xfs_inode	*m_rsumip;	/* pointer to summary inode */
 	struct xfs_inode	*m_rootip;	/* pointer to root directory */
 	struct xfs_inode	*m_metadirip;	/* ptr to metadata directory */
 	struct xfs_inode	*m_rtdirip;	/* ptr to realtime metadir */
@@ -100,14 +98,6 @@ typedef struct xfs_mount {
 	struct xfs_buftarg	*m_logdev_targp;/* log device */
 	struct xfs_buftarg	*m_rtdev_targp;	/* rt device */
 	void __percpu		*m_inodegc;	/* percpu inodegc structures */
-
-	/*
-	 * Optional cache of rt summary level per bitmap block with the
-	 * invariant that m_rsum_cache[bbno] > the maximum i for which
-	 * rsum[i][bbno] != 0, or 0 if rsum[i][bbno] == 0 for all i.
-	 * Reads and writes are serialized by the rsumip inode lock.
-	 */
-	uint8_t			*m_rsum_cache;
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct workqueue_struct *m_buf_workqueue;
 	struct workqueue_struct	*m_unwritten_workqueue;
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index b94d6f192e725..28b1420bac1dd 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -29,6 +29,7 @@
 #include "xfs_health.h"
 #include "xfs_da_format.h"
 #include "xfs_metafile.h"
+#include "xfs_rtgroup.h"
 
 /*
  * The global quota manager. There is only one of these for the entire
@@ -210,6 +211,21 @@ xfs_qm_unmount(
 	}
 }
 
+static void
+xfs_qm_unmount_rt(
+	struct xfs_mount	*mp)
+{
+	struct xfs_rtgroup	*rtg = xfs_rtgroup_grab(mp, 0);
+
+	if (!rtg)
+		return;
+	if (rtg->rtg_inodes[XFS_RTGI_BITMAP])
+		xfs_qm_dqdetach(rtg->rtg_inodes[XFS_RTGI_BITMAP]);
+	if (rtg->rtg_inodes[XFS_RTGI_SUMMARY])
+		xfs_qm_dqdetach(rtg->rtg_inodes[XFS_RTGI_SUMMARY]);
+	xfs_rtgroup_rele(rtg);
+}
+
 /*
  * Called from the vfsops layer.
  */
@@ -223,10 +239,13 @@ xfs_qm_unmount_quotas(
 	 */
 	ASSERT(mp->m_rootip);
 	xfs_qm_dqdetach(mp->m_rootip);
-	if (mp->m_rbmip)
-		xfs_qm_dqdetach(mp->m_rbmip);
-	if (mp->m_rsumip)
-		xfs_qm_dqdetach(mp->m_rsumip);
+
+	/*
+	 * For pre-RTG file systems, the RT inodes have quotas attached,
+	 * detach them now.
+	 */
+	if (!xfs_has_rtgroups(mp))
+		xfs_qm_unmount_rt(mp);
 
 	/*
 	 * Release the quota inodes.
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index dcdb726ebe4a0..f63228b3dd9a2 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -42,14 +42,14 @@ xfs_rtany_summary(
 	xfs_fileoff_t		bbno,	/* bitmap block number */
 	int			*maxlog) /* out: max log2 extent size free */
 {
-	struct xfs_mount	*mp = args->mp;
+	uint8_t			*rsum_cache = args->rtg->rtg_rsum_cache;
 	int			error;
 	int			log;	/* loop counter, log2 of ext. size */
 	xfs_suminfo_t		sum;	/* summary data */
 
-	/* There are no extents at levels >= m_rsum_cache[bbno]. */
-	if (mp->m_rsum_cache) {
-		high = min(high, mp->m_rsum_cache[bbno] - 1);
+	/* There are no extents at levels >= rsum_cache[bbno]. */
+	if (rsum_cache) {
+		high = min(high, rsum_cache[bbno] - 1);
 		if (low > high) {
 			*maxlog = -1;
 			return 0;
@@ -81,12 +81,11 @@ xfs_rtany_summary(
 	*maxlog = -1;
 out:
 	/* There were no extents at levels > log. */
-	if (mp->m_rsum_cache && log + 1 < mp->m_rsum_cache[bbno])
-		mp->m_rsum_cache[bbno] = log + 1;
+	if (rsum_cache && log + 1 < rsum_cache[bbno])
+		rsum_cache[bbno] = log + 1;
 	return 0;
 }
 
-
 /*
  * Copy and transform the summary file, given the old and new
  * parameters in the mount structures.
@@ -153,7 +152,7 @@ xfs_rtallocate_range(
 	/*
 	 * Find the next allocated block (end of free extent).
 	 */
-	error = xfs_rtfind_forw(args, end, mp->m_sb.sb_rextents - 1,
+	error = xfs_rtfind_forw(args, end, args->rtg->rtg_extents - 1,
 			&postblock);
 	if (error)
 		return error;
@@ -215,14 +214,14 @@ xfs_rtalloc_align_len(
  */
 static inline xfs_rtxlen_t
 xfs_rtallocate_clamp_len(
-	struct xfs_mount	*mp,
+	struct xfs_rtgroup	*rtg,
 	xfs_rtxnum_t		startrtx,
 	xfs_rtxlen_t		rtxlen,
 	xfs_rtxlen_t		prod)
 {
 	xfs_rtxlen_t		ret;
 
-	ret = min(mp->m_sb.sb_rextents, startrtx + rtxlen) - startrtx;
+	ret = min(rtg->rtg_extents, startrtx + rtxlen) - startrtx;
 	return xfs_rtalloc_align_len(ret, prod);
 }
 
@@ -257,10 +256,11 @@ xfs_rtallocate_extent_block(
 	 * Loop over all the extents starting in this bitmap block up to the
 	 * end of the rt volume, looking for one that's long enough.
 	 */
-	end = min(mp->m_sb.sb_rextents, xfs_rbmblock_to_rtx(mp, bbno + 1)) - 1;
+	end = min(args->rtg->rtg_extents, xfs_rbmblock_to_rtx(mp, bbno + 1)) -
+		1;
 	for (i = xfs_rbmblock_to_rtx(mp, bbno); i <= end; i++) {
 		/* Make sure we don't scan off the end of the rt volume. */
-		scanlen = xfs_rtallocate_clamp_len(mp, i, maxlen, prod);
+		scanlen = xfs_rtallocate_clamp_len(args->rtg, i, maxlen, prod);
 		if (scanlen < minlen)
 			break;
 
@@ -345,7 +345,6 @@ xfs_rtallocate_extent_exact(
 	xfs_rtxlen_t		prod,	/* extent product factor */
 	xfs_rtxnum_t		*rtx)	/* out: start rtext allocated */
 {
-	struct xfs_mount	*mp = args->mp;
 	xfs_rtxnum_t		next;	/* next rtext to try (dummy) */
 	xfs_rtxlen_t		alloclen; /* candidate length */
 	xfs_rtxlen_t		scanlen; /* number of free rtx to look for */
@@ -356,7 +355,7 @@ xfs_rtallocate_extent_exact(
 	ASSERT(maxlen % prod == 0);
 
 	/* Make sure we don't run off the end of the rt volume. */
-	scanlen = xfs_rtallocate_clamp_len(mp, start, maxlen, prod);
+	scanlen = xfs_rtallocate_clamp_len(args->rtg, start, maxlen, prod);
 	if (scanlen < minlen)
 		return -ENOSPC;
 
@@ -417,11 +416,10 @@ xfs_rtallocate_extent_near(
 	ASSERT(maxlen % prod == 0);
 
 	/*
-	 * If the block number given is off the end, silently set it to
-	 * the last block.
+	 * If the block number given is off the end, silently set it to the last
+	 * block.
 	 */
-	if (start >= mp->m_sb.sb_rextents)
-		start = mp->m_sb.sb_rextents - 1;
+	start = min(start, args->rtg->rtg_extents - 1);
 
 	/*
 	 * Try the exact allocation first.
@@ -661,21 +659,22 @@ xfs_rtunmount_rtg(
 
 	for (i = 0; i < XFS_RTGI_MAX; i++)
 		xfs_rtginode_irele(&rtg->rtg_inodes[i]);
+	kvfree(rtg->rtg_rsum_cache);
 }
 
 static int
 xfs_alloc_rsum_cache(
-	struct xfs_mount	*mp,
+	struct xfs_rtgroup	*rtg,
 	xfs_extlen_t		rbmblocks)
 {
 	/*
 	 * The rsum cache is initialized to the maximum value, which is
 	 * trivially an upper bound on the maximum level with any free extents.
 	 */
-	mp->m_rsum_cache = kvmalloc(rbmblocks, GFP_KERNEL);
-	if (!mp->m_rsum_cache)
+	rtg->rtg_rsum_cache = kvmalloc(rbmblocks, GFP_KERNEL);
+	if (!rtg->rtg_rsum_cache)
 		return -ENOMEM;
-	memset(mp->m_rsum_cache, -1, rbmblocks);
+	memset(rtg->rtg_rsum_cache, -1, rbmblocks);
 	return 0;
 }
 
@@ -712,19 +711,45 @@ xfs_growfs_rt_fixup_extsize(
 	return error;
 }
 
+/* Ensure that the rtgroup metadata inode is loaded, creating it if neeeded. */
+static int
+xfs_rtginode_ensure(
+	struct xfs_rtgroup	*rtg,
+	enum xfs_rtg_inodes	type)
+{
+	struct xfs_trans	*tp;
+	int			error;
+
+	if (rtg->rtg_inodes[type])
+		return 0;
+
+	error = xfs_trans_alloc_empty(rtg->rtg_mount, &tp);
+	if (error)
+		return error;
+	error = xfs_rtginode_load(rtg, type, tp);
+	xfs_trans_cancel(tp);
+
+	if (error != -ENOENT)
+		return 0;
+	return xfs_rtginode_create(rtg, type, true);
+}
+
 static int
 xfs_growfs_rt_bmblock(
-	struct xfs_mount	*mp,
+	struct xfs_rtgroup	*rtg,
 	xfs_rfsblock_t		nrblocks,
 	xfs_agblock_t		rextsize,
 	xfs_fileoff_t		bmbno)
 {
-	struct xfs_inode	*rbmip = mp->m_rbmip;
-	struct xfs_inode	*rsumip = mp->m_rsumip;
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
+	struct xfs_inode	*rsumip = rtg->rtg_inodes[XFS_RTGI_SUMMARY];
 	struct xfs_rtalloc_args	args = {
 		.mp		= mp,
+		.rtg		= rtg,
 	};
 	struct xfs_rtalloc_args	nargs = {
+		.rtg		= rtg,
 	};
 	struct xfs_mount	*nmp;
 	xfs_rfsblock_t		nrblocks_step;
@@ -750,6 +775,7 @@ xfs_growfs_rt_bmblock(
 	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
 	nmp->m_rsumblocks = xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
 			nmp->m_sb.sb_rbmblocks);
+	rtg->rtg_extents = xfs_rtgroup_extents(nmp, rtg->rtg_rgno);
 
 	/*
 	 * Recompute the growfsrt reservation from the new rsumsize, so that the
@@ -762,8 +788,8 @@ xfs_growfs_rt_bmblock(
 		goto out_free;
 	nargs.tp = args.tp;
 
-	xfs_rtbitmap_lock(mp);
-	xfs_rtbitmap_trans_join(args.tp);
+	xfs_rtgroup_lock(args.rtg, XFS_RTGLOCK_BITMAP);
+	xfs_rtgroup_trans_join(args.tp, args.rtg, XFS_RTGLOCK_BITMAP);
 
 	/*
 	 * Update the bitmap inode's size ondisk and incore.  We need to update
@@ -865,8 +891,9 @@ xfs_growfs_rt_bmblock(
  */
 static xfs_fileoff_t
 xfs_last_rt_bmblock(
-	struct xfs_mount	*mp)
+	struct xfs_rtgroup	*rtg)
 {
+	struct xfs_mount	*mp = rtg->rtg_mount;
 	xfs_fileoff_t		bmbno = mp->m_sb.sb_rbmblocks;
 
 	/* Skip the current block if it is exactly full. */
@@ -875,6 +902,103 @@ xfs_last_rt_bmblock(
 	return bmbno;
 }
 
+/*
+ * Allocate space to the bitmap and summary files, as necessary.
+ */
+static int
+xfs_growfs_rt_alloc_blocks(
+	struct xfs_rtgroup	*rtg,
+	xfs_rfsblock_t		nrblocks,
+	xfs_agblock_t		rextsize,
+	xfs_extlen_t		*nrbmblocks)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
+	struct xfs_inode	*rsumip = rtg->rtg_inodes[XFS_RTGI_SUMMARY];
+	xfs_rtxnum_t		nrextents = div_u64(nrblocks, rextsize);
+	xfs_extlen_t		orbmblocks;
+	xfs_extlen_t		orsumblocks;
+	xfs_extlen_t		nrsumblocks;
+	int			error;
+
+	/*
+	 * Get the old block counts for bitmap and summary inodes.
+	 * These can't change since other growfs callers are locked out.
+	 */
+	orbmblocks = XFS_B_TO_FSB(mp, rbmip->i_disk_size);
+	orsumblocks = XFS_B_TO_FSB(mp, rsumip->i_disk_size);
+
+	*nrbmblocks = xfs_rtbitmap_blockcount(mp, nrextents);
+	nrsumblocks = xfs_rtsummary_blockcount(mp,
+		xfs_compute_rextslog(nrextents) + 1, *nrbmblocks);
+
+	error = xfs_rtfile_initialize_blocks(rtg, XFS_RTGI_BITMAP, orbmblocks,
+			*nrbmblocks, NULL);
+	if (error)
+		return error;
+	return xfs_rtfile_initialize_blocks(rtg, XFS_RTGI_SUMMARY, orsumblocks,
+			nrsumblocks, NULL);
+}
+
+static int
+xfs_growfs_rtg(
+	struct xfs_mount	*mp,
+	xfs_rfsblock_t		nrblocks,
+	xfs_agblock_t		rextsize)
+{
+	uint8_t			*old_rsum_cache = NULL;
+	xfs_extlen_t		bmblocks;
+	xfs_fileoff_t		bmbno;
+	struct xfs_rtgroup	*rtg;
+	unsigned int		i;
+	int			error;
+
+	rtg = xfs_rtgroup_grab(mp, 0);
+	if (!rtg)
+		return -EINVAL;
+
+	for (i = 0; i < XFS_RTGI_MAX; i++) {
+		error = xfs_rtginode_ensure(rtg, i);
+		if (error)
+			goto out_rele;
+	}
+
+	error = xfs_growfs_rt_alloc_blocks(rtg, nrblocks, rextsize, &bmblocks);
+	if (error)
+		goto out_rele;
+
+	if (bmblocks != rtg->rtg_mount->m_sb.sb_rbmblocks) {
+		old_rsum_cache = rtg->rtg_rsum_cache;
+		error = xfs_alloc_rsum_cache(rtg, bmblocks);
+		if (error)
+			goto out_rele;
+	}
+
+	for (bmbno = xfs_last_rt_bmblock(rtg); bmbno < bmblocks; bmbno++) {
+		error = xfs_growfs_rt_bmblock(rtg, nrblocks, rextsize, bmbno);
+		if (error)
+			goto out_error;
+	}
+
+	if (old_rsum_cache)
+		kvfree(old_rsum_cache);
+	xfs_rtgroup_rele(rtg);
+	return 0;
+
+out_error:
+	/*
+	 * Reset rtg_extents to the old value if adding more blocks failed.
+	 */
+	rtg->rtg_extents = xfs_rtgroup_extents(rtg->rtg_mount, rtg->rtg_rgno);
+	if (old_rsum_cache) {
+		kvfree(rtg->rtg_rsum_cache);
+		rtg->rtg_rsum_cache = old_rsum_cache;
+	}
+out_rele:
+	xfs_rtgroup_rele(rtg);
+	return error;
+}
+
 /*
  * Grow the realtime area of the filesystem.
  */
@@ -883,16 +1007,12 @@ xfs_growfs_rt(
 	xfs_mount_t	*mp,		/* mount point for filesystem */
 	xfs_growfs_rt_t	*in)		/* growfs rt input struct */
 {
-	xfs_fileoff_t	bmbno;		/* bitmap block number */
-	struct xfs_buf	*bp;		/* temporary buffer */
-	int		error;		/* error return value */
-	xfs_extlen_t	nrbmblocks;	/* new number of rt bitmap blocks */
-	xfs_rtxnum_t	nrextents;	/* new number of realtime extents */
-	xfs_extlen_t	nrsumblocks;	/* new number of summary blocks */
-	xfs_extlen_t	rbmblocks;	/* current number of rt bitmap blocks */
-	xfs_extlen_t	rsumblocks;	/* current number of rt summary blks */
-	uint8_t		*rsum_cache;	/* old summary cache */
-	xfs_agblock_t	old_rextsize = mp->m_sb.sb_rextsize;
+	xfs_rtxnum_t		nrextents;
+	xfs_extlen_t		nrbmblocks;
+	xfs_extlen_t		nrsumblocks;
+	struct xfs_buf		*bp;
+	xfs_agblock_t		old_rextsize = mp->m_sb.sb_rextsize;
+	int			error;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -903,15 +1023,9 @@ xfs_growfs_rt(
 
 	if (!mutex_trylock(&mp->m_growlock))
 		return -EWOULDBLOCK;
-	/*
-	 * Mount should fail if the rt bitmap/summary files don't load, but
-	 * we'll check anyway.
-	 */
-	error = -EINVAL;
-	if (!mp->m_rbmip || !mp->m_rsumip)
-		goto out_unlock;
 
 	/* Shrink not supported. */
+	error = -EINVAL;
 	if (in->newblocks <= mp->m_sb.sb_rblocks)
 		goto out_unlock;
 	/* Can only change rt extent size when adding rt volume. */
@@ -945,10 +1059,9 @@ xfs_growfs_rt(
 	 * Calculate new parameters.  These are the final values to be reached.
 	 */
 	nrextents = div_u64(in->newblocks, in->extsize);
-	if (nrextents == 0) {
-		error = -EINVAL;
+	error = -EINVAL;
+	if (nrextents == 0)
 		goto out_unlock;
-	}
 	nrbmblocks = xfs_rtbitmap_blockcount(mp, nrextents);
 	nrsumblocks = xfs_rtsummary_blockcount(mp,
 			xfs_compute_rextslog(nrextents) + 1, nrbmblocks);
@@ -958,68 +1071,22 @@ xfs_growfs_rt(
 	 * the log.  This prevents us from getting a log overflow,
 	 * since we'll log basically the whole summary file at once.
 	 */
-	if (nrsumblocks > (mp->m_sb.sb_logblocks >> 1)) {
-		error = -EINVAL;
+	if (nrsumblocks > (mp->m_sb.sb_logblocks >> 1))
 		goto out_unlock;
-	}
 
-	/*
-	 * Get the old block counts for bitmap and summary inodes.
-	 * These can't change since other growfs callers are locked out.
-	 */
-	rbmblocks = XFS_B_TO_FSB(mp, mp->m_rbmip->i_disk_size);
-	rsumblocks = XFS_B_TO_FSB(mp, mp->m_rsumip->i_disk_size);
-	/*
-	 * Allocate space to the bitmap and summary files, as necessary.
-	 */
-	error = xfs_rtfile_initialize_blocks(mp->m_rbmip, rbmblocks,
-			nrbmblocks, NULL);
+	error = xfs_growfs_rtg(mp, in->newblocks, in->extsize);
 	if (error)
 		goto out_unlock;
-	error = xfs_rtfile_initialize_blocks(mp->m_rsumip, rsumblocks,
-			nrsumblocks, NULL);
-	if (error)
-		goto out_unlock;
-
-	rsum_cache = mp->m_rsum_cache;
-	if (nrbmblocks != mp->m_sb.sb_rbmblocks) {
-		error = xfs_alloc_rsum_cache(mp, nrbmblocks);
-		if (error)
-			goto out_unlock;
-	}
-
-	/* Initialize the free space bitmap one bitmap block at a time. */
-	for (bmbno = xfs_last_rt_bmblock(mp); bmbno < nrbmblocks; bmbno++) {
-		error = xfs_growfs_rt_bmblock(mp, in->newblocks, in->extsize,
-				bmbno);
-		if (error)
-			goto out_free;
-	}
 
 	if (old_rextsize != in->extsize) {
 		error = xfs_growfs_rt_fixup_extsize(mp);
 		if (error)
-			goto out_free;
+			goto out_unlock;
 	}
 
 	/* Update secondary superblocks now the physical grow has completed */
 	error = xfs_update_secondary_sbs(mp);
 
-out_free:
-	/*
-	 * If we had to allocate a new rsum_cache, we either need to free the
-	 * old one (if we succeeded) or free the new one and restore the old one
-	 * (if there was an error).
-	 */
-	if (rsum_cache != mp->m_rsum_cache) {
-		if (error) {
-			kvfree(mp->m_rsum_cache);
-			mp->m_rsum_cache = rsum_cache;
-		} else {
-			kvfree(rsum_cache);
-		}
-	}
-
 out_unlock:
 	mutex_unlock(&mp->m_growlock);
 	return error;
@@ -1048,7 +1115,7 @@ xfs_rtmount_init(
 	mp->m_rsumlevels = sbp->sb_rextslog + 1;
 	mp->m_rsumblocks = xfs_rtsummary_blockcount(mp, mp->m_rsumlevels,
 			mp->m_sb.sb_rbmblocks);
-	mp->m_rbmip = mp->m_rsumip = NULL;
+
 	/*
 	 * Check that the realtime section is an ok size.
 	 */
@@ -1072,7 +1139,7 @@ xfs_rtmount_init(
 
 static int
 xfs_rtalloc_count_frextent(
-	struct xfs_mount		*mp,
+	struct xfs_rtgroup		*rtg,
 	struct xfs_trans		*tp,
 	const struct xfs_rtalloc_rec	*rec,
 	void				*priv)
@@ -1094,12 +1161,17 @@ xfs_rtalloc_reinit_frextents(
 	uint64_t		val = 0;
 	int			error;
 
-	xfs_rtbitmap_lock_shared(mp, XFS_RBMLOCK_BITMAP);
-	error = xfs_rtalloc_query_all(mp, NULL, xfs_rtalloc_count_frextent,
-			&val);
-	xfs_rtbitmap_unlock_shared(mp, XFS_RBMLOCK_BITMAP);
-	if (error)
-		return error;
+	struct xfs_rtgroup	*rtg;
+	xfs_rgnumber_t		rgno;
+
+	for_each_rtgroup(mp, rgno, rtg) {
+		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		error = xfs_rtalloc_query_all(rtg, NULL, xfs_rtalloc_count_frextent,
+				&val);
+		xfs_rtgroup_unlock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
+		if (error)
+			return error;
+	}
 
 	spin_lock(&mp->m_sb_lock);
 	mp->m_sb.sb_frextents = val;
@@ -1138,16 +1210,30 @@ xfs_rtmount_iread_extents(
 	return error;
 }
 
-static void
-xfs_rtgroup_unmount_inodes(
-	struct xfs_mount	*mp)
+static int
+xfs_rtmount_rtg(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	struct xfs_rtgroup	*rtg)
 {
-	struct xfs_rtgroup	*rtg;
-	xfs_rgnumber_t		rgno;
+	int			error, i;
 
-	for_each_rtgroup(mp, rgno, rtg)
-		xfs_rtunmount_rtg(rtg);
-	xfs_rtginode_irele(&mp->m_rtdirip);
+	rtg->rtg_extents = xfs_rtgroup_extents(mp, rtg->rtg_rgno);
+
+	for (i = 0; i < XFS_RTGI_MAX; i++) {
+		error = xfs_rtginode_load(rtg, i, tp);
+		if (error)
+			return error;
+
+		if (rtg->rtg_inodes[i]) {
+			error = xfs_rtmount_iread_extents(tp,
+					rtg->rtg_inodes[i], 0);
+			if (error)
+				return error;
+		}
+	}
+
+	return xfs_alloc_rsum_cache(rtg, mp->m_sb.sb_rbmblocks);
 }
 
 /*
@@ -1159,73 +1245,30 @@ xfs_rtmount_inodes(
 	struct xfs_mount	*mp)
 {
 	struct xfs_trans	*tp;
-	struct xfs_sb		*sbp = &mp->m_sb;
 	struct xfs_rtgroup	*rtg;
 	xfs_rgnumber_t		rgno;
-	unsigned int		i;
 	int			error;
 
 	error = xfs_trans_alloc_empty(mp, &tp);
 	if (error)
 		return error;
 
-	error = xfs_trans_metafile_iget(tp, mp->m_sb.sb_rbmino,
-			XFS_METAFILE_RTBITMAP, &mp->m_rbmip);
-	if (xfs_metadata_is_sick(error))
-		xfs_rt_mark_sick(mp, XFS_SICK_RT_BITMAP);
-	if (error)
-		goto out_trans;
-	ASSERT(mp->m_rbmip != NULL);
-
-	error = xfs_rtmount_iread_extents(tp, mp->m_rbmip, XFS_ILOCK_RTBITMAP);
-	if (error)
-		goto out_rele_bitmap;
-
-	error = xfs_trans_metafile_iget(tp, mp->m_sb.sb_rsumino,
-			XFS_METAFILE_RTSUMMARY, &mp->m_rsumip);
-	if (xfs_metadata_is_sick(error))
-		xfs_rt_mark_sick(mp, XFS_SICK_RT_SUMMARY);
-	if (error)
-		goto out_rele_bitmap;
-	ASSERT(mp->m_rsumip != NULL);
-
-	error = xfs_rtmount_iread_extents(tp, mp->m_rsumip, XFS_ILOCK_RTSUM);
-	if (error)
-		goto out_rele_summary;
-
 	if (xfs_has_rtgroups(mp) && mp->m_sb.sb_rgcount > 0) {
 		error = xfs_rtginode_load_parent(tp);
 		if (error)
-			goto out_rele_rtdir;
+			goto out_cancel;
 	}
 
 	for_each_rtgroup(mp, rgno, rtg) {
-		rtg->rtg_extents = xfs_rtgroup_extents(mp, rtg->rtg_rgno);
-
-		for (i = 0; i < XFS_RTGI_MAX; i++) {
-			error = xfs_rtginode_load(rtg, i, tp);
-			if (error) {
-				xfs_rtgroup_rele(rtg);
-				goto out_rele_inodes;
-			}
+		error = xfs_rtmount_rtg(mp, tp, rtg);
+		if (error) {
+			xfs_rtgroup_rele(rtg);
+			xfs_rtunmount_inodes(mp);
+			break;
 		}
 	}
 
-	error = xfs_alloc_rsum_cache(mp, sbp->sb_rbmblocks);
-	if (error)
-		goto out_rele_summary;
-	xfs_trans_cancel(tp);
-	return 0;
-
-out_rele_inodes:
-	xfs_rtgroup_unmount_inodes(mp);
-out_rele_rtdir:
-	xfs_rtginode_irele(&mp->m_rtdirip);
-out_rele_summary:
-	xfs_irele(mp->m_rsumip);
-out_rele_bitmap:
-	xfs_irele(mp->m_rbmip);
-out_trans:
+out_cancel:
 	xfs_trans_cancel(tp);
 	return error;
 }
@@ -1234,14 +1277,12 @@ void
 xfs_rtunmount_inodes(
 	struct xfs_mount	*mp)
 {
-	kvfree(mp->m_rsum_cache);
+	struct xfs_rtgroup	*rtg;
+	xfs_rgnumber_t		rgno;
 
-	xfs_rtgroup_unmount_inodes(mp);
+	for_each_rtgroup(mp, rgno, rtg)
+		xfs_rtunmount_rtg(rtg);
 	xfs_rtginode_irele(&mp->m_rtdirip);
-	if (mp->m_rbmip)
-		xfs_irele(mp->m_rbmip);
-	if (mp->m_rsumip)
-		xfs_irele(mp->m_rsumip);
 }
 
 /*
@@ -1253,28 +1294,29 @@ xfs_rtunmount_inodes(
  */
 static xfs_rtxnum_t
 xfs_rtpick_extent(
-	xfs_mount_t		*mp,		/* file system mount point */
-	xfs_trans_t		*tp,		/* transaction pointer */
+	struct xfs_rtgroup	*rtg,
+	struct xfs_trans	*tp,
 	xfs_rtxlen_t		len)		/* allocation length (rtextents) */
 {
-	xfs_rtxnum_t		b;		/* result rtext */
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
+	xfs_rtxnum_t		b = 0;		/* result rtext */
 	int			log2;		/* log of sequence number */
 	uint64_t		resid;		/* residual after log removed */
 	uint64_t		seq;		/* sequence number of file creation */
 	struct timespec64	ts;		/* timespec in inode */
 
-	xfs_assert_ilocked(mp->m_rbmip, XFS_ILOCK_EXCL);
+	xfs_assert_ilocked(rbmip, XFS_ILOCK_EXCL);
 
-	ts = inode_get_atime(VFS_I(mp->m_rbmip));
-	if (!(mp->m_rbmip->i_diflags & XFS_DIFLAG_NEWRTBM)) {
-		mp->m_rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
+	ts = inode_get_atime(VFS_I(rbmip));
+	if (!(rbmip->i_diflags & XFS_DIFLAG_NEWRTBM)) {
+		rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
 		seq = 0;
 	} else {
 		seq = ts.tv_sec;
 	}
-	if ((log2 = xfs_highbit64(seq)) == -1)
-		b = 0;
-	else {
+	log2 = xfs_highbit64(seq);
+	if (log2 != -1) {
 		resid = seq - (1ULL << log2);
 		b = (mp->m_sb.sb_rextents * ((resid << 1) + 1ULL)) >>
 		    (log2 + 1);
@@ -1284,8 +1326,8 @@ xfs_rtpick_extent(
 			b = mp->m_sb.sb_rextents - len;
 	}
 	ts.tv_sec = seq + 1;
-	inode_set_atime_to_ts(VFS_I(mp->m_rbmip), ts);
-	xfs_trans_log_inode(tp, mp->m_rbmip, XFS_ILOG_CORE);
+	inode_set_atime_to_ts(VFS_I(rbmip), ts);
+	xfs_trans_log_inode(tp, rbmip, XFS_ILOG_CORE);
 	return b;
 }
 
@@ -1340,12 +1382,16 @@ xfs_rtallocate(
 	xfs_rtxlen_t		len = 0;
 	int			error = 0;
 
+	args.rtg = xfs_rtgroup_grab(args.mp, 0);
+	if (!args.rtg)
+		return -ENOSPC;
+
 	/*
 	 * Lock out modifications to both the RT bitmap and summary inodes.
 	 */
 	if (!*rtlocked) {
-		xfs_rtbitmap_lock(args.mp);
-		xfs_rtbitmap_trans_join(tp);
+		xfs_rtgroup_lock(args.rtg, XFS_RTGLOCK_BITMAP);
+		xfs_rtgroup_trans_join(tp, args.rtg, XFS_RTGLOCK_BITMAP);
 		*rtlocked = true;
 	}
 
@@ -1356,7 +1402,7 @@ xfs_rtallocate(
 	if (bno_hint)
 		start = xfs_rtb_to_rtx(args.mp, bno_hint);
 	else if (initial_user_data)
-		start = xfs_rtpick_extent(args.mp, tp, maxlen);
+		start = xfs_rtpick_extent(args.rtg, tp, maxlen);
 
 	if (start) {
 		error = xfs_rtallocate_extent_near(&args, start, minlen, maxlen,
@@ -1390,6 +1436,7 @@ xfs_rtallocate(
 	*blen = xfs_rtxlen_to_extlen(args.mp, len);
 
 out_release:
+	xfs_rtgroup_rele(args.rtg);
 	xfs_rtbuf_cache_relse(&args);
 	return error;
 }


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 17/24] xfs: remove XFS_ILOCK_RT*
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (15 preceding siblings ...)
  2024-08-23  0:18   ` [PATCH 16/24] xfs: move RT bitmap and summary information to the rtgroup Darrick J. Wong
@ 2024-08-23  0:19   ` Darrick J. Wong
  2024-08-23  5:04     ` Christoph Hellwig
  2024-08-23  0:19   ` [PATCH 18/24] xfs: calculate RT bitmap and summary blocks based on sb_rextents Darrick J. Wong
                     ` (6 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:19 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Now that we've centralized the realtime metadata locking routines, get
rid of the ILOCK subclasses since we now use explicit lockdep classes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_inode.c   |    3 +--
 fs/xfs/xfs_inode.h   |   13 ++++---------
 fs/xfs/xfs_rtalloc.c |    9 ++++-----
 3 files changed, 9 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index fff3037e67574..4ae628fe7d877 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -342,8 +342,7 @@ xfs_lock_inumorder(
 {
 	uint	class = 0;
 
-	ASSERT(!(lock_mode & (XFS_ILOCK_PARENT | XFS_ILOCK_RTBITMAP |
-			      XFS_ILOCK_RTSUM)));
+	ASSERT(!(lock_mode & XFS_ILOCK_PARENT));
 	ASSERT(xfs_lockdep_subclass_ok(subclass));
 
 	if (lock_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 54d995740b328..7c35511d0e471 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -443,9 +443,8 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
  * However, MAX_LOCKDEP_SUBCLASSES == 8, which means we are greatly
  * limited to the subclasses we can represent via nesting. We need at least
  * 5 inodes nest depth for the ILOCK through rename, and we also have to support
- * XFS_ILOCK_PARENT, which gives 6 subclasses. Then we have XFS_ILOCK_RTBITMAP
- * and XFS_ILOCK_RTSUM, which are another 2 unique subclasses, so that's all
- * 8 subclasses supported by lockdep.
+ * XFS_ILOCK_PARENT, which gives 6 subclasses.  That's 6 of the 8 subclasses
+ * supported by lockdep.
  *
  * This also means we have to number the sub-classes in the lowest bits of
  * the mask we keep, and we have to ensure we never exceed 3 bits of lockdep
@@ -471,8 +470,8 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
  * ILOCK values
  * 0-4		subclass values
  * 5		PARENT subclass (not nestable)
- * 6		RTBITMAP subclass (not nestable)
- * 7		RTSUM subclass (not nestable)
+ * 6		unused
+ * 7		unused
  * 
  */
 #define XFS_IOLOCK_SHIFT		16
@@ -487,12 +486,8 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
 #define XFS_ILOCK_SHIFT			24
 #define XFS_ILOCK_PARENT_VAL		5u
 #define XFS_ILOCK_MAX_SUBCLASS		(XFS_ILOCK_PARENT_VAL - 1)
-#define XFS_ILOCK_RTBITMAP_VAL		6u
-#define XFS_ILOCK_RTSUM_VAL		7u
 #define XFS_ILOCK_DEP_MASK		0xff000000u
 #define	XFS_ILOCK_PARENT		(XFS_ILOCK_PARENT_VAL << XFS_ILOCK_SHIFT)
-#define	XFS_ILOCK_RTBITMAP		(XFS_ILOCK_RTBITMAP_VAL << XFS_ILOCK_SHIFT)
-#define	XFS_ILOCK_RTSUM			(XFS_ILOCK_RTSUM_VAL << XFS_ILOCK_SHIFT)
 
 #define XFS_LOCK_SUBCLASS_MASK	(XFS_IOLOCK_DEP_MASK | \
 				 XFS_MMAPLOCK_DEP_MASK | \
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index f63228b3dd9a2..2a694ad8ead2c 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1188,12 +1188,11 @@ xfs_rtalloc_reinit_frextents(
 static inline int
 xfs_rtmount_iread_extents(
 	struct xfs_trans	*tp,
-	struct xfs_inode	*ip,
-	unsigned int		lock_class)
+	struct xfs_inode	*ip)
 {
 	int			error;
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL | lock_class);
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
 
 	error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
 	if (error)
@@ -1206,7 +1205,7 @@ xfs_rtmount_iread_extents(
 	}
 
 out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_EXCL | lock_class);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
 
@@ -1227,7 +1226,7 @@ xfs_rtmount_rtg(
 
 		if (rtg->rtg_inodes[i]) {
 			error = xfs_rtmount_iread_extents(tp,
-					rtg->rtg_inodes[i], 0);
+					rtg->rtg_inodes[i]);
 			if (error)
 				return error;
 		}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 18/24] xfs: calculate RT bitmap and summary blocks based on sb_rextents
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (16 preceding siblings ...)
  2024-08-23  0:19   ` [PATCH 17/24] xfs: remove XFS_ILOCK_RT* Darrick J. Wong
@ 2024-08-23  0:19   ` Darrick J. Wong
  2024-08-23  0:19   ` [PATCH 19/24] xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper Darrick J. Wong
                     ` (5 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:19 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Use the on-disk rextents to calculate the bitmap and summary blocks
instead of the calculated one so that we can refactor the helpers for
calculating them.

As the RT bitmap and summary scrubbers already check that sb_rextents
match the block count this does not change coverage of the scrubber.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/rtbitmap.c  |    3 ++-
 fs/xfs/scrub/rtsummary.c |    5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 6551b4374b89f..4a3e9d0302b51 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -67,7 +67,8 @@ xchk_setup_rtbitmap(
 	if (mp->m_sb.sb_rblocks) {
 		rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
 		rtb->rextslog = xfs_compute_rextslog(rtb->rextents);
-		rtb->rbmblocks = xfs_rtbitmap_blockcount(mp, rtb->rextents);
+		rtb->rbmblocks = xfs_rtbitmap_blockcount(mp,
+				mp->m_sb.sb_rextents);
 	}
 
 	return 0;
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 43d509422053c..a756fb2c4abf8 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -105,9 +105,10 @@ xchk_setup_rtsummary(
 		int		rextslog;
 
 		rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
-		rextslog = xfs_compute_rextslog(rts->rextents);
+		rextslog = xfs_compute_rextslog(mp->m_sb.sb_rextents);
 		rts->rsumlevels = rextslog + 1;
-		rts->rbmblocks = xfs_rtbitmap_blockcount(mp, rts->rextents);
+		rts->rbmblocks = xfs_rtbitmap_blockcount(mp,
+				mp->m_sb.sb_rextents);
 		rts->rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels,
 				rts->rbmblocks);
 	}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 19/24] xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (17 preceding siblings ...)
  2024-08-23  0:19   ` [PATCH 18/24] xfs: calculate RT bitmap and summary blocks based on sb_rextents Darrick J. Wong
@ 2024-08-23  0:19   ` Darrick J. Wong
  2024-08-23  0:19   ` [PATCH 20/24] xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks Darrick J. Wong
                     ` (4 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:19 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Split the code to set up a fake mount point to calculate new RT
geometry out of xfs_growfs_rt_bmblock so that it can be reused.

Note that this changes the rmblocks calculation method to be based
on the passed in rblocks and extsize and not the explicitly passed
one, but both methods will always lead to the same result.  The new
version just does a little bit more math while being more general.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   52 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 16 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 2a694ad8ead2c..71e650b6c4253 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -734,6 +734,36 @@ xfs_rtginode_ensure(
 	return xfs_rtginode_create(rtg, type, true);
 }
 
+static struct xfs_mount *
+xfs_growfs_rt_alloc_fake_mount(
+	const struct xfs_mount	*mp,
+	xfs_rfsblock_t		rblocks,
+	xfs_agblock_t		rextsize)
+{
+	struct xfs_mount	*nmp;
+
+	nmp = kmemdup(mp, sizeof(*mp), GFP_KERNEL);
+	if (!nmp)
+		return NULL;
+	nmp->m_sb.sb_rextsize = rextsize;
+	xfs_mount_sb_set_rextsize(nmp, &nmp->m_sb);
+	nmp->m_sb.sb_rblocks = rblocks;
+	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
+	nmp->m_sb.sb_rbmblocks = xfs_rtbitmap_blockcount(nmp,
+			nmp->m_sb.sb_rextents);
+	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
+	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
+	nmp->m_rsumblocks = xfs_rtsummary_blockcount(nmp, nmp->m_rsumlevels,
+			nmp->m_sb.sb_rbmblocks);
+
+	if (rblocks > 0)
+		nmp->m_features |= XFS_FEAT_REALTIME;
+
+	/* recompute growfsrt reservation from new rsumsize */
+	xfs_trans_resv_calc(nmp, &nmp->m_resv);
+	return nmp;
+}
+
 static int
 xfs_growfs_rt_bmblock(
 	struct xfs_rtgroup	*rtg,
@@ -756,25 +786,15 @@ xfs_growfs_rt_bmblock(
 	xfs_rtbxlen_t		freed_rtx;
 	int			error;
 
-
-	nrblocks_step = (bmbno + 1) * NBBY * mp->m_sb.sb_blocksize * rextsize;
-
-	nmp = nargs.mp = kmemdup(mp, sizeof(*mp), GFP_KERNEL);
-	if (!nmp)
-		return -ENOMEM;
-
 	/*
 	 * Calculate new sb and mount fields for this round.
 	 */
-	nmp->m_sb.sb_rextsize = rextsize;
-	xfs_mount_sb_set_rextsize(nmp, &nmp->m_sb);
-	nmp->m_sb.sb_rbmblocks = bmbno + 1;
-	nmp->m_sb.sb_rblocks = min(nrblocks, nrblocks_step);
-	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
-	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
-	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
-	nmp->m_rsumblocks = xfs_rtsummary_blockcount(mp, nmp->m_rsumlevels,
-			nmp->m_sb.sb_rbmblocks);
+	nrblocks_step = (bmbno + 1) * NBBY * mp->m_sb.sb_blocksize * rextsize;
+	nmp = nargs.mp = xfs_growfs_rt_alloc_fake_mount(mp,
+			min(nrblocks, nrblocks_step), rextsize);
+	if (!nmp)
+		return -ENOMEM;
+
 	rtg->rtg_extents = xfs_rtgroup_extents(nmp, rtg->rtg_rgno);
 
 	/*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 20/24] xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (18 preceding siblings ...)
  2024-08-23  0:19   ` [PATCH 19/24] xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper Darrick J. Wong
@ 2024-08-23  0:19   ` Darrick J. Wong
  2024-08-23  0:20   ` [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper Darrick J. Wong
                     ` (3 subsequent siblings)
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:19 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Use xfs_growfs_rt_alloc_fake_mount instead of manually recalculating
the RT bitmap geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 71e650b6c4253..61231b1dc4b79 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -935,10 +935,10 @@ xfs_growfs_rt_alloc_blocks(
 	struct xfs_mount	*mp = rtg->rtg_mount;
 	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
 	struct xfs_inode	*rsumip = rtg->rtg_inodes[XFS_RTGI_SUMMARY];
-	xfs_rtxnum_t		nrextents = div_u64(nrblocks, rextsize);
 	xfs_extlen_t		orbmblocks;
 	xfs_extlen_t		orsumblocks;
 	xfs_extlen_t		nrsumblocks;
+	struct xfs_mount	*nmp;
 	int			error;
 
 	/*
@@ -948,9 +948,13 @@ xfs_growfs_rt_alloc_blocks(
 	orbmblocks = XFS_B_TO_FSB(mp, rbmip->i_disk_size);
 	orsumblocks = XFS_B_TO_FSB(mp, rsumip->i_disk_size);
 
-	*nrbmblocks = xfs_rtbitmap_blockcount(mp, nrextents);
-	nrsumblocks = xfs_rtsummary_blockcount(mp,
-		xfs_compute_rextslog(nrextents) + 1, *nrbmblocks);
+	nmp = xfs_growfs_rt_alloc_fake_mount(mp, nrblocks, rextsize);
+	if (!nmp)
+		return -ENOMEM;
+
+	*nrbmblocks = nmp->m_sb.sb_rbmblocks;
+	nrsumblocks = nmp->m_rsumblocks;
+	kfree(nmp);
 
 	error = xfs_rtfile_initialize_blocks(rtg, XFS_RTGI_BITMAP, orbmblocks,
 			*nrbmblocks, NULL);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (19 preceding siblings ...)
  2024-08-23  0:19   ` [PATCH 20/24] xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks Darrick J. Wong
@ 2024-08-23  0:20   ` Darrick J. Wong
  2024-08-26  2:06     ` Dave Chinner
  2024-08-23  0:20   ` [PATCH 22/24] xfs: refactor xfs_rtbitmap_blockcount Darrick J. Wong
                     ` (2 subsequent siblings)
  23 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:20 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Split the check that the rtsummary fits into the log into a separate
helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: avoid division for the 0-rtx growfs check]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_rtalloc.c |   43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 61231b1dc4b79..78a3879ad6193 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1023,6 +1023,31 @@ xfs_growfs_rtg(
 	return error;
 }
 
+static int
+xfs_growfs_check_rtgeom(
+	const struct xfs_mount	*mp,
+	xfs_rfsblock_t		rblocks,
+	xfs_extlen_t		rextsize)
+{
+	struct xfs_mount	*nmp;
+	int			error = 0;
+
+	nmp = xfs_growfs_rt_alloc_fake_mount(mp, rblocks, rextsize);
+	if (!nmp)
+		return -ENOMEM;
+
+	/*
+	 * New summary size can't be more than half the size of the log.  This
+	 * prevents us from getting a log overflow, since we'll log basically
+	 * the whole summary file at once.
+	 */
+	if (nmp->m_rsumblocks > (mp->m_sb.sb_logblocks >> 1))
+		error = -EINVAL;
+
+	kfree(nmp);
+	return error;
+}
+
 /*
  * Grow the realtime area of the filesystem.
  */
@@ -1031,9 +1056,6 @@ xfs_growfs_rt(
 	xfs_mount_t	*mp,		/* mount point for filesystem */
 	xfs_growfs_rt_t	*in)		/* growfs rt input struct */
 {
-	xfs_rtxnum_t		nrextents;
-	xfs_extlen_t		nrbmblocks;
-	xfs_extlen_t		nrsumblocks;
 	struct xfs_buf		*bp;
 	xfs_agblock_t		old_rextsize = mp->m_sb.sb_rextsize;
 	int			error;
@@ -1082,20 +1104,13 @@ xfs_growfs_rt(
 	/*
 	 * Calculate new parameters.  These are the final values to be reached.
 	 */
-	nrextents = div_u64(in->newblocks, in->extsize);
 	error = -EINVAL;
-	if (nrextents == 0)
+	if (in->newblocks < in->extsize)
 		goto out_unlock;
-	nrbmblocks = xfs_rtbitmap_blockcount(mp, nrextents);
-	nrsumblocks = xfs_rtsummary_blockcount(mp,
-			xfs_compute_rextslog(nrextents) + 1, nrbmblocks);
 
-	/*
-	 * New summary size can't be more than half the size of
-	 * the log.  This prevents us from getting a log overflow,
-	 * since we'll log basically the whole summary file at once.
-	 */
-	if (nrsumblocks > (mp->m_sb.sb_logblocks >> 1))
+	/* Make sure the new fs size won't cause problems with the log. */
+	error = xfs_growfs_check_rtgeom(mp, in->newblocks, in->extsize);
+	if (error)
 		goto out_unlock;
 
 	error = xfs_growfs_rtg(mp, in->newblocks, in->extsize);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 22/24] xfs: refactor xfs_rtbitmap_blockcount
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (20 preceding siblings ...)
  2024-08-23  0:20   ` [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper Darrick J. Wong
@ 2024-08-23  0:20   ` Darrick J. Wong
  2024-08-23  0:20   ` [PATCH 23/24] xfs: refactor xfs_rtsummary_blockcount Darrick J. Wong
  2024-08-23  0:20   ` [PATCH 24/24] xfs: make RT extent numbers relative to the rtgroup Darrick J. Wong
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:20 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Rename the existing xfs_rtbitmap_blockcount to
xfs_rtbitmap_blockcount_len and add a new xfs_rtbitmap_blockcount wrapper
around it that takes the number of extents from the mount structure.

This will simplify the move to per-rtgroup bitmaps as those will need to
pass in the number of extents per rtgroup instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c   |   12 +++++++++++-
 fs/xfs/libxfs/xfs_rtbitmap.h   |    7 ++++---
 fs/xfs/libxfs/xfs_trans_resv.c |    2 +-
 fs/xfs/scrub/rtbitmap.c        |    3 +--
 fs/xfs/scrub/rtsummary.c       |    7 ++-----
 fs/xfs/xfs_rtalloc.c           |    3 +--
 6 files changed, 20 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 41de2f071934f..ea89503213c62 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1149,13 +1149,23 @@ xfs_rtalloc_extent_is_free(
  * extents.
  */
 xfs_filblks_t
-xfs_rtbitmap_blockcount(
+xfs_rtbitmap_blockcount_len(
 	struct xfs_mount	*mp,
 	xfs_rtbxlen_t		rtextents)
 {
 	return howmany_64(rtextents, NBBY * mp->m_sb.sb_blocksize);
 }
 
+/*
+ * Compute the number of rtbitmap blocks used for a given file system.
+ */
+xfs_filblks_t
+xfs_rtbitmap_blockcount(
+	struct xfs_mount	*mp)
+{
+	return xfs_rtbitmap_blockcount_len(mp, mp->m_sb.sb_rextents);
+}
+
 /* Compute the number of rtsummary blocks needed to track the given rt space. */
 xfs_filblks_t
 xfs_rtsummary_blockcount(
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index e4994a3e461d3..58672863053a9 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -307,8 +307,9 @@ int xfs_rtfree_extent(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
 int xfs_rtfree_blocks(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
 		xfs_fsblock_t rtbno, xfs_filblks_t rtlen);
 
-xfs_filblks_t xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t
-		rtextents);
+xfs_filblks_t xfs_rtbitmap_blockcount(struct xfs_mount *mp);
+xfs_filblks_t xfs_rtbitmap_blockcount_len(struct xfs_mount *mp,
+		xfs_rtbxlen_t rtextents);
 xfs_filblks_t xfs_rtsummary_blockcount(struct xfs_mount *mp,
 		unsigned int rsumlevels, xfs_extlen_t rbmblocks);
 
@@ -336,7 +337,7 @@ static inline int xfs_rtfree_blocks(struct xfs_trans *tp,
 # define xfs_rtbuf_cache_relse(a)			(0)
 # define xfs_rtalloc_extent_is_free(m,t,s,l,i)		(-ENOSYS)
 static inline xfs_filblks_t
-xfs_rtbitmap_blockcount(struct xfs_mount *mp, xfs_rtbxlen_t rtextents)
+xfs_rtbitmap_blockcount_len(struct xfs_mount *mp, xfs_rtbxlen_t rtextents)
 {
 	/* shut up gcc */
 	return 0;
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 2e6d7bb3b5a2f..5050fbcc37b75 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -224,7 +224,7 @@ xfs_rtalloc_block_count(
 	xfs_rtxlen_t		rtxlen;
 
 	rtxlen = xfs_extlen_to_rtxlen(mp, XFS_MAX_BMBT_EXTLEN);
-	rtbmp_blocks = xfs_rtbitmap_blockcount(mp, rtxlen);
+	rtbmp_blocks = xfs_rtbitmap_blockcount_len(mp, rtxlen);
 	return (rtbmp_blocks + 1) * num_ops;
 }
 
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 4a3e9d0302b51..3f090c3e3d11e 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -67,8 +67,7 @@ xchk_setup_rtbitmap(
 	if (mp->m_sb.sb_rblocks) {
 		rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
 		rtb->rextslog = xfs_compute_rextslog(rtb->rextents);
-		rtb->rbmblocks = xfs_rtbitmap_blockcount(mp,
-				mp->m_sb.sb_rextents);
+		rtb->rbmblocks = xfs_rtbitmap_blockcount(mp);
 	}
 
 	return 0;
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index a756fb2c4abf8..e96aa24d89f62 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -107,8 +107,7 @@ xchk_setup_rtsummary(
 		rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
 		rextslog = xfs_compute_rextslog(mp->m_sb.sb_rextents);
 		rts->rsumlevels = rextslog + 1;
-		rts->rbmblocks = xfs_rtbitmap_blockcount(mp,
-				mp->m_sb.sb_rextents);
+		rts->rbmblocks = xfs_rtbitmap_blockcount(mp);
 		rts->rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels,
 				rts->rbmblocks);
 	}
@@ -215,11 +214,9 @@ xchk_rtsum_compute(
 {
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_rtgroup	*rtg = sc->sr.rtg;
-	unsigned long long	rtbmp_blocks;
 
 	/* If the bitmap size doesn't match the computed size, bail. */
-	rtbmp_blocks = xfs_rtbitmap_blockcount(mp, mp->m_sb.sb_rextents);
-	if (XFS_FSB_TO_B(mp, rtbmp_blocks) !=
+	if (XFS_FSB_TO_B(mp, xfs_rtbitmap_blockcount(mp)) !=
 	    rtg->rtg_inodes[XFS_RTGI_BITMAP]->i_disk_size)
 		return -EFSCORRUPTED;
 
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 78a3879ad6193..fc35cdf856194 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -749,8 +749,7 @@ xfs_growfs_rt_alloc_fake_mount(
 	xfs_mount_sb_set_rextsize(nmp, &nmp->m_sb);
 	nmp->m_sb.sb_rblocks = rblocks;
 	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
-	nmp->m_sb.sb_rbmblocks = xfs_rtbitmap_blockcount(nmp,
-			nmp->m_sb.sb_rextents);
+	nmp->m_sb.sb_rbmblocks = xfs_rtbitmap_blockcount(nmp);
 	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
 	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
 	nmp->m_rsumblocks = xfs_rtsummary_blockcount(nmp, nmp->m_rsumlevels,


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 23/24] xfs: refactor xfs_rtsummary_blockcount
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (21 preceding siblings ...)
  2024-08-23  0:20   ` [PATCH 22/24] xfs: refactor xfs_rtbitmap_blockcount Darrick J. Wong
@ 2024-08-23  0:20   ` Darrick J. Wong
  2024-08-23  0:20   ` [PATCH 24/24] xfs: make RT extent numbers relative to the rtgroup Darrick J. Wong
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:20 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Make xfs_rtsummary_blockcount take all the required information from
the mount structure and return the number of summary levels from it
as well.  This cleans up many of the callers and prepares for making the
rtsummary files per-rtgroup where they need to look at different value.

This means we recalculate some values in some callers, but as all these
calculations are outside the fast path and cheap that seems like a price
worth paying.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |   13 +++++++++----
 fs/xfs/libxfs/xfs_rtbitmap.h |    3 +--
 fs/xfs/scrub/rtsummary.c     |    8 ++------
 fs/xfs/xfs_mount.h           |    2 +-
 fs/xfs/xfs_rtalloc.c         |   13 ++++---------
 5 files changed, 17 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index ea89503213c62..7a848cacd561d 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -20,6 +20,7 @@
 #include "xfs_error.h"
 #include "xfs_rtbitmap.h"
 #include "xfs_health.h"
+#include "xfs_sb.h"
 
 /*
  * Realtime allocator bitmap functions shared with userspace.
@@ -1166,16 +1167,20 @@ xfs_rtbitmap_blockcount(
 	return xfs_rtbitmap_blockcount_len(mp, mp->m_sb.sb_rextents);
 }
 
-/* Compute the number of rtsummary blocks needed to track the given rt space. */
+/*
+ * Compute the geometry of the rtsummary file needed to track the given rt
+ * space.
+ */
 xfs_filblks_t
 xfs_rtsummary_blockcount(
 	struct xfs_mount	*mp,
-	unsigned int		rsumlevels,
-	xfs_extlen_t		rbmblocks)
+	unsigned int		*rsumlevels)
 {
 	unsigned long long	rsumwords;
 
-	rsumwords = (unsigned long long)rsumlevels * rbmblocks;
+	*rsumlevels = xfs_compute_rextslog(mp->m_sb.sb_rextents) + 1;
+
+	rsumwords = xfs_rtbitmap_blockcount(mp) * (*rsumlevels);
 	return XFS_B_TO_FSB(mp, rsumwords << XFS_WORDLOG);
 }
 
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 58672863053a9..776cca9e41bf0 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -311,7 +311,7 @@ xfs_filblks_t xfs_rtbitmap_blockcount(struct xfs_mount *mp);
 xfs_filblks_t xfs_rtbitmap_blockcount_len(struct xfs_mount *mp,
 		xfs_rtbxlen_t rtextents);
 xfs_filblks_t xfs_rtsummary_blockcount(struct xfs_mount *mp,
-		unsigned int rsumlevels, xfs_extlen_t rbmblocks);
+		unsigned int *rsumlevels);
 
 int xfs_rtfile_initialize_blocks(struct xfs_rtgroup *rtg,
 		enum xfs_rtg_inodes type, xfs_fileoff_t offset_fsb,
@@ -342,7 +342,6 @@ xfs_rtbitmap_blockcount_len(struct xfs_mount *mp, xfs_rtbxlen_t rtextents)
 	/* shut up gcc */
 	return 0;
 }
-# define xfs_rtsummary_blockcount(mp, l, b)		(0)
 #endif /* CONFIG_XFS_RT */
 
 #endif /* __XFS_RTBITMAP_H__ */
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index e96aa24d89f62..3e2357f50b9d3 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -102,14 +102,10 @@ xchk_setup_rtsummary(
 	 */
 	xchk_rtgroup_lock(&sc->sr, XFS_RTGLOCK_BITMAP);
 	if (mp->m_sb.sb_rblocks) {
-		int		rextslog;
-
 		rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
-		rextslog = xfs_compute_rextslog(mp->m_sb.sb_rextents);
-		rts->rsumlevels = rextslog + 1;
 		rts->rbmblocks = xfs_rtbitmap_blockcount(mp);
-		rts->rsumblocks = xfs_rtsummary_blockcount(mp, rts->rsumlevels,
-				rts->rbmblocks);
+		rts->rsumblocks =
+			xfs_rtsummary_blockcount(mp, &rts->rsumlevels);
 	}
 
 	return 0;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2518977150295..137fb5f88307b 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -139,7 +139,7 @@ typedef struct xfs_mount {
 	uint			m_allocsize_blocks; /* min write size blocks */
 	int			m_logbufs;	/* number of log buffers */
 	int			m_logbsize;	/* size of each log buffer */
-	uint			m_rsumlevels;	/* rt summary levels */
+	unsigned int		m_rsumlevels;	/* rt summary levels */
 	xfs_filblks_t		m_rsumblocks;	/* size of rt summary, FSBs */
 	uint32_t		m_rgblocks;	/* size of rtgroup in rtblocks */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index fc35cdf856194..28d8cea4f84e3 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -751,9 +751,7 @@ xfs_growfs_rt_alloc_fake_mount(
 	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
 	nmp->m_sb.sb_rbmblocks = xfs_rtbitmap_blockcount(nmp);
 	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
-	nmp->m_rsumlevels = nmp->m_sb.sb_rextslog + 1;
-	nmp->m_rsumblocks = xfs_rtsummary_blockcount(nmp, nmp->m_rsumlevels,
-			nmp->m_sb.sb_rbmblocks);
+	nmp->m_rsumblocks = xfs_rtsummary_blockcount(nmp, &nmp->m_rsumlevels);
 
 	if (rblocks > 0)
 		nmp->m_features |= XFS_FEAT_REALTIME;
@@ -1138,21 +1136,18 @@ xfs_rtmount_init(
 	struct xfs_mount	*mp)	/* file system mount structure */
 {
 	struct xfs_buf		*bp;	/* buffer for last block of subvolume */
-	struct xfs_sb		*sbp;	/* filesystem superblock copy in mount */
 	xfs_daddr_t		d;	/* address of last block of subvolume */
 	int			error;
 
-	sbp = &mp->m_sb;
-	if (sbp->sb_rblocks == 0)
+	if (mp->m_sb.sb_rblocks == 0)
 		return 0;
 	if (mp->m_rtdev_targp == NULL) {
 		xfs_warn(mp,
 	"Filesystem has a realtime volume, use rtdev=device option");
 		return -ENODEV;
 	}
-	mp->m_rsumlevels = sbp->sb_rextslog + 1;
-	mp->m_rsumblocks = xfs_rtsummary_blockcount(mp, mp->m_rsumlevels,
-			mp->m_sb.sb_rbmblocks);
+
+	mp->m_rsumblocks = xfs_rtsummary_blockcount(mp, &mp->m_rsumlevels);
 
 	/*
 	 * Check that the realtime section is an ok size.


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 24/24] xfs: make RT extent numbers relative to the rtgroup
  2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
                     ` (22 preceding siblings ...)
  2024-08-23  0:20   ` [PATCH 23/24] xfs: refactor xfs_rtsummary_blockcount Darrick J. Wong
@ 2024-08-23  0:20   ` Darrick J. Wong
  23 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:20 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t
type encode the RT extent number relative to the rtgroup.  The biggest
part of this to clearly distinguish between the relative extent number
that gets masked when converting from a global block number and length
values that just have a factor applied to them when converting from
file system blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c     |    6 ++--
 fs/xfs/libxfs/xfs_rtbitmap.h |   69 ++++++++++++++++++++++++++----------------
 fs/xfs/scrub/rtbitmap.c      |    9 ++---
 fs/xfs/scrub/rtsummary.c     |    6 ++--
 fs/xfs/xfs_discard.c         |    4 +-
 fs/xfs/xfs_fsmap.c           |    4 +-
 fs/xfs/xfs_iomap.c           |    4 +-
 fs/xfs/xfs_mount.c           |    2 +
 fs/xfs/xfs_rtalloc.c         |    4 +-
 fs/xfs/xfs_super.c           |    3 +-
 10 files changed, 64 insertions(+), 47 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a1ee8dc91d6ba..c056ca8ad6090 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4126,7 +4126,7 @@ xfs_bmapi_reserve_delalloc(
 
 	fdblocks = indlen;
 	if (XFS_IS_REALTIME_INODE(ip)) {
-		error = xfs_dec_frextents(mp, xfs_rtb_to_rtx(mp, alen));
+		error = xfs_dec_frextents(mp, xfs_blen_to_rtbxlen(mp, alen));
 		if (error)
 			goto out_unreserve_quota;
 	} else {
@@ -4161,7 +4161,7 @@ xfs_bmapi_reserve_delalloc(
 
 out_unreserve_frextents:
 	if (XFS_IS_REALTIME_INODE(ip))
-		xfs_add_frextents(mp, xfs_rtb_to_rtx(mp, alen));
+		xfs_add_frextents(mp, xfs_blen_to_rtbxlen(mp, alen));
 out_unreserve_quota:
 	if (XFS_IS_QUOTA_ON(mp))
 		xfs_quota_unreserve_blkres(ip, alen);
@@ -5088,7 +5088,7 @@ xfs_bmap_del_extent_delay(
 	fdblocks = da_diff;
 
 	if (isrt)
-		xfs_add_frextents(mp, xfs_rtb_to_rtx(mp, del->br_blockcount));
+		xfs_add_frextents(mp, xfs_blen_to_rtbxlen(mp, del->br_blockcount));
 	else
 		fdblocks += del->br_blockcount;
 
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 776cca9e41bf0..cf21ae31bfaa4 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -22,13 +22,37 @@ struct xfs_rtalloc_args {
 
 static inline xfs_rtblock_t
 xfs_rtx_to_rtb(
-	struct xfs_mount	*mp,
+	struct xfs_rtgroup	*rtg,
 	xfs_rtxnum_t		rtx)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	xfs_rtblock_t		start = xfs_rgno_start_rtb(mp, rtg->rtg_rgno);
+
+	if (mp->m_rtxblklog >= 0)
+		return start + (rtx << mp->m_rtxblklog);
+	return start + (rtx * mp->m_sb.sb_rextsize);
+}
+
+/* Convert an rgbno into an rt extent number. */
+static inline xfs_rtxnum_t
+xfs_rgbno_to_rtx(
+	struct xfs_mount	*mp,
+	xfs_rgblock_t		rgbno)
+{
+	if (likely(mp->m_rtxblklog >= 0))
+		return rgbno >> mp->m_rtxblklog;
+	return rgbno / mp->m_sb.sb_rextsize;
+}
+
+static inline uint64_t
+xfs_rtbxlen_to_blen(
+	struct xfs_mount	*mp,
+	xfs_rtbxlen_t		rtbxlen)
 {
 	if (mp->m_rtxblklog >= 0)
-		return rtx << mp->m_rtxblklog;
+		return rtbxlen << mp->m_rtxblklog;
 
-	return rtx * mp->m_sb.sb_rextsize;
+	return rtbxlen * mp->m_sb.sb_rextsize;
 }
 
 static inline xfs_extlen_t
@@ -65,16 +89,29 @@ xfs_extlen_to_rtxlen(
 	return len / mp->m_sb.sb_rextsize;
 }
 
+/* Convert an rt block count into an rt extent count. */
+static inline xfs_rtbxlen_t
+xfs_blen_to_rtbxlen(
+	struct xfs_mount	*mp,
+	uint64_t		blen)
+{
+	if (likely(mp->m_rtxblklog >= 0))
+		return blen >> mp->m_rtxblklog;
+
+	return div_u64(blen, mp->m_sb.sb_rextsize);
+}
+
 /* Convert an rt block number into an rt extent number. */
 static inline xfs_rtxnum_t
 xfs_rtb_to_rtx(
 	struct xfs_mount	*mp,
 	xfs_rtblock_t		rtbno)
 {
-	if (likely(mp->m_rtxblklog >= 0))
-		return rtbno >> mp->m_rtxblklog;
+	uint64_t		__rgbno = __xfs_rtb_to_rgbno(mp, rtbno);
 
-	return div_u64(rtbno, mp->m_sb.sb_rextsize);
+	if (likely(mp->m_rtxblklog >= 0))
+		return __rgbno >> mp->m_rtxblklog;
+	return div_u64(__rgbno, mp->m_sb.sb_rextsize);
 }
 
 /* Return the offset of an rt block number within an rt extent. */
@@ -89,26 +126,6 @@ xfs_rtb_to_rtxoff(
 	return do_div(rtbno, mp->m_sb.sb_rextsize);
 }
 
-/*
- * Convert an rt block number into an rt extent number, rounding up to the next
- * rt extent if the rt block is not aligned to an rt extent boundary.
- */
-static inline xfs_rtxnum_t
-xfs_rtb_to_rtxup(
-	struct xfs_mount	*mp,
-	xfs_rtblock_t		rtbno)
-{
-	if (likely(mp->m_rtxblklog >= 0)) {
-		if (rtbno & mp->m_rtxblkmask)
-			return (rtbno >> mp->m_rtxblklog) + 1;
-		return rtbno >> mp->m_rtxblklog;
-	}
-
-	if (do_div(rtbno, mp->m_sb.sb_rextsize))
-		rtbno++;
-	return rtbno;
-}
-
 /* Round this rtblock up to the nearest rt extent size. */
 static inline xfs_rtblock_t
 xfs_rtb_roundup_rtx(
diff --git a/fs/xfs/scrub/rtbitmap.c b/fs/xfs/scrub/rtbitmap.c
index 3f090c3e3d11e..17aff4227721e 100644
--- a/fs/xfs/scrub/rtbitmap.c
+++ b/fs/xfs/scrub/rtbitmap.c
@@ -65,7 +65,7 @@ xchk_setup_rtbitmap(
 	 */
 	xchk_rtgroup_lock(&sc->sr, XFS_RTGLOCK_BITMAP);
 	if (mp->m_sb.sb_rblocks) {
-		rtb->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
+		rtb->rextents = xfs_blen_to_rtbxlen(mp, mp->m_sb.sb_rblocks);
 		rtb->rextslog = xfs_compute_rextslog(rtb->rextents);
 		rtb->rbmblocks = xfs_rtbitmap_blockcount(mp);
 	}
@@ -83,15 +83,14 @@ xchk_rtbitmap_rec(
 	const struct xfs_rtalloc_rec *rec,
 	void			*priv)
 {
-	struct xfs_mount	*mp = rtg->rtg_mount;
 	struct xfs_scrub	*sc = priv;
 	xfs_rtblock_t		startblock;
 	xfs_filblks_t		blockcount;
 
-	startblock = xfs_rtx_to_rtb(mp, rec->ar_startext);
-	blockcount = xfs_rtx_to_rtb(mp, rec->ar_extcount);
+	startblock = xfs_rtx_to_rtb(rtg, rec->ar_startext);
+	blockcount = xfs_rtxlen_to_extlen(rtg->rtg_mount, rec->ar_extcount);
 
-	if (!xfs_verify_rtbext(mp, startblock, blockcount))
+	if (!xfs_verify_rtbext(rtg->rtg_mount, startblock, blockcount))
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0);
 	return 0;
 }
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 3e2357f50b9d3..1f01ed9450388 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -102,7 +102,7 @@ xchk_setup_rtsummary(
 	 */
 	xchk_rtgroup_lock(&sc->sr, XFS_RTGLOCK_BITMAP);
 	if (mp->m_sb.sb_rblocks) {
-		rts->rextents = xfs_rtb_to_rtx(mp, mp->m_sb.sb_rblocks);
+		rts->rextents = xfs_blen_to_rtbxlen(mp, mp->m_sb.sb_rblocks);
 		rts->rbmblocks = xfs_rtbitmap_blockcount(mp);
 		rts->rsumblocks =
 			xfs_rtsummary_blockcount(mp, &rts->rsumlevels);
@@ -182,8 +182,8 @@ xchk_rtsum_record_free(
 	lenlog = xfs_highbit64(rec->ar_extcount);
 	offs = xfs_rtsumoffs(mp, lenlog, rbmoff);
 
-	rtbno = xfs_rtx_to_rtb(mp, rec->ar_startext);
-	rtlen = xfs_rtx_to_rtb(mp, rec->ar_extcount);
+	rtbno = xfs_rtx_to_rtb(rtg, rec->ar_startext);
+	rtlen = xfs_rtxlen_to_extlen(mp, rec->ar_extcount);
 
 	if (!xfs_verify_rtbext(mp, rtbno, rtlen)) {
 		xchk_ino_xref_set_corrupt(sc,
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index b2ef5ebe1f047..e1a024f68a68f 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -526,8 +526,8 @@ xfs_trim_gather_rtextent(
 		return -ECANCELED;
 	}
 
-	rbno = xfs_rtx_to_rtb(rtg->rtg_mount, rec->ar_startext);
-	rlen = xfs_rtx_to_rtb(rtg->rtg_mount, rec->ar_extcount);
+	rbno = xfs_rtx_to_rtb(rtg, rec->ar_startext);
+	rlen = xfs_rtxlen_to_extlen(rtg->rtg_mount, rec->ar_extcount);
 
 	/* Ignore too small. */
 	if (rlen < tr->minlen_fsb) {
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 0e0ec3f0574b1..6ae929dd65b6e 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -726,9 +726,9 @@ xfs_getfsmap_rtdev_rtbitmap_helper(
 	struct xfs_mount		*mp = rtg->rtg_mount;
 	struct xfs_getfsmap_info	*info = priv;
 	xfs_rtblock_t			start_rtb =
-				xfs_rtx_to_rtb(mp, rec->ar_startext);
+				xfs_rtx_to_rtb(rtg, rec->ar_startext);
 	uint64_t			rtbcount =
-				xfs_rtx_to_rtb(mp, rec->ar_extcount);
+				xfs_rtbxlen_to_blen(mp, rec->ar_extcount);
 	struct xfs_rmap_irec		irec = {
 		.rm_startblock		= start_rtb,
 		.rm_blockcount		= rtbcount,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 72c981e3dc921..13cabd345e227 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -501,8 +501,8 @@ xfs_iomap_prealloc_size(
 				       alloc_blocks);
 
 	if (unlikely(XFS_IS_REALTIME_INODE(ip)))
-		freesp = xfs_rtx_to_rtb(mp,
-			xfs_iomap_freesp(&mp->m_frextents,
+		freesp = xfs_rtbxlen_to_blen(mp,
+				xfs_iomap_freesp(&mp->m_frextents,
 					mp->m_low_rtexts, &shift));
 	else
 		freesp = xfs_iomap_freesp(&mp->m_fdblocks, mp->m_low_space,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index e1e849101cdd4..5726ea597f5a2 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1472,7 +1472,7 @@ xfs_mod_delalloc(
 
 	if (XFS_IS_REALTIME_INODE(ip)) {
 		percpu_counter_add_batch(&mp->m_delalloc_rtextents,
-				xfs_rtb_to_rtx(mp, data_delta),
+				xfs_blen_to_rtbxlen(mp, data_delta),
 				XFS_DELALLOC_BATCH);
 		if (!ind_delta)
 			return;
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 28d8cea4f84e3..308049f2fb79d 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -748,7 +748,7 @@ xfs_growfs_rt_alloc_fake_mount(
 	nmp->m_sb.sb_rextsize = rextsize;
 	xfs_mount_sb_set_rextsize(nmp, &nmp->m_sb);
 	nmp->m_sb.sb_rblocks = rblocks;
-	nmp->m_sb.sb_rextents = xfs_rtb_to_rtx(nmp, nmp->m_sb.sb_rblocks);
+	nmp->m_sb.sb_rextents = xfs_blen_to_rtbxlen(nmp, nmp->m_sb.sb_rblocks);
 	nmp->m_sb.sb_rbmblocks = xfs_rtbitmap_blockcount(nmp);
 	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
 	nmp->m_rsumblocks = xfs_rtsummary_blockcount(nmp, &nmp->m_rsumlevels);
@@ -1464,7 +1464,7 @@ xfs_rtallocate(
 	xfs_trans_mod_sb(tp, wasdel ?
 			XFS_TRANS_SB_RES_FREXTENTS : XFS_TRANS_SB_FREXTENTS,
 			-(long)len);
-	*bno = xfs_rtx_to_rtb(args.mp, rtx);
+	*bno = xfs_rtx_to_rtb(args.rtg, rtx);
 	*blen = xfs_rtxlen_to_extlen(args.mp, len);
 
 out_release:
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index cee64c1a7d650..2767083612bf6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -885,7 +885,8 @@ xfs_fs_statfs(
 
 		statp->f_blocks = sbp->sb_rblocks;
 		freertx = percpu_counter_sum_positive(&mp->m_frextents);
-		statp->f_bavail = statp->f_bfree = xfs_rtx_to_rtb(mp, freertx);
+		statp->f_bavail = statp->f_bfree =
+			xfs_rtbxlen_to_blen(mp, freertx);
 	}
 
 	return 0;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 1/1] iomap: add a merge boundary flag
  2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
@ 2024-08-23  0:21   ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

File systems might have boundaries over which merges aren't possible.
In fact these are very common, although most of the time some kind of
header at the beginning of this region (e.g. XFS alloation groups, ext4
block groups) automatically create a merge barrier.  But if that is
not present, say for a device purely used for data we need to manually
communicate that to iomap.

Add a IOMAP_F_BOUNDARY flag to never merge I/O into a previous mapping.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |    6 ++++++
 include/linux/iomap.h  |    4 ++++
 2 files changed, 10 insertions(+)


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc..685136a57cbf7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1603,6 +1603,8 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
 {
 	if (ioend->io_bio.bi_status != next->io_bio.bi_status)
 		return false;
+	if (next->io_flags & IOMAP_F_BOUNDARY)
+		return false;
 	if ((ioend->io_flags & IOMAP_F_SHARED) ^
 	    (next->io_flags & IOMAP_F_SHARED))
 		return false;
@@ -1722,6 +1724,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 	INIT_LIST_HEAD(&ioend->io_list);
 	ioend->io_type = wpc->iomap.type;
 	ioend->io_flags = wpc->iomap.flags;
+	if (pos > wpc->iomap.offset)
+		wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
 	ioend->io_inode = inode;
 	ioend->io_size = 0;
 	ioend->io_offset = pos;
@@ -1733,6 +1737,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 
 static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
 {
+	if (wpc->iomap.offset == pos && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
+		return false;
 	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
 	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
 		return false;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d1..ba3c9e5124637 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -53,6 +53,9 @@ struct vm_fault;
  *
  * IOMAP_F_XATTR indicates that the iomap is for an extended attribute extent
  * rather than a file data extent.
+ *
+ * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
+ * never be merged with the mapping before it.
  */
 #define IOMAP_F_NEW		(1U << 0)
 #define IOMAP_F_DIRTY		(1U << 1)
@@ -64,6 +67,7 @@ struct vm_fault;
 #define IOMAP_F_BUFFER_HEAD	0
 #endif /* CONFIG_BUFFER_HEAD */
 #define IOMAP_F_XATTR		(1U << 5)
+#define IOMAP_F_BOUNDARY	(1U << 6)
 
 /*
  * Flags set by the core iomap code during operations:


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 01/26] xfs: define the format of rt groups
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
@ 2024-08-23  0:21   ` Darrick J. Wong
  2024-08-23  5:11     ` Christoph Hellwig
  2024-08-23  0:21   ` [PATCH 02/26] xfs: check the realtime superblock at mount time Darrick J. Wong
                     ` (24 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:21 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Define the ondisk format of realtime group metadata, and a superblock
for realtime volumes.  rt supers are protected by a separate rocompat
bit so that we can leave them off if the rt device is zoned.

Add a xfs_sb_version_hasrtgroups so that xfs_repair knows how to zero
the tail of superblocks.

For rt group enabled file systems there is a separate bitmap and summary
file for each group and thus the number of bitmap and summary blocks
needs to be calculated differently.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h   |   50 +++++++++++++++++++--
 fs/xfs/libxfs/xfs_ondisk.h   |    3 +
 fs/xfs/libxfs/xfs_rtbitmap.c |   20 +++++++-
 fs/xfs/libxfs/xfs_rtgroup.c  |   82 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_sb.c       |  102 +++++++++++++++++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_shared.h   |    1 
 fs/xfs/xfs_mount.h           |    6 ++
 fs/xfs/xfs_rtalloc.c         |   30 +++++++++++-
 8 files changed, 268 insertions(+), 26 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index fa5cfc8265d92..9e351b19bd86e 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -265,8 +265,15 @@ struct xfs_dsb {
 	uuid_t		sb_meta_uuid;	/* metadata file system unique id */
 
 	__be64		sb_metadirino;	/* metadata directory tree root */
+	__be32		sb_rgcount;	/* # of realtime groups */
+	__be32		sb_rgextents;	/* size of rtgroup in rtx */
 
-	/* must be padded to 64 bit alignment */
+	/*
+	 * The size of this structure must be padded to 64 bit alignment.
+	 *
+	 * NOTE: Don't forget to update secondary_sb_whack in xfs_repair when
+	 * adding new fields here.
+	 */
 };
 
 #define XFS_SB_CRC_OFF		offsetof(struct xfs_dsb, sb_crc)
@@ -355,10 +362,10 @@ xfs_sb_has_compat_feature(
 	return (sbp->sb_features_compat & feature) != 0;
 }
 
-#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
-#define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
-#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
-#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FINOBT	(1 << 0)  /* free inode btree */
+#define XFS_SB_FEAT_RO_COMPAT_RMAPBT	(1 << 1)  /* reverse map btree */
+#define XFS_SB_FEAT_RO_COMPAT_REFLINK	(1 << 2)  /* reflinked files */
+#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT	(1 << 3)  /* inobt block counts */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -722,6 +729,39 @@ union xfs_suminfo_raw {
 	__u32		old;
 };
 
+/*
+ * Realtime allocation groups break the rt section into multiple pieces that
+ * could be locked independently.  Realtime block group numbers are 32-bit
+ * quantities.  Block numbers within a group are also 32-bit quantities, but
+ * the upper bit must never be set.  rtgroup 0 might have a superblock in it,
+ * so the minimum size of an rtgroup is 2 rtx.
+ */
+#define XFS_MAX_RGBLOCKS	((xfs_rgblock_t)(1U << 31) - 1)
+#define XFS_MIN_RGEXTENTS	((xfs_rtxlen_t)2)
+#define XFS_MAX_RGNUMBER	((xfs_rgnumber_t)(-1U))
+
+#define XFS_RTSB_MAGIC	0x46726F67	/* 'Frog' */
+
+/*
+ * Realtime superblock - on disk version.  Must be padded to 64 bit alignment.
+ * The first block of the realtime volume contains this superblock.
+ */
+struct xfs_rtsb {
+	__be32		rsb_magicnum;	/* magic number == XFS_RTSB_MAGIC */
+	__le32		rsb_crc;	/* superblock crc */
+
+	__be32		rsb_pad;	/* zero */
+	unsigned char	rsb_fname[XFSLABEL_MAX]; /* file system name */
+
+	uuid_t		rsb_uuid;	/* user-visible file system unique id */
+	uuid_t		rsb_meta_uuid;	/* metadata file system unique id */
+
+	/* must be padded to 64 bit alignment */
+};
+
+#define XFS_RTSB_CRC_OFF	offsetof(struct xfs_rtsb, rsb_crc)
+#define XFS_RTSB_DADDR		((xfs_daddr_t)0) /* daddr in rt section */
+
 /*
  * XFS Timestamps
  * ==============
diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
index 8bca86e350fdc..38b314113d8f2 100644
--- a/fs/xfs/libxfs/xfs_ondisk.h
+++ b/fs/xfs/libxfs/xfs_ondisk.h
@@ -37,7 +37,7 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dinode,		176);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_disk_dquot,		104);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dqblk,			136);
-	XFS_CHECK_STRUCT_SIZE(struct xfs_dsb,			272);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_dsb,			280);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
@@ -53,6 +53,7 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(xfs_inobt_ptr_t,			4);
 	XFS_CHECK_STRUCT_SIZE(xfs_refcount_ptr_t,		4);
 	XFS_CHECK_STRUCT_SIZE(xfs_rmap_ptr_t,			4);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_rtsb,			56);
 
 	/* dir/attr trees */
 	XFS_CHECK_STRUCT_SIZE(struct xfs_attr3_leaf_hdr,	80);
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 7a848cacd561d..330acf1ab39f8 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1157,6 +1157,21 @@ xfs_rtbitmap_blockcount_len(
 	return howmany_64(rtextents, NBBY * mp->m_sb.sb_blocksize);
 }
 
+/* How many rt extents does each rtbitmap file track? */
+static inline xfs_rtbxlen_t
+xfs_rtbitmap_bitcount(
+	struct xfs_mount	*mp)
+{
+	if (!mp->m_sb.sb_rextents)
+		return 0;
+
+	/* rtgroup size can be nonzero even if rextents is zero */
+	if (xfs_has_rtgroups(mp))
+		return mp->m_sb.sb_rgextents;
+
+	return mp->m_sb.sb_rextents;
+}
+
 /*
  * Compute the number of rtbitmap blocks used for a given file system.
  */
@@ -1164,7 +1179,7 @@ xfs_filblks_t
 xfs_rtbitmap_blockcount(
 	struct xfs_mount	*mp)
 {
-	return xfs_rtbitmap_blockcount_len(mp, mp->m_sb.sb_rextents);
+	return xfs_rtbitmap_blockcount_len(mp, xfs_rtbitmap_bitcount(mp));
 }
 
 /*
@@ -1178,8 +1193,7 @@ xfs_rtsummary_blockcount(
 {
 	unsigned long long	rsumwords;
 
-	*rsumlevels = xfs_compute_rextslog(mp->m_sb.sb_rextents) + 1;
-
+	*rsumlevels = xfs_compute_rextslog(xfs_rtbitmap_bitcount(mp)) + 1;
 	rsumwords = xfs_rtbitmap_blockcount(mp) * (*rsumlevels);
 	return XFS_B_TO_FSB(mp, rsumwords << XFS_WORDLOG);
 }
diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 4618caf344efd..2a8d5561da9d0 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -28,6 +28,7 @@
 #include "xfs_trace.h"
 #include "xfs_inode.h"
 #include "xfs_icache.h"
+#include "xfs_buf_item.h"
 #include "xfs_rtgroup.h"
 #include "xfs_rtbitmap.h"
 #include "xfs_metafile.h"
@@ -527,3 +528,84 @@ xfs_rtginode_load_parent(
 	return xfs_metadir_load(tp, mp->m_metadirip, "rtgroups",
 			XFS_METAFILE_DIR, &mp->m_rtdirip);
 }
+
+/* Check superblock fields for a read or a write. */
+static xfs_failaddr_t
+xfs_rtsb_verify_common(
+	struct xfs_buf		*bp)
+{
+	struct xfs_rtsb		*rsb = bp->b_addr;
+
+	if (!xfs_verify_magic(bp, rsb->rsb_magicnum))
+		return __this_address;
+	if (rsb->rsb_pad)
+		return __this_address;
+
+	/* Everything to the end of the fs block must be zero */
+	if (memchr_inv(rsb + 1, 0, BBTOB(bp->b_length) - sizeof(*rsb)))
+		return __this_address;
+
+	return NULL;
+}
+
+/* Check superblock fields for a read or revalidation. */
+static inline xfs_failaddr_t
+xfs_rtsb_verify_all(
+	struct xfs_buf		*bp)
+{
+	struct xfs_rtsb		*rsb = bp->b_addr;
+	struct xfs_mount	*mp = bp->b_mount;
+	xfs_failaddr_t		fa;
+
+	fa = xfs_rtsb_verify_common(bp);
+	if (fa)
+		return fa;
+
+	if (memcmp(&rsb->rsb_fname, &mp->m_sb.sb_fname, XFSLABEL_MAX))
+		return __this_address;
+	if (!uuid_equal(&rsb->rsb_uuid, &mp->m_sb.sb_uuid))
+		return __this_address;
+	if (!uuid_equal(&rsb->rsb_meta_uuid, &mp->m_sb.sb_meta_uuid))
+		return  __this_address;
+
+	return NULL;
+}
+
+static void
+xfs_rtsb_read_verify(
+	struct xfs_buf		*bp)
+{
+	xfs_failaddr_t		fa;
+
+	if (!xfs_buf_verify_cksum(bp, XFS_RTSB_CRC_OFF)) {
+		xfs_verifier_error(bp, -EFSBADCRC, __this_address);
+		return;
+	}
+
+	fa = xfs_rtsb_verify_all(bp);
+	if (fa)
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+}
+
+static void
+xfs_rtsb_write_verify(
+	struct xfs_buf		*bp)
+{
+	xfs_failaddr_t		fa;
+
+	fa = xfs_rtsb_verify_common(bp);
+	if (fa) {
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+		return;
+	}
+
+	xfs_buf_update_cksum(bp, XFS_RTSB_CRC_OFF);
+}
+
+const struct xfs_buf_ops xfs_rtsb_buf_ops = {
+	.name		= "xfs_rtsb",
+	.magic		= { 0, cpu_to_be32(XFS_RTSB_MAGIC) },
+	.verify_read	= xfs_rtsb_read_verify,
+	.verify_write	= xfs_rtsb_write_verify,
+	.verify_struct	= xfs_rtsb_verify_all,
+};
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f1cdffb2f3392..e33afd8f3e256 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -234,11 +234,21 @@ xfs_validate_sb_read(
 	return 0;
 }
 
+/* Return the number of extents covered by a single rt bitmap file */
+static xfs_rtbxlen_t
+xfs_extents_per_rbm(
+	struct xfs_sb		*sbp)
+{
+	if (xfs_sb_version_hasmetadir(sbp))
+		return sbp->sb_rgextents;
+	return sbp->sb_rextents;
+}
+
 static uint64_t
-xfs_sb_calc_rbmblocks(
+xfs_expected_rbmblocks(
 	struct xfs_sb		*sbp)
 {
-	return howmany_64(sbp->sb_rextents, NBBY * sbp->sb_blocksize);
+	return howmany_64(xfs_extents_per_rbm(sbp), NBBY * sbp->sb_blocksize);
 }
 
 /* Validate the realtime geometry */
@@ -260,7 +270,7 @@ xfs_validate_rt_geometry(
 	if (sbp->sb_rextents == 0 ||
 	    sbp->sb_rextents != div_u64(sbp->sb_rblocks, sbp->sb_rextsize) ||
 	    sbp->sb_rextslog != xfs_compute_rextslog(sbp->sb_rextents) ||
-	    sbp->sb_rbmblocks != xfs_sb_calc_rbmblocks(sbp))
+	    sbp->sb_rbmblocks != xfs_expected_rbmblocks(sbp))
 		return false;
 
 	return true;
@@ -341,6 +351,62 @@ xfs_validate_sb_write(
 	return 0;
 }
 
+static int
+xfs_validate_sb_rtgroups(
+	struct xfs_mount	*mp,
+	struct xfs_sb		*sbp)
+{
+	uint64_t		groups;
+
+	if (!sbp->sb_rextents)
+		return 0;
+
+	if (sbp->sb_rextsize == 0) {
+		xfs_warn(mp,
+"Realtime extent size must not be zero.");
+		return -EINVAL;
+	}
+
+	if (sbp->sb_rgextents > XFS_MAX_RGBLOCKS / sbp->sb_rextsize) {
+		xfs_warn(mp,
+"Realtime group size (%u) must be less than %u rt extents.",
+				sbp->sb_rgextents,
+				XFS_MAX_RGBLOCKS / sbp->sb_rextsize);
+		return -EINVAL;
+	}
+
+	if (sbp->sb_rgextents < XFS_MIN_RGEXTENTS) {
+		xfs_warn(mp,
+"Realtime group size (%u) must be at least %u rt extents.",
+				sbp->sb_rgextents, XFS_MIN_RGEXTENTS);
+		return -EINVAL;
+	}
+
+	if (sbp->sb_rgcount > XFS_MAX_RGNUMBER) {
+		xfs_warn(mp,
+"Realtime groups (%u) must be less than %u.",
+				sbp->sb_rgcount, XFS_MAX_RGNUMBER);
+		return -EINVAL;
+	}
+
+	groups = howmany_64(sbp->sb_rextents, sbp->sb_rgextents);
+	if (groups != sbp->sb_rgcount) {
+		xfs_warn(mp,
+"Realtime groups (%u) do not cover the entire rt section; need (%llu) groups.",
+				sbp->sb_rgcount, groups);
+		return -EINVAL;
+	}
+
+	/* Exchange-range is required for fsr to work on realtime files */
+	if (!(sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_EXCHRANGE)) {
+		xfs_warn(mp,
+"Realtime groups feature requires exchange-range support.");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /* Check the validity of the SB. */
 STATIC int
 xfs_validate_sb_common(
@@ -352,6 +418,7 @@ xfs_validate_sb_common(
 	uint32_t		agcount = 0;
 	uint32_t		rem;
 	bool			has_dalign;
+	int			error;
 
 	if (!xfs_verify_magic(bp, dsb->sb_magicnum)) {
 		xfs_warn(mp,
@@ -401,6 +468,12 @@ xfs_validate_sb_common(
 				return -EINVAL;
 			}
 		}
+
+		if (xfs_sb_version_hasmetadir(sbp))  {
+			error = xfs_validate_sb_rtgroups(mp, sbp);
+			if (error)
+				return error;
+		}
 	} else if (sbp->sb_qflags & (XFS_PQUOTA_ENFD | XFS_GQUOTA_ENFD |
 				XFS_PQUOTA_CHKD | XFS_GQUOTA_CHKD)) {
 			xfs_notice(mp,
@@ -692,13 +765,15 @@ __xfs_sb_from_disk(
 	if (convert_xquota)
 		xfs_sb_quota_from_disk(to);
 
-	if (to->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
+	if (to->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR) {
 		to->sb_metadirino = be64_to_cpu(from->sb_metadirino);
-	else
+		to->sb_rgcount = be32_to_cpu(from->sb_rgcount);
+		to->sb_rgextents = be32_to_cpu(from->sb_rgextents);
+	} else {
 		to->sb_metadirino = NULLFSINO;
-
-	to->sb_rgcount = 1;
-	to->sb_rgextents = 0;
+		to->sb_rgcount = 1;
+		to->sb_rgextents = 0;
+	}
 }
 
 void
@@ -847,8 +922,11 @@ xfs_sb_to_disk(
 	if (from->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_META_UUID)
 		uuid_copy(&to->sb_meta_uuid, &from->sb_meta_uuid);
 
-	if (from->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
+	if (from->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR) {
 		to->sb_metadirino = cpu_to_be64(from->sb_metadirino);
+		to->sb_rgcount = cpu_to_be32(from->sb_rgcount);
+		to->sb_rgextents = cpu_to_be32(from->sb_rgextents);
+	}
 }
 
 /*
@@ -986,9 +1064,9 @@ xfs_mount_sb_set_rextsize(
 	mp->m_rtxblklog = log2_if_power2(sbp->sb_rextsize);
 	mp->m_rtxblkmask = mask64_if_power2(sbp->sb_rextsize);
 
-	mp->m_rgblocks = 0;
-	mp->m_rgblklog = 0;
-	mp->m_rgblkmask = 0;
+	mp->m_rgblocks = sbp->sb_rgextents * sbp->sb_rextsize;
+	mp->m_rgblklog = log2_if_power2(mp->m_rgblocks);
+	mp->m_rgblkmask = mask64_if_power2(mp->m_rgblocks);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 2f7413afbf46c..0343926d2a6b4 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -39,6 +39,7 @@ extern const struct xfs_buf_ops xfs_inode_buf_ra_ops;
 extern const struct xfs_buf_ops xfs_refcountbt_buf_ops;
 extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
 extern const struct xfs_buf_ops xfs_rtbuf_ops;
+extern const struct xfs_buf_ops xfs_rtsb_buf_ops;
 extern const struct xfs_buf_ops xfs_sb_buf_ops;
 extern const struct xfs_buf_ops xfs_sb_quiet_buf_ops;
 extern const struct xfs_buf_ops xfs_symlink_buf_ops;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 137fb5f88307b..6d49893fc91c7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -355,12 +355,14 @@ __XFS_HAS_FEAT(metadir, METADIR)
 
 static inline bool xfs_has_rtgroups(struct xfs_mount *mp)
 {
-	return false;
+	/* all metadir file systems also allow rtgroups */
+	return xfs_has_metadir(mp);
 }
 
 static inline bool xfs_has_rtsb(struct xfs_mount *mp)
 {
-	return false;
+	/* all rtgroups filesystems with an rt section have an rtsb */
+	return xfs_has_rtgroups(mp) && xfs_has_realtime(mp);
 }
 
 /*
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 308049f2fb79d..b2c0c3fe64a11 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -751,6 +751,11 @@ xfs_growfs_rt_alloc_fake_mount(
 	nmp->m_sb.sb_rextents = xfs_blen_to_rtbxlen(nmp, nmp->m_sb.sb_rblocks);
 	nmp->m_sb.sb_rbmblocks = xfs_rtbitmap_blockcount(nmp);
 	nmp->m_sb.sb_rextslog = xfs_compute_rextslog(nmp->m_sb.sb_rextents);
+	if (xfs_has_rtgroups(nmp))
+		nmp->m_sb.sb_rgcount =
+			howmany_64(nmp->m_sb.sb_rextents, nmp->m_sb.sb_rgextents);
+	else
+		nmp->m_sb.sb_rgcount = 1;
 	nmp->m_rsumblocks = xfs_rtsummary_blockcount(nmp, &nmp->m_rsumlevels);
 
 	if (rblocks > 0)
@@ -761,6 +766,26 @@ xfs_growfs_rt_alloc_fake_mount(
 	return nmp;
 }
 
+static xfs_rfsblock_t
+xfs_growfs_rt_nrblocks(
+	struct xfs_rtgroup	*rtg,
+	xfs_rfsblock_t		nrblocks,
+	xfs_agblock_t		rextsize,
+	xfs_fileoff_t		bmbno)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	xfs_rfsblock_t		step;
+
+	step = (bmbno + 1) * NBBY * mp->m_sb.sb_blocksize * rextsize;
+	if (xfs_has_rtgroups(mp)) {
+		xfs_rfsblock_t	rgblocks = mp->m_sb.sb_rgextents * rextsize;
+
+		step = min(rgblocks, step) + rgblocks * rtg->rtg_rgno;
+	}
+
+	return min(nrblocks, step);
+}
+
 static int
 xfs_growfs_rt_bmblock(
 	struct xfs_rtgroup	*rtg,
@@ -779,16 +804,15 @@ xfs_growfs_rt_bmblock(
 		.rtg		= rtg,
 	};
 	struct xfs_mount	*nmp;
-	xfs_rfsblock_t		nrblocks_step;
 	xfs_rtbxlen_t		freed_rtx;
 	int			error;
 
 	/*
 	 * Calculate new sb and mount fields for this round.
 	 */
-	nrblocks_step = (bmbno + 1) * NBBY * mp->m_sb.sb_blocksize * rextsize;
 	nmp = nargs.mp = xfs_growfs_rt_alloc_fake_mount(mp,
-			min(nrblocks, nrblocks_step), rextsize);
+			xfs_growfs_rt_nrblocks(rtg, nrblocks, rextsize, bmbno),
+			rextsize);
 	if (!nmp)
 		return -ENOMEM;
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 02/26] xfs: check the realtime superblock at mount time
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
  2024-08-23  0:21   ` [PATCH 01/26] xfs: define the format of rt groups Darrick J. Wong
@ 2024-08-23  0:21   ` Darrick J. Wong
  2024-08-23  5:11     ` Christoph Hellwig
  2024-08-23  0:21   ` [PATCH 03/26] xfs: update realtime super every time we update the primary fs super Darrick J. Wong
                     ` (23 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:21 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check the realtime superblock at mount time, to ensure that the label
and uuids actually match the primary superblock on the data device.  If
the rt superblock is good, attach it to the xfs_mount so that the log
can use ordered buffers to keep this primary in sync with the primary
super on the data device.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h   |    1 +
 fs/xfs/xfs_rtalloc.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h |    6 ++++++
 fs/xfs/xfs_super.c   |   12 ++++++++++--
 4 files changed, 67 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 6d49893fc91c7..1da20fafcf978 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -85,6 +85,7 @@ typedef struct xfs_mount {
 	struct super_block	*m_super;
 	struct xfs_ail		*m_ail;		/* fs active log item list */
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
+	struct xfs_buf		*m_rtsb_bp;	/* realtime superblock */
 	char			*m_rtname;	/* realtime device name */
 	char			*m_logname;	/* external log device name */
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index b2c0c3fe64a11..d8aa354b3bf14 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1152,6 +1152,56 @@ xfs_growfs_rt(
 	return error;
 }
 
+/* Read the realtime superblock and attach it to the mount. */
+int
+xfs_rtmount_readsb(
+	struct xfs_mount	*mp)
+{
+	struct xfs_buf		*bp;
+	int			error;
+
+	if (!xfs_has_rtsb(mp))
+		return 0;
+	if (mp->m_sb.sb_rblocks == 0)
+		return 0;
+	if (mp->m_rtdev_targp == NULL) {
+		xfs_warn(mp,
+	"Filesystem has a realtime volume, use rtdev=device option");
+		return -ENODEV;
+	}
+
+	/* m_blkbb_log is not set up yet */
+	error = xfs_buf_read_uncached(mp->m_rtdev_targp, XFS_RTSB_DADDR,
+			mp->m_sb.sb_blocksize >> BBSHIFT, XBF_NO_IOACCT, &bp,
+			&xfs_rtsb_buf_ops);
+	if (error) {
+		xfs_warn(mp, "rt sb validate failed with error %d.", error);
+		/* bad CRC means corrupted metadata */
+		if (error == -EFSBADCRC)
+			error = -EFSCORRUPTED;
+		return error;
+	}
+
+	mp->m_rtsb_bp = bp;
+	xfs_buf_unlock(bp);
+	return 0;
+}
+
+/* Detach the realtime superblock from the mount and free it. */
+void
+xfs_rtmount_freesb(
+	struct xfs_mount	*mp)
+{
+	struct xfs_buf		*bp = mp->m_rtsb_bp;
+
+	if (!bp)
+		return;
+
+	xfs_buf_lock(bp);
+	mp->m_rtsb_bp = NULL;
+	xfs_buf_relse(bp);
+}
+
 /*
  * Initialize realtime fields in the mount structure.
  */
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index a6836da9bebef..8e2a07b8174b7 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -12,6 +12,10 @@ struct xfs_mount;
 struct xfs_trans;
 
 #ifdef CONFIG_XFS_RT
+/* rtgroup superblock initialization */
+int xfs_rtmount_readsb(struct xfs_mount *mp);
+void xfs_rtmount_freesb(struct xfs_mount *mp);
+
 /*
  * Initialize realtime fields in the mount structure.
  */
@@ -42,6 +46,8 @@ int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp);
 #else
 # define xfs_growfs_rt(mp,in)				(-ENOSYS)
 # define xfs_rtalloc_reinit_frextents(m)		(0)
+# define xfs_rtmount_readsb(mp)				(0)
+# define xfs_rtmount_freesb(mp)				((void)0)
 static inline int		/* error */
 xfs_rtmount_init(
 	xfs_mount_t	*mp)	/* file system mount structure */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2767083612bf6..835886c322a83 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -45,6 +45,7 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_exchmaps_item.h"
 #include "xfs_parent.h"
+#include "xfs_rtalloc.h"
 #include "scrub/stats.h"
 #include "scrub/rcbag_btree.h"
 
@@ -1145,6 +1146,7 @@ xfs_fs_put_super(
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
 
+	xfs_rtmount_freesb(mp);
 	xfs_freesb(mp);
 	xchk_mount_stats_free(mp);
 	free_percpu(mp->m_stats.xs_stats);
@@ -1680,9 +1682,13 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
+	error = xfs_rtmount_readsb(mp);
+	if (error)
+		goto out_free_sb;
+
 	error = xfs_filestream_mount(mp);
 	if (error)
-		goto out_free_sb;
+		goto out_free_rtsb;
 
 	/*
 	 * we must configure the block size in the superblock before we run the
@@ -1774,6 +1780,8 @@ xfs_fs_fill_super(
 
  out_filestream_unmount:
 	xfs_filestream_unmount(mp);
+ out_free_rtsb:
+	xfs_rtmount_freesb(mp);
  out_free_sb:
 	xfs_freesb(mp);
  out_free_scrub_stats:
@@ -1793,7 +1801,7 @@ xfs_fs_fill_super(
  out_unmount:
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
-	goto out_free_sb;
+	goto out_free_rtsb;
 }
 
 static int


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 03/26] xfs: update realtime super every time we update the primary fs super
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
  2024-08-23  0:21   ` [PATCH 01/26] xfs: define the format of rt groups Darrick J. Wong
  2024-08-23  0:21   ` [PATCH 02/26] xfs: check the realtime superblock at mount time Darrick J. Wong
@ 2024-08-23  0:21   ` Darrick J. Wong
  2024-08-23  5:12     ` Christoph Hellwig
  2024-08-23  0:22   ` [PATCH 04/26] xfs: export realtime group geometry via XFS_FSOP_GEOM Darrick J. Wong
                     ` (22 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:21 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Every time we update parts of the primary filesystem superblock that are
echoed in the rt superblock, we must update the rt super.  Avoid
changing the log to support logging to the rt device by using ordered
buffers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtgroup.c   |   60 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rtgroup.h   |    7 +++++
 fs/xfs/libxfs/xfs_sb.c        |   14 +++++++++-
 fs/xfs/libxfs/xfs_sb.h        |    2 +
 fs/xfs/xfs_buf_item_recover.c |   18 ++++++++++++
 fs/xfs/xfs_ioctl.c            |    4 ++-
 fs/xfs/xfs_trans.c            |    1 +
 fs/xfs/xfs_trans.h            |    1 +
 fs/xfs/xfs_trans_buf.c        |   25 ++++++++++++++---
 9 files changed, 124 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 2a8d5561da9d0..89194a66267e2 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -609,3 +609,63 @@ const struct xfs_buf_ops xfs_rtsb_buf_ops = {
 	.verify_write	= xfs_rtsb_write_verify,
 	.verify_struct	= xfs_rtsb_verify_all,
 };
+
+/* Update a realtime superblock from the primary fs super */
+void
+xfs_update_rtsb(
+	struct xfs_buf		*rtsb_bp,
+	const struct xfs_buf	*sb_bp)
+{
+	const struct xfs_dsb	*dsb = sb_bp->b_addr;
+	struct xfs_rtsb		*rsb = rtsb_bp->b_addr;
+	const uuid_t		*meta_uuid;
+
+	rsb->rsb_magicnum = cpu_to_be32(XFS_RTSB_MAGIC);
+
+	rsb->rsb_pad = 0;
+	memcpy(&rsb->rsb_fname, &dsb->sb_fname, XFSLABEL_MAX);
+
+	memcpy(&rsb->rsb_uuid, &dsb->sb_uuid, sizeof(rsb->rsb_uuid));
+
+	/*
+	 * The metadata uuid is the fs uuid if the metauuid feature is not
+	 * enabled.
+	 */
+	if (dsb->sb_features_incompat &
+				cpu_to_be32(XFS_SB_FEAT_INCOMPAT_META_UUID))
+		meta_uuid = &dsb->sb_meta_uuid;
+	else
+		meta_uuid = &dsb->sb_uuid;
+	memcpy(&rsb->rsb_meta_uuid, meta_uuid, sizeof(rsb->rsb_meta_uuid));
+}
+
+/*
+ * Update the realtime superblock from a filesystem superblock and log it to
+ * the given transaction.
+ */
+struct xfs_buf *
+xfs_log_rtsb(
+	struct xfs_trans	*tp,
+	const struct xfs_buf	*sb_bp)
+{
+	struct xfs_buf		*rtsb_bp;
+
+	if (!xfs_has_rtsb(tp->t_mountp))
+		return NULL;
+
+	rtsb_bp = xfs_trans_getrtsb(tp);
+	if (!rtsb_bp) {
+		/*
+		 * It's possible for the rtgroups feature to be enabled but
+		 * there is no incore rt superblock buffer if the rt geometry
+		 * was specified at mkfs time but the rt section has not yet
+		 * been attached.  In this case, rblocks must be zero.
+		 */
+		ASSERT(tp->t_mountp->m_sb.sb_rblocks == 0);
+		return NULL;
+	}
+
+	xfs_update_rtsb(rtsb_bp, sb_bp);
+	xfs_trans_ordered_buf(tp, rtsb_bp);
+	return rtsb_bp;
+}
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index e622b24a0d75f..a18ea0aca3db1 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -258,11 +258,18 @@ static inline const char *xfs_rtginode_path(xfs_rgnumber_t rgno,
 {
 	return kasprintf(GFP_KERNEL, "%u.%s", rgno, xfs_rtginode_name(type));
 }
+
+void xfs_update_rtsb(struct xfs_buf *rtsb_bp,
+		const struct xfs_buf *sb_bp);
+struct xfs_buf *xfs_log_rtsb(struct xfs_trans *tp,
+		const struct xfs_buf *sb_bp);
 #else
 # define xfs_rtgroup_extents(mp, rgno)		(0)
 # define xfs_rtgroup_lock(rtg, gf)		((void)0)
 # define xfs_rtgroup_unlock(rtg, gf)		((void)0)
 # define xfs_rtgroup_trans_join(tp, rtg, gf)	((void)0)
+# define xfs_update_rtsb(bp, sb_bp)	((void)0)
+# define xfs_log_rtsb(tp, sb_bp)	(NULL)
 #endif /* CONFIG_XFS_RT */
 
 #endif /* __LIBXFS_RTGROUP_H */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index e33afd8f3e256..29b20615d80bb 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -27,6 +27,7 @@
 #include "xfs_ag.h"
 #include "xfs_rtbitmap.h"
 #include "xfs_exchrange.h"
+#include "xfs_rtgroup.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -1270,10 +1271,12 @@ xfs_update_secondary_sbs(
  */
 int
 xfs_sync_sb_buf(
-	struct xfs_mount	*mp)
+	struct xfs_mount	*mp,
+	bool			update_rtsb)
 {
 	struct xfs_trans	*tp;
 	struct xfs_buf		*bp;
+	struct xfs_buf		*rtsb_bp = NULL;
 	int			error;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_sb, 0, 0, 0, &tp);
@@ -1283,6 +1286,11 @@ xfs_sync_sb_buf(
 	bp = xfs_trans_getsb(tp);
 	xfs_log_sb(tp);
 	xfs_trans_bhold(tp, bp);
+	if (update_rtsb) {
+		rtsb_bp = xfs_log_rtsb(tp, bp);
+		if (rtsb_bp)
+			xfs_trans_bhold(tp, rtsb_bp);
+	}
 	xfs_trans_set_sync(tp);
 	error = xfs_trans_commit(tp);
 	if (error)
@@ -1291,7 +1299,11 @@ xfs_sync_sb_buf(
 	 * write out the sb buffer to get the changes to disk
 	 */
 	error = xfs_bwrite(bp);
+	if (!error && rtsb_bp)
+		error = xfs_bwrite(rtsb_bp);
 out:
+	if (rtsb_bp)
+		xfs_buf_relse(rtsb_bp);
 	xfs_buf_relse(bp);
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 885c837559914..999dcfccdaf96 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -15,7 +15,7 @@ struct xfs_perag;
 
 extern void	xfs_log_sb(struct xfs_trans *tp);
 extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
-extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
+extern int	xfs_sync_sb_buf(struct xfs_mount *mp, bool update_rtsb);
 extern void	xfs_sb_mount_common(struct xfs_mount *mp, struct xfs_sb *sbp);
 void		xfs_mount_sb_set_rextsize(struct xfs_mount *mp,
 			struct xfs_sb *sbp);
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 09e893cf563cb..51cb239d7924c 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -22,6 +22,7 @@
 #include "xfs_inode.h"
 #include "xfs_dir2.h"
 #include "xfs_quota.h"
+#include "xfs_rtgroup.h"
 
 /*
  * This is the number of entries in the l_buf_cancel_table used during
@@ -995,6 +996,23 @@ xlog_recover_buf_commit_pass2(
 		ASSERT(bp->b_mount == mp);
 		bp->b_flags |= _XBF_LOGRECOVERY;
 		xfs_buf_delwri_queue(bp, buffer_list);
+
+		/*
+		 * Update the rt super if we just recovered the primary fs
+		 * super.
+		 */
+		if (xfs_has_rtsb(mp) && bp->b_ops == &xfs_sb_buf_ops) {
+			struct xfs_buf	*rtsb_bp = mp->m_rtsb_bp;
+
+			if (rtsb_bp) {
+				xfs_buf_lock(rtsb_bp);
+				xfs_buf_hold(rtsb_bp);
+				xfs_update_rtsb(rtsb_bp, bp);
+				rtsb_bp->b_flags |= _XBF_LOGRECOVERY;
+				xfs_buf_delwri_queue(rtsb_bp, buffer_list);
+				xfs_buf_relse(rtsb_bp);
+			}
+		}
 	}
 
 out_release:
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 461780ffb8fc0..c5526434f66fd 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1028,7 +1028,7 @@ xfs_ioc_setlabel(
 	 * buffered reads from userspace (i.e. from blkid) are invalidated,
 	 * and userspace will see the newly-written label.
 	 */
-	error = xfs_sync_sb_buf(mp);
+	error = xfs_sync_sb_buf(mp, true);
 	if (error)
 		goto out;
 	/*
@@ -1039,6 +1039,8 @@ xfs_ioc_setlabel(
 	mutex_unlock(&mp->m_growlock);
 
 	invalidate_bdev(mp->m_ddev_targp->bt_bdev);
+	if (xfs_has_rtsb(mp) && mp->m_rtdev_targp)
+		invalidate_bdev(mp->m_rtdev_targp->bt_bdev);
 
 out:
 	mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bdf3704dc3011..5fd1765b3dcd8 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -25,6 +25,7 @@
 #include "xfs_dquot.h"
 #include "xfs_icache.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_rtgroup.h"
 
 struct kmem_cache	*xfs_trans_cache;
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f06cc0f41665a..f97e5c416efad 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -214,6 +214,7 @@ xfs_trans_read_buf(
 }
 
 struct xfs_buf	*xfs_trans_getsb(struct xfs_trans *);
+struct xfs_buf	*xfs_trans_getrtsb(struct xfs_trans *tp);
 
 void		xfs_trans_brelse(xfs_trans_t *, struct xfs_buf *);
 void		xfs_trans_bjoin(xfs_trans_t *, struct xfs_buf *);
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index e28ab74af4f0e..8e886ecfd69a3 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -168,12 +168,11 @@ xfs_trans_get_buf_map(
 /*
  * Get and lock the superblock buffer for the given transaction.
  */
-struct xfs_buf *
-xfs_trans_getsb(
-	struct xfs_trans	*tp)
+static struct xfs_buf *
+__xfs_trans_getsb(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
 {
-	struct xfs_buf		*bp = tp->t_mountp->m_sb_bp;
-
 	/*
 	 * Just increment the lock recursion count if the buffer is already
 	 * attached to this transaction.
@@ -197,6 +196,22 @@ xfs_trans_getsb(
 	return bp;
 }
 
+struct xfs_buf *
+xfs_trans_getsb(
+	struct xfs_trans	*tp)
+{
+	return __xfs_trans_getsb(tp, tp->t_mountp->m_sb_bp);
+}
+
+struct xfs_buf *
+xfs_trans_getrtsb(
+	struct xfs_trans	*tp)
+{
+	if (!tp->t_mountp->m_rtsb_bp)
+		return NULL;
+	return __xfs_trans_getsb(tp, tp->t_mountp->m_rtsb_bp);
+}
+
 /*
  * Get and lock the buffer for the caller if it is not already
  * locked within the given transaction.  If it has not yet been


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 04/26] xfs: export realtime group geometry via XFS_FSOP_GEOM
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-23  0:21   ` [PATCH 03/26] xfs: update realtime super every time we update the primary fs super Darrick J. Wong
@ 2024-08-23  0:22   ` Darrick J. Wong
  2024-08-23  5:12     ` Christoph Hellwig
  2024-08-23  0:22   ` [PATCH 05/26] xfs: check that rtblock extents do not break rtsupers or rtgroups Darrick J. Wong
                     ` (21 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:22 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Export the realtime geometry information so that userspace can query it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |    4 +++-
 fs/xfs/libxfs/xfs_sb.c |    5 +++++
 2 files changed, 8 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b441b9258128e..57819fea064e7 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -186,7 +186,9 @@ struct xfs_fsop_geom {
 	__u32		logsunit;	/* log stripe unit, bytes	*/
 	uint32_t	sick;		/* o: unhealthy fs & rt metadata */
 	uint32_t	checked;	/* o: checked fs & rt metadata	*/
-	__u64		reserved[17];	/* reserved space		*/
+	__u32		rgextents;	/* rt extents in a realtime group */
+	__u32		rgcount;	/* number of realtime groups	*/
+	__u64		reserved[16];	/* reserved space		*/
 };
 
 #define XFS_FSOP_GEOM_SICK_COUNTERS	(1 << 0)  /* summary counters */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 29b20615d80bb..2a0155d946c1e 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1407,6 +1407,11 @@ xfs_fs_geometry(
 		return;
 
 	geo->version = XFS_FSOP_GEOM_VERSION_V5;
+
+	if (xfs_has_rtgroups(mp)) {
+		geo->rgcount = sbp->sb_rgcount;
+		geo->rgextents = sbp->sb_rgextents;
+	}
 }
 
 /* Read a secondary superblock. */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 05/26] xfs: check that rtblock extents do not break rtsupers or rtgroups
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-23  0:22   ` [PATCH 04/26] xfs: export realtime group geometry via XFS_FSOP_GEOM Darrick J. Wong
@ 2024-08-23  0:22   ` Darrick J. Wong
  2024-08-23  5:13     ` Christoph Hellwig
  2024-08-23  0:22   ` [PATCH 06/26] xfs: add a helper to prevent bmap merges across rtgroup boundaries Darrick J. Wong
                     ` (20 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:22 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Check that rt block pointers do not point to the realtime superblock and
that allocated rt space extents do not cross rtgroup boundaries.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_types.c |   38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
index c299b16c9365f..8625cbaf530e5 100644
--- a/fs/xfs/libxfs/xfs_types.c
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -12,6 +12,8 @@
 #include "xfs_bit.h"
 #include "xfs_mount.h"
 #include "xfs_ag.h"
+#include "xfs_rtbitmap.h"
+#include "xfs_rtgroup.h"
 
 
 /*
@@ -135,18 +137,37 @@ xfs_verify_dir_ino(
 }
 
 /*
- * Verify that an realtime block number pointer doesn't point off the
- * end of the realtime device.
+ * Verify that a realtime block number pointer neither points outside the
+ * allocatable areas of the rtgroup nor off the end of the realtime
+ * device.
  */
 inline bool
 xfs_verify_rtbno(
 	struct xfs_mount	*mp,
 	xfs_rtblock_t		rtbno)
 {
-	return rtbno < mp->m_sb.sb_rblocks;
+	if (rtbno >= mp->m_sb.sb_rblocks)
+		return false;
+
+	if (xfs_has_rtgroups(mp)) {
+		xfs_rgnumber_t	rgno = xfs_rtb_to_rgno(mp, rtbno);
+		xfs_rtxnum_t	rtx = xfs_rtb_to_rtx(mp, rtbno);
+
+		if (rgno >= mp->m_sb.sb_rgcount)
+			return false;
+		if (rtx >= xfs_rtgroup_extents(mp, rgno))
+			return false;
+		if (xfs_has_rtsb(mp) && rgno == 0 && rtx == 0)
+			return false;
+	}
+	return true;
 }
 
-/* Verify that a realtime device extent is fully contained inside the volume. */
+/*
+ * Verify that an allocated realtime device extent neither points outside
+ * allocatable areas of the rtgroup, across an rtgroup boundary, nor off the
+ * end of the realtime device.
+ */
 bool
 xfs_verify_rtbext(
 	struct xfs_mount	*mp,
@@ -159,7 +180,14 @@ xfs_verify_rtbext(
 	if (!xfs_verify_rtbno(mp, rtbno))
 		return false;
 
-	return xfs_verify_rtbno(mp, rtbno + len - 1);
+	if (!xfs_verify_rtbno(mp, rtbno + len - 1))
+		return false;
+
+	if (xfs_has_rtgroups(mp) &&
+	    xfs_rtb_to_rgno(mp, rtbno) != xfs_rtb_to_rgno(mp, rtbno + len - 1))
+		return false;
+
+	return true;
 }
 
 /* Calculate the range of valid icount values. */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 06/26] xfs: add a helper to prevent bmap merges across rtgroup boundaries
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:22   ` [PATCH 05/26] xfs: check that rtblock extents do not break rtsupers or rtgroups Darrick J. Wong
@ 2024-08-23  0:22   ` Darrick J. Wong
  2024-08-23  0:22   ` [PATCH 07/26] xfs: add frextents to the lazysbcounters when rtgroups enabled Darrick J. Wong
                     ` (19 subsequent siblings)
  25 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:22 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Except for the rt superblock, realtime groups do not store any metadata
at the start (or end) of the group.  There is nothing to prevent the
bmap code from merging allocations from multiple groups into a single
bmap record.  Add a helper to check for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: massage the commit message after pulling this into rtgroups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   56 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c056ca8ad6090..f1bf8635a8cf3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -40,6 +40,7 @@
 #include "xfs_bmap_item.h"
 #include "xfs_symlink_remote.h"
 #include "xfs_inode_util.h"
+#include "xfs_rtgroup.h"
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
@@ -1426,6 +1427,24 @@ xfs_bmap_last_offset(
  * Extent tree manipulation functions used during allocation.
  */
 
+static inline bool
+xfs_bmap_same_rtgroup(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*left,
+	struct xfs_bmbt_irec	*right)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (xfs_ifork_is_realtime(ip, whichfork) && xfs_has_rtgroups(mp)) {
+		if (xfs_rtb_to_rgno(mp, left->br_startblock) !=
+		    xfs_rtb_to_rgno(mp, right->br_startblock))
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * Convert a delayed allocation to a real allocation.
  */
@@ -1495,7 +1514,8 @@ xfs_bmap_add_extent_delay_real(
 	    LEFT.br_startoff + LEFT.br_blockcount == new->br_startoff &&
 	    LEFT.br_startblock + LEFT.br_blockcount == new->br_startblock &&
 	    LEFT.br_state == new->br_state &&
-	    LEFT.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+	    LEFT.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN &&
+	    xfs_bmap_same_rtgroup(bma->ip, whichfork, &LEFT, new))
 		state |= BMAP_LEFT_CONTIG;
 
 	/*
@@ -1519,7 +1539,8 @@ xfs_bmap_add_extent_delay_real(
 		      (BMAP_LEFT_CONTIG | BMAP_LEFT_FILLING |
 		       BMAP_RIGHT_FILLING) ||
 	     LEFT.br_blockcount + new->br_blockcount + RIGHT.br_blockcount
-			<= XFS_MAX_BMBT_EXTLEN))
+			<= XFS_MAX_BMBT_EXTLEN) &&
+	    xfs_bmap_same_rtgroup(bma->ip, whichfork, new, &RIGHT))
 		state |= BMAP_RIGHT_CONTIG;
 
 	error = 0;
@@ -2064,7 +2085,8 @@ xfs_bmap_add_extent_unwritten_real(
 	    LEFT.br_startoff + LEFT.br_blockcount == new->br_startoff &&
 	    LEFT.br_startblock + LEFT.br_blockcount == new->br_startblock &&
 	    LEFT.br_state == new->br_state &&
-	    LEFT.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+	    LEFT.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN &&
+	    xfs_bmap_same_rtgroup(ip, whichfork, &LEFT, new))
 		state |= BMAP_LEFT_CONTIG;
 
 	/*
@@ -2088,7 +2110,8 @@ xfs_bmap_add_extent_unwritten_real(
 		      (BMAP_LEFT_CONTIG | BMAP_LEFT_FILLING |
 		       BMAP_RIGHT_FILLING) ||
 	     LEFT.br_blockcount + new->br_blockcount + RIGHT.br_blockcount
-			<= XFS_MAX_BMBT_EXTLEN))
+			<= XFS_MAX_BMBT_EXTLEN) &&
+	    xfs_bmap_same_rtgroup(ip, whichfork, new, &RIGHT))
 		state |= BMAP_RIGHT_CONTIG;
 
 	/*
@@ -2597,7 +2620,8 @@ xfs_bmap_add_extent_hole_delay(
 	 */
 	if ((state & BMAP_LEFT_VALID) && (state & BMAP_LEFT_DELAY) &&
 	    left.br_startoff + left.br_blockcount == new->br_startoff &&
-	    left.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+	    left.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN &&
+	    xfs_bmap_same_rtgroup(ip, whichfork, &left, new))
 		state |= BMAP_LEFT_CONTIG;
 
 	if ((state & BMAP_RIGHT_VALID) && (state & BMAP_RIGHT_DELAY) &&
@@ -2605,7 +2629,8 @@ xfs_bmap_add_extent_hole_delay(
 	    new->br_blockcount + right.br_blockcount <= XFS_MAX_BMBT_EXTLEN &&
 	    (!(state & BMAP_LEFT_CONTIG) ||
 	     (left.br_blockcount + new->br_blockcount +
-	      right.br_blockcount <= XFS_MAX_BMBT_EXTLEN)))
+	      right.br_blockcount <= XFS_MAX_BMBT_EXTLEN)) &&
+	    xfs_bmap_same_rtgroup(ip, whichfork, new, &right))
 		state |= BMAP_RIGHT_CONTIG;
 
 	/*
@@ -2748,7 +2773,8 @@ xfs_bmap_add_extent_hole_real(
 	    left.br_startoff + left.br_blockcount == new->br_startoff &&
 	    left.br_startblock + left.br_blockcount == new->br_startblock &&
 	    left.br_state == new->br_state &&
-	    left.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+	    left.br_blockcount + new->br_blockcount <= XFS_MAX_BMBT_EXTLEN &&
+	    xfs_bmap_same_rtgroup(ip, whichfork, &left, new))
 		state |= BMAP_LEFT_CONTIG;
 
 	if ((state & BMAP_RIGHT_VALID) && !(state & BMAP_RIGHT_DELAY) &&
@@ -2758,7 +2784,8 @@ xfs_bmap_add_extent_hole_real(
 	    new->br_blockcount + right.br_blockcount <= XFS_MAX_BMBT_EXTLEN &&
 	    (!(state & BMAP_LEFT_CONTIG) ||
 	     left.br_blockcount + new->br_blockcount +
-	     right.br_blockcount <= XFS_MAX_BMBT_EXTLEN))
+	     right.br_blockcount <= XFS_MAX_BMBT_EXTLEN) &&
+	    xfs_bmap_same_rtgroup(ip, whichfork, new, &right))
 		state |= BMAP_RIGHT_CONTIG;
 
 	error = 0;
@@ -5766,6 +5793,8 @@ xfs_bunmapi(
  */
 STATIC bool
 xfs_bmse_can_merge(
+	struct xfs_inode	*ip,
+	int			whichfork,
 	struct xfs_bmbt_irec	*left,	/* preceding extent */
 	struct xfs_bmbt_irec	*got,	/* current extent to shift */
 	xfs_fileoff_t		shift)	/* shift fsb */
@@ -5781,7 +5810,8 @@ xfs_bmse_can_merge(
 	if ((left->br_startoff + left->br_blockcount != startoff) ||
 	    (left->br_startblock + left->br_blockcount != got->br_startblock) ||
 	    (left->br_state != got->br_state) ||
-	    (left->br_blockcount + got->br_blockcount > XFS_MAX_BMBT_EXTLEN))
+	    (left->br_blockcount + got->br_blockcount > XFS_MAX_BMBT_EXTLEN) ||
+	    !xfs_bmap_same_rtgroup(ip, whichfork, left, got))
 		return false;
 
 	return true;
@@ -5817,7 +5847,7 @@ xfs_bmse_merge(
 	blockcount = left->br_blockcount + got->br_blockcount;
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL);
-	ASSERT(xfs_bmse_can_merge(left, got, shift));
+	ASSERT(xfs_bmse_can_merge(ip, whichfork, left, got, shift));
 
 	new = *left;
 	new.br_blockcount = blockcount;
@@ -5979,7 +6009,8 @@ xfs_bmap_collapse_extents(
 			goto del_cursor;
 		}
 
-		if (xfs_bmse_can_merge(&prev, &got, offset_shift_fsb)) {
+		if (xfs_bmse_can_merge(ip, whichfork, &prev, &got,
+				offset_shift_fsb)) {
 			error = xfs_bmse_merge(tp, ip, whichfork,
 					offset_shift_fsb, &icur, &got, &prev,
 					cur, &logflags);
@@ -6115,7 +6146,8 @@ xfs_bmap_insert_extents(
 		 * never find mergeable extents in this scenario.  Check anyways
 		 * and warn if we encounter two extents that could be one.
 		 */
-		if (xfs_bmse_can_merge(&got, &next, offset_shift_fsb))
+		if (xfs_bmse_can_merge(ip, whichfork, &got, &next,
+				offset_shift_fsb))
 			WARN_ON_ONCE(1);
 	}
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 07/26] xfs: add frextents to the lazysbcounters when rtgroups enabled
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-08-23  0:22   ` [PATCH 06/26] xfs: add a helper to prevent bmap merges across rtgroup boundaries Darrick J. Wong
@ 2024-08-23  0:22   ` Darrick J. Wong
  2024-08-23  5:13     ` Christoph Hellwig
  2024-08-23  0:23   ` [PATCH 08/26] xfs: convert sick_map loops to use ARRAY_SIZE Darrick J. Wong
                     ` (18 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:22 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make the free rt extent count a part of the lazy sb counters when the
realtime groups feature is enabled.  This is possible because the patch
to recompute frextents from the rtbitmap during log recovery predates
the code adding rtgroup support, hence we know that the value will
always be correct during runtime.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_sb.c           |    8 ++++++++
 fs/xfs/scrub/fscounters_repair.c |    9 +++++----
 fs/xfs/xfs_trans.c               |   17 ++++++++++++++---
 3 files changed, 27 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 2a0155d946c1e..109be10c6e84f 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1146,6 +1146,11 @@ xfs_log_sb(
 	 * sb counters, despite having a percpu counter. It is always kept
 	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
 	 * and hence we don't need have to update it here.
+	 *
+	 * sb_frextents was added to the lazy sb counters when the rt groups
+	 * feature was introduced.  This counter can go negative due to the way
+	 * we handle nearly-lockless reservations, so we must use the _positive
+	 * variant here to avoid writing out nonsense frextents.
 	 */
 	if (xfs_has_lazysbcount(mp)) {
 		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
@@ -1155,6 +1160,9 @@ xfs_log_sb(
 		mp->m_sb.sb_fdblocks =
 				percpu_counter_sum_positive(&mp->m_fdblocks);
 	}
+	if (xfs_has_rtgroups(mp))
+		mp->m_sb.sb_frextents =
+				percpu_counter_sum_positive(&mp->m_frextents);
 
 	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
diff --git a/fs/xfs/scrub/fscounters_repair.c b/fs/xfs/scrub/fscounters_repair.c
index 469bf645dbea5..cda13447a373e 100644
--- a/fs/xfs/scrub/fscounters_repair.c
+++ b/fs/xfs/scrub/fscounters_repair.c
@@ -68,15 +68,16 @@ xrep_fscounters(
 
 	/*
 	 * Online repair is only supported on v5 file systems, which require
-	 * lazy sb counters and thus no update of sb_fdblocks here.  But as of
-	 * now we don't support lazy counting sb_frextents yet, and thus need
-	 * to also update it directly here.  And for that we need to keep
+	 * lazy sb counters and thus no update of sb_fdblocks here.  But
+	 * sb_frextents only uses a lazy counter with rtgroups, and thus needs
+	 * to be updated directly here otherwise.  And for that we need to keep
 	 * track of the delalloc reservations separately, as they are are
 	 * subtracted from m_frextents, but not included in sb_frextents.
 	 */
 	percpu_counter_set(&mp->m_frextents,
 		fsc->frextents - fsc->frextents_delayed);
-	mp->m_sb.sb_frextents = fsc->frextents;
+	if (!xfs_has_rtgroups(mp))
+		mp->m_sb.sb_frextents = fsc->frextents;
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 5fd1765b3dcd8..552e3a149346c 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -421,6 +421,8 @@ xfs_trans_mod_sb(
 			ASSERT(tp->t_rtx_res_used <= tp->t_rtx_res);
 		}
 		tp->t_frextents_delta += delta;
+		if (xfs_has_rtgroups(mp))
+			flags &= ~XFS_TRANS_SB_DIRTY;
 		break;
 	case XFS_TRANS_SB_RES_FREXTENTS:
 		/*
@@ -510,8 +512,14 @@ xfs_trans_apply_sb_deltas(
 	 *
 	 * Don't touch m_frextents because it includes incore reservations,
 	 * and those are handled by the unreserve function.
+	 *
+	 * sb_frextents was added to the lazy sb counters when the rt groups
+	 * feature was introduced.  This is possible because we know that all
+	 * kernels supporting rtgroups will also recompute frextents from the
+	 * realtime bitmap.
 	 */
-	if (tp->t_frextents_delta || tp->t_res_frextents_delta) {
+	if ((tp->t_frextents_delta || tp->t_res_frextents_delta) &&
+	    !xfs_has_rtgroups(tp->t_mountp)) {
 		struct xfs_mount	*mp = tp->t_mountp;
 		int64_t			rtxdelta;
 
@@ -619,7 +627,7 @@ xfs_trans_unreserve_and_mod_sb(
 	}
 
 	ASSERT(tp->t_rtx_res || tp->t_frextents_delta >= 0);
-	if (tp->t_flags & XFS_TRANS_SB_DIRTY) {
+	if (xfs_has_rtgroups(mp) || (tp->t_flags & XFS_TRANS_SB_DIRTY)) {
 		rtxdelta += tp->t_frextents_delta;
 		ASSERT(rtxdelta >= 0);
 	}
@@ -655,8 +663,11 @@ xfs_trans_unreserve_and_mod_sb(
 	 * Do not touch sb_frextents here because we are dealing with incore
 	 * reservation.  sb_frextents is not part of the lazy sb counters so it
 	 * must be consistent with the ondisk rtbitmap and must never include
-	 * incore reservations.
+	 * incore reservations.  sb_frextents was added to the lazy sb counters
+	 * when the realtime groups feature was introduced.
 	 */
+	if (xfs_has_rtgroups(mp))
+		mp->m_sb.sb_frextents += rtxdelta;
 	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
 	mp->m_sb.sb_agcount += tp->t_agcount_delta;
 	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 08/26] xfs: convert sick_map loops to use ARRAY_SIZE
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-08-23  0:22   ` [PATCH 07/26] xfs: add frextents to the lazysbcounters when rtgroups enabled Darrick J. Wong
@ 2024-08-23  0:23   ` Darrick J. Wong
  2024-08-23  5:14     ` Christoph Hellwig
  2024-08-23  0:23   ` [PATCH 09/26] xfs: record rt group metadata errors in the health system Darrick J. Wong
                     ` (17 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:23 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert these arrays to use ARRAY_SIZE insteead of requiring an empty
sentinel array element at the end.  This saves memory and would have
avoided a bug that worked its way into the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_health.c |   15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 0bdbf6807bd29..cb43bd11dcac5 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -369,6 +369,9 @@ struct ioctl_sick_map {
 	unsigned int		ioctl_mask;
 };
 
+#define for_each_sick_map(map, m) \
+	for ((m) = (map); (m) < (map) + ARRAY_SIZE(map); (m)++)
+
 static const struct ioctl_sick_map fs_map[] = {
 	{ XFS_SICK_FS_COUNTERS,	XFS_FSOP_GEOM_SICK_COUNTERS},
 	{ XFS_SICK_FS_UQUOTA,	XFS_FSOP_GEOM_SICK_UQUOTA },
@@ -378,13 +381,11 @@ static const struct ioctl_sick_map fs_map[] = {
 	{ XFS_SICK_FS_NLINKS,	XFS_FSOP_GEOM_SICK_NLINKS },
 	{ XFS_SICK_FS_METADIR,	XFS_FSOP_GEOM_SICK_METADIR },
 	{ XFS_SICK_FS_METAPATH,	XFS_FSOP_GEOM_SICK_METAPATH },
-	{ 0, 0 },
 };
 
 static const struct ioctl_sick_map rt_map[] = {
 	{ XFS_SICK_RT_BITMAP,	XFS_FSOP_GEOM_SICK_RT_BITMAP },
 	{ XFS_SICK_RT_SUMMARY,	XFS_FSOP_GEOM_SICK_RT_SUMMARY },
-	{ 0, 0 },
 };
 
 static inline void
@@ -414,11 +415,11 @@ xfs_fsop_geom_health(
 	geo->checked = 0;
 
 	xfs_fs_measure_sickness(mp, &sick, &checked);
-	for (m = fs_map; m->sick_mask; m++)
+	for_each_sick_map(fs_map, m)
 		xfgeo_health_tick(geo, sick, checked, m);
 
 	xfs_rt_measure_sickness(mp, &sick, &checked);
-	for (m = rt_map; m->sick_mask; m++)
+	for_each_sick_map(rt_map, m)
 		xfgeo_health_tick(geo, sick, checked, m);
 }
 
@@ -434,7 +435,6 @@ static const struct ioctl_sick_map ag_map[] = {
 	{ XFS_SICK_AG_RMAPBT,	XFS_AG_GEOM_SICK_RMAPBT },
 	{ XFS_SICK_AG_REFCNTBT,	XFS_AG_GEOM_SICK_REFCNTBT },
 	{ XFS_SICK_AG_INODES,	XFS_AG_GEOM_SICK_INODES },
-	{ 0, 0 },
 };
 
 /* Fill out ag geometry health info. */
@@ -451,7 +451,7 @@ xfs_ag_geom_health(
 	ageo->ag_checked = 0;
 
 	xfs_ag_measure_sickness(pag, &sick, &checked);
-	for (m = ag_map; m->sick_mask; m++) {
+	for_each_sick_map(ag_map, m) {
 		if (checked & m->sick_mask)
 			ageo->ag_checked |= m->ioctl_mask;
 		if (sick & m->sick_mask)
@@ -473,7 +473,6 @@ static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_DIR_ZAPPED,	XFS_BS_SICK_DIR },
 	{ XFS_SICK_INO_SYMLINK_ZAPPED,	XFS_BS_SICK_SYMLINK },
 	{ XFS_SICK_INO_DIRTREE,	XFS_BS_SICK_DIRTREE },
-	{ 0, 0 },
 };
 
 /* Fill out bulkstat health info. */
@@ -490,7 +489,7 @@ xfs_bulkstat_health(
 	bs->bs_checked = 0;
 
 	xfs_inode_measure_sickness(ip, &sick, &checked);
-	for (m = ino_map; m->sick_mask; m++) {
+	for_each_sick_map(ino_map, m) {
 		if (checked & m->sick_mask)
 			bs->bs_checked |= m->ioctl_mask;
 		if (sick & m->sick_mask)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 09/26] xfs: record rt group metadata errors in the health system
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-08-23  0:23   ` [PATCH 08/26] xfs: convert sick_map loops to use ARRAY_SIZE Darrick J. Wong
@ 2024-08-23  0:23   ` Darrick J. Wong
  2024-08-23  5:14     ` Christoph Hellwig
  2024-08-23  0:23   ` [PATCH 10/26] xfs: export the geometry of realtime groups to userspace Darrick J. Wong
                     ` (16 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:23 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Record the state of per-rtgroup metadata sickness in the rtgroup
structure for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h   |   59 ++++++++-------
 fs/xfs/libxfs/xfs_rtbitmap.c |   37 +++++----
 fs/xfs/libxfs/xfs_rtgroup.c  |   38 ++++++++--
 fs/xfs/libxfs/xfs_rtgroup.h  |    9 ++
 fs/xfs/scrub/health.c        |   33 ++++++--
 fs/xfs/xfs_health.c          |  164 ++++++++++++++++++++++++------------------
 fs/xfs/xfs_trace.h           |   30 +++++++-
 7 files changed, 236 insertions(+), 134 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 8abd345e23885..7e77e2df9704a 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -52,6 +52,7 @@ struct xfs_inode;
 struct xfs_fsop_geom;
 struct xfs_btree_cur;
 struct xfs_da_args;
+struct xfs_rtgroup;
 
 /* Observable health issues for metadata spanning the entire filesystem. */
 #define XFS_SICK_FS_COUNTERS	(1 << 0)  /* summary counters */
@@ -63,9 +64,10 @@ struct xfs_da_args;
 #define XFS_SICK_FS_METADIR	(1 << 6)  /* metadata directory tree */
 #define XFS_SICK_FS_METAPATH	(1 << 7)  /* metadata directory tree path */
 
-/* Observable health issues for realtime volume metadata. */
-#define XFS_SICK_RT_BITMAP	(1 << 0)  /* realtime bitmap */
-#define XFS_SICK_RT_SUMMARY	(1 << 1)  /* realtime summary */
+/* Observable health issues for realtime group metadata. */
+#define XFS_SICK_RG_SUPER	(1 << 0)  /* rt group superblock */
+#define XFS_SICK_RG_BITMAP	(1 << 1)  /* rt group bitmap */
+#define XFS_SICK_RG_SUMMARY	(1 << 2)  /* rt groups summary */
 
 /* Observable health issues for AG metadata. */
 #define XFS_SICK_AG_SB		(1 << 0)  /* superblock */
@@ -109,8 +111,9 @@ struct xfs_da_args;
 				 XFS_SICK_FS_METADIR | \
 				 XFS_SICK_FS_METAPATH)
 
-#define XFS_SICK_RT_PRIMARY	(XFS_SICK_RT_BITMAP | \
-				 XFS_SICK_RT_SUMMARY)
+#define XFS_SICK_RG_PRIMARY	(XFS_SICK_RG_SUPER | \
+				 XFS_SICK_RG_BITMAP | \
+				 XFS_SICK_RG_SUMMARY)
 
 #define XFS_SICK_AG_PRIMARY	(XFS_SICK_AG_SB | \
 				 XFS_SICK_AG_AGF | \
@@ -140,26 +143,26 @@ struct xfs_da_args;
 
 /* Secondary state related to (but not primary evidence of) health problems. */
 #define XFS_SICK_FS_SECONDARY	(0)
-#define XFS_SICK_RT_SECONDARY	(0)
+#define XFS_SICK_RG_SECONDARY	(0)
 #define XFS_SICK_AG_SECONDARY	(0)
 #define XFS_SICK_INO_SECONDARY	(XFS_SICK_INO_FORGET)
 
 /* Evidence of health problems elsewhere. */
 #define XFS_SICK_FS_INDIRECT	(0)
-#define XFS_SICK_RT_INDIRECT	(0)
+#define XFS_SICK_RG_INDIRECT	(0)
 #define XFS_SICK_AG_INDIRECT	(XFS_SICK_AG_INODES)
 #define XFS_SICK_INO_INDIRECT	(0)
 
 /* All health masks. */
-#define XFS_SICK_FS_ALL	(XFS_SICK_FS_PRIMARY | \
+#define XFS_SICK_FS_ALL		(XFS_SICK_FS_PRIMARY | \
 				 XFS_SICK_FS_SECONDARY | \
 				 XFS_SICK_FS_INDIRECT)
 
-#define XFS_SICK_RT_ALL	(XFS_SICK_RT_PRIMARY | \
-				 XFS_SICK_RT_SECONDARY | \
-				 XFS_SICK_RT_INDIRECT)
+#define XFS_SICK_RG_ALL		(XFS_SICK_RG_PRIMARY | \
+				 XFS_SICK_RG_SECONDARY | \
+				 XFS_SICK_RG_INDIRECT)
 
-#define XFS_SICK_AG_ALL	(XFS_SICK_AG_PRIMARY | \
+#define XFS_SICK_AG_ALL		(XFS_SICK_AG_PRIMARY | \
 				 XFS_SICK_AG_SECONDARY | \
 				 XFS_SICK_AG_INDIRECT)
 
@@ -193,10 +196,12 @@ void xfs_fs_mark_healthy(struct xfs_mount *mp, unsigned int mask);
 void xfs_fs_measure_sickness(struct xfs_mount *mp, unsigned int *sick,
 		unsigned int *checked);
 
-void xfs_rt_mark_sick(struct xfs_mount *mp, unsigned int mask);
-void xfs_rt_mark_corrupt(struct xfs_mount *mp, unsigned int mask);
-void xfs_rt_mark_healthy(struct xfs_mount *mp, unsigned int mask);
-void xfs_rt_measure_sickness(struct xfs_mount *mp, unsigned int *sick,
+void xfs_rgno_mark_sick(struct xfs_mount *mp, xfs_rgnumber_t rgno,
+		unsigned int mask);
+void xfs_rtgroup_mark_sick(struct xfs_rtgroup *rtg, unsigned int mask);
+void xfs_rtgroup_mark_corrupt(struct xfs_rtgroup *rtg, unsigned int mask);
+void xfs_rtgroup_mark_healthy(struct xfs_rtgroup *rtg, unsigned int mask);
+void xfs_rtgroup_measure_sickness(struct xfs_rtgroup *rtg, unsigned int *sick,
 		unsigned int *checked);
 
 void xfs_agno_mark_sick(struct xfs_mount *mp, xfs_agnumber_t agno,
@@ -230,15 +235,6 @@ xfs_fs_has_sickness(struct xfs_mount *mp, unsigned int mask)
 	return sick & mask;
 }
 
-static inline bool
-xfs_rt_has_sickness(struct xfs_mount *mp, unsigned int mask)
-{
-	unsigned int	sick, checked;
-
-	xfs_rt_measure_sickness(mp, &sick, &checked);
-	return sick & mask;
-}
-
 static inline bool
 xfs_ag_has_sickness(struct xfs_perag *pag, unsigned int mask)
 {
@@ -248,6 +244,15 @@ xfs_ag_has_sickness(struct xfs_perag *pag, unsigned int mask)
 	return sick & mask;
 }
 
+static inline bool
+xfs_rtgroup_has_sickness(struct xfs_rtgroup *rtg, unsigned int mask)
+{
+	unsigned int	sick, checked;
+
+	xfs_rtgroup_measure_sickness(rtg, &sick, &checked);
+	return sick & mask;
+}
+
 static inline bool
 xfs_inode_has_sickness(struct xfs_inode *ip, unsigned int mask)
 {
@@ -264,9 +269,9 @@ xfs_fs_is_healthy(struct xfs_mount *mp)
 }
 
 static inline bool
-xfs_rt_is_healthy(struct xfs_mount *mp)
+xfs_rtgroup_is_healthy(struct xfs_rtgroup *rtg)
 {
-	return !xfs_rt_has_sickness(mp, -1U);
+	return !xfs_rtgroup_has_sickness(rtg, -1U);
 }
 
 static inline bool
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 330acf1ab39f8..44e3c027c0537 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -76,28 +76,31 @@ static int
 xfs_rtbuf_get(
 	struct xfs_rtalloc_args	*args,
 	xfs_fileoff_t		block,	/* block number in bitmap or summary */
-	int			issum)	/* is summary not bitmap */
+	enum xfs_rtg_inodes	type)
 {
+	struct xfs_inode	*ip = args->rtg->rtg_inodes[type];
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_buf		**cbpp;	/* cached block buffer */
 	xfs_fileoff_t		*coffp;	/* cached block number */
 	struct xfs_buf		*bp;	/* block buffer, result */
-	struct xfs_inode	*ip;	/* bitmap or summary inode */
 	struct xfs_bmbt_irec	map;
-	enum xfs_blft		type;
+	enum xfs_blft		buf_type;
 	int			nmap = 1;
 	int			error;
 
-	if (issum) {
+	switch (type) {
+	case XFS_RTGI_SUMMARY:
 		cbpp = &args->sumbp;
 		coffp = &args->sumoff;
-		ip = args->rtg->rtg_inodes[XFS_RTGI_SUMMARY];
-		type = XFS_BLFT_RTSUMMARY_BUF;
-	} else {
+		buf_type = XFS_BLFT_RTSUMMARY_BUF;
+		break;
+	case XFS_RTGI_BITMAP:
 		cbpp = &args->rbmbp;
 		coffp = &args->rbmoff;
-		ip = args->rtg->rtg_inodes[XFS_RTGI_BITMAP];
-		type = XFS_BLFT_RTBITMAP_BUF;
+		buf_type = XFS_BLFT_RTBITMAP_BUF;
+		break;
+	default:
+		return -EINVAL;
 	}
 
 	/*
@@ -120,8 +123,7 @@ xfs_rtbuf_get(
 		return error;
 
 	if (XFS_IS_CORRUPT(mp, nmap == 0 || !xfs_bmap_is_written_extent(&map))) {
-		xfs_rt_mark_sick(mp, issum ? XFS_SICK_RT_SUMMARY :
-					     XFS_SICK_RT_BITMAP);
+		xfs_rtginode_mark_sick(args->rtg, type);
 		return -EFSCORRUPTED;
 	}
 
@@ -130,12 +132,11 @@ xfs_rtbuf_get(
 				   XFS_FSB_TO_DADDR(mp, map.br_startblock),
 				   mp->m_bsize, 0, &bp, &xfs_rtbuf_ops);
 	if (xfs_metadata_is_sick(error))
-		xfs_rt_mark_sick(mp, issum ? XFS_SICK_RT_SUMMARY :
-					     XFS_SICK_RT_BITMAP);
+		xfs_rtginode_mark_sick(args->rtg, type);
 	if (error)
 		return error;
 
-	xfs_trans_buf_set_type(args->tp, bp, type);
+	xfs_trans_buf_set_type(args->tp, bp, buf_type);
 	*cbpp = bp;
 	*coffp = block;
 	return 0;
@@ -149,11 +150,11 @@ xfs_rtbitmap_read_buf(
 	struct xfs_mount		*mp = args->mp;
 
 	if (XFS_IS_CORRUPT(mp, block >= mp->m_sb.sb_rbmblocks)) {
-		xfs_rt_mark_sick(mp, XFS_SICK_RT_BITMAP);
+		xfs_rtginode_mark_sick(args->rtg, XFS_RTGI_BITMAP);
 		return -EFSCORRUPTED;
 	}
 
-	return xfs_rtbuf_get(args, block, 0);
+	return xfs_rtbuf_get(args, block, XFS_RTGI_BITMAP);
 }
 
 int
@@ -164,10 +165,10 @@ xfs_rtsummary_read_buf(
 	struct xfs_mount		*mp = args->mp;
 
 	if (XFS_IS_CORRUPT(mp, block >= mp->m_rsumblocks)) {
-		xfs_rt_mark_sick(args->mp, XFS_SICK_RT_SUMMARY);
+		xfs_rtginode_mark_sick(args->rtg, XFS_RTGI_SUMMARY);
 		return -EFSCORRUPTED;
 	}
-	return xfs_rtbuf_get(args, block, 1);
+	return xfs_rtbuf_get(args, block, XFS_RTGI_SUMMARY);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 89194a66267e2..3cb08f5cfc260 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -316,6 +316,8 @@ struct xfs_rtginode_ops {
 
 	enum xfs_metafile_type	metafile_type;
 
+	unsigned int		sick;	/* rtgroup sickness flag */
+
 	/* Does the fs have this feature? */
 	bool			(*enabled)(struct xfs_mount *mp);
 
@@ -330,11 +332,13 @@ static const struct xfs_rtginode_ops xfs_rtginode_ops[XFS_RTGI_MAX] = {
 	[XFS_RTGI_BITMAP] = {
 		.name		= "bitmap",
 		.metafile_type	= XFS_METAFILE_RTBITMAP,
+		.sick		= XFS_SICK_RG_BITMAP,
 		.create		= xfs_rtbitmap_create,
 	},
 	[XFS_RTGI_SUMMARY] = {
 		.name		= "summary",
 		.metafile_type	= XFS_METAFILE_RTSUMMARY,
+		.sick		= XFS_SICK_RG_SUMMARY,
 		.create		= xfs_rtsummary_create,
 	},
 };
@@ -368,6 +372,17 @@ xfs_rtginode_enabled(
 	return ops->enabled(rtg->rtg_mount);
 }
 
+/* Mark an rtgroup inode sick */
+void
+xfs_rtginode_mark_sick(
+	struct xfs_rtgroup	*rtg,
+	enum xfs_rtg_inodes	type)
+{
+	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
+
+	xfs_rtgroup_mark_sick(rtg, ops->sick);
+}
+
 /* Load and existing rtgroup inode into the rtgroup structure. */
 int
 xfs_rtginode_load(
@@ -403,8 +418,10 @@ xfs_rtginode_load(
 	} else {
 		const char	*path;
 
-		if (!mp->m_rtdirip)
+		if (!mp->m_rtdirip) {
+			xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 			return -EFSCORRUPTED;
+		}
 
 		path = xfs_rtginode_path(rtg->rtg_rgno, type);
 		if (!path)
@@ -414,17 +431,22 @@ xfs_rtginode_load(
 		kfree(path);
 	}
 
-	if (error)
+	if (error) {
+		if (xfs_metadata_is_sick(error))
+			xfs_rtginode_mark_sick(rtg, type);
 		return error;
+	}
 
 	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
 			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
 		xfs_irele(ip);
+		xfs_rtginode_mark_sick(rtg, type);
 		return -EFSCORRUPTED;
 	}
 
 	if (XFS_IS_CORRUPT(mp, ip->i_projid != rtg->rtg_rgno)) {
 		xfs_irele(ip);
+		xfs_rtginode_mark_sick(rtg, type);
 		return -EFSCORRUPTED;
 	}
 
@@ -461,8 +483,10 @@ xfs_rtginode_create(
 	if (!xfs_rtginode_enabled(rtg, type))
 		return 0;
 
-	if (!mp->m_rtdirip)
+	if (!mp->m_rtdirip) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 		return -EFSCORRUPTED;
+	}
 
 	upd.path = xfs_rtginode_path(rtg->rtg_rgno, type);
 	if (!upd.path)
@@ -509,8 +533,10 @@ int
 xfs_rtginode_mkdir_parent(
 	struct xfs_mount	*mp)
 {
-	if (!mp->m_metadirip)
+	if (!mp->m_metadirip) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 		return -EFSCORRUPTED;
+	}
 
 	return xfs_metadir_mkdir(mp->m_metadirip, "rtgroups", &mp->m_rtdirip);
 }
@@ -522,8 +548,10 @@ xfs_rtginode_load_parent(
 {
 	struct xfs_mount	*mp = tp->t_mountp;
 
-	if (!mp->m_metadirip)
+	if (!mp->m_metadirip) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
 		return -EFSCORRUPTED;
+	}
 
 	return xfs_metadir_load(tp, mp->m_metadirip, "rtgroups",
 			XFS_METAFILE_DIR, &mp->m_rtdirip);
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index a18ea0aca3db1..f51f1a7592775 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -36,6 +36,14 @@ struct xfs_rtgroup {
 	/* Number of blocks in this group */
 	xfs_rtxnum_t		rtg_extents;
 
+	/*
+	 * Bitsets of per-rtgroup metadata that have been checked and/or are
+	 * sick.  Callers should hold rtg_state_lock before accessing this
+	 * field.
+	 */
+	uint16_t		rtg_checked;
+	uint16_t		rtg_sick;
+
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that rtg_rsum_cache[bbno] > the maximum i for which
@@ -247,6 +255,7 @@ int xfs_rtginode_load_parent(struct xfs_trans *tp);
 const char *xfs_rtginode_name(enum xfs_rtg_inodes type);
 enum xfs_metafile_type xfs_rtginode_metafile_type(enum xfs_rtg_inodes type);
 bool xfs_rtginode_enabled(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type);
+void xfs_rtginode_mark_sick(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type);
 int xfs_rtginode_load(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type,
 		struct xfs_trans *tp);
 int xfs_rtginode_create(struct xfs_rtgroup *rtg, enum xfs_rtg_inodes type,
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index e202d84ec5140..a0a721ae5763d 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -12,6 +12,7 @@
 #include "xfs_btree.h"
 #include "xfs_ag.h"
 #include "xfs_health.h"
+#include "xfs_rtgroup.h"
 #include "scrub/scrub.h"
 #include "scrub/health.h"
 #include "scrub/common.h"
@@ -71,9 +72,9 @@
 
 enum xchk_health_group {
 	XHG_FS = 1,
-	XHG_RT,
 	XHG_AG,
 	XHG_INO,
+	XHG_RTGROUP,
 };
 
 struct xchk_health_map {
@@ -100,8 +101,8 @@ static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_XATTR]		= { XHG_INO, XFS_SICK_INO_XATTR },
 	[XFS_SCRUB_TYPE_SYMLINK]	= { XHG_INO, XFS_SICK_INO_SYMLINK },
 	[XFS_SCRUB_TYPE_PARENT]		= { XHG_INO, XFS_SICK_INO_PARENT },
-	[XFS_SCRUB_TYPE_RTBITMAP]	= { XHG_RT,  XFS_SICK_RT_BITMAP },
-	[XFS_SCRUB_TYPE_RTSUM]		= { XHG_RT,  XFS_SICK_RT_SUMMARY },
+	[XFS_SCRUB_TYPE_RTBITMAP]	= { XHG_RTGROUP, XFS_SICK_RG_BITMAP },
+	[XFS_SCRUB_TYPE_RTSUM]		= { XHG_RTGROUP, XFS_SICK_RG_SUMMARY },
 	[XFS_SCRUB_TYPE_UQUOTA]		= { XHG_FS,  XFS_SICK_FS_UQUOTA },
 	[XFS_SCRUB_TYPE_GQUOTA]		= { XHG_FS,  XFS_SICK_FS_GQUOTA },
 	[XFS_SCRUB_TYPE_PQUOTA]		= { XHG_FS,  XFS_SICK_FS_PQUOTA },
@@ -162,12 +163,15 @@ xchk_mark_all_healthy(
 	struct xfs_mount	*mp)
 {
 	struct xfs_perag	*pag;
+	struct xfs_rtgroup	*rtg;
 	xfs_agnumber_t		agno;
+	xfs_rgnumber_t		rgno;
 
 	xfs_fs_mark_healthy(mp, XFS_SICK_FS_INDIRECT);
-	xfs_rt_mark_healthy(mp, XFS_SICK_RT_INDIRECT);
 	for_each_perag(mp, agno, pag)
 		xfs_ag_mark_healthy(pag, XFS_SICK_AG_INDIRECT);
+	for_each_rtgroup(mp, rgno, rtg)
+		xfs_rtgroup_mark_healthy(rtg, XFS_SICK_RG_INDIRECT);
 }
 
 /*
@@ -185,6 +189,7 @@ xchk_update_health(
 	struct xfs_scrub	*sc)
 {
 	struct xfs_perag	*pag;
+	struct xfs_rtgroup	*rtg;
 	bool			bad;
 
 	/*
@@ -237,11 +242,13 @@ xchk_update_health(
 		else
 			xfs_fs_mark_healthy(sc->mp, sc->sick_mask);
 		break;
-	case XHG_RT:
+	case XHG_RTGROUP:
+		rtg = xfs_rtgroup_get(sc->mp, sc->sm->sm_agno);
 		if (bad)
-			xfs_rt_mark_corrupt(sc->mp, sc->sick_mask);
+			xfs_rtgroup_mark_corrupt(rtg, sc->sick_mask);
 		else
-			xfs_rt_mark_healthy(sc->mp, sc->sick_mask);
+			xfs_rtgroup_mark_healthy(rtg, sc->sick_mask);
+		xfs_rtgroup_put(rtg);
 		break;
 	default:
 		ASSERT(0);
@@ -296,7 +303,9 @@ xchk_health_record(
 {
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_perag	*pag;
+	struct xfs_rtgroup	*rtg;
 	xfs_agnumber_t		agno;
+	xfs_rgnumber_t		rgno;
 
 	unsigned int		sick;
 	unsigned int		checked;
@@ -305,15 +314,17 @@ xchk_health_record(
 	if (sick & XFS_SICK_FS_PRIMARY)
 		xchk_set_corrupt(sc);
 
-	xfs_rt_measure_sickness(mp, &sick, &checked);
-	if (sick & XFS_SICK_RT_PRIMARY)
-		xchk_set_corrupt(sc);
-
 	for_each_perag(mp, agno, pag) {
 		xfs_ag_measure_sickness(pag, &sick, &checked);
 		if (sick & XFS_SICK_AG_PRIMARY)
 			xchk_set_corrupt(sc);
 	}
 
+	for_each_rtgroup(mp, rgno, rtg) {
+		xfs_rtgroup_measure_sickness(rtg, &sick, &checked);
+		if (sick & XFS_SICK_RG_PRIMARY)
+			xchk_set_corrupt(sc);
+	}
+
 	return 0;
 }
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index cb43bd11dcac5..e94a5ede103d4 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -18,6 +18,7 @@
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_quota_defs.h"
+#include "xfs_rtgroup.h"
 
 /*
  * Warn about metadata corruption that we detected but haven't fixed, and
@@ -29,7 +30,9 @@ xfs_health_unmount(
 	struct xfs_mount	*mp)
 {
 	struct xfs_perag	*pag;
+	struct xfs_rtgroup	*rtg;
 	xfs_agnumber_t		agno;
+	xfs_rgnumber_t		rgno;
 	unsigned int		sick = 0;
 	unsigned int		checked = 0;
 	bool			warn = false;
@@ -46,11 +49,13 @@ xfs_health_unmount(
 		}
 	}
 
-	/* Measure realtime volume corruption levels. */
-	xfs_rt_measure_sickness(mp, &sick, &checked);
-	if (sick) {
-		trace_xfs_rt_unfixed_corruption(mp, sick);
-		warn = true;
+	/* Measure realtime group corruption levels. */
+	for_each_rtgroup(mp, rgno, rtg) {
+		xfs_rtgroup_measure_sickness(rtg, &sick, &checked);
+		if (sick) {
+			trace_xfs_rtgroup_unfixed_corruption(rtg, sick);
+			warn = true;
+		}
 	}
 
 	/*
@@ -150,65 +155,6 @@ xfs_fs_measure_sickness(
 	spin_unlock(&mp->m_sb_lock);
 }
 
-/* Mark unhealthy realtime metadata. */
-void
-xfs_rt_mark_sick(
-	struct xfs_mount	*mp,
-	unsigned int		mask)
-{
-	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
-	trace_xfs_rt_mark_sick(mp, mask);
-
-	spin_lock(&mp->m_sb_lock);
-	mp->m_rt_sick |= mask;
-	spin_unlock(&mp->m_sb_lock);
-}
-
-/* Mark realtime metadata as having been checked and found unhealthy by fsck. */
-void
-xfs_rt_mark_corrupt(
-	struct xfs_mount	*mp,
-	unsigned int		mask)
-{
-	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
-	trace_xfs_rt_mark_corrupt(mp, mask);
-
-	spin_lock(&mp->m_sb_lock);
-	mp->m_rt_sick |= mask;
-	mp->m_rt_checked |= mask;
-	spin_unlock(&mp->m_sb_lock);
-}
-
-/* Mark a realtime metadata healed. */
-void
-xfs_rt_mark_healthy(
-	struct xfs_mount	*mp,
-	unsigned int		mask)
-{
-	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
-	trace_xfs_rt_mark_healthy(mp, mask);
-
-	spin_lock(&mp->m_sb_lock);
-	mp->m_rt_sick &= ~mask;
-	if (!(mp->m_rt_sick & XFS_SICK_RT_PRIMARY))
-		mp->m_rt_sick &= ~XFS_SICK_RT_SECONDARY;
-	mp->m_rt_checked |= mask;
-	spin_unlock(&mp->m_sb_lock);
-}
-
-/* Sample which realtime metadata are unhealthy. */
-void
-xfs_rt_measure_sickness(
-	struct xfs_mount	*mp,
-	unsigned int		*sick,
-	unsigned int		*checked)
-{
-	spin_lock(&mp->m_sb_lock);
-	*sick = mp->m_rt_sick;
-	*checked = mp->m_rt_checked;
-	spin_unlock(&mp->m_sb_lock);
-}
-
 /* Mark unhealthy per-ag metadata given a raw AG number. */
 void
 xfs_agno_mark_sick(
@@ -285,6 +231,82 @@ xfs_ag_measure_sickness(
 	spin_unlock(&pag->pag_state_lock);
 }
 
+/* Mark unhealthy per-rtgroup metadata given a raw rt group number. */
+void
+xfs_rgno_mark_sick(
+	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno,
+	unsigned int		mask)
+{
+	struct xfs_rtgroup	*rtg = xfs_rtgroup_get(mp, rgno);
+
+	/* per-rtgroup structure not set up yet? */
+	if (!rtg)
+		return;
+
+	xfs_rtgroup_mark_sick(rtg, mask);
+	xfs_rtgroup_put(rtg);
+}
+
+/* Mark unhealthy per-rtgroup metadata. */
+void
+xfs_rtgroup_mark_sick(
+	struct xfs_rtgroup	*rtg,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_RG_ALL));
+	trace_xfs_rtgroup_mark_sick(rtg, mask);
+
+	spin_lock(&rtg->rtg_state_lock);
+	rtg->rtg_sick |= mask;
+	spin_unlock(&rtg->rtg_state_lock);
+}
+
+/* Mark rtgroup metadata as having been checked and found unhealthy by fsck. */
+void
+xfs_rtgroup_mark_corrupt(
+	struct xfs_rtgroup	*rtg,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_RG_ALL));
+	trace_xfs_rtgroup_mark_corrupt(rtg, mask);
+
+	spin_lock(&rtg->rtg_state_lock);
+	rtg->rtg_sick |= mask;
+	rtg->rtg_checked |= mask;
+	spin_unlock(&rtg->rtg_state_lock);
+}
+
+/* Mark per-rtgroup metadata ok. */
+void
+xfs_rtgroup_mark_healthy(
+	struct xfs_rtgroup	*rtg,
+	unsigned int		mask)
+{
+	ASSERT(!(mask & ~XFS_SICK_RG_ALL));
+	trace_xfs_rtgroup_mark_healthy(rtg, mask);
+
+	spin_lock(&rtg->rtg_state_lock);
+	rtg->rtg_sick &= ~mask;
+	if (!(rtg->rtg_sick & XFS_SICK_RG_PRIMARY))
+		rtg->rtg_sick &= ~XFS_SICK_RG_SECONDARY;
+	rtg->rtg_checked |= mask;
+	spin_unlock(&rtg->rtg_state_lock);
+}
+
+/* Sample which per-rtgroup metadata are unhealthy. */
+void
+xfs_rtgroup_measure_sickness(
+	struct xfs_rtgroup	*rtg,
+	unsigned int		*sick,
+	unsigned int		*checked)
+{
+	spin_lock(&rtg->rtg_state_lock);
+	*sick = rtg->rtg_sick;
+	*checked = rtg->rtg_checked;
+	spin_unlock(&rtg->rtg_state_lock);
+}
+
 /* Mark the unhealthy parts of an inode. */
 void
 xfs_inode_mark_sick(
@@ -384,8 +406,8 @@ static const struct ioctl_sick_map fs_map[] = {
 };
 
 static const struct ioctl_sick_map rt_map[] = {
-	{ XFS_SICK_RT_BITMAP,	XFS_FSOP_GEOM_SICK_RT_BITMAP },
-	{ XFS_SICK_RT_SUMMARY,	XFS_FSOP_GEOM_SICK_RT_SUMMARY },
+	{ XFS_SICK_RG_BITMAP,	XFS_FSOP_GEOM_SICK_RT_BITMAP },
+	{ XFS_SICK_RG_SUMMARY,	XFS_FSOP_GEOM_SICK_RT_SUMMARY },
 };
 
 static inline void
@@ -410,6 +432,8 @@ xfs_fsop_geom_health(
 	const struct ioctl_sick_map	*m;
 	unsigned int			sick;
 	unsigned int			checked;
+	struct xfs_rtgroup		*rtg;
+	xfs_rgnumber_t			rgno;
 
 	geo->sick = 0;
 	geo->checked = 0;
@@ -418,9 +442,11 @@ xfs_fsop_geom_health(
 	for_each_sick_map(fs_map, m)
 		xfgeo_health_tick(geo, sick, checked, m);
 
-	xfs_rt_measure_sickness(mp, &sick, &checked);
-	for_each_sick_map(rt_map, m)
-		xfgeo_health_tick(geo, sick, checked, m);
+	for_each_rtgroup(mp, rgno, rtg) {
+		xfs_rtgroup_measure_sickness(rtg, &sick, &checked);
+		for_each_sick_map(rt_map, m)
+			xfgeo_health_tick(geo, sick, checked, m);
+	}
 }
 
 static const struct ioctl_sick_map ag_map[] = {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 4401a7c6230df..43bfa0d51c7d6 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4224,10 +4224,6 @@ DEFINE_FS_CORRUPT_EVENT(xfs_fs_mark_sick);
 DEFINE_FS_CORRUPT_EVENT(xfs_fs_mark_corrupt);
 DEFINE_FS_CORRUPT_EVENT(xfs_fs_mark_healthy);
 DEFINE_FS_CORRUPT_EVENT(xfs_fs_unfixed_corruption);
-DEFINE_FS_CORRUPT_EVENT(xfs_rt_mark_sick);
-DEFINE_FS_CORRUPT_EVENT(xfs_rt_mark_corrupt);
-DEFINE_FS_CORRUPT_EVENT(xfs_rt_mark_healthy);
-DEFINE_FS_CORRUPT_EVENT(xfs_rt_unfixed_corruption);
 
 DECLARE_EVENT_CLASS(xfs_ag_corrupt_class,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, unsigned int flags),
@@ -4256,6 +4252,32 @@ DEFINE_AG_CORRUPT_EVENT(xfs_ag_mark_corrupt);
 DEFINE_AG_CORRUPT_EVENT(xfs_ag_mark_healthy);
 DEFINE_AG_CORRUPT_EVENT(xfs_ag_unfixed_corruption);
 
+DECLARE_EVENT_CLASS(xfs_rtgroup_corrupt_class,
+	TP_PROTO(struct xfs_rtgroup *rtg, unsigned int flags),
+	TP_ARGS(rtg, flags),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_rgnumber_t, rgno)
+		__field(unsigned int, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = rtg->rtg_mount->m_super->s_dev;
+		__entry->rgno = rtg->rtg_rgno;
+		__entry->flags = flags;
+	),
+	TP_printk("dev %d:%d rgno 0x%x flags 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->rgno, __entry->flags)
+);
+#define DEFINE_RTGROUP_CORRUPT_EVENT(name)	\
+DEFINE_EVENT(xfs_rtgroup_corrupt_class, name,	\
+	TP_PROTO(struct xfs_rtgroup *rtg, unsigned int flags), \
+	TP_ARGS(rtg, flags))
+DEFINE_RTGROUP_CORRUPT_EVENT(xfs_rtgroup_mark_sick);
+DEFINE_RTGROUP_CORRUPT_EVENT(xfs_rtgroup_mark_corrupt);
+DEFINE_RTGROUP_CORRUPT_EVENT(xfs_rtgroup_mark_healthy);
+DEFINE_RTGROUP_CORRUPT_EVENT(xfs_rtgroup_unfixed_corruption);
+
 DECLARE_EVENT_CLASS(xfs_inode_corrupt_class,
 	TP_PROTO(struct xfs_inode *ip, unsigned int flags),
 	TP_ARGS(ip, flags),


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 10/26] xfs: export the geometry of realtime groups to userspace
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-08-23  0:23   ` [PATCH 09/26] xfs: record rt group metadata errors in the health system Darrick J. Wong
@ 2024-08-23  0:23   ` Darrick J. Wong
  2024-08-23  5:14     ` Christoph Hellwig
  2024-08-23  0:24   ` [PATCH 11/26] xfs: add block headers to realtime bitmap and summary blocks Darrick J. Wong
                     ` (15 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:23 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create an ioctl so that the kernel can report the status of realtime
groups to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h      |   17 +++++++++++++++++
 fs/xfs/libxfs/xfs_health.h  |    2 ++
 fs/xfs/libxfs/xfs_rtgroup.c |   15 +++++++++++++++
 fs/xfs/libxfs/xfs_rtgroup.h |    4 ++++
 fs/xfs/xfs_health.c         |   28 ++++++++++++++++++++++++++++
 fs/xfs/xfs_ioctl.c          |   33 +++++++++++++++++++++++++++++++++
 6 files changed, 99 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 57819fea064e7..2dacc19723c37 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -971,6 +971,22 @@ struct xfs_getparents_by_handle {
 	struct xfs_getparents		gph_request;
 };
 
+/*
+ * Output for XFS_IOC_RTGROUP_GEOMETRY
+ */
+struct xfs_rtgroup_geometry {
+	__u32 rg_number;	/* i/o: rtgroup number */
+	__u32 rg_length;	/* o: length in blocks */
+	__u32 rg_capacity;	/* o: usable capacity in blocks */
+	__u32 rg_sick;		/* o: sick things in ag */
+	__u32 rg_checked;	/* o: checked metadata in ag */
+	__u32 rg_flags;		/* i/o: flags for this ag */
+	__u64 rg_reserved[13];	/* o: zero */
+};
+#define XFS_RTGROUP_GEOM_SICK_SUPER	(1U << 0)  /* superblock */
+#define XFS_RTGROUP_GEOM_SICK_BITMAP	(1U << 1)  /* rtbitmap */
+#define XFS_RTGROUP_GEOM_SICK_SUMMARY	(1U << 2)  /* rtsummary */
+
 /*
  * ioctl commands that are used by Linux filesystems
  */
@@ -1009,6 +1025,7 @@ struct xfs_getparents_by_handle {
 #define XFS_IOC_GETPARENTS	_IOWR('X', 62, struct xfs_getparents)
 #define XFS_IOC_GETPARENTS_BY_HANDLE _IOWR('X', 63, struct xfs_getparents_by_handle)
 #define XFS_IOC_SCRUBV_METADATA	_IOWR('X', 64, struct xfs_scrub_vec_head)
+#define XFS_IOC_RTGROUP_GEOMETRY _IOWR('X', 65, struct xfs_rtgroup_geometry)
 
 /*
  * ioctl commands that replace IRIX syssgi()'s
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 7e77e2df9704a..2da64555434d5 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -288,6 +288,8 @@ xfs_inode_is_healthy(struct xfs_inode *ip)
 
 void xfs_fsop_geom_health(struct xfs_mount *mp, struct xfs_fsop_geom *geo);
 void xfs_ag_geom_health(struct xfs_perag *pag, struct xfs_ag_geometry *ageo);
+void xfs_rtgroup_geom_health(struct xfs_rtgroup *rtg,
+		struct xfs_rtgroup_geometry *rgeo);
 void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 
 #define xfs_metadata_is_sick(error) \
diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
index 3cb08f5cfc260..df70015c68dd0 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.c
+++ b/fs/xfs/libxfs/xfs_rtgroup.c
@@ -259,6 +259,21 @@ xfs_rtgroup_trans_join(
 	}
 }
 
+/* Retrieve rt group geometry. */
+int
+xfs_rtgroup_get_geometry(
+	struct xfs_rtgroup	*rtg,
+	struct xfs_rtgroup_geometry *rgeo)
+{
+	/* Fill out form. */
+	memset(rgeo, 0, sizeof(*rgeo));
+	rgeo->rg_number = rtg->rtg_rgno;
+	rgeo->rg_length = rtg->rtg_extents * rtg->rtg_mount->m_sb.sb_rextsize;
+	rgeo->rg_capacity = rgeo->rg_length;
+	xfs_rtgroup_geom_health(rtg, rgeo);
+	return 0;
+}
+
 #ifdef CONFIG_PROVE_LOCKING
 static struct lock_class_key xfs_rtginode_lock_class;
 
diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
index f51f1a7592775..4525aaa26efc2 100644
--- a/fs/xfs/libxfs/xfs_rtgroup.h
+++ b/fs/xfs/libxfs/xfs_rtgroup.h
@@ -249,6 +249,9 @@ void xfs_rtgroup_unlock(struct xfs_rtgroup *rtg, unsigned int rtglock_flags);
 void xfs_rtgroup_trans_join(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
 		unsigned int rtglock_flags);
 
+int xfs_rtgroup_get_geometry(struct xfs_rtgroup *rtg,
+		struct xfs_rtgroup_geometry *rgeo);
+
 int xfs_rtginode_mkdir_parent(struct xfs_mount *mp);
 int xfs_rtginode_load_parent(struct xfs_trans *tp);
 
@@ -279,6 +282,7 @@ struct xfs_buf *xfs_log_rtsb(struct xfs_trans *tp,
 # define xfs_rtgroup_trans_join(tp, rtg, gf)	((void)0)
 # define xfs_update_rtsb(bp, sb_bp)	((void)0)
 # define xfs_log_rtsb(tp, sb_bp)	(NULL)
+# define xfs_rtgroup_get_geometry(rtg, rgeo)	(-EOPNOTSUPP)
 #endif /* CONFIG_XFS_RT */
 
 #endif /* __LIBXFS_RTGROUP_H */
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index e94a5ede103d4..b3d288df4ca20 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -485,6 +485,34 @@ xfs_ag_geom_health(
 	}
 }
 
+static const struct ioctl_sick_map rtgroup_map[] = {
+	{ XFS_SICK_RG_SUPER,	XFS_RTGROUP_GEOM_SICK_SUPER },
+	{ XFS_SICK_RG_BITMAP,	XFS_RTGROUP_GEOM_SICK_BITMAP },
+	{ XFS_SICK_RG_SUMMARY,	XFS_RTGROUP_GEOM_SICK_SUMMARY },
+};
+
+/* Fill out rtgroup geometry health info. */
+void
+xfs_rtgroup_geom_health(
+	struct xfs_rtgroup	*rtg,
+	struct xfs_rtgroup_geometry *rgeo)
+{
+	const struct ioctl_sick_map	*m;
+	unsigned int			sick;
+	unsigned int			checked;
+
+	rgeo->rg_sick = 0;
+	rgeo->rg_checked = 0;
+
+	xfs_rtgroup_measure_sickness(rtg, &sick, &checked);
+	for_each_sick_map(rtgroup_map, m) {
+		if (checked & m->sick_mask)
+			rgeo->rg_checked |= m->ioctl_mask;
+		if (sick & m->sick_mask)
+			rgeo->rg_sick |= m->ioctl_mask;
+	}
+}
+
 static const struct ioctl_sick_map ino_map[] = {
 	{ XFS_SICK_INO_CORE,	XFS_BS_SICK_INODE },
 	{ XFS_SICK_INO_BMBTD,	XFS_BS_SICK_BMBTD },
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index c5526434f66fd..6f5cd06267873 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -40,6 +40,7 @@
 #include "xfs_file.h"
 #include "xfs_exchrange.h"
 #include "xfs_handle.h"
+#include "xfs_rtgroup.h"
 
 #include <linux/mount.h>
 #include <linux/fileattr.h>
@@ -403,6 +404,36 @@ xfs_ioc_ag_geometry(
 	return 0;
 }
 
+STATIC int
+xfs_ioc_rtgroup_geometry(
+	struct xfs_mount	*mp,
+	void			__user *arg)
+{
+	struct xfs_rtgroup	*rtg;
+	struct xfs_rtgroup_geometry rgeo;
+	int			error;
+
+	if (copy_from_user(&rgeo, arg, sizeof(rgeo)))
+		return -EFAULT;
+	if (rgeo.rg_flags)
+		return -EINVAL;
+	if (memchr_inv(&rgeo.rg_reserved, 0, sizeof(rgeo.rg_reserved)))
+		return -EINVAL;
+
+	rtg = xfs_rtgroup_get(mp, rgeo.rg_number);
+	if (!rtg)
+		return -EINVAL;
+
+	error = xfs_rtgroup_get_geometry(rtg, &rgeo);
+	xfs_rtgroup_put(rtg);
+	if (error)
+		return error;
+
+	if (copy_to_user(arg, &rgeo, sizeof(rgeo)))
+		return -EFAULT;
+	return 0;
+}
+
 /*
  * Linux extended inode flags interface.
  */
@@ -1225,6 +1256,8 @@ xfs_file_ioctl(
 
 	case XFS_IOC_AG_GEOMETRY:
 		return xfs_ioc_ag_geometry(mp, arg);
+	case XFS_IOC_RTGROUP_GEOMETRY:
+		return xfs_ioc_rtgroup_geometry(mp, arg);
 
 	case XFS_IOC_GETVERSION:
 		return put_user(inode->i_generation, (int __user *)arg);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 11/26] xfs: add block headers to realtime bitmap and summary blocks
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (9 preceding siblings ...)
  2024-08-23  0:23   ` [PATCH 10/26] xfs: export the geometry of realtime groups to userspace Darrick J. Wong
@ 2024-08-23  0:24   ` Darrick J. Wong
  2024-08-23  5:15     ` Christoph Hellwig
  2024-08-23  0:24   ` [PATCH 12/26] xfs: encode the rtbitmap in big endian format Darrick J. Wong
                     ` (14 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:24 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Upgrade rtbitmap and rtsummary blocks to have self describing metadata
like most every other thing in XFS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h      |   18 +++++
 fs/xfs/libxfs/xfs_ondisk.h      |    1 
 fs/xfs/libxfs/xfs_rtbitmap.c    |  146 +++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_rtbitmap.h    |   50 +++++++++++++
 fs/xfs/libxfs/xfs_sb.c          |   20 +++++
 fs/xfs/libxfs/xfs_shared.h      |    2 +
 fs/xfs/scrub/rtsummary_repair.c |   15 +++-
 fs/xfs/xfs_buf_item_recover.c   |   25 ++++++-
 fs/xfs/xfs_discard.c            |    2 -
 fs/xfs/xfs_mount.h              |    3 +
 fs/xfs/xfs_rtalloc.c            |    2 -
 11 files changed, 256 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 9e351b19bd86e..27193a2b0ea62 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1272,6 +1272,24 @@ static inline bool xfs_dinode_has_large_extent_counts(
 #define	XFS_DFL_RTEXTSIZE	(64 * 1024)	        /* 64kB */
 #define	XFS_MIN_RTEXTSIZE	(4 * 1024)		/* 4kB */
 
+/*
+ * RT bit manipulation macros.
+ */
+#define XFS_RTBITMAP_MAGIC	0x424D505A	/* BMPZ */
+#define XFS_RTSUMMARY_MAGIC	0x53554D59	/* SUMY */
+
+struct xfs_rtbuf_blkinfo {
+	__be32		rt_magic;	/* validity check on block */
+	__be32		rt_crc;		/* CRC of block */
+	__be64		rt_owner;	/* inode that owns the block */
+	__be64		rt_blkno;	/* first block of the buffer */
+	__be64		rt_lsn;		/* sequence number of last write */
+	uuid_t		rt_uuid;	/* filesystem we belong to */
+};
+
+#define XFS_RTBUF_CRC_OFF \
+	offsetof(struct xfs_rtbuf_blkinfo, rt_crc)
+
 /*
  * Dquot and dquot block format definitions
  */
diff --git a/fs/xfs/libxfs/xfs_ondisk.h b/fs/xfs/libxfs/xfs_ondisk.h
index 38b314113d8f2..6a2bcbc392842 100644
--- a/fs/xfs/libxfs/xfs_ondisk.h
+++ b/fs/xfs/libxfs/xfs_ondisk.h
@@ -76,6 +76,7 @@ xfs_check_ondisk_structs(void)
 	/* realtime structures */
 	XFS_CHECK_STRUCT_SIZE(union xfs_rtword_raw,		4);
 	XFS_CHECK_STRUCT_SIZE(union xfs_suminfo_raw,		4);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_rtbuf_blkinfo,		48);
 
 	/*
 	 * m68k has problems with xfs_attr_leaf_name_remote_t, but we pad it to
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 44e3c027c0537..dfac0e89409a9 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -21,28 +21,84 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_health.h"
 #include "xfs_sb.h"
+#include "xfs_log.h"
+#include "xfs_buf_item.h"
 
 /*
  * Realtime allocator bitmap functions shared with userspace.
  */
 
-/*
- * Real time buffers need verifiers to avoid runtime warnings during IO.
- * We don't have anything to verify, however, so these are just dummy
- * operations.
- */
+static xfs_failaddr_t
+xfs_rtbuf_verify(
+	struct xfs_buf			*bp)
+{
+	struct xfs_mount		*mp = bp->b_mount;
+	struct xfs_rtbuf_blkinfo	*hdr = bp->b_addr;
+
+	if (!xfs_verify_magic(bp, hdr->rt_magic))
+		return __this_address;
+	if (!xfs_has_rtgroups(mp))
+		return __this_address;
+	if (!xfs_has_crc(mp))
+		return __this_address;
+	if (!uuid_equal(&hdr->rt_uuid, &mp->m_sb.sb_meta_uuid))
+		return __this_address;
+	if (hdr->rt_blkno != cpu_to_be64(xfs_buf_daddr(bp)))
+		return __this_address;
+	return NULL;
+}
+
 static void
 xfs_rtbuf_verify_read(
-	struct xfs_buf	*bp)
+	struct xfs_buf			*bp)
 {
+	struct xfs_mount		*mp = bp->b_mount;
+	struct xfs_rtbuf_blkinfo	*hdr = bp->b_addr;
+	xfs_failaddr_t			fa;
+
+	if (!xfs_has_rtgroups(mp))
+		return;
+
+	if (!xfs_log_check_lsn(mp, be64_to_cpu(hdr->rt_lsn))) {
+		fa = __this_address;
+		goto fail;
+	}
+
+	if (!xfs_buf_verify_cksum(bp, XFS_RTBUF_CRC_OFF)) {
+		fa = __this_address;
+		goto fail;
+	}
+
+	fa = xfs_rtbuf_verify(bp);
+	if (fa)
+		goto fail;
+
 	return;
+fail:
+	xfs_verifier_error(bp, -EFSCORRUPTED, fa);
 }
 
 static void
 xfs_rtbuf_verify_write(
 	struct xfs_buf	*bp)
 {
-	return;
+	struct xfs_mount		*mp = bp->b_mount;
+	struct xfs_rtbuf_blkinfo	*hdr = bp->b_addr;
+	struct xfs_buf_log_item		*bip = bp->b_log_item;
+	xfs_failaddr_t			fa;
+
+	if (!xfs_has_rtgroups(mp))
+		return;
+
+	fa = xfs_rtbuf_verify(bp);
+	if (fa) {
+		xfs_verifier_error(bp, -EFSCORRUPTED, fa);
+		return;
+	}
+
+	if (bip)
+		hdr->rt_lsn = cpu_to_be64(bip->bli_item.li_lsn);
+	xfs_buf_update_cksum(bp, XFS_RTBUF_CRC_OFF);
 }
 
 const struct xfs_buf_ops xfs_rtbuf_ops = {
@@ -51,6 +107,22 @@ const struct xfs_buf_ops xfs_rtbuf_ops = {
 	.verify_write = xfs_rtbuf_verify_write,
 };
 
+const struct xfs_buf_ops xfs_rtbitmap_buf_ops = {
+	.name		= "xfs_rtbitmap",
+	.magic		= { 0, cpu_to_be32(XFS_RTBITMAP_MAGIC) },
+	.verify_read	= xfs_rtbuf_verify_read,
+	.verify_write	= xfs_rtbuf_verify_write,
+	.verify_struct	= xfs_rtbuf_verify,
+};
+
+const struct xfs_buf_ops xfs_rtsummary_buf_ops = {
+	.name		= "xfs_rtsummary",
+	.magic		= { 0, cpu_to_be32(XFS_RTSUMMARY_MAGIC) },
+	.verify_read	= xfs_rtbuf_verify_read,
+	.verify_write	= xfs_rtbuf_verify_write,
+	.verify_struct	= xfs_rtbuf_verify,
+};
+
 /* Release cached rt bitmap and summary buffers. */
 void
 xfs_rtbuf_cache_relse(
@@ -130,12 +202,24 @@ xfs_rtbuf_get(
 	ASSERT(map.br_startblock != NULLFSBLOCK);
 	error = xfs_trans_read_buf(mp, args->tp, mp->m_ddev_targp,
 				   XFS_FSB_TO_DADDR(mp, map.br_startblock),
-				   mp->m_bsize, 0, &bp, &xfs_rtbuf_ops);
+				   mp->m_bsize, 0, &bp,
+				   xfs_rtblock_ops(mp, type));
 	if (xfs_metadata_is_sick(error))
 		xfs_rtginode_mark_sick(args->rtg, type);
 	if (error)
 		return error;
 
+	if (xfs_has_rtgroups(mp)) {
+		struct xfs_rtbuf_blkinfo	*hdr = bp->b_addr;
+
+		if (hdr->rt_owner != cpu_to_be64(ip->i_ino)) {
+			xfs_buf_mark_corrupt(bp);
+			xfs_trans_brelse(args->tp, bp);
+			xfs_rtginode_mark_sick(args->rtg, type);
+			return -EFSCORRUPTED;
+		}
+	}
+
 	xfs_trans_buf_set_type(args->tp, bp, buf_type);
 	*cbpp = bp;
 	*coffp = block;
@@ -1146,6 +1230,19 @@ xfs_rtalloc_extent_is_free(
 	return 0;
 }
 
+/* Compute the number of rt extents tracked by a single bitmap block. */
+xfs_rtxnum_t
+xfs_rtbitmap_rtx_per_rbmblock(
+	struct xfs_mount	*mp)
+{
+	unsigned int		rbmblock_bytes = mp->m_sb.sb_blocksize;
+
+	if (xfs_has_rtgroups(mp))
+		rbmblock_bytes -= sizeof(struct xfs_rtbuf_blkinfo);
+
+	return rbmblock_bytes * NBBY;
+}
+
 /*
  * Compute the number of rtbitmap blocks needed to track the given number of rt
  * extents.
@@ -1155,7 +1252,7 @@ xfs_rtbitmap_blockcount_len(
 	struct xfs_mount	*mp,
 	xfs_rtbxlen_t		rtextents)
 {
-	return howmany_64(rtextents, NBBY * mp->m_sb.sb_blocksize);
+	return howmany_64(rtextents, xfs_rtbitmap_rtx_per_rbmblock(mp));
 }
 
 /* How many rt extents does each rtbitmap file track? */
@@ -1192,11 +1289,12 @@ xfs_rtsummary_blockcount(
 	struct xfs_mount	*mp,
 	unsigned int		*rsumlevels)
 {
+	xfs_rtbxlen_t		rextents = xfs_rtbitmap_bitcount(mp);
 	unsigned long long	rsumwords;
 
-	*rsumlevels = xfs_compute_rextslog(xfs_rtbitmap_bitcount(mp)) + 1;
-	rsumwords = xfs_rtbitmap_blockcount(mp) * (*rsumlevels);
-	return XFS_B_TO_FSB(mp, rsumwords << XFS_WORDLOG);
+	*rsumlevels = xfs_compute_rextslog(rextents) + 1;
+	rsumwords = xfs_rtbitmap_blockcount_len(mp, rextents) * (*rsumlevels);
+	return howmany_64(rsumwords, mp->m_blockwsize);
 }
 
 static int
@@ -1248,6 +1346,7 @@ xfs_rtfile_initialize_block(
 	struct xfs_inode	*ip = rtg->rtg_inodes[type];
 	struct xfs_trans	*tp;
 	struct xfs_buf		*bp;
+	void			*bufdata;
 	const size_t		copylen = mp->m_blockwsize << XFS_WORDLOG;
 	enum xfs_blft		buf_type;
 	int			error;
@@ -1271,13 +1370,30 @@ xfs_rtfile_initialize_block(
 		xfs_trans_cancel(tp);
 		return error;
 	}
+	bufdata = bp->b_addr;
 
 	xfs_trans_buf_set_type(tp, bp, buf_type);
-	bp->b_ops = &xfs_rtbuf_ops;
+	bp->b_ops = xfs_rtblock_ops(mp, type);
+
+	if (xfs_has_rtgroups(mp)) {
+		struct xfs_rtbuf_blkinfo	*hdr = bp->b_addr;
+
+		if (type == XFS_RTGI_BITMAP)
+			hdr->rt_magic = cpu_to_be32(XFS_RTBITMAP_MAGIC);
+		else
+			hdr->rt_magic = cpu_to_be32(XFS_RTSUMMARY_MAGIC);
+		hdr->rt_owner = cpu_to_be64(ip->i_ino);
+		hdr->rt_blkno = cpu_to_be64(XFS_FSB_TO_DADDR(mp, fsbno));
+		hdr->rt_lsn = 0;
+		uuid_copy(&hdr->rt_uuid, &mp->m_sb.sb_meta_uuid);
+
+		bufdata += sizeof(*hdr);
+	}
+
 	if (data)
-		memcpy(bp->b_addr, data, copylen);
+		memcpy(bufdata, data, copylen);
 	else
-		memset(bp->b_addr, 0, copylen);
+		memset(bufdata, 0, copylen);
 	xfs_trans_log_buf(tp, bp, 0, mp->m_sb.sb_blocksize - 1);
 	return xfs_trans_commit(tp);
 }
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index cf21ae31bfaa4..13a05dce47601 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -150,6 +150,9 @@ xfs_rtx_to_rbmblock(
 	struct xfs_mount	*mp,
 	xfs_rtxnum_t		rtx)
 {
+	if (xfs_has_rtgroups(mp))
+		return div_u64(rtx, mp->m_rtx_per_rbmblock);
+
 	return rtx >> mp->m_blkbit_log;
 }
 
@@ -159,6 +162,13 @@ xfs_rtx_to_rbmword(
 	struct xfs_mount	*mp,
 	xfs_rtxnum_t		rtx)
 {
+	if (xfs_has_rtgroups(mp)) {
+		unsigned int	mod;
+
+		div_u64_rem(rtx >> XFS_NBWORDLOG, mp->m_blockwsize, &mod);
+		return mod;
+	}
+
 	return (rtx >> XFS_NBWORDLOG) & (mp->m_blockwsize - 1);
 }
 
@@ -168,6 +178,9 @@ xfs_rbmblock_to_rtx(
 	struct xfs_mount	*mp,
 	xfs_fileoff_t		rbmoff)
 {
+	if (xfs_has_rtgroups(mp))
+		return rbmoff * mp->m_rtx_per_rbmblock;
+
 	return rbmoff << mp->m_blkbit_log;
 }
 
@@ -177,7 +190,14 @@ xfs_rbmblock_wordptr(
 	struct xfs_rtalloc_args	*args,
 	unsigned int		index)
 {
-	union xfs_rtword_raw	*words = args->rbmbp->b_addr;
+	struct xfs_mount	*mp = args->mp;
+	union xfs_rtword_raw	*words;
+	struct xfs_rtbuf_blkinfo *hdr = args->rbmbp->b_addr;
+
+	if (xfs_has_rtgroups(mp))
+		words = (union xfs_rtword_raw *)(hdr + 1);
+	else
+		words = args->rbmbp->b_addr;
 
 	return words + index;
 }
@@ -227,6 +247,9 @@ xfs_rtsumoffs_to_block(
 	struct xfs_mount	*mp,
 	xfs_rtsumoff_t		rsumoff)
 {
+	if (xfs_has_rtgroups(mp))
+		return rsumoff / mp->m_blockwsize;
+
 	return XFS_B_TO_FSBT(mp, rsumoff * sizeof(xfs_suminfo_t));
 }
 
@@ -241,6 +264,9 @@ xfs_rtsumoffs_to_infoword(
 {
 	unsigned int		mask = mp->m_blockmask >> XFS_SUMINFOLOG;
 
+	if (xfs_has_rtgroups(mp))
+		return rsumoff % mp->m_blockwsize;
+
 	return rsumoff & mask;
 }
 
@@ -250,7 +276,13 @@ xfs_rsumblock_infoptr(
 	struct xfs_rtalloc_args	*args,
 	unsigned int		index)
 {
-	union xfs_suminfo_raw	*info = args->sumbp->b_addr;
+	union xfs_suminfo_raw	*info;
+	struct xfs_rtbuf_blkinfo *hdr = args->sumbp->b_addr;
+
+	if (xfs_has_rtgroups(args->mp))
+		info = (union xfs_suminfo_raw *)(hdr + 1);
+	else
+		info = args->sumbp->b_addr;
 
 	return info + index;
 }
@@ -279,6 +311,19 @@ xfs_suminfo_add(
 	return info->old;
 }
 
+static inline const struct xfs_buf_ops *
+xfs_rtblock_ops(
+	struct xfs_mount	*mp,
+	enum xfs_rtg_inodes	type)
+{
+	if (xfs_has_rtgroups(mp)) {
+		if (type == XFS_RTGI_SUMMARY)
+			return &xfs_rtsummary_buf_ops;
+		return &xfs_rtbitmap_buf_ops;
+	}
+	return &xfs_rtbuf_ops;
+}
+
 /*
  * Functions for walking free space rtextents in the realtime bitmap.
  */
@@ -324,6 +369,7 @@ int xfs_rtfree_extent(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
 int xfs_rtfree_blocks(struct xfs_trans *tp, struct xfs_rtgroup *rtg,
 		xfs_fsblock_t rtbno, xfs_filblks_t rtlen);
 
+xfs_rtxnum_t xfs_rtbitmap_rtx_per_rbmblock(struct xfs_mount *mp);
 xfs_filblks_t xfs_rtbitmap_blockcount(struct xfs_mount *mp);
 xfs_filblks_t xfs_rtbitmap_blockcount_len(struct xfs_mount *mp,
 		xfs_rtbxlen_t rtextents);
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 109be10c6e84f..f94d081f7d928 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -245,11 +245,25 @@ xfs_extents_per_rbm(
 	return sbp->sb_rextents;
 }
 
+/*
+ * Return the payload size of a single rt bitmap block (without the metadata
+ * header if any).
+ */
+static inline unsigned int
+xfs_rtbmblock_size(
+	struct xfs_sb		*sbp)
+{
+	if (xfs_sb_version_hasmetadir(sbp))
+		return sbp->sb_blocksize - sizeof(struct xfs_rtbuf_blkinfo);
+	return sbp->sb_blocksize;
+}
+
 static uint64_t
 xfs_expected_rbmblocks(
 	struct xfs_sb		*sbp)
 {
-	return howmany_64(xfs_extents_per_rbm(sbp), NBBY * sbp->sb_blocksize);
+	return howmany_64(xfs_extents_per_rbm(sbp),
+			  NBBY * xfs_rtbmblock_size(sbp));
 }
 
 /* Validate the realtime geometry */
@@ -1092,8 +1106,8 @@ xfs_sb_mount_common(
 	mp->m_sectbb_log = sbp->sb_sectlog - BBSHIFT;
 	mp->m_agno_log = xfs_highbit32(sbp->sb_agcount - 1) + 1;
 	mp->m_blockmask = sbp->sb_blocksize - 1;
-	mp->m_blockwsize = sbp->sb_blocksize >> XFS_WORDLOG;
-	mp->m_blockwmask = mp->m_blockwsize - 1;
+	mp->m_blockwsize = xfs_rtbmblock_size(sbp) >> XFS_WORDLOG;
+	mp->m_rtx_per_rbmblock = mp->m_blockwsize << XFS_NBWORDLOG;
 	xfs_mount_sb_set_rextsize(mp, sbp);
 
 	mp->m_alloc_mxr[0] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 1);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 0343926d2a6b4..4f5f1d3526803 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -38,6 +38,8 @@ extern const struct xfs_buf_ops xfs_inode_buf_ops;
 extern const struct xfs_buf_ops xfs_inode_buf_ra_ops;
 extern const struct xfs_buf_ops xfs_refcountbt_buf_ops;
 extern const struct xfs_buf_ops xfs_rmapbt_buf_ops;
+extern const struct xfs_buf_ops xfs_rtbitmap_buf_ops;
+extern const struct xfs_buf_ops xfs_rtsummary_buf_ops;
 extern const struct xfs_buf_ops xfs_rtbuf_ops;
 extern const struct xfs_buf_ops xfs_rtsb_buf_ops;
 extern const struct xfs_buf_ops xfs_sb_buf_ops;
diff --git a/fs/xfs/scrub/rtsummary_repair.c b/fs/xfs/scrub/rtsummary_repair.c
index 1688380988007..8198ea84ad70e 100644
--- a/fs/xfs/scrub/rtsummary_repair.c
+++ b/fs/xfs/scrub/rtsummary_repair.c
@@ -83,12 +83,23 @@ xrep_rtsummary_prep_buf(
 	ondisk = xfs_rsumblock_infoptr(&rts->args, 0);
 	rts->args.sumbp = NULL;
 
-	bp->b_ops = &xfs_rtbuf_ops;
-
 	error = xfsum_copyout(sc, rts->prep_wordoff, ondisk, mp->m_blockwsize);
 	if (error)
 		return error;
 
+	if (xfs_has_rtgroups(sc->mp)) {
+		struct xfs_rtbuf_blkinfo	*hdr = bp->b_addr;
+
+		hdr->rt_magic = cpu_to_be32(XFS_RTSUMMARY_MAGIC);
+		hdr->rt_owner = cpu_to_be64(sc->ip->i_ino);
+		hdr->rt_blkno = cpu_to_be64(xfs_buf_daddr(bp));
+		hdr->rt_lsn = 0;
+		uuid_copy(&hdr->rt_uuid, &sc->mp->m_sb.sb_meta_uuid);
+		bp->b_ops = &xfs_rtsummary_buf_ops;
+	} else {
+		bp->b_ops = &xfs_rtbuf_ops;
+	}
+
 	rts->prep_wordoff += mp->m_blockwsize;
 	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_RTSUMMARY_BUF);
 	return 0;
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 51cb239d7924c..c55c911446728 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -23,6 +23,7 @@
 #include "xfs_dir2.h"
 #include "xfs_quota.h"
 #include "xfs_rtgroup.h"
+#include "xfs_rtbitmap.h"
 
 /*
  * This is the number of entries in the l_buf_cancel_table used during
@@ -391,9 +392,18 @@ xlog_recover_validate_buf_type(
 		break;
 #ifdef CONFIG_XFS_RT
 	case XFS_BLFT_RTBITMAP_BUF:
+		if (xfs_has_rtgroups(mp) && magic32 != XFS_RTBITMAP_MAGIC) {
+			warnmsg = "Bad rtbitmap magic!";
+			break;
+		}
+		bp->b_ops = xfs_rtblock_ops(mp, XFS_RTGI_BITMAP);
+		break;
 	case XFS_BLFT_RTSUMMARY_BUF:
-		/* no magic numbers for verification of RT buffers */
-		bp->b_ops = &xfs_rtbuf_ops;
+		if (xfs_has_rtgroups(mp) && magic32 != XFS_RTSUMMARY_MAGIC) {
+			warnmsg = "Bad rtsummary magic!";
+			break;
+		}
+		bp->b_ops = xfs_rtblock_ops(mp, XFS_RTGI_SUMMARY);
 		break;
 #endif /* CONFIG_XFS_RT */
 	default:
@@ -728,11 +738,20 @@ xlog_recover_get_buf_lsn(
 	 * UUIDs, so we must recover them immediately.
 	 */
 	blft = xfs_blft_from_flags(buf_f);
-	if (blft == XFS_BLFT_RTBITMAP_BUF || blft == XFS_BLFT_RTSUMMARY_BUF)
+	if (!xfs_has_rtgroups(mp) && (blft == XFS_BLFT_RTBITMAP_BUF ||
+				      blft == XFS_BLFT_RTSUMMARY_BUF))
 		goto recover_immediately;
 
 	magic32 = be32_to_cpu(*(__be32 *)blk);
 	switch (magic32) {
+	case XFS_RTSUMMARY_MAGIC:
+	case XFS_RTBITMAP_MAGIC: {
+		struct xfs_rtbuf_blkinfo	*hdr = blk;
+
+		lsn = be64_to_cpu(hdr->rt_lsn);
+		uuid = &hdr->rt_uuid;
+		break;
+	}
 	case XFS_ABTB_CRC_MAGIC:
 	case XFS_ABTC_CRC_MAGIC:
 	case XFS_ABTB_MAGIC:
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index e1a024f68a68f..b66780cbe55bf 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -572,7 +572,7 @@ xfs_trim_rtg_extents(
 	 * trims the extents returned.
 	 */
 	do {
-		tr.stop_rtx = low + (mp->m_sb.sb_blocksize * NBBY);
+		tr.stop_rtx = low + xfs_rtbitmap_rtx_per_rbmblock(mp);
 		xfs_rtgroup_lock(rtg, XFS_RTGLOCK_BITMAP_SHARED);
 		error = xfs_rtalloc_query_range(rtg, tp, low, high,
 				xfs_trim_gather_rtextent, &tr);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 1da20fafcf978..c4e4f5414a299 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -116,7 +116,8 @@ typedef struct xfs_mount {
 	int8_t			m_rgblklog;	/* log2 of rt group sz if possible */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
-	uint			m_blockwmask;	/* blockwsize-1 */
+	/* number of rt extents per rt bitmap block if rtgroups enabled */
+	unsigned int		m_rtx_per_rbmblock;
 	uint			m_alloc_mxr[2];	/* max alloc btree records */
 	uint			m_alloc_mnr[2];	/* min alloc btree records */
 	uint			m_bmap_dmxr[2];	/* max bmap btree records */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index d8aa354b3bf14..6989ee1c13fa0 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -776,7 +776,7 @@ xfs_growfs_rt_nrblocks(
 	struct xfs_mount	*mp = rtg->rtg_mount;
 	xfs_rfsblock_t		step;
 
-	step = (bmbno + 1) * NBBY * mp->m_sb.sb_blocksize * rextsize;
+	step = (bmbno + 1) * mp->m_rtx_per_rbmblock * rextsize;
 	if (xfs_has_rtgroups(mp)) {
 		xfs_rfsblock_t	rgblocks = mp->m_sb.sb_rgextents * rextsize;
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 12/26] xfs: encode the rtbitmap in big endian format
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (10 preceding siblings ...)
  2024-08-23  0:24   ` [PATCH 11/26] xfs: add block headers to realtime bitmap and summary blocks Darrick J. Wong
@ 2024-08-23  0:24   ` Darrick J. Wong
  2024-08-23  5:15     ` Christoph Hellwig
  2024-08-23  0:24   ` [PATCH 13/26] xfs: encode the rtsummary " Darrick J. Wong
                     ` (13 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:24 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, the ondisk realtime bitmap file is accessed in units of
32-bit words.  There's no endian translation of the contents of this
file, which means that the Bad Things Happen(tm) if you go from (say)
x86 to powerpc.  Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h   |    4 +++-
 fs/xfs/libxfs/xfs_rtbitmap.h |    7 ++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 27193a2b0ea62..506f5d5ee03fe 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -715,10 +715,12 @@ struct xfs_agfl {
 
 /*
  * Realtime bitmap information is accessed by the word, which is currently
- * stored in host-endian format.
+ * stored in host-endian format.  Starting with the realtime groups feature,
+ * the words are stored in be32 ondisk.
  */
 union xfs_rtword_raw {
 	__u32		old;
+	__be32		rtg;
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 13a05dce47601..148f7631d7fc2 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -210,6 +210,8 @@ xfs_rtbitmap_getword(
 {
 	union xfs_rtword_raw	*word = xfs_rbmblock_wordptr(args, index);
 
+	if (xfs_has_rtgroups(args->mp))
+		return be32_to_cpu(word->rtg);
 	return word->old;
 }
 
@@ -222,7 +224,10 @@ xfs_rtbitmap_setword(
 {
 	union xfs_rtword_raw	*word = xfs_rbmblock_wordptr(args, index);
 
-	word->old = value;
+	if (xfs_has_rtgroups(args->mp))
+		word->rtg = cpu_to_be32(value);
+	else
+		word->old = value;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 13/26] xfs: encode the rtsummary in big endian format
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (11 preceding siblings ...)
  2024-08-23  0:24   ` [PATCH 12/26] xfs: encode the rtbitmap in big endian format Darrick J. Wong
@ 2024-08-23  0:24   ` Darrick J. Wong
  2024-08-23  5:15     ` Christoph Hellwig
  2024-08-23  0:24   ` [PATCH 14/26] xfs: grow the realtime section when realtime groups are enabled Darrick J. Wong
                     ` (12 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:24 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, the ondisk realtime summary file counters are accessed in
units of 32-bit words.  There's no endian translation of the contents of
this file, which means that the Bad Things Happen(tm) if you go from
(say) x86 to powerpc.  Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file.  Encode the summary
information in big endian format, like most of the rest of the
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h   |    4 +++-
 fs/xfs/libxfs/xfs_rtbitmap.h |    7 +++++++
 fs/xfs/scrub/rtsummary.c     |    5 +++++
 3 files changed, 15 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 506f5d5ee03fe..cafac42cd51ad 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -725,10 +725,12 @@ union xfs_rtword_raw {
 
 /*
  * Realtime summary counts are accessed by the word, which is currently
- * stored in host-endian format.
+ * stored in host-endian format.  Starting with the realtime groups feature,
+ * the words are stored in be32 ondisk.
  */
 union xfs_suminfo_raw {
 	__u32		old;
+	__be32		rtg;
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.h b/fs/xfs/libxfs/xfs_rtbitmap.h
index 148f7631d7fc2..d36fe2efc56f0 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.h
+++ b/fs/xfs/libxfs/xfs_rtbitmap.h
@@ -300,6 +300,8 @@ xfs_suminfo_get(
 {
 	union xfs_suminfo_raw	*info = xfs_rsumblock_infoptr(args, index);
 
+	if (xfs_has_rtgroups(args->mp))
+		return be32_to_cpu(info->rtg);
 	return info->old;
 }
 
@@ -312,6 +314,11 @@ xfs_suminfo_add(
 {
 	union xfs_suminfo_raw	*info = xfs_rsumblock_infoptr(args, index);
 
+	if (xfs_has_rtgroups(args->mp)) {
+		be32_add_cpu(&info->rtg, delta);
+		return be32_to_cpu(info->rtg);
+	}
+
 	info->old += delta;
 	return info->old;
 }
diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c
index 1f01ed9450388..f6779af92d57b 100644
--- a/fs/xfs/scrub/rtsummary.c
+++ b/fs/xfs/scrub/rtsummary.c
@@ -151,6 +151,11 @@ xchk_rtsum_inc(
 	struct xfs_mount	*mp,
 	union xfs_suminfo_raw	*v)
 {
+	if (xfs_has_rtgroups(mp)) {
+		be32_add_cpu(&v->rtg, 1);
+		return be32_to_cpu(v->rtg);
+	}
+
 	v->old += 1;
 	return v->old;
 }


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 14/26] xfs: grow the realtime section when realtime groups are enabled
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (12 preceding siblings ...)
  2024-08-23  0:24   ` [PATCH 13/26] xfs: encode the rtsummary " Darrick J. Wong
@ 2024-08-23  0:24   ` Darrick J. Wong
  2024-08-23  5:16     ` Christoph Hellwig
  2024-08-23  0:25   ` [PATCH 15/26] xfs: store rtgroup information with a bmap intent Darrick J. Wong
                     ` (11 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:24 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enable growing the rt section when realtime groups are enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_shared.h |    1 
 fs/xfs/xfs_rtalloc.c       |  268 ++++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_trans.c         |    9 +
 fs/xfs/xfs_trans.h         |    1 
 4 files changed, 242 insertions(+), 37 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 4f5f1d3526803..b6e56daa6a147 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -160,6 +160,7 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_TRANS_SB_RBLOCKS		0x00000800
 #define	XFS_TRANS_SB_REXTENTS		0x00001000
 #define	XFS_TRANS_SB_REXTSLOG		0x00002000
+#define XFS_TRANS_SB_RGCOUNT		0x00004000
 
 /*
  * Here we centralize the specification of XFS meta-data buffer reference count
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 6989ee1c13fa0..3fedc552b51b0 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -766,6 +766,31 @@ xfs_growfs_rt_alloc_fake_mount(
 	return nmp;
 }
 
+/* Free all the new space and return the number of extents actually freed. */
+static int
+xfs_growfs_rt_free_new(
+	struct xfs_rtgroup	*rtg,
+	struct xfs_rtalloc_args	*nargs,
+	xfs_rtbxlen_t		*freed_rtx)
+{
+	struct xfs_mount	*mp = rtg->rtg_mount;
+	xfs_rgnumber_t		rgno = rtg->rtg_rgno;
+	xfs_rtxnum_t		start_rtx = 0, end_rtx;
+
+	if (rgno < mp->m_sb.sb_rgcount)
+		start_rtx = xfs_rtgroup_extents(mp, rgno);
+	end_rtx = xfs_rtgroup_extents(nargs->mp, rgno);
+
+	/*
+	 * Compute the first new extent that we want to free, being careful to
+	 * skip past a realtime superblock at the start of the realtime volume.
+	 */
+	if (xfs_has_rtsb(nargs->mp) && rgno == 0 && start_rtx == 0)
+		start_rtx++;
+	*freed_rtx = end_rtx - start_rtx;
+	return xfs_rtfree_range(nargs, start_rtx, *freed_rtx);
+}
+
 static xfs_rfsblock_t
 xfs_growfs_rt_nrblocks(
 	struct xfs_rtgroup	*rtg,
@@ -786,6 +811,43 @@ xfs_growfs_rt_nrblocks(
 	return min(nrblocks, step);
 }
 
+/*
+ * If the post-grow filesystem will have an rtsb; we're initializing the first
+ * rtgroup; and the filesystem didn't have a realtime section, write the rtsb
+ * now, and attach the rtsb buffer to the real mount.
+ */
+static int
+xfs_growfs_rt_init_rtsb(
+	const struct xfs_rtalloc_args	*nargs,
+	const struct xfs_rtgroup	*rtg,
+	const struct xfs_rtalloc_args	*args)
+{
+	struct xfs_mount		*mp = args->mp;
+	struct xfs_buf			*rtsb_bp;
+	int				error;
+
+	if (!xfs_has_rtsb(nargs->mp))
+		return 0;
+	if (rtg->rtg_rgno > 0)
+		return 0;
+	if (mp->m_sb.sb_rblocks)
+		return 0;
+
+	error = xfs_buf_get_uncached(mp->m_rtdev_targp, XFS_FSB_TO_BB(mp, 1),
+			0, &rtsb_bp);
+	if (error)
+		return error;
+
+	rtsb_bp->b_maps[0].bm_bn = XFS_RTSB_DADDR;
+	rtsb_bp->b_ops = &xfs_rtsb_buf_ops;
+
+	xfs_update_rtsb(rtsb_bp, mp->m_sb_bp);
+	mp->m_rtsb_bp = rtsb_bp;
+	error = xfs_bwrite(rtsb_bp);
+	xfs_buf_unlock(rtsb_bp);
+	return error;
+}
+
 static int
 xfs_growfs_rt_bmblock(
 	struct xfs_rtgroup	*rtg,
@@ -808,7 +870,8 @@ xfs_growfs_rt_bmblock(
 	int			error;
 
 	/*
-	 * Calculate new sb and mount fields for this round.
+	 * Calculate new sb and mount fields for this round.  Also ensure the
+	 * rtg_extents value is uptodate as the rtbitmap code relies on it.
 	 */
 	nmp = nargs.mp = xfs_growfs_rt_alloc_fake_mount(mp,
 			xfs_growfs_rt_nrblocks(rtg, nrblocks, rextsize, bmbno),
@@ -861,6 +924,10 @@ xfs_growfs_rt_bmblock(
 			goto out_cancel;
 	}
 
+	error = xfs_growfs_rt_init_rtsb(&nargs, rtg, &args);
+	if (error)
+		goto out_cancel;
+
 	/*
 	 * Update superblock fields.
 	 */
@@ -879,12 +946,14 @@ xfs_growfs_rt_bmblock(
 	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
 		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
 			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
+	if (nmp->m_sb.sb_rgcount != mp->m_sb.sb_rgcount)
+		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RGCOUNT,
+			nmp->m_sb.sb_rgcount - mp->m_sb.sb_rgcount);
 
 	/*
 	 * Free the new extent.
 	 */
-	freed_rtx = nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents;
-	error = xfs_rtfree_range(&nargs, mp->m_sb.sb_rextents, freed_rtx);
+	error = xfs_growfs_rt_free_new(rtg, &nargs, &freed_rtx);
 	xfs_rtbuf_cache_relse(&nargs);
 	if (error)
 		goto out_cancel;
@@ -925,6 +994,15 @@ xfs_growfs_rt_bmblock(
 	return error;
 }
 
+static xfs_rtxnum_t
+xfs_last_rtgroup_extents(
+	struct xfs_mount	*mp)
+{
+	return mp->m_sb.sb_rextents -
+		((xfs_rtxnum_t)(mp->m_sb.sb_rgcount - 1) *
+		 mp->m_sb.sb_rgextents);
+}
+
 /*
  * Calculate the last rbmblock currently used.
  *
@@ -935,11 +1013,20 @@ xfs_last_rt_bmblock(
 	struct xfs_rtgroup	*rtg)
 {
 	struct xfs_mount	*mp = rtg->rtg_mount;
-	xfs_fileoff_t		bmbno = mp->m_sb.sb_rbmblocks;
+	xfs_rgnumber_t		rgno = rtg->rtg_rgno;
+	xfs_fileoff_t		bmbno = 0;
+
+	ASSERT(!mp->m_sb.sb_rgcount || rgno >= mp->m_sb.sb_rgcount - 1);
+
+	if (mp->m_sb.sb_rgcount && rgno == mp->m_sb.sb_rgcount - 1) {
+		xfs_rtxnum_t	nrext = xfs_last_rtgroup_extents(mp);
+
+		/* Also fill up the previous block if not entirely full. */
+		bmbno = xfs_rtbitmap_blockcount_len(mp, nrext);
+		if (xfs_rtx_to_rbmword(mp, nrext) != 0)
+			bmbno--;
+	}
 
-	/* Skip the current block if it is exactly full. */
-	if (xfs_rtx_to_rbmword(mp, mp->m_sb.sb_rextents) != 0)
-		bmbno--;
 	return bmbno;
 }
 
@@ -956,38 +1043,56 @@ xfs_growfs_rt_alloc_blocks(
 	struct xfs_mount	*mp = rtg->rtg_mount;
 	struct xfs_inode	*rbmip = rtg->rtg_inodes[XFS_RTGI_BITMAP];
 	struct xfs_inode	*rsumip = rtg->rtg_inodes[XFS_RTGI_SUMMARY];
-	xfs_extlen_t		orbmblocks;
-	xfs_extlen_t		orsumblocks;
-	xfs_extlen_t		nrsumblocks;
+	xfs_extlen_t		orbmblocks = 0;
+	xfs_extlen_t		orsumblocks = 0;
 	struct xfs_mount	*nmp;
-	int			error;
-
-	/*
-	 * Get the old block counts for bitmap and summary inodes.
-	 * These can't change since other growfs callers are locked out.
-	 */
-	orbmblocks = XFS_B_TO_FSB(mp, rbmip->i_disk_size);
-	orsumblocks = XFS_B_TO_FSB(mp, rsumip->i_disk_size);
+	int			error = 0;
 
 	nmp = xfs_growfs_rt_alloc_fake_mount(mp, nrblocks, rextsize);
 	if (!nmp)
 		return -ENOMEM;
-
 	*nrbmblocks = nmp->m_sb.sb_rbmblocks;
-	nrsumblocks = nmp->m_rsumblocks;
-	kfree(nmp);
+
+	if (xfs_has_rtgroups(mp)) {
+		/*
+		 * For file systems with the rtgroups feature, the RT bitmap and
+		 * summary are always fully allocated, which means that we never
+		 * need to grow the existing files.
+		 *
+		 * But we have to be careful to only fill the bitmap until the
+		 * end of the actually used range.
+		 */
+		if (rtg->rtg_rgno == nmp->m_sb.sb_rgcount - 1)
+			*nrbmblocks = xfs_rtbitmap_blockcount_len(nmp,
+					xfs_last_rtgroup_extents(nmp));
+
+		if (mp->m_sb.sb_rgcount &&
+		    rtg->rtg_rgno == mp->m_sb.sb_rgcount - 1)
+			goto out_free;
+	} else {
+		/*
+		 * Get the old block counts for bitmap and summary inodes.
+		 * These can't change since other growfs callers are locked out.
+		 */
+		orbmblocks = XFS_B_TO_FSB(mp, rbmip->i_disk_size);
+		orsumblocks = XFS_B_TO_FSB(mp, rsumip->i_disk_size);
+	}
 
 	error = xfs_rtfile_initialize_blocks(rtg, XFS_RTGI_BITMAP, orbmblocks,
-			*nrbmblocks, NULL);
+			nmp->m_sb.sb_rbmblocks, NULL);
 	if (error)
-		return error;
-	return xfs_rtfile_initialize_blocks(rtg, XFS_RTGI_SUMMARY, orsumblocks,
-			nrsumblocks, NULL);
+		goto out_free;
+	error = xfs_rtfile_initialize_blocks(rtg, XFS_RTGI_SUMMARY, orsumblocks,
+			nmp->m_rsumblocks, NULL);
+out_free:
+	kfree(nmp);
+	return error;
 }
 
 static int
 xfs_growfs_rtg(
 	struct xfs_mount	*mp,
+	xfs_rgnumber_t		rgno,
 	xfs_rfsblock_t		nrblocks,
 	xfs_agblock_t		rextsize)
 {
@@ -998,7 +1103,7 @@ xfs_growfs_rtg(
 	unsigned int		i;
 	int			error;
 
-	rtg = xfs_rtgroup_grab(mp, 0);
+	rtg = xfs_rtgroup_grab(mp, rgno);
 	if (!rtg)
 		return -EINVAL;
 
@@ -1069,14 +1174,67 @@ xfs_growfs_check_rtgeom(
 	return error;
 }
 
+/*
+ * Compute the new number of rt groups and ensure that /rtgroups exists.
+ *
+ * Changing the rtgroup size is not allowed (even if the rt volume hasn't yet
+ * been initialized) because the userspace ABI doesn't support it.
+ */
+static int
+xfs_growfs_rt_prep_groups(
+	struct xfs_mount	*mp,
+	xfs_rfsblock_t		rblocks,
+	xfs_extlen_t		rextsize,
+	xfs_rgnumber_t		*new_rgcount)
+{
+	int			error;
+
+	*new_rgcount = howmany_64(rblocks, mp->m_sb.sb_rgextents * rextsize);
+	if (*new_rgcount > XFS_MAX_RGNUMBER)
+		return -EINVAL;
+
+	/* Make sure the /rtgroups dir has been created */
+	if (!mp->m_rtdirip) {
+		struct xfs_trans	*tp;
+
+		error = xfs_trans_alloc_empty(mp, &tp);
+		if (error)
+			return error;
+		error = xfs_rtginode_load_parent(tp);
+		xfs_trans_cancel(tp);
+
+		if (error == -ENOENT)
+			error = xfs_rtginode_mkdir_parent(mp);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+static bool
+xfs_grow_last_rtg(
+	struct xfs_mount	*mp)
+{
+	if (!xfs_has_rtgroups(mp))
+		return true;
+	if (mp->m_sb.sb_rgcount == 0)
+		return false;
+	return xfs_rtgroup_extents(mp, mp->m_sb.sb_rgcount - 1) <=
+			mp->m_sb.sb_rgextents;
+}
+
 /*
  * Grow the realtime area of the filesystem.
  */
 int
 xfs_growfs_rt(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_growfs_rt_t	*in)		/* growfs rt input struct */
+	struct xfs_mount	*mp,
+	struct xfs_growfs_rt	*in)
 {
+	xfs_rgnumber_t		old_rgcount = mp->m_sb.sb_rgcount;
+	xfs_rgnumber_t		new_rgcount = 1;
+	xfs_rgnumber_t		rgno;
 	struct xfs_buf		*bp;
 	xfs_agblock_t		old_rextsize = mp->m_sb.sb_rextsize;
 	int			error;
@@ -1134,19 +1292,55 @@ xfs_growfs_rt(
 	if (error)
 		goto out_unlock;
 
-	error = xfs_growfs_rtg(mp, in->newblocks, in->extsize);
-	if (error)
-		goto out_unlock;
+	if (xfs_has_rtgroups(mp)) {
+		error = xfs_growfs_rt_prep_groups(mp, in->newblocks,
+				in->extsize, &new_rgcount);
+		if (error)
+			goto out_unlock;
+	}
 
-	if (old_rextsize != in->extsize) {
+	if (xfs_grow_last_rtg(mp)) {
+		error = xfs_growfs_rtg(mp, old_rgcount - 1, in->newblocks,
+				in->extsize);
+		if (error)
+			goto out_unlock;
+	}
+
+	for (rgno = old_rgcount; rgno < new_rgcount; rgno++) {
+		error = xfs_rtgroup_alloc(mp, rgno);
+		if (error)
+			goto out_unlock;
+
+		error = xfs_growfs_rtg(mp, rgno, in->newblocks, in->extsize);
+		if (error) {
+			struct xfs_rtgroup	*rtg;
+
+			rtg = xfs_rtgroup_grab(mp, rgno);
+			if (!WARN_ON_ONCE(!rtg)) {
+				xfs_rtunmount_rtg(rtg);
+				xfs_rtgroup_rele(rtg);
+				xfs_rtgroup_free(mp, rgno);
+			}
+			break;
+		}
+	}
+
+	if (!error && old_rextsize != in->extsize)
 		error = xfs_growfs_rt_fixup_extsize(mp);
-		if (error)
-			goto out_unlock;
+
+	/*
+	 * Update secondary superblocks now the physical grow has completed.
+	 *
+	 * Also do this in case of an error as we might have already
+	 * successfully updated one or more RTGs and incremented sb_rgcount.
+	 */
+	if (!xfs_is_shutdown(mp)) {
+		int error2 = xfs_update_secondary_sbs(mp);
+
+		if (!error)
+			error = error2;
 	}
 
-	/* Update secondary superblocks now the physical grow has completed */
-	error = xfs_update_secondary_sbs(mp);
-
 out_unlock:
 	mutex_unlock(&mp->m_growlock);
 	return error;
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 552e3a149346c..dc90856e1e9bb 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -458,6 +458,10 @@ xfs_trans_mod_sb(
 	case XFS_TRANS_SB_REXTSLOG:
 		tp->t_rextslog_delta += delta;
 		break;
+	case XFS_TRANS_SB_RGCOUNT:
+		ASSERT(delta > 0);
+		tp->t_rgcount_delta += delta;
+		break;
 	default:
 		ASSERT(0);
 		return;
@@ -563,6 +567,10 @@ xfs_trans_apply_sb_deltas(
 		sbp->sb_rextslog += tp->t_rextslog_delta;
 		whole = 1;
 	}
+	if (tp->t_rgcount_delta) {
+		be32_add_cpu(&sbp->sb_rgcount, tp->t_rgcount_delta);
+		whole = 1;
+	}
 
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
 	if (whole)
@@ -680,6 +688,7 @@ xfs_trans_unreserve_and_mod_sb(
 	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
 	mp->m_sb.sb_rextents += tp->t_rextents_delta;
 	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
+	mp->m_sb.sb_rgcount += tp->t_rgcount_delta;
 	spin_unlock(&mp->m_sb_lock);
 
 	/*
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f97e5c416efad..71c2e82e4dadf 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -148,6 +148,7 @@ typedef struct xfs_trans {
 	int64_t			t_rblocks_delta;/* superblock rblocks change */
 	int64_t			t_rextents_delta;/* superblocks rextents chg */
 	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
+	int64_t			t_rgcount_delta; /* realtime group count */
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	struct list_head	t_dfops;	/* deferred operations */


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 15/26] xfs: store rtgroup information with a bmap intent
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (13 preceding siblings ...)
  2024-08-23  0:24   ` [PATCH 14/26] xfs: grow the realtime section when realtime groups are enabled Darrick J. Wong
@ 2024-08-23  0:25   ` Darrick J. Wong
  2024-08-23  5:16     ` Christoph Hellwig
  2024-08-23  0:25   ` [PATCH 16/26] xfs: force swapext to a realtime file to use the file content exchange ioctl Darrick J. Wong
                     ` (10 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:25 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make the bmap intent items take an active reference to the rtgroup
containing the space that is being mapped or unmapped.  We will need
this functionality once we start enabling rmap and reflink on the rt
volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.h |    5 ++++-
 fs/xfs/xfs_bmap_item.c   |   18 ++++++++++++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 7592d46e97c66..eb3670ecd1373 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -248,7 +248,10 @@ struct xfs_bmap_intent {
 	enum xfs_bmap_intent_type		bi_type;
 	int					bi_whichfork;
 	struct xfs_inode			*bi_owner;
-	struct xfs_perag			*bi_pag;
+	union {
+		struct xfs_perag		*bi_pag;
+		struct xfs_rtgroup		*bi_rtg;
+	};
 	struct xfs_bmbt_irec			bi_bmap;
 };
 
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index e224b49b7cff6..9a7e97a922b6d 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -26,6 +26,7 @@
 #include "xfs_log_recover.h"
 #include "xfs_ag.h"
 #include "xfs_trace.h"
+#include "xfs_rtgroup.h"
 
 struct kmem_cache	*xfs_bui_cache;
 struct kmem_cache	*xfs_bud_cache;
@@ -324,8 +325,18 @@ xfs_bmap_update_get_group(
 	struct xfs_mount	*mp,
 	struct xfs_bmap_intent	*bi)
 {
-	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork))
+	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork)) {
+		if (xfs_has_rtgroups(mp)) {
+			xfs_rgnumber_t	rgno;
+
+			rgno = xfs_rtb_to_rgno(mp, bi->bi_bmap.br_startblock);
+			bi->bi_rtg = xfs_rtgroup_get(mp, rgno);
+		} else {
+			bi->bi_rtg = NULL;
+		}
+
 		return;
+	}
 
 	/*
 	 * Bump the intent count on behalf of the deferred rmap and refcount
@@ -354,8 +365,11 @@ static inline void
 xfs_bmap_update_put_group(
 	struct xfs_bmap_intent	*bi)
 {
-	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork))
+	if (xfs_ifork_is_realtime(bi->bi_owner, bi->bi_whichfork)) {
+		if (xfs_has_rtgroups(bi->bi_owner->i_mount))
+			xfs_rtgroup_put(bi->bi_rtg);
 		return;
+	}
 
 	xfs_perag_intent_put(bi->bi_pag);
 }


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 16/26] xfs: force swapext to a realtime file to use the file content exchange ioctl
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (14 preceding siblings ...)
  2024-08-23  0:25   ` [PATCH 15/26] xfs: store rtgroup information with a bmap intent Darrick J. Wong
@ 2024-08-23  0:25   ` Darrick J. Wong
  2024-08-23  5:17     ` Christoph Hellwig
  2024-08-23  0:25   ` [PATCH 17/26] xfs: support logging EFIs for realtime extents Darrick J. Wong
                     ` (9 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:25 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

xfs_swap_extent_rmap does not use log items to track the overall
progress of an attempt to swap the extent mappings between two files.
If the system crashes in the middle of swapping a partially written
realtime extent, the mapping will be left in an inconsistent state
wherein a file can point to multiple extents on the rt volume.

The new file range exchange functionality handles this correctly, so all
callers must upgrade to that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fe2e2c9309755..025e58daf6f50 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1522,6 +1522,18 @@ xfs_swap_extents(
 		goto out_unlock;
 	}
 
+	/*
+	 * The rmapbt implementation is unable to resume a swapext operation
+	 * after a crash if the allocation unit size is larger than a block.
+	 * This (deprecated) interface will not be upgraded to handle this
+	 * situation.  Defragmentation must be performed with the commit range
+	 * ioctl.
+	 */
+	if (XFS_IS_REALTIME_INODE(ip) && xfs_has_rtgroups(ip->i_mount)) {
+		error = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
 	error = xfs_qm_dqattach(ip);
 	if (error)
 		goto out_unlock;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 17/26] xfs: support logging EFIs for realtime extents
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (15 preceding siblings ...)
  2024-08-23  0:25   ` [PATCH 16/26] xfs: force swapext to a realtime file to use the file content exchange ioctl Darrick J. Wong
@ 2024-08-23  0:25   ` Darrick J. Wong
  2024-08-23  5:17     ` Christoph Hellwig
  2024-08-26  4:33     ` Dave Chinner
  2024-08-23  0:25   ` [PATCH 18/26] xfs: support error injection when freeing rt extents Darrick J. Wong
                     ` (8 subsequent siblings)
  25 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:25 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Teach the EFI mechanism how to free realtime extents.  We're going to
need this to enforce proper ordering of operations when we enable
realtime rmap.

Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
ops for rt extents.  This keeps the ondisk artifacts and processing code
completely separate between the rt and non-rt cases.  Hopefully this
will make it easier to debug filesystem problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c       |   15 ++
 fs/xfs/libxfs/xfs_alloc.h       |   17 ++
 fs/xfs/libxfs/xfs_defer.c       |    6 +
 fs/xfs/libxfs/xfs_defer.h       |    1 
 fs/xfs/libxfs/xfs_log_format.h  |    6 +
 fs/xfs/libxfs/xfs_log_recover.h |    2 
 fs/xfs/xfs_extfree_item.c       |  281 ++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_log_recover.c        |    2 
 8 files changed, 305 insertions(+), 25 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 59326f84f6a57..0eae7835c92a9 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2645,8 +2645,17 @@ xfs_defer_extent_free(
 	ASSERT(!isnullstartblock(bno));
 	ASSERT(!(free_flags & ~XFS_FREE_EXTENT_ALL_FLAGS));
 
-	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
-		return -EFSCORRUPTED;
+	if (free_flags & XFS_FREE_EXTENT_REALTIME) {
+		if (type != XFS_AG_RESV_NONE) {
+			ASSERT(type == XFS_AG_RESV_NONE);
+			return -EFSCORRUPTED;
+		}
+		if (XFS_IS_CORRUPT(mp, !xfs_verify_rtbext(mp, bno, len)))
+			return -EFSCORRUPTED;
+	} else {
+		if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
+			return -EFSCORRUPTED;
+	}
 
 	xefi = kmem_cache_zalloc(xfs_extfree_item_cache,
 			       GFP_KERNEL | __GFP_NOFAIL);
@@ -2655,6 +2664,8 @@ xfs_defer_extent_free(
 	xefi->xefi_agresv = type;
 	if (free_flags & XFS_FREE_EXTENT_SKIP_DISCARD)
 		xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD;
+	if (free_flags & XFS_FREE_EXTENT_REALTIME)
+		xefi->xefi_flags |= XFS_EFI_REALTIME;
 	if (oinfo) {
 		ASSERT(oinfo->oi_offset == 0);
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index fae170825be06..349ffeb407690 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -240,7 +240,11 @@ int xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno,
 /* Don't issue a discard for the blocks freed. */
 #define XFS_FREE_EXTENT_SKIP_DISCARD	(1U << 0)
 
-#define XFS_FREE_EXTENT_ALL_FLAGS	(XFS_FREE_EXTENT_SKIP_DISCARD)
+/* Free blocks on the realtime device. */
+#define XFS_FREE_EXTENT_REALTIME	(1U << 1)
+
+#define XFS_FREE_EXTENT_ALL_FLAGS	(XFS_FREE_EXTENT_SKIP_DISCARD | \
+					 XFS_FREE_EXTENT_REALTIME)
 
 /*
  * List of extents to be free "later".
@@ -251,7 +255,10 @@ struct xfs_extent_free_item {
 	uint64_t		xefi_owner;
 	xfs_fsblock_t		xefi_startblock;/* starting fs block number */
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
-	struct xfs_perag	*xefi_pag;
+	union {
+		struct xfs_perag	*xefi_pag;
+		struct xfs_rtgroup	*xefi_rtg;
+	};
 	unsigned int		xefi_flags;
 	enum xfs_ag_resv_type	xefi_agresv;
 };
@@ -260,6 +267,12 @@ struct xfs_extent_free_item {
 #define XFS_EFI_ATTR_FORK	(1U << 1) /* freeing attr fork block */
 #define XFS_EFI_BMBT_BLOCK	(1U << 2) /* freeing bmap btree block */
 #define XFS_EFI_CANCELLED	(1U << 3) /* dont actually free the space */
+#define XFS_EFI_REALTIME	(1U << 4) /* freeing realtime extent */
+
+static inline bool xfs_efi_is_realtime(const struct xfs_extent_free_item *xefi)
+{
+	return xefi->xefi_flags & XFS_EFI_REALTIME;
+}
 
 struct xfs_alloc_autoreap {
 	struct xfs_defer_pending	*dfp;
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 40021849b42f0..a33e22d091367 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -847,6 +847,12 @@ xfs_defer_add(
 
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
+	if (!ops->finish_item) {
+		ASSERT(ops->finish_item != NULL);
+		xfs_force_shutdown(tp->t_mountp, SHUTDOWN_CORRUPT_INCORE);
+		return NULL;
+	}
+
 	dfp = xfs_defer_find_last(tp, ops);
 	if (!dfp || !xfs_defer_can_append(dfp, ops))
 		dfp = xfs_defer_alloc(&tp->t_dfops, ops);
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 8b338031e487c..ec51b8465e61c 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -71,6 +71,7 @@ extern const struct xfs_defer_op_type xfs_refcount_update_defer_type;
 extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
+extern const struct xfs_defer_op_type xfs_rtextent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_attr_defer_type;
 extern const struct xfs_defer_op_type xfs_exchmaps_defer_type;
 
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index ace7384a275bf..15dec19b6c32a 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -248,6 +248,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_ATTRD		0x1247  /* attr set/remove done */
 #define	XFS_LI_XMI		0x1248  /* mapping exchange intent */
 #define	XFS_LI_XMD		0x1249  /* mapping exchange done */
+#define	XFS_LI_EFI_RT		0x124a	/* realtime extent free intent */
+#define	XFS_LI_EFD_RT		0x124b	/* realtime extent free done */
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -267,7 +269,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_ATTRI,		"XFS_LI_ATTRI" }, \
 	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }, \
 	{ XFS_LI_XMI,		"XFS_LI_XMI" }, \
-	{ XFS_LI_XMD,		"XFS_LI_XMD" }
+	{ XFS_LI_XMD,		"XFS_LI_XMD" }, \
+	{ XFS_LI_EFI_RT,	"XFS_LI_EFI_RT" }, \
+	{ XFS_LI_EFD_RT,	"XFS_LI_EFD_RT" }
 
 /*
  * Inode Log Item Format definitions.
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 521d327e4c89e..5397a8ff004df 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -77,6 +77,8 @@ extern const struct xlog_recover_item_ops xlog_attri_item_ops;
 extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
 extern const struct xlog_recover_item_ops xlog_xmi_item_ops;
 extern const struct xlog_recover_item_ops xlog_xmd_item_ops;
+extern const struct xlog_recover_item_ops xlog_rtefi_item_ops;
+extern const struct xlog_recover_item_ops xlog_rtefd_item_ops;
 
 /*
  * Macros, structures, prototypes for internal log manager use.
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index abffc74a924f7..57b46f1b8463d 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -25,6 +25,10 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_rtalloc.h"
+#include "xfs_inode.h"
+#include "xfs_rtbitmap.h"
+#include "xfs_rtgroup.h"
 
 struct kmem_cache	*xfs_efi_cache;
 struct kmem_cache	*xfs_efd_cache;
@@ -95,16 +99,15 @@ xfs_efi_item_format(
 
 	ASSERT(atomic_read(&efip->efi_next_extent) ==
 				efip->efi_format.efi_nextents);
+	ASSERT(lip->li_type == XFS_LI_EFI || lip->li_type == XFS_LI_EFI_RT);
 
-	efip->efi_format.efi_type = XFS_LI_EFI;
+	efip->efi_format.efi_type = lip->li_type;
 	efip->efi_format.efi_size = 1;
 
-	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_EFI_FORMAT,
-			&efip->efi_format,
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_EFI_FORMAT, &efip->efi_format,
 			xfs_efi_log_format_sizeof(efip->efi_format.efi_nextents));
 }
 
-
 /*
  * The unpin operation is the last place an EFI is manipulated in the log. It is
  * either inserted in the AIL or aborted in the event of a log I/O error. In
@@ -140,12 +143,14 @@ xfs_efi_item_release(
 STATIC struct xfs_efi_log_item *
 xfs_efi_init(
 	struct xfs_mount	*mp,
+	unsigned short		item_type,
 	uint			nextents)
-
 {
 	struct xfs_efi_log_item	*efip;
 
+	ASSERT(item_type == XFS_LI_EFI || item_type == XFS_LI_EFI_RT);
 	ASSERT(nextents > 0);
+
 	if (nextents > XFS_EFI_MAX_FAST_EXTENTS) {
 		efip = kzalloc(xfs_efi_log_item_sizeof(nextents),
 				GFP_KERNEL | __GFP_NOFAIL);
@@ -154,7 +159,7 @@ xfs_efi_init(
 					 GFP_KERNEL | __GFP_NOFAIL);
 	}
 
-	xfs_log_item_init(mp, &efip->efi_item, XFS_LI_EFI, &xfs_efi_item_ops);
+	xfs_log_item_init(mp, &efip->efi_item, item_type, &xfs_efi_item_ops);
 	efip->efi_format.efi_nextents = nextents;
 	efip->efi_format.efi_id = (uintptr_t)(void *)efip;
 	atomic_set(&efip->efi_next_extent, 0);
@@ -264,12 +269,12 @@ xfs_efd_item_format(
 	struct xfs_log_iovec	*vecp = NULL;
 
 	ASSERT(efdp->efd_next_extent == efdp->efd_format.efd_nextents);
+	ASSERT(lip->li_type == XFS_LI_EFD || lip->li_type == XFS_LI_EFD_RT);
 
-	efdp->efd_format.efd_type = XFS_LI_EFD;
+	efdp->efd_format.efd_type = lip->li_type;
 	efdp->efd_format.efd_size = 1;
 
-	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_EFD_FORMAT,
-			&efdp->efd_format,
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_EFD_FORMAT, &efdp->efd_format,
 			xfs_efd_log_format_sizeof(efdp->efd_format.efd_nextents));
 }
 
@@ -308,6 +313,14 @@ static inline struct xfs_extent_free_item *xefi_entry(const struct list_head *e)
 	return list_entry(e, struct xfs_extent_free_item, xefi_list);
 }
 
+static inline bool
+xfs_efi_item_isrt(const struct xfs_log_item *lip)
+{
+	ASSERT(lip->li_type == XFS_LI_EFI || lip->li_type == XFS_LI_EFI_RT);
+
+	return lip->li_type == XFS_LI_EFI_RT;
+}
+
 /*
  * Fill the EFD with all extents from the EFI when we need to roll the
  * transaction and continue with a new EFI.
@@ -395,11 +408,12 @@ xfs_extent_free_create_intent(
 	bool				sort)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
-	struct xfs_efi_log_item		*efip = xfs_efi_init(mp, count);
+	struct xfs_efi_log_item		*efip;
 	struct xfs_extent_free_item	*xefi;
 
 	ASSERT(count > 0);
 
+	efip = xfs_efi_init(mp, XFS_LI_EFI, count);
 	if (sort)
 		list_sort(mp, items, xfs_extent_free_diff_items);
 	list_for_each_entry(xefi, items, xefi_list)
@@ -407,6 +421,12 @@ xfs_extent_free_create_intent(
 	return &efip->efi_item;
 }
 
+static inline unsigned short
+xfs_efd_type_from_efi(const struct xfs_efi_log_item *efip)
+{
+	return xfs_efi_item_isrt(&efip->efi_item) ?  XFS_LI_EFD_RT : XFS_LI_EFD;
+}
+
 /* Get an EFD so we can process all the free extents. */
 static struct xfs_log_item *
 xfs_extent_free_create_done(
@@ -427,8 +447,8 @@ xfs_extent_free_create_done(
 					GFP_KERNEL | __GFP_NOFAIL);
 	}
 
-	xfs_log_item_init(tp->t_mountp, &efdp->efd_item, XFS_LI_EFD,
-			  &xfs_efd_item_ops);
+	xfs_log_item_init(tp->t_mountp, &efdp->efd_item,
+			xfs_efd_type_from_efi(efip), &xfs_efd_item_ops);
 	efdp->efd_efip = efip;
 	efdp->efd_format.efd_nextents = count;
 	efdp->efd_format.efd_efi_id = efip->efi_format.efi_id;
@@ -447,6 +467,17 @@ xfs_extent_free_defer_add(
 
 	trace_xfs_extent_free_defer(mp, xefi);
 
+	if (xfs_efi_is_realtime(xefi)) {
+		xfs_rgnumber_t		rgno;
+
+		rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock);
+		xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno);
+
+		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
+				&xfs_rtextent_free_defer_type);
+		return;
+	}
+
 	xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock);
 	if (xefi->xefi_agresv == XFS_AG_RESV_AGFL)
 		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
@@ -559,8 +590,12 @@ xfs_agfl_free_finish_item(
 static inline bool
 xfs_efi_validate_ext(
 	struct xfs_mount		*mp,
+	bool				isrt,
 	struct xfs_extent		*extp)
 {
+	if (isrt)
+		return xfs_verify_rtbext(mp, extp->ext_start, extp->ext_len);
+
 	return xfs_verify_fsbext(mp, extp->ext_start, extp->ext_len);
 }
 
@@ -568,6 +603,7 @@ static inline void
 xfs_efi_recover_work(
 	struct xfs_mount		*mp,
 	struct xfs_defer_pending	*dfp,
+	bool				isrt,
 	struct xfs_extent		*extp)
 {
 	struct xfs_extent_free_item	*xefi;
@@ -578,7 +614,15 @@ xfs_efi_recover_work(
 	xefi->xefi_blockcount = extp->ext_len;
 	xefi->xefi_agresv = XFS_AG_RESV_NONE;
 	xefi->xefi_owner = XFS_RMAP_OWN_UNKNOWN;
-	xefi->xefi_pag = xfs_perag_intent_get(mp, extp->ext_start);
+	if (isrt) {
+		xfs_rgnumber_t		rgno;
+
+		xefi->xefi_flags |= XFS_EFI_REALTIME;
+		rgno = xfs_rtb_to_rgno(mp, extp->ext_start);
+		xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno);
+	} else {
+		xefi->xefi_pag = xfs_perag_intent_get(mp, extp->ext_start);
+	}
 
 	xfs_defer_add_item(dfp, &xefi->xefi_list);
 }
@@ -599,14 +643,15 @@ xfs_extent_free_recover_work(
 	struct xfs_trans		*tp;
 	int				i;
 	int				error = 0;
+	bool				isrt = xfs_efi_item_isrt(lip);
 
 	/*
-	 * First check the validity of the extents described by the
-	 * EFI.  If any are bad, then assume that all are bad and
-	 * just toss the EFI.
+	 * First check the validity of the extents described by the EFI.  If
+	 * any are bad, then assume that all are bad and just toss the EFI.
+	 * Mixing RT and non-RT extents in the same EFI item is not allowed.
 	 */
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
-		if (!xfs_efi_validate_ext(mp,
+		if (!xfs_efi_validate_ext(mp, isrt,
 					&efip->efi_format.efi_extents[i])) {
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					&efip->efi_format,
@@ -614,7 +659,8 @@ xfs_extent_free_recover_work(
 			return -EFSCORRUPTED;
 		}
 
-		xfs_efi_recover_work(mp, dfp, &efip->efi_format.efi_extents[i]);
+		xfs_efi_recover_work(mp, dfp, isrt,
+				&efip->efi_format.efi_extents[i]);
 	}
 
 	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
@@ -652,10 +698,12 @@ xfs_extent_free_relog_intent(
 	count = EFI_ITEM(intent)->efi_format.efi_nextents;
 	extp = EFI_ITEM(intent)->efi_format.efi_extents;
 
+	ASSERT(intent->li_type == XFS_LI_EFI || intent->li_type == XFS_LI_EFI_RT);
+
 	efdp->efd_next_extent = count;
 	memcpy(efdp->efd_format.efd_extents, extp, count * sizeof(*extp));
 
-	efip = xfs_efi_init(tp->t_mountp, count);
+	efip = xfs_efi_init(tp->t_mountp, intent->li_type, count);
 	memcpy(efip->efi_format.efi_extents, extp, count * sizeof(*extp));
 	atomic_set(&efip->efi_next_extent, count);
 
@@ -687,6 +735,106 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
 	.relog_intent	= xfs_extent_free_relog_intent,
 };
 
+#ifdef CONFIG_XFS_RT
+/* Sort realtime efi items by rtgroup for efficiency. */
+static int
+xfs_rtextent_free_diff_items(
+	void				*priv,
+	const struct list_head		*a,
+	const struct list_head		*b)
+{
+	struct xfs_extent_free_item	*ra = xefi_entry(a);
+	struct xfs_extent_free_item	*rb = xefi_entry(b);
+
+	return ra->xefi_rtg->rtg_rgno - rb->xefi_rtg->rtg_rgno;
+}
+
+/* Create a realtime extent freeing */
+static struct xfs_log_item *
+xfs_rtextent_free_create_intent(
+	struct xfs_trans		*tp,
+	struct list_head		*items,
+	unsigned int			count,
+	bool				sort)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_efi_log_item		*efip;
+	struct xfs_extent_free_item	*xefi;
+
+	ASSERT(count > 0);
+
+	efip = xfs_efi_init(mp, XFS_LI_EFI_RT, count);
+	if (sort)
+		list_sort(mp, items, xfs_rtextent_free_diff_items);
+	list_for_each_entry(xefi, items, xefi_list)
+		xfs_extent_free_log_item(tp, efip, xefi);
+	return &efip->efi_item;
+}
+
+/* Cancel a realtime extent freeing. */
+STATIC void
+xfs_rtextent_free_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_extent_free_item	*xefi = xefi_entry(item);
+
+	xfs_rtgroup_put(xefi->xefi_rtg);
+	kmem_cache_free(xfs_extfree_item_cache, xefi);
+}
+
+/* Process a free realtime extent. */
+STATIC int
+xfs_rtextent_free_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct list_head		*item,
+	struct xfs_btree_cur		**state)
+{
+	struct xfs_mount		*mp = tp->t_mountp;
+	struct xfs_extent_free_item	*xefi = xefi_entry(item);
+	struct xfs_efd_log_item		*efdp = EFD_ITEM(done);
+	struct xfs_rtgroup		**rtgp = (struct xfs_rtgroup **)state;
+	int				error = 0;
+
+	trace_xfs_extent_free_deferred(mp, xefi);
+
+	if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) {
+		if (*rtgp != xefi->xefi_rtg) {
+			xfs_rtgroup_lock(xefi->xefi_rtg, XFS_RTGLOCK_BITMAP);
+			xfs_rtgroup_trans_join(tp, xefi->xefi_rtg,
+					XFS_RTGLOCK_BITMAP);
+			*rtgp = xefi->xefi_rtg;
+		}
+		error = xfs_rtfree_blocks(tp, xefi->xefi_rtg,
+				xefi->xefi_startblock, xefi->xefi_blockcount);
+	}
+	if (error == -EAGAIN) {
+		xfs_efd_from_efi(efdp);
+		return error;
+	}
+
+	xfs_efd_add_extent(efdp, xefi);
+	xfs_rtextent_free_cancel_item(item);
+	return error;
+}
+
+const struct xfs_defer_op_type xfs_rtextent_free_defer_type = {
+	.name		= "rtextent_free",
+	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
+	.create_intent	= xfs_rtextent_free_create_intent,
+	.abort_intent	= xfs_extent_free_abort_intent,
+	.create_done	= xfs_extent_free_create_done,
+	.finish_item	= xfs_rtextent_free_finish_item,
+	.cancel_item	= xfs_rtextent_free_cancel_item,
+	.recover_work	= xfs_extent_free_recover_work,
+	.relog_intent	= xfs_extent_free_relog_intent,
+};
+#else
+const struct xfs_defer_op_type xfs_rtextent_free_defer_type = {
+	.name		= "rtextent_free",
+};
+#endif /* CONFIG_XFS_RT */
+
 STATIC bool
 xfs_efi_item_match(
 	struct xfs_log_item	*lip,
@@ -731,7 +879,7 @@ xlog_recover_efi_commit_pass2(
 		return -EFSCORRUPTED;
 	}
 
-	efip = xfs_efi_init(mp, efi_formatp->efi_nextents);
+	efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents);
 	error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format);
 	if (error) {
 		xfs_efi_item_free(efip);
@@ -749,6 +897,58 @@ const struct xlog_recover_item_ops xlog_efi_item_ops = {
 	.commit_pass2		= xlog_recover_efi_commit_pass2,
 };
 
+#ifdef CONFIG_XFS_RT
+STATIC int
+xlog_recover_rtefi_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_efi_log_item		*efip;
+	struct xfs_efi_log_format	*efi_formatp;
+	int				error;
+
+	efi_formatp = item->ri_buf[0].i_addr;
+
+	if (item->ri_buf[0].i_len < xfs_efi_log_format_sizeof(0)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+				item->ri_buf[0].i_addr, item->ri_buf[0].i_len);
+		return -EFSCORRUPTED;
+	}
+
+	efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents);
+	error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format);
+	if (error) {
+		xfs_efi_item_free(efip);
+		return error;
+	}
+	atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents);
+
+	xlog_recover_intent_item(log, &efip->efi_item, lsn,
+			&xfs_rtextent_free_defer_type);
+	return 0;
+}
+#else
+STATIC int
+xlog_recover_rtefi_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp,
+			item->ri_buf[0].i_addr, item->ri_buf[0].i_len);
+	return -EFSCORRUPTED;
+}
+#endif
+
+const struct xlog_recover_item_ops xlog_rtefi_item_ops = {
+	.item_type		= XFS_LI_EFI_RT,
+	.commit_pass2		= xlog_recover_rtefi_commit_pass2,
+};
+
 /*
  * This routine is called when an EFD format structure is found in a committed
  * transaction in the log. Its purpose is to cancel the corresponding EFI if it
@@ -791,3 +991,44 @@ const struct xlog_recover_item_ops xlog_efd_item_ops = {
 	.item_type		= XFS_LI_EFD,
 	.commit_pass2		= xlog_recover_efd_commit_pass2,
 };
+
+#ifdef CONFIG_XFS_RT
+STATIC int
+xlog_recover_rtefd_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_efd_log_format	*efd_formatp;
+	int				buflen = item->ri_buf[0].i_len;
+
+	efd_formatp = item->ri_buf[0].i_addr;
+
+	if (buflen < sizeof(struct xfs_efd_log_format)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp,
+				efd_formatp, buflen);
+		return -EFSCORRUPTED;
+	}
+
+	if (item->ri_buf[0].i_len != xfs_efd_log_format32_sizeof(
+						efd_formatp->efd_nextents) &&
+	    item->ri_buf[0].i_len != xfs_efd_log_format64_sizeof(
+						efd_formatp->efd_nextents)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp,
+				efd_formatp, buflen);
+		return -EFSCORRUPTED;
+	}
+
+	xlog_recover_release_intent(log, XFS_LI_EFI_RT,
+			efd_formatp->efd_efi_id);
+	return 0;
+}
+#else
+# define xlog_recover_rtefd_commit_pass2	xlog_recover_rtefi_commit_pass2
+#endif
+
+const struct xlog_recover_item_ops xlog_rtefd_item_ops = {
+	.item_type		= XFS_LI_EFD_RT,
+	.commit_pass2		= xlog_recover_rtefd_commit_pass2,
+};
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index c627cde3bb1e0..07f63c300626a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1819,6 +1819,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
 	&xlog_attrd_item_ops,
 	&xlog_xmi_item_ops,
 	&xlog_xmd_item_ops,
+	&xlog_rtefi_item_ops,
+	&xlog_rtefd_item_ops,
 };
 
 static const struct xlog_recover_item_ops *


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 18/26] xfs: support error injection when freeing rt extents
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (16 preceding siblings ...)
  2024-08-23  0:25   ` [PATCH 17/26] xfs: support logging EFIs for realtime extents Darrick J. Wong
@ 2024-08-23  0:25   ` Darrick J. Wong
  2024-08-23  5:18     ` Christoph Hellwig
  2024-08-23  0:26   ` [PATCH 19/26] xfs: use realtime EFI to free extents when rtgroups are enabled Darrick J. Wong
                     ` (7 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:25 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

A handful of fstests expect to be able to test what happens when extent
free intents fail to actually free the extent.  Now that we're
supporting EFIs for realtime extents, add to xfs_rtfree_extent the same
injection point that exists in the regular extent freeing code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_rtbitmap.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index dfac0e89409a9..c8958d3e0abe0 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -21,6 +21,7 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_health.h"
 #include "xfs_sb.h"
+#include "xfs_errortag.h"
 #include "xfs_log.h"
 #include "xfs_buf_item.h"
 
@@ -1065,6 +1066,9 @@ xfs_rtfree_extent(
 	ASSERT(rbmip->i_itemp != NULL);
 	xfs_assert_ilocked(rbmip, XFS_ILOCK_EXCL);
 
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FREE_EXTENT))
+		return -EIO;
+
 	error = xfs_rtcheck_alloc_range(&args, start, len);
 	if (error)
 		return error;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 19/26] xfs: use realtime EFI to free extents when rtgroups are enabled
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (17 preceding siblings ...)
  2024-08-23  0:25   ` [PATCH 18/26] xfs: support error injection when freeing rt extents Darrick J. Wong
@ 2024-08-23  0:26   ` Darrick J. Wong
  2024-08-23  5:18     ` Christoph Hellwig
  2024-08-23  0:26   ` [PATCH 20/26] xfs: don't merge ioends across RTGs Darrick J. Wong
                     ` (6 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:26 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When rmap is enabled, XFS expects a certain order of operations, which
is: 1) remove the file mapping, 2) remove the reverse mapping, and then
3) free the blocks.  When reflink is enabled, XFS replaces (3) with a
deferred refcount decrement operation that can schedule freeing the
blocks if that was the last refcount.

For realtime files, xfs_bmap_del_extent_real tries to do 1 and 3 in the
same transaction, which will break both rmap and reflink unless we
switch it to use realtime EFIs.  Both rmap and reflink depend on the
rtgroups feature, so let's turn on EFIs for all rtgroups filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index f1bf8635a8cf3..126a0d253654a 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5434,9 +5434,11 @@ xfs_bmap_del_extent_real(
 	 * If we need to, add to list of extents to delete.
 	 */
 	if (!(bflags & XFS_BMAPI_REMAP)) {
+		bool	isrt = xfs_ifork_is_realtime(ip, whichfork);
+
 		if (xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK) {
 			xfs_refcount_decrease_extent(tp, del);
-		} else if (xfs_ifork_is_realtime(ip, whichfork)) {
+		} else if (isrt && !xfs_has_rtgroups(mp)) {
 			error = xfs_bmap_free_rtblocks(tp, del);
 		} else {
 			unsigned int	efi_flags = 0;
@@ -5445,6 +5447,19 @@ xfs_bmap_del_extent_real(
 			    del->br_state == XFS_EXT_UNWRITTEN)
 				efi_flags |= XFS_FREE_EXTENT_SKIP_DISCARD;
 
+			/*
+			 * Historically, we did not use EFIs to free realtime
+			 * extents.  However, when reverse mapping is enabled,
+			 * we must maintain the same order of operations as the
+			 * data device, which is: Remove the file mapping,
+			 * remove the reverse mapping, and then free the
+			 * blocks.  Reflink for realtime volumes requires the
+			 * same sort of ordering.  Both features rely on
+			 * rtgroups, so let's gate rt EFI usage on rtgroups.
+			 */
+			if (isrt)
+				efi_flags |= XFS_FREE_EXTENT_REALTIME;
+
 			error = xfs_free_extent_later(tp, del->br_startblock,
 					del->br_blockcount, NULL,
 					XFS_AG_RESV_NONE, efi_flags);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 20/26] xfs: don't merge ioends across RTGs
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (18 preceding siblings ...)
  2024-08-23  0:26   ` [PATCH 19/26] xfs: use realtime EFI to free extents when rtgroups are enabled Darrick J. Wong
@ 2024-08-23  0:26   ` Darrick J. Wong
  2024-08-23  0:26   ` [PATCH 21/26] xfs: make the RT allocator rtgroup aware Darrick J. Wong
                     ` (5 subsequent siblings)
  25 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:26 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Unlike AGs, RTGs don't always have metadata in their first blocks, and
thus we don't get automatic protection from merging I/O completions
across RTG boundaries.  Add code to set the IOMAP_F_BOUNDARY flag for
ioends that start at the first block of a RTG so that they never get
merged into the previous ioend.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 13cabd345e227..607d360c4a911 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -24,6 +24,7 @@
 #include "xfs_iomap.h"
 #include "xfs_trace.h"
 #include "xfs_quota.h"
+#include "xfs_rtgroup.h"
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
@@ -115,7 +116,9 @@ xfs_bmbt_to_iomap(
 		iomap->addr = IOMAP_NULL_ADDR;
 		iomap->type = IOMAP_DELALLOC;
 	} else {
-		iomap->addr = BBTOB(xfs_fsb_to_db(ip, imap->br_startblock));
+		xfs_daddr_t	bno = xfs_fsb_to_db(ip, imap->br_startblock);
+
+		iomap->addr = BBTOB(bno);
 		if (mapping_flags & IOMAP_DAX)
 			iomap->addr += target->bt_dax_part_off;
 
@@ -124,6 +127,15 @@ xfs_bmbt_to_iomap(
 		else
 			iomap->type = IOMAP_MAPPED;
 
+		/*
+		 * Mark iomaps starting at the first sector of a RTG as merge
+		 * boundary so that each I/O completions is contained to a
+		 * single RTG.
+		 */
+		if (XFS_IS_REALTIME_INODE(ip) && xfs_has_rtgroups(mp) &&
+		    xfs_rtb_to_rtx(mp, bno) == 0 &&
+		    xfs_rtb_to_rtxoff(mp, bno) == 0)
+			iomap->flags |= IOMAP_F_BOUNDARY;
 	}
 	iomap->offset = XFS_FSB_TO_B(mp, imap->br_startoff);
 	iomap->length = XFS_FSB_TO_B(mp, imap->br_blockcount);


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (19 preceding siblings ...)
  2024-08-23  0:26   ` [PATCH 20/26] xfs: don't merge ioends across RTGs Darrick J. Wong
@ 2024-08-23  0:26   ` Darrick J. Wong
  2024-08-26  4:56     ` Dave Chinner
  2024-08-23  0:26   ` [PATCH 22/26] xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub Darrick J. Wong
                     ` (4 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:26 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

Make the allocator rtgroup aware by either picking a specific group if
there is a hint, or loop over all groups otherwise.  A simple rotor is
provided to pick the placement for initial allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c     |   13 +++++-
 fs/xfs/libxfs/xfs_rtbitmap.c |    6 ++-
 fs/xfs/xfs_mount.h           |    1 
 fs/xfs/xfs_rtalloc.c         |   98 ++++++++++++++++++++++++++++++++++++++----
 4 files changed, 105 insertions(+), 13 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 126a0d253654a..88c62e1158ac7 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3151,8 +3151,17 @@ xfs_bmap_adjacent_valid(
 	struct xfs_mount	*mp = ap->ip->i_mount;
 
 	if (XFS_IS_REALTIME_INODE(ap->ip) &&
-	    (ap->datatype & XFS_ALLOC_USERDATA))
-		return x < mp->m_sb.sb_rblocks;
+	    (ap->datatype & XFS_ALLOC_USERDATA)) {
+		if (x >= mp->m_sb.sb_rblocks)
+			return false;
+		if (!xfs_has_rtgroups(mp))
+			return true;
+
+		return xfs_rtb_to_rgno(mp, x) == xfs_rtb_to_rgno(mp, y) &&
+			xfs_rtb_to_rgno(mp, x) < mp->m_sb.sb_rgcount &&
+			xfs_rtb_to_rtx(mp, x) < mp->m_sb.sb_rgextents;
+
+	}
 
 	return XFS_FSB_TO_AGNO(mp, x) == XFS_FSB_TO_AGNO(mp, y) &&
 		XFS_FSB_TO_AGNO(mp, x) < mp->m_sb.sb_agcount &&
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index c8958d3e0abe0..ef94b67feccd7 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1084,11 +1084,13 @@ xfs_rtfree_extent(
 	 * Mark more blocks free in the superblock.
 	 */
 	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FREXTENTS, (long)len);
+
 	/*
 	 * If we've now freed all the blocks, reset the file sequence
-	 * number to 0.
+	 * number to 0 for pre-RTG file systems.
 	 */
-	if (tp->t_frextents_delta + mp->m_sb.sb_frextents ==
+	if (!xfs_has_rtgroups(mp) &&
+	    tp->t_frextents_delta + mp->m_sb.sb_frextents ==
 	    mp->m_sb.sb_rextents) {
 		if (!(rbmip->i_diflags & XFS_DIFLAG_NEWRTBM))
 			rbmip->i_diflags |= XFS_DIFLAG_NEWRTBM;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c4e4f5414a299..7e68812db1be7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -223,6 +223,7 @@ typedef struct xfs_mount {
 #endif
 	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
 	atomic_t		m_agirotor;	/* last ag dir inode alloced */
+	atomic_t		m_rtgrotor;	/* last rtgroup rtpicked */
 
 	/* Memory shrinker to throttle and reprioritize inodegc */
 	struct shrinker		*m_inodegc_shrinker;
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 3fedc552b51b0..2b57ff2687bf6 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -1661,8 +1661,9 @@ xfs_rtalloc_align_minmax(
 }
 
 static int
-xfs_rtallocate(
+xfs_rtallocate_rtg(
 	struct xfs_trans	*tp,
+	xfs_rgnumber_t		rgno,
 	xfs_rtblock_t		bno_hint,
 	xfs_rtxlen_t		minlen,
 	xfs_rtxlen_t		maxlen,
@@ -1682,16 +1683,33 @@ xfs_rtallocate(
 	xfs_rtxlen_t		len = 0;
 	int			error = 0;
 
-	args.rtg = xfs_rtgroup_grab(args.mp, 0);
+	args.rtg = xfs_rtgroup_grab(args.mp, rgno);
 	if (!args.rtg)
 		return -ENOSPC;
 
 	/*
-	 * Lock out modifications to both the RT bitmap and summary inodes.
+	 * We need to lock out modifications to both the RT bitmap and summary
+	 * inodes for finding free space in xfs_rtallocate_extent_{near,size}
+	 * and join the bitmap and summary inodes for the actual allocation
+	 * down in xfs_rtallocate_range.
+	 *
+	 * For RTG-enabled file system we don't want to join the inodes to the
+	 * transaction until we are committed to allocate to allocate from this
+	 * RTG so that only one inode of each type is locked at a time.
+	 *
+	 * But for pre-RTG file systems we need to already to join the bitmap
+	 * inode to the transaction for xfs_rtpick_extent, which bumps the
+	 * sequence number in it, so we'll have to join the inode to the
+	 * transaction early here.
+	 *
+	 * This is all a bit messy, but at least the mess is contained in
+	 * this function.
 	 */
 	if (!*rtlocked) {
 		xfs_rtgroup_lock(args.rtg, XFS_RTGLOCK_BITMAP);
-		xfs_rtgroup_trans_join(tp, args.rtg, XFS_RTGLOCK_BITMAP);
+		if (!xfs_has_rtgroups(args.mp))
+			xfs_rtgroup_trans_join(tp, args.rtg,
+					XFS_RTGLOCK_BITMAP);
 		*rtlocked = true;
 	}
 
@@ -1701,7 +1719,7 @@ xfs_rtallocate(
 	 */
 	if (bno_hint)
 		start = xfs_rtb_to_rtx(args.mp, bno_hint);
-	else if (initial_user_data)
+	else if (!xfs_has_rtgroups(args.mp) && initial_user_data)
 		start = xfs_rtpick_extent(args.rtg, tp, maxlen);
 
 	if (start) {
@@ -1722,8 +1740,16 @@ xfs_rtallocate(
 				prod, &rtx);
 	}
 
-	if (error)
+	if (error) {
+		if (xfs_has_rtgroups(args.mp)) {
+			xfs_rtgroup_unlock(args.rtg, XFS_RTGLOCK_BITMAP);
+			*rtlocked = false;
+		}
 		goto out_release;
+	}
+
+	if (xfs_has_rtgroups(args.mp))
+		xfs_rtgroup_trans_join(tp, args.rtg, XFS_RTGLOCK_BITMAP);
 
 	error = xfs_rtallocate_range(&args, rtx, len);
 	if (error)
@@ -1741,6 +1767,53 @@ xfs_rtallocate(
 	return error;
 }
 
+static int
+xfs_rtallocate_rtgs(
+	struct xfs_trans	*tp,
+	xfs_fsblock_t		bno_hint,
+	xfs_rtxlen_t		minlen,
+	xfs_rtxlen_t		maxlen,
+	xfs_rtxlen_t		prod,
+	bool			wasdel,
+	bool			initial_user_data,
+	xfs_rtblock_t		*bno,
+	xfs_extlen_t		*blen)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_rgnumber_t		start_rgno, rgno;
+	int			error;
+
+	/*
+	 * For now this just blindly iterates over the RTGs for an initial
+	 * allocation.  We could try to keep an in-memory rtg_longest member
+	 * to avoid the locking when just looking for big enough free space,
+	 * but for now this keep things simple.
+	 */
+	if (bno_hint != NULLFSBLOCK)
+		start_rgno = xfs_rtb_to_rgno(mp, bno_hint);
+	else
+		start_rgno = (atomic_inc_return(&mp->m_rtgrotor) - 1) %
+				mp->m_sb.sb_rgcount;
+
+	rgno = start_rgno;
+	do {
+		bool		rtlocked = false;
+
+		error = xfs_rtallocate_rtg(tp, rgno, bno_hint, minlen, maxlen,
+				prod, wasdel, initial_user_data, &rtlocked,
+				bno, blen);
+		if (error != -ENOSPC)
+			return error;
+		ASSERT(!rtlocked);
+
+		if (++rgno == mp->m_sb.sb_rgcount)
+			rgno = 0;
+		bno_hint = NULLFSBLOCK;
+	} while (rgno != start_rgno);
+
+	return -ENOSPC;
+}
+
 static int
 xfs_rtallocate_align(
 	struct xfs_bmalloca	*ap,
@@ -1835,9 +1908,16 @@ xfs_bmap_rtalloc(
 	if (xfs_bmap_adjacent(ap))
 		bno_hint = ap->blkno;
 
-	error = xfs_rtallocate(ap->tp, bno_hint, raminlen, ralen, prod,
-			ap->wasdel, initial_user_data, &rtlocked,
-			&ap->blkno, &ap->length);
+	if (xfs_has_rtgroups(ap->ip->i_mount)) {
+		error = xfs_rtallocate_rtgs(ap->tp, bno_hint, raminlen, ralen,
+				prod, ap->wasdel, initial_user_data,
+				&ap->blkno, &ap->length);
+	} else {
+		error = xfs_rtallocate_rtg(ap->tp, 0, bno_hint, raminlen, ralen,
+				prod, ap->wasdel, initial_user_data,
+				&rtlocked, &ap->blkno, &ap->length);
+	}
+
 	if (error == -ENOSPC) {
 		if (!noalign) {
 			/*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 22/26] xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (20 preceding siblings ...)
  2024-08-23  0:26   ` [PATCH 21/26] xfs: make the RT allocator rtgroup aware Darrick J. Wong
@ 2024-08-23  0:26   ` Darrick J. Wong
  2024-08-23  5:19     ` Christoph Hellwig
  2024-08-23  0:27   ` [PATCH 23/26] xfs: scrub the realtime group superblock Darrick J. Wong
                     ` (3 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:26 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The bmbt scrubber will combine file mappings if they are mergeable to
reduce the number of cross-referencing checks.  However, we shouldn't
combine mappings that cross rt group boundaries because that will cause
verifiers to trip incorrectly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/bmap.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 69dac1bd6a83e..173c371e822f5 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -835,9 +835,12 @@ xchk_bmap_iext_mapping(
 /* Are these two mappings contiguous with each other? */
 static inline bool
 xchk_are_bmaps_contiguous(
+	const struct xchk_bmap_info	*info,
 	const struct xfs_bmbt_irec	*b1,
 	const struct xfs_bmbt_irec	*b2)
 {
+	struct xfs_mount		*mp = info->sc->mp;
+
 	/* Don't try to combine unallocated mappings. */
 	if (!xfs_bmap_is_real_extent(b1))
 		return false;
@@ -851,6 +854,17 @@ xchk_are_bmaps_contiguous(
 		return false;
 	if (b1->br_state != b2->br_state)
 		return false;
+
+	/*
+	 * Don't combine bmaps that would cross rtgroup boundaries.  This is a
+	 * valid state, but if combined they will fail rtb extent checks.
+	 */
+	if (info->is_rt && xfs_has_rtgroups(mp)) {
+		if (xfs_rtb_to_rgno(mp, b1->br_startblock) !=
+		    xfs_rtb_to_rgno(mp, b2->br_startblock))
+			return false;
+	}
+
 	return true;
 }
 
@@ -888,7 +902,7 @@ xchk_bmap_iext_iter(
 	 * that we just read, if possible.
 	 */
 	while (xfs_iext_peek_next_extent(ifp, &info->icur, &got)) {
-		if (!xchk_are_bmaps_contiguous(irec, &got))
+		if (!xchk_are_bmaps_contiguous(info, irec, &got))
 			break;
 
 		if (!xchk_bmap_iext_mapping(info, &got)) {


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 23/26] xfs: scrub the realtime group superblock
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (21 preceding siblings ...)
  2024-08-23  0:26   ` [PATCH 22/26] xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub Darrick J. Wong
@ 2024-08-23  0:27   ` Darrick J. Wong
  2024-08-23  5:19     ` Christoph Hellwig
  2024-08-23  0:27   ` [PATCH 24/26] xfs: repair " Darrick J. Wong
                     ` (2 subsequent siblings)
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:27 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enable scrubbing of realtime group superblocks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile        |    1 +
 fs/xfs/libxfs/xfs_fs.h |    3 +-
 fs/xfs/scrub/common.h  |    2 +
 fs/xfs/scrub/health.c  |    1 +
 fs/xfs/scrub/rgsuper.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/scrub.c   |    7 +++++
 fs/xfs/scrub/scrub.h   |    2 +
 fs/xfs/scrub/stats.c   |    1 +
 fs/xfs/scrub/trace.h   |    4 ++-
 9 files changed, 92 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/rgsuper.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 388b5cef48ca5..56f518e5017fd 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -190,6 +190,7 @@ xfs-y				+= $(addprefix scrub/, \
 xfs-$(CONFIG_XFS_ONLINE_SCRUB_STATS) += scrub/stats.o
 
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix scrub/, \
+				   rgsuper.o \
 				   rtbitmap.o \
 				   rtsummary.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2dacc19723c37..07337958fc41a 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -736,9 +736,10 @@ struct xfs_scrub_metadata {
 #define XFS_SCRUB_TYPE_HEALTHY	27	/* everything checked out ok */
 #define XFS_SCRUB_TYPE_DIRTREE	28	/* directory tree structure */
 #define XFS_SCRUB_TYPE_METAPATH	29	/* metadata directory tree paths */
+#define XFS_SCRUB_TYPE_RGSUPER	30	/* realtime superblock */
 
 /* Number of scrub subcommands. */
-#define XFS_SCRUB_TYPE_NR	30
+#define XFS_SCRUB_TYPE_NR	31
 
 /*
  * This special type code only applies to the vectored scrub implementation.
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 0d531770e83b0..c8465a4eb594a 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -79,9 +79,11 @@ int xchk_setup_metapath(struct xfs_scrub *sc);
 #ifdef CONFIG_XFS_RT
 int xchk_setup_rtbitmap(struct xfs_scrub *sc);
 int xchk_setup_rtsummary(struct xfs_scrub *sc);
+int xchk_setup_rgsuperblock(struct xfs_scrub *sc);
 #else
 # define xchk_setup_rtbitmap		xchk_setup_nothing
 # define xchk_setup_rtsummary		xchk_setup_nothing
+# define xchk_setup_rgsuperblock	xchk_setup_nothing
 #endif
 #ifdef CONFIG_XFS_QUOTA
 int xchk_ino_dqattach(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
index a0a721ae5763d..3406579db71eb 100644
--- a/fs/xfs/scrub/health.c
+++ b/fs/xfs/scrub/health.c
@@ -111,6 +111,7 @@ static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_NLINKS]		= { XHG_FS,  XFS_SICK_FS_NLINKS },
 	[XFS_SCRUB_TYPE_DIRTREE]	= { XHG_INO, XFS_SICK_INO_DIRTREE },
 	[XFS_SCRUB_TYPE_METAPATH]	= { XHG_FS,  XFS_SICK_FS_METAPATH },
+	[XFS_SCRUB_TYPE_RGSUPER]	= { XHG_RTGROUP, XFS_SICK_RG_SUPER },
 };
 
 /* Return the health status mask for this scrub type. */
diff --git a/fs/xfs/scrub/rgsuper.c b/fs/xfs/scrub/rgsuper.c
new file mode 100644
index 0000000000000..bfba31a03adbc
--- /dev/null
+++ b/fs/xfs/scrub/rgsuper.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_rtgroup.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+
+/* Set us up with a transaction and an empty context. */
+int
+xchk_setup_rgsuperblock(
+	struct xfs_scrub	*sc)
+{
+	return xchk_trans_alloc(sc, 0);
+}
+
+/* Cross-reference with the other rt metadata. */
+STATIC void
+xchk_rgsuperblock_xref(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_mount	*mp = sc->mp;
+	xfs_rgnumber_t		rgno = sc->sr.rtg->rtg_rgno;
+	xfs_rtblock_t		rtbno;
+
+	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
+		return;
+
+	rtbno = xfs_rgbno_to_rtb(mp, rgno, 0);
+	xchk_xref_is_used_rt_space(sc, rtbno, 1);
+}
+
+int
+xchk_rgsuperblock(
+	struct xfs_scrub	*sc)
+{
+	xfs_rgnumber_t		rgno = sc->sm->sm_agno;
+	int			error;
+
+	/*
+	 * Only rtgroup 0 has a superblock.  We may someday want to use higher
+	 * rgno for other functions, similar to what we do with the primary
+	 * super scrub function.
+	 */
+	if (rgno != 0)
+		return -ENOENT;
+
+	/*
+	 * Grab an active reference to the rtgroup structure.  If we can't get
+	 * it, we're racing with something that's tearing down the group, so
+	 * signal that the group no longer exists.  Take the rtbitmap in shared
+	 * mode so that the group can't change while we're doing things.
+	 */
+	error = xchk_rtgroup_init_existing(sc, rgno, &sc->sr);
+	if (!xchk_xref_process_error(sc, 0, 0, &error))
+		return error;
+
+	xchk_rtgroup_lock(&sc->sr, XFS_RTGLOCK_BITMAP_SHARED);
+
+	/*
+	 * Since we already validated the rt superblock at mount time, we don't
+	 * need to check its contents again.  All we need is to cross-reference.
+	 */
+	xchk_rgsuperblock_xref(sc);
+	return 0;
+}
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 910825d4b61a2..fc8476c522746 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -451,6 +451,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.has	= xfs_has_metadir,
 		.repair	= xrep_metapath,
 	},
+	[XFS_SCRUB_TYPE_RGSUPER] = {	/* realtime group superblock */
+		.type	= ST_RTGROUP,
+		.setup	= xchk_setup_rgsuperblock,
+		.scrub	= xchk_rgsuperblock,
+		.has	= xfs_has_rtsb,
+		.repair = xrep_notsupported,
+	},
 };
 
 static int
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index f73c6d0d90a11..a7fda3e2b0137 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -273,9 +273,11 @@ int xchk_metapath(struct xfs_scrub *sc);
 #ifdef CONFIG_XFS_RT
 int xchk_rtbitmap(struct xfs_scrub *sc);
 int xchk_rtsummary(struct xfs_scrub *sc);
+int xchk_rgsuperblock(struct xfs_scrub *sc);
 #else
 # define xchk_rtbitmap		xchk_nothing
 # define xchk_rtsummary		xchk_nothing
+# define xchk_rgsuperblock	xchk_nothing
 #endif
 #ifdef CONFIG_XFS_QUOTA
 int xchk_quota(struct xfs_scrub *sc);
diff --git a/fs/xfs/scrub/stats.c b/fs/xfs/scrub/stats.c
index edcd02dc2e62c..a476c7b2ab759 100644
--- a/fs/xfs/scrub/stats.c
+++ b/fs/xfs/scrub/stats.c
@@ -81,6 +81,7 @@ static const char *name_map[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_NLINKS]		= "nlinks",
 	[XFS_SCRUB_TYPE_DIRTREE]	= "dirtree",
 	[XFS_SCRUB_TYPE_METAPATH]	= "metapath",
+	[XFS_SCRUB_TYPE_RGSUPER]	= "rgsuper",
 };
 
 /* Format the scrub stats into a text buffer, similar to pcp style. */
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index fe901b9138b4b..d4d0e8ceeeb7b 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -71,6 +71,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_DIRTREE);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_BARRIER);
 TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_METAPATH);
+TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_RGSUPER);
 
 #define XFS_SCRUB_TYPE_STRINGS \
 	{ XFS_SCRUB_TYPE_PROBE,		"probe" }, \
@@ -103,7 +104,8 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_METAPATH);
 	{ XFS_SCRUB_TYPE_HEALTHY,	"healthy" }, \
 	{ XFS_SCRUB_TYPE_DIRTREE,	"dirtree" }, \
 	{ XFS_SCRUB_TYPE_BARRIER,	"barrier" }, \
-	{ XFS_SCRUB_TYPE_METAPATH,	"metapath" }
+	{ XFS_SCRUB_TYPE_METAPATH,	"metapath" }, \
+	{ XFS_SCRUB_TYPE_RGSUPER,	"rgsuper" }
 
 #define XFS_SCRUB_FLAG_STRINGS \
 	{ XFS_SCRUB_IFLAG_REPAIR,		"repair" }, \


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 24/26] xfs: repair realtime group superblock
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (22 preceding siblings ...)
  2024-08-23  0:27   ` [PATCH 23/26] xfs: scrub the realtime group superblock Darrick J. Wong
@ 2024-08-23  0:27   ` Darrick J. Wong
  2024-08-23  5:19     ` Christoph Hellwig
  2024-08-23  0:27   ` [PATCH 25/26] xfs: scrub metadir paths for rtgroup metadata Darrick J. Wong
  2024-08-23  0:27   ` [PATCH 26/26] xfs: mask off the rtbitmap and summary inodes when metadir in use Darrick J. Wong
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:27 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Repair the realtime superblock if it has become out of date with the
primary superblock.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.h  |    3 +++
 fs/xfs/scrub/rgsuper.c |   16 ++++++++++++++++
 fs/xfs/scrub/scrub.c   |    2 +-
 3 files changed, 20 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 4052185743910..b649da1a93eb8 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -146,9 +146,11 @@ int xrep_metapath(struct xfs_scrub *sc);
 #ifdef CONFIG_XFS_RT
 int xrep_rtbitmap(struct xfs_scrub *sc);
 int xrep_rtsummary(struct xfs_scrub *sc);
+int xrep_rgsuperblock(struct xfs_scrub *sc);
 #else
 # define xrep_rtbitmap			xrep_notsupported
 # define xrep_rtsummary			xrep_notsupported
+# define xrep_rgsuperblock		xrep_notsupported
 #endif /* CONFIG_XFS_RT */
 
 #ifdef CONFIG_XFS_QUOTA
@@ -253,6 +255,7 @@ static inline int xrep_setup_symlink(struct xfs_scrub *sc, unsigned int *x)
 #define xrep_symlink			xrep_notsupported
 #define xrep_dirtree			xrep_notsupported
 #define xrep_metapath			xrep_notsupported
+#define xrep_rgsuperblock		xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/rgsuper.c b/fs/xfs/scrub/rgsuper.c
index bfba31a03adbc..ad54a58cd9848 100644
--- a/fs/xfs/scrub/rgsuper.c
+++ b/fs/xfs/scrub/rgsuper.c
@@ -10,8 +10,12 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_rtgroup.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
+#include "scrub/repair.h"
 
 /* Set us up with a transaction and an empty context. */
 int
@@ -71,3 +75,15 @@ xchk_rgsuperblock(
 	xchk_rgsuperblock_xref(sc);
 	return 0;
 }
+
+#ifdef CONFIG_XFS_ONLINE_REPAIR
+int
+xrep_rgsuperblock(
+	struct xfs_scrub	*sc)
+{
+	ASSERT(sc->sr.rtg->rtg_rgno == 0);
+
+	xfs_log_sb(sc->tp);
+	return 0;
+}
+#endif /* CONFIG_XFS_ONLINE_REPAIR */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index fc8476c522746..c255882fc5e40 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -456,7 +456,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_rgsuperblock,
 		.scrub	= xchk_rgsuperblock,
 		.has	= xfs_has_rtsb,
-		.repair = xrep_notsupported,
+		.repair = xrep_rgsuperblock,
 	},
 };
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 25/26] xfs: scrub metadir paths for rtgroup metadata
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (23 preceding siblings ...)
  2024-08-23  0:27   ` [PATCH 24/26] xfs: repair " Darrick J. Wong
@ 2024-08-23  0:27   ` Darrick J. Wong
  2024-08-23  5:20     ` Christoph Hellwig
  2024-08-23  0:27   ` [PATCH 26/26] xfs: mask off the rtbitmap and summary inodes when metadir in use Darrick J. Wong
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:27 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add the code we need to scan the metadata directory paths of rt group
metadata files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h  |    5 ++-
 fs/xfs/scrub/metapath.c |   92 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 07337958fc41a..11fa3d0c38086 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -822,9 +822,12 @@ struct xfs_scrub_vec_head {
  * path checking.
  */
 #define XFS_SCRUB_METAPATH_PROBE	(0)  /* do we have a metapath scrubber? */
+#define XFS_SCRUB_METAPATH_RTDIR	(1)  /* rtrgroups metadir */
+#define XFS_SCRUB_METAPATH_RTBITMAP	(2)  /* per-rtg bitmap */
+#define XFS_SCRUB_METAPATH_RTSUMMARY	(3)  /* per-rtg summary */
 
 /* Number of metapath sm_ino values */
-#define XFS_SCRUB_METAPATH_NR		(1)
+#define XFS_SCRUB_METAPATH_NR		(4)
 
 /*
  * ioctl limits
diff --git a/fs/xfs/scrub/metapath.c b/fs/xfs/scrub/metapath.c
index edc1a395c4015..e5714655152db 100644
--- a/fs/xfs/scrub/metapath.c
+++ b/fs/xfs/scrub/metapath.c
@@ -20,6 +20,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
 #include "xfs_attr.h"
+#include "xfs_rtgroup.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -79,6 +80,91 @@ xchk_metapath_cleanup(
 	kfree(mpath->path);
 }
 
+/* Set up a metadir path scan.  @path must be dynamically allocated. */
+static inline int
+xchk_setup_metapath_scan(
+	struct xfs_scrub	*sc,
+	struct xfs_inode	*dp,
+	const char		*path,
+	struct xfs_inode	*ip)
+{
+	struct xchk_metapath	*mpath;
+	int			error;
+
+	if (!path)
+		return -ENOMEM;
+
+	error = xchk_install_live_inode(sc, ip);
+	if (error) {
+		kfree(path);
+		return error;
+	}
+
+	mpath = kzalloc(sizeof(struct xchk_metapath), XCHK_GFP_FLAGS);
+	if (!mpath) {
+		kfree(path);
+		return -ENOMEM;
+	}
+
+	mpath->sc = sc;
+	sc->buf = mpath;
+	sc->buf_cleanup = xchk_metapath_cleanup;
+
+	mpath->dp = dp;
+	mpath->path = path; /* path is now owned by mpath */
+
+	mpath->xname.name = mpath->path;
+	mpath->xname.len = strlen(mpath->path);
+	mpath->xname.type = xfs_mode_to_ftype(VFS_I(ip)->i_mode);
+
+	return 0;
+}
+
+#ifdef CONFIG_XFS_RT
+/* Scan the /rtgroups directory itself. */
+static int
+xchk_setup_metapath_rtdir(
+	struct xfs_scrub	*sc)
+{
+	if (!sc->mp->m_rtdirip)
+		return -ENOENT;
+
+	return xchk_setup_metapath_scan(sc, sc->mp->m_metadirip,
+			kasprintf(GFP_KERNEL, "rtgroups"), sc->mp->m_rtdirip);
+}
+
+/* Scan a rtgroup inode under the /rtgroups directory. */
+static int
+xchk_setup_metapath_rtginode(
+	struct xfs_scrub	*sc,
+	enum xfs_rtg_inodes	type)
+{
+	struct xfs_rtgroup	*rtg;
+	struct xfs_inode	*ip;
+	int			error;
+
+	rtg = xfs_rtgroup_get(sc->mp, sc->sm->sm_agno);
+	if (!rtg)
+		return -ENOENT;
+
+	ip = rtg->rtg_inodes[type];
+	if (!ip) {
+		error = -ENOENT;
+		goto out_put_rtg;
+	}
+
+	error = xchk_setup_metapath_scan(sc, sc->mp->m_rtdirip,
+			xfs_rtginode_path(rtg->rtg_rgno, type), ip);
+
+out_put_rtg:
+	xfs_rtgroup_put(rtg);
+	return error;
+}
+#else
+# define xchk_setup_metapath_rtdir(...)		(-ENOENT)
+# define xchk_setup_metapath_rtginode(...)	(-ENOENT)
+#endif /* CONFIG_XFS_RT */
+
 int
 xchk_setup_metapath(
 	struct xfs_scrub	*sc)
@@ -94,6 +180,12 @@ xchk_setup_metapath(
 		if (sc->sm->sm_agno)
 			return -EINVAL;
 		return 0;
+	case XFS_SCRUB_METAPATH_RTDIR:
+		return xchk_setup_metapath_rtdir(sc);
+	case XFS_SCRUB_METAPATH_RTBITMAP:
+		return xchk_setup_metapath_rtginode(sc, XFS_RTGI_BITMAP);
+	case XFS_SCRUB_METAPATH_RTSUMMARY:
+		return xchk_setup_metapath_rtginode(sc, XFS_RTGI_SUMMARY);
 	default:
 		return -ENOENT;
 	}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 26/26] xfs: mask off the rtbitmap and summary inodes when metadir in use
  2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
                     ` (24 preceding siblings ...)
  2024-08-23  0:27   ` [PATCH 25/26] xfs: scrub metadir paths for rtgroup metadata Darrick J. Wong
@ 2024-08-23  0:27   ` Darrick J. Wong
  2024-08-23  5:20     ` Christoph Hellwig
  25 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:27 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set the rtbitmap and summary file inumbers to NULLFSINO in the
superblock and make sure they're zeroed whenever we write the superblock
to disk, to mimic mkfs behavior.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_sb.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f94d081f7d928..3dc6d272519ba 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -656,6 +656,13 @@ xfs_validate_sb_common(
 void
 xfs_sb_quota_from_disk(struct xfs_sb *sbp)
 {
+	if (xfs_sb_version_hasmetadir(sbp)) {
+		sbp->sb_uquotino = NULLFSINO;
+		sbp->sb_gquotino = NULLFSINO;
+		sbp->sb_pquotino = NULLFSINO;
+		return;
+	}
+
 	/*
 	 * older mkfs doesn't initialize quota inodes to NULLFSINO. This
 	 * leads to in-core values having two different values for a quota
@@ -784,6 +791,8 @@ __xfs_sb_from_disk(
 		to->sb_metadirino = be64_to_cpu(from->sb_metadirino);
 		to->sb_rgcount = be32_to_cpu(from->sb_rgcount);
 		to->sb_rgextents = be32_to_cpu(from->sb_rgextents);
+		to->sb_rbmino = NULLFSINO;
+		to->sb_rsumino = NULLFSINO;
 	} else {
 		to->sb_metadirino = NULLFSINO;
 		to->sb_rgcount = 1;
@@ -806,6 +815,13 @@ xfs_sb_quota_to_disk(
 {
 	uint16_t	qflags = from->sb_qflags;
 
+	if (xfs_sb_version_hasmetadir(from)) {
+		to->sb_uquotino = cpu_to_be64(0);
+		to->sb_gquotino = cpu_to_be64(0);
+		to->sb_pquotino = cpu_to_be64(0);
+		return;
+	}
+
 	to->sb_uquotino = cpu_to_be64(from->sb_uquotino);
 
 	/*
@@ -941,6 +957,8 @@ xfs_sb_to_disk(
 		to->sb_metadirino = cpu_to_be64(from->sb_metadirino);
 		to->sb_rgcount = cpu_to_be32(from->sb_rgcount);
 		to->sb_rgextents = cpu_to_be32(from->sb_rgextents);
+		to->sb_rbmino = cpu_to_be64(0);
+		to->sb_rsumino = cpu_to_be64(0);
 	}
 }
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
@ 2024-08-23  0:28   ` Darrick J. Wong
  2024-08-23  5:51     ` Christoph Hellwig
  2024-08-23  0:28   ` [PATCH 2/6] xfs: use metadir for quota inodes Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:28 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Reuse this function instead of open-coding the logic.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_qm.c |   53 ++++++++++++++++++++---------------------------------
 1 file changed, 20 insertions(+), 33 deletions(-)


diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 28b1420bac1dd..b37e80fe7e86a 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -40,7 +40,6 @@
 STATIC int	xfs_qm_init_quotainos(struct xfs_mount *mp);
 STATIC int	xfs_qm_init_quotainfo(struct xfs_mount *mp);
 
-STATIC void	xfs_qm_destroy_quotainos(struct xfs_quotainfo *qi);
 STATIC void	xfs_qm_dqfree_one(struct xfs_dquot *dqp);
 /*
  * We use the batch lookup interface to iterate over the dquots as it
@@ -226,6 +225,24 @@ xfs_qm_unmount_rt(
 	xfs_rtgroup_rele(rtg);
 }
 
+STATIC void
+xfs_qm_destroy_quotainos(
+	struct xfs_quotainfo	*qi)
+{
+	if (qi->qi_uquotaip) {
+		xfs_irele(qi->qi_uquotaip);
+		qi->qi_uquotaip = NULL; /* paranoia */
+	}
+	if (qi->qi_gquotaip) {
+		xfs_irele(qi->qi_gquotaip);
+		qi->qi_gquotaip = NULL;
+	}
+	if (qi->qi_pquotaip) {
+		xfs_irele(qi->qi_pquotaip);
+		qi->qi_pquotaip = NULL;
+	}
+}
+
 /*
  * Called from the vfsops layer.
  */
@@ -250,20 +267,8 @@ xfs_qm_unmount_quotas(
 	/*
 	 * Release the quota inodes.
 	 */
-	if (mp->m_quotainfo) {
-		if (mp->m_quotainfo->qi_uquotaip) {
-			xfs_irele(mp->m_quotainfo->qi_uquotaip);
-			mp->m_quotainfo->qi_uquotaip = NULL;
-		}
-		if (mp->m_quotainfo->qi_gquotaip) {
-			xfs_irele(mp->m_quotainfo->qi_gquotaip);
-			mp->m_quotainfo->qi_gquotaip = NULL;
-		}
-		if (mp->m_quotainfo->qi_pquotaip) {
-			xfs_irele(mp->m_quotainfo->qi_pquotaip);
-			mp->m_quotainfo->qi_pquotaip = NULL;
-		}
-	}
+	if (mp->m_quotainfo)
+		xfs_qm_destroy_quotainos(mp->m_quotainfo);
 }
 
 STATIC int
@@ -1712,24 +1717,6 @@ xfs_qm_init_quotainos(
 	return error;
 }
 
-STATIC void
-xfs_qm_destroy_quotainos(
-	struct xfs_quotainfo	*qi)
-{
-	if (qi->qi_uquotaip) {
-		xfs_irele(qi->qi_uquotaip);
-		qi->qi_uquotaip = NULL; /* paranoia */
-	}
-	if (qi->qi_gquotaip) {
-		xfs_irele(qi->qi_gquotaip);
-		qi->qi_gquotaip = NULL;
-	}
-	if (qi->qi_pquotaip) {
-		xfs_irele(qi->qi_pquotaip);
-		qi->qi_pquotaip = NULL;
-	}
-}
-
 STATIC void
 xfs_qm_dqfree_one(
 	struct xfs_dquot	*dqp)


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 2/6] xfs: use metadir for quota inodes
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
  2024-08-23  0:28   ` [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos Darrick J. Wong
@ 2024-08-23  0:28   ` Darrick J. Wong
  2024-08-23  5:53     ` Christoph Hellwig
  2024-08-23  0:28   ` [PATCH 3/6] xfs: scrub quota file metapaths Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:28 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Store the quota inodes in a metadata directory if metadir is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_dquot_buf.c  |  190 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_quota_defs.h |   43 +++++++++
 fs/xfs/libxfs/xfs_sb.c         |    1 
 fs/xfs/xfs_qm.c                |  197 +++++++++++++++++++++++++++++++++++-----
 4 files changed, 407 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_dquot_buf.c b/fs/xfs/libxfs/xfs_dquot_buf.c
index 15a362e2f5ea7..dceef2abd4e2a 100644
--- a/fs/xfs/libxfs/xfs_dquot_buf.c
+++ b/fs/xfs/libxfs/xfs_dquot_buf.c
@@ -16,6 +16,9 @@
 #include "xfs_trans.h"
 #include "xfs_qm.h"
 #include "xfs_error.h"
+#include "xfs_health.h"
+#include "xfs_metadir.h"
+#include "xfs_metafile.h"
 
 int
 xfs_calc_dquots_per_chunk(
@@ -323,3 +326,190 @@ xfs_dquot_to_disk_ts(
 
 	return cpu_to_be32(t);
 }
+
+inline unsigned int
+xfs_dqinode_sick_mask(xfs_dqtype_t type)
+{
+	switch (type) {
+	case XFS_DQTYPE_USER:
+		return XFS_SICK_FS_UQUOTA;
+	case XFS_DQTYPE_GROUP:
+		return XFS_SICK_FS_GQUOTA;
+	case XFS_DQTYPE_PROJ:
+		return XFS_SICK_FS_PQUOTA;
+	}
+
+	ASSERT(0);
+	return 0;
+}
+
+/*
+ * Load the inode for a given type of quota, assuming that the sb fields have
+ * been sorted out.  This is not true when switching quota types on a V4
+ * filesystem, so do not use this function for that.  If metadir is enabled,
+ * @dp must be the /quota metadir.
+ *
+ * Returns -ENOENT if the quota inode field is NULLFSINO; 0 and an inode on
+ * success; or a negative errno.
+ */
+int
+xfs_dqinode_load(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*dp,
+	xfs_dqtype_t		type,
+	struct xfs_inode	**ipp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_inode	*ip;
+	enum xfs_metafile_type	metafile_type = xfs_dqinode_metafile_type(type);
+	int			error;
+
+	if (!xfs_has_metadir(mp)) {
+		xfs_ino_t	ino;
+
+		switch (type) {
+		case XFS_DQTYPE_USER:
+			ino = mp->m_sb.sb_uquotino;
+			break;
+		case XFS_DQTYPE_GROUP:
+			ino = mp->m_sb.sb_gquotino;
+			break;
+		case XFS_DQTYPE_PROJ:
+			ino = mp->m_sb.sb_pquotino;
+			break;
+		default:
+			ASSERT(0);
+			return -EFSCORRUPTED;
+		}
+
+		/* Should have set 0 to NULLFSINO when loading superblock */
+		if (ino == NULLFSINO)
+			return -ENOENT;
+
+		error = xfs_trans_metafile_iget(tp, ino, metafile_type, &ip);
+	} else {
+		error = xfs_metadir_load(tp, dp, xfs_dqinode_path(type),
+				metafile_type, &ip);
+		if (error == -ENOENT)
+			return error;
+	}
+	if (error) {
+		if (xfs_metadata_is_sick(error))
+			xfs_fs_mark_sick(mp, xfs_dqinode_sick_mask(type));
+		return error;
+	}
+
+	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
+			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
+		xfs_irele(ip);
+		xfs_fs_mark_sick(mp, xfs_dqinode_sick_mask(type));
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_IS_CORRUPT(mp, ip->i_projid != 0)) {
+		xfs_irele(ip);
+		xfs_fs_mark_sick(mp, xfs_dqinode_sick_mask(type));
+		return -EFSCORRUPTED;
+	}
+
+	*ipp = ip;
+	return 0;
+}
+
+/* Create a metadata directory quota inode. */
+int
+xfs_dqinode_metadir_create(
+	struct xfs_inode		*dp,
+	xfs_dqtype_t			type,
+	struct xfs_inode		**ipp)
+{
+	struct xfs_metadir_update	upd = {
+		.dp			= dp,
+		.metafile_type		= xfs_dqinode_metafile_type(type),
+		.path			= xfs_dqinode_path(type),
+	};
+	int				error;
+
+	error = xfs_metadir_start_create(&upd);
+	if (error)
+		return error;
+
+	error = xfs_metadir_create(&upd, S_IFREG);
+	if (error)
+		return error;
+
+	xfs_trans_log_inode(upd.tp, upd.ip, XFS_ILOG_CORE);
+
+	error = xfs_metadir_commit(&upd);
+	if (error)
+		return error;
+
+	xfs_finish_inode_setup(upd.ip);
+	*ipp = upd.ip;
+	return 0;
+}
+
+#ifndef __KERNEL__
+/* Link a metadata directory quota inode. */
+int
+xfs_dqinode_metadir_link(
+	struct xfs_inode		*dp,
+	xfs_dqtype_t			type,
+	struct xfs_inode		*ip)
+{
+	struct xfs_metadir_update	upd = {
+		.dp			= dp,
+		.metafile_type		= xfs_dqinode_metafile_type(type),
+		.path			= xfs_dqinode_path(type),
+		.ip			= ip,
+	};
+	int				error;
+
+	error = xfs_metadir_start_link(&upd);
+	if (error)
+		return error;
+
+	error = xfs_metadir_link(&upd);
+	if (error)
+		return error;
+
+	xfs_trans_log_inode(upd.tp, upd.ip, XFS_ILOG_CORE);
+
+	return xfs_metadir_commit(&upd);
+}
+#endif /* __KERNEL__ */
+
+/* Create the parent directory for all quota inodes and load it. */
+int
+xfs_dqinode_mkdir_parent(
+	struct xfs_mount	*mp,
+	struct xfs_inode	**dpp)
+{
+	if (!mp->m_metadirip) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
+		return -EFSCORRUPTED;
+	}
+
+	return xfs_metadir_mkdir(mp->m_metadirip, "quota", dpp);
+}
+
+/*
+ * Load the parent directory of all quota inodes.  Pass the inode to the caller
+ * because quota functions (e.g. QUOTARM) can be called on the quota files even
+ * if quotas are not enabled.
+ */
+int
+xfs_dqinode_load_parent(
+	struct xfs_trans	*tp,
+	struct xfs_inode	**dpp)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+
+	if (!mp->m_metadirip) {
+		xfs_fs_mark_sick(mp, XFS_SICK_FS_METADIR);
+		return -EFSCORRUPTED;
+	}
+
+	return xfs_metadir_load(tp, mp->m_metadirip, "quota", XFS_METAFILE_DIR,
+			dpp);
+}
diff --git a/fs/xfs/libxfs/xfs_quota_defs.h b/fs/xfs/libxfs/xfs_quota_defs.h
index fb05f44f6c754..763d941a8420c 100644
--- a/fs/xfs/libxfs/xfs_quota_defs.h
+++ b/fs/xfs/libxfs/xfs_quota_defs.h
@@ -143,4 +143,47 @@ time64_t xfs_dquot_from_disk_ts(struct xfs_disk_dquot *ddq,
 		__be32 dtimer);
 __be32 xfs_dquot_to_disk_ts(struct xfs_dquot *ddq, time64_t timer);
 
+static inline const char *
+xfs_dqinode_path(xfs_dqtype_t type)
+{
+	switch (type) {
+	case XFS_DQTYPE_USER:
+		return "user";
+	case XFS_DQTYPE_GROUP:
+		return "group";
+	case XFS_DQTYPE_PROJ:
+		return "project";
+	}
+
+	ASSERT(0);
+	return NULL;
+}
+
+static inline enum xfs_metafile_type
+xfs_dqinode_metafile_type(xfs_dqtype_t type)
+{
+	switch (type) {
+	case XFS_DQTYPE_USER:
+		return XFS_METAFILE_USRQUOTA;
+	case XFS_DQTYPE_GROUP:
+		return XFS_METAFILE_GRPQUOTA;
+	case XFS_DQTYPE_PROJ:
+		return XFS_METAFILE_PRJQUOTA;
+	}
+
+	ASSERT(0);
+	return XFS_METAFILE_UNKNOWN;
+}
+
+unsigned int xfs_dqinode_sick_mask(xfs_dqtype_t type);
+
+int xfs_dqinode_load(struct xfs_trans *tp, struct xfs_inode *dp,
+		xfs_dqtype_t type, struct xfs_inode **ipp);
+int xfs_dqinode_metadir_create(struct xfs_inode *dp, xfs_dqtype_t type,
+		struct xfs_inode **ipp);
+int xfs_dqinode_metadir_link(struct xfs_inode *dp, xfs_dqtype_t type,
+		struct xfs_inode *ip);
+int xfs_dqinode_mkdir_parent(struct xfs_mount *mp, struct xfs_inode **dpp);
+int xfs_dqinode_load_parent(struct xfs_trans *tp, struct xfs_inode **dpp);
+
 #endif	/* __XFS_QUOTA_H__ */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 3dc6d272519ba..2f5ccd6e7a662 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -816,6 +816,7 @@ xfs_sb_quota_to_disk(
 	uint16_t	qflags = from->sb_qflags;
 
 	if (xfs_sb_version_hasmetadir(from)) {
+		to->sb_qflags = cpu_to_be16(from->sb_qflags);
 		to->sb_uquotino = cpu_to_be64(0);
 		to->sb_gquotino = cpu_to_be64(0);
 		to->sb_pquotino = cpu_to_be64(0);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index b37e80fe7e86a..d9d09195eabb0 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -645,6 +645,157 @@ xfs_qm_init_timelimits(
 	xfs_qm_dqdestroy(dqp);
 }
 
+static int
+xfs_qm_load_metadir_qinos(
+	struct xfs_mount	*mp,
+	struct xfs_quotainfo	*qi,
+	struct xfs_inode	**dpp)
+{
+	struct xfs_trans	*tp;
+	int			error;
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_dqinode_load_parent(tp, dpp);
+	if (error == -ENOENT) {
+		/* no quota dir directory, but we'll create one later */
+		error = 0;
+		goto out_trans;
+	}
+	if (error)
+		goto out_trans;
+
+	if (XFS_IS_UQUOTA_ON(mp)) {
+		error = xfs_dqinode_load(tp, *dpp, XFS_DQTYPE_USER,
+				&qi->qi_uquotaip);
+		if (error && error != -ENOENT)
+			goto out_trans;
+	}
+
+	if (XFS_IS_GQUOTA_ON(mp)) {
+		error = xfs_dqinode_load(tp, *dpp, XFS_DQTYPE_GROUP,
+				&qi->qi_gquotaip);
+		if (error && error != -ENOENT)
+			goto out_trans;
+	}
+
+	if (XFS_IS_PQUOTA_ON(mp)) {
+		error = xfs_dqinode_load(tp, *dpp, XFS_DQTYPE_PROJ,
+				&qi->qi_pquotaip);
+		if (error && error != -ENOENT)
+			goto out_trans;
+	}
+
+	error = 0;
+out_trans:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+/* Create quota inodes in the metadata directory tree. */
+STATIC int
+xfs_qm_create_metadir_qinos(
+	struct xfs_mount	*mp,
+	struct xfs_quotainfo	*qi,
+	struct xfs_inode	**dpp)
+{
+	int			error;
+
+	if (!*dpp) {
+		error = xfs_dqinode_mkdir_parent(mp, dpp);
+		if (error && error != -EEXIST)
+			return error;
+	}
+
+	if (XFS_IS_UQUOTA_ON(mp) && !qi->qi_uquotaip) {
+		error = xfs_dqinode_metadir_create(*dpp, XFS_DQTYPE_USER,
+				&qi->qi_uquotaip);
+		if (error)
+			return error;
+	}
+
+	if (XFS_IS_GQUOTA_ON(mp) && !qi->qi_gquotaip) {
+		error = xfs_dqinode_metadir_create(*dpp, XFS_DQTYPE_GROUP,
+				&qi->qi_gquotaip);
+		if (error)
+			return error;
+	}
+
+	if (XFS_IS_PQUOTA_ON(mp) && !qi->qi_pquotaip) {
+		error = xfs_dqinode_metadir_create(*dpp, XFS_DQTYPE_PROJ,
+				&qi->qi_pquotaip);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Add QUOTABIT to sb_versionnum and initialize qflags in preparation for
+ * creating quota files on a metadir filesystem.
+ */
+STATIC int
+xfs_qm_prep_metadir_sb(
+	struct xfs_mount	*mp)
+{
+	struct xfs_trans	*tp;
+	int			error;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_sb, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
+	spin_lock(&mp->m_sb_lock);
+
+	xfs_add_quota(mp);
+
+	/* qflags will get updated fully _after_ quotacheck */
+	mp->m_sb.sb_qflags = mp->m_qflags & XFS_ALL_QUOTA_ACCT;
+
+	spin_unlock(&mp->m_sb_lock);
+	xfs_log_sb(tp);
+
+	return xfs_trans_commit(tp);
+}
+
+/*
+ * Load existing quota inodes or create them.  Since this is a V5 filesystem,
+ * we don't have to deal with the grp/prjquota switcheroo thing from V4.
+ */
+STATIC int
+xfs_qm_init_metadir_qinos(
+	struct xfs_mount	*mp)
+{
+	struct xfs_quotainfo	*qi = mp->m_quotainfo;
+	struct xfs_inode	*dp = NULL;
+	int			error;
+
+	if (!xfs_has_quota(mp)) {
+		error = xfs_qm_prep_metadir_sb(mp);
+		if (error)
+			return error;
+	}
+
+	error = xfs_qm_load_metadir_qinos(mp, qi, &dp);
+	if (error)
+		goto out_err;
+
+	error = xfs_qm_create_metadir_qinos(mp, qi, &dp);
+	if (error)
+		goto out_err;
+
+	xfs_irele(dp);
+	return 0;
+out_err:
+	xfs_qm_destroy_quotainos(mp->m_quotainfo);
+	if (dp)
+		xfs_irele(dp);
+	return error;
+}
+
 /*
  * This initializes all the quota information that's kept in the
  * mount structure
@@ -669,7 +820,10 @@ xfs_qm_init_quotainfo(
 	 * See if quotainodes are setup, and if not, allocate them,
 	 * and change the superblock accordingly.
 	 */
-	error = xfs_qm_init_quotainos(mp);
+	if (xfs_has_metadir(mp))
+		error = xfs_qm_init_metadir_qinos(mp);
+	else
+		error = xfs_qm_init_quotainos(mp);
 	if (error)
 		goto out_free_lru;
 
@@ -1581,7 +1735,7 @@ xfs_qm_mount_quotas(
 	}
 
 	if (error) {
-		xfs_warn(mp, "Failed to initialize disk quotas.");
+		xfs_warn(mp, "Failed to initialize disk quotas, err %d.", error);
 		return;
 	}
 }
@@ -1600,31 +1754,26 @@ xfs_qm_qino_load(
 	xfs_dqtype_t		type,
 	struct xfs_inode	**ipp)
 {
-	xfs_ino_t		ino = NULLFSINO;
-	enum xfs_metafile_type	metafile_type = XFS_METAFILE_UNKNOWN;
+	struct xfs_trans	*tp;
+	struct xfs_inode	*dp = NULL;
+	int			error;
 
-	switch (type) {
-	case XFS_DQTYPE_USER:
-		ino = mp->m_sb.sb_uquotino;
-		metafile_type = XFS_METAFILE_USRQUOTA;
-		break;
-	case XFS_DQTYPE_GROUP:
-		ino = mp->m_sb.sb_gquotino;
-		metafile_type = XFS_METAFILE_GRPQUOTA;
-		break;
-	case XFS_DQTYPE_PROJ:
-		ino = mp->m_sb.sb_pquotino;
-		metafile_type = XFS_METAFILE_PRJQUOTA;
-		break;
-	default:
-		ASSERT(0);
-		return -EFSCORRUPTED;
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	if (xfs_has_metadir(mp)) {
+		error = xfs_dqinode_load_parent(tp, &dp);
+		if (error)
+			goto out_cancel;
 	}
 
-	if (ino == NULLFSINO)
-		return -ENOENT;
-
-	return xfs_metafile_iget(mp, ino, metafile_type, ipp);
+	error = xfs_dqinode_load(tp, dp, type, ipp);
+	if (dp)
+		xfs_irele(dp);
+out_cancel:
+	xfs_trans_cancel(tp);
+	return error;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 3/6] xfs: scrub quota file metapaths
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
  2024-08-23  0:28   ` [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos Darrick J. Wong
  2024-08-23  0:28   ` [PATCH 2/6] xfs: use metadir for quota inodes Darrick J. Wong
@ 2024-08-23  0:28   ` Darrick J. Wong
  2024-08-23  5:53     ` Christoph Hellwig
  2024-08-23  0:28   ` [PATCH 4/6] xfs: persist quota flags with metadir Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:28 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enable online fsck for quota file metadata directory paths.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h  |    6 +++-
 fs/xfs/scrub/metapath.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 11fa3d0c38086..d460946cae8f1 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,9 +825,13 @@ struct xfs_scrub_vec_head {
 #define XFS_SCRUB_METAPATH_RTDIR	(1)  /* rtrgroups metadir */
 #define XFS_SCRUB_METAPATH_RTBITMAP	(2)  /* per-rtg bitmap */
 #define XFS_SCRUB_METAPATH_RTSUMMARY	(3)  /* per-rtg summary */
+#define XFS_SCRUB_METAPATH_QUOTADIR	(4)  /* quota metadir */
+#define XFS_SCRUB_METAPATH_USRQUOTA	(5)  /* user quota */
+#define XFS_SCRUB_METAPATH_GRPQUOTA	(6)  /* group quota */
+#define XFS_SCRUB_METAPATH_PRJQUOTA	(7)  /* project quota */
 
 /* Number of metapath sm_ino values */
-#define XFS_SCRUB_METAPATH_NR		(4)
+#define XFS_SCRUB_METAPATH_NR		(8)
 
 /*
  * ioctl limits
diff --git a/fs/xfs/scrub/metapath.c b/fs/xfs/scrub/metapath.c
index e5714655152db..49ea19edc1492 100644
--- a/fs/xfs/scrub/metapath.c
+++ b/fs/xfs/scrub/metapath.c
@@ -165,6 +165,74 @@ xchk_setup_metapath_rtginode(
 # define xchk_setup_metapath_rtginode(...)	(-ENOENT)
 #endif /* CONFIG_XFS_RT */
 
+#ifdef CONFIG_XFS_QUOTA
+/* Scan the /quota directory itself. */
+static int
+xchk_setup_metapath_quotadir(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_trans	*tp;
+	struct xfs_inode	*dp = NULL;
+	int			error;
+
+	error = xfs_trans_alloc_empty(sc->mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_dqinode_load_parent(tp, &dp);
+	xfs_trans_cancel(tp);
+	if (error)
+		return error;
+
+	error = xchk_setup_metapath_scan(sc, sc->mp->m_metadirip,
+			kasprintf(GFP_KERNEL, "quota"), dp);
+	xfs_irele(dp);
+	return error;
+}
+
+/* Scan a quota inode under the /quota directory. */
+static int
+xchk_setup_metapath_dqinode(
+	struct xfs_scrub	*sc,
+	xfs_dqtype_t		type)
+{
+	struct xfs_trans	*tp = NULL;
+	struct xfs_inode	*dp = NULL;
+	struct xfs_inode	*ip = NULL;
+	const char		*path;
+	int			error;
+
+	error = xfs_trans_alloc_empty(sc->mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_dqinode_load_parent(tp, &dp);
+	if (error)
+		goto out_cancel;
+
+	error = xfs_dqinode_load(tp, dp, type, &ip);
+	if (error)
+		goto out_dp;
+
+	xfs_trans_cancel(tp);
+	tp = NULL;
+
+	path = kasprintf(GFP_KERNEL, "%s", xfs_dqinode_path(type));
+	error = xchk_setup_metapath_scan(sc, dp, path, ip);
+
+	xfs_irele(ip);
+out_dp:
+	xfs_irele(dp);
+out_cancel:
+	if (tp)
+		xfs_trans_cancel(tp);
+	return error;
+}
+#else
+# define xchk_setup_metapath_quotadir(...)	(-ENOENT)
+# define xchk_setup_metapath_dqinode(...)	(-ENOENT)
+#endif /* CONFIG_XFS_QUOTA */
+
 int
 xchk_setup_metapath(
 	struct xfs_scrub	*sc)
@@ -186,6 +254,14 @@ xchk_setup_metapath(
 		return xchk_setup_metapath_rtginode(sc, XFS_RTGI_BITMAP);
 	case XFS_SCRUB_METAPATH_RTSUMMARY:
 		return xchk_setup_metapath_rtginode(sc, XFS_RTGI_SUMMARY);
+	case XFS_SCRUB_METAPATH_QUOTADIR:
+		return xchk_setup_metapath_quotadir(sc);
+	case XFS_SCRUB_METAPATH_USRQUOTA:
+		return xchk_setup_metapath_dqinode(sc, XFS_DQTYPE_USER);
+	case XFS_SCRUB_METAPATH_GRPQUOTA:
+		return xchk_setup_metapath_dqinode(sc, XFS_DQTYPE_GROUP);
+	case XFS_SCRUB_METAPATH_PRJQUOTA:
+		return xchk_setup_metapath_dqinode(sc, XFS_DQTYPE_PROJ);
 	default:
 		return -ENOENT;
 	}


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 4/6] xfs: persist quota flags with metadir
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-08-23  0:28   ` [PATCH 3/6] xfs: scrub quota file metapaths Darrick J. Wong
@ 2024-08-23  0:28   ` Darrick J. Wong
  2024-08-23  5:54     ` Christoph Hellwig
  2024-08-26  9:42     ` Dave Chinner
  2024-08-23  0:29   ` [PATCH 5/6] xfs: update sb field checks when metadir is turned on Darrick J. Wong
  2024-08-23  0:29   ` [PATCH 6/6] xfs: enable metadata directory feature Darrick J. Wong
  5 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:28 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

It's annoying that one has to keep reminding XFS about what quota
options it should mount with, since the quota flags recording the
previous state are sitting right there in the primary superblock.  Even
more strangely, there exists a noquota option to disable quotas
completely, so it's odder still that providing no options is the same as
noquota.

Starting with metadir, let's change the behavior so that if the user
does not specify any quota-related mount options at all, the ondisk
quota flags will be used to bring up quota.  In other words, the
filesystem will mount in the same state and with the same functionality
as it had during the last mount.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.c  |   15 +++++++++++++++
 fs/xfs/xfs_mount.h  |    6 ++++++
 fs/xfs/xfs_qm_bhv.c |   18 ++++++++++++++++++
 fs/xfs/xfs_quota.h  |    2 ++
 fs/xfs/xfs_super.c  |   22 ++++++++++++++++++++++
 5 files changed, 63 insertions(+)


diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 5726ea597f5a2..cbf47354561c1 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -850,6 +850,13 @@ xfs_mountfs(
 	if (error)
 		goto out_fail_wait;
 
+	/*
+	 * If we're resuming quota status, pick up the preliminary qflags from
+	 * the ondisk superblock so that we know if we should recover dquots.
+	 */
+	if (xfs_is_resuming_quotaon(mp))
+		xfs_qm_resume_quotaon(mp);
+
 	/*
 	 * Log's mount-time initialization. The first part of recovery can place
 	 * some items on the AIL, to be handled when recovery is finished or
@@ -863,6 +870,14 @@ xfs_mountfs(
 		goto out_inodegc_shrinker;
 	}
 
+	/*
+	 * If we're resuming quota status and recovered the log, re-sample the
+	 * qflags from the ondisk superblock now that we've recovered it, just
+	 * in case someone shut down enforcement just before a crash.
+	 */
+	if (xfs_clear_resuming_quotaon(mp) && xlog_recovery_needed(mp->m_log))
+		xfs_qm_resume_quotaon(mp);
+
 	/*
 	 * If logged xattrs are still enabled after log recovery finishes, then
 	 * they'll be available until unmount.  Otherwise, turn them off.
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7e68812db1be7..ba9af63aec143 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -459,6 +459,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
 #define XFS_OPSTATE_UNSET_LOG_INCOMPAT	11
 /* Filesystem can use logged extended attributes */
 #define XFS_OPSTATE_USE_LARP		12
+/* Filesystem should use qflags to determine quotaon status */
+#define XFS_OPSTATE_RESUMING_QUOTAON	13
 
 #define __XFS_IS_OPSTATE(name, NAME) \
 static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
@@ -483,8 +485,12 @@ __XFS_IS_OPSTATE(inodegc_enabled, INODEGC_ENABLED)
 __XFS_IS_OPSTATE(blockgc_enabled, BLOCKGC_ENABLED)
 #ifdef CONFIG_XFS_QUOTA
 __XFS_IS_OPSTATE(quotacheck_running, QUOTACHECK_RUNNING)
+__XFS_IS_OPSTATE(resuming_quotaon, RESUMING_QUOTAON)
 #else
 # define xfs_is_quotacheck_running(mp)	(false)
+# define xfs_is_resuming_quotaon(mp)	(false)
+# define xfs_set_resuming_quotaon(mp)	(false)
+# define xfs_clear_resuming_quotaon(mp)	(false)
 #endif
 __XFS_IS_OPSTATE(done_with_log_incompat, UNSET_LOG_INCOMPAT)
 __XFS_IS_OPSTATE(using_logged_xattrs, USE_LARP)
diff --git a/fs/xfs/xfs_qm_bhv.c b/fs/xfs/xfs_qm_bhv.c
index a11436579877d..79a96558f739e 100644
--- a/fs/xfs/xfs_qm_bhv.c
+++ b/fs/xfs/xfs_qm_bhv.c
@@ -135,3 +135,21 @@ xfs_qm_newmount(
 
 	return 0;
 }
+
+/*
+ * If the sysadmin didn't provide any quota mount options, restore the quota
+ * accounting and enforcement state from the ondisk superblock.  Only do this
+ * for metadir filesystems because this is a behavior change.
+ */
+void
+xfs_qm_resume_quotaon(
+	struct xfs_mount	*mp)
+{
+	if (!xfs_has_metadir(mp))
+		return;
+	if (xfs_has_norecovery(mp))
+		return;
+
+	mp->m_qflags = mp->m_sb.sb_qflags & (XFS_ALL_QUOTA_ACCT |
+					     XFS_ALL_QUOTA_ENFD);
+}
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 645761997bf2d..2d36d967380e7 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -125,6 +125,7 @@ extern void xfs_qm_dqdetach(struct xfs_inode *);
 extern void xfs_qm_dqrele(struct xfs_dquot *);
 extern void xfs_qm_statvfs(struct xfs_inode *, struct kstatfs *);
 extern int xfs_qm_newmount(struct xfs_mount *, uint *, uint *);
+void xfs_qm_resume_quotaon(struct xfs_mount *mp);
 extern void xfs_qm_mount_quotas(struct xfs_mount *);
 extern void xfs_qm_unmount(struct xfs_mount *);
 extern void xfs_qm_unmount_quotas(struct xfs_mount *);
@@ -202,6 +203,7 @@ xfs_trans_reserve_quota_icreate(struct xfs_trans *tp, struct xfs_dquot *udqp,
 #define xfs_qm_dqrele(d)			do { (d) = (d); } while(0)
 #define xfs_qm_statvfs(ip, s)			do { } while(0)
 #define xfs_qm_newmount(mp, a, b)					(0)
+#define xfs_qm_resume_quotaon(mp)		((void)0)
 #define xfs_qm_mount_quotas(mp)
 #define xfs_qm_unmount(mp)
 #define xfs_qm_unmount_quotas(mp)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 835886c322a83..d02bfe9ddfe58 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -67,6 +67,9 @@ enum xfs_dax_mode {
 	XFS_DAX_NEVER = 2,
 };
 
+/* Were quota mount options provided?  Must use the upper 16 bits of qflags. */
+#define XFS_QFLAGS_MNTOPTS	(1U << 31)
+
 static void
 xfs_mount_set_dax_mode(
 	struct xfs_mount	*mp,
@@ -1264,6 +1267,8 @@ xfs_fs_parse_param(
 	int			size = 0;
 	int			opt;
 
+	BUILD_BUG_ON(XFS_QFLAGS_MNTOPTS & XFS_MOUNT_QUOTA_ALL);
+
 	opt = fs_parse(fc, xfs_fs_parameters, param, &result);
 	if (opt < 0)
 		return opt;
@@ -1341,32 +1346,39 @@ xfs_fs_parse_param(
 	case Opt_noquota:
 		parsing_mp->m_qflags &= ~XFS_ALL_QUOTA_ACCT;
 		parsing_mp->m_qflags &= ~XFS_ALL_QUOTA_ENFD;
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_quota:
 	case Opt_uquota:
 	case Opt_usrquota:
 		parsing_mp->m_qflags |= (XFS_UQUOTA_ACCT | XFS_UQUOTA_ENFD);
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_qnoenforce:
 	case Opt_uqnoenforce:
 		parsing_mp->m_qflags |= XFS_UQUOTA_ACCT;
 		parsing_mp->m_qflags &= ~XFS_UQUOTA_ENFD;
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_pquota:
 	case Opt_prjquota:
 		parsing_mp->m_qflags |= (XFS_PQUOTA_ACCT | XFS_PQUOTA_ENFD);
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_pqnoenforce:
 		parsing_mp->m_qflags |= XFS_PQUOTA_ACCT;
 		parsing_mp->m_qflags &= ~XFS_PQUOTA_ENFD;
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_gquota:
 	case Opt_grpquota:
 		parsing_mp->m_qflags |= (XFS_GQUOTA_ACCT | XFS_GQUOTA_ENFD);
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_gqnoenforce:
 		parsing_mp->m_qflags |= XFS_GQUOTA_ACCT;
 		parsing_mp->m_qflags &= ~XFS_GQUOTA_ENFD;
+		parsing_mp->m_qflags |= XFS_QFLAGS_MNTOPTS;
 		return 0;
 	case Opt_discard:
 		parsing_mp->m_features |= XFS_FEAT_DISCARD;
@@ -1761,6 +1773,14 @@ xfs_fs_fill_super(
 		xfs_warn(mp,
 	"EXPERIMENTAL parent pointer feature enabled. Use at your own risk!");
 
+	/*
+	 * If no quota mount options were provided, maybe we'll try to pick
+	 * up the quota accounting and enforcement flags from the ondisk sb.
+	 */
+	if (!(mp->m_qflags & XFS_QFLAGS_MNTOPTS))
+		xfs_set_resuming_quotaon(mp);
+	mp->m_qflags &= ~XFS_QFLAGS_MNTOPTS;
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;
@@ -1947,6 +1967,8 @@ xfs_fs_reconfigure(
 	int			flags = fc->sb_flags;
 	int			error;
 
+	new_mp->m_qflags &= ~XFS_QFLAGS_MNTOPTS;
+
 	/* version 5 superblocks always support version counters. */
 	if (xfs_has_crc(mp))
 		fc->sb_flags |= SB_I_VERSION;


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 5/6] xfs: update sb field checks when metadir is turned on
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-08-23  0:28   ` [PATCH 4/6] xfs: persist quota flags with metadir Darrick J. Wong
@ 2024-08-23  0:29   ` Darrick J. Wong
  2024-08-23  5:55     ` Christoph Hellwig
  2024-08-26  9:52     ` Dave Chinner
  2024-08-23  0:29   ` [PATCH 6/6] xfs: enable metadata directory feature Darrick J. Wong
  5 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:29 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When metadir is enabled, we want to check the two new rtgroups fields,
and we don't want to check the old inumbers that are now in the metadir.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader.c |   36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index cad997f38a424..0d22d70950a5c 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -147,14 +147,14 @@ xchk_superblock(
 	if (xfs_has_metadir(sc->mp)) {
 		if (sb->sb_metadirino != cpu_to_be64(mp->m_sb.sb_metadirino))
 			xchk_block_set_preen(sc, bp);
+	} else {
+		if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
+			xchk_block_set_preen(sc, bp);
+
+		if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
+			xchk_block_set_preen(sc, bp);
 	}
 
-	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
-		xchk_block_set_preen(sc, bp);
-
-	if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
-		xchk_block_set_preen(sc, bp);
-
 	if (sb->sb_rextsize != cpu_to_be32(mp->m_sb.sb_rextsize))
 		xchk_block_set_corrupt(sc, bp);
 
@@ -229,11 +229,13 @@ xchk_superblock(
 	 * sb_icount, sb_ifree, sb_fdblocks, sb_frexents
 	 */
 
-	if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
-		xchk_block_set_preen(sc, bp);
+	if (!xfs_has_metadir(mp)) {
+		if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
+			xchk_block_set_preen(sc, bp);
 
-	if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
-		xchk_block_set_preen(sc, bp);
+		if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
+			xchk_block_set_preen(sc, bp);
+	}
 
 	/*
 	 * Skip the quota flags since repair will force quotacheck.
@@ -342,8 +344,10 @@ xchk_superblock(
 		if (sb->sb_spino_align != cpu_to_be32(mp->m_sb.sb_spino_align))
 			xchk_block_set_corrupt(sc, bp);
 
-		if (sb->sb_pquotino != cpu_to_be64(mp->m_sb.sb_pquotino))
-			xchk_block_set_preen(sc, bp);
+		if (!xfs_has_metadir(mp)) {
+			if (sb->sb_pquotino != cpu_to_be64(mp->m_sb.sb_pquotino))
+				xchk_block_set_preen(sc, bp);
+		}
 
 		/* Don't care about sb_lsn */
 	}
@@ -354,6 +358,14 @@ xchk_superblock(
 			xchk_block_set_corrupt(sc, bp);
 	}
 
+	if (xfs_has_metadir(mp)) {
+		if (sb->sb_rgcount != cpu_to_be32(mp->m_sb.sb_rgcount))
+			xchk_block_set_corrupt(sc, bp);
+
+		if (sb->sb_rgextents != cpu_to_be32(mp->m_sb.sb_rgextents))
+			xchk_block_set_corrupt(sc, bp);
+	}
+
 	/* Everything else must be zero. */
 	if (memchr_inv(sb + 1, 0,
 			BBTOB(bp->b_length) - sizeof(struct xfs_dsb)))


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* [PATCH 6/6] xfs: enable metadata directory feature
  2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-08-23  0:29   ` [PATCH 5/6] xfs: update sb field checks when metadir is turned on Darrick J. Wong
@ 2024-08-23  0:29   ` Darrick J. Wong
  2024-08-23  5:58     ` Christoph Hellwig
  5 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:29 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Enable the metadata directory feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index cafac42cd51ad..6aa141c99e808 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -397,7 +397,8 @@ xfs_sb_has_ro_compat_feature(
 		 XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR | \
 		 XFS_SB_FEAT_INCOMPAT_NREXT64 | \
 		 XFS_SB_FEAT_INCOMPAT_EXCHRANGE | \
-		 XFS_SB_FEAT_INCOMPAT_PARENT)
+		 XFS_SB_FEAT_INCOMPAT_PARENT | \
+		 XFS_SB_FEAT_INCOMPAT_METADIR)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool


^ permalink raw reply related	[flat|nested] 271+ messages in thread

* Re: [PATCH 8/9] xfs: take m_growlock when running growfsrt
  2024-08-23  0:00   ` [PATCH 8/9] xfs: take m_growlock when running growfsrt Darrick J. Wong
@ 2024-08-23  4:08     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:00:51PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Take the grow lock when we're expanding the realtime volume, like we do
> for the other growfs calls.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 9/9] xfs: reset rootdir extent size hint after growfsrt
  2024-08-23  0:01   ` [PATCH 9/9] xfs: reset rootdir extent size hint after growfsrt Darrick J. Wong
@ 2024-08-23  4:09     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11
  2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-08-23  0:01   ` [PATCH 9/9] xfs: reset rootdir extent size hint after growfsrt Darrick J. Wong
@ 2024-08-23  4:09   ` Christoph Hellwig
  9 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, wozizhi, Anders Blomdell, Christoph Hellwig, willy,
	kjell.m.randa, linux-xfs

On Thu, Aug 22, 2024 at 04:56:25PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Various bug fixes for 6.11.

FYI, patches 5+ seem to be pretty long and not critical issue.
I'd probably defer them to 6.12 with the rest of the patchbomb.

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 6/9] xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code
  2024-08-23  0:00   ` [PATCH 6/9] xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code Darrick J. Wong
@ 2024-08-23  4:10     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: wozizhi, hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 5/9] xfs: Fix the owner setting issue for rmap query in xfs fsmap
  2024-08-23  0:00   ` [PATCH 5/9] xfs: Fix the owner setting issue for rmap query in xfs fsmap Darrick J. Wong
@ 2024-08-23  4:10     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Zizhi Wo, hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
@ 2024-08-23  4:12     ` Christoph Hellwig
  2024-08-23 13:20       ` Jeff Layton
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel, Jeff Layton

On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point.  The start-commit ioctl performs the sampling
> of file attributes.

The code itself looks simply enough now, but how do we guarantee
that ctime actually works as a full change count and not just by
chance here?


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 3/3] xfs: pass the icreate args object to xfs_dialloc
  2024-08-23  0:02   ` [PATCH 3/3] xfs: pass the icreate args object to xfs_dialloc Darrick J. Wong
@ 2024-08-23  4:13     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 01/26] xfs: define the on-disk format for the metadir feature
  2024-08-23  0:02   ` [PATCH 01/26] xfs: define the on-disk format for the metadir feature Darrick J. Wong
@ 2024-08-23  4:30     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 05:02:25PM -0700, Darrick J. Wong wrote:
> +static inline bool xfs_sb_version_hasmetadir(const struct xfs_sb *sbp)
> +{
> +	return (XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5) &&

This is copy and paste from the other xfs_sb_version_* helpers,
but there really is no need for the braces here.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 02/26] xfs: refactor loading quota inodes in the regular case
  2024-08-23  0:02   ` [PATCH 02/26] xfs: refactor loading quota inodes in the regular case Darrick J. Wong
@ 2024-08-23  4:31     ` Christoph Hellwig
  2024-08-23 17:51       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

> +	xfs_ino_t		ino = NULLFSINO;
> +
> +	switch (type) {
> +	case XFS_DQTYPE_USER:
> +		ino = mp->m_sb.sb_uquotino;
> +		break;
> +	case XFS_DQTYPE_GROUP:
> +		ino = mp->m_sb.sb_gquotino;
> +		break;
> +	case XFS_DQTYPE_PROJ:
> +		ino = mp->m_sb.sb_pquotino;
> +		break;
> +	default:
> +		ASSERT(0);
> +		return -EFSCORRUPTED;
> +	}

I'd probably split this type to ino lookup into a separate helper,
but that doesn't really matter.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 03/26] xfs: iget for metadata inodes
  2024-08-23  0:02   ` [PATCH 03/26] xfs: iget for metadata inodes Darrick J. Wong
@ 2024-08-23  4:35     ` Christoph Hellwig
  2024-08-23 17:53       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:02:56PM -0700, Darrick J. Wong wrote:
> +#include "xfs_da_format.h"
> +#include "xfs_dir2.h"
> +#include "xfs_metafile.h"

Hmm, there really should be no need to include xfs_da_format.h before
metafile.h - enum xfs_metafile_type is in format.h and I can't see what
else would need it.

I don't think dir2.h is needed here either.

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 04/26] xfs: load metadata directory root at mount time
  2024-08-23  0:03   ` [PATCH 04/26] xfs: load metadata directory root at mount time Darrick J. Wong
@ 2024-08-23  4:35     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 05/26] xfs: enforce metadata inode flag
  2024-08-23  0:03   ` [PATCH 05/26] xfs: enforce metadata inode flag Darrick J. Wong
@ 2024-08-23  4:38     ` Christoph Hellwig
  2024-08-23 17:55       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

> +	/* Mandatory directory flags must be set */

s/directory/inode/ ?

> +/* All metadata directory files must have these flags set. */

s/directory files/directories/ ?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 06/26] xfs: read and write metadata inode directory tree
  2024-08-23  0:03   ` [PATCH 06/26] xfs: read and write metadata inode directory tree Darrick J. Wong
@ 2024-08-23  4:39     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 05:03:43PM -0700, Darrick J. Wong wrote:
> +#ifndef __KERNEL__
> +/*
> + * Begin the process of linking a metadata file by allocating transactions
> + * and locking whatever resources we're going to need.
> + */
> +int
> +xfs_metadir_start_link(
> +	struct xfs_metadir_update	*upd)

I kinda hate placing this repair only code in libxfs, but given the
dependencies on metadir internals I can't really think of anything
better, so:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 07/26] xfs: disable the agi rotor for metadata inodes
  2024-08-23  0:03   ` [PATCH 07/26] xfs: disable the agi rotor for metadata inodes Darrick J. Wong
@ 2024-08-23  4:39     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special
  2024-08-23  0:04   ` [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special Darrick J. Wong
@ 2024-08-23  4:40     ` Christoph Hellwig
  2024-08-26  0:41     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 09/26] xfs: advertise metadata directory feature
  2024-08-23  0:04   ` [PATCH 09/26] xfs: advertise metadata directory feature Darrick J. Wong
@ 2024-08-23  4:40     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 10/26] xfs: allow bulkstat to return metadata directories
  2024-08-23  0:04   ` [PATCH 10/26] xfs: allow bulkstat to return metadata directories Darrick J. Wong
@ 2024-08-23  4:41     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/26] xfs: don't count metadata directory files to quota
  2024-08-23  0:05   ` [PATCH 11/26] xfs: don't count metadata directory files to quota Darrick J. Wong
@ 2024-08-23  4:42     ` Christoph Hellwig
  2024-08-26  0:47     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 12/26] xfs: mark quota inodes as metadata files
  2024-08-23  0:05   ` [PATCH 12/26] xfs: mark quota inodes as metadata files Darrick J. Wong
@ 2024-08-23  4:42     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/26] xfs: adjust xfs_bmap_add_attrfork for metadir
  2024-08-23  0:05   ` [PATCH 13/26] xfs: adjust xfs_bmap_add_attrfork for metadir Darrick J. Wong
@ 2024-08-23  4:42     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 05:05:32PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Online repair might use the xfs_bmap_add_attrfork to repair a file in
> the metadata directory tree if (say) the metadata file lacks the correct
> parent pointers.  In that case, it is not correct to check that the file
> is dqattached -- metadata files must be not have /any/ dquot attached at
> all.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/26] xfs: record health problems with the metadata directory
  2024-08-23  0:05   ` [PATCH 14/26] xfs: record health problems with the metadata directory Darrick J. Wong
@ 2024-08-23  4:43     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 15/26] xfs: refactor directory tree root predicates
  2024-08-23  0:06   ` [PATCH 15/26] xfs: refactor directory tree root predicates Darrick J. Wong
@ 2024-08-23  4:48     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 16/26] xfs: do not count metadata directory files when doing online quotacheck
  2024-08-23  0:06   ` [PATCH 16/26] xfs: do not count metadata directory files when doing online quotacheck Darrick J. Wong
@ 2024-08-23  4:48     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 17/26] xfs: don't fail repairs on metadata files with no attr fork
  2024-08-23  0:06   ` [PATCH 17/26] xfs: don't fail repairs on metadata files with no attr fork Darrick J. Wong
@ 2024-08-23  4:49     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled
  2024-08-23  0:06   ` [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled Darrick J. Wong
@ 2024-08-23  4:50     ` Christoph Hellwig
  2024-08-23 18:00       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 05:06:50PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> If metadata directory trees are enabled, it's possible that some future
> metadata file might want to store information in extended attributes.
> Or, if parent pointers are enabled, then children of the metadir tree
> need parent pointers.  Either way, we start allowing xattr data when
> metadir is enabled, so we now need check and repair to examine attr
> forks for metadata files on metadir filesystems.

I think the parent pointer case is the relevant here, so maybe state
that more clearly?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 19/26] xfs: adjust parent pointer scrubber for sb-rooted metadata files
  2024-08-23  0:07   ` [PATCH 19/26] xfs: adjust parent pointer scrubber for sb-rooted metadata files Darrick J. Wong
@ 2024-08-23  4:50     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 20/26] xfs: fix di_metatype field of inodes that won't load
  2024-08-23  0:07   ` [PATCH 20/26] xfs: fix di_metatype field of inodes that won't load Darrick J. Wong
@ 2024-08-23  4:51     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: scrub metadata directories
  2024-08-23  0:07   ` [PATCH 21/26] xfs: scrub metadata directories Darrick J. Wong
@ 2024-08-23  4:53     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 22/26] xfs: check the metadata directory inumber in superblocks
  2024-08-23  0:07   ` [PATCH 22/26] xfs: check the metadata directory inumber in superblocks Darrick J. Wong
@ 2024-08-23  4:53     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 23/26] xfs: move repair temporary files to the metadata directory tree
  2024-08-23  0:08   ` [PATCH 23/26] xfs: move repair temporary files to the metadata directory tree Darrick J. Wong
@ 2024-08-23  4:54     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 24/26] xfs: check metadata directory file path connectivity
  2024-08-23  0:08   ` [PATCH 24/26] xfs: check metadata directory file path connectivity Darrick J. Wong
@ 2024-08-23  4:55     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 25/26] xfs: confirm dotdot target before replacing it during a repair
  2024-08-23  0:08   ` [PATCH 25/26] xfs: confirm dotdot target before replacing it during a repair Darrick J. Wong
@ 2024-08-23  4:55     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 26/26] xfs: repair metadata directory file path connectivity
  2024-08-23  0:08   ` [PATCH 26/26] xfs: repair metadata directory file path connectivity Darrick J. Wong
@ 2024-08-23  4:56     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 03/10] xfs: don't return too-short extents from xfs_rtallocate_extent_block
  2024-08-23  0:12   ` [PATCH 03/10] xfs: don't return too-short extents from xfs_rtallocate_extent_block Darrick J. Wong
@ 2024-08-23  4:57     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block
  2024-08-23  0:13   ` [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block Darrick J. Wong
@ 2024-08-23  4:57     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 05/10] xfs: refactor aligning bestlen to prod
  2024-08-23  0:13   ` [PATCH 05/10] xfs: refactor aligning bestlen to prod Darrick J. Wong
@ 2024-08-23  4:58     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 06/10] xfs: clean up xfs_rtallocate_extent_exact a bit
  2024-08-23  0:13   ` [PATCH 06/10] xfs: clean up xfs_rtallocate_extent_exact a bit Darrick J. Wong
@ 2024-08-23  4:58     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 07/10] xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near
  2024-08-23  0:13   ` [PATCH 07/10] xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near Darrick J. Wong
@ 2024-08-23  4:59     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 08/10] xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block
  2024-08-23  0:14   ` [PATCH 08/10] xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block Darrick J. Wong
@ 2024-08-23  4:59     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 06/24] xfs: add xchk_setup_nothing and xchk_nothing helpers
  2024-08-23  0:16   ` [PATCH 06/24] xfs: add xchk_setup_nothing and xchk_nothing helpers Darrick J. Wong
@ 2024-08-23  5:00     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 09/24] xfs: rearrange xfs_fsmap.c a little bit
  2024-08-23  0:17   ` [PATCH 09/24] xfs: rearrange xfs_fsmap.c a little bit Darrick J. Wong
@ 2024-08-23  5:01     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 10/24] xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c
  2024-08-23  0:17   ` [PATCH 10/24] xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c Darrick J. Wong
@ 2024-08-23  5:01     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-23  0:17   ` [PATCH 11/24] xfs: create incore realtime group structures Darrick J. Wong
@ 2024-08-23  5:01     ` Christoph Hellwig
  2024-08-25 23:56     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 12/24] xfs: define locking primitives for realtime groups
  2024-08-23  0:17   ` [PATCH 12/24] xfs: define locking primitives for realtime groups Darrick J. Wong
@ 2024-08-23  5:02     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-23  0:18   ` [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes Darrick J. Wong
@ 2024-08-23  5:02     ` Christoph Hellwig
  2024-08-25 23:58     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/24] xfs: support caching rtgroup metadata inodes
  2024-08-23  0:18   ` [PATCH 14/24] xfs: support caching rtgroup metadata inodes Darrick J. Wong
@ 2024-08-23  5:02     ` Christoph Hellwig
  2024-08-26  1:41     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 15/24] xfs: add rtgroup-based realtime scrubbing context management
  2024-08-23  0:18   ` [PATCH 15/24] xfs: add rtgroup-based realtime scrubbing context management Darrick J. Wong
@ 2024-08-23  5:03     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 17/24] xfs: remove XFS_ILOCK_RT*
  2024-08-23  0:19   ` [PATCH 17/24] xfs: remove XFS_ILOCK_RT* Darrick J. Wong
@ 2024-08-23  5:04     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:04 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 01/26] xfs: define the format of rt groups
  2024-08-23  0:21   ` [PATCH 01/26] xfs: define the format of rt groups Darrick J. Wong
@ 2024-08-23  5:11     ` Christoph Hellwig
  2024-08-23 18:12       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:21:25PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Define the ondisk format of realtime group metadata, and a superblock
> for realtime volumes.  rt supers are protected by a separate rocompat
> bit so that we can leave them off if the rt device is zoned.

We actually killed the flag again and just kept the separate helper
to check for it.

> Add a xfs_sb_version_hasrtgroups so that xfs_repair knows how to zero
> the tail of superblocks.

.. and merged the rtgroup and metadir flags, so while this helper
still exists (and will get lots of use to make the code readable),
that particular use case is gone now.

> -#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> -#define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> -#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> -#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> +#define XFS_SB_FEAT_RO_COMPAT_FINOBT	(1 << 0)  /* free inode btree */
> +#define XFS_SB_FEAT_RO_COMPAT_RMAPBT	(1 << 1)  /* reverse map btree */
> +#define XFS_SB_FEAT_RO_COMPAT_REFLINK	(1 << 2)  /* reflinked files */
> +#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT	(1 << 3)  /* inobt block counts */

That also means the above is just a spurious unrelated cleanup now.
Still useful, but maybe it should go into a sepaarate patch?  Or just
don't bother.  Btw, one day we should clearly mark all our on-disk
bitmaps as unsigned.

> +	if (xfs_has_rtgroups(nmp))
> +		nmp->m_sb.sb_rgcount =
> +			howmany_64(nmp->m_sb.sb_rextents, nmp->m_sb.sb_rgextents);

		nmp->m_sb.sb_rgcount = howmany_64(nmp->m_sb.sb_rextents,
						nmp->m_sb.sb_rgextents);

to avoid the overly long line.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 02/26] xfs: check the realtime superblock at mount time
  2024-08-23  0:21   ` [PATCH 02/26] xfs: check the realtime superblock at mount time Darrick J. Wong
@ 2024-08-23  5:11     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 05:21:41PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Check the realtime superblock at mount time, to ensure that the label
> and uuids actually match the primary superblock on the data device.  If
> the rt superblock is good, attach it to the xfs_mount so that the log
> can use ordered buffers to keep this primary in sync with the primary
> super on the data device.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 03/26] xfs: update realtime super every time we update the primary fs super
  2024-08-23  0:21   ` [PATCH 03/26] xfs: update realtime super every time we update the primary fs super Darrick J. Wong
@ 2024-08-23  5:12     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 04/26] xfs: export realtime group geometry via XFS_FSOP_GEOM
  2024-08-23  0:22   ` [PATCH 04/26] xfs: export realtime group geometry via XFS_FSOP_GEOM Darrick J. Wong
@ 2024-08-23  5:12     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 05/26] xfs: check that rtblock extents do not break rtsupers or rtgroups
  2024-08-23  0:22   ` [PATCH 05/26] xfs: check that rtblock extents do not break rtsupers or rtgroups Darrick J. Wong
@ 2024-08-23  5:13     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 07/26] xfs: add frextents to the lazysbcounters when rtgroups enabled
  2024-08-23  0:22   ` [PATCH 07/26] xfs: add frextents to the lazysbcounters when rtgroups enabled Darrick J. Wong
@ 2024-08-23  5:13     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 08/26] xfs: convert sick_map loops to use ARRAY_SIZE
  2024-08-23  0:23   ` [PATCH 08/26] xfs: convert sick_map loops to use ARRAY_SIZE Darrick J. Wong
@ 2024-08-23  5:14     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 09/26] xfs: record rt group metadata errors in the health system
  2024-08-23  0:23   ` [PATCH 09/26] xfs: record rt group metadata errors in the health system Darrick J. Wong
@ 2024-08-23  5:14     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 10/26] xfs: export the geometry of realtime groups to userspace
  2024-08-23  0:23   ` [PATCH 10/26] xfs: export the geometry of realtime groups to userspace Darrick J. Wong
@ 2024-08-23  5:14     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/26] xfs: add block headers to realtime bitmap and summary blocks
  2024-08-23  0:24   ` [PATCH 11/26] xfs: add block headers to realtime bitmap and summary blocks Darrick J. Wong
@ 2024-08-23  5:15     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 12/26] xfs: encode the rtbitmap in big endian format
  2024-08-23  0:24   ` [PATCH 12/26] xfs: encode the rtbitmap in big endian format Darrick J. Wong
@ 2024-08-23  5:15     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/26] xfs: encode the rtsummary in big endian format
  2024-08-23  0:24   ` [PATCH 13/26] xfs: encode the rtsummary " Darrick J. Wong
@ 2024-08-23  5:15     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/26] xfs: grow the realtime section when realtime groups are enabled
  2024-08-23  0:24   ` [PATCH 14/26] xfs: grow the realtime section when realtime groups are enabled Darrick J. Wong
@ 2024-08-23  5:16     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 15/26] xfs: store rtgroup information with a bmap intent
  2024-08-23  0:25   ` [PATCH 15/26] xfs: store rtgroup information with a bmap intent Darrick J. Wong
@ 2024-08-23  5:16     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 16/26] xfs: force swapext to a realtime file to use the file content exchange ioctl
  2024-08-23  0:25   ` [PATCH 16/26] xfs: force swapext to a realtime file to use the file content exchange ioctl Darrick J. Wong
@ 2024-08-23  5:17     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 17/26] xfs: support logging EFIs for realtime extents
  2024-08-23  0:25   ` [PATCH 17/26] xfs: support logging EFIs for realtime extents Darrick J. Wong
@ 2024-08-23  5:17     ` Christoph Hellwig
  2024-08-26  4:33     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 18/26] xfs: support error injection when freeing rt extents
  2024-08-23  0:25   ` [PATCH 18/26] xfs: support error injection when freeing rt extents Darrick J. Wong
@ 2024-08-23  5:18     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 19/26] xfs: use realtime EFI to free extents when rtgroups are enabled
  2024-08-23  0:26   ` [PATCH 19/26] xfs: use realtime EFI to free extents when rtgroups are enabled Darrick J. Wong
@ 2024-08-23  5:18     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 22/26] xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub
  2024-08-23  0:26   ` [PATCH 22/26] xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub Darrick J. Wong
@ 2024-08-23  5:19     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 23/26] xfs: scrub the realtime group superblock
  2024-08-23  0:27   ` [PATCH 23/26] xfs: scrub the realtime group superblock Darrick J. Wong
@ 2024-08-23  5:19     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 24/26] xfs: repair realtime group superblock
  2024-08-23  0:27   ` [PATCH 24/26] xfs: repair " Darrick J. Wong
@ 2024-08-23  5:19     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 25/26] xfs: scrub metadir paths for rtgroup metadata
  2024-08-23  0:27   ` [PATCH 25/26] xfs: scrub metadir paths for rtgroup metadata Darrick J. Wong
@ 2024-08-23  5:20     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 26/26] xfs: mask off the rtbitmap and summary inodes when metadir in use
  2024-08-23  0:27   ` [PATCH 26/26] xfs: mask off the rtbitmap and summary inodes when metadir in use Darrick J. Wong
@ 2024-08-23  5:20     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos
  2024-08-23  0:28   ` [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos Darrick J. Wong
@ 2024-08-23  5:51     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 2/6] xfs: use metadir for quota inodes
  2024-08-23  0:28   ` [PATCH 2/6] xfs: use metadir for quota inodes Darrick J. Wong
@ 2024-08-23  5:53     ` Christoph Hellwig
  2024-08-23 18:20       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 05:28:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Store the quota inodes in a metadata directory if metadir is enabled.

I think this commit log could explain a bit better what this means
and why it is done.

Otherwis looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 3/6] xfs: scrub quota file metapaths
  2024-08-23  0:28   ` [PATCH 3/6] xfs: scrub quota file metapaths Darrick J. Wong
@ 2024-08-23  5:53     ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 4/6] xfs: persist quota flags with metadir
  2024-08-23  0:28   ` [PATCH 4/6] xfs: persist quota flags with metadir Darrick J. Wong
@ 2024-08-23  5:54     ` Christoph Hellwig
  2024-08-23 18:23       ` Darrick J. Wong
  2024-08-26  9:42     ` Dave Chinner
  1 sibling, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:28:59PM -0700, Darrick J. Wong wrote:
> Starting with metadir, let's change the behavior so that if the user
> does not specify any quota-related mount options at all, the ondisk
> quota flags will be used to bring up quota.  In other words, the
> filesystem will mount in the same state and with the same functionality
> as it had during the last mount.

Finally!

Are you going to send some tests that test this behavior?

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 5/6] xfs: update sb field checks when metadir is turned on
  2024-08-23  0:29   ` [PATCH 5/6] xfs: update sb field checks when metadir is turned on Darrick J. Wong
@ 2024-08-23  5:55     ` Christoph Hellwig
  2024-08-26  9:52     ` Dave Chinner
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 6/6] xfs: enable metadata directory feature
  2024-08-23  0:29   ` [PATCH 6/6] xfs: enable metadata directory feature Darrick J. Wong
@ 2024-08-23  5:58     ` Christoph Hellwig
  2024-08-23 18:26       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-23  5:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:29:30PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Enable the metadata directory feature.

Maybe put in just a little bit more information.  E.g.:

With this feature all metadata inodes are places in the metadata
directory and no sb root metadata except for the metadir itself it left.

The RT device is now shared into a number of rtgroups, where 0 rtgroups
mean that no RT extents are supported, and the traditional XFS stub
RT bitmap and summary inodes don't exist, while a single rtgroup gives
roughly identical behavior to the traditional RT setup, just with
checksummed and self identifying metadata.

For quota the quota options are read from the superblock unless
explicitly overridden.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23  4:12     ` Christoph Hellwig
@ 2024-08-23 13:20       ` Jeff Layton
  2024-08-23 17:41         ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Jeff Layton @ 2024-08-23 13:20 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel

On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > This patch introduces two more new ioctls to manage atomic updates to
> > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > does, but with the additional requirement that file2 cannot have changed
> > since some sampling point.  The start-commit ioctl performs the sampling
> > of file attributes.
> 
> The code itself looks simply enough now, but how do we guarantee
> that ctime actually works as a full change count and not just by
> chance here?
> 

With current mainline kernels it won't, but the updated multigrain
timestamp series is in linux-next and is slated to go into v6.12. At
that point it should be fine for this purpose.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23 13:20       ` Jeff Layton
@ 2024-08-23 17:41         ` Darrick J. Wong
  2024-08-23 19:15           ` Jeff Layton
  2024-08-24  3:29           ` Christoph Hellwig
  0 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Christoph Hellwig, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 09:20:15AM -0400, Jeff Layton wrote:
> On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> > On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > This patch introduces two more new ioctls to manage atomic updates to
> > > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> > > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > > does, but with the additional requirement that file2 cannot have changed
> > > since some sampling point.  The start-commit ioctl performs the sampling
> > > of file attributes.
> > 
> > The code itself looks simply enough now, but how do we guarantee
> > that ctime actually works as a full change count and not just by
> > chance here?
> > 
> 
> With current mainline kernels it won't, but the updated multigrain
> timestamp series is in linux-next and is slated to go into v6.12. At
> that point it should be fine for this purpose.

<nod> If these both get merged for 6.12, I think the appropriate port
for this patch is to change xfs_ioc_start_commit to do:

	struct kstat	kstat;

	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
	kern_f->file2_ctime		= kstat.ctime.tv_sec;
	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
	kern_f->file2_mtime		= kstat.mtime.tv_sec;
	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;

instead of open-coding the inode_get_[cm]time calls.  The entire
exchangerange feature is still marked experimental, so I didn't think it
was worth rebasing my entire dev branch on the multigrain timestamp
redux series; we can just fix it later.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 02/26] xfs: refactor loading quota inodes in the regular case
  2024-08-23  4:31     ` Christoph Hellwig
@ 2024-08-23 17:51       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 09:31:58PM -0700, Christoph Hellwig wrote:
> > +	xfs_ino_t		ino = NULLFSINO;
> > +
> > +	switch (type) {
> > +	case XFS_DQTYPE_USER:
> > +		ino = mp->m_sb.sb_uquotino;
> > +		break;
> > +	case XFS_DQTYPE_GROUP:
> > +		ino = mp->m_sb.sb_gquotino;
> > +		break;
> > +	case XFS_DQTYPE_PROJ:
> > +		ino = mp->m_sb.sb_pquotino;
> > +		break;
> > +	default:
> > +		ASSERT(0);
> > +		return -EFSCORRUPTED;
> > +	}
> 
> I'd probably split this type to ino lookup into a separate helper,
> but that doesn't really matter.

I tried that, but left it embedded here because I didn't want to write a
helper function that then had to return a magic value for "some
programmer f*cked up, let's just bail out" that also couldn't have been
read in from disk.  In theory 0 should work because
xfs_sb_quota_from_disk should have converted that to NULLFSINO for us,
but that felt like a good way to introduce a subtle bug that will blow
up later.

I suppose 0 wouldn't be the worst magic value, since xfs_iget would just
blow up noisily for us.  OTOH all this gets deleted at the other end of
the metadir series anyway so I'd preferentially fix xfs_dqinode_load.

Thanks for the review!

--D

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 03/26] xfs: iget for metadata inodes
  2024-08-23  4:35     ` Christoph Hellwig
@ 2024-08-23 17:53       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 09:35:31PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:02:56PM -0700, Darrick J. Wong wrote:
> > +#include "xfs_da_format.h"
> > +#include "xfs_dir2.h"
> > +#include "xfs_metafile.h"
> 
> Hmm, there really should be no need to include xfs_da_format.h before
> metafile.h - enum xfs_metafile_type is in format.h and I can't see what
> else would need it.
> 
> I don't think dir2.h is needed here either.

Yeah, I'll go figure out which of these includes can go away.

--D

> Otherwise this looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 05/26] xfs: enforce metadata inode flag
  2024-08-23  4:38     ` Christoph Hellwig
@ 2024-08-23 17:55       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 09:38:00PM -0700, Christoph Hellwig wrote:
> > +	/* Mandatory directory flags must be set */
> 
> s/directory/inode/ ?
> 
> > +/* All metadata directory files must have these flags set. */
> 
> s/directory files/directories/ ?

Fixed.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled
  2024-08-23  4:50     ` Christoph Hellwig
@ 2024-08-23 18:00       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 18:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 09:50:16PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:06:50PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > If metadata directory trees are enabled, it's possible that some future
> > metadata file might want to store information in extended attributes.
> > Or, if parent pointers are enabled, then children of the metadir tree
> > need parent pointers.  Either way, we start allowing xattr data when
> > metadir is enabled, so we now need check and repair to examine attr
> > forks for metadata files on metadir filesystems.
> 
> I think the parent pointer case is the relevant here, so maybe state
> that more clearly?

I'll change this to:

"If parent pointers are enabled, then metadata files will store parent
pointers in xattrs, just like files in the user visible directory tree.
Therefore, scrub and repair need to handle attr forks for metadata files
on metadir filesystems."

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 01/26] xfs: define the format of rt groups
  2024-08-23  5:11     ` Christoph Hellwig
@ 2024-08-23 18:12       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 18:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 10:11:04PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:21:25PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Define the ondisk format of realtime group metadata, and a superblock
> > for realtime volumes.  rt supers are protected by a separate rocompat
> > bit so that we can leave them off if the rt device is zoned.
> 
> We actually killed the flag again and just kept the separate helper
> to check for it.
> 
> > Add a xfs_sb_version_hasrtgroups so that xfs_repair knows how to zero
> > the tail of superblocks.
> 
> .. and merged the rtgroup and metadir flags, so while this helper
> still exists (and will get lots of use to make the code readable),
> that particular use case is gone now.

I'll just delete this sentence since xfs_sb_version_hasmetadir is in
another patch anyway.

> > -#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> > -#define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > -#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > -#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > +#define XFS_SB_FEAT_RO_COMPAT_FINOBT	(1 << 0)  /* free inode btree */
> > +#define XFS_SB_FEAT_RO_COMPAT_RMAPBT	(1 << 1)  /* reverse map btree */
> > +#define XFS_SB_FEAT_RO_COMPAT_REFLINK	(1 << 2)  /* reflinked files */
> > +#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT	(1 << 3)  /* inobt block counts */
> 
> That also means the above is just a spurious unrelated cleanup now.
> Still useful, but maybe it should go into a sepaarate patch?  Or just
> don't bother.  Btw, one day we should clearly mark all our on-disk
> bitmaps as unsigned.

Eh, I'll drop it and make the next new rocompat feature do it.

> > +	if (xfs_has_rtgroups(nmp))
> > +		nmp->m_sb.sb_rgcount =
> > +			howmany_64(nmp->m_sb.sb_rextents, nmp->m_sb.sb_rgextents);
> 
> 		nmp->m_sb.sb_rgcount = howmany_64(nmp->m_sb.sb_rextents,
> 						nmp->m_sb.sb_rgextents);

Done.

> to avoid the overly long line.
> 
> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 2/6] xfs: use metadir for quota inodes
  2024-08-23  5:53     ` Christoph Hellwig
@ 2024-08-23 18:20       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 18:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Aug 22, 2024 at 10:53:00PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:28:28PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Store the quota inodes in a metadata directory if metadir is enabled.
> 
> I think this commit log could explain a bit better what this means
> and why it is done.

How about:

"Store the quota inodes in the /quota metadata directory if metadir is
enabled.  This enables us to stop using the sb_[ugp]uotino fields in the
superblock.  From this point on, all metadata files will be children of
the metadata directory tree root."

> Otherwis looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 4/6] xfs: persist quota flags with metadir
  2024-08-23  5:54     ` Christoph Hellwig
@ 2024-08-23 18:23       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 18:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 10:54:51PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:28:59PM -0700, Darrick J. Wong wrote:
> > Starting with metadir, let's change the behavior so that if the user
> > does not specify any quota-related mount options at all, the ondisk
> > quota flags will be used to bring up quota.  In other words, the
> > filesystem will mount in the same state and with the same functionality
> > as it had during the last mount.
> 
> Finally!
> 
> Are you going to send some tests that test this behavior?
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Yes.

--D

xfs: test persistent quota flags

Test the persistent quota flags that come with the metadir feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 tests/xfs/1891     |  128 +++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/1891.out |  147 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 282 insertions(+), 1 deletion(-)
 create mode 100755 tests/xfs/1891
 create mode 100644 tests/xfs/1891.out

diff --git a/tests/xfs/1891 b/tests/xfs/1891
new file mode 100755
index 0000000000..53009571a9
--- /dev/null
+++ b/tests/xfs/1891
@@ -0,0 +1,128 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2024 Oracle.  All Rights Reserved.
+#
+# FS QA Test 1891
+#
+# Functionality test for persistent quota accounting and enforcement flags in
+# XFS when metadata directories are enabled.
+#
+. ./common/preamble
+_begin_fstest auto quick quota
+
+. ./common/filter
+. ./common/quota
+
+$MKFS_XFS_PROG 2>&1 | grep -q 'uquota' || \
+	_notrun "mkfs does not support uquota option"
+
+_require_scratch
+_require_xfs_quota
+
+filter_quota_state() {
+	sed -e 's/Inode: #[0-9]\+/Inode #XXX/g' \
+	    -e '/max warnings:/d' \
+	    -e '/Blocks grace time:/d' \
+	    -e '/Inodes grace time:/d' \
+		| _filter_scratch
+}
+
+qerase_mkfs_options() {
+	echo "$MKFS_OPTIONS" | sed \
+		-e 's/uquota//g' \
+		-e 's/gquota//g' \
+		-e 's/pquota//g' \
+		-e 's/uqnoenforce//g' \
+		-e 's/gqnoenforce//g' \
+		-e 's/pqnoenforce//g' \
+		-e 's/,,*/,/g'
+}
+
+confirm() {
+	echo "$MOUNT_OPTIONS" | grep -E -q '(qnoenforce|quota)' && \
+		echo "saw quota mount options"
+	_scratch_mount
+	$XFS_QUOTA_PROG -x -c "state -ugp" $SCRATCH_MNT | filter_quota_state
+	_check_xfs_scratch_fs
+	_scratch_unmount
+}
+
+ORIG_MOUNT_OPTIONS="$MOUNT_OPTIONS"
+MKFS_OPTIONS="$(qerase_mkfs_options)"
+
+echo "Test 0: formatting a subset"
+_scratch_mkfs -m uquota,gqnoenforce &>> $seqres.full
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option	# blank out quota options
+confirm
+
+echo "Test 1: formatting"
+_scratch_mkfs -m uquota,gquota,pquota &>> $seqres.full
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option	# blank out quota options
+confirm
+
+echo "Test 2: only grpquota"
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option grpquota
+confirm
+
+echo "Test 3: repair"
+_scratch_xfs_repair &>> $seqres.full || echo "repair failed?"
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option	# blank out quota options
+confirm
+
+echo "Test 4: weird options"
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option pqnoenforce,uquota
+confirm
+
+echo "Test 5: simple recovery"
+_scratch_mkfs -m uquota,gquota,pquota &>> $seqres.full
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option	# blank out quota options
+echo "$MOUNT_OPTIONS" | grep -E -q '(qnoenforce|quota)' && \
+	echo "saw quota mount options"
+_scratch_mount
+$XFS_QUOTA_PROG -x -c "state -ugp" $SCRATCH_MNT | filter_quota_state
+touch $SCRATCH_MNT/a
+_scratch_shutdown -v -f >> $seqres.full
+echo shutdown
+_scratch_unmount
+confirm
+
+echo "Test 6: simple recovery with mount options"
+_scratch_mkfs -m uquota,gquota,pquota &>> $seqres.full
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option	# blank out quota options
+echo "$MOUNT_OPTIONS" | grep -E -q '(qnoenforce|quota)' && \
+	echo "saw quota mount options"
+_scratch_mount
+$XFS_QUOTA_PROG -x -c "state -ugp" $SCRATCH_MNT | filter_quota_state
+touch $SCRATCH_MNT/a
+_scratch_shutdown -v -f >> $seqres.full
+echo shutdown
+_scratch_unmount
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option gqnoenforce
+confirm
+
+echo "Test 7: user quotaoff recovery"
+_scratch_mkfs -m uquota,gquota,pquota &>> $seqres.full
+MOUNT_OPTIONS="$ORIG_MOUNT_OPTIONS"
+_qmount_option	# blank out quota options
+echo "$MOUNT_OPTIONS" | grep -E -q '(qnoenforce|quota)' && \
+	echo "saw quota mount options"
+_scratch_mount
+$XFS_QUOTA_PROG -x -c "state -ugp" $SCRATCH_MNT | filter_quota_state
+touch $SCRATCH_MNT/a
+$XFS_QUOTA_PROG -x -c 'off -u' $SCRATCH_MNT
+_scratch_shutdown -v -f >> $seqres.full
+echo shutdown
+_scratch_unmount
+confirm
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/1891.out b/tests/xfs/1891.out
new file mode 100644
index 0000000000..7e88940880
--- /dev/null
+++ b/tests/xfs/1891.out
@@ -0,0 +1,147 @@
+QA output created by 1891
+Test 0: formatting a subset
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode: N/A
+Test 1: formatting
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Test 2: only grpquota
+saw quota mount options
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Test 3: repair
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Test 4: weird options
+saw quota mount options
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Test 5: simple recovery
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+shutdown
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Test 6: simple recovery with mount options
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+shutdown
+saw quota mount options
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: OFF
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Test 7: user quotaoff recovery
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+shutdown
+User quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: OFF
+  Inode #XXX (1 blocks, 1 extents)
+Group quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)
+Project quota state on SCRATCH_MNT (SCRATCH_DEV)
+  Accounting: ON
+  Enforcement: ON
+  Inode #XXX (1 blocks, 1 extents)

^ permalink raw reply related	[flat|nested] 271+ messages in thread

* Re: [PATCH 6/6] xfs: enable metadata directory feature
  2024-08-23  5:58     ` Christoph Hellwig
@ 2024-08-23 18:26       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-23 18:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 10:58:23PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:29:30PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Enable the metadata directory feature.
> 
> Maybe put in just a little bit more information.  E.g.:
> 
> With this feature all metadata inodes are places in the metadata
> directory and no sb root metadata except for the metadir itself it left.
> 
> The RT device is now shared into a number of rtgroups, where 0 rtgroups
> mean that no RT extents are supported, and the traditional XFS stub
> RT bitmap and summary inodes don't exist, while a single rtgroup gives
> roughly identical behavior to the traditional RT setup, just with
> checksummed and self identifying metadata.
> 
> For quota the quota options are read from the superblock unless
> explicitly overridden.

I've massaged that into:

"With this feature all metadata inodes are places in the metadata
directory and no sb root metadata except for the metadir itself it left.

"The RT device is now shared into a number of rtgroups, where 0 rtgroups
mean that no RT extents are supported, and the traditional XFS stub RT
bitmap and summary inodes don't exist, while a single rtgroup gives
roughly identical behavior to the traditional RT setup, just with
checksummed and self identifying metadata.

"For quota the quota options are read from the superblock unless
explicitly overridden."

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for all your help getting this ready!

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23 17:41         ` Darrick J. Wong
@ 2024-08-23 19:15           ` Jeff Layton
  2024-08-24  3:29           ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Jeff Layton @ 2024-08-23 19:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, hch, linux-xfs, linux-fsdevel

On Fri, 2024-08-23 at 10:41 -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 09:20:15AM -0400, Jeff Layton wrote:
> > On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> > > On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > This patch introduces two more new ioctls to manage atomic
> > > > updates to
> > > > file contents -- XFS_IOC_START_COMMIT and
> > > > XFS_IOC_COMMIT_RANGE.  The
> > > > commit mechanism here is exactly the same as what
> > > > XFS_IOC_EXCHANGE_RANGE
> > > > does, but with the additional requirement that file2 cannot
> > > > have changed
> > > > since some sampling point.  The start-commit ioctl performs the
> > > > sampling
> > > > of file attributes.
> > > 
> > > The code itself looks simply enough now, but how do we guarantee
> > > that ctime actually works as a full change count and not just by
> > > chance here?
> > > 
> > 
> > With current mainline kernels it won't, but the updated multigrain
> > timestamp series is in linux-next and is slated to go into v6.12.
> > At
> > that point it should be fine for this purpose.
> 
> <nod> If these both get merged for 6.12, I think the appropriate port
> for this patch is to change xfs_ioc_start_commit to do:
> 
> 	struct kstat	kstat;
> 
> 	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME,
> XFS_I(ip2));
> 	kern_f->file2_ctime		= kstat.ctime.tv_sec;
> 	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
> 	kern_f->file2_mtime		= kstat.mtime.tv_sec;
> 	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;
> 

Yep, that's exactly what you'd want to do.

> instead of open-coding the inode_get_[cm]time calls.  The entire
> exchangerange feature is still marked experimental, so I didn't think
> it
> was worth rebasing my entire dev branch on the multigrain timestamp
> redux series; we can just fix it later.
> 

Sounds good.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23 17:41         ` Darrick J. Wong
  2024-08-23 19:15           ` Jeff Layton
@ 2024-08-24  3:29           ` Christoph Hellwig
  2024-08-24  4:46             ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-24  3:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Layton, Christoph Hellwig, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 10:41:40AM -0700, Darrick J. Wong wrote:
> <nod> If these both get merged for 6.12, I think the appropriate port
> for this patch is to change xfs_ioc_start_commit to do:
> 
> 	struct kstat	kstat;
> 
> 	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
> 	kern_f->file2_ctime		= kstat.ctime.tv_sec;
> 	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
> 	kern_f->file2_mtime		= kstat.mtime.tv_sec;
> 	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;
> 
> instead of open-coding the inode_get_[cm]time calls.  The entire
> exchangerange feature is still marked experimental, so I didn't think it
> was worth rebasing my entire dev branch on the multigrain timestamp
> redux series; we can just fix it later.

But the commit log could really note this dependency.  This will be
especially useful for backports, but also for anyone reading through
code history.


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  3:29           ` Christoph Hellwig
@ 2024-08-24  4:46             ` Darrick J. Wong
  2024-08-24  4:48               ` Christoph Hellwig
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-24  4:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jeff Layton, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 08:29:18PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2024 at 10:41:40AM -0700, Darrick J. Wong wrote:
> > <nod> If these both get merged for 6.12, I think the appropriate port
> > for this patch is to change xfs_ioc_start_commit to do:
> > 
> > 	struct kstat	kstat;
> > 
> > 	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
> > 	kern_f->file2_ctime		= kstat.ctime.tv_sec;
> > 	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
> > 	kern_f->file2_mtime		= kstat.mtime.tv_sec;
> > 	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;
> > 
> > instead of open-coding the inode_get_[cm]time calls.  The entire
> > exchangerange feature is still marked experimental, so I didn't think it
> > was worth rebasing my entire dev branch on the multigrain timestamp
> > redux series; we can just fix it later.
> 
> But the commit log could really note this dependency.  This will be
> especially useful for backports, but also for anyone reading through
> code history.

Ok, how about this for a commit message:

"This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

"Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced in that release by Jeff
Layton, it's now possible to update ctime such that this does not
happen.

"It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps."

Will that pass muster?

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  4:46             ` Darrick J. Wong
@ 2024-08-24  4:48               ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-24  4:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Jeff Layton, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 09:46:43PM -0700, Darrick J. Wong wrote:
> "It is critical, then, that this patch must not be backported to any
> kernel that does not support fine-grained file change timestamps."
> 
> Will that pass muster?

I'd drop that last sentence as the previous part should be clear
enough.

^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
  2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
  2024-08-23  4:12     ` Christoph Hellwig
@ 2024-08-24  6:29     ` Darrick J. Wong
  2024-08-24 12:11       ` Jeff Layton
  2024-08-25  4:52       ` Christoph Hellwig
  1 sibling, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-24  6:29 UTC (permalink / raw)
  To: hch; +Cc: linux-xfs, linux-fsdevel, jlayton

From: Darrick J. Wong <djwong@kernel.org>

This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef72016..c85c8077fac39 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
 	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
 };
 
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call.  The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+	__s32		file1_fd;
+	__u32		pad;		/* must be zeroes */
+	__u64		file1_offset;	/* file1 offset, bytes */
+	__u64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
+
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[6];
+};
+
 /*
  * Exchange file data all the way to the ends of both files, and then exchange
  * the file sizes.  This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92f..d0889190ab7ff 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
 	return error;
 }
 
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip2)
+{
+	struct inode			*inode2 = VFS_I(ip2);
+	struct timespec64		ctime = inode_get_ctime(inode2);
+	struct timespec64		mtime = inode_get_mtime(inode2);
+
+	trace_xfs_exchrange_freshness(fxr, ip2);
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if (fxr->file2_ino != ip2->i_ino ||
+	    fxr->file2_gen != inode2->i_generation ||
+	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+	    !timespec64_equal(&fxr->file2_mtime, &mtime))
+		return -EBUSY;
+
+	return 0;
+}
+
 #define QRETRY_IP1	(0x1)
 #define QRETRY_IP2	(0x2)
 
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
 	if (error || fxr->length == 0)
 		return error;
 
+	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+		error = xfs_exchrange_check_freshness(fxr, ip2);
+		if (error)
+			return error;
+	}
+
 	/* Attach dquots to both inodes before changing block maps. */
 	error = xfs_qm_dqattach(ip2);
 	if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
 	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
 		return -EXDEV;
 
-	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
 		return -EINVAL;
 
 	/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
 	fdput(file1);
 	return error;
 }
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	xfs_fsid_t	fsid;		/* m_fixedfsid */
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u32		file2_gen;	/* inode generation */
+	__u32		magic;		/* zero */
+};
+#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_commit_range		args = { };
+	struct timespec64		ts;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+	struct inode			*inode2 = file_inode(file);
+	struct xfs_inode		*ip2 = XFS_I(inode2);
+	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
+						    XFS_MMAPLOCK_SHARED |
+						    XFS_ILOCK_SHARED;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+	xfs_ilock(ip2, lockflags);
+	ts = inode_get_ctime(inode2);
+	kern_f->file2_ctime		= ts.tv_sec;
+	kern_f->file2_ctime_nsec	= ts.tv_nsec;
+	ts = inode_get_mtime(inode2);
+	kern_f->file2_mtime		= ts.tv_sec;
+	kern_f->file2_mtime_nsec	= ts.tv_nsec;
+	kern_f->file2_ino		= ip2->i_ino;
+	kern_f->file2_gen		= inode2->i_generation;
+	kern_f->magic			= XCR_FRESH_MAGIC;
+	xfs_iunlock(ip2, lockflags);
+
+	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
+	struct xfs_mount		*mp = ip2->i_mount;
+	struct fd			file1;
+	int				error;
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+		return -EINVAL;
+	if (kern_f->magic != XCR_FRESH_MAGIC)
+		return -EBUSY;
+	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+		return -EBUSY;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_gen		= kern_f->file2_gen;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = xfs_exchange_range(&fxr);
+	fdput(file1);
+	return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e8..bc1298aba806b 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
 
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
+
 #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
-					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
 
 struct xfs_exchrange {
 	struct file		*file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
 	u64			length;
 
 	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
+
+	/* file2 metadata for freshness checks */
+	u64			file2_ino;
+	struct timespec64	file2_mtime;
+	struct timespec64	file2_ctime;
+	u32			file2_gen;
 };
 
 long xfs_ioc_exchange_range(struct file *file,
 		struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+		struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+		struct xfs_commit_range __user	*argp);
 
 struct xfs_exchmaps_req;
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e963..90b3ee21e7fe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
+	case XFS_IOC_COMMIT_RANGE:
+		return xfs_ioc_commit_range(filp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a9..4cf0fa71ba9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
 	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
 	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
 	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
-	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
+	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
+	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
 
 /* file exchange-range tracepoint class */
 DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
 
+TRACE_EVENT(xfs_exchrange_freshness,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		struct timespec64	ts64;
+		struct inode		*inode2 = VFS_I(ip2);
+
+		__entry->dev = inode2->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+
+		ts64 = inode_get_ctime(inode2);
+		__entry->ip2_ctime = ts64.tv_sec;
+		__entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+		ts64 = inode_get_mtime(inode2);
+		__entry->ip2_mtime = ts64.tv_sec;
+		__entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
+		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
+		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 TRACE_EVENT(xfs_exchmaps_overhead,
 	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
 		 unsigned long long rmapbt_blocks),

^ permalink raw reply related	[flat|nested] 271+ messages in thread

* Re: [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
@ 2024-08-24 12:11       ` Jeff Layton
  2024-08-25  4:52       ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Jeff Layton @ 2024-08-24 12:11 UTC (permalink / raw)
  To: Darrick J. Wong, hch; +Cc: linux-xfs, linux-fsdevel

On Fri, 2024-08-23 at 23:29 -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point.  The start-commit ioctl performs the sampling
> of file attributes.
> 
> Note: This patch currently samples i_ctime during START_COMMIT and
> checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
> safe in kernels prior to 6.12 because ctime only had coarse grained
> granularity and very fast updates could collide with a COMMIT_RANGE.
> With the multi-granularity ctime introduced by Jeff Layton, it's now
> possible to update ctime such that this does not happen.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
>  fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_exchrange.h |   16 +++++
>  fs/xfs/xfs_ioctl.c     |    4 +
>  fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
>  5 files changed, 243 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 454b63ef72016..c85c8077fac39 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -825,6 +825,30 @@ struct xfs_exchange_range {
>  	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
>  };
>  
> +/*
> + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> + * ctime as the arguments provided to the call.  The old contents of file2 will
> + * be moved to file1.
> + *
> + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> + *
> + * Filesystems must be able to restart and complete the operation even after
> + * the system goes down.
> + */
> +struct xfs_commit_range {
> +	__s32		file1_fd;
> +	__u32		pad;		/* must be zeroes */
> +	__u64		file1_offset;	/* file1 offset, bytes */
> +	__u64		file2_offset;	/* file2 offset, bytes */
> +	__u64		length;		/* bytes to exchange */
> +
> +	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
> +
> +	/* opaque file2 metadata for freshness checks */
> +	__u64		file2_freshness[6];
> +};
> +
>  /*
>   * Exchange file data all the way to the ends of both files, and then exchange
>   * the file sizes.  This flag can be used to replace a file's contents with a
> @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
>  #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
>  #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
>  #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
> +#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
> +#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
>  /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
>  
>  
> diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> index c8a655c92c92f..d0889190ab7ff 100644
> --- a/fs/xfs/xfs_exchrange.c
> +++ b/fs/xfs/xfs_exchrange.c
> @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
>  	return error;
>  }
>  
> +/*
> + * Check that file2's metadata agree with the snapshot that we took for the
> + * range commit request.
> + *
> + * This should be called after the filesystem has locked /all/ inode metadata
> + * against modification.
> + */
> +STATIC int
> +xfs_exchrange_check_freshness(
> +	const struct xfs_exchrange	*fxr,
> +	struct xfs_inode		*ip2)
> +{
> +	struct inode			*inode2 = VFS_I(ip2);
> +	struct timespec64		ctime = inode_get_ctime(inode2);
> +	struct timespec64		mtime = inode_get_mtime(inode2);
> +
> +	trace_xfs_exchrange_freshness(fxr, ip2);
> +
> +	/* Check that file2 hasn't otherwise been modified. */
> +	if (fxr->file2_ino != ip2->i_ino ||
> +	    fxr->file2_gen != inode2->i_generation ||
> +	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> +	    !timespec64_equal(&fxr->file2_mtime, &mtime))
> +		return -EBUSY;
> +
> +	return 0;
> +}
> +
>  #define QRETRY_IP1	(0x1)
>  #define QRETRY_IP2	(0x2)
>  
> @@ -607,6 +635,12 @@ xfs_exchrange_prep(
>  	if (error || fxr->length == 0)
>  		return error;
>  
> +	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> +		error = xfs_exchrange_check_freshness(fxr, ip2);
> +		if (error)
> +			return error;
> +	}
> +
>  	/* Attach dquots to both inodes before changing block maps. */
>  	error = xfs_qm_dqattach(ip2);
>  	if (error)
> @@ -719,7 +753,8 @@ xfs_exchange_range(
>  	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
>  		return -EXDEV;
>  
> -	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> +	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> +			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
>  		return -EINVAL;
>  
>  	/* Userspace requests only honored for regular files. */
> @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
>  	fdput(file1);
>  	return error;
>  }
> +
> +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> +struct xfs_commit_range_fresh {
> +	xfs_fsid_t	fsid;		/* m_fixedfsid */
> +	__u64		file2_ino;	/* inode number */
> +	__s64		file2_mtime;	/* modification time */
> +	__s64		file2_ctime;	/* change time */
> +	__s32		file2_mtime_nsec; /* mod time, nsec */
> +	__s32		file2_ctime_nsec; /* change time, nsec */
> +	__u32		file2_gen;	/* inode generation */
> +	__u32		magic;		/* zero */
> +};
> +#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
> +
> +/* Set up a commitrange operation by sampling file2's write-related attrs */
> +long
> +xfs_ioc_start_commit(
> +	struct file			*file,
> +	struct xfs_commit_range __user	*argp)
> +{
> +	struct xfs_commit_range		args = { };
> +	struct timespec64		ts;
> +	struct xfs_commit_range_fresh	*kern_f;
> +	struct xfs_commit_range_fresh	__user *user_f;
> +	struct inode			*inode2 = file_inode(file);
> +	struct xfs_inode		*ip2 = XFS_I(inode2);
> +	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
> +						    XFS_MMAPLOCK_SHARED |
> +						    XFS_ILOCK_SHARED;
> +
> +	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> +		     sizeof(args.file2_freshness));
> +
> +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> +	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> +
> +	xfs_ilock(ip2, lockflags);
> +	ts = inode_get_ctime(inode2);
> +	kern_f->file2_ctime		= ts.tv_sec;
> +	kern_f->file2_ctime_nsec	= ts.tv_nsec;
> +	ts = inode_get_mtime(inode2);
> +	kern_f->file2_mtime		= ts.tv_sec;
> +	kern_f->file2_mtime_nsec	= ts.tv_nsec;
> +	kern_f->file2_ino		= ip2->i_ino;
> +	kern_f->file2_gen		= inode2->i_generation;
> +	kern_f->magic			= XCR_FRESH_MAGIC;
> +	xfs_iunlock(ip2, lockflags);
> +
> +	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> +	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/*
> + * Exchange file1 and file2 contents if file2 has not been written since the
> + * start commit operation.
> + */
> +long
> +xfs_ioc_commit_range(
> +	struct file			*file,
> +	struct xfs_commit_range __user	*argp)
> +{
> +	struct xfs_exchrange		fxr = {
> +		.file2			= file,
> +	};
> +	struct xfs_commit_range		args;
> +	struct xfs_commit_range_fresh	*kern_f;
> +	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
> +	struct xfs_mount		*mp = ip2->i_mount;
> +	struct fd			file1;
> +	int				error;
> +
> +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> +		return -EINVAL;
> +	if (kern_f->magic != XCR_FRESH_MAGIC)
> +		return -EBUSY;
> +	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> +		return -EBUSY;
> +
> +	fxr.file1_offset	= args.file1_offset;
> +	fxr.file2_offset	= args.file2_offset;
> +	fxr.length		= args.length;
> +	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> +	fxr.file2_ino		= kern_f->file2_ino;
> +	fxr.file2_gen		= kern_f->file2_gen;
> +	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
> +	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
> +	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
> +	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
> +
> +	file1 = fdget(args.file1_fd);
> +	if (!file1.file)
> +		return -EBADF;
> +	fxr.file1 = file1.file;
> +
> +	error = xfs_exchange_range(&fxr);
> +	fdput(file1);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> index 039abcca546e8..bc1298aba806b 100644
> --- a/fs/xfs/xfs_exchrange.h
> +++ b/fs/xfs/xfs_exchrange.h
> @@ -10,8 +10,12 @@
>  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
>  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
>  
> +/* Freshness check required */
> +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
> +
>  #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> -					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> +					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> +					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
>  
>  struct xfs_exchrange {
>  	struct file		*file1;
> @@ -22,10 +26,20 @@ struct xfs_exchrange {
>  	u64			length;
>  
>  	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
> +
> +	/* file2 metadata for freshness checks */
> +	u64			file2_ino;
> +	struct timespec64	file2_mtime;
> +	struct timespec64	file2_ctime;
> +	u32			file2_gen;
>  };
>  
>  long xfs_ioc_exchange_range(struct file *file,
>  		struct xfs_exchange_range __user *argp);
> +long xfs_ioc_start_commit(struct file *file,
> +		struct xfs_commit_range __user *argp);
> +long xfs_ioc_commit_range(struct file *file,
> +		struct xfs_commit_range __user	*argp);
>  
>  struct xfs_exchmaps_req;
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 6b13666d4e963..90b3ee21e7fe6 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
>  
>  	case XFS_IOC_EXCHANGE_RANGE:
>  		return xfs_ioc_exchange_range(filp, arg);
> +	case XFS_IOC_START_COMMIT:
> +		return xfs_ioc_start_commit(filp, arg);
> +	case XFS_IOC_COMMIT_RANGE:
> +		return xfs_ioc_commit_range(filp, arg);
>  
>  	default:
>  		return -ENOTTY;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 180ce697305a9..4cf0fa71ba9ce 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
>  	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
>  	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
>  	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
> -	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
> +	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
> +	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
>  
>  /* file exchange-range tracepoint class */
>  DECLARE_EVENT_CLASS(xfs_exchrange_class,
> @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
>  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
>  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
>  
> +TRACE_EVENT(xfs_exchrange_freshness,
> +	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> +	TP_ARGS(fxr, ip2),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_ino_t, ip2_ino)
> +		__field(long long, ip2_mtime)
> +		__field(long long, ip2_ctime)
> +		__field(int, ip2_mtime_nsec)
> +		__field(int, ip2_ctime_nsec)
> +
> +		__field(xfs_ino_t, file2_ino)
> +		__field(long long, file2_mtime)
> +		__field(long long, file2_ctime)
> +		__field(int, file2_mtime_nsec)
> +		__field(int, file2_ctime_nsec)
> +	),
> +	TP_fast_assign(
> +		struct timespec64	ts64;
> +		struct inode		*inode2 = VFS_I(ip2);
> +
> +		__entry->dev = inode2->i_sb->s_dev;
> +		__entry->ip2_ino = ip2->i_ino;
> +
> +		ts64 = inode_get_ctime(inode2);
> +		__entry->ip2_ctime = ts64.tv_sec;
> +		__entry->ip2_ctime_nsec = ts64.tv_nsec;
> +
> +		ts64 = inode_get_mtime(inode2);
> +		__entry->ip2_mtime = ts64.tv_sec;
> +		__entry->ip2_mtime_nsec = ts64.tv_nsec;
> +
> +		__entry->file2_ino = fxr->file2_ino;
> +		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
> +		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
> +		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> +		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> +	),
> +	TP_printk("dev %d:%d "
> +		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> +		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ip2_ino,
> +		  __entry->ip2_mtime,
> +		  __entry->ip2_mtime_nsec,
> +		  __entry->ip2_ctime,
> +		  __entry->ip2_ctime_nsec,
> +		  __entry->file2_ino,
> +		  __entry->file2_mtime,
> +		  __entry->file2_mtime_nsec,
> +		  __entry->file2_ctime,
> +		  __entry->file2_ctime_nsec)
> +);
> +
>  TRACE_EVENT(xfs_exchmaps_overhead,
>  	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
>  		 unsigned long long rmapbt_blocks),

Acked-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
  2024-08-24 12:11       ` Jeff Layton
@ 2024-08-25  4:52       ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-25  4:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel, jlayton

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-23  0:17   ` [PATCH 11/24] xfs: create incore realtime group structures Darrick J. Wong
  2024-08-23  5:01     ` Christoph Hellwig
@ 2024-08-25 23:56     ` Dave Chinner
  2024-08-26 19:14       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-25 23:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:17:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create an incore object that will contain information about a realtime
> allocation group.  This will eventually enable us to shard the realtime
> section in a similar manner to how we shard the data section, but for
> now just a single object for the entire RT subvolume is created.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/Makefile             |    1 
>  fs/xfs/libxfs/xfs_format.h  |    3 +
>  fs/xfs/libxfs/xfs_rtgroup.c |  196 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rtgroup.h |  212 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_sb.c      |    7 +
>  fs/xfs/libxfs/xfs_types.h   |    4 +
>  fs/xfs/xfs_log_recover.c    |   20 ++++
>  fs/xfs/xfs_mount.c          |   16 +++
>  fs/xfs/xfs_mount.h          |   14 +++
>  fs/xfs/xfs_rtalloc.c        |    6 +
>  fs/xfs/xfs_super.c          |    1 
>  fs/xfs/xfs_trace.c          |    1 
>  fs/xfs/xfs_trace.h          |   38 ++++++++
>  13 files changed, 517 insertions(+), 2 deletions(-)
>  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
>  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h

Ok, how is the global address space for real time extents laid out
across rt groups? i.e. is it sparse similar to how fsbnos and inode
numbers are created for the data device like so?

	fsbno = (agno << agblklog) | agbno

Or is it something different? I can't find that defined anywhere in
this patch, so I can't determine if the unit conversion code and
validation is correct or not...

> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 4d8ca08cdd0ec..388b5cef48ca5 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -60,6 +60,7 @@ xfs-y				+= $(addprefix libxfs/, \
>  # xfs_rtbitmap is shared with libxfs
>  xfs-$(CONFIG_XFS_RT)		+= $(addprefix libxfs/, \
>  				   xfs_rtbitmap.o \
> +				   xfs_rtgroup.o \
>  				   )
>  
>  # highlevel code
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 16a7bc02aa5f5..fa5cfc8265d92 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -176,6 +176,9 @@ typedef struct xfs_sb {
>  
>  	xfs_ino_t	sb_metadirino;	/* metadata directory tree root */
>  
> +	xfs_rgnumber_t	sb_rgcount;	/* number of realtime groups */
> +	xfs_rtxlen_t	sb_rgextents;	/* size of a realtime group in rtx */

So min/max rtgroup size is defined by the sb_rextsize field? What
redundant metadata do we end up with that allows us to validate
the sb_rextsize field is still valid w.r.t. rtgroups geometry?

Also, rtgroup lengths are defined by "rtx counts", but the
definitions in the xfs_mount later on are "m_rtblklog" and
"m_rgblocks" and we use xfs_rgblock_t and rgbno all over the place.

Just from the context of this patch, it is somewhat confusing trying
to work out what the difference is...

>  	/* must be padded to 64 bit alignment */
>  } xfs_sb_t;
>  
> diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> new file mode 100644
> index 0000000000000..2bad1ecb811eb
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> @@ -0,0 +1,196 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_bit.h"
> +#include "xfs_sb.h"
> +#include "xfs_mount.h"
> +#include "xfs_btree.h"
> +#include "xfs_alloc_btree.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_alloc.h"
> +#include "xfs_ialloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_ag.h"
> +#include "xfs_ag_resv.h"
> +#include "xfs_health.h"
> +#include "xfs_error.h"
> +#include "xfs_bmap.h"
> +#include "xfs_defer.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans.h"
> +#include "xfs_trace.h"
> +#include "xfs_inode.h"
> +#include "xfs_icache.h"
> +#include "xfs_rtgroup.h"
> +#include "xfs_rtbitmap.h"
> +
> +/*
> + * Passive reference counting access wrappers to the rtgroup structures.  If
> + * the rtgroup structure is to be freed, the freeing code is responsible for
> + * cleaning up objects with passive references before freeing the structure.
> + */
> +struct xfs_rtgroup *
> +xfs_rtgroup_get(
> +	struct xfs_mount	*mp,
> +	xfs_rgnumber_t		rgno)
> +{
> +	struct xfs_rtgroup	*rtg;
> +
> +	rcu_read_lock();
> +	rtg = xa_load(&mp->m_rtgroups, rgno);
> +	if (rtg) {
> +		trace_xfs_rtgroup_get(rtg, _RET_IP_);
> +		ASSERT(atomic_read(&rtg->rtg_ref) >= 0);
> +		atomic_inc(&rtg->rtg_ref);
> +	}
> +	rcu_read_unlock();
> +	return rtg;
> +}
> +
> +/* Get a passive reference to the given rtgroup. */
> +struct xfs_rtgroup *
> +xfs_rtgroup_hold(
> +	struct xfs_rtgroup	*rtg)
> +{
> +	ASSERT(atomic_read(&rtg->rtg_ref) > 0 ||
> +	       atomic_read(&rtg->rtg_active_ref) > 0);
> +
> +	trace_xfs_rtgroup_hold(rtg, _RET_IP_);
> +	atomic_inc(&rtg->rtg_ref);
> +	return rtg;
> +}
> +
> +void
> +xfs_rtgroup_put(
> +	struct xfs_rtgroup	*rtg)
> +{
> +	trace_xfs_rtgroup_put(rtg, _RET_IP_);
> +	ASSERT(atomic_read(&rtg->rtg_ref) > 0);
> +	atomic_dec(&rtg->rtg_ref);
> +}
> +
> +/*
> + * Active references for rtgroup structures. This is for short term access to
> + * the rtgroup structures for walking trees or accessing state. If an rtgroup
> + * is being shrunk or is offline, then this will fail to find that group and
> + * return NULL instead.
> + */
> +struct xfs_rtgroup *
> +xfs_rtgroup_grab(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
> +{
> +	struct xfs_rtgroup	*rtg;
> +
> +	rcu_read_lock();
> +	rtg = xa_load(&mp->m_rtgroups, agno);
> +	if (rtg) {
> +		trace_xfs_rtgroup_grab(rtg, _RET_IP_);
> +		if (!atomic_inc_not_zero(&rtg->rtg_active_ref))
> +			rtg = NULL;
> +	}
> +	rcu_read_unlock();
> +	return rtg;
> +}
> +
> +void
> +xfs_rtgroup_rele(
> +	struct xfs_rtgroup	*rtg)
> +{
> +	trace_xfs_rtgroup_rele(rtg, _RET_IP_);
> +	if (atomic_dec_and_test(&rtg->rtg_active_ref))
> +		wake_up(&rtg->rtg_active_wq);
> +}

This is all duplicates of the xfs_perag code. Can you put together a
patchset to abstract this into a "xfs_group" and embed them in both
the perag and and rtgroup structures?

That way we only need one set of lookup and iterator infrastructure,
and it will work for both data and rt groups...

> +
> +/* Compute the number of rt extents in this realtime group. */
> +xfs_rtxnum_t
> +xfs_rtgroup_extents(
+	struct xfs_mount	*mp,
> +	xfs_rgnumber_t		rgno)
> +{
> +	xfs_rgnumber_t		rgcount = mp->m_sb.sb_rgcount;
> +
> +	ASSERT(rgno < rgcount);
> +	if (rgno == rgcount - 1)
> +		return mp->m_sb.sb_rextents -
> +			((xfs_rtxnum_t)rgno * mp->m_sb.sb_rgextents);

Urk. So this relies on a non-rtgroup filesystem doing a
multiplication by zero of a field that the on-disk format does not
understand to get the right result.  I think this is a copying a bad
pattern we've been slowly trying to remove from the normal
allocation group code.

> +
> +	ASSERT(xfs_has_rtgroups(mp));
> +	return mp->m_sb.sb_rgextents;
> +}

We already embed the length of the rtgroup in the rtgroup structure.
THis should be looking up the rtgroup (or being passed the rtgroup
the caller already has) and doing the right thing. i.e.

	if (!rtg || !xfs_has_rtgroups(rtg->rtg_mount))
		return mp->m_sb.sb_rextents;
	return rtg->rtg_extents;

> diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
> new file mode 100644
> index 0000000000000..2c09ecfc50328
> --- /dev/null
> +++ b/fs/xfs/libxfs/xfs_rtgroup.h
> @@ -0,0 +1,212 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#ifndef __LIBXFS_RTGROUP_H
> +#define __LIBXFS_RTGROUP_H 1
> +
> +struct xfs_mount;
> +struct xfs_trans;
> +
> +/*
> + * Realtime group incore structure, similar to the per-AG structure.
> + */
> +struct xfs_rtgroup {
> +	struct xfs_mount	*rtg_mount;
> +	xfs_rgnumber_t		rtg_rgno;
> +	atomic_t		rtg_ref;	/* passive reference count */
> +	atomic_t		rtg_active_ref;	/* active reference count */
> +	wait_queue_head_t	rtg_active_wq;/* woken active_ref falls to zero */

Yeah, that's all common with xfs_perag....

....
> +/*
> + * rt group iteration APIs
> + */
> +static inline struct xfs_rtgroup *
> +xfs_rtgroup_next(
> +	struct xfs_rtgroup	*rtg,
> +	xfs_rgnumber_t		*rgno,
> +	xfs_rgnumber_t		end_rgno)
> +{
> +	struct xfs_mount	*mp = rtg->rtg_mount;
> +
> +	*rgno = rtg->rtg_rgno + 1;
> +	xfs_rtgroup_rele(rtg);
> +	if (*rgno > end_rgno)
> +		return NULL;
> +	return xfs_rtgroup_grab(mp, *rgno);
> +}
> +
> +#define for_each_rtgroup_range(mp, rgno, end_rgno, rtg) \
> +	for ((rtg) = xfs_rtgroup_grab((mp), (rgno)); \
> +		(rtg) != NULL; \
> +		(rtg) = xfs_rtgroup_next((rtg), &(rgno), (end_rgno)))
> +
> +#define for_each_rtgroup_from(mp, rgno, rtg) \
> +	for_each_rtgroup_range((mp), (rgno), (mp)->m_sb.sb_rgcount - 1, (rtg))
> +
> +
> +#define for_each_rtgroup(mp, rgno, rtg) \
> +	(rgno) = 0; \
> +	for_each_rtgroup_from((mp), (rgno), (rtg))

Yup, that's all common with xfs_perag iteration, too. Can you put
together a patchset to unify these, please?

> +static inline bool
> +xfs_verify_rgbno(
> +	struct xfs_rtgroup	*rtg,
> +	xfs_rgblock_t		rgbno)

Ok, what's the difference between and xfs_rgblock_t and a "rtx"?

OH.... Then penny just dropped - it's another "single letter
difference that's really, really hard to spot" problem. You've
defined "xfs_r*g*block_t" for the like a a*g*bno, but we have
xfs_r*t*block_t for the global 64bit block number instead of a
xfs_fsbno_t.

We just had a bug caused by exactly this sort of confusion with a
patch that mixed xfs_[f]inobt changes together and one of the
conversions was incorrect. Nobody spotted the single incorrect
letter in the bigger patch, and I can see -exactly- the same sort of
confusion happening with rtblock vs rgblock causing implicit 32/64
bit integer promotion bugs...

> +{
> +	struct xfs_mount	*mp = rtg->rtg_mount;
> +
> +	if (rgbno >= rtg->rtg_extents * mp->m_sb.sb_rextsize)
> +		return false;

Why isn't the max valid "rgbno" stored in the rtgroup instead of
having to multiply the extent count by extent size every time we
have to verify a rgbno? (i.e. same as pag->block_count).

We know from the agbno verification this will be a -very- hot path,
and so precalculating all the constants and storing them in the rtg
should be done right from the start here.

> +	if (xfs_has_rtsb(mp) && rtg->rtg_rgno == 0 &&
> +	    rgbno < mp->m_sb.sb_rextsize)
> +		return false;

Same here - this value is stored in pag->min_block...

> +	return true;
> +}

And then, if we put the max_bno and min_bno in the generic
"xfs_group" structure, we suddenly have a generic "group bno"
verification mechanism that is independent of whether the group

static inline bool
xfs_verify_gbno(
     struct xfs_group      *g,
     xfs_gblock_t         gbno)
{
     struct xfs_mount        *mp = g->g_mount;

     if (gbno >= g->block_count)
             return false;
     if (gbno < g->min_block)
             return false;
     return true;
}

And the rest of these functions fall out the same way....

> +static inline xfs_rtblock_t
> +xfs_rgno_start_rtb(
> +	struct xfs_mount	*mp,
> +	xfs_rgnumber_t		rgno)
> +{
> +	if (mp->m_rgblklog >= 0)
> +		return ((xfs_rtblock_t)rgno << mp->m_rgblklog);
> +	return ((xfs_rtblock_t)rgno * mp->m_rgblocks);
> +}

Where does mp->m_rgblklog come from? That wasn't added to the
on-disk superblock structure and it is always initialised to zero
in this patch.

When will m_rgblklog be zero and when will it be non-zero? If it's
only going to be zero for existing non-rtg realtime systems,
then this code makes little sense (again, relying on multiplication
by zero to get the right result). If it's not always used for
rtg enabled filesytsems, then the reason for that has not been
explained and I can't work out why this would ever need to be done.

> +static inline xfs_rtblock_t
> +xfs_rgbno_to_rtb(
> +	struct xfs_mount	*mp,
> +	xfs_rgnumber_t		rgno,
> +	xfs_rgblock_t		rgbno)
> +{
> +	return xfs_rgno_start_rtb(mp, rgno) + rgbno;
> +}
> +
> +static inline xfs_rgnumber_t
> +xfs_rtb_to_rgno(
> +	struct xfs_mount	*mp,
> +	xfs_rtblock_t		rtbno)
> +{
> +	if (!xfs_has_rtgroups(mp))
> +		return 0;
> +
> +	if (mp->m_rgblklog >= 0)
> +		return rtbno >> mp->m_rgblklog;
> +
> +	return div_u64(rtbno, mp->m_rgblocks);
> +}

Ah, now I'm really confused, because m_rgblklog is completely
bypassed for legacy rt filesystems.

And I just realised, this "if (mp->m_rgblklog >= 0)" implies that
m_rgblklog can have negative values and there's no comments anywhere
about why that can happen and what would trigger it. 

We validate sb_agblklog during the superblock verifier, and so once
the filesystem is mounted we never, ever need to check whether
sb_agblklog is in range. Why is the rtblklog being handled so
differently here?

> +
> +static inline uint64_t
> +__xfs_rtb_to_rgbno(
> +	struct xfs_mount	*mp,
> +	xfs_rtblock_t		rtbno)
> +{
> +	uint32_t		rem;
> +
> +	if (!xfs_has_rtgroups(mp))
> +		return rtbno;
> +
> +	if (mp->m_rgblklog >= 0)
> +		return rtbno & mp->m_rgblkmask;
> +
> +	div_u64_rem(rtbno, mp->m_rgblocks, &rem);
> +	return rem;
> +}

Why is this function returning a uint64_t - a xfs_rgblock_t is only
a 32 bit type...

> +
> +static inline xfs_rgblock_t
> +xfs_rtb_to_rgbno(
> +	struct xfs_mount	*mp,
> +	xfs_rtblock_t		rtbno)
> +{
> +	return __xfs_rtb_to_rgbno(mp, rtbno);
> +}
> +
> +static inline xfs_daddr_t
> +xfs_rtb_to_daddr(
> +	struct xfs_mount	*mp,
> +	xfs_rtblock_t		rtbno)
> +{
> +	return rtbno << mp->m_blkbb_log;
> +}
> +
> +static inline xfs_rtblock_t
> +xfs_daddr_to_rtb(
> +	struct xfs_mount	*mp,
> +	xfs_daddr_t		daddr)
> +{
> +	return daddr >> mp->m_blkbb_log;
> +}

Ah. This code doesn't sparsify the xfs_rtblock_t address space for
rtgroups. xfs_rtblock_t is still direct physical encoding of the
location on disk.

I really think that needs to be changed to match how xfs_fsbno_t is
a sparse encoding before these changes get merged. It shouldn't
affect any of the other code in the patch set - the existing rt code
has a rtgno of 0, so it will always be a direct physical encoding
even when using a sparse xfs_rtblock_t address space.

All that moving to a sparse encoding means is that the addresses
stored in the BMBT are logical addresses rather than physical
addresses.  It should not affect any of the other code, just what
ends up stored on disk for global 64-bit rt extent addresses...

In doing this, I think we can greatly simply all this group
management stuff as most of the verification, type conversion and
iteration infrastructure can then be shared between the exist perag
and the new rtg infrastructure....

> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index a8cd44d03ef64..1ce4b9eb16f47 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -9,10 +9,12 @@
>  typedef uint32_t	prid_t;		/* project ID */
>  
>  typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> +typedef uint32_t	xfs_rgblock_t;	/* blockno in realtime group */

Is that right? The rtg length is 2^32 * rtextsize, and rtextsize can
be 2^20 bytes:

#define XFS_MAX_RTEXTSIZE (1024 * 1024 * 1024)

Hence for a 4kB fsbno filesystem, the actual maximum size of an rtg
in filesystem blocks far exceeds what we can address with a 32 bit
variable.

If xfs_rgblock_t is actually indexing multi-fsbno rtextents, then it
is an extent number index, not a "block" index. An extent number
index won't overflow 32 bits (because the rtg has a max of 2^32 - 1
rtextents)

IOWs, shouldn't this be named soemthing like:

typedef uint32_t	xfs_rgext_t;	/* extent number in realtime group */

>  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_rtxlen_t;	/* file extent length in rtextents */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> +typedef uint32_t	xfs_rgnumber_t;	/* realtime group number */
>  typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
>  typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> @@ -53,7 +55,9 @@ typedef void *		xfs_failaddr_t;
>  #define	NULLFILEOFF	((xfs_fileoff_t)-1)
>  
>  #define	NULLAGBLOCK	((xfs_agblock_t)-1)
> +#define NULLRGBLOCK	((xfs_rgblock_t)-1)
>  #define	NULLAGNUMBER	((xfs_agnumber_t)-1)
> +#define	NULLRGNUMBER	((xfs_rgnumber_t)-1)

What's the maximum valid rtg number? We're not ever going to be
supporting 2^32 - 2 rtgs, so what is a realistic maximum we can cap
this at and validate it at?

>  #define NULLCOMMITLSN	((xfs_lsn_t)-1)
>  
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 4423dd344239b..c627cde3bb1e0 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -28,6 +28,7 @@
>  #include "xfs_ag.h"
>  #include "xfs_quota.h"
>  #include "xfs_reflink.h"
> +#include "xfs_rtgroup.h"
>  
>  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
>  
> @@ -3346,6 +3347,7 @@ xlog_do_recover(
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xfs_buf		*bp = mp->m_sb_bp;
>  	struct xfs_sb		*sbp = &mp->m_sb;
> +	xfs_rgnumber_t		old_rgcount = sbp->sb_rgcount;
>  	int			error;
>  
>  	trace_xfs_log_recover(log, head_blk, tail_blk);
> @@ -3399,6 +3401,24 @@ xlog_do_recover(
>  		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
>  		return error;
>  	}
> +
> +	if (sbp->sb_rgcount < old_rgcount) {
> +		xfs_warn(mp, "rgcount shrink not supported");
> +		return -EINVAL;
> +	}
> +	if (sbp->sb_rgcount > old_rgcount) {
> +		xfs_rgnumber_t		rgno;
> +
> +		for (rgno = old_rgcount; rgno < sbp->sb_rgcount; rgno++) {
> +			error = xfs_rtgroup_alloc(mp, rgno);
> +			if (error) {
> +				xfs_warn(mp,
> +	"Failed post-recovery rtgroup init: %d",
> +						error);
> +				return error;
> +			}
> +		}
> +	}

Please factor this out into a separate function with all the other
rtgroup init/teardown code. That means we don't have to care about
how rtgrowfs functions in recovery code, similar to the
xfs_initialize_perag() already in this function for handling
recovery of data device growing...

>  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
>  	/* Normal transactions can now occur */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index b0ea88acdb618..e1e849101cdd4 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -36,6 +36,7 @@
>  #include "xfs_ag.h"
>  #include "xfs_rtbitmap.h"
>  #include "xfs_metafile.h"
> +#include "xfs_rtgroup.h"
>  #include "scrub/stats.h"
>  
>  static DEFINE_MUTEX(xfs_uuid_table_mutex);
> @@ -664,6 +665,7 @@ xfs_mountfs(
>  	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
>  	uint			quotamount = 0;
>  	uint			quotaflags = 0;
> +	xfs_rgnumber_t		rgno;
>  	int			error = 0;
>  
>  	xfs_sb_mount_common(mp, sbp);
> @@ -830,10 +832,18 @@ xfs_mountfs(
>  		goto out_free_dir;
>  	}
>  
> +	for (rgno = 0; rgno < mp->m_sb.sb_rgcount; rgno++) {
> +		error = xfs_rtgroup_alloc(mp, rgno);
> +		if (error) {
> +			xfs_warn(mp, "Failed rtgroup init: %d", error);
> +			goto out_free_rtgroup;
> +		}
> +	}

Same - factor this to a xfs_rtgroup_init() function located with the
rest of the rtgroup infrastructure...

> +
>  	if (XFS_IS_CORRUPT(mp, !sbp->sb_logblocks)) {
>  		xfs_warn(mp, "no log defined");
>  		error = -EFSCORRUPTED;
> -		goto out_free_perag;
> +		goto out_free_rtgroup;
>  	}
>  
>  	error = xfs_inodegc_register_shrinker(mp);
> @@ -1068,7 +1078,8 @@ xfs_mountfs(
>  	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
>  		xfs_buftarg_drain(mp->m_logdev_targp);
>  	xfs_buftarg_drain(mp->m_ddev_targp);
> - out_free_perag:
> + out_free_rtgroup:
> +	xfs_free_rtgroups(mp, rgno);
>  	xfs_free_perag(mp);
>   out_free_dir:
>  	xfs_da_unmount(mp);
> @@ -1152,6 +1163,7 @@ xfs_unmountfs(
>  	xfs_errortag_clearall(mp);
>  #endif
>  	shrinker_free(mp->m_inodegc_shrinker);
> +	xfs_free_rtgroups(mp, mp->m_sb.sb_rgcount);

... like you've already for the cleanup side ;)

....

> @@ -1166,6 +1169,9 @@ xfs_rtmount_inodes(
>  	if (error)
>  		goto out_rele_summary;
>  
> +	for_each_rtgroup(mp, rgno, rtg)
> +		rtg->rtg_extents = xfs_rtgroup_extents(mp, rtg->rtg_rgno);
> +

This also needs to be done after recovery has initialised new rtgs
as a result fo replaying a sb growfs modification, right?

Which leads to the next question: if there are thousands of rtgs,
this requires walking every rtg at mount time, right? We know that
walking thousands of static structures at mount time is a
scalability issue, so can we please avoid this if at all possible?
i.e. do demand loading of per-rtg metadata when it is first required
(like we do with agf/agi information) rather than doing it all at
mount time...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-23  0:18   ` [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes Darrick J. Wong
  2024-08-23  5:02     ` Christoph Hellwig
@ 2024-08-25 23:58     ` Dave Chinner
  2024-08-26 21:38       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-25 23:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:18:02PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add a dynamic lockdep class key for rtgroup inodes.  This will enable
> lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
> order.  Each class can have 8 subclasses, and for now we will only have
> 2 inodes per group.  This enables rtgroup order and inode order checks
> when nesting ILOCKs.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_rtgroup.c |   52 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> index 51f04cad5227c..ae6d67c673b1a 100644
> --- a/fs/xfs/libxfs/xfs_rtgroup.c
> +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> @@ -243,3 +243,55 @@ xfs_rtgroup_trans_join(
>  	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
>  		xfs_rtbitmap_trans_join(tp);
>  }
> +
> +#ifdef CONFIG_PROVE_LOCKING
> +static struct lock_class_key xfs_rtginode_lock_class;
> +
> +static int
> +xfs_rtginode_ilock_cmp_fn(
> +	const struct lockdep_map	*m1,
> +	const struct lockdep_map	*m2)
> +{
> +	const struct xfs_inode *ip1 =
> +		container_of(m1, struct xfs_inode, i_lock.dep_map);
> +	const struct xfs_inode *ip2 =
> +		container_of(m2, struct xfs_inode, i_lock.dep_map);
> +
> +	if (ip1->i_projid < ip2->i_projid)
> +		return -1;
> +	if (ip1->i_projid > ip2->i_projid)
> +		return 1;
> +	return 0;
> +}

What's the project ID of the inode got to do with realtime groups?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special
  2024-08-23  0:04   ` [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special Darrick J. Wong
  2024-08-23  4:40     ` Christoph Hellwig
@ 2024-08-26  0:41     ` Dave Chinner
  2024-08-26 17:33       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  0:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:04:14PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Metadata inodes are private files and therefore cannot be exposed to
> userspace.  This means no bulkstat, no open-by-handle, no linking them
> into the directory tree, and no feeding them to LSMs.  As such, we mark
> them S_PRIVATE, which stops all that.

Can you merge this back up into the initial iget support code?

> 
> While we're at it, put them in a separate lockdep class so that it won't
> get confused by "recursive" i_rwsem locking such as what happens when we
> write to a rt file and need to allocate from the rt bitmap file.  The
> static function that we use to do this will be exported in the rtgroups
> patchset.

Stale commit message? There's nothing of the sort in this patch....

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/tempfile.c |    8 ++++++++
>  fs/xfs/xfs_iops.c       |   15 ++++++++++++++-
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
> index 177f922acfaf1..3c5a1d77fefae 100644
> --- a/fs/xfs/scrub/tempfile.c
> +++ b/fs/xfs/scrub/tempfile.c
> @@ -844,6 +844,14 @@ xrep_is_tempfile(
>  	const struct xfs_inode	*ip)
>  {
>  	const struct inode	*inode = &ip->i_vnode;
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	/*
> +	 * Files in the metadata directory tree also have S_PRIVATE set and
> +	 * IOP_XATTR unset, so we must distinguish them separately.
> +	 */
> +	if (xfs_has_metadir(mp) && (ip->i_diflags2 & XFS_DIFLAG2_METADATA))
> +		return false;

Why do you need to check both xfs_has_metadir() and the inode flag
here? The latter should only be set if the former is set, yes?
If it's the other way around, then we have an on-disk corruption...

>  	if (IS_PRIVATE(inode) && !(inode->i_opflags & IOP_XATTR))
>  		return true;

> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 1cdc8034f54d9..c1686163299a0 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -42,7 +42,9 @@
>   * held. For regular files, the lock order is the other way around - the
>   * mmap_lock is taken during the page fault, and then we lock the ilock to do
>   * block mapping. Hence we need a different class for the directory ilock so
> - * that lockdep can tell them apart.
> + * that lockdep can tell them apart.  Directories in the metadata directory
> + * tree get a separate class so that lockdep reports will warn us if someone
> + * ever tries to lock regular directories after locking metadata directories.
>   */
>  static struct lock_class_key xfs_nondir_ilock_class;
>  static struct lock_class_key xfs_dir_ilock_class;
> @@ -1299,6 +1301,7 @@ xfs_setup_inode(
>  {
>  	struct inode		*inode = &ip->i_vnode;
>  	gfp_t			gfp_mask;
> +	bool			is_meta = xfs_is_metadata_inode(ip);
>  
>  	inode->i_ino = ip->i_ino;
>  	inode->i_state |= I_NEW;
> @@ -1310,6 +1313,16 @@ xfs_setup_inode(
>  	i_size_write(inode, ip->i_disk_size);
>  	xfs_diflags_to_iflags(ip, true);
>  
> +	/*
> +	 * Mark our metadata files as private so that LSMs and the ACL code
> +	 * don't try to add their own metadata or reason about these files,
> +	 * and users cannot ever obtain file handles to them.
> +	 */
> +	if (is_meta) {
> +		inode->i_flags |= S_PRIVATE;
> +		inode->i_opflags &= ~IOP_XATTR;
> +	}

No need for a temporary variable here.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/26] xfs: don't count metadata directory files to quota
  2024-08-23  0:05   ` [PATCH 11/26] xfs: don't count metadata directory files to quota Darrick J. Wong
  2024-08-23  4:42     ` Christoph Hellwig
@ 2024-08-26  0:47     ` Dave Chinner
  2024-08-26 17:57       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  0:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:05:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Files in the metadata directory tree are internal to the filesystem.
> Don't count the inodes or the blocks they use in the root dquot because
> users do not need to know about their resource usage.  This will also
> quiet down complaints about dquot usage not matching du output.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_dquot.c       |    1 +
>  fs/xfs/xfs_qm.c          |   11 +++++++++++
>  fs/xfs/xfs_quota.h       |    5 +++++
>  fs/xfs/xfs_trans_dquot.c |    6 ++++++
>  4 files changed, 23 insertions(+)
> 
> 
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index c1b211c260a9d..3bf47458c517a 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -983,6 +983,7 @@ xfs_qm_dqget_inode(
>  
>  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>  	ASSERT(xfs_inode_dquot(ip, type) == NULL);
> +	ASSERT(!xfs_is_metadir_inode(ip));
>  
>  	id = xfs_qm_id_for_quotatype(ip, type);
>  
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index d0674d84af3ec..ec983cca9adae 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -304,6 +304,8 @@ xfs_qm_need_dqattach(
>  		return false;
>  	if (xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
>  		return false;
> +	if (xfs_is_metadir_inode(ip))
> +		return false;
>  	return true;
>  }
>  
> @@ -326,6 +328,7 @@ xfs_qm_dqattach_locked(
>  		return 0;
>  
>  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> +	ASSERT(!xfs_is_metadir_inode(ip));
>  
>  	if (XFS_IS_UQUOTA_ON(mp) && !ip->i_udquot) {
>  		error = xfs_qm_dqattach_one(ip, XFS_DQTYPE_USER,
> @@ -1204,6 +1207,10 @@ xfs_qm_dqusage_adjust(
>  		}
>  	}
>  
> +	/* Metadata directory files are not accounted to user-visible quotas. */
> +	if (xfs_is_metadir_inode(ip))
> +		goto error0;
> +

Hmmmm. I'm starting to think that xfs_iget() should not return
metadata inodes unless a new XFS_IGET_METAINODE flag is set.

That would replace all these post xfs_iget() checks with a single
check in xfs_iget(), and then xfs_trans_metafile_iget() is the only
place that sets this specific flag.

That means stuff like VFS lookups, bulkstat, quotacheck, and
filehandle lookups will never return metadata inodes and we don't
need to add special checks all over for them...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/24] xfs: support caching rtgroup metadata inodes
  2024-08-23  0:18   ` [PATCH 14/24] xfs: support caching rtgroup metadata inodes Darrick J. Wong
  2024-08-23  5:02     ` Christoph Hellwig
@ 2024-08-26  1:41     ` Dave Chinner
  2024-08-26 18:37       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  1:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:18:18PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create the necessary per-rtgroup infrastructure that we need to load
> metadata inodes into memory.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_rtgroup.c |  182 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rtgroup.h |   28 +++++++
>  fs/xfs/xfs_mount.h          |    1 
>  fs/xfs/xfs_rtalloc.c        |   48 +++++++++++
>  4 files changed, 258 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> index ae6d67c673b1a..50e4a56d749f0 100644
> --- a/fs/xfs/libxfs/xfs_rtgroup.c
> +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> @@ -30,6 +30,8 @@
>  #include "xfs_icache.h"
>  #include "xfs_rtgroup.h"
>  #include "xfs_rtbitmap.h"
> +#include "xfs_metafile.h"
> +#include "xfs_metadir.h"
>  
>  /*
>   * Passive reference counting access wrappers to the rtgroup structures.  If
> @@ -295,3 +297,183 @@ xfs_rtginode_lockdep_setup(
>  #else
>  #define xfs_rtginode_lockdep_setup(ip, rgno, type)	do { } while (0)
>  #endif /* CONFIG_PROVE_LOCKING */
> +
> +struct xfs_rtginode_ops {
> +	const char		*name;	/* short name */
> +
> +	enum xfs_metafile_type	metafile_type;
> +
> +	/* Does the fs have this feature? */
> +	bool			(*enabled)(struct xfs_mount *mp);
> +
> +	/* Create this rtgroup metadata inode and initialize it. */
> +	int			(*create)(struct xfs_rtgroup *rtg,
> +					  struct xfs_inode *ip,
> +					  struct xfs_trans *tp,
> +					  bool init);
> +};

What's all this for?

AFAICT, loading the inodes into the rtgs requires a call to
xfs_metadir_load() when initialising the rtg (either at mount or
lazily on the first access to the rtg). Hence I'm not really sure
what this complexity is needed for, and the commit message is not
very informative....


> +static const struct xfs_rtginode_ops xfs_rtginode_ops[XFS_RTGI_MAX] = {
> +};
> +
> +/* Return the shortname of this rtgroup inode. */
> +const char *
> +xfs_rtginode_name(
> +	enum xfs_rtg_inodes	type)
> +{
> +	return xfs_rtginode_ops[type].name;
> +}
> +
> +/* Should this rtgroup inode be present? */
> +bool
> +xfs_rtginode_enabled(
> +	struct xfs_rtgroup	*rtg,
> +	enum xfs_rtg_inodes	type)
> +{
> +	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
> +
> +	if (!ops->enabled)
> +		return true;
> +	return ops->enabled(rtg->rtg_mount);
> +}
> +
> +/* Load and existing rtgroup inode into the rtgroup structure. */
> +int
> +xfs_rtginode_load(
> +	struct xfs_rtgroup	*rtg,
> +	enum xfs_rtg_inodes	type,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	const char		*path;
> +	struct xfs_inode	*ip;
> +	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
> +	int			error;
> +
> +	if (!xfs_rtginode_enabled(rtg, type))
> +		return 0;
> +
> +	if (!mp->m_rtdirip)
> +		return -EFSCORRUPTED;
> +
> +	path = xfs_rtginode_path(rtg->rtg_rgno, type);
> +	if (!path)
> +		return -ENOMEM;
> +	error = xfs_metadir_load(tp, mp->m_rtdirip, path, ops->metafile_type,
> +			&ip);
> +	kfree(path);
> +
> +	if (error)
> +		return error;
> +
> +	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
> +			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
> +		xfs_irele(ip);
> +		return -EFSCORRUPTED;
> +	}

We don't support LOCAL format for any type of regular file inodes,
so I'm a little confiused as to why this wouldn't be caught by the
verifier on inode read? i.e.  What problem is this trying to catch,
and why doesn't the inode verifier catch it for us?

> +	if (XFS_IS_CORRUPT(mp, ip->i_projid != rtg->rtg_rgno)) {
> +		xfs_irele(ip);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	xfs_rtginode_lockdep_setup(ip, rtg->rtg_rgno, type);
> +	rtg->rtg_inodes[type] = ip;
> +	return 0;
> +}
> +
> +/* Release an rtgroup metadata inode. */
> +void
> +xfs_rtginode_irele(
> +	struct xfs_inode	**ipp)
> +{
> +	if (*ipp)
> +		xfs_irele(*ipp);
> +	*ipp = NULL;
> +}
> +
> +/* Add a metadata inode for a realtime rmap btree. */
> +int
> +xfs_rtginode_create(
> +	struct xfs_rtgroup		*rtg,
> +	enum xfs_rtg_inodes		type,
> +	bool				init)

This doesn't seem to belong in this patchset...

....

> +/* Create the parent directory for all rtgroup inodes and load it. */
> +int
> +xfs_rtginode_mkdir_parent(
> +	struct xfs_mount	*mp)

Or this...

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 16/24] xfs: move RT bitmap and summary information to the rtgroup
  2024-08-23  0:18   ` [PATCH 16/24] xfs: move RT bitmap and summary information to the rtgroup Darrick J. Wong
@ 2024-08-26  1:58     ` Dave Chinner
  0 siblings, 0 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  1:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Thu, Aug 22, 2024 at 05:18:49PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Move the pointers to the RT bitmap and summary inodes as well as the
> summary cache to the rtgroups structure to prepare for having a
> separate bitmap and summary inodes for each rtgroup.
> 
> Code using the inodes now needs to operate on a rtgroup.  Where easily
> possible such code is converted to iterate over all rtgroups, else
> rtgroup 0 (the only one that can currently exist) is hardcoded.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_bmap.c        |   40 +++-
>  fs/xfs/libxfs/xfs_rtbitmap.c    |  174 ++++++++--------
>  fs/xfs/libxfs/xfs_rtbitmap.h    |   68 +++---
>  fs/xfs/libxfs/xfs_rtgroup.c     |   90 +++++++-
>  fs/xfs/libxfs/xfs_rtgroup.h     |   14 +
>  fs/xfs/scrub/bmap.c             |   13 +
>  fs/xfs/scrub/fscounters.c       |   26 +-
>  fs/xfs/scrub/repair.c           |   24 ++
>  fs/xfs/scrub/repair.h           |    7 +
>  fs/xfs/scrub/rtbitmap.c         |   45 ++--
>  fs/xfs/scrub/rtsummary.c        |   93 +++++----
>  fs/xfs/scrub/rtsummary_repair.c |    7 -
>  fs/xfs/scrub/scrub.c            |    4 
>  fs/xfs/xfs_discard.c            |  100 ++++++---
>  fs/xfs/xfs_fsmap.c              |  143 ++++++++-----
>  fs/xfs/xfs_mount.h              |   10 -
>  fs/xfs/xfs_qm.c                 |   27 ++-
>  fs/xfs/xfs_rtalloc.c            |  415 ++++++++++++++++++++++-----------------
>  18 files changed, 763 insertions(+), 537 deletions(-)

I'm finding this patch does far too many things to be reviewable.
There's code factoring, abstraction by local variables, changes to
locking APIs, etc that are needed to simplify the conversion, but
could all be done separately before the actual changeover to using
rtgroups.

There's also multiple functional changes in the code - like support
for growfs using rtgroups and moving to per-rtg summary caches - so
it's really difficult to separate and review the individual changes
in this.

Can you please split this up into a couple of separate steps? One
for all the local variable conversions and "no change" factoring,
one to move the summary cache code, one to add the grwofs support
and, finally, one to actually convert everything over to use
rtgroups directly?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper
  2024-08-23  0:20   ` [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper Darrick J. Wong
@ 2024-08-26  2:06     ` Dave Chinner
  2024-08-26 18:27       ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  2:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Thu, Aug 22, 2024 at 05:20:07PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Split the check that the rtsummary fits into the log into a separate
> helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
> geometry.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> [djwong: avoid division for the 0-rtx growfs check]
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_rtalloc.c |   43 +++++++++++++++++++++++++++++--------------
>  1 file changed, 29 insertions(+), 14 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 61231b1dc4b79..78a3879ad6193 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -1023,6 +1023,31 @@ xfs_growfs_rtg(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_check_rtgeom(
> +	const struct xfs_mount	*mp,
> +	xfs_rfsblock_t		rblocks,
> +	xfs_extlen_t		rextsize)
> +{
> +	struct xfs_mount	*nmp;
> +	int			error = 0;
> +
> +	nmp = xfs_growfs_rt_alloc_fake_mount(mp, rblocks, rextsize);
> +	if (!nmp)
> +		return -ENOMEM;
> +
> +	/*
> +	 * New summary size can't be more than half the size of the log.  This
> +	 * prevents us from getting a log overflow, since we'll log basically
> +	 * the whole summary file at once.
> +	 */
> +	if (nmp->m_rsumblocks > (mp->m_sb.sb_logblocks >> 1))
> +		error = -EINVAL;

FWIW, the new size needs to be smaller than that, because the "half
the log size" must to include all the log metadata needed to
encapsulate that object. The grwofs transaction also logs inodes and
the superblock, so that also takes away from the maximum size of
the summary file....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 7/9] xfs: Fix missing interval for missing_owner in xfs fsmap
  2024-08-23  0:00   ` [PATCH 7/9] xfs: Fix missing interval for missing_owner in xfs fsmap Darrick J. Wong
@ 2024-08-26  3:58     ` Zizhi Wo
  0 siblings, 0 replies; 271+ messages in thread
From: Zizhi Wo @ 2024-08-26  3:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

Hi!

在 2024/8/23 8:00, Darrick J. Wong 写道:
> From: Zizhi Wo <wozizhi@huawei.com>
> 
> In the fsmap query of xfs, there is an interval missing problem:
> [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt
>   EXT: DEV    BLOCK-RANGE           OWNER              FILE-OFFSET      AG AG-OFFSET             TOTAL
>     0: 253:16 [0..7]:               static fs metadata                  0  (0..7)                    8
>     1: 253:16 [8..23]:              per-AG metadata                     0  (8..23)                  16
>     2: 253:16 [24..39]:             inode btree                         0  (24..39)                 16
>     3: 253:16 [40..47]:             per-AG metadata                     0  (40..47)                  8
>     4: 253:16 [48..55]:             refcount btree                      0  (48..55)                  8
>     5: 253:16 [56..103]:            per-AG metadata                     0  (56..103)                48
>     6: 253:16 [104..127]:           free space                          0  (104..127)               24
>     ......
> 
> BUG:
> [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt
> [root@fedora ~]#
> Normally, we should be able to get [104, 107), but we got nothing.
> 
> The problem is caused by shifting. The query for the problem-triggered
> scenario is for the missing_owner interval (e.g. freespace in rmapbt/
> unknown space in bnobt), which is obtained by subtraction (gap). For this
> scenario, the interval is obtained by info->last. However, rec_daddr is
> calculated based on the start_block recorded in key[1], which is converted
> by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed
> info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log
> <= info->next_daddr, no records will be displayed. In the above example,
> 104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two
> are reduced to 0 and the gap is ignored:
> 
>   before calculate ----------------> after shifting
>   104(st)  107(ed)		      12(st/ed)
>    |---------|				  |
>    sector size			      block size
> 
> Resolve this issue by introducing the "end_daddr" field in
> xfs_getfsmap_info. This records |key[1].fmr_physical + key[1].length| at
> the granularity of sector. If the current query is the last, the rec_daddr
> is end_daddr to prevent missing interval problems caused by shifting. We
> only need to focus on the last query, because xfs disks are internally
> aligned with disk blocksize that are powers of two and minimum 512, so
> there is no problem with shifting in previous queries.
> 
> After applying this patch, the above problem have been solved:
> [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt
>   EXT: DEV    BLOCK-RANGE      OWNER            FILE-OFFSET      AG AG-OFFSET        TOTAL
>     0: 253:16 [104..106]:      free space                        0  (104..106)           3
> 
> Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl")
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> [djwong: limit the range of end_addr correctly]
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>   fs/xfs/xfs_fsmap.c |   24 +++++++++++++++++++++++-
>   1 file changed, 23 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
> index 613a0ec204120..71f32354944e4 100644
> --- a/fs/xfs/xfs_fsmap.c
> +++ b/fs/xfs/xfs_fsmap.c
> @@ -162,6 +162,7 @@ struct xfs_getfsmap_info {
>   	xfs_daddr_t		next_daddr;	/* next daddr we expect */
>   	/* daddr of low fsmap key when we're using the rtbitmap */
>   	xfs_daddr_t		low_daddr;
> +	xfs_daddr_t		end_daddr;	/* daddr of high fsmap key */
>   	u64			missing_owner;	/* owner of holes */
>   	u32			dev;		/* device id */
>   	/*
> @@ -182,6 +183,7 @@ struct xfs_getfsmap_dev {
>   	int			(*fn)(struct xfs_trans *tp,
>   				      const struct xfs_fsmap *keys,
>   				      struct xfs_getfsmap_info *info);
> +	sector_t		nr_sectors;
>   };
>   
>   /* Compare two getfsmap device handlers. */
> @@ -294,6 +296,18 @@ xfs_getfsmap_helper(
>   		return 0;
>   	}
>   
> +	/*
> +	 * For an info->last query, we're looking for a gap between the last
> +	 * mapping emitted and the high key specified by userspace.  If the
> +	 * user's query spans less than 1 fsblock, then info->high and
> +	 * info->low will have the same rm_startblock, which causes rec_daddr
> +	 * and next_daddr to be the same.  Therefore, use the end_daddr that
> +	 * we calculated from userspace's high key to synthesize the record.
> +	 * Note that if the btree query found a mapping, there won't be a gap.
> +	 */
> +	if (info->last && info->end_daddr != XFS_BUF_DADDR_NULL)
> +		rec_daddr = info->end_daddr;
> +
>   	/* Are we just counting mappings? */
>   	if (info->head->fmh_count == 0) {
>   		if (info->head->fmh_entries == UINT_MAX)
> @@ -904,17 +918,21 @@ xfs_getfsmap(
>   
>   	/* Set up our device handlers. */
>   	memset(handlers, 0, sizeof(handlers));
> +	handlers[0].nr_sectors = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
>   	handlers[0].dev = new_encode_dev(mp->m_ddev_targp->bt_dev);
>   	if (use_rmap)
>   		handlers[0].fn = xfs_getfsmap_datadev_rmapbt;
>   	else
>   		handlers[0].fn = xfs_getfsmap_datadev_bnobt;
>   	if (mp->m_logdev_targp != mp->m_ddev_targp) {
> +		handlers[1].nr_sectors = XFS_FSB_TO_BB(mp,
> +						       mp->m_sb.sb_logblocks);
>   		handlers[1].dev = new_encode_dev(mp->m_logdev_targp->bt_dev);
>   		handlers[1].fn = xfs_getfsmap_logdev;
>   	}
>   #ifdef CONFIG_XFS_RT
>   	if (mp->m_rtdev_targp) {
> +		handlers[2].nr_sectors = XFS_FSB_TO_BB(mp, mp->m_sb.sb_rblocks);
>   		handlers[2].dev = new_encode_dev(mp->m_rtdev_targp->bt_dev);
>   		handlers[2].fn = xfs_getfsmap_rtdev_rtbitmap;
>   	}
> @@ -946,6 +964,7 @@ xfs_getfsmap(
>   
>   	info.next_daddr = head->fmh_keys[0].fmr_physical +
>   			  head->fmh_keys[0].fmr_length;
> +	info.end_daddr = XFS_BUF_DADDR_NULL;
>   	info.fsmap_recs = fsmap_recs;
>   	info.head = head;
>   
> @@ -966,8 +985,11 @@ xfs_getfsmap(
>   		 * low key, zero out the low key so that we get
>   		 * everything from the beginning.
>   		 */
> -		if (handlers[i].dev == head->fmh_keys[1].fmr_device)
> +		if (handlers[i].dev == head->fmh_keys[1].fmr_device) {
>   			dkeys[1] = head->fmh_keys[1];
> +			info.end_daddr = min(handlers[i].nr_sectors - 1,
> +					     dkeys[1].fmr_physical);
> +		}

In this case, we shouldn't subtract 1 from handlers[i].nr_sectors,
otherwise we lose 1 sector, and after we've shifted it, we lose 1 block
(8 sectors) . This boundary bug is similar to the latest patch set I
sent[1].

[1] https://lore.kernel.org/all/20240826031005.2493150-1-wozizhi@huawei.com/

>   		if (handlers[i].dev > head->fmh_keys[0].fmr_device)
>   			memset(&dkeys[0], 0, sizeof(struct xfs_fsmap));
>   
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 17/26] xfs: support logging EFIs for realtime extents
  2024-08-23  0:25   ` [PATCH 17/26] xfs: support logging EFIs for realtime extents Darrick J. Wong
  2024-08-23  5:17     ` Christoph Hellwig
@ 2024-08-26  4:33     ` Dave Chinner
  2024-08-26 19:38       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  4:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Teach the EFI mechanism how to free realtime extents.  We're going to
> need this to enforce proper ordering of operations when we enable
> realtime rmap.
> 
> Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
> ops for rt extents.  This keeps the ondisk artifacts and processing code
> completely separate between the rt and non-rt cases.  Hopefully this
> will make it easier to debug filesystem problems.

Doesn't this now require busy extent tracking for rt extents that
are being freed?  i.e. they get marked as free with the EFD, but
cannot be reallocated (or discarded) until the EFD is committed to
disk.

we don't allow user data allocation on the data device to reuse busy
ranges because the freeing of the extent has not yet been committed
to the journal. Because we use async transaction commits, that means
we can return to userspace without even the EFI in the journal - it
can still be in memory in the CIL. Hence we cannot allow userspace
to reallocate that range and write to it, even though it is marked free in the
in-memory metadata.

If userspace then does a write and then we crash without the
original EFI on disk, then we've just violated metadata vs data
update ordering because recovery will not replay the extent free nor
the new allocation, yet the data in that extent will have been
changed.

Hence I think that if we are moving to intent based freeing of real
time extents, we absolutely need to add support for busy extent
tracking to realtime groups before we enable EFIs on realtime
groups.....

Also ....

> @@ -447,6 +467,17 @@ xfs_extent_free_defer_add(
>  
>  	trace_xfs_extent_free_defer(mp, xefi);
>  
> +	if (xfs_efi_is_realtime(xefi)) {
> +		xfs_rgnumber_t		rgno;
> +
> +		rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock);
> +		xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno);
> +
> +		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> +				&xfs_rtextent_free_defer_type);
> +		return;
> +	}
> +
>  	xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock);
>  	if (xefi->xefi_agresv == XFS_AG_RESV_AGFL)
>  		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,

Hmmmm. Isn't this also missing the xfs_drain intent interlocks that
allow online repair to wait until all the intents outstanding on a
group complete?

> @@ -687,6 +735,106 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
>  	.relog_intent	= xfs_extent_free_relog_intent,
>  };
>  
> +#ifdef CONFIG_XFS_RT
> +/* Sort realtime efi items by rtgroup for efficiency. */
> +static int
> +xfs_rtextent_free_diff_items(
> +	void				*priv,
> +	const struct list_head		*a,
> +	const struct list_head		*b)
> +{
> +	struct xfs_extent_free_item	*ra = xefi_entry(a);
> +	struct xfs_extent_free_item	*rb = xefi_entry(b);
> +
> +	return ra->xefi_rtg->rtg_rgno - rb->xefi_rtg->rtg_rgno;
> +}
> +
> +/* Create a realtime extent freeing */
> +static struct xfs_log_item *
> +xfs_rtextent_free_create_intent(
> +	struct xfs_trans		*tp,
> +	struct list_head		*items,
> +	unsigned int			count,
> +	bool				sort)
> +{
> +	struct xfs_mount		*mp = tp->t_mountp;
> +	struct xfs_efi_log_item		*efip;
> +	struct xfs_extent_free_item	*xefi;
> +
> +	ASSERT(count > 0);
> +
> +	efip = xfs_efi_init(mp, XFS_LI_EFI_RT, count);
> +	if (sort)
> +		list_sort(mp, items, xfs_rtextent_free_diff_items);
> +	list_for_each_entry(xefi, items, xefi_list)
> +		xfs_extent_free_log_item(tp, efip, xefi);
> +	return &efip->efi_item;
> +}

Hmmmm - when would we get an XFS_LI_EFI_RT with multiple extents in
it? We only ever free a single user data extent per transaction at a
time, right? There will be no metadata blocks being freed on the rt
device - all the BMBT, refcountbt and rmapbt blocks that get freed
as a result of freeing the user data extent will be in the data
device and so will use EFIs, not EFI_RTs....

> +
> +/* Cancel a realtime extent freeing. */
> +STATIC void
> +xfs_rtextent_free_cancel_item(
> +	struct list_head		*item)
> +{
> +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> +
> +	xfs_rtgroup_put(xefi->xefi_rtg);
> +	kmem_cache_free(xfs_extfree_item_cache, xefi);
> +}
> +
> +/* Process a free realtime extent. */
> +STATIC int
> +xfs_rtextent_free_finish_item(
> +	struct xfs_trans		*tp,
> +	struct xfs_log_item		*done,
> +	struct list_head		*item,
> +	struct xfs_btree_cur		**state)

btree cursor ....

> +{
> +	struct xfs_mount		*mp = tp->t_mountp;
> +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> +	struct xfs_efd_log_item		*efdp = EFD_ITEM(done);
> +	struct xfs_rtgroup		**rtgp = (struct xfs_rtgroup **)state;

... but is apparently holding a xfs_rtgroup. that's kinda nasty, and
the rtg the xefi is supposed to be associated with is already held
by the xefi, so....

> +	int				error = 0;
> +
> +	trace_xfs_extent_free_deferred(mp, xefi);
> +
> +	if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) {
> +		if (*rtgp != xefi->xefi_rtg) {
> +			xfs_rtgroup_lock(xefi->xefi_rtg, XFS_RTGLOCK_BITMAP);
> +			xfs_rtgroup_trans_join(tp, xefi->xefi_rtg,
> +					XFS_RTGLOCK_BITMAP);
> +			*rtgp = xefi->xefi_rtg;

How does this case happen? Why is it safe to lock the xefi rtg
here, and why are we returning the xefi rtg to the caller without
taking extra references or dropping the rtg the caller passed in?

At least a comment explaining what is happening is necessary here...

> +		}
> +		error = xfs_rtfree_blocks(tp, xefi->xefi_rtg,
> +				xefi->xefi_startblock, xefi->xefi_blockcount);
> +	}
> +	if (error == -EAGAIN) {
> +		xfs_efd_from_efi(efdp);
> +		return error;
> +	}
> +
> +	xfs_efd_add_extent(efdp, xefi);
> +	xfs_rtextent_free_cancel_item(item);
> +	return error;
> +}
> +
> +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = {
> +	.name		= "rtextent_free",
> +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> +	.create_intent	= xfs_rtextent_free_create_intent,
> +	.abort_intent	= xfs_extent_free_abort_intent,
> +	.create_done	= xfs_extent_free_create_done,
> +	.finish_item	= xfs_rtextent_free_finish_item,
> +	.cancel_item	= xfs_rtextent_free_cancel_item,
> +	.recover_work	= xfs_extent_free_recover_work,
> +	.relog_intent	= xfs_extent_free_relog_intent,
> +};
> +#else
> +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = {
> +	.name		= "rtextent_free",
> +};
> +#endif /* CONFIG_XFS_RT */
> +
>  STATIC bool
>  xfs_efi_item_match(
>  	struct xfs_log_item	*lip,
> @@ -731,7 +879,7 @@ xlog_recover_efi_commit_pass2(
>  		return -EFSCORRUPTED;
>  	}
>  
> -	efip = xfs_efi_init(mp, efi_formatp->efi_nextents);
> +	efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents);
>  	error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format);
>  	if (error) {
>  		xfs_efi_item_free(efip);
> @@ -749,6 +897,58 @@ const struct xlog_recover_item_ops xlog_efi_item_ops = {
>  	.commit_pass2		= xlog_recover_efi_commit_pass2,
>  };
>  
> +#ifdef CONFIG_XFS_RT
> +STATIC int
> +xlog_recover_rtefi_commit_pass2(
> +	struct xlog			*log,
> +	struct list_head		*buffer_list,
> +	struct xlog_recover_item	*item,
> +	xfs_lsn_t			lsn)
> +{
> +	struct xfs_mount		*mp = log->l_mp;
> +	struct xfs_efi_log_item		*efip;
> +	struct xfs_efi_log_format	*efi_formatp;
> +	int				error;
> +
> +	efi_formatp = item->ri_buf[0].i_addr;
> +
> +	if (item->ri_buf[0].i_len < xfs_efi_log_format_sizeof(0)) {
> +		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
> +				item->ri_buf[0].i_addr, item->ri_buf[0].i_len);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents);
> +	error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format);
> +	if (error) {
> +		xfs_efi_item_free(efip);
> +		return error;
> +	}
> +	atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents);
> +
> +	xlog_recover_intent_item(log, &efip->efi_item, lsn,
> +			&xfs_rtextent_free_defer_type);
> +	return 0;
> +}
> +#else
> +STATIC int
> +xlog_recover_rtefi_commit_pass2(
> +	struct xlog			*log,
> +	struct list_head		*buffer_list,
> +	struct xlog_recover_item	*item,
> +	xfs_lsn_t			lsn)
> +{
> +	XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp,
> +			item->ri_buf[0].i_addr, item->ri_buf[0].i_len);
> +	return -EFSCORRUPTED;

This needs to be a more meaningful error. It's not technically a
corruption - we recognised that an RTEFI is needing to be recovered,
but this kernel does not have RTEFI support compiled in. Hence the
error should be something along the lines of

"RTEFI found in journal, but kernel not compiled with CONFIG_XFS_RT enabled.
Cannot recover journal, please remount using a kernel with RT device
support enabled."

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-23  0:26   ` [PATCH 21/26] xfs: make the RT allocator rtgroup aware Darrick J. Wong
@ 2024-08-26  4:56     ` Dave Chinner
  2024-08-26 19:40       ` Darrick J. Wong
  2024-08-27  4:59       ` Christoph Hellwig
  0 siblings, 2 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  4:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Thu, Aug 22, 2024 at 05:26:38PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Make the allocator rtgroup aware by either picking a specific group if
> there is a hint, or loop over all groups otherwise.  A simple rotor is
> provided to pick the placement for initial allocations.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_bmap.c     |   13 +++++-
>  fs/xfs/libxfs/xfs_rtbitmap.c |    6 ++-
>  fs/xfs/xfs_mount.h           |    1 
>  fs/xfs/xfs_rtalloc.c         |   98 ++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 105 insertions(+), 13 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 126a0d253654a..88c62e1158ac7 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3151,8 +3151,17 @@ xfs_bmap_adjacent_valid(
>  	struct xfs_mount	*mp = ap->ip->i_mount;
>  
>  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> -	    (ap->datatype & XFS_ALLOC_USERDATA))
> -		return x < mp->m_sb.sb_rblocks;
> +	    (ap->datatype & XFS_ALLOC_USERDATA)) {
> +		if (x >= mp->m_sb.sb_rblocks)
> +			return false;
> +		if (!xfs_has_rtgroups(mp))
> +			return true;
> +
> +		return xfs_rtb_to_rgno(mp, x) == xfs_rtb_to_rgno(mp, y) &&
> +			xfs_rtb_to_rgno(mp, x) < mp->m_sb.sb_rgcount &&
> +			xfs_rtb_to_rtx(mp, x) < mp->m_sb.sb_rgextents;

WHy do we need the xfs_has_rtgroups() check here? The new rtg logic will
return true for an old school rt device here, right?

> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 3fedc552b51b0..2b57ff2687bf6 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -1661,8 +1661,9 @@ xfs_rtalloc_align_minmax(
>  }
>  
>  static int
> -xfs_rtallocate(
> +xfs_rtallocate_rtg(
>  	struct xfs_trans	*tp,
> +	xfs_rgnumber_t		rgno,
>  	xfs_rtblock_t		bno_hint,
>  	xfs_rtxlen_t		minlen,
>  	xfs_rtxlen_t		maxlen,
> @@ -1682,16 +1683,33 @@ xfs_rtallocate(
>  	xfs_rtxlen_t		len = 0;
>  	int			error = 0;
>  
> -	args.rtg = xfs_rtgroup_grab(args.mp, 0);
> +	args.rtg = xfs_rtgroup_grab(args.mp, rgno);
>  	if (!args.rtg)
>  		return -ENOSPC;
>  
>  	/*
> -	 * Lock out modifications to both the RT bitmap and summary inodes.
> +	 * We need to lock out modifications to both the RT bitmap and summary
> +	 * inodes for finding free space in xfs_rtallocate_extent_{near,size}
> +	 * and join the bitmap and summary inodes for the actual allocation
> +	 * down in xfs_rtallocate_range.
> +	 *
> +	 * For RTG-enabled file system we don't want to join the inodes to the
> +	 * transaction until we are committed to allocate to allocate from this
> +	 * RTG so that only one inode of each type is locked at a time.
> +	 *
> +	 * But for pre-RTG file systems we need to already to join the bitmap
> +	 * inode to the transaction for xfs_rtpick_extent, which bumps the
> +	 * sequence number in it, so we'll have to join the inode to the
> +	 * transaction early here.
> +	 *
> +	 * This is all a bit messy, but at least the mess is contained in
> +	 * this function.
>  	 */
>  	if (!*rtlocked) {
>  		xfs_rtgroup_lock(args.rtg, XFS_RTGLOCK_BITMAP);
> -		xfs_rtgroup_trans_join(tp, args.rtg, XFS_RTGLOCK_BITMAP);
> +		if (!xfs_has_rtgroups(args.mp))
> +			xfs_rtgroup_trans_join(tp, args.rtg,
> +					XFS_RTGLOCK_BITMAP);
>  		*rtlocked = true;
>  	}
>  
> @@ -1701,7 +1719,7 @@ xfs_rtallocate(
>  	 */
>  	if (bno_hint)
>  		start = xfs_rtb_to_rtx(args.mp, bno_hint);
> -	else if (initial_user_data)
> +	else if (!xfs_has_rtgroups(args.mp) && initial_user_data)
>  		start = xfs_rtpick_extent(args.rtg, tp, maxlen);

Check initial_user_data first - we don't care if there are rtgroups
enabled if initial_user_data is not true, and we only ever allocate
initial data on an inode once...

> @@ -1741,6 +1767,53 @@ xfs_rtallocate(
>  	return error;
>  }
>  
> +static int
> +xfs_rtallocate_rtgs(
> +	struct xfs_trans	*tp,
> +	xfs_fsblock_t		bno_hint,
> +	xfs_rtxlen_t		minlen,
> +	xfs_rtxlen_t		maxlen,
> +	xfs_rtxlen_t		prod,
> +	bool			wasdel,
> +	bool			initial_user_data,
> +	xfs_rtblock_t		*bno,
> +	xfs_extlen_t		*blen)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	xfs_rgnumber_t		start_rgno, rgno;
> +	int			error;
> +
> +	/*
> +	 * For now this just blindly iterates over the RTGs for an initial
> +	 * allocation.  We could try to keep an in-memory rtg_longest member
> +	 * to avoid the locking when just looking for big enough free space,
> +	 * but for now this keep things simple.
> +	 */
> +	if (bno_hint != NULLFSBLOCK)
> +		start_rgno = xfs_rtb_to_rgno(mp, bno_hint);
> +	else
> +		start_rgno = (atomic_inc_return(&mp->m_rtgrotor) - 1) %
> +				mp->m_sb.sb_rgcount;
> +
> +	rgno = start_rgno;
> +	do {
> +		bool		rtlocked = false;
> +
> +		error = xfs_rtallocate_rtg(tp, rgno, bno_hint, minlen, maxlen,
> +				prod, wasdel, initial_user_data, &rtlocked,
> +				bno, blen);
> +		if (error != -ENOSPC)
> +			return error;
> +		ASSERT(!rtlocked);
> +
> +		if (++rgno == mp->m_sb.sb_rgcount)
> +			rgno = 0;
> +		bno_hint = NULLFSBLOCK;
> +	} while (rgno != start_rgno);
> +
> +	return -ENOSPC;
> +}
> +
>  static int
>  xfs_rtallocate_align(
>  	struct xfs_bmalloca	*ap,
> @@ -1835,9 +1908,16 @@ xfs_bmap_rtalloc(
>  	if (xfs_bmap_adjacent(ap))
>  		bno_hint = ap->blkno;
>  
> -	error = xfs_rtallocate(ap->tp, bno_hint, raminlen, ralen, prod,
> -			ap->wasdel, initial_user_data, &rtlocked,
> -			&ap->blkno, &ap->length);
> +	if (xfs_has_rtgroups(ap->ip->i_mount)) {
> +		error = xfs_rtallocate_rtgs(ap->tp, bno_hint, raminlen, ralen,
> +				prod, ap->wasdel, initial_user_data,
> +				&ap->blkno, &ap->length);
> +	} else {
> +		error = xfs_rtallocate_rtg(ap->tp, 0, bno_hint, raminlen, ralen,
> +				prod, ap->wasdel, initial_user_data,
> +				&rtlocked, &ap->blkno, &ap->length);
> +	}

The xfs_has_rtgroups() check is unnecessary.  The iterator in
xfs_rtallocate_rtgs() will do the right thing for the
!xfs_has_rtgroups() case - it'll set start_rgno = 0 and break out
after a single call to xfs_rtallocate_rtg() with rgno = 0.

Another thing that probably should be done here is push all the
constant value calculations a couple of functions down the stack to
where they are used. Then we only need to pass two parameters down
through the rg iterator here, not 11...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 4/6] xfs: persist quota flags with metadir
  2024-08-23  0:28   ` [PATCH 4/6] xfs: persist quota flags with metadir Darrick J. Wong
  2024-08-23  5:54     ` Christoph Hellwig
@ 2024-08-26  9:42     ` Dave Chinner
  2024-08-26 18:15       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  9:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:28:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> It's annoying that one has to keep reminding XFS about what quota
> options it should mount with, since the quota flags recording the
> previous state are sitting right there in the primary superblock.  Even
> more strangely, there exists a noquota option to disable quotas
> completely, so it's odder still that providing no options is the same as
> noquota.
> 
> Starting with metadir, let's change the behavior so that if the user
> does not specify any quota-related mount options at all, the ondisk
> quota flags will be used to bring up quota.  In other words, the
> filesystem will mount in the same state and with the same functionality
> as it had during the last mount.

This means the only way to switch quota off completely with this
functionality is to explicitly unmount the filesystem and then mount
it again with the "-o noquota" option instead of mounting it again
without any quota options.

If so, this will need clear documentation in various man pages
because users will not expect this change of quota admin behaviour
caused by enabling some other unrelated functionality (like
rtgroups).....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 5/6] xfs: update sb field checks when metadir is turned on
  2024-08-23  0:29   ` [PATCH 5/6] xfs: update sb field checks when metadir is turned on Darrick J. Wong
  2024-08-23  5:55     ` Christoph Hellwig
@ 2024-08-26  9:52     ` Dave Chinner
  2024-08-26 18:07       ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-26  9:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Thu, Aug 22, 2024 at 05:29:15PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When metadir is enabled, we want to check the two new rtgroups fields,
> and we don't want to check the old inumbers that are now in the metadir.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/agheader.c |   36 ++++++++++++++++++++++++------------
>  1 file changed, 24 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> index cad997f38a424..0d22d70950a5c 100644
> --- a/fs/xfs/scrub/agheader.c
> +++ b/fs/xfs/scrub/agheader.c
> @@ -147,14 +147,14 @@ xchk_superblock(
>  	if (xfs_has_metadir(sc->mp)) {
>  		if (sb->sb_metadirino != cpu_to_be64(mp->m_sb.sb_metadirino))
>  			xchk_block_set_preen(sc, bp);
> +	} else {
> +		if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> +			xchk_block_set_preen(sc, bp);
> +
> +		if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> +			xchk_block_set_preen(sc, bp);
>  	}
>  
> -	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> -		xchk_block_set_preen(sc, bp);
> -
> -	if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> -		xchk_block_set_preen(sc, bp);
> -

If metadir is enabled, then shouldn't sb->sb_rbmino/sb_rsumino both
be NULLFSINO to indicate they aren't valid?

Given the rt inodes should have a well defined value even when
metadir is enabled, I would say the current code that is validating
the values are consistent with the primary across all secondary
superblocks is correct and this change is unnecessary....


> @@ -229,11 +229,13 @@ xchk_superblock(
>  	 * sb_icount, sb_ifree, sb_fdblocks, sb_frexents
>  	 */
>  
> -	if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> -		xchk_block_set_preen(sc, bp);
> +	if (!xfs_has_metadir(mp)) {
> +		if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> +			xchk_block_set_preen(sc, bp);
>  
> -	if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> -		xchk_block_set_preen(sc, bp);
> +		if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> +			xchk_block_set_preen(sc, bp);
> +	}

Same - if metadir is in use and quota inodes are in the metadir,
then the superblock quota inodes should be NULLFSINO....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special
  2024-08-26  0:41     ` Dave Chinner
@ 2024-08-26 17:33       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 17:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 10:41:18AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:04:14PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Metadata inodes are private files and therefore cannot be exposed to
> > userspace.  This means no bulkstat, no open-by-handle, no linking them
> > into the directory tree, and no feeding them to LSMs.  As such, we mark
> > them S_PRIVATE, which stops all that.
> 
> Can you merge this back up into the initial iget support code?
> 
> > 
> > While we're at it, put them in a separate lockdep class so that it won't
> > get confused by "recursive" i_rwsem locking such as what happens when we
> > write to a rt file and need to allocate from the rt bitmap file.  The
> > static function that we use to do this will be exported in the rtgroups
> > patchset.
> 
> Stale commit message? There's nothing of the sort in this patch....

Yeah, sorry.  Previously there were separate lockdep classes for metadir
directories and files each, but hch and I decided that each consumer of
a metadata file should set its own class accordingly, and that the
directories could continue using xfs_nondir_ilock_class as the only
code that uses them is either mount time setup code or repair.

> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/scrub/tempfile.c |    8 ++++++++
> >  fs/xfs/xfs_iops.c       |   15 ++++++++++++++-
> >  2 files changed, 22 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c
> > index 177f922acfaf1..3c5a1d77fefae 100644
> > --- a/fs/xfs/scrub/tempfile.c
> > +++ b/fs/xfs/scrub/tempfile.c
> > @@ -844,6 +844,14 @@ xrep_is_tempfile(
> >  	const struct xfs_inode	*ip)
> >  {
> >  	const struct inode	*inode = &ip->i_vnode;
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +
> > +	/*
> > +	 * Files in the metadata directory tree also have S_PRIVATE set and
> > +	 * IOP_XATTR unset, so we must distinguish them separately.
> > +	 */
> > +	if (xfs_has_metadir(mp) && (ip->i_diflags2 & XFS_DIFLAG2_METADATA))
> > +		return false;
> 
> Why do you need to check both xfs_has_metadir() and the inode flag
> here? The latter should only be set if the former is set, yes?
> If it's the other way around, then we have an on-disk corruption...

Probably just stale code that's been sitting around for a while.
But yes, this could all be:

	if (xfs_is_metadir_inode(ip))
		return false;

since the inode verifier should have already caught this.

> >  	if (IS_PRIVATE(inode) && !(inode->i_opflags & IOP_XATTR))
> >  		return true;
> 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 1cdc8034f54d9..c1686163299a0 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -42,7 +42,9 @@
> >   * held. For regular files, the lock order is the other way around - the
> >   * mmap_lock is taken during the page fault, and then we lock the ilock to do
> >   * block mapping. Hence we need a different class for the directory ilock so
> > - * that lockdep can tell them apart.
> > + * that lockdep can tell them apart.  Directories in the metadata directory
> > + * tree get a separate class so that lockdep reports will warn us if someone
> > + * ever tries to lock regular directories after locking metadata directories.
> >   */
> >  static struct lock_class_key xfs_nondir_ilock_class;
> >  static struct lock_class_key xfs_dir_ilock_class;
> > @@ -1299,6 +1301,7 @@ xfs_setup_inode(
> >  {
> >  	struct inode		*inode = &ip->i_vnode;
> >  	gfp_t			gfp_mask;
> > +	bool			is_meta = xfs_is_metadata_inode(ip);
> >  
> >  	inode->i_ino = ip->i_ino;
> >  	inode->i_state |= I_NEW;
> > @@ -1310,6 +1313,16 @@ xfs_setup_inode(
> >  	i_size_write(inode, ip->i_disk_size);
> >  	xfs_diflags_to_iflags(ip, true);
> >  
> > +	/*
> > +	 * Mark our metadata files as private so that LSMs and the ACL code
> > +	 * don't try to add their own metadata or reason about these files,
> > +	 * and users cannot ever obtain file handles to them.
> > +	 */
> > +	if (is_meta) {
> > +		inode->i_flags |= S_PRIVATE;
> > +		inode->i_opflags &= ~IOP_XATTR;
> > +	}
> 
> No need for a temporary variable here.

<nod>

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/26] xfs: don't count metadata directory files to quota
  2024-08-26  0:47     ` Dave Chinner
@ 2024-08-26 17:57       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 17:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 10:47:41AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:05:01PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Files in the metadata directory tree are internal to the filesystem.
> > Don't count the inodes or the blocks they use in the root dquot because
> > users do not need to know about their resource usage.  This will also
> > quiet down complaints about dquot usage not matching du output.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_dquot.c       |    1 +
> >  fs/xfs/xfs_qm.c          |   11 +++++++++++
> >  fs/xfs/xfs_quota.h       |    5 +++++
> >  fs/xfs/xfs_trans_dquot.c |    6 ++++++
> >  4 files changed, 23 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index c1b211c260a9d..3bf47458c517a 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> > @@ -983,6 +983,7 @@ xfs_qm_dqget_inode(
> >  
> >  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> >  	ASSERT(xfs_inode_dquot(ip, type) == NULL);
> > +	ASSERT(!xfs_is_metadir_inode(ip));
> >  
> >  	id = xfs_qm_id_for_quotatype(ip, type);
> >  
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index d0674d84af3ec..ec983cca9adae 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -304,6 +304,8 @@ xfs_qm_need_dqattach(
> >  		return false;
> >  	if (xfs_is_quota_inode(&mp->m_sb, ip->i_ino))
> >  		return false;
> > +	if (xfs_is_metadir_inode(ip))
> > +		return false;
> >  	return true;
> >  }
> >  
> > @@ -326,6 +328,7 @@ xfs_qm_dqattach_locked(
> >  		return 0;
> >  
> >  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> > +	ASSERT(!xfs_is_metadir_inode(ip));
> >  
> >  	if (XFS_IS_UQUOTA_ON(mp) && !ip->i_udquot) {
> >  		error = xfs_qm_dqattach_one(ip, XFS_DQTYPE_USER,
> > @@ -1204,6 +1207,10 @@ xfs_qm_dqusage_adjust(
> >  		}
> >  	}
> >  
> > +	/* Metadata directory files are not accounted to user-visible quotas. */
> > +	if (xfs_is_metadir_inode(ip))
> > +		goto error0;
> > +
> 
> Hmmmm. I'm starting to think that xfs_iget() should not return
> metadata inodes unless a new XFS_IGET_METAINODE flag is set.
> 
> That would replace all these post xfs_iget() checks with a single
> check in xfs_iget(), and then xfs_trans_metafile_iget() is the only
> place that sets this specific flag.
> 
> That means stuff like VFS lookups, bulkstat, quotacheck, and
> filehandle lookups will never return metadata inodes and we don't
> need to add special checks all over for them...

I think doing so is likely an overall improvement for the codebase, but
it will complicate life for the directory and parent pointer scrubbers
because they can be called with the ino/gen of files in metadata
directory tree, and there's a special bulkstat mode for xfs_scrub
wherein one can get the ino/gen of metadata directory tree files.
Anyway I'll think further about this as a separate cleanup.

Note that the checks in this particular patch exist so that we don't
have to sprinkle them all over the bmap code so that the rtbitmap and
summary files can be excluded from quota accounting.

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 5/6] xfs: update sb field checks when metadir is turned on
  2024-08-26  9:52     ` Dave Chinner
@ 2024-08-26 18:07       ` Darrick J. Wong
  2024-08-27  2:16         ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 18:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 07:52:43PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:29:15PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > When metadir is enabled, we want to check the two new rtgroups fields,
> > and we don't want to check the old inumbers that are now in the metadir.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/scrub/agheader.c |   36 ++++++++++++++++++++++++------------
> >  1 file changed, 24 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > index cad997f38a424..0d22d70950a5c 100644
> > --- a/fs/xfs/scrub/agheader.c
> > +++ b/fs/xfs/scrub/agheader.c
> > @@ -147,14 +147,14 @@ xchk_superblock(
> >  	if (xfs_has_metadir(sc->mp)) {
> >  		if (sb->sb_metadirino != cpu_to_be64(mp->m_sb.sb_metadirino))
> >  			xchk_block_set_preen(sc, bp);
> > +	} else {
> > +		if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> > +			xchk_block_set_preen(sc, bp);
> > +
> > +		if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> > +			xchk_block_set_preen(sc, bp);
> >  	}
> >  
> > -	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> > -		xchk_block_set_preen(sc, bp);
> > -
> > -	if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> > -		xchk_block_set_preen(sc, bp);
> > -
> 
> If metadir is enabled, then shouldn't sb->sb_rbmino/sb_rsumino both
> be NULLFSINO to indicate they aren't valid?

The ondisk sb values aren't defined anymore and we set the incore values
to NULLFSINO (and never write that back out) so there's not much to
check anymore.  I guess we could check that they're all zero or
something, which is what mkfs writes out, though my intent here was to
leave them as undefined bits, figuring that if we ever want to reuse
those fields we're going to define a new incompat bit anyway.

OTOH now would be the time to define what the field contents are
supposed to be -- zero or NULLFSINO?

> Given the rt inodes should have a well defined value even when
> metadir is enabled, I would say the current code that is validating
> the values are consistent with the primary across all secondary
> superblocks is correct and this change is unnecessary....
> 
> 
> > @@ -229,11 +229,13 @@ xchk_superblock(
> >  	 * sb_icount, sb_ifree, sb_fdblocks, sb_frexents
> >  	 */
> >  
> > -	if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> > -		xchk_block_set_preen(sc, bp);
> > +	if (!xfs_has_metadir(mp)) {
> > +		if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> > +			xchk_block_set_preen(sc, bp);
> >  
> > -	if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> > -		xchk_block_set_preen(sc, bp);
> > +		if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> > +			xchk_block_set_preen(sc, bp);
> > +	}
> 
> Same - if metadir is in use and quota inodes are in the metadir,
> then the superblock quota inodes should be NULLFSINO....

Ok, I'll go with NULLFSINO ondisk and in memory.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 4/6] xfs: persist quota flags with metadir
  2024-08-26  9:42     ` Dave Chinner
@ 2024-08-26 18:15       ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 18:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 07:42:48PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:28:59PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > It's annoying that one has to keep reminding XFS about what quota
> > options it should mount with, since the quota flags recording the
> > previous state are sitting right there in the primary superblock.  Even
> > more strangely, there exists a noquota option to disable quotas
> > completely, so it's odder still that providing no options is the same as
> > noquota.
> > 
> > Starting with metadir, let's change the behavior so that if the user
> > does not specify any quota-related mount options at all, the ondisk
> > quota flags will be used to bring up quota.  In other words, the
> > filesystem will mount in the same state and with the same functionality
> > as it had during the last mount.
> 
> This means the only way to switch quota off completely with this
> functionality is to explicitly unmount the filesystem and then mount
> it again with the "-o noquota" option instead of mounting it again
> without any quota options.
> 
> If so, this will need clear documentation in various man pages
> because users will not expect this change of quota admin behaviour
> caused by enabling some other unrelated functionality (like
> rtgroups).....

Yeah, manpage updates are in progress.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper
  2024-08-26  2:06     ` Dave Chinner
@ 2024-08-26 18:27       ` Darrick J. Wong
  2024-08-27  1:29         ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 18:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 12:06:58PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:20:07PM -0700, Darrick J. Wong wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Split the check that the rtsummary fits into the log into a separate
> > helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
> > geometry.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > [djwong: avoid division for the 0-rtx growfs check]
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_rtalloc.c |   43 +++++++++++++++++++++++++++++--------------
> >  1 file changed, 29 insertions(+), 14 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> > index 61231b1dc4b79..78a3879ad6193 100644
> > --- a/fs/xfs/xfs_rtalloc.c
> > +++ b/fs/xfs/xfs_rtalloc.c
> > @@ -1023,6 +1023,31 @@ xfs_growfs_rtg(
> >  	return error;
> >  }
> >  
> > +static int
> > +xfs_growfs_check_rtgeom(
> > +	const struct xfs_mount	*mp,
> > +	xfs_rfsblock_t		rblocks,
> > +	xfs_extlen_t		rextsize)
> > +{
> > +	struct xfs_mount	*nmp;
> > +	int			error = 0;
> > +
> > +	nmp = xfs_growfs_rt_alloc_fake_mount(mp, rblocks, rextsize);
> > +	if (!nmp)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * New summary size can't be more than half the size of the log.  This
> > +	 * prevents us from getting a log overflow, since we'll log basically
> > +	 * the whole summary file at once.
> > +	 */
> > +	if (nmp->m_rsumblocks > (mp->m_sb.sb_logblocks >> 1))
> > +		error = -EINVAL;
> 
> FWIW, the new size needs to be smaller than that, because the "half
> the log size" must to include all the log metadata needed to
> encapsulate that object. The grwofs transaction also logs inodes and
> the superblock, so that also takes away from the maximum size of
> the summary file....

<shrug> It's the same logic as what's there now, and there haven't been
any bug reports, have there?  Though I suppose that's just a reduction
of what?  One block for the rtbitmap, and (conservatively) two inodes
and a superblock?

n = nmp->m_rsumblocks + 1 + howmany(inodesize * 2, blocksize) + 1;
if (n > (logblocks / 2))
	return -EINVAL;

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/24] xfs: support caching rtgroup metadata inodes
  2024-08-26  1:41     ` Dave Chinner
@ 2024-08-26 18:37       ` Darrick J. Wong
  2024-08-27  1:05         ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 18:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 11:41:19AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:18:18PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Create the necessary per-rtgroup infrastructure that we need to load
> > metadata inodes into memory.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_rtgroup.c |  182 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rtgroup.h |   28 +++++++
> >  fs/xfs/xfs_mount.h          |    1 
> >  fs/xfs/xfs_rtalloc.c        |   48 +++++++++++
> >  4 files changed, 258 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > index ae6d67c673b1a..50e4a56d749f0 100644
> > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > @@ -30,6 +30,8 @@
> >  #include "xfs_icache.h"
> >  #include "xfs_rtgroup.h"
> >  #include "xfs_rtbitmap.h"
> > +#include "xfs_metafile.h"
> > +#include "xfs_metadir.h"
> >  
> >  /*
> >   * Passive reference counting access wrappers to the rtgroup structures.  If
> > @@ -295,3 +297,183 @@ xfs_rtginode_lockdep_setup(
> >  #else
> >  #define xfs_rtginode_lockdep_setup(ip, rgno, type)	do { } while (0)
> >  #endif /* CONFIG_PROVE_LOCKING */
> > +
> > +struct xfs_rtginode_ops {
> > +	const char		*name;	/* short name */
> > +
> > +	enum xfs_metafile_type	metafile_type;
> > +
> > +	/* Does the fs have this feature? */
> > +	bool			(*enabled)(struct xfs_mount *mp);
> > +
> > +	/* Create this rtgroup metadata inode and initialize it. */
> > +	int			(*create)(struct xfs_rtgroup *rtg,
> > +					  struct xfs_inode *ip,
> > +					  struct xfs_trans *tp,
> > +					  bool init);
> > +};
> 
> What's all this for?
> 
> AFAICT, loading the inodes into the rtgs requires a call to
> xfs_metadir_load() when initialising the rtg (either at mount or
> lazily on the first access to the rtg). Hence I'm not really sure
> what this complexity is needed for, and the commit message is not
> very informative....

Yes, the creation and mkdir code in here is really to support growfs,
mkfs, and repair.  How about I change the commit message to:

"Create the necessary per-rtgroup infrastructure that we need to load
metadata inodes into memory and to create directory trees on the fly.
Loading is needed by the mounting process.  Creation is needed by
growfs, mkfs, and repair."

> > +static const struct xfs_rtginode_ops xfs_rtginode_ops[XFS_RTGI_MAX] = {
> > +};
> > +
> > +/* Return the shortname of this rtgroup inode. */
> > +const char *
> > +xfs_rtginode_name(
> > +	enum xfs_rtg_inodes	type)
> > +{
> > +	return xfs_rtginode_ops[type].name;
> > +}
> > +
> > +/* Should this rtgroup inode be present? */
> > +bool
> > +xfs_rtginode_enabled(
> > +	struct xfs_rtgroup	*rtg,
> > +	enum xfs_rtg_inodes	type)
> > +{
> > +	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
> > +
> > +	if (!ops->enabled)
> > +		return true;
> > +	return ops->enabled(rtg->rtg_mount);
> > +}
> > +
> > +/* Load and existing rtgroup inode into the rtgroup structure. */
> > +int
> > +xfs_rtginode_load(
> > +	struct xfs_rtgroup	*rtg,
> > +	enum xfs_rtg_inodes	type,
> > +	struct xfs_trans	*tp)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	const char		*path;
> > +	struct xfs_inode	*ip;
> > +	const struct xfs_rtginode_ops *ops = &xfs_rtginode_ops[type];
> > +	int			error;
> > +
> > +	if (!xfs_rtginode_enabled(rtg, type))
> > +		return 0;
> > +
> > +	if (!mp->m_rtdirip)
> > +		return -EFSCORRUPTED;
> > +
> > +	path = xfs_rtginode_path(rtg->rtg_rgno, type);
> > +	if (!path)
> > +		return -ENOMEM;
> > +	error = xfs_metadir_load(tp, mp->m_rtdirip, path, ops->metafile_type,
> > +			&ip);
> > +	kfree(path);
> > +
> > +	if (error)
> > +		return error;
> > +
> > +	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
> > +			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
> > +		xfs_irele(ip);
> > +		return -EFSCORRUPTED;
> > +	}
> 
> We don't support LOCAL format for any type of regular file inodes,
> so I'm a little confiused as to why this wouldn't be caught by the
> verifier on inode read? i.e.  What problem is this trying to catch,
> and why doesn't the inode verifier catch it for us?

This is really more of a placeholder for more refactorings coming down
the line for the rtrmap patchset, which will create a new
XFS_DINODE_FMT_RMAP.  At that time we'll need to check that an inode
that we are loading to be the rmap btree actually has that set.

> > +	if (XFS_IS_CORRUPT(mp, ip->i_projid != rtg->rtg_rgno)) {
> > +		xfs_irele(ip);
> > +		return -EFSCORRUPTED;
> > +	}
> > +
> > +	xfs_rtginode_lockdep_setup(ip, rtg->rtg_rgno, type);
> > +	rtg->rtg_inodes[type] = ip;
> > +	return 0;
> > +}
> > +
> > +/* Release an rtgroup metadata inode. */
> > +void
> > +xfs_rtginode_irele(
> > +	struct xfs_inode	**ipp)
> > +{
> > +	if (*ipp)
> > +		xfs_irele(*ipp);
> > +	*ipp = NULL;
> > +}
> > +
> > +/* Add a metadata inode for a realtime rmap btree. */
> > +int
> > +xfs_rtginode_create(
> > +	struct xfs_rtgroup		*rtg,
> > +	enum xfs_rtg_inodes		type,
> > +	bool				init)
> 
> This doesn't seem to belong in this patchset...
> 
> ....
> 
> > +/* Create the parent directory for all rtgroup inodes and load it. */
> > +int
> > +xfs_rtginode_mkdir_parent(
> > +	struct xfs_mount	*mp)
> 
> Or this...
> 
> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-25 23:56     ` Dave Chinner
@ 2024-08-26 19:14       ` Darrick J. Wong
  2024-08-27  0:57         ` Dave Chinner
  2024-08-27  4:27         ` Christoph Hellwig
  0 siblings, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 19:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 09:56:08AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:17:31PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Create an incore object that will contain information about a realtime
> > allocation group.  This will eventually enable us to shard the realtime
> > section in a similar manner to how we shard the data section, but for
> > now just a single object for the entire RT subvolume is created.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/Makefile             |    1 
> >  fs/xfs/libxfs/xfs_format.h  |    3 +
> >  fs/xfs/libxfs/xfs_rtgroup.c |  196 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rtgroup.h |  212 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_sb.c      |    7 +
> >  fs/xfs/libxfs/xfs_types.h   |    4 +
> >  fs/xfs/xfs_log_recover.c    |   20 ++++
> >  fs/xfs/xfs_mount.c          |   16 +++
> >  fs/xfs/xfs_mount.h          |   14 +++
> >  fs/xfs/xfs_rtalloc.c        |    6 +
> >  fs/xfs/xfs_super.c          |    1 
> >  fs/xfs/xfs_trace.c          |    1 
> >  fs/xfs/xfs_trace.h          |   38 ++++++++
> >  13 files changed, 517 insertions(+), 2 deletions(-)
> >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
> >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h
> 
> Ok, how is the global address space for real time extents laid out
> across rt groups? i.e. is it sparse similar to how fsbnos and inode
> numbers are created for the data device like so?
> 
> 	fsbno = (agno << agblklog) | agbno
> 
> Or is it something different? I can't find that defined anywhere in
> this patch, so I can't determine if the unit conversion code and
> validation is correct or not...

They're not sparse like fsbnos on the data device, they're laid end to
end.  IOWs, it's a straight linear translation.  If you have an rtgroup
that is 50 blocks long, then rtgroup 1 starts at (50 * blocksize).

This patch, FWIW, refactors the existing rt code so that a !rtgroups
filesystem is represented by one large "group", with xfs_rtxnum_t now
indexing rt extents within a group.  Probably it should be renamed to
xfs_rgxnum_t.

Note that we haven't defined the rtgroup ondisk format yet, so I'll go
amend that patch to spell out the ondisk format of the brave new world.

> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 4d8ca08cdd0ec..388b5cef48ca5 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -60,6 +60,7 @@ xfs-y				+= $(addprefix libxfs/, \
> >  # xfs_rtbitmap is shared with libxfs
> >  xfs-$(CONFIG_XFS_RT)		+= $(addprefix libxfs/, \
> >  				   xfs_rtbitmap.o \
> > +				   xfs_rtgroup.o \
> >  				   )
> >  
> >  # highlevel code
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 16a7bc02aa5f5..fa5cfc8265d92 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -176,6 +176,9 @@ typedef struct xfs_sb {
> >  
> >  	xfs_ino_t	sb_metadirino;	/* metadata directory tree root */
> >  
> > +	xfs_rgnumber_t	sb_rgcount;	/* number of realtime groups */
> > +	xfs_rtxlen_t	sb_rgextents;	/* size of a realtime group in rtx */
> 
> So min/max rtgroup size is defined by the sb_rextsize field? What
> redundant metadata do we end up with that allows us to validate
> the sb_rextsize field is still valid w.r.t. rtgroups geometry?
> 
> Also, rtgroup lengths are defined by "rtx counts", but the
> definitions in the xfs_mount later on are "m_rtblklog" and
> "m_rgblocks" and we use xfs_rgblock_t and rgbno all over the place.
> 
> Just from the context of this patch, it is somewhat confusing trying
> to work out what the difference is...
> 
> 
> >  	/* must be padded to 64 bit alignment */
> >  } xfs_sb_t;
> >  
> > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > new file mode 100644
> > index 0000000000000..2bad1ecb811eb
> > --- /dev/null
> > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > @@ -0,0 +1,196 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <djwong@kernel.org>
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_bit.h"
> > +#include "xfs_sb.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_btree.h"
> > +#include "xfs_alloc_btree.h"
> > +#include "xfs_rmap_btree.h"
> > +#include "xfs_alloc.h"
> > +#include "xfs_ialloc.h"
> > +#include "xfs_rmap.h"
> > +#include "xfs_ag.h"
> > +#include "xfs_ag_resv.h"
> > +#include "xfs_health.h"
> > +#include "xfs_error.h"
> > +#include "xfs_bmap.h"
> > +#include "xfs_defer.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans.h"
> > +#include "xfs_trace.h"
> > +#include "xfs_inode.h"
> > +#include "xfs_icache.h"
> > +#include "xfs_rtgroup.h"
> > +#include "xfs_rtbitmap.h"
> > +
> > +/*
> > + * Passive reference counting access wrappers to the rtgroup structures.  If
> > + * the rtgroup structure is to be freed, the freeing code is responsible for
> > + * cleaning up objects with passive references before freeing the structure.
> > + */
> > +struct xfs_rtgroup *
> > +xfs_rtgroup_get(
> > +	struct xfs_mount	*mp,
> > +	xfs_rgnumber_t		rgno)
> > +{
> > +	struct xfs_rtgroup	*rtg;
> > +
> > +	rcu_read_lock();
> > +	rtg = xa_load(&mp->m_rtgroups, rgno);
> > +	if (rtg) {
> > +		trace_xfs_rtgroup_get(rtg, _RET_IP_);
> > +		ASSERT(atomic_read(&rtg->rtg_ref) >= 0);
> > +		atomic_inc(&rtg->rtg_ref);
> > +	}
> > +	rcu_read_unlock();
> > +	return rtg;
> > +}
> > +
> > +/* Get a passive reference to the given rtgroup. */
> > +struct xfs_rtgroup *
> > +xfs_rtgroup_hold(
> > +	struct xfs_rtgroup	*rtg)
> > +{
> > +	ASSERT(atomic_read(&rtg->rtg_ref) > 0 ||
> > +	       atomic_read(&rtg->rtg_active_ref) > 0);
> > +
> > +	trace_xfs_rtgroup_hold(rtg, _RET_IP_);
> > +	atomic_inc(&rtg->rtg_ref);
> > +	return rtg;
> > +}
> > +
> > +void
> > +xfs_rtgroup_put(
> > +	struct xfs_rtgroup	*rtg)
> > +{
> > +	trace_xfs_rtgroup_put(rtg, _RET_IP_);
> > +	ASSERT(atomic_read(&rtg->rtg_ref) > 0);
> > +	atomic_dec(&rtg->rtg_ref);
> > +}
> > +
> > +/*
> > + * Active references for rtgroup structures. This is for short term access to
> > + * the rtgroup structures for walking trees or accessing state. If an rtgroup
> > + * is being shrunk or is offline, then this will fail to find that group and
> > + * return NULL instead.
> > + */
> > +struct xfs_rtgroup *
> > +xfs_rtgroup_grab(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	struct xfs_rtgroup	*rtg;
> > +
> > +	rcu_read_lock();
> > +	rtg = xa_load(&mp->m_rtgroups, agno);
> > +	if (rtg) {
> > +		trace_xfs_rtgroup_grab(rtg, _RET_IP_);
> > +		if (!atomic_inc_not_zero(&rtg->rtg_active_ref))
> > +			rtg = NULL;
> > +	}
> > +	rcu_read_unlock();
> > +	return rtg;
> > +}
> > +
> > +void
> > +xfs_rtgroup_rele(
> > +	struct xfs_rtgroup	*rtg)
> > +{
> > +	trace_xfs_rtgroup_rele(rtg, _RET_IP_);
> > +	if (atomic_dec_and_test(&rtg->rtg_active_ref))
> > +		wake_up(&rtg->rtg_active_wq);
> > +}
> 
> This is all duplicates of the xfs_perag code. Can you put together a
> patchset to abstract this into a "xfs_group" and embed them in both
> the perag and and rtgroup structures?
> 
> That way we only need one set of lookup and iterator infrastructure,
> and it will work for both data and rt groups...

How will that work with perags still using the radix tree and rtgroups
using the xarray?  Yes, we should move the perags to use the xarray too
(and indeed hch already has a series on list to do that) but here's
really not the time to do that because I don't want to frontload a bunch
more core changes onto this already huge patchset.

> > +
> > +/* Compute the number of rt extents in this realtime group. */
> > +xfs_rtxnum_t
> > +xfs_rtgroup_extents(
> +	struct xfs_mount	*mp,
> > +	xfs_rgnumber_t		rgno)
> > +{
> > +	xfs_rgnumber_t		rgcount = mp->m_sb.sb_rgcount;
> > +
> > +	ASSERT(rgno < rgcount);
> > +	if (rgno == rgcount - 1)
> > +		return mp->m_sb.sb_rextents -
> > +			((xfs_rtxnum_t)rgno * mp->m_sb.sb_rgextents);
> 
> Urk. So this relies on a non-rtgroup filesystem doing a
> multiplication by zero of a field that the on-disk format does not
> understand to get the right result.  I think this is a copying a bad
> pattern we've been slowly trying to remove from the normal
> allocation group code.
> 
> > +
> > +	ASSERT(xfs_has_rtgroups(mp));
> > +	return mp->m_sb.sb_rgextents;
> > +}
> 
> We already embed the length of the rtgroup in the rtgroup structure.
> THis should be looking up the rtgroup (or being passed the rtgroup
> the caller already has) and doing the right thing. i.e.
> 
> 	if (!rtg || !xfs_has_rtgroups(rtg->rtg_mount))
> 		return mp->m_sb.sb_rextents;
> 	return rtg->rtg_extents;

xfs_rtgroup_extents is the function that we use to set rtg->rtg_extents.

> > diff --git a/fs/xfs/libxfs/xfs_rtgroup.h b/fs/xfs/libxfs/xfs_rtgroup.h
> > new file mode 100644
> > index 0000000000000..2c09ecfc50328
> > --- /dev/null
> > +++ b/fs/xfs/libxfs/xfs_rtgroup.h
> > @@ -0,0 +1,212 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +/*
> > + * Copyright (c) 2022-2024 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <djwong@kernel.org>
> > + */
> > +#ifndef __LIBXFS_RTGROUP_H
> > +#define __LIBXFS_RTGROUP_H 1
> > +
> > +struct xfs_mount;
> > +struct xfs_trans;
> > +
> > +/*
> > + * Realtime group incore structure, similar to the per-AG structure.
> > + */
> > +struct xfs_rtgroup {
> > +	struct xfs_mount	*rtg_mount;
> > +	xfs_rgnumber_t		rtg_rgno;
> > +	atomic_t		rtg_ref;	/* passive reference count */
> > +	atomic_t		rtg_active_ref;	/* active reference count */
> > +	wait_queue_head_t	rtg_active_wq;/* woken active_ref falls to zero */
> 
> Yeah, that's all common with xfs_perag....
> 
> ....
> > +/*
> > + * rt group iteration APIs
> > + */
> > +static inline struct xfs_rtgroup *
> > +xfs_rtgroup_next(
> > +	struct xfs_rtgroup	*rtg,
> > +	xfs_rgnumber_t		*rgno,
> > +	xfs_rgnumber_t		end_rgno)
> > +{
> > +	struct xfs_mount	*mp = rtg->rtg_mount;
> > +
> > +	*rgno = rtg->rtg_rgno + 1;
> > +	xfs_rtgroup_rele(rtg);
> > +	if (*rgno > end_rgno)
> > +		return NULL;
> > +	return xfs_rtgroup_grab(mp, *rgno);
> > +}
> > +
> > +#define for_each_rtgroup_range(mp, rgno, end_rgno, rtg) \
> > +	for ((rtg) = xfs_rtgroup_grab((mp), (rgno)); \
> > +		(rtg) != NULL; \
> > +		(rtg) = xfs_rtgroup_next((rtg), &(rgno), (end_rgno)))
> > +
> > +#define for_each_rtgroup_from(mp, rgno, rtg) \
> > +	for_each_rtgroup_range((mp), (rgno), (mp)->m_sb.sb_rgcount - 1, (rtg))
> > +
> > +
> > +#define for_each_rtgroup(mp, rgno, rtg) \
> > +	(rgno) = 0; \
> > +	for_each_rtgroup_from((mp), (rgno), (rtg))
> 
> Yup, that's all common with xfs_perag iteration, too. Can you put
> together a patchset to unify these, please?

Here's the first part of that, to convert perags to xarrays...
https://lore.kernel.org/linux-xfs/20240821063901.650776-1-hch@lst.de/

> > +static inline bool
> > +xfs_verify_rgbno(
> > +	struct xfs_rtgroup	*rtg,
> > +	xfs_rgblock_t		rgbno)
> 
> Ok, what's the difference between and xfs_rgblock_t and a "rtx"?
> 
> OH.... Then penny just dropped - it's another "single letter
> difference that's really, really hard to spot" problem. You've
> defined "xfs_r*g*block_t" for the like a a*g*bno, but we have
> xfs_r*t*block_t for the global 64bit block number instead of a
> xfs_fsbno_t.
> 
> We just had a bug caused by exactly this sort of confusion with a
> patch that mixed xfs_[f]inobt changes together and one of the
> conversions was incorrect. Nobody spotted the single incorrect
> letter in the bigger patch, and I can see -exactly- the same sort of
> confusion happening with rtblock vs rgblock causing implicit 32/64
> bit integer promotion bugs...
> 
> > +{
> > +	struct xfs_mount	*mp = rtg->rtg_mount;
> > +
> > +	if (rgbno >= rtg->rtg_extents * mp->m_sb.sb_rextsize)
> > +		return false;
> 
> Why isn't the max valid "rgbno" stored in the rtgroup instead of
> having to multiply the extent count by extent size every time we
> have to verify a rgbno? (i.e. same as pag->block_count).
> 
> We know from the agbno verification this will be a -very- hot path,
> and so precalculating all the constants and storing them in the rtg
> should be done right from the start here.
> 
> > +	if (xfs_has_rtsb(mp) && rtg->rtg_rgno == 0 &&
> > +	    rgbno < mp->m_sb.sb_rextsize)
> > +		return false;
> 
> Same here - this value is stored in pag->min_block...
> 
> > +	return true;
> > +}
> 
> And then, if we put the max_bno and min_bno in the generic
> "xfs_group" structure, we suddenly have a generic "group bno"
> verification mechanism that is independent of whether the group
> 
> static inline bool
> xfs_verify_gbno(
>      struct xfs_group      *g,
>      xfs_gblock_t         gbno)
> {
>      struct xfs_mount        *mp = g->g_mount;
> 
>      if (gbno >= g->block_count)
>              return false;
>      if (gbno < g->min_block)
>              return false;
>      return true;
> }
> 
> And the rest of these functions fall out the same way....
> 
> 
> > +static inline xfs_rtblock_t
> > +xfs_rgno_start_rtb(
> > +	struct xfs_mount	*mp,
> > +	xfs_rgnumber_t		rgno)
> > +{
> > +	if (mp->m_rgblklog >= 0)
> > +		return ((xfs_rtblock_t)rgno << mp->m_rgblklog);
> > +	return ((xfs_rtblock_t)rgno * mp->m_rgblocks);
> > +}
> 
> Where does mp->m_rgblklog come from? That wasn't added to the
> on-disk superblock structure and it is always initialised to zero
> in this patch.
> 
> When will m_rgblklog be zero and when will it be non-zero? If it's

As I mentioned before, this patch merely ports non-rtg filesystems to
use the rtgroup structure.  m_rgblklog will be set to nonzero values
when we get to defining the ondisk rtgroup structure.

But, to cut ahead here, m_rgblklog will be set to a non-negative value
if the rtgroup size (in blocks) is a power of two.  Then these unit
conversion functions can use shifts instead of expensive multiplication
and divisions.  The same goes for rt extent to {fs,rt}block conversions.

> only going to be zero for existing non-rtg realtime systems,
> then this code makes little sense (again, relying on multiplication
> by zero to get the right result). If it's not always used for
> rtg enabled filesytsems, then the reason for that has not been
> explained and I can't work out why this would ever need to be done.

https://lore.kernel.org/linux-xfs/172437088534.60592.14072619855969226822.stgit@frogsfrogsfrogs/

> 
> > +static inline xfs_rtblock_t
> > +xfs_rgbno_to_rtb(
> > +	struct xfs_mount	*mp,
> > +	xfs_rgnumber_t		rgno,
> > +	xfs_rgblock_t		rgbno)
> > +{
> > +	return xfs_rgno_start_rtb(mp, rgno) + rgbno;
> > +}
> > +
> > +static inline xfs_rgnumber_t
> > +xfs_rtb_to_rgno(
> > +	struct xfs_mount	*mp,
> > +	xfs_rtblock_t		rtbno)
> > +{
> > +	if (!xfs_has_rtgroups(mp))
> > +		return 0;
> > +
> > +	if (mp->m_rgblklog >= 0)
> > +		return rtbno >> mp->m_rgblklog;
> > +
> > +	return div_u64(rtbno, mp->m_rgblocks);
> > +}
> 
> Ah, now I'm really confused, because m_rgblklog is completely
> bypassed for legacy rt filesystems.
> 
> And I just realised, this "if (mp->m_rgblklog >= 0)" implies that
> m_rgblklog can have negative values and there's no comments anywhere
> about why that can happen and what would trigger it. 

-1 is the magic value for "the rtgroup size is not a power of two, so
you have to use slow integer division and multiplication".

> We validate sb_agblklog during the superblock verifier, and so once
> the filesystem is mounted we never, ever need to check whether
> sb_agblklog is in range. Why is the rtblklog being handled so
> differently here?

This all could be documented better across the patches.  Originally the
incore and ondisk patches were adjacent and it was at least somewhat
easier to figure these things out, but hch really wanted to shard the
rtbitmaps, so now the patchset has grown even larger and harder to
understand if you only read one patch at a time.

> > +
> > +static inline uint64_t
> > +__xfs_rtb_to_rgbno(
> > +	struct xfs_mount	*mp,
> > +	xfs_rtblock_t		rtbno)
> > +{
> > +	uint32_t		rem;
> > +
> > +	if (!xfs_has_rtgroups(mp))
> > +		return rtbno;
> > +
> > +	if (mp->m_rgblklog >= 0)
> > +		return rtbno & mp->m_rgblkmask;
> > +
> > +	div_u64_rem(rtbno, mp->m_rgblocks, &rem);
> > +	return rem;
> > +}
> 
> Why is this function returning a uint64_t - a xfs_rgblock_t is only
> a 32 bit type...

group 0 on a !rtg filesystem can be 64-bits in block/rt count.  This is
a /very/ annoying pain point -- if you actually created such a
filesystem it actually would never work because the rtsummary file would
be created undersized due to an integer overflow, but the verifiers
never checked any of that, and due to the same underflow the rtallocator
would search the wrong places and (eventually) fall back to a dumb
linear scan.

Soooooo this is an obnoxious usecase (broken large !rtg filesystems)
that we can't just drop, though I'm pretty sure there aren't any systems
in the wild.

> > +
> > +static inline xfs_rgblock_t
> > +xfs_rtb_to_rgbno(
> > +	struct xfs_mount	*mp,
> > +	xfs_rtblock_t		rtbno)
> > +{
> > +	return __xfs_rtb_to_rgbno(mp, rtbno);
> > +}
> > +
> > +static inline xfs_daddr_t
> > +xfs_rtb_to_daddr(
> > +	struct xfs_mount	*mp,
> > +	xfs_rtblock_t		rtbno)
> > +{
> > +	return rtbno << mp->m_blkbb_log;
> > +}
> > +
> > +static inline xfs_rtblock_t
> > +xfs_daddr_to_rtb(
> > +	struct xfs_mount	*mp,
> > +	xfs_daddr_t		daddr)
> > +{
> > +	return daddr >> mp->m_blkbb_log;
> > +}
> 
> Ah. This code doesn't sparsify the xfs_rtblock_t address space for
> rtgroups. xfs_rtblock_t is still direct physical encoding of the
> location on disk.

Yes.

> I really think that needs to be changed to match how xfs_fsbno_t is
> a sparse encoding before these changes get merged. It shouldn't
> affect any of the other code in the patch set - the existing rt code
> has a rtgno of 0, so it will always be a direct physical encoding
> even when using a sparse xfs_rtblock_t address space.
> 
> All that moving to a sparse encoding means is that the addresses
> stored in the BMBT are logical addresses rather than physical
> addresses.  It should not affect any of the other code, just what
> ends up stored on disk for global 64-bit rt extent addresses...
> 
> In doing this, I think we can greatly simply all this group
> management stuff as most of the verification, type conversion and
> iteration infrastructure can then be shared between the exist perag
> and the new rtg infrastructure....
> 
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index a8cd44d03ef64..1ce4b9eb16f47 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -9,10 +9,12 @@
> >  typedef uint32_t	prid_t;		/* project ID */
> >  
> >  typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> > +typedef uint32_t	xfs_rgblock_t;	/* blockno in realtime group */
> 
> Is that right? The rtg length is 2^32 * rtextsize, and rtextsize can
> be 2^20 bytes:
> 
> #define XFS_MAX_RTEXTSIZE (1024 * 1024 * 1024)

No, the maximum rtgroup length is 2^32-1 blocks.

> Hence for a 4kB fsbno filesystem, the actual maximum size of an rtg
> in filesystem blocks far exceeds what we can address with a 32 bit
> variable.
> 
> If xfs_rgblock_t is actually indexing multi-fsbno rtextents, then it
> is an extent number index, not a "block" index. An extent number
> index won't overflow 32 bits (because the rtg has a max of 2^32 - 1
> rtextents)
> 
> IOWs, shouldn't this be named soemthing like:
> 
> typedef uint32_t	xfs_rgext_t;	/* extent number in realtime group */

and again, we can't do that because we emulate !rtg filesystems with a
single "rtgroup" that can be more than 2^32 rtx long.

> >  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_rtxlen_t;	/* file extent length in rtextents */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> > +typedef uint32_t	xfs_rgnumber_t;	/* realtime group number */
> >  typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> >  typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> > @@ -53,7 +55,9 @@ typedef void *		xfs_failaddr_t;
> >  #define	NULLFILEOFF	((xfs_fileoff_t)-1)
> >  
> >  #define	NULLAGBLOCK	((xfs_agblock_t)-1)
> > +#define NULLRGBLOCK	((xfs_rgblock_t)-1)
> >  #define	NULLAGNUMBER	((xfs_agnumber_t)-1)
> > +#define	NULLRGNUMBER	((xfs_rgnumber_t)-1)
> 
> What's the maximum valid rtg number? We're not ever going to be
> supporting 2^32 - 2 rtgs, so what is a realistic maximum we can cap
> this at and validate it at?

/me shrugs -- the smallest AG size on the data device is 16M, which
technically speaking means that one /could/ format 2^(63-24) groups,
or order 39.

Realistically with the maximum rtgroup size of 2^31 blocks, we probably
only need 2^(63 - (31 + 10)) = 2^22 rtgroups max on a 1k fsblock fs.

> >  #define NULLCOMMITLSN	((xfs_lsn_t)-1)
> >  
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 4423dd344239b..c627cde3bb1e0 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -28,6 +28,7 @@
> >  #include "xfs_ag.h"
> >  #include "xfs_quota.h"
> >  #include "xfs_reflink.h"
> > +#include "xfs_rtgroup.h"
> >  
> >  #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
> >  
> > @@ -3346,6 +3347,7 @@ xlog_do_recover(
> >  	struct xfs_mount	*mp = log->l_mp;
> >  	struct xfs_buf		*bp = mp->m_sb_bp;
> >  	struct xfs_sb		*sbp = &mp->m_sb;
> > +	xfs_rgnumber_t		old_rgcount = sbp->sb_rgcount;
> >  	int			error;
> >  
> >  	trace_xfs_log_recover(log, head_blk, tail_blk);
> > @@ -3399,6 +3401,24 @@ xlog_do_recover(
> >  		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
> >  		return error;
> >  	}
> > +
> > +	if (sbp->sb_rgcount < old_rgcount) {
> > +		xfs_warn(mp, "rgcount shrink not supported");
> > +		return -EINVAL;
> > +	}
> > +	if (sbp->sb_rgcount > old_rgcount) {
> > +		xfs_rgnumber_t		rgno;
> > +
> > +		for (rgno = old_rgcount; rgno < sbp->sb_rgcount; rgno++) {
> > +			error = xfs_rtgroup_alloc(mp, rgno);
> > +			if (error) {
> > +				xfs_warn(mp,
> > +	"Failed post-recovery rtgroup init: %d",
> > +						error);
> > +				return error;
> > +			}
> > +		}
> > +	}
> 
> Please factor this out into a separate function with all the other
> rtgroup init/teardown code. That means we don't have to care about
> how rtgrowfs functions in recovery code, similar to the
> xfs_initialize_perag() already in this function for handling
> recovery of data device growing...
> 
> >  	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> >  
> >  	/* Normal transactions can now occur */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index b0ea88acdb618..e1e849101cdd4 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -36,6 +36,7 @@
> >  #include "xfs_ag.h"
> >  #include "xfs_rtbitmap.h"
> >  #include "xfs_metafile.h"
> > +#include "xfs_rtgroup.h"
> >  #include "scrub/stats.h"
> >  
> >  static DEFINE_MUTEX(xfs_uuid_table_mutex);
> > @@ -664,6 +665,7 @@ xfs_mountfs(
> >  	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> >  	uint			quotamount = 0;
> >  	uint			quotaflags = 0;
> > +	xfs_rgnumber_t		rgno;
> >  	int			error = 0;
> >  
> >  	xfs_sb_mount_common(mp, sbp);
> > @@ -830,10 +832,18 @@ xfs_mountfs(
> >  		goto out_free_dir;
> >  	}
> >  
> > +	for (rgno = 0; rgno < mp->m_sb.sb_rgcount; rgno++) {
> > +		error = xfs_rtgroup_alloc(mp, rgno);
> > +		if (error) {
> > +			xfs_warn(mp, "Failed rtgroup init: %d", error);
> > +			goto out_free_rtgroup;
> > +		}
> > +	}
> 
> Same - factor this to a xfs_rtgroup_init() function located with the
> rest of the rtgroup infrastructure...
> 
> > +
> >  	if (XFS_IS_CORRUPT(mp, !sbp->sb_logblocks)) {
> >  		xfs_warn(mp, "no log defined");
> >  		error = -EFSCORRUPTED;
> > -		goto out_free_perag;
> > +		goto out_free_rtgroup;
> >  	}
> >  
> >  	error = xfs_inodegc_register_shrinker(mp);
> > @@ -1068,7 +1078,8 @@ xfs_mountfs(
> >  	if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp)
> >  		xfs_buftarg_drain(mp->m_logdev_targp);
> >  	xfs_buftarg_drain(mp->m_ddev_targp);
> > - out_free_perag:
> > + out_free_rtgroup:
> > +	xfs_free_rtgroups(mp, rgno);
> >  	xfs_free_perag(mp);
> >   out_free_dir:
> >  	xfs_da_unmount(mp);
> > @@ -1152,6 +1163,7 @@ xfs_unmountfs(
> >  	xfs_errortag_clearall(mp);
> >  #endif
> >  	shrinker_free(mp->m_inodegc_shrinker);
> > +	xfs_free_rtgroups(mp, mp->m_sb.sb_rgcount);
> 
> ... like you've already for the cleanup side ;)
> 
> ....
> 
> > @@ -1166,6 +1169,9 @@ xfs_rtmount_inodes(
> >  	if (error)
> >  		goto out_rele_summary;
> >  
> > +	for_each_rtgroup(mp, rgno, rtg)
> > +		rtg->rtg_extents = xfs_rtgroup_extents(mp, rtg->rtg_rgno);
> > +
> 
> This also needs to be done after recovery has initialised new rtgs
> as a result fo replaying a sb growfs modification, right?
> 
> Which leads to the next question: if there are thousands of rtgs,
> this requires walking every rtg at mount time, right? We know that
> walking thousands of static structures at mount time is a
> scalability issue, so can we please avoid this if at all possible?
> i.e. do demand loading of per-rtg metadata when it is first required
> (like we do with agf/agi information) rather than doing it all at
> mount time...

Sounds like a reasonable optimization patch...

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 17/26] xfs: support logging EFIs for realtime extents
  2024-08-26  4:33     ` Dave Chinner
@ 2024-08-26 19:38       ` Darrick J. Wong
  2024-08-27  1:36         ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 19:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 02:33:08PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Teach the EFI mechanism how to free realtime extents.  We're going to
> > need this to enforce proper ordering of operations when we enable
> > realtime rmap.
> > 
> > Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
> > ops for rt extents.  This keeps the ondisk artifacts and processing code
> > completely separate between the rt and non-rt cases.  Hopefully this
> > will make it easier to debug filesystem problems.
> 
> Doesn't this now require busy extent tracking for rt extents that
> are being freed?  i.e. they get marked as free with the EFD, but
> cannot be reallocated (or discarded) until the EFD is committed to
> disk.
> 
> we don't allow user data allocation on the data device to reuse busy
> ranges because the freeing of the extent has not yet been committed
> to the journal. Because we use async transaction commits, that means
> we can return to userspace without even the EFI in the journal - it
> can still be in memory in the CIL. Hence we cannot allow userspace
> to reallocate that range and write to it, even though it is marked free in the
> in-memory metadata.

Ah, that's a good point -- in memory the bunmapi -> RTEFI -> RTEFD ->
rtalloc -> bmapi transactions succeed, userspace writes to the file
blocks, then the log goes down without completing /any/ of those
transactions, and now a read of the old file gets new contents.

> If userspace then does a write and then we crash without the
> original EFI on disk, then we've just violated metadata vs data
> update ordering because recovery will not replay the extent free nor
> the new allocation, yet the data in that extent will have been
> changed.
> 
> Hence I think that if we are moving to intent based freeing of real
> time extents, we absolutely need to add support for busy extent
> tracking to realtime groups before we enable EFIs on realtime
> groups.....

Yep.  As a fringe benefit, we'd be able to support issuing discards from
FITRIM without holding the rtbitmap lock, and -o discard on rt extents
too.

> Also ....
> 
> > @@ -447,6 +467,17 @@ xfs_extent_free_defer_add(
> >  
> >  	trace_xfs_extent_free_defer(mp, xefi);
> >  
> > +	if (xfs_efi_is_realtime(xefi)) {
> > +		xfs_rgnumber_t		rgno;
> > +
> > +		rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock);
> > +		xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno);
> > +
> > +		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> > +				&xfs_rtextent_free_defer_type);
> > +		return;
> > +	}
> > +
> >  	xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock);
> >  	if (xefi->xefi_agresv == XFS_AG_RESV_AGFL)
> >  		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> 
> Hmmmm. Isn't this also missing the xfs_drain intent interlocks that
> allow online repair to wait until all the intents outstanding on a
> group complete?

Yep.  I forgot about that.

> > @@ -687,6 +735,106 @@ const struct xfs_defer_op_type xfs_agfl_free_defer_type = {
> >  	.relog_intent	= xfs_extent_free_relog_intent,
> >  };
> >  
> > +#ifdef CONFIG_XFS_RT
> > +/* Sort realtime efi items by rtgroup for efficiency. */
> > +static int
> > +xfs_rtextent_free_diff_items(
> > +	void				*priv,
> > +	const struct list_head		*a,
> > +	const struct list_head		*b)
> > +{
> > +	struct xfs_extent_free_item	*ra = xefi_entry(a);
> > +	struct xfs_extent_free_item	*rb = xefi_entry(b);
> > +
> > +	return ra->xefi_rtg->rtg_rgno - rb->xefi_rtg->rtg_rgno;
> > +}
> > +
> > +/* Create a realtime extent freeing */
> > +static struct xfs_log_item *
> > +xfs_rtextent_free_create_intent(
> > +	struct xfs_trans		*tp,
> > +	struct list_head		*items,
> > +	unsigned int			count,
> > +	bool				sort)
> > +{
> > +	struct xfs_mount		*mp = tp->t_mountp;
> > +	struct xfs_efi_log_item		*efip;
> > +	struct xfs_extent_free_item	*xefi;
> > +
> > +	ASSERT(count > 0);
> > +
> > +	efip = xfs_efi_init(mp, XFS_LI_EFI_RT, count);
> > +	if (sort)
> > +		list_sort(mp, items, xfs_rtextent_free_diff_items);
> > +	list_for_each_entry(xefi, items, xefi_list)
> > +		xfs_extent_free_log_item(tp, efip, xefi);
> > +	return &efip->efi_item;
> > +}
> 
> Hmmmm - when would we get an XFS_LI_EFI_RT with multiple extents in
> it? We only ever free a single user data extent per transaction at a
> time, right? There will be no metadata blocks being freed on the rt
> device - all the BMBT, refcountbt and rmapbt blocks that get freed
> as a result of freeing the user data extent will be in the data
> device and so will use EFIs, not EFI_RTs....

Later on when we get to reflink, a refcount decrement operation on an
extent that has a mix of single and multiple-owned blocks can generate
RTEFIs with multiple extents.

> > +
> > +/* Cancel a realtime extent freeing. */
> > +STATIC void
> > +xfs_rtextent_free_cancel_item(
> > +	struct list_head		*item)
> > +{
> > +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> > +
> > +	xfs_rtgroup_put(xefi->xefi_rtg);
> > +	kmem_cache_free(xfs_extfree_item_cache, xefi);
> > +}
> > +
> > +/* Process a free realtime extent. */
> > +STATIC int
> > +xfs_rtextent_free_finish_item(
> > +	struct xfs_trans		*tp,
> > +	struct xfs_log_item		*done,
> > +	struct list_head		*item,
> > +	struct xfs_btree_cur		**state)
> 
> btree cursor ....
> 
> > +{
> > +	struct xfs_mount		*mp = tp->t_mountp;
> > +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> > +	struct xfs_efd_log_item		*efdp = EFD_ITEM(done);
> > +	struct xfs_rtgroup		**rtgp = (struct xfs_rtgroup **)state;
> 
> ... but is apparently holding a xfs_rtgroup. that's kinda nasty, and
> the rtg the xefi is supposed to be associated with is already held
> by the xefi, so....

It's very nasty, and I preferred when it was just a void**.  Maybe we
should just change that to a:

struct xfs_intent_item_state {
	struct xfs_btree_cur	*cur;
	struct xfs_rtgroup	*rtg;
};

and pass that around?  At least then the compiler can typecheck that for
us.

> > +	int				error = 0;
> > +
> > +	trace_xfs_extent_free_deferred(mp, xefi);
> > +
> > +	if (!(xefi->xefi_flags & XFS_EFI_CANCELLED)) {
> > +		if (*rtgp != xefi->xefi_rtg) {
> > +			xfs_rtgroup_lock(xefi->xefi_rtg, XFS_RTGLOCK_BITMAP);
> > +			xfs_rtgroup_trans_join(tp, xefi->xefi_rtg,
> > +					XFS_RTGLOCK_BITMAP);
> > +			*rtgp = xefi->xefi_rtg;
> 
> How does this case happen? Why is it safe to lock the xefi rtg
> here, and why are we returning the xefi rtg to the caller without
> taking extra references or dropping the rtg the caller passed in?
> 
> At least a comment explaining what is happening is necessary here...

Hmm, I wonder when /is/ this possible?  I don't think it can actually
happen ... except maybe in the case of a bunmapi where we pass in a
large bmbt_irec array?  Let me investigate...

The locks and ijoins will be dropped at transaction commit.

> > +		}
> > +		error = xfs_rtfree_blocks(tp, xefi->xefi_rtg,
> > +				xefi->xefi_startblock, xefi->xefi_blockcount);
> > +	}
> > +	if (error == -EAGAIN) {
> > +		xfs_efd_from_efi(efdp);
> > +		return error;
> > +	}
> > +
> > +	xfs_efd_add_extent(efdp, xefi);
> > +	xfs_rtextent_free_cancel_item(item);
> > +	return error;
> > +}
> > +
> > +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = {
> > +	.name		= "rtextent_free",
> > +	.max_items	= XFS_EFI_MAX_FAST_EXTENTS,
> > +	.create_intent	= xfs_rtextent_free_create_intent,
> > +	.abort_intent	= xfs_extent_free_abort_intent,
> > +	.create_done	= xfs_extent_free_create_done,
> > +	.finish_item	= xfs_rtextent_free_finish_item,
> > +	.cancel_item	= xfs_rtextent_free_cancel_item,
> > +	.recover_work	= xfs_extent_free_recover_work,
> > +	.relog_intent	= xfs_extent_free_relog_intent,
> > +};
> > +#else
> > +const struct xfs_defer_op_type xfs_rtextent_free_defer_type = {
> > +	.name		= "rtextent_free",
> > +};
> > +#endif /* CONFIG_XFS_RT */
> > +
> >  STATIC bool
> >  xfs_efi_item_match(
> >  	struct xfs_log_item	*lip,
> > @@ -731,7 +879,7 @@ xlog_recover_efi_commit_pass2(
> >  		return -EFSCORRUPTED;
> >  	}
> >  
> > -	efip = xfs_efi_init(mp, efi_formatp->efi_nextents);
> > +	efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents);
> >  	error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format);
> >  	if (error) {
> >  		xfs_efi_item_free(efip);
> > @@ -749,6 +897,58 @@ const struct xlog_recover_item_ops xlog_efi_item_ops = {
> >  	.commit_pass2		= xlog_recover_efi_commit_pass2,
> >  };
> >  
> > +#ifdef CONFIG_XFS_RT
> > +STATIC int
> > +xlog_recover_rtefi_commit_pass2(
> > +	struct xlog			*log,
> > +	struct list_head		*buffer_list,
> > +	struct xlog_recover_item	*item,
> > +	xfs_lsn_t			lsn)
> > +{
> > +	struct xfs_mount		*mp = log->l_mp;
> > +	struct xfs_efi_log_item		*efip;
> > +	struct xfs_efi_log_format	*efi_formatp;
> > +	int				error;
> > +
> > +	efi_formatp = item->ri_buf[0].i_addr;
> > +
> > +	if (item->ri_buf[0].i_len < xfs_efi_log_format_sizeof(0)) {
> > +		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
> > +				item->ri_buf[0].i_addr, item->ri_buf[0].i_len);
> > +		return -EFSCORRUPTED;
> > +	}
> > +
> > +	efip = xfs_efi_init(mp, ITEM_TYPE(item), efi_formatp->efi_nextents);
> > +	error = xfs_efi_copy_format(&item->ri_buf[0], &efip->efi_format);
> > +	if (error) {
> > +		xfs_efi_item_free(efip);
> > +		return error;
> > +	}
> > +	atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents);
> > +
> > +	xlog_recover_intent_item(log, &efip->efi_item, lsn,
> > +			&xfs_rtextent_free_defer_type);
> > +	return 0;
> > +}
> > +#else
> > +STATIC int
> > +xlog_recover_rtefi_commit_pass2(
> > +	struct xlog			*log,
> > +	struct list_head		*buffer_list,
> > +	struct xlog_recover_item	*item,
> > +	xfs_lsn_t			lsn)
> > +{
> > +	XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, log->l_mp,
> > +			item->ri_buf[0].i_addr, item->ri_buf[0].i_len);
> > +	return -EFSCORRUPTED;
> 
> This needs to be a more meaningful error. It's not technically a
> corruption - we recognised that an RTEFI is needing to be recovered,
> but this kernel does not have RTEFI support compiled in. Hence the
> error should be something along the lines of
> 
> "RTEFI found in journal, but kernel not compiled with CONFIG_XFS_RT enabled.
> Cannot recover journal, please remount using a kernel with RT device
> support enabled."

Ok.  That should probably get applied to the RTRUI and RTCUI recovery
stubs too.

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-26  4:56     ` Dave Chinner
@ 2024-08-26 19:40       ` Darrick J. Wong
  2024-08-27  1:56         ` Dave Chinner
  2024-08-27  5:00         ` Christoph Hellwig
  2024-08-27  4:59       ` Christoph Hellwig
  1 sibling, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 19:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 02:56:37PM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:26:38PM -0700, Darrick J. Wong wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Make the allocator rtgroup aware by either picking a specific group if
> > there is a hint, or loop over all groups otherwise.  A simple rotor is
> > provided to pick the placement for initial allocations.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c     |   13 +++++-
> >  fs/xfs/libxfs/xfs_rtbitmap.c |    6 ++-
> >  fs/xfs/xfs_mount.h           |    1 
> >  fs/xfs/xfs_rtalloc.c         |   98 ++++++++++++++++++++++++++++++++++++++----
> >  4 files changed, 105 insertions(+), 13 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 126a0d253654a..88c62e1158ac7 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3151,8 +3151,17 @@ xfs_bmap_adjacent_valid(
> >  	struct xfs_mount	*mp = ap->ip->i_mount;
> >  
> >  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> > -	    (ap->datatype & XFS_ALLOC_USERDATA))
> > -		return x < mp->m_sb.sb_rblocks;
> > +	    (ap->datatype & XFS_ALLOC_USERDATA)) {
> > +		if (x >= mp->m_sb.sb_rblocks)
> > +			return false;
> > +		if (!xfs_has_rtgroups(mp))
> > +			return true;
> > +
> > +		return xfs_rtb_to_rgno(mp, x) == xfs_rtb_to_rgno(mp, y) &&
> > +			xfs_rtb_to_rgno(mp, x) < mp->m_sb.sb_rgcount &&
> > +			xfs_rtb_to_rtx(mp, x) < mp->m_sb.sb_rgextents;
> 
> WHy do we need the xfs_has_rtgroups() check here? The new rtg logic will
> return true for an old school rt device here, right?

The incore sb_rgextents is zero on !rtg filesystems, so we need the
xfs_has_rtgroups.

> > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> > index 3fedc552b51b0..2b57ff2687bf6 100644
> > --- a/fs/xfs/xfs_rtalloc.c
> > +++ b/fs/xfs/xfs_rtalloc.c
> > @@ -1661,8 +1661,9 @@ xfs_rtalloc_align_minmax(
> >  }
> >  
> >  static int
> > -xfs_rtallocate(
> > +xfs_rtallocate_rtg(
> >  	struct xfs_trans	*tp,
> > +	xfs_rgnumber_t		rgno,
> >  	xfs_rtblock_t		bno_hint,
> >  	xfs_rtxlen_t		minlen,
> >  	xfs_rtxlen_t		maxlen,
> > @@ -1682,16 +1683,33 @@ xfs_rtallocate(
> >  	xfs_rtxlen_t		len = 0;
> >  	int			error = 0;
> >  
> > -	args.rtg = xfs_rtgroup_grab(args.mp, 0);
> > +	args.rtg = xfs_rtgroup_grab(args.mp, rgno);
> >  	if (!args.rtg)
> >  		return -ENOSPC;
> >  
> >  	/*
> > -	 * Lock out modifications to both the RT bitmap and summary inodes.
> > +	 * We need to lock out modifications to both the RT bitmap and summary
> > +	 * inodes for finding free space in xfs_rtallocate_extent_{near,size}
> > +	 * and join the bitmap and summary inodes for the actual allocation
> > +	 * down in xfs_rtallocate_range.
> > +	 *
> > +	 * For RTG-enabled file system we don't want to join the inodes to the
> > +	 * transaction until we are committed to allocate to allocate from this
> > +	 * RTG so that only one inode of each type is locked at a time.
> > +	 *
> > +	 * But for pre-RTG file systems we need to already to join the bitmap
> > +	 * inode to the transaction for xfs_rtpick_extent, which bumps the
> > +	 * sequence number in it, so we'll have to join the inode to the
> > +	 * transaction early here.
> > +	 *
> > +	 * This is all a bit messy, but at least the mess is contained in
> > +	 * this function.
> >  	 */
> >  	if (!*rtlocked) {
> >  		xfs_rtgroup_lock(args.rtg, XFS_RTGLOCK_BITMAP);
> > -		xfs_rtgroup_trans_join(tp, args.rtg, XFS_RTGLOCK_BITMAP);
> > +		if (!xfs_has_rtgroups(args.mp))
> > +			xfs_rtgroup_trans_join(tp, args.rtg,
> > +					XFS_RTGLOCK_BITMAP);
> >  		*rtlocked = true;
> >  	}
> >  
> > @@ -1701,7 +1719,7 @@ xfs_rtallocate(
> >  	 */
> >  	if (bno_hint)
> >  		start = xfs_rtb_to_rtx(args.mp, bno_hint);
> > -	else if (initial_user_data)
> > +	else if (!xfs_has_rtgroups(args.mp) && initial_user_data)
> >  		start = xfs_rtpick_extent(args.rtg, tp, maxlen);
> 
> Check initial_user_data first - we don't care if there are rtgroups
> enabled if initial_user_data is not true, and we only ever allocate
> initial data on an inode once...

<nod>

> > @@ -1741,6 +1767,53 @@ xfs_rtallocate(
> >  	return error;
> >  }
> >  
> > +static int
> > +xfs_rtallocate_rtgs(
> > +	struct xfs_trans	*tp,
> > +	xfs_fsblock_t		bno_hint,
> > +	xfs_rtxlen_t		minlen,
> > +	xfs_rtxlen_t		maxlen,
> > +	xfs_rtxlen_t		prod,
> > +	bool			wasdel,
> > +	bool			initial_user_data,
> > +	xfs_rtblock_t		*bno,
> > +	xfs_extlen_t		*blen)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	xfs_rgnumber_t		start_rgno, rgno;
> > +	int			error;
> > +
> > +	/*
> > +	 * For now this just blindly iterates over the RTGs for an initial
> > +	 * allocation.  We could try to keep an in-memory rtg_longest member
> > +	 * to avoid the locking when just looking for big enough free space,
> > +	 * but for now this keep things simple.
> > +	 */
> > +	if (bno_hint != NULLFSBLOCK)
> > +		start_rgno = xfs_rtb_to_rgno(mp, bno_hint);
> > +	else
> > +		start_rgno = (atomic_inc_return(&mp->m_rtgrotor) - 1) %
> > +				mp->m_sb.sb_rgcount;
> > +
> > +	rgno = start_rgno;
> > +	do {
> > +		bool		rtlocked = false;
> > +
> > +		error = xfs_rtallocate_rtg(tp, rgno, bno_hint, minlen, maxlen,
> > +				prod, wasdel, initial_user_data, &rtlocked,
> > +				bno, blen);
> > +		if (error != -ENOSPC)
> > +			return error;
> > +		ASSERT(!rtlocked);
> > +
> > +		if (++rgno == mp->m_sb.sb_rgcount)
> > +			rgno = 0;
> > +		bno_hint = NULLFSBLOCK;
> > +	} while (rgno != start_rgno);
> > +
> > +	return -ENOSPC;
> > +}
> > +
> >  static int
> >  xfs_rtallocate_align(
> >  	struct xfs_bmalloca	*ap,
> > @@ -1835,9 +1908,16 @@ xfs_bmap_rtalloc(
> >  	if (xfs_bmap_adjacent(ap))
> >  		bno_hint = ap->blkno;
> >  
> > -	error = xfs_rtallocate(ap->tp, bno_hint, raminlen, ralen, prod,
> > -			ap->wasdel, initial_user_data, &rtlocked,
> > -			&ap->blkno, &ap->length);
> > +	if (xfs_has_rtgroups(ap->ip->i_mount)) {
> > +		error = xfs_rtallocate_rtgs(ap->tp, bno_hint, raminlen, ralen,
> > +				prod, ap->wasdel, initial_user_data,
> > +				&ap->blkno, &ap->length);
> > +	} else {
> > +		error = xfs_rtallocate_rtg(ap->tp, 0, bno_hint, raminlen, ralen,
> > +				prod, ap->wasdel, initial_user_data,
> > +				&rtlocked, &ap->blkno, &ap->length);
> > +	}
> 
> The xfs_has_rtgroups() check is unnecessary.  The iterator in
> xfs_rtallocate_rtgs() will do the right thing for the
> !xfs_has_rtgroups() case - it'll set start_rgno = 0 and break out
> after a single call to xfs_rtallocate_rtg() with rgno = 0.
> 
> Another thing that probably should be done here is push all the
> constant value calculations a couple of functions down the stack to
> where they are used. Then we only need to pass two parameters down
> through the rg iterator here, not 11...

..and pass the ap itself too, to remove three of the parameters?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-25 23:58     ` Dave Chinner
@ 2024-08-26 21:38       ` Darrick J. Wong
  2024-08-27  0:58         ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-26 21:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 09:58:05AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2024 at 05:18:02PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add a dynamic lockdep class key for rtgroup inodes.  This will enable
> > lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
> > order.  Each class can have 8 subclasses, and for now we will only have
> > 2 inodes per group.  This enables rtgroup order and inode order checks
> > when nesting ILOCKs.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_rtgroup.c |   52 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 52 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > index 51f04cad5227c..ae6d67c673b1a 100644
> > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > @@ -243,3 +243,55 @@ xfs_rtgroup_trans_join(
> >  	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
> >  		xfs_rtbitmap_trans_join(tp);
> >  }
> > +
> > +#ifdef CONFIG_PROVE_LOCKING
> > +static struct lock_class_key xfs_rtginode_lock_class;
> > +
> > +static int
> > +xfs_rtginode_ilock_cmp_fn(
> > +	const struct lockdep_map	*m1,
> > +	const struct lockdep_map	*m2)
> > +{
> > +	const struct xfs_inode *ip1 =
> > +		container_of(m1, struct xfs_inode, i_lock.dep_map);
> > +	const struct xfs_inode *ip2 =
> > +		container_of(m2, struct xfs_inode, i_lock.dep_map);
> > +
> > +	if (ip1->i_projid < ip2->i_projid)
> > +		return -1;
> > +	if (ip1->i_projid > ip2->i_projid)
> > +		return 1;
> > +	return 0;
> > +}
> 
> What's the project ID of the inode got to do with realtime groups?

Each rtgroup metadata file stores its group number in i_projid so that
mount can detect if there's a corruption in /rtgroup and we just opened
the bitmap from the wrong group.

We can also use lockdep to detect code that locks rtgroup metadata in
the wrong order.  Potentially we could use this _cmp_fn to enforce that
we always ILOCK in the order bitmap -> summary -> rmap -> refcount based
on i_metatype.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-26 19:14       ` Darrick J. Wong
@ 2024-08-27  0:57         ` Dave Chinner
  2024-08-27  1:55           ` Darrick J. Wong
  2024-08-27  4:38           ` Christoph Hellwig
  2024-08-27  4:27         ` Christoph Hellwig
  1 sibling, 2 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 12:14:04PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 09:56:08AM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:17:31PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Create an incore object that will contain information about a realtime
> > > allocation group.  This will eventually enable us to shard the realtime
> > > section in a similar manner to how we shard the data section, but for
> > > now just a single object for the entire RT subvolume is created.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/Makefile             |    1 
> > >  fs/xfs/libxfs/xfs_format.h  |    3 +
> > >  fs/xfs/libxfs/xfs_rtgroup.c |  196 ++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_rtgroup.h |  212 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_sb.c      |    7 +
> > >  fs/xfs/libxfs/xfs_types.h   |    4 +
> > >  fs/xfs/xfs_log_recover.c    |   20 ++++
> > >  fs/xfs/xfs_mount.c          |   16 +++
> > >  fs/xfs/xfs_mount.h          |   14 +++
> > >  fs/xfs/xfs_rtalloc.c        |    6 +
> > >  fs/xfs/xfs_super.c          |    1 
> > >  fs/xfs/xfs_trace.c          |    1 
> > >  fs/xfs/xfs_trace.h          |   38 ++++++++
> > >  13 files changed, 517 insertions(+), 2 deletions(-)
> > >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
> > >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h
> > 
> > Ok, how is the global address space for real time extents laid out
> > across rt groups? i.e. is it sparse similar to how fsbnos and inode
> > numbers are created for the data device like so?
> > 
> > 	fsbno = (agno << agblklog) | agbno
> > 
> > Or is it something different? I can't find that defined anywhere in
> > this patch, so I can't determine if the unit conversion code and
> > validation is correct or not...
> 
> They're not sparse like fsbnos on the data device, they're laid end to
> end.  IOWs, it's a straight linear translation.  If you have an rtgroup
> that is 50 blocks long, then rtgroup 1 starts at (50 * blocksize).

Yes, I figured that out later. I think that's less than optimal,
because it essentially repeats the problems we have with AGs being
fixed size without the potential for fixing it easily. i.e. the
global sharded fsbno address space is sparse, so we can actually
space out the sparse address regions to allow future flexibility in
group size and location work.

By having the rtgroup addressing being purely physical, we're
completely stuck with fixed sized rtgroups and there is no way
around that. IOWs, the physical address space sharding repeats the
existing grow and shrink problems we have with the existing fixed
size AGs.

We're discussing how to use the sparse fsbno addressing to allow
resizing of AGs, but we will not be able to do that at all with
rtgroups as they stand. The limitation is a 64 bit global rt extent
address is essential the physical address of the extent in the block
device LBA space.

> 
> This patch, FWIW, refactors the existing rt code so that a !rtgroups
> filesystem is represented by one large "group", with xfs_rtxnum_t now
> indexing rt extents within a group. 

Right, we can do that regardless of whether we use logical or
physical addressing for the global rtbno for sharded rtgroup layout.
the rtgno of 0 for that rtg always results in logical = physical
addressing.

> Probably it should be renamed to xfs_rgxnum_t.

That might be a good idea.

> Note that we haven't defined the rtgroup ondisk format yet, so I'll go
> amend that patch to spell out the ondisk format of the brave new world.

Yes, please! That would have made working out all the differences
between all the combinations of rt, rtx, rg, num, len, blk, etc a
whole lot easier to work out.

> > > +struct xfs_rtgroup *
> > > +xfs_rtgroup_grab(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno)
> > > +{
> > > +	struct xfs_rtgroup	*rtg;
> > > +
> > > +	rcu_read_lock();
> > > +	rtg = xa_load(&mp->m_rtgroups, agno);
> > > +	if (rtg) {
> > > +		trace_xfs_rtgroup_grab(rtg, _RET_IP_);
> > > +		if (!atomic_inc_not_zero(&rtg->rtg_active_ref))
> > > +			rtg = NULL;
> > > +	}
> > > +	rcu_read_unlock();
> > > +	return rtg;
> > > +}
> > > +
> > > +void
> > > +xfs_rtgroup_rele(
> > > +	struct xfs_rtgroup	*rtg)
> > > +{
> > > +	trace_xfs_rtgroup_rele(rtg, _RET_IP_);
> > > +	if (atomic_dec_and_test(&rtg->rtg_active_ref))
> > > +		wake_up(&rtg->rtg_active_wq);
> > > +}
> > 
> > This is all duplicates of the xfs_perag code. Can you put together a
> > patchset to abstract this into a "xfs_group" and embed them in both
> > the perag and and rtgroup structures?
> > 
> > That way we only need one set of lookup and iterator infrastructure,
> > and it will work for both data and rt groups...
> 
> How will that work with perags still using the radix tree and rtgroups
> using the xarray?  Yes, we should move the perags to use the xarray too
> (and indeed hch already has a series on list to do that) but here's
> really not the time to do that because I don't want to frontload a bunch
> more core changes onto this already huge patchset.

Let's first assume they both use xarray (that's just a matter of
time, yes?) so it's easier to reason about. Then we have something
like this:

/*
 * xfs_group - a contiguous 32 bit block address space group
 */
struct xfs_group {
	struct xarray		xarr;
	u32			num_groups;
};

struct xfs_group_item {
	struct xfs_group	*group; /* so put/rele don't need any other context */
	u32			gno;
	atomic_t		passive_refs;
	atomic_t		active_refs;
	wait_queue_head_t	active_wq;
	unsigned long		opstate;

	u32			blocks;		/* length in fsb */
	u32			extents;	/* length in extents */
	u32			blk_log;	/* extent size in fsb */

	/* limits for min/max valid addresses */
	u32			max_addr;
	u32			min_addr;
};

And then we define:

struct xfs_perag {
	struct xfs_group_item	g;

	/* perag specific stuff follows */
	....
};

struct xfs_rtgroup {
	struct xfs_group_item	g;

	/* rtg specific stuff follows */
	.....

}

And then a couple of generic macros:

#define to_grpi(grpi, gi)	container_of((gi), typeof(grpi), g)
#define to_gi(grpi)		(&(grpi)->g)

though this might be better as just typed macros:

#define gi_to_pag(gi)	container_of((gi), struct xfs_perag, g)
#define gi_to_rtg(gi)	container_of((gi), struct xfs_rtgroup, g)

And then all the grab/rele/get/put stuff becomes:

	rtg = to_grpi(rtg, xfs_group_grab(mp->m_rtgroups, rgno));
	pag = to_grpi(pag, xfs_group_grab(mp->m_perags, agno));
	....
	xfs_group_put(&rtg->g);
	xfs_group_put(&pag->g);


or

	rtg = gi_to_rtg(xfs_group_grab(mp->m_rtgroups, rgno));
	pag = gi_to_pag(xfs_group_grab(mp->m_perags, agno));
	....
	xfs_group_put(&rtg->g);
	xfs_group_put(&pag->g);


then we pass the group to each of the "for_each_group..." iterators
like so:

	for_each_group(&mp->m_perags, agno, pag) {
		/* do stuff with pag */
	}

or
	for_each_group(&mp->m_rtgroups, rtgno, rtg) {
		/* do stuff with rtg */
	}

And we use typeof() and container_of() to access the group structure
within the pag/rtg. Something like:

#define to_grpi(grpi, gi)	container_of((gi), typeof(grpi), g)
#define to_gi(grpi)		(&(grpi)->g)

#define for_each_group(grp, gno, grpi)					\
	(gno) = 0;							\
	for ((grpi) = to_grpi((grpi), xfs_group_grab((grp), (gno)));	\
	     (grpi) != NULL;						\
	     (grpi) = to_grpi(grpi, xfs_group_next((grp), to_gi(grpi),	\
					&(gno), (grp)->num_groups))

And now we essentially have common group infrstructure for
access, iteration, geometry and address verification purposes...

> 
> > > +
> > > +/* Compute the number of rt extents in this realtime group. */
> > > +xfs_rtxnum_t
> > > +xfs_rtgroup_extents(
> > +	struct xfs_mount	*mp,
> > > +	xfs_rgnumber_t		rgno)
> > > +{
> > > +	xfs_rgnumber_t		rgcount = mp->m_sb.sb_rgcount;
> > > +
> > > +	ASSERT(rgno < rgcount);
> > > +	if (rgno == rgcount - 1)
> > > +		return mp->m_sb.sb_rextents -
> > > +			((xfs_rtxnum_t)rgno * mp->m_sb.sb_rgextents);
> > 
> > Urk. So this relies on a non-rtgroup filesystem doing a
> > multiplication by zero of a field that the on-disk format does not
> > understand to get the right result.  I think this is a copying a bad
> > pattern we've been slowly trying to remove from the normal
> > allocation group code.
> > 
> > > +
> > > +	ASSERT(xfs_has_rtgroups(mp));
> > > +	return mp->m_sb.sb_rgextents;
> > > +}
> > 
> > We already embed the length of the rtgroup in the rtgroup structure.
> > THis should be looking up the rtgroup (or being passed the rtgroup
> > the caller already has) and doing the right thing. i.e.
> > 
> > 	if (!rtg || !xfs_has_rtgroups(rtg->rtg_mount))
> > 		return mp->m_sb.sb_rextents;
> > 	return rtg->rtg_extents;
> 
> xfs_rtgroup_extents is the function that we use to set rtg->rtg_extents.

That wasn't clear from the context of the patch. Perhaps a better
name xfs_rtgroup_calc_extents() to indicate that it is a setup
function, not something that should be regularly called at runtime?

> 
> > > +static inline xfs_rtblock_t
> > > +xfs_rgno_start_rtb(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_rgnumber_t		rgno)
> > > +{
> > > +	if (mp->m_rgblklog >= 0)
> > > +		return ((xfs_rtblock_t)rgno << mp->m_rgblklog);
> > > +	return ((xfs_rtblock_t)rgno * mp->m_rgblocks);
> > > +}
> > 
> > Where does mp->m_rgblklog come from? That wasn't added to the
> > on-disk superblock structure and it is always initialised to zero
> > in this patch.
> > 
> > When will m_rgblklog be zero and when will it be non-zero? If it's
> 
> As I mentioned before, this patch merely ports non-rtg filesystems to
> use the rtgroup structure.  m_rgblklog will be set to nonzero values
> when we get to defining the ondisk rtgroup structure.

Yeah, which makes some of the context in the patch hard to
understand... :/

> But, to cut ahead here, m_rgblklog will be set to a non-negative value
> if the rtgroup size (in blocks) is a power of two.  Then these unit
> conversion functions can use shifts instead of expensive multiplication
> and divisions.  The same goes for rt extent to {fs,rt}block conversions.

yeah, so mp->m_rgblklog is not equivalent of mp->m_agblklog at all.
It took me some time to understand that - the names are the same,
they are used in similar address conversions, but they have
completely different functions.

I suspect we need some better naming here, regardless of the
rtgroups global address space layout discussion...

> > > +
> > > +static inline uint64_t
> > > +__xfs_rtb_to_rgbno(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_rtblock_t		rtbno)
> > > +{
> > > +	uint32_t		rem;
> > > +
> > > +	if (!xfs_has_rtgroups(mp))
> > > +		return rtbno;
> > > +
> > > +	if (mp->m_rgblklog >= 0)
> > > +		return rtbno & mp->m_rgblkmask;
> > > +
> > > +	div_u64_rem(rtbno, mp->m_rgblocks, &rem);
> > > +	return rem;
> > > +}
> > 
> > Why is this function returning a uint64_t - a xfs_rgblock_t is only
> > a 32 bit type...
> 
> group 0 on a !rtg filesystem can be 64-bits in block/rt count.  This is
> a /very/ annoying pain point -- if you actually created such a
> filesystem it actually would never work because the rtsummary file would
> be created undersized due to an integer overflow, but the verifiers
> never checked any of that, and due to the same underflow the rtallocator
> would search the wrong places and (eventually) fall back to a dumb
> linear scan.
> 
> Soooooo this is an obnoxious usecase (broken large !rtg filesystems)
> that we can't just drop, though I'm pretty sure there aren't any systems
> in the wild.

Ugh. That definitely needs to be a comment somewhere in the code to
explain this. :/

> > > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > > index a8cd44d03ef64..1ce4b9eb16f47 100644
> > > --- a/fs/xfs/libxfs/xfs_types.h
> > > +++ b/fs/xfs/libxfs/xfs_types.h
> > > @@ -9,10 +9,12 @@
> > >  typedef uint32_t	prid_t;		/* project ID */
> > >  
> > >  typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> > > +typedef uint32_t	xfs_rgblock_t;	/* blockno in realtime group */
> > 
> > Is that right? The rtg length is 2^32 * rtextsize, and rtextsize can
> > be 2^20 bytes:
> > 
> > #define XFS_MAX_RTEXTSIZE (1024 * 1024 * 1024)
> 
> No, the maximum rtgroup length is 2^32-1 blocks.

I couldn't tell if the max length was being defined as the maximum
number of rt extents that the rtgroup could index, of whether it was
the maximum number of filesystem blocks (i.e. data device fsblock
size) tha an rtgroup could index...


> > Hence for a 4kB fsbno filesystem, the actual maximum size of an rtg
> > in filesystem blocks far exceeds what we can address with a 32 bit
> > variable.
> > 
> > If xfs_rgblock_t is actually indexing multi-fsbno rtextents, then it
> > is an extent number index, not a "block" index. An extent number
> > index won't overflow 32 bits (because the rtg has a max of 2^32 - 1
> > rtextents)
> > 
> > IOWs, shouldn't this be named soemthing like:
> > 
> > typedef uint32_t	xfs_rgext_t;	/* extent number in realtime group */
> 
> and again, we can't do that because we emulate !rtg filesystems with a
> single "rtgroup" that can be more than 2^32 rtx long.

*nod*

> > >  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> > >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> > >  typedef uint32_t	xfs_rtxlen_t;	/* file extent length in rtextents */
> > >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> > > +typedef uint32_t	xfs_rgnumber_t;	/* realtime group number */
> > >  typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> > >  typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
> > >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> > > @@ -53,7 +55,9 @@ typedef void *		xfs_failaddr_t;
> > >  #define	NULLFILEOFF	((xfs_fileoff_t)-1)
> > >  
> > >  #define	NULLAGBLOCK	((xfs_agblock_t)-1)
> > > +#define NULLRGBLOCK	((xfs_rgblock_t)-1)
> > >  #define	NULLAGNUMBER	((xfs_agnumber_t)-1)
> > > +#define	NULLRGNUMBER	((xfs_rgnumber_t)-1)
> > 
> > What's the maximum valid rtg number? We're not ever going to be
> > supporting 2^32 - 2 rtgs, so what is a realistic maximum we can cap
> > this at and validate it at?
> 
> /me shrugs -- the smallest AG size on the data device is 16M, which
> technically speaking means that one /could/ format 2^(63-24) groups,
> or order 39.
> 
> Realistically with the maximum rtgroup size of 2^31 blocks, we probably
> only need 2^(63 - (31 + 10)) = 2^22 rtgroups max on a 1k fsblock fs.

Right, those are the theoretical maximums. Practically speaking,
though, mkfs and mount iteration of all AGs means millions to
billions of IOs need to be done before the filesystem can even be
fully mounted. Hence the practical limit to AG count is closer to a
few tens of thousands, not hundreds of billions.

Hence I'm wondering if we should actually cap the maximum number of
rtgroups. WE're just about at BS > PS, so with a 64k block size a
single rtgroup can index 2^32 * 2^16 bytes which puts individual
rtgs at 256TB in size. Unless there are use cases for rtgroup sizes
smaller than a few GBs, I just don't see the need for support
theoretical maximum counts on tiny block size filesystems. Thirty
thousand rtgs at 256TB per rtg puts us at 64 bit device size limits,
and we hit those limits on 4kB block sizes at around 500,000 rtgs.

So do we need to support millions of rtgs? I'd say no....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-26 21:38       ` Darrick J. Wong
@ 2024-08-27  0:58         ` Dave Chinner
  2024-08-27  1:56           ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  0:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 09:58:05AM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:18:02PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Add a dynamic lockdep class key for rtgroup inodes.  This will enable
> > > lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
> > > order.  Each class can have 8 subclasses, and for now we will only have
> > > 2 inodes per group.  This enables rtgroup order and inode order checks
> > > when nesting ILOCKs.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_rtgroup.c |   52 +++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 52 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > > index 51f04cad5227c..ae6d67c673b1a 100644
> > > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > > @@ -243,3 +243,55 @@ xfs_rtgroup_trans_join(
> > >  	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
> > >  		xfs_rtbitmap_trans_join(tp);
> > >  }
> > > +
> > > +#ifdef CONFIG_PROVE_LOCKING
> > > +static struct lock_class_key xfs_rtginode_lock_class;
> > > +
> > > +static int
> > > +xfs_rtginode_ilock_cmp_fn(
> > > +	const struct lockdep_map	*m1,
> > > +	const struct lockdep_map	*m2)
> > > +{
> > > +	const struct xfs_inode *ip1 =
> > > +		container_of(m1, struct xfs_inode, i_lock.dep_map);
> > > +	const struct xfs_inode *ip2 =
> > > +		container_of(m2, struct xfs_inode, i_lock.dep_map);
> > > +
> > > +	if (ip1->i_projid < ip2->i_projid)
> > > +		return -1;
> > > +	if (ip1->i_projid > ip2->i_projid)
> > > +		return 1;
> > > +	return 0;
> > > +}
> > 
> > What's the project ID of the inode got to do with realtime groups?
> 
> Each rtgroup metadata file stores its group number in i_projid so that
> mount can detect if there's a corruption in /rtgroup and we just opened
> the bitmap from the wrong group.
> 
> We can also use lockdep to detect code that locks rtgroup metadata in
> the wrong order.  Potentially we could use this _cmp_fn to enforce that
> we always ILOCK in the order bitmap -> summary -> rmap -> refcount based
> on i_metatype.

Ok, can we union the i_projid field (both in memory and in the
on-disk structure) so that dual use of the field is well documented
by the code?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/24] xfs: support caching rtgroup metadata inodes
  2024-08-26 18:37       ` Darrick J. Wong
@ 2024-08-27  1:05         ` Dave Chinner
  2024-08-27  2:01           ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  1:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 11:37:34AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 11:41:19AM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:18:18PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Create the necessary per-rtgroup infrastructure that we need to load
> > > metadata inodes into memory.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_rtgroup.c |  182 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_rtgroup.h |   28 +++++++
> > >  fs/xfs/xfs_mount.h          |    1 
> > >  fs/xfs/xfs_rtalloc.c        |   48 +++++++++++
> > >  4 files changed, 258 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > > index ae6d67c673b1a..50e4a56d749f0 100644
> > > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > > @@ -30,6 +30,8 @@
> > >  #include "xfs_icache.h"
> > >  #include "xfs_rtgroup.h"
> > >  #include "xfs_rtbitmap.h"
> > > +#include "xfs_metafile.h"
> > > +#include "xfs_metadir.h"
> > >  
> > >  /*
> > >   * Passive reference counting access wrappers to the rtgroup structures.  If
> > > @@ -295,3 +297,183 @@ xfs_rtginode_lockdep_setup(
> > >  #else
> > >  #define xfs_rtginode_lockdep_setup(ip, rgno, type)	do { } while (0)
> > >  #endif /* CONFIG_PROVE_LOCKING */
> > > +
> > > +struct xfs_rtginode_ops {
> > > +	const char		*name;	/* short name */
> > > +
> > > +	enum xfs_metafile_type	metafile_type;
> > > +
> > > +	/* Does the fs have this feature? */
> > > +	bool			(*enabled)(struct xfs_mount *mp);
> > > +
> > > +	/* Create this rtgroup metadata inode and initialize it. */
> > > +	int			(*create)(struct xfs_rtgroup *rtg,
> > > +					  struct xfs_inode *ip,
> > > +					  struct xfs_trans *tp,
> > > +					  bool init);
> > > +};
> > 
> > What's all this for?
> > 
> > AFAICT, loading the inodes into the rtgs requires a call to
> > xfs_metadir_load() when initialising the rtg (either at mount or
> > lazily on the first access to the rtg). Hence I'm not really sure
> > what this complexity is needed for, and the commit message is not
> > very informative....
> 
> Yes, the creation and mkdir code in here is really to support growfs,
> mkfs, and repair.  How about I change the commit message to:
> 
> "Create the necessary per-rtgroup infrastructure that we need to load
> metadata inodes into memory and to create directory trees on the fly.
> Loading is needed by the mounting process.  Creation is needed by
> growfs, mkfs, and repair."

IMO it would have been nicer to add this with the patch that
adds growfs support for rtgs. That way the initial inode loading
would be much easier to understand and review, and the rest of it
would have enough context to be able to review it sanely. There
isn't enough context in this patch to determine if the creation code
is sane or works correctly....


> > > +	path = xfs_rtginode_path(rtg->rtg_rgno, type);
> > > +	if (!path)
> > > +		return -ENOMEM;
> > > +	error = xfs_metadir_load(tp, mp->m_rtdirip, path, ops->metafile_type,
> > > +			&ip);
> > > +	kfree(path);
> > > +
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
> > > +			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
> > > +		xfs_irele(ip);
> > > +		return -EFSCORRUPTED;
> > > +	}
> > 
> > We don't support LOCAL format for any type of regular file inodes,
> > so I'm a little confiused as to why this wouldn't be caught by the
> > verifier on inode read? i.e.  What problem is this trying to catch,
> > and why doesn't the inode verifier catch it for us?
> 
> This is really more of a placeholder for more refactorings coming down
> the line for the rtrmap patchset, which will create a new
> XFS_DINODE_FMT_RMAP.  At that time we'll need to check that an inode
> that we are loading to be the rmap btree actually has that set.

Ok, can you leave a comment to indicate this so I don't have to
remember why this code exists?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper
  2024-08-26 18:27       ` Darrick J. Wong
@ 2024-08-27  1:29         ` Dave Chinner
  2024-08-27  4:27           ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  1:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 11:27:34AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 12:06:58PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:20:07PM -0700, Darrick J. Wong wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > Split the check that the rtsummary fits into the log into a separate
> > > helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
> > > geometry.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > [djwong: avoid division for the 0-rtx growfs check]
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_rtalloc.c |   43 +++++++++++++++++++++++++++++--------------
> > >  1 file changed, 29 insertions(+), 14 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> > > index 61231b1dc4b79..78a3879ad6193 100644
> > > --- a/fs/xfs/xfs_rtalloc.c
> > > +++ b/fs/xfs/xfs_rtalloc.c
> > > @@ -1023,6 +1023,31 @@ xfs_growfs_rtg(
> > >  	return error;
> > >  }
> > >  
> > > +static int
> > > +xfs_growfs_check_rtgeom(
> > > +	const struct xfs_mount	*mp,
> > > +	xfs_rfsblock_t		rblocks,
> > > +	xfs_extlen_t		rextsize)
> > > +{
> > > +	struct xfs_mount	*nmp;
> > > +	int			error = 0;
> > > +
> > > +	nmp = xfs_growfs_rt_alloc_fake_mount(mp, rblocks, rextsize);
> > > +	if (!nmp)
> > > +		return -ENOMEM;
> > > +
> > > +	/*
> > > +	 * New summary size can't be more than half the size of the log.  This
> > > +	 * prevents us from getting a log overflow, since we'll log basically
> > > +	 * the whole summary file at once.
> > > +	 */
> > > +	if (nmp->m_rsumblocks > (mp->m_sb.sb_logblocks >> 1))
> > > +		error = -EINVAL;
> > 
> > FWIW, the new size needs to be smaller than that, because the "half
> > the log size" must to include all the log metadata needed to
> > encapsulate that object. The grwofs transaction also logs inodes and
> > the superblock, so that also takes away from the maximum size of
> > the summary file....
> 
> <shrug> It's the same logic as what's there now, and there haven't been
> any bug reports, have there? 

No, none that I know of - it was just an observation that the code
doesn't actually guarantee what the comment says it should do.

> Though I suppose that's just a reduction
> of what?  One block for the rtbitmap, and (conservatively) two inodes
> and a superblock?

The rtbitmap update might touch a lot more than one block. The newly
allocated space in the rtbitmap inode is initialised to zeros, and
so the xfs_rtfree_range() call from the growfs code to mark the new
space free has to write all 1s to that range of the rtbitmap. This
is all done in a single transaction, so we might actually be logging
a *lot* of rtbitmap buffers here.

IIRC, there is a bit per rtextent, so in a 4kB buffer we can mark
32768 rtextents free. If they are 4kB each, then that's 128MB of
space tracked per rtbitmap block. This adds up to roughly 3.5MB of
log space for the rtbitmap updates per TB of grown rtdev space....

So, yeah, I think that calculation and comment is inaccurate, but we
don't have to fix this right now.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 17/26] xfs: support logging EFIs for realtime extents
  2024-08-26 19:38       ` Darrick J. Wong
@ 2024-08-27  1:36         ` Dave Chinner
  0 siblings, 0 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  1:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 12:38:35PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 02:33:08PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:25:36PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Teach the EFI mechanism how to free realtime extents.  We're going to
> > > need this to enforce proper ordering of operations when we enable
> > > realtime rmap.
> > > 
> > > Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
> > > ops for rt extents.  This keeps the ondisk artifacts and processing code
> > > completely separate between the rt and non-rt cases.  Hopefully this
> > > will make it easier to debug filesystem problems.
> > 
> > Doesn't this now require busy extent tracking for rt extents that
> > are being freed?  i.e. they get marked as free with the EFD, but
> > cannot be reallocated (or discarded) until the EFD is committed to
> > disk.
> > 
> > we don't allow user data allocation on the data device to reuse busy
> > ranges because the freeing of the extent has not yet been committed
> > to the journal. Because we use async transaction commits, that means
> > we can return to userspace without even the EFI in the journal - it
> > can still be in memory in the CIL. Hence we cannot allow userspace
> > to reallocate that range and write to it, even though it is marked free in the
> > in-memory metadata.
> 
> Ah, that's a good point -- in memory the bunmapi -> RTEFI -> RTEFD ->
> rtalloc -> bmapi transactions succeed, userspace writes to the file
> blocks, then the log goes down without completing /any/ of those
> transactions, and now a read of the old file gets new contents.

*nod*

> > If userspace then does a write and then we crash without the
> > original EFI on disk, then we've just violated metadata vs data
> > update ordering because recovery will not replay the extent free nor
> > the new allocation, yet the data in that extent will have been
> > changed.
> > 
> > Hence I think that if we are moving to intent based freeing of real
> > time extents, we absolutely need to add support for busy extent
> > tracking to realtime groups before we enable EFIs on realtime
> > groups.....
> 
> Yep.  As a fringe benefit, we'd be able to support issuing discards from
> FITRIM without holding the rtbitmap lock, and -o discard on rt extents
> too.

Yes. And I suspect that if we unify the perag and rtg into a single
group abstraction, the busy extent tracking will work for both
allocators without much functional change being needed at all...

> > Also ....
> > 
> > > @@ -447,6 +467,17 @@ xfs_extent_free_defer_add(
> > >  
> > >  	trace_xfs_extent_free_defer(mp, xefi);
> > >  
> > > +	if (xfs_efi_is_realtime(xefi)) {
> > > +		xfs_rgnumber_t		rgno;
> > > +
> > > +		rgno = xfs_rtb_to_rgno(mp, xefi->xefi_startblock);
> > > +		xefi->xefi_rtg = xfs_rtgroup_get(mp, rgno);
> > > +
> > > +		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> > > +				&xfs_rtextent_free_defer_type);
> > > +		return;
> > > +	}
> > > +
> > >  	xefi->xefi_pag = xfs_perag_intent_get(mp, xefi->xefi_startblock);
> > >  	if (xefi->xefi_agresv == XFS_AG_RESV_AGFL)
> > >  		*dfpp = xfs_defer_add(tp, &xefi->xefi_list,
> > 
> > Hmmmm. Isn't this also missing the xfs_drain intent interlocks that
> > allow online repair to wait until all the intents outstanding on a
> > group complete?
> 
> Yep.  I forgot about that.

Same comment about unified group infrastructure ;)

> > > +
> > > +/* Cancel a realtime extent freeing. */
> > > +STATIC void
> > > +xfs_rtextent_free_cancel_item(
> > > +	struct list_head		*item)
> > > +{
> > > +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> > > +
> > > +	xfs_rtgroup_put(xefi->xefi_rtg);
> > > +	kmem_cache_free(xfs_extfree_item_cache, xefi);
> > > +}
> > > +
> > > +/* Process a free realtime extent. */
> > > +STATIC int
> > > +xfs_rtextent_free_finish_item(
> > > +	struct xfs_trans		*tp,
> > > +	struct xfs_log_item		*done,
> > > +	struct list_head		*item,
> > > +	struct xfs_btree_cur		**state)
> > 
> > btree cursor ....
> > 
> > > +{
> > > +	struct xfs_mount		*mp = tp->t_mountp;
> > > +	struct xfs_extent_free_item	*xefi = xefi_entry(item);
> > > +	struct xfs_efd_log_item		*efdp = EFD_ITEM(done);
> > > +	struct xfs_rtgroup		**rtgp = (struct xfs_rtgroup **)state;
> > 
> > ... but is apparently holding a xfs_rtgroup. that's kinda nasty, and
> > the rtg the xefi is supposed to be associated with is already held
> > by the xefi, so....
> 
> It's very nasty, and I preferred when it was just a void**.  Maybe we
> should just change that to a:
> 
> struct xfs_intent_item_state {
> 	struct xfs_btree_cur	*cur;
> 	struct xfs_rtgroup	*rtg;
> };
> 
> and pass that around?  At least then the compiler can typecheck that for
> us.

Sounds good to me. :)

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  0:57         ` Dave Chinner
@ 2024-08-27  1:55           ` Darrick J. Wong
  2024-08-27  3:00             ` Dave Chinner
  2024-08-27  4:44             ` Christoph Hellwig
  2024-08-27  4:38           ` Christoph Hellwig
  1 sibling, 2 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  1:55 UTC (permalink / raw)
  To: Dave Chinner, Christoph Hellwig; +Cc: linux-xfs

On Tue, Aug 27, 2024 at 10:57:34AM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2024 at 12:14:04PM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 26, 2024 at 09:56:08AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2024 at 05:17:31PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Create an incore object that will contain information about a realtime
> > > > allocation group.  This will eventually enable us to shard the realtime
> > > > section in a similar manner to how we shard the data section, but for
> > > > now just a single object for the entire RT subvolume is created.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/Makefile             |    1 
> > > >  fs/xfs/libxfs/xfs_format.h  |    3 +
> > > >  fs/xfs/libxfs/xfs_rtgroup.c |  196 ++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_rtgroup.h |  212 +++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_sb.c      |    7 +
> > > >  fs/xfs/libxfs/xfs_types.h   |    4 +
> > > >  fs/xfs/xfs_log_recover.c    |   20 ++++
> > > >  fs/xfs/xfs_mount.c          |   16 +++
> > > >  fs/xfs/xfs_mount.h          |   14 +++
> > > >  fs/xfs/xfs_rtalloc.c        |    6 +
> > > >  fs/xfs/xfs_super.c          |    1 
> > > >  fs/xfs/xfs_trace.c          |    1 
> > > >  fs/xfs/xfs_trace.h          |   38 ++++++++
> > > >  13 files changed, 517 insertions(+), 2 deletions(-)
> > > >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
> > > >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h
> > > 
> > > Ok, how is the global address space for real time extents laid out
> > > across rt groups? i.e. is it sparse similar to how fsbnos and inode
> > > numbers are created for the data device like so?
> > > 
> > > 	fsbno = (agno << agblklog) | agbno
> > > 
> > > Or is it something different? I can't find that defined anywhere in
> > > this patch, so I can't determine if the unit conversion code and
> > > validation is correct or not...
> > 
> > They're not sparse like fsbnos on the data device, they're laid end to
> > end.  IOWs, it's a straight linear translation.  If you have an rtgroup
> > that is 50 blocks long, then rtgroup 1 starts at (50 * blocksize).
> 
> Yes, I figured that out later. I think that's less than optimal,
> because it essentially repeats the problems we have with AGs being
> fixed size without the potential for fixing it easily. i.e. the
> global sharded fsbno address space is sparse, so we can actually
> space out the sparse address regions to allow future flexibility in
> group size and location work.
> 
> By having the rtgroup addressing being purely physical, we're
> completely stuck with fixed sized rtgroups and there is no way
> around that. IOWs, the physical address space sharding repeats the
> existing grow and shrink problems we have with the existing fixed
> size AGs.
> 
> We're discussing how to use the sparse fsbno addressing to allow
> resizing of AGs, but we will not be able to do that at all with
> rtgroups as they stand. The limitation is a 64 bit global rt extent
> address is essential the physical address of the extent in the block
> device LBA space.

<nod> I /think/ it's pretty simple to convert the rtgroups rtblock
numbers to sparse ala xfs_fsblock_t -- all we have to do is make sure
that mp->m_rgblklog is set to highbit64(rtgroup block count) and then
delete all the multiply/divide code, just like we do on the data device.

The thing I *don't* know is how will this affect hch's zoned device
support -- he's mentioned that rtgroups will eventually have both a size
and a "capacity" to keep the zones aligned to groups, or groups aligned
to zones, I don't remember which.  I don't know if segmenting
br_startblock for rt mappings makes things better or worse for that.

> > This patch, FWIW, refactors the existing rt code so that a !rtgroups
> > filesystem is represented by one large "group", with xfs_rtxnum_t now
> > indexing rt extents within a group. 
> 
> Right, we can do that regardless of whether we use logical or
> physical addressing for the global rtbno for sharded rtgroup layout.
> the rtgno of 0 for that rtg always results in logical = physical
> addressing.
> 
> > Probably it should be renamed to xfs_rgxnum_t.
> 
> That might be a good idea.
> 
> > Note that we haven't defined the rtgroup ondisk format yet, so I'll go
> > amend that patch to spell out the ondisk format of the brave new world.
> 
> Yes, please! That would have made working out all the differences
> between all the combinations of rt, rtx, rg, num, len, blk, etc a
> whole lot easier to work out.

<Nod> I'll go work all that out.

> > > > +struct xfs_rtgroup *
> > > > +xfs_rtgroup_grab(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno)
> > > > +{
> > > > +	struct xfs_rtgroup	*rtg;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	rtg = xa_load(&mp->m_rtgroups, agno);
> > > > +	if (rtg) {
> > > > +		trace_xfs_rtgroup_grab(rtg, _RET_IP_);
> > > > +		if (!atomic_inc_not_zero(&rtg->rtg_active_ref))
> > > > +			rtg = NULL;
> > > > +	}
> > > > +	rcu_read_unlock();
> > > > +	return rtg;
> > > > +}
> > > > +
> > > > +void
> > > > +xfs_rtgroup_rele(
> > > > +	struct xfs_rtgroup	*rtg)
> > > > +{
> > > > +	trace_xfs_rtgroup_rele(rtg, _RET_IP_);
> > > > +	if (atomic_dec_and_test(&rtg->rtg_active_ref))
> > > > +		wake_up(&rtg->rtg_active_wq);
> > > > +}
> > > 
> > > This is all duplicates of the xfs_perag code. Can you put together a
> > > patchset to abstract this into a "xfs_group" and embed them in both
> > > the perag and and rtgroup structures?
> > > 
> > > That way we only need one set of lookup and iterator infrastructure,
> > > and it will work for both data and rt groups...
> > 
> > How will that work with perags still using the radix tree and rtgroups
> > using the xarray?  Yes, we should move the perags to use the xarray too
> > (and indeed hch already has a series on list to do that) but here's
> > really not the time to do that because I don't want to frontload a bunch
> > more core changes onto this already huge patchset.
> 
> Let's first assume they both use xarray (that's just a matter of
> time, yes?) so it's easier to reason about. Then we have something
> like this:
> 
> /*
>  * xfs_group - a contiguous 32 bit block address space group
>  */
> struct xfs_group {
> 	struct xarray		xarr;
> 	u32			num_groups;
> };

Ah, that's the group head.  I might call this struct xfs_groups?

So ... would it theoretically make more sense to use an rhashtable here?
Insofar as the only place that totally falls down is if you want to
iterate tagged groups; and that's only done for AGs.

I'm ok with using an xarray here, fwiw.

> struct xfs_group_item {
> 	struct xfs_group	*group; /* so put/rele don't need any other context */
> 	u32			gno;
> 	atomic_t		passive_refs;
> 	atomic_t		active_refs;
> 	wait_queue_head_t	active_wq;
> 	unsigned long		opstate;
> 
> 	u32			blocks;		/* length in fsb */
> 	u32			extents;	/* length in extents */
> 	u32			blk_log;	/* extent size in fsb */
> 
> 	/* limits for min/max valid addresses */
> 	u32			max_addr;
> 	u32			min_addr;
> };

Yeah, that's pretty much what I had in the prototype that I shredded an
hour ago.

> And then we define:
> 
> struct xfs_perag {
> 	struct xfs_group_item	g;
> 
> 	/* perag specific stuff follows */
> 	....
> };
> 
> struct xfs_rtgroup {
> 	struct xfs_group_item	g;
> 
> 	/* rtg specific stuff follows */
> 	.....
> 
> }
> 
> And then a couple of generic macros:
> 
> #define to_grpi(grpi, gi)	container_of((gi), typeof(grpi), g)
> #define to_gi(grpi)		(&(grpi)->g)
> 
> though this might be better as just typed macros:
> 
> #define gi_to_pag(gi)	container_of((gi), struct xfs_perag, g)
> #define gi_to_rtg(gi)	container_of((gi), struct xfs_rtgroup, g)
> 
> And then all the grab/rele/get/put stuff becomes:
> 
> 	rtg = to_grpi(rtg, xfs_group_grab(mp->m_rtgroups, rgno));
> 	pag = to_grpi(pag, xfs_group_grab(mp->m_perags, agno));
> 	....
> 	xfs_group_put(&rtg->g);
> 	xfs_group_put(&pag->g);
> 
> 
> or
> 
> 	rtg = gi_to_rtg(xfs_group_grab(mp->m_rtgroups, rgno));
> 	pag = gi_to_pag(xfs_group_grab(mp->m_perags, agno));
> 	....
> 	xfs_group_put(&rtg->g);
> 	xfs_group_put(&pag->g);
> 
> 
> then we pass the group to each of the "for_each_group..." iterators
> like so:
> 
> 	for_each_group(&mp->m_perags, agno, pag) {
> 		/* do stuff with pag */
> 	}
> 
> or
> 	for_each_group(&mp->m_rtgroups, rtgno, rtg) {
> 		/* do stuff with rtg */
> 	}
> 
> And we use typeof() and container_of() to access the group structure
> within the pag/rtg. Something like:
> 
> #define to_grpi(grpi, gi)	container_of((gi), typeof(grpi), g)
> #define to_gi(grpi)		(&(grpi)->g)
> 
> #define for_each_group(grp, gno, grpi)					\
> 	(gno) = 0;							\
> 	for ((grpi) = to_grpi((grpi), xfs_group_grab((grp), (gno)));	\
> 	     (grpi) != NULL;						\
> 	     (grpi) = to_grpi(grpi, xfs_group_next((grp), to_gi(grpi),	\
> 					&(gno), (grp)->num_groups))
> 
> And now we essentially have common group infrstructure for
> access, iteration, geometry and address verification purposes...

<nod> That's pretty much what I had drafted, albeit with different
helper macros since I kept the for_each_{perag,rtgroup} things around
for type safety.  Though I think for_each_perag just becomes:

#define for_each_perag(mp, agno, pag) \
	for_each_group((mp)->m_perags, (agno), (pag))

Right?

> > 
> > > > +
> > > > +/* Compute the number of rt extents in this realtime group. */
> > > > +xfs_rtxnum_t
> > > > +xfs_rtgroup_extents(
> > > +	struct xfs_mount	*mp,
> > > > +	xfs_rgnumber_t		rgno)
> > > > +{
> > > > +	xfs_rgnumber_t		rgcount = mp->m_sb.sb_rgcount;
> > > > +
> > > > +	ASSERT(rgno < rgcount);
> > > > +	if (rgno == rgcount - 1)
> > > > +		return mp->m_sb.sb_rextents -
> > > > +			((xfs_rtxnum_t)rgno * mp->m_sb.sb_rgextents);
> > > 
> > > Urk. So this relies on a non-rtgroup filesystem doing a
> > > multiplication by zero of a field that the on-disk format does not
> > > understand to get the right result.  I think this is a copying a bad
> > > pattern we've been slowly trying to remove from the normal
> > > allocation group code.
> > > 
> > > > +
> > > > +	ASSERT(xfs_has_rtgroups(mp));
> > > > +	return mp->m_sb.sb_rgextents;
> > > > +}
> > > 
> > > We already embed the length of the rtgroup in the rtgroup structure.
> > > THis should be looking up the rtgroup (or being passed the rtgroup
> > > the caller already has) and doing the right thing. i.e.
> > > 
> > > 	if (!rtg || !xfs_has_rtgroups(rtg->rtg_mount))
> > > 		return mp->m_sb.sb_rextents;
> > > 	return rtg->rtg_extents;
> > 
> > xfs_rtgroup_extents is the function that we use to set rtg->rtg_extents.
> 
> That wasn't clear from the context of the patch. Perhaps a better
> name xfs_rtgroup_calc_extents() to indicate that it is a setup
> function, not something that should be regularly called at runtime?

<nod>

> > 
> > > > +static inline xfs_rtblock_t
> > > > +xfs_rgno_start_rtb(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_rgnumber_t		rgno)
> > > > +{
> > > > +	if (mp->m_rgblklog >= 0)
> > > > +		return ((xfs_rtblock_t)rgno << mp->m_rgblklog);
> > > > +	return ((xfs_rtblock_t)rgno * mp->m_rgblocks);
> > > > +}
> > > 
> > > Where does mp->m_rgblklog come from? That wasn't added to the
> > > on-disk superblock structure and it is always initialised to zero
> > > in this patch.
> > > 
> > > When will m_rgblklog be zero and when will it be non-zero? If it's
> > 
> > As I mentioned before, this patch merely ports non-rtg filesystems to
> > use the rtgroup structure.  m_rgblklog will be set to nonzero values
> > when we get to defining the ondisk rtgroup structure.
> 
> Yeah, which makes some of the context in the patch hard to
> understand... :/
> 
> > But, to cut ahead here, m_rgblklog will be set to a non-negative value
> > if the rtgroup size (in blocks) is a power of two.  Then these unit
> > conversion functions can use shifts instead of expensive multiplication
> > and divisions.  The same goes for rt extent to {fs,rt}block conversions.
> 
> yeah, so mp->m_rgblklog is not equivalent of mp->m_agblklog at all.
> It took me some time to understand that - the names are the same,
> they are used in similar address conversions, but they have
> completely different functions.
> 
> I suspect we need some better naming here, regardless of the
> rtgroups global address space layout discussion...

Or just make xfs_rtblock_t sparse, in which case I think m_rgblklog
usage patterns become exactly the same as m_agblklog.

> > > > +
> > > > +static inline uint64_t
> > > > +__xfs_rtb_to_rgbno(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_rtblock_t		rtbno)
> > > > +{
> > > > +	uint32_t		rem;
> > > > +
> > > > +	if (!xfs_has_rtgroups(mp))
> > > > +		return rtbno;
> > > > +
> > > > +	if (mp->m_rgblklog >= 0)
> > > > +		return rtbno & mp->m_rgblkmask;
> > > > +
> > > > +	div_u64_rem(rtbno, mp->m_rgblocks, &rem);
> > > > +	return rem;
> > > > +}
> > > 
> > > Why is this function returning a uint64_t - a xfs_rgblock_t is only
> > > a 32 bit type...
> > 
> > group 0 on a !rtg filesystem can be 64-bits in block/rt count.  This is
> > a /very/ annoying pain point -- if you actually created such a
> > filesystem it actually would never work because the rtsummary file would
> > be created undersized due to an integer overflow, but the verifiers
> > never checked any of that, and due to the same underflow the rtallocator
> > would search the wrong places and (eventually) fall back to a dumb
> > linear scan.
> > 
> > Soooooo this is an obnoxious usecase (broken large !rtg filesystems)
> > that we can't just drop, though I'm pretty sure there aren't any systems
> > in the wild.
> 
> Ugh. That definitely needs to be a comment somewhere in the code to
> explain this. :/

Well it's all in the commit that fixed the rtsummary for those things.

> > > > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > > > index a8cd44d03ef64..1ce4b9eb16f47 100644
> > > > --- a/fs/xfs/libxfs/xfs_types.h
> > > > +++ b/fs/xfs/libxfs/xfs_types.h
> > > > @@ -9,10 +9,12 @@
> > > >  typedef uint32_t	prid_t;		/* project ID */
> > > >  
> > > >  typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> > > > +typedef uint32_t	xfs_rgblock_t;	/* blockno in realtime group */
> > > 
> > > Is that right? The rtg length is 2^32 * rtextsize, and rtextsize can
> > > be 2^20 bytes:
> > > 
> > > #define XFS_MAX_RTEXTSIZE (1024 * 1024 * 1024)
> > 
> > No, the maximum rtgroup length is 2^32-1 blocks.
> 
> I couldn't tell if the max length was being defined as the maximum
> number of rt extents that the rtgroup could index, of whether it was
> the maximum number of filesystem blocks (i.e. data device fsblock
> size) tha an rtgroup could index...

The max rtgroup length is defined in blocks; the min is defined in rt
extents.  I might want to bump up the minimum a bit, but I think
Christoph should weigh in on that first -- I think his zns patchset
currently assigns one rtgroup to each zone?  Because he was muttering
about how 130,000x 256MB rtgroups really sucks.  Would it be very messy
to have a minimum size of (say) 1GB?

> > > Hence for a 4kB fsbno filesystem, the actual maximum size of an rtg
> > > in filesystem blocks far exceeds what we can address with a 32 bit
> > > variable.
> > > 
> > > If xfs_rgblock_t is actually indexing multi-fsbno rtextents, then it
> > > is an extent number index, not a "block" index. An extent number
> > > index won't overflow 32 bits (because the rtg has a max of 2^32 - 1
> > > rtextents)
> > > 
> > > IOWs, shouldn't this be named soemthing like:
> > > 
> > > typedef uint32_t	xfs_rgext_t;	/* extent number in realtime group */
> > 
> > and again, we can't do that because we emulate !rtg filesystems with a
> > single "rtgroup" that can be more than 2^32 rtx long.
> 
> *nod*
> 
> > > >  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> > > >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> > > >  typedef uint32_t	xfs_rtxlen_t;	/* file extent length in rtextents */
> > > >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> > > > +typedef uint32_t	xfs_rgnumber_t;	/* realtime group number */
> > > >  typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> > > >  typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
> > > >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> > > > @@ -53,7 +55,9 @@ typedef void *		xfs_failaddr_t;
> > > >  #define	NULLFILEOFF	((xfs_fileoff_t)-1)
> > > >  
> > > >  #define	NULLAGBLOCK	((xfs_agblock_t)-1)
> > > > +#define NULLRGBLOCK	((xfs_rgblock_t)-1)
> > > >  #define	NULLAGNUMBER	((xfs_agnumber_t)-1)
> > > > +#define	NULLRGNUMBER	((xfs_rgnumber_t)-1)
> > > 
> > > What's the maximum valid rtg number? We're not ever going to be
> > > supporting 2^32 - 2 rtgs, so what is a realistic maximum we can cap
> > > this at and validate it at?
> > 
> > /me shrugs -- the smallest AG size on the data device is 16M, which
> > technically speaking means that one /could/ format 2^(63-24) groups,
> > or order 39.
> > 
> > Realistically with the maximum rtgroup size of 2^31 blocks, we probably
> > only need 2^(63 - (31 + 10)) = 2^22 rtgroups max on a 1k fsblock fs.
> 
> Right, those are the theoretical maximums. Practically speaking,
> though, mkfs and mount iteration of all AGs means millions to
> billions of IOs need to be done before the filesystem can even be
> fully mounted. Hence the practical limit to AG count is closer to a
> few tens of thousands, not hundreds of billions.
> 
> Hence I'm wondering if we should actually cap the maximum number of
> rtgroups. WE're just about at BS > PS, so with a 64k block size a
> single rtgroup can index 2^32 * 2^16 bytes which puts individual
> rtgs at 256TB in size. Unless there are use cases for rtgroup sizes
> smaller than a few GBs, I just don't see the need for support
> theoretical maximum counts on tiny block size filesystems. Thirty
> thousand rtgs at 256TB per rtg puts us at 64 bit device size limits,
> and we hit those limits on 4kB block sizes at around 500,000 rtgs.
> 
> So do we need to support millions of rtgs? I'd say no....

...but we might.  Christoph, how gnarly does zns support get if you have
to be able to pack multiple SMR zones into a single rtgroup?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-26 19:40       ` Darrick J. Wong
@ 2024-08-27  1:56         ` Dave Chinner
  2024-08-27  2:16           ` Darrick J. Wong
  2024-08-27  5:00         ` Christoph Hellwig
  1 sibling, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  1:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 12:40:28PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 02:56:37PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:26:38PM -0700, Darrick J. Wong wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > Make the allocator rtgroup aware by either picking a specific group if
> > > there is a hint, or loop over all groups otherwise.  A simple rotor is
> > > provided to pick the placement for initial allocations.
> > > 
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c     |   13 +++++-
> > >  fs/xfs/libxfs/xfs_rtbitmap.c |    6 ++-
> > >  fs/xfs/xfs_mount.h           |    1 
> > >  fs/xfs/xfs_rtalloc.c         |   98 ++++++++++++++++++++++++++++++++++++++----
> > >  4 files changed, 105 insertions(+), 13 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 126a0d253654a..88c62e1158ac7 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -3151,8 +3151,17 @@ xfs_bmap_adjacent_valid(
> > >  	struct xfs_mount	*mp = ap->ip->i_mount;
> > >  
> > >  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> > > -	    (ap->datatype & XFS_ALLOC_USERDATA))
> > > -		return x < mp->m_sb.sb_rblocks;
> > > +	    (ap->datatype & XFS_ALLOC_USERDATA)) {
> > > +		if (x >= mp->m_sb.sb_rblocks)
> > > +			return false;
> > > +		if (!xfs_has_rtgroups(mp))
> > > +			return true;
> > > +
> > > +		return xfs_rtb_to_rgno(mp, x) == xfs_rtb_to_rgno(mp, y) &&
> > > +			xfs_rtb_to_rgno(mp, x) < mp->m_sb.sb_rgcount &&
> > > +			xfs_rtb_to_rtx(mp, x) < mp->m_sb.sb_rgextents;
> > 
> > WHy do we need the xfs_has_rtgroups() check here? The new rtg logic will
> > return true for an old school rt device here, right?
> 
> The incore sb_rgextents is zero on !rtg filesystems, so we need the
> xfs_has_rtgroups.

Hmmm. Could we initialise it in memory only for !rtg filesystems,
and make sure we never write it back via a check in the
xfs_sb_to_disk() formatter function?

That would remove one of the problematic in-memory differences
between old skool rtdev setups and the new rtg-based setups...

> > > @@ -1835,9 +1908,16 @@ xfs_bmap_rtalloc(
> > >  	if (xfs_bmap_adjacent(ap))
> > >  		bno_hint = ap->blkno;
> > >  
> > > -	error = xfs_rtallocate(ap->tp, bno_hint, raminlen, ralen, prod,
> > > -			ap->wasdel, initial_user_data, &rtlocked,
> > > -			&ap->blkno, &ap->length);
> > > +	if (xfs_has_rtgroups(ap->ip->i_mount)) {
> > > +		error = xfs_rtallocate_rtgs(ap->tp, bno_hint, raminlen, ralen,
> > > +				prod, ap->wasdel, initial_user_data,
> > > +				&ap->blkno, &ap->length);
> > > +	} else {
> > > +		error = xfs_rtallocate_rtg(ap->tp, 0, bno_hint, raminlen, ralen,
> > > +				prod, ap->wasdel, initial_user_data,
> > > +				&rtlocked, &ap->blkno, &ap->length);
> > > +	}
> > 
> > The xfs_has_rtgroups() check is unnecessary.  The iterator in
> > xfs_rtallocate_rtgs() will do the right thing for the
> > !xfs_has_rtgroups() case - it'll set start_rgno = 0 and break out
> > after a single call to xfs_rtallocate_rtg() with rgno = 0.
> > 
> > Another thing that probably should be done here is push all the
> > constant value calculations a couple of functions down the stack to
> > where they are used. Then we only need to pass two parameters down
> > through the rg iterator here, not 11...
> 
> ..and pass the ap itself too, to remove three of the parameters?

Yeah, I was thinking that the iterator only needs the bno_hint
to determine which group to start iterating. Everything else is
derived from information in the ap structure and so doesn't need to
be calculated above the iterator.

Though we could just lift the xfs_rtalloc_args() up to this level
and stuff all the parameters into that structure and pass it down
instead (like we do with xfs_alloc_args for the btree allocator).
Then we only need to pass args through xfs_rtallocate(),
xfs_rtallocate_extent_near/size() and all the other helper
functions, too.

That's a much bigger sort of cleanup, though, but I think it would
be worth doing a some point because it would bring the rtalloc code
closer to how the btalloc code is structured. And, perhaps, allow us
to potentially share group selection and iteration code between
the bt and rt allocators in future...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-27  0:58         ` Dave Chinner
@ 2024-08-27  1:56           ` Darrick J. Wong
  2024-08-27  3:00             ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  1:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Tue, Aug 27, 2024 at 10:58:59AM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 26, 2024 at 09:58:05AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2024 at 05:18:02PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Add a dynamic lockdep class key for rtgroup inodes.  This will enable
> > > > lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
> > > > order.  Each class can have 8 subclasses, and for now we will only have
> > > > 2 inodes per group.  This enables rtgroup order and inode order checks
> > > > when nesting ILOCKs.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_rtgroup.c |   52 +++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 52 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > > > index 51f04cad5227c..ae6d67c673b1a 100644
> > > > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > > > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > > > @@ -243,3 +243,55 @@ xfs_rtgroup_trans_join(
> > > >  	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
> > > >  		xfs_rtbitmap_trans_join(tp);
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_PROVE_LOCKING
> > > > +static struct lock_class_key xfs_rtginode_lock_class;
> > > > +
> > > > +static int
> > > > +xfs_rtginode_ilock_cmp_fn(
> > > > +	const struct lockdep_map	*m1,
> > > > +	const struct lockdep_map	*m2)
> > > > +{
> > > > +	const struct xfs_inode *ip1 =
> > > > +		container_of(m1, struct xfs_inode, i_lock.dep_map);
> > > > +	const struct xfs_inode *ip2 =
> > > > +		container_of(m2, struct xfs_inode, i_lock.dep_map);
> > > > +
> > > > +	if (ip1->i_projid < ip2->i_projid)
> > > > +		return -1;
> > > > +	if (ip1->i_projid > ip2->i_projid)
> > > > +		return 1;
> > > > +	return 0;
> > > > +}
> > > 
> > > What's the project ID of the inode got to do with realtime groups?
> > 
> > Each rtgroup metadata file stores its group number in i_projid so that
> > mount can detect if there's a corruption in /rtgroup and we just opened
> > the bitmap from the wrong group.
> > 
> > We can also use lockdep to detect code that locks rtgroup metadata in
> > the wrong order.  Potentially we could use this _cmp_fn to enforce that
> > we always ILOCK in the order bitmap -> summary -> rmap -> refcount based
> > on i_metatype.
> 
> Ok, can we union the i_projid field (both in memory and in the
> on-disk structure) so that dual use of the field is well documented
> by the code?

Sounds good to me.  Does

union {
	xfs_prid_t	i_projid;
	uint32_t	i_metagroup;
};

sound ok?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 14/24] xfs: support caching rtgroup metadata inodes
  2024-08-27  1:05         ` Dave Chinner
@ 2024-08-27  2:01           ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  2:01 UTC (permalink / raw)
  To: Dave Chinner, b; +Cc: hch, linux-xfs

On Tue, Aug 27, 2024 at 11:05:53AM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2024 at 11:37:34AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 26, 2024 at 11:41:19AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2024 at 05:18:18PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > Create the necessary per-rtgroup infrastructure that we need to load
> > > > metadata inodes into memory.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_rtgroup.c |  182 +++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_rtgroup.h |   28 +++++++
> > > >  fs/xfs/xfs_mount.h          |    1 
> > > >  fs/xfs/xfs_rtalloc.c        |   48 +++++++++++
> > > >  4 files changed, 258 insertions(+), 1 deletion(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > > > index ae6d67c673b1a..50e4a56d749f0 100644
> > > > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > > > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > > > @@ -30,6 +30,8 @@
> > > >  #include "xfs_icache.h"
> > > >  #include "xfs_rtgroup.h"
> > > >  #include "xfs_rtbitmap.h"
> > > > +#include "xfs_metafile.h"
> > > > +#include "xfs_metadir.h"
> > > >  
> > > >  /*
> > > >   * Passive reference counting access wrappers to the rtgroup structures.  If
> > > > @@ -295,3 +297,183 @@ xfs_rtginode_lockdep_setup(
> > > >  #else
> > > >  #define xfs_rtginode_lockdep_setup(ip, rgno, type)	do { } while (0)
> > > >  #endif /* CONFIG_PROVE_LOCKING */
> > > > +
> > > > +struct xfs_rtginode_ops {
> > > > +	const char		*name;	/* short name */
> > > > +
> > > > +	enum xfs_metafile_type	metafile_type;
> > > > +
> > > > +	/* Does the fs have this feature? */
> > > > +	bool			(*enabled)(struct xfs_mount *mp);
> > > > +
> > > > +	/* Create this rtgroup metadata inode and initialize it. */
> > > > +	int			(*create)(struct xfs_rtgroup *rtg,
> > > > +					  struct xfs_inode *ip,
> > > > +					  struct xfs_trans *tp,
> > > > +					  bool init);
> > > > +};
> > > 
> > > What's all this for?
> > > 
> > > AFAICT, loading the inodes into the rtgs requires a call to
> > > xfs_metadir_load() when initialising the rtg (either at mount or
> > > lazily on the first access to the rtg). Hence I'm not really sure
> > > what this complexity is needed for, and the commit message is not
> > > very informative....
> > 
> > Yes, the creation and mkdir code in here is really to support growfs,
> > mkfs, and repair.  How about I change the commit message to:
> > 
> > "Create the necessary per-rtgroup infrastructure that we need to load
> > metadata inodes into memory and to create directory trees on the fly.
> > Loading is needed by the mounting process.  Creation is needed by
> > growfs, mkfs, and repair."
> 
> IMO it would have been nicer to add this with the patch that
> adds growfs support for rtgs. That way the initial inode loading
> would be much easier to understand and review, and the rest of it
> would have enough context to be able to review it sanely. There
> isn't enough context in this patch to determine if the creation code
> is sane or works correctly....

<nod> I think that's doable.  I also want to change the name to
->init_inode because that's the only thing it can really do at the point
that we're creating inodes in growfs.

> 
> > > > +	path = xfs_rtginode_path(rtg->rtg_rgno, type);
> > > > +	if (!path)
> > > > +		return -ENOMEM;
> > > > +	error = xfs_metadir_load(tp, mp->m_rtdirip, path, ops->metafile_type,
> > > > +			&ip);
> > > > +	kfree(path);
> > > > +
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	if (XFS_IS_CORRUPT(mp, ip->i_df.if_format != XFS_DINODE_FMT_EXTENTS &&
> > > > +			       ip->i_df.if_format != XFS_DINODE_FMT_BTREE)) {
> > > > +		xfs_irele(ip);
> > > > +		return -EFSCORRUPTED;
> > > > +	}
> > > 
> > > We don't support LOCAL format for any type of regular file inodes,
> > > so I'm a little confiused as to why this wouldn't be caught by the
> > > verifier on inode read? i.e.  What problem is this trying to catch,
> > > and why doesn't the inode verifier catch it for us?
> > 
> > This is really more of a placeholder for more refactorings coming down
> > the line for the rtrmap patchset, which will create a new
> > XFS_DINODE_FMT_RMAP.  At that time we'll need to check that an inode
> > that we are loading to be the rmap btree actually has that set.
> 
> Ok, can you leave a comment to indicate this so I don't have to
> remember why this code exists?

Will do.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-27  1:56         ` Dave Chinner
@ 2024-08-27  2:16           ` Darrick J. Wong
  2024-08-27  5:00             ` Christoph Hellwig
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  2:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Tue, Aug 27, 2024 at 11:56:31AM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2024 at 12:40:28PM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 26, 2024 at 02:56:37PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2024 at 05:26:38PM -0700, Darrick J. Wong wrote:
> > > > From: Christoph Hellwig <hch@lst.de>
> > > > 
> > > > Make the allocator rtgroup aware by either picking a specific group if
> > > > there is a hint, or loop over all groups otherwise.  A simple rotor is
> > > > provided to pick the placement for initial allocations.
> > > > 
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c     |   13 +++++-
> > > >  fs/xfs/libxfs/xfs_rtbitmap.c |    6 ++-
> > > >  fs/xfs/xfs_mount.h           |    1 
> > > >  fs/xfs/xfs_rtalloc.c         |   98 ++++++++++++++++++++++++++++++++++++++----
> > > >  4 files changed, 105 insertions(+), 13 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index 126a0d253654a..88c62e1158ac7 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -3151,8 +3151,17 @@ xfs_bmap_adjacent_valid(
> > > >  	struct xfs_mount	*mp = ap->ip->i_mount;
> > > >  
> > > >  	if (XFS_IS_REALTIME_INODE(ap->ip) &&
> > > > -	    (ap->datatype & XFS_ALLOC_USERDATA))
> > > > -		return x < mp->m_sb.sb_rblocks;
> > > > +	    (ap->datatype & XFS_ALLOC_USERDATA)) {
> > > > +		if (x >= mp->m_sb.sb_rblocks)
> > > > +			return false;
> > > > +		if (!xfs_has_rtgroups(mp))
> > > > +			return true;
> > > > +
> > > > +		return xfs_rtb_to_rgno(mp, x) == xfs_rtb_to_rgno(mp, y) &&
> > > > +			xfs_rtb_to_rgno(mp, x) < mp->m_sb.sb_rgcount &&
> > > > +			xfs_rtb_to_rtx(mp, x) < mp->m_sb.sb_rgextents;
> > > 
> > > WHy do we need the xfs_has_rtgroups() check here? The new rtg logic will
> > > return true for an old school rt device here, right?
> > 
> > The incore sb_rgextents is zero on !rtg filesystems, so we need the
> > xfs_has_rtgroups.
> 
> Hmmm. Could we initialise it in memory only for !rtg filesystems,
> and make sure we never write it back via a check in the
> xfs_sb_to_disk() formatter function?

Only if the incore sb_rgextents becomes u64, which will then cause the
incore and ondisk superblock structures not to match anymore.  There's
probably not much reason to keep them the same anymore.  That said, up
until recently the metadir patchset actually broke the two apart, but
then hch and I put things back to reduce our own confusion.

> That would remove one of the problematic in-memory differences
> between old skool rtdev setups and the new rtg-based setups...
> 
> > > > @@ -1835,9 +1908,16 @@ xfs_bmap_rtalloc(
> > > >  	if (xfs_bmap_adjacent(ap))
> > > >  		bno_hint = ap->blkno;
> > > >  
> > > > -	error = xfs_rtallocate(ap->tp, bno_hint, raminlen, ralen, prod,
> > > > -			ap->wasdel, initial_user_data, &rtlocked,
> > > > -			&ap->blkno, &ap->length);
> > > > +	if (xfs_has_rtgroups(ap->ip->i_mount)) {
> > > > +		error = xfs_rtallocate_rtgs(ap->tp, bno_hint, raminlen, ralen,
> > > > +				prod, ap->wasdel, initial_user_data,
> > > > +				&ap->blkno, &ap->length);
> > > > +	} else {
> > > > +		error = xfs_rtallocate_rtg(ap->tp, 0, bno_hint, raminlen, ralen,
> > > > +				prod, ap->wasdel, initial_user_data,
> > > > +				&rtlocked, &ap->blkno, &ap->length);
> > > > +	}
> > > 
> > > The xfs_has_rtgroups() check is unnecessary.  The iterator in
> > > xfs_rtallocate_rtgs() will do the right thing for the
> > > !xfs_has_rtgroups() case - it'll set start_rgno = 0 and break out
> > > after a single call to xfs_rtallocate_rtg() with rgno = 0.
> > > 
> > > Another thing that probably should be done here is push all the
> > > constant value calculations a couple of functions down the stack to
> > > where they are used. Then we only need to pass two parameters down
> > > through the rg iterator here, not 11...
> > 
> > ..and pass the ap itself too, to remove three of the parameters?
> 
> Yeah, I was thinking that the iterator only needs the bno_hint
> to determine which group to start iterating. Everything else is
> derived from information in the ap structure and so doesn't need to
> be calculated above the iterator.
> 
> Though we could just lift the xfs_rtalloc_args() up to this level
> and stuff all the parameters into that structure and pass it down
> instead (like we do with xfs_alloc_args for the btree allocator).
> Then we only need to pass args through xfs_rtallocate(),
> xfs_rtallocate_extent_near/size() and all the other helper
> functions, too.
> 
> That's a much bigger sort of cleanup, though, but I think it would
> be worth doing a some point because it would bring the rtalloc code
> closer to how the btalloc code is structured. And, perhaps, allow us
> to potentially share group selection and iteration code between
> the bt and rt allocators in future...

Well we're already tearing the rt allocator to pieces and rebuilding it,
so why not...

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 5/6] xfs: update sb field checks when metadir is turned on
  2024-08-26 18:07       ` Darrick J. Wong
@ 2024-08-27  2:16         ` Dave Chinner
  2024-08-27  3:16           ` Darrick J. Wong
  0 siblings, 1 reply; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  2:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 11:07:47AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 26, 2024 at 07:52:43PM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2024 at 05:29:15PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > When metadir is enabled, we want to check the two new rtgroups fields,
> > > and we don't want to check the old inumbers that are now in the metadir.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/scrub/agheader.c |   36 ++++++++++++++++++++++++------------
> > >  1 file changed, 24 insertions(+), 12 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > > index cad997f38a424..0d22d70950a5c 100644
> > > --- a/fs/xfs/scrub/agheader.c
> > > +++ b/fs/xfs/scrub/agheader.c
> > > @@ -147,14 +147,14 @@ xchk_superblock(
> > >  	if (xfs_has_metadir(sc->mp)) {
> > >  		if (sb->sb_metadirino != cpu_to_be64(mp->m_sb.sb_metadirino))
> > >  			xchk_block_set_preen(sc, bp);
> > > +	} else {
> > > +		if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> > > +			xchk_block_set_preen(sc, bp);
> > > +
> > > +		if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> > > +			xchk_block_set_preen(sc, bp);
> > >  	}
> > >  
> > > -	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> > > -		xchk_block_set_preen(sc, bp);
> > > -
> > > -	if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> > > -		xchk_block_set_preen(sc, bp);
> > > -
> > 
> > If metadir is enabled, then shouldn't sb->sb_rbmino/sb_rsumino both
> > be NULLFSINO to indicate they aren't valid?
> 
> The ondisk sb values aren't defined anymore and we set the incore values
> to NULLFSINO (and never write that back out) so there's not much to
> check anymore.  I guess we could check that they're all zero or
> something, which is what mkfs writes out, though my intent here was to
> leave them as undefined bits, figuring that if we ever want to reuse
> those fields we're going to define a new incompat bit anyway.
> 
> OTOH now would be the time to define what the field contents are
> supposed to be -- zero or NULLFSINO?

Yeah, I think it's best to give them a solid definition, that way we
don't bump up against "we can't tell if it has never been used
before" problems.

> 
> > Given the rt inodes should have a well defined value even when
> > metadir is enabled, I would say the current code that is validating
> > the values are consistent with the primary across all secondary
> > superblocks is correct and this change is unnecessary....
> > 
> > 
> > > @@ -229,11 +229,13 @@ xchk_superblock(
> > >  	 * sb_icount, sb_ifree, sb_fdblocks, sb_frexents
> > >  	 */
> > >  
> > > -	if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> > > -		xchk_block_set_preen(sc, bp);
> > > +	if (!xfs_has_metadir(mp)) {
> > > +		if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> > > +			xchk_block_set_preen(sc, bp);
> > >  
> > > -	if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> > > -		xchk_block_set_preen(sc, bp);
> > > +		if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> > > +			xchk_block_set_preen(sc, bp);
> > > +	}
> > 
> > Same - if metadir is in use and quota inodes are in the metadir,
> > then the superblock quota inodes should be NULLFSINO....
> 
> Ok, I'll go with NULLFSINO ondisk and in memory.

OK.

Just to add to that (because I looked), mkfs.xfs does this to
initialise rtino numbers before they are allocated:

$ git grep NULLFSINO mkfs
mkfs/xfs_mkfs.c:        sbp->sb_rootino = sbp->sb_rbmino = sbp->sb_rsumino = NULLFSINO;
$

and repair does this for quota inodes when clearing the superblock
inode fields:

$ git grep NULLFSINO repair/dinode.c
repair/dinode.c:                        mp->m_sb.sb_uquotino = NULLFSINO;
repair/dinode.c:                        mp->m_sb.sb_gquotino = NULLFSINO;
repair/dinode.c:                        mp->m_sb.sb_pquotino = NULLFSINO;
$

So the current code is typically using NULLFSINO instead of zero on
disk for "inode does not exist".

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  1:55           ` Darrick J. Wong
@ 2024-08-27  3:00             ` Dave Chinner
  2024-08-27  4:44             ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  3:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 06:55:58PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 27, 2024 at 10:57:34AM +1000, Dave Chinner wrote:
> > On Mon, Aug 26, 2024 at 12:14:04PM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 26, 2024 at 09:56:08AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 22, 2024 at 05:17:31PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > Create an incore object that will contain information about a realtime
> > > > > allocation group.  This will eventually enable us to shard the realtime
> > > > > section in a similar manner to how we shard the data section, but for
> > > > > now just a single object for the entire RT subvolume is created.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/xfs/Makefile             |    1 
> > > > >  fs/xfs/libxfs/xfs_format.h  |    3 +
> > > > >  fs/xfs/libxfs/xfs_rtgroup.c |  196 ++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/libxfs/xfs_rtgroup.h |  212 +++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/libxfs/xfs_sb.c      |    7 +
> > > > >  fs/xfs/libxfs/xfs_types.h   |    4 +
> > > > >  fs/xfs/xfs_log_recover.c    |   20 ++++
> > > > >  fs/xfs/xfs_mount.c          |   16 +++
> > > > >  fs/xfs/xfs_mount.h          |   14 +++
> > > > >  fs/xfs/xfs_rtalloc.c        |    6 +
> > > > >  fs/xfs/xfs_super.c          |    1 
> > > > >  fs/xfs/xfs_trace.c          |    1 
> > > > >  fs/xfs/xfs_trace.h          |   38 ++++++++
> > > > >  13 files changed, 517 insertions(+), 2 deletions(-)
> > > > >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.c
> > > > >  create mode 100644 fs/xfs/libxfs/xfs_rtgroup.h
> > > > 
> > > > Ok, how is the global address space for real time extents laid out
> > > > across rt groups? i.e. is it sparse similar to how fsbnos and inode
> > > > numbers are created for the data device like so?
> > > > 
> > > > 	fsbno = (agno << agblklog) | agbno
> > > > 
> > > > Or is it something different? I can't find that defined anywhere in
> > > > this patch, so I can't determine if the unit conversion code and
> > > > validation is correct or not...
> > > 
> > > They're not sparse like fsbnos on the data device, they're laid end to
> > > end.  IOWs, it's a straight linear translation.  If you have an rtgroup
> > > that is 50 blocks long, then rtgroup 1 starts at (50 * blocksize).
> > 
> > Yes, I figured that out later. I think that's less than optimal,
> > because it essentially repeats the problems we have with AGs being
> > fixed size without the potential for fixing it easily. i.e. the
> > global sharded fsbno address space is sparse, so we can actually
> > space out the sparse address regions to allow future flexibility in
> > group size and location work.
> > 
> > By having the rtgroup addressing being purely physical, we're
> > completely stuck with fixed sized rtgroups and there is no way
> > around that. IOWs, the physical address space sharding repeats the
> > existing grow and shrink problems we have with the existing fixed
> > size AGs.
> > 
> > We're discussing how to use the sparse fsbno addressing to allow
> > resizing of AGs, but we will not be able to do that at all with
> > rtgroups as they stand. The limitation is a 64 bit global rt extent
> > address is essential the physical address of the extent in the block
> > device LBA space.
> 
> <nod> I /think/ it's pretty simple to convert the rtgroups rtblock
> numbers to sparse ala xfs_fsblock_t -- all we have to do is make sure
> that mp->m_rgblklog is set to highbit64(rtgroup block count) and then
> delete all the multiply/divide code, just like we do on the data device.
> 
> The thing I *don't* know is how will this affect hch's zoned device
> support -- he's mentioned that rtgroups will eventually have both a size
> and a "capacity" to keep the zones aligned to groups, or groups aligned
> to zones, I don't remember which.  I don't know if segmenting
> br_startblock for rt mappings makes things better or worse for that.

I can't really comment on that because I haven't heard anything
about this requirement. It kinda sounds like sparse addressing just
with different names, but I'm just guessing there. Maybe Christoph
can educate us here...

> > > > This is all duplicates of the xfs_perag code. Can you put together a
> > > > patchset to abstract this into a "xfs_group" and embed them in both
> > > > the perag and and rtgroup structures?
> > > > 
> > > > That way we only need one set of lookup and iterator infrastructure,
> > > > and it will work for both data and rt groups...
> > > 
> > > How will that work with perags still using the radix tree and rtgroups
> > > using the xarray?  Yes, we should move the perags to use the xarray too
> > > (and indeed hch already has a series on list to do that) but here's
> > > really not the time to do that because I don't want to frontload a bunch
> > > more core changes onto this already huge patchset.
> > 
> > Let's first assume they both use xarray (that's just a matter of
> > time, yes?) so it's easier to reason about. Then we have something
> > like this:
> > 
> > /*
> >  * xfs_group - a contiguous 32 bit block address space group
> >  */
> > struct xfs_group {
> > 	struct xarray		xarr;
> > 	u32			num_groups;
> > };
> 
> Ah, that's the group head.  I might call this struct xfs_groups?

Sure.

> 
> So ... would it theoretically make more sense to use an rhashtable here?
> Insofar as the only place that totally falls down is if you want to
> iterate tagged groups; and that's only done for AGs.

The index is contiguous and starts at zero, so it packs extremely
well into an xarray. For small numbers of groups (i.e. the vast
majority of installations) item lookup is essentially O(1) (single
node), and it scales out at O(log N) for large numbers and random
access.  It also has efficient sequential iteration, which is what
we mostly do with groups.

rhashtable has an advantage at scale of being mostly O(1), but it
comes at an increased memory footprint and has terrible for ordered
iteration behaviour even ignoring tags (essentially random memory
access).

> I'm ok with using an xarray here, fwiw.

OK.

> > then we pass the group to each of the "for_each_group..." iterators
> > like so:
> > 
> > 	for_each_group(&mp->m_perags, agno, pag) {
> > 		/* do stuff with pag */
> > 	}
> > 
> > or
> > 	for_each_group(&mp->m_rtgroups, rtgno, rtg) {
> > 		/* do stuff with rtg */
> > 	}
> > 
> > And we use typeof() and container_of() to access the group structure
> > within the pag/rtg. Something like:
> > 
> > #define to_grpi(grpi, gi)	container_of((gi), typeof(grpi), g)
> > #define to_gi(grpi)		(&(grpi)->g)
> > 
> > #define for_each_group(grp, gno, grpi)					\
> > 	(gno) = 0;							\
> > 	for ((grpi) = to_grpi((grpi), xfs_group_grab((grp), (gno)));	\
> > 	     (grpi) != NULL;						\
> > 	     (grpi) = to_grpi(grpi, xfs_group_next((grp), to_gi(grpi),	\
> > 					&(gno), (grp)->num_groups))
> > 
> > And now we essentially have common group infrstructure for
> > access, iteration, geometry and address verification purposes...
> 
> <nod> That's pretty much what I had drafted, albeit with different
> helper macros since I kept the for_each_{perag,rtgroup} things around
> for type safety.  Though I think for_each_perag just becomes:
> 
> #define for_each_perag(mp, agno, pag) \
> 	for_each_group((mp)->m_perags, (agno), (pag))
> 
> Right?

Yeah, that's what I though of doing first, but then figured a little
bit of compiler magic gets rid of the need for the type specific
iterator wrappers altogether...

> > > > > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > > > > index a8cd44d03ef64..1ce4b9eb16f47 100644
> > > > > --- a/fs/xfs/libxfs/xfs_types.h
> > > > > +++ b/fs/xfs/libxfs/xfs_types.h
> > > > > @@ -9,10 +9,12 @@
> > > > >  typedef uint32_t	prid_t;		/* project ID */
> > > > >  
> > > > >  typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> > > > > +typedef uint32_t	xfs_rgblock_t;	/* blockno in realtime group */
> > > > 
> > > > Is that right? The rtg length is 2^32 * rtextsize, and rtextsize can
> > > > be 2^20 bytes:
> > > > 
> > > > #define XFS_MAX_RTEXTSIZE (1024 * 1024 * 1024)
> > > 
> > > No, the maximum rtgroup length is 2^32-1 blocks.
> > 
> > I couldn't tell if the max length was being defined as the maximum
> > number of rt extents that the rtgroup could index, of whether it was
> > the maximum number of filesystem blocks (i.e. data device fsblock
> > size) tha an rtgroup could index...
> 
> The max rtgroup length is defined in blocks; the min is defined in rt
> extents.

I think that's part of the problem - can we define min and max in
the same units? Or have two sets of definitions - one for each unit?

> I might want to bump up the minimum a bit, but I think
> Christoph should weigh in on that first -- I think his zns patchset
> currently assigns one rtgroup to each zone?  Because he was muttering
> about how 130,000x 256MB rtgroups really sucks.

Ah, that might be the capacity vs size thing - to allow rtgroups to
be sized as an integer multiple of the zone capacity and so have an
rtgroup for every N contiguous zones....

> Would it be very messy
> to have a minimum size of (say) 1GB?

I was thinking of larger than that, but the question comes down to
how *small* do we need to support for rtg based rtdevs? I was
thinking that hundreds of GB would be the smallest size device we
might deploy this sort of feature on, in which case somewhere around
50GB would be the typical minimum rtg size...

I'm kind worried that 1GB sizes still allows the crazy growfs small
to huge capacity problems we have with AGs. It's probably a good
place to start, but I think larger would be better...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes
  2024-08-27  1:56           ` Darrick J. Wong
@ 2024-08-27  3:00             ` Dave Chinner
  0 siblings, 0 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-27  3:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs

On Mon, Aug 26, 2024 at 06:56:58PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 27, 2024 at 10:58:59AM +1000, Dave Chinner wrote:
> > On Mon, Aug 26, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 26, 2024 at 09:58:05AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 22, 2024 at 05:18:02PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > Add a dynamic lockdep class key for rtgroup inodes.  This will enable
> > > > > lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
> > > > > order.  Each class can have 8 subclasses, and for now we will only have
> > > > > 2 inodes per group.  This enables rtgroup order and inode order checks
> > > > > when nesting ILOCKs.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_rtgroup.c |   52 +++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 52 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_rtgroup.c b/fs/xfs/libxfs/xfs_rtgroup.c
> > > > > index 51f04cad5227c..ae6d67c673b1a 100644
> > > > > --- a/fs/xfs/libxfs/xfs_rtgroup.c
> > > > > +++ b/fs/xfs/libxfs/xfs_rtgroup.c
> > > > > @@ -243,3 +243,55 @@ xfs_rtgroup_trans_join(
> > > > >  	if (rtglock_flags & XFS_RTGLOCK_BITMAP)
> > > > >  		xfs_rtbitmap_trans_join(tp);
> > > > >  }
> > > > > +
> > > > > +#ifdef CONFIG_PROVE_LOCKING
> > > > > +static struct lock_class_key xfs_rtginode_lock_class;
> > > > > +
> > > > > +static int
> > > > > +xfs_rtginode_ilock_cmp_fn(
> > > > > +	const struct lockdep_map	*m1,
> > > > > +	const struct lockdep_map	*m2)
> > > > > +{
> > > > > +	const struct xfs_inode *ip1 =
> > > > > +		container_of(m1, struct xfs_inode, i_lock.dep_map);
> > > > > +	const struct xfs_inode *ip2 =
> > > > > +		container_of(m2, struct xfs_inode, i_lock.dep_map);
> > > > > +
> > > > > +	if (ip1->i_projid < ip2->i_projid)
> > > > > +		return -1;
> > > > > +	if (ip1->i_projid > ip2->i_projid)
> > > > > +		return 1;
> > > > > +	return 0;
> > > > > +}
> > > > 
> > > > What's the project ID of the inode got to do with realtime groups?
> > > 
> > > Each rtgroup metadata file stores its group number in i_projid so that
> > > mount can detect if there's a corruption in /rtgroup and we just opened
> > > the bitmap from the wrong group.
> > > 
> > > We can also use lockdep to detect code that locks rtgroup metadata in
> > > the wrong order.  Potentially we could use this _cmp_fn to enforce that
> > > we always ILOCK in the order bitmap -> summary -> rmap -> refcount based
> > > on i_metatype.
> > 
> > Ok, can we union the i_projid field (both in memory and in the
> > on-disk structure) so that dual use of the field is well documented
> > by the code?
> 
> Sounds good to me.  Does
> 
> union {
> 	xfs_prid_t	i_projid;
> 	uint32_t	i_metagroup;
> };
> 
> sound ok?

Yup.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 5/6] xfs: update sb field checks when metadir is turned on
  2024-08-27  2:16         ` Dave Chinner
@ 2024-08-27  3:16           ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  3:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: hch, linux-xfs

On Tue, Aug 27, 2024 at 12:16:43PM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2024 at 11:07:47AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 26, 2024 at 07:52:43PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2024 at 05:29:15PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > When metadir is enabled, we want to check the two new rtgroups fields,
> > > > and we don't want to check the old inumbers that are now in the metadir.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/scrub/agheader.c |   36 ++++++++++++++++++++++++------------
> > > >  1 file changed, 24 insertions(+), 12 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> > > > index cad997f38a424..0d22d70950a5c 100644
> > > > --- a/fs/xfs/scrub/agheader.c
> > > > +++ b/fs/xfs/scrub/agheader.c
> > > > @@ -147,14 +147,14 @@ xchk_superblock(
> > > >  	if (xfs_has_metadir(sc->mp)) {
> > > >  		if (sb->sb_metadirino != cpu_to_be64(mp->m_sb.sb_metadirino))
> > > >  			xchk_block_set_preen(sc, bp);
> > > > +	} else {
> > > > +		if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> > > > +			xchk_block_set_preen(sc, bp);
> > > > +
> > > > +		if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> > > > +			xchk_block_set_preen(sc, bp);
> > > >  	}
> > > >  
> > > > -	if (sb->sb_rbmino != cpu_to_be64(mp->m_sb.sb_rbmino))
> > > > -		xchk_block_set_preen(sc, bp);
> > > > -
> > > > -	if (sb->sb_rsumino != cpu_to_be64(mp->m_sb.sb_rsumino))
> > > > -		xchk_block_set_preen(sc, bp);
> > > > -
> > > 
> > > If metadir is enabled, then shouldn't sb->sb_rbmino/sb_rsumino both
> > > be NULLFSINO to indicate they aren't valid?
> > 
> > The ondisk sb values aren't defined anymore and we set the incore values
> > to NULLFSINO (and never write that back out) so there's not much to
> > check anymore.  I guess we could check that they're all zero or
> > something, which is what mkfs writes out, though my intent here was to
> > leave them as undefined bits, figuring that if we ever want to reuse
> > those fields we're going to define a new incompat bit anyway.
> > 
> > OTOH now would be the time to define what the field contents are
> > supposed to be -- zero or NULLFSINO?
> 
> Yeah, I think it's best to give them a solid definition, that way we
> don't bump up against "we can't tell if it has never been used
> before" problems.
> 
> > 
> > > Given the rt inodes should have a well defined value even when
> > > metadir is enabled, I would say the current code that is validating
> > > the values are consistent with the primary across all secondary
> > > superblocks is correct and this change is unnecessary....
> > > 
> > > 
> > > > @@ -229,11 +229,13 @@ xchk_superblock(
> > > >  	 * sb_icount, sb_ifree, sb_fdblocks, sb_frexents
> > > >  	 */
> > > >  
> > > > -	if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> > > > -		xchk_block_set_preen(sc, bp);
> > > > +	if (!xfs_has_metadir(mp)) {
> > > > +		if (sb->sb_uquotino != cpu_to_be64(mp->m_sb.sb_uquotino))
> > > > +			xchk_block_set_preen(sc, bp);
> > > >  
> > > > -	if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> > > > -		xchk_block_set_preen(sc, bp);
> > > > +		if (sb->sb_gquotino != cpu_to_be64(mp->m_sb.sb_gquotino))
> > > > +			xchk_block_set_preen(sc, bp);
> > > > +	}
> > > 
> > > Same - if metadir is in use and quota inodes are in the metadir,
> > > then the superblock quota inodes should be NULLFSINO....
> > 
> > Ok, I'll go with NULLFSINO ondisk and in memory.
> 
> OK.
> 
> Just to add to that (because I looked), mkfs.xfs does this to
> initialise rtino numbers before they are allocated:
> 
> $ git grep NULLFSINO mkfs
> mkfs/xfs_mkfs.c:        sbp->sb_rootino = sbp->sb_rbmino = sbp->sb_rsumino = NULLFSINO;
> $
> 
> and repair does this for quota inodes when clearing the superblock
> inode fields:
> 
> $ git grep NULLFSINO repair/dinode.c
> repair/dinode.c:                        mp->m_sb.sb_uquotino = NULLFSINO;
> repair/dinode.c:                        mp->m_sb.sb_gquotino = NULLFSINO;
> repair/dinode.c:                        mp->m_sb.sb_pquotino = NULLFSINO;
> $
> 
> So the current code is typically using NULLFSINO instead of zero on
> disk for "inode does not exist".

<nod> Though I noticed that it writes out sb_[ugp]quotino = 0.
Christoph once remarked that those parts of the sb were at some point
unused, so they were zero, and they only become NULLFSINO once someone
turns on QUOTABIT in sb_versionnum.

Regardless, all 1s is ok by me.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-26 19:14       ` Darrick J. Wong
  2024-08-27  0:57         ` Dave Chinner
@ 2024-08-27  4:27         ` Christoph Hellwig
  2024-08-27  5:19           ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  4:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, hch, linux-xfs

On Mon, Aug 26, 2024 at 12:14:04PM -0700, Darrick J. Wong wrote:
> They're not sparse like fsbnos on the data device, they're laid end to
> end.  IOWs, it's a straight linear translation.  If you have an rtgroup
> that is 50 blocks long, then rtgroup 1 starts at (50 * blocksize).

Except with the zone capacity features on ZNS devices, where they
already are sparse.  But that's like 200 patches away from the state
here..

> group 0 on a !rtg filesystem can be 64-bits in block/rt count.  This is
> a /very/ annoying pain point -- if you actually created such a
> filesystem it actually would never work because the rtsummary file would
> be created undersized due to an integer overflow, but the verifiers
> never checked any of that, and due to the same underflow the rtallocator
> would search the wrong places and (eventually) fall back to a dumb
> linear scan.
> 
> Soooooo this is an obnoxious usecase (broken large !rtg filesystems)
> that we can't just drop, though I'm pretty sure there aren't any systems
> in the wild.

So, do we really need to support that?  I think we've always supported
a 64-bit block count, so we'll have to support that, but if a > 32bit
extent count was always broken maybe we should simply stop to pretend
to support it?

> > What's the maximum valid rtg number? We're not ever going to be
> > supporting 2^32 - 2 rtgs, so what is a realistic maximum we can cap
> > this at and validate it at?
> 
> /me shrugs -- the smallest AG size on the data device is 16M, which
> technically speaking means that one /could/ format 2^(63-24) groups,
> or order 39.
> 
> Realistically with the maximum rtgroup size of 2^31 blocks, we probably
> only need 2^(63 - (31 + 10)) = 2^22 rtgroups max on a 1k fsblock fs.

Note that with zoned file system later on we are bound by hardware
size.  SMR HDDs by convention some with 256MB zones.  This is a bit
on the small side, but grouping multiple of those into a RT group
would be a major pain.  I hope the hardware size will eventually
increase, maybe when they move to 3-digit TB capcity points.


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper
  2024-08-27  1:29         ` Dave Chinner
@ 2024-08-27  4:27           ` Darrick J. Wong
  2024-08-27 22:16             ` Dave Chinner
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  4:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Tue, Aug 27, 2024 at 11:29:40AM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2024 at 11:27:34AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 26, 2024 at 12:06:58PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2024 at 05:20:07PM -0700, Darrick J. Wong wrote:
> > > > From: Christoph Hellwig <hch@lst.de>
> > > > 
> > > > Split the check that the rtsummary fits into the log into a separate
> > > > helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
> > > > geometry.
> > > > 
> > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > [djwong: avoid division for the 0-rtx growfs check]
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/xfs_rtalloc.c |   43 +++++++++++++++++++++++++++++--------------
> > > >  1 file changed, 29 insertions(+), 14 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> > > > index 61231b1dc4b79..78a3879ad6193 100644
> > > > --- a/fs/xfs/xfs_rtalloc.c
> > > > +++ b/fs/xfs/xfs_rtalloc.c
> > > > @@ -1023,6 +1023,31 @@ xfs_growfs_rtg(
> > > >  	return error;
> > > >  }
> > > >  
> > > > +static int
> > > > +xfs_growfs_check_rtgeom(
> > > > +	const struct xfs_mount	*mp,
> > > > +	xfs_rfsblock_t		rblocks,
> > > > +	xfs_extlen_t		rextsize)
> > > > +{
> > > > +	struct xfs_mount	*nmp;
> > > > +	int			error = 0;
> > > > +
> > > > +	nmp = xfs_growfs_rt_alloc_fake_mount(mp, rblocks, rextsize);
> > > > +	if (!nmp)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	/*
> > > > +	 * New summary size can't be more than half the size of the log.  This
> > > > +	 * prevents us from getting a log overflow, since we'll log basically
> > > > +	 * the whole summary file at once.
> > > > +	 */
> > > > +	if (nmp->m_rsumblocks > (mp->m_sb.sb_logblocks >> 1))
> > > > +		error = -EINVAL;
> > > 
> > > FWIW, the new size needs to be smaller than that, because the "half
> > > the log size" must to include all the log metadata needed to
> > > encapsulate that object. The grwofs transaction also logs inodes and
> > > the superblock, so that also takes away from the maximum size of
> > > the summary file....
> > 
> > <shrug> It's the same logic as what's there now, and there haven't been
> > any bug reports, have there? 
> 
> No, none that I know of - it was just an observation that the code
> doesn't actually guarantee what the comment says it should do.
> 
> > Though I suppose that's just a reduction
> > of what?  One block for the rtbitmap, and (conservatively) two inodes
> > and a superblock?
> 
> The rtbitmap update might touch a lot more than one block. The newly
> allocated space in the rtbitmap inode is initialised to zeros, and
> so the xfs_rtfree_range() call from the growfs code to mark the new
> space free has to write all 1s to that range of the rtbitmap. This
> is all done in a single transaction, so we might actually be logging
> a *lot* of rtbitmap buffers here.
> 
> IIRC, there is a bit per rtextent, so in a 4kB buffer we can mark
> 32768 rtextents free. If they are 4kB each, then that's 128MB of
> space tracked per rtbitmap block. This adds up to roughly 3.5MB of
> log space for the rtbitmap updates per TB of grown rtdev space....
> 
> So, yeah, I think that calculation and comment is inaccurate, but we
> don't have to fix this right now.

The kernel only "frees" the new space one rbmblock at a time, so I think
that's why this calculation has never misfired.  I /think/ that means
that each transaction only ends up logging two rtsummary blocks at a
time?  One to decrement a counter, and again to increment one another
level up?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  0:57         ` Dave Chinner
  2024-08-27  1:55           ` Darrick J. Wong
@ 2024-08-27  4:38           ` Christoph Hellwig
  2024-08-27  5:17             ` Darrick J. Wong
  1 sibling, 1 reply; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  4:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, hch, linux-xfs

On Tue, Aug 27, 2024 at 10:57:34AM +1000, Dave Chinner wrote:
> We're discussing how to use the sparse fsbno addressing to allow
> resizing of AGs, but we will not be able to do that at all with
> rtgroups as they stand. The limitation is a 64 bit global rt extent
> address is essential the physical address of the extent in the block
> device LBA space.

With this series there are not global RT extent addresses, the extents
are always relative to the group and an entity only used in the
allocator.

> /*
>  * xfs_group - a contiguous 32 bit block address space group
>  */
> struct xfs_group {
> 	struct xarray		xarr;
> 	u32			num_groups;
> };
> 
> struct xfs_group_item {
> 	struct xfs_group	*group; /* so put/rele don't need any other context */
> 	u32			gno;
> 	atomic_t		passive_refs;
> 	atomic_t		active_refs;

What is the point of splitting the group and group_item?  This isn't
done in the current perag struture either.

> Hence I'm wondering if we should actually cap the maximum number of
> rtgroups. WE're just about at BS > PS, so with a 64k block size a
> single rtgroup can index 2^32 * 2^16 bytes which puts individual
> rtgs at 256TB in size. Unless there are use cases for rtgroup sizes
> smaller than a few GBs, I just don't see the need for support
> theoretical maximum counts on tiny block size filesystems. Thirty
> thousand rtgs at 256TB per rtg puts us at 64 bit device size limits,
> and we hit those limits on 4kB block sizes at around 500,000 rtgs.
> 
> So do we need to support millions of rtgs? I'd say no....

As said before hardware is having a word with with the 256GB hardware
zone size in SMR HDDs.  I hope that size will eventually increase, but
I would not bet my house on it.


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  1:55           ` Darrick J. Wong
  2024-08-27  3:00             ` Dave Chinner
@ 2024-08-27  4:44             ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  4:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Mon, Aug 26, 2024 at 06:55:58PM -0700, Darrick J. Wong wrote:
> The thing I *don't* know is how will this affect hch's zoned device
> support -- he's mentioned that rtgroups will eventually have both a size
> and a "capacity" to keep the zones aligned to groups, or groups aligned
> to zones, I don't remember which.  I don't know if segmenting
> br_startblock for rt mappings makes things better or worse for that.

This should be fine.  The ZNS zone capacity features where zones have
a size (LBA space allocated to it) and a capacity (LBAs that can
actually be written to) is the hardware equivalent of this.

> So ... would it theoretically make more sense to use an rhashtable here?
> Insofar as the only place that totally falls down is if you want to
> iterate tagged groups; and that's only done for AGs.

It also is an important part of garbage collection for zoned XFS, where
we'll use it on RTGs.

> > 
> > #define for_each_group(grp, gno, grpi)					\
> > 	(gno) = 0;							\
> > 	for ((grpi) = to_grpi((grpi), xfs_group_grab((grp), (gno)));	\
> > 	     (grpi) != NULL;						\
> > 	     (grpi) = to_grpi(grpi, xfs_group_next((grp), to_gi(grpi),	\
> > 					&(gno), (grp)->num_groups))
> > 
> > And now we essentially have common group infrstructure for
> > access, iteration, geometry and address verification purposes...
> 
> <nod> That's pretty much what I had drafted, albeit with different
> helper macros since I kept the for_each_{perag,rtgroup} things around
> for type safety.  Though I think for_each_perag just becomes:
> 
> #define for_each_perag(mp, agno, pag) \
> 	for_each_group((mp)->m_perags, (agno), (pag))
> 
> Right?

Btw, if we touch all of this anyway I'd drop the agno argument.
We can get the group number from the group struct (see my perag xarray
conversion series for an example where I'm doing this for the tagged
iteration).

> 
> The max rtgroup length is defined in blocks; the min is defined in rt
> extents.  I might want to bump up the minimum a bit, but I think
> Christoph should weigh in on that first -- I think his zns patchset
> currently assigns one rtgroup to each zone?  Because he was muttering
> about how 130,000x 256MB rtgroups really sucks.  Would it be very messy
> to have a minimum size of (say) 1GB?

Very messy.  I can live with a minimum of 256 MB, but no byte less :)
This is the size used by all shipping SMR hard drivers.  For ZNS SSDs
there are samples with very small zones size that are basically open
channel devices in disguise - no sane person would want them and they
don't make sense to support in XFS as they require extensive erasure
encoding and error correction.  The ZNS drives with full data integrity
support have zone sizes and capacities way over 1GB and growing.

> > and we hit those limits on 4kB block sizes at around 500,000 rtgs.
> > 
> > So do we need to support millions of rtgs? I'd say no....
> 
> ...but we might.  Christoph, how gnarly does zns support get if you have
> to be able to pack multiple SMR zones into a single rtgroup?

I thought about it, but it creates real accounting nightmares.  It's
not entirely doable, but really messy.

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-26  4:56     ` Dave Chinner
  2024-08-26 19:40       ` Darrick J. Wong
@ 2024-08-27  4:59       ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  4:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 02:56:37PM +1000, Dave Chinner wrote:
> > +	if (xfs_has_rtgroups(ap->ip->i_mount)) {
> > +		error = xfs_rtallocate_rtgs(ap->tp, bno_hint, raminlen, ralen,
> > +				prod, ap->wasdel, initial_user_data,
> > +				&ap->blkno, &ap->length);
> > +	} else {
> > +		error = xfs_rtallocate_rtg(ap->tp, 0, bno_hint, raminlen, ralen,
> > +				prod, ap->wasdel, initial_user_data,
> > +				&rtlocked, &ap->blkno, &ap->length);
> > +	}
> 
> The xfs_has_rtgroups() check is unnecessary.  The iterator in
> xfs_rtallocate_rtgs() will do the right thing for the
> !xfs_has_rtgroups() case - it'll set start_rgno = 0 and break out
> after a single call to xfs_rtallocate_rtg() with rgno = 0.

The iterator itself does, but the start_rgno calculation does not.
But we can make that conditional, which shouldn't be too bad especially
if we merge xfs_rtallocate_rtgs into xfs_bmap_rtalloc.

> Another thing that probably should be done here is push all the
> constant value calculations a couple of functions down the stack to
> where they are used. Then we only need to pass two parameters down
> through the rg iterator here, not 11...

Well, not too much of that actually is constant.


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-26 19:40       ` Darrick J. Wong
  2024-08-27  1:56         ` Dave Chinner
@ 2024-08-27  5:00         ` Christoph Hellwig
  1 sibling, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  5:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 12:40:28PM -0700, Darrick J. Wong wrote:
> ..and pass the ap itself too, to remove three of the parameters?

I tried that earlier, but it breaks the allocation added from the
repair code later in your tree.  We could fake up a partial
xfs_bmalloca there, but that seemed pretty ugly.


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/26] xfs: make the RT allocator rtgroup aware
  2024-08-27  2:16           ` Darrick J. Wong
@ 2024-08-27  5:00             ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  5:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 07:16:09PM -0700, Darrick J. Wong wrote:
> > Hmmm. Could we initialise it in memory only for !rtg filesystems,
> > and make sure we never write it back via a check in the
> > xfs_sb_to_disk() formatter function?
> 
> Only if the incore sb_rgextents becomes u64, which will then cause the
> incore and ondisk superblock structures not to match anymore.  There's
> probably not much reason to keep them the same anymore.  That said, up
> until recently the metadir patchset actually broke the two apart, but
> then hch and I put things back to reduce our own confusion.

Note that the incore sb really isn't much of a thing.  It's a random
structure that only exists embedded into the XFS mount.  The only
reason we keep adding fields to it is because some of the conversion
functions from/to disk are a mess.  The RT growfs cleanups earlier
in this patchbomb actually take care of a large part of that, so
we should be able to retire it in the not too distant future.


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  4:38           ` Christoph Hellwig
@ 2024-08-27  5:17             ` Darrick J. Wong
  2024-08-27  5:18               ` Christoph Hellwig
  0 siblings, 1 reply; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  5:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, hch, linux-xfs

On Mon, Aug 26, 2024 at 09:38:16PM -0700, Christoph Hellwig wrote:
> On Tue, Aug 27, 2024 at 10:57:34AM +1000, Dave Chinner wrote:
> > We're discussing how to use the sparse fsbno addressing to allow
> > resizing of AGs, but we will not be able to do that at all with
> > rtgroups as they stand. The limitation is a 64 bit global rt extent
> > address is essential the physical address of the extent in the block
> > device LBA space.
> 
> With this series there are not global RT extent addresses, the extents
> are always relative to the group and an entity only used in the
> allocator.
> 
> > /*
> >  * xfs_group - a contiguous 32 bit block address space group
> >  */
> > struct xfs_group {
> > 	struct xarray		xarr;
> > 	u32			num_groups;
> > };
> > 
> > struct xfs_group_item {
> > 	struct xfs_group	*group; /* so put/rele don't need any other context */
> > 	u32			gno;
> > 	atomic_t		passive_refs;
> > 	atomic_t		active_refs;
> 
> What is the point of splitting the group and group_item?  This isn't
> done in the current perag struture either.

I think xfs_group encapsulates/replaces the radix tree root in struct
xfs_mount, and the xarray inside it points to xfs_group_item objects.

> > Hence I'm wondering if we should actually cap the maximum number of
> > rtgroups. WE're just about at BS > PS, so with a 64k block size a
> > single rtgroup can index 2^32 * 2^16 bytes which puts individual
> > rtgs at 256TB in size. Unless there are use cases for rtgroup sizes
> > smaller than a few GBs, I just don't see the need for support
> > theoretical maximum counts on tiny block size filesystems. Thirty
> > thousand rtgs at 256TB per rtg puts us at 64 bit device size limits,
> > and we hit those limits on 4kB block sizes at around 500,000 rtgs.
> > 
> > So do we need to support millions of rtgs? I'd say no....
> 
> As said before hardware is having a word with with the 256GB hardware
> zone size in SMR HDDs.  I hope that size will eventually increase, but
> I would not bet my house on it.

Wait, 256 *gigabytes*?  That wouldn't be such a bad minimum.

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  5:17             ` Darrick J. Wong
@ 2024-08-27  5:18               ` Christoph Hellwig
  0 siblings, 0 replies; 271+ messages in thread
From: Christoph Hellwig @ 2024-08-27  5:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, hch, linux-xfs

On Mon, Aug 26, 2024 at 10:17:19PM -0700, Darrick J. Wong wrote:
> > allocator.
> > 
> > > /*
> > >  * xfs_group - a contiguous 32 bit block address space group
> > >  */
> > > struct xfs_group {
> > > 	struct xarray		xarr;
> > > 	u32			num_groups;
> > > };
> > > 
> > > struct xfs_group_item {
> > > 	struct xfs_group	*group; /* so put/rele don't need any other context */
> > > 	u32			gno;
> > > 	atomic_t		passive_refs;
> > > 	atomic_t		active_refs;
> > 
> > What is the point of splitting the group and group_item?  This isn't
> > done in the current perag struture either.
> 
> I think xfs_group encapsulates/replaces the radix tree root in struct
> xfs_mount, and the xarray inside it points to xfs_group_item objects.

Ahh.  So it's now a xfs_group structure, but a xfs_groups one,
with the group item really being xfs_group.

> 
> > > Hence I'm wondering if we should actually cap the maximum number of
> > > rtgroups. WE're just about at BS > PS, so with a 64k block size a
> > > single rtgroup can index 2^32 * 2^16 bytes which puts individual
> > > rtgs at 256TB in size. Unless there are use cases for rtgroup sizes
> > > smaller than a few GBs, I just don't see the need for support
> > > theoretical maximum counts on tiny block size filesystems. Thirty
> > > thousand rtgs at 256TB per rtg puts us at 64 bit device size limits,
> > > and we hit those limits on 4kB block sizes at around 500,000 rtgs.
> > > 
> > > So do we need to support millions of rtgs? I'd say no....
> > 
> > As said before hardware is having a word with with the 256GB hardware
> > zone size in SMR HDDs.  I hope that size will eventually increase, but
> > I would not bet my house on it.
> 
> Wait, 256 *gigabytes*?  That wouldn't be such a bad minimum.

Sorry, MB.  My units really suck this morning :)


^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 11/24] xfs: create incore realtime group structures
  2024-08-27  4:27         ` Christoph Hellwig
@ 2024-08-27  5:19           ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-08-27  5:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, hch, linux-xfs

On Mon, Aug 26, 2024 at 09:27:03PM -0700, Christoph Hellwig wrote:
> On Mon, Aug 26, 2024 at 12:14:04PM -0700, Darrick J. Wong wrote:
> > They're not sparse like fsbnos on the data device, they're laid end to
> > end.  IOWs, it's a straight linear translation.  If you have an rtgroup
> > that is 50 blocks long, then rtgroup 1 starts at (50 * blocksize).
> 
> Except with the zone capacity features on ZNS devices, where they
> already are sparse.  But that's like 200 patches away from the state
> here..

Heh.

> > group 0 on a !rtg filesystem can be 64-bits in block/rt count.  This is
> > a /very/ annoying pain point -- if you actually created such a
> > filesystem it actually would never work because the rtsummary file would
> > be created undersized due to an integer overflow, but the verifiers
> > never checked any of that, and due to the same underflow the rtallocator
> > would search the wrong places and (eventually) fall back to a dumb
> > linear scan.
> > 
> > Soooooo this is an obnoxious usecase (broken large !rtg filesystems)
> > that we can't just drop, though I'm pretty sure there aren't any systems
> > in the wild.
> 
> So, do we really need to support that?  I think we've always supported
> a 64-bit block count, so we'll have to support that, but if a > 32bit
> extent count was always broken maybe we should simply stop to pretend
> to support it?

I'm in favor of that.  The rextslog computation only got fixed in 6.8,
which means none of the LTS kernels really have it yet.  And the ones
that do are migrating verrrrry slowly due to the global rtbmp lock.

> > > What's the maximum valid rtg number? We're not ever going to be
> > > supporting 2^32 - 2 rtgs, so what is a realistic maximum we can cap
> > > this at and validate it at?
> > 
> > /me shrugs -- the smallest AG size on the data device is 16M, which
> > technically speaking means that one /could/ format 2^(63-24) groups,
> > or order 39.
> > 
> > Realistically with the maximum rtgroup size of 2^31 blocks, we probably
> > only need 2^(63 - (31 + 10)) = 2^22 rtgroups max on a 1k fsblock fs.
> 
> Note that with zoned file system later on we are bound by hardware
> size.  SMR HDDs by convention some with 256MB zones.  This is a bit
> on the small side, but grouping multiple of those into a RT group
> would be a major pain.  I hope the hardware size will eventually
> increase, maybe when they move to 3-digit TB capcity points.

<nod>

--D

^ permalink raw reply	[flat|nested] 271+ messages in thread

* Re: [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper
  2024-08-27  4:27           ` Darrick J. Wong
@ 2024-08-27 22:16             ` Dave Chinner
  0 siblings, 0 replies; 271+ messages in thread
From: Dave Chinner @ 2024-08-27 22:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Mon, Aug 26, 2024 at 09:27:24PM -0700, Darrick J. Wong wrote:
> On Tue, Aug 27, 2024 at 11:29:40AM +1000, Dave Chinner wrote:
> > On Mon, Aug 26, 2024 at 11:27:34AM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 26, 2024 at 12:06:58PM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 22, 2024 at 05:20:07PM -0700, Darrick J. Wong wrote:
> > > > > From: Christoph Hellwig <hch@lst.de>
> > > > > 
> > > > > Split the check that the rtsummary fits into the log into a separate
> > > > > helper, and use xfs_growfs_rt_alloc_fake_mount to calculate the new RT
> > > > > geometry.
> > > > > 
> > > > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > > [djwong: avoid division for the 0-rtx growfs check]
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/xfs/xfs_rtalloc.c |   43 +++++++++++++++++++++++++++++--------------
> > > > >  1 file changed, 29 insertions(+), 14 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> > > > > index 61231b1dc4b79..78a3879ad6193 100644
> > > > > --- a/fs/xfs/xfs_rtalloc.c
> > > > > +++ b/fs/xfs/xfs_rtalloc.c
> > > > > @@ -1023,6 +1023,31 @@ xfs_growfs_rtg(
> > > > >  	return error;
> > > > >  }
> > > > >  
> > > > > +static int
> > > > > +xfs_growfs_check_rtgeom(
> > > > > +	const struct xfs_mount	*mp,
> > > > > +	xfs_rfsblock_t		rblocks,
> > > > > +	xfs_extlen_t		rextsize)
> > > > > +{
> > > > > +	struct xfs_mount	*nmp;
> > > > > +	int			error = 0;
> > > > > +
> > > > > +	nmp = xfs_growfs_rt_alloc_fake_mount(mp, rblocks, rextsize);
> > > > > +	if (!nmp)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	/*
> > > > > +	 * New summary size can't be more than half the size of the log.  This
> > > > > +	 * prevents us from getting a log overflow, since we'll log basically
> > > > > +	 * the whole summary file at once.
> > > > > +	 */
> > > > > +	if (nmp->m_rsumblocks > (mp->m_sb.sb_logblocks >> 1))
> > > > > +		error = -EINVAL;
> > > > 
> > > > FWIW, the new size needs to be smaller than that, because the "half
> > > > the log size" must to include all the log metadata needed to
> > > > encapsulate that object. The grwofs transaction also logs inodes and
> > > > the superblock, so that also takes away from the maximum size of
> > > > the summary file....
> > > 
> > > <shrug> It's the same logic as what's there now, and there haven't been
> > > any bug reports, have there? 
> > 
> > No, none that I know of - it was just an observation that the code
> > doesn't actually guarantee what the comment says it should do.
> > 
> > > Though I suppose that's just a reduction
> > > of what?  One block for the rtbitmap, and (conservatively) two inodes
> > > and a superblock?
> > 
> > The rtbitmap update might touch a lot more than one block. The newly
> > allocated space in the rtbitmap inode is initialised to zeros, and
> > so the xfs_rtfree_range() call from the growfs code to mark the new
> > space free has to write all 1s to that range of the rtbitmap. This
> > is all done in a single transaction, so we might actually be logging
> > a *lot* of rtbitmap buffers here.
> > 
> > IIRC, there is a bit per rtextent, so in a 4kB buffer we can mark
> > 32768 rtextents free. If they are 4kB each, then that's 128MB of
> > space tracked per rtbitmap block. This adds up to roughly 3.5MB of
> > log space for the rtbitmap updates per TB of grown rtdev space....
> > 
> > So, yeah, I think that calculation and comment is inaccurate, but we
> > don't have to fix this right now.
> 
> The kernel only "frees" the new space one rbmblock at a time, so I think
> that's why this calculation has never misfired.

not quite. It iterates all the rbmblocks in the given range (i.e.
the entire extent being freed) in xfs_rtmodify_range() in
a single transaction, but...

> I /think/ that means
> that each transaction only ends up logging two rtsummary blocks at a
> time?  One to decrement a counter, and again to increment one another
> level up?

... we only do one update of the summary blocks per extent being
freed (i.e. in xfs_rtfree_range() after the call to
xfs_rtmodify_range()). So, yes, we should only end up logging two
rtsummary blocks per extent being freed, but the number of rbmblocks
logged in that same transaction is O(extent length).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 271+ messages in thread

* [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block
  2024-09-02 18:21 [PATCHSET v4.2 4/8] xfs: fixes for the realtime allocator Darrick J. Wong
@ 2024-09-02 18:28 ` Darrick J. Wong
  0 siblings, 0 replies; 271+ messages in thread
From: Darrick J. Wong @ 2024-09-02 18:28 UTC (permalink / raw)
  To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The loop conditional here is not quite correct because an rtbitmap block
can represent rtextents beyond the end of the rt volume.  There's no way
that it makes sense to scan for free space beyond EOFS, so don't do it.
This overrun has been present since v2.6.0.

Also fix the type of bestlen, which was incorrectly converted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_rtalloc.c |   16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index c65ee8d1d38d..58081ce5247b 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -229,22 +229,20 @@ xfs_rtallocate_extent_block(
 	xfs_rtxnum_t		*rtx)	/* out: start rtext allocated */
 {
 	struct xfs_mount	*mp = args->mp;
-	xfs_rtxnum_t		besti;	/* best rtext found so far */
-	xfs_rtxnum_t		bestlen;/* best length found so far */
+	xfs_rtxnum_t		besti = -1; /* best rtext found so far */
 	xfs_rtxnum_t		end;	/* last rtext in chunk */
-	int			error;
 	xfs_rtxnum_t		i;	/* current rtext trying */
 	xfs_rtxnum_t		next;	/* next rtext to try */
+	xfs_rtxlen_t		bestlen = 0; /* best length found so far */
 	int			stat;	/* status from internal calls */
+	int			error;
 
 	/*
-	 * Loop over all the extents starting in this bitmap block,
-	 * looking for one that's long enough.
+	 * Loop over all the extents starting in this bitmap block up to the
+	 * end of the rt volume, looking for one that's long enough.
 	 */
-	for (i = xfs_rbmblock_to_rtx(mp, bbno), besti = -1, bestlen = 0,
-		end = xfs_rbmblock_to_rtx(mp, bbno + 1) - 1;
-	     i <= end;
-	     i++) {
+	end = min(mp->m_sb.sb_rextents, xfs_rbmblock_to_rtx(mp, bbno + 1)) - 1;
+	for (i = xfs_rbmblock_to_rtx(mp, bbno); i <= end; i++) {
 		/* Make sure we don't scan off the end of the rt volume. */
 		maxlen = xfs_rtallocate_clamp_len(mp, i, maxlen, prod);
 


^ permalink raw reply related	[flat|nested] 271+ messages in thread

end of thread, other threads:[~2024-09-02 18:28 UTC | newest]

Thread overview: 271+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22 23:52 [PATCHBOMB 6.12] xfs: metadata directories and realtime groups Darrick J. Wong
2024-08-22 23:56 ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Darrick J. Wong
2024-08-22 23:59   ` [PATCH 1/9] xfs: fix di_onlink checking for V1/V2 inodes Darrick J. Wong
2024-08-22 23:59   ` [PATCH 2/9] xfs: fix folio dirtying for XFILE_ALLOC callers Darrick J. Wong
2024-08-22 23:59   ` [PATCH 3/9] xfs: xfs_finobt_count_blocks() walks the wrong btree Darrick J. Wong
2024-08-22 23:59   ` [PATCH 4/9] xfs: don't bother reporting blocks trimmed via FITRIM Darrick J. Wong
2024-08-23  0:00   ` [PATCH 5/9] xfs: Fix the owner setting issue for rmap query in xfs fsmap Darrick J. Wong
2024-08-23  4:10     ` Christoph Hellwig
2024-08-23  0:00   ` [PATCH 6/9] xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code Darrick J. Wong
2024-08-23  4:10     ` Christoph Hellwig
2024-08-23  0:00   ` [PATCH 7/9] xfs: Fix missing interval for missing_owner in xfs fsmap Darrick J. Wong
2024-08-26  3:58     ` Zizhi Wo
2024-08-23  0:00   ` [PATCH 8/9] xfs: take m_growlock when running growfsrt Darrick J. Wong
2024-08-23  4:08     ` Christoph Hellwig
2024-08-23  0:01   ` [PATCH 9/9] xfs: reset rootdir extent size hint after growfsrt Darrick J. Wong
2024-08-23  4:09     ` Christoph Hellwig
2024-08-23  4:09   ` [PATCHSET v4.0 01/10] xfs: various bug fixes for 6.11 Christoph Hellwig
2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-08-23  4:12     ` Christoph Hellwig
2024-08-23 13:20       ` Jeff Layton
2024-08-23 17:41         ` Darrick J. Wong
2024-08-23 19:15           ` Jeff Layton
2024-08-24  3:29           ` Christoph Hellwig
2024-08-24  4:46             ` Darrick J. Wong
2024-08-24  4:48               ` Christoph Hellwig
2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
2024-08-24 12:11       ` Jeff Layton
2024-08-25  4:52       ` Christoph Hellwig
2024-08-22 23:56 ` [PATCHSET v4.0 03/10] xfs: cleanups before adding metadata directories Darrick J. Wong
2024-08-23  0:01   ` [PATCH 1/3] xfs: validate inumber in xfs_iget Darrick J. Wong
2024-08-23  0:01   ` [PATCH 2/3] xfs: match on the global RT inode numbers in xfs_is_metadata_inode Darrick J. Wong
2024-08-23  0:02   ` [PATCH 3/3] xfs: pass the icreate args object to xfs_dialloc Darrick J. Wong
2024-08-23  4:13     ` Christoph Hellwig
2024-08-22 23:57 ` [PATCHSET v4.0 04/10] xfs: metadata inode directories Darrick J. Wong
2024-08-23  0:02   ` [PATCH 01/26] xfs: define the on-disk format for the metadir feature Darrick J. Wong
2024-08-23  4:30     ` Christoph Hellwig
2024-08-23  0:02   ` [PATCH 02/26] xfs: refactor loading quota inodes in the regular case Darrick J. Wong
2024-08-23  4:31     ` Christoph Hellwig
2024-08-23 17:51       ` Darrick J. Wong
2024-08-23  0:02   ` [PATCH 03/26] xfs: iget for metadata inodes Darrick J. Wong
2024-08-23  4:35     ` Christoph Hellwig
2024-08-23 17:53       ` Darrick J. Wong
2024-08-23  0:03   ` [PATCH 04/26] xfs: load metadata directory root at mount time Darrick J. Wong
2024-08-23  4:35     ` Christoph Hellwig
2024-08-23  0:03   ` [PATCH 05/26] xfs: enforce metadata inode flag Darrick J. Wong
2024-08-23  4:38     ` Christoph Hellwig
2024-08-23 17:55       ` Darrick J. Wong
2024-08-23  0:03   ` [PATCH 06/26] xfs: read and write metadata inode directory tree Darrick J. Wong
2024-08-23  4:39     ` Christoph Hellwig
2024-08-23  0:03   ` [PATCH 07/26] xfs: disable the agi rotor for metadata inodes Darrick J. Wong
2024-08-23  4:39     ` Christoph Hellwig
2024-08-23  0:04   ` [PATCH 08/26] xfs: hide metadata inodes from everyone because they are special Darrick J. Wong
2024-08-23  4:40     ` Christoph Hellwig
2024-08-26  0:41     ` Dave Chinner
2024-08-26 17:33       ` Darrick J. Wong
2024-08-23  0:04   ` [PATCH 09/26] xfs: advertise metadata directory feature Darrick J. Wong
2024-08-23  4:40     ` Christoph Hellwig
2024-08-23  0:04   ` [PATCH 10/26] xfs: allow bulkstat to return metadata directories Darrick J. Wong
2024-08-23  4:41     ` Christoph Hellwig
2024-08-23  0:05   ` [PATCH 11/26] xfs: don't count metadata directory files to quota Darrick J. Wong
2024-08-23  4:42     ` Christoph Hellwig
2024-08-26  0:47     ` Dave Chinner
2024-08-26 17:57       ` Darrick J. Wong
2024-08-23  0:05   ` [PATCH 12/26] xfs: mark quota inodes as metadata files Darrick J. Wong
2024-08-23  4:42     ` Christoph Hellwig
2024-08-23  0:05   ` [PATCH 13/26] xfs: adjust xfs_bmap_add_attrfork for metadir Darrick J. Wong
2024-08-23  4:42     ` Christoph Hellwig
2024-08-23  0:05   ` [PATCH 14/26] xfs: record health problems with the metadata directory Darrick J. Wong
2024-08-23  4:43     ` Christoph Hellwig
2024-08-23  0:06   ` [PATCH 15/26] xfs: refactor directory tree root predicates Darrick J. Wong
2024-08-23  4:48     ` Christoph Hellwig
2024-08-23  0:06   ` [PATCH 16/26] xfs: do not count metadata directory files when doing online quotacheck Darrick J. Wong
2024-08-23  4:48     ` Christoph Hellwig
2024-08-23  0:06   ` [PATCH 17/26] xfs: don't fail repairs on metadata files with no attr fork Darrick J. Wong
2024-08-23  4:49     ` Christoph Hellwig
2024-08-23  0:06   ` [PATCH 18/26] xfs: metadata files can have xattrs if metadir is enabled Darrick J. Wong
2024-08-23  4:50     ` Christoph Hellwig
2024-08-23 18:00       ` Darrick J. Wong
2024-08-23  0:07   ` [PATCH 19/26] xfs: adjust parent pointer scrubber for sb-rooted metadata files Darrick J. Wong
2024-08-23  4:50     ` Christoph Hellwig
2024-08-23  0:07   ` [PATCH 20/26] xfs: fix di_metatype field of inodes that won't load Darrick J. Wong
2024-08-23  4:51     ` Christoph Hellwig
2024-08-23  0:07   ` [PATCH 21/26] xfs: scrub metadata directories Darrick J. Wong
2024-08-23  4:53     ` Christoph Hellwig
2024-08-23  0:07   ` [PATCH 22/26] xfs: check the metadata directory inumber in superblocks Darrick J. Wong
2024-08-23  4:53     ` Christoph Hellwig
2024-08-23  0:08   ` [PATCH 23/26] xfs: move repair temporary files to the metadata directory tree Darrick J. Wong
2024-08-23  4:54     ` Christoph Hellwig
2024-08-23  0:08   ` [PATCH 24/26] xfs: check metadata directory file path connectivity Darrick J. Wong
2024-08-23  4:55     ` Christoph Hellwig
2024-08-23  0:08   ` [PATCH 25/26] xfs: confirm dotdot target before replacing it during a repair Darrick J. Wong
2024-08-23  4:55     ` Christoph Hellwig
2024-08-23  0:08   ` [PATCH 26/26] xfs: repair metadata directory file path connectivity Darrick J. Wong
2024-08-23  4:56     ` Christoph Hellwig
2024-08-22 23:57 ` [PATCHSET v4.0 05/10] xfs: clean up the rtbitmap code Darrick J. Wong
2024-08-23  0:09   ` [PATCH 01/12] xfs: remove xfs_validate_rtextents Darrick J. Wong
2024-08-23  0:09   ` [PATCH 02/12] xfs: factor out a xfs_validate_rt_geometry helper Darrick J. Wong
2024-08-23  0:09   ` [PATCH 03/12] xfs: make the RT rsum_cache mandatory Darrick J. Wong
2024-08-23  0:09   ` [PATCH 04/12] xfs: remove the limit argument to xfs_rtfind_back Darrick J. Wong
2024-08-23  0:10   ` [PATCH 05/12] xfs: assert a valid limit in xfs_rtfind_forw Darrick J. Wong
2024-08-23  0:10   ` [PATCH 06/12] xfs: add bounds checking to xfs_rt{bitmap,summary}_read_buf Darrick J. Wong
2024-08-23  0:10   ` [PATCH 07/12] xfs: cleanup the calling convention for xfs_rtpick_extent Darrick J. Wong
2024-08-23  0:11   ` [PATCH 08/12] xfs: push the calls to xfs_rtallocate_range out to xfs_bmap_rtalloc Darrick J. Wong
2024-08-23  0:11   ` [PATCH 09/12] xfs: factor out a xfs_growfs_rt_bmblock helper Darrick J. Wong
2024-08-23  0:11   ` [PATCH 10/12] xfs: factor out a xfs_last_rt_bmblock helper Darrick J. Wong
2024-08-23  0:11   ` [PATCH 11/12] xfs: factor out rtbitmap/summary initialization helpers Darrick J. Wong
2024-08-23  0:12   ` [PATCH 12/12] xfs: push transaction join out of xfs_rtbitmap_lock and xfs_rtgroup_lock Darrick J. Wong
2024-08-22 23:57 ` [PATCHSET v4.0 06/10] xfs: fixes and cleanups for the realtime allocator Darrick J. Wong
2024-08-23  0:12   ` [PATCH 01/10] xfs: use the recalculated transaction reservation in xfs_growfs_rt_bmblock Darrick J. Wong
2024-08-23  0:12   ` [PATCH 02/10] xfs: ensure rtx mask/shift are correct after growfs Darrick J. Wong
2024-08-23  0:12   ` [PATCH 03/10] xfs: don't return too-short extents from xfs_rtallocate_extent_block Darrick J. Wong
2024-08-23  4:57     ` Christoph Hellwig
2024-08-23  0:13   ` [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block Darrick J. Wong
2024-08-23  4:57     ` Christoph Hellwig
2024-08-23  0:13   ` [PATCH 05/10] xfs: refactor aligning bestlen to prod Darrick J. Wong
2024-08-23  4:58     ` Christoph Hellwig
2024-08-23  0:13   ` [PATCH 06/10] xfs: clean up xfs_rtallocate_extent_exact a bit Darrick J. Wong
2024-08-23  4:58     ` Christoph Hellwig
2024-08-23  0:13   ` [PATCH 07/10] xfs: reduce excessive clamping of maxlen in xfs_rtallocate_extent_near Darrick J. Wong
2024-08-23  4:59     ` Christoph Hellwig
2024-08-23  0:14   ` [PATCH 08/10] xfs: fix broken variable-sized allocation detection in xfs_rtallocate_extent_block Darrick J. Wong
2024-08-23  4:59     ` Christoph Hellwig
2024-08-23  0:14   ` [PATCH 09/10] xfs: remove xfs_rtb_to_rtxrem Darrick J. Wong
2024-08-23  0:14   ` [PATCH 10/10] xfs: simplify xfs_rtalloc_query_range Darrick J. Wong
2024-08-22 23:57 ` [PATCHSET v4.0 07/10] xfs: create incore rt allocation groups Darrick J. Wong
2024-08-23  0:14   ` [PATCH 01/24] xfs: clean up the ISVALID macro in xfs_bmap_adjacent Darrick J. Wong
2024-08-23  0:15   ` [PATCH 02/24] xfs: factor out a xfs_rtallocate helper Darrick J. Wong
2024-08-23  0:15   ` [PATCH 03/24] xfs: rework the rtalloc fallback handling Darrick J. Wong
2024-08-23  0:15   ` [PATCH 04/24] xfs: factor out a xfs_rtallocate_align helper Darrick J. Wong
2024-08-23  0:15   ` [PATCH 05/24] xfs: make the rtalloc start hint a xfs_rtblock_t Darrick J. Wong
2024-08-23  0:16   ` [PATCH 06/24] xfs: add xchk_setup_nothing and xchk_nothing helpers Darrick J. Wong
2024-08-23  5:00     ` Christoph Hellwig
2024-08-23  0:16   ` [PATCH 07/24] xfs: remove xfs_{rtbitmap,rtsummary}_wordcount Darrick J. Wong
2024-08-23  0:16   ` [PATCH 08/24] xfs: replace m_rsumsize with m_rsumblocks Darrick J. Wong
2024-08-23  0:17   ` [PATCH 09/24] xfs: rearrange xfs_fsmap.c a little bit Darrick J. Wong
2024-08-23  5:01     ` Christoph Hellwig
2024-08-23  0:17   ` [PATCH 10/24] xfs: move xfs_ioc_getfsmap out of xfs_ioctl.c Darrick J. Wong
2024-08-23  5:01     ` Christoph Hellwig
2024-08-23  0:17   ` [PATCH 11/24] xfs: create incore realtime group structures Darrick J. Wong
2024-08-23  5:01     ` Christoph Hellwig
2024-08-25 23:56     ` Dave Chinner
2024-08-26 19:14       ` Darrick J. Wong
2024-08-27  0:57         ` Dave Chinner
2024-08-27  1:55           ` Darrick J. Wong
2024-08-27  3:00             ` Dave Chinner
2024-08-27  4:44             ` Christoph Hellwig
2024-08-27  4:38           ` Christoph Hellwig
2024-08-27  5:17             ` Darrick J. Wong
2024-08-27  5:18               ` Christoph Hellwig
2024-08-27  4:27         ` Christoph Hellwig
2024-08-27  5:19           ` Darrick J. Wong
2024-08-23  0:17   ` [PATCH 12/24] xfs: define locking primitives for realtime groups Darrick J. Wong
2024-08-23  5:02     ` Christoph Hellwig
2024-08-23  0:18   ` [PATCH 13/24] xfs: add a lockdep class key for rtgroup inodes Darrick J. Wong
2024-08-23  5:02     ` Christoph Hellwig
2024-08-25 23:58     ` Dave Chinner
2024-08-26 21:38       ` Darrick J. Wong
2024-08-27  0:58         ` Dave Chinner
2024-08-27  1:56           ` Darrick J. Wong
2024-08-27  3:00             ` Dave Chinner
2024-08-23  0:18   ` [PATCH 14/24] xfs: support caching rtgroup metadata inodes Darrick J. Wong
2024-08-23  5:02     ` Christoph Hellwig
2024-08-26  1:41     ` Dave Chinner
2024-08-26 18:37       ` Darrick J. Wong
2024-08-27  1:05         ` Dave Chinner
2024-08-27  2:01           ` Darrick J. Wong
2024-08-23  0:18   ` [PATCH 15/24] xfs: add rtgroup-based realtime scrubbing context management Darrick J. Wong
2024-08-23  5:03     ` Christoph Hellwig
2024-08-23  0:18   ` [PATCH 16/24] xfs: move RT bitmap and summary information to the rtgroup Darrick J. Wong
2024-08-26  1:58     ` Dave Chinner
2024-08-23  0:19   ` [PATCH 17/24] xfs: remove XFS_ILOCK_RT* Darrick J. Wong
2024-08-23  5:04     ` Christoph Hellwig
2024-08-23  0:19   ` [PATCH 18/24] xfs: calculate RT bitmap and summary blocks based on sb_rextents Darrick J. Wong
2024-08-23  0:19   ` [PATCH 19/24] xfs: factor out a xfs_growfs_rt_alloc_fake_mount helper Darrick J. Wong
2024-08-23  0:19   ` [PATCH 20/24] xfs: use xfs_growfs_rt_alloc_fake_mount in xfs_growfs_rt_alloc_blocks Darrick J. Wong
2024-08-23  0:20   ` [PATCH 21/24] xfs: factor out a xfs_growfs_check_rtgeom helper Darrick J. Wong
2024-08-26  2:06     ` Dave Chinner
2024-08-26 18:27       ` Darrick J. Wong
2024-08-27  1:29         ` Dave Chinner
2024-08-27  4:27           ` Darrick J. Wong
2024-08-27 22:16             ` Dave Chinner
2024-08-23  0:20   ` [PATCH 22/24] xfs: refactor xfs_rtbitmap_blockcount Darrick J. Wong
2024-08-23  0:20   ` [PATCH 23/24] xfs: refactor xfs_rtsummary_blockcount Darrick J. Wong
2024-08-23  0:20   ` [PATCH 24/24] xfs: make RT extent numbers relative to the rtgroup Darrick J. Wong
2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
2024-08-23  0:21   ` [PATCH 1/1] iomap: add a merge boundary flag Darrick J. Wong
2024-08-22 23:58 ` [PATCHSET v4.0 09/10] xfs: shard the realtime section Darrick J. Wong
2024-08-23  0:21   ` [PATCH 01/26] xfs: define the format of rt groups Darrick J. Wong
2024-08-23  5:11     ` Christoph Hellwig
2024-08-23 18:12       ` Darrick J. Wong
2024-08-23  0:21   ` [PATCH 02/26] xfs: check the realtime superblock at mount time Darrick J. Wong
2024-08-23  5:11     ` Christoph Hellwig
2024-08-23  0:21   ` [PATCH 03/26] xfs: update realtime super every time we update the primary fs super Darrick J. Wong
2024-08-23  5:12     ` Christoph Hellwig
2024-08-23  0:22   ` [PATCH 04/26] xfs: export realtime group geometry via XFS_FSOP_GEOM Darrick J. Wong
2024-08-23  5:12     ` Christoph Hellwig
2024-08-23  0:22   ` [PATCH 05/26] xfs: check that rtblock extents do not break rtsupers or rtgroups Darrick J. Wong
2024-08-23  5:13     ` Christoph Hellwig
2024-08-23  0:22   ` [PATCH 06/26] xfs: add a helper to prevent bmap merges across rtgroup boundaries Darrick J. Wong
2024-08-23  0:22   ` [PATCH 07/26] xfs: add frextents to the lazysbcounters when rtgroups enabled Darrick J. Wong
2024-08-23  5:13     ` Christoph Hellwig
2024-08-23  0:23   ` [PATCH 08/26] xfs: convert sick_map loops to use ARRAY_SIZE Darrick J. Wong
2024-08-23  5:14     ` Christoph Hellwig
2024-08-23  0:23   ` [PATCH 09/26] xfs: record rt group metadata errors in the health system Darrick J. Wong
2024-08-23  5:14     ` Christoph Hellwig
2024-08-23  0:23   ` [PATCH 10/26] xfs: export the geometry of realtime groups to userspace Darrick J. Wong
2024-08-23  5:14     ` Christoph Hellwig
2024-08-23  0:24   ` [PATCH 11/26] xfs: add block headers to realtime bitmap and summary blocks Darrick J. Wong
2024-08-23  5:15     ` Christoph Hellwig
2024-08-23  0:24   ` [PATCH 12/26] xfs: encode the rtbitmap in big endian format Darrick J. Wong
2024-08-23  5:15     ` Christoph Hellwig
2024-08-23  0:24   ` [PATCH 13/26] xfs: encode the rtsummary " Darrick J. Wong
2024-08-23  5:15     ` Christoph Hellwig
2024-08-23  0:24   ` [PATCH 14/26] xfs: grow the realtime section when realtime groups are enabled Darrick J. Wong
2024-08-23  5:16     ` Christoph Hellwig
2024-08-23  0:25   ` [PATCH 15/26] xfs: store rtgroup information with a bmap intent Darrick J. Wong
2024-08-23  5:16     ` Christoph Hellwig
2024-08-23  0:25   ` [PATCH 16/26] xfs: force swapext to a realtime file to use the file content exchange ioctl Darrick J. Wong
2024-08-23  5:17     ` Christoph Hellwig
2024-08-23  0:25   ` [PATCH 17/26] xfs: support logging EFIs for realtime extents Darrick J. Wong
2024-08-23  5:17     ` Christoph Hellwig
2024-08-26  4:33     ` Dave Chinner
2024-08-26 19:38       ` Darrick J. Wong
2024-08-27  1:36         ` Dave Chinner
2024-08-23  0:25   ` [PATCH 18/26] xfs: support error injection when freeing rt extents Darrick J. Wong
2024-08-23  5:18     ` Christoph Hellwig
2024-08-23  0:26   ` [PATCH 19/26] xfs: use realtime EFI to free extents when rtgroups are enabled Darrick J. Wong
2024-08-23  5:18     ` Christoph Hellwig
2024-08-23  0:26   ` [PATCH 20/26] xfs: don't merge ioends across RTGs Darrick J. Wong
2024-08-23  0:26   ` [PATCH 21/26] xfs: make the RT allocator rtgroup aware Darrick J. Wong
2024-08-26  4:56     ` Dave Chinner
2024-08-26 19:40       ` Darrick J. Wong
2024-08-27  1:56         ` Dave Chinner
2024-08-27  2:16           ` Darrick J. Wong
2024-08-27  5:00             ` Christoph Hellwig
2024-08-27  5:00         ` Christoph Hellwig
2024-08-27  4:59       ` Christoph Hellwig
2024-08-23  0:26   ` [PATCH 22/26] xfs: don't coalesce file mappings that cross rtgroup boundaries in scrub Darrick J. Wong
2024-08-23  5:19     ` Christoph Hellwig
2024-08-23  0:27   ` [PATCH 23/26] xfs: scrub the realtime group superblock Darrick J. Wong
2024-08-23  5:19     ` Christoph Hellwig
2024-08-23  0:27   ` [PATCH 24/26] xfs: repair " Darrick J. Wong
2024-08-23  5:19     ` Christoph Hellwig
2024-08-23  0:27   ` [PATCH 25/26] xfs: scrub metadir paths for rtgroup metadata Darrick J. Wong
2024-08-23  5:20     ` Christoph Hellwig
2024-08-23  0:27   ` [PATCH 26/26] xfs: mask off the rtbitmap and summary inodes when metadir in use Darrick J. Wong
2024-08-23  5:20     ` Christoph Hellwig
2024-08-22 23:58 ` [PATCHSET v4.0 10/10] xfs: store quota files in the metadir Darrick J. Wong
2024-08-23  0:28   ` [PATCH 1/6] xfs: refactor xfs_qm_destroy_quotainos Darrick J. Wong
2024-08-23  5:51     ` Christoph Hellwig
2024-08-23  0:28   ` [PATCH 2/6] xfs: use metadir for quota inodes Darrick J. Wong
2024-08-23  5:53     ` Christoph Hellwig
2024-08-23 18:20       ` Darrick J. Wong
2024-08-23  0:28   ` [PATCH 3/6] xfs: scrub quota file metapaths Darrick J. Wong
2024-08-23  5:53     ` Christoph Hellwig
2024-08-23  0:28   ` [PATCH 4/6] xfs: persist quota flags with metadir Darrick J. Wong
2024-08-23  5:54     ` Christoph Hellwig
2024-08-23 18:23       ` Darrick J. Wong
2024-08-26  9:42     ` Dave Chinner
2024-08-26 18:15       ` Darrick J. Wong
2024-08-23  0:29   ` [PATCH 5/6] xfs: update sb field checks when metadir is turned on Darrick J. Wong
2024-08-23  5:55     ` Christoph Hellwig
2024-08-26  9:52     ` Dave Chinner
2024-08-26 18:07       ` Darrick J. Wong
2024-08-27  2:16         ` Dave Chinner
2024-08-27  3:16           ` Darrick J. Wong
2024-08-23  0:29   ` [PATCH 6/6] xfs: enable metadata directory feature Darrick J. Wong
2024-08-23  5:58     ` Christoph Hellwig
2024-08-23 18:26       ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2024-09-02 18:21 [PATCHSET v4.2 4/8] xfs: fixes for the realtime allocator Darrick J. Wong
2024-09-02 18:28 ` [PATCH 04/10] xfs: don't scan off the end of the rt volume in xfs_rtallocate_extent_block Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox