[PATCHSET v29.4 03/13] xfs: atomic file content exchanges

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
@ 2024-02-27  2:18 Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:18 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, linux-xfs, hch

Hi all,

This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically.  This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes.  A successful call
completion guarantees that the new contents will be seen even if the
system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Callers can arrange for the update to be rejected if the
original file has been changed.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.
Here are the proposed manual pages:

IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)

NAME
       ioctl_xfs_exchange_range  -  exchange  the contents of parts of
       two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs_staging.h>

       int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
       xfs_exch_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions, so no userspace-level locks need to be taken  to  obtain
       consistent  results.  Implementations must guarantee that read‐
       ers see either the old contents or the new  contents  in  their
       entirety, even if the system fails.

       The  system  call  parameters are conveyed in structures of the
       following form:

           struct xfs_exch_range {
               __s64    file1_fd;
               __s64    file1_offset;
               __s64    file2_offset;
               __s64    length;
               __u64    flags;

               __u64    pad;
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both files must be from the same filesystem mount.  If the  two
       file  descriptors represent the same file, the byte ranges must
       not overlap.  Most  disk-based  filesystems  require  that  the
       starts  of  both ranges must be aligned to the file block size.
       If this is the case, the ends of the ranges  must  also  be  so
       aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHRANGE_TO_EOF
                  Ignore  the length parameter.  All bytes in file1_fd
                  from file1_offset to EOF are moved to file2_fd,  and
                  file2's  size is set to (file2_offset+(file1_length-
                  file1_offset)).  Meanwhile, all bytes in file2  from
                  file2_offset  to  EOF are moved to file1 and file1's
                  size   is   set   to    (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHRANGE_DSYNC
                  Ensure  that  all modified in-core data in both file
                  ranges and all metadata updates  pertaining  to  the
                  exchange operation are flushed to persistent storage
                  before the call returns.  Opening  either  file  de‐
                  scriptor  with  O_SYNC or O_DSYNC will have the same
                  effect.

           XFS_EXCHRANGE_FILE1_WRITTEN
                  Only exchange sub-ranges of file1_fd that are  known
                  to  contain  data  written  by application software.
                  Each sub-range may be  expanded  (both  upwards  and
                  downwards)  to  align with the file allocation unit.
                  For files on the data device, this is one filesystem
                  block.   For  files  on the realtime device, this is
                  the realtime extent size.  This facility can be used
                  to  implement  fast  atomic scatter-gather writes of
                  any complexity for software-defined storage  targets
                  if  all  writes  are  aligned to the file allocation
                  unit.

           XFS_EXCHRANGE_DRY_RUN
                  Check the parameters and the feasibility of the  op‐
                  eration, but do not change anything.

RETURN VALUE
       On  error, -1 is returned, and errno is set to indicate the er‐
       ror.

ERRORS
       Error codes can be one of, but are not limited to, the  follow‐
       ing:

       EBADF  file1_fd  is not open for reading and writing or is open
              for append-only writes; or  file2_fd  is  not  open  for
              reading and writing or is open for append-only writes.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several  use  cases  are imagined for this system call.  In all
       cases, application software must coordinate updates to the file
       because the exchange is performed unconditionally.

       The  first  is a data storage program that wants to commit non-
       contiguous updates to a file atomically and  coordinates  write
       access  to that file.  This can be done by creating a temporary
       file, calling FICLONE(2) to share the contents, and staging the
       updates into the temporary file.  The FULL_FILES flag is recom‐
       mended for this purpose.  The temporary file can be deleted  or
       punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           struct xfs_exch_range args = {
               .file1_fd = temp_fd,
               .flags = XFS_EXCHRANGE_TO_EOF,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

       The  second  is  a  software-defined  storage host (e.g. a disk
       jukebox) which implements an atomic scatter-gather  write  com‐
       mand.   Provided the exported disk's logical block size matches
       the file's allocation unit size, this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       It is recommended that the temporary file be truncated  to  the
       size  of  the  regular file before any writes are staged to the
       temporary file to avoid issues with zeroing during  EOF  exten‐
       sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
       only the file allocation units involved  in  the  emulated  de‐
       vice's  write  command.  The temporary file should be truncated
       or punched out completely before being reused to stage  another
       write.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           int blksz;

           fstat(fd, &sb);
           blksz = sb.st_blksize;

           /* land scatter gather writes between 100fsb and 500fsb */
           pwrite(temp_fd, data1, blksz * 2, blksz * 100);
           pwrite(temp_fd, data2, blksz * 20, blksz * 480);
           pwrite(temp_fd, data3, blksz * 7, blksz * 257);

           /* commit the entire update */
           struct xfs_exch_range args = {
               .file1_fd = temp_fd,
               .file1_offset = blksz * 100,
               .file2_offset = blksz * 100,
               .length       = blksz * 400,
               .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
                               XFS_EXCHRANGE_FILE1_DSYNC,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)

NAME
       ioctl_xfs_commit_range - conditionally exchange the contents of
       parts of two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs_staging.h>

       int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
       mit_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes  the contents of the two ranges if file2_fd passes cer‐
       tain freshness criteria.

       After locking both files but before  exchanging  the  contents,
       the  supplied  file2_ino field must match file2_fd's inode num‐
       ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
       file2_ctime, and file2_ctime_nsec fields must match the modifi‐
       cation time and change time of file2.  If they  do  not  match,
       EBUSY will be returned.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions, so no userspace-level locks need to be taken  to  obtain
       consistent  results.  Implementations must guarantee that read‐
       ers see either the old contents or the new  contents  in  their
       entirety, even if the system fails.

       The  system  call  parameters are conveyed in structures of the
       following form:

           struct xfs_commit_range {
               __s64    file1_fd;
               __s64    file1_offset;
               __s64    file2_offset;
               __s64    length;
               __u64    flags;

               __s64    file2_ino;
               __s64    file2_mtime;
               __s64    file2_ctime;
               __s32    file2_mtime_nsec;
               __s32    file2_ctime_nsec;

               __u64    pad;
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       The   fields    file2_ino,    file2_mtime,    file2_mtime_nsec,
       file2_ctime   and   file2_ctime_nsec   must  be  gathered  from
       file2_fd's stat information prior to beginning the file update.
       These file attributes are used to confirm that file2_fd has not
       changed by another thread since the current thread began  stag‐
       ing its own update.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHRANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHRANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHRANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHRANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EBUSY  The file2 inode number and timestamps  supplied  do  not
              match file2_fd.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system call.  Coordina‐
       tion between multiple threads is performed by the kernel.

       The first is a filesystem defragmenter, which copies  the  con‐
       tents  of  a  file into another file and wishes to exchange the
       space mappings of the two files,  provided  that  the  original
       file has not changed.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           struct xfs_commit_range args = {
               .flags = XFS_EXCHRANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           fstat(fd, &sb);
           args.file2_ino = sb.st_ino;
           args.file2_mtime = sb.st_mtim.tv_sec;
           args.file2_mtime_nsec = sb.st_mtim.tv_nsec;
           args.file2_ctime = sb.st_ctim.tv_sec;
           args.file2_ctime_nsec = sb.st_ctim.tv_nsec;

           /* make a fresh copy of the file with terrible alignment to avoid reflink */
           clone_file_range(fd, NULL, temp_fd, NULL, 1, 0);
           clone_file_range(fd, NULL, temp_fd, NULL, sb.st_size - 1, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed while defrag was underway
");

       The  second is a data storage program that wants to commit non-
       contiguous updates to a file atomically.  This  program  cannot
       coordinate updates to the file and therefore relies on the ker‐
       nel to reject the COMMIT_RANGE command if the file has been up‐
       dated  by  someone else.  This can be done by creating a tempo‐
       rary file, calling FICLONE(2) to share the contents, and  stag‐
       ing  the  updates into the temporary file.  The FULL_FILES flag
       is recommended for this purpose.  The  temporary  file  can  be
       deleted or punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           struct xfs_commit_range args = {
               .flags = XFS_EXCHRANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           fstat(fd, &sb);
           args.file2_ino = sb.st_ino;
           args.file2_mtime = sb.st_mtim.tv_sec;
           args.file2_mtime_nsec = sb.st_mtim.tv_nsec;
           args.file2_ctime = sb.st_ctim.tv_sec;
           args.file2_ctime_nsec = sb.st_ctim.tv_nsec;

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed before commit; will roll back
");

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-18     IOCTL-XFS-COMMIT-RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

or userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
Commits in this patchset:
 * vfs: export remap and write check helpers
 * xfs: introduce new file range exchange ioctls
 * xfs: create a log incompat flag for atomic file mapping exchanges
 * xfs: introduce a file mapping exchange log intent item
 * xfs: create deferred log items for file mapping exchanges
 * xfs: bind together the front and back ends of the file range exchange code
 * xfs: add error injection to test file mapping exchange recovery
 * xfs: condense extended attributes after a mapping exchange operation
 * xfs: condense directories after a mapping exchange operation
 * xfs: condense symbolic links after a mapping exchange operation
 * xfs: make file range exchange support realtime files
 * xfs: support non-power-of-two rtextsize with exchange-range
 * docs: update swapext -> exchmaps language
 * xfs: enable logged file mapping exchange feature
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  259 ++--
 fs/read_write.c                                    |    1 
 fs/remap_range.c                                   |    4 
 fs/xfs/Makefile                                    |    3 
 fs/xfs/libxfs/xfs_bmap.h                           |    2 
 fs/xfs/libxfs/xfs_defer.c                          |    6 
 fs/xfs/libxfs/xfs_defer.h                          |    2 
 fs/xfs/libxfs/xfs_errortag.h                       |    4 
 fs/xfs/libxfs/xfs_exchmaps.c                       | 1222 ++++++++++++++++++++
 fs/xfs/libxfs/xfs_exchmaps.h                       |  123 ++
 fs/xfs/libxfs/xfs_format.h                         |   16 
 fs/xfs/libxfs/xfs_fs.h                             |   75 +
 fs/xfs/libxfs/xfs_log_format.h                     |   64 +
 fs/xfs/libxfs/xfs_log_recover.h                    |    2 
 fs/xfs/libxfs/xfs_sb.c                             |    3 
 fs/xfs/libxfs/xfs_symlink_remote.c                 |   47 +
 fs/xfs/libxfs/xfs_symlink_remote.h                 |    1 
 fs/xfs/libxfs/xfs_trans_space.h                    |    4 
 fs/xfs/xfs_error.c                                 |    3 
 fs/xfs/xfs_exchmaps_item.c                         |  603 ++++++++++
 fs/xfs/xfs_exchmaps_item.h                         |   64 +
 fs/xfs/xfs_exchrange.c                             |  865 ++++++++++++++
 fs/xfs/xfs_exchrange.h                             |   51 +
 fs/xfs/xfs_ioctl.c                                 |   89 +
 fs/xfs/xfs_log_recover.c                           |    2 
 fs/xfs/xfs_mount.h                                 |    5 
 fs/xfs/xfs_super.c                                 |   19 
 fs/xfs/xfs_symlink.c                               |   49 -
 fs/xfs/xfs_trace.c                                 |    2 
 fs/xfs/xfs_trace.h                                 |  382 ++++++
 include/linux/fs.h                                 |    1 
 31 files changed, 3797 insertions(+), 176 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_exchmaps.c
 create mode 100644 fs/xfs/libxfs/xfs_exchmaps.h
 create mode 100644 fs/xfs/xfs_exchmaps_item.c
 create mode 100644 fs/xfs/xfs_exchmaps_item.h
 create mode 100644 fs/xfs/xfs_exchrange.c
 create mode 100644 fs/xfs/xfs_exchrange.h


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/14] vfs: export remap and write check helpers
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
@ 2024-02-27  2:21 ` Darrick J. Wong
  2024-02-28 15:40   ` Christoph Hellwig
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Export these functions so that the next patch can use them to check the
file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/read_write.c    |    1 +
 fs/remap_range.c   |    4 ++--
 include/linux/fs.h |    1 +
 3 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index d4c036e82b6c3..85c096f2c0d06 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(generic_write_check_limits);
 
 /* Like generic_write_checks(), but takes size of write instead of iter. */
 int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
diff --git a/fs/remap_range.c b/fs/remap_range.c
index de07f978ce3eb..28246dfc84851 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	return 0;
 }
 
-static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
-			     bool write)
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
 {
 	int mask = write ? MAY_WRITE : MAY_READ;
 	loff_t tmp;
@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
 
 	return fsnotify_file_area_perm(file, mask, &pos, len);
 }
+EXPORT_SYMBOL_GPL(remap_verify_area);
 
 /*
  * Ensure that we don't remap a partial EOF block in the middle of something
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fbc72c5f112c..f0ada316dc97b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2096,6 +2096,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
 int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    loff_t *len, unsigned int remap_flags,


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
@ 2024-02-27  9:23 ` Amir Goldstein
  2024-02-27 10:53   ` Jeff Layton
  2024-02-27 15:45   ` Darrick J. Wong
  2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
  2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
  3 siblings, 2 replies; 27+ messages in thread
From: Amir Goldstein @ 2024-02-27  9:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton

On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically.  This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes.  A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file.  If the updates are aligned with the file block
> size, a new (since v2) flag provides for exchanging only the written
> areas.  Callers can arrange for the update to be rejected if the
> original file has been changed.
>
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files.  For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory, fsync the new
> file, rename the new file on top of the old filename, and then fsync the
> directory.  People get it wrong all the time, and $fs hacks abound.
> Here are the proposed manual pages:
>
> IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
>
> NAME
>        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
>        two files
>
> SYNOPSIS
>        #include <sys/ioctl.h>
>        #include <xfs/xfs_fs_staging.h>
>
>        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
>        xfs_exch_range *arg);
>
> DESCRIPTION
>        Given  a  range  of bytes in a first file file1_fd and a second
>        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
>        changes the contents of the two ranges.
>
>        Exchanges  are  atomic  with  regards to concurrent file opera‐
>        tions, so no userspace-level locks need to be taken  to  obtain
>        consistent  results.  Implementations must guarantee that read‐
>        ers see either the old contents or the new  contents  in  their
>        entirety, even if the system fails.
>
>        The  system  call  parameters are conveyed in structures of the
>        following form:
>
>            struct xfs_exch_range {
>                __s64    file1_fd;
>                __s64    file1_offset;
>                __s64    file2_offset;
>                __s64    length;
>                __u64    flags;
>
>                __u64    pad;
>            };
>
>        The field pad must be zero.
>
>        The fields file1_fd, file1_offset, and length define the  first
>        range of bytes to be exchanged.
>
>        The fields file2_fd, file2_offset, and length define the second
>        range of bytes to be exchanged.
>
>        Both files must be from the same filesystem mount.  If the  two
>        file  descriptors represent the same file, the byte ranges must
>        not overlap.  Most  disk-based  filesystems  require  that  the
>        starts  of  both ranges must be aligned to the file block size.
>        If this is the case, the ends of the ranges  must  also  be  so
>        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
>
>        The field flags control the behavior of the exchange operation.
>
>            XFS_EXCHRANGE_TO_EOF
>                   Ignore  the length parameter.  All bytes in file1_fd
>                   from file1_offset to EOF are moved to file2_fd,  and
>                   file2's  size is set to (file2_offset+(file1_length-
>                   file1_offset)).  Meanwhile, all bytes in file2  from
>                   file2_offset  to  EOF are moved to file1 and file1's
>                   size   is   set   to    (file1_offset+(file2_length-
>                   file2_offset)).
>
>            XFS_EXCHRANGE_DSYNC
>                   Ensure  that  all modified in-core data in both file
>                   ranges and all metadata updates  pertaining  to  the
>                   exchange operation are flushed to persistent storage
>                   before the call returns.  Opening  either  file  de‐
>                   scriptor  with  O_SYNC or O_DSYNC will have the same
>                   effect.
>
>            XFS_EXCHRANGE_FILE1_WRITTEN
>                   Only exchange sub-ranges of file1_fd that are  known
>                   to  contain  data  written  by application software.
>                   Each sub-range may be  expanded  (both  upwards  and
>                   downwards)  to  align with the file allocation unit.
>                   For files on the data device, this is one filesystem
>                   block.   For  files  on the realtime device, this is
>                   the realtime extent size.  This facility can be used
>                   to  implement  fast  atomic scatter-gather writes of
>                   any complexity for software-defined storage  targets
>                   if  all  writes  are  aligned to the file allocation
>                   unit.
>
>            XFS_EXCHRANGE_DRY_RUN
>                   Check the parameters and the feasibility of the  op‐
>                   eration, but do not change anything.
>
> RETURN VALUE
>        On  error, -1 is returned, and errno is set to indicate the er‐
>        ror.
>
> ERRORS
>        Error codes can be one of, but are not limited to, the  follow‐
>        ing:
>
>        EBADF  file1_fd  is not open for reading and writing or is open
>               for append-only writes; or  file2_fd  is  not  open  for
>               reading and writing or is open for append-only writes.
>
>        EINVAL The  parameters  are  not correct for these files.  This
>               error can also appear if either file  descriptor  repre‐
>               sents  a device, FIFO, or socket.  Disk filesystems gen‐
>               erally require the offset and  length  arguments  to  be
>               aligned to the fundamental block sizes of both files.
>
>        EIO    An I/O error occurred.
>
>        EISDIR One of the files is a directory.
>
>        ENOMEM The  kernel  was unable to allocate sufficient memory to
>               perform the operation.
>
>        ENOSPC There is not enough free space  in  the  filesystem  ex‐
>               change the contents safely.
>
>        EOPNOTSUPP
>               The filesystem does not support exchanging bytes between
>               the two files.
>
>        EPERM  file1_fd or file2_fd are immutable.
>
>        ETXTBSY
>               One of the files is a swap file.
>
>        EUCLEAN
>               The filesystem is corrupt.
>
>        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
>               filesystem.
>
> CONFORMING TO
>        This API is XFS-specific.
>
> USE CASES
>        Several  use  cases  are imagined for this system call.  In all
>        cases, application software must coordinate updates to the file
>        because the exchange is performed unconditionally.
>
>        The  first  is a data storage program that wants to commit non-
>        contiguous updates to a file atomically and  coordinates  write
>        access  to that file.  This can be done by creating a temporary
>        file, calling FICLONE(2) to share the contents, and staging the
>        updates into the temporary file.  The FULL_FILES flag is recom‐
>        mended for this purpose.  The temporary file can be deleted  or
>        punched out afterwards.
>
>        An example program might look like this:
>
>            int fd = open("/some/file", O_RDWR);
>            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
>
>            ioctl(temp_fd, FICLONE, fd);
>
>            /* append 1MB of records */
>            lseek(temp_fd, 0, SEEK_END);
>            write(temp_fd, data1, 1000000);
>
>            /* update record index */
>            pwrite(temp_fd, data1, 600, 98765);
>            pwrite(temp_fd, data2, 320, 54321);
>            pwrite(temp_fd, data2, 15, 0);
>
>            /* commit the entire update */
>            struct xfs_exch_range args = {
>                .file1_fd = temp_fd,
>                .flags = XFS_EXCHRANGE_TO_EOF,
>            };
>
>            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
>
>        The  second  is  a  software-defined  storage host (e.g. a disk
>        jukebox) which implements an atomic scatter-gather  write  com‐
>        mand.   Provided the exported disk's logical block size matches
>        the file's allocation unit size, this can be done by creating a
>        temporary file and writing the data at the appropriate offsets.
>        It is recommended that the temporary file be truncated  to  the
>        size  of  the  regular file before any writes are staged to the
>        temporary file to avoid issues with zeroing during  EOF  exten‐
>        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
>        only the file allocation units involved  in  the  emulated  de‐
>        vice's  write  command.  The temporary file should be truncated
>        or punched out completely before being reused to stage  another
>        write.
>
>        An example program might look like this:
>
>            int fd = open("/some/file", O_RDWR);
>            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
>            struct stat sb;
>            int blksz;
>
>            fstat(fd, &sb);
>            blksz = sb.st_blksize;
>
>            /* land scatter gather writes between 100fsb and 500fsb */
>            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
>            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
>            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
>
>            /* commit the entire update */
>            struct xfs_exch_range args = {
>                .file1_fd = temp_fd,
>                .file1_offset = blksz * 100,
>                .file2_offset = blksz * 100,
>                .length       = blksz * 400,
>                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
>                                XFS_EXCHRANGE_FILE1_DSYNC,
>            };
>
>            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
>
> NOTES
>        Some  filesystems may limit the amount of data or the number of
>        extents that can be exchanged in a single call.
>
> SEE ALSO
>        ioctl(2)
>
> XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
>
> NAME
>        ioctl_xfs_commit_range - conditionally exchange the contents of
>        parts of two files
>
> SYNOPSIS
>        #include <sys/ioctl.h>
>        #include <xfs/xfs_fs_staging.h>
>
>        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
>        mit_range *arg);
>
> DESCRIPTION
>        Given  a  range  of bytes in a first file file1_fd and a second
>        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
>        changes  the contents of the two ranges if file2_fd passes cer‐
>        tain freshness criteria.
>
>        After locking both files but before  exchanging  the  contents,
>        the  supplied  file2_ino field must match file2_fd's inode num‐
>        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
>        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
>        cation time and change time of file2.  If they  do  not  match,
>        EBUSY will be returned.
>

Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?

And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
Even if this API is designed to be hoisted out of XFS at some future time,
Is there a real need to support it on filesystems that do not support
i_version(?)

Not to mention the fact that POSIX does not explicitly define how ctime should
behave with changes to fiemap (uninitialized extent and all), so who knows
how other filesystems may update ctime in those cases.

I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
really explicitly requests a bump of i_version on the next change.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
@ 2024-02-27 10:53   ` Jeff Layton
  2024-02-27 16:06     ` Darrick J. Wong
  2024-02-27 23:46     ` Dave Chinner
  2024-02-27 15:45   ` Darrick J. Wong
  1 sibling, 2 replies; 27+ messages in thread
From: Jeff Layton @ 2024-02-27 10:53 UTC (permalink / raw)
  To: Amir Goldstein, Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch

On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > Hi all,
> > 
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> > 
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> > 
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file.  If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas.  Callers can arrange for the update to be rejected if the
> > original file has been changed.
> > 
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory.  People get it wrong all the time, and $fs hacks abound.
> > Here are the proposed manual pages:
> > 

This is a cool idea!  I've had some handwavy ideas about making a gated
write() syscall (i.e. only write if the change cookie hasn't changed),
but something like this may be a simpler lift initially.

> > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > 
> > NAME
> >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> >        two files
> > 
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> > 
> >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> >        xfs_exch_range *arg);
> > 
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes the contents of the two ranges.
> > 
> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> > 
> >        The  system  call  parameters are conveyed in structures of the
> >        following form:
> > 
> >            struct xfs_exch_range {
> >                __s64    file1_fd;
> >                __s64    file1_offset;
> >                __s64    file2_offset;
> >                __s64    length;
> >                __u64    flags;
> > 
> >                __u64    pad;
> >            };
> > 
> >        The field pad must be zero.
> > 
> >        The fields file1_fd, file1_offset, and length define the  first
> >        range of bytes to be exchanged.
> > 
> >        The fields file2_fd, file2_offset, and length define the second
> >        range of bytes to be exchanged.
> > 
> >        Both files must be from the same filesystem mount.  If the  two
> >        file  descriptors represent the same file, the byte ranges must
> >        not overlap.  Most  disk-based  filesystems  require  that  the
> >        starts  of  both ranges must be aligned to the file block size.
> >        If this is the case, the ends of the ranges  must  also  be  so
> >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > 
> >        The field flags control the behavior of the exchange operation.
> > 
> >            XFS_EXCHRANGE_TO_EOF
> >                   Ignore  the length parameter.  All bytes in file1_fd
> >                   from file1_offset to EOF are moved to file2_fd,  and
> >                   file2's  size is set to (file2_offset+(file1_length-
> >                   file1_offset)).  Meanwhile, all bytes in file2  from
> >                   file2_offset  to  EOF are moved to file1 and file1's
> >                   size   is   set   to    (file1_offset+(file2_length-
> >                   file2_offset)).
> > 
> >            XFS_EXCHRANGE_DSYNC
> >                   Ensure  that  all modified in-core data in both file
> >                   ranges and all metadata updates  pertaining  to  the
> >                   exchange operation are flushed to persistent storage
> >                   before the call returns.  Opening  either  file  de‐
> >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> >                   effect.
> > 
> >            XFS_EXCHRANGE_FILE1_WRITTEN
> >                   Only exchange sub-ranges of file1_fd that are  known
> >                   to  contain  data  written  by application software.
> >                   Each sub-range may be  expanded  (both  upwards  and
> >                   downwards)  to  align with the file allocation unit.
> >                   For files on the data device, this is one filesystem
> >                   block.   For  files  on the realtime device, this is
> >                   the realtime extent size.  This facility can be used
> >                   to  implement  fast  atomic scatter-gather writes of
> >                   any complexity for software-defined storage  targets
> >                   if  all  writes  are  aligned to the file allocation
> >                   unit.
> > 
> >            XFS_EXCHRANGE_DRY_RUN
> >                   Check the parameters and the feasibility of the  op‐
> >                   eration, but do not change anything.
> > 
> > RETURN VALUE
> >        On  error, -1 is returned, and errno is set to indicate the er‐
> >        ror.
> > 
> > ERRORS
> >        Error codes can be one of, but are not limited to, the  follow‐
> >        ing:
> > 
> >        EBADF  file1_fd  is not open for reading and writing or is open
> >               for append-only writes; or  file2_fd  is  not  open  for
> >               reading and writing or is open for append-only writes.
> > 
> >        EINVAL The  parameters  are  not correct for these files.  This
> >               error can also appear if either file  descriptor  repre‐
> >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> >               erally require the offset and  length  arguments  to  be
> >               aligned to the fundamental block sizes of both files.
> > 
> >        EIO    An I/O error occurred.
> > 
> >        EISDIR One of the files is a directory.
> > 
> >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> >               perform the operation.
> > 
> >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> >               change the contents safely.
> > 
> >        EOPNOTSUPP
> >               The filesystem does not support exchanging bytes between
> >               the two files.
> > 
> >        EPERM  file1_fd or file2_fd are immutable.
> > 
> >        ETXTBSY
> >               One of the files is a swap file.
> > 
> >        EUCLEAN
> >               The filesystem is corrupt.
> > 
> >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> >               filesystem.
> > 
> > CONFORMING TO
> >        This API is XFS-specific.
> > 
> > USE CASES
> >        Several  use  cases  are imagined for this system call.  In all
> >        cases, application software must coordinate updates to the file
> >        because the exchange is performed unconditionally.
> > 
> >        The  first  is a data storage program that wants to commit non-
> >        contiguous updates to a file atomically and  coordinates  write
> >        access  to that file.  This can be done by creating a temporary
> >        file, calling FICLONE(2) to share the contents, and staging the
> >        updates into the temporary file.  The FULL_FILES flag is recom‐
> >        mended for this purpose.  The temporary file can be deleted  or
> >        punched out afterwards.
> > 
> >        An example program might look like this:
> > 
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > 
> >            ioctl(temp_fd, FICLONE, fd);
> > 
> >            /* append 1MB of records */
> >            lseek(temp_fd, 0, SEEK_END);
> >            write(temp_fd, data1, 1000000);
> > 
> >            /* update record index */
> >            pwrite(temp_fd, data1, 600, 98765);
> >            pwrite(temp_fd, data2, 320, 54321);
> >            pwrite(temp_fd, data2, 15, 0);
> > 
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .flags = XFS_EXCHRANGE_TO_EOF,
> >            };
> > 
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > 
> >        The  second  is  a  software-defined  storage host (e.g. a disk
> >        jukebox) which implements an atomic scatter-gather  write  com‐
> >        mand.   Provided the exported disk's logical block size matches
> >        the file's allocation unit size, this can be done by creating a
> >        temporary file and writing the data at the appropriate offsets.
> >        It is recommended that the temporary file be truncated  to  the
> >        size  of  the  regular file before any writes are staged to the
> >        temporary file to avoid issues with zeroing during  EOF  exten‐
> >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> >        only the file allocation units involved  in  the  emulated  de‐
> >        vice's  write  command.  The temporary file should be truncated
> >        or punched out completely before being reused to stage  another
> >        write.
> > 
> >        An example program might look like this:
> > 
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >            struct stat sb;
> >            int blksz;
> > 
> >            fstat(fd, &sb);
> >            blksz = sb.st_blksize;
> > 
> >            /* land scatter gather writes between 100fsb and 500fsb */
> >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > 
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .file1_offset = blksz * 100,
> >                .file2_offset = blksz * 100,
> >                .length       = blksz * 400,
> >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> >                                XFS_EXCHRANGE_FILE1_DSYNC,
> >            };
> > 
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > 
> > NOTES
> >        Some  filesystems may limit the amount of data or the number of
> >        extents that can be exchanged in a single call.
> > 
> > SEE ALSO
> >        ioctl(2)
> > 
> > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > 
> > NAME
> >        ioctl_xfs_commit_range - conditionally exchange the contents of
> >        parts of two files
> > 
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> > 
> >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> >        mit_range *arg);
> > 
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes  the contents of the two ranges if file2_fd passes cer‐
> >        tain freshness criteria.
> > 
> >        After locking both files but before  exchanging  the  contents,
> >        the  supplied  file2_ino field must match file2_fd's inode num‐
> >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> >        cation time and change time of file2.  If they  do  not  match,
> >        EBUSY will be returned.
> > 
> 
> Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?
> 

ctime should always change if the mtime does. An mtime update means that
the metadata was updated, so you also need to update the ctime. 

> And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> Even if this API is designed to be hoisted out of XFS at some future time,
> Is there a real need to support it on filesystems that do not support
> i_version(?)
> 
> Not to mention the fact that POSIX does not explicitly define how ctime should
> behave with changes to fiemap (uninitialized extent and all), so who knows
> how other filesystems may update ctime in those cases.
> 
> I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> really explicitly requests a bump of i_version on the next change.
> 


I agree. Using an opaque change cookie would be a lot nicer from an API
standpoint, and shouldn't be subject to timestamp granularity issues.

That said, XFS's change cookie is currently broken. Dave C. said he had
some patches in progress to fix that however.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
  2024-02-27 10:53   ` Jeff Layton
@ 2024-02-27 15:45   ` Darrick J. Wong
  2024-02-27 16:58     ` Amir Goldstein
  1 sibling, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-27 15:45 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton

On Tue, Feb 27, 2024 at 11:23:39AM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file.  If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas.  Callers can arrange for the update to be rejected if the
> > original file has been changed.
> >
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory.  People get it wrong all the time, and $fs hacks abound.
> > Here are the proposed manual pages:
> >
> > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> >
> > NAME
> >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> >        two files
> >
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> >
> >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> >        xfs_exch_range *arg);
> >
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes the contents of the two ranges.
> >
> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> >
> >        The  system  call  parameters are conveyed in structures of the
> >        following form:
> >
> >            struct xfs_exch_range {
> >                __s64    file1_fd;
> >                __s64    file1_offset;
> >                __s64    file2_offset;
> >                __s64    length;
> >                __u64    flags;
> >
> >                __u64    pad;
> >            };
> >
> >        The field pad must be zero.
> >
> >        The fields file1_fd, file1_offset, and length define the  first
> >        range of bytes to be exchanged.
> >
> >        The fields file2_fd, file2_offset, and length define the second
> >        range of bytes to be exchanged.
> >
> >        Both files must be from the same filesystem mount.  If the  two
> >        file  descriptors represent the same file, the byte ranges must
> >        not overlap.  Most  disk-based  filesystems  require  that  the
> >        starts  of  both ranges must be aligned to the file block size.
> >        If this is the case, the ends of the ranges  must  also  be  so
> >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> >
> >        The field flags control the behavior of the exchange operation.
> >
> >            XFS_EXCHRANGE_TO_EOF
> >                   Ignore  the length parameter.  All bytes in file1_fd
> >                   from file1_offset to EOF are moved to file2_fd,  and
> >                   file2's  size is set to (file2_offset+(file1_length-
> >                   file1_offset)).  Meanwhile, all bytes in file2  from
> >                   file2_offset  to  EOF are moved to file1 and file1's
> >                   size   is   set   to    (file1_offset+(file2_length-
> >                   file2_offset)).
> >
> >            XFS_EXCHRANGE_DSYNC
> >                   Ensure  that  all modified in-core data in both file
> >                   ranges and all metadata updates  pertaining  to  the
> >                   exchange operation are flushed to persistent storage
> >                   before the call returns.  Opening  either  file  de‐
> >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> >                   effect.
> >
> >            XFS_EXCHRANGE_FILE1_WRITTEN
> >                   Only exchange sub-ranges of file1_fd that are  known
> >                   to  contain  data  written  by application software.
> >                   Each sub-range may be  expanded  (both  upwards  and
> >                   downwards)  to  align with the file allocation unit.
> >                   For files on the data device, this is one filesystem
> >                   block.   For  files  on the realtime device, this is
> >                   the realtime extent size.  This facility can be used
> >                   to  implement  fast  atomic scatter-gather writes of
> >                   any complexity for software-defined storage  targets
> >                   if  all  writes  are  aligned to the file allocation
> >                   unit.
> >
> >            XFS_EXCHRANGE_DRY_RUN
> >                   Check the parameters and the feasibility of the  op‐
> >                   eration, but do not change anything.
> >
> > RETURN VALUE
> >        On  error, -1 is returned, and errno is set to indicate the er‐
> >        ror.
> >
> > ERRORS
> >        Error codes can be one of, but are not limited to, the  follow‐
> >        ing:
> >
> >        EBADF  file1_fd  is not open for reading and writing or is open
> >               for append-only writes; or  file2_fd  is  not  open  for
> >               reading and writing or is open for append-only writes.
> >
> >        EINVAL The  parameters  are  not correct for these files.  This
> >               error can also appear if either file  descriptor  repre‐
> >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> >               erally require the offset and  length  arguments  to  be
> >               aligned to the fundamental block sizes of both files.
> >
> >        EIO    An I/O error occurred.
> >
> >        EISDIR One of the files is a directory.
> >
> >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> >               perform the operation.
> >
> >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> >               change the contents safely.
> >
> >        EOPNOTSUPP
> >               The filesystem does not support exchanging bytes between
> >               the two files.
> >
> >        EPERM  file1_fd or file2_fd are immutable.
> >
> >        ETXTBSY
> >               One of the files is a swap file.
> >
> >        EUCLEAN
> >               The filesystem is corrupt.
> >
> >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> >               filesystem.
> >
> > CONFORMING TO
> >        This API is XFS-specific.
> >
> > USE CASES
> >        Several  use  cases  are imagined for this system call.  In all
> >        cases, application software must coordinate updates to the file
> >        because the exchange is performed unconditionally.
> >
> >        The  first  is a data storage program that wants to commit non-
> >        contiguous updates to a file atomically and  coordinates  write
> >        access  to that file.  This can be done by creating a temporary
> >        file, calling FICLONE(2) to share the contents, and staging the
> >        updates into the temporary file.  The FULL_FILES flag is recom‐
> >        mended for this purpose.  The temporary file can be deleted  or
> >        punched out afterwards.
> >
> >        An example program might look like this:
> >
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >
> >            ioctl(temp_fd, FICLONE, fd);
> >
> >            /* append 1MB of records */
> >            lseek(temp_fd, 0, SEEK_END);
> >            write(temp_fd, data1, 1000000);
> >
> >            /* update record index */
> >            pwrite(temp_fd, data1, 600, 98765);
> >            pwrite(temp_fd, data2, 320, 54321);
> >            pwrite(temp_fd, data2, 15, 0);
> >
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .flags = XFS_EXCHRANGE_TO_EOF,
> >            };
> >
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> >        The  second  is  a  software-defined  storage host (e.g. a disk
> >        jukebox) which implements an atomic scatter-gather  write  com‐
> >        mand.   Provided the exported disk's logical block size matches
> >        the file's allocation unit size, this can be done by creating a
> >        temporary file and writing the data at the appropriate offsets.
> >        It is recommended that the temporary file be truncated  to  the
> >        size  of  the  regular file before any writes are staged to the
> >        temporary file to avoid issues with zeroing during  EOF  exten‐
> >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> >        only the file allocation units involved  in  the  emulated  de‐
> >        vice's  write  command.  The temporary file should be truncated
> >        or punched out completely before being reused to stage  another
> >        write.
> >
> >        An example program might look like this:
> >
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >            struct stat sb;
> >            int blksz;
> >
> >            fstat(fd, &sb);
> >            blksz = sb.st_blksize;
> >
> >            /* land scatter gather writes between 100fsb and 500fsb */
> >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> >
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .file1_offset = blksz * 100,
> >                .file2_offset = blksz * 100,
> >                .length       = blksz * 400,
> >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> >                                XFS_EXCHRANGE_FILE1_DSYNC,
> >            };
> >
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> > NOTES
> >        Some  filesystems may limit the amount of data or the number of
> >        extents that can be exchanged in a single call.
> >
> > SEE ALSO
> >        ioctl(2)
> >
> > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> >
> > NAME
> >        ioctl_xfs_commit_range - conditionally exchange the contents of
> >        parts of two files
> >
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> >
> >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> >        mit_range *arg);
> >
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes  the contents of the two ranges if file2_fd passes cer‐
> >        tain freshness criteria.
> >
> >        After locking both files but before  exchanging  the  contents,
> >        the  supplied  file2_ino field must match file2_fd's inode num‐
> >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> >        cation time and change time of file2.  If they  do  not  match,
> >        EBUSY will be returned.
> >
> 
> Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?

It's the other way 'round -- mtime doesn't change but ctime does.  The
race I'm trying to protect against is:

Thread 0			Thread 1
<snapshot fd cmtime>
<start writing tempfd>
				<fstat fd>
				<write to fd>
				<futimens to reset mtime>
<commitrange>

mtime is controllable by "attackers" but ctime isn't.  I think we only
need to capture ctime, but ye olde swapext ioctl (from which this
derives) did both.

> And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?

Seeing as iversion (as the vfs and/or jlayton seems to want it) doesn't
work in the intended manner in XFS, no.

> Even if this API is designed to be hoisted out of XFS at some future time,
> Is there a real need to support it on filesystems that do not support
> i_version(?)

Given the way the iversion discussions have gone (file data write
counter) I don't think there's a way to support commitrange on
non-iversion filesystems.

I withdrew any plans to make this more than an XFS-specific ioctl last
year after giving up on ever getting through fsdevel review.  I think
the last reply I got was from viro back in 2021...

> Not to mention the fact that POSIX does not explicitly define how ctime should
> behave with changes to fiemap (uninitialized extent and all), so who knows
> how other filesystems may update ctime in those cases.

...and given the lack of interest from any other filesystem developers
in porting it to !xfs, I'm not likely to take this up ever again.  To be
fair, I think the only filesystems that could possibly support
EXCHANGE_RANGE are /maybe/ btrfs and /probably/ bcachefs.

> I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> really explicitly requests a bump of i_version on the next change.

Another way I could've structured this (and still could!) would be to
declare the entire freshness region as an untyped u64 fresh[4] blob and
add a START_COMMIT ioctl to fill it out.  Then the kernel fs drivers
gets to determine what goes in there.

That at least would be less work for userspace to do.

I don't want userspace API wrangling to hold up online repair **yet
again**.  I only made EXCHANGE_RANGE so that I could test the functionality
that fsck relies on.  If there was another way to test it then I would
have gladly done that.  Further down the line, COMMIT_RANGE will get us
out of trouble with the xfs defrag tool.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 10:53   ` Jeff Layton
@ 2024-02-27 16:06     ` Darrick J. Wong
  2024-03-01 13:16       ` Jeff Layton
  2024-02-27 23:46     ` Dave Chinner
  1 sibling, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-27 16:06 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > 
> > > Hi all,
> > > 
> > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > ranges of bytes between two files atomically.  This new functionality
> > > enables data storage programs to stage and commit file updates such that
> > > reader programs will see either the old contents or the new contents in
> > > their entirety, with no chance of torn writes.  A successful call
> > > completion guarantees that the new contents will be seen even if the
> > > system fails.
> > > 
> > > The ability to exchange file fork mappings between files in this manner
> > > is critical to supporting online filesystem repair, which is built upon
> > > the strategy of constructing a clean copy of a damaged structure and
> > > committing the new structure into the metadata file atomically.
> > > 
> > > User programs will be able to update files atomically by opening an
> > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > they want to make, and exchange the relevant ranges of the temp file
> > > with the original file.  If the updates are aligned with the file block
> > > size, a new (since v2) flag provides for exchanging only the written
> > > areas.  Callers can arrange for the update to be rejected if the
> > > original file has been changed.
> > > 
> > > The intent behind this new userspace functionality is to enable atomic
> > > rewrites of arbitrary parts of individual files.  For years, application
> > > programmers wanting to ensure the atomicity of a file update had to
> > > write the changes to a new file in the same directory, fsync the new
> > > file, rename the new file on top of the old filename, and then fsync the
> > > directory.  People get it wrong all the time, and $fs hacks abound.
> > > Here are the proposed manual pages:
> > > 
> 
> This is a cool idea!  I've had some handwavy ideas about making a gated
> write() syscall (i.e. only write if the change cookie hasn't changed),
> but something like this may be a simpler lift initially.

How /does/ userspace get at the change cookie nowadays?

> > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > 
> > > NAME
> > >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> > >        two files
> > > 
> > > SYNOPSIS
> > >        #include <sys/ioctl.h>
> > >        #include <xfs/xfs_fs_staging.h>
> > > 
> > >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> > >        xfs_exch_range *arg);
> > > 
> > > DESCRIPTION
> > >        Given  a  range  of bytes in a first file file1_fd and a second
> > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > >        changes the contents of the two ranges.
> > > 
> > >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> > >        tions, so no userspace-level locks need to be taken  to  obtain
> > >        consistent  results.  Implementations must guarantee that read‐
> > >        ers see either the old contents or the new  contents  in  their
> > >        entirety, even if the system fails.
> > > 
> > >        The  system  call  parameters are conveyed in structures of the
> > >        following form:
> > > 
> > >            struct xfs_exch_range {
> > >                __s64    file1_fd;
> > >                __s64    file1_offset;
> > >                __s64    file2_offset;
> > >                __s64    length;
> > >                __u64    flags;
> > > 
> > >                __u64    pad;
> > >            };
> > > 
> > >        The field pad must be zero.
> > > 
> > >        The fields file1_fd, file1_offset, and length define the  first
> > >        range of bytes to be exchanged.
> > > 
> > >        The fields file2_fd, file2_offset, and length define the second
> > >        range of bytes to be exchanged.
> > > 
> > >        Both files must be from the same filesystem mount.  If the  two
> > >        file  descriptors represent the same file, the byte ranges must
> > >        not overlap.  Most  disk-based  filesystems  require  that  the
> > >        starts  of  both ranges must be aligned to the file block size.
> > >        If this is the case, the ends of the ranges  must  also  be  so
> > >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > 
> > >        The field flags control the behavior of the exchange operation.
> > > 
> > >            XFS_EXCHRANGE_TO_EOF
> > >                   Ignore  the length parameter.  All bytes in file1_fd
> > >                   from file1_offset to EOF are moved to file2_fd,  and
> > >                   file2's  size is set to (file2_offset+(file1_length-
> > >                   file1_offset)).  Meanwhile, all bytes in file2  from
> > >                   file2_offset  to  EOF are moved to file1 and file1's
> > >                   size   is   set   to    (file1_offset+(file2_length-
> > >                   file2_offset)).
> > > 
> > >            XFS_EXCHRANGE_DSYNC
> > >                   Ensure  that  all modified in-core data in both file
> > >                   ranges and all metadata updates  pertaining  to  the
> > >                   exchange operation are flushed to persistent storage
> > >                   before the call returns.  Opening  either  file  de‐
> > >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> > >                   effect.
> > > 
> > >            XFS_EXCHRANGE_FILE1_WRITTEN
> > >                   Only exchange sub-ranges of file1_fd that are  known
> > >                   to  contain  data  written  by application software.
> > >                   Each sub-range may be  expanded  (both  upwards  and
> > >                   downwards)  to  align with the file allocation unit.
> > >                   For files on the data device, this is one filesystem
> > >                   block.   For  files  on the realtime device, this is
> > >                   the realtime extent size.  This facility can be used
> > >                   to  implement  fast  atomic scatter-gather writes of
> > >                   any complexity for software-defined storage  targets
> > >                   if  all  writes  are  aligned to the file allocation
> > >                   unit.
> > > 
> > >            XFS_EXCHRANGE_DRY_RUN
> > >                   Check the parameters and the feasibility of the  op‐
> > >                   eration, but do not change anything.
> > > 
> > > RETURN VALUE
> > >        On  error, -1 is returned, and errno is set to indicate the er‐
> > >        ror.
> > > 
> > > ERRORS
> > >        Error codes can be one of, but are not limited to, the  follow‐
> > >        ing:
> > > 
> > >        EBADF  file1_fd  is not open for reading and writing or is open
> > >               for append-only writes; or  file2_fd  is  not  open  for
> > >               reading and writing or is open for append-only writes.
> > > 
> > >        EINVAL The  parameters  are  not correct for these files.  This
> > >               error can also appear if either file  descriptor  repre‐
> > >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> > >               erally require the offset and  length  arguments  to  be
> > >               aligned to the fundamental block sizes of both files.
> > > 
> > >        EIO    An I/O error occurred.
> > > 
> > >        EISDIR One of the files is a directory.
> > > 
> > >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> > >               perform the operation.
> > > 
> > >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> > >               change the contents safely.
> > > 
> > >        EOPNOTSUPP
> > >               The filesystem does not support exchanging bytes between
> > >               the two files.
> > > 
> > >        EPERM  file1_fd or file2_fd are immutable.
> > > 
> > >        ETXTBSY
> > >               One of the files is a swap file.
> > > 
> > >        EUCLEAN
> > >               The filesystem is corrupt.
> > > 
> > >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> > >               filesystem.
> > > 
> > > CONFORMING TO
> > >        This API is XFS-specific.
> > > 
> > > USE CASES
> > >        Several  use  cases  are imagined for this system call.  In all
> > >        cases, application software must coordinate updates to the file
> > >        because the exchange is performed unconditionally.
> > > 
> > >        The  first  is a data storage program that wants to commit non-
> > >        contiguous updates to a file atomically and  coordinates  write
> > >        access  to that file.  This can be done by creating a temporary
> > >        file, calling FICLONE(2) to share the contents, and staging the
> > >        updates into the temporary file.  The FULL_FILES flag is recom‐
> > >        mended for this purpose.  The temporary file can be deleted  or
> > >        punched out afterwards.
> > > 
> > >        An example program might look like this:
> > > 
> > >            int fd = open("/some/file", O_RDWR);
> > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > 
> > >            ioctl(temp_fd, FICLONE, fd);
> > > 
> > >            /* append 1MB of records */
> > >            lseek(temp_fd, 0, SEEK_END);
> > >            write(temp_fd, data1, 1000000);
> > > 
> > >            /* update record index */
> > >            pwrite(temp_fd, data1, 600, 98765);
> > >            pwrite(temp_fd, data2, 320, 54321);
> > >            pwrite(temp_fd, data2, 15, 0);
> > > 
> > >            /* commit the entire update */
> > >            struct xfs_exch_range args = {
> > >                .file1_fd = temp_fd,
> > >                .flags = XFS_EXCHRANGE_TO_EOF,
> > >            };
> > > 
> > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > 
> > >        The  second  is  a  software-defined  storage host (e.g. a disk
> > >        jukebox) which implements an atomic scatter-gather  write  com‐
> > >        mand.   Provided the exported disk's logical block size matches
> > >        the file's allocation unit size, this can be done by creating a
> > >        temporary file and writing the data at the appropriate offsets.
> > >        It is recommended that the temporary file be truncated  to  the
> > >        size  of  the  regular file before any writes are staged to the
> > >        temporary file to avoid issues with zeroing during  EOF  exten‐
> > >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> > >        only the file allocation units involved  in  the  emulated  de‐
> > >        vice's  write  command.  The temporary file should be truncated
> > >        or punched out completely before being reused to stage  another
> > >        write.
> > > 
> > >        An example program might look like this:
> > > 
> > >            int fd = open("/some/file", O_RDWR);
> > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > >            struct stat sb;
> > >            int blksz;
> > > 
> > >            fstat(fd, &sb);
> > >            blksz = sb.st_blksize;
> > > 
> > >            /* land scatter gather writes between 100fsb and 500fsb */
> > >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > 
> > >            /* commit the entire update */
> > >            struct xfs_exch_range args = {
> > >                .file1_fd = temp_fd,
> > >                .file1_offset = blksz * 100,
> > >                .file2_offset = blksz * 100,
> > >                .length       = blksz * 400,
> > >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> > >                                XFS_EXCHRANGE_FILE1_DSYNC,
> > >            };
> > > 
> > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > 
> > > NOTES
> > >        Some  filesystems may limit the amount of data or the number of
> > >        extents that can be exchanged in a single call.
> > > 
> > > SEE ALSO
> > >        ioctl(2)
> > > 
> > > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > 
> > > NAME
> > >        ioctl_xfs_commit_range - conditionally exchange the contents of
> > >        parts of two files
> > > 
> > > SYNOPSIS
> > >        #include <sys/ioctl.h>
> > >        #include <xfs/xfs_fs_staging.h>
> > > 
> > >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> > >        mit_range *arg);
> > > 
> > > DESCRIPTION
> > >        Given  a  range  of bytes in a first file file1_fd and a second
> > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > >        changes  the contents of the two ranges if file2_fd passes cer‐
> > >        tain freshness criteria.
> > > 
> > >        After locking both files but before  exchanging  the  contents,
> > >        the  supplied  file2_ino field must match file2_fd's inode num‐
> > >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> > >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > >        cation time and change time of file2.  If they  do  not  match,
> > >        EBUSY will be returned.
> > > 
> > 
> > Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?
> > 
> 
> ctime should always change if the mtime does. An mtime update means that
> the metadata was updated, so you also need to update the ctime. 

Exactly. :)

> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
> > 
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
> > 
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
> > 
> 
> 
> I agree. Using an opaque change cookie would be a lot nicer from an API
> standpoint, and shouldn't be subject to timestamp granularity issues.

TLDR: No.

> That said, XFS's change cookie is currently broken. Dave C. said he had
> some patches in progress to fix that however.

Dave says that about a lot of things.  I'm not willing to delay the
online fsck project _even further_ for a bunch of vaporware that's not
even out on linux-xfs for review.

The difference in opinion between xfs and the rest of the kernel about
i_version is 50% of why I didn't include it here.  The other 50% is the
part where userspace can't access it, because I do not want to saddle my
mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
CHANGES.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 15:45   ` Darrick J. Wong
@ 2024-02-27 16:58     ` Amir Goldstein
  0 siblings, 0 replies; 27+ messages in thread
From: Amir Goldstein @ 2024-02-27 16:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton

On Tue, Feb 27, 2024 at 5:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 11:23:39AM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
[...]
> > Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?
>
> It's the other way 'round -- mtime doesn't change but ctime does.  The
> race I'm trying to protect against is:
>
> Thread 0                        Thread 1
> <snapshot fd cmtime>
> <start writing tempfd>
>                                 <fstat fd>
>                                 <write to fd>
>                                 <futimens to reset mtime>
> <commitrange>
>
> mtime is controllable by "attackers" but ctime isn't.  I think we only
> need to capture ctime, but ye olde swapext ioctl (from which this
> derives) did both.
>

Yes, that's what I meant. was just a braino.
mtime seems redundant, but if you want to keep it for compatibility with
legacy API, so be it.

> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
>
> Seeing as iversion (as the vfs and/or jlayton seems to want it) doesn't
> work in the intended manner in XFS, no.
>

OK. for the record, AFAICT, the problem of NFS guys with xfs iversion
is that it is too
aggressive to their taste (i.e. bumped on atime updates), but because
ctime is always
updated along with iversion in xfs and because iversion has no granularity
problem, I think it is a better choice for you, regardless of any
other filesystems
and their interpretation of ctime or iversion.

> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
>
> Given the way the iversion discussions have gone (file data write
> counter) I don't think there's a way to support commitrange on
> non-iversion filesystems.
>
> I withdrew any plans to make this more than an XFS-specific ioctl last
> year after giving up on ever getting through fsdevel review.  I think
> the last reply I got was from viro back in 2021...
>

understandable.
I wasn't implying that you should hoist this out of XFS.
I was wondering about why not use xfs's iversion, which
seems like a better change counter than ctime.

> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
>
> ...and given the lack of interest from any other filesystem developers
> in porting it to !xfs, I'm not likely to take this up ever again.  To be
> faiproblemr, I think the only filesystems that could possibly support
> EXCHANGE_RANGE are /maybe/ btrfs and /probably/ bcachefs.
>

Again, sorry if my question were misinterpreted.
I was not trying to imply that this API should be hoisted.

> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
>
> Another way I could've structured this (and still could!) would be to
> deproblemclare the entire freshness region as an untyped u64 fresh[4] blob and
> add a START_COMMIT ioctl to fill it out.  Then the kernel fs drivers
> gets to determine what goes in there.
>
> That at least would be less work for userspace to do.
>

To me that makes sense. Cleaner API.
Less questions asked.
But it's up to you.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
@ 2024-02-27 17:46 ` Darrick J. Wong
  2024-02-27 18:52   ` Amir Goldstein
  2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
  3 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-27 17:46 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs, hch; +Cc: Amir Goldstein, jlayton

From: Darrick J. Wong <djwong@kernel.org>

To head off bikeshedding about the fields in xfs_commit_range, let's
make it an opaque u64 array and require the userspace program to call
a third ioctl to sample the freshness data for us.  If we ever converge
on a definition for i_version then we can use that; for now we'll just
use mtime/ctime like the old swapext ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   13 +++--------
 fs/xfs/xfs_exchrange.c |   15 ++++++++++++
 fs/xfs/xfs_exchrange.h |    1 +
 fs/xfs/xfs_ioctl.c     |   58 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 01b3553adfc55..4019a78ee3ea5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -860,14 +860,8 @@ struct xfs_commit_range {
 
 	__u64		flags;		/* see XFS_EXCHRANGE_* below */
 
-	/* file2 metadata for freshness checks */
-	__u64		file2_ino;	/* inode number */
-	__s64		file2_mtime;	/* modification time */
-	__s64		file2_ctime;	/* change time */
-	__s32		file2_mtime_nsec; /* mod time, nsec */
-	__s32		file2_ctime_nsec; /* change time, nsec */
-
-	__u64		pad;		/* must be zeroes */
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[5];
 };
 
 /*
@@ -973,7 +967,8 @@ struct xfs_commit_range {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOWR('X', 129, struct xfs_exch_range)
-#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 129, struct xfs_commit_range)
+#define XFS_IOC_START_COMMIT	     _IOWR('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index e55ae06f1a32c..dae855515c3c4 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -863,3 +863,18 @@ xfs_exchange_range(
 		fsnotify_modify(fxr->file2);
 	return 0;
 }
+
+/* Sample freshness data from fxr->file2 for a commit range operation. */
+void
+xfs_exchrange_freshness(
+	struct xfs_exchrange	*fxr)
+{
+	struct inode		*inode2 = file_inode(fxr->file2);
+	struct xfs_inode	*ip2 = XFS_I(inode2);
+
+	xfs_ilock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+	fxr->file2_ino = ip2->i_ino;
+	fxr->file2_ctime = inode_get_ctime(inode2);
+	fxr->file2_mtime = inode_get_mtime(inode2);
+	xfs_iunlock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 2dd9ab7d76828..942283a7f75f5 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -36,6 +36,7 @@ struct xfs_exchrange {
 	struct timespec64	file2_ctime;
 };
 
+void xfs_exchrange_freshness(struct xfs_exchrange *fxr);
 int xfs_exchange_range(struct xfs_exchrange *fxr);
 
 /* XFS-specific parts of file exchanges */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ee26ac2028da1..1940da72a1da7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -2402,6 +2402,47 @@ xfs_ioc_exchange_range(
 	return error;
 }
 
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u64		pad;		/* zero */
+};
+
+static long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	xfs_exchrange_freshness(&fxr);
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+	kern_f->file2_ino		= fxr.file2_ino;
+	kern_f->file2_mtime		= fxr.file2_mtime.tv_sec;
+	kern_f->file2_mtime_nsec	= fxr.file2_mtime.tv_nsec;
+	kern_f->file2_ctime		= fxr.file2_ctime.tv_sec;
+	kern_f->file2_ctime_nsec	= fxr.file2_ctime.tv_nsec;
+
+	user_f = (struct xfs_commit_range_fresh *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static long
 xfs_ioc_commit_range(
 	struct file			*file,
@@ -2411,12 +2452,15 @@ xfs_ioc_commit_range(
 		.file2			= file,
 	};
 	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
 	struct fd			file1;
 	int				error;
 
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
 	if (copy_from_user(&args, argp, sizeof(args)))
 		return -EFAULT;
-	if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
+	if (memchr_inv(&kern_f->pad, 0, sizeof(kern_f->pad)))
 		return -EINVAL;
 	if (args.flags & ~XFS_EXCHRANGE_ALL_FLAGS)
 		return -EINVAL;
@@ -2425,11 +2469,11 @@ xfs_ioc_commit_range(
 	fxr.file2_offset	= args.file2_offset;
 	fxr.length		= args.length;
 	fxr.flags		= args.flags | __XFS_EXCHRANGE_CHECK_FRESH2;
-	fxr.file2_ino		= args.file2_ino;
-	fxr.file2_mtime.tv_sec	= args.file2_mtime;
-	fxr.file2_mtime.tv_nsec	= args.file2_mtime_nsec;
-	fxr.file2_ctime.tv_sec	= args.file2_ctime;
-	fxr.file2_ctime.tv_nsec	= args.file2_ctime_nsec;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
 
 	file1 = fdget(args.file1_fd);
 	if (!file1.file)
@@ -2782,6 +2826,8 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
@ 2024-02-27 18:52   ` Amir Goldstein
  2024-02-29 23:27     ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Amir Goldstein @ 2024-02-27 18:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, jlayton

On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> To head off bikeshedding about the fields in xfs_commit_range, let's
> make it an opaque u64 array and require the userspace program to call
> a third ioctl to sample the freshness data for us.  If we ever converge
> on a definition for i_version then we can use that; for now we'll just
> use mtime/ctime like the old swapext ioctl.

This addresses my concerns about using mtime/ctime.

I have to say, Darrick, that I think that referring to this concern as
bikeshedding is not being honest.

I do hate nit picking reviews and I do hate "maybe also fix the world"
review comments, but I think the question about using mtime/ctime in
this new API was not out of place and I think that making the freshness
data opaque is better for everyone in the long run and hopefully, this will
help you move to the things you care about faster.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 10:53   ` Jeff Layton
  2024-02-27 16:06     ` Darrick J. Wong
@ 2024-02-27 23:46     ` Dave Chinner
  2024-02-28 10:30       ` Jeff Layton
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2024-02-27 23:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Darrick J. Wong, linux-fsdevel, linux-xfs, hch

On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?

Like xfs_fsr doing online defrag, we really only care about explicit
user data changes here, not internal layout and metadata changes to
the files...

> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
> > 
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
> > 
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
> > 
> 
> 
> I agree. Using an opaque change cookie would be a lot nicer from an API
> standpoint, and shouldn't be subject to timestamp granularity issues.
> 
> That said, XFS's change cookie is currently broken. Dave C. said he had
> some patches in progress to fix that however.

By "fix", I meant "remove".

i.e. the patches I was proposing were to remove SB_I_VERSION support
from XFS so NFS just uses the ctime on XFS because the recent
changes to i_version make it a ctime change counter, not an inode
change counter.

Then patches were posted for finer grained inode timestamps to allow
everything to use ctime instead of i_version, and with that I
thought NFS was just going to change to ctime for everyone with that
the whole change cookie issue was going away.

It now sounds like that isn't happening, so I'll just ressurect the
patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
support from XFS for now and us XFS people can just go back to
ignoring this problem again.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
@ 2024-02-28  1:50 ` Colin Walters
  2024-02-29 20:18   ` Darrick J. Wong
  3 siblings, 1 reply; 27+ messages in thread
From: Colin Walters @ 2024-02-28  1:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically.  This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes.  A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file. 

It's probably worth noting that the "reflinking the source file" here
is optional, right?  IOW one can just:

- open(O_TMPFILE)
- write()
- ioctl(FIEXCHANGE_RANGE)

I suspect the "simpler" non-database cases (think e.g. editors
operating on plain text files) are going to be operating on an
in-memory copy; in theory of course we could identify common ranges
and reflink, but it's not clear to me it's really worth it at the
tiny scale most source files are.

> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files.  For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory

More sophisticated tools already are using O_TMPFILE I would say,
just with a final last step of materializing it with a name,
and then rename() into place.  So if this also
obviates the need for
https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
that seems good.

>        Exchanges  are  atomic  with  regards to concurrent file opera‐
>        tions, so no userspace-level locks need to be taken  to  obtain
>        consistent  results.  Implementations must guarantee that read‐
>        ers see either the old contents or the new  contents  in  their
>        entirety, even if the system fails.

But given that we're reusing the same inode, I don't think that can *really* be true...at least, not without higher level serialization.

A classic case today is dconf in GNOME is a basic memory-mapped database file that is atomically replaced by the "create new file, rename into place" model.  Clients with mmap() view just see the old data until they reload explicitly.  But with this, clients with mmap'd view *will* immediately see the new contents (because it's the same inode, right?) and that's just going to lead to possibly split reads and undefined behavior - without extra userspace serialization or locking (that more proper databases) are going to be doing.

Arguably of course, dconf is too simple and more sophisticated tools like sqlite or LMDB could make use of this.  (There's some special atomic write that got added to f2fs for sqlite last I saw...I'm curious if this could replace it)

But still...it seems to me like there's going to be quite a lot of the "potentially concurrent reader, atomic replace desired" pattern and since this can't replace that, we should call that out explicitly in the man page.  And also if so, then there's still a need for the linkat(AT_REPLACE) etc.

>            XFS_EXCHRANGE_TO_EOF

I kept reading this as some sort of typo...would it really be too onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.?  Echoes of unix "creat" here =)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 23:46     ` Dave Chinner
@ 2024-02-28 10:30       ` Jeff Layton
  2024-02-28 10:58         ` Amir Goldstein
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Layton @ 2024-02-28 10:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Darrick J. Wong, linux-fsdevel, linux-xfs, hch

On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> 
> Like xfs_fsr doing online defrag, we really only care about explicit
> user data changes here, not internal layout and metadata changes to
> the files...
> 
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > > 
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > > 
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > > 
> > 
> > 
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
> > 
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
> 
> By "fix", I meant "remove".
> 
> i.e. the patches I was proposing were to remove SB_I_VERSION support
> from XFS so NFS just uses the ctime on XFS because the recent
> changes to i_version make it a ctime change counter, not an inode
> change counter.
> 
> Then patches were posted for finer grained inode timestamps to allow
> everything to use ctime instead of i_version, and with that I
> thought NFS was just going to change to ctime for everyone with that
> the whole change cookie issue was going away.
> 
> It now sounds like that isn't happening, so I'll just ressurect the
> patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> support from XFS for now and us XFS people can just go back to
> ignoring this problem again.


I must have misunderstood what you said when we were at LPC this year:

After the multigrain ctime patches were reverted, you mentioned that you
were working on a patchset that used the unused bits in the tv_nsec
field as counter for counting changes that have occurred within the same
tv_nsec value.

Did those not pan out for some reason?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-28 10:30       ` Jeff Layton
@ 2024-02-28 10:58         ` Amir Goldstein
  2024-02-28 11:01           ` Jeff Layton
  0 siblings, 1 reply; 27+ messages in thread
From: Amir Goldstein @ 2024-02-28 10:58 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, Darrick J. Wong, linux-fsdevel, linux-xfs, hch,
	Jan Kara

On Wed, Feb 28, 2024 at 12:30 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> > On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> >
> > Like xfs_fsr doing online defrag, we really only care about explicit
> > user data changes here, not internal layout and metadata changes to
> > the files...
> >
> > > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > > Is there a real need to support it on filesystems that do not support
> > > > i_version(?)
> > > >
> > > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > > how other filesystems may update ctime in those cases.
> > > >
> > > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > > really explicitly requests a bump of i_version on the next change.
> > > >
> > >
> > >
> > > I agree. Using an opaque change cookie would be a lot nicer from an API
> > > standpoint, and shouldn't be subject to timestamp granularity issues.
> > >
> > > That said, XFS's change cookie is currently broken. Dave C. said he had
> > > some patches in progress to fix that however.
> >
> > By "fix", I meant "remove".
> >
> > i.e. the patches I was proposing were to remove SB_I_VERSION support
> > from XFS so NFS just uses the ctime on XFS because the recent
> > changes to i_version make it a ctime change counter, not an inode
> > change counter.
> >
> > Then patches were posted for finer grained inode timestamps to allow
> > everything to use ctime instead of i_version, and with that I
> > thought NFS was just going to change to ctime for everyone with that
> > the whole change cookie issue was going away.
> >
> > It now sounds like that isn't happening, so I'll just ressurect the
> > patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> > support from XFS for now and us XFS people can just go back to
> > ignoring this problem again.
>
>
> I must have misunderstood what you said when we were at LPC this year:
>
> After the multigrain ctime patches were reverted, you mentioned that you
> were working on a patchset that used the unused bits in the tv_nsec
> field as counter for counting changes that have occurred within the same
> tv_nsec value.
>
> Did those not pan out for some reason?

Jeff,

Could I trouble you to suggest a topic for LSFMM to summarize
everything that has been going on this year wrt change cookie/time
at xfs/vfs level and try to set a clear roadmap?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-28 10:58         ` Amir Goldstein
@ 2024-02-28 11:01           ` Jeff Layton
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2024-02-28 11:01 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Darrick J. Wong, linux-fsdevel, linux-xfs, hch,
	Jan Kara

On Wed, 2024-02-28 at 12:58 +0200, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> > > On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > > > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > 
> > > Like xfs_fsr doing online defrag, we really only care about explicit
> > > user data changes here, not internal layout and metadata changes to
> > > the files...
> > > 
> > > > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > > > Is there a real need to support it on filesystems that do not support
> > > > > i_version(?)
> > > > > 
> > > > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > > > how other filesystems may update ctime in those cases.
> > > > > 
> > > > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > > > really explicitly requests a bump of i_version on the next change.
> > > > > 
> > > > 
> > > > 
> > > > I agree. Using an opaque change cookie would be a lot nicer from an API
> > > > standpoint, and shouldn't be subject to timestamp granularity issues.
> > > > 
> > > > That said, XFS's change cookie is currently broken. Dave C. said he had
> > > > some patches in progress to fix that however.
> > > 
> > > By "fix", I meant "remove".
> > > 
> > > i.e. the patches I was proposing were to remove SB_I_VERSION support
> > > from XFS so NFS just uses the ctime on XFS because the recent
> > > changes to i_version make it a ctime change counter, not an inode
> > > change counter.
> > > 
> > > Then patches were posted for finer grained inode timestamps to allow
> > > everything to use ctime instead of i_version, and with that I
> > > thought NFS was just going to change to ctime for everyone with that
> > > the whole change cookie issue was going away.
> > > 
> > > It now sounds like that isn't happening, so I'll just ressurect the
> > > patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> > > support from XFS for now and us XFS people can just go back to
> > > ignoring this problem again.
> > 
> > 
> > I must have misunderstood what you said when we were at LPC this year:
> > 
> > After the multigrain ctime patches were reverted, you mentioned that you
> > were working on a patchset that used the unused bits in the tv_nsec
> > field as counter for counting changes that have occurred within the same
> > tv_nsec value.
> > 
> > Did those not pan out for some reason?
> 
> Jeff,
> 
> Could I trouble you to suggest a topic for LSFMM to summarize
> everything that has been going on this year wrt change cookie/time
> at xfs/vfs level and try to set a clear roadmap?
> 

Sure, I'll send something later today.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 01/14] vfs: export remap and write check helpers
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
@ 2024-02-28 15:40   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
@ 2024-02-29 20:18   ` Darrick J. Wong
  2024-02-29 22:43     ` Colin Walters
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-29 20:18 UTC (permalink / raw)
  To: Colin Walters; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Tue, Feb 27, 2024 at 08:50:20PM -0500, Colin Walters wrote:
> 
> 
> On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file. 
> 
> It's probably worth noting that the "reflinking the source file" here
> is optional, right?  IOW one can just:
> 
> - open(O_TMPFILE)
> - write()
> - ioctl(FIEXCHANGE_RANGE)

If the write() rewrites the entire file, then yes, that'll also work.

> I suspect the "simpler" non-database cases (think e.g. editors
> operating on plain text files) are going to be operating on an
> in-memory copy; in theory of course we could identify common ranges
> and reflink, but it's not clear to me it's really worth it at the
> tiny scale most source files are.

Correct, there's no built-in dedupe.  For small files you'll probably
end up with a single allocation anyway, which is ideal in terms of
ondisk metadata overhead.

One advantage that EXCHANGE_RANGE has over the rename dance is that the
calling application doesn't have to copy all the file attributes and
xattrs to the temporary file before the switch.

> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory
> 
> More sophisticated tools already are using O_TMPFILE I would say,
> just with a final last step of materializing it with a name,
> and then rename() into place.  So if this also
> obviates the need for
> https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
> that seems good.

It would, though I would bet that extending linkat (or rename, or
whatever) is going to be the only workable solution for old / simple
filesystems (e.g. fat32).

> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> 
> But given that we're reusing the same inode, I don't think that can
> *really* be true...at least, not without higher level serialization.

Higher level coordination is required, yes.  It doesn't have to be
serialization, though.  The committing thread could signal all the other
readers that they should invalidate and restart whatever they're working
on if that work depends on the file that was COMMIT_RANGE'd.  The
readers could detect unexpected data and resample mtime of the files
they've read and restart if it's changed.

> A classic case today is dconf in GNOME is a basic memory-mapped
> database file that is atomically replaced by the "create new file,
> rename into place" model.  Clients with mmap() view just see the old
> data until they reload explicitly.  But with this, clients with mmap'd
> view *will* immediately see the new contents (because it's the same
> inode, right?)

Correct, they'll start seeing the new contents as soon as they access
the affected pages.

How /does/ dconf handle those changes?  Does it rename the file and
signal all the other dconf threads to reopen the file?  And then those
threads get the new file contents?

>                and that's just going to lead to possibly split reads
> and undefined behavior - without extra userspace serialization or
> locking (that more proper databases) are going to be doing.

Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
with mmap without actual flock()ing. :(

fsnotify will send a message out to userspace after the exchange
finishes, which means that userspace could watch for the notifications
via fanotify.  However, that's still a bit racy... :/

> Arguably of course, dconf is too simple and more sophisticated tools
> like sqlite or LMDB could make use of this.  (There's some special
> atomic write that got added to f2fs for sqlite last I saw...I'm
> curious if this could replace it)

I think so:

F2FS_IOC_START_ATOMIC_WRITE -> XFS_IOC_START_COMMIT,
F2FS_IOC_COMMIT_ATOMIC_WRITE -> XFS_IOC_COMMIT_RANGE, and
F2FS_IOC_ABORT_VOLATILE_WRITE merely turns into close(temp_fd);

> But still...it seems to me like there's going to be quite a lot of the
> "potentially concurrent reader, atomic replace desired" pattern and
> since this can't replace that, we should call that out explicitly in
> the man page.  And also if so, then there's still a need for the
> linkat(AT_REPLACE) etc.

Hmm, I think I'll shrink that paragraph of the manpage:

"Exchanges are atomic with regards to concurrent file operations.
Implementations must guarantee that readers see either the old contents
or the new contents in their entirety, even if the system fails."

> 
> >            XFS_EXCHRANGE_TO_EOF
> 
> I kept reading this as some sort of typo...would it really be too
> onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.?  Echoes of
> unix "creat" here =)

Yeah, I've expanded that to XFS_EXCHANGE_RANGE_TO_EOF for v29.5.

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-29 20:18   ` Darrick J. Wong
@ 2024-02-29 22:43     ` Colin Walters
  2024-03-01  0:03       ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Colin Walters @ 2024-02-29 22:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Thu, Feb 29, 2024, at 3:18 PM, Darrick J. Wong wrote:
>
> Correct, there's no built-in dedupe.  For small files you'll probably
> end up with a single allocation anyway, which is ideal in terms of
> ondisk metadata overhead.

Makes sense.

> though I would bet that extending linkat (or rename, or
> whatever) is going to be the only workable solution for old / simple
> filesystems (e.g. fat32).

Ah, right; that too.

> How /does/ dconf handle those changes?  Does it rename the file and
> signal all the other dconf threads to reopen the file?  And then those
> threads get the new file contents?

I briefly skimmed the code and couldn't find it, but yes I believe it's basically that clients have an inotify watch that gets handled from the mainloop and clients close and reopen and re-mmap - it's probably nonexistent to have non-mainloop threads reading things from the mmap, so there's no races with any other threads.

>
> Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
> with mmap without actual flock()ing. :(
>
> fsnotify will send a message out to userspace after the exchange
> finishes, which means that userspace could watch for the notifications
> via fanotify.  However, that's still a bit racy... :/

Right.  However...it's not just about mmap.  Sorry this is a minor rant but...near my top ten list of changes to make with a time machine for Unix would be the concept of a contents-immutable file, like all the seals that work on memfd with F_ADD_SEALS (and outside of fsverity, which is good but can be a bit of a heavier hammer).

A few times I've been working on shell script in my editor on my desktop, and these shell scripts are tests because shell script is so tempting.  I'm sure this familiar, given (x)fstests.

And if you just run the tests (directly from source in git), and then notice a bug, and start typing in your editor, save the changes, and then and your editor happens to do a generic "open(O_TRUNC), save" instead of an atomic rename.  This happens to be what `nano` and VSCode do, although at least the `vi` I have here does an atomic rename.  (One could say all editors that don't are broken...but...)

And now because the way bash works (and I assume other historical Unix shells) is that they interpret the file *as they're reading it* in this scenario you can get completely undefined behavior.  It could do *anything*.

At least one of those times, I got an error from an `rm -rf` invocation that happened to live in one of those test scripts...that could have in theory just gone off and removed anything.

Basically the contents-immutable is really what you *always* want for executables and really anything that can be parsed without locking (like, almost all config files in /etc too).  With ELF files there's EXTBUSY if it *happens* to be in use, but that's just a hack.  Also in that other thread about racing writes to suid executables...well, there'd be no possibility for races if we just denied writing because again - it makes no sense to just make random writes in-place to an executable.  (OK I did see the zig folks are trying an incremental linker, but still I would just assume reflinks are available for that)

Now this is relevant here because, I don't think anything like dpkg/rpm and all those things could ever use this ioctl for this reason.

So, it seems to me like it should really be more explicitly targeted at
- Things that are using open()+write() today and it's safe for that use case
- The database cases

And not talk about replacing the general open(O_TMPFILE) + rename() path.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-27 18:52   ` Amir Goldstein
@ 2024-02-29 23:27     ` Darrick J. Wong
  2024-03-01 13:00       ` Amir Goldstein
  2024-03-01 13:31       ` Jeff Layton
  0 siblings, 2 replies; 27+ messages in thread
From: Darrick J. Wong @ 2024-02-29 23:27 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch, jlayton

On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > To head off bikeshedding about the fields in xfs_commit_range, let's
> > make it an opaque u64 array and require the userspace program to call
> > a third ioctl to sample the freshness data for us.  If we ever converge
> > on a definition for i_version then we can use that; for now we'll just
> > use mtime/ctime like the old swapext ioctl.
> 
> This addresses my concerns about using mtime/ctime.

Oh good! :)

> I have to say, Darrick, that I think that referring to this concern as
> bikeshedding is not being honest.
> 
> I do hate nit picking reviews and I do hate "maybe also fix the world"
> review comments, but I think the question about using mtime/ctime in
> this new API was not out of place

I agree, your question about mtime/ctime:

"Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?"

was a very good question.  But perhaps that statement referred to the
other part of that thread.

>                                   and I think that making the freshness
> data opaque is better for everyone in the long run and hopefully, this will
> help you move to the things you care about faster.

I wish you'd suggested an opaque blob that the fs can lay out however it
wants instead of suggesting specifically the change cookie.  I'm very
much ok with an opaque freshness blob that allows future flexibility in
how we define the blob's contents.

I was however very upset about the Jeff's suggestion of using i_version.
I apologize for using all caps in that reply, and snarling about it in
the commit message here.  The final version of this patch will not have
that.

That said, I don't think it is at all helpful to suggest using a file
attribute whose behavior is as yet unresolved.  Multigrain timestamps
were a clever idea, regrettably reverted.  As far as I could tell when I
wrote my reply, neither had NFS implemented a better behavior and
quietly merged it; nor have Jeff and Dave produced any sort of candidate
patchset to fix all the resulting issues in XFS.

Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
internal" made me think "OH $deity, they wants me to do that work
too???"

A better way to have woreded that might've been "How about switching
this to a fs-determined structure so that we can switch the freshness
check to i_version when that's fully working on XFS?"

The problem I have with reading patch review emails is that I can't
easily tell whether an author's suggestion is being made in a casual
offhand manner?  Or if it reflects something they feel strongly needs
change before merging.

In fairness to you, Amir, I don't know how much you've kept on top of
that i_version vs. XFS discussion.  So I have no idea if you were aware
of the status of that work.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-29 22:43     ` Colin Walters
@ 2024-03-01  0:03       ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2024-03-01  0:03 UTC (permalink / raw)
  To: Colin Walters; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Thu, Feb 29, 2024 at 05:43:21PM -0500, Colin Walters wrote:
> 
> 
> On Thu, Feb 29, 2024, at 3:18 PM, Darrick J. Wong wrote:
> >
> > Correct, there's no built-in dedupe.  For small files you'll probably
> > end up with a single allocation anyway, which is ideal in terms of
> > ondisk metadata overhead.
> 
> Makes sense.
> 
> > though I would bet that extending linkat (or rename, or
> > whatever) is going to be the only workable solution for old / simple
> > filesystems (e.g. fat32).
> 
> Ah, right; that too.
> 
> > How /does/ dconf handle those changes?  Does it rename the file and
> > signal all the other dconf threads to reopen the file?  And then those
> > threads get the new file contents?
> 
> I briefly skimmed the code and couldn't find it, but yes I believe
> it's basically that clients have an inotify watch that gets handled
> from the mainloop and clients close and reopen and re-mmap - it's
> probably nonexistent to have non-mainloop threads reading things from
> the mmap, so there's no races with any other threads.

Hrmm.  IIRC inotify and fanotify both use the same fsnotify backend.
fsnotify events are emitted after i_rwsem drops, which (if I read
read_write.c correctly) means this is technically racy.

That said if they're mostly waiting around in the inotify loop then it
probably doesn't matter.

> > Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
> > with mmap without actual flock()ing. :(
> >
> > fsnotify will send a message out to userspace after the exchange
> > finishes, which means that userspace could watch for the notifications
> > via fanotify.  However, that's still a bit racy... :/
> 
> Right.  However...it's not just about mmap.  Sorry this is a minor
> rant but...near my top ten list of changes to make with a time machine
> for Unix would be the concept of a contents-immutable file, like all
> the seals that work on memfd with F_ADD_SEALS (and outside of
> fsverity, which is good but can be a bit of a heavier hammer).

You and me both. :)

Also I want a persistent file contents write counter; and a
file-anything write counter.

Oh, and a conditional read where you pass in the file contents write
counter and returns an error if the file has been changed since sampling
time.  The changecookie thing mentioned elsewhere gets us towards that,
if onlty the issues w/ XFS get resolved.

> A few times I've been working on shell script in my editor on my
> desktop, and these shell scripts are tests because shell script is so
> tempting.  I'm sure this familiar, given (x)fstests.
> 
> And if you just run the tests (directly from source in git), and then
> notice a bug, and start typing in your editor, save the changes, and
> then and your editor happens to do a generic "open(O_TRUNC), save"
> instead of an atomic rename.  This happens to be what `nano` and
> VSCode do, although at least the `vi` I have here does an atomic
> rename.  (One could say all editors that don't are broken...but...)

I think they do O_TRUNC because it saves them from having to copy the
file attrs and xattrs.  Too bad it severely screws up a program running
in another terminal that just happens to hit the zero-byte file.

> And now because the way bash works (and I assume other historical Unix
> shells) is that they interpret the file *as they're reading it* in
> this scenario you can get completely undefined behavior.  It could do
> *anything*.
> 
> At least one of those times, I got an error from an `rm -rf`
> invocation that happened to live in one of those test scripts...that
> could have in theory just gone off and removed anything.
> 
> Basically the contents-immutable is really what you *always* want for
> executables and really anything that can be parsed without locking
> (like, almost all config files in /etc too).  With ELF files there's

Yes.

> EXTBUSY if it *happens* to be in use, but that's just a hack.  Also in

A hack that doesn't work for scripts.  Either interpreters have to read
the entire script into memory before execution, or I guess they can do
the insane thing that the DOS batch interpreter did, where before each
statement it would save the file pos, close it, execute the command,
reopen the batch file, and seek back to that line.

> that other thread about racing writes to suid executables...well,
> there'd be no possibility for races if we just denied writing because
> again - it makes no sense to just make random writes in-place to an
> executable.  (OK I did see the zig folks are trying an incremental
> linker, but still I would just assume reflinks are available for that)
> 
> Now this is relevant here because, I don't think anything like
> dpkg/rpm and all those things could ever use this ioctl for this
> reason.

Right.  dpkg executable file replacement really doesn't make much sense
for exchange range.  That's also wasn't the usecase I was targetting
though admittedly I'm only using this ioctl to test functionality that
online fsck requires.

> So, it seems to me like it should really be more explicitly targeted at
> - Things that are using open()+write() today and it's safe for that use case
> - The database cases
> 
> And not talk about replacing the general open(O_TMPFILE) + rename() path.

I think I'll change the cover letter to talk about what it does, what
problems it solves, and what problems it introduces.  Figuring out how
to take advantage of it is an exercise for application writers.

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-29 23:27     ` Darrick J. Wong
@ 2024-03-01 13:00       ` Amir Goldstein
  2024-03-01 13:31       ` Jeff Layton
  1 sibling, 0 replies; 27+ messages in thread
From: Amir Goldstein @ 2024-03-01 13:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, jlayton

On Fri, Mar 1, 2024 at 1:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> >
> > This addresses my concerns about using mtime/ctime.
>
> Oh good! :)
>
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> >
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
>
> I agree, your question about mtime/ctime:
>
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
>
> was a very good question.  But perhaps that statement referred to the
> other part of that thread.
>
> >                                   and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
>
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie.  I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
>

I wish I had thought of it myself - it is a good idea - just did not
occur to me.
Using the language of i_changecounter, that is "the current xfs implementation
of i_version", I still think that using it as the content of the
opaque freshness blob
makes more sense than mtime+ctime, but it is none of my concern what
you decide to fill in the freshness blob for the first version.

I was not aware of the way xfs_fsr is currently using mtime+ctime when
I replied and I am not sure if and how it is relevant to the new API.

> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here.  The final version of this patch will not have
> that.
>
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved.  Multigrain timestamps
> were a clever idea, regrettably reverted.  As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
>
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
>

Yeh, I should have chosen my words more carefully.
I was perfectly aware of your lack of interest in doing extra work
and wasn't trying to request any.

> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner?  Or if it reflects something they feel strongly needs
> change before merging.
>

Can't speak for everyone else, but coming from the middle east,
I have fewer politeness filters.
When I write "wouldn't it be better to use change_cookie?"
I am just asking that question.

When I am asking something to be changed before merge,
I try to be much more explicit about it and this is what I expect
others to do when reviewing my patches.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 16:06     ` Darrick J. Wong
@ 2024-03-01 13:16       ` Jeff Layton
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Layton @ 2024-03-01 13:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Tue, 2024-02-27 at 08:06 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > > ranges of bytes between two files atomically.  This new functionality
> > > > enables data storage programs to stage and commit file updates such that
> > > > reader programs will see either the old contents or the new contents in
> > > > their entirety, with no chance of torn writes.  A successful call
> > > > completion guarantees that the new contents will be seen even if the
> > > > system fails.
> > > > 
> > > > The ability to exchange file fork mappings between files in this manner
> > > > is critical to supporting online filesystem repair, which is built upon
> > > > the strategy of constructing a clean copy of a damaged structure and
> > > > committing the new structure into the metadata file atomically.
> > > > 
> > > > User programs will be able to update files atomically by opening an
> > > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > > they want to make, and exchange the relevant ranges of the temp file
> > > > with the original file.  If the updates are aligned with the file block
> > > > size, a new (since v2) flag provides for exchanging only the written
> > > > areas.  Callers can arrange for the update to be rejected if the
> > > > original file has been changed.
> > > > 
> > > > The intent behind this new userspace functionality is to enable atomic
> > > > rewrites of arbitrary parts of individual files.  For years, application
> > > > programmers wanting to ensure the atomicity of a file update had to
> > > > write the changes to a new file in the same directory, fsync the new
> > > > file, rename the new file on top of the old filename, and then fsync the
> > > > directory.  People get it wrong all the time, and $fs hacks abound.
> > > > Here are the proposed manual pages:
> > > > 
> > 
> > This is a cool idea!  I've had some handwavy ideas about making a gated
> > write() syscall (i.e. only write if the change cookie hasn't changed),
> > but something like this may be a simpler lift initially.
> 
> How /does/ userspace get at the change cookie nowadays?
> 

Today, it doesn't. That would need to be exposed before we could make
that work.

> > > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > > 
> > > > NAME
> > > >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> > > >        two files
> > > > 
> > > > SYNOPSIS
> > > >        #include <sys/ioctl.h>
> > > >        #include <xfs/xfs_fs_staging.h>
> > > > 
> > > >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> > > >        xfs_exch_range *arg);
> > > > 
> > > > DESCRIPTION
> > > >        Given  a  range  of bytes in a first file file1_fd and a second
> > > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > > >        changes the contents of the two ranges.
> > > > 
> > > >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> > > >        tions, so no userspace-level locks need to be taken  to  obtain
> > > >        consistent  results.  Implementations must guarantee that read‐
> > > >        ers see either the old contents or the new  contents  in  their
> > > >        entirety, even if the system fails.
> > > > 
> > > >        The  system  call  parameters are conveyed in structures of the
> > > >        following form:
> > > > 
> > > >            struct xfs_exch_range {
> > > >                __s64    file1_fd;
> > > >                __s64    file1_offset;
> > > >                __s64    file2_offset;
> > > >                __s64    length;
> > > >                __u64    flags;
> > > > 
> > > >                __u64    pad;
> > > >            };
> > > > 
> > > >        The field pad must be zero.
> > > > 
> > > >        The fields file1_fd, file1_offset, and length define the  first
> > > >        range of bytes to be exchanged.
> > > > 
> > > >        The fields file2_fd, file2_offset, and length define the second
> > > >        range of bytes to be exchanged.
> > > > 
> > > >        Both files must be from the same filesystem mount.  If the  two
> > > >        file  descriptors represent the same file, the byte ranges must
> > > >        not overlap.  Most  disk-based  filesystems  require  that  the
> > > >        starts  of  both ranges must be aligned to the file block size.
> > > >        If this is the case, the ends of the ranges  must  also  be  so
> > > >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > > 
> > > >        The field flags control the behavior of the exchange operation.
> > > > 
> > > >            XFS_EXCHRANGE_TO_EOF
> > > >                   Ignore  the length parameter.  All bytes in file1_fd
> > > >                   from file1_offset to EOF are moved to file2_fd,  and
> > > >                   file2's  size is set to (file2_offset+(file1_length-
> > > >                   file1_offset)).  Meanwhile, all bytes in file2  from
> > > >                   file2_offset  to  EOF are moved to file1 and file1's
> > > >                   size   is   set   to    (file1_offset+(file2_length-
> > > >                   file2_offset)).
> > > > 
> > > >            XFS_EXCHRANGE_DSYNC
> > > >                   Ensure  that  all modified in-core data in both file
> > > >                   ranges and all metadata updates  pertaining  to  the
> > > >                   exchange operation are flushed to persistent storage
> > > >                   before the call returns.  Opening  either  file  de‐
> > > >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> > > >                   effect.
> > > > 
> > > >            XFS_EXCHRANGE_FILE1_WRITTEN
> > > >                   Only exchange sub-ranges of file1_fd that are  known
> > > >                   to  contain  data  written  by application software.
> > > >                   Each sub-range may be  expanded  (both  upwards  and
> > > >                   downwards)  to  align with the file allocation unit.
> > > >                   For files on the data device, this is one filesystem
> > > >                   block.   For  files  on the realtime device, this is
> > > >                   the realtime extent size.  This facility can be used
> > > >                   to  implement  fast  atomic scatter-gather writes of
> > > >                   any complexity for software-defined storage  targets
> > > >                   if  all  writes  are  aligned to the file allocation
> > > >                   unit.
> > > > 
> > > >            XFS_EXCHRANGE_DRY_RUN
> > > >                   Check the parameters and the feasibility of the  op‐
> > > >                   eration, but do not change anything.
> > > > 
> > > > RETURN VALUE
> > > >        On  error, -1 is returned, and errno is set to indicate the er‐
> > > >        ror.
> > > > 
> > > > ERRORS
> > > >        Error codes can be one of, but are not limited to, the  follow‐
> > > >        ing:
> > > > 
> > > >        EBADF  file1_fd  is not open for reading and writing or is open
> > > >               for append-only writes; or  file2_fd  is  not  open  for
> > > >               reading and writing or is open for append-only writes.
> > > > 
> > > >        EINVAL The  parameters  are  not correct for these files.  This
> > > >               error can also appear if either file  descriptor  repre‐
> > > >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> > > >               erally require the offset and  length  arguments  to  be
> > > >               aligned to the fundamental block sizes of both files.
> > > > 
> > > >        EIO    An I/O error occurred.
> > > > 
> > > >        EISDIR One of the files is a directory.
> > > > 
> > > >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> > > >               perform the operation.
> > > > 
> > > >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> > > >               change the contents safely.
> > > > 
> > > >        EOPNOTSUPP
> > > >               The filesystem does not support exchanging bytes between
> > > >               the two files.
> > > > 
> > > >        EPERM  file1_fd or file2_fd are immutable.
> > > > 
> > > >        ETXTBSY
> > > >               One of the files is a swap file.
> > > > 
> > > >        EUCLEAN
> > > >               The filesystem is corrupt.
> > > > 
> > > >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> > > >               filesystem.
> > > > 
> > > > CONFORMING TO
> > > >        This API is XFS-specific.
> > > > 
> > > > USE CASES
> > > >        Several  use  cases  are imagined for this system call.  In all
> > > >        cases, application software must coordinate updates to the file
> > > >        because the exchange is performed unconditionally.
> > > > 
> > > >        The  first  is a data storage program that wants to commit non-
> > > >        contiguous updates to a file atomically and  coordinates  write
> > > >        access  to that file.  This can be done by creating a temporary
> > > >        file, calling FICLONE(2) to share the contents, and staging the
> > > >        updates into the temporary file.  The FULL_FILES flag is recom‐
> > > >        mended for this purpose.  The temporary file can be deleted  or
> > > >        punched out afterwards.
> > > > 
> > > >        An example program might look like this:
> > > > 
> > > >            int fd = open("/some/file", O_RDWR);
> > > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > > 
> > > >            ioctl(temp_fd, FICLONE, fd);
> > > > 
> > > >            /* append 1MB of records */
> > > >            lseek(temp_fd, 0, SEEK_END);
> > > >            write(temp_fd, data1, 1000000);
> > > > 
> > > >            /* update record index */
> > > >            pwrite(temp_fd, data1, 600, 98765);
> > > >            pwrite(temp_fd, data2, 320, 54321);
> > > >            pwrite(temp_fd, data2, 15, 0);
> > > > 
> > > >            /* commit the entire update */
> > > >            struct xfs_exch_range args = {
> > > >                .file1_fd = temp_fd,
> > > >                .flags = XFS_EXCHRANGE_TO_EOF,
> > > >            };
> > > > 
> > > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > > 
> > > >        The  second  is  a  software-defined  storage host (e.g. a disk
> > > >        jukebox) which implements an atomic scatter-gather  write  com‐
> > > >        mand.   Provided the exported disk's logical block size matches
> > > >        the file's allocation unit size, this can be done by creating a
> > > >        temporary file and writing the data at the appropriate offsets.
> > > >        It is recommended that the temporary file be truncated  to  the
> > > >        size  of  the  regular file before any writes are staged to the
> > > >        temporary file to avoid issues with zeroing during  EOF  exten‐
> > > >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> > > >        only the file allocation units involved  in  the  emulated  de‐
> > > >        vice's  write  command.  The temporary file should be truncated
> > > >        or punched out completely before being reused to stage  another
> > > >        write.
> > > > 
> > > >        An example program might look like this:
> > > > 
> > > >            int fd = open("/some/file", O_RDWR);
> > > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > >            struct stat sb;
> > > >            int blksz;
> > > > 
> > > >            fstat(fd, &sb);
> > > >            blksz = sb.st_blksize;
> > > > 
> > > >            /* land scatter gather writes between 100fsb and 500fsb */
> > > >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > > >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > > >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > > 
> > > >            /* commit the entire update */
> > > >            struct xfs_exch_range args = {
> > > >                .file1_fd = temp_fd,
> > > >                .file1_offset = blksz * 100,
> > > >                .file2_offset = blksz * 100,
> > > >                .length       = blksz * 400,
> > > >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> > > >                                XFS_EXCHRANGE_FILE1_DSYNC,
> > > >            };
> > > > 
> > > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > > 
> > > > NOTES
> > > >        Some  filesystems may limit the amount of data or the number of
> > > >        extents that can be exchanged in a single call.
> > > > 
> > > > SEE ALSO
> > > >        ioctl(2)
> > > > 
> > > > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > > 
> > > > NAME
> > > >        ioctl_xfs_commit_range - conditionally exchange the contents of
> > > >        parts of two files
> > > > 
> > > > SYNOPSIS
> > > >        #include <sys/ioctl.h>
> > > >        #include <xfs/xfs_fs_staging.h>
> > > > 
> > > >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> > > >        mit_range *arg);
> > > > 
> > > > DESCRIPTION
> > > >        Given  a  range  of bytes in a first file file1_fd and a second
> > > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > > >        changes  the contents of the two ranges if file2_fd passes cer‐
> > > >        tain freshness criteria.
> > > > 
> > > >        After locking both files but before  exchanging  the  contents,
> > > >        the  supplied  file2_ino field must match file2_fd's inode num‐
> > > >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> > > >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > > >        cation time and change time of file2.  If they  do  not  match,
> > > >        EBUSY will be returned.
> > > > 
> > > 
> > > Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?
> > > 
> > 
> > ctime should always change if the mtime does. An mtime update means that
> > the metadata was updated, so you also need to update the ctime. 
> 
> Exactly. :)
> 
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > > 
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > > 
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > > 
> > 
> > 
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
> 
> TLDR: No.
> 
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
> 
> Dave says that about a lot of things.  I'm not willing to delay the
> online fsck project _even further_ for a bunch of vaporware that's not
> even out on linux-xfs for review.
> 
> The difference in opinion between xfs and the rest of the kernel about
> i_version is 50% of why I didn't include it here.  The other 50% is the
> part where userspace can't access it, because I do not want to saddle my
> mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
> CHANGES.

Ouch, point taken.

I just have grave concerns about using something as coarse-grained as
the  to gate changes to a file. With modern machines, a single timestamp
can represent a large number of different states of the file's contents.

Is that not a problem here?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-29 23:27     ` Darrick J. Wong
  2024-03-01 13:00       ` Amir Goldstein
@ 2024-03-01 13:31       ` Jeff Layton
  2024-03-02  2:48         ` Darrick J. Wong
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Layton @ 2024-03-01 13:31 UTC (permalink / raw)
  To: Darrick J. Wong, Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch

On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > 
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> > 
> > This addresses my concerns about using mtime/ctime.
> 
> Oh good! :)
> 
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> > 
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
> 
> I agree, your question about mtime/ctime:
> 
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
> 
> was a very good question.  But perhaps that statement referred to the
> other part of that thread.
> 
> >                                   and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
> 
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie.  I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
> 
> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here.  The final version of this patch will not have
> that.
> 
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved.  Multigrain timestamps
> were a clever idea, regrettably reverted.  As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
> 
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
> 
> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner?  Or if it reflects something they feel strongly needs
> change before merging.
> 
> In fairness to you, Amir, I don't know how much you've kept on top of
> that i_version vs. XFS discussion.  So I have no idea if you were aware
> of the status of that work.
> 

Sorry, I didn't mean to trigger anyone, but I do have real concerns
about any API that attempts to use timestamps to detect whether
something has changed.

We learned that lesson in NFS in the 90's. VFS timestamp resolution is
just not enough to show whether there was a change to a file -- full
stop.

I get the hand-wringing over i_version definitions and I don't care to
rehash that discussion here, but I'll point out that this is a
(proposed) XFS-private interface:

What you could do is expose the XFS change counter (the one that gets
bumped for everything, even atime updates, possibly via different
ioctl), and use that for your "freshness" check.

You'd unfortunately get false negative freshness checks after read
operations, but you shouldn't get any false positives (which is real
danger with timestamps).
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-03-01 13:31       ` Jeff Layton
@ 2024-03-02  2:48         ` Darrick J. Wong
  2024-03-02 12:43           ` Jeff Layton
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2024-03-02  2:48 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > 
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > make it an opaque u64 array and require the userspace program to call
> > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > on a definition for i_version then we can use that; for now we'll just
> > > > use mtime/ctime like the old swapext ioctl.
> > > 
> > > This addresses my concerns about using mtime/ctime.
> > 
> > Oh good! :)
> > 
> > > I have to say, Darrick, that I think that referring to this concern as
> > > bikeshedding is not being honest.
> > > 
> > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > review comments, but I think the question about using mtime/ctime in
> > > this new API was not out of place
> > 
> > I agree, your question about mtime/ctime:
> > 
> > "Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?"
> > 
> > was a very good question.  But perhaps that statement referred to the
> > other part of that thread.
> > 
> > >                                   and I think that making the freshness
> > > data opaque is better for everyone in the long run and hopefully, this will
> > > help you move to the things you care about faster.
> > 
> > I wish you'd suggested an opaque blob that the fs can lay out however it
> > wants instead of suggesting specifically the change cookie.  I'm very
> > much ok with an opaque freshness blob that allows future flexibility in
> > how we define the blob's contents.
> > 
> > I was however very upset about the Jeff's suggestion of using i_version.
> > I apologize for using all caps in that reply, and snarling about it in
> > the commit message here.  The final version of this patch will not have
> > that.
> > 
> > That said, I don't think it is at all helpful to suggest using a file
> > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > were a clever idea, regrettably reverted.  As far as I could tell when I
> > wrote my reply, neither had NFS implemented a better behavior and
> > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > patchset to fix all the resulting issues in XFS.
> >
> > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > internal" made me think "OH $deity, they wants me to do that work
> > too???"
> > 
> > A better way to have woreded that might've been "How about switching
> > this to a fs-determined structure so that we can switch the freshness
> > check to i_version when that's fully working on XFS?"
> > 
> > The problem I have with reading patch review emails is that I can't
> > easily tell whether an author's suggestion is being made in a casual
> > offhand manner?  Or if it reflects something they feel strongly needs
> > change before merging.
> > 
> > In fairness to you, Amir, I don't know how much you've kept on top of
> > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > of the status of that work.
> > 
> 
> Sorry, I didn't mean to trigger anyone, but I do have real concerns
> about any API that attempts to use timestamps to detect whether
> something has changed.
> 
> We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> just not enough to show whether there was a change to a file -- full
> stop.
> 
> I get the hand-wringing over i_version definitions and I don't care to
> rehash that discussion here, but I'll point out that this is a
> (proposed) XFS-private interface:
> 
> What you could do is expose the XFS change counter (the one that gets
> bumped for everything, even atime updates, possibly via different
> ioctl), and use that for your "freshness" check.
> 
> You'd unfortunately get false negative freshness checks after read
> operations, but you shouldn't get any false positives (which is real
> danger with timestamps).

I don't see how would that work for this usecase?  You have to sample
file2 before reflinking file2's contents to file1, writing the changes
to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
inode flag on file2 will trigger an iversion update even though it won't
change mtime or ctime.  The COMMIT then fails due to the inode flags
change.

Worse yet, applications aren't going to know if a particular access is
actually the one that will trigger an atime update.  So this will just
fail unpredictably.

If iversion was purely a write counter then I would switch the freshness
implementation to use it.  But it's not, and I know this to be true
because I tried that and could not get COMMIT_RANGE to work reliably.
I suppose the advantage of the blob thing is that we actually /can/
switch over whenever it's ready.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-03-02  2:48         ` Darrick J. Wong
@ 2024-03-02 12:43           ` Jeff Layton
  2024-03-07 23:25             ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Layton @ 2024-03-02 12:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > 
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > make it an opaque u64 array and require the userspace program to call
> > > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > use mtime/ctime like the old swapext ioctl.
> > > > 
> > > > This addresses my concerns about using mtime/ctime.
> > > 
> > > Oh good! :)
> > > 
> > > > I have to say, Darrick, that I think that referring to this concern as
> > > > bikeshedding is not being honest.
> > > > 
> > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > review comments, but I think the question about using mtime/ctime in
> > > > this new API was not out of place
> > > 
> > > I agree, your question about mtime/ctime:
> > > 
> > > "Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?"
> > > 
> > > was a very good question.  But perhaps that statement referred to the
> > > other part of that thread.
> > > 
> > > >                                   and I think that making the freshness
> > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > help you move to the things you care about faster.
> > > 
> > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > wants instead of suggesting specifically the change cookie.  I'm very
> > > much ok with an opaque freshness blob that allows future flexibility in
> > > how we define the blob's contents.
> > > 
> > > I was however very upset about the Jeff's suggestion of using i_version.
> > > I apologize for using all caps in that reply, and snarling about it in
> > > the commit message here.  The final version of this patch will not have
> > > that.
> > > 
> > > That said, I don't think it is at all helpful to suggest using a file
> > > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > > were a clever idea, regrettably reverted.  As far as I could tell when I
> > > wrote my reply, neither had NFS implemented a better behavior and
> > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > patchset to fix all the resulting issues in XFS.
> > > 
> > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > internal" made me think "OH $deity, they wants me to do that work
> > > too???"
> > > 
> > > A better way to have woreded that might've been "How about switching
> > > this to a fs-determined structure so that we can switch the freshness
> > > check to i_version when that's fully working on XFS?"
> > > 
> > > The problem I have with reading patch review emails is that I can't
> > > easily tell whether an author's suggestion is being made in a casual
> > > offhand manner?  Or if it reflects something they feel strongly needs
> > > change before merging.
> > > 
> > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > > of the status of that work.
> > > 
> > 
> > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > about any API that attempts to use timestamps to detect whether
> > something has changed.
> > 
> > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > just not enough to show whether there was a change to a file -- full
> > stop.
> > 
> > I get the hand-wringing over i_version definitions and I don't care to
> > rehash that discussion here, but I'll point out that this is a
> > (proposed) XFS-private interface:
> > 
> > What you could do is expose the XFS change counter (the one that gets
> > bumped for everything, even atime updates, possibly via different
> > ioctl), and use that for your "freshness" check.
> > 
> > You'd unfortunately get false negative freshness checks after read
> > operations, but you shouldn't get any false positives (which is real
> > danger with timestamps).
> 
> I don't see how would that work for this usecase?  You have to sample
> file2 before reflinking file2's contents to file1, writing the changes
> to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
> inode flag on file2 will trigger an iversion update even though it won't
> change mtime or ctime.  The COMMIT then fails due to the inode flags
> change.
> 
> Worse yet, applications aren't going to know if a particular access is
> actually the one that will trigger an atime update.  So this will just
> fail unpredictably.
> 
> If iversion was purely a write counter then I would switch the freshness
> implementation to use it.  But it's not, and I know this to be true
> because I tried that and could not get COMMIT_RANGE to work reliably.
> I suppose the advantage of the blob thing is that we actually /can/
> switch over whenever it's ready.
> 

Yeah, that's the other part -- you have to be willing to redrive the I/O
every time the freshness check fails, which can get expensive depending
on how active the file is. Again this is an XFS interface, so I don't
really have a dog in this fight. If you think timestamps are good
enough, then so be it.

All I can do is mention that it has been our experience in the NFS world
that relying on timestamps like this will eventually lead to data
corruption. The race conditions may be tight, and much of the time the
race may be benign, but if you do this enough you'll eventually get
bitten, and end up exchanging data when you shouldn't have.

All of that said, I think this is great discussion fodder for LSF this
year. I feel like the time is right to consider these sorts of
interfaces that do synchronized I/O without locking. I've already
proposed a discussion around the state of the i_version counter, so
maybe we can chat about it then?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-03-02 12:43           ` Jeff Layton
@ 2024-03-07 23:25             ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2024-03-07 23:25 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Sat, Mar 02, 2024 at 07:43:53AM -0500, Jeff Layton wrote:
> On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> > On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > 
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > 
> > > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > > make it an opaque u64 array and require the userspace program to call
> > > > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > > use mtime/ctime like the old swapext ioctl.
> > > > > 
> > > > > This addresses my concerns about using mtime/ctime.
> > > > 
> > > > Oh good! :)
> > > > 
> > > > > I have to say, Darrick, that I think that referring to this concern as
> > > > > bikeshedding is not being honest.
> > > > > 
> > > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > > review comments, but I think the question about using mtime/ctime in
> > > > > this new API was not out of place
> > > > 
> > > > I agree, your question about mtime/ctime:
> > > > 
> > > > "Maybe a stupid question, but under which circumstances would mtime
> > > > change and ctime not change? Why are both needed?"
> > > > 
> > > > was a very good question.  But perhaps that statement referred to the
> > > > other part of that thread.
> > > > 
> > > > >                                   and I think that making the freshness
> > > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > > help you move to the things you care about faster.
> > > > 
> > > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > > wants instead of suggesting specifically the change cookie.  I'm very
> > > > much ok with an opaque freshness blob that allows future flexibility in
> > > > how we define the blob's contents.
> > > > 
> > > > I was however very upset about the Jeff's suggestion of using i_version.
> > > > I apologize for using all caps in that reply, and snarling about it in
> > > > the commit message here.  The final version of this patch will not have
> > > > that.
> > > > 
> > > > That said, I don't think it is at all helpful to suggest using a file
> > > > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > > > were a clever idea, regrettably reverted.  As far as I could tell when I
> > > > wrote my reply, neither had NFS implemented a better behavior and
> > > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > > patchset to fix all the resulting issues in XFS.
> > > > 
> > > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > > internal" made me think "OH $deity, they wants me to do that work
> > > > too???"
> > > > 
> > > > A better way to have woreded that might've been "How about switching
> > > > this to a fs-determined structure so that we can switch the freshness
> > > > check to i_version when that's fully working on XFS?"
> > > > 
> > > > The problem I have with reading patch review emails is that I can't
> > > > easily tell whether an author's suggestion is being made in a casual
> > > > offhand manner?  Or if it reflects something they feel strongly needs
> > > > change before merging.
> > > > 
> > > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > > > of the status of that work.
> > > > 
> > > 
> > > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > > about any API that attempts to use timestamps to detect whether
> > > something has changed.
> > > 
> > > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > > just not enough to show whether there was a change to a file -- full
> > > stop.
> > > 
> > > I get the hand-wringing over i_version definitions and I don't care to
> > > rehash that discussion here, but I'll point out that this is a
> > > (proposed) XFS-private interface:
> > > 
> > > What you could do is expose the XFS change counter (the one that gets
> > > bumped for everything, even atime updates, possibly via different
> > > ioctl), and use that for your "freshness" check.
> > > 
> > > You'd unfortunately get false negative freshness checks after read
> > > operations, but you shouldn't get any false positives (which is real
> > > danger with timestamps).
> > 
> > I don't see how would that work for this usecase?  You have to sample
> > file2 before reflinking file2's contents to file1, writing the changes
> > to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
> > inode flag on file2 will trigger an iversion update even though it won't
> > change mtime or ctime.  The COMMIT then fails due to the inode flags
> > change.
> > 
> > Worse yet, applications aren't going to know if a particular access is
> > actually the one that will trigger an atime update.  So this will just
> > fail unpredictably.
> > 
> > If iversion was purely a write counter then I would switch the freshness
> > implementation to use it.  But it's not, and I know this to be true
> > because I tried that and could not get COMMIT_RANGE to work reliably.
> > I suppose the advantage of the blob thing is that we actually /can/
> > switch over whenever it's ready.
> > 
> 
> Yeah, that's the other part -- you have to be willing to redrive the I/O
> every time the freshness check fails, which can get expensive depending
> on how active the file is. Again this is an XFS interface, so I don't
> really have a dog in this fight. If you think timestamps are good
> enough, then so be it.
> 
> All I can do is mention that it has been our experience in the NFS world
> that relying on timestamps like this will eventually lead to data
> corruption. The race conditions may be tight, and much of the time the
> race may be benign, but if you do this enough you'll eventually get
> bitten, and end up exchanging data when you shouldn't have.
> 
> All of that said, I think this is great discussion fodder for LSF this
> year. I feel like the time is right to consider these sorts of
> interfaces that do synchronized I/O without locking. I've already
> proposed a discussion around the state of the i_version counter, so
> maybe we can chat about it then?

Yes.  I've gotten an invitation, so corporate approval and dumb injuries
notwithstanding, I'll be there this year. :)

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/14] vfs: export remap and write check helpers
  2024-03-30  0:57 [PATCHSET v30.1] " Darrick J. Wong
@ 2024-03-30  0:57 ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2024-03-30  0:57 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, Christoph Hellwig, linux-fsdevel, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Export these functions so that the next patch can use them to check the
file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c    |    1 +
 fs/remap_range.c   |    4 ++--
 include/linux/fs.h |    1 +
 3 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index d4c036e82b6c3..85c096f2c0d06 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(generic_write_check_limits);
 
 /* Like generic_write_checks(), but takes size of write instead of iter. */
 int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
diff --git a/fs/remap_range.c b/fs/remap_range.c
index de07f978ce3eb..28246dfc84851 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	return 0;
 }
 
-static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
-			     bool write)
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
 {
 	int mask = write ? MAY_WRITE : MAY_READ;
 	loff_t tmp;
@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
 
 	return fsnotify_file_area_perm(file, mask, &pos, len);
 }
+EXPORT_SYMBOL_GPL(remap_verify_area);
 
 /*
  * Ensure that we don't remap a partial EOF block in the middle of something
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 00fc429b0af0f..9cbec9750d86b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2117,6 +2117,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
 int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    loff_t *len, unsigned int remap_flags,


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 01/14] vfs: export remap and write check helpers
  2024-04-09  3:34 [PATCHSET v30.2] xfs: atomic file content exchanges Darrick J. Wong
@ 2024-04-09  3:34 ` Darrick J. Wong
  0 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2024-04-09  3:34 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, Christoph Hellwig, hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Export these functions so that the next patch can use them to check the
file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c    |    1 +
 fs/remap_range.c   |    4 ++--
 include/linux/fs.h |    1 +
 3 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index d4c036e82b6c3..85c096f2c0d06 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(generic_write_check_limits);
 
 /* Like generic_write_checks(), but takes size of write instead of iter. */
 int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
diff --git a/fs/remap_range.c b/fs/remap_range.c
index de07f978ce3eb..28246dfc84851 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	return 0;
 }
 
-static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
-			     bool write)
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
 {
 	int mask = write ? MAY_WRITE : MAY_READ;
 	loff_t tmp;
@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
 
 	return fsnotify_file_area_perm(file, mask, &pos, len);
 }
+EXPORT_SYMBOL_GPL(remap_verify_area);
 
 /*
  * Ensure that we don't remap a partial EOF block in the middle of something
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8dfd53b52744a..0835faeebe7b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2119,6 +2119,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
 int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    loff_t *len, unsigned int remap_flags,


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-04-09  3:34 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-02-28 15:40   ` Christoph Hellwig
2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
2024-02-27 10:53   ` Jeff Layton
2024-02-27 16:06     ` Darrick J. Wong
2024-03-01 13:16       ` Jeff Layton
2024-02-27 23:46     ` Dave Chinner
2024-02-28 10:30       ` Jeff Layton
2024-02-28 10:58         ` Amir Goldstein
2024-02-28 11:01           ` Jeff Layton
2024-02-27 15:45   ` Darrick J. Wong
2024-02-27 16:58     ` Amir Goldstein
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
2024-02-27 18:52   ` Amir Goldstein
2024-02-29 23:27     ` Darrick J. Wong
2024-03-01 13:00       ` Amir Goldstein
2024-03-01 13:31       ` Jeff Layton
2024-03-02  2:48         ` Darrick J. Wong
2024-03-02 12:43           ` Jeff Layton
2024-03-07 23:25             ` Darrick J. Wong
2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
2024-02-29 20:18   ` Darrick J. Wong
2024-02-29 22:43     ` Colin Walters
2024-03-01  0:03       ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2024-03-30  0:57 [PATCHSET v30.1] " Darrick J. Wong
2024-03-30  0:57 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-04-09  3:34 [PATCHSET v30.2] xfs: atomic file content exchanges Darrick J. Wong
2024-04-09  3:34 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).