linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v31.0 02/10] xfs: atomic file content commits
       [not found] <20240822235230.GJ6043@frogsfrogsfrogs>
@ 2024-08-22 23:56 ` Darrick J. Wong
  2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
  2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:56 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs, linux-fsdevel

Hi all,

This series creates XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE ioctls
to perform the exchange only if the target file has not been changed
since a given sampling point.

This new functionality uses the mechanism underlying EXCHANGE_RANGE to
stage and commit file updates such that reader programs will see either
the old contents or the new contents in their entirety, with no chance
of torn writes.  A successful call completion guarantees that the new
contents will be seen even if the system fails.  The pair of ioctls
allows userspace to perform what amounts to a compare and exchange
operation on entire file contents.

Note that there are ongoing arguments in the community about how best to
implement some sort of file data write counter that nfsd could also use
to signal invalidations to clients.  Until such a thing is implemented,
this patch will rely on ctime/mtime updates.

Here are the proposed manual pages:

IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)

NAME
       ioctl_xfs_start_commit  -  prepare  to exchange the contents of
       two files ioctl_xfs_commit_range - conditionally  exchange  the
       contents of parts of two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs.h>

       int  ioctl(int  file2_fd, XFS_IOC_START_COMMIT, struct xfs_com‐
       mit_range *arg);

       int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
       mit_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes  the contents of the two ranges if file2_fd passes cer‐
       tain freshness criteria.

       Before exchanging the  contents,  the  program  must  call  the
       XFS_IOC_START_COMMIT   ioctl   to  sample  freshness  data  for
       file2_fd.  If the sampled metadata  does  not  match  the  file
       metadata  at  commit  time,  XFS_IOC_COMMIT_RANGE  will  return
       EBUSY.

       Exchanges are atomic with regards  to  concurrent  file  opera‐
       tions.   Implementations must guarantee that readers see either
       the old contents or the new contents in their entirety, even if
       the system fails.

       The  system  call  parameters are conveyed in structures of the
       following form:

           struct xfs_commit_range {
               __s32    file1_fd;
               __u32    pad;
               __u64    file1_offset;
               __u64    file2_offset;
               __u64    length;
               __u64    flags;
               __u64    file2_freshness[5];
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       The field file2_freshness is an opaque field whose contents are
       determined  by  the  kernel.  These file attributes are used to
       confirm that file2_fd has not changed by another  thread  since
       the current thread began staging its own update.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHANGE_RANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHANGE_RANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHANGE_RANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EBUSY  The file2 inode number and timestamps  supplied  do  not
              match file2_fd.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system call.  Coordina‐
       tion between multiple threads is performed by the kernel.

       The first is a filesystem defragmenter, which copies  the  con‐
       tents  of  a  file into another file and wishes to exchange the
       space mappings of the two files,  provided  that  the  original
       file has not changed.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           struct xfs_commit_range args = {
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           ioctl(fd, XFS_IOC_START_COMMIT, &args);
           fstat(fd, &sb);

           /* make a fresh copy of the file with terrible alignment to avoid reflink */
           clone_file_range(fd, NULL, temp_fd, NULL, 1, 0);
           clone_file_range(fd, NULL, temp_fd, NULL, sb.st_size - 1, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed while defrag was underway
");

       The  second is a data storage program that wants to commit non-
       contiguous updates to a file atomically.  This  program  cannot
       coordinate updates to the file and therefore relies on the ker‐
       nel to reject the COMMIT_RANGE command if the file has been up‐
       dated  by  someone else.  This can be done by creating a tempo‐
       rary file, calling FICLONE(2) to share the contents, and  stag‐
       ing  the  updates into the temporary file.  The FULL_FILES flag
       is recommended for this purpose.  The  temporary  file  can  be
       deleted or punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct xfs_commit_range args = {
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           ioctl(fd, XFS_IOC_START_COMMIT, &args);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed before commit; will roll back
");

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-18     IOCTL-XFS-COMMIT-RANGE(2)

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-commits

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-commits

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-commits
---
Commits in this patchset:
 * xfs: introduce new file range commit ioctls
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups
       [not found] <20240822235230.GJ6043@frogsfrogsfrogs>
  2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
@ 2024-08-22 23:58 ` Darrick J. Wong
  2024-08-23  0:21   ` [PATCH 1/1] iomap: add a merge boundary flag Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:58 UTC (permalink / raw)
  To: djwong; +Cc: Christoph Hellwig, linux-fsdevel, hch, linux-xfs

Hi all,

Having cleaned up the rtbitmap code and fixed various weird bugs in the
allocator, now we want to do some more cleanups to the rt free space management
code to get it ready for the introduction of allocation groups.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rtgroups-prep

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rtgroups-prep
---
Commits in this patchset:
 * iomap: add a merge boundary flag
---
 fs/iomap/buffered-io.c |    6 ++++++
 include/linux/iomap.h  |    4 ++++
 2 files changed, 10 insertions(+)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
@ 2024-08-23  0:01   ` Darrick J. Wong
  2024-08-23  4:12     ` Christoph Hellwig
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
  0 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:01 UTC (permalink / raw)
  To: djwong; +Cc: hch, linux-xfs, linux-fsdevel

From: Darrick J. Wong <djwong@kernel.org>

This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef72016..c85c8077fac39 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
 	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
 };
 
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call.  The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+	__s32		file1_fd;
+	__u32		pad;		/* must be zeroes */
+	__u64		file1_offset;	/* file1 offset, bytes */
+	__u64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
+
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[6];
+};
+
 /*
  * Exchange file data all the way to the ends of both files, and then exchange
  * the file sizes.  This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92f..d0889190ab7ff 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
 	return error;
 }
 
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip2)
+{
+	struct inode			*inode2 = VFS_I(ip2);
+	struct timespec64		ctime = inode_get_ctime(inode2);
+	struct timespec64		mtime = inode_get_mtime(inode2);
+
+	trace_xfs_exchrange_freshness(fxr, ip2);
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if (fxr->file2_ino != ip2->i_ino ||
+	    fxr->file2_gen != inode2->i_generation ||
+	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+	    !timespec64_equal(&fxr->file2_mtime, &mtime))
+		return -EBUSY;
+
+	return 0;
+}
+
 #define QRETRY_IP1	(0x1)
 #define QRETRY_IP2	(0x2)
 
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
 	if (error || fxr->length == 0)
 		return error;
 
+	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+		error = xfs_exchrange_check_freshness(fxr, ip2);
+		if (error)
+			return error;
+	}
+
 	/* Attach dquots to both inodes before changing block maps. */
 	error = xfs_qm_dqattach(ip2);
 	if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
 	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
 		return -EXDEV;
 
-	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
 		return -EINVAL;
 
 	/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
 	fdput(file1);
 	return error;
 }
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	xfs_fsid_t	fsid;		/* m_fixedfsid */
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u32		file2_gen;	/* inode generation */
+	__u32		magic;		/* zero */
+};
+#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_commit_range		args = { };
+	struct timespec64		ts;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+	struct inode			*inode2 = file_inode(file);
+	struct xfs_inode		*ip2 = XFS_I(inode2);
+	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
+						    XFS_MMAPLOCK_SHARED |
+						    XFS_ILOCK_SHARED;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+	xfs_ilock(ip2, lockflags);
+	ts = inode_get_ctime(inode2);
+	kern_f->file2_ctime		= ts.tv_sec;
+	kern_f->file2_ctime_nsec	= ts.tv_nsec;
+	ts = inode_get_mtime(inode2);
+	kern_f->file2_mtime		= ts.tv_sec;
+	kern_f->file2_mtime_nsec	= ts.tv_nsec;
+	kern_f->file2_ino		= ip2->i_ino;
+	kern_f->file2_gen		= inode2->i_generation;
+	kern_f->magic			= XCR_FRESH_MAGIC;
+	xfs_iunlock(ip2, lockflags);
+
+	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
+	struct xfs_mount		*mp = ip2->i_mount;
+	struct fd			file1;
+	int				error;
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+		return -EINVAL;
+	if (kern_f->magic != XCR_FRESH_MAGIC)
+		return -EBUSY;
+	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+		return -EBUSY;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_gen		= kern_f->file2_gen;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = xfs_exchange_range(&fxr);
+	fdput(file1);
+	return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e8..bc1298aba806b 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
 
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
+
 #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
-					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
 
 struct xfs_exchrange {
 	struct file		*file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
 	u64			length;
 
 	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
+
+	/* file2 metadata for freshness checks */
+	u64			file2_ino;
+	struct timespec64	file2_mtime;
+	struct timespec64	file2_ctime;
+	u32			file2_gen;
 };
 
 long xfs_ioc_exchange_range(struct file *file,
 		struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+		struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+		struct xfs_commit_range __user	*argp);
 
 struct xfs_exchmaps_req;
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e963..90b3ee21e7fe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
+	case XFS_IOC_COMMIT_RANGE:
+		return xfs_ioc_commit_range(filp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a9..4cf0fa71ba9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
 	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
 	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
 	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
-	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
+	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
+	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
 
 /* file exchange-range tracepoint class */
 DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
 
+TRACE_EVENT(xfs_exchrange_freshness,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		struct timespec64	ts64;
+		struct inode		*inode2 = VFS_I(ip2);
+
+		__entry->dev = inode2->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+
+		ts64 = inode_get_ctime(inode2);
+		__entry->ip2_ctime = ts64.tv_sec;
+		__entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+		ts64 = inode_get_mtime(inode2);
+		__entry->ip2_mtime = ts64.tv_sec;
+		__entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
+		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
+		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 TRACE_EVENT(xfs_exchmaps_overhead,
 	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
 		 unsigned long long rmapbt_blocks),


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] iomap: add a merge boundary flag
  2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
@ 2024-08-23  0:21   ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-23  0:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-fsdevel, Christoph Hellwig, hch, linux-xfs

From: Christoph Hellwig <hch@lst.de>

File systems might have boundaries over which merges aren't possible.
In fact these are very common, although most of the time some kind of
header at the beginning of this region (e.g. XFS alloation groups, ext4
block groups) automatically create a merge barrier.  But if that is
not present, say for a device purely used for data we need to manually
communicate that to iomap.

Add a IOMAP_F_BOUNDARY flag to never merge I/O into a previous mapping.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |    6 ++++++
 include/linux/iomap.h  |    4 ++++
 2 files changed, 10 insertions(+)


diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc..685136a57cbf7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1603,6 +1603,8 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
 {
 	if (ioend->io_bio.bi_status != next->io_bio.bi_status)
 		return false;
+	if (next->io_flags & IOMAP_F_BOUNDARY)
+		return false;
 	if ((ioend->io_flags & IOMAP_F_SHARED) ^
 	    (next->io_flags & IOMAP_F_SHARED))
 		return false;
@@ -1722,6 +1724,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 	INIT_LIST_HEAD(&ioend->io_list);
 	ioend->io_type = wpc->iomap.type;
 	ioend->io_flags = wpc->iomap.flags;
+	if (pos > wpc->iomap.offset)
+		wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
 	ioend->io_inode = inode;
 	ioend->io_size = 0;
 	ioend->io_offset = pos;
@@ -1733,6 +1737,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
 
 static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
 {
+	if (wpc->iomap.offset == pos && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
+		return false;
 	if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
 	    (wpc->ioend->io_flags & IOMAP_F_SHARED))
 		return false;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d1..ba3c9e5124637 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -53,6 +53,9 @@ struct vm_fault;
  *
  * IOMAP_F_XATTR indicates that the iomap is for an extended attribute extent
  * rather than a file data extent.
+ *
+ * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
+ * never be merged with the mapping before it.
  */
 #define IOMAP_F_NEW		(1U << 0)
 #define IOMAP_F_DIRTY		(1U << 1)
@@ -64,6 +67,7 @@ struct vm_fault;
 #define IOMAP_F_BUFFER_HEAD	0
 #endif /* CONFIG_BUFFER_HEAD */
 #define IOMAP_F_XATTR		(1U << 5)
+#define IOMAP_F_BOUNDARY	(1U << 6)
 
 /*
  * Flags set by the core iomap code during operations:


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
@ 2024-08-23  4:12     ` Christoph Hellwig
  2024-08-23 13:20       ` Jeff Layton
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-23  4:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel, Jeff Layton

On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point.  The start-commit ioctl performs the sampling
> of file attributes.

The code itself looks simply enough now, but how do we guarantee
that ctime actually works as a full change count and not just by
chance here?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23  4:12     ` Christoph Hellwig
@ 2024-08-23 13:20       ` Jeff Layton
  2024-08-23 17:41         ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2024-08-23 13:20 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel

On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > This patch introduces two more new ioctls to manage atomic updates to
> > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > does, but with the additional requirement that file2 cannot have changed
> > since some sampling point.  The start-commit ioctl performs the sampling
> > of file attributes.
> 
> The code itself looks simply enough now, but how do we guarantee
> that ctime actually works as a full change count and not just by
> chance here?
> 

With current mainline kernels it won't, but the updated multigrain
timestamp series is in linux-next and is slated to go into v6.12. At
that point it should be fine for this purpose.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23 13:20       ` Jeff Layton
@ 2024-08-23 17:41         ` Darrick J. Wong
  2024-08-23 19:15           ` Jeff Layton
  2024-08-24  3:29           ` Christoph Hellwig
  0 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Christoph Hellwig, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 09:20:15AM -0400, Jeff Layton wrote:
> On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> > On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > This patch introduces two more new ioctls to manage atomic updates to
> > > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> > > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > > does, but with the additional requirement that file2 cannot have changed
> > > since some sampling point.  The start-commit ioctl performs the sampling
> > > of file attributes.
> > 
> > The code itself looks simply enough now, but how do we guarantee
> > that ctime actually works as a full change count and not just by
> > chance here?
> > 
> 
> With current mainline kernels it won't, but the updated multigrain
> timestamp series is in linux-next and is slated to go into v6.12. At
> that point it should be fine for this purpose.

<nod> If these both get merged for 6.12, I think the appropriate port
for this patch is to change xfs_ioc_start_commit to do:

	struct kstat	kstat;

	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
	kern_f->file2_ctime		= kstat.ctime.tv_sec;
	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
	kern_f->file2_mtime		= kstat.mtime.tv_sec;
	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;

instead of open-coding the inode_get_[cm]time calls.  The entire
exchangerange feature is still marked experimental, so I didn't think it
was worth rebasing my entire dev branch on the multigrain timestamp
redux series; we can just fix it later.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23 17:41         ` Darrick J. Wong
@ 2024-08-23 19:15           ` Jeff Layton
  2024-08-24  3:29           ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2024-08-23 19:15 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, hch, linux-xfs, linux-fsdevel

On Fri, 2024-08-23 at 10:41 -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 09:20:15AM -0400, Jeff Layton wrote:
> > On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> > > On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > This patch introduces two more new ioctls to manage atomic
> > > > updates to
> > > > file contents -- XFS_IOC_START_COMMIT and
> > > > XFS_IOC_COMMIT_RANGE.  The
> > > > commit mechanism here is exactly the same as what
> > > > XFS_IOC_EXCHANGE_RANGE
> > > > does, but with the additional requirement that file2 cannot
> > > > have changed
> > > > since some sampling point.  The start-commit ioctl performs the
> > > > sampling
> > > > of file attributes.
> > > 
> > > The code itself looks simply enough now, but how do we guarantee
> > > that ctime actually works as a full change count and not just by
> > > chance here?
> > > 
> > 
> > With current mainline kernels it won't, but the updated multigrain
> > timestamp series is in linux-next and is slated to go into v6.12.
> > At
> > that point it should be fine for this purpose.
> 
> <nod> If these both get merged for 6.12, I think the appropriate port
> for this patch is to change xfs_ioc_start_commit to do:
> 
> 	struct kstat	kstat;
> 
> 	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME,
> XFS_I(ip2));
> 	kern_f->file2_ctime		= kstat.ctime.tv_sec;
> 	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
> 	kern_f->file2_mtime		= kstat.mtime.tv_sec;
> 	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;
> 

Yep, that's exactly what you'd want to do.

> instead of open-coding the inode_get_[cm]time calls.  The entire
> exchangerange feature is still marked experimental, so I didn't think
> it
> was worth rebasing my entire dev branch on the multigrain timestamp
> redux series; we can just fix it later.
> 

Sounds good.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-23 17:41         ` Darrick J. Wong
  2024-08-23 19:15           ` Jeff Layton
@ 2024-08-24  3:29           ` Christoph Hellwig
  2024-08-24  4:46             ` Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-24  3:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Layton, Christoph Hellwig, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 10:41:40AM -0700, Darrick J. Wong wrote:
> <nod> If these both get merged for 6.12, I think the appropriate port
> for this patch is to change xfs_ioc_start_commit to do:
> 
> 	struct kstat	kstat;
> 
> 	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
> 	kern_f->file2_ctime		= kstat.ctime.tv_sec;
> 	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
> 	kern_f->file2_mtime		= kstat.mtime.tv_sec;
> 	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;
> 
> instead of open-coding the inode_get_[cm]time calls.  The entire
> exchangerange feature is still marked experimental, so I didn't think it
> was worth rebasing my entire dev branch on the multigrain timestamp
> redux series; we can just fix it later.

But the commit log could really note this dependency.  This will be
especially useful for backports, but also for anyone reading through
code history.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  3:29           ` Christoph Hellwig
@ 2024-08-24  4:46             ` Darrick J. Wong
  2024-08-24  4:48               ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-24  4:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jeff Layton, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 08:29:18PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2024 at 10:41:40AM -0700, Darrick J. Wong wrote:
> > <nod> If these both get merged for 6.12, I think the appropriate port
> > for this patch is to change xfs_ioc_start_commit to do:
> > 
> > 	struct kstat	kstat;
> > 
> > 	fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
> > 	kern_f->file2_ctime		= kstat.ctime.tv_sec;
> > 	kern_f->file2_ctime_nsec	= kstat.ctime.tv_nsec;
> > 	kern_f->file2_mtime		= kstat.mtime.tv_sec;
> > 	kern_f->file2_mtime_nsec	= kstat.mtime.tv_nsec;
> > 
> > instead of open-coding the inode_get_[cm]time calls.  The entire
> > exchangerange feature is still marked experimental, so I didn't think it
> > was worth rebasing my entire dev branch on the multigrain timestamp
> > redux series; we can just fix it later.
> 
> But the commit log could really note this dependency.  This will be
> especially useful for backports, but also for anyone reading through
> code history.

Ok, how about this for a commit message:

"This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

"Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced in that release by Jeff
Layton, it's now possible to update ctime such that this does not
happen.

"It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps."

Will that pass muster?

--D

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  4:46             ` Darrick J. Wong
@ 2024-08-24  4:48               ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-24  4:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Jeff Layton, hch, linux-xfs, linux-fsdevel

On Fri, Aug 23, 2024 at 09:46:43PM -0700, Darrick J. Wong wrote:
> "It is critical, then, that this patch must not be backported to any
> kernel that does not support fine-grained file change timestamps."
> 
> Will that pass muster?

I'd drop that last sentence as the previous part should be clear
enough.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
  2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
  2024-08-23  4:12     ` Christoph Hellwig
@ 2024-08-24  6:29     ` Darrick J. Wong
  2024-08-24 12:11       ` Jeff Layton
  2024-08-25  4:52       ` Christoph Hellwig
  1 sibling, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-24  6:29 UTC (permalink / raw)
  To: hch; +Cc: linux-xfs, linux-fsdevel, jlayton

From: Darrick J. Wong <djwong@kernel.org>

This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef72016..c85c8077fac39 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
 	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
 };
 
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call.  The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+	__s32		file1_fd;
+	__u32		pad;		/* must be zeroes */
+	__u64		file1_offset;	/* file1 offset, bytes */
+	__u64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
+
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[6];
+};
+
 /*
  * Exchange file data all the way to the ends of both files, and then exchange
  * the file sizes.  This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92f..d0889190ab7ff 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
 	return error;
 }
 
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip2)
+{
+	struct inode			*inode2 = VFS_I(ip2);
+	struct timespec64		ctime = inode_get_ctime(inode2);
+	struct timespec64		mtime = inode_get_mtime(inode2);
+
+	trace_xfs_exchrange_freshness(fxr, ip2);
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if (fxr->file2_ino != ip2->i_ino ||
+	    fxr->file2_gen != inode2->i_generation ||
+	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+	    !timespec64_equal(&fxr->file2_mtime, &mtime))
+		return -EBUSY;
+
+	return 0;
+}
+
 #define QRETRY_IP1	(0x1)
 #define QRETRY_IP2	(0x2)
 
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
 	if (error || fxr->length == 0)
 		return error;
 
+	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+		error = xfs_exchrange_check_freshness(fxr, ip2);
+		if (error)
+			return error;
+	}
+
 	/* Attach dquots to both inodes before changing block maps. */
 	error = xfs_qm_dqattach(ip2);
 	if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
 	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
 		return -EXDEV;
 
-	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
 		return -EINVAL;
 
 	/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
 	fdput(file1);
 	return error;
 }
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	xfs_fsid_t	fsid;		/* m_fixedfsid */
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u32		file2_gen;	/* inode generation */
+	__u32		magic;		/* zero */
+};
+#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_commit_range		args = { };
+	struct timespec64		ts;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+	struct inode			*inode2 = file_inode(file);
+	struct xfs_inode		*ip2 = XFS_I(inode2);
+	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
+						    XFS_MMAPLOCK_SHARED |
+						    XFS_ILOCK_SHARED;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+	xfs_ilock(ip2, lockflags);
+	ts = inode_get_ctime(inode2);
+	kern_f->file2_ctime		= ts.tv_sec;
+	kern_f->file2_ctime_nsec	= ts.tv_nsec;
+	ts = inode_get_mtime(inode2);
+	kern_f->file2_mtime		= ts.tv_sec;
+	kern_f->file2_mtime_nsec	= ts.tv_nsec;
+	kern_f->file2_ino		= ip2->i_ino;
+	kern_f->file2_gen		= inode2->i_generation;
+	kern_f->magic			= XCR_FRESH_MAGIC;
+	xfs_iunlock(ip2, lockflags);
+
+	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
+	struct xfs_mount		*mp = ip2->i_mount;
+	struct fd			file1;
+	int				error;
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+		return -EINVAL;
+	if (kern_f->magic != XCR_FRESH_MAGIC)
+		return -EBUSY;
+	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+		return -EBUSY;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_gen		= kern_f->file2_gen;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = xfs_exchange_range(&fxr);
+	fdput(file1);
+	return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e8..bc1298aba806b 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
 
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
+
 #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
-					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
 
 struct xfs_exchrange {
 	struct file		*file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
 	u64			length;
 
 	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
+
+	/* file2 metadata for freshness checks */
+	u64			file2_ino;
+	struct timespec64	file2_mtime;
+	struct timespec64	file2_ctime;
+	u32			file2_gen;
 };
 
 long xfs_ioc_exchange_range(struct file *file,
 		struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+		struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+		struct xfs_commit_range __user	*argp);
 
 struct xfs_exchmaps_req;
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e963..90b3ee21e7fe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
+	case XFS_IOC_COMMIT_RANGE:
+		return xfs_ioc_commit_range(filp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a9..4cf0fa71ba9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
 	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
 	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
 	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
-	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
+	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
+	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
 
 /* file exchange-range tracepoint class */
 DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
 
+TRACE_EVENT(xfs_exchrange_freshness,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		struct timespec64	ts64;
+		struct inode		*inode2 = VFS_I(ip2);
+
+		__entry->dev = inode2->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+
+		ts64 = inode_get_ctime(inode2);
+		__entry->ip2_ctime = ts64.tv_sec;
+		__entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+		ts64 = inode_get_mtime(inode2);
+		__entry->ip2_mtime = ts64.tv_sec;
+		__entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
+		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
+		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 TRACE_EVENT(xfs_exchmaps_overhead,
 	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
 		 unsigned long long rmapbt_blocks),

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
@ 2024-08-24 12:11       ` Jeff Layton
  2024-08-25  4:52       ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2024-08-24 12:11 UTC (permalink / raw)
  To: Darrick J. Wong, hch; +Cc: linux-xfs, linux-fsdevel

On Fri, 2024-08-23 at 23:29 -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point.  The start-commit ioctl performs the sampling
> of file attributes.
> 
> Note: This patch currently samples i_ctime during START_COMMIT and
> checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
> safe in kernels prior to 6.12 because ctime only had coarse grained
> granularity and very fast updates could collide with a COMMIT_RANGE.
> With the multi-granularity ctime introduced by Jeff Layton, it's now
> possible to update ctime such that this does not happen.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
>  fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_exchrange.h |   16 +++++
>  fs/xfs/xfs_ioctl.c     |    4 +
>  fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
>  5 files changed, 243 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 454b63ef72016..c85c8077fac39 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -825,6 +825,30 @@ struct xfs_exchange_range {
>  	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
>  };
>  
> +/*
> + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> + * ctime as the arguments provided to the call.  The old contents of file2 will
> + * be moved to file1.
> + *
> + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> + *
> + * Filesystems must be able to restart and complete the operation even after
> + * the system goes down.
> + */
> +struct xfs_commit_range {
> +	__s32		file1_fd;
> +	__u32		pad;		/* must be zeroes */
> +	__u64		file1_offset;	/* file1 offset, bytes */
> +	__u64		file2_offset;	/* file2 offset, bytes */
> +	__u64		length;		/* bytes to exchange */
> +
> +	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
> +
> +	/* opaque file2 metadata for freshness checks */
> +	__u64		file2_freshness[6];
> +};
> +
>  /*
>   * Exchange file data all the way to the ends of both files, and then exchange
>   * the file sizes.  This flag can be used to replace a file's contents with a
> @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
>  #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
>  #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
>  #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
> +#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
> +#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
>  /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
>  
>  
> diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> index c8a655c92c92f..d0889190ab7ff 100644
> --- a/fs/xfs/xfs_exchrange.c
> +++ b/fs/xfs/xfs_exchrange.c
> @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
>  	return error;
>  }
>  
> +/*
> + * Check that file2's metadata agree with the snapshot that we took for the
> + * range commit request.
> + *
> + * This should be called after the filesystem has locked /all/ inode metadata
> + * against modification.
> + */
> +STATIC int
> +xfs_exchrange_check_freshness(
> +	const struct xfs_exchrange	*fxr,
> +	struct xfs_inode		*ip2)
> +{
> +	struct inode			*inode2 = VFS_I(ip2);
> +	struct timespec64		ctime = inode_get_ctime(inode2);
> +	struct timespec64		mtime = inode_get_mtime(inode2);
> +
> +	trace_xfs_exchrange_freshness(fxr, ip2);
> +
> +	/* Check that file2 hasn't otherwise been modified. */
> +	if (fxr->file2_ino != ip2->i_ino ||
> +	    fxr->file2_gen != inode2->i_generation ||
> +	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> +	    !timespec64_equal(&fxr->file2_mtime, &mtime))
> +		return -EBUSY;
> +
> +	return 0;
> +}
> +
>  #define QRETRY_IP1	(0x1)
>  #define QRETRY_IP2	(0x2)
>  
> @@ -607,6 +635,12 @@ xfs_exchrange_prep(
>  	if (error || fxr->length == 0)
>  		return error;
>  
> +	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> +		error = xfs_exchrange_check_freshness(fxr, ip2);
> +		if (error)
> +			return error;
> +	}
> +
>  	/* Attach dquots to both inodes before changing block maps. */
>  	error = xfs_qm_dqattach(ip2);
>  	if (error)
> @@ -719,7 +753,8 @@ xfs_exchange_range(
>  	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
>  		return -EXDEV;
>  
> -	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> +	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> +			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
>  		return -EINVAL;
>  
>  	/* Userspace requests only honored for regular files. */
> @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
>  	fdput(file1);
>  	return error;
>  }
> +
> +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> +struct xfs_commit_range_fresh {
> +	xfs_fsid_t	fsid;		/* m_fixedfsid */
> +	__u64		file2_ino;	/* inode number */
> +	__s64		file2_mtime;	/* modification time */
> +	__s64		file2_ctime;	/* change time */
> +	__s32		file2_mtime_nsec; /* mod time, nsec */
> +	__s32		file2_ctime_nsec; /* change time, nsec */
> +	__u32		file2_gen;	/* inode generation */
> +	__u32		magic;		/* zero */
> +};
> +#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
> +
> +/* Set up a commitrange operation by sampling file2's write-related attrs */
> +long
> +xfs_ioc_start_commit(
> +	struct file			*file,
> +	struct xfs_commit_range __user	*argp)
> +{
> +	struct xfs_commit_range		args = { };
> +	struct timespec64		ts;
> +	struct xfs_commit_range_fresh	*kern_f;
> +	struct xfs_commit_range_fresh	__user *user_f;
> +	struct inode			*inode2 = file_inode(file);
> +	struct xfs_inode		*ip2 = XFS_I(inode2);
> +	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
> +						    XFS_MMAPLOCK_SHARED |
> +						    XFS_ILOCK_SHARED;
> +
> +	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> +		     sizeof(args.file2_freshness));
> +
> +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> +	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> +
> +	xfs_ilock(ip2, lockflags);
> +	ts = inode_get_ctime(inode2);
> +	kern_f->file2_ctime		= ts.tv_sec;
> +	kern_f->file2_ctime_nsec	= ts.tv_nsec;
> +	ts = inode_get_mtime(inode2);
> +	kern_f->file2_mtime		= ts.tv_sec;
> +	kern_f->file2_mtime_nsec	= ts.tv_nsec;
> +	kern_f->file2_ino		= ip2->i_ino;
> +	kern_f->file2_gen		= inode2->i_generation;
> +	kern_f->magic			= XCR_FRESH_MAGIC;
> +	xfs_iunlock(ip2, lockflags);
> +
> +	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> +	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/*
> + * Exchange file1 and file2 contents if file2 has not been written since the
> + * start commit operation.
> + */
> +long
> +xfs_ioc_commit_range(
> +	struct file			*file,
> +	struct xfs_commit_range __user	*argp)
> +{
> +	struct xfs_exchrange		fxr = {
> +		.file2			= file,
> +	};
> +	struct xfs_commit_range		args;
> +	struct xfs_commit_range_fresh	*kern_f;
> +	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
> +	struct xfs_mount		*mp = ip2->i_mount;
> +	struct fd			file1;
> +	int				error;
> +
> +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> +		return -EINVAL;
> +	if (kern_f->magic != XCR_FRESH_MAGIC)
> +		return -EBUSY;
> +	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> +		return -EBUSY;
> +
> +	fxr.file1_offset	= args.file1_offset;
> +	fxr.file2_offset	= args.file2_offset;
> +	fxr.length		= args.length;
> +	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> +	fxr.file2_ino		= kern_f->file2_ino;
> +	fxr.file2_gen		= kern_f->file2_gen;
> +	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
> +	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
> +	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
> +	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
> +
> +	file1 = fdget(args.file1_fd);
> +	if (!file1.file)
> +		return -EBADF;
> +	fxr.file1 = file1.file;
> +
> +	error = xfs_exchange_range(&fxr);
> +	fdput(file1);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> index 039abcca546e8..bc1298aba806b 100644
> --- a/fs/xfs/xfs_exchrange.h
> +++ b/fs/xfs/xfs_exchrange.h
> @@ -10,8 +10,12 @@
>  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
>  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
>  
> +/* Freshness check required */
> +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
> +
>  #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> -					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> +					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> +					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
>  
>  struct xfs_exchrange {
>  	struct file		*file1;
> @@ -22,10 +26,20 @@ struct xfs_exchrange {
>  	u64			length;
>  
>  	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
> +
> +	/* file2 metadata for freshness checks */
> +	u64			file2_ino;
> +	struct timespec64	file2_mtime;
> +	struct timespec64	file2_ctime;
> +	u32			file2_gen;
>  };
>  
>  long xfs_ioc_exchange_range(struct file *file,
>  		struct xfs_exchange_range __user *argp);
> +long xfs_ioc_start_commit(struct file *file,
> +		struct xfs_commit_range __user *argp);
> +long xfs_ioc_commit_range(struct file *file,
> +		struct xfs_commit_range __user	*argp);
>  
>  struct xfs_exchmaps_req;
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 6b13666d4e963..90b3ee21e7fe6 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
>  
>  	case XFS_IOC_EXCHANGE_RANGE:
>  		return xfs_ioc_exchange_range(filp, arg);
> +	case XFS_IOC_START_COMMIT:
> +		return xfs_ioc_start_commit(filp, arg);
> +	case XFS_IOC_COMMIT_RANGE:
> +		return xfs_ioc_commit_range(filp, arg);
>  
>  	default:
>  		return -ENOTTY;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 180ce697305a9..4cf0fa71ba9ce 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
>  	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
>  	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
>  	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
> -	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
> +	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
> +	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
>  
>  /* file exchange-range tracepoint class */
>  DECLARE_EVENT_CLASS(xfs_exchrange_class,
> @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
>  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
>  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
>  
> +TRACE_EVENT(xfs_exchrange_freshness,
> +	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> +	TP_ARGS(fxr, ip2),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_ino_t, ip2_ino)
> +		__field(long long, ip2_mtime)
> +		__field(long long, ip2_ctime)
> +		__field(int, ip2_mtime_nsec)
> +		__field(int, ip2_ctime_nsec)
> +
> +		__field(xfs_ino_t, file2_ino)
> +		__field(long long, file2_mtime)
> +		__field(long long, file2_ctime)
> +		__field(int, file2_mtime_nsec)
> +		__field(int, file2_ctime_nsec)
> +	),
> +	TP_fast_assign(
> +		struct timespec64	ts64;
> +		struct inode		*inode2 = VFS_I(ip2);
> +
> +		__entry->dev = inode2->i_sb->s_dev;
> +		__entry->ip2_ino = ip2->i_ino;
> +
> +		ts64 = inode_get_ctime(inode2);
> +		__entry->ip2_ctime = ts64.tv_sec;
> +		__entry->ip2_ctime_nsec = ts64.tv_nsec;
> +
> +		ts64 = inode_get_mtime(inode2);
> +		__entry->ip2_mtime = ts64.tv_sec;
> +		__entry->ip2_mtime_nsec = ts64.tv_nsec;
> +
> +		__entry->file2_ino = fxr->file2_ino;
> +		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
> +		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
> +		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> +		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> +	),
> +	TP_printk("dev %d:%d "
> +		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> +		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ip2_ino,
> +		  __entry->ip2_mtime,
> +		  __entry->ip2_mtime_nsec,
> +		  __entry->ip2_ctime,
> +		  __entry->ip2_ctime_nsec,
> +		  __entry->file2_ino,
> +		  __entry->file2_mtime,
> +		  __entry->file2_mtime_nsec,
> +		  __entry->file2_ctime,
> +		  __entry->file2_ctime_nsec)
> +);
> +
>  TRACE_EVENT(xfs_exchmaps_overhead,
>  	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
>  		 unsigned long long rmapbt_blocks),

Acked-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
  2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
  2024-08-24 12:11       ` Jeff Layton
@ 2024-08-25  4:52       ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-25  4:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel, jlayton

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-09-02 18:21 [PATCHSET v31.1 1/8] xfs: atomic file content commits Darrick J. Wong
@ 2024-09-02 18:23 ` Darrick J. Wong
  2024-09-03  7:52   ` Christian Brauner
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-09-02 18:23 UTC (permalink / raw)
  To: chandanbabu, djwong
  Cc: Jeff Layton, Christoph Hellwig, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point.  The start-commit ioctl performs the sampling
of file attributes.

Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.

It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
 fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   16 +++++
 fs/xfs/xfs_ioctl.c     |    4 +
 fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
 5 files changed, 243 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef7201..c85c8077fac3 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
 	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
 };
 
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call.  The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+	__s32		file1_fd;
+	__u32		pad;		/* must be zeroes */
+	__u64		file1_offset;	/* file1 offset, bytes */
+	__u64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
+
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[6];
+};
+
 /*
  * Exchange file data all the way to the ends of both files, and then exchange
  * the file sizes.  This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92..d0889190ab7f 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
 	return error;
 }
 
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip2)
+{
+	struct inode			*inode2 = VFS_I(ip2);
+	struct timespec64		ctime = inode_get_ctime(inode2);
+	struct timespec64		mtime = inode_get_mtime(inode2);
+
+	trace_xfs_exchrange_freshness(fxr, ip2);
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if (fxr->file2_ino != ip2->i_ino ||
+	    fxr->file2_gen != inode2->i_generation ||
+	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+	    !timespec64_equal(&fxr->file2_mtime, &mtime))
+		return -EBUSY;
+
+	return 0;
+}
+
 #define QRETRY_IP1	(0x1)
 #define QRETRY_IP2	(0x2)
 
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
 	if (error || fxr->length == 0)
 		return error;
 
+	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+		error = xfs_exchrange_check_freshness(fxr, ip2);
+		if (error)
+			return error;
+	}
+
 	/* Attach dquots to both inodes before changing block maps. */
 	error = xfs_qm_dqattach(ip2);
 	if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
 	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
 		return -EXDEV;
 
-	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
 		return -EINVAL;
 
 	/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
 	fdput(file1);
 	return error;
 }
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	xfs_fsid_t	fsid;		/* m_fixedfsid */
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u32		file2_gen;	/* inode generation */
+	__u32		magic;		/* zero */
+};
+#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_commit_range		args = { };
+	struct timespec64		ts;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+	struct inode			*inode2 = file_inode(file);
+	struct xfs_inode		*ip2 = XFS_I(inode2);
+	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
+						    XFS_MMAPLOCK_SHARED |
+						    XFS_ILOCK_SHARED;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+	xfs_ilock(ip2, lockflags);
+	ts = inode_get_ctime(inode2);
+	kern_f->file2_ctime		= ts.tv_sec;
+	kern_f->file2_ctime_nsec	= ts.tv_nsec;
+	ts = inode_get_mtime(inode2);
+	kern_f->file2_mtime		= ts.tv_sec;
+	kern_f->file2_mtime_nsec	= ts.tv_nsec;
+	kern_f->file2_ino		= ip2->i_ino;
+	kern_f->file2_gen		= inode2->i_generation;
+	kern_f->magic			= XCR_FRESH_MAGIC;
+	xfs_iunlock(ip2, lockflags);
+
+	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
+	struct xfs_mount		*mp = ip2->i_mount;
+	struct fd			file1;
+	int				error;
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+		return -EINVAL;
+	if (kern_f->magic != XCR_FRESH_MAGIC)
+		return -EBUSY;
+	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+		return -EBUSY;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_gen		= kern_f->file2_gen;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = xfs_exchange_range(&fxr);
+	fdput(file1);
+	return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e..bc1298aba806 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
 #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
 
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
+
 #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
-					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
 
 struct xfs_exchrange {
 	struct file		*file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
 	u64			length;
 
 	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
+
+	/* file2 metadata for freshness checks */
+	u64			file2_ino;
+	struct timespec64	file2_mtime;
+	struct timespec64	file2_ctime;
+	u32			file2_gen;
 };
 
 long xfs_ioc_exchange_range(struct file *file,
 		struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+		struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+		struct xfs_commit_range __user	*argp);
 
 struct xfs_exchmaps_req;
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e96..90b3ee21e7fe 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
+	case XFS_IOC_COMMIT_RANGE:
+		return xfs_ioc_commit_range(filp, arg);
 
 	default:
 		return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a..4cf0fa71ba9c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
 	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
 	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
 	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
-	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
+	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
+	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
 
 /* file exchange-range tracepoint class */
 DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
 DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
 
+TRACE_EVENT(xfs_exchrange_freshness,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		struct timespec64	ts64;
+		struct inode		*inode2 = VFS_I(ip2);
+
+		__entry->dev = inode2->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+
+		ts64 = inode_get_ctime(inode2);
+		__entry->ip2_ctime = ts64.tv_sec;
+		__entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+		ts64 = inode_get_mtime(inode2);
+		__entry->ip2_mtime = ts64.tv_sec;
+		__entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
+		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
+		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 TRACE_EVENT(xfs_exchmaps_overhead,
 	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
 		 unsigned long long rmapbt_blocks),


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-09-02 18:23 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
@ 2024-09-03  7:52   ` Christian Brauner
  2024-10-25 21:23     ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2024-09-03  7:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: chandanbabu, Jeff Layton, Christoph Hellwig, linux-fsdevel,
	linux-xfs

On Mon, Sep 02, 2024 at 11:23:07AM GMT, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point.  The start-commit ioctl performs the sampling
> of file attributes.
> 
> Note: This patch currently samples i_ctime during START_COMMIT and
> checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
> safe in kernels prior to 6.12 because ctime only had coarse grained
> granularity and very fast updates could collide with a COMMIT_RANGE.
> With the multi-granularity ctime introduced by Jeff Layton, it's now
> possible to update ctime such that this does not happen.
> 
> It is critical, then, that this patch must not be backported to any
> kernel that does not support fine-grained file change timestamps.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Jeff Layton <jlayton@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
>  fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_exchrange.h |   16 +++++
>  fs/xfs/xfs_ioctl.c     |    4 +
>  fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
>  5 files changed, 243 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 454b63ef7201..c85c8077fac3 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -825,6 +825,30 @@ struct xfs_exchange_range {
>  	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
>  };
>  
> +/*
> + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> + * ctime as the arguments provided to the call.  The old contents of file2 will
> + * be moved to file1.
> + *
> + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> + *
> + * Filesystems must be able to restart and complete the operation even after
> + * the system goes down.
> + */
> +struct xfs_commit_range {
> +	__s32		file1_fd;
> +	__u32		pad;		/* must be zeroes */
> +	__u64		file1_offset;	/* file1 offset, bytes */
> +	__u64		file2_offset;	/* file2 offset, bytes */
> +	__u64		length;		/* bytes to exchange */
> +
> +	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
> +
> +	/* opaque file2 metadata for freshness checks */
> +	__u64		file2_freshness[6];
> +};
> +
>  /*
>   * Exchange file data all the way to the ends of both files, and then exchange
>   * the file sizes.  This flag can be used to replace a file's contents with a
> @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
>  #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
>  #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
>  #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
> +#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
> +#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
>  /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
>  
>  
> diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> index c8a655c92c92..d0889190ab7f 100644
> --- a/fs/xfs/xfs_exchrange.c
> +++ b/fs/xfs/xfs_exchrange.c
> @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
>  	return error;
>  }
>  
> +/*
> + * Check that file2's metadata agree with the snapshot that we took for the
> + * range commit request.
> + *
> + * This should be called after the filesystem has locked /all/ inode metadata
> + * against modification.
> + */
> +STATIC int
> +xfs_exchrange_check_freshness(
> +	const struct xfs_exchrange	*fxr,
> +	struct xfs_inode		*ip2)
> +{
> +	struct inode			*inode2 = VFS_I(ip2);
> +	struct timespec64		ctime = inode_get_ctime(inode2);
> +	struct timespec64		mtime = inode_get_mtime(inode2);
> +
> +	trace_xfs_exchrange_freshness(fxr, ip2);
> +
> +	/* Check that file2 hasn't otherwise been modified. */
> +	if (fxr->file2_ino != ip2->i_ino ||
> +	    fxr->file2_gen != inode2->i_generation ||
> +	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> +	    !timespec64_equal(&fxr->file2_mtime, &mtime))
> +		return -EBUSY;
> +
> +	return 0;
> +}
> +
>  #define QRETRY_IP1	(0x1)
>  #define QRETRY_IP2	(0x2)
>  
> @@ -607,6 +635,12 @@ xfs_exchrange_prep(
>  	if (error || fxr->length == 0)
>  		return error;
>  
> +	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> +		error = xfs_exchrange_check_freshness(fxr, ip2);
> +		if (error)
> +			return error;
> +	}
> +
>  	/* Attach dquots to both inodes before changing block maps. */
>  	error = xfs_qm_dqattach(ip2);
>  	if (error)
> @@ -719,7 +753,8 @@ xfs_exchange_range(
>  	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
>  		return -EXDEV;
>  
> -	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> +	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> +			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
>  		return -EINVAL;
>  
>  	/* Userspace requests only honored for regular files. */
> @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
>  	fdput(file1);
>  	return error;
>  }
> +
> +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> +struct xfs_commit_range_fresh {
> +	xfs_fsid_t	fsid;		/* m_fixedfsid */
> +	__u64		file2_ino;	/* inode number */
> +	__s64		file2_mtime;	/* modification time */
> +	__s64		file2_ctime;	/* change time */
> +	__s32		file2_mtime_nsec; /* mod time, nsec */
> +	__s32		file2_ctime_nsec; /* change time, nsec */
> +	__u32		file2_gen;	/* inode generation */
> +	__u32		magic;		/* zero */
> +};
> +#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
> +
> +/* Set up a commitrange operation by sampling file2's write-related attrs */
> +long
> +xfs_ioc_start_commit(
> +	struct file			*file,
> +	struct xfs_commit_range __user	*argp)
> +{
> +	struct xfs_commit_range		args = { };
> +	struct timespec64		ts;
> +	struct xfs_commit_range_fresh	*kern_f;
> +	struct xfs_commit_range_fresh	__user *user_f;
> +	struct inode			*inode2 = file_inode(file);
> +	struct xfs_inode		*ip2 = XFS_I(inode2);
> +	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
> +						    XFS_MMAPLOCK_SHARED |
> +						    XFS_ILOCK_SHARED;
> +
> +	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> +		     sizeof(args.file2_freshness));
> +
> +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> +	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> +
> +	xfs_ilock(ip2, lockflags);
> +	ts = inode_get_ctime(inode2);
> +	kern_f->file2_ctime		= ts.tv_sec;
> +	kern_f->file2_ctime_nsec	= ts.tv_nsec;
> +	ts = inode_get_mtime(inode2);
> +	kern_f->file2_mtime		= ts.tv_sec;
> +	kern_f->file2_mtime_nsec	= ts.tv_nsec;
> +	kern_f->file2_ino		= ip2->i_ino;
> +	kern_f->file2_gen		= inode2->i_generation;
> +	kern_f->magic			= XCR_FRESH_MAGIC;
> +	xfs_iunlock(ip2, lockflags);
> +
> +	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> +	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/*
> + * Exchange file1 and file2 contents if file2 has not been written since the
> + * start commit operation.
> + */
> +long
> +xfs_ioc_commit_range(
> +	struct file			*file,
> +	struct xfs_commit_range __user	*argp)
> +{
> +	struct xfs_exchrange		fxr = {
> +		.file2			= file,
> +	};
> +	struct xfs_commit_range		args;
> +	struct xfs_commit_range_fresh	*kern_f;
> +	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
> +	struct xfs_mount		*mp = ip2->i_mount;
> +	struct fd			file1;
> +	int				error;
> +
> +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> +	if (copy_from_user(&args, argp, sizeof(args)))
> +		return -EFAULT;
> +	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> +		return -EINVAL;
> +	if (kern_f->magic != XCR_FRESH_MAGIC)
> +		return -EBUSY;
> +	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> +		return -EBUSY;

So, I mentioned this before in another mail a few months ago and I think
you liked the idea so just as a reminder in case you forgot:

Ioctls are extensible if done correctly:

switch (__IOC_NR(ioctl)) {
	case _IOC_NR(XFS_IOC_START_COMMIT): {
	        size_t usize = _IOC_SIZE(ioctl);
		struct xfs_commit_range	args;

		if (usize < XFS_IOC_START_COMMIT_SIZE_VER0)
                        return -EINVAL;

		if (copy_struct_from_user(&args, sizeof(args), argp, usize))
			return -EFAULT;
	}

If you code it this way, relying on copy_struct_from_user() right from
the start you can easily extend your struct in a backward and forward
compatible manner.

}

> +
> +	fxr.file1_offset	= args.file1_offset;
> +	fxr.file2_offset	= args.file2_offset;
> +	fxr.length		= args.length;
> +	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> +	fxr.file2_ino		= kern_f->file2_ino;
> +	fxr.file2_gen		= kern_f->file2_gen;
> +	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
> +	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
> +	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
> +	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
> +
> +	file1 = fdget(args.file1_fd);
> +	if (!file1.file)
> +		return -EBADF;

Please use CLASS(fd, f)(args.file1_fd) :)

> +	fxr.file1 = file1.file;
> +
> +	error = xfs_exchange_range(&fxr);
> +	fdput(file1);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> index 039abcca546e..bc1298aba806 100644
> --- a/fs/xfs/xfs_exchrange.h
> +++ b/fs/xfs/xfs_exchrange.h
> @@ -10,8 +10,12 @@
>  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
>  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
>  
> +/* Freshness check required */
> +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
> +
>  #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> -					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> +					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> +					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
>  
>  struct xfs_exchrange {
>  	struct file		*file1;
> @@ -22,10 +26,20 @@ struct xfs_exchrange {
>  	u64			length;
>  
>  	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
> +
> +	/* file2 metadata for freshness checks */
> +	u64			file2_ino;
> +	struct timespec64	file2_mtime;
> +	struct timespec64	file2_ctime;
> +	u32			file2_gen;
>  };
>  
>  long xfs_ioc_exchange_range(struct file *file,
>  		struct xfs_exchange_range __user *argp);
> +long xfs_ioc_start_commit(struct file *file,
> +		struct xfs_commit_range __user *argp);
> +long xfs_ioc_commit_range(struct file *file,
> +		struct xfs_commit_range __user	*argp);
>  
>  struct xfs_exchmaps_req;
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 6b13666d4e96..90b3ee21e7fe 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
>  
>  	case XFS_IOC_EXCHANGE_RANGE:
>  		return xfs_ioc_exchange_range(filp, arg);
> +	case XFS_IOC_START_COMMIT:
> +		return xfs_ioc_start_commit(filp, arg);
> +	case XFS_IOC_COMMIT_RANGE:
> +		return xfs_ioc_commit_range(filp, arg);
>  
>  	default:
>  		return -ENOTTY;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 180ce697305a..4cf0fa71ba9c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
>  	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
>  	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
>  	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
> -	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
> +	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
> +	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
>  
>  /* file exchange-range tracepoint class */
>  DECLARE_EVENT_CLASS(xfs_exchrange_class,
> @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
>  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
>  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
>  
> +TRACE_EVENT(xfs_exchrange_freshness,
> +	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> +	TP_ARGS(fxr, ip2),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_ino_t, ip2_ino)
> +		__field(long long, ip2_mtime)
> +		__field(long long, ip2_ctime)
> +		__field(int, ip2_mtime_nsec)
> +		__field(int, ip2_ctime_nsec)
> +
> +		__field(xfs_ino_t, file2_ino)
> +		__field(long long, file2_mtime)
> +		__field(long long, file2_ctime)
> +		__field(int, file2_mtime_nsec)
> +		__field(int, file2_ctime_nsec)
> +	),
> +	TP_fast_assign(
> +		struct timespec64	ts64;
> +		struct inode		*inode2 = VFS_I(ip2);
> +
> +		__entry->dev = inode2->i_sb->s_dev;
> +		__entry->ip2_ino = ip2->i_ino;
> +
> +		ts64 = inode_get_ctime(inode2);
> +		__entry->ip2_ctime = ts64.tv_sec;
> +		__entry->ip2_ctime_nsec = ts64.tv_nsec;
> +
> +		ts64 = inode_get_mtime(inode2);
> +		__entry->ip2_mtime = ts64.tv_sec;
> +		__entry->ip2_mtime_nsec = ts64.tv_nsec;
> +
> +		__entry->file2_ino = fxr->file2_ino;
> +		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
> +		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
> +		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> +		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> +	),
> +	TP_printk("dev %d:%d "
> +		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> +		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->ip2_ino,
> +		  __entry->ip2_mtime,
> +		  __entry->ip2_mtime_nsec,
> +		  __entry->ip2_ctime,
> +		  __entry->ip2_ctime_nsec,
> +		  __entry->file2_ino,
> +		  __entry->file2_mtime,
> +		  __entry->file2_mtime_nsec,
> +		  __entry->file2_ctime,
> +		  __entry->file2_ctime_nsec)
> +);
> +
>  TRACE_EVENT(xfs_exchmaps_overhead,
>  	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
>  		 unsigned long long rmapbt_blocks),
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
  2024-09-03  7:52   ` Christian Brauner
@ 2024-10-25 21:23     ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-10-25 21:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: chandanbabu, Jeff Layton, Christoph Hellwig, linux-fsdevel,
	linux-xfs

On Tue, Sep 03, 2024 at 09:52:43AM +0200, Christian Brauner wrote:
> On Mon, Sep 02, 2024 at 11:23:07AM GMT, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > This patch introduces two more new ioctls to manage atomic updates to
> > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE.  The
> > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > does, but with the additional requirement that file2 cannot have changed
> > since some sampling point.  The start-commit ioctl performs the sampling
> > of file attributes.
> > 
> > Note: This patch currently samples i_ctime during START_COMMIT and
> > checks that it hasn't changed during COMMIT_RANGE.  This isn't entirely
> > safe in kernels prior to 6.12 because ctime only had coarse grained
> > granularity and very fast updates could collide with a COMMIT_RANGE.
> > With the multi-granularity ctime introduced by Jeff Layton, it's now
> > possible to update ctime such that this does not happen.
> > 
> > It is critical, then, that this patch must not be backported to any
> > kernel that does not support fine-grained file change timestamps.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Acked-by: Jeff Layton <jlayton@kernel.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/libxfs/xfs_fs.h |   26 +++++++++
> >  fs/xfs/xfs_exchrange.c |  143 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_exchrange.h |   16 +++++
> >  fs/xfs/xfs_ioctl.c     |    4 +
> >  fs/xfs/xfs_trace.h     |   57 +++++++++++++++++++
> >  5 files changed, 243 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > index 454b63ef7201..c85c8077fac3 100644
> > --- a/fs/xfs/libxfs/xfs_fs.h
> > +++ b/fs/xfs/libxfs/xfs_fs.h
> > @@ -825,6 +825,30 @@ struct xfs_exchange_range {
> >  	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
> >  };
> >  
> > +/*
> > + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> > + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> > + * ctime as the arguments provided to the call.  The old contents of file2 will
> > + * be moved to file1.
> > + *
> > + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> > + *
> > + * Filesystems must be able to restart and complete the operation even after
> > + * the system goes down.
> > + */
> > +struct xfs_commit_range {
> > +	__s32		file1_fd;
> > +	__u32		pad;		/* must be zeroes */
> > +	__u64		file1_offset;	/* file1 offset, bytes */
> > +	__u64		file2_offset;	/* file2 offset, bytes */
> > +	__u64		length;		/* bytes to exchange */
> > +
> > +	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
> > +
> > +	/* opaque file2 metadata for freshness checks */
> > +	__u64		file2_freshness[6];
> > +};
> > +
> >  /*
> >   * Exchange file data all the way to the ends of both files, and then exchange
> >   * the file sizes.  This flag can be used to replace a file's contents with a
> > @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
> >  #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
> >  #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
> >  #define XFS_IOC_EXCHANGE_RANGE	     _IOW ('X', 129, struct xfs_exchange_range)
> > +#define XFS_IOC_START_COMMIT	     _IOR ('X', 130, struct xfs_commit_range)
> > +#define XFS_IOC_COMMIT_RANGE	     _IOW ('X', 131, struct xfs_commit_range)
> >  /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
> >  
> >  
> > diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> > index c8a655c92c92..d0889190ab7f 100644
> > --- a/fs/xfs/xfs_exchrange.c
> > +++ b/fs/xfs/xfs_exchrange.c
> > @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
> >  	return error;
> >  }
> >  
> > +/*
> > + * Check that file2's metadata agree with the snapshot that we took for the
> > + * range commit request.
> > + *
> > + * This should be called after the filesystem has locked /all/ inode metadata
> > + * against modification.
> > + */
> > +STATIC int
> > +xfs_exchrange_check_freshness(
> > +	const struct xfs_exchrange	*fxr,
> > +	struct xfs_inode		*ip2)
> > +{
> > +	struct inode			*inode2 = VFS_I(ip2);
> > +	struct timespec64		ctime = inode_get_ctime(inode2);
> > +	struct timespec64		mtime = inode_get_mtime(inode2);
> > +
> > +	trace_xfs_exchrange_freshness(fxr, ip2);
> > +
> > +	/* Check that file2 hasn't otherwise been modified. */
> > +	if (fxr->file2_ino != ip2->i_ino ||
> > +	    fxr->file2_gen != inode2->i_generation ||
> > +	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> > +	    !timespec64_equal(&fxr->file2_mtime, &mtime))
> > +		return -EBUSY;
> > +
> > +	return 0;
> > +}
> > +
> >  #define QRETRY_IP1	(0x1)
> >  #define QRETRY_IP2	(0x2)
> >  
> > @@ -607,6 +635,12 @@ xfs_exchrange_prep(
> >  	if (error || fxr->length == 0)
> >  		return error;
> >  
> > +	if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> > +		error = xfs_exchrange_check_freshness(fxr, ip2);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> >  	/* Attach dquots to both inodes before changing block maps. */
> >  	error = xfs_qm_dqattach(ip2);
> >  	if (error)
> > @@ -719,7 +753,8 @@ xfs_exchange_range(
> >  	if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
> >  		return -EXDEV;
> >  
> > -	if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> > +	if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> > +			 __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
> >  		return -EINVAL;
> >  
> >  	/* Userspace requests only honored for regular files. */
> > @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
> >  	fdput(file1);
> >  	return error;
> >  }
> > +
> > +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> > +struct xfs_commit_range_fresh {
> > +	xfs_fsid_t	fsid;		/* m_fixedfsid */
> > +	__u64		file2_ino;	/* inode number */
> > +	__s64		file2_mtime;	/* modification time */
> > +	__s64		file2_ctime;	/* change time */
> > +	__s32		file2_mtime_nsec; /* mod time, nsec */
> > +	__s32		file2_ctime_nsec; /* change time, nsec */
> > +	__u32		file2_gen;	/* inode generation */
> > +	__u32		magic;		/* zero */
> > +};
> > +#define XCR_FRESH_MAGIC	0x444F524B	/* DORK */
> > +
> > +/* Set up a commitrange operation by sampling file2's write-related attrs */
> > +long
> > +xfs_ioc_start_commit(
> > +	struct file			*file,
> > +	struct xfs_commit_range __user	*argp)
> > +{
> > +	struct xfs_commit_range		args = { };
> > +	struct timespec64		ts;
> > +	struct xfs_commit_range_fresh	*kern_f;
> > +	struct xfs_commit_range_fresh	__user *user_f;
> > +	struct inode			*inode2 = file_inode(file);
> > +	struct xfs_inode		*ip2 = XFS_I(inode2);
> > +	const unsigned int		lockflags = XFS_IOLOCK_SHARED |
> > +						    XFS_MMAPLOCK_SHARED |
> > +						    XFS_ILOCK_SHARED;
> > +
> > +	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> > +		     sizeof(args.file2_freshness));
> > +
> > +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> > +
> > +	memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> > +
> > +	xfs_ilock(ip2, lockflags);
> > +	ts = inode_get_ctime(inode2);
> > +	kern_f->file2_ctime		= ts.tv_sec;
> > +	kern_f->file2_ctime_nsec	= ts.tv_nsec;
> > +	ts = inode_get_mtime(inode2);
> > +	kern_f->file2_mtime		= ts.tv_sec;
> > +	kern_f->file2_mtime_nsec	= ts.tv_nsec;
> > +	kern_f->file2_ino		= ip2->i_ino;
> > +	kern_f->file2_gen		= inode2->i_generation;
> > +	kern_f->magic			= XCR_FRESH_MAGIC;
> > +	xfs_iunlock(ip2, lockflags);
> > +
> > +	user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> > +	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Exchange file1 and file2 contents if file2 has not been written since the
> > + * start commit operation.
> > + */
> > +long
> > +xfs_ioc_commit_range(
> > +	struct file			*file,
> > +	struct xfs_commit_range __user	*argp)
> > +{
> > +	struct xfs_exchrange		fxr = {
> > +		.file2			= file,
> > +	};
> > +	struct xfs_commit_range		args;
> > +	struct xfs_commit_range_fresh	*kern_f;
> > +	struct xfs_inode		*ip2 = XFS_I(file_inode(file));
> > +	struct xfs_mount		*mp = ip2->i_mount;
> > +	struct fd			file1;
> > +	int				error;
> > +
> > +	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> > +
> > +	if (copy_from_user(&args, argp, sizeof(args)))
> > +		return -EFAULT;
> > +	if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> > +		return -EINVAL;
> > +	if (kern_f->magic != XCR_FRESH_MAGIC)
> > +		return -EBUSY;
> > +	if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> > +		return -EBUSY;
> 
> So, I mentioned this before in another mail a few months ago and I think
> you liked the idea so just as a reminder in case you forgot:
> 
> Ioctls are extensible if done correctly:
> 
> switch (__IOC_NR(ioctl)) {
> 	case _IOC_NR(XFS_IOC_START_COMMIT): {
> 	        size_t usize = _IOC_SIZE(ioctl);
> 		struct xfs_commit_range	args;
> 
> 		if (usize < XFS_IOC_START_COMMIT_SIZE_VER0)
>                         return -EINVAL;
> 
> 		if (copy_struct_from_user(&args, sizeof(args), argp, usize))
> 			return -EFAULT;
> 	}
> 
> If you code it this way, relying on copy_struct_from_user() right from
> the start you can easily extend your struct in a backward and forward
> compatible manner.

I don't know that we'd really need it for commitrange since there's
plenty of space (~40 bytes) in the "opaque" freshness blob.  I suspect
that if we ever add subvolumes to XFS then we might want to take over
the 12 bytes used by mtime* for the subvolume id.

That said, I also think we could convert to this format pretty easily.
Also, copy_struct_from_user can return -E2BIG so I think that needs to
be:

	ret = copy_struct_from_user(&args, sizeof(args), argp, usize);
	if (ret)
		return ret;

Though the overriding reason for not writing the __IOC_NR dispatch code
this way is that every time I've tried extend an xfs ioctl in this
manner, Dave says no because (I think) he doesn't trust how the struct
size is opaquely encoded in the ioctl number /and/ doesn't trust the
BUILD_BUG_ONs I put in the code to guarantee uniqueness so I pick a new
number so I can complete the review instead of starting over with a
different reviewer who doesn't have that particular preference.

> }
> 
> > +
> > +	fxr.file1_offset	= args.file1_offset;
> > +	fxr.file2_offset	= args.file2_offset;
> > +	fxr.length		= args.length;
> > +	fxr.flags		= args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> > +	fxr.file2_ino		= kern_f->file2_ino;
> > +	fxr.file2_gen		= kern_f->file2_gen;
> > +	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
> > +	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
> > +	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
> > +	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
> > +
> > +	file1 = fdget(args.file1_fd);
> > +	if (!file1.file)
> > +		return -EBADF;
> 
> Please use CLASS(fd, f)(args.file1_fd) :)

Yeah, I saw that the fd cleanups collided with xfs in for-next, thanks
for the heads up.

--D

> > +	fxr.file1 = file1.file;
> > +
> > +	error = xfs_exchange_range(&fxr);
> > +	fdput(file1);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> > index 039abcca546e..bc1298aba806 100644
> > --- a/fs/xfs/xfs_exchrange.h
> > +++ b/fs/xfs/xfs_exchrange.h
> > @@ -10,8 +10,12 @@
> >  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1	(1ULL << 63)
> >  #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2	(1ULL << 62)
> >  
> > +/* Freshness check required */
> > +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2	(1ULL << 61)
> > +
> >  #define XFS_EXCHANGE_RANGE_PRIV_FLAGS	(__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> > -					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> > +					 __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> > +					 __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
> >  
> >  struct xfs_exchrange {
> >  	struct file		*file1;
> > @@ -22,10 +26,20 @@ struct xfs_exchrange {
> >  	u64			length;
> >  
> >  	u64			flags;	/* XFS_EXCHANGE_RANGE flags */
> > +
> > +	/* file2 metadata for freshness checks */
> > +	u64			file2_ino;
> > +	struct timespec64	file2_mtime;
> > +	struct timespec64	file2_ctime;
> > +	u32			file2_gen;
> >  };
> >  
> >  long xfs_ioc_exchange_range(struct file *file,
> >  		struct xfs_exchange_range __user *argp);
> > +long xfs_ioc_start_commit(struct file *file,
> > +		struct xfs_commit_range __user *argp);
> > +long xfs_ioc_commit_range(struct file *file,
> > +		struct xfs_commit_range __user	*argp);
> >  
> >  struct xfs_exchmaps_req;
> >  
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index 6b13666d4e96..90b3ee21e7fe 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
> >  
> >  	case XFS_IOC_EXCHANGE_RANGE:
> >  		return xfs_ioc_exchange_range(filp, arg);
> > +	case XFS_IOC_START_COMMIT:
> > +		return xfs_ioc_start_commit(filp, arg);
> > +	case XFS_IOC_COMMIT_RANGE:
> > +		return xfs_ioc_commit_range(filp, arg);
> >  
> >  	default:
> >  		return -ENOTTY;
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 180ce697305a..4cf0fa71ba9c 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
> >  	{ XFS_EXCHANGE_RANGE_DRY_RUN,		"DRY_RUN" }, \
> >  	{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
> >  	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1,	"CMTIME1" }, \
> > -	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }
> > +	{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2,	"CMTIME2" }, \
> > +	{ __XFS_EXCHANGE_RANGE_CHECK_FRESH2,	"FRESH2" }
> >  
> >  /* file exchange-range tracepoint class */
> >  DECLARE_EVENT_CLASS(xfs_exchrange_class,
> > @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
> >  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
> >  DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
> >  
> > +TRACE_EVENT(xfs_exchrange_freshness,
> > +	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> > +	TP_ARGS(fxr, ip2),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_ino_t, ip2_ino)
> > +		__field(long long, ip2_mtime)
> > +		__field(long long, ip2_ctime)
> > +		__field(int, ip2_mtime_nsec)
> > +		__field(int, ip2_ctime_nsec)
> > +
> > +		__field(xfs_ino_t, file2_ino)
> > +		__field(long long, file2_mtime)
> > +		__field(long long, file2_ctime)
> > +		__field(int, file2_mtime_nsec)
> > +		__field(int, file2_ctime_nsec)
> > +	),
> > +	TP_fast_assign(
> > +		struct timespec64	ts64;
> > +		struct inode		*inode2 = VFS_I(ip2);
> > +
> > +		__entry->dev = inode2->i_sb->s_dev;
> > +		__entry->ip2_ino = ip2->i_ino;
> > +
> > +		ts64 = inode_get_ctime(inode2);
> > +		__entry->ip2_ctime = ts64.tv_sec;
> > +		__entry->ip2_ctime_nsec = ts64.tv_nsec;
> > +
> > +		ts64 = inode_get_mtime(inode2);
> > +		__entry->ip2_mtime = ts64.tv_sec;
> > +		__entry->ip2_mtime_nsec = ts64.tv_nsec;
> > +
> > +		__entry->file2_ino = fxr->file2_ino;
> > +		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
> > +		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
> > +		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> > +		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> > +	),
> > +	TP_printk("dev %d:%d "
> > +		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> > +		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->ip2_ino,
> > +		  __entry->ip2_mtime,
> > +		  __entry->ip2_mtime_nsec,
> > +		  __entry->ip2_ctime,
> > +		  __entry->ip2_ctime_nsec,
> > +		  __entry->file2_ino,
> > +		  __entry->file2_mtime,
> > +		  __entry->file2_mtime_nsec,
> > +		  __entry->file2_ctime,
> > +		  __entry->file2_ctime_nsec)
> > +);
> > +
> >  TRACE_EVENT(xfs_exchmaps_overhead,
> >  	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
> >  		 unsigned long long rmapbt_blocks),
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-10-25 21:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240822235230.GJ6043@frogsfrogsfrogs>
2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
2024-08-23  0:01   ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-08-23  4:12     ` Christoph Hellwig
2024-08-23 13:20       ` Jeff Layton
2024-08-23 17:41         ` Darrick J. Wong
2024-08-23 19:15           ` Jeff Layton
2024-08-24  3:29           ` Christoph Hellwig
2024-08-24  4:46             ` Darrick J. Wong
2024-08-24  4:48               ` Christoph Hellwig
2024-08-24  6:29     ` [PATCH v31.0.1 " Darrick J. Wong
2024-08-24 12:11       ` Jeff Layton
2024-08-25  4:52       ` Christoph Hellwig
2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
2024-08-23  0:21   ` [PATCH 1/1] iomap: add a merge boundary flag Darrick J. Wong
2024-09-02 18:21 [PATCHSET v31.1 1/8] xfs: atomic file content commits Darrick J. Wong
2024-09-02 18:23 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-09-03  7:52   ` Christian Brauner
2024-10-25 21:23     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).