* [PATCHSET v31.0 02/10] xfs: atomic file content commits
[not found] <20240822235230.GJ6043@frogsfrogsfrogs>
@ 2024-08-22 23:56 ` Darrick J. Wong
2024-08-23 0:01 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:56 UTC (permalink / raw)
To: djwong; +Cc: hch, linux-xfs, linux-fsdevel
Hi all,
This series creates XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE ioctls
to perform the exchange only if the target file has not been changed
since a given sampling point.
This new functionality uses the mechanism underlying EXCHANGE_RANGE to
stage and commit file updates such that reader programs will see either
the old contents or the new contents in their entirety, with no chance
of torn writes. A successful call completion guarantees that the new
contents will be seen even if the system fails. The pair of ioctls
allows userspace to perform what amounts to a compare and exchange
operation on entire file contents.
Note that there are ongoing arguments in the community about how best to
implement some sort of file data write counter that nfsd could also use
to signal invalidations to clients. Until such a thing is implemented,
this patch will rely on ctime/mtime updates.
Here are the proposed manual pages:
IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
NAME
ioctl_xfs_start_commit - prepare to exchange the contents of
two files ioctl_xfs_commit_range - conditionally exchange the
contents of parts of two files
SYNOPSIS
#include <sys/ioctl.h>
#include <xfs/xfs_fs.h>
int ioctl(int file2_fd, XFS_IOC_START_COMMIT, struct xfs_com‐
mit_range *arg);
int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
mit_range *arg);
DESCRIPTION
Given a range of bytes in a first file file1_fd and a second
range of bytes in a second file file2_fd, this ioctl(2) ex‐
changes the contents of the two ranges if file2_fd passes cer‐
tain freshness criteria.
Before exchanging the contents, the program must call the
XFS_IOC_START_COMMIT ioctl to sample freshness data for
file2_fd. If the sampled metadata does not match the file
metadata at commit time, XFS_IOC_COMMIT_RANGE will return
EBUSY.
Exchanges are atomic with regards to concurrent file opera‐
tions. Implementations must guarantee that readers see either
the old contents or the new contents in their entirety, even if
the system fails.
The system call parameters are conveyed in structures of the
following form:
struct xfs_commit_range {
__s32 file1_fd;
__u32 pad;
__u64 file1_offset;
__u64 file2_offset;
__u64 length;
__u64 flags;
__u64 file2_freshness[5];
};
The field pad must be zero.
The fields file1_fd, file1_offset, and length define the first
range of bytes to be exchanged.
The fields file2_fd, file2_offset, and length define the second
range of bytes to be exchanged.
The field file2_freshness is an opaque field whose contents are
determined by the kernel. These file attributes are used to
confirm that file2_fd has not changed by another thread since
the current thread began staging its own update.
Both files must be from the same filesystem mount. If the two
file descriptors represent the same file, the byte ranges must
not overlap. Most disk-based filesystems require that the
starts of both ranges must be aligned to the file block size.
If this is the case, the ends of the ranges must also be so
aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.
The field flags control the behavior of the exchange operation.
XFS_EXCHANGE_RANGE_TO_EOF
Ignore the length parameter. All bytes in file1_fd
from file1_offset to EOF are moved to file2_fd, and
file2's size is set to (file2_offset+(file1_length-
file1_offset)). Meanwhile, all bytes in file2 from
file2_offset to EOF are moved to file1 and file1's
size is set to (file1_offset+(file2_length-
file2_offset)).
XFS_EXCHANGE_RANGE_DSYNC
Ensure that all modified in-core data in both file
ranges and all metadata updates pertaining to the
exchange operation are flushed to persistent storage
before the call returns. Opening either file de‐
scriptor with O_SYNC or O_DSYNC will have the same
effect.
XFS_EXCHANGE_RANGE_FILE1_WRITTEN
Only exchange sub-ranges of file1_fd that are known
to contain data written by application software.
Each sub-range may be expanded (both upwards and
downwards) to align with the file allocation unit.
For files on the data device, this is one filesystem
block. For files on the realtime device, this is
the realtime extent size. This facility can be used
to implement fast atomic scatter-gather writes of
any complexity for software-defined storage targets
if all writes are aligned to the file allocation
unit.
XFS_EXCHANGE_RANGE_DRY_RUN
Check the parameters and the feasibility of the op‐
eration, but do not change anything.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the er‐
ror.
ERRORS
Error codes can be one of, but are not limited to, the follow‐
ing:
EBADF file1_fd is not open for reading and writing or is open
for append-only writes; or file2_fd is not open for
reading and writing or is open for append-only writes.
EBUSY The file2 inode number and timestamps supplied do not
match file2_fd.
EINVAL The parameters are not correct for these files. This
error can also appear if either file descriptor repre‐
sents a device, FIFO, or socket. Disk filesystems gen‐
erally require the offset and length arguments to be
aligned to the fundamental block sizes of both files.
EIO An I/O error occurred.
EISDIR One of the files is a directory.
ENOMEM The kernel was unable to allocate sufficient memory to
perform the operation.
ENOSPC There is not enough free space in the filesystem ex‐
change the contents safely.
EOPNOTSUPP
The filesystem does not support exchanging bytes between
the two files.
EPERM file1_fd or file2_fd are immutable.
ETXTBSY
One of the files is a swap file.
EUCLEAN
The filesystem is corrupt.
EXDEV file1_fd and file2_fd are not on the same mounted
filesystem.
CONFORMING TO
This API is XFS-specific.
USE CASES
Several use cases are imagined for this system call. Coordina‐
tion between multiple threads is performed by the kernel.
The first is a filesystem defragmenter, which copies the con‐
tents of a file into another file and wishes to exchange the
space mappings of the two files, provided that the original
file has not changed.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct stat sb;
struct xfs_commit_range args = {
.flags = XFS_EXCHANGE_RANGE_TO_EOF,
};
/* gather file2's freshness information */
ioctl(fd, XFS_IOC_START_COMMIT, &args);
fstat(fd, &sb);
/* make a fresh copy of the file with terrible alignment to avoid reflink */
clone_file_range(fd, NULL, temp_fd, NULL, 1, 0);
clone_file_range(fd, NULL, temp_fd, NULL, sb.st_size - 1, 0);
/* commit the entire update */
args.file1_fd = temp_fd;
ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
if (ret && errno == EBUSY)
printf("file changed while defrag was underway
");
The second is a data storage program that wants to commit non-
contiguous updates to a file atomically. This program cannot
coordinate updates to the file and therefore relies on the ker‐
nel to reject the COMMIT_RANGE command if the file has been up‐
dated by someone else. This can be done by creating a tempo‐
rary file, calling FICLONE(2) to share the contents, and stag‐
ing the updates into the temporary file. The FULL_FILES flag
is recommended for this purpose. The temporary file can be
deleted or punched out afterwards.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct xfs_commit_range args = {
.flags = XFS_EXCHANGE_RANGE_TO_EOF,
};
/* gather file2's freshness information */
ioctl(fd, XFS_IOC_START_COMMIT, &args);
ioctl(temp_fd, FICLONE, fd);
/* append 1MB of records */
lseek(temp_fd, 0, SEEK_END);
write(temp_fd, data1, 1000000);
/* update record index */
pwrite(temp_fd, data1, 600, 98765);
pwrite(temp_fd, data2, 320, 54321);
pwrite(temp_fd, data2, 15, 0);
/* commit the entire update */
args.file1_fd = temp_fd;
ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
if (ret && errno == EBUSY)
printf("file changed before commit; will roll back
");
NOTES
Some filesystems may limit the amount of data or the number of
extents that can be exchanged in a single call.
SEE ALSO
ioctl(2)
XFS 2024-02-18 IOCTL-XFS-COMMIT-RANGE(2)
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-commits
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-commits
fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-commits
---
Commits in this patchset:
* xfs: introduce new file range commit ioctls
---
fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_exchrange.h | 16 +++++
fs/xfs/xfs_ioctl.c | 4 +
fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
5 files changed, 243 insertions(+), 3 deletions(-)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups
[not found] <20240822235230.GJ6043@frogsfrogsfrogs>
2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
@ 2024-08-22 23:58 ` Darrick J. Wong
2024-08-23 0:21 ` [PATCH 1/1] iomap: add a merge boundary flag Darrick J. Wong
1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-22 23:58 UTC (permalink / raw)
To: djwong; +Cc: Christoph Hellwig, linux-fsdevel, hch, linux-xfs
Hi all,
Having cleaned up the rtbitmap code and fixed various weird bugs in the
allocator, now we want to do some more cleanups to the rt free space management
code to get it ready for the introduction of allocation groups.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=rtgroups-prep
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=rtgroups-prep
---
Commits in this patchset:
* iomap: add a merge boundary flag
---
fs/iomap/buffered-io.c | 6 ++++++
include/linux/iomap.h | 4 ++++
2 files changed, 10 insertions(+)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
@ 2024-08-23 0:01 ` Darrick J. Wong
2024-08-23 4:12 ` Christoph Hellwig
2024-08-24 6:29 ` [PATCH v31.0.1 " Darrick J. Wong
0 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-23 0:01 UTC (permalink / raw)
To: djwong; +Cc: hch, linux-xfs, linux-fsdevel
From: Darrick J. Wong <djwong@kernel.org>
This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point. The start-commit ioctl performs the sampling
of file attributes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_exchrange.h | 16 +++++
fs/xfs/xfs_ioctl.c | 4 +
fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
5 files changed, 243 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef72016..c85c8077fac39 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
__u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
};
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call. The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+ __s32 file1_fd;
+ __u32 pad; /* must be zeroes */
+ __u64 file1_offset; /* file1 offset, bytes */
+ __u64 file2_offset; /* file2 offset, bytes */
+ __u64 length; /* bytes to exchange */
+
+ __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
+
+ /* opaque file2 metadata for freshness checks */
+ __u64 file2_freshness[6];
+};
+
/*
* Exchange file data all the way to the ends of both files, and then exchange
* the file sizes. This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE _IOW ('X', 131, struct xfs_commit_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92f..d0889190ab7ff 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
return error;
}
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+ const struct xfs_exchrange *fxr,
+ struct xfs_inode *ip2)
+{
+ struct inode *inode2 = VFS_I(ip2);
+ struct timespec64 ctime = inode_get_ctime(inode2);
+ struct timespec64 mtime = inode_get_mtime(inode2);
+
+ trace_xfs_exchrange_freshness(fxr, ip2);
+
+ /* Check that file2 hasn't otherwise been modified. */
+ if (fxr->file2_ino != ip2->i_ino ||
+ fxr->file2_gen != inode2->i_generation ||
+ !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+ !timespec64_equal(&fxr->file2_mtime, &mtime))
+ return -EBUSY;
+
+ return 0;
+}
+
#define QRETRY_IP1 (0x1)
#define QRETRY_IP2 (0x2)
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
if (error || fxr->length == 0)
return error;
+ if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+ error = xfs_exchrange_check_freshness(fxr, ip2);
+ if (error)
+ return error;
+ }
+
/* Attach dquots to both inodes before changing block maps. */
error = xfs_qm_dqattach(ip2);
if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
return -EXDEV;
- if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+ if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+ __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
return -EINVAL;
/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
fdput(file1);
return error;
}
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+ xfs_fsid_t fsid; /* m_fixedfsid */
+ __u64 file2_ino; /* inode number */
+ __s64 file2_mtime; /* modification time */
+ __s64 file2_ctime; /* change time */
+ __s32 file2_mtime_nsec; /* mod time, nsec */
+ __s32 file2_ctime_nsec; /* change time, nsec */
+ __u32 file2_gen; /* inode generation */
+ __u32 magic; /* zero */
+};
+#define XCR_FRESH_MAGIC 0x444F524B /* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_commit_range args = { };
+ struct timespec64 ts;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_commit_range_fresh __user *user_f;
+ struct inode *inode2 = file_inode(file);
+ struct xfs_inode *ip2 = XFS_I(inode2);
+ const unsigned int lockflags = XFS_IOLOCK_SHARED |
+ XFS_MMAPLOCK_SHARED |
+ XFS_ILOCK_SHARED;
+
+ BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+ sizeof(args.file2_freshness));
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+ memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+ xfs_ilock(ip2, lockflags);
+ ts = inode_get_ctime(inode2);
+ kern_f->file2_ctime = ts.tv_sec;
+ kern_f->file2_ctime_nsec = ts.tv_nsec;
+ ts = inode_get_mtime(inode2);
+ kern_f->file2_mtime = ts.tv_sec;
+ kern_f->file2_mtime_nsec = ts.tv_nsec;
+ kern_f->file2_ino = ip2->i_ino;
+ kern_f->file2_gen = inode2->i_generation;
+ kern_f->magic = XCR_FRESH_MAGIC;
+ xfs_iunlock(ip2, lockflags);
+
+ user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+ if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+ return -EFAULT;
+
+ return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_exchrange fxr = {
+ .file2 = file,
+ };
+ struct xfs_commit_range args;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_inode *ip2 = XFS_I(file_inode(file));
+ struct xfs_mount *mp = ip2->i_mount;
+ struct fd file1;
+ int error;
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+ if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+ return -EINVAL;
+ if (kern_f->magic != XCR_FRESH_MAGIC)
+ return -EBUSY;
+ if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+ return -EBUSY;
+
+ fxr.file1_offset = args.file1_offset;
+ fxr.file2_offset = args.file2_offset;
+ fxr.length = args.length;
+ fxr.flags = args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+ fxr.file2_ino = kern_f->file2_ino;
+ fxr.file2_gen = kern_f->file2_gen;
+ fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
+ fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
+ fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
+ fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
+
+ file1 = fdget(args.file1_fd);
+ if (!file1.file)
+ return -EBADF;
+ fxr.file1 = file1.file;
+
+ error = xfs_exchange_range(&fxr);
+ fdput(file1);
+ return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e8..bc1298aba806b 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2 (1ULL << 61)
+
#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
- __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+ __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+ __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
struct xfs_exchrange {
struct file *file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
u64 length;
u64 flags; /* XFS_EXCHANGE_RANGE flags */
+
+ /* file2 metadata for freshness checks */
+ u64 file2_ino;
+ struct timespec64 file2_mtime;
+ struct timespec64 file2_ctime;
+ u32 file2_gen;
};
long xfs_ioc_exchange_range(struct file *file,
struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+ struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+ struct xfs_commit_range __user *argp);
struct xfs_exchmaps_req;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e963..90b3ee21e7fe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
+ case XFS_IOC_START_COMMIT:
+ return xfs_ioc_start_commit(filp, arg);
+ case XFS_IOC_COMMIT_RANGE:
+ return xfs_ioc_commit_range(filp, arg);
default:
return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a9..4cf0fa71ba9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
{ XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
- { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
+ { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }, \
+ { __XFS_EXCHANGE_RANGE_CHECK_FRESH2, "FRESH2" }
/* file exchange-range tracepoint class */
DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
+TRACE_EVENT(xfs_exchrange_freshness,
+ TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+ TP_ARGS(fxr, ip2),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ip2_ino)
+ __field(long long, ip2_mtime)
+ __field(long long, ip2_ctime)
+ __field(int, ip2_mtime_nsec)
+ __field(int, ip2_ctime_nsec)
+
+ __field(xfs_ino_t, file2_ino)
+ __field(long long, file2_mtime)
+ __field(long long, file2_ctime)
+ __field(int, file2_mtime_nsec)
+ __field(int, file2_ctime_nsec)
+ ),
+ TP_fast_assign(
+ struct timespec64 ts64;
+ struct inode *inode2 = VFS_I(ip2);
+
+ __entry->dev = inode2->i_sb->s_dev;
+ __entry->ip2_ino = ip2->i_ino;
+
+ ts64 = inode_get_ctime(inode2);
+ __entry->ip2_ctime = ts64.tv_sec;
+ __entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+ ts64 = inode_get_mtime(inode2);
+ __entry->ip2_mtime = ts64.tv_sec;
+ __entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+ __entry->file2_ino = fxr->file2_ino;
+ __entry->file2_mtime = fxr->file2_mtime.tv_sec;
+ __entry->file2_ctime = fxr->file2_ctime.tv_sec;
+ __entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+ __entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+ ),
+ TP_printk("dev %d:%d "
+ "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+ "file 0x%llx mtime %lld:%d ctime %lld:%d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ip2_ino,
+ __entry->ip2_mtime,
+ __entry->ip2_mtime_nsec,
+ __entry->ip2_ctime,
+ __entry->ip2_ctime_nsec,
+ __entry->file2_ino,
+ __entry->file2_mtime,
+ __entry->file2_mtime_nsec,
+ __entry->file2_ctime,
+ __entry->file2_ctime_nsec)
+);
+
TRACE_EVENT(xfs_exchmaps_overhead,
TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
unsigned long long rmapbt_blocks),
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 1/1] iomap: add a merge boundary flag
2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
@ 2024-08-23 0:21 ` Darrick J. Wong
0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-23 0:21 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, Christoph Hellwig, hch, linux-xfs
From: Christoph Hellwig <hch@lst.de>
File systems might have boundaries over which merges aren't possible.
In fact these are very common, although most of the time some kind of
header at the beginning of this region (e.g. XFS alloation groups, ext4
block groups) automatically create a merge barrier. But if that is
not present, say for a device purely used for data we need to manually
communicate that to iomap.
Add a IOMAP_F_BOUNDARY flag to never merge I/O into a previous mapping.
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/iomap/buffered-io.c | 6 ++++++
include/linux/iomap.h | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86acc..685136a57cbf7 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1603,6 +1603,8 @@ iomap_ioend_can_merge(struct iomap_ioend *ioend, struct iomap_ioend *next)
{
if (ioend->io_bio.bi_status != next->io_bio.bi_status)
return false;
+ if (next->io_flags & IOMAP_F_BOUNDARY)
+ return false;
if ((ioend->io_flags & IOMAP_F_SHARED) ^
(next->io_flags & IOMAP_F_SHARED))
return false;
@@ -1722,6 +1724,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
INIT_LIST_HEAD(&ioend->io_list);
ioend->io_type = wpc->iomap.type;
ioend->io_flags = wpc->iomap.flags;
+ if (pos > wpc->iomap.offset)
+ wpc->iomap.flags &= ~IOMAP_F_BOUNDARY;
ioend->io_inode = inode;
ioend->io_size = 0;
ioend->io_offset = pos;
@@ -1733,6 +1737,8 @@ static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
{
+ if (wpc->iomap.offset == pos && (wpc->iomap.flags & IOMAP_F_BOUNDARY))
+ return false;
if ((wpc->iomap.flags & IOMAP_F_SHARED) !=
(wpc->ioend->io_flags & IOMAP_F_SHARED))
return false;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d1..ba3c9e5124637 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -53,6 +53,9 @@ struct vm_fault;
*
* IOMAP_F_XATTR indicates that the iomap is for an extended attribute extent
* rather than a file data extent.
+ *
+ * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
+ * never be merged with the mapping before it.
*/
#define IOMAP_F_NEW (1U << 0)
#define IOMAP_F_DIRTY (1U << 1)
@@ -64,6 +67,7 @@ struct vm_fault;
#define IOMAP_F_BUFFER_HEAD 0
#endif /* CONFIG_BUFFER_HEAD */
#define IOMAP_F_XATTR (1U << 5)
+#define IOMAP_F_BOUNDARY (1U << 6)
/*
* Flags set by the core iomap code during operations:
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-23 0:01 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
@ 2024-08-23 4:12 ` Christoph Hellwig
2024-08-23 13:20 ` Jeff Layton
2024-08-24 6:29 ` [PATCH v31.0.1 " Darrick J. Wong
1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-23 4:12 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel, Jeff Layton
On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point. The start-commit ioctl performs the sampling
> of file attributes.
The code itself looks simply enough now, but how do we guarantee
that ctime actually works as a full change count and not just by
chance here?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-23 4:12 ` Christoph Hellwig
@ 2024-08-23 13:20 ` Jeff Layton
2024-08-23 17:41 ` Darrick J. Wong
0 siblings, 1 reply; 17+ messages in thread
From: Jeff Layton @ 2024-08-23 13:20 UTC (permalink / raw)
To: Christoph Hellwig, Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel
On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > This patch introduces two more new ioctls to manage atomic updates to
> > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
> > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > does, but with the additional requirement that file2 cannot have changed
> > since some sampling point. The start-commit ioctl performs the sampling
> > of file attributes.
>
> The code itself looks simply enough now, but how do we guarantee
> that ctime actually works as a full change count and not just by
> chance here?
>
With current mainline kernels it won't, but the updated multigrain
timestamp series is in linux-next and is slated to go into v6.12. At
that point it should be fine for this purpose.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-23 13:20 ` Jeff Layton
@ 2024-08-23 17:41 ` Darrick J. Wong
2024-08-23 19:15 ` Jeff Layton
2024-08-24 3:29 ` Christoph Hellwig
0 siblings, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-23 17:41 UTC (permalink / raw)
To: Jeff Layton; +Cc: Christoph Hellwig, hch, linux-xfs, linux-fsdevel
On Fri, Aug 23, 2024 at 09:20:15AM -0400, Jeff Layton wrote:
> On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> > On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > This patch introduces two more new ioctls to manage atomic updates to
> > > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
> > > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > > does, but with the additional requirement that file2 cannot have changed
> > > since some sampling point. The start-commit ioctl performs the sampling
> > > of file attributes.
> >
> > The code itself looks simply enough now, but how do we guarantee
> > that ctime actually works as a full change count and not just by
> > chance here?
> >
>
> With current mainline kernels it won't, but the updated multigrain
> timestamp series is in linux-next and is slated to go into v6.12. At
> that point it should be fine for this purpose.
<nod> If these both get merged for 6.12, I think the appropriate port
for this patch is to change xfs_ioc_start_commit to do:
struct kstat kstat;
fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
kern_f->file2_ctime = kstat.ctime.tv_sec;
kern_f->file2_ctime_nsec = kstat.ctime.tv_nsec;
kern_f->file2_mtime = kstat.mtime.tv_sec;
kern_f->file2_mtime_nsec = kstat.mtime.tv_nsec;
instead of open-coding the inode_get_[cm]time calls. The entire
exchangerange feature is still marked experimental, so I didn't think it
was worth rebasing my entire dev branch on the multigrain timestamp
redux series; we can just fix it later.
--D
> --
> Jeff Layton <jlayton@kernel.org>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-23 17:41 ` Darrick J. Wong
@ 2024-08-23 19:15 ` Jeff Layton
2024-08-24 3:29 ` Christoph Hellwig
1 sibling, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2024-08-23 19:15 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, hch, linux-xfs, linux-fsdevel
On Fri, 2024-08-23 at 10:41 -0700, Darrick J. Wong wrote:
> On Fri, Aug 23, 2024 at 09:20:15AM -0400, Jeff Layton wrote:
> > On Thu, 2024-08-22 at 21:12 -0700, Christoph Hellwig wrote:
> > > On Thu, Aug 22, 2024 at 05:01:22PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > This patch introduces two more new ioctls to manage atomic
> > > > updates to
> > > > file contents -- XFS_IOC_START_COMMIT and
> > > > XFS_IOC_COMMIT_RANGE. The
> > > > commit mechanism here is exactly the same as what
> > > > XFS_IOC_EXCHANGE_RANGE
> > > > does, but with the additional requirement that file2 cannot
> > > > have changed
> > > > since some sampling point. The start-commit ioctl performs the
> > > > sampling
> > > > of file attributes.
> > >
> > > The code itself looks simply enough now, but how do we guarantee
> > > that ctime actually works as a full change count and not just by
> > > chance here?
> > >
> >
> > With current mainline kernels it won't, but the updated multigrain
> > timestamp series is in linux-next and is slated to go into v6.12.
> > At
> > that point it should be fine for this purpose.
>
> <nod> If these both get merged for 6.12, I think the appropriate port
> for this patch is to change xfs_ioc_start_commit to do:
>
> struct kstat kstat;
>
> fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME,
> XFS_I(ip2));
> kern_f->file2_ctime = kstat.ctime.tv_sec;
> kern_f->file2_ctime_nsec = kstat.ctime.tv_nsec;
> kern_f->file2_mtime = kstat.mtime.tv_sec;
> kern_f->file2_mtime_nsec = kstat.mtime.tv_nsec;
>
Yep, that's exactly what you'd want to do.
> instead of open-coding the inode_get_[cm]time calls. The entire
> exchangerange feature is still marked experimental, so I didn't think
> it
> was worth rebasing my entire dev branch on the multigrain timestamp
> redux series; we can just fix it later.
>
Sounds good.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-23 17:41 ` Darrick J. Wong
2024-08-23 19:15 ` Jeff Layton
@ 2024-08-24 3:29 ` Christoph Hellwig
2024-08-24 4:46 ` Darrick J. Wong
1 sibling, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-24 3:29 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Jeff Layton, Christoph Hellwig, hch, linux-xfs, linux-fsdevel
On Fri, Aug 23, 2024 at 10:41:40AM -0700, Darrick J. Wong wrote:
> <nod> If these both get merged for 6.12, I think the appropriate port
> for this patch is to change xfs_ioc_start_commit to do:
>
> struct kstat kstat;
>
> fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
> kern_f->file2_ctime = kstat.ctime.tv_sec;
> kern_f->file2_ctime_nsec = kstat.ctime.tv_nsec;
> kern_f->file2_mtime = kstat.mtime.tv_sec;
> kern_f->file2_mtime_nsec = kstat.mtime.tv_nsec;
>
> instead of open-coding the inode_get_[cm]time calls. The entire
> exchangerange feature is still marked experimental, so I didn't think it
> was worth rebasing my entire dev branch on the multigrain timestamp
> redux series; we can just fix it later.
But the commit log could really note this dependency. This will be
especially useful for backports, but also for anyone reading through
code history.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-24 3:29 ` Christoph Hellwig
@ 2024-08-24 4:46 ` Darrick J. Wong
2024-08-24 4:48 ` Christoph Hellwig
0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-24 4:46 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jeff Layton, hch, linux-xfs, linux-fsdevel
On Fri, Aug 23, 2024 at 08:29:18PM -0700, Christoph Hellwig wrote:
> On Fri, Aug 23, 2024 at 10:41:40AM -0700, Darrick J. Wong wrote:
> > <nod> If these both get merged for 6.12, I think the appropriate port
> > for this patch is to change xfs_ioc_start_commit to do:
> >
> > struct kstat kstat;
> >
> > fill_mg_cmtime(&kstat, STATX_CTIME | STATX_MTIME, XFS_I(ip2));
> > kern_f->file2_ctime = kstat.ctime.tv_sec;
> > kern_f->file2_ctime_nsec = kstat.ctime.tv_nsec;
> > kern_f->file2_mtime = kstat.mtime.tv_sec;
> > kern_f->file2_mtime_nsec = kstat.mtime.tv_nsec;
> >
> > instead of open-coding the inode_get_[cm]time calls. The entire
> > exchangerange feature is still marked experimental, so I didn't think it
> > was worth rebasing my entire dev branch on the multigrain timestamp
> > redux series; we can just fix it later.
>
> But the commit log could really note this dependency. This will be
> especially useful for backports, but also for anyone reading through
> code history.
Ok, how about this for a commit message:
"This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point. The start-commit ioctl performs the sampling
of file attributes.
"Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced in that release by Jeff
Layton, it's now possible to update ctime such that this does not
happen.
"It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps."
Will that pass muster?
--D
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-08-24 4:46 ` Darrick J. Wong
@ 2024-08-24 4:48 ` Christoph Hellwig
0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-24 4:48 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christoph Hellwig, Jeff Layton, hch, linux-xfs, linux-fsdevel
On Fri, Aug 23, 2024 at 09:46:43PM -0700, Darrick J. Wong wrote:
> "It is critical, then, that this patch must not be backported to any
> kernel that does not support fine-grained file change timestamps."
>
> Will that pass muster?
I'd drop that last sentence as the previous part should be clear
enough.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
2024-08-23 0:01 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-08-23 4:12 ` Christoph Hellwig
@ 2024-08-24 6:29 ` Darrick J. Wong
2024-08-24 12:11 ` Jeff Layton
2024-08-25 4:52 ` Christoph Hellwig
1 sibling, 2 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-08-24 6:29 UTC (permalink / raw)
To: hch; +Cc: linux-xfs, linux-fsdevel, jlayton
From: Darrick J. Wong <djwong@kernel.org>
This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point. The start-commit ioctl performs the sampling
of file attributes.
Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_exchrange.h | 16 +++++
fs/xfs/xfs_ioctl.c | 4 +
fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
5 files changed, 243 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef72016..c85c8077fac39 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
__u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
};
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call. The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+ __s32 file1_fd;
+ __u32 pad; /* must be zeroes */
+ __u64 file1_offset; /* file1 offset, bytes */
+ __u64 file2_offset; /* file2 offset, bytes */
+ __u64 length; /* bytes to exchange */
+
+ __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
+
+ /* opaque file2 metadata for freshness checks */
+ __u64 file2_freshness[6];
+};
+
/*
* Exchange file data all the way to the ends of both files, and then exchange
* the file sizes. This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE _IOW ('X', 131, struct xfs_commit_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92f..d0889190ab7ff 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
return error;
}
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+ const struct xfs_exchrange *fxr,
+ struct xfs_inode *ip2)
+{
+ struct inode *inode2 = VFS_I(ip2);
+ struct timespec64 ctime = inode_get_ctime(inode2);
+ struct timespec64 mtime = inode_get_mtime(inode2);
+
+ trace_xfs_exchrange_freshness(fxr, ip2);
+
+ /* Check that file2 hasn't otherwise been modified. */
+ if (fxr->file2_ino != ip2->i_ino ||
+ fxr->file2_gen != inode2->i_generation ||
+ !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+ !timespec64_equal(&fxr->file2_mtime, &mtime))
+ return -EBUSY;
+
+ return 0;
+}
+
#define QRETRY_IP1 (0x1)
#define QRETRY_IP2 (0x2)
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
if (error || fxr->length == 0)
return error;
+ if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+ error = xfs_exchrange_check_freshness(fxr, ip2);
+ if (error)
+ return error;
+ }
+
/* Attach dquots to both inodes before changing block maps. */
error = xfs_qm_dqattach(ip2);
if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
return -EXDEV;
- if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+ if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+ __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
return -EINVAL;
/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
fdput(file1);
return error;
}
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+ xfs_fsid_t fsid; /* m_fixedfsid */
+ __u64 file2_ino; /* inode number */
+ __s64 file2_mtime; /* modification time */
+ __s64 file2_ctime; /* change time */
+ __s32 file2_mtime_nsec; /* mod time, nsec */
+ __s32 file2_ctime_nsec; /* change time, nsec */
+ __u32 file2_gen; /* inode generation */
+ __u32 magic; /* zero */
+};
+#define XCR_FRESH_MAGIC 0x444F524B /* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_commit_range args = { };
+ struct timespec64 ts;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_commit_range_fresh __user *user_f;
+ struct inode *inode2 = file_inode(file);
+ struct xfs_inode *ip2 = XFS_I(inode2);
+ const unsigned int lockflags = XFS_IOLOCK_SHARED |
+ XFS_MMAPLOCK_SHARED |
+ XFS_ILOCK_SHARED;
+
+ BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+ sizeof(args.file2_freshness));
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+ memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+ xfs_ilock(ip2, lockflags);
+ ts = inode_get_ctime(inode2);
+ kern_f->file2_ctime = ts.tv_sec;
+ kern_f->file2_ctime_nsec = ts.tv_nsec;
+ ts = inode_get_mtime(inode2);
+ kern_f->file2_mtime = ts.tv_sec;
+ kern_f->file2_mtime_nsec = ts.tv_nsec;
+ kern_f->file2_ino = ip2->i_ino;
+ kern_f->file2_gen = inode2->i_generation;
+ kern_f->magic = XCR_FRESH_MAGIC;
+ xfs_iunlock(ip2, lockflags);
+
+ user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+ if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+ return -EFAULT;
+
+ return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_exchrange fxr = {
+ .file2 = file,
+ };
+ struct xfs_commit_range args;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_inode *ip2 = XFS_I(file_inode(file));
+ struct xfs_mount *mp = ip2->i_mount;
+ struct fd file1;
+ int error;
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+ if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+ return -EINVAL;
+ if (kern_f->magic != XCR_FRESH_MAGIC)
+ return -EBUSY;
+ if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+ return -EBUSY;
+
+ fxr.file1_offset = args.file1_offset;
+ fxr.file2_offset = args.file2_offset;
+ fxr.length = args.length;
+ fxr.flags = args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+ fxr.file2_ino = kern_f->file2_ino;
+ fxr.file2_gen = kern_f->file2_gen;
+ fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
+ fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
+ fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
+ fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
+
+ file1 = fdget(args.file1_fd);
+ if (!file1.file)
+ return -EBADF;
+ fxr.file1 = file1.file;
+
+ error = xfs_exchange_range(&fxr);
+ fdput(file1);
+ return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e8..bc1298aba806b 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2 (1ULL << 61)
+
#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
- __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+ __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+ __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
struct xfs_exchrange {
struct file *file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
u64 length;
u64 flags; /* XFS_EXCHANGE_RANGE flags */
+
+ /* file2 metadata for freshness checks */
+ u64 file2_ino;
+ struct timespec64 file2_mtime;
+ struct timespec64 file2_ctime;
+ u32 file2_gen;
};
long xfs_ioc_exchange_range(struct file *file,
struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+ struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+ struct xfs_commit_range __user *argp);
struct xfs_exchmaps_req;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e963..90b3ee21e7fe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
+ case XFS_IOC_START_COMMIT:
+ return xfs_ioc_start_commit(filp, arg);
+ case XFS_IOC_COMMIT_RANGE:
+ return xfs_ioc_commit_range(filp, arg);
default:
return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a9..4cf0fa71ba9ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
{ XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
- { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
+ { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }, \
+ { __XFS_EXCHANGE_RANGE_CHECK_FRESH2, "FRESH2" }
/* file exchange-range tracepoint class */
DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
+TRACE_EVENT(xfs_exchrange_freshness,
+ TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+ TP_ARGS(fxr, ip2),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ip2_ino)
+ __field(long long, ip2_mtime)
+ __field(long long, ip2_ctime)
+ __field(int, ip2_mtime_nsec)
+ __field(int, ip2_ctime_nsec)
+
+ __field(xfs_ino_t, file2_ino)
+ __field(long long, file2_mtime)
+ __field(long long, file2_ctime)
+ __field(int, file2_mtime_nsec)
+ __field(int, file2_ctime_nsec)
+ ),
+ TP_fast_assign(
+ struct timespec64 ts64;
+ struct inode *inode2 = VFS_I(ip2);
+
+ __entry->dev = inode2->i_sb->s_dev;
+ __entry->ip2_ino = ip2->i_ino;
+
+ ts64 = inode_get_ctime(inode2);
+ __entry->ip2_ctime = ts64.tv_sec;
+ __entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+ ts64 = inode_get_mtime(inode2);
+ __entry->ip2_mtime = ts64.tv_sec;
+ __entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+ __entry->file2_ino = fxr->file2_ino;
+ __entry->file2_mtime = fxr->file2_mtime.tv_sec;
+ __entry->file2_ctime = fxr->file2_ctime.tv_sec;
+ __entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+ __entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+ ),
+ TP_printk("dev %d:%d "
+ "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+ "file 0x%llx mtime %lld:%d ctime %lld:%d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ip2_ino,
+ __entry->ip2_mtime,
+ __entry->ip2_mtime_nsec,
+ __entry->ip2_ctime,
+ __entry->ip2_ctime_nsec,
+ __entry->file2_ino,
+ __entry->file2_mtime,
+ __entry->file2_mtime_nsec,
+ __entry->file2_ctime,
+ __entry->file2_ctime_nsec)
+);
+
TRACE_EVENT(xfs_exchmaps_overhead,
TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
unsigned long long rmapbt_blocks),
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
2024-08-24 6:29 ` [PATCH v31.0.1 " Darrick J. Wong
@ 2024-08-24 12:11 ` Jeff Layton
2024-08-25 4:52 ` Christoph Hellwig
1 sibling, 0 replies; 17+ messages in thread
From: Jeff Layton @ 2024-08-24 12:11 UTC (permalink / raw)
To: Darrick J. Wong, hch; +Cc: linux-xfs, linux-fsdevel
On Fri, 2024-08-23 at 23:29 -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point. The start-commit ioctl performs the sampling
> of file attributes.
>
> Note: This patch currently samples i_ctime during START_COMMIT and
> checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
> safe in kernels prior to 6.12 because ctime only had coarse grained
> granularity and very fast updates could collide with a COMMIT_RANGE.
> With the multi-granularity ctime introduced by Jeff Layton, it's now
> possible to update ctime such that this does not happen.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
> fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_exchrange.h | 16 +++++
> fs/xfs/xfs_ioctl.c | 4 +
> fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
> 5 files changed, 243 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 454b63ef72016..c85c8077fac39 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -825,6 +825,30 @@ struct xfs_exchange_range {
> __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
> };
>
> +/*
> + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> + * ctime as the arguments provided to the call. The old contents of file2 will
> + * be moved to file1.
> + *
> + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> + *
> + * Filesystems must be able to restart and complete the operation even after
> + * the system goes down.
> + */
> +struct xfs_commit_range {
> + __s32 file1_fd;
> + __u32 pad; /* must be zeroes */
> + __u64 file1_offset; /* file1 offset, bytes */
> + __u64 file2_offset; /* file2 offset, bytes */
> + __u64 length; /* bytes to exchange */
> +
> + __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
> +
> + /* opaque file2 metadata for freshness checks */
> + __u64 file2_freshness[6];
> +};
> +
> /*
> * Exchange file data all the way to the ends of both files, and then exchange
> * the file sizes. This flag can be used to replace a file's contents with a
> @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
> #define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
> #define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
> #define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
> +#define XFS_IOC_START_COMMIT _IOR ('X', 130, struct xfs_commit_range)
> +#define XFS_IOC_COMMIT_RANGE _IOW ('X', 131, struct xfs_commit_range)
> /* XFS_IOC_GETFSUUID ---------- deprecated 140 */
>
>
> diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> index c8a655c92c92f..d0889190ab7ff 100644
> --- a/fs/xfs/xfs_exchrange.c
> +++ b/fs/xfs/xfs_exchrange.c
> @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
> return error;
> }
>
> +/*
> + * Check that file2's metadata agree with the snapshot that we took for the
> + * range commit request.
> + *
> + * This should be called after the filesystem has locked /all/ inode metadata
> + * against modification.
> + */
> +STATIC int
> +xfs_exchrange_check_freshness(
> + const struct xfs_exchrange *fxr,
> + struct xfs_inode *ip2)
> +{
> + struct inode *inode2 = VFS_I(ip2);
> + struct timespec64 ctime = inode_get_ctime(inode2);
> + struct timespec64 mtime = inode_get_mtime(inode2);
> +
> + trace_xfs_exchrange_freshness(fxr, ip2);
> +
> + /* Check that file2 hasn't otherwise been modified. */
> + if (fxr->file2_ino != ip2->i_ino ||
> + fxr->file2_gen != inode2->i_generation ||
> + !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> + !timespec64_equal(&fxr->file2_mtime, &mtime))
> + return -EBUSY;
> +
> + return 0;
> +}
> +
> #define QRETRY_IP1 (0x1)
> #define QRETRY_IP2 (0x2)
>
> @@ -607,6 +635,12 @@ xfs_exchrange_prep(
> if (error || fxr->length == 0)
> return error;
>
> + if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> + error = xfs_exchrange_check_freshness(fxr, ip2);
> + if (error)
> + return error;
> + }
> +
> /* Attach dquots to both inodes before changing block maps. */
> error = xfs_qm_dqattach(ip2);
> if (error)
> @@ -719,7 +753,8 @@ xfs_exchange_range(
> if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
> return -EXDEV;
>
> - if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> + if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> + __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
> return -EINVAL;
>
> /* Userspace requests only honored for regular files. */
> @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
> fdput(file1);
> return error;
> }
> +
> +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> +struct xfs_commit_range_fresh {
> + xfs_fsid_t fsid; /* m_fixedfsid */
> + __u64 file2_ino; /* inode number */
> + __s64 file2_mtime; /* modification time */
> + __s64 file2_ctime; /* change time */
> + __s32 file2_mtime_nsec; /* mod time, nsec */
> + __s32 file2_ctime_nsec; /* change time, nsec */
> + __u32 file2_gen; /* inode generation */
> + __u32 magic; /* zero */
> +};
> +#define XCR_FRESH_MAGIC 0x444F524B /* DORK */
> +
> +/* Set up a commitrange operation by sampling file2's write-related attrs */
> +long
> +xfs_ioc_start_commit(
> + struct file *file,
> + struct xfs_commit_range __user *argp)
> +{
> + struct xfs_commit_range args = { };
> + struct timespec64 ts;
> + struct xfs_commit_range_fresh *kern_f;
> + struct xfs_commit_range_fresh __user *user_f;
> + struct inode *inode2 = file_inode(file);
> + struct xfs_inode *ip2 = XFS_I(inode2);
> + const unsigned int lockflags = XFS_IOLOCK_SHARED |
> + XFS_MMAPLOCK_SHARED |
> + XFS_ILOCK_SHARED;
> +
> + BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> + sizeof(args.file2_freshness));
> +
> + kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> + memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> +
> + xfs_ilock(ip2, lockflags);
> + ts = inode_get_ctime(inode2);
> + kern_f->file2_ctime = ts.tv_sec;
> + kern_f->file2_ctime_nsec = ts.tv_nsec;
> + ts = inode_get_mtime(inode2);
> + kern_f->file2_mtime = ts.tv_sec;
> + kern_f->file2_mtime_nsec = ts.tv_nsec;
> + kern_f->file2_ino = ip2->i_ino;
> + kern_f->file2_gen = inode2->i_generation;
> + kern_f->magic = XCR_FRESH_MAGIC;
> + xfs_iunlock(ip2, lockflags);
> +
> + user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> + if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +/*
> + * Exchange file1 and file2 contents if file2 has not been written since the
> + * start commit operation.
> + */
> +long
> +xfs_ioc_commit_range(
> + struct file *file,
> + struct xfs_commit_range __user *argp)
> +{
> + struct xfs_exchrange fxr = {
> + .file2 = file,
> + };
> + struct xfs_commit_range args;
> + struct xfs_commit_range_fresh *kern_f;
> + struct xfs_inode *ip2 = XFS_I(file_inode(file));
> + struct xfs_mount *mp = ip2->i_mount;
> + struct fd file1;
> + int error;
> +
> + kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> + if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> + return -EINVAL;
> + if (kern_f->magic != XCR_FRESH_MAGIC)
> + return -EBUSY;
> + if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> + return -EBUSY;
> +
> + fxr.file1_offset = args.file1_offset;
> + fxr.file2_offset = args.file2_offset;
> + fxr.length = args.length;
> + fxr.flags = args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> + fxr.file2_ino = kern_f->file2_ino;
> + fxr.file2_gen = kern_f->file2_gen;
> + fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
> + fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
> + fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
> + fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
> +
> + file1 = fdget(args.file1_fd);
> + if (!file1.file)
> + return -EBADF;
> + fxr.file1 = file1.file;
> +
> + error = xfs_exchange_range(&fxr);
> + fdput(file1);
> + return error;
> +}
> diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> index 039abcca546e8..bc1298aba806b 100644
> --- a/fs/xfs/xfs_exchrange.h
> +++ b/fs/xfs/xfs_exchrange.h
> @@ -10,8 +10,12 @@
> #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
> #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
>
> +/* Freshness check required */
> +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2 (1ULL << 61)
> +
> #define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> - __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> + __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> + __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
>
> struct xfs_exchrange {
> struct file *file1;
> @@ -22,10 +26,20 @@ struct xfs_exchrange {
> u64 length;
>
> u64 flags; /* XFS_EXCHANGE_RANGE flags */
> +
> + /* file2 metadata for freshness checks */
> + u64 file2_ino;
> + struct timespec64 file2_mtime;
> + struct timespec64 file2_ctime;
> + u32 file2_gen;
> };
>
> long xfs_ioc_exchange_range(struct file *file,
> struct xfs_exchange_range __user *argp);
> +long xfs_ioc_start_commit(struct file *file,
> + struct xfs_commit_range __user *argp);
> +long xfs_ioc_commit_range(struct file *file,
> + struct xfs_commit_range __user *argp);
>
> struct xfs_exchmaps_req;
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 6b13666d4e963..90b3ee21e7fe6 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
>
> case XFS_IOC_EXCHANGE_RANGE:
> return xfs_ioc_exchange_range(filp, arg);
> + case XFS_IOC_START_COMMIT:
> + return xfs_ioc_start_commit(filp, arg);
> + case XFS_IOC_COMMIT_RANGE:
> + return xfs_ioc_commit_range(filp, arg);
>
> default:
> return -ENOTTY;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 180ce697305a9..4cf0fa71ba9ce 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
> { XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
> { XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
> { __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
> - { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
> + { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }, \
> + { __XFS_EXCHANGE_RANGE_CHECK_FRESH2, "FRESH2" }
>
> /* file exchange-range tracepoint class */
> DECLARE_EVENT_CLASS(xfs_exchrange_class,
> @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
> DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
> DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
>
> +TRACE_EVENT(xfs_exchrange_freshness,
> + TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> + TP_ARGS(fxr, ip2),
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(xfs_ino_t, ip2_ino)
> + __field(long long, ip2_mtime)
> + __field(long long, ip2_ctime)
> + __field(int, ip2_mtime_nsec)
> + __field(int, ip2_ctime_nsec)
> +
> + __field(xfs_ino_t, file2_ino)
> + __field(long long, file2_mtime)
> + __field(long long, file2_ctime)
> + __field(int, file2_mtime_nsec)
> + __field(int, file2_ctime_nsec)
> + ),
> + TP_fast_assign(
> + struct timespec64 ts64;
> + struct inode *inode2 = VFS_I(ip2);
> +
> + __entry->dev = inode2->i_sb->s_dev;
> + __entry->ip2_ino = ip2->i_ino;
> +
> + ts64 = inode_get_ctime(inode2);
> + __entry->ip2_ctime = ts64.tv_sec;
> + __entry->ip2_ctime_nsec = ts64.tv_nsec;
> +
> + ts64 = inode_get_mtime(inode2);
> + __entry->ip2_mtime = ts64.tv_sec;
> + __entry->ip2_mtime_nsec = ts64.tv_nsec;
> +
> + __entry->file2_ino = fxr->file2_ino;
> + __entry->file2_mtime = fxr->file2_mtime.tv_sec;
> + __entry->file2_ctime = fxr->file2_ctime.tv_sec;
> + __entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> + __entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> + ),
> + TP_printk("dev %d:%d "
> + "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> + "file 0x%llx mtime %lld:%d ctime %lld:%d",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->ip2_ino,
> + __entry->ip2_mtime,
> + __entry->ip2_mtime_nsec,
> + __entry->ip2_ctime,
> + __entry->ip2_ctime_nsec,
> + __entry->file2_ino,
> + __entry->file2_mtime,
> + __entry->file2_mtime_nsec,
> + __entry->file2_ctime,
> + __entry->file2_ctime_nsec)
> +);
> +
> TRACE_EVENT(xfs_exchmaps_overhead,
> TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
> unsigned long long rmapbt_blocks),
Acked-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v31.0.1 1/1] xfs: introduce new file range commit ioctls
2024-08-24 6:29 ` [PATCH v31.0.1 " Darrick J. Wong
2024-08-24 12:11 ` Jeff Layton
@ 2024-08-25 4:52 ` Christoph Hellwig
1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-08-25 4:52 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: hch, linux-xfs, linux-fsdevel, jlayton
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-09-02 18:21 [PATCHSET v31.1 1/8] xfs: atomic file content commits Darrick J. Wong
@ 2024-09-02 18:23 ` Darrick J. Wong
2024-09-03 7:52 ` Christian Brauner
0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-09-02 18:23 UTC (permalink / raw)
To: chandanbabu, djwong
Cc: Jeff Layton, Christoph Hellwig, linux-fsdevel, linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
This patch introduces two more new ioctls to manage atomic updates to
file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
does, but with the additional requirement that file2 cannot have changed
since some sampling point. The start-commit ioctl performs the sampling
of file attributes.
Note: This patch currently samples i_ctime during START_COMMIT and
checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
safe in kernels prior to 6.12 because ctime only had coarse grained
granularity and very fast updates could collide with a COMMIT_RANGE.
With the multi-granularity ctime introduced by Jeff Layton, it's now
possible to update ctime such that this does not happen.
It is critical, then, that this patch must not be backported to any
kernel that does not support fine-grained file change timestamps.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_exchrange.h | 16 +++++
fs/xfs/xfs_ioctl.c | 4 +
fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
5 files changed, 243 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 454b63ef7201..c85c8077fac3 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -825,6 +825,30 @@ struct xfs_exchange_range {
__u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
};
+/*
+ * Using the same definition of file2 as struct xfs_exchange_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call. The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+ __s32 file1_fd;
+ __u32 pad; /* must be zeroes */
+ __u64 file1_offset; /* file1 offset, bytes */
+ __u64 file2_offset; /* file2 offset, bytes */
+ __u64 length; /* bytes to exchange */
+
+ __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
+
+ /* opaque file2 metadata for freshness checks */
+ __u64 file2_freshness[6];
+};
+
/*
* Exchange file data all the way to the ends of both files, and then exchange
* the file sizes. This flag can be used to replace a file's contents with a
@@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
+#define XFS_IOC_START_COMMIT _IOR ('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE _IOW ('X', 131, struct xfs_commit_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index c8a655c92c92..d0889190ab7f 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -72,6 +72,34 @@ xfs_exchrange_estimate(
return error;
}
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+ const struct xfs_exchrange *fxr,
+ struct xfs_inode *ip2)
+{
+ struct inode *inode2 = VFS_I(ip2);
+ struct timespec64 ctime = inode_get_ctime(inode2);
+ struct timespec64 mtime = inode_get_mtime(inode2);
+
+ trace_xfs_exchrange_freshness(fxr, ip2);
+
+ /* Check that file2 hasn't otherwise been modified. */
+ if (fxr->file2_ino != ip2->i_ino ||
+ fxr->file2_gen != inode2->i_generation ||
+ !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+ !timespec64_equal(&fxr->file2_mtime, &mtime))
+ return -EBUSY;
+
+ return 0;
+}
+
#define QRETRY_IP1 (0x1)
#define QRETRY_IP2 (0x2)
@@ -607,6 +635,12 @@ xfs_exchrange_prep(
if (error || fxr->length == 0)
return error;
+ if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
+ error = xfs_exchrange_check_freshness(fxr, ip2);
+ if (error)
+ return error;
+ }
+
/* Attach dquots to both inodes before changing block maps. */
error = xfs_qm_dqattach(ip2);
if (error)
@@ -719,7 +753,8 @@ xfs_exchange_range(
if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
return -EXDEV;
- if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+ if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
+ __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
return -EINVAL;
/* Userspace requests only honored for regular files. */
@@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
fdput(file1);
return error;
}
+
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+ xfs_fsid_t fsid; /* m_fixedfsid */
+ __u64 file2_ino; /* inode number */
+ __s64 file2_mtime; /* modification time */
+ __s64 file2_ctime; /* change time */
+ __s32 file2_mtime_nsec; /* mod time, nsec */
+ __s32 file2_ctime_nsec; /* change time, nsec */
+ __u32 file2_gen; /* inode generation */
+ __u32 magic; /* zero */
+};
+#define XCR_FRESH_MAGIC 0x444F524B /* DORK */
+
+/* Set up a commitrange operation by sampling file2's write-related attrs */
+long
+xfs_ioc_start_commit(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_commit_range args = { };
+ struct timespec64 ts;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_commit_range_fresh __user *user_f;
+ struct inode *inode2 = file_inode(file);
+ struct xfs_inode *ip2 = XFS_I(inode2);
+ const unsigned int lockflags = XFS_IOLOCK_SHARED |
+ XFS_MMAPLOCK_SHARED |
+ XFS_ILOCK_SHARED;
+
+ BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+ sizeof(args.file2_freshness));
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+ memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
+
+ xfs_ilock(ip2, lockflags);
+ ts = inode_get_ctime(inode2);
+ kern_f->file2_ctime = ts.tv_sec;
+ kern_f->file2_ctime_nsec = ts.tv_nsec;
+ ts = inode_get_mtime(inode2);
+ kern_f->file2_mtime = ts.tv_sec;
+ kern_f->file2_mtime_nsec = ts.tv_nsec;
+ kern_f->file2_ino = ip2->i_ino;
+ kern_f->file2_gen = inode2->i_generation;
+ kern_f->magic = XCR_FRESH_MAGIC;
+ xfs_iunlock(ip2, lockflags);
+
+ user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
+ if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+ return -EFAULT;
+
+ return 0;
+}
+
+/*
+ * Exchange file1 and file2 contents if file2 has not been written since the
+ * start commit operation.
+ */
+long
+xfs_ioc_commit_range(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_exchrange fxr = {
+ .file2 = file,
+ };
+ struct xfs_commit_range args;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_inode *ip2 = XFS_I(file_inode(file));
+ struct xfs_mount *mp = ip2->i_mount;
+ struct fd file1;
+ int error;
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+ if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
+ return -EINVAL;
+ if (kern_f->magic != XCR_FRESH_MAGIC)
+ return -EBUSY;
+ if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
+ return -EBUSY;
+
+ fxr.file1_offset = args.file1_offset;
+ fxr.file2_offset = args.file2_offset;
+ fxr.length = args.length;
+ fxr.flags = args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
+ fxr.file2_ino = kern_f->file2_ino;
+ fxr.file2_gen = kern_f->file2_gen;
+ fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
+ fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
+ fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
+ fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
+
+ file1 = fdget(args.file1_fd);
+ if (!file1.file)
+ return -EBADF;
+ fxr.file1 = file1.file;
+
+ error = xfs_exchange_range(&fxr);
+ fdput(file1);
+ return error;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 039abcca546e..bc1298aba806 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -10,8 +10,12 @@
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
+/* Freshness check required */
+#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2 (1ULL << 61)
+
#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
- __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
+ __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
+ __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
struct xfs_exchrange {
struct file *file1;
@@ -22,10 +26,20 @@ struct xfs_exchrange {
u64 length;
u64 flags; /* XFS_EXCHANGE_RANGE flags */
+
+ /* file2 metadata for freshness checks */
+ u64 file2_ino;
+ struct timespec64 file2_mtime;
+ struct timespec64 file2_ctime;
+ u32 file2_gen;
};
long xfs_ioc_exchange_range(struct file *file,
struct xfs_exchange_range __user *argp);
+long xfs_ioc_start_commit(struct file *file,
+ struct xfs_commit_range __user *argp);
+long xfs_ioc_commit_range(struct file *file,
+ struct xfs_commit_range __user *argp);
struct xfs_exchmaps_req;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6b13666d4e96..90b3ee21e7fe 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1518,6 +1518,10 @@ xfs_file_ioctl(
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
+ case XFS_IOC_START_COMMIT:
+ return xfs_ioc_start_commit(filp, arg);
+ case XFS_IOC_COMMIT_RANGE:
+ return xfs_ioc_commit_range(filp, arg);
default:
return -ENOTTY;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 180ce697305a..4cf0fa71ba9c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
{ XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
- { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
+ { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }, \
+ { __XFS_EXCHANGE_RANGE_CHECK_FRESH2, "FRESH2" }
/* file exchange-range tracepoint class */
DECLARE_EVENT_CLASS(xfs_exchrange_class,
@@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
+TRACE_EVENT(xfs_exchrange_freshness,
+ TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+ TP_ARGS(fxr, ip2),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(xfs_ino_t, ip2_ino)
+ __field(long long, ip2_mtime)
+ __field(long long, ip2_ctime)
+ __field(int, ip2_mtime_nsec)
+ __field(int, ip2_ctime_nsec)
+
+ __field(xfs_ino_t, file2_ino)
+ __field(long long, file2_mtime)
+ __field(long long, file2_ctime)
+ __field(int, file2_mtime_nsec)
+ __field(int, file2_ctime_nsec)
+ ),
+ TP_fast_assign(
+ struct timespec64 ts64;
+ struct inode *inode2 = VFS_I(ip2);
+
+ __entry->dev = inode2->i_sb->s_dev;
+ __entry->ip2_ino = ip2->i_ino;
+
+ ts64 = inode_get_ctime(inode2);
+ __entry->ip2_ctime = ts64.tv_sec;
+ __entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+ ts64 = inode_get_mtime(inode2);
+ __entry->ip2_mtime = ts64.tv_sec;
+ __entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+ __entry->file2_ino = fxr->file2_ino;
+ __entry->file2_mtime = fxr->file2_mtime.tv_sec;
+ __entry->file2_ctime = fxr->file2_ctime.tv_sec;
+ __entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+ __entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+ ),
+ TP_printk("dev %d:%d "
+ "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+ "file 0x%llx mtime %lld:%d ctime %lld:%d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ip2_ino,
+ __entry->ip2_mtime,
+ __entry->ip2_mtime_nsec,
+ __entry->ip2_ctime,
+ __entry->ip2_ctime_nsec,
+ __entry->file2_ino,
+ __entry->file2_mtime,
+ __entry->file2_mtime_nsec,
+ __entry->file2_ctime,
+ __entry->file2_ctime_nsec)
+);
+
TRACE_EVENT(xfs_exchmaps_overhead,
TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
unsigned long long rmapbt_blocks),
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-09-02 18:23 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
@ 2024-09-03 7:52 ` Christian Brauner
2024-10-25 21:23 ` Darrick J. Wong
0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2024-09-03 7:52 UTC (permalink / raw)
To: Darrick J. Wong
Cc: chandanbabu, Jeff Layton, Christoph Hellwig, linux-fsdevel,
linux-xfs
On Mon, Sep 02, 2024 at 11:23:07AM GMT, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> This patch introduces two more new ioctls to manage atomic updates to
> file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
> commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> does, but with the additional requirement that file2 cannot have changed
> since some sampling point. The start-commit ioctl performs the sampling
> of file attributes.
>
> Note: This patch currently samples i_ctime during START_COMMIT and
> checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
> safe in kernels prior to 6.12 because ctime only had coarse grained
> granularity and very fast updates could collide with a COMMIT_RANGE.
> With the multi-granularity ctime introduced by Jeff Layton, it's now
> possible to update ctime such that this does not happen.
>
> It is critical, then, that this patch must not be backported to any
> kernel that does not support fine-grained file change timestamps.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Acked-by: Jeff Layton <jlayton@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
> fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_exchrange.h | 16 +++++
> fs/xfs/xfs_ioctl.c | 4 +
> fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
> 5 files changed, 243 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 454b63ef7201..c85c8077fac3 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -825,6 +825,30 @@ struct xfs_exchange_range {
> __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
> };
>
> +/*
> + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> + * ctime as the arguments provided to the call. The old contents of file2 will
> + * be moved to file1.
> + *
> + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> + *
> + * Filesystems must be able to restart and complete the operation even after
> + * the system goes down.
> + */
> +struct xfs_commit_range {
> + __s32 file1_fd;
> + __u32 pad; /* must be zeroes */
> + __u64 file1_offset; /* file1 offset, bytes */
> + __u64 file2_offset; /* file2 offset, bytes */
> + __u64 length; /* bytes to exchange */
> +
> + __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
> +
> + /* opaque file2 metadata for freshness checks */
> + __u64 file2_freshness[6];
> +};
> +
> /*
> * Exchange file data all the way to the ends of both files, and then exchange
> * the file sizes. This flag can be used to replace a file's contents with a
> @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
> #define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
> #define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
> #define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
> +#define XFS_IOC_START_COMMIT _IOR ('X', 130, struct xfs_commit_range)
> +#define XFS_IOC_COMMIT_RANGE _IOW ('X', 131, struct xfs_commit_range)
> /* XFS_IOC_GETFSUUID ---------- deprecated 140 */
>
>
> diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> index c8a655c92c92..d0889190ab7f 100644
> --- a/fs/xfs/xfs_exchrange.c
> +++ b/fs/xfs/xfs_exchrange.c
> @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
> return error;
> }
>
> +/*
> + * Check that file2's metadata agree with the snapshot that we took for the
> + * range commit request.
> + *
> + * This should be called after the filesystem has locked /all/ inode metadata
> + * against modification.
> + */
> +STATIC int
> +xfs_exchrange_check_freshness(
> + const struct xfs_exchrange *fxr,
> + struct xfs_inode *ip2)
> +{
> + struct inode *inode2 = VFS_I(ip2);
> + struct timespec64 ctime = inode_get_ctime(inode2);
> + struct timespec64 mtime = inode_get_mtime(inode2);
> +
> + trace_xfs_exchrange_freshness(fxr, ip2);
> +
> + /* Check that file2 hasn't otherwise been modified. */
> + if (fxr->file2_ino != ip2->i_ino ||
> + fxr->file2_gen != inode2->i_generation ||
> + !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> + !timespec64_equal(&fxr->file2_mtime, &mtime))
> + return -EBUSY;
> +
> + return 0;
> +}
> +
> #define QRETRY_IP1 (0x1)
> #define QRETRY_IP2 (0x2)
>
> @@ -607,6 +635,12 @@ xfs_exchrange_prep(
> if (error || fxr->length == 0)
> return error;
>
> + if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> + error = xfs_exchrange_check_freshness(fxr, ip2);
> + if (error)
> + return error;
> + }
> +
> /* Attach dquots to both inodes before changing block maps. */
> error = xfs_qm_dqattach(ip2);
> if (error)
> @@ -719,7 +753,8 @@ xfs_exchange_range(
> if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
> return -EXDEV;
>
> - if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> + if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> + __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
> return -EINVAL;
>
> /* Userspace requests only honored for regular files. */
> @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
> fdput(file1);
> return error;
> }
> +
> +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> +struct xfs_commit_range_fresh {
> + xfs_fsid_t fsid; /* m_fixedfsid */
> + __u64 file2_ino; /* inode number */
> + __s64 file2_mtime; /* modification time */
> + __s64 file2_ctime; /* change time */
> + __s32 file2_mtime_nsec; /* mod time, nsec */
> + __s32 file2_ctime_nsec; /* change time, nsec */
> + __u32 file2_gen; /* inode generation */
> + __u32 magic; /* zero */
> +};
> +#define XCR_FRESH_MAGIC 0x444F524B /* DORK */
> +
> +/* Set up a commitrange operation by sampling file2's write-related attrs */
> +long
> +xfs_ioc_start_commit(
> + struct file *file,
> + struct xfs_commit_range __user *argp)
> +{
> + struct xfs_commit_range args = { };
> + struct timespec64 ts;
> + struct xfs_commit_range_fresh *kern_f;
> + struct xfs_commit_range_fresh __user *user_f;
> + struct inode *inode2 = file_inode(file);
> + struct xfs_inode *ip2 = XFS_I(inode2);
> + const unsigned int lockflags = XFS_IOLOCK_SHARED |
> + XFS_MMAPLOCK_SHARED |
> + XFS_ILOCK_SHARED;
> +
> + BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> + sizeof(args.file2_freshness));
> +
> + kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> + memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> +
> + xfs_ilock(ip2, lockflags);
> + ts = inode_get_ctime(inode2);
> + kern_f->file2_ctime = ts.tv_sec;
> + kern_f->file2_ctime_nsec = ts.tv_nsec;
> + ts = inode_get_mtime(inode2);
> + kern_f->file2_mtime = ts.tv_sec;
> + kern_f->file2_mtime_nsec = ts.tv_nsec;
> + kern_f->file2_ino = ip2->i_ino;
> + kern_f->file2_gen = inode2->i_generation;
> + kern_f->magic = XCR_FRESH_MAGIC;
> + xfs_iunlock(ip2, lockflags);
> +
> + user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> + if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +/*
> + * Exchange file1 and file2 contents if file2 has not been written since the
> + * start commit operation.
> + */
> +long
> +xfs_ioc_commit_range(
> + struct file *file,
> + struct xfs_commit_range __user *argp)
> +{
> + struct xfs_exchrange fxr = {
> + .file2 = file,
> + };
> + struct xfs_commit_range args;
> + struct xfs_commit_range_fresh *kern_f;
> + struct xfs_inode *ip2 = XFS_I(file_inode(file));
> + struct xfs_mount *mp = ip2->i_mount;
> + struct fd file1;
> + int error;
> +
> + kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> + if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> + return -EINVAL;
> + if (kern_f->magic != XCR_FRESH_MAGIC)
> + return -EBUSY;
> + if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> + return -EBUSY;
So, I mentioned this before in another mail a few months ago and I think
you liked the idea so just as a reminder in case you forgot:
Ioctls are extensible if done correctly:
switch (__IOC_NR(ioctl)) {
case _IOC_NR(XFS_IOC_START_COMMIT): {
size_t usize = _IOC_SIZE(ioctl);
struct xfs_commit_range args;
if (usize < XFS_IOC_START_COMMIT_SIZE_VER0)
return -EINVAL;
if (copy_struct_from_user(&args, sizeof(args), argp, usize))
return -EFAULT;
}
If you code it this way, relying on copy_struct_from_user() right from
the start you can easily extend your struct in a backward and forward
compatible manner.
}
> +
> + fxr.file1_offset = args.file1_offset;
> + fxr.file2_offset = args.file2_offset;
> + fxr.length = args.length;
> + fxr.flags = args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> + fxr.file2_ino = kern_f->file2_ino;
> + fxr.file2_gen = kern_f->file2_gen;
> + fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
> + fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
> + fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
> + fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
> +
> + file1 = fdget(args.file1_fd);
> + if (!file1.file)
> + return -EBADF;
Please use CLASS(fd, f)(args.file1_fd) :)
> + fxr.file1 = file1.file;
> +
> + error = xfs_exchange_range(&fxr);
> + fdput(file1);
> + return error;
> +}
> diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> index 039abcca546e..bc1298aba806 100644
> --- a/fs/xfs/xfs_exchrange.h
> +++ b/fs/xfs/xfs_exchrange.h
> @@ -10,8 +10,12 @@
> #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
> #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
>
> +/* Freshness check required */
> +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2 (1ULL << 61)
> +
> #define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> - __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> + __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> + __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
>
> struct xfs_exchrange {
> struct file *file1;
> @@ -22,10 +26,20 @@ struct xfs_exchrange {
> u64 length;
>
> u64 flags; /* XFS_EXCHANGE_RANGE flags */
> +
> + /* file2 metadata for freshness checks */
> + u64 file2_ino;
> + struct timespec64 file2_mtime;
> + struct timespec64 file2_ctime;
> + u32 file2_gen;
> };
>
> long xfs_ioc_exchange_range(struct file *file,
> struct xfs_exchange_range __user *argp);
> +long xfs_ioc_start_commit(struct file *file,
> + struct xfs_commit_range __user *argp);
> +long xfs_ioc_commit_range(struct file *file,
> + struct xfs_commit_range __user *argp);
>
> struct xfs_exchmaps_req;
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 6b13666d4e96..90b3ee21e7fe 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
>
> case XFS_IOC_EXCHANGE_RANGE:
> return xfs_ioc_exchange_range(filp, arg);
> + case XFS_IOC_START_COMMIT:
> + return xfs_ioc_start_commit(filp, arg);
> + case XFS_IOC_COMMIT_RANGE:
> + return xfs_ioc_commit_range(filp, arg);
>
> default:
> return -ENOTTY;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 180ce697305a..4cf0fa71ba9c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
> { XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
> { XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
> { __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
> - { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
> + { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }, \
> + { __XFS_EXCHANGE_RANGE_CHECK_FRESH2, "FRESH2" }
>
> /* file exchange-range tracepoint class */
> DECLARE_EVENT_CLASS(xfs_exchrange_class,
> @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
> DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
> DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
>
> +TRACE_EVENT(xfs_exchrange_freshness,
> + TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> + TP_ARGS(fxr, ip2),
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(xfs_ino_t, ip2_ino)
> + __field(long long, ip2_mtime)
> + __field(long long, ip2_ctime)
> + __field(int, ip2_mtime_nsec)
> + __field(int, ip2_ctime_nsec)
> +
> + __field(xfs_ino_t, file2_ino)
> + __field(long long, file2_mtime)
> + __field(long long, file2_ctime)
> + __field(int, file2_mtime_nsec)
> + __field(int, file2_ctime_nsec)
> + ),
> + TP_fast_assign(
> + struct timespec64 ts64;
> + struct inode *inode2 = VFS_I(ip2);
> +
> + __entry->dev = inode2->i_sb->s_dev;
> + __entry->ip2_ino = ip2->i_ino;
> +
> + ts64 = inode_get_ctime(inode2);
> + __entry->ip2_ctime = ts64.tv_sec;
> + __entry->ip2_ctime_nsec = ts64.tv_nsec;
> +
> + ts64 = inode_get_mtime(inode2);
> + __entry->ip2_mtime = ts64.tv_sec;
> + __entry->ip2_mtime_nsec = ts64.tv_nsec;
> +
> + __entry->file2_ino = fxr->file2_ino;
> + __entry->file2_mtime = fxr->file2_mtime.tv_sec;
> + __entry->file2_ctime = fxr->file2_ctime.tv_sec;
> + __entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> + __entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> + ),
> + TP_printk("dev %d:%d "
> + "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> + "file 0x%llx mtime %lld:%d ctime %lld:%d",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->ip2_ino,
> + __entry->ip2_mtime,
> + __entry->ip2_mtime_nsec,
> + __entry->ip2_ctime,
> + __entry->ip2_ctime_nsec,
> + __entry->file2_ino,
> + __entry->file2_mtime,
> + __entry->file2_mtime_nsec,
> + __entry->file2_ctime,
> + __entry->file2_ctime_nsec)
> +);
> +
> TRACE_EVENT(xfs_exchmaps_overhead,
> TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
> unsigned long long rmapbt_blocks),
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/1] xfs: introduce new file range commit ioctls
2024-09-03 7:52 ` Christian Brauner
@ 2024-10-25 21:23 ` Darrick J. Wong
0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-10-25 21:23 UTC (permalink / raw)
To: Christian Brauner
Cc: chandanbabu, Jeff Layton, Christoph Hellwig, linux-fsdevel,
linux-xfs
On Tue, Sep 03, 2024 at 09:52:43AM +0200, Christian Brauner wrote:
> On Mon, Sep 02, 2024 at 11:23:07AM GMT, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > This patch introduces two more new ioctls to manage atomic updates to
> > file contents -- XFS_IOC_START_COMMIT and XFS_IOC_COMMIT_RANGE. The
> > commit mechanism here is exactly the same as what XFS_IOC_EXCHANGE_RANGE
> > does, but with the additional requirement that file2 cannot have changed
> > since some sampling point. The start-commit ioctl performs the sampling
> > of file attributes.
> >
> > Note: This patch currently samples i_ctime during START_COMMIT and
> > checks that it hasn't changed during COMMIT_RANGE. This isn't entirely
> > safe in kernels prior to 6.12 because ctime only had coarse grained
> > granularity and very fast updates could collide with a COMMIT_RANGE.
> > With the multi-granularity ctime introduced by Jeff Layton, it's now
> > possible to update ctime such that this does not happen.
> >
> > It is critical, then, that this patch must not be backported to any
> > kernel that does not support fine-grained file change timestamps.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Acked-by: Jeff Layton <jlayton@kernel.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> > fs/xfs/libxfs/xfs_fs.h | 26 +++++++++
> > fs/xfs/xfs_exchrange.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++
> > fs/xfs/xfs_exchrange.h | 16 +++++
> > fs/xfs/xfs_ioctl.c | 4 +
> > fs/xfs/xfs_trace.h | 57 +++++++++++++++++++
> > 5 files changed, 243 insertions(+), 3 deletions(-)
> >
> >
> > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > index 454b63ef7201..c85c8077fac3 100644
> > --- a/fs/xfs/libxfs/xfs_fs.h
> > +++ b/fs/xfs/libxfs/xfs_fs.h
> > @@ -825,6 +825,30 @@ struct xfs_exchange_range {
> > __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
> > };
> >
> > +/*
> > + * Using the same definition of file2 as struct xfs_exchange_range, commit the
> > + * contents of file1 into file2 if file2 has the same inode number, mtime, and
> > + * ctime as the arguments provided to the call. The old contents of file2 will
> > + * be moved to file1.
> > + *
> > + * Returns -EBUSY if there isn't an exact match for the file2 fields.
> > + *
> > + * Filesystems must be able to restart and complete the operation even after
> > + * the system goes down.
> > + */
> > +struct xfs_commit_range {
> > + __s32 file1_fd;
> > + __u32 pad; /* must be zeroes */
> > + __u64 file1_offset; /* file1 offset, bytes */
> > + __u64 file2_offset; /* file2 offset, bytes */
> > + __u64 length; /* bytes to exchange */
> > +
> > + __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
> > +
> > + /* opaque file2 metadata for freshness checks */
> > + __u64 file2_freshness[6];
> > +};
> > +
> > /*
> > * Exchange file data all the way to the ends of both files, and then exchange
> > * the file sizes. This flag can be used to replace a file's contents with a
> > @@ -997,6 +1021,8 @@ struct xfs_getparents_by_handle {
> > #define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
> > #define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
> > #define XFS_IOC_EXCHANGE_RANGE _IOW ('X', 129, struct xfs_exchange_range)
> > +#define XFS_IOC_START_COMMIT _IOR ('X', 130, struct xfs_commit_range)
> > +#define XFS_IOC_COMMIT_RANGE _IOW ('X', 131, struct xfs_commit_range)
> > /* XFS_IOC_GETFSUUID ---------- deprecated 140 */
> >
> >
> > diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
> > index c8a655c92c92..d0889190ab7f 100644
> > --- a/fs/xfs/xfs_exchrange.c
> > +++ b/fs/xfs/xfs_exchrange.c
> > @@ -72,6 +72,34 @@ xfs_exchrange_estimate(
> > return error;
> > }
> >
> > +/*
> > + * Check that file2's metadata agree with the snapshot that we took for the
> > + * range commit request.
> > + *
> > + * This should be called after the filesystem has locked /all/ inode metadata
> > + * against modification.
> > + */
> > +STATIC int
> > +xfs_exchrange_check_freshness(
> > + const struct xfs_exchrange *fxr,
> > + struct xfs_inode *ip2)
> > +{
> > + struct inode *inode2 = VFS_I(ip2);
> > + struct timespec64 ctime = inode_get_ctime(inode2);
> > + struct timespec64 mtime = inode_get_mtime(inode2);
> > +
> > + trace_xfs_exchrange_freshness(fxr, ip2);
> > +
> > + /* Check that file2 hasn't otherwise been modified. */
> > + if (fxr->file2_ino != ip2->i_ino ||
> > + fxr->file2_gen != inode2->i_generation ||
> > + !timespec64_equal(&fxr->file2_ctime, &ctime) ||
> > + !timespec64_equal(&fxr->file2_mtime, &mtime))
> > + return -EBUSY;
> > +
> > + return 0;
> > +}
> > +
> > #define QRETRY_IP1 (0x1)
> > #define QRETRY_IP2 (0x2)
> >
> > @@ -607,6 +635,12 @@ xfs_exchrange_prep(
> > if (error || fxr->length == 0)
> > return error;
> >
> > + if (fxr->flags & __XFS_EXCHANGE_RANGE_CHECK_FRESH2) {
> > + error = xfs_exchrange_check_freshness(fxr, ip2);
> > + if (error)
> > + return error;
> > + }
> > +
> > /* Attach dquots to both inodes before changing block maps. */
> > error = xfs_qm_dqattach(ip2);
> > if (error)
> > @@ -719,7 +753,8 @@ xfs_exchange_range(
> > if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
> > return -EXDEV;
> >
> > - if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> > + if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
> > + __XFS_EXCHANGE_RANGE_CHECK_FRESH2))
> > return -EINVAL;
> >
> > /* Userspace requests only honored for regular files. */
> > @@ -802,3 +837,109 @@ xfs_ioc_exchange_range(
> > fdput(file1);
> > return error;
> > }
> > +
> > +/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
> > +struct xfs_commit_range_fresh {
> > + xfs_fsid_t fsid; /* m_fixedfsid */
> > + __u64 file2_ino; /* inode number */
> > + __s64 file2_mtime; /* modification time */
> > + __s64 file2_ctime; /* change time */
> > + __s32 file2_mtime_nsec; /* mod time, nsec */
> > + __s32 file2_ctime_nsec; /* change time, nsec */
> > + __u32 file2_gen; /* inode generation */
> > + __u32 magic; /* zero */
> > +};
> > +#define XCR_FRESH_MAGIC 0x444F524B /* DORK */
> > +
> > +/* Set up a commitrange operation by sampling file2's write-related attrs */
> > +long
> > +xfs_ioc_start_commit(
> > + struct file *file,
> > + struct xfs_commit_range __user *argp)
> > +{
> > + struct xfs_commit_range args = { };
> > + struct timespec64 ts;
> > + struct xfs_commit_range_fresh *kern_f;
> > + struct xfs_commit_range_fresh __user *user_f;
> > + struct inode *inode2 = file_inode(file);
> > + struct xfs_inode *ip2 = XFS_I(inode2);
> > + const unsigned int lockflags = XFS_IOLOCK_SHARED |
> > + XFS_MMAPLOCK_SHARED |
> > + XFS_ILOCK_SHARED;
> > +
> > + BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
> > + sizeof(args.file2_freshness));
> > +
> > + kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> > +
> > + memcpy(&kern_f->fsid, ip2->i_mount->m_fixedfsid, sizeof(xfs_fsid_t));
> > +
> > + xfs_ilock(ip2, lockflags);
> > + ts = inode_get_ctime(inode2);
> > + kern_f->file2_ctime = ts.tv_sec;
> > + kern_f->file2_ctime_nsec = ts.tv_nsec;
> > + ts = inode_get_mtime(inode2);
> > + kern_f->file2_mtime = ts.tv_sec;
> > + kern_f->file2_mtime_nsec = ts.tv_nsec;
> > + kern_f->file2_ino = ip2->i_ino;
> > + kern_f->file2_gen = inode2->i_generation;
> > + kern_f->magic = XCR_FRESH_MAGIC;
> > + xfs_iunlock(ip2, lockflags);
> > +
> > + user_f = (struct xfs_commit_range_fresh __user *)&argp->file2_freshness;
> > + if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
> > + return -EFAULT;
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Exchange file1 and file2 contents if file2 has not been written since the
> > + * start commit operation.
> > + */
> > +long
> > +xfs_ioc_commit_range(
> > + struct file *file,
> > + struct xfs_commit_range __user *argp)
> > +{
> > + struct xfs_exchrange fxr = {
> > + .file2 = file,
> > + };
> > + struct xfs_commit_range args;
> > + struct xfs_commit_range_fresh *kern_f;
> > + struct xfs_inode *ip2 = XFS_I(file_inode(file));
> > + struct xfs_mount *mp = ip2->i_mount;
> > + struct fd file1;
> > + int error;
> > +
> > + kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
> > +
> > + if (copy_from_user(&args, argp, sizeof(args)))
> > + return -EFAULT;
> > + if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
> > + return -EINVAL;
> > + if (kern_f->magic != XCR_FRESH_MAGIC)
> > + return -EBUSY;
> > + if (memcmp(&kern_f->fsid, mp->m_fixedfsid, sizeof(xfs_fsid_t)))
> > + return -EBUSY;
>
> So, I mentioned this before in another mail a few months ago and I think
> you liked the idea so just as a reminder in case you forgot:
>
> Ioctls are extensible if done correctly:
>
> switch (__IOC_NR(ioctl)) {
> case _IOC_NR(XFS_IOC_START_COMMIT): {
> size_t usize = _IOC_SIZE(ioctl);
> struct xfs_commit_range args;
>
> if (usize < XFS_IOC_START_COMMIT_SIZE_VER0)
> return -EINVAL;
>
> if (copy_struct_from_user(&args, sizeof(args), argp, usize))
> return -EFAULT;
> }
>
> If you code it this way, relying on copy_struct_from_user() right from
> the start you can easily extend your struct in a backward and forward
> compatible manner.
I don't know that we'd really need it for commitrange since there's
plenty of space (~40 bytes) in the "opaque" freshness blob. I suspect
that if we ever add subvolumes to XFS then we might want to take over
the 12 bytes used by mtime* for the subvolume id.
That said, I also think we could convert to this format pretty easily.
Also, copy_struct_from_user can return -E2BIG so I think that needs to
be:
ret = copy_struct_from_user(&args, sizeof(args), argp, usize);
if (ret)
return ret;
Though the overriding reason for not writing the __IOC_NR dispatch code
this way is that every time I've tried extend an xfs ioctl in this
manner, Dave says no because (I think) he doesn't trust how the struct
size is opaquely encoded in the ioctl number /and/ doesn't trust the
BUILD_BUG_ONs I put in the code to guarantee uniqueness so I pick a new
number so I can complete the review instead of starting over with a
different reviewer who doesn't have that particular preference.
> }
>
> > +
> > + fxr.file1_offset = args.file1_offset;
> > + fxr.file2_offset = args.file2_offset;
> > + fxr.length = args.length;
> > + fxr.flags = args.flags | __XFS_EXCHANGE_RANGE_CHECK_FRESH2;
> > + fxr.file2_ino = kern_f->file2_ino;
> > + fxr.file2_gen = kern_f->file2_gen;
> > + fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
> > + fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
> > + fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
> > + fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
> > +
> > + file1 = fdget(args.file1_fd);
> > + if (!file1.file)
> > + return -EBADF;
>
> Please use CLASS(fd, f)(args.file1_fd) :)
Yeah, I saw that the fd cleanups collided with xfs in for-next, thanks
for the heads up.
--D
> > + fxr.file1 = file1.file;
> > +
> > + error = xfs_exchange_range(&fxr);
> > + fdput(file1);
> > + return error;
> > +}
> > diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
> > index 039abcca546e..bc1298aba806 100644
> > --- a/fs/xfs/xfs_exchrange.h
> > +++ b/fs/xfs/xfs_exchrange.h
> > @@ -10,8 +10,12 @@
> > #define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
> > #define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
> >
> > +/* Freshness check required */
> > +#define __XFS_EXCHANGE_RANGE_CHECK_FRESH2 (1ULL << 61)
> > +
> > #define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
> > - __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
> > + __XFS_EXCHANGE_RANGE_UPD_CMTIME2 | \
> > + __XFS_EXCHANGE_RANGE_CHECK_FRESH2)
> >
> > struct xfs_exchrange {
> > struct file *file1;
> > @@ -22,10 +26,20 @@ struct xfs_exchrange {
> > u64 length;
> >
> > u64 flags; /* XFS_EXCHANGE_RANGE flags */
> > +
> > + /* file2 metadata for freshness checks */
> > + u64 file2_ino;
> > + struct timespec64 file2_mtime;
> > + struct timespec64 file2_ctime;
> > + u32 file2_gen;
> > };
> >
> > long xfs_ioc_exchange_range(struct file *file,
> > struct xfs_exchange_range __user *argp);
> > +long xfs_ioc_start_commit(struct file *file,
> > + struct xfs_commit_range __user *argp);
> > +long xfs_ioc_commit_range(struct file *file,
> > + struct xfs_commit_range __user *argp);
> >
> > struct xfs_exchmaps_req;
> >
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index 6b13666d4e96..90b3ee21e7fe 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1518,6 +1518,10 @@ xfs_file_ioctl(
> >
> > case XFS_IOC_EXCHANGE_RANGE:
> > return xfs_ioc_exchange_range(filp, arg);
> > + case XFS_IOC_START_COMMIT:
> > + return xfs_ioc_start_commit(filp, arg);
> > + case XFS_IOC_COMMIT_RANGE:
> > + return xfs_ioc_commit_range(filp, arg);
> >
> > default:
> > return -ENOTTY;
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 180ce697305a..4cf0fa71ba9c 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -4926,7 +4926,8 @@ DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
> > { XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
> > { XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
> > { __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
> > - { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
> > + { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }, \
> > + { __XFS_EXCHANGE_RANGE_CHECK_FRESH2, "FRESH2" }
> >
> > /* file exchange-range tracepoint class */
> > DECLARE_EVENT_CLASS(xfs_exchrange_class,
> > @@ -4986,6 +4987,60 @@ DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
> > DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
> > DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
> >
> > +TRACE_EVENT(xfs_exchrange_freshness,
> > + TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
> > + TP_ARGS(fxr, ip2),
> > + TP_STRUCT__entry(
> > + __field(dev_t, dev)
> > + __field(xfs_ino_t, ip2_ino)
> > + __field(long long, ip2_mtime)
> > + __field(long long, ip2_ctime)
> > + __field(int, ip2_mtime_nsec)
> > + __field(int, ip2_ctime_nsec)
> > +
> > + __field(xfs_ino_t, file2_ino)
> > + __field(long long, file2_mtime)
> > + __field(long long, file2_ctime)
> > + __field(int, file2_mtime_nsec)
> > + __field(int, file2_ctime_nsec)
> > + ),
> > + TP_fast_assign(
> > + struct timespec64 ts64;
> > + struct inode *inode2 = VFS_I(ip2);
> > +
> > + __entry->dev = inode2->i_sb->s_dev;
> > + __entry->ip2_ino = ip2->i_ino;
> > +
> > + ts64 = inode_get_ctime(inode2);
> > + __entry->ip2_ctime = ts64.tv_sec;
> > + __entry->ip2_ctime_nsec = ts64.tv_nsec;
> > +
> > + ts64 = inode_get_mtime(inode2);
> > + __entry->ip2_mtime = ts64.tv_sec;
> > + __entry->ip2_mtime_nsec = ts64.tv_nsec;
> > +
> > + __entry->file2_ino = fxr->file2_ino;
> > + __entry->file2_mtime = fxr->file2_mtime.tv_sec;
> > + __entry->file2_ctime = fxr->file2_ctime.tv_sec;
> > + __entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
> > + __entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
> > + ),
> > + TP_printk("dev %d:%d "
> > + "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
> > + "file 0x%llx mtime %lld:%d ctime %lld:%d",
> > + MAJOR(__entry->dev), MINOR(__entry->dev),
> > + __entry->ip2_ino,
> > + __entry->ip2_mtime,
> > + __entry->ip2_mtime_nsec,
> > + __entry->ip2_ctime,
> > + __entry->ip2_ctime_nsec,
> > + __entry->file2_ino,
> > + __entry->file2_mtime,
> > + __entry->file2_mtime_nsec,
> > + __entry->file2_ctime,
> > + __entry->file2_ctime_nsec)
> > +);
> > +
> > TRACE_EVENT(xfs_exchmaps_overhead,
> > TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
> > unsigned long long rmapbt_blocks),
> >
>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-10-25 21:23 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20240822235230.GJ6043@frogsfrogsfrogs>
2024-08-22 23:56 ` [PATCHSET v31.0 02/10] xfs: atomic file content commits Darrick J. Wong
2024-08-23 0:01 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-08-23 4:12 ` Christoph Hellwig
2024-08-23 13:20 ` Jeff Layton
2024-08-23 17:41 ` Darrick J. Wong
2024-08-23 19:15 ` Jeff Layton
2024-08-24 3:29 ` Christoph Hellwig
2024-08-24 4:46 ` Darrick J. Wong
2024-08-24 4:48 ` Christoph Hellwig
2024-08-24 6:29 ` [PATCH v31.0.1 " Darrick J. Wong
2024-08-24 12:11 ` Jeff Layton
2024-08-25 4:52 ` Christoph Hellwig
2024-08-22 23:58 ` [PATCHSET v4.0 08/10] xfs: preparation for realtime allocation groups Darrick J. Wong
2024-08-23 0:21 ` [PATCH 1/1] iomap: add a merge boundary flag Darrick J. Wong
2024-09-02 18:21 [PATCHSET v31.1 1/8] xfs: atomic file content commits Darrick J. Wong
2024-09-02 18:23 ` [PATCH 1/1] xfs: introduce new file range commit ioctls Darrick J. Wong
2024-09-03 7:52 ` Christian Brauner
2024-10-25 21:23 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).