* [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
@ 2024-02-27 2:18 Darrick J. Wong
2024-02-27 2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
` (3 more replies)
0 siblings, 4 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-27 2:18 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, linux-xfs, hch
Hi all,
This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically. This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes. A successful call
completion guarantees that the new contents will be seen even if the
system fails.
The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.
User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file. If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas. Callers can arrange for the update to be rejected if the
original file has been changed.
The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files. For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory. People get it wrong all the time, and $fs hacks abound.
Here are the proposed manual pages:
IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
NAME
ioctl_xfs_exchange_range - exchange the contents of parts of
two files
SYNOPSIS
#include <sys/ioctl.h>
#include <xfs/xfs_fs_staging.h>
int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
xfs_exch_range *arg);
DESCRIPTION
Given a range of bytes in a first file file1_fd and a second
range of bytes in a second file file2_fd, this ioctl(2) ex‐
changes the contents of the two ranges.
Exchanges are atomic with regards to concurrent file opera‐
tions, so no userspace-level locks need to be taken to obtain
consistent results. Implementations must guarantee that read‐
ers see either the old contents or the new contents in their
entirety, even if the system fails.
The system call parameters are conveyed in structures of the
following form:
struct xfs_exch_range {
__s64 file1_fd;
__s64 file1_offset;
__s64 file2_offset;
__s64 length;
__u64 flags;
__u64 pad;
};
The field pad must be zero.
The fields file1_fd, file1_offset, and length define the first
range of bytes to be exchanged.
The fields file2_fd, file2_offset, and length define the second
range of bytes to be exchanged.
Both files must be from the same filesystem mount. If the two
file descriptors represent the same file, the byte ranges must
not overlap. Most disk-based filesystems require that the
starts of both ranges must be aligned to the file block size.
If this is the case, the ends of the ranges must also be so
aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
The field flags control the behavior of the exchange operation.
XFS_EXCHRANGE_TO_EOF
Ignore the length parameter. All bytes in file1_fd
from file1_offset to EOF are moved to file2_fd, and
file2's size is set to (file2_offset+(file1_length-
file1_offset)). Meanwhile, all bytes in file2 from
file2_offset to EOF are moved to file1 and file1's
size is set to (file1_offset+(file2_length-
file2_offset)).
XFS_EXCHRANGE_DSYNC
Ensure that all modified in-core data in both file
ranges and all metadata updates pertaining to the
exchange operation are flushed to persistent storage
before the call returns. Opening either file de‐
scriptor with O_SYNC or O_DSYNC will have the same
effect.
XFS_EXCHRANGE_FILE1_WRITTEN
Only exchange sub-ranges of file1_fd that are known
to contain data written by application software.
Each sub-range may be expanded (both upwards and
downwards) to align with the file allocation unit.
For files on the data device, this is one filesystem
block. For files on the realtime device, this is
the realtime extent size. This facility can be used
to implement fast atomic scatter-gather writes of
any complexity for software-defined storage targets
if all writes are aligned to the file allocation
unit.
XFS_EXCHRANGE_DRY_RUN
Check the parameters and the feasibility of the op‐
eration, but do not change anything.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the er‐
ror.
ERRORS
Error codes can be one of, but are not limited to, the follow‐
ing:
EBADF file1_fd is not open for reading and writing or is open
for append-only writes; or file2_fd is not open for
reading and writing or is open for append-only writes.
EINVAL The parameters are not correct for these files. This
error can also appear if either file descriptor repre‐
sents a device, FIFO, or socket. Disk filesystems gen‐
erally require the offset and length arguments to be
aligned to the fundamental block sizes of both files.
EIO An I/O error occurred.
EISDIR One of the files is a directory.
ENOMEM The kernel was unable to allocate sufficient memory to
perform the operation.
ENOSPC There is not enough free space in the filesystem ex‐
change the contents safely.
EOPNOTSUPP
The filesystem does not support exchanging bytes between
the two files.
EPERM file1_fd or file2_fd are immutable.
ETXTBSY
One of the files is a swap file.
EUCLEAN
The filesystem is corrupt.
EXDEV file1_fd and file2_fd are not on the same mounted
filesystem.
CONFORMING TO
This API is XFS-specific.
USE CASES
Several use cases are imagined for this system call. In all
cases, application software must coordinate updates to the file
because the exchange is performed unconditionally.
The first is a data storage program that wants to commit non-
contiguous updates to a file atomically and coordinates write
access to that file. This can be done by creating a temporary
file, calling FICLONE(2) to share the contents, and staging the
updates into the temporary file. The FULL_FILES flag is recom‐
mended for this purpose. The temporary file can be deleted or
punched out afterwards.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
ioctl(temp_fd, FICLONE, fd);
/* append 1MB of records */
lseek(temp_fd, 0, SEEK_END);
write(temp_fd, data1, 1000000);
/* update record index */
pwrite(temp_fd, data1, 600, 98765);
pwrite(temp_fd, data2, 320, 54321);
pwrite(temp_fd, data2, 15, 0);
/* commit the entire update */
struct xfs_exch_range args = {
.file1_fd = temp_fd,
.flags = XFS_EXCHRANGE_TO_EOF,
};
ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
The second is a software-defined storage host (e.g. a disk
jukebox) which implements an atomic scatter-gather write com‐
mand. Provided the exported disk's logical block size matches
the file's allocation unit size, this can be done by creating a
temporary file and writing the data at the appropriate offsets.
It is recommended that the temporary file be truncated to the
size of the regular file before any writes are staged to the
temporary file to avoid issues with zeroing during EOF exten‐
sion. Use this call with the FILE1_WRITTEN flag to exchange
only the file allocation units involved in the emulated de‐
vice's write command. The temporary file should be truncated
or punched out completely before being reused to stage another
write.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct stat sb;
int blksz;
fstat(fd, &sb);
blksz = sb.st_blksize;
/* land scatter gather writes between 100fsb and 500fsb */
pwrite(temp_fd, data1, blksz * 2, blksz * 100);
pwrite(temp_fd, data2, blksz * 20, blksz * 480);
pwrite(temp_fd, data3, blksz * 7, blksz * 257);
/* commit the entire update */
struct xfs_exch_range args = {
.file1_fd = temp_fd,
.file1_offset = blksz * 100,
.file2_offset = blksz * 100,
.length = blksz * 400,
.flags = XFS_EXCHRANGE_FILE1_WRITTEN |
XFS_EXCHRANGE_FILE1_DSYNC,
};
ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
NOTES
Some filesystems may limit the amount of data or the number of
extents that can be exchanged in a single call.
SEE ALSO
ioctl(2)
XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
NAME
ioctl_xfs_commit_range - conditionally exchange the contents of
parts of two files
SYNOPSIS
#include <sys/ioctl.h>
#include <xfs/xfs_fs_staging.h>
int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
mit_range *arg);
DESCRIPTION
Given a range of bytes in a first file file1_fd and a second
range of bytes in a second file file2_fd, this ioctl(2) ex‐
changes the contents of the two ranges if file2_fd passes cer‐
tain freshness criteria.
After locking both files but before exchanging the contents,
the supplied file2_ino field must match file2_fd's inode num‐
ber, and the supplied file2_mtime, file2_mtime_nsec,
file2_ctime, and file2_ctime_nsec fields must match the modifi‐
cation time and change time of file2. If they do not match,
EBUSY will be returned.
Exchanges are atomic with regards to concurrent file opera‐
tions, so no userspace-level locks need to be taken to obtain
consistent results. Implementations must guarantee that read‐
ers see either the old contents or the new contents in their
entirety, even if the system fails.
The system call parameters are conveyed in structures of the
following form:
struct xfs_commit_range {
__s64 file1_fd;
__s64 file1_offset;
__s64 file2_offset;
__s64 length;
__u64 flags;
__s64 file2_ino;
__s64 file2_mtime;
__s64 file2_ctime;
__s32 file2_mtime_nsec;
__s32 file2_ctime_nsec;
__u64 pad;
};
The field pad must be zero.
The fields file1_fd, file1_offset, and length define the first
range of bytes to be exchanged.
The fields file2_fd, file2_offset, and length define the second
range of bytes to be exchanged.
The fields file2_ino, file2_mtime, file2_mtime_nsec,
file2_ctime and file2_ctime_nsec must be gathered from
file2_fd's stat information prior to beginning the file update.
These file attributes are used to confirm that file2_fd has not
changed by another thread since the current thread began stag‐
ing its own update.
Both files must be from the same filesystem mount. If the two
file descriptors represent the same file, the byte ranges must
not overlap. Most disk-based filesystems require that the
starts of both ranges must be aligned to the file block size.
If this is the case, the ends of the ranges must also be so
aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
The field flags control the behavior of the exchange operation.
XFS_EXCHRANGE_TO_EOF
Ignore the length parameter. All bytes in file1_fd
from file1_offset to EOF are moved to file2_fd, and
file2's size is set to (file2_offset+(file1_length-
file1_offset)). Meanwhile, all bytes in file2 from
file2_offset to EOF are moved to file1 and file1's
size is set to (file1_offset+(file2_length-
file2_offset)).
XFS_EXCHRANGE_DSYNC
Ensure that all modified in-core data in both file
ranges and all metadata updates pertaining to the
exchange operation are flushed to persistent storage
before the call returns. Opening either file de‐
scriptor with O_SYNC or O_DSYNC will have the same
effect.
XFS_EXCHRANGE_FILE1_WRITTEN
Only exchange sub-ranges of file1_fd that are known
to contain data written by application software.
Each sub-range may be expanded (both upwards and
downwards) to align with the file allocation unit.
For files on the data device, this is one filesystem
block. For files on the realtime device, this is
the realtime extent size. This facility can be used
to implement fast atomic scatter-gather writes of
any complexity for software-defined storage targets
if all writes are aligned to the file allocation
unit.
XFS_EXCHRANGE_DRY_RUN
Check the parameters and the feasibility of the op‐
eration, but do not change anything.
RETURN VALUE
On error, -1 is returned, and errno is set to indicate the er‐
ror.
ERRORS
Error codes can be one of, but are not limited to, the follow‐
ing:
EBADF file1_fd is not open for reading and writing or is open
for append-only writes; or file2_fd is not open for
reading and writing or is open for append-only writes.
EBUSY The file2 inode number and timestamps supplied do not
match file2_fd.
EINVAL The parameters are not correct for these files. This
error can also appear if either file descriptor repre‐
sents a device, FIFO, or socket. Disk filesystems gen‐
erally require the offset and length arguments to be
aligned to the fundamental block sizes of both files.
EIO An I/O error occurred.
EISDIR One of the files is a directory.
ENOMEM The kernel was unable to allocate sufficient memory to
perform the operation.
ENOSPC There is not enough free space in the filesystem ex‐
change the contents safely.
EOPNOTSUPP
The filesystem does not support exchanging bytes between
the two files.
EPERM file1_fd or file2_fd are immutable.
ETXTBSY
One of the files is a swap file.
EUCLEAN
The filesystem is corrupt.
EXDEV file1_fd and file2_fd are not on the same mounted
filesystem.
CONFORMING TO
This API is XFS-specific.
USE CASES
Several use cases are imagined for this system call. Coordina‐
tion between multiple threads is performed by the kernel.
The first is a filesystem defragmenter, which copies the con‐
tents of a file into another file and wishes to exchange the
space mappings of the two files, provided that the original
file has not changed.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct stat sb;
struct xfs_commit_range args = {
.flags = XFS_EXCHRANGE_TO_EOF,
};
/* gather file2's freshness information */
fstat(fd, &sb);
args.file2_ino = sb.st_ino;
args.file2_mtime = sb.st_mtim.tv_sec;
args.file2_mtime_nsec = sb.st_mtim.tv_nsec;
args.file2_ctime = sb.st_ctim.tv_sec;
args.file2_ctime_nsec = sb.st_ctim.tv_nsec;
/* make a fresh copy of the file with terrible alignment to avoid reflink */
clone_file_range(fd, NULL, temp_fd, NULL, 1, 0);
clone_file_range(fd, NULL, temp_fd, NULL, sb.st_size - 1, 0);
/* commit the entire update */
args.file1_fd = temp_fd;
ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
if (ret && errno == EBUSY)
printf("file changed while defrag was underway
");
The second is a data storage program that wants to commit non-
contiguous updates to a file atomically. This program cannot
coordinate updates to the file and therefore relies on the ker‐
nel to reject the COMMIT_RANGE command if the file has been up‐
dated by someone else. This can be done by creating a tempo‐
rary file, calling FICLONE(2) to share the contents, and stag‐
ing the updates into the temporary file. The FULL_FILES flag
is recommended for this purpose. The temporary file can be
deleted or punched out afterwards.
An example program might look like this:
int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);
struct stat sb;
struct xfs_commit_range args = {
.flags = XFS_EXCHRANGE_TO_EOF,
};
/* gather file2's freshness information */
fstat(fd, &sb);
args.file2_ino = sb.st_ino;
args.file2_mtime = sb.st_mtim.tv_sec;
args.file2_mtime_nsec = sb.st_mtim.tv_nsec;
args.file2_ctime = sb.st_ctim.tv_sec;
args.file2_ctime_nsec = sb.st_ctim.tv_nsec;
ioctl(temp_fd, FICLONE, fd);
/* append 1MB of records */
lseek(temp_fd, 0, SEEK_END);
write(temp_fd, data1, 1000000);
/* update record index */
pwrite(temp_fd, data1, 600, 98765);
pwrite(temp_fd, data2, 320, 54321);
pwrite(temp_fd, data2, 15, 0);
/* commit the entire update */
args.file1_fd = temp_fd;
ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
if (ret && errno == EBUSY)
printf("file changed before commit; will roll back
");
NOTES
Some filesystems may limit the amount of data or the number of
extents that can be exchanged in a single call.
SEE ALSO
ioctl(2)
XFS 2024-02-18 IOCTL-XFS-COMMIT-RANGE(2)
The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down. Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.
Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years. It is also not
the RWF_ATOMIC patchset that has been shared. This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.
As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata. The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.
Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges. This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode. If this
completes successfully, the new contents can be committed atomically
into the inode being repaired. This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.
or userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates
fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates
xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
Commits in this patchset:
* vfs: export remap and write check helpers
* xfs: introduce new file range exchange ioctls
* xfs: create a log incompat flag for atomic file mapping exchanges
* xfs: introduce a file mapping exchange log intent item
* xfs: create deferred log items for file mapping exchanges
* xfs: bind together the front and back ends of the file range exchange code
* xfs: add error injection to test file mapping exchange recovery
* xfs: condense extended attributes after a mapping exchange operation
* xfs: condense directories after a mapping exchange operation
* xfs: condense symbolic links after a mapping exchange operation
* xfs: make file range exchange support realtime files
* xfs: support non-power-of-two rtextsize with exchange-range
* docs: update swapext -> exchmaps language
* xfs: enable logged file mapping exchange feature
---
.../filesystems/xfs/xfs-online-fsck-design.rst | 259 ++--
fs/read_write.c | 1
fs/remap_range.c | 4
fs/xfs/Makefile | 3
fs/xfs/libxfs/xfs_bmap.h | 2
fs/xfs/libxfs/xfs_defer.c | 6
fs/xfs/libxfs/xfs_defer.h | 2
fs/xfs/libxfs/xfs_errortag.h | 4
fs/xfs/libxfs/xfs_exchmaps.c | 1222 ++++++++++++++++++++
fs/xfs/libxfs/xfs_exchmaps.h | 123 ++
fs/xfs/libxfs/xfs_format.h | 16
fs/xfs/libxfs/xfs_fs.h | 75 +
fs/xfs/libxfs/xfs_log_format.h | 64 +
fs/xfs/libxfs/xfs_log_recover.h | 2
fs/xfs/libxfs/xfs_sb.c | 3
fs/xfs/libxfs/xfs_symlink_remote.c | 47 +
fs/xfs/libxfs/xfs_symlink_remote.h | 1
fs/xfs/libxfs/xfs_trans_space.h | 4
fs/xfs/xfs_error.c | 3
fs/xfs/xfs_exchmaps_item.c | 603 ++++++++++
fs/xfs/xfs_exchmaps_item.h | 64 +
fs/xfs/xfs_exchrange.c | 865 ++++++++++++++
fs/xfs/xfs_exchrange.h | 51 +
fs/xfs/xfs_ioctl.c | 89 +
fs/xfs/xfs_log_recover.c | 2
fs/xfs/xfs_mount.h | 5
fs/xfs/xfs_super.c | 19
fs/xfs/xfs_symlink.c | 49 -
fs/xfs/xfs_trace.c | 2
fs/xfs/xfs_trace.h | 382 ++++++
include/linux/fs.h | 1
31 files changed, 3797 insertions(+), 176 deletions(-)
create mode 100644 fs/xfs/libxfs/xfs_exchmaps.c
create mode 100644 fs/xfs/libxfs/xfs_exchmaps.h
create mode 100644 fs/xfs/xfs_exchmaps_item.c
create mode 100644 fs/xfs/xfs_exchmaps_item.h
create mode 100644 fs/xfs/xfs_exchrange.c
create mode 100644 fs/xfs/xfs_exchrange.h
^ permalink raw reply [flat|nested] 25+ messages in thread* [PATCH 01/14] vfs: export remap and write check helpers
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
@ 2024-02-27 2:21 ` Darrick J. Wong
2024-02-28 15:40 ` Christoph Hellwig
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
` (2 subsequent siblings)
3 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-27 2:21 UTC (permalink / raw)
To: djwong; +Cc: linux-fsdevel, linux-xfs, hch
From: Darrick J. Wong <djwong@kernel.org>
Export these functions so that the next patch can use them to check the
file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/read_write.c | 1 +
fs/remap_range.c | 4 ++--
include/linux/fs.h | 1 +
3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index d4c036e82b6c3..85c096f2c0d06 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
return 0;
}
+EXPORT_SYMBOL_GPL(generic_write_check_limits);
/* Like generic_write_checks(), but takes size of write instead of iter. */
int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
diff --git a/fs/remap_range.c b/fs/remap_range.c
index de07f978ce3eb..28246dfc84851 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
return 0;
}
-static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
- bool write)
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
{
int mask = write ? MAY_WRITE : MAY_READ;
loff_t tmp;
@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
return fsnotify_file_area_perm(file, mask, &pos, len);
}
+EXPORT_SYMBOL_GPL(remap_verify_area);
/*
* Ensure that we don't remap a partial EOF block in the middle of something
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fbc72c5f112c..f0ada316dc97b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2096,6 +2096,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags,
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27 2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
@ 2024-02-27 9:23 ` Amir Goldstein
2024-02-27 10:53 ` Jeff Layton
2024-02-27 15:45 ` Darrick J. Wong
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
2024-02-28 1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
3 siblings, 2 replies; 25+ messages in thread
From: Amir Goldstein @ 2024-02-27 9:23 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton
On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically. This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes. A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file. If the updates are aligned with the file block
> size, a new (since v2) flag provides for exchanging only the written
> areas. Callers can arrange for the update to be rejected if the
> original file has been changed.
>
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files. For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory, fsync the new
> file, rename the new file on top of the old filename, and then fsync the
> directory. People get it wrong all the time, and $fs hacks abound.
> Here are the proposed manual pages:
>
> IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
>
> NAME
> ioctl_xfs_exchange_range - exchange the contents of parts of
> two files
>
> SYNOPSIS
> #include <sys/ioctl.h>
> #include <xfs/xfs_fs_staging.h>
>
> int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
> xfs_exch_range *arg);
>
> DESCRIPTION
> Given a range of bytes in a first file file1_fd and a second
> range of bytes in a second file file2_fd, this ioctl(2) ex‐
> changes the contents of the two ranges.
>
> Exchanges are atomic with regards to concurrent file opera‐
> tions, so no userspace-level locks need to be taken to obtain
> consistent results. Implementations must guarantee that read‐
> ers see either the old contents or the new contents in their
> entirety, even if the system fails.
>
> The system call parameters are conveyed in structures of the
> following form:
>
> struct xfs_exch_range {
> __s64 file1_fd;
> __s64 file1_offset;
> __s64 file2_offset;
> __s64 length;
> __u64 flags;
>
> __u64 pad;
> };
>
> The field pad must be zero.
>
> The fields file1_fd, file1_offset, and length define the first
> range of bytes to be exchanged.
>
> The fields file2_fd, file2_offset, and length define the second
> range of bytes to be exchanged.
>
> Both files must be from the same filesystem mount. If the two
> file descriptors represent the same file, the byte ranges must
> not overlap. Most disk-based filesystems require that the
> starts of both ranges must be aligned to the file block size.
> If this is the case, the ends of the ranges must also be so
> aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
>
> The field flags control the behavior of the exchange operation.
>
> XFS_EXCHRANGE_TO_EOF
> Ignore the length parameter. All bytes in file1_fd
> from file1_offset to EOF are moved to file2_fd, and
> file2's size is set to (file2_offset+(file1_length-
> file1_offset)). Meanwhile, all bytes in file2 from
> file2_offset to EOF are moved to file1 and file1's
> size is set to (file1_offset+(file2_length-
> file2_offset)).
>
> XFS_EXCHRANGE_DSYNC
> Ensure that all modified in-core data in both file
> ranges and all metadata updates pertaining to the
> exchange operation are flushed to persistent storage
> before the call returns. Opening either file de‐
> scriptor with O_SYNC or O_DSYNC will have the same
> effect.
>
> XFS_EXCHRANGE_FILE1_WRITTEN
> Only exchange sub-ranges of file1_fd that are known
> to contain data written by application software.
> Each sub-range may be expanded (both upwards and
> downwards) to align with the file allocation unit.
> For files on the data device, this is one filesystem
> block. For files on the realtime device, this is
> the realtime extent size. This facility can be used
> to implement fast atomic scatter-gather writes of
> any complexity for software-defined storage targets
> if all writes are aligned to the file allocation
> unit.
>
> XFS_EXCHRANGE_DRY_RUN
> Check the parameters and the feasibility of the op‐
> eration, but do not change anything.
>
> RETURN VALUE
> On error, -1 is returned, and errno is set to indicate the er‐
> ror.
>
> ERRORS
> Error codes can be one of, but are not limited to, the follow‐
> ing:
>
> EBADF file1_fd is not open for reading and writing or is open
> for append-only writes; or file2_fd is not open for
> reading and writing or is open for append-only writes.
>
> EINVAL The parameters are not correct for these files. This
> error can also appear if either file descriptor repre‐
> sents a device, FIFO, or socket. Disk filesystems gen‐
> erally require the offset and length arguments to be
> aligned to the fundamental block sizes of both files.
>
> EIO An I/O error occurred.
>
> EISDIR One of the files is a directory.
>
> ENOMEM The kernel was unable to allocate sufficient memory to
> perform the operation.
>
> ENOSPC There is not enough free space in the filesystem ex‐
> change the contents safely.
>
> EOPNOTSUPP
> The filesystem does not support exchanging bytes between
> the two files.
>
> EPERM file1_fd or file2_fd are immutable.
>
> ETXTBSY
> One of the files is a swap file.
>
> EUCLEAN
> The filesystem is corrupt.
>
> EXDEV file1_fd and file2_fd are not on the same mounted
> filesystem.
>
> CONFORMING TO
> This API is XFS-specific.
>
> USE CASES
> Several use cases are imagined for this system call. In all
> cases, application software must coordinate updates to the file
> because the exchange is performed unconditionally.
>
> The first is a data storage program that wants to commit non-
> contiguous updates to a file atomically and coordinates write
> access to that file. This can be done by creating a temporary
> file, calling FICLONE(2) to share the contents, and staging the
> updates into the temporary file. The FULL_FILES flag is recom‐
> mended for this purpose. The temporary file can be deleted or
> punched out afterwards.
>
> An example program might look like this:
>
> int fd = open("/some/file", O_RDWR);
> int temp_fd = open("/some", O_TMPFILE | O_RDWR);
>
> ioctl(temp_fd, FICLONE, fd);
>
> /* append 1MB of records */
> lseek(temp_fd, 0, SEEK_END);
> write(temp_fd, data1, 1000000);
>
> /* update record index */
> pwrite(temp_fd, data1, 600, 98765);
> pwrite(temp_fd, data2, 320, 54321);
> pwrite(temp_fd, data2, 15, 0);
>
> /* commit the entire update */
> struct xfs_exch_range args = {
> .file1_fd = temp_fd,
> .flags = XFS_EXCHRANGE_TO_EOF,
> };
>
> ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
>
> The second is a software-defined storage host (e.g. a disk
> jukebox) which implements an atomic scatter-gather write com‐
> mand. Provided the exported disk's logical block size matches
> the file's allocation unit size, this can be done by creating a
> temporary file and writing the data at the appropriate offsets.
> It is recommended that the temporary file be truncated to the
> size of the regular file before any writes are staged to the
> temporary file to avoid issues with zeroing during EOF exten‐
> sion. Use this call with the FILE1_WRITTEN flag to exchange
> only the file allocation units involved in the emulated de‐
> vice's write command. The temporary file should be truncated
> or punched out completely before being reused to stage another
> write.
>
> An example program might look like this:
>
> int fd = open("/some/file", O_RDWR);
> int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> struct stat sb;
> int blksz;
>
> fstat(fd, &sb);
> blksz = sb.st_blksize;
>
> /* land scatter gather writes between 100fsb and 500fsb */
> pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> pwrite(temp_fd, data3, blksz * 7, blksz * 257);
>
> /* commit the entire update */
> struct xfs_exch_range args = {
> .file1_fd = temp_fd,
> .file1_offset = blksz * 100,
> .file2_offset = blksz * 100,
> .length = blksz * 400,
> .flags = XFS_EXCHRANGE_FILE1_WRITTEN |
> XFS_EXCHRANGE_FILE1_DSYNC,
> };
>
> ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
>
> NOTES
> Some filesystems may limit the amount of data or the number of
> extents that can be exchanged in a single call.
>
> SEE ALSO
> ioctl(2)
>
> XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
> IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
>
> NAME
> ioctl_xfs_commit_range - conditionally exchange the contents of
> parts of two files
>
> SYNOPSIS
> #include <sys/ioctl.h>
> #include <xfs/xfs_fs_staging.h>
>
> int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
> mit_range *arg);
>
> DESCRIPTION
> Given a range of bytes in a first file file1_fd and a second
> range of bytes in a second file file2_fd, this ioctl(2) ex‐
> changes the contents of the two ranges if file2_fd passes cer‐
> tain freshness criteria.
>
> After locking both files but before exchanging the contents,
> the supplied file2_ino field must match file2_fd's inode num‐
> ber, and the supplied file2_mtime, file2_mtime_nsec,
> file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> cation time and change time of file2. If they do not match,
> EBUSY will be returned.
>
Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?
And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
Even if this API is designed to be hoisted out of XFS at some future time,
Is there a real need to support it on filesystems that do not support
i_version(?)
Not to mention the fact that POSIX does not explicitly define how ctime should
behave with changes to fiemap (uninitialized extent and all), so who knows
how other filesystems may update ctime in those cases.
I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
really explicitly requests a bump of i_version on the next change.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
@ 2024-02-27 10:53 ` Jeff Layton
2024-02-27 16:06 ` Darrick J. Wong
2024-02-27 23:46 ` Dave Chinner
2024-02-27 15:45 ` Darrick J. Wong
1 sibling, 2 replies; 25+ messages in thread
From: Jeff Layton @ 2024-02-27 10:53 UTC (permalink / raw)
To: Amir Goldstein, Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch
On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically. This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes. A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file. If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas. Callers can arrange for the update to be rejected if the
> > original file has been changed.
> >
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files. For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory. People get it wrong all the time, and $fs hacks abound.
> > Here are the proposed manual pages:
> >
This is a cool idea! I've had some handwavy ideas about making a gated
write() syscall (i.e. only write if the change cookie hasn't changed),
but something like this may be a simpler lift initially.
> > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> >
> > NAME
> > ioctl_xfs_exchange_range - exchange the contents of parts of
> > two files
> >
> > SYNOPSIS
> > #include <sys/ioctl.h>
> > #include <xfs/xfs_fs_staging.h>
> >
> > int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
> > xfs_exch_range *arg);
> >
> > DESCRIPTION
> > Given a range of bytes in a first file file1_fd and a second
> > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > changes the contents of the two ranges.
> >
> > Exchanges are atomic with regards to concurrent file opera‐
> > tions, so no userspace-level locks need to be taken to obtain
> > consistent results. Implementations must guarantee that read‐
> > ers see either the old contents or the new contents in their
> > entirety, even if the system fails.
> >
> > The system call parameters are conveyed in structures of the
> > following form:
> >
> > struct xfs_exch_range {
> > __s64 file1_fd;
> > __s64 file1_offset;
> > __s64 file2_offset;
> > __s64 length;
> > __u64 flags;
> >
> > __u64 pad;
> > };
> >
> > The field pad must be zero.
> >
> > The fields file1_fd, file1_offset, and length define the first
> > range of bytes to be exchanged.
> >
> > The fields file2_fd, file2_offset, and length define the second
> > range of bytes to be exchanged.
> >
> > Both files must be from the same filesystem mount. If the two
> > file descriptors represent the same file, the byte ranges must
> > not overlap. Most disk-based filesystems require that the
> > starts of both ranges must be aligned to the file block size.
> > If this is the case, the ends of the ranges must also be so
> > aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> >
> > The field flags control the behavior of the exchange operation.
> >
> > XFS_EXCHRANGE_TO_EOF
> > Ignore the length parameter. All bytes in file1_fd
> > from file1_offset to EOF are moved to file2_fd, and
> > file2's size is set to (file2_offset+(file1_length-
> > file1_offset)). Meanwhile, all bytes in file2 from
> > file2_offset to EOF are moved to file1 and file1's
> > size is set to (file1_offset+(file2_length-
> > file2_offset)).
> >
> > XFS_EXCHRANGE_DSYNC
> > Ensure that all modified in-core data in both file
> > ranges and all metadata updates pertaining to the
> > exchange operation are flushed to persistent storage
> > before the call returns. Opening either file de‐
> > scriptor with O_SYNC or O_DSYNC will have the same
> > effect.
> >
> > XFS_EXCHRANGE_FILE1_WRITTEN
> > Only exchange sub-ranges of file1_fd that are known
> > to contain data written by application software.
> > Each sub-range may be expanded (both upwards and
> > downwards) to align with the file allocation unit.
> > For files on the data device, this is one filesystem
> > block. For files on the realtime device, this is
> > the realtime extent size. This facility can be used
> > to implement fast atomic scatter-gather writes of
> > any complexity for software-defined storage targets
> > if all writes are aligned to the file allocation
> > unit.
> >
> > XFS_EXCHRANGE_DRY_RUN
> > Check the parameters and the feasibility of the op‐
> > eration, but do not change anything.
> >
> > RETURN VALUE
> > On error, -1 is returned, and errno is set to indicate the er‐
> > ror.
> >
> > ERRORS
> > Error codes can be one of, but are not limited to, the follow‐
> > ing:
> >
> > EBADF file1_fd is not open for reading and writing or is open
> > for append-only writes; or file2_fd is not open for
> > reading and writing or is open for append-only writes.
> >
> > EINVAL The parameters are not correct for these files. This
> > error can also appear if either file descriptor repre‐
> > sents a device, FIFO, or socket. Disk filesystems gen‐
> > erally require the offset and length arguments to be
> > aligned to the fundamental block sizes of both files.
> >
> > EIO An I/O error occurred.
> >
> > EISDIR One of the files is a directory.
> >
> > ENOMEM The kernel was unable to allocate sufficient memory to
> > perform the operation.
> >
> > ENOSPC There is not enough free space in the filesystem ex‐
> > change the contents safely.
> >
> > EOPNOTSUPP
> > The filesystem does not support exchanging bytes between
> > the two files.
> >
> > EPERM file1_fd or file2_fd are immutable.
> >
> > ETXTBSY
> > One of the files is a swap file.
> >
> > EUCLEAN
> > The filesystem is corrupt.
> >
> > EXDEV file1_fd and file2_fd are not on the same mounted
> > filesystem.
> >
> > CONFORMING TO
> > This API is XFS-specific.
> >
> > USE CASES
> > Several use cases are imagined for this system call. In all
> > cases, application software must coordinate updates to the file
> > because the exchange is performed unconditionally.
> >
> > The first is a data storage program that wants to commit non-
> > contiguous updates to a file atomically and coordinates write
> > access to that file. This can be done by creating a temporary
> > file, calling FICLONE(2) to share the contents, and staging the
> > updates into the temporary file. The FULL_FILES flag is recom‐
> > mended for this purpose. The temporary file can be deleted or
> > punched out afterwards.
> >
> > An example program might look like this:
> >
> > int fd = open("/some/file", O_RDWR);
> > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >
> > ioctl(temp_fd, FICLONE, fd);
> >
> > /* append 1MB of records */
> > lseek(temp_fd, 0, SEEK_END);
> > write(temp_fd, data1, 1000000);
> >
> > /* update record index */
> > pwrite(temp_fd, data1, 600, 98765);
> > pwrite(temp_fd, data2, 320, 54321);
> > pwrite(temp_fd, data2, 15, 0);
> >
> > /* commit the entire update */
> > struct xfs_exch_range args = {
> > .file1_fd = temp_fd,
> > .flags = XFS_EXCHRANGE_TO_EOF,
> > };
> >
> > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> > The second is a software-defined storage host (e.g. a disk
> > jukebox) which implements an atomic scatter-gather write com‐
> > mand. Provided the exported disk's logical block size matches
> > the file's allocation unit size, this can be done by creating a
> > temporary file and writing the data at the appropriate offsets.
> > It is recommended that the temporary file be truncated to the
> > size of the regular file before any writes are staged to the
> > temporary file to avoid issues with zeroing during EOF exten‐
> > sion. Use this call with the FILE1_WRITTEN flag to exchange
> > only the file allocation units involved in the emulated de‐
> > vice's write command. The temporary file should be truncated
> > or punched out completely before being reused to stage another
> > write.
> >
> > An example program might look like this:
> >
> > int fd = open("/some/file", O_RDWR);
> > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > struct stat sb;
> > int blksz;
> >
> > fstat(fd, &sb);
> > blksz = sb.st_blksize;
> >
> > /* land scatter gather writes between 100fsb and 500fsb */
> > pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> >
> > /* commit the entire update */
> > struct xfs_exch_range args = {
> > .file1_fd = temp_fd,
> > .file1_offset = blksz * 100,
> > .file2_offset = blksz * 100,
> > .length = blksz * 400,
> > .flags = XFS_EXCHRANGE_FILE1_WRITTEN |
> > XFS_EXCHRANGE_FILE1_DSYNC,
> > };
> >
> > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> > NOTES
> > Some filesystems may limit the amount of data or the number of
> > extents that can be exchanged in a single call.
> >
> > SEE ALSO
> > ioctl(2)
> >
> > XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
> > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> >
> > NAME
> > ioctl_xfs_commit_range - conditionally exchange the contents of
> > parts of two files
> >
> > SYNOPSIS
> > #include <sys/ioctl.h>
> > #include <xfs/xfs_fs_staging.h>
> >
> > int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
> > mit_range *arg);
> >
> > DESCRIPTION
> > Given a range of bytes in a first file file1_fd and a second
> > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > changes the contents of the two ranges if file2_fd passes cer‐
> > tain freshness criteria.
> >
> > After locking both files but before exchanging the contents,
> > the supplied file2_ino field must match file2_fd's inode num‐
> > ber, and the supplied file2_mtime, file2_mtime_nsec,
> > file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > cation time and change time of file2. If they do not match,
> > EBUSY will be returned.
> >
>
> Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?
>
ctime should always change if the mtime does. An mtime update means that
the metadata was updated, so you also need to update the ctime.
> And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> Even if this API is designed to be hoisted out of XFS at some future time,
> Is there a real need to support it on filesystems that do not support
> i_version(?)
>
> Not to mention the fact that POSIX does not explicitly define how ctime should
> behave with changes to fiemap (uninitialized extent and all), so who knows
> how other filesystems may update ctime in those cases.
>
> I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> really explicitly requests a bump of i_version on the next change.
>
I agree. Using an opaque change cookie would be a lot nicer from an API
standpoint, and shouldn't be subject to timestamp granularity issues.
That said, XFS's change cookie is currently broken. Dave C. said he had
some patches in progress to fix that however.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 10:53 ` Jeff Layton
@ 2024-02-27 16:06 ` Darrick J. Wong
2024-03-01 13:16 ` Jeff Layton
2024-02-27 23:46 ` Dave Chinner
1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-27 16:06 UTC (permalink / raw)
To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch
On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > Hi all,
> > >
> > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > ranges of bytes between two files atomically. This new functionality
> > > enables data storage programs to stage and commit file updates such that
> > > reader programs will see either the old contents or the new contents in
> > > their entirety, with no chance of torn writes. A successful call
> > > completion guarantees that the new contents will be seen even if the
> > > system fails.
> > >
> > > The ability to exchange file fork mappings between files in this manner
> > > is critical to supporting online filesystem repair, which is built upon
> > > the strategy of constructing a clean copy of a damaged structure and
> > > committing the new structure into the metadata file atomically.
> > >
> > > User programs will be able to update files atomically by opening an
> > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > they want to make, and exchange the relevant ranges of the temp file
> > > with the original file. If the updates are aligned with the file block
> > > size, a new (since v2) flag provides for exchanging only the written
> > > areas. Callers can arrange for the update to be rejected if the
> > > original file has been changed.
> > >
> > > The intent behind this new userspace functionality is to enable atomic
> > > rewrites of arbitrary parts of individual files. For years, application
> > > programmers wanting to ensure the atomicity of a file update had to
> > > write the changes to a new file in the same directory, fsync the new
> > > file, rename the new file on top of the old filename, and then fsync the
> > > directory. People get it wrong all the time, and $fs hacks abound.
> > > Here are the proposed manual pages:
> > >
>
> This is a cool idea! I've had some handwavy ideas about making a gated
> write() syscall (i.e. only write if the change cookie hasn't changed),
> but something like this may be a simpler lift initially.
How /does/ userspace get at the change cookie nowadays?
> > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > >
> > > NAME
> > > ioctl_xfs_exchange_range - exchange the contents of parts of
> > > two files
> > >
> > > SYNOPSIS
> > > #include <sys/ioctl.h>
> > > #include <xfs/xfs_fs_staging.h>
> > >
> > > int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
> > > xfs_exch_range *arg);
> > >
> > > DESCRIPTION
> > > Given a range of bytes in a first file file1_fd and a second
> > > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > > changes the contents of the two ranges.
> > >
> > > Exchanges are atomic with regards to concurrent file opera‐
> > > tions, so no userspace-level locks need to be taken to obtain
> > > consistent results. Implementations must guarantee that read‐
> > > ers see either the old contents or the new contents in their
> > > entirety, even if the system fails.
> > >
> > > The system call parameters are conveyed in structures of the
> > > following form:
> > >
> > > struct xfs_exch_range {
> > > __s64 file1_fd;
> > > __s64 file1_offset;
> > > __s64 file2_offset;
> > > __s64 length;
> > > __u64 flags;
> > >
> > > __u64 pad;
> > > };
> > >
> > > The field pad must be zero.
> > >
> > > The fields file1_fd, file1_offset, and length define the first
> > > range of bytes to be exchanged.
> > >
> > > The fields file2_fd, file2_offset, and length define the second
> > > range of bytes to be exchanged.
> > >
> > > Both files must be from the same filesystem mount. If the two
> > > file descriptors represent the same file, the byte ranges must
> > > not overlap. Most disk-based filesystems require that the
> > > starts of both ranges must be aligned to the file block size.
> > > If this is the case, the ends of the ranges must also be so
> > > aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > >
> > > The field flags control the behavior of the exchange operation.
> > >
> > > XFS_EXCHRANGE_TO_EOF
> > > Ignore the length parameter. All bytes in file1_fd
> > > from file1_offset to EOF are moved to file2_fd, and
> > > file2's size is set to (file2_offset+(file1_length-
> > > file1_offset)). Meanwhile, all bytes in file2 from
> > > file2_offset to EOF are moved to file1 and file1's
> > > size is set to (file1_offset+(file2_length-
> > > file2_offset)).
> > >
> > > XFS_EXCHRANGE_DSYNC
> > > Ensure that all modified in-core data in both file
> > > ranges and all metadata updates pertaining to the
> > > exchange operation are flushed to persistent storage
> > > before the call returns. Opening either file de‐
> > > scriptor with O_SYNC or O_DSYNC will have the same
> > > effect.
> > >
> > > XFS_EXCHRANGE_FILE1_WRITTEN
> > > Only exchange sub-ranges of file1_fd that are known
> > > to contain data written by application software.
> > > Each sub-range may be expanded (both upwards and
> > > downwards) to align with the file allocation unit.
> > > For files on the data device, this is one filesystem
> > > block. For files on the realtime device, this is
> > > the realtime extent size. This facility can be used
> > > to implement fast atomic scatter-gather writes of
> > > any complexity for software-defined storage targets
> > > if all writes are aligned to the file allocation
> > > unit.
> > >
> > > XFS_EXCHRANGE_DRY_RUN
> > > Check the parameters and the feasibility of the op‐
> > > eration, but do not change anything.
> > >
> > > RETURN VALUE
> > > On error, -1 is returned, and errno is set to indicate the er‐
> > > ror.
> > >
> > > ERRORS
> > > Error codes can be one of, but are not limited to, the follow‐
> > > ing:
> > >
> > > EBADF file1_fd is not open for reading and writing or is open
> > > for append-only writes; or file2_fd is not open for
> > > reading and writing or is open for append-only writes.
> > >
> > > EINVAL The parameters are not correct for these files. This
> > > error can also appear if either file descriptor repre‐
> > > sents a device, FIFO, or socket. Disk filesystems gen‐
> > > erally require the offset and length arguments to be
> > > aligned to the fundamental block sizes of both files.
> > >
> > > EIO An I/O error occurred.
> > >
> > > EISDIR One of the files is a directory.
> > >
> > > ENOMEM The kernel was unable to allocate sufficient memory to
> > > perform the operation.
> > >
> > > ENOSPC There is not enough free space in the filesystem ex‐
> > > change the contents safely.
> > >
> > > EOPNOTSUPP
> > > The filesystem does not support exchanging bytes between
> > > the two files.
> > >
> > > EPERM file1_fd or file2_fd are immutable.
> > >
> > > ETXTBSY
> > > One of the files is a swap file.
> > >
> > > EUCLEAN
> > > The filesystem is corrupt.
> > >
> > > EXDEV file1_fd and file2_fd are not on the same mounted
> > > filesystem.
> > >
> > > CONFORMING TO
> > > This API is XFS-specific.
> > >
> > > USE CASES
> > > Several use cases are imagined for this system call. In all
> > > cases, application software must coordinate updates to the file
> > > because the exchange is performed unconditionally.
> > >
> > > The first is a data storage program that wants to commit non-
> > > contiguous updates to a file atomically and coordinates write
> > > access to that file. This can be done by creating a temporary
> > > file, calling FICLONE(2) to share the contents, and staging the
> > > updates into the temporary file. The FULL_FILES flag is recom‐
> > > mended for this purpose. The temporary file can be deleted or
> > > punched out afterwards.
> > >
> > > An example program might look like this:
> > >
> > > int fd = open("/some/file", O_RDWR);
> > > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > >
> > > ioctl(temp_fd, FICLONE, fd);
> > >
> > > /* append 1MB of records */
> > > lseek(temp_fd, 0, SEEK_END);
> > > write(temp_fd, data1, 1000000);
> > >
> > > /* update record index */
> > > pwrite(temp_fd, data1, 600, 98765);
> > > pwrite(temp_fd, data2, 320, 54321);
> > > pwrite(temp_fd, data2, 15, 0);
> > >
> > > /* commit the entire update */
> > > struct xfs_exch_range args = {
> > > .file1_fd = temp_fd,
> > > .flags = XFS_EXCHRANGE_TO_EOF,
> > > };
> > >
> > > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > >
> > > The second is a software-defined storage host (e.g. a disk
> > > jukebox) which implements an atomic scatter-gather write com‐
> > > mand. Provided the exported disk's logical block size matches
> > > the file's allocation unit size, this can be done by creating a
> > > temporary file and writing the data at the appropriate offsets.
> > > It is recommended that the temporary file be truncated to the
> > > size of the regular file before any writes are staged to the
> > > temporary file to avoid issues with zeroing during EOF exten‐
> > > sion. Use this call with the FILE1_WRITTEN flag to exchange
> > > only the file allocation units involved in the emulated de‐
> > > vice's write command. The temporary file should be truncated
> > > or punched out completely before being reused to stage another
> > > write.
> > >
> > > An example program might look like this:
> > >
> > > int fd = open("/some/file", O_RDWR);
> > > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > struct stat sb;
> > > int blksz;
> > >
> > > fstat(fd, &sb);
> > > blksz = sb.st_blksize;
> > >
> > > /* land scatter gather writes between 100fsb and 500fsb */
> > > pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > > pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > > pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > >
> > > /* commit the entire update */
> > > struct xfs_exch_range args = {
> > > .file1_fd = temp_fd,
> > > .file1_offset = blksz * 100,
> > > .file2_offset = blksz * 100,
> > > .length = blksz * 400,
> > > .flags = XFS_EXCHRANGE_FILE1_WRITTEN |
> > > XFS_EXCHRANGE_FILE1_DSYNC,
> > > };
> > >
> > > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > >
> > > NOTES
> > > Some filesystems may limit the amount of data or the number of
> > > extents that can be exchanged in a single call.
> > >
> > > SEE ALSO
> > > ioctl(2)
> > >
> > > XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
> > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > >
> > > NAME
> > > ioctl_xfs_commit_range - conditionally exchange the contents of
> > > parts of two files
> > >
> > > SYNOPSIS
> > > #include <sys/ioctl.h>
> > > #include <xfs/xfs_fs_staging.h>
> > >
> > > int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
> > > mit_range *arg);
> > >
> > > DESCRIPTION
> > > Given a range of bytes in a first file file1_fd and a second
> > > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > > changes the contents of the two ranges if file2_fd passes cer‐
> > > tain freshness criteria.
> > >
> > > After locking both files but before exchanging the contents,
> > > the supplied file2_ino field must match file2_fd's inode num‐
> > > ber, and the supplied file2_mtime, file2_mtime_nsec,
> > > file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > > cation time and change time of file2. If they do not match,
> > > EBUSY will be returned.
> > >
> >
> > Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?
> >
>
> ctime should always change if the mtime does. An mtime update means that
> the metadata was updated, so you also need to update the ctime.
Exactly. :)
> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
> >
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
> >
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
> >
>
>
> I agree. Using an opaque change cookie would be a lot nicer from an API
> standpoint, and shouldn't be subject to timestamp granularity issues.
TLDR: No.
> That said, XFS's change cookie is currently broken. Dave C. said he had
> some patches in progress to fix that however.
Dave says that about a lot of things. I'm not willing to delay the
online fsck project _even further_ for a bunch of vaporware that's not
even out on linux-xfs for review.
The difference in opinion between xfs and the rest of the kernel about
i_version is 50% of why I didn't include it here. The other 50% is the
part where userspace can't access it, because I do not want to saddle my
mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
CHANGES.
--D
> --
> Jeff Layton <jlayton@kernel.org>
>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 16:06 ` Darrick J. Wong
@ 2024-03-01 13:16 ` Jeff Layton
0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2024-03-01 13:16 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch
On Tue, 2024-02-27 at 08:06 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > > ranges of bytes between two files atomically. This new functionality
> > > > enables data storage programs to stage and commit file updates such that
> > > > reader programs will see either the old contents or the new contents in
> > > > their entirety, with no chance of torn writes. A successful call
> > > > completion guarantees that the new contents will be seen even if the
> > > > system fails.
> > > >
> > > > The ability to exchange file fork mappings between files in this manner
> > > > is critical to supporting online filesystem repair, which is built upon
> > > > the strategy of constructing a clean copy of a damaged structure and
> > > > committing the new structure into the metadata file atomically.
> > > >
> > > > User programs will be able to update files atomically by opening an
> > > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > > they want to make, and exchange the relevant ranges of the temp file
> > > > with the original file. If the updates are aligned with the file block
> > > > size, a new (since v2) flag provides for exchanging only the written
> > > > areas. Callers can arrange for the update to be rejected if the
> > > > original file has been changed.
> > > >
> > > > The intent behind this new userspace functionality is to enable atomic
> > > > rewrites of arbitrary parts of individual files. For years, application
> > > > programmers wanting to ensure the atomicity of a file update had to
> > > > write the changes to a new file in the same directory, fsync the new
> > > > file, rename the new file on top of the old filename, and then fsync the
> > > > directory. People get it wrong all the time, and $fs hacks abound.
> > > > Here are the proposed manual pages:
> > > >
> >
> > This is a cool idea! I've had some handwavy ideas about making a gated
> > write() syscall (i.e. only write if the change cookie hasn't changed),
> > but something like this may be a simpler lift initially.
>
> How /does/ userspace get at the change cookie nowadays?
>
Today, it doesn't. That would need to be exposed before we could make
that work.
> > > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > >
> > > > NAME
> > > > ioctl_xfs_exchange_range - exchange the contents of parts of
> > > > two files
> > > >
> > > > SYNOPSIS
> > > > #include <sys/ioctl.h>
> > > > #include <xfs/xfs_fs_staging.h>
> > > >
> > > > int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
> > > > xfs_exch_range *arg);
> > > >
> > > > DESCRIPTION
> > > > Given a range of bytes in a first file file1_fd and a second
> > > > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > > > changes the contents of the two ranges.
> > > >
> > > > Exchanges are atomic with regards to concurrent file opera‐
> > > > tions, so no userspace-level locks need to be taken to obtain
> > > > consistent results. Implementations must guarantee that read‐
> > > > ers see either the old contents or the new contents in their
> > > > entirety, even if the system fails.
> > > >
> > > > The system call parameters are conveyed in structures of the
> > > > following form:
> > > >
> > > > struct xfs_exch_range {
> > > > __s64 file1_fd;
> > > > __s64 file1_offset;
> > > > __s64 file2_offset;
> > > > __s64 length;
> > > > __u64 flags;
> > > >
> > > > __u64 pad;
> > > > };
> > > >
> > > > The field pad must be zero.
> > > >
> > > > The fields file1_fd, file1_offset, and length define the first
> > > > range of bytes to be exchanged.
> > > >
> > > > The fields file2_fd, file2_offset, and length define the second
> > > > range of bytes to be exchanged.
> > > >
> > > > Both files must be from the same filesystem mount. If the two
> > > > file descriptors represent the same file, the byte ranges must
> > > > not overlap. Most disk-based filesystems require that the
> > > > starts of both ranges must be aligned to the file block size.
> > > > If this is the case, the ends of the ranges must also be so
> > > > aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > >
> > > > The field flags control the behavior of the exchange operation.
> > > >
> > > > XFS_EXCHRANGE_TO_EOF
> > > > Ignore the length parameter. All bytes in file1_fd
> > > > from file1_offset to EOF are moved to file2_fd, and
> > > > file2's size is set to (file2_offset+(file1_length-
> > > > file1_offset)). Meanwhile, all bytes in file2 from
> > > > file2_offset to EOF are moved to file1 and file1's
> > > > size is set to (file1_offset+(file2_length-
> > > > file2_offset)).
> > > >
> > > > XFS_EXCHRANGE_DSYNC
> > > > Ensure that all modified in-core data in both file
> > > > ranges and all metadata updates pertaining to the
> > > > exchange operation are flushed to persistent storage
> > > > before the call returns. Opening either file de‐
> > > > scriptor with O_SYNC or O_DSYNC will have the same
> > > > effect.
> > > >
> > > > XFS_EXCHRANGE_FILE1_WRITTEN
> > > > Only exchange sub-ranges of file1_fd that are known
> > > > to contain data written by application software.
> > > > Each sub-range may be expanded (both upwards and
> > > > downwards) to align with the file allocation unit.
> > > > For files on the data device, this is one filesystem
> > > > block. For files on the realtime device, this is
> > > > the realtime extent size. This facility can be used
> > > > to implement fast atomic scatter-gather writes of
> > > > any complexity for software-defined storage targets
> > > > if all writes are aligned to the file allocation
> > > > unit.
> > > >
> > > > XFS_EXCHRANGE_DRY_RUN
> > > > Check the parameters and the feasibility of the op‐
> > > > eration, but do not change anything.
> > > >
> > > > RETURN VALUE
> > > > On error, -1 is returned, and errno is set to indicate the er‐
> > > > ror.
> > > >
> > > > ERRORS
> > > > Error codes can be one of, but are not limited to, the follow‐
> > > > ing:
> > > >
> > > > EBADF file1_fd is not open for reading and writing or is open
> > > > for append-only writes; or file2_fd is not open for
> > > > reading and writing or is open for append-only writes.
> > > >
> > > > EINVAL The parameters are not correct for these files. This
> > > > error can also appear if either file descriptor repre‐
> > > > sents a device, FIFO, or socket. Disk filesystems gen‐
> > > > erally require the offset and length arguments to be
> > > > aligned to the fundamental block sizes of both files.
> > > >
> > > > EIO An I/O error occurred.
> > > >
> > > > EISDIR One of the files is a directory.
> > > >
> > > > ENOMEM The kernel was unable to allocate sufficient memory to
> > > > perform the operation.
> > > >
> > > > ENOSPC There is not enough free space in the filesystem ex‐
> > > > change the contents safely.
> > > >
> > > > EOPNOTSUPP
> > > > The filesystem does not support exchanging bytes between
> > > > the two files.
> > > >
> > > > EPERM file1_fd or file2_fd are immutable.
> > > >
> > > > ETXTBSY
> > > > One of the files is a swap file.
> > > >
> > > > EUCLEAN
> > > > The filesystem is corrupt.
> > > >
> > > > EXDEV file1_fd and file2_fd are not on the same mounted
> > > > filesystem.
> > > >
> > > > CONFORMING TO
> > > > This API is XFS-specific.
> > > >
> > > > USE CASES
> > > > Several use cases are imagined for this system call. In all
> > > > cases, application software must coordinate updates to the file
> > > > because the exchange is performed unconditionally.
> > > >
> > > > The first is a data storage program that wants to commit non-
> > > > contiguous updates to a file atomically and coordinates write
> > > > access to that file. This can be done by creating a temporary
> > > > file, calling FICLONE(2) to share the contents, and staging the
> > > > updates into the temporary file. The FULL_FILES flag is recom‐
> > > > mended for this purpose. The temporary file can be deleted or
> > > > punched out afterwards.
> > > >
> > > > An example program might look like this:
> > > >
> > > > int fd = open("/some/file", O_RDWR);
> > > > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > >
> > > > ioctl(temp_fd, FICLONE, fd);
> > > >
> > > > /* append 1MB of records */
> > > > lseek(temp_fd, 0, SEEK_END);
> > > > write(temp_fd, data1, 1000000);
> > > >
> > > > /* update record index */
> > > > pwrite(temp_fd, data1, 600, 98765);
> > > > pwrite(temp_fd, data2, 320, 54321);
> > > > pwrite(temp_fd, data2, 15, 0);
> > > >
> > > > /* commit the entire update */
> > > > struct xfs_exch_range args = {
> > > > .file1_fd = temp_fd,
> > > > .flags = XFS_EXCHRANGE_TO_EOF,
> > > > };
> > > >
> > > > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > >
> > > > The second is a software-defined storage host (e.g. a disk
> > > > jukebox) which implements an atomic scatter-gather write com‐
> > > > mand. Provided the exported disk's logical block size matches
> > > > the file's allocation unit size, this can be done by creating a
> > > > temporary file and writing the data at the appropriate offsets.
> > > > It is recommended that the temporary file be truncated to the
> > > > size of the regular file before any writes are staged to the
> > > > temporary file to avoid issues with zeroing during EOF exten‐
> > > > sion. Use this call with the FILE1_WRITTEN flag to exchange
> > > > only the file allocation units involved in the emulated de‐
> > > > vice's write command. The temporary file should be truncated
> > > > or punched out completely before being reused to stage another
> > > > write.
> > > >
> > > > An example program might look like this:
> > > >
> > > > int fd = open("/some/file", O_RDWR);
> > > > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > > struct stat sb;
> > > > int blksz;
> > > >
> > > > fstat(fd, &sb);
> > > > blksz = sb.st_blksize;
> > > >
> > > > /* land scatter gather writes between 100fsb and 500fsb */
> > > > pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > > > pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > > > pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > >
> > > > /* commit the entire update */
> > > > struct xfs_exch_range args = {
> > > > .file1_fd = temp_fd,
> > > > .file1_offset = blksz * 100,
> > > > .file2_offset = blksz * 100,
> > > > .length = blksz * 400,
> > > > .flags = XFS_EXCHRANGE_FILE1_WRITTEN |
> > > > XFS_EXCHRANGE_FILE1_DSYNC,
> > > > };
> > > >
> > > > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > >
> > > > NOTES
> > > > Some filesystems may limit the amount of data or the number of
> > > > extents that can be exchanged in a single call.
> > > >
> > > > SEE ALSO
> > > > ioctl(2)
> > > >
> > > > XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
> > > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > >
> > > > NAME
> > > > ioctl_xfs_commit_range - conditionally exchange the contents of
> > > > parts of two files
> > > >
> > > > SYNOPSIS
> > > > #include <sys/ioctl.h>
> > > > #include <xfs/xfs_fs_staging.h>
> > > >
> > > > int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
> > > > mit_range *arg);
> > > >
> > > > DESCRIPTION
> > > > Given a range of bytes in a first file file1_fd and a second
> > > > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > > > changes the contents of the two ranges if file2_fd passes cer‐
> > > > tain freshness criteria.
> > > >
> > > > After locking both files but before exchanging the contents,
> > > > the supplied file2_ino field must match file2_fd's inode num‐
> > > > ber, and the supplied file2_mtime, file2_mtime_nsec,
> > > > file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > > > cation time and change time of file2. If they do not match,
> > > > EBUSY will be returned.
> > > >
> > >
> > > Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?
> > >
> >
> > ctime should always change if the mtime does. An mtime update means that
> > the metadata was updated, so you also need to update the ctime.
>
> Exactly. :)
>
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > >
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > >
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > >
> >
> >
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
>
> TLDR: No.
>
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
>
> Dave says that about a lot of things. I'm not willing to delay the
> online fsck project _even further_ for a bunch of vaporware that's not
> even out on linux-xfs for review.
>
> The difference in opinion between xfs and the rest of the kernel about
> i_version is 50% of why I didn't include it here. The other 50% is the
> part where userspace can't access it, because I do not want to saddle my
> mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
> CHANGES.
Ouch, point taken.
I just have grave concerns about using something as coarse-grained as
the to gate changes to a file. With modern machines, a single timestamp
can represent a large number of different states of the file's contents.
Is that not a problem here?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 10:53 ` Jeff Layton
2024-02-27 16:06 ` Darrick J. Wong
@ 2024-02-27 23:46 ` Dave Chinner
2024-02-28 10:30 ` Jeff Layton
1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2024-02-27 23:46 UTC (permalink / raw)
To: Jeff Layton
Cc: Amir Goldstein, Darrick J. Wong, linux-fsdevel, linux-xfs, hch
On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
Like xfs_fsr doing online defrag, we really only care about explicit
user data changes here, not internal layout and metadata changes to
the files...
> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
> >
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
> >
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
> >
>
>
> I agree. Using an opaque change cookie would be a lot nicer from an API
> standpoint, and shouldn't be subject to timestamp granularity issues.
>
> That said, XFS's change cookie is currently broken. Dave C. said he had
> some patches in progress to fix that however.
By "fix", I meant "remove".
i.e. the patches I was proposing were to remove SB_I_VERSION support
from XFS so NFS just uses the ctime on XFS because the recent
changes to i_version make it a ctime change counter, not an inode
change counter.
Then patches were posted for finer grained inode timestamps to allow
everything to use ctime instead of i_version, and with that I
thought NFS was just going to change to ctime for everyone with that
the whole change cookie issue was going away.
It now sounds like that isn't happening, so I'll just ressurect the
patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
support from XFS for now and us XFS people can just go back to
ignoring this problem again.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 23:46 ` Dave Chinner
@ 2024-02-28 10:30 ` Jeff Layton
2024-02-28 10:58 ` Amir Goldstein
0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2024-02-28 10:30 UTC (permalink / raw)
To: Dave Chinner
Cc: Amir Goldstein, Darrick J. Wong, linux-fsdevel, linux-xfs, hch
On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
>
> Like xfs_fsr doing online defrag, we really only care about explicit
> user data changes here, not internal layout and metadata changes to
> the files...
>
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > >
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > >
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > >
> >
> >
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
> >
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
>
> By "fix", I meant "remove".
>
> i.e. the patches I was proposing were to remove SB_I_VERSION support
> from XFS so NFS just uses the ctime on XFS because the recent
> changes to i_version make it a ctime change counter, not an inode
> change counter.
>
> Then patches were posted for finer grained inode timestamps to allow
> everything to use ctime instead of i_version, and with that I
> thought NFS was just going to change to ctime for everyone with that
> the whole change cookie issue was going away.
>
> It now sounds like that isn't happening, so I'll just ressurect the
> patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> support from XFS for now and us XFS people can just go back to
> ignoring this problem again.
I must have misunderstood what you said when we were at LPC this year:
After the multigrain ctime patches were reverted, you mentioned that you
were working on a patchset that used the unused bits in the tv_nsec
field as counter for counting changes that have occurred within the same
tv_nsec value.
Did those not pan out for some reason?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-28 10:30 ` Jeff Layton
@ 2024-02-28 10:58 ` Amir Goldstein
2024-02-28 11:01 ` Jeff Layton
0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2024-02-28 10:58 UTC (permalink / raw)
To: Jeff Layton
Cc: Dave Chinner, Darrick J. Wong, linux-fsdevel, linux-xfs, hch,
Jan Kara
On Wed, Feb 28, 2024 at 12:30 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> > On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> >
> > Like xfs_fsr doing online defrag, we really only care about explicit
> > user data changes here, not internal layout and metadata changes to
> > the files...
> >
> > > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > > Is there a real need to support it on filesystems that do not support
> > > > i_version(?)
> > > >
> > > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > > how other filesystems may update ctime in those cases.
> > > >
> > > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > > really explicitly requests a bump of i_version on the next change.
> > > >
> > >
> > >
> > > I agree. Using an opaque change cookie would be a lot nicer from an API
> > > standpoint, and shouldn't be subject to timestamp granularity issues.
> > >
> > > That said, XFS's change cookie is currently broken. Dave C. said he had
> > > some patches in progress to fix that however.
> >
> > By "fix", I meant "remove".
> >
> > i.e. the patches I was proposing were to remove SB_I_VERSION support
> > from XFS so NFS just uses the ctime on XFS because the recent
> > changes to i_version make it a ctime change counter, not an inode
> > change counter.
> >
> > Then patches were posted for finer grained inode timestamps to allow
> > everything to use ctime instead of i_version, and with that I
> > thought NFS was just going to change to ctime for everyone with that
> > the whole change cookie issue was going away.
> >
> > It now sounds like that isn't happening, so I'll just ressurect the
> > patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> > support from XFS for now and us XFS people can just go back to
> > ignoring this problem again.
>
>
> I must have misunderstood what you said when we were at LPC this year:
>
> After the multigrain ctime patches were reverted, you mentioned that you
> were working on a patchset that used the unused bits in the tv_nsec
> field as counter for counting changes that have occurred within the same
> tv_nsec value.
>
> Did those not pan out for some reason?
Jeff,
Could I trouble you to suggest a topic for LSFMM to summarize
everything that has been going on this year wrt change cookie/time
at xfs/vfs level and try to set a clear roadmap?
Thanks,
Amir.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-28 10:58 ` Amir Goldstein
@ 2024-02-28 11:01 ` Jeff Layton
0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2024-02-28 11:01 UTC (permalink / raw)
To: Amir Goldstein
Cc: Dave Chinner, Darrick J. Wong, linux-fsdevel, linux-xfs, hch,
Jan Kara
On Wed, 2024-02-28 at 12:58 +0200, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> > > On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > > > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > >
> > > Like xfs_fsr doing online defrag, we really only care about explicit
> > > user data changes here, not internal layout and metadata changes to
> > > the files...
> > >
> > > > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > > > Is there a real need to support it on filesystems that do not support
> > > > > i_version(?)
> > > > >
> > > > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > > > how other filesystems may update ctime in those cases.
> > > > >
> > > > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > > > really explicitly requests a bump of i_version on the next change.
> > > > >
> > > >
> > > >
> > > > I agree. Using an opaque change cookie would be a lot nicer from an API
> > > > standpoint, and shouldn't be subject to timestamp granularity issues.
> > > >
> > > > That said, XFS's change cookie is currently broken. Dave C. said he had
> > > > some patches in progress to fix that however.
> > >
> > > By "fix", I meant "remove".
> > >
> > > i.e. the patches I was proposing were to remove SB_I_VERSION support
> > > from XFS so NFS just uses the ctime on XFS because the recent
> > > changes to i_version make it a ctime change counter, not an inode
> > > change counter.
> > >
> > > Then patches were posted for finer grained inode timestamps to allow
> > > everything to use ctime instead of i_version, and with that I
> > > thought NFS was just going to change to ctime for everyone with that
> > > the whole change cookie issue was going away.
> > >
> > > It now sounds like that isn't happening, so I'll just ressurect the
> > > patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> > > support from XFS for now and us XFS people can just go back to
> > > ignoring this problem again.
> >
> >
> > I must have misunderstood what you said when we were at LPC this year:
> >
> > After the multigrain ctime patches were reverted, you mentioned that you
> > were working on a patchset that used the unused bits in the tv_nsec
> > field as counter for counting changes that have occurred within the same
> > tv_nsec value.
> >
> > Did those not pan out for some reason?
>
> Jeff,
>
> Could I trouble you to suggest a topic for LSFMM to summarize
> everything that has been going on this year wrt change cookie/time
> at xfs/vfs level and try to set a clear roadmap?
>
Sure, I'll send something later today.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
2024-02-27 10:53 ` Jeff Layton
@ 2024-02-27 15:45 ` Darrick J. Wong
2024-02-27 16:58 ` Amir Goldstein
1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-27 15:45 UTC (permalink / raw)
To: Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton
On Tue, Feb 27, 2024 at 11:23:39AM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically. This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes. A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file. If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas. Callers can arrange for the update to be rejected if the
> > original file has been changed.
> >
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files. For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory. People get it wrong all the time, and $fs hacks abound.
> > Here are the proposed manual pages:
> >
> > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> >
> > NAME
> > ioctl_xfs_exchange_range - exchange the contents of parts of
> > two files
> >
> > SYNOPSIS
> > #include <sys/ioctl.h>
> > #include <xfs/xfs_fs_staging.h>
> >
> > int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
> > xfs_exch_range *arg);
> >
> > DESCRIPTION
> > Given a range of bytes in a first file file1_fd and a second
> > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > changes the contents of the two ranges.
> >
> > Exchanges are atomic with regards to concurrent file opera‐
> > tions, so no userspace-level locks need to be taken to obtain
> > consistent results. Implementations must guarantee that read‐
> > ers see either the old contents or the new contents in their
> > entirety, even if the system fails.
> >
> > The system call parameters are conveyed in structures of the
> > following form:
> >
> > struct xfs_exch_range {
> > __s64 file1_fd;
> > __s64 file1_offset;
> > __s64 file2_offset;
> > __s64 length;
> > __u64 flags;
> >
> > __u64 pad;
> > };
> >
> > The field pad must be zero.
> >
> > The fields file1_fd, file1_offset, and length define the first
> > range of bytes to be exchanged.
> >
> > The fields file2_fd, file2_offset, and length define the second
> > range of bytes to be exchanged.
> >
> > Both files must be from the same filesystem mount. If the two
> > file descriptors represent the same file, the byte ranges must
> > not overlap. Most disk-based filesystems require that the
> > starts of both ranges must be aligned to the file block size.
> > If this is the case, the ends of the ranges must also be so
> > aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> >
> > The field flags control the behavior of the exchange operation.
> >
> > XFS_EXCHRANGE_TO_EOF
> > Ignore the length parameter. All bytes in file1_fd
> > from file1_offset to EOF are moved to file2_fd, and
> > file2's size is set to (file2_offset+(file1_length-
> > file1_offset)). Meanwhile, all bytes in file2 from
> > file2_offset to EOF are moved to file1 and file1's
> > size is set to (file1_offset+(file2_length-
> > file2_offset)).
> >
> > XFS_EXCHRANGE_DSYNC
> > Ensure that all modified in-core data in both file
> > ranges and all metadata updates pertaining to the
> > exchange operation are flushed to persistent storage
> > before the call returns. Opening either file de‐
> > scriptor with O_SYNC or O_DSYNC will have the same
> > effect.
> >
> > XFS_EXCHRANGE_FILE1_WRITTEN
> > Only exchange sub-ranges of file1_fd that are known
> > to contain data written by application software.
> > Each sub-range may be expanded (both upwards and
> > downwards) to align with the file allocation unit.
> > For files on the data device, this is one filesystem
> > block. For files on the realtime device, this is
> > the realtime extent size. This facility can be used
> > to implement fast atomic scatter-gather writes of
> > any complexity for software-defined storage targets
> > if all writes are aligned to the file allocation
> > unit.
> >
> > XFS_EXCHRANGE_DRY_RUN
> > Check the parameters and the feasibility of the op‐
> > eration, but do not change anything.
> >
> > RETURN VALUE
> > On error, -1 is returned, and errno is set to indicate the er‐
> > ror.
> >
> > ERRORS
> > Error codes can be one of, but are not limited to, the follow‐
> > ing:
> >
> > EBADF file1_fd is not open for reading and writing or is open
> > for append-only writes; or file2_fd is not open for
> > reading and writing or is open for append-only writes.
> >
> > EINVAL The parameters are not correct for these files. This
> > error can also appear if either file descriptor repre‐
> > sents a device, FIFO, or socket. Disk filesystems gen‐
> > erally require the offset and length arguments to be
> > aligned to the fundamental block sizes of both files.
> >
> > EIO An I/O error occurred.
> >
> > EISDIR One of the files is a directory.
> >
> > ENOMEM The kernel was unable to allocate sufficient memory to
> > perform the operation.
> >
> > ENOSPC There is not enough free space in the filesystem ex‐
> > change the contents safely.
> >
> > EOPNOTSUPP
> > The filesystem does not support exchanging bytes between
> > the two files.
> >
> > EPERM file1_fd or file2_fd are immutable.
> >
> > ETXTBSY
> > One of the files is a swap file.
> >
> > EUCLEAN
> > The filesystem is corrupt.
> >
> > EXDEV file1_fd and file2_fd are not on the same mounted
> > filesystem.
> >
> > CONFORMING TO
> > This API is XFS-specific.
> >
> > USE CASES
> > Several use cases are imagined for this system call. In all
> > cases, application software must coordinate updates to the file
> > because the exchange is performed unconditionally.
> >
> > The first is a data storage program that wants to commit non-
> > contiguous updates to a file atomically and coordinates write
> > access to that file. This can be done by creating a temporary
> > file, calling FICLONE(2) to share the contents, and staging the
> > updates into the temporary file. The FULL_FILES flag is recom‐
> > mended for this purpose. The temporary file can be deleted or
> > punched out afterwards.
> >
> > An example program might look like this:
> >
> > int fd = open("/some/file", O_RDWR);
> > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >
> > ioctl(temp_fd, FICLONE, fd);
> >
> > /* append 1MB of records */
> > lseek(temp_fd, 0, SEEK_END);
> > write(temp_fd, data1, 1000000);
> >
> > /* update record index */
> > pwrite(temp_fd, data1, 600, 98765);
> > pwrite(temp_fd, data2, 320, 54321);
> > pwrite(temp_fd, data2, 15, 0);
> >
> > /* commit the entire update */
> > struct xfs_exch_range args = {
> > .file1_fd = temp_fd,
> > .flags = XFS_EXCHRANGE_TO_EOF,
> > };
> >
> > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> > The second is a software-defined storage host (e.g. a disk
> > jukebox) which implements an atomic scatter-gather write com‐
> > mand. Provided the exported disk's logical block size matches
> > the file's allocation unit size, this can be done by creating a
> > temporary file and writing the data at the appropriate offsets.
> > It is recommended that the temporary file be truncated to the
> > size of the regular file before any writes are staged to the
> > temporary file to avoid issues with zeroing during EOF exten‐
> > sion. Use this call with the FILE1_WRITTEN flag to exchange
> > only the file allocation units involved in the emulated de‐
> > vice's write command. The temporary file should be truncated
> > or punched out completely before being reused to stage another
> > write.
> >
> > An example program might look like this:
> >
> > int fd = open("/some/file", O_RDWR);
> > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > struct stat sb;
> > int blksz;
> >
> > fstat(fd, &sb);
> > blksz = sb.st_blksize;
> >
> > /* land scatter gather writes between 100fsb and 500fsb */
> > pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> >
> > /* commit the entire update */
> > struct xfs_exch_range args = {
> > .file1_fd = temp_fd,
> > .file1_offset = blksz * 100,
> > .file2_offset = blksz * 100,
> > .length = blksz * 400,
> > .flags = XFS_EXCHRANGE_FILE1_WRITTEN |
> > XFS_EXCHRANGE_FILE1_DSYNC,
> > };
> >
> > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> > NOTES
> > Some filesystems may limit the amount of data or the number of
> > extents that can be exchanged in a single call.
> >
> > SEE ALSO
> > ioctl(2)
> >
> > XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
> > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> >
> > NAME
> > ioctl_xfs_commit_range - conditionally exchange the contents of
> > parts of two files
> >
> > SYNOPSIS
> > #include <sys/ioctl.h>
> > #include <xfs/xfs_fs_staging.h>
> >
> > int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
> > mit_range *arg);
> >
> > DESCRIPTION
> > Given a range of bytes in a first file file1_fd and a second
> > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > changes the contents of the two ranges if file2_fd passes cer‐
> > tain freshness criteria.
> >
> > After locking both files but before exchanging the contents,
> > the supplied file2_ino field must match file2_fd's inode num‐
> > ber, and the supplied file2_mtime, file2_mtime_nsec,
> > file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > cation time and change time of file2. If they do not match,
> > EBUSY will be returned.
> >
>
> Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?
It's the other way 'round -- mtime doesn't change but ctime does. The
race I'm trying to protect against is:
Thread 0 Thread 1
<snapshot fd cmtime>
<start writing tempfd>
<fstat fd>
<write to fd>
<futimens to reset mtime>
<commitrange>
mtime is controllable by "attackers" but ctime isn't. I think we only
need to capture ctime, but ye olde swapext ioctl (from which this
derives) did both.
> And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
Seeing as iversion (as the vfs and/or jlayton seems to want it) doesn't
work in the intended manner in XFS, no.
> Even if this API is designed to be hoisted out of XFS at some future time,
> Is there a real need to support it on filesystems that do not support
> i_version(?)
Given the way the iversion discussions have gone (file data write
counter) I don't think there's a way to support commitrange on
non-iversion filesystems.
I withdrew any plans to make this more than an XFS-specific ioctl last
year after giving up on ever getting through fsdevel review. I think
the last reply I got was from viro back in 2021...
> Not to mention the fact that POSIX does not explicitly define how ctime should
> behave with changes to fiemap (uninitialized extent and all), so who knows
> how other filesystems may update ctime in those cases.
...and given the lack of interest from any other filesystem developers
in porting it to !xfs, I'm not likely to take this up ever again. To be
fair, I think the only filesystems that could possibly support
EXCHANGE_RANGE are /maybe/ btrfs and /probably/ bcachefs.
> I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> really explicitly requests a bump of i_version on the next change.
Another way I could've structured this (and still could!) would be to
declare the entire freshness region as an untyped u64 fresh[4] blob and
add a START_COMMIT ioctl to fill it out. Then the kernel fs drivers
gets to determine what goes in there.
That at least would be less work for userspace to do.
I don't want userspace API wrangling to hold up online repair **yet
again**. I only made EXCHANGE_RANGE so that I could test the functionality
that fsck relies on. If there was another way to test it then I would
have gladly done that. Further down the line, COMMIT_RANGE will get us
out of trouble with the xfs defrag tool.
--D
> Thanks,
> Amir.
>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 15:45 ` Darrick J. Wong
@ 2024-02-27 16:58 ` Amir Goldstein
0 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2024-02-27 16:58 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton
On Tue, Feb 27, 2024 at 5:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 11:23:39AM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
[...]
> > Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?
>
> It's the other way 'round -- mtime doesn't change but ctime does. The
> race I'm trying to protect against is:
>
> Thread 0 Thread 1
> <snapshot fd cmtime>
> <start writing tempfd>
> <fstat fd>
> <write to fd>
> <futimens to reset mtime>
> <commitrange>
>
> mtime is controllable by "attackers" but ctime isn't. I think we only
> need to capture ctime, but ye olde swapext ioctl (from which this
> derives) did both.
>
Yes, that's what I meant. was just a braino.
mtime seems redundant, but if you want to keep it for compatibility with
legacy API, so be it.
> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
>
> Seeing as iversion (as the vfs and/or jlayton seems to want it) doesn't
> work in the intended manner in XFS, no.
>
OK. for the record, AFAICT, the problem of NFS guys with xfs iversion
is that it is too
aggressive to their taste (i.e. bumped on atime updates), but because
ctime is always
updated along with iversion in xfs and because iversion has no granularity
problem, I think it is a better choice for you, regardless of any
other filesystems
and their interpretation of ctime or iversion.
> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
>
> Given the way the iversion discussions have gone (file data write
> counter) I don't think there's a way to support commitrange on
> non-iversion filesystems.
>
> I withdrew any plans to make this more than an XFS-specific ioctl last
> year after giving up on ever getting through fsdevel review. I think
> the last reply I got was from viro back in 2021...
>
understandable.
I wasn't implying that you should hoist this out of XFS.
I was wondering about why not use xfs's iversion, which
seems like a better change counter than ctime.
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
>
> ...and given the lack of interest from any other filesystem developers
> in porting it to !xfs, I'm not likely to take this up ever again. To be
> faiproblemr, I think the only filesystems that could possibly support
> EXCHANGE_RANGE are /maybe/ btrfs and /probably/ bcachefs.
>
Again, sorry if my question were misinterpreted.
I was not trying to imply that this API should be hoisted.
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
>
> Another way I could've structured this (and still could!) would be to
> deproblemclare the entire freshness region as an untyped u64 fresh[4] blob and
> add a START_COMMIT ioctl to fill it out. Then the kernel fs drivers
> gets to determine what goes in there.
>
> That at least would be less work for userspace to do.
>
To me that makes sense. Cleaner API.
Less questions asked.
But it's up to you.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27 2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
@ 2024-02-27 17:46 ` Darrick J. Wong
2024-02-27 18:52 ` Amir Goldstein
2024-02-28 1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
3 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-27 17:46 UTC (permalink / raw)
To: linux-fsdevel, linux-xfs, hch; +Cc: Amir Goldstein, jlayton
From: Darrick J. Wong <djwong@kernel.org>
To head off bikeshedding about the fields in xfs_commit_range, let's
make it an opaque u64 array and require the userspace program to call
a third ioctl to sample the freshness data for us. If we ever converge
on a definition for i_version then we can use that; for now we'll just
use mtime/ctime like the old swapext ioctl.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/libxfs/xfs_fs.h | 13 +++--------
fs/xfs/xfs_exchrange.c | 15 ++++++++++++
fs/xfs/xfs_exchrange.h | 1 +
fs/xfs/xfs_ioctl.c | 58 +++++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 72 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 01b3553adfc55..4019a78ee3ea5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -860,14 +860,8 @@ struct xfs_commit_range {
__u64 flags; /* see XFS_EXCHRANGE_* below */
- /* file2 metadata for freshness checks */
- __u64 file2_ino; /* inode number */
- __s64 file2_mtime; /* modification time */
- __s64 file2_ctime; /* change time */
- __s32 file2_mtime_nsec; /* mod time, nsec */
- __s32 file2_ctime_nsec; /* change time, nsec */
-
- __u64 pad; /* must be zeroes */
+ /* opaque file2 metadata for freshness checks */
+ __u64 file2_freshness[5];
};
/*
@@ -973,7 +967,8 @@ struct xfs_commit_range {
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOWR('X', 129, struct xfs_exch_range)
-#define XFS_IOC_COMMIT_RANGE _IOWR('X', 129, struct xfs_commit_range)
+#define XFS_IOC_START_COMMIT _IOWR('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE _IOWR('X', 131, struct xfs_commit_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index e55ae06f1a32c..dae855515c3c4 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -863,3 +863,18 @@ xfs_exchange_range(
fsnotify_modify(fxr->file2);
return 0;
}
+
+/* Sample freshness data from fxr->file2 for a commit range operation. */
+void
+xfs_exchrange_freshness(
+ struct xfs_exchrange *fxr)
+{
+ struct inode *inode2 = file_inode(fxr->file2);
+ struct xfs_inode *ip2 = XFS_I(inode2);
+
+ xfs_ilock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+ fxr->file2_ino = ip2->i_ino;
+ fxr->file2_ctime = inode_get_ctime(inode2);
+ fxr->file2_mtime = inode_get_mtime(inode2);
+ xfs_iunlock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 2dd9ab7d76828..942283a7f75f5 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -36,6 +36,7 @@ struct xfs_exchrange {
struct timespec64 file2_ctime;
};
+void xfs_exchrange_freshness(struct xfs_exchrange *fxr);
int xfs_exchange_range(struct xfs_exchrange *fxr);
/* XFS-specific parts of file exchanges */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ee26ac2028da1..1940da72a1da7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -2402,6 +2402,47 @@ xfs_ioc_exchange_range(
return error;
}
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+ __u64 file2_ino; /* inode number */
+ __s64 file2_mtime; /* modification time */
+ __s64 file2_ctime; /* change time */
+ __s32 file2_mtime_nsec; /* mod time, nsec */
+ __s32 file2_ctime_nsec; /* change time, nsec */
+ __u64 pad; /* zero */
+};
+
+static long
+xfs_ioc_start_commit(
+ struct file *file,
+ struct xfs_commit_range __user *argp)
+{
+ struct xfs_exchrange fxr = {
+ .file2 = file,
+ };
+ struct xfs_commit_range args;
+ struct xfs_commit_range_fresh *kern_f;
+ struct xfs_commit_range_fresh __user *user_f;
+
+ BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+ sizeof(args.file2_freshness));
+
+ xfs_exchrange_freshness(&fxr);
+
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+ kern_f->file2_ino = fxr.file2_ino;
+ kern_f->file2_mtime = fxr.file2_mtime.tv_sec;
+ kern_f->file2_mtime_nsec = fxr.file2_mtime.tv_nsec;
+ kern_f->file2_ctime = fxr.file2_ctime.tv_sec;
+ kern_f->file2_ctime_nsec = fxr.file2_ctime.tv_nsec;
+
+ user_f = (struct xfs_commit_range_fresh *)&argp->file2_freshness;
+ if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+ return -EFAULT;
+
+ return 0;
+}
+
static long
xfs_ioc_commit_range(
struct file *file,
@@ -2411,12 +2452,15 @@ xfs_ioc_commit_range(
.file2 = file,
};
struct xfs_commit_range args;
+ struct xfs_commit_range_fresh *kern_f;
struct fd file1;
int error;
+ kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
if (copy_from_user(&args, argp, sizeof(args)))
return -EFAULT;
- if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
+ if (memchr_inv(&kern_f->pad, 0, sizeof(kern_f->pad)))
return -EINVAL;
if (args.flags & ~XFS_EXCHRANGE_ALL_FLAGS)
return -EINVAL;
@@ -2425,11 +2469,11 @@ xfs_ioc_commit_range(
fxr.file2_offset = args.file2_offset;
fxr.length = args.length;
fxr.flags = args.flags | __XFS_EXCHRANGE_CHECK_FRESH2;
- fxr.file2_ino = args.file2_ino;
- fxr.file2_mtime.tv_sec = args.file2_mtime;
- fxr.file2_mtime.tv_nsec = args.file2_mtime_nsec;
- fxr.file2_ctime.tv_sec = args.file2_ctime;
- fxr.file2_ctime.tv_nsec = args.file2_ctime_nsec;
+ fxr.file2_ino = kern_f->file2_ino;
+ fxr.file2_mtime.tv_sec = kern_f->file2_mtime;
+ fxr.file2_mtime.tv_nsec = kern_f->file2_mtime_nsec;
+ fxr.file2_ctime.tv_sec = kern_f->file2_ctime;
+ fxr.file2_ctime.tv_nsec = kern_f->file2_ctime_nsec;
file1 = fdget(args.file1_fd);
if (!file1.file)
@@ -2782,6 +2826,8 @@ xfs_file_ioctl(
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
+ case XFS_IOC_START_COMMIT:
+ return xfs_ioc_start_commit(filp, arg);
case XFS_IOC_COMMIT_RANGE:
return xfs_ioc_commit_range(filp, arg);
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
@ 2024-02-27 18:52 ` Amir Goldstein
2024-02-29 23:27 ` Darrick J. Wong
0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2024-02-27 18:52 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, jlayton
On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> To head off bikeshedding about the fields in xfs_commit_range, let's
> make it an opaque u64 array and require the userspace program to call
> a third ioctl to sample the freshness data for us. If we ever converge
> on a definition for i_version then we can use that; for now we'll just
> use mtime/ctime like the old swapext ioctl.
This addresses my concerns about using mtime/ctime.
I have to say, Darrick, that I think that referring to this concern as
bikeshedding is not being honest.
I do hate nit picking reviews and I do hate "maybe also fix the world"
review comments, but I think the question about using mtime/ctime in
this new API was not out of place and I think that making the freshness
data opaque is better for everyone in the long run and hopefully, this will
help you move to the things you care about faster.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-02-27 18:52 ` Amir Goldstein
@ 2024-02-29 23:27 ` Darrick J. Wong
2024-03-01 13:00 ` Amir Goldstein
2024-03-01 13:31 ` Jeff Layton
0 siblings, 2 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-29 23:27 UTC (permalink / raw)
To: Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch, jlayton
On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > To head off bikeshedding about the fields in xfs_commit_range, let's
> > make it an opaque u64 array and require the userspace program to call
> > a third ioctl to sample the freshness data for us. If we ever converge
> > on a definition for i_version then we can use that; for now we'll just
> > use mtime/ctime like the old swapext ioctl.
>
> This addresses my concerns about using mtime/ctime.
Oh good! :)
> I have to say, Darrick, that I think that referring to this concern as
> bikeshedding is not being honest.
>
> I do hate nit picking reviews and I do hate "maybe also fix the world"
> review comments, but I think the question about using mtime/ctime in
> this new API was not out of place
I agree, your question about mtime/ctime:
"Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?"
was a very good question. But perhaps that statement referred to the
other part of that thread.
> and I think that making the freshness
> data opaque is better for everyone in the long run and hopefully, this will
> help you move to the things you care about faster.
I wish you'd suggested an opaque blob that the fs can lay out however it
wants instead of suggesting specifically the change cookie. I'm very
much ok with an opaque freshness blob that allows future flexibility in
how we define the blob's contents.
I was however very upset about the Jeff's suggestion of using i_version.
I apologize for using all caps in that reply, and snarling about it in
the commit message here. The final version of this patch will not have
that.
That said, I don't think it is at all helpful to suggest using a file
attribute whose behavior is as yet unresolved. Multigrain timestamps
were a clever idea, regrettably reverted. As far as I could tell when I
wrote my reply, neither had NFS implemented a better behavior and
quietly merged it; nor have Jeff and Dave produced any sort of candidate
patchset to fix all the resulting issues in XFS.
Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
internal" made me think "OH $deity, they wants me to do that work
too???"
A better way to have woreded that might've been "How about switching
this to a fs-determined structure so that we can switch the freshness
check to i_version when that's fully working on XFS?"
The problem I have with reading patch review emails is that I can't
easily tell whether an author's suggestion is being made in a casual
offhand manner? Or if it reflects something they feel strongly needs
change before merging.
In fairness to you, Amir, I don't know how much you've kept on top of
that i_version vs. XFS discussion. So I have no idea if you were aware
of the status of that work.
--D
> Thanks,
> Amir.
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-02-29 23:27 ` Darrick J. Wong
@ 2024-03-01 13:00 ` Amir Goldstein
2024-03-01 13:31 ` Jeff Layton
1 sibling, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2024-03-01 13:00 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, jlayton
On Fri, Mar 1, 2024 at 1:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us. If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> >
> > This addresses my concerns about using mtime/ctime.
>
> Oh good! :)
>
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> >
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
>
> I agree, your question about mtime/ctime:
>
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
>
> was a very good question. But perhaps that statement referred to the
> other part of that thread.
>
> > and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
>
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie. I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
>
I wish I had thought of it myself - it is a good idea - just did not
occur to me.
Using the language of i_changecounter, that is "the current xfs implementation
of i_version", I still think that using it as the content of the
opaque freshness blob
makes more sense than mtime+ctime, but it is none of my concern what
you decide to fill in the freshness blob for the first version.
I was not aware of the way xfs_fsr is currently using mtime+ctime when
I replied and I am not sure if and how it is relevant to the new API.
> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here. The final version of this patch will not have
> that.
>
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved. Multigrain timestamps
> were a clever idea, regrettably reverted. As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
>
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
>
Yeh, I should have chosen my words more carefully.
I was perfectly aware of your lack of interest in doing extra work
and wasn't trying to request any.
> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner? Or if it reflects something they feel strongly needs
> change before merging.
>
Can't speak for everyone else, but coming from the middle east,
I have fewer politeness filters.
When I write "wouldn't it be better to use change_cookie?"
I am just asking that question.
When I am asking something to be changed before merge,
I try to be much more explicit about it and this is what I expect
others to do when reviewing my patches.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-02-29 23:27 ` Darrick J. Wong
2024-03-01 13:00 ` Amir Goldstein
@ 2024-03-01 13:31 ` Jeff Layton
2024-03-02 2:48 ` Darrick J. Wong
1 sibling, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2024-03-01 13:31 UTC (permalink / raw)
To: Darrick J. Wong, Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch
On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us. If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> >
> > This addresses my concerns about using mtime/ctime.
>
> Oh good! :)
>
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> >
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
>
> I agree, your question about mtime/ctime:
>
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
>
> was a very good question. But perhaps that statement referred to the
> other part of that thread.
>
> > and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
>
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie. I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
>
> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here. The final version of this patch will not have
> that.
>
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved. Multigrain timestamps
> were a clever idea, regrettably reverted. As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
>
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
>
> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner? Or if it reflects something they feel strongly needs
> change before merging.
>
> In fairness to you, Amir, I don't know how much you've kept on top of
> that i_version vs. XFS discussion. So I have no idea if you were aware
> of the status of that work.
>
Sorry, I didn't mean to trigger anyone, but I do have real concerns
about any API that attempts to use timestamps to detect whether
something has changed.
We learned that lesson in NFS in the 90's. VFS timestamp resolution is
just not enough to show whether there was a change to a file -- full
stop.
I get the hand-wringing over i_version definitions and I don't care to
rehash that discussion here, but I'll point out that this is a
(proposed) XFS-private interface:
What you could do is expose the XFS change counter (the one that gets
bumped for everything, even atime updates, possibly via different
ioctl), and use that for your "freshness" check.
You'd unfortunately get false negative freshness checks after read
operations, but you shouldn't get any false positives (which is real
danger with timestamps).
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-03-01 13:31 ` Jeff Layton
@ 2024-03-02 2:48 ` Darrick J. Wong
2024-03-02 12:43 ` Jeff Layton
0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-02 2:48 UTC (permalink / raw)
To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch
On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > >
> > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > make it an opaque u64 array and require the userspace program to call
> > > > a third ioctl to sample the freshness data for us. If we ever converge
> > > > on a definition for i_version then we can use that; for now we'll just
> > > > use mtime/ctime like the old swapext ioctl.
> > >
> > > This addresses my concerns about using mtime/ctime.
> >
> > Oh good! :)
> >
> > > I have to say, Darrick, that I think that referring to this concern as
> > > bikeshedding is not being honest.
> > >
> > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > review comments, but I think the question about using mtime/ctime in
> > > this new API was not out of place
> >
> > I agree, your question about mtime/ctime:
> >
> > "Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?"
> >
> > was a very good question. But perhaps that statement referred to the
> > other part of that thread.
> >
> > > and I think that making the freshness
> > > data opaque is better for everyone in the long run and hopefully, this will
> > > help you move to the things you care about faster.
> >
> > I wish you'd suggested an opaque blob that the fs can lay out however it
> > wants instead of suggesting specifically the change cookie. I'm very
> > much ok with an opaque freshness blob that allows future flexibility in
> > how we define the blob's contents.
> >
> > I was however very upset about the Jeff's suggestion of using i_version.
> > I apologize for using all caps in that reply, and snarling about it in
> > the commit message here. The final version of this patch will not have
> > that.
> >
> > That said, I don't think it is at all helpful to suggest using a file
> > attribute whose behavior is as yet unresolved. Multigrain timestamps
> > were a clever idea, regrettably reverted. As far as I could tell when I
> > wrote my reply, neither had NFS implemented a better behavior and
> > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > patchset to fix all the resulting issues in XFS.
> >
> > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > internal" made me think "OH $deity, they wants me to do that work
> > too???"
> >
> > A better way to have woreded that might've been "How about switching
> > this to a fs-determined structure so that we can switch the freshness
> > check to i_version when that's fully working on XFS?"
> >
> > The problem I have with reading patch review emails is that I can't
> > easily tell whether an author's suggestion is being made in a casual
> > offhand manner? Or if it reflects something they feel strongly needs
> > change before merging.
> >
> > In fairness to you, Amir, I don't know how much you've kept on top of
> > that i_version vs. XFS discussion. So I have no idea if you were aware
> > of the status of that work.
> >
>
> Sorry, I didn't mean to trigger anyone, but I do have real concerns
> about any API that attempts to use timestamps to detect whether
> something has changed.
>
> We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> just not enough to show whether there was a change to a file -- full
> stop.
>
> I get the hand-wringing over i_version definitions and I don't care to
> rehash that discussion here, but I'll point out that this is a
> (proposed) XFS-private interface:
>
> What you could do is expose the XFS change counter (the one that gets
> bumped for everything, even atime updates, possibly via different
> ioctl), and use that for your "freshness" check.
>
> You'd unfortunately get false negative freshness checks after read
> operations, but you shouldn't get any false positives (which is real
> danger with timestamps).
I don't see how would that work for this usecase? You have to sample
file2 before reflinking file2's contents to file1, writing the changes
to file1, and executing COMMIT_RANGE. Setting the xfs-private REFLINK
inode flag on file2 will trigger an iversion update even though it won't
change mtime or ctime. The COMMIT then fails due to the inode flags
change.
Worse yet, applications aren't going to know if a particular access is
actually the one that will trigger an atime update. So this will just
fail unpredictably.
If iversion was purely a write counter then I would switch the freshness
implementation to use it. But it's not, and I know this to be true
because I tried that and could not get COMMIT_RANGE to work reliably.
I suppose the advantage of the blob thing is that we actually /can/
switch over whenever it's ready.
--D
> --
> Jeff Layton <jlayton@kernel.org>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-03-02 2:48 ` Darrick J. Wong
@ 2024-03-02 12:43 ` Jeff Layton
2024-03-07 23:25 ` Darrick J. Wong
0 siblings, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2024-03-02 12:43 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch
On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > >
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > >
> > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > make it an opaque u64 array and require the userspace program to call
> > > > > a third ioctl to sample the freshness data for us. If we ever converge
> > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > use mtime/ctime like the old swapext ioctl.
> > > >
> > > > This addresses my concerns about using mtime/ctime.
> > >
> > > Oh good! :)
> > >
> > > > I have to say, Darrick, that I think that referring to this concern as
> > > > bikeshedding is not being honest.
> > > >
> > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > review comments, but I think the question about using mtime/ctime in
> > > > this new API was not out of place
> > >
> > > I agree, your question about mtime/ctime:
> > >
> > > "Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?"
> > >
> > > was a very good question. But perhaps that statement referred to the
> > > other part of that thread.
> > >
> > > > and I think that making the freshness
> > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > help you move to the things you care about faster.
> > >
> > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > wants instead of suggesting specifically the change cookie. I'm very
> > > much ok with an opaque freshness blob that allows future flexibility in
> > > how we define the blob's contents.
> > >
> > > I was however very upset about the Jeff's suggestion of using i_version.
> > > I apologize for using all caps in that reply, and snarling about it in
> > > the commit message here. The final version of this patch will not have
> > > that.
> > >
> > > That said, I don't think it is at all helpful to suggest using a file
> > > attribute whose behavior is as yet unresolved. Multigrain timestamps
> > > were a clever idea, regrettably reverted. As far as I could tell when I
> > > wrote my reply, neither had NFS implemented a better behavior and
> > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > patchset to fix all the resulting issues in XFS.
> > >
> > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > internal" made me think "OH $deity, they wants me to do that work
> > > too???"
> > >
> > > A better way to have woreded that might've been "How about switching
> > > this to a fs-determined structure so that we can switch the freshness
> > > check to i_version when that's fully working on XFS?"
> > >
> > > The problem I have with reading patch review emails is that I can't
> > > easily tell whether an author's suggestion is being made in a casual
> > > offhand manner? Or if it reflects something they feel strongly needs
> > > change before merging.
> > >
> > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > that i_version vs. XFS discussion. So I have no idea if you were aware
> > > of the status of that work.
> > >
> >
> > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > about any API that attempts to use timestamps to detect whether
> > something has changed.
> >
> > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > just not enough to show whether there was a change to a file -- full
> > stop.
> >
> > I get the hand-wringing over i_version definitions and I don't care to
> > rehash that discussion here, but I'll point out that this is a
> > (proposed) XFS-private interface:
> >
> > What you could do is expose the XFS change counter (the one that gets
> > bumped for everything, even atime updates, possibly via different
> > ioctl), and use that for your "freshness" check.
> >
> > You'd unfortunately get false negative freshness checks after read
> > operations, but you shouldn't get any false positives (which is real
> > danger with timestamps).
>
> I don't see how would that work for this usecase? You have to sample
> file2 before reflinking file2's contents to file1, writing the changes
> to file1, and executing COMMIT_RANGE. Setting the xfs-private REFLINK
> inode flag on file2 will trigger an iversion update even though it won't
> change mtime or ctime. The COMMIT then fails due to the inode flags
> change.
>
> Worse yet, applications aren't going to know if a particular access is
> actually the one that will trigger an atime update. So this will just
> fail unpredictably.
>
> If iversion was purely a write counter then I would switch the freshness
> implementation to use it. But it's not, and I know this to be true
> because I tried that and could not get COMMIT_RANGE to work reliably.
> I suppose the advantage of the blob thing is that we actually /can/
> switch over whenever it's ready.
>
Yeah, that's the other part -- you have to be willing to redrive the I/O
every time the freshness check fails, which can get expensive depending
on how active the file is. Again this is an XFS interface, so I don't
really have a dog in this fight. If you think timestamps are good
enough, then so be it.
All I can do is mention that it has been our experience in the NFS world
that relying on timestamps like this will eventually lead to data
corruption. The race conditions may be tight, and much of the time the
race may be benign, but if you do this enough you'll eventually get
bitten, and end up exchanging data when you shouldn't have.
All of that said, I think this is great discussion fodder for LSF this
year. I feel like the time is right to consider these sorts of
interfaces that do synchronized I/O without locking. I've already
proposed a discussion around the state of the i_version counter, so
maybe we can chat about it then?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
2024-03-02 12:43 ` Jeff Layton
@ 2024-03-07 23:25 ` Darrick J. Wong
0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-07 23:25 UTC (permalink / raw)
To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch
On Sat, Mar 02, 2024 at 07:43:53AM -0500, Jeff Layton wrote:
> On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> > On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > >
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > >
> > > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > > make it an opaque u64 array and require the userspace program to call
> > > > > > a third ioctl to sample the freshness data for us. If we ever converge
> > > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > > use mtime/ctime like the old swapext ioctl.
> > > > >
> > > > > This addresses my concerns about using mtime/ctime.
> > > >
> > > > Oh good! :)
> > > >
> > > > > I have to say, Darrick, that I think that referring to this concern as
> > > > > bikeshedding is not being honest.
> > > > >
> > > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > > review comments, but I think the question about using mtime/ctime in
> > > > > this new API was not out of place
> > > >
> > > > I agree, your question about mtime/ctime:
> > > >
> > > > "Maybe a stupid question, but under which circumstances would mtime
> > > > change and ctime not change? Why are both needed?"
> > > >
> > > > was a very good question. But perhaps that statement referred to the
> > > > other part of that thread.
> > > >
> > > > > and I think that making the freshness
> > > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > > help you move to the things you care about faster.
> > > >
> > > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > > wants instead of suggesting specifically the change cookie. I'm very
> > > > much ok with an opaque freshness blob that allows future flexibility in
> > > > how we define the blob's contents.
> > > >
> > > > I was however very upset about the Jeff's suggestion of using i_version.
> > > > I apologize for using all caps in that reply, and snarling about it in
> > > > the commit message here. The final version of this patch will not have
> > > > that.
> > > >
> > > > That said, I don't think it is at all helpful to suggest using a file
> > > > attribute whose behavior is as yet unresolved. Multigrain timestamps
> > > > were a clever idea, regrettably reverted. As far as I could tell when I
> > > > wrote my reply, neither had NFS implemented a better behavior and
> > > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > > patchset to fix all the resulting issues in XFS.
> > > >
> > > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > > internal" made me think "OH $deity, they wants me to do that work
> > > > too???"
> > > >
> > > > A better way to have woreded that might've been "How about switching
> > > > this to a fs-determined structure so that we can switch the freshness
> > > > check to i_version when that's fully working on XFS?"
> > > >
> > > > The problem I have with reading patch review emails is that I can't
> > > > easily tell whether an author's suggestion is being made in a casual
> > > > offhand manner? Or if it reflects something they feel strongly needs
> > > > change before merging.
> > > >
> > > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > > that i_version vs. XFS discussion. So I have no idea if you were aware
> > > > of the status of that work.
> > > >
> > >
> > > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > > about any API that attempts to use timestamps to detect whether
> > > something has changed.
> > >
> > > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > > just not enough to show whether there was a change to a file -- full
> > > stop.
> > >
> > > I get the hand-wringing over i_version definitions and I don't care to
> > > rehash that discussion here, but I'll point out that this is a
> > > (proposed) XFS-private interface:
> > >
> > > What you could do is expose the XFS change counter (the one that gets
> > > bumped for everything, even atime updates, possibly via different
> > > ioctl), and use that for your "freshness" check.
> > >
> > > You'd unfortunately get false negative freshness checks after read
> > > operations, but you shouldn't get any false positives (which is real
> > > danger with timestamps).
> >
> > I don't see how would that work for this usecase? You have to sample
> > file2 before reflinking file2's contents to file1, writing the changes
> > to file1, and executing COMMIT_RANGE. Setting the xfs-private REFLINK
> > inode flag on file2 will trigger an iversion update even though it won't
> > change mtime or ctime. The COMMIT then fails due to the inode flags
> > change.
> >
> > Worse yet, applications aren't going to know if a particular access is
> > actually the one that will trigger an atime update. So this will just
> > fail unpredictably.
> >
> > If iversion was purely a write counter then I would switch the freshness
> > implementation to use it. But it's not, and I know this to be true
> > because I tried that and could not get COMMIT_RANGE to work reliably.
> > I suppose the advantage of the blob thing is that we actually /can/
> > switch over whenever it's ready.
> >
>
> Yeah, that's the other part -- you have to be willing to redrive the I/O
> every time the freshness check fails, which can get expensive depending
> on how active the file is. Again this is an XFS interface, so I don't
> really have a dog in this fight. If you think timestamps are good
> enough, then so be it.
>
> All I can do is mention that it has been our experience in the NFS world
> that relying on timestamps like this will eventually lead to data
> corruption. The race conditions may be tight, and much of the time the
> race may be benign, but if you do this enough you'll eventually get
> bitten, and end up exchanging data when you shouldn't have.
>
> All of that said, I think this is great discussion fodder for LSF this
> year. I feel like the time is right to consider these sorts of
> interfaces that do synchronized I/O without locking. I've already
> proposed a discussion around the state of the i_version counter, so
> maybe we can chat about it then?
Yes. I've gotten an invitation, so corporate approval and dumb injuries
notwithstanding, I'll be there this year. :)
--D
> --
> Jeff Layton <jlayton@kernel.org>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
` (2 preceding siblings ...)
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
@ 2024-02-28 1:50 ` Colin Walters
2024-02-29 20:18 ` Darrick J. Wong
3 siblings, 1 reply; 25+ messages in thread
From: Colin Walters @ 2024-02-28 1:50 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, xfs, Christoph Hellwig
On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically. This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes. A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file.
It's probably worth noting that the "reflinking the source file" here
is optional, right? IOW one can just:
- open(O_TMPFILE)
- write()
- ioctl(FIEXCHANGE_RANGE)
I suspect the "simpler" non-database cases (think e.g. editors
operating on plain text files) are going to be operating on an
in-memory copy; in theory of course we could identify common ranges
and reflink, but it's not clear to me it's really worth it at the
tiny scale most source files are.
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files. For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory
More sophisticated tools already are using O_TMPFILE I would say,
just with a final last step of materializing it with a name,
and then rename() into place. So if this also
obviates the need for
https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
that seems good.
> Exchanges are atomic with regards to concurrent file opera‐
> tions, so no userspace-level locks need to be taken to obtain
> consistent results. Implementations must guarantee that read‐
> ers see either the old contents or the new contents in their
> entirety, even if the system fails.
But given that we're reusing the same inode, I don't think that can *really* be true...at least, not without higher level serialization.
A classic case today is dconf in GNOME is a basic memory-mapped database file that is atomically replaced by the "create new file, rename into place" model. Clients with mmap() view just see the old data until they reload explicitly. But with this, clients with mmap'd view *will* immediately see the new contents (because it's the same inode, right?) and that's just going to lead to possibly split reads and undefined behavior - without extra userspace serialization or locking (that more proper databases) are going to be doing.
Arguably of course, dconf is too simple and more sophisticated tools like sqlite or LMDB could make use of this. (There's some special atomic write that got added to f2fs for sqlite last I saw...I'm curious if this could replace it)
But still...it seems to me like there's going to be quite a lot of the "potentially concurrent reader, atomic replace desired" pattern and since this can't replace that, we should call that out explicitly in the man page. And also if so, then there's still a need for the linkat(AT_REPLACE) etc.
> XFS_EXCHRANGE_TO_EOF
I kept reading this as some sort of typo...would it really be too onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.? Echoes of unix "creat" here =)
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-28 1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
@ 2024-02-29 20:18 ` Darrick J. Wong
2024-02-29 22:43 ` Colin Walters
0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2024-02-29 20:18 UTC (permalink / raw)
To: Colin Walters; +Cc: linux-fsdevel, xfs, Christoph Hellwig
On Tue, Feb 27, 2024 at 08:50:20PM -0500, Colin Walters wrote:
>
>
> On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically. This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes. A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file.
>
> It's probably worth noting that the "reflinking the source file" here
> is optional, right? IOW one can just:
>
> - open(O_TMPFILE)
> - write()
> - ioctl(FIEXCHANGE_RANGE)
If the write() rewrites the entire file, then yes, that'll also work.
> I suspect the "simpler" non-database cases (think e.g. editors
> operating on plain text files) are going to be operating on an
> in-memory copy; in theory of course we could identify common ranges
> and reflink, but it's not clear to me it's really worth it at the
> tiny scale most source files are.
Correct, there's no built-in dedupe. For small files you'll probably
end up with a single allocation anyway, which is ideal in terms of
ondisk metadata overhead.
One advantage that EXCHANGE_RANGE has over the rename dance is that the
calling application doesn't have to copy all the file attributes and
xattrs to the temporary file before the switch.
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files. For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory
>
> More sophisticated tools already are using O_TMPFILE I would say,
> just with a final last step of materializing it with a name,
> and then rename() into place. So if this also
> obviates the need for
> https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
> that seems good.
It would, though I would bet that extending linkat (or rename, or
whatever) is going to be the only workable solution for old / simple
filesystems (e.g. fat32).
> > Exchanges are atomic with regards to concurrent file opera‐
> > tions, so no userspace-level locks need to be taken to obtain
> > consistent results. Implementations must guarantee that read‐
> > ers see either the old contents or the new contents in their
> > entirety, even if the system fails.
>
> But given that we're reusing the same inode, I don't think that can
> *really* be true...at least, not without higher level serialization.
Higher level coordination is required, yes. It doesn't have to be
serialization, though. The committing thread could signal all the other
readers that they should invalidate and restart whatever they're working
on if that work depends on the file that was COMMIT_RANGE'd. The
readers could detect unexpected data and resample mtime of the files
they've read and restart if it's changed.
> A classic case today is dconf in GNOME is a basic memory-mapped
> database file that is atomically replaced by the "create new file,
> rename into place" model. Clients with mmap() view just see the old
> data until they reload explicitly. But with this, clients with mmap'd
> view *will* immediately see the new contents (because it's the same
> inode, right?)
Correct, they'll start seeing the new contents as soon as they access
the affected pages.
How /does/ dconf handle those changes? Does it rename the file and
signal all the other dconf threads to reopen the file? And then those
threads get the new file contents?
> and that's just going to lead to possibly split reads
> and undefined behavior - without extra userspace serialization or
> locking (that more proper databases) are going to be doing.
Huurrrh hurrrh. That's right, I don't see how exchange can mesh well
with mmap without actual flock()ing. :(
fsnotify will send a message out to userspace after the exchange
finishes, which means that userspace could watch for the notifications
via fanotify. However, that's still a bit racy... :/
> Arguably of course, dconf is too simple and more sophisticated tools
> like sqlite or LMDB could make use of this. (There's some special
> atomic write that got added to f2fs for sqlite last I saw...I'm
> curious if this could replace it)
I think so:
F2FS_IOC_START_ATOMIC_WRITE -> XFS_IOC_START_COMMIT,
F2FS_IOC_COMMIT_ATOMIC_WRITE -> XFS_IOC_COMMIT_RANGE, and
F2FS_IOC_ABORT_VOLATILE_WRITE merely turns into close(temp_fd);
> But still...it seems to me like there's going to be quite a lot of the
> "potentially concurrent reader, atomic replace desired" pattern and
> since this can't replace that, we should call that out explicitly in
> the man page. And also if so, then there's still a need for the
> linkat(AT_REPLACE) etc.
Hmm, I think I'll shrink that paragraph of the manpage:
"Exchanges are atomic with regards to concurrent file operations.
Implementations must guarantee that readers see either the old contents
or the new contents in their entirety, even if the system fails."
>
> > XFS_EXCHRANGE_TO_EOF
>
> I kept reading this as some sort of typo...would it really be too
> onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.? Echoes of
> unix "creat" here =)
Yeah, I've expanded that to XFS_EXCHANGE_RANGE_TO_EOF for v29.5.
--D
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-29 20:18 ` Darrick J. Wong
@ 2024-02-29 22:43 ` Colin Walters
2024-03-01 0:03 ` Darrick J. Wong
0 siblings, 1 reply; 25+ messages in thread
From: Colin Walters @ 2024-02-29 22:43 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-fsdevel, xfs, Christoph Hellwig
On Thu, Feb 29, 2024, at 3:18 PM, Darrick J. Wong wrote:
>
> Correct, there's no built-in dedupe. For small files you'll probably
> end up with a single allocation anyway, which is ideal in terms of
> ondisk metadata overhead.
Makes sense.
> though I would bet that extending linkat (or rename, or
> whatever) is going to be the only workable solution for old / simple
> filesystems (e.g. fat32).
Ah, right; that too.
> How /does/ dconf handle those changes? Does it rename the file and
> signal all the other dconf threads to reopen the file? And then those
> threads get the new file contents?
I briefly skimmed the code and couldn't find it, but yes I believe it's basically that clients have an inotify watch that gets handled from the mainloop and clients close and reopen and re-mmap - it's probably nonexistent to have non-mainloop threads reading things from the mmap, so there's no races with any other threads.
>
> Huurrrh hurrrh. That's right, I don't see how exchange can mesh well
> with mmap without actual flock()ing. :(
>
> fsnotify will send a message out to userspace after the exchange
> finishes, which means that userspace could watch for the notifications
> via fanotify. However, that's still a bit racy... :/
Right. However...it's not just about mmap. Sorry this is a minor rant but...near my top ten list of changes to make with a time machine for Unix would be the concept of a contents-immutable file, like all the seals that work on memfd with F_ADD_SEALS (and outside of fsverity, which is good but can be a bit of a heavier hammer).
A few times I've been working on shell script in my editor on my desktop, and these shell scripts are tests because shell script is so tempting. I'm sure this familiar, given (x)fstests.
And if you just run the tests (directly from source in git), and then notice a bug, and start typing in your editor, save the changes, and then and your editor happens to do a generic "open(O_TRUNC), save" instead of an atomic rename. This happens to be what `nano` and VSCode do, although at least the `vi` I have here does an atomic rename. (One could say all editors that don't are broken...but...)
And now because the way bash works (and I assume other historical Unix shells) is that they interpret the file *as they're reading it* in this scenario you can get completely undefined behavior. It could do *anything*.
At least one of those times, I got an error from an `rm -rf` invocation that happened to live in one of those test scripts...that could have in theory just gone off and removed anything.
Basically the contents-immutable is really what you *always* want for executables and really anything that can be parsed without locking (like, almost all config files in /etc too). With ELF files there's EXTBUSY if it *happens* to be in use, but that's just a hack. Also in that other thread about racing writes to suid executables...well, there'd be no possibility for races if we just denied writing because again - it makes no sense to just make random writes in-place to an executable. (OK I did see the zig folks are trying an incremental linker, but still I would just assume reflinks are available for that)
Now this is relevant here because, I don't think anything like dpkg/rpm and all those things could ever use this ioctl for this reason.
So, it seems to me like it should really be more explicitly targeted at
- Things that are using open()+write() today and it's safe for that use case
- The database cases
And not talk about replacing the general open(O_TMPFILE) + rename() path.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
2024-02-29 22:43 ` Colin Walters
@ 2024-03-01 0:03 ` Darrick J. Wong
0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2024-03-01 0:03 UTC (permalink / raw)
To: Colin Walters; +Cc: linux-fsdevel, xfs, Christoph Hellwig
On Thu, Feb 29, 2024 at 05:43:21PM -0500, Colin Walters wrote:
>
>
> On Thu, Feb 29, 2024, at 3:18 PM, Darrick J. Wong wrote:
> >
> > Correct, there's no built-in dedupe. For small files you'll probably
> > end up with a single allocation anyway, which is ideal in terms of
> > ondisk metadata overhead.
>
> Makes sense.
>
> > though I would bet that extending linkat (or rename, or
> > whatever) is going to be the only workable solution for old / simple
> > filesystems (e.g. fat32).
>
> Ah, right; that too.
>
> > How /does/ dconf handle those changes? Does it rename the file and
> > signal all the other dconf threads to reopen the file? And then those
> > threads get the new file contents?
>
> I briefly skimmed the code and couldn't find it, but yes I believe
> it's basically that clients have an inotify watch that gets handled
> from the mainloop and clients close and reopen and re-mmap - it's
> probably nonexistent to have non-mainloop threads reading things from
> the mmap, so there's no races with any other threads.
Hrmm. IIRC inotify and fanotify both use the same fsnotify backend.
fsnotify events are emitted after i_rwsem drops, which (if I read
read_write.c correctly) means this is technically racy.
That said if they're mostly waiting around in the inotify loop then it
probably doesn't matter.
> > Huurrrh hurrrh. That's right, I don't see how exchange can mesh well
> > with mmap without actual flock()ing. :(
> >
> > fsnotify will send a message out to userspace after the exchange
> > finishes, which means that userspace could watch for the notifications
> > via fanotify. However, that's still a bit racy... :/
>
> Right. However...it's not just about mmap. Sorry this is a minor
> rant but...near my top ten list of changes to make with a time machine
> for Unix would be the concept of a contents-immutable file, like all
> the seals that work on memfd with F_ADD_SEALS (and outside of
> fsverity, which is good but can be a bit of a heavier hammer).
You and me both. :)
Also I want a persistent file contents write counter; and a
file-anything write counter.
Oh, and a conditional read where you pass in the file contents write
counter and returns an error if the file has been changed since sampling
time. The changecookie thing mentioned elsewhere gets us towards that,
if onlty the issues w/ XFS get resolved.
> A few times I've been working on shell script in my editor on my
> desktop, and these shell scripts are tests because shell script is so
> tempting. I'm sure this familiar, given (x)fstests.
>
> And if you just run the tests (directly from source in git), and then
> notice a bug, and start typing in your editor, save the changes, and
> then and your editor happens to do a generic "open(O_TRUNC), save"
> instead of an atomic rename. This happens to be what `nano` and
> VSCode do, although at least the `vi` I have here does an atomic
> rename. (One could say all editors that don't are broken...but...)
I think they do O_TRUNC because it saves them from having to copy the
file attrs and xattrs. Too bad it severely screws up a program running
in another terminal that just happens to hit the zero-byte file.
> And now because the way bash works (and I assume other historical Unix
> shells) is that they interpret the file *as they're reading it* in
> this scenario you can get completely undefined behavior. It could do
> *anything*.
>
> At least one of those times, I got an error from an `rm -rf`
> invocation that happened to live in one of those test scripts...that
> could have in theory just gone off and removed anything.
>
> Basically the contents-immutable is really what you *always* want for
> executables and really anything that can be parsed without locking
> (like, almost all config files in /etc too). With ELF files there's
Yes.
> EXTBUSY if it *happens* to be in use, but that's just a hack. Also in
A hack that doesn't work for scripts. Either interpreters have to read
the entire script into memory before execution, or I guess they can do
the insane thing that the DOS batch interpreter did, where before each
statement it would save the file pos, close it, execute the command,
reopen the batch file, and seek back to that line.
> that other thread about racing writes to suid executables...well,
> there'd be no possibility for races if we just denied writing because
> again - it makes no sense to just make random writes in-place to an
> executable. (OK I did see the zig folks are trying an incremental
> linker, but still I would just assume reflinks are available for that)
>
> Now this is relevant here because, I don't think anything like
> dpkg/rpm and all those things could ever use this ioctl for this
> reason.
Right. dpkg executable file replacement really doesn't make much sense
for exchange range. That's also wasn't the usecase I was targetting
though admittedly I'm only using this ioctl to test functionality that
online fsck requires.
> So, it seems to me like it should really be more explicitly targeted at
> - Things that are using open()+write() today and it's safe for that use case
> - The database cases
>
> And not talk about replacing the general open(O_TMPFILE) + rename() path.
I think I'll change the cover letter to talk about what it does, what
problems it solves, and what problems it introduces. Figuring out how
to take advantage of it is an exercise for application writers.
--D
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2024-03-07 23:25 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27 2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-02-28 15:40 ` Christoph Hellwig
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
2024-02-27 10:53 ` Jeff Layton
2024-02-27 16:06 ` Darrick J. Wong
2024-03-01 13:16 ` Jeff Layton
2024-02-27 23:46 ` Dave Chinner
2024-02-28 10:30 ` Jeff Layton
2024-02-28 10:58 ` Amir Goldstein
2024-02-28 11:01 ` Jeff Layton
2024-02-27 15:45 ` Darrick J. Wong
2024-02-27 16:58 ` Amir Goldstein
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
2024-02-27 18:52 ` Amir Goldstein
2024-02-29 23:27 ` Darrick J. Wong
2024-03-01 13:00 ` Amir Goldstein
2024-03-01 13:31 ` Jeff Layton
2024-03-02 2:48 ` Darrick J. Wong
2024-03-02 12:43 ` Jeff Layton
2024-03-07 23:25 ` Darrick J. Wong
2024-02-28 1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
2024-02-29 20:18 ` Darrick J. Wong
2024-02-29 22:43 ` Colin Walters
2024-03-01 0:03 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).