From: Jeff Layton <jlayton@kernel.org>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Amir Goldstein <amir73il@gmail.com>,
linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
hch@lst.de
Subject: Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
Date: Fri, 01 Mar 2024 08:16:44 -0500 [thread overview]
Message-ID: <6a0e108a26b57402ed6ed0fc58fb640b5dadb400.camel@kernel.org> (raw)
In-Reply-To: <20240227160658.GW616564@frogsfrogsfrogs>
On Tue, 2024-02-27 at 08:06 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > > ranges of bytes between two files atomically. This new functionality
> > > > enables data storage programs to stage and commit file updates such that
> > > > reader programs will see either the old contents or the new contents in
> > > > their entirety, with no chance of torn writes. A successful call
> > > > completion guarantees that the new contents will be seen even if the
> > > > system fails.
> > > >
> > > > The ability to exchange file fork mappings between files in this manner
> > > > is critical to supporting online filesystem repair, which is built upon
> > > > the strategy of constructing a clean copy of a damaged structure and
> > > > committing the new structure into the metadata file atomically.
> > > >
> > > > User programs will be able to update files atomically by opening an
> > > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > > they want to make, and exchange the relevant ranges of the temp file
> > > > with the original file. If the updates are aligned with the file block
> > > > size, a new (since v2) flag provides for exchanging only the written
> > > > areas. Callers can arrange for the update to be rejected if the
> > > > original file has been changed.
> > > >
> > > > The intent behind this new userspace functionality is to enable atomic
> > > > rewrites of arbitrary parts of individual files. For years, application
> > > > programmers wanting to ensure the atomicity of a file update had to
> > > > write the changes to a new file in the same directory, fsync the new
> > > > file, rename the new file on top of the old filename, and then fsync the
> > > > directory. People get it wrong all the time, and $fs hacks abound.
> > > > Here are the proposed manual pages:
> > > >
> >
> > This is a cool idea! I've had some handwavy ideas about making a gated
> > write() syscall (i.e. only write if the change cookie hasn't changed),
> > but something like this may be a simpler lift initially.
>
> How /does/ userspace get at the change cookie nowadays?
>
Today, it doesn't. That would need to be exposed before we could make
that work.
> > > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > >
> > > > NAME
> > > > ioctl_xfs_exchange_range - exchange the contents of parts of
> > > > two files
> > > >
> > > > SYNOPSIS
> > > > #include <sys/ioctl.h>
> > > > #include <xfs/xfs_fs_staging.h>
> > > >
> > > > int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct
> > > > xfs_exch_range *arg);
> > > >
> > > > DESCRIPTION
> > > > Given a range of bytes in a first file file1_fd and a second
> > > > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > > > changes the contents of the two ranges.
> > > >
> > > > Exchanges are atomic with regards to concurrent file opera‐
> > > > tions, so no userspace-level locks need to be taken to obtain
> > > > consistent results. Implementations must guarantee that read‐
> > > > ers see either the old contents or the new contents in their
> > > > entirety, even if the system fails.
> > > >
> > > > The system call parameters are conveyed in structures of the
> > > > following form:
> > > >
> > > > struct xfs_exch_range {
> > > > __s64 file1_fd;
> > > > __s64 file1_offset;
> > > > __s64 file2_offset;
> > > > __s64 length;
> > > > __u64 flags;
> > > >
> > > > __u64 pad;
> > > > };
> > > >
> > > > The field pad must be zero.
> > > >
> > > > The fields file1_fd, file1_offset, and length define the first
> > > > range of bytes to be exchanged.
> > > >
> > > > The fields file2_fd, file2_offset, and length define the second
> > > > range of bytes to be exchanged.
> > > >
> > > > Both files must be from the same filesystem mount. If the two
> > > > file descriptors represent the same file, the byte ranges must
> > > > not overlap. Most disk-based filesystems require that the
> > > > starts of both ranges must be aligned to the file block size.
> > > > If this is the case, the ends of the ranges must also be so
> > > > aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > >
> > > > The field flags control the behavior of the exchange operation.
> > > >
> > > > XFS_EXCHRANGE_TO_EOF
> > > > Ignore the length parameter. All bytes in file1_fd
> > > > from file1_offset to EOF are moved to file2_fd, and
> > > > file2's size is set to (file2_offset+(file1_length-
> > > > file1_offset)). Meanwhile, all bytes in file2 from
> > > > file2_offset to EOF are moved to file1 and file1's
> > > > size is set to (file1_offset+(file2_length-
> > > > file2_offset)).
> > > >
> > > > XFS_EXCHRANGE_DSYNC
> > > > Ensure that all modified in-core data in both file
> > > > ranges and all metadata updates pertaining to the
> > > > exchange operation are flushed to persistent storage
> > > > before the call returns. Opening either file de‐
> > > > scriptor with O_SYNC or O_DSYNC will have the same
> > > > effect.
> > > >
> > > > XFS_EXCHRANGE_FILE1_WRITTEN
> > > > Only exchange sub-ranges of file1_fd that are known
> > > > to contain data written by application software.
> > > > Each sub-range may be expanded (both upwards and
> > > > downwards) to align with the file allocation unit.
> > > > For files on the data device, this is one filesystem
> > > > block. For files on the realtime device, this is
> > > > the realtime extent size. This facility can be used
> > > > to implement fast atomic scatter-gather writes of
> > > > any complexity for software-defined storage targets
> > > > if all writes are aligned to the file allocation
> > > > unit.
> > > >
> > > > XFS_EXCHRANGE_DRY_RUN
> > > > Check the parameters and the feasibility of the op‐
> > > > eration, but do not change anything.
> > > >
> > > > RETURN VALUE
> > > > On error, -1 is returned, and errno is set to indicate the er‐
> > > > ror.
> > > >
> > > > ERRORS
> > > > Error codes can be one of, but are not limited to, the follow‐
> > > > ing:
> > > >
> > > > EBADF file1_fd is not open for reading and writing or is open
> > > > for append-only writes; or file2_fd is not open for
> > > > reading and writing or is open for append-only writes.
> > > >
> > > > EINVAL The parameters are not correct for these files. This
> > > > error can also appear if either file descriptor repre‐
> > > > sents a device, FIFO, or socket. Disk filesystems gen‐
> > > > erally require the offset and length arguments to be
> > > > aligned to the fundamental block sizes of both files.
> > > >
> > > > EIO An I/O error occurred.
> > > >
> > > > EISDIR One of the files is a directory.
> > > >
> > > > ENOMEM The kernel was unable to allocate sufficient memory to
> > > > perform the operation.
> > > >
> > > > ENOSPC There is not enough free space in the filesystem ex‐
> > > > change the contents safely.
> > > >
> > > > EOPNOTSUPP
> > > > The filesystem does not support exchanging bytes between
> > > > the two files.
> > > >
> > > > EPERM file1_fd or file2_fd are immutable.
> > > >
> > > > ETXTBSY
> > > > One of the files is a swap file.
> > > >
> > > > EUCLEAN
> > > > The filesystem is corrupt.
> > > >
> > > > EXDEV file1_fd and file2_fd are not on the same mounted
> > > > filesystem.
> > > >
> > > > CONFORMING TO
> > > > This API is XFS-specific.
> > > >
> > > > USE CASES
> > > > Several use cases are imagined for this system call. In all
> > > > cases, application software must coordinate updates to the file
> > > > because the exchange is performed unconditionally.
> > > >
> > > > The first is a data storage program that wants to commit non-
> > > > contiguous updates to a file atomically and coordinates write
> > > > access to that file. This can be done by creating a temporary
> > > > file, calling FICLONE(2) to share the contents, and staging the
> > > > updates into the temporary file. The FULL_FILES flag is recom‐
> > > > mended for this purpose. The temporary file can be deleted or
> > > > punched out afterwards.
> > > >
> > > > An example program might look like this:
> > > >
> > > > int fd = open("/some/file", O_RDWR);
> > > > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > >
> > > > ioctl(temp_fd, FICLONE, fd);
> > > >
> > > > /* append 1MB of records */
> > > > lseek(temp_fd, 0, SEEK_END);
> > > > write(temp_fd, data1, 1000000);
> > > >
> > > > /* update record index */
> > > > pwrite(temp_fd, data1, 600, 98765);
> > > > pwrite(temp_fd, data2, 320, 54321);
> > > > pwrite(temp_fd, data2, 15, 0);
> > > >
> > > > /* commit the entire update */
> > > > struct xfs_exch_range args = {
> > > > .file1_fd = temp_fd,
> > > > .flags = XFS_EXCHRANGE_TO_EOF,
> > > > };
> > > >
> > > > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > >
> > > > The second is a software-defined storage host (e.g. a disk
> > > > jukebox) which implements an atomic scatter-gather write com‐
> > > > mand. Provided the exported disk's logical block size matches
> > > > the file's allocation unit size, this can be done by creating a
> > > > temporary file and writing the data at the appropriate offsets.
> > > > It is recommended that the temporary file be truncated to the
> > > > size of the regular file before any writes are staged to the
> > > > temporary file to avoid issues with zeroing during EOF exten‐
> > > > sion. Use this call with the FILE1_WRITTEN flag to exchange
> > > > only the file allocation units involved in the emulated de‐
> > > > vice's write command. The temporary file should be truncated
> > > > or punched out completely before being reused to stage another
> > > > write.
> > > >
> > > > An example program might look like this:
> > > >
> > > > int fd = open("/some/file", O_RDWR);
> > > > int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > > struct stat sb;
> > > > int blksz;
> > > >
> > > > fstat(fd, &sb);
> > > > blksz = sb.st_blksize;
> > > >
> > > > /* land scatter gather writes between 100fsb and 500fsb */
> > > > pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > > > pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > > > pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > >
> > > > /* commit the entire update */
> > > > struct xfs_exch_range args = {
> > > > .file1_fd = temp_fd,
> > > > .file1_offset = blksz * 100,
> > > > .file2_offset = blksz * 100,
> > > > .length = blksz * 400,
> > > > .flags = XFS_EXCHRANGE_FILE1_WRITTEN |
> > > > XFS_EXCHRANGE_FILE1_DSYNC,
> > > > };
> > > >
> > > > ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > >
> > > > NOTES
> > > > Some filesystems may limit the amount of data or the number of
> > > > extents that can be exchanged in a single call.
> > > >
> > > > SEE ALSO
> > > > ioctl(2)
> > > >
> > > > XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2)
> > > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > >
> > > > NAME
> > > > ioctl_xfs_commit_range - conditionally exchange the contents of
> > > > parts of two files
> > > >
> > > > SYNOPSIS
> > > > #include <sys/ioctl.h>
> > > > #include <xfs/xfs_fs_staging.h>
> > > >
> > > > int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE, struct xfs_com‐
> > > > mit_range *arg);
> > > >
> > > > DESCRIPTION
> > > > Given a range of bytes in a first file file1_fd and a second
> > > > range of bytes in a second file file2_fd, this ioctl(2) ex‐
> > > > changes the contents of the two ranges if file2_fd passes cer‐
> > > > tain freshness criteria.
> > > >
> > > > After locking both files but before exchanging the contents,
> > > > the supplied file2_ino field must match file2_fd's inode num‐
> > > > ber, and the supplied file2_mtime, file2_mtime_nsec,
> > > > file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > > > cation time and change time of file2. If they do not match,
> > > > EBUSY will be returned.
> > > >
> > >
> > > Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?
> > >
> >
> > ctime should always change if the mtime does. An mtime update means that
> > the metadata was updated, so you also need to update the ctime.
>
> Exactly. :)
>
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > >
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > >
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > >
> >
> >
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
>
> TLDR: No.
>
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
>
> Dave says that about a lot of things. I'm not willing to delay the
> online fsck project _even further_ for a bunch of vaporware that's not
> even out on linux-xfs for review.
>
> The difference in opinion between xfs and the rest of the kernel about
> i_version is 50% of why I didn't include it here. The other 50% is the
> part where userspace can't access it, because I do not want to saddle my
> mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
> CHANGES.
Ouch, point taken.
I just have grave concerns about using something as coarse-grained as
the to gate changes to a file. With modern machines, a single timestamp
can represent a large number of different states of the file's contents.
Is that not a problem here?
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2024-03-01 13:16 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27 2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-02-28 15:40 ` Christoph Hellwig
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
2024-02-27 10:53 ` Jeff Layton
2024-02-27 16:06 ` Darrick J. Wong
2024-03-01 13:16 ` Jeff Layton [this message]
2024-02-27 23:46 ` Dave Chinner
2024-02-28 10:30 ` Jeff Layton
2024-02-28 10:58 ` Amir Goldstein
2024-02-28 11:01 ` Jeff Layton
2024-02-27 15:45 ` Darrick J. Wong
2024-02-27 16:58 ` Amir Goldstein
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
2024-02-27 18:52 ` Amir Goldstein
2024-02-29 23:27 ` Darrick J. Wong
2024-03-01 13:00 ` Amir Goldstein
2024-03-01 13:31 ` Jeff Layton
2024-03-02 2:48 ` Darrick J. Wong
2024-03-02 12:43 ` Jeff Layton
2024-03-07 23:25 ` Darrick J. Wong
2024-02-28 1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
2024-02-29 20:18 ` Darrick J. Wong
2024-02-29 22:43 ` Colin Walters
2024-03-01 0:03 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6a0e108a26b57402ed6ed0fc58fb640b5dadb400.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=amir73il@gmail.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).