From: "Colin Walters" <walters@verbum.org>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org, xfs <linux-xfs@vger.kernel.org>,
"Christoph Hellwig" <hch@lst.de>
Subject: Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
Date: Tue, 27 Feb 2024 20:50:20 -0500 [thread overview]
Message-ID: <87961163-a4b9-4032-aa06-f5126c9c8ca2@app.fastmail.com> (raw)
In-Reply-To: <170900011604.938268.9876750689883987904.stgit@frogsfrogsfrogs>
On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically. This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes. A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file.
It's probably worth noting that the "reflinking the source file" here
is optional, right? IOW one can just:
- open(O_TMPFILE)
- write()
- ioctl(FIEXCHANGE_RANGE)
I suspect the "simpler" non-database cases (think e.g. editors
operating on plain text files) are going to be operating on an
in-memory copy; in theory of course we could identify common ranges
and reflink, but it's not clear to me it's really worth it at the
tiny scale most source files are.
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files. For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory
More sophisticated tools already are using O_TMPFILE I would say,
just with a final last step of materializing it with a name,
and then rename() into place. So if this also
obviates the need for
https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
that seems good.
> Exchanges are atomic with regards to concurrent file opera‐
> tions, so no userspace-level locks need to be taken to obtain
> consistent results. Implementations must guarantee that read‐
> ers see either the old contents or the new contents in their
> entirety, even if the system fails.
But given that we're reusing the same inode, I don't think that can *really* be true...at least, not without higher level serialization.
A classic case today is dconf in GNOME is a basic memory-mapped database file that is atomically replaced by the "create new file, rename into place" model. Clients with mmap() view just see the old data until they reload explicitly. But with this, clients with mmap'd view *will* immediately see the new contents (because it's the same inode, right?) and that's just going to lead to possibly split reads and undefined behavior - without extra userspace serialization or locking (that more proper databases) are going to be doing.
Arguably of course, dconf is too simple and more sophisticated tools like sqlite or LMDB could make use of this. (There's some special atomic write that got added to f2fs for sqlite last I saw...I'm curious if this could replace it)
But still...it seems to me like there's going to be quite a lot of the "potentially concurrent reader, atomic replace desired" pattern and since this can't replace that, we should call that out explicitly in the man page. And also if so, then there's still a need for the linkat(AT_REPLACE) etc.
> XFS_EXCHRANGE_TO_EOF
I kept reading this as some sort of typo...would it really be too onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.? Echoes of unix "creat" here =)
next prev parent reply other threads:[~2024-02-28 1:50 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-27 2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27 2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-02-28 15:40 ` Christoph Hellwig
2024-02-27 9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
2024-02-27 10:53 ` Jeff Layton
2024-02-27 16:06 ` Darrick J. Wong
2024-03-01 13:16 ` Jeff Layton
2024-02-27 23:46 ` Dave Chinner
2024-02-28 10:30 ` Jeff Layton
2024-02-28 10:58 ` Amir Goldstein
2024-02-28 11:01 ` Jeff Layton
2024-02-27 15:45 ` Darrick J. Wong
2024-02-27 16:58 ` Amir Goldstein
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
2024-02-27 18:52 ` Amir Goldstein
2024-02-29 23:27 ` Darrick J. Wong
2024-03-01 13:00 ` Amir Goldstein
2024-03-01 13:31 ` Jeff Layton
2024-03-02 2:48 ` Darrick J. Wong
2024-03-02 12:43 ` Jeff Layton
2024-03-07 23:25 ` Darrick J. Wong
2024-02-28 1:50 ` Colin Walters [this message]
2024-02-29 20:18 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-29 22:43 ` Colin Walters
2024-03-01 0:03 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87961163-a4b9-4032-aa06-f5126c9c8ca2@app.fastmail.com \
--to=walters@verbum.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).