From: "Darrick J. Wong" <djwong@kernel.org>
To: Christoph Hellwig <hch@lst.de>
Cc: John Garry <john.g.garry@oracle.com>,
viro@zeniv.linux.org.uk, brauner@kernel.org, dchinner@redhat.com,
jack@suse.cz, chandan.babu@oracle.com,
martin.petersen@oracle.com, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
tytso@mit.edu, jbongio@google.com, ojaswin@linux.ibm.com
Subject: Re: [PATCH 0/6] block atomic writes for XFS
Date: Wed, 21 Feb 2024 08:56:15 -0800 [thread overview]
Message-ID: <20240221165615.GH6184@frogsfrogsfrogs> (raw)
In-Reply-To: <20240214074559.GB10006@lst.de>
On Wed, Feb 14, 2024 at 08:45:59AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 13, 2024 at 09:55:49AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> > > From reading the series and the discussions with Darrick and Dave
> > > I'm coming more and more back to my initial position that tying this
> > > user visible feature to hardware limits is wrong and will just keep
> > > on creating ever more painpoints in the future.
> > >
> > > Based on that I suspect that doing proper software only atomic writes
> > > using the swapext log item and selective always COW mode
> >
> > Er, what are you thinking w.r.t. swapext and sometimescow?
>
> What do you mean with sometimescow? Just normal reflinked inodes?
>
> > swapext
> > doesn't currently handle COW forks at all, and it can only exchange
> > between two of the same type of fork (e.g. both data forks or both attr
> > forks, no mixing).
> >
> > Or will that be your next suggestion whenever I get back to fiddling
> > with the online fsck patches? ;)
>
> Let's take a step back. If we want atomic write semantics without
> hardware offload, what we need is to allocate new blocks and atomically
> swap them into the data fork. Basicall an atomic version of
> xfs_reflink_end_cow. But yes, the details of the current swapext
> item might not be an exact fit, maybe it's just shared infrastructure
> and concepts.
Hmm. For rt reflink (whenever I get back to that, ha) I've been
starting to think that yes, we actually /do/ want to have a log item
that tracks the progress of remap and cow operations. That would solve
the problem of someone wanting to reflink a semi-written rtx.
That said, it might complicate the reflink code quite a bit since right
now it writes zeroes to the unwritten parts of an rt file's rtx so that
there's only one mapping record for the whole rtx, and then it remaps
them. That's most of why I haven't bothered to implement that solution.
> I'm not planning to make you do it, because such a log item would
> generally be pretty useful for always COW mode.
One other thing -- while I was refactoring the swapext code into
exch{range,maps}, it occurred to me that doing an exchange between the
cow and data forks isn't possible because log recovery won't be able to
do anything. There's no ondisk metadata to map a cow staging extent
back to the file it came from, which means we can't generally resume an
exchange operation.
However for a small write I guess you could simply queue all the log
intent items for all the changes needed and commit that.
> > > and making that
> > > work should be the first step. We can then avoid that overhead for
> > > properly aligned writs if the hardware supports it. For your Oracle
> > > DB loads you'll set the alignment hints and maybe even check with
> > > fiemap that everything is fine and will get the offload, but we also
> > > provide a nice and useful API for less performance critical applications
> > > that don't have to care about all these details.
> >
> > I suspect they might want to fail-fast (back to standard WAL mode or
> > whatever) if the hardware support isn't available.
>
> Maybe for your particular DB use case. But there's plenty of
> applications that just want atomic writes without building their
> own infrastruture, including some that want pretty large chunks.
>
> Also if a file system supports logging data (which I have an
> XFS early prototype for that I plan to finish), we can even do
> the small double writes more efficiently than the application,
> all through the same interface.
Heh. Ted's been trying to kill data=journal. Now we've found a use for
it after all. :)
--D
next prev parent reply other threads:[~2024-02-21 16:56 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
2024-02-02 17:25 ` Darrick J. Wong
2024-02-05 11:29 ` John Garry
2024-02-13 6:55 ` Christoph Hellwig
2024-02-13 8:20 ` John Garry
2024-02-15 11:08 ` John Garry
2024-02-13 18:08 ` Darrick J. Wong
2024-02-05 15:20 ` Pankaj Raghav (Samsung)
2024-02-05 15:41 ` John Garry
2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-02-02 17:57 ` Darrick J. Wong
2024-02-05 12:58 ` John Garry
2024-02-13 6:56 ` Christoph Hellwig
2024-02-13 17:08 ` Darrick J. Wong
2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
2024-02-02 17:52 ` Darrick J. Wong
2024-02-03 7:40 ` Ojaswin Mujoo
2024-02-05 12:51 ` John Garry
2024-02-13 17:22 ` Darrick J. Wong
2024-02-14 12:19 ` John Garry
2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
2024-02-02 18:05 ` Darrick J. Wong
2024-02-05 13:10 ` John Garry
2024-02-13 17:37 ` Darrick J. Wong
2024-02-14 12:26 ` John Garry
2024-02-09 7:00 ` Ojaswin Mujoo
2024-02-09 17:30 ` John Garry
2024-02-12 11:48 ` Ojaswin Mujoo
2024-02-12 12:05 ` Ojaswin Mujoo
2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
2024-02-02 18:47 ` Darrick J. Wong
2024-02-05 13:36 ` John Garry
2024-02-06 1:15 ` Dave Chinner
2024-02-06 9:53 ` John Garry
2024-02-07 0:06 ` Dave Chinner
2024-02-07 14:13 ` John Garry
2024-02-09 1:40 ` Dave Chinner
2024-02-09 12:47 ` John Garry
2024-02-13 23:41 ` Dave Chinner
2024-02-14 11:06 ` John Garry
2024-02-14 23:03 ` Dave Chinner
2024-02-15 9:53 ` John Garry
2024-02-13 17:50 ` Darrick J. Wong
2024-02-14 12:13 ` John Garry
2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
2024-02-02 18:06 ` Darrick J. Wong
2024-02-05 10:26 ` John Garry
2024-02-13 17:59 ` Darrick J. Wong
2024-02-14 12:36 ` John Garry
2024-02-21 17:00 ` Darrick J. Wong
2024-02-21 17:38 ` John Garry
2024-02-24 4:18 ` Darrick J. Wong
2024-02-09 7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
2024-02-09 9:22 ` John Garry
2024-02-12 12:06 ` Ojaswin Mujoo
2024-02-13 7:22 ` Christoph Hellwig
2024-02-13 17:55 ` Darrick J. Wong
2024-02-14 7:45 ` Christoph Hellwig
2024-02-21 16:56 ` Darrick J. Wong [this message]
2024-02-23 6:57 ` Christoph Hellwig
2024-02-13 23:50 ` Dave Chinner
2024-02-14 7:38 ` Christoph Hellwig
2024-02-13 7:45 ` Ritesh Harjani
2024-02-13 8:41 ` John Garry
2024-02-13 9:10 ` Ritesh Harjani
2024-02-13 22:49 ` Dave Chinner
2024-02-14 10:10 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240221165615.GH6184@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=dchinner@redhat.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jbongio@google.com \
--cc=john.g.garry@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).