* Re: XFS reflink overhead, ioctl(FICLONE)
[not found] <CACQnzjuhRzNruTm369wVQU3y091da2c+h+AfRED+AtA-dYqXNQ@mail.gmail.com>
@ 2022-12-13 17:18 ` Darrick J. Wong
2022-12-14 1:46 ` Terence Kelly
2022-12-14 4:47 ` Suyash Mahar
0 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2022-12-13 17:18 UTC (permalink / raw)
To: Suyash Mahar; +Cc: linux-xfs, tpkelly, Suyash Mahar
[ugh, your email never made it to the list. I bet the email security
standards have been tightened again. <insert rant about dkim and dmarc
silent failures here>] :(
On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> Hi all!
>
> While using XFS's ioctl(FICLONE), we found that XFS seems to have
> poor performance (ioctl takes milliseconds for sparse files) and the
> overhead
> increases with every call.
>
> For the demo, we are using an Optane DC-PMM configured as a
> block device (fsdax) and running XFS (Linux v5.18.13).
How are you using fsdax and reflink on a 5.18 kernel? That combination
of features wasn't supported until 6.0, and the data corruption problems
won't get fixed until a pull request that's about to happen for 6.2.
> We create a 1 GiB dense file, then repeatedly modify a tiny random
> fraction of it and make a clone via ioctl(FICLONE).
Yay, random cow writes, that will slowly increase the number of space
mapping records in the file metadata.
> The time required for the ioctl() calls increases from large to insane
> over the course of ~250 iterations: From roughly a millisecond for the
> first iteration or two (which seems high, given that this is on
> Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> milliseconds (which seems crazy).
Does the system call runtime increase with O(number_extents)? You might
record the number of extents in the file you're cloning by running this
periodically:
xfs_io -c stat $path | grep fsxattr.nextents
FICLONE (at least on XFS) persists dirty pagecache data to disk, and
then duplicates all written-space mapping records from the source file to
the destination file. It skips preallocated mappings created with
fallocate.
So yes, the plot is exactly what I was expecting.
--D
> The plot is attached to this email.
>
> A cursory look at the extent map suggests that it gets increasingly
> complicated resulting in the complexity.
>
> The enclosed tarball contains our code, our results, and some other info
> like a flame graph that might shed light on where the ioctl is spending
> its time.
>
> - Suyash & Terence
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
@ 2022-12-14 1:46 ` Terence Kelly
2022-12-14 4:47 ` Suyash Mahar
1 sibling, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-14 1:46 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Suyash Mahar, linux-xfs, Suyash Mahar
Hi Darrick,
Thanks for your quick and detailed reply.
The thing that really puzzled me when I re-ran Suyash's experiments on a
DRAM-backed file system is that the ioctl(FICLONE) calls were still very
very slow. A slow block storage device can't be blamed, because there
wasn't a slow block storage device anywhere in the picture; the slowness
came from software.
Suyash, can you send those results?
-- Terence Kelly
On Tue, 13 Dec 2022, Darrick J. Wong wrote:
> FICLONE (at least on XFS) persists dirty pagecache data to disk, and
> then duplicates all written-space mapping records from the source file
> to the destination file. It skips preallocated mappings created with
> fallocate.
>
> So yes, the plot is exactly what I was expecting.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
2022-12-14 1:46 ` Terence Kelly
@ 2022-12-14 4:47 ` Suyash Mahar
2022-12-15 0:19 ` Dave Chinner
1 sibling, 1 reply; 14+ messages in thread
From: Suyash Mahar @ 2022-12-14 4:47 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs, tpkelly, Suyash Mahar
Hi Darrick,
Thank you for the response. I have replied inline.
-Suyash
Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@kernel.org> a écrit :
>
> [ugh, your email never made it to the list. I bet the email security
> standards have been tightened again. <insert rant about dkim and dmarc
> silent failures here>] :(
>
> On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> > Hi all!
> >
> > While using XFS's ioctl(FICLONE), we found that XFS seems to have
> > poor performance (ioctl takes milliseconds for sparse files) and the
> > overhead
> > increases with every call.
> >
> > For the demo, we are using an Optane DC-PMM configured as a
> > block device (fsdax) and running XFS (Linux v5.18.13).
>
> How are you using fsdax and reflink on a 5.18 kernel? That combination
> of features wasn't supported until 6.0, and the data corruption problems
> won't get fixed until a pull request that's about to happen for 6.2.
We did not enable the dax option. The optane DIMMs are configured to
appear as a block device.
$ mount | grep xfs
/dev/pmem0p4 on /mnt/pmem0p4 type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
Regardless of the block device (the plot includes results for optane
and RamFS), it seems like the ioctl(FICLONE) call is slow.
> > We create a 1 GiB dense file, then repeatedly modify a tiny random
> > fraction of it and make a clone via ioctl(FICLONE).
>
> Yay, random cow writes, that will slowly increase the number of space
> mapping records in the file metadata.
>
> > The time required for the ioctl() calls increases from large to insane
> > over the course of ~250 iterations: From roughly a millisecond for the
> > first iteration or two (which seems high, given that this is on
> > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> > milliseconds (which seems crazy).
>
> Does the system call runtime increase with O(number_extents)? You might
> record the number of extents in the file you're cloning by running this
> periodically:
>
> xfs_io -c stat $path | grep fsxattr.nextents
The extent count does increase linearly (just like the ioctl() call latency).
I used the xfs_bmap tool, let me know if this is not the right way. If
it is not, I'll update the microbenchmark to run xfs_io.
> FICLONE (at least on XFS) persists dirty pagecache data to disk, and
> then duplicates all written-space mapping records from the source file to
> the destination file. It skips preallocated mappings created with
> fallocate.
>
> So yes, the plot is exactly what I was expecting.
>
> --D
>
> > The plot is attached to this email.
> >
> > A cursory look at the extent map suggests that it gets increasingly
> > complicated resulting in the complexity.
> >
> > The enclosed tarball contains our code, our results, and some other info
> > like a flame graph that might shed light on where the ioctl is spending
> > its time.
> >
> > - Suyash & Terence
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-14 4:47 ` Suyash Mahar
@ 2022-12-15 0:19 ` Dave Chinner
2022-12-16 1:06 ` Terence Kelly
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2022-12-15 0:19 UTC (permalink / raw)
To: Suyash Mahar; +Cc: Darrick J. Wong, linux-xfs, tpkelly, Suyash Mahar
On Tue, Dec 13, 2022 at 08:47:03PM -0800, Suyash Mahar wrote:
> Hi Darrick,
>
> Thank you for the response. I have replied inline.
>
> -Suyash
>
> Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@kernel.org> a écrit :
> >
> > [ugh, your email never made it to the list. I bet the email security
> > standards have been tightened again. <insert rant about dkim and dmarc
> > silent failures here>] :(
> >
> > On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> > > Hi all!
> > >
> > > While using XFS's ioctl(FICLONE), we found that XFS seems to have
> > > poor performance (ioctl takes milliseconds for sparse files) and the
> > > overhead
> > > increases with every call.
> > >
> > > For the demo, we are using an Optane DC-PMM configured as a
> > > block device (fsdax) and running XFS (Linux v5.18.13).
> >
> > How are you using fsdax and reflink on a 5.18 kernel? That combination
> > of features wasn't supported until 6.0, and the data corruption problems
> > won't get fixed until a pull request that's about to happen for 6.2.
>
> We did not enable the dax option. The optane DIMMs are configured to
> appear as a block device.
>
> $ mount | grep xfs
> /dev/pmem0p4 on /mnt/pmem0p4 type xfs
> (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
>
> Regardless of the block device (the plot includes results for optane
> and RamFS), it seems like the ioctl(FICLONE) call is slow.
Please define "slow" - is it actually slower than it should be
(i.e. a bug) or does it simply not perform according to your
expectations?
A few things that you can quantify to answer these questions.
1. What is the actual rate it is cloning extents at? i.e. extent count
/ clone time? Is this rate consistent/sustained, or is it dropping substantially
over time and/or increase in extent count?
3. How does clone speed of a given file compare to the actual data
copy speed of that file (please include fsync time in the data
copy results)? Is cloning faster or slower than copying
the data? What is the extent count of the file at the cross-over
point where cloning goes from being faster to slower than copying
the data?
3. How does it compare with btrfs running the same write/clone
workload? Does btrfs run faster? Does it perform better with
high extent counts than XFS? What about with high sharing counts
(e.g. after 500 or 1000 clones of the source file)?
Basically, I'm trying to understand what "slow" means in teh context
of the operations you are performing. I haven't seen any recent
performance regressions in clone speed on XFS, so I'm trying to
understand what you are seeing and why you think it is slower than
it should be.
> > > We create a 1 GiB dense file, then repeatedly modify a tiny random
> > > fraction of it and make a clone via ioctl(FICLONE).
> >
> > Yay, random cow writes, that will slowly increase the number of space
> > mapping records in the file metadata.
Yup, the scripts I use do exactly this - 10,000 random 4kB writes to
a 8GB file between reflink clones. I then iterate a few thousand
times and measure the reflink time.
> > > The time required for the ioctl() calls increases from large to insane
> > > over the course of ~250 iterations: From roughly a millisecond for the
> > > first iteration or two (which seems high, given that this is on
> > > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> > > milliseconds (which seems crazy).
> >
> > Does the system call runtime increase with O(number_extents)? You might
> > record the number of extents in the file you're cloning by running this
> > periodically:
> >
> > xfs_io -c stat $path | grep fsxattr.nextents
>
> The extent count does increase linearly (just like the ioctl() call latency).
As expected. Changing the sharing state a single extent has a
roughly constant overhead regardless of the number of extents in the
file. Hence clone time should scale linearly with the number of
extents that need to have their shared state modified.
> I used the xfs_bmap tool, let me know if this is not the right way. If
> it is not, I'll update the microbenchmark to run xfs_io.
xfs_bmap is the slow way - it has to iterate every extents and
format them out to userspace. the above mechanism just does a single
syscall to query the count of extents from the inode. Using the
fsxattr extent count query is much faster, especially when you have
files with tens of millions of extents in them....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-15 0:19 ` Dave Chinner
@ 2022-12-16 1:06 ` Terence Kelly
2022-12-17 17:30 ` Mike Fleetwood
2022-12-18 1:46 ` Dave Chinner
0 siblings, 2 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-16 1:06 UTC (permalink / raw)
To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar
Hi Dave,
Thanks for your quick and detailed reply. More inline....
On Thu, 15 Dec 2022, Dave Chinner wrote:
>> Regardless of the block device (the plot includes results for optane
>> and RamFS), it seems like the ioctl(FICLONE) call is slow.
>
> Please define "slow" - is it actually slower than it should be (i.e. a
> bug) or does it simply not perform according to your expectations?
I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
*milli*seconds right from the start, and grew to *tens* of milliseconds.
There's no slow block storage device to increase latency; all of the
latency is due to software. I was expecting microseconds of latency with
DRAM underneath.
Performance matters because cloning is an excellent crash-tolerance
mechanism. Applications that maintain persistent state in files ---
that's a huge number of applications --- can make clones of said files and
recover from crashes by reverting to the most recent successful clone.
In many situations this is much easier and better than shoe-horning
application data into something like an ACID-transactional relational
database or transactional key-value store. But the run-time cost of
making a clone during failure-free operation can't be excessive. Cloning
for crash tolerance usually requires durable media beneath the file system
(HDD or SSD, not DRAM), so performance on block storage devices matters
too. We measured performance of cloning atop DRAM to understand how much
latency is due to block storage hardware vs. software alone.
My colleagues and I started working on clone-based crash tolerance
mechanisms nearly a decade ago. Extensive experience with cloning and
related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
the DEC Tru64 file system, taught me to expect cloning to be *faster* than
alternatives for crash tolerance:
https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
https://web.eecs.umich.edu/~tpkelly/papers/HPL-2015-103.pdf
The point I'm trying to make is: I'm a serious customer who loves cloning
and my performance expectations aren't based on idle speculation but on
experience with other cloning implementations. (AdvFS is not open source
and I'm no longer an HP employee, so I no longer have access to it.)
More recently I torture-tested XFS cloning as a crash-tolerance mechanism
by subjecting it to real whole-system power interruptions:
https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
I performed these correctness tests before making any performance
measurements because I don't care how fast a mechanism is if it doesn't
correctly tolerate crashes. XFS passed the power-fail tests with flying
colors. Now it's time to consider performance.
I'm surprised that in XFS, cloning alone *without* fsync() pushes data
down to storage. I would have expected that the implementation of cloning
would always operate upon memory alone, and that an explicit fsync() would
be required to force data down to durable media. Analogy: write()
doesn't modify storage; write() plus fsync() does. Is there a reason why
copying via ioctl(FICLONE) isn't similar?
Finally I understand your explanation that the cost of cloning is
proportional to the size of the extent map, and that in the limit where
the extent map is very large, cloning a file of size N requires O(N) time.
However the constant factors surprise me. If memory serves we were seeing
latencies of milliseconds atop DRAM for the first few clones on files that
began as sparse files and had only a few blocks written to them. Copying
the extent map on a DRAM file system must be tantamount to a bunch of
memcpy() calls (right?), and I'm surprised that the volume of data that
must be memcpy'd is so large that it takes milliseconds.
We might be able to take some of the additional measurements you suggested
during/after the holidays.
Thanks again.
> A few things that you can quantify to answer these questions.
>
> ...
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-16 1:06 ` Terence Kelly
@ 2022-12-17 17:30 ` Mike Fleetwood
2022-12-17 18:43 ` Terence Kelly
2022-12-18 1:46 ` Dave Chinner
1 sibling, 1 reply; 14+ messages in thread
From: Mike Fleetwood @ 2022-12-17 17:30 UTC (permalink / raw)
To: Terence Kelly
Cc: Dave Chinner, Suyash Mahar, Darrick J. Wong, linux-xfs,
Suyash Mahar
On Fri, 16 Dec 2022 at 01:06, Terence Kelly <tpkelly@eecs.umich.edu> wrote:
> (AdvFS is not open source
> and I'm no longer an HP employee, so I no longer have access to it.)
Just to put the record straight, HP did (abandon and) open source AdvFS
in June 2008.
https://www.hp.com/hpinfo/newsroom/press/2008/080623a.html
It's available under a GPLv2 license from
https://advfs.sourceforge.net/
Mike
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-17 17:30 ` Mike Fleetwood
@ 2022-12-17 18:43 ` Terence Kelly
0 siblings, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-17 18:43 UTC (permalink / raw)
To: Mike Fleetwood
Cc: Dave Chinner, Suyash Mahar, Darrick J. Wong, linux-xfs,
Suyash Mahar
It's confusing.
My FAST '15 paper was co-authored with AdvFS developers from the HP
Storage Division. The paper mentions the open-source release of AdvFS.
There's not a lot of recent activity on open-source AdvFS:
https://sourceforge.net/p/advfs/discussion/
One thing is certain, however: HP did not "abandon" AdvFS in 2008. At
the time of my FAST paper it was used under the hood in HP products and
was being actively developed internally. See Section 3 of the FAST paper.
The whole point of the paper is to describe a new (internal-only) AdvFS
feature.
I'm pretty sure (relying on memory) that the changes to AdvFS made by HP
between 2008 and 2015 did not find their way into the open-source release.
On Sat, 17 Dec 2022, Mike Fleetwood wrote:
> On Fri, 16 Dec 2022 at 01:06, Terence Kelly <tpkelly@eecs.umich.edu> wrote:
>> (AdvFS is not open source and I'm no longer an HP employee, so I no
>> longer have access to it.)
>
> Just to put the record straight, HP did (abandon and) open source AdvFS
> in June 2008.
> https://www.hp.com/hpinfo/newsroom/press/2008/080623a.html
>
> It's available under a GPLv2 license from
> https://advfs.sourceforge.net/
>
> Mike
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-16 1:06 ` Terence Kelly
2022-12-17 17:30 ` Mike Fleetwood
@ 2022-12-18 1:46 ` Dave Chinner
2022-12-18 4:47 ` Suyash Mahar
2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
1 sibling, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2022-12-18 1:46 UTC (permalink / raw)
To: Terence Kelly; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar
On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote:
>
> Hi Dave,
>
> Thanks for your quick and detailed reply. More inline....
>
> On Thu, 15 Dec 2022, Dave Chinner wrote:
>
> > > Regardless of the block device (the plot includes results for optane
> > > and RamFS), it seems like the ioctl(FICLONE) call is slow.
> >
> > Please define "slow" - is it actually slower than it should be (i.e. a
> > bug) or does it simply not perform according to your expectations?
>
> I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
> *milli*seconds right from the start, and grew to *tens* of milliseconds.
> There's no slow block storage device to increase latency; all of the latency
> is due to software. I was expecting microseconds of latency with DRAM
> underneath.
Ah - slower than expectations then, and you have unrealistic
expectations about how "fast" DRAM is.
From a storage engineer's perspective, DRAM is slow compared to nvme
based flash storage - DRAM has better access latency, but on all
other aspects of storage performance and capability, it falls way
behind pcie attached storage because the *CPU time* is the limiting
factor in storage performance these days, not storage device speed.
The problem with DRAM based storage (and DAX in general) is that
data movement is run by the CPU - it's synchronous storage.
Filesystems like XFS are built around highly concurrent pipelined
asynchronous IO hardware. Filesystems are capable of keeping
thousands of IOs in flight *per CPU*, but on synchronous storage
like DRAM we can only have *1 IO per CPU* in flight at any given
time.
Hence when we compare synchronous write performance, DRAM is fast
compared to SSDs. When we use async IO (AIO+DIO or io_uring), the
numbers go the other way and SSDs come out further in front the more
of them you attach to the system. DRAM based IO doesn't get any
faster because it still can only process one IO at a time, whilst
*each SSD* can process 100+ IOs at a time.
IOWs, for normal block based storage we only use the CPU to marshall
the data movement in the system, and the hardware takes care of the
data movement. i.e. DMA-based storage devices are a hardware offload
mechanism. DRAM based storage relies on the CPU to move data, and so
we use all the time that the CPU could be sending IO to the hardware
to move data in DRAM from A to B.
Put simply: DRAM can only be considered fast if your application
does (or is optimised for) synchronous IO. For all other uses, DRAM
based storage is a poor choice.
> Performance matters because cloning is an excellent crash-tolerance
> mechanism.
Guaranteeing filesystem and data integrity is our primary focus when
building infrastructure that can be used for crash-tolerance
mechanisms...
> Applications that maintain persistent state in files --- that's
> a huge number of applications --- can make clones of said files and recover
> from crashes by reverting to the most recent successful clone.
... and that's the data integrity guarantee that the filesystem
*must* provide the application.
> In many
> situations this is much easier and better than shoe-horning application data
> into something like an ACID-transactional relational database or
> transactional key-value store.
Of course. But that doesn't mean the need for ACID-transactional
database functionality goes away. We've just moved that
functionality into the filesystem to implement FICLONE
functionality.
> But the run-time cost of making a clone
> during failure-free operation can't be excessive.
Define "excessive".
Our design constraints were that FICLONE had to be faster than
copying the data, and needed to have fixed cost per shared extent
reference modification or better so that it could scale to millions
of extents without bringing the filesystem, storage and/or system
to it's knees when someone tried to do that.
Remember - extent sharing and clones were retrofitted to XFS 20
years after it was designed. We had to make lots of compromises
just to make it work correctly, let alone acheive the performance
requirements we set a decade ago.
> Cloning for crash
> tolerance usually requires durable media beneath the file system (HDD or
> SSD, not DRAM), so performance on block storage devices matters too. We
> measured performance of cloning atop DRAM to understand how much latency is
> due to block storage hardware vs. software alone.
Cloning is a CPU intensive operation, not an IO intensive operation.
What you are measuring is *entirely* the CPU overhead of doing all
the transactions and cross-referencing needed to track extent
sharing in a manner that is crash consistent, atomic and fully
recoverable.
> My colleagues and I started working on clone-based crash tolerance
> mechanisms nearly a decade ago. Extensive experience with cloning and
> related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
> the DEC Tru64 file system, taught me to expect cloning to be *faster* than
> alternatives for crash tolerance:
Cloning files on XFSi and btrfs is still much faster than the
existing safe overwrite mechanism of {create a whole new data copy,
fysnc, rename, fsync}. So I'm not sure what you're actually
complaining about here.
> https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
Ah, now I get it. You want *anonymous ephemeral clones*, not named
persistent clones. For everyone else, so they don't have to read
the paper and try to work it out:
The mechanism is a hacked the O_ATOMIC path to instantiate a whole
new cloned inode which is linked into a hidden namespace in the
filesystem so the user can't see it but so it is present after a
crash.
It doesn't track the cloned extents in a persistent index, the
hidden file simply shares the same block map on disk and the sharing
is tracked in memory. After a crash, nothing is done with this until
the original file is instantiated in memory. At this point, the
hidden clone file(s) are then accessed and the shared state is
recovered in memory and decisions are made about which contains the
most recent data are made.
The clone is only present while the fd returned by the
open(O_ATOMIC) is valid. On close(), the clone is deleted and all
the in-memory and hidden on-disk state is torn down. Effectively,
the close() operation becomes an unlink().
Further, a new syscall (called syncv()) takes a vector of these
O_ATOMIC cloned file descriptors is added. THis syscall forces the
filesystem to make the inode -metadata- persistent without requiring
data modifications to be persistent. This allows the ephemeral
clones to be persisted without requiring the data in the original
file to be writtent to disk. At this point, we have a hidden clone
with a matching block map that can be used for crash recovery
purposes.
This clone mechanism in advfs is limited by journal size - 256
clones per 128MB journal space due to reservation space needed for
clone deletes.
----
So my reading of this paper is that the "file clone operation"
essentially creates an ephemeral clone rather than a persistent named
clones. I think they are more equivalent to ephemeral tmp files
than FICLONE. That is, we use open(O_TMPFILE) to create an
ephemeral temporary file attached to a file descriptor instead of
requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then
unlinking it and holding the fd open or relying on /tmp being
volatile or cleaned at boot to remove tmpfiles on crash.
Hence the difference in functionality is that FICLONE provides
persistent, unrestricted named clones rather than ephemeral clones.
We could implement ephemeral clones in XFS, but nobody has ever
mentioned needing or wanting such functionality until this thread.
Darrick already has patches to provide an internal hidden
persistent namespace for XFS filesystems, we could add a new O_CLONE
open flag that provides ephemeral clone behaviour, we could add a
flag to the inode to indicate it has ephemeral clones that need
recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger
COW instead of overwrite in place, etc. It's just a matter of time
and resources.
If you've got resources available to implement this, I can find the
time to help design and integrate it into the VFS and XFS....
> The point I'm trying to make is: I'm a serious customer who loves cloning
> and my performance expectations aren't based on idle speculation but on
> experience with other cloning implementations. (AdvFS is not open source
> and I'm no longer an HP employee, so I no longer have access to it.)
>
> More recently I torture-tested XFS cloning as a crash-tolerance mechanism by
> subjecting it to real whole-system power interruptions:
>
> https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
Heh. You're still using hardware to do filesystem power fail
testing? We moved away from needing hardware to do power fail
testing of filesystems several years ago.
Using functionality like dm-logwrites, we can simulate the effect of
several hundred different power fail cases with write-by-write
replay and recovery in the space of a couple of minutes.
Not only that, failures are fully replayable and so we can actually
debug every single individual failure without having to guess at the
runtime context that created the failure or the recovery context
that exposed the failure.i
This infrastructure has provided us with a massive step forward for
improving crash resilence and recovery capability in ext4, btrfs and
XFS. These tests are built into automated tests suites (e.g.
fstests) that pretty much all linux fs engineers and distro QE teams
run these days.
IOWs, hardware based power fail testing of filesystems is largely
obsolete these days....
> I'm surprised that in XFS, cloning alone *without* fsync() pushes data down
> to storage. I would have expected that the implementation of cloning would
> always operate upon memory alone, and that an explicit fsync() would be
> required to force data down to durable media. Analogy: write() doesn't
> modify storage; write() plus fsync() does. Is there a reason why copying
> via ioctl(FICLONE) isn't similar?
Because FICLONE provides a persistent named clone that is a fully
functioning file in it's own right. That means it has to be
completely indepedent of the source file by the time the FICLONE
operation completes. This implies that there is a certain order to
the operations the clone performances - the data has to be on disk
before the clone is made persistent and recoverable so that both
files as guaranteed to have identical contents if we crash
immediately after the clone completes.
> Finally I understand your explanation that the cost of cloning is
> proportional to the size of the extent map, and that in the limit where the
> extent map is very large, cloning a file of size N requires O(N) time.
> However the constant factors surprise me. If memory serves we were seeing
> latencies of milliseconds atop DRAM for the first few clones on files that
> began as sparse files and had only a few blocks written to them. Copying
> the extent map on a DRAM file system must be tantamount to a bunch of
> memcpy() calls (right?),
At the IO layer, yes, it's just a memcpy.
But we can't just copy a million extents from one in-memory btree to
another. We have to modify the filesystem metadata in an atomic,
transactional, recoverable way. Those transactions work one extent
at a time because each extent might require a different set of
modifications. Persistent clones require tracking of the number of
times a given block on disk is shared so that we know when extent
removals result in the extent no longer being shared and/or
referenced. A file that has been cloned a million times might have
a million extents each shared a different number of times. When we
remove one of those clones, how do we know which blocks are now
unreferenced and need to be freed?
IOWs, named persistent clones are *much more complex* than ephemeral
clones. The overhead you are measuring is the result of all the
persistent cross referencing and reference counting metadata we need
to atomically update on each extent sharing operation ensure long
term persistent clones work correctly.
If we were to implement ephemeral clones as per the mechanism you've
outlined in the papers above, then we could just copy the in-memory
extent list btree with a series of memcpy() operations because we
don't need persistent on-disk shared reference counting to implement
it....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-18 1:46 ` Dave Chinner
@ 2022-12-18 4:47 ` Suyash Mahar
2022-12-20 3:06 ` Darrick J. Wong
2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
1 sibling, 1 reply; 14+ messages in thread
From: Suyash Mahar @ 2022-12-18 4:47 UTC (permalink / raw)
To: Dave Chinner; +Cc: Terence Kelly, Darrick J. Wong, linux-xfs, Suyash Mahar
Thank you for the detailed response. This does confirm some of our
observations that the overhead is mainly from the software layer. We
did see better performance from optimization in the transaction code
moving from kernel v5.4 to v5.18.
-Suyash
Le sam. 17 déc. 2022 à 17:46, Dave Chinner <david@fromorbit.com> a écrit :
>
> On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote:
> >
> > Hi Dave,
> >
> > Thanks for your quick and detailed reply. More inline....
> >
> > On Thu, 15 Dec 2022, Dave Chinner wrote:
> >
> > > > Regardless of the block device (the plot includes results for optane
> > > > and RamFS), it seems like the ioctl(FICLONE) call is slow.
> > >
> > > Please define "slow" - is it actually slower than it should be (i.e. a
> > > bug) or does it simply not perform according to your expectations?
> >
> > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
> > *milli*seconds right from the start, and grew to *tens* of milliseconds.
> > There's no slow block storage device to increase latency; all of the latency
> > is due to software. I was expecting microseconds of latency with DRAM
> > underneath.
>
> Ah - slower than expectations then, and you have unrealistic
> expectations about how "fast" DRAM is.
>
> From a storage engineer's perspective, DRAM is slow compared to nvme
> based flash storage - DRAM has better access latency, but on all
> other aspects of storage performance and capability, it falls way
> behind pcie attached storage because the *CPU time* is the limiting
> factor in storage performance these days, not storage device speed.
>
> The problem with DRAM based storage (and DAX in general) is that
> data movement is run by the CPU - it's synchronous storage.
> Filesystems like XFS are built around highly concurrent pipelined
> asynchronous IO hardware. Filesystems are capable of keeping
> thousands of IOs in flight *per CPU*, but on synchronous storage
> like DRAM we can only have *1 IO per CPU* in flight at any given
> time.
>
> Hence when we compare synchronous write performance, DRAM is fast
> compared to SSDs. When we use async IO (AIO+DIO or io_uring), the
> numbers go the other way and SSDs come out further in front the more
> of them you attach to the system. DRAM based IO doesn't get any
> faster because it still can only process one IO at a time, whilst
> *each SSD* can process 100+ IOs at a time.
>
> IOWs, for normal block based storage we only use the CPU to marshall
> the data movement in the system, and the hardware takes care of the
> data movement. i.e. DMA-based storage devices are a hardware offload
> mechanism. DRAM based storage relies on the CPU to move data, and so
> we use all the time that the CPU could be sending IO to the hardware
> to move data in DRAM from A to B.
>
>
> Put simply: DRAM can only be considered fast if your application
> does (or is optimised for) synchronous IO. For all other uses, DRAM
> based storage is a poor choice.
>
> > Performance matters because cloning is an excellent crash-tolerance
> > mechanism.
>
> Guaranteeing filesystem and data integrity is our primary focus when
> building infrastructure that can be used for crash-tolerance
> mechanisms...
>
> > Applications that maintain persistent state in files --- that's
> > a huge number of applications --- can make clones of said files and recover
> > from crashes by reverting to the most recent successful clone.
>
> ... and that's the data integrity guarantee that the filesystem
> *must* provide the application.
>
> > In many
> > situations this is much easier and better than shoe-horning application data
> > into something like an ACID-transactional relational database or
> > transactional key-value store.
>
> Of course. But that doesn't mean the need for ACID-transactional
> database functionality goes away. We've just moved that
> functionality into the filesystem to implement FICLONE
> functionality.
>
> > But the run-time cost of making a clone
> > during failure-free operation can't be excessive.
>
> Define "excessive".
>
> Our design constraints were that FICLONE had to be faster than
> copying the data, and needed to have fixed cost per shared extent
> reference modification or better so that it could scale to millions
> of extents without bringing the filesystem, storage and/or system
> to it's knees when someone tried to do that.
>
> Remember - extent sharing and clones were retrofitted to XFS 20
> years after it was designed. We had to make lots of compromises
> just to make it work correctly, let alone acheive the performance
> requirements we set a decade ago.
>
> > Cloning for crash
> > tolerance usually requires durable media beneath the file system (HDD or
> > SSD, not DRAM), so performance on block storage devices matters too. We
> > measured performance of cloning atop DRAM to understand how much latency is
> > due to block storage hardware vs. software alone.
>
> Cloning is a CPU intensive operation, not an IO intensive operation.
> What you are measuring is *entirely* the CPU overhead of doing all
> the transactions and cross-referencing needed to track extent
> sharing in a manner that is crash consistent, atomic and fully
> recoverable.
>
> > My colleagues and I started working on clone-based crash tolerance
> > mechanisms nearly a decade ago. Extensive experience with cloning and
> > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
> > the DEC Tru64 file system, taught me to expect cloning to be *faster* than
> > alternatives for crash tolerance:
>
> Cloning files on XFSi and btrfs is still much faster than the
> existing safe overwrite mechanism of {create a whole new data copy,
> fysnc, rename, fsync}. So I'm not sure what you're actually
> complaining about here.
>
>
> > https://urldefense.com/v3/__https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_IkjbSIg$
>
> Ah, now I get it. You want *anonymous ephemeral clones*, not named
> persistent clones. For everyone else, so they don't have to read
> the paper and try to work it out:
>
> The mechanism is a hacked the O_ATOMIC path to instantiate a whole
> new cloned inode which is linked into a hidden namespace in the
> filesystem so the user can't see it but so it is present after a
> crash.
>
> It doesn't track the cloned extents in a persistent index, the
> hidden file simply shares the same block map on disk and the sharing
> is tracked in memory. After a crash, nothing is done with this until
> the original file is instantiated in memory. At this point, the
> hidden clone file(s) are then accessed and the shared state is
> recovered in memory and decisions are made about which contains the
> most recent data are made.
>
> The clone is only present while the fd returned by the
> open(O_ATOMIC) is valid. On close(), the clone is deleted and all
> the in-memory and hidden on-disk state is torn down. Effectively,
> the close() operation becomes an unlink().
>
> Further, a new syscall (called syncv()) takes a vector of these
> O_ATOMIC cloned file descriptors is added. THis syscall forces the
> filesystem to make the inode -metadata- persistent without requiring
> data modifications to be persistent. This allows the ephemeral
> clones to be persisted without requiring the data in the original
> file to be writtent to disk. At this point, we have a hidden clone
> with a matching block map that can be used for crash recovery
> purposes.
>
> This clone mechanism in advfs is limited by journal size - 256
> clones per 128MB journal space due to reservation space needed for
> clone deletes.
>
> ----
>
> So my reading of this paper is that the "file clone operation"
> essentially creates an ephemeral clone rather than a persistent named
> clones. I think they are more equivalent to ephemeral tmp files
> than FICLONE. That is, we use open(O_TMPFILE) to create an
> ephemeral temporary file attached to a file descriptor instead of
> requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then
> unlinking it and holding the fd open or relying on /tmp being
> volatile or cleaned at boot to remove tmpfiles on crash.
>
> Hence the difference in functionality is that FICLONE provides
> persistent, unrestricted named clones rather than ephemeral clones.
>
>
> We could implement ephemeral clones in XFS, but nobody has ever
> mentioned needing or wanting such functionality until this thread.
> Darrick already has patches to provide an internal hidden
> persistent namespace for XFS filesystems, we could add a new O_CLONE
> open flag that provides ephemeral clone behaviour, we could add a
> flag to the inode to indicate it has ephemeral clones that need
> recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger
> COW instead of overwrite in place, etc. It's just a matter of time
> and resources.
>
> If you've got resources available to implement this, I can find the
> time to help design and integrate it into the VFS and XFS....
>
> > The point I'm trying to make is: I'm a serious customer who loves cloning
> > and my performance expectations aren't based on idle speculation but on
> > experience with other cloning implementations. (AdvFS is not open source
> > and I'm no longer an HP employee, so I no longer have access to it.)
> >
> > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by
> > subjecting it to real whole-system power interruptions:
> >
> > https://urldefense.com/v3/__https://dl.acm.org/doi/pdf/10.1145/3400899.3400902__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_th521GQ$
>
> Heh. You're still using hardware to do filesystem power fail
> testing? We moved away from needing hardware to do power fail
> testing of filesystems several years ago.
>
> Using functionality like dm-logwrites, we can simulate the effect of
> several hundred different power fail cases with write-by-write
> replay and recovery in the space of a couple of minutes.
>
> Not only that, failures are fully replayable and so we can actually
> debug every single individual failure without having to guess at the
> runtime context that created the failure or the recovery context
> that exposed the failure.i
>
> This infrastructure has provided us with a massive step forward for
> improving crash resilence and recovery capability in ext4, btrfs and
> XFS. These tests are built into automated tests suites (e.g.
> fstests) that pretty much all linux fs engineers and distro QE teams
> run these days.
>
> IOWs, hardware based power fail testing of filesystems is largely
> obsolete these days....
>
> > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down
> > to storage. I would have expected that the implementation of cloning would
> > always operate upon memory alone, and that an explicit fsync() would be
> > required to force data down to durable media. Analogy: write() doesn't
> > modify storage; write() plus fsync() does. Is there a reason why copying
> > via ioctl(FICLONE) isn't similar?
>
> Because FICLONE provides a persistent named clone that is a fully
> functioning file in it's own right. That means it has to be
> completely indepedent of the source file by the time the FICLONE
> operation completes. This implies that there is a certain order to
> the operations the clone performances - the data has to be on disk
> before the clone is made persistent and recoverable so that both
> files as guaranteed to have identical contents if we crash
> immediately after the clone completes.
>
> > Finally I understand your explanation that the cost of cloning is
> > proportional to the size of the extent map, and that in the limit where the
> > extent map is very large, cloning a file of size N requires O(N) time.
> > However the constant factors surprise me. If memory serves we were seeing
> > latencies of milliseconds atop DRAM for the first few clones on files that
> > began as sparse files and had only a few blocks written to them. Copying
> > the extent map on a DRAM file system must be tantamount to a bunch of
> > memcpy() calls (right?),
>
> At the IO layer, yes, it's just a memcpy.
>
> But we can't just copy a million extents from one in-memory btree to
> another. We have to modify the filesystem metadata in an atomic,
> transactional, recoverable way. Those transactions work one extent
> at a time because each extent might require a different set of
> modifications. Persistent clones require tracking of the number of
> times a given block on disk is shared so that we know when extent
> removals result in the extent no longer being shared and/or
> referenced. A file that has been cloned a million times might have
> a million extents each shared a different number of times. When we
> remove one of those clones, how do we know which blocks are now
> unreferenced and need to be freed?
>
> IOWs, named persistent clones are *much more complex* than ephemeral
> clones. The overhead you are measuring is the result of all the
> persistent cross referencing and reference counting metadata we need
> to atomically update on each extent sharing operation ensure long
> term persistent clones work correctly.
>
> If we were to implement ephemeral clones as per the mechanism you've
> outlined in the papers above, then we could just copy the in-memory
> extent list btree with a series of memcpy() operations because we
> don't need persistent on-disk shared reference counting to implement
> it....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-18 1:46 ` Dave Chinner
2022-12-18 4:47 ` Suyash Mahar
@ 2022-12-18 23:40 ` Terence Kelly
2022-12-20 2:16 ` Dave Chinner
1 sibling, 1 reply; 14+ messages in thread
From: Terence Kelly @ 2022-12-18 23:40 UTC (permalink / raw)
To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar
On Sun, 18 Dec 2022, Dave Chinner wrote:
>> https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
>
> Ah, now I get it. You want *anonymous ephemeral clones*, not named
> persistent clones. For everyone else, so they don't have to read the
> paper and try to work it out:
>
> The mechanism is a hacked the O_ATOMIC path ...
No. To be clear, nobody now in 2022 is asking for the AdvFS features of
the FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).
The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and
foreseeable needs, except for performance.
I cited the FAST 2015 paper simply to show that I've worked with a
clone-based mechanism in the past and it delighted me in every way. It's
simply an existence proof that cloning can be delightful for crash
tolerance.
> Hence the difference in functionality is that FICLONE provides
> persistent, unrestricted named clones rather than ephemeral clones.
For the record, the AdvFS implementation of clone-based crash tolerance
--- the moral equivalent of failure-atomic msync(), which was the topic of
my EuroSys 2013 paper --- involved persistent files on durable storage;
the files were hidden and were discarded when their usefulness was over
but the hidden files were not "ephemeral" in the sense of a file in a
DRAM-backed file system (/tmp/ or /dev/shm/ or whatnot). AdvFS crash
tolerance survived real power failures. But this is a side issue of
historical interest only.
I mainly want to emphasize that nobody is asking for the behavior of AdvFS
in that FAST 2015 paper.
> We could implement ephemeral clones in XFS, but nobody has ever
> mentioned needing or wanting such functionality until this thread.
Nobody needs or wants such functionality, even in this thread. The
current ioctl(FICLONE) is perfect except for performance.
>> https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
>
> Heh. You're still using hardware to do filesystem power fail testing?
> We moved away from needing hardware to do power fail testing of
> filesystems several years ago.
>
> Using functionality like dm-logwrites, we can simulate the effect of
> several hundred different power fail cases with write-by-write replay
> and recovery in the space of a couple of minutes.
Cool. I assume you're familiar with a paper on a similar technique that
my HP Labs colleagues wrote circa 2013 or 2014: "Torturing Databases for
Fun and Profit."
> Not only that, failures are fully replayable and so we can actually
> debug every single individual failure without having to guess at the
> runtime context that created the failure or the recovery context that
> exposed the failure.
>
> This infrastructure has provided us with a massive step forward for
> improving crash resilence and recovery capability in ext4, btrfs and
> XFS. These tests are built into automated tests suites (e.g. fstests)
> that pretty much all linux fs engineers and distro QE teams run these
> days.
If you think the world would benefit from reading about this technique and
using it more widely, I might be able to help. My column in _Queue_
magazine reaches thousands of readers, sometimes tens of thousands. It's
about teaching better techniques to working programmers.
I'd be honored to help pass along to my readers practical techniques that
you're using to improve quality.
> IOWs, hardware based power fail testing of filesystems is largely
> obsolete these days....
I don't mind telling the world that my own past work is obsolete. That's
what progress is all about.
>> I'm surprised that in XFS, cloning alone *without* fsync() pushes data
>> down to storage. I would have expected that the implementation of
>> cloning would always operate upon memory alone, and that an explicit
>> fsync() would be required to force data down to durable media.
>> Analogy: write() doesn't modify storage; write() plus fsync() does.
>> Is there a reason why copying via ioctl(FICLONE) isn't similar?
>
> Because FICLONE provides a persistent named clone that is a fully
> functioning file in it's own right. That means it has to be completely
> indepedent of the source file by the time the FICLONE operation
> completes. This implies that there is a certain order to the operations
> the clone performances - the data has to be on disk before the clone is
> made persistent and recoverable so that both files as guaranteed to have
> identical contents if we crash immediately after the clone completes.
I thought the rule was that if an application doesn't call fsync() or
msync(), no durability of any kind is guaranteed. I thought modern file
systems did all their work in DRAM until an explicit fsync/msync or other
necessity compelled them to push data down to durable media (in the right
order etc.).
Also, we might be using terminology differently:
I use "persistent" in the sense of "outlives processes". Files in /tmp/
and /dev/shm/ are persistent, but not durable.
I use "durable" to mean "written to non-volatile media (HDD or SSD) in
such a way as to guarantee that it will survive power cycling."
I expect *persistence* from ioctl(FICLONE) but I didn't expect a
*durability* guarantee without fsync(). If I'm understanding you
correctly, cloning in XFS gives us durability whether we want it or not.
>> Finally I understand your explanation that the cost of cloning is
>> proportional to the size of the extent map, and that in the limit where
>> the extent map is very large, cloning a file of size N requires O(N)
>> time. However the constant factors surprise me. If memory serves we
>> were seeing latencies of milliseconds atop DRAM for the first few
>> clones on files that began as sparse files and had only a few blocks
>> written to them. Copying the extent map on a DRAM file system must be
>> tantamount to a bunch of memcpy() calls (right?),
>
> At the IO layer, yes, it's just a memcpy.
>
> But we can't just copy a million extents from one in-memory btree to
> another. We have to modify the filesystem metadata in an atomic,
> transactional, recoverable way. Those transactions work one extent at a
> time because each extent might require a different set of modifications.
Ah, so now I see where the time goes. This is clear.
> Persistent clones require tracking of the number of times a given block
> on disk is shared so that we know when extent removals result in the
> extent no longer being shared and/or referenced. A file that has been
> cloned a million times might have a million extents each shared a
> different number of times. When we remove one of those clones, how do we
> know which blocks are now unreferenced and need to be freed?
>
> IOWs, named persistent clones are *much more complex* than ephemeral
> clones.
Again, I don't know where you're getting "ephemeral" from; that word does
not appear in the FAST '15 paper. The AdvFS clones of the FAST '15 paper
were both durable and persistent; they were just hidden from the
user-visible namespace. A crash (power outage or whatever ) caused a file
to revert to the most recent hidden clone. In AdvFS, a hidden clone was
created by an fsync/msync call. This is how AdvFS made file updates
failure-atomic.
Again, we're not asking for the same functionality of the FAST '15 paper.
However if the contrast between what AdvFS did with clones and how XFS
works illuminates issues like XFS performance, then it might be worth
understanding AdvFS.
Incidentally, I really appreciate the time & effort you're taking to
educate me & Suyash. I hope I'm not being too sluggish a student, though
sometimes I am.
For the near term, Suyash and I are getting closer to an understanding of
today's ioctl(FICLONE) that we can pass along to readers in the paper
we're writing.
> The overhead you are measuring is the result of all the persistent cross
> referencing and reference counting metadata we need to atomically update
> on each extent sharing operation ensure long term persistent clones work
> correctly.
This is clear. Thanks.
> If we were to implement ephemeral clones as per the mechanism you've
> outlined in the papers above, then we could just copy the in-memory
> extent list btree with a series of memcpy() operations because we don't
> need persistent on-disk shared reference counting to implement it....
We're not on the same page about what AdvFS did.
Of course I'll understand if you don't have time or interest to get on the
same page; we understand that you're busy with a lot of important work.
Thanks for your help and Happy Holidays!
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
@ 2022-12-20 2:16 ` Dave Chinner
2022-12-21 23:07 ` wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2022-12-20 2:16 UTC (permalink / raw)
To: Terence Kelly; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar
On Sun, Dec 18, 2022 at 06:40:54PM -0500, Terence Kelly wrote:
>
>
> On Sun, 18 Dec 2022, Dave Chinner wrote:
>
> > > https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
> >
> > Ah, now I get it. You want *anonymous ephemeral clones*, not named
> > persistent clones. For everyone else, so they don't have to read the
> > paper and try to work it out:
> >
> > The mechanism is a hacked the O_ATOMIC path ...
>
> No. To be clear, nobody now in 2022 is asking for the AdvFS features of the
> FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).
>
> The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and
> foreseeable needs, except for performance.
>
> I cited the FAST 2015 paper simply to show that I've worked with a
> clone-based mechanism in the past and it delighted me in every way. It's
> simply an existence proof that cloning can be delightful for crash
> tolerance.
Sure, you're preaching to the choir. But the context was quoting a
paper as an example of the cloning performance you expected from XFS
but weren't getting. You're still talking about how XFS clones are
too slow for you needs, but now you are saying you don't want
clones for fault tolerance as implemented in advfs
> > Hence the difference in functionality is that FICLONE provides
> > persistent, unrestricted named clones rather than ephemeral clones.
>
> For the record, the AdvFS implementation of clone-based crash tolerance ---
> the moral equivalent of failure-atomic msync(), which was the topic of my
> EuroSys 2013 paper --- involved persistent files on durable storage; the
> files were hidden and were discarded when their usefulness was over but the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is the very definition of an ephemeral filesystem object.
The clones are temporary filesystem objects that exist only within
the context of an active file descriptor, users doesn't know they
exist, users cannot discover their existence, and they get cleaned
up automatically by the filesystem when they are no longer useful.
Yes, there is some persistent state needed to implement the required
garbage collection semantics of the ephemeral object (just like
O_TMPFILE!), but that doesn't change the fact that users don't know
(or care) that the internal filesystem objects even exist.
Really, I can't think of a better example of an ephemeral object
than this, regardless of whether the paper's authors used that term
or not.
> hidden files were not "ephemeral" in the sense of a file in a DRAM-backed
> file system (/tmp/ or /dev/shm/ or whatnot). AdvFS crash tolerance survived
> real power failures. But this is a side issue of historical interest only.
>
> I mainly want to emphasize that nobody is asking for the behavior of AdvFS
> in that FAST 2015 paper.
OK, so what are you asking us to do, then?
[....]
> > > https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
> >
> > Heh. You're still using hardware to do filesystem power fail testing? We
> > moved away from needing hardware to do power fail testing of filesystems
> > several years ago.
> >
> > Using functionality like dm-logwrites, we can simulate the effect of
> > several hundred different power fail cases with write-by-write replay
> > and recovery in the space of a couple of minutes.
>
> Cool. I assume you're familiar with a paper on a similar technique that my
> HP Labs colleagues wrote circa 2013 or 2014: "Torturing Databases for Fun
> and Profit."
Nope, but it's not a new or revolutionary technique so I'm not
surprised that other people have done similar things. There's been
plenty of research based on model checking over the past 2-3 decades
- the series of Iron Filesystem papers is a good example of this.
What we have in fstests is just a version of these concepts that
simplifies discovering and debugging previously undiscovered write
ordering issues...
> > Not only that, failures are fully replayable and so we can actually
> > debug every single individual failure without having to guess at the
> > runtime context that created the failure or the recovery context that
> > exposed the failure.
> >
> > This infrastructure has provided us with a massive step forward for
> > improving crash resilence and recovery capability in ext4, btrfs and
> > XFS. These tests are built into automated tests suites (e.g. fstests)
> > that pretty much all linux fs engineers and distro QE teams run these
> > days.
>
> If you think the world would benefit from reading about this technique and
> using it more widely, I might be able to help. My column in _Queue_
> magazine reaches thousands of readers, sometimes tens of thousands. It's
> about teaching better techniques to working programmers.
You're welcome to do so - the source code is all there, there's a
mailing list for fstests where you can ask questions about it, etc.
If you think it's valuable for people outside the core linux fs
developer community, then you don't need to ask our permission to
write an article on it....
> > > I'm surprised that in XFS, cloning alone *without* fsync() pushes
> > > data down to storage. I would have expected that the implementation
> > > of cloning would always operate upon memory alone, and that an
> > > explicit fsync() would be required to force data down to durable
> > > media. Analogy: write() doesn't modify storage; write() plus
> > > fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't
> > > similar?
> >
> > Because FICLONE provides a persistent named clone that is a fully
> > functioning file in it's own right. That means it has to be completely
> > indepedent of the source file by the time the FICLONE operation
> > completes. This implies that there is a certain order to the operations
> > the clone performances - the data has to be on disk before the clone is
> > made persistent and recoverable so that both files as guaranteed to have
> > identical contents if we crash immediately after the clone completes.
>
> I thought the rule was that if an application doesn't call fsync() or
> msync(), no durability of any kind is guaranteed.
No durability of any kind is guaranteed, but that doesn't preclude
the OS and/or filesystem actually performing an operation in a way
that guarantees persistence....
That said, the FICLONE API doesn't guarantee persistence. The
application still have to call fdatasync() to ensure that all the
metadata changes that FICLONE makes are persisted all the way down
to stable storage.
> I thought modern file
> systems did all their work in DRAM until an explicit fsync/msync or other
> necessity compelled them to push data down to durable media (in the right
> order etc.).
Largely, they do. But some operations have dependencies and require
data/metadata update synchronisation, and at that point we have
ordering constraints. TO an outside observer, that may look like
the filesystem is trying to provide durability, but in fact it is
doing nothing of the sort...
I suspect you've seen the data writeback in FICLONE and thought this
is because it needs to provide a durability guarantee.
For XFS, this is an ordering constrain - we have to ensure the right
thing happens with delayed allocation and resolve pending COW
operations on a file before we clone the extent map to a new file.
We do this by running writeback to process these pending extent map
operations we deferred at write() time. Once those deferred
operations have been resolved, we can run the transactions to clone
the extent map.
However, if FICLONE is acting on files containing only data at rest,
then it can run without doing a single data IO, and the whole clone
can be lost on crash if fdatasync() is not run once it is complete.
IOWs, the FICLONE API provides no persistence guarantees.
fdatasync/O_DSYNC is still required.
> Also, we might be using terminology differently:
>
> I use "persistent" in the sense of "outlives processes". Files in /tmp/ and
> /dev/shm/ are persistent, but not durable.
Yeah, different terminology - you seem to have different frames of
reference for the terms you are using.
The frame of reference I'm using for terminology is filesystem
objects rather than processes or storage. Stuff that exists purely
in memory (such as tmpfs or shm files) is always considered
"volatile" - they are lost if the system crashes or shuts down.
Volatile storage also include caches like dirty data in the page
cache and storage devices with DRAM based caches.
Persistent refers to ensuring filesystem objects are not volatile;
they do not get lost during shutdown or abnormal termination because
they have been guaranteed to exist on a stable, permanent storage
media.
> I use "durable" to mean "written to non-volatile media (HDD or SSD) in such
> a way as to guarantee that it will survive power cycling."
Sure. We typically refer to non-volatile storage media as "stable
storage" because the hardware can be durable in the short term but
volatile in the long term. e.g. battery backed RAM is considered
"stable" if the battery backup lasts longer than 72 hours, but
over long periods it will not retain it's contents. Hence calling it
"non-volatile media" isn't really correct - the contents are only
stable over a fixed timeframe.
Regardless of terminology, "persisting objects to stable
storage" is effectively the same thing as "making durable".
> I expect *persistence* from ioctl(FICLONE) but I didn't expect a
> *durability* guarantee without fsync(). If I'm understanding you correctly,
> cloning in XFS gives us durability whether we want it or not.
See above. We provide no guarantees about persistence, but in some
cases we can't perform the FICLONE operation correctly without
performing most of the operations needed to provide persistence of
the source file.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE)
2022-12-18 4:47 ` Suyash Mahar
@ 2022-12-20 3:06 ` Darrick J. Wong
2022-12-21 22:34 ` atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2022-12-20 3:06 UTC (permalink / raw)
To: Suyash Mahar; +Cc: Dave Chinner, Terence Kelly, linux-xfs, Suyash Mahar
On Sat, Dec 17, 2022 at 08:47:45PM -0800, Suyash Mahar wrote:
> Thank you for the detailed response. This does confirm some of our
> observations that the overhead is mainly from the software layer. We
> did see better performance from optimization in the transaction code
> moving from kernel v5.4 to v5.18.
>
> -Suyash
>
> Le sam. 17 déc. 2022 à 17:46, Dave Chinner <david@fromorbit.com> a écrit :
> >
> > On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote:
> > >
> > > Hi Dave,
> > >
> > > Thanks for your quick and detailed reply. More inline....
> > >
> > > On Thu, 15 Dec 2022, Dave Chinner wrote:
> > >
> > > > > Regardless of the block device (the plot includes results for optane
> > > > > and RamFS), it seems like the ioctl(FICLONE) call is slow.
> > > >
> > > > Please define "slow" - is it actually slower than it should be (i.e. a
> > > > bug) or does it simply not perform according to your expectations?
> > >
> > > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
> > > *milli*seconds right from the start, and grew to *tens* of milliseconds.
> > > There's no slow block storage device to increase latency; all of the latency
> > > is due to software. I was expecting microseconds of latency with DRAM
> > > underneath.
> >
> > Ah - slower than expectations then, and you have unrealistic
> > expectations about how "fast" DRAM is.
> >
> > From a storage engineer's perspective, DRAM is slow compared to nvme
> > based flash storage - DRAM has better access latency, but on all
> > other aspects of storage performance and capability, it falls way
> > behind pcie attached storage because the *CPU time* is the limiting
> > factor in storage performance these days, not storage device speed.
> >
> > The problem with DRAM based storage (and DAX in general) is that
> > data movement is run by the CPU - it's synchronous storage.
> > Filesystems like XFS are built around highly concurrent pipelined
> > asynchronous IO hardware. Filesystems are capable of keeping
> > thousands of IOs in flight *per CPU*, but on synchronous storage
> > like DRAM we can only have *1 IO per CPU* in flight at any given
> > time.
> >
> > Hence when we compare synchronous write performance, DRAM is fast
> > compared to SSDs. When we use async IO (AIO+DIO or io_uring), the
> > numbers go the other way and SSDs come out further in front the more
> > of them you attach to the system. DRAM based IO doesn't get any
> > faster because it still can only process one IO at a time, whilst
> > *each SSD* can process 100+ IOs at a time.
> >
> > IOWs, for normal block based storage we only use the CPU to marshall
> > the data movement in the system, and the hardware takes care of the
> > data movement. i.e. DMA-based storage devices are a hardware offload
> > mechanism. DRAM based storage relies on the CPU to move data, and so
> > we use all the time that the CPU could be sending IO to the hardware
> > to move data in DRAM from A to B.
> >
> >
> > Put simply: DRAM can only be considered fast if your application
> > does (or is optimised for) synchronous IO. For all other uses, DRAM
> > based storage is a poor choice.
Oh, it's worse than that -- since you're using 5.18 with reflink
enabled, DAX will always yield to reflink. IOWs, the random writes are
done to the pagecache, so the implied fdatasync in the FICLONE
preparation also has to *copy* the dirty pagecache to the pmem.
It would at least be interesting (a) to bump to 6.2, and (b) stuff an
fsync(src_fd) call in before you start timing the FICLONE to see what
proportion of the clone time was actually just pagecache maneuvers.
> > > Performance matters because cloning is an excellent crash-tolerance
> > > mechanism.
> >
> > Guaranteeing filesystem and data integrity is our primary focus when
> > building infrastructure that can be used for crash-tolerance
> > mechanisms...
> >
> > > Applications that maintain persistent state in files --- that's
> > > a huge number of applications --- can make clones of said files and recover
> > > from crashes by reverting to the most recent successful clone.
> >
> > ... and that's the data integrity guarantee that the filesystem
> > *must* provide the application.
> >
> > > In many
> > > situations this is much easier and better than shoe-horning application data
> > > into something like an ACID-transactional relational database or
> > > transactional key-value store.
> >
> > Of course. But that doesn't mean the need for ACID-transactional
> > database functionality goes away. We've just moved that
> > functionality into the filesystem to implement FICLONE
> > functionality.
> >
> > > But the run-time cost of making a clone
> > > during failure-free operation can't be excessive.
> >
> > Define "excessive".
> >
> > Our design constraints were that FICLONE had to be faster than
> > copying the data, and needed to have fixed cost per shared extent
> > reference modification or better so that it could scale to millions
> > of extents without bringing the filesystem, storage and/or system
> > to it's knees when someone tried to do that.
> >
> > Remember - extent sharing and clones were retrofitted to XFS 20
> > years after it was designed. We had to make lots of compromises
> > just to make it work correctly, let alone acheive the performance
> > requirements we set a decade ago.
> >
> > > Cloning for crash
> > > tolerance usually requires durable media beneath the file system (HDD or
> > > SSD, not DRAM), so performance on block storage devices matters too. We
> > > measured performance of cloning atop DRAM to understand how much latency is
> > > due to block storage hardware vs. software alone.
> >
> > Cloning is a CPU intensive operation, not an IO intensive operation.
> > What you are measuring is *entirely* the CPU overhead of doing all
> > the transactions and cross-referencing needed to track extent
> > sharing in a manner that is crash consistent, atomic and fully
> > recoverable.
> >
> > > My colleagues and I started working on clone-based crash tolerance
> > > mechanisms nearly a decade ago. Extensive experience with cloning and
> > > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
> > > the DEC Tru64 file system, taught me to expect cloning to be *faster* than
> > > alternatives for crash tolerance:
> >
> > Cloning files on XFSi and btrfs is still much faster than the
> > existing safe overwrite mechanism of {create a whole new data copy,
> > fysnc, rename, fsync}. So I'm not sure what you're actually
> > complaining about here.
> >
> >
> > > https://urldefense.com/v3/__https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_IkjbSIg$
> >
> > Ah, now I get it. You want *anonymous ephemeral clones*, not named
> > persistent clones. For everyone else, so they don't have to read
> > the paper and try to work it out:
> >
> > The mechanism is a hacked the O_ATOMIC path to instantiate a whole
> > new cloned inode which is linked into a hidden namespace in the
> > filesystem so the user can't see it but so it is present after a
> > crash.
> >
> > It doesn't track the cloned extents in a persistent index, the
> > hidden file simply shares the same block map on disk and the sharing
> > is tracked in memory. After a crash, nothing is done with this until
> > the original file is instantiated in memory. At this point, the
> > hidden clone file(s) are then accessed and the shared state is
> > recovered in memory and decisions are made about which contains the
> > most recent data are made.
> >
> > The clone is only present while the fd returned by the
> > open(O_ATOMIC) is valid. On close(), the clone is deleted and all
> > the in-memory and hidden on-disk state is torn down. Effectively,
> > the close() operation becomes an unlink().
> >
> > Further, a new syscall (called syncv()) takes a vector of these
> > O_ATOMIC cloned file descriptors is added. THis syscall forces the
> > filesystem to make the inode -metadata- persistent without requiring
> > data modifications to be persistent. This allows the ephemeral
> > clones to be persisted without requiring the data in the original
> > file to be writtent to disk. At this point, we have a hidden clone
> > with a matching block map that can be used for crash recovery
> > purposes.
> >
> > This clone mechanism in advfs is limited by journal size - 256
> > clones per 128MB journal space due to reservation space needed for
> > clone deletes.
> >
> > ----
> >
> > So my reading of this paper is that the "file clone operation"
> > essentially creates an ephemeral clone rather than a persistent named
> > clones. I think they are more equivalent to ephemeral tmp files
> > than FICLONE. That is, we use open(O_TMPFILE) to create an
> > ephemeral temporary file attached to a file descriptor instead of
> > requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then
> > unlinking it and holding the fd open or relying on /tmp being
> > volatile or cleaned at boot to remove tmpfiles on crash.
> >
> > Hence the difference in functionality is that FICLONE provides
> > persistent, unrestricted named clones rather than ephemeral clones.
> >
> >
> > We could implement ephemeral clones in XFS, but nobody has ever
> > mentioned needing or wanting such functionality until this thread.
> > Darrick already has patches to provide an internal hidden
> > persistent namespace for XFS filesystems, we could add a new O_CLONE
> > open flag that provides ephemeral clone behaviour, we could add a
> > flag to the inode to indicate it has ephemeral clones that need
> > recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger
> > COW instead of overwrite in place, etc. It's just a matter of time
> > and resources.
<cough> The bits needed for atomic file commits have been out for review
on fsdevel since **before the COVID19 pandemic started**. It's buried
in the middle of the online repair featureset.
Summary of the usage model:
fd = open(sourcefile...)
tmp_fd = open(..., O_TMPFILE)
ioctl(tmp_fd, FICLONE, fd); /* clone data to temporary file */
/* write whatever you want to the temporary file */
ioctl(fd, FIEXCHANGE_RANGE, {tmp_fd, file range...}) /* durable commit */
close(tmp_fd)
True, this isn't an ephemeral file -- for such a thing, we could just
duplicate the in-memory data fork and never commit it to disk. But that
said, I've been trying to get the parts I /have/ built merged for three
years.
I'm planning to push the whole giant thing to the list on Thursday.
--D
> > If you've got resources available to implement this, I can find the
> > time to help design and integrate it into the VFS and XFS....
> >
> > > The point I'm trying to make is: I'm a serious customer who loves cloning
> > > and my performance expectations aren't based on idle speculation but on
> > > experience with other cloning implementations. (AdvFS is not open source
> > > and I'm no longer an HP employee, so I no longer have access to it.)
> > >
> > > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by
> > > subjecting it to real whole-system power interruptions:
> > >
> > > https://urldefense.com/v3/__https://dl.acm.org/doi/pdf/10.1145/3400899.3400902__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_th521GQ$
> >
> > Heh. You're still using hardware to do filesystem power fail
> > testing? We moved away from needing hardware to do power fail
> > testing of filesystems several years ago.
> >
> > Using functionality like dm-logwrites, we can simulate the effect of
> > several hundred different power fail cases with write-by-write
> > replay and recovery in the space of a couple of minutes.
> >
> > Not only that, failures are fully replayable and so we can actually
> > debug every single individual failure without having to guess at the
> > runtime context that created the failure or the recovery context
> > that exposed the failure.i
> >
> > This infrastructure has provided us with a massive step forward for
> > improving crash resilence and recovery capability in ext4, btrfs and
> > XFS. These tests are built into automated tests suites (e.g.
> > fstests) that pretty much all linux fs engineers and distro QE teams
> > run these days.
> >
> > IOWs, hardware based power fail testing of filesystems is largely
> > obsolete these days....
> >
> > > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down
> > > to storage. I would have expected that the implementation of cloning would
> > > always operate upon memory alone, and that an explicit fsync() would be
> > > required to force data down to durable media. Analogy: write() doesn't
> > > modify storage; write() plus fsync() does. Is there a reason why copying
> > > via ioctl(FICLONE) isn't similar?
> >
> > Because FICLONE provides a persistent named clone that is a fully
> > functioning file in it's own right. That means it has to be
> > completely indepedent of the source file by the time the FICLONE
> > operation completes. This implies that there is a certain order to
> > the operations the clone performances - the data has to be on disk
> > before the clone is made persistent and recoverable so that both
> > files as guaranteed to have identical contents if we crash
> > immediately after the clone completes.
> >
> > > Finally I understand your explanation that the cost of cloning is
> > > proportional to the size of the extent map, and that in the limit where the
> > > extent map is very large, cloning a file of size N requires O(N) time.
> > > However the constant factors surprise me. If memory serves we were seeing
> > > latencies of milliseconds atop DRAM for the first few clones on files that
> > > began as sparse files and had only a few blocks written to them. Copying
> > > the extent map on a DRAM file system must be tantamount to a bunch of
> > > memcpy() calls (right?),
> >
> > At the IO layer, yes, it's just a memcpy.
> >
> > But we can't just copy a million extents from one in-memory btree to
> > another. We have to modify the filesystem metadata in an atomic,
> > transactional, recoverable way. Those transactions work one extent
> > at a time because each extent might require a different set of
> > modifications. Persistent clones require tracking of the number of
> > times a given block on disk is shared so that we know when extent
> > removals result in the extent no longer being shared and/or
> > referenced. A file that has been cloned a million times might have
> > a million extents each shared a different number of times. When we
> > remove one of those clones, how do we know which blocks are now
> > unreferenced and need to be freed?
> >
> > IOWs, named persistent clones are *much more complex* than ephemeral
> > clones. The overhead you are measuring is the result of all the
> > persistent cross referencing and reference counting metadata we need
> > to atomically update on each extent sharing operation ensure long
> > term persistent clones work correctly.
> >
> > If we were to implement ephemeral clones as per the mechanism you've
> > outlined in the papers above, then we could just copy the in-memory
> > extent list btree with a series of memcpy() operations because we
> > don't need persistent on-disk shared reference counting to implement
> > it....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
^ permalink raw reply [flat|nested] 14+ messages in thread
* atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE))
2022-12-20 3:06 ` Darrick J. Wong
@ 2022-12-21 22:34 ` Terence Kelly
0 siblings, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-21 22:34 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Suyash Mahar, Dave Chinner, linux-xfs, Suyash Mahar
Hi Darrick,
I should have mentioned this earlier, but for several years XFS developer
Christoph Hellwig has been working on a feature inspired by the FAST 2015
paper. My HP colleagues and I met Christoph at FAST 2015 and he expressed
interest in doing something similar in XFS. Since then he has reported
doing a considerable amount of work toward that goal, though I don't know
the current state of his efforts.
I'm just pointing out a possible connection between the "atomic file
commits" described below and Christoph's work; I don't know if the
implementations are similar, but to an outsider it sounds like they aspire
to serve the same purpose: Enabling applications to efficiently evolve
files from one well-defined state to another atomically even in the
presence of failure.
Regardless of how and by whom this goal is achieved, folks like Suyash and
I eagerly await the results.
May the Force be with you!
-- Terence
On Mon, 19 Dec 2022, Darrick J. Wong wrote:
> ...
>
> <cough> The bits needed for atomic file commits have been out for review
> on fsdevel since **before the COVID19 pandemic started**. It's buried
> in the middle of the online repair featureset.
>
> Summary of the usage model:
>
> fd = open(sourcefile...)
> tmp_fd = open(..., O_TMPFILE)
>
> ioctl(tmp_fd, FICLONE, fd); /* clone data to temporary file */
>
> /* write whatever you want to the temporary file */
>
> ioctl(fd, FIEXCHANGE_RANGE, {tmp_fd, file range...}) /* durable commit */
>
> close(tmp_fd)
>
> True, this isn't an ephemeral file -- for such a thing, we could just
> duplicate the in-memory data fork and never commit it to disk. But that
> said, I've been trying to get the parts I /have/ built merged for three
> years.
>
> I'm planning to push the whole giant thing to the list on Thursday.
>
> --D
^ permalink raw reply [flat|nested] 14+ messages in thread
* wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE))
2022-12-20 2:16 ` Dave Chinner
@ 2022-12-21 23:07 ` Terence Kelly
0 siblings, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-21 23:07 UTC (permalink / raw)
To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar
Hi Dave,
To answer your question below:
When we sent our observations about ioctl(FICLONE) performance recently,
starting this e-mail thread, we were hoping for one of several outcomes:
Perhaps we were misusing the feature, in which case guidance on how to
obtain better performance would be helpful. Or if we're not doing
anything wrong, an explanation of why ioctl(FICLONE) isn't as fast as we
expected based on experience with the clone-based crash-tolerance
mechanism in AdvFS. In recent days we've been getting the latter, for
which we are grateful. We may try to pass along your explanations in a
paper we're writing; if so we'll offer y'all the opportunity to review
this paper and ask if you'd like to be acknowledged.
In the longer term, we're very interested in any developments related to
crash tolerance. The details of interfaces are less important as long as
user-level applications can with reasonable convenience and performance
obtain a simple guarantee: Following a power failure or other crash a
file can always be restored to a state that the application deemed
consistent (application-level invariants & correctness criteria hold).
Ideally the application would like a synchronous function call whose
successful return provides the consistent-recoverability guarantee for the
current state of the file. That's the guarantee that the original
failure-atomic msync() of EuroSys 2013 provided.
Obtaining this guarantee with ioctl(FICLONE) is quite convenient: When
the application knows that the file is in a consistent state, the
application makes a clone and stashes the clone in a safe place. Loosely
speaking, the performance desired is that the work of cloning should be
"O(delta) not O(data)", i.e., the time and effort required to make & stash
a clone should be proportional to the amount of data in the file changed
between consecutive clones, not to the logical size of the entire file.
I gather from our recent correspondence that XFS cloning today requires
O(data) time and effort, not O(delta). Which is progress; we have a much
better understanding of what's going on under the hood.
We understand that you're volunteers and that you're busy with many
important matters. We're not asking for any further work, though we'll
surely applaud from the sidelines any improvements toward crash tolerance.
I've been thinking about alternative approaches to crash tolerance for
over a decade. In practice today people use things like relational
databases and transactional key-value stores to protect application data
integrity from crashes. I'm interested in other approaches, including but
not limited to failure-atomic msync() and the moral equivalents thereof
implemented with help from file systems. I've worked on a half-dozen
variants of this theme and I'd be happy to explain why I think this area
is exciting to anyone willing to listen. In a nutshell I look forward to
the day when file systems render relational databases and transactional
key-value stores obsolete for some (not all) use cases.
Thanks again for your extraordinary help clarifying matters, which goes
above & beyond the call of duty, and happy holidays!
-- Terence
On Tue, 20 Dec 2022, Dave Chinner wrote:
>> I mainly want to emphasize that nobody is asking for the behavior of
>> AdvFS in that FAST 2015 paper.
>
> OK, so what are you asking us to do, then?
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-12-21 23:08 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CACQnzjuhRzNruTm369wVQU3y091da2c+h+AfRED+AtA-dYqXNQ@mail.gmail.com>
2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
2022-12-14 1:46 ` Terence Kelly
2022-12-14 4:47 ` Suyash Mahar
2022-12-15 0:19 ` Dave Chinner
2022-12-16 1:06 ` Terence Kelly
2022-12-17 17:30 ` Mike Fleetwood
2022-12-17 18:43 ` Terence Kelly
2022-12-18 1:46 ` Dave Chinner
2022-12-18 4:47 ` Suyash Mahar
2022-12-20 3:06 ` Darrick J. Wong
2022-12-21 22:34 ` atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
2022-12-20 2:16 ` Dave Chinner
2022-12-21 23:07 ` wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox