Re: XFS reflink overhead, ioctl(FICLONE)

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Re: XFS reflink overhead, ioctl(FICLONE)
       [not found] <CACQnzjuhRzNruTm369wVQU3y091da2c+h+AfRED+AtA-dYqXNQ@mail.gmail.com>
@ 2022-12-13 17:18 ` Darrick J. Wong
  2022-12-14  1:46   ` Terence Kelly
  2022-12-14  4:47   ` Suyash Mahar
  0 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2022-12-13 17:18 UTC (permalink / raw)
  To: Suyash Mahar; +Cc: linux-xfs, tpkelly, Suyash Mahar

[ugh, your email never made it to the list.  I bet the email security
standards have been tightened again.  <insert rant about dkim and dmarc
silent failures here>] :(

On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> Hi all!
> 
> While using XFS's ioctl(FICLONE), we found that XFS seems to have
> poor performance (ioctl takes milliseconds for sparse files) and the
> overhead
> increases with every call.
> 
> For the demo, we are using an Optane DC-PMM configured as a
> block device (fsdax) and running XFS (Linux v5.18.13).

How are you using fsdax and reflink on a 5.18 kernel?  That combination
of features wasn't supported until 6.0, and the data corruption problems
won't get fixed until a pull request that's about to happen for 6.2.

> We create a 1 GiB dense file, then repeatedly modify a tiny random
> fraction of it and make a clone via ioctl(FICLONE).

Yay, random cow writes, that will slowly increase the number of space
mapping records in the file metadata.

> The time required for the ioctl() calls increases from large to insane
> over the course of ~250 iterations: From roughly a millisecond for the
> first iteration or two (which seems high, given that this is on
> Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> milliseconds (which seems crazy).

Does the system call runtime increase with O(number_extents)?  You might
record the number of extents in the file you're cloning by running this
periodically:

xfs_io -c stat $path | grep fsxattr.nextents

FICLONE (at least on XFS) persists dirty pagecache data to disk, and
then duplicates all written-space mapping records from the source file to
the destination file.  It skips preallocated mappings created with
fallocate.

So yes, the plot is exactly what I was expecting.

--D

> The plot is attached to this email.
> 
> A cursory look at the extent map suggests that it gets increasingly
> complicated resulting in the complexity.
> 
> The enclosed tarball contains our code, our results, and some other info
> like a flame graph that might shed light on where the ioctl is spending
> its time.
> 
> - Suyash & Terence

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
@ 2022-12-14  1:46   ` Terence Kelly
  2022-12-14  4:47   ` Suyash Mahar
  1 sibling, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-14  1:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Suyash Mahar, linux-xfs, Suyash Mahar

Hi Darrick,

Thanks for your quick and detailed reply.

The thing that really puzzled me when I re-ran Suyash's experiments on a 
DRAM-backed file system is that the ioctl(FICLONE) calls were still very 
very slow.  A slow block storage device can't be blamed, because there 
wasn't a slow block storage device anywhere in the picture; the slowness 
came from software.

Suyash, can you send those results?

-- Terence Kelly

On Tue, 13 Dec 2022, Darrick J. Wong wrote:

> FICLONE (at least on XFS) persists dirty pagecache data to disk, and 
> then duplicates all written-space mapping records from the source file 
> to the destination file.  It skips preallocated mappings created with 
> fallocate.
> 
> So yes, the plot is exactly what I was expecting.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
  2022-12-14  1:46   ` Terence Kelly
@ 2022-12-14  4:47   ` Suyash Mahar
  2022-12-15  0:19     ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Suyash Mahar @ 2022-12-14  4:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, tpkelly, Suyash Mahar

Hi Darrick,

Thank you for the response. I have replied inline.

-Suyash

Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@kernel.org> a écrit :
>
> [ugh, your email never made it to the list.  I bet the email security
> standards have been tightened again.  <insert rant about dkim and dmarc
> silent failures here>] :(
>
> On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> > Hi all!
> >
> > While using XFS's ioctl(FICLONE), we found that XFS seems to have
> > poor performance (ioctl takes milliseconds for sparse files) and the
> > overhead
> > increases with every call.
> >
> > For the demo, we are using an Optane DC-PMM configured as a
> > block device (fsdax) and running XFS (Linux v5.18.13).
>
> How are you using fsdax and reflink on a 5.18 kernel?  That combination
> of features wasn't supported until 6.0, and the data corruption problems
> won't get fixed until a pull request that's about to happen for 6.2.

We did not enable the dax option. The optane DIMMs are configured to
appear as a block device.

$ mount | grep xfs
/dev/pmem0p4 on /mnt/pmem0p4 type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

Regardless of the block device (the plot includes results for optane
and RamFS), it seems like the ioctl(FICLONE) call is slow.

> > We create a 1 GiB dense file, then repeatedly modify a tiny random
> > fraction of it and make a clone via ioctl(FICLONE).
>
> Yay, random cow writes, that will slowly increase the number of space
> mapping records in the file metadata.
>
> > The time required for the ioctl() calls increases from large to insane
> > over the course of ~250 iterations: From roughly a millisecond for the
> > first iteration or two (which seems high, given that this is on
> > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> > milliseconds (which seems crazy).
>
> Does the system call runtime increase with O(number_extents)?  You might
> record the number of extents in the file you're cloning by running this
> periodically:
>
> xfs_io -c stat $path | grep fsxattr.nextents

The extent count does increase linearly (just like the ioctl() call latency).
I used the xfs_bmap tool, let me know if this is not the right way. If
it is not, I'll update the microbenchmark to run xfs_io.

> FICLONE (at least on XFS) persists dirty pagecache data to disk, and
> then duplicates all written-space mapping records from the source file to
> the destination file.  It skips preallocated mappings created with
> fallocate.
>
> So yes, the plot is exactly what I was expecting.
>
> --D
>
> > The plot is attached to this email.
> >
> > A cursory look at the extent map suggests that it gets increasingly
> > complicated resulting in the complexity.
> >
> > The enclosed tarball contains our code, our results, and some other info
> > like a flame graph that might shed light on where the ioctl is spending
> > its time.
> >
> > - Suyash & Terence

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-14  4:47   ` Suyash Mahar
@ 2022-12-15  0:19     ` Dave Chinner
  2022-12-16  1:06       ` Terence Kelly
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2022-12-15  0:19 UTC (permalink / raw)
  To: Suyash Mahar; +Cc: Darrick J. Wong, linux-xfs, tpkelly, Suyash Mahar

On Tue, Dec 13, 2022 at 08:47:03PM -0800, Suyash Mahar wrote:
> Hi Darrick,
> 
> Thank you for the response. I have replied inline.
> 
> -Suyash
> 
> Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@kernel.org> a écrit :
> >
> > [ugh, your email never made it to the list.  I bet the email security
> > standards have been tightened again.  <insert rant about dkim and dmarc
> > silent failures here>] :(
> >
> > On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote:
> > > Hi all!
> > >
> > > While using XFS's ioctl(FICLONE), we found that XFS seems to have
> > > poor performance (ioctl takes milliseconds for sparse files) and the
> > > overhead
> > > increases with every call.
> > >
> > > For the demo, we are using an Optane DC-PMM configured as a
> > > block device (fsdax) and running XFS (Linux v5.18.13).
> >
> > How are you using fsdax and reflink on a 5.18 kernel?  That combination
> > of features wasn't supported until 6.0, and the data corruption problems
> > won't get fixed until a pull request that's about to happen for 6.2.
> 
> We did not enable the dax option. The optane DIMMs are configured to
> appear as a block device.
> 
> $ mount | grep xfs
> /dev/pmem0p4 on /mnt/pmem0p4 type xfs
> (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
> 
> Regardless of the block device (the plot includes results for optane
> and RamFS), it seems like the ioctl(FICLONE) call is slow.

Please define "slow" - is it actually slower than it should be
(i.e. a bug) or does it simply not perform according to your
expectations?

A few things that you can quantify to answer these questions.

1. What is the actual rate it is cloning extents at? i.e. extent count
/ clone time?  Is this rate consistent/sustained, or is it dropping substantially
over time and/or increase in extent count?

3. How does clone speed of a given file compare to the actual data
copy speed of that file (please include fsync time in the data
copy results)? Is cloning faster or slower than copying
the data? What is the extent count of the file at the cross-over
point where cloning goes from being faster to slower than copying
the data?

3. How does it compare with btrfs running the same write/clone
workload? Does btrfs run faster? Does it perform better with
high extent counts than XFS? What about with high sharing counts
(e.g. after 500 or 1000 clones of the source file)?

Basically, I'm trying to understand what "slow" means in teh context
of the operations you are performing.  I haven't seen any recent
performance regressions in clone speed on XFS, so I'm trying to
understand what you are seeing and why you think it is slower than
it should be.

> > > We create a 1 GiB dense file, then repeatedly modify a tiny random
> > > fraction of it and make a clone via ioctl(FICLONE).
> >
> > Yay, random cow writes, that will slowly increase the number of space
> > mapping records in the file metadata.

Yup, the scripts I use do exactly this - 10,000 random 4kB writes to
a 8GB file between reflink clones. I then iterate a few thousand
times and measure the reflink time.

> > > The time required for the ioctl() calls increases from large to insane
> > > over the course of ~250 iterations: From roughly a millisecond for the
> > > first iteration or two (which seems high, given that this is on
> > > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20
> > > milliseconds (which seems crazy).
> >
> > Does the system call runtime increase with O(number_extents)?  You might
> > record the number of extents in the file you're cloning by running this
> > periodically:
> >
> > xfs_io -c stat $path | grep fsxattr.nextents
> 
> The extent count does increase linearly (just like the ioctl() call latency).

As expected. Changing the sharing state a single extent has a
roughly constant overhead regardless of the number of extents in the
file. Hence clone time should scale linearly with the number of
extents that need to have their shared state modified.

> I used the xfs_bmap tool, let me know if this is not the right way. If
> it is not, I'll update the microbenchmark to run xfs_io.

xfs_bmap is the slow way - it has to iterate every extents and
format them out to userspace. the above mechanism just does a single
syscall to query the count of extents from the inode. Using the
fsxattr extent count query is much faster, especially when you have
files with tens of millions of extents in them....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-15  0:19     ` Dave Chinner
@ 2022-12-16  1:06       ` Terence Kelly
  2022-12-17 17:30         ` Mike Fleetwood
  2022-12-18  1:46         ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-16  1:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar

Hi Dave,

Thanks for your quick and detailed reply.  More inline....

On Thu, 15 Dec 2022, Dave Chinner wrote:

>> Regardless of the block device (the plot includes results for optane 
>> and RamFS), it seems like the ioctl(FICLONE) call is slow.
>
> Please define "slow" - is it actually slower than it should be (i.e. a 
> bug) or does it simply not perform according to your expectations?

I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took 
*milli*seconds right from the start, and grew to *tens* of milliseconds. 
There's no slow block storage device to increase latency; all of the 
latency is due to software.  I was expecting microseconds of latency with 
DRAM underneath.

Performance matters because cloning is an excellent crash-tolerance 
mechanism.  Applications that maintain persistent state in files --- 
that's a huge number of applications --- can make clones of said files and 
recover from crashes by reverting to the most recent successful clone. 
In many situations this is much easier and better than shoe-horning 
application data into something like an ACID-transactional relational 
database or transactional key-value store.  But the run-time cost of 
making a clone during failure-free operation can't be excessive.  Cloning 
for crash tolerance usually requires durable media beneath the file system 
(HDD or SSD, not DRAM), so performance on block storage devices matters 
too.  We measured performance of cloning atop DRAM to understand how much 
latency is due to block storage hardware vs. software alone.

My colleagues and I started working on clone-based crash tolerance 
mechanisms nearly a decade ago.  Extensive experience with cloning and 
related mechanisms in the HP Advanced File System (AdvFS), a Linux port of 
the DEC Tru64 file system, taught me to expect cloning to be *faster* than 
alternatives for crash tolerance:

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

https://web.eecs.umich.edu/~tpkelly/papers/HPL-2015-103.pdf

The point I'm trying to make is:  I'm a serious customer who loves cloning 
and my performance expectations aren't based on idle speculation but on 
experience with other cloning implementations.  (AdvFS is not open source 
and I'm no longer an HP employee, so I no longer have access to it.)

More recently I torture-tested XFS cloning as a crash-tolerance mechanism 
by subjecting it to real whole-system power interruptions:

https://dl.acm.org/doi/pdf/10.1145/3400899.3400902

I performed these correctness tests before making any performance 
measurements because I don't care how fast a mechanism is if it doesn't 
correctly tolerate crashes.  XFS passed the power-fail tests with flying 
colors.  Now it's time to consider performance.

I'm surprised that in XFS, cloning alone *without* fsync() pushes data 
down to storage.  I would have expected that the implementation of cloning 
would always operate upon memory alone, and that an explicit fsync() would 
be required to force data down to durable media.  Analogy:  write() 
doesn't modify storage; write() plus fsync() does.  Is there a reason why 
copying via ioctl(FICLONE) isn't similar?

Finally I understand your explanation that the cost of cloning is 
proportional to the size of the extent map, and that in the limit where 
the extent map is very large, cloning a file of size N requires O(N) time. 
However the constant factors surprise me.  If memory serves we were seeing 
latencies of milliseconds atop DRAM for the first few clones on files that 
began as sparse files and had only a few blocks written to them.  Copying 
the extent map on a DRAM file system must be tantamount to a bunch of 
memcpy() calls (right?), and I'm surprised that the volume of data that 
must be memcpy'd is so large that it takes milliseconds.

We might be able to take some of the additional measurements you suggested 
during/after the holidays.

Thanks again.

> A few things that you can quantify to answer these questions.
>
> ...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-16  1:06       ` Terence Kelly
@ 2022-12-17 17:30         ` Mike Fleetwood
  2022-12-17 18:43           ` Terence Kelly
  2022-12-18  1:46         ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Fleetwood @ 2022-12-17 17:30 UTC (permalink / raw)
  To: Terence Kelly
  Cc: Dave Chinner, Suyash Mahar, Darrick J. Wong, linux-xfs,
	Suyash Mahar

On Fri, 16 Dec 2022 at 01:06, Terence Kelly <tpkelly@eecs.umich.edu> wrote:
> (AdvFS is not open source
> and I'm no longer an HP employee, so I no longer have access to it.)

Just to put the record straight, HP did (abandon and) open source AdvFS
in June 2008.
https://www.hp.com/hpinfo/newsroom/press/2008/080623a.html

It's available under a GPLv2 license from
https://advfs.sourceforge.net/

Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-17 17:30         ` Mike Fleetwood
@ 2022-12-17 18:43           ` Terence Kelly
  0 siblings, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-17 18:43 UTC (permalink / raw)
  To: Mike Fleetwood
  Cc: Dave Chinner, Suyash Mahar, Darrick J. Wong, linux-xfs,
	Suyash Mahar

It's confusing.

My FAST '15 paper was co-authored with AdvFS developers from the HP 
Storage Division.  The paper mentions the open-source release of AdvFS.

There's not a lot of recent activity on open-source AdvFS:

https://sourceforge.net/p/advfs/discussion/

One thing is certain, however:  HP did not "abandon" AdvFS in 2008.  At 
the time of my FAST paper it was used under the hood in HP products and 
was being actively developed internally.  See Section 3 of the FAST paper. 
The whole point of the paper is to describe a new (internal-only) AdvFS 
feature.

I'm pretty sure (relying on memory) that the changes to AdvFS made by HP 
between 2008 and 2015 did not find their way into the open-source release.

On Sat, 17 Dec 2022, Mike Fleetwood wrote:

> On Fri, 16 Dec 2022 at 01:06, Terence Kelly <tpkelly@eecs.umich.edu> wrote:

>> (AdvFS is not open source and I'm no longer an HP employee, so I no 
>> longer have access to it.)
>
> Just to put the record straight, HP did (abandon and) open source AdvFS
> in June 2008.
> https://www.hp.com/hpinfo/newsroom/press/2008/080623a.html
>
> It's available under a GPLv2 license from
> https://advfs.sourceforge.net/
>
> Mike
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-16  1:06       ` Terence Kelly
  2022-12-17 17:30         ` Mike Fleetwood
@ 2022-12-18  1:46         ` Dave Chinner
  2022-12-18  4:47           ` Suyash Mahar
  2022-12-18 23:40           ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
  1 sibling, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2022-12-18  1:46 UTC (permalink / raw)
  To: Terence Kelly; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar

On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote:
> 
> Hi Dave,
> 
> Thanks for your quick and detailed reply.  More inline....
> 
> On Thu, 15 Dec 2022, Dave Chinner wrote:
> 
> > > Regardless of the block device (the plot includes results for optane
> > > and RamFS), it seems like the ioctl(FICLONE) call is slow.
> > 
> > Please define "slow" - is it actually slower than it should be (i.e. a
> > bug) or does it simply not perform according to your expectations?
> 
> I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
> *milli*seconds right from the start, and grew to *tens* of milliseconds.
> There's no slow block storage device to increase latency; all of the latency
> is due to software.  I was expecting microseconds of latency with DRAM
> underneath.

Ah - slower than expectations then, and you have unrealistic
expectations about how "fast" DRAM is.

From a storage engineer's perspective, DRAM is slow compared to nvme
based flash storage - DRAM has better access latency, but on all
other aspects of storage performance and capability, it falls way
behind pcie attached storage because the *CPU time* is the limiting
factor in storage performance these days, not storage device speed.

The problem with DRAM based storage (and DAX in general) is that
data movement is run by the CPU - it's synchronous storage.
Filesystems like XFS are built around highly concurrent pipelined
asynchronous IO hardware. Filesystems are capable of keeping
thousands of IOs in flight *per CPU*, but on synchronous storage
like DRAM we can only have *1 IO per CPU* in flight at any given
time.

Hence when we compare synchronous write performance, DRAM is fast
compared to SSDs. When we use async IO (AIO+DIO or io_uring), the
numbers go the other way and SSDs come out further in front the more
of them you attach to the system. DRAM based IO doesn't get any
faster because it still can only process one IO at a time, whilst
*each SSD* can process 100+ IOs at a time.

IOWs, for normal block based storage we only use the CPU to marshall
the data movement in the system, and the hardware takes care of the
data movement. i.e. DMA-based storage devices are a hardware offload
mechanism. DRAM based storage relies on the CPU to move data, and so
we use all the time that the CPU could be sending IO to the hardware
to move data in DRAM from A to B. 

Put simply: DRAM can only be considered fast if your application
does (or is optimised for) synchronous IO. For all other uses, DRAM
based storage is a poor choice.

> Performance matters because cloning is an excellent crash-tolerance
> mechanism.

Guaranteeing filesystem and data integrity is our primary focus when
building infrastructure that can be used for crash-tolerance
mechanisms...

> Applications that maintain persistent state in files --- that's
> a huge number of applications --- can make clones of said files and recover
> from crashes by reverting to the most recent successful clone.

... and that's the data integrity guarantee that the filesystem
*must* provide the application. 

> In many
> situations this is much easier and better than shoe-horning application data
> into something like an ACID-transactional relational database or
> transactional key-value store.

Of course. But that doesn't mean the need for ACID-transactional
database functionality goes away.  We've just moved that
functionality into the filesystem to implement FICLONE
functionality.

> But the run-time cost of making a clone
> during failure-free operation can't be excessive.

Define "excessive".

Our design constraints were that FICLONE had to be faster than
copying the data, and needed to have fixed cost per shared extent
reference modification or better so that it could scale to millions
of extents without bringing the filesystem, storage and/or system
to it's knees when someone tried to do that.

Remember - extent sharing and clones were retrofitted to XFS 20
years after it was designed. We had to make lots of compromises
just to make it work correctly, let alone acheive the performance
requirements we set a decade ago.

> Cloning for crash
> tolerance usually requires durable media beneath the file system (HDD or
> SSD, not DRAM), so performance on block storage devices matters too.  We
> measured performance of cloning atop DRAM to understand how much latency is
> due to block storage hardware vs. software alone.

Cloning is a CPU intensive operation, not an IO intensive operation.
What you are measuring is *entirely* the CPU overhead of doing all
the transactions and cross-referencing needed to track extent
sharing in a manner that is crash consistent, atomic and fully
recoverable.

> My colleagues and I started working on clone-based crash tolerance
> mechanisms nearly a decade ago.  Extensive experience with cloning and
> related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
> the DEC Tru64 file system, taught me to expect cloning to be *faster* than
> alternatives for crash tolerance:

Cloning files on XFSi and btrfs is still much faster than the
existing safe overwrite mechanism of {create a whole new data copy,
fysnc, rename, fsync}. So I'm not sure what you're actually
complaining about here.

> https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

Ah, now I get it. You want *anonymous ephemeral clones*, not named
persistent clones.  For everyone else, so they don't have to read
the paper and try to work it out:

The mechanism is a hacked the O_ATOMIC path to instantiate a whole
new cloned inode which is linked into a hidden namespace in the
filesystem so the user can't see it but so it is present after a
crash.

It doesn't track the cloned extents in a persistent index, the
hidden file simply shares the same block map on disk and the sharing
is tracked in memory. After a crash, nothing is done with this until
the original file is instantiated in memory. At this point, the
hidden clone file(s) are then accessed and the shared state is
recovered in memory and decisions are made about which contains the
most recent data are made.

The clone is only present while the fd returned by the
open(O_ATOMIC) is valid. On close(), the clone is deleted and all
the in-memory and hidden on-disk state is torn down. Effectively,
the close() operation becomes an unlink().

Further, a new syscall (called syncv()) takes a vector of these
O_ATOMIC cloned file descriptors is added. THis syscall forces the
filesystem to make the inode -metadata- persistent without requiring
data modifications to be persistent. This allows the ephemeral
clones to be persisted without requiring the data in the original
file to be writtent to disk. At this point, we have a hidden clone
with a matching block map that can be used for crash recovery
purposes.

This clone mechanism in advfs is limited by journal size - 256
clones per 128MB journal space due to reservation space needed for
clone deletes.

----

So my reading of this paper is that the "file clone operation"
essentially creates an ephemeral clone rather than a persistent named
clones. I think they are more equivalent to ephemeral tmp files
than FICLONE. That is, we use open(O_TMPFILE) to create an
ephemeral temporary file attached to a file descriptor instead of
requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then
unlinking it and holding the fd open or relying on /tmp being
volatile or cleaned at boot to remove tmpfiles on crash.

Hence the difference in functionality is that FICLONE provides
persistent, unrestricted named clones rather than ephemeral clones.

We could implement ephemeral clones in XFS, but nobody has ever
mentioned needing or wanting such functionality until this thread.
Darrick already has patches to provide an internal hidden
persistent namespace for XFS filesystems, we could add a new O_CLONE
open flag that provides ephemeral clone behaviour, we could add a
flag to the inode to indicate it has ephemeral clones that need
recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger
COW instead of overwrite in place, etc. It's just a matter of time
and resources.

If you've got resources available to implement this, I can find the
time to help design and integrate it into the VFS and XFS....

> The point I'm trying to make is:  I'm a serious customer who loves cloning
> and my performance expectations aren't based on idle speculation but on
> experience with other cloning implementations.  (AdvFS is not open source
> and I'm no longer an HP employee, so I no longer have access to it.)
> 
> More recently I torture-tested XFS cloning as a crash-tolerance mechanism by
> subjecting it to real whole-system power interruptions:
> 
> https://dl.acm.org/doi/pdf/10.1145/3400899.3400902

Heh. You're still using hardware to do filesystem power fail
testing?  We moved away from needing hardware to do power fail
testing of filesystems several years ago.

Using functionality like dm-logwrites, we can simulate the effect of
several hundred different power fail cases with write-by-write
replay and recovery in the space of a couple of minutes.

Not only that, failures are fully replayable and so we can actually
debug every single individual failure without having to guess at the
runtime context that created the failure or the recovery context
that exposed the failure.i

This infrastructure has provided us with a massive step forward for
improving crash resilence and recovery capability in ext4, btrfs and
XFS.  These tests are built into automated tests suites (e.g.
fstests) that pretty much all linux fs engineers and distro QE teams
run these days.

IOWs, hardware based power fail testing of filesystems is largely
obsolete these days....

> I'm surprised that in XFS, cloning alone *without* fsync() pushes data down
> to storage.  I would have expected that the implementation of cloning would
> always operate upon memory alone, and that an explicit fsync() would be
> required to force data down to durable media.  Analogy:  write() doesn't
> modify storage; write() plus fsync() does.  Is there a reason why copying
> via ioctl(FICLONE) isn't similar?

Because FICLONE provides a persistent named clone that is a fully
functioning file in it's own right.  That means it has to be
completely indepedent of the source file by the time the FICLONE
operation completes.  This implies that there is a certain order to
the operations the clone performances - the data has to be on disk
before the clone is made persistent and recoverable so that both
files as guaranteed to have identical contents if we crash
immediately after the clone completes.

> Finally I understand your explanation that the cost of cloning is
> proportional to the size of the extent map, and that in the limit where the
> extent map is very large, cloning a file of size N requires O(N) time.
> However the constant factors surprise me.  If memory serves we were seeing
> latencies of milliseconds atop DRAM for the first few clones on files that
> began as sparse files and had only a few blocks written to them.  Copying
> the extent map on a DRAM file system must be tantamount to a bunch of
> memcpy() calls (right?),

At the IO layer, yes, it's just a memcpy.

But we can't just copy a million extents from one in-memory btree to
another. We have to modify the filesystem metadata in an atomic,
transactional, recoverable way. Those transactions work one extent
at a time because each extent might require a different set of
modifications. Persistent clones require tracking of the number of
times a given block on disk is shared so that we know when extent
removals result in the extent no longer being shared and/or
referenced. A file that has been cloned a million times might have
a million extents each shared a different number of times. When we
remove one of those clones, how do we know which blocks are now
unreferenced and need to be freed?

IOWs, named persistent clones are *much more complex* than ephemeral
clones. The overhead you are measuring is the result of all the
persistent cross referencing and reference counting metadata we need
to atomically update on each extent sharing operation ensure long
term persistent clones work correctly.

If we were to implement ephemeral clones as per the mechanism you've
outlined in the papers above, then we could just copy the in-memory
extent list btree with a series of memcpy() operations because we
don't need persistent on-disk shared reference counting to implement
it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-18  1:46         ` Dave Chinner
@ 2022-12-18  4:47           ` Suyash Mahar
  2022-12-20  3:06             ` Darrick J. Wong
  2022-12-18 23:40           ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
  1 sibling, 1 reply; 14+ messages in thread
From: Suyash Mahar @ 2022-12-18  4:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Terence Kelly, Darrick J. Wong, linux-xfs, Suyash Mahar

Thank you for the detailed response. This does confirm some of our
observations that the overhead is mainly from the software layer. We
did see better performance from optimization in the transaction code
moving from kernel v5.4 to v5.18.

-Suyash

Le sam. 17 déc. 2022 à 17:46, Dave Chinner <david@fromorbit.com> a écrit :
>
> On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote:
> >
> > Hi Dave,
> >
> > Thanks for your quick and detailed reply.  More inline....
> >
> > On Thu, 15 Dec 2022, Dave Chinner wrote:
> >
> > > > Regardless of the block device (the plot includes results for optane
> > > > and RamFS), it seems like the ioctl(FICLONE) call is slow.
> > >
> > > Please define "slow" - is it actually slower than it should be (i.e. a
> > > bug) or does it simply not perform according to your expectations?
> >
> > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
> > *milli*seconds right from the start, and grew to *tens* of milliseconds.
> > There's no slow block storage device to increase latency; all of the latency
> > is due to software.  I was expecting microseconds of latency with DRAM
> > underneath.
>
> Ah - slower than expectations then, and you have unrealistic
> expectations about how "fast" DRAM is.
>
> From a storage engineer's perspective, DRAM is slow compared to nvme
> based flash storage - DRAM has better access latency, but on all
> other aspects of storage performance and capability, it falls way
> behind pcie attached storage because the *CPU time* is the limiting
> factor in storage performance these days, not storage device speed.
>
> The problem with DRAM based storage (and DAX in general) is that
> data movement is run by the CPU - it's synchronous storage.
> Filesystems like XFS are built around highly concurrent pipelined
> asynchronous IO hardware. Filesystems are capable of keeping
> thousands of IOs in flight *per CPU*, but on synchronous storage
> like DRAM we can only have *1 IO per CPU* in flight at any given
> time.
>
> Hence when we compare synchronous write performance, DRAM is fast
> compared to SSDs. When we use async IO (AIO+DIO or io_uring), the
> numbers go the other way and SSDs come out further in front the more
> of them you attach to the system. DRAM based IO doesn't get any
> faster because it still can only process one IO at a time, whilst
> *each SSD* can process 100+ IOs at a time.
>
> IOWs, for normal block based storage we only use the CPU to marshall
> the data movement in the system, and the hardware takes care of the
> data movement. i.e. DMA-based storage devices are a hardware offload
> mechanism. DRAM based storage relies on the CPU to move data, and so
> we use all the time that the CPU could be sending IO to the hardware
> to move data in DRAM from A to B.
>
>
> Put simply: DRAM can only be considered fast if your application
> does (or is optimised for) synchronous IO. For all other uses, DRAM
> based storage is a poor choice.
>
> > Performance matters because cloning is an excellent crash-tolerance
> > mechanism.
>
> Guaranteeing filesystem and data integrity is our primary focus when
> building infrastructure that can be used for crash-tolerance
> mechanisms...
>
> > Applications that maintain persistent state in files --- that's
> > a huge number of applications --- can make clones of said files and recover
> > from crashes by reverting to the most recent successful clone.
>
> ... and that's the data integrity guarantee that the filesystem
> *must* provide the application.
>
> > In many
> > situations this is much easier and better than shoe-horning application data
> > into something like an ACID-transactional relational database or
> > transactional key-value store.
>
> Of course. But that doesn't mean the need for ACID-transactional
> database functionality goes away.  We've just moved that
> functionality into the filesystem to implement FICLONE
> functionality.
>
> > But the run-time cost of making a clone
> > during failure-free operation can't be excessive.
>
> Define "excessive".
>
> Our design constraints were that FICLONE had to be faster than
> copying the data, and needed to have fixed cost per shared extent
> reference modification or better so that it could scale to millions
> of extents without bringing the filesystem, storage and/or system
> to it's knees when someone tried to do that.
>
> Remember - extent sharing and clones were retrofitted to XFS 20
> years after it was designed. We had to make lots of compromises
> just to make it work correctly, let alone acheive the performance
> requirements we set a decade ago.
>
> > Cloning for crash
> > tolerance usually requires durable media beneath the file system (HDD or
> > SSD, not DRAM), so performance on block storage devices matters too.  We
> > measured performance of cloning atop DRAM to understand how much latency is
> > due to block storage hardware vs. software alone.
>
> Cloning is a CPU intensive operation, not an IO intensive operation.
> What you are measuring is *entirely* the CPU overhead of doing all
> the transactions and cross-referencing needed to track extent
> sharing in a manner that is crash consistent, atomic and fully
> recoverable.
>
> > My colleagues and I started working on clone-based crash tolerance
> > mechanisms nearly a decade ago.  Extensive experience with cloning and
> > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
> > the DEC Tru64 file system, taught me to expect cloning to be *faster* than
> > alternatives for crash tolerance:
>
> Cloning files on XFSi and btrfs is still much faster than the
> existing safe overwrite mechanism of {create a whole new data copy,
> fysnc, rename, fsync}. So I'm not sure what you're actually
> complaining about here.
>
>
> > https://urldefense.com/v3/__https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_IkjbSIg$
>
> Ah, now I get it. You want *anonymous ephemeral clones*, not named
> persistent clones.  For everyone else, so they don't have to read
> the paper and try to work it out:
>
> The mechanism is a hacked the O_ATOMIC path to instantiate a whole
> new cloned inode which is linked into a hidden namespace in the
> filesystem so the user can't see it but so it is present after a
> crash.
>
> It doesn't track the cloned extents in a persistent index, the
> hidden file simply shares the same block map on disk and the sharing
> is tracked in memory. After a crash, nothing is done with this until
> the original file is instantiated in memory. At this point, the
> hidden clone file(s) are then accessed and the shared state is
> recovered in memory and decisions are made about which contains the
> most recent data are made.
>
> The clone is only present while the fd returned by the
> open(O_ATOMIC) is valid. On close(), the clone is deleted and all
> the in-memory and hidden on-disk state is torn down. Effectively,
> the close() operation becomes an unlink().
>
> Further, a new syscall (called syncv()) takes a vector of these
> O_ATOMIC cloned file descriptors is added. THis syscall forces the
> filesystem to make the inode -metadata- persistent without requiring
> data modifications to be persistent. This allows the ephemeral
> clones to be persisted without requiring the data in the original
> file to be writtent to disk. At this point, we have a hidden clone
> with a matching block map that can be used for crash recovery
> purposes.
>
> This clone mechanism in advfs is limited by journal size - 256
> clones per 128MB journal space due to reservation space needed for
> clone deletes.
>
> ----
>
> So my reading of this paper is that the "file clone operation"
> essentially creates an ephemeral clone rather than a persistent named
> clones. I think they are more equivalent to ephemeral tmp files
> than FICLONE. That is, we use open(O_TMPFILE) to create an
> ephemeral temporary file attached to a file descriptor instead of
> requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then
> unlinking it and holding the fd open or relying on /tmp being
> volatile or cleaned at boot to remove tmpfiles on crash.
>
> Hence the difference in functionality is that FICLONE provides
> persistent, unrestricted named clones rather than ephemeral clones.
>
>
> We could implement ephemeral clones in XFS, but nobody has ever
> mentioned needing or wanting such functionality until this thread.
> Darrick already has patches to provide an internal hidden
> persistent namespace for XFS filesystems, we could add a new O_CLONE
> open flag that provides ephemeral clone behaviour, we could add a
> flag to the inode to indicate it has ephemeral clones that need
> recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger
> COW instead of overwrite in place, etc. It's just a matter of time
> and resources.
>
> If you've got resources available to implement this, I can find the
> time to help design and integrate it into the VFS and XFS....
>
> > The point I'm trying to make is:  I'm a serious customer who loves cloning
> > and my performance expectations aren't based on idle speculation but on
> > experience with other cloning implementations.  (AdvFS is not open source
> > and I'm no longer an HP employee, so I no longer have access to it.)
> >
> > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by
> > subjecting it to real whole-system power interruptions:
> >
> > https://urldefense.com/v3/__https://dl.acm.org/doi/pdf/10.1145/3400899.3400902__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_th521GQ$
>
> Heh. You're still using hardware to do filesystem power fail
> testing?  We moved away from needing hardware to do power fail
> testing of filesystems several years ago.
>
> Using functionality like dm-logwrites, we can simulate the effect of
> several hundred different power fail cases with write-by-write
> replay and recovery in the space of a couple of minutes.
>
> Not only that, failures are fully replayable and so we can actually
> debug every single individual failure without having to guess at the
> runtime context that created the failure or the recovery context
> that exposed the failure.i
>
> This infrastructure has provided us with a massive step forward for
> improving crash resilence and recovery capability in ext4, btrfs and
> XFS.  These tests are built into automated tests suites (e.g.
> fstests) that pretty much all linux fs engineers and distro QE teams
> run these days.
>
> IOWs, hardware based power fail testing of filesystems is largely
> obsolete these days....
>
> > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down
> > to storage.  I would have expected that the implementation of cloning would
> > always operate upon memory alone, and that an explicit fsync() would be
> > required to force data down to durable media.  Analogy:  write() doesn't
> > modify storage; write() plus fsync() does.  Is there a reason why copying
> > via ioctl(FICLONE) isn't similar?
>
> Because FICLONE provides a persistent named clone that is a fully
> functioning file in it's own right.  That means it has to be
> completely indepedent of the source file by the time the FICLONE
> operation completes.  This implies that there is a certain order to
> the operations the clone performances - the data has to be on disk
> before the clone is made persistent and recoverable so that both
> files as guaranteed to have identical contents if we crash
> immediately after the clone completes.
>
> > Finally I understand your explanation that the cost of cloning is
> > proportional to the size of the extent map, and that in the limit where the
> > extent map is very large, cloning a file of size N requires O(N) time.
> > However the constant factors surprise me.  If memory serves we were seeing
> > latencies of milliseconds atop DRAM for the first few clones on files that
> > began as sparse files and had only a few blocks written to them.  Copying
> > the extent map on a DRAM file system must be tantamount to a bunch of
> > memcpy() calls (right?),
>
> At the IO layer, yes, it's just a memcpy.
>
> But we can't just copy a million extents from one in-memory btree to
> another. We have to modify the filesystem metadata in an atomic,
> transactional, recoverable way. Those transactions work one extent
> at a time because each extent might require a different set of
> modifications. Persistent clones require tracking of the number of
> times a given block on disk is shared so that we know when extent
> removals result in the extent no longer being shared and/or
> referenced. A file that has been cloned a million times might have
> a million extents each shared a different number of times. When we
> remove one of those clones, how do we know which blocks are now
> unreferenced and need to be freed?
>
> IOWs, named persistent clones are *much more complex* than ephemeral
> clones. The overhead you are measuring is the result of all the
> persistent cross referencing and reference counting metadata we need
> to atomically update on each extent sharing operation ensure long
> term persistent clones work correctly.
>
> If we were to implement ephemeral clones as per the mechanism you've
> outlined in the papers above, then we could just copy the in-memory
> extent list btree with a series of memcpy() operations because we
> don't need persistent on-disk shared reference counting to implement
> it....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-18  1:46         ` Dave Chinner
  2022-12-18  4:47           ` Suyash Mahar
@ 2022-12-18 23:40           ` Terence Kelly
  2022-12-20  2:16             ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Terence Kelly @ 2022-12-18 23:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar

On Sun, 18 Dec 2022, Dave Chinner wrote:

>> https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
>
> Ah, now I get it. You want *anonymous ephemeral clones*, not named 
> persistent clones.  For everyone else, so they don't have to read the 
> paper and try to work it out:
>
> The mechanism is a hacked the O_ATOMIC path ...

No.  To be clear, nobody now in 2022 is asking for the AdvFS features of 
the FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).

The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and 
foreseeable needs, except for performance.

I cited the FAST 2015 paper simply to show that I've worked with a 
clone-based mechanism in the past and it delighted me in every way.  It's 
simply an existence proof that cloning can be delightful for crash 
tolerance.

> Hence the difference in functionality is that FICLONE provides 
> persistent, unrestricted named clones rather than ephemeral clones.

For the record, the AdvFS implementation of clone-based crash tolerance 
--- the moral equivalent of failure-atomic msync(), which was the topic of 
my EuroSys 2013 paper --- involved persistent files on durable storage; 
the files were hidden and were discarded when their usefulness was over 
but the hidden files were not "ephemeral" in the sense of a file in a 
DRAM-backed file system (/tmp/ or /dev/shm/ or whatnot).  AdvFS crash 
tolerance survived real power failures.  But this is a side issue of 
historical interest only.

I mainly want to emphasize that nobody is asking for the behavior of AdvFS 
in that FAST 2015 paper.

> We could implement ephemeral clones in XFS, but nobody has ever 
> mentioned needing or wanting such functionality until this thread.

Nobody needs or wants such functionality, even in this thread.  The 
current ioctl(FICLONE) is perfect except for performance.

>> https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
>
> Heh. You're still using hardware to do filesystem power fail testing? 
> We moved away from needing hardware to do power fail testing of 
> filesystems several years ago.
>
> Using functionality like dm-logwrites, we can simulate the effect of 
> several hundred different power fail cases with write-by-write replay 
> and recovery in the space of a couple of minutes.

Cool.  I assume you're familiar with a paper on a similar technique that 
my HP Labs colleagues wrote circa 2013 or 2014:  "Torturing Databases for 
Fun and Profit."

> Not only that, failures are fully replayable and so we can actually 
> debug every single individual failure without having to guess at the 
> runtime context that created the failure or the recovery context that 
> exposed the failure.
>
> This infrastructure has provided us with a massive step forward for 
> improving crash resilence and recovery capability in ext4, btrfs and 
> XFS.  These tests are built into automated tests suites (e.g. fstests) 
> that pretty much all linux fs engineers and distro QE teams run these 
> days.

If you think the world would benefit from reading about this technique and 
using it more widely, I might be able to help.  My column in _Queue_ 
magazine reaches thousands of readers, sometimes tens of thousands.  It's 
about teaching better techniques to working programmers.

I'd be honored to help pass along to my readers practical techniques that 
you're using to improve quality.

> IOWs, hardware based power fail testing of filesystems is largely 
> obsolete these days....

I don't mind telling the world that my own past work is obsolete.  That's 
what progress is all about.

>> I'm surprised that in XFS, cloning alone *without* fsync() pushes data 
>> down to storage.  I would have expected that the implementation of 
>> cloning would always operate upon memory alone, and that an explicit 
>> fsync() would be required to force data down to durable media. 
>> Analogy:  write() doesn't modify storage; write() plus fsync() does. 
>> Is there a reason why copying via ioctl(FICLONE) isn't similar?
>
> Because FICLONE provides a persistent named clone that is a fully 
> functioning file in it's own right.  That means it has to be completely 
> indepedent of the source file by the time the FICLONE operation 
> completes.  This implies that there is a certain order to the operations 
> the clone performances - the data has to be on disk before the clone is 
> made persistent and recoverable so that both files as guaranteed to have 
> identical contents if we crash immediately after the clone completes.

I thought the rule was that if an application doesn't call fsync() or 
msync(), no durability of any kind is guaranteed.  I thought modern file 
systems did all their work in DRAM until an explicit fsync/msync or other 
necessity compelled them to push data down to durable media (in the right 
order etc.).

Also, we might be using terminology differently:

I use "persistent" in the sense of "outlives processes".  Files in /tmp/ 
and /dev/shm/ are persistent, but not durable.

I use "durable" to mean "written to non-volatile media (HDD or SSD) in 
such a way as to guarantee that it will survive power cycling."

I expect *persistence* from ioctl(FICLONE) but I didn't expect a 
*durability* guarantee without fsync().  If I'm understanding you 
correctly, cloning in XFS gives us durability whether we want it or not.

>> Finally I understand your explanation that the cost of cloning is 
>> proportional to the size of the extent map, and that in the limit where 
>> the extent map is very large, cloning a file of size N requires O(N) 
>> time. However the constant factors surprise me.  If memory serves we 
>> were seeing latencies of milliseconds atop DRAM for the first few 
>> clones on files that began as sparse files and had only a few blocks 
>> written to them.  Copying the extent map on a DRAM file system must be 
>> tantamount to a bunch of memcpy() calls (right?),
>
> At the IO layer, yes, it's just a memcpy.
>
> But we can't just copy a million extents from one in-memory btree to 
> another.  We have to modify the filesystem metadata in an atomic, 
> transactional, recoverable way. Those transactions work one extent at a 
> time because each extent might require a different set of modifications.

Ah, so now I see where the time goes.  This is clear.

> Persistent clones require tracking of the number of times a given block 
> on disk is shared so that we know when extent removals result in the 
> extent no longer being shared and/or referenced. A file that has been 
> cloned a million times might have a million extents each shared a 
> different number of times. When we remove one of those clones, how do we 
> know which blocks are now unreferenced and need to be freed?
>
> IOWs, named persistent clones are *much more complex* than ephemeral 
> clones.

Again, I don't know where you're getting "ephemeral" from; that word does 
not appear in the FAST '15 paper.  The AdvFS clones of the FAST '15 paper 
were both durable and persistent; they were just hidden from the 
user-visible namespace.  A crash (power outage or whatever ) caused a file 
to revert to the most recent hidden clone.  In AdvFS, a hidden clone was 
created by an fsync/msync call.  This is how AdvFS made file updates 
failure-atomic.

Again, we're not asking for the same functionality of the FAST '15 paper.

However if the contrast between what AdvFS did with clones and how XFS 
works illuminates issues like XFS performance, then it might be worth 
understanding AdvFS.

Incidentally, I really appreciate the time & effort you're taking to 
educate me & Suyash.  I hope I'm not being too sluggish a student, though 
sometimes I am.

For the near term, Suyash and I are getting closer to an understanding of 
today's ioctl(FICLONE) that we can pass along to readers in the paper 
we're writing.

> The overhead you are measuring is the result of all the persistent cross 
> referencing and reference counting metadata we need to atomically update 
> on each extent sharing operation ensure long term persistent clones work 
> correctly.

This is clear.  Thanks.

> If we were to implement ephemeral clones as per the mechanism you've 
> outlined in the papers above, then we could just copy the in-memory 
> extent list btree with a series of memcpy() operations because we don't 
> need persistent on-disk shared reference counting to implement it....

We're not on the same page about what AdvFS did.

Of course I'll understand if you don't have time or interest to get on the 
same page; we understand that you're busy with a lot of important work.

Thanks for your help and Happy Holidays!

> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-18 23:40           ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
@ 2022-12-20  2:16             ` Dave Chinner
  2022-12-21 23:07               ` wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2022-12-20  2:16 UTC (permalink / raw)
  To: Terence Kelly; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar

On Sun, Dec 18, 2022 at 06:40:54PM -0500, Terence Kelly wrote:
> 
> 
> On Sun, 18 Dec 2022, Dave Chinner wrote:
> 
> > > https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf
> > 
> > Ah, now I get it. You want *anonymous ephemeral clones*, not named
> > persistent clones.  For everyone else, so they don't have to read the
> > paper and try to work it out:
> > 
> > The mechanism is a hacked the O_ATOMIC path ...
> 
> No.  To be clear, nobody now in 2022 is asking for the AdvFS features of the
> FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS).
> 
> The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and
> foreseeable needs, except for performance.
> 
> I cited the FAST 2015 paper simply to show that I've worked with a
> clone-based mechanism in the past and it delighted me in every way.  It's
> simply an existence proof that cloning can be delightful for crash
> tolerance.

Sure, you're preaching to the choir. But the context was quoting a
paper as an example of the cloning performance you expected from XFS
but weren't getting. You're still talking about how XFS clones are
too slow for you needs, but now you are saying you don't want
clones for fault tolerance as implemented in advfs

> > Hence the difference in functionality is that FICLONE provides
> > persistent, unrestricted named clones rather than ephemeral clones.
> 
> For the record, the AdvFS implementation of clone-based crash tolerance ---
> the moral equivalent of failure-atomic msync(), which was the topic of my
> EuroSys 2013 paper --- involved persistent files on durable storage; the
> files were hidden and were discarded when their usefulness was over but the
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is the very definition of an ephemeral filesystem object.

The clones are temporary filesystem objects that exist only within
the context of an active file descriptor, users doesn't know they
exist, users cannot discover their existence, and they get cleaned
up automatically by the filesystem when they are no longer useful.

Yes, there is some persistent state needed to implement the required
garbage collection semantics of the ephemeral object (just like
O_TMPFILE!), but that doesn't change the fact that users don't know
(or care) that the internal filesystem objects even exist.

Really, I can't think of a better example of an ephemeral object
than this, regardless of whether the paper's authors used that term
or not.

> hidden files were not "ephemeral" in the sense of a file in a DRAM-backed
> file system (/tmp/ or /dev/shm/ or whatnot).  AdvFS crash tolerance survived
> real power failures.  But this is a side issue of historical interest only.
>
> I mainly want to emphasize that nobody is asking for the behavior of AdvFS
> in that FAST 2015 paper.

OK, so what are you asking us to do, then?

[....]

> > > https://dl.acm.org/doi/pdf/10.1145/3400899.3400902
> > 
> > Heh. You're still using hardware to do filesystem power fail testing? We
> > moved away from needing hardware to do power fail testing of filesystems
> > several years ago.
> > 
> > Using functionality like dm-logwrites, we can simulate the effect of
> > several hundred different power fail cases with write-by-write replay
> > and recovery in the space of a couple of minutes.
> 
> Cool.  I assume you're familiar with a paper on a similar technique that my
> HP Labs colleagues wrote circa 2013 or 2014:  "Torturing Databases for Fun
> and Profit."

Nope, but it's not a new or revolutionary technique so I'm not
surprised that other people have done similar things. There's been
plenty of research based on model checking over the past 2-3 decades
- the series of Iron Filesystem papers is a good example of this.
What we have in fstests is just a version of these concepts that
simplifies discovering and debugging previously undiscovered write
ordering issues...

> > Not only that, failures are fully replayable and so we can actually
> > debug every single individual failure without having to guess at the
> > runtime context that created the failure or the recovery context that
> > exposed the failure.
> > 
> > This infrastructure has provided us with a massive step forward for
> > improving crash resilence and recovery capability in ext4, btrfs and
> > XFS.  These tests are built into automated tests suites (e.g. fstests)
> > that pretty much all linux fs engineers and distro QE teams run these
> > days.
> 
> If you think the world would benefit from reading about this technique and
> using it more widely, I might be able to help.  My column in _Queue_
> magazine reaches thousands of readers, sometimes tens of thousands.  It's
> about teaching better techniques to working programmers.

You're welcome to do so - the source code is all there, there's a
mailing list for fstests where you can ask questions about it, etc.
If you think it's valuable for people outside the core linux fs
developer community, then you don't need to ask our permission to
write an article on it....

> > > I'm surprised that in XFS, cloning alone *without* fsync() pushes
> > > data down to storage.  I would have expected that the implementation
> > > of cloning would always operate upon memory alone, and that an
> > > explicit fsync() would be required to force data down to durable
> > > media. Analogy:  write() doesn't modify storage; write() plus
> > > fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't
> > > similar?
> > 
> > Because FICLONE provides a persistent named clone that is a fully
> > functioning file in it's own right.  That means it has to be completely
> > indepedent of the source file by the time the FICLONE operation
> > completes.  This implies that there is a certain order to the operations
> > the clone performances - the data has to be on disk before the clone is
> > made persistent and recoverable so that both files as guaranteed to have
> > identical contents if we crash immediately after the clone completes.
> 
> I thought the rule was that if an application doesn't call fsync() or
> msync(), no durability of any kind is guaranteed.

No durability of any kind is guaranteed, but that doesn't preclude
the OS and/or filesystem actually performing an operation in a way
that guarantees persistence....

That said, the FICLONE API doesn't guarantee persistence. The
application still have to call fdatasync() to ensure that all the
metadata changes that FICLONE makes are persisted all the way down
to stable storage.

> I thought modern file
> systems did all their work in DRAM until an explicit fsync/msync or other
> necessity compelled them to push data down to durable media (in the right
> order etc.).

Largely, they do. But some operations have dependencies and require
data/metadata update synchronisation, and at that point we have
ordering constraints. TO an outside observer, that may look like
the filesystem is trying to provide durability, but in fact it is
doing nothing of the sort...

I suspect you've seen the data writeback in FICLONE and thought this
is because it needs to provide a durability guarantee.

For XFS, this is an ordering constrain - we have to ensure the right
thing happens with delayed allocation and resolve pending COW
operations on a file before we clone the extent map to a new file.
We do this by running writeback to process these pending extent map
operations we deferred at write() time. Once those deferred
operations have been resolved, we can run the transactions to clone
the extent map.

However, if FICLONE is acting on files containing only data at rest,
then it can run without doing a single data IO, and the whole clone
can be lost on crash if fdatasync() is not run once it is complete.

IOWs, the FICLONE API provides no persistence guarantees.
fdatasync/O_DSYNC is still required.

> Also, we might be using terminology differently:
> 
> I use "persistent" in the sense of "outlives processes".  Files in /tmp/ and
> /dev/shm/ are persistent, but not durable.

Yeah, different terminology - you seem to have different frames of
reference for the terms you are using.

The frame of reference I'm using for terminology is filesystem
objects rather than processes or storage.  Stuff that exists purely
in memory (such as tmpfs or shm files) is always considered
"volatile" - they are lost if the system crashes or shuts down.
Volatile storage also include caches like dirty data in the page
cache and storage devices with DRAM based caches.

Persistent refers to ensuring filesystem objects are not volatile;
they do not get lost during shutdown or abnormal termination because
they have been guaranteed to exist on a stable, permanent storage
media. 

> I use "durable" to mean "written to non-volatile media (HDD or SSD) in such
> a way as to guarantee that it will survive power cycling."

Sure. We typically refer to non-volatile storage media as "stable
storage" because the hardware can be durable in the short term but
volatile in the long term. e.g. battery backed RAM is considered
"stable" if the battery backup lasts longer than 72 hours, but
over long periods it will not retain it's contents. Hence calling it
"non-volatile media" isn't really correct - the contents are only
stable over a fixed timeframe.

Regardless of terminology, "persisting objects to stable
storage" is effectively the same thing as "making durable".

> I expect *persistence* from ioctl(FICLONE) but I didn't expect a
> *durability* guarantee without fsync().  If I'm understanding you correctly,
> cloning in XFS gives us durability whether we want it or not.

See above. We provide no guarantees about persistence, but in some
cases we can't perform the FICLONE operation correctly without
performing most of the operations needed to provide persistence of
the source file.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: XFS reflink overhead, ioctl(FICLONE)
  2022-12-18  4:47           ` Suyash Mahar
@ 2022-12-20  3:06             ` Darrick J. Wong
  2022-12-21 22:34               ` atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
  0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2022-12-20  3:06 UTC (permalink / raw)
  To: Suyash Mahar; +Cc: Dave Chinner, Terence Kelly, linux-xfs, Suyash Mahar

On Sat, Dec 17, 2022 at 08:47:45PM -0800, Suyash Mahar wrote:
> Thank you for the detailed response. This does confirm some of our
> observations that the overhead is mainly from the software layer. We
> did see better performance from optimization in the transaction code
> moving from kernel v5.4 to v5.18.
> 
> -Suyash
> 
> Le sam. 17 déc. 2022 à 17:46, Dave Chinner <david@fromorbit.com> a écrit :
> >
> > On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote:
> > >
> > > Hi Dave,
> > >
> > > Thanks for your quick and detailed reply.  More inline....
> > >
> > > On Thu, 15 Dec 2022, Dave Chinner wrote:
> > >
> > > > > Regardless of the block device (the plot includes results for optane
> > > > > and RamFS), it seems like the ioctl(FICLONE) call is slow.
> > > >
> > > > Please define "slow" - is it actually slower than it should be (i.e. a
> > > > bug) or does it simply not perform according to your expectations?
> > >
> > > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took
> > > *milli*seconds right from the start, and grew to *tens* of milliseconds.
> > > There's no slow block storage device to increase latency; all of the latency
> > > is due to software.  I was expecting microseconds of latency with DRAM
> > > underneath.
> >
> > Ah - slower than expectations then, and you have unrealistic
> > expectations about how "fast" DRAM is.
> >
> > From a storage engineer's perspective, DRAM is slow compared to nvme
> > based flash storage - DRAM has better access latency, but on all
> > other aspects of storage performance and capability, it falls way
> > behind pcie attached storage because the *CPU time* is the limiting
> > factor in storage performance these days, not storage device speed.
> >
> > The problem with DRAM based storage (and DAX in general) is that
> > data movement is run by the CPU - it's synchronous storage.
> > Filesystems like XFS are built around highly concurrent pipelined
> > asynchronous IO hardware. Filesystems are capable of keeping
> > thousands of IOs in flight *per CPU*, but on synchronous storage
> > like DRAM we can only have *1 IO per CPU* in flight at any given
> > time.
> >
> > Hence when we compare synchronous write performance, DRAM is fast
> > compared to SSDs. When we use async IO (AIO+DIO or io_uring), the
> > numbers go the other way and SSDs come out further in front the more
> > of them you attach to the system. DRAM based IO doesn't get any
> > faster because it still can only process one IO at a time, whilst
> > *each SSD* can process 100+ IOs at a time.
> >
> > IOWs, for normal block based storage we only use the CPU to marshall
> > the data movement in the system, and the hardware takes care of the
> > data movement. i.e. DMA-based storage devices are a hardware offload
> > mechanism. DRAM based storage relies on the CPU to move data, and so
> > we use all the time that the CPU could be sending IO to the hardware
> > to move data in DRAM from A to B.
> >
> >
> > Put simply: DRAM can only be considered fast if your application
> > does (or is optimised for) synchronous IO. For all other uses, DRAM
> > based storage is a poor choice.

Oh, it's worse than that -- since you're using 5.18 with reflink
enabled, DAX will always yield to reflink.  IOWs, the random writes are
done to the pagecache, so the implied fdatasync in the FICLONE
preparation also has to *copy* the dirty pagecache to the pmem.

It would at least be interesting (a) to bump to 6.2, and (b) stuff an
fsync(src_fd) call in before you start timing the FICLONE to see what
proportion of the clone time was actually just pagecache maneuvers.

> > > Performance matters because cloning is an excellent crash-tolerance
> > > mechanism.
> >
> > Guaranteeing filesystem and data integrity is our primary focus when
> > building infrastructure that can be used for crash-tolerance
> > mechanisms...
> >
> > > Applications that maintain persistent state in files --- that's
> > > a huge number of applications --- can make clones of said files and recover
> > > from crashes by reverting to the most recent successful clone.
> >
> > ... and that's the data integrity guarantee that the filesystem
> > *must* provide the application.
> >
> > > In many
> > > situations this is much easier and better than shoe-horning application data
> > > into something like an ACID-transactional relational database or
> > > transactional key-value store.
> >
> > Of course. But that doesn't mean the need for ACID-transactional
> > database functionality goes away.  We've just moved that
> > functionality into the filesystem to implement FICLONE
> > functionality.
> >
> > > But the run-time cost of making a clone
> > > during failure-free operation can't be excessive.
> >
> > Define "excessive".
> >
> > Our design constraints were that FICLONE had to be faster than
> > copying the data, and needed to have fixed cost per shared extent
> > reference modification or better so that it could scale to millions
> > of extents without bringing the filesystem, storage and/or system
> > to it's knees when someone tried to do that.
> >
> > Remember - extent sharing and clones were retrofitted to XFS 20
> > years after it was designed. We had to make lots of compromises
> > just to make it work correctly, let alone acheive the performance
> > requirements we set a decade ago.
> >
> > > Cloning for crash
> > > tolerance usually requires durable media beneath the file system (HDD or
> > > SSD, not DRAM), so performance on block storage devices matters too.  We
> > > measured performance of cloning atop DRAM to understand how much latency is
> > > due to block storage hardware vs. software alone.
> >
> > Cloning is a CPU intensive operation, not an IO intensive operation.
> > What you are measuring is *entirely* the CPU overhead of doing all
> > the transactions and cross-referencing needed to track extent
> > sharing in a manner that is crash consistent, atomic and fully
> > recoverable.
> >
> > > My colleagues and I started working on clone-based crash tolerance
> > > mechanisms nearly a decade ago.  Extensive experience with cloning and
> > > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of
> > > the DEC Tru64 file system, taught me to expect cloning to be *faster* than
> > > alternatives for crash tolerance:
> >
> > Cloning files on XFSi and btrfs is still much faster than the
> > existing safe overwrite mechanism of {create a whole new data copy,
> > fysnc, rename, fsync}. So I'm not sure what you're actually
> > complaining about here.
> >
> >
> > > https://urldefense.com/v3/__https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_IkjbSIg$
> >
> > Ah, now I get it. You want *anonymous ephemeral clones*, not named
> > persistent clones.  For everyone else, so they don't have to read
> > the paper and try to work it out:
> >
> > The mechanism is a hacked the O_ATOMIC path to instantiate a whole
> > new cloned inode which is linked into a hidden namespace in the
> > filesystem so the user can't see it but so it is present after a
> > crash.
> >
> > It doesn't track the cloned extents in a persistent index, the
> > hidden file simply shares the same block map on disk and the sharing
> > is tracked in memory. After a crash, nothing is done with this until
> > the original file is instantiated in memory. At this point, the
> > hidden clone file(s) are then accessed and the shared state is
> > recovered in memory and decisions are made about which contains the
> > most recent data are made.
> >
> > The clone is only present while the fd returned by the
> > open(O_ATOMIC) is valid. On close(), the clone is deleted and all
> > the in-memory and hidden on-disk state is torn down. Effectively,
> > the close() operation becomes an unlink().
> >
> > Further, a new syscall (called syncv()) takes a vector of these
> > O_ATOMIC cloned file descriptors is added. THis syscall forces the
> > filesystem to make the inode -metadata- persistent without requiring
> > data modifications to be persistent. This allows the ephemeral
> > clones to be persisted without requiring the data in the original
> > file to be writtent to disk. At this point, we have a hidden clone
> > with a matching block map that can be used for crash recovery
> > purposes.
> >
> > This clone mechanism in advfs is limited by journal size - 256
> > clones per 128MB journal space due to reservation space needed for
> > clone deletes.
> >
> > ----
> >
> > So my reading of this paper is that the "file clone operation"
> > essentially creates an ephemeral clone rather than a persistent named
> > clones. I think they are more equivalent to ephemeral tmp files
> > than FICLONE. That is, we use open(O_TMPFILE) to create an
> > ephemeral temporary file attached to a file descriptor instead of
> > requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then
> > unlinking it and holding the fd open or relying on /tmp being
> > volatile or cleaned at boot to remove tmpfiles on crash.
> >
> > Hence the difference in functionality is that FICLONE provides
> > persistent, unrestricted named clones rather than ephemeral clones.
> >
> >
> > We could implement ephemeral clones in XFS, but nobody has ever
> > mentioned needing or wanting such functionality until this thread.
> > Darrick already has patches to provide an internal hidden
> > persistent namespace for XFS filesystems, we could add a new O_CLONE
> > open flag that provides ephemeral clone behaviour, we could add a
> > flag to the inode to indicate it has ephemeral clones that need
> > recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger
> > COW instead of overwrite in place, etc. It's just a matter of time
> > and resources.

<cough> The bits needed for atomic file commits have been out for review
on fsdevel since **before the COVID19 pandemic started**.  It's buried
in the middle of the online repair featureset.

Summary of the usage model:

fd = open(sourcefile...)
tmp_fd = open(..., O_TMPFILE)

ioctl(tmp_fd, FICLONE, fd);	/* clone data to temporary file */

/* write whatever you want to the temporary file */

ioctl(fd, FIEXCHANGE_RANGE, {tmp_fd, file range...}) /* durable commit */

close(tmp_fd)

True, this isn't an ephemeral file -- for such a thing, we could just
duplicate the in-memory data fork and never commit it to disk.  But that
said, I've been trying to get the parts I /have/ built merged for three
years.

I'm planning to push the whole giant thing to the list on Thursday.

--D

> > If you've got resources available to implement this, I can find the
> > time to help design and integrate it into the VFS and XFS....
> >
> > > The point I'm trying to make is:  I'm a serious customer who loves cloning
> > > and my performance expectations aren't based on idle speculation but on
> > > experience with other cloning implementations.  (AdvFS is not open source
> > > and I'm no longer an HP employee, so I no longer have access to it.)
> > >
> > > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by
> > > subjecting it to real whole-system power interruptions:
> > >
> > > https://urldefense.com/v3/__https://dl.acm.org/doi/pdf/10.1145/3400899.3400902__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_th521GQ$
> >
> > Heh. You're still using hardware to do filesystem power fail
> > testing?  We moved away from needing hardware to do power fail
> > testing of filesystems several years ago.
> >
> > Using functionality like dm-logwrites, we can simulate the effect of
> > several hundred different power fail cases with write-by-write
> > replay and recovery in the space of a couple of minutes.
> >
> > Not only that, failures are fully replayable and so we can actually
> > debug every single individual failure without having to guess at the
> > runtime context that created the failure or the recovery context
> > that exposed the failure.i
> >
> > This infrastructure has provided us with a massive step forward for
> > improving crash resilence and recovery capability in ext4, btrfs and
> > XFS.  These tests are built into automated tests suites (e.g.
> > fstests) that pretty much all linux fs engineers and distro QE teams
> > run these days.
> >
> > IOWs, hardware based power fail testing of filesystems is largely
> > obsolete these days....
> >
> > > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down
> > > to storage.  I would have expected that the implementation of cloning would
> > > always operate upon memory alone, and that an explicit fsync() would be
> > > required to force data down to durable media.  Analogy:  write() doesn't
> > > modify storage; write() plus fsync() does.  Is there a reason why copying
> > > via ioctl(FICLONE) isn't similar?
> >
> > Because FICLONE provides a persistent named clone that is a fully
> > functioning file in it's own right.  That means it has to be
> > completely indepedent of the source file by the time the FICLONE
> > operation completes.  This implies that there is a certain order to
> > the operations the clone performances - the data has to be on disk
> > before the clone is made persistent and recoverable so that both
> > files as guaranteed to have identical contents if we crash
> > immediately after the clone completes.
> >
> > > Finally I understand your explanation that the cost of cloning is
> > > proportional to the size of the extent map, and that in the limit where the
> > > extent map is very large, cloning a file of size N requires O(N) time.
> > > However the constant factors surprise me.  If memory serves we were seeing
> > > latencies of milliseconds atop DRAM for the first few clones on files that
> > > began as sparse files and had only a few blocks written to them.  Copying
> > > the extent map on a DRAM file system must be tantamount to a bunch of
> > > memcpy() calls (right?),
> >
> > At the IO layer, yes, it's just a memcpy.
> >
> > But we can't just copy a million extents from one in-memory btree to
> > another. We have to modify the filesystem metadata in an atomic,
> > transactional, recoverable way. Those transactions work one extent
> > at a time because each extent might require a different set of
> > modifications. Persistent clones require tracking of the number of
> > times a given block on disk is shared so that we know when extent
> > removals result in the extent no longer being shared and/or
> > referenced. A file that has been cloned a million times might have
> > a million extents each shared a different number of times. When we
> > remove one of those clones, how do we know which blocks are now
> > unreferenced and need to be freed?
> >
> > IOWs, named persistent clones are *much more complex* than ephemeral
> > clones. The overhead you are measuring is the result of all the
> > persistent cross referencing and reference counting metadata we need
> > to atomically update on each extent sharing operation ensure long
> > term persistent clones work correctly.
> >
> > If we were to implement ephemeral clones as per the mechanism you've
> > outlined in the papers above, then we could just copy the in-memory
> > extent list btree with a series of memcpy() operations because we
> > don't need persistent on-disk shared reference counting to implement
> > it....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE))
  2022-12-20  3:06             ` Darrick J. Wong
@ 2022-12-21 22:34               ` Terence Kelly
  0 siblings, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-21 22:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Suyash Mahar, Dave Chinner, linux-xfs, Suyash Mahar

Hi Darrick,

I should have mentioned this earlier, but for several years XFS developer 
Christoph Hellwig has been working on a feature inspired by the FAST 2015 
paper.  My HP colleagues and I met Christoph at FAST 2015 and he expressed 
interest in doing something similar in XFS.  Since then he has reported 
doing a considerable amount of work toward that goal, though I don't know 
the current state of his efforts.

I'm just pointing out a possible connection between the "atomic file 
commits" described below and Christoph's work; I don't know if the 
implementations are similar, but to an outsider it sounds like they aspire 
to serve the same purpose:  Enabling applications to efficiently evolve 
files from one well-defined state to another atomically even in the 
presence of failure.

Regardless of how and by whom this goal is achieved, folks like Suyash and 
I eagerly await the results.

May the Force be with you!

-- Terence

On Mon, 19 Dec 2022, Darrick J. Wong wrote:

> ...
>
> <cough> The bits needed for atomic file commits have been out for review 
> on fsdevel since **before the COVID19 pandemic started**.  It's buried 
> in the middle of the online repair featureset.
>
> Summary of the usage model:
>
> fd = open(sourcefile...)
> tmp_fd = open(..., O_TMPFILE)
>
> ioctl(tmp_fd, FICLONE, fd);	/* clone data to temporary file */
>
> /* write whatever you want to the temporary file */
>
> ioctl(fd, FIEXCHANGE_RANGE, {tmp_fd, file range...}) /* durable commit */
>
> close(tmp_fd)
>
> True, this isn't an ephemeral file -- for such a thing, we could just 
> duplicate the in-memory data fork and never commit it to disk.  But that 
> said, I've been trying to get the parts I /have/ built merged for three 
> years.
>
> I'm planning to push the whole giant thing to the list on Thursday.
>
> --D

^ permalink raw reply	[flat|nested] 14+ messages in thread

* wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE))
  2022-12-20  2:16             ` Dave Chinner
@ 2022-12-21 23:07               ` Terence Kelly
  0 siblings, 0 replies; 14+ messages in thread
From: Terence Kelly @ 2022-12-21 23:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar

Hi Dave,

To answer your question below:

When we sent our observations about ioctl(FICLONE) performance recently, 
starting this e-mail thread, we were hoping for one of several outcomes: 
Perhaps we were misusing the feature, in which case guidance on how to 
obtain better performance would be helpful.  Or if we're not doing 
anything wrong, an explanation of why ioctl(FICLONE) isn't as fast as we 
expected based on experience with the clone-based crash-tolerance 
mechanism in AdvFS.  In recent days we've been getting the latter, for 
which we are grateful.  We may try to pass along your explanations in a 
paper we're writing; if so we'll offer y'all the opportunity to review 
this paper and ask if you'd like to be acknowledged.

In the longer term, we're very interested in any developments related to 
crash tolerance.  The details of interfaces are less important as long as 
user-level applications can with reasonable convenience and performance 
obtain a simple guarantee:  Following a power failure or other crash a 
file can always be restored to a state that the application deemed 
consistent (application-level invariants & correctness criteria hold). 
Ideally the application would like a synchronous function call whose 
successful return provides the consistent-recoverability guarantee for the 
current state of the file.  That's the guarantee that the original 
failure-atomic msync() of EuroSys 2013 provided.

Obtaining this guarantee with ioctl(FICLONE) is quite convenient:  When 
the application knows that the file is in a consistent state, the 
application makes a clone and stashes the clone in a safe place.  Loosely 
speaking, the performance desired is that the work of cloning should be 
"O(delta) not O(data)", i.e., the time and effort required to make & stash 
a clone should be proportional to the amount of data in the file changed 
between consecutive clones, not to the logical size of the entire file. 
I gather from our recent correspondence that XFS cloning today requires 
O(data) time and effort, not O(delta).  Which is progress; we have a much 
better understanding of what's going on under the hood.

We understand that you're volunteers and that you're busy with many 
important matters.  We're not asking for any further work, though we'll 
surely applaud from the sidelines any improvements toward crash tolerance.

I've been thinking about alternative approaches to crash tolerance for 
over a decade.  In practice today people use things like relational 
databases and transactional key-value stores to protect application data 
integrity from crashes. I'm interested in other approaches, including but 
not limited to failure-atomic msync() and the moral equivalents thereof 
implemented with help from file systems.  I've worked on a half-dozen 
variants of this theme and I'd be happy to explain why I think this area 
is exciting to anyone willing to listen.  In a nutshell I look forward to 
the day when file systems render relational databases and transactional 
key-value stores obsolete for some (not all) use cases.

Thanks again for your extraordinary help clarifying matters, which goes 
above & beyond the call of duty, and happy holidays!

-- Terence

On Tue, 20 Dec 2022, Dave Chinner wrote:

>> I mainly want to emphasize that nobody is asking for the behavior of 
>> AdvFS in that FAST 2015 paper.
>
> OK, so what are you asking us to do, then?

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-12-21 23:08 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CACQnzjuhRzNruTm369wVQU3y091da2c+h+AfRED+AtA-dYqXNQ@mail.gmail.com>
2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
2022-12-14  1:46   ` Terence Kelly
2022-12-14  4:47   ` Suyash Mahar
2022-12-15  0:19     ` Dave Chinner
2022-12-16  1:06       ` Terence Kelly
2022-12-17 17:30         ` Mike Fleetwood
2022-12-17 18:43           ` Terence Kelly
2022-12-18  1:46         ` Dave Chinner
2022-12-18  4:47           ` Suyash Mahar
2022-12-20  3:06             ` Darrick J. Wong
2022-12-21 22:34               ` atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
2022-12-18 23:40           ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
2022-12-20  2:16             ` Dave Chinner
2022-12-21 23:07               ` wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox