* Re: XFS reflink overhead, ioctl(FICLONE) [not found] <CACQnzjuhRzNruTm369wVQU3y091da2c+h+AfRED+AtA-dYqXNQ@mail.gmail.com> @ 2022-12-13 17:18 ` Darrick J. Wong 2022-12-14 1:46 ` Terence Kelly 2022-12-14 4:47 ` Suyash Mahar 0 siblings, 2 replies; 14+ messages in thread From: Darrick J. Wong @ 2022-12-13 17:18 UTC (permalink / raw) To: Suyash Mahar; +Cc: linux-xfs, tpkelly, Suyash Mahar [ugh, your email never made it to the list. I bet the email security standards have been tightened again. <insert rant about dkim and dmarc silent failures here>] :( On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote: > Hi all! > > While using XFS's ioctl(FICLONE), we found that XFS seems to have > poor performance (ioctl takes milliseconds for sparse files) and the > overhead > increases with every call. > > For the demo, we are using an Optane DC-PMM configured as a > block device (fsdax) and running XFS (Linux v5.18.13). How are you using fsdax and reflink on a 5.18 kernel? That combination of features wasn't supported until 6.0, and the data corruption problems won't get fixed until a pull request that's about to happen for 6.2. > We create a 1 GiB dense file, then repeatedly modify a tiny random > fraction of it and make a clone via ioctl(FICLONE). Yay, random cow writes, that will slowly increase the number of space mapping records in the file metadata. > The time required for the ioctl() calls increases from large to insane > over the course of ~250 iterations: From roughly a millisecond for the > first iteration or two (which seems high, given that this is on > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20 > milliseconds (which seems crazy). Does the system call runtime increase with O(number_extents)? You might record the number of extents in the file you're cloning by running this periodically: xfs_io -c stat $path | grep fsxattr.nextents FICLONE (at least on XFS) persists dirty pagecache data to disk, and then duplicates all written-space mapping records from the source file to the destination file. It skips preallocated mappings created with fallocate. So yes, the plot is exactly what I was expecting. --D > The plot is attached to this email. > > A cursory look at the extent map suggests that it gets increasingly > complicated resulting in the complexity. > > The enclosed tarball contains our code, our results, and some other info > like a flame graph that might shed light on where the ioctl is spending > its time. > > - Suyash & Terence ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong @ 2022-12-14 1:46 ` Terence Kelly 2022-12-14 4:47 ` Suyash Mahar 1 sibling, 0 replies; 14+ messages in thread From: Terence Kelly @ 2022-12-14 1:46 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Suyash Mahar, linux-xfs, Suyash Mahar Hi Darrick, Thanks for your quick and detailed reply. The thing that really puzzled me when I re-ran Suyash's experiments on a DRAM-backed file system is that the ioctl(FICLONE) calls were still very very slow. A slow block storage device can't be blamed, because there wasn't a slow block storage device anywhere in the picture; the slowness came from software. Suyash, can you send those results? -- Terence Kelly On Tue, 13 Dec 2022, Darrick J. Wong wrote: > FICLONE (at least on XFS) persists dirty pagecache data to disk, and > then duplicates all written-space mapping records from the source file > to the destination file. It skips preallocated mappings created with > fallocate. > > So yes, the plot is exactly what I was expecting. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong 2022-12-14 1:46 ` Terence Kelly @ 2022-12-14 4:47 ` Suyash Mahar 2022-12-15 0:19 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Suyash Mahar @ 2022-12-14 4:47 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs, tpkelly, Suyash Mahar Hi Darrick, Thank you for the response. I have replied inline. -Suyash Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@kernel.org> a écrit : > > [ugh, your email never made it to the list. I bet the email security > standards have been tightened again. <insert rant about dkim and dmarc > silent failures here>] :( > > On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote: > > Hi all! > > > > While using XFS's ioctl(FICLONE), we found that XFS seems to have > > poor performance (ioctl takes milliseconds for sparse files) and the > > overhead > > increases with every call. > > > > For the demo, we are using an Optane DC-PMM configured as a > > block device (fsdax) and running XFS (Linux v5.18.13). > > How are you using fsdax and reflink on a 5.18 kernel? That combination > of features wasn't supported until 6.0, and the data corruption problems > won't get fixed until a pull request that's about to happen for 6.2. We did not enable the dax option. The optane DIMMs are configured to appear as a block device. $ mount | grep xfs /dev/pmem0p4 on /mnt/pmem0p4 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) Regardless of the block device (the plot includes results for optane and RamFS), it seems like the ioctl(FICLONE) call is slow. > > We create a 1 GiB dense file, then repeatedly modify a tiny random > > fraction of it and make a clone via ioctl(FICLONE). > > Yay, random cow writes, that will slowly increase the number of space > mapping records in the file metadata. > > > The time required for the ioctl() calls increases from large to insane > > over the course of ~250 iterations: From roughly a millisecond for the > > first iteration or two (which seems high, given that this is on > > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20 > > milliseconds (which seems crazy). > > Does the system call runtime increase with O(number_extents)? You might > record the number of extents in the file you're cloning by running this > periodically: > > xfs_io -c stat $path | grep fsxattr.nextents The extent count does increase linearly (just like the ioctl() call latency). I used the xfs_bmap tool, let me know if this is not the right way. If it is not, I'll update the microbenchmark to run xfs_io. > FICLONE (at least on XFS) persists dirty pagecache data to disk, and > then duplicates all written-space mapping records from the source file to > the destination file. It skips preallocated mappings created with > fallocate. > > So yes, the plot is exactly what I was expecting. > > --D > > > The plot is attached to this email. > > > > A cursory look at the extent map suggests that it gets increasingly > > complicated resulting in the complexity. > > > > The enclosed tarball contains our code, our results, and some other info > > like a flame graph that might shed light on where the ioctl is spending > > its time. > > > > - Suyash & Terence ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-14 4:47 ` Suyash Mahar @ 2022-12-15 0:19 ` Dave Chinner 2022-12-16 1:06 ` Terence Kelly 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2022-12-15 0:19 UTC (permalink / raw) To: Suyash Mahar; +Cc: Darrick J. Wong, linux-xfs, tpkelly, Suyash Mahar On Tue, Dec 13, 2022 at 08:47:03PM -0800, Suyash Mahar wrote: > Hi Darrick, > > Thank you for the response. I have replied inline. > > -Suyash > > Le mar. 13 déc. 2022 à 09:18, Darrick J. Wong <djwong@kernel.org> a écrit : > > > > [ugh, your email never made it to the list. I bet the email security > > standards have been tightened again. <insert rant about dkim and dmarc > > silent failures here>] :( > > > > On Sat, Dec 10, 2022 at 09:28:36PM -0800, Suyash Mahar wrote: > > > Hi all! > > > > > > While using XFS's ioctl(FICLONE), we found that XFS seems to have > > > poor performance (ioctl takes milliseconds for sparse files) and the > > > overhead > > > increases with every call. > > > > > > For the demo, we are using an Optane DC-PMM configured as a > > > block device (fsdax) and running XFS (Linux v5.18.13). > > > > How are you using fsdax and reflink on a 5.18 kernel? That combination > > of features wasn't supported until 6.0, and the data corruption problems > > won't get fixed until a pull request that's about to happen for 6.2. > > We did not enable the dax option. The optane DIMMs are configured to > appear as a block device. > > $ mount | grep xfs > /dev/pmem0p4 on /mnt/pmem0p4 type xfs > (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > > Regardless of the block device (the plot includes results for optane > and RamFS), it seems like the ioctl(FICLONE) call is slow. Please define "slow" - is it actually slower than it should be (i.e. a bug) or does it simply not perform according to your expectations? A few things that you can quantify to answer these questions. 1. What is the actual rate it is cloning extents at? i.e. extent count / clone time? Is this rate consistent/sustained, or is it dropping substantially over time and/or increase in extent count? 3. How does clone speed of a given file compare to the actual data copy speed of that file (please include fsync time in the data copy results)? Is cloning faster or slower than copying the data? What is the extent count of the file at the cross-over point where cloning goes from being faster to slower than copying the data? 3. How does it compare with btrfs running the same write/clone workload? Does btrfs run faster? Does it perform better with high extent counts than XFS? What about with high sharing counts (e.g. after 500 or 1000 clones of the source file)? Basically, I'm trying to understand what "slow" means in teh context of the operations you are performing. I haven't seen any recent performance regressions in clone speed on XFS, so I'm trying to understand what you are seeing and why you think it is slower than it should be. > > > We create a 1 GiB dense file, then repeatedly modify a tiny random > > > fraction of it and make a clone via ioctl(FICLONE). > > > > Yay, random cow writes, that will slowly increase the number of space > > mapping records in the file metadata. Yup, the scripts I use do exactly this - 10,000 random 4kB writes to a 8GB file between reflink clones. I then iterate a few thousand times and measure the reflink time. > > > The time required for the ioctl() calls increases from large to insane > > > over the course of ~250 iterations: From roughly a millisecond for the > > > first iteration or two (which seems high, given that this is on > > > Optane and the code doesn't fsync or msync anywhere at all, ever) to 20 > > > milliseconds (which seems crazy). > > > > Does the system call runtime increase with O(number_extents)? You might > > record the number of extents in the file you're cloning by running this > > periodically: > > > > xfs_io -c stat $path | grep fsxattr.nextents > > The extent count does increase linearly (just like the ioctl() call latency). As expected. Changing the sharing state a single extent has a roughly constant overhead regardless of the number of extents in the file. Hence clone time should scale linearly with the number of extents that need to have their shared state modified. > I used the xfs_bmap tool, let me know if this is not the right way. If > it is not, I'll update the microbenchmark to run xfs_io. xfs_bmap is the slow way - it has to iterate every extents and format them out to userspace. the above mechanism just does a single syscall to query the count of extents from the inode. Using the fsxattr extent count query is much faster, especially when you have files with tens of millions of extents in them.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-15 0:19 ` Dave Chinner @ 2022-12-16 1:06 ` Terence Kelly 2022-12-17 17:30 ` Mike Fleetwood 2022-12-18 1:46 ` Dave Chinner 0 siblings, 2 replies; 14+ messages in thread From: Terence Kelly @ 2022-12-16 1:06 UTC (permalink / raw) To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar Hi Dave, Thanks for your quick and detailed reply. More inline.... On Thu, 15 Dec 2022, Dave Chinner wrote: >> Regardless of the block device (the plot includes results for optane >> and RamFS), it seems like the ioctl(FICLONE) call is slow. > > Please define "slow" - is it actually slower than it should be (i.e. a > bug) or does it simply not perform according to your expectations? I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took *milli*seconds right from the start, and grew to *tens* of milliseconds. There's no slow block storage device to increase latency; all of the latency is due to software. I was expecting microseconds of latency with DRAM underneath. Performance matters because cloning is an excellent crash-tolerance mechanism. Applications that maintain persistent state in files --- that's a huge number of applications --- can make clones of said files and recover from crashes by reverting to the most recent successful clone. In many situations this is much easier and better than shoe-horning application data into something like an ACID-transactional relational database or transactional key-value store. But the run-time cost of making a clone during failure-free operation can't be excessive. Cloning for crash tolerance usually requires durable media beneath the file system (HDD or SSD, not DRAM), so performance on block storage devices matters too. We measured performance of cloning atop DRAM to understand how much latency is due to block storage hardware vs. software alone. My colleagues and I started working on clone-based crash tolerance mechanisms nearly a decade ago. Extensive experience with cloning and related mechanisms in the HP Advanced File System (AdvFS), a Linux port of the DEC Tru64 file system, taught me to expect cloning to be *faster* than alternatives for crash tolerance: https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf https://web.eecs.umich.edu/~tpkelly/papers/HPL-2015-103.pdf The point I'm trying to make is: I'm a serious customer who loves cloning and my performance expectations aren't based on idle speculation but on experience with other cloning implementations. (AdvFS is not open source and I'm no longer an HP employee, so I no longer have access to it.) More recently I torture-tested XFS cloning as a crash-tolerance mechanism by subjecting it to real whole-system power interruptions: https://dl.acm.org/doi/pdf/10.1145/3400899.3400902 I performed these correctness tests before making any performance measurements because I don't care how fast a mechanism is if it doesn't correctly tolerate crashes. XFS passed the power-fail tests with flying colors. Now it's time to consider performance. I'm surprised that in XFS, cloning alone *without* fsync() pushes data down to storage. I would have expected that the implementation of cloning would always operate upon memory alone, and that an explicit fsync() would be required to force data down to durable media. Analogy: write() doesn't modify storage; write() plus fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't similar? Finally I understand your explanation that the cost of cloning is proportional to the size of the extent map, and that in the limit where the extent map is very large, cloning a file of size N requires O(N) time. However the constant factors surprise me. If memory serves we were seeing latencies of milliseconds atop DRAM for the first few clones on files that began as sparse files and had only a few blocks written to them. Copying the extent map on a DRAM file system must be tantamount to a bunch of memcpy() calls (right?), and I'm surprised that the volume of data that must be memcpy'd is so large that it takes milliseconds. We might be able to take some of the additional measurements you suggested during/after the holidays. Thanks again. > A few things that you can quantify to answer these questions. > > ... ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-16 1:06 ` Terence Kelly @ 2022-12-17 17:30 ` Mike Fleetwood 2022-12-17 18:43 ` Terence Kelly 2022-12-18 1:46 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Mike Fleetwood @ 2022-12-17 17:30 UTC (permalink / raw) To: Terence Kelly Cc: Dave Chinner, Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar On Fri, 16 Dec 2022 at 01:06, Terence Kelly <tpkelly@eecs.umich.edu> wrote: > (AdvFS is not open source > and I'm no longer an HP employee, so I no longer have access to it.) Just to put the record straight, HP did (abandon and) open source AdvFS in June 2008. https://www.hp.com/hpinfo/newsroom/press/2008/080623a.html It's available under a GPLv2 license from https://advfs.sourceforge.net/ Mike ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-17 17:30 ` Mike Fleetwood @ 2022-12-17 18:43 ` Terence Kelly 0 siblings, 0 replies; 14+ messages in thread From: Terence Kelly @ 2022-12-17 18:43 UTC (permalink / raw) To: Mike Fleetwood Cc: Dave Chinner, Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar It's confusing. My FAST '15 paper was co-authored with AdvFS developers from the HP Storage Division. The paper mentions the open-source release of AdvFS. There's not a lot of recent activity on open-source AdvFS: https://sourceforge.net/p/advfs/discussion/ One thing is certain, however: HP did not "abandon" AdvFS in 2008. At the time of my FAST paper it was used under the hood in HP products and was being actively developed internally. See Section 3 of the FAST paper. The whole point of the paper is to describe a new (internal-only) AdvFS feature. I'm pretty sure (relying on memory) that the changes to AdvFS made by HP between 2008 and 2015 did not find their way into the open-source release. On Sat, 17 Dec 2022, Mike Fleetwood wrote: > On Fri, 16 Dec 2022 at 01:06, Terence Kelly <tpkelly@eecs.umich.edu> wrote: >> (AdvFS is not open source and I'm no longer an HP employee, so I no >> longer have access to it.) > > Just to put the record straight, HP did (abandon and) open source AdvFS > in June 2008. > https://www.hp.com/hpinfo/newsroom/press/2008/080623a.html > > It's available under a GPLv2 license from > https://advfs.sourceforge.net/ > > Mike > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-16 1:06 ` Terence Kelly 2022-12-17 17:30 ` Mike Fleetwood @ 2022-12-18 1:46 ` Dave Chinner 2022-12-18 4:47 ` Suyash Mahar 2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly 1 sibling, 2 replies; 14+ messages in thread From: Dave Chinner @ 2022-12-18 1:46 UTC (permalink / raw) To: Terence Kelly; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote: > > Hi Dave, > > Thanks for your quick and detailed reply. More inline.... > > On Thu, 15 Dec 2022, Dave Chinner wrote: > > > > Regardless of the block device (the plot includes results for optane > > > and RamFS), it seems like the ioctl(FICLONE) call is slow. > > > > Please define "slow" - is it actually slower than it should be (i.e. a > > bug) or does it simply not perform according to your expectations? > > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took > *milli*seconds right from the start, and grew to *tens* of milliseconds. > There's no slow block storage device to increase latency; all of the latency > is due to software. I was expecting microseconds of latency with DRAM > underneath. Ah - slower than expectations then, and you have unrealistic expectations about how "fast" DRAM is. From a storage engineer's perspective, DRAM is slow compared to nvme based flash storage - DRAM has better access latency, but on all other aspects of storage performance and capability, it falls way behind pcie attached storage because the *CPU time* is the limiting factor in storage performance these days, not storage device speed. The problem with DRAM based storage (and DAX in general) is that data movement is run by the CPU - it's synchronous storage. Filesystems like XFS are built around highly concurrent pipelined asynchronous IO hardware. Filesystems are capable of keeping thousands of IOs in flight *per CPU*, but on synchronous storage like DRAM we can only have *1 IO per CPU* in flight at any given time. Hence when we compare synchronous write performance, DRAM is fast compared to SSDs. When we use async IO (AIO+DIO or io_uring), the numbers go the other way and SSDs come out further in front the more of them you attach to the system. DRAM based IO doesn't get any faster because it still can only process one IO at a time, whilst *each SSD* can process 100+ IOs at a time. IOWs, for normal block based storage we only use the CPU to marshall the data movement in the system, and the hardware takes care of the data movement. i.e. DMA-based storage devices are a hardware offload mechanism. DRAM based storage relies on the CPU to move data, and so we use all the time that the CPU could be sending IO to the hardware to move data in DRAM from A to B. Put simply: DRAM can only be considered fast if your application does (or is optimised for) synchronous IO. For all other uses, DRAM based storage is a poor choice. > Performance matters because cloning is an excellent crash-tolerance > mechanism. Guaranteeing filesystem and data integrity is our primary focus when building infrastructure that can be used for crash-tolerance mechanisms... > Applications that maintain persistent state in files --- that's > a huge number of applications --- can make clones of said files and recover > from crashes by reverting to the most recent successful clone. ... and that's the data integrity guarantee that the filesystem *must* provide the application. > In many > situations this is much easier and better than shoe-horning application data > into something like an ACID-transactional relational database or > transactional key-value store. Of course. But that doesn't mean the need for ACID-transactional database functionality goes away. We've just moved that functionality into the filesystem to implement FICLONE functionality. > But the run-time cost of making a clone > during failure-free operation can't be excessive. Define "excessive". Our design constraints were that FICLONE had to be faster than copying the data, and needed to have fixed cost per shared extent reference modification or better so that it could scale to millions of extents without bringing the filesystem, storage and/or system to it's knees when someone tried to do that. Remember - extent sharing and clones were retrofitted to XFS 20 years after it was designed. We had to make lots of compromises just to make it work correctly, let alone acheive the performance requirements we set a decade ago. > Cloning for crash > tolerance usually requires durable media beneath the file system (HDD or > SSD, not DRAM), so performance on block storage devices matters too. We > measured performance of cloning atop DRAM to understand how much latency is > due to block storage hardware vs. software alone. Cloning is a CPU intensive operation, not an IO intensive operation. What you are measuring is *entirely* the CPU overhead of doing all the transactions and cross-referencing needed to track extent sharing in a manner that is crash consistent, atomic and fully recoverable. > My colleagues and I started working on clone-based crash tolerance > mechanisms nearly a decade ago. Extensive experience with cloning and > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of > the DEC Tru64 file system, taught me to expect cloning to be *faster* than > alternatives for crash tolerance: Cloning files on XFSi and btrfs is still much faster than the existing safe overwrite mechanism of {create a whole new data copy, fysnc, rename, fsync}. So I'm not sure what you're actually complaining about here. > https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf Ah, now I get it. You want *anonymous ephemeral clones*, not named persistent clones. For everyone else, so they don't have to read the paper and try to work it out: The mechanism is a hacked the O_ATOMIC path to instantiate a whole new cloned inode which is linked into a hidden namespace in the filesystem so the user can't see it but so it is present after a crash. It doesn't track the cloned extents in a persistent index, the hidden file simply shares the same block map on disk and the sharing is tracked in memory. After a crash, nothing is done with this until the original file is instantiated in memory. At this point, the hidden clone file(s) are then accessed and the shared state is recovered in memory and decisions are made about which contains the most recent data are made. The clone is only present while the fd returned by the open(O_ATOMIC) is valid. On close(), the clone is deleted and all the in-memory and hidden on-disk state is torn down. Effectively, the close() operation becomes an unlink(). Further, a new syscall (called syncv()) takes a vector of these O_ATOMIC cloned file descriptors is added. THis syscall forces the filesystem to make the inode -metadata- persistent without requiring data modifications to be persistent. This allows the ephemeral clones to be persisted without requiring the data in the original file to be writtent to disk. At this point, we have a hidden clone with a matching block map that can be used for crash recovery purposes. This clone mechanism in advfs is limited by journal size - 256 clones per 128MB journal space due to reservation space needed for clone deletes. ---- So my reading of this paper is that the "file clone operation" essentially creates an ephemeral clone rather than a persistent named clones. I think they are more equivalent to ephemeral tmp files than FICLONE. That is, we use open(O_TMPFILE) to create an ephemeral temporary file attached to a file descriptor instead of requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then unlinking it and holding the fd open or relying on /tmp being volatile or cleaned at boot to remove tmpfiles on crash. Hence the difference in functionality is that FICLONE provides persistent, unrestricted named clones rather than ephemeral clones. We could implement ephemeral clones in XFS, but nobody has ever mentioned needing or wanting such functionality until this thread. Darrick already has patches to provide an internal hidden persistent namespace for XFS filesystems, we could add a new O_CLONE open flag that provides ephemeral clone behaviour, we could add a flag to the inode to indicate it has ephemeral clones that need recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger COW instead of overwrite in place, etc. It's just a matter of time and resources. If you've got resources available to implement this, I can find the time to help design and integrate it into the VFS and XFS.... > The point I'm trying to make is: I'm a serious customer who loves cloning > and my performance expectations aren't based on idle speculation but on > experience with other cloning implementations. (AdvFS is not open source > and I'm no longer an HP employee, so I no longer have access to it.) > > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by > subjecting it to real whole-system power interruptions: > > https://dl.acm.org/doi/pdf/10.1145/3400899.3400902 Heh. You're still using hardware to do filesystem power fail testing? We moved away from needing hardware to do power fail testing of filesystems several years ago. Using functionality like dm-logwrites, we can simulate the effect of several hundred different power fail cases with write-by-write replay and recovery in the space of a couple of minutes. Not only that, failures are fully replayable and so we can actually debug every single individual failure without having to guess at the runtime context that created the failure or the recovery context that exposed the failure.i This infrastructure has provided us with a massive step forward for improving crash resilence and recovery capability in ext4, btrfs and XFS. These tests are built into automated tests suites (e.g. fstests) that pretty much all linux fs engineers and distro QE teams run these days. IOWs, hardware based power fail testing of filesystems is largely obsolete these days.... > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down > to storage. I would have expected that the implementation of cloning would > always operate upon memory alone, and that an explicit fsync() would be > required to force data down to durable media. Analogy: write() doesn't > modify storage; write() plus fsync() does. Is there a reason why copying > via ioctl(FICLONE) isn't similar? Because FICLONE provides a persistent named clone that is a fully functioning file in it's own right. That means it has to be completely indepedent of the source file by the time the FICLONE operation completes. This implies that there is a certain order to the operations the clone performances - the data has to be on disk before the clone is made persistent and recoverable so that both files as guaranteed to have identical contents if we crash immediately after the clone completes. > Finally I understand your explanation that the cost of cloning is > proportional to the size of the extent map, and that in the limit where the > extent map is very large, cloning a file of size N requires O(N) time. > However the constant factors surprise me. If memory serves we were seeing > latencies of milliseconds atop DRAM for the first few clones on files that > began as sparse files and had only a few blocks written to them. Copying > the extent map on a DRAM file system must be tantamount to a bunch of > memcpy() calls (right?), At the IO layer, yes, it's just a memcpy. But we can't just copy a million extents from one in-memory btree to another. We have to modify the filesystem metadata in an atomic, transactional, recoverable way. Those transactions work one extent at a time because each extent might require a different set of modifications. Persistent clones require tracking of the number of times a given block on disk is shared so that we know when extent removals result in the extent no longer being shared and/or referenced. A file that has been cloned a million times might have a million extents each shared a different number of times. When we remove one of those clones, how do we know which blocks are now unreferenced and need to be freed? IOWs, named persistent clones are *much more complex* than ephemeral clones. The overhead you are measuring is the result of all the persistent cross referencing and reference counting metadata we need to atomically update on each extent sharing operation ensure long term persistent clones work correctly. If we were to implement ephemeral clones as per the mechanism you've outlined in the papers above, then we could just copy the in-memory extent list btree with a series of memcpy() operations because we don't need persistent on-disk shared reference counting to implement it.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-18 1:46 ` Dave Chinner @ 2022-12-18 4:47 ` Suyash Mahar 2022-12-20 3:06 ` Darrick J. Wong 2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly 1 sibling, 1 reply; 14+ messages in thread From: Suyash Mahar @ 2022-12-18 4:47 UTC (permalink / raw) To: Dave Chinner; +Cc: Terence Kelly, Darrick J. Wong, linux-xfs, Suyash Mahar Thank you for the detailed response. This does confirm some of our observations that the overhead is mainly from the software layer. We did see better performance from optimization in the transaction code moving from kernel v5.4 to v5.18. -Suyash Le sam. 17 déc. 2022 à 17:46, Dave Chinner <david@fromorbit.com> a écrit : > > On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote: > > > > Hi Dave, > > > > Thanks for your quick and detailed reply. More inline.... > > > > On Thu, 15 Dec 2022, Dave Chinner wrote: > > > > > > Regardless of the block device (the plot includes results for optane > > > > and RamFS), it seems like the ioctl(FICLONE) call is slow. > > > > > > Please define "slow" - is it actually slower than it should be (i.e. a > > > bug) or does it simply not perform according to your expectations? > > > > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took > > *milli*seconds right from the start, and grew to *tens* of milliseconds. > > There's no slow block storage device to increase latency; all of the latency > > is due to software. I was expecting microseconds of latency with DRAM > > underneath. > > Ah - slower than expectations then, and you have unrealistic > expectations about how "fast" DRAM is. > > From a storage engineer's perspective, DRAM is slow compared to nvme > based flash storage - DRAM has better access latency, but on all > other aspects of storage performance and capability, it falls way > behind pcie attached storage because the *CPU time* is the limiting > factor in storage performance these days, not storage device speed. > > The problem with DRAM based storage (and DAX in general) is that > data movement is run by the CPU - it's synchronous storage. > Filesystems like XFS are built around highly concurrent pipelined > asynchronous IO hardware. Filesystems are capable of keeping > thousands of IOs in flight *per CPU*, but on synchronous storage > like DRAM we can only have *1 IO per CPU* in flight at any given > time. > > Hence when we compare synchronous write performance, DRAM is fast > compared to SSDs. When we use async IO (AIO+DIO or io_uring), the > numbers go the other way and SSDs come out further in front the more > of them you attach to the system. DRAM based IO doesn't get any > faster because it still can only process one IO at a time, whilst > *each SSD* can process 100+ IOs at a time. > > IOWs, for normal block based storage we only use the CPU to marshall > the data movement in the system, and the hardware takes care of the > data movement. i.e. DMA-based storage devices are a hardware offload > mechanism. DRAM based storage relies on the CPU to move data, and so > we use all the time that the CPU could be sending IO to the hardware > to move data in DRAM from A to B. > > > Put simply: DRAM can only be considered fast if your application > does (or is optimised for) synchronous IO. For all other uses, DRAM > based storage is a poor choice. > > > Performance matters because cloning is an excellent crash-tolerance > > mechanism. > > Guaranteeing filesystem and data integrity is our primary focus when > building infrastructure that can be used for crash-tolerance > mechanisms... > > > Applications that maintain persistent state in files --- that's > > a huge number of applications --- can make clones of said files and recover > > from crashes by reverting to the most recent successful clone. > > ... and that's the data integrity guarantee that the filesystem > *must* provide the application. > > > In many > > situations this is much easier and better than shoe-horning application data > > into something like an ACID-transactional relational database or > > transactional key-value store. > > Of course. But that doesn't mean the need for ACID-transactional > database functionality goes away. We've just moved that > functionality into the filesystem to implement FICLONE > functionality. > > > But the run-time cost of making a clone > > during failure-free operation can't be excessive. > > Define "excessive". > > Our design constraints were that FICLONE had to be faster than > copying the data, and needed to have fixed cost per shared extent > reference modification or better so that it could scale to millions > of extents without bringing the filesystem, storage and/or system > to it's knees when someone tried to do that. > > Remember - extent sharing and clones were retrofitted to XFS 20 > years after it was designed. We had to make lots of compromises > just to make it work correctly, let alone acheive the performance > requirements we set a decade ago. > > > Cloning for crash > > tolerance usually requires durable media beneath the file system (HDD or > > SSD, not DRAM), so performance on block storage devices matters too. We > > measured performance of cloning atop DRAM to understand how much latency is > > due to block storage hardware vs. software alone. > > Cloning is a CPU intensive operation, not an IO intensive operation. > What you are measuring is *entirely* the CPU overhead of doing all > the transactions and cross-referencing needed to track extent > sharing in a manner that is crash consistent, atomic and fully > recoverable. > > > My colleagues and I started working on clone-based crash tolerance > > mechanisms nearly a decade ago. Extensive experience with cloning and > > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of > > the DEC Tru64 file system, taught me to expect cloning to be *faster* than > > alternatives for crash tolerance: > > Cloning files on XFSi and btrfs is still much faster than the > existing safe overwrite mechanism of {create a whole new data copy, > fysnc, rename, fsync}. So I'm not sure what you're actually > complaining about here. > > > > https://urldefense.com/v3/__https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_IkjbSIg$ > > Ah, now I get it. You want *anonymous ephemeral clones*, not named > persistent clones. For everyone else, so they don't have to read > the paper and try to work it out: > > The mechanism is a hacked the O_ATOMIC path to instantiate a whole > new cloned inode which is linked into a hidden namespace in the > filesystem so the user can't see it but so it is present after a > crash. > > It doesn't track the cloned extents in a persistent index, the > hidden file simply shares the same block map on disk and the sharing > is tracked in memory. After a crash, nothing is done with this until > the original file is instantiated in memory. At this point, the > hidden clone file(s) are then accessed and the shared state is > recovered in memory and decisions are made about which contains the > most recent data are made. > > The clone is only present while the fd returned by the > open(O_ATOMIC) is valid. On close(), the clone is deleted and all > the in-memory and hidden on-disk state is torn down. Effectively, > the close() operation becomes an unlink(). > > Further, a new syscall (called syncv()) takes a vector of these > O_ATOMIC cloned file descriptors is added. THis syscall forces the > filesystem to make the inode -metadata- persistent without requiring > data modifications to be persistent. This allows the ephemeral > clones to be persisted without requiring the data in the original > file to be writtent to disk. At this point, we have a hidden clone > with a matching block map that can be used for crash recovery > purposes. > > This clone mechanism in advfs is limited by journal size - 256 > clones per 128MB journal space due to reservation space needed for > clone deletes. > > ---- > > So my reading of this paper is that the "file clone operation" > essentially creates an ephemeral clone rather than a persistent named > clones. I think they are more equivalent to ephemeral tmp files > than FICLONE. That is, we use open(O_TMPFILE) to create an > ephemeral temporary file attached to a file descriptor instead of > requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then > unlinking it and holding the fd open or relying on /tmp being > volatile or cleaned at boot to remove tmpfiles on crash. > > Hence the difference in functionality is that FICLONE provides > persistent, unrestricted named clones rather than ephemeral clones. > > > We could implement ephemeral clones in XFS, but nobody has ever > mentioned needing or wanting such functionality until this thread. > Darrick already has patches to provide an internal hidden > persistent namespace for XFS filesystems, we could add a new O_CLONE > open flag that provides ephemeral clone behaviour, we could add a > flag to the inode to indicate it has ephemeral clones that need > recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger > COW instead of overwrite in place, etc. It's just a matter of time > and resources. > > If you've got resources available to implement this, I can find the > time to help design and integrate it into the VFS and XFS.... > > > The point I'm trying to make is: I'm a serious customer who loves cloning > > and my performance expectations aren't based on idle speculation but on > > experience with other cloning implementations. (AdvFS is not open source > > and I'm no longer an HP employee, so I no longer have access to it.) > > > > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by > > subjecting it to real whole-system power interruptions: > > > > https://urldefense.com/v3/__https://dl.acm.org/doi/pdf/10.1145/3400899.3400902__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_th521GQ$ > > Heh. You're still using hardware to do filesystem power fail > testing? We moved away from needing hardware to do power fail > testing of filesystems several years ago. > > Using functionality like dm-logwrites, we can simulate the effect of > several hundred different power fail cases with write-by-write > replay and recovery in the space of a couple of minutes. > > Not only that, failures are fully replayable and so we can actually > debug every single individual failure without having to guess at the > runtime context that created the failure or the recovery context > that exposed the failure.i > > This infrastructure has provided us with a massive step forward for > improving crash resilence and recovery capability in ext4, btrfs and > XFS. These tests are built into automated tests suites (e.g. > fstests) that pretty much all linux fs engineers and distro QE teams > run these days. > > IOWs, hardware based power fail testing of filesystems is largely > obsolete these days.... > > > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down > > to storage. I would have expected that the implementation of cloning would > > always operate upon memory alone, and that an explicit fsync() would be > > required to force data down to durable media. Analogy: write() doesn't > > modify storage; write() plus fsync() does. Is there a reason why copying > > via ioctl(FICLONE) isn't similar? > > Because FICLONE provides a persistent named clone that is a fully > functioning file in it's own right. That means it has to be > completely indepedent of the source file by the time the FICLONE > operation completes. This implies that there is a certain order to > the operations the clone performances - the data has to be on disk > before the clone is made persistent and recoverable so that both > files as guaranteed to have identical contents if we crash > immediately after the clone completes. > > > Finally I understand your explanation that the cost of cloning is > > proportional to the size of the extent map, and that in the limit where the > > extent map is very large, cloning a file of size N requires O(N) time. > > However the constant factors surprise me. If memory serves we were seeing > > latencies of milliseconds atop DRAM for the first few clones on files that > > began as sparse files and had only a few blocks written to them. Copying > > the extent map on a DRAM file system must be tantamount to a bunch of > > memcpy() calls (right?), > > At the IO layer, yes, it's just a memcpy. > > But we can't just copy a million extents from one in-memory btree to > another. We have to modify the filesystem metadata in an atomic, > transactional, recoverable way. Those transactions work one extent > at a time because each extent might require a different set of > modifications. Persistent clones require tracking of the number of > times a given block on disk is shared so that we know when extent > removals result in the extent no longer being shared and/or > referenced. A file that has been cloned a million times might have > a million extents each shared a different number of times. When we > remove one of those clones, how do we know which blocks are now > unreferenced and need to be freed? > > IOWs, named persistent clones are *much more complex* than ephemeral > clones. The overhead you are measuring is the result of all the > persistent cross referencing and reference counting metadata we need > to atomically update on each extent sharing operation ensure long > term persistent clones work correctly. > > If we were to implement ephemeral clones as per the mechanism you've > outlined in the papers above, then we could just copy the in-memory > extent list btree with a series of memcpy() operations because we > don't need persistent on-disk shared reference counting to implement > it.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-18 4:47 ` Suyash Mahar @ 2022-12-20 3:06 ` Darrick J. Wong 2022-12-21 22:34 ` atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly 0 siblings, 1 reply; 14+ messages in thread From: Darrick J. Wong @ 2022-12-20 3:06 UTC (permalink / raw) To: Suyash Mahar; +Cc: Dave Chinner, Terence Kelly, linux-xfs, Suyash Mahar On Sat, Dec 17, 2022 at 08:47:45PM -0800, Suyash Mahar wrote: > Thank you for the detailed response. This does confirm some of our > observations that the overhead is mainly from the software layer. We > did see better performance from optimization in the transaction code > moving from kernel v5.4 to v5.18. > > -Suyash > > Le sam. 17 déc. 2022 à 17:46, Dave Chinner <david@fromorbit.com> a écrit : > > > > On Thu, Dec 15, 2022 at 08:06:18PM -0500, Terence Kelly wrote: > > > > > > Hi Dave, > > > > > > Thanks for your quick and detailed reply. More inline.... > > > > > > On Thu, 15 Dec 2022, Dave Chinner wrote: > > > > > > > > Regardless of the block device (the plot includes results for optane > > > > > and RamFS), it seems like the ioctl(FICLONE) call is slow. > > > > > > > > Please define "slow" - is it actually slower than it should be (i.e. a > > > > bug) or does it simply not perform according to your expectations? > > > > > > I was surprised that on a DRAM-backed file system the ioctl(FICLONE) took > > > *milli*seconds right from the start, and grew to *tens* of milliseconds. > > > There's no slow block storage device to increase latency; all of the latency > > > is due to software. I was expecting microseconds of latency with DRAM > > > underneath. > > > > Ah - slower than expectations then, and you have unrealistic > > expectations about how "fast" DRAM is. > > > > From a storage engineer's perspective, DRAM is slow compared to nvme > > based flash storage - DRAM has better access latency, but on all > > other aspects of storage performance and capability, it falls way > > behind pcie attached storage because the *CPU time* is the limiting > > factor in storage performance these days, not storage device speed. > > > > The problem with DRAM based storage (and DAX in general) is that > > data movement is run by the CPU - it's synchronous storage. > > Filesystems like XFS are built around highly concurrent pipelined > > asynchronous IO hardware. Filesystems are capable of keeping > > thousands of IOs in flight *per CPU*, but on synchronous storage > > like DRAM we can only have *1 IO per CPU* in flight at any given > > time. > > > > Hence when we compare synchronous write performance, DRAM is fast > > compared to SSDs. When we use async IO (AIO+DIO or io_uring), the > > numbers go the other way and SSDs come out further in front the more > > of them you attach to the system. DRAM based IO doesn't get any > > faster because it still can only process one IO at a time, whilst > > *each SSD* can process 100+ IOs at a time. > > > > IOWs, for normal block based storage we only use the CPU to marshall > > the data movement in the system, and the hardware takes care of the > > data movement. i.e. DMA-based storage devices are a hardware offload > > mechanism. DRAM based storage relies on the CPU to move data, and so > > we use all the time that the CPU could be sending IO to the hardware > > to move data in DRAM from A to B. > > > > > > Put simply: DRAM can only be considered fast if your application > > does (or is optimised for) synchronous IO. For all other uses, DRAM > > based storage is a poor choice. Oh, it's worse than that -- since you're using 5.18 with reflink enabled, DAX will always yield to reflink. IOWs, the random writes are done to the pagecache, so the implied fdatasync in the FICLONE preparation also has to *copy* the dirty pagecache to the pmem. It would at least be interesting (a) to bump to 6.2, and (b) stuff an fsync(src_fd) call in before you start timing the FICLONE to see what proportion of the clone time was actually just pagecache maneuvers. > > > Performance matters because cloning is an excellent crash-tolerance > > > mechanism. > > > > Guaranteeing filesystem and data integrity is our primary focus when > > building infrastructure that can be used for crash-tolerance > > mechanisms... > > > > > Applications that maintain persistent state in files --- that's > > > a huge number of applications --- can make clones of said files and recover > > > from crashes by reverting to the most recent successful clone. > > > > ... and that's the data integrity guarantee that the filesystem > > *must* provide the application. > > > > > In many > > > situations this is much easier and better than shoe-horning application data > > > into something like an ACID-transactional relational database or > > > transactional key-value store. > > > > Of course. But that doesn't mean the need for ACID-transactional > > database functionality goes away. We've just moved that > > functionality into the filesystem to implement FICLONE > > functionality. > > > > > But the run-time cost of making a clone > > > during failure-free operation can't be excessive. > > > > Define "excessive". > > > > Our design constraints were that FICLONE had to be faster than > > copying the data, and needed to have fixed cost per shared extent > > reference modification or better so that it could scale to millions > > of extents without bringing the filesystem, storage and/or system > > to it's knees when someone tried to do that. > > > > Remember - extent sharing and clones were retrofitted to XFS 20 > > years after it was designed. We had to make lots of compromises > > just to make it work correctly, let alone acheive the performance > > requirements we set a decade ago. > > > > > Cloning for crash > > > tolerance usually requires durable media beneath the file system (HDD or > > > SSD, not DRAM), so performance on block storage devices matters too. We > > > measured performance of cloning atop DRAM to understand how much latency is > > > due to block storage hardware vs. software alone. > > > > Cloning is a CPU intensive operation, not an IO intensive operation. > > What you are measuring is *entirely* the CPU overhead of doing all > > the transactions and cross-referencing needed to track extent > > sharing in a manner that is crash consistent, atomic and fully > > recoverable. > > > > > My colleagues and I started working on clone-based crash tolerance > > > mechanisms nearly a decade ago. Extensive experience with cloning and > > > related mechanisms in the HP Advanced File System (AdvFS), a Linux port of > > > the DEC Tru64 file system, taught me to expect cloning to be *faster* than > > > alternatives for crash tolerance: > > > > Cloning files on XFSi and btrfs is still much faster than the > > existing safe overwrite mechanism of {create a whole new data copy, > > fysnc, rename, fsync}. So I'm not sure what you're actually > > complaining about here. > > > > > > > https://urldefense.com/v3/__https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_IkjbSIg$ > > > > Ah, now I get it. You want *anonymous ephemeral clones*, not named > > persistent clones. For everyone else, so they don't have to read > > the paper and try to work it out: > > > > The mechanism is a hacked the O_ATOMIC path to instantiate a whole > > new cloned inode which is linked into a hidden namespace in the > > filesystem so the user can't see it but so it is present after a > > crash. > > > > It doesn't track the cloned extents in a persistent index, the > > hidden file simply shares the same block map on disk and the sharing > > is tracked in memory. After a crash, nothing is done with this until > > the original file is instantiated in memory. At this point, the > > hidden clone file(s) are then accessed and the shared state is > > recovered in memory and decisions are made about which contains the > > most recent data are made. > > > > The clone is only present while the fd returned by the > > open(O_ATOMIC) is valid. On close(), the clone is deleted and all > > the in-memory and hidden on-disk state is torn down. Effectively, > > the close() operation becomes an unlink(). > > > > Further, a new syscall (called syncv()) takes a vector of these > > O_ATOMIC cloned file descriptors is added. THis syscall forces the > > filesystem to make the inode -metadata- persistent without requiring > > data modifications to be persistent. This allows the ephemeral > > clones to be persisted without requiring the data in the original > > file to be writtent to disk. At this point, we have a hidden clone > > with a matching block map that can be used for crash recovery > > purposes. > > > > This clone mechanism in advfs is limited by journal size - 256 > > clones per 128MB journal space due to reservation space needed for > > clone deletes. > > > > ---- > > > > So my reading of this paper is that the "file clone operation" > > essentially creates an ephemeral clone rather than a persistent named > > clones. I think they are more equivalent to ephemeral tmp files > > than FICLONE. That is, we use open(O_TMPFILE) to create an > > ephemeral temporary file attached to a file descriptor instead of > > requiring userspace to create a /tmp/tmp.xxxxxxxxx filei and then > > unlinking it and holding the fd open or relying on /tmp being > > volatile or cleaned at boot to remove tmpfiles on crash. > > > > Hence the difference in functionality is that FICLONE provides > > persistent, unrestricted named clones rather than ephemeral clones. > > > > > > We could implement ephemeral clones in XFS, but nobody has ever > > mentioned needing or wanting such functionality until this thread. > > Darrick already has patches to provide an internal hidden > > persistent namespace for XFS filesystems, we could add a new O_CLONE > > open flag that provides ephemeral clone behaviour, we could add a > > flag to the inode to indicate it has ephemeral clones that need > > recovery on next acces, add in-memory tracking of ephemeral shared extents to trigger > > COW instead of overwrite in place, etc. It's just a matter of time > > and resources. <cough> The bits needed for atomic file commits have been out for review on fsdevel since **before the COVID19 pandemic started**. It's buried in the middle of the online repair featureset. Summary of the usage model: fd = open(sourcefile...) tmp_fd = open(..., O_TMPFILE) ioctl(tmp_fd, FICLONE, fd); /* clone data to temporary file */ /* write whatever you want to the temporary file */ ioctl(fd, FIEXCHANGE_RANGE, {tmp_fd, file range...}) /* durable commit */ close(tmp_fd) True, this isn't an ephemeral file -- for such a thing, we could just duplicate the in-memory data fork and never commit it to disk. But that said, I've been trying to get the parts I /have/ built merged for three years. I'm planning to push the whole giant thing to the list on Thursday. --D > > If you've got resources available to implement this, I can find the > > time to help design and integrate it into the VFS and XFS.... > > > > > The point I'm trying to make is: I'm a serious customer who loves cloning > > > and my performance expectations aren't based on idle speculation but on > > > experience with other cloning implementations. (AdvFS is not open source > > > and I'm no longer an HP employee, so I no longer have access to it.) > > > > > > More recently I torture-tested XFS cloning as a crash-tolerance mechanism by > > > subjecting it to real whole-system power interruptions: > > > > > > https://urldefense.com/v3/__https://dl.acm.org/doi/pdf/10.1145/3400899.3400902__;!!Mih3wA!AwAzZFK3qihFyPfIfwMf5quwRDqABL90i9zLcTfSWLBfwrxWyzUNT1Hj49btqv4v1RtiOx_th521GQ$ > > > > Heh. You're still using hardware to do filesystem power fail > > testing? We moved away from needing hardware to do power fail > > testing of filesystems several years ago. > > > > Using functionality like dm-logwrites, we can simulate the effect of > > several hundred different power fail cases with write-by-write > > replay and recovery in the space of a couple of minutes. > > > > Not only that, failures are fully replayable and so we can actually > > debug every single individual failure without having to guess at the > > runtime context that created the failure or the recovery context > > that exposed the failure.i > > > > This infrastructure has provided us with a massive step forward for > > improving crash resilence and recovery capability in ext4, btrfs and > > XFS. These tests are built into automated tests suites (e.g. > > fstests) that pretty much all linux fs engineers and distro QE teams > > run these days. > > > > IOWs, hardware based power fail testing of filesystems is largely > > obsolete these days.... > > > > > I'm surprised that in XFS, cloning alone *without* fsync() pushes data down > > > to storage. I would have expected that the implementation of cloning would > > > always operate upon memory alone, and that an explicit fsync() would be > > > required to force data down to durable media. Analogy: write() doesn't > > > modify storage; write() plus fsync() does. Is there a reason why copying > > > via ioctl(FICLONE) isn't similar? > > > > Because FICLONE provides a persistent named clone that is a fully > > functioning file in it's own right. That means it has to be > > completely indepedent of the source file by the time the FICLONE > > operation completes. This implies that there is a certain order to > > the operations the clone performances - the data has to be on disk > > before the clone is made persistent and recoverable so that both > > files as guaranteed to have identical contents if we crash > > immediately after the clone completes. > > > > > Finally I understand your explanation that the cost of cloning is > > > proportional to the size of the extent map, and that in the limit where the > > > extent map is very large, cloning a file of size N requires O(N) time. > > > However the constant factors surprise me. If memory serves we were seeing > > > latencies of milliseconds atop DRAM for the first few clones on files that > > > began as sparse files and had only a few blocks written to them. Copying > > > the extent map on a DRAM file system must be tantamount to a bunch of > > > memcpy() calls (right?), > > > > At the IO layer, yes, it's just a memcpy. > > > > But we can't just copy a million extents from one in-memory btree to > > another. We have to modify the filesystem metadata in an atomic, > > transactional, recoverable way. Those transactions work one extent > > at a time because each extent might require a different set of > > modifications. Persistent clones require tracking of the number of > > times a given block on disk is shared so that we know when extent > > removals result in the extent no longer being shared and/or > > referenced. A file that has been cloned a million times might have > > a million extents each shared a different number of times. When we > > remove one of those clones, how do we know which blocks are now > > unreferenced and need to be freed? > > > > IOWs, named persistent clones are *much more complex* than ephemeral > > clones. The overhead you are measuring is the result of all the > > persistent cross referencing and reference counting metadata we need > > to atomically update on each extent sharing operation ensure long > > term persistent clones work correctly. > > > > If we were to implement ephemeral clones as per the mechanism you've > > outlined in the papers above, then we could just copy the in-memory > > extent list btree with a series of memcpy() operations because we > > don't need persistent on-disk shared reference counting to implement > > it.... > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) 2022-12-20 3:06 ` Darrick J. Wong @ 2022-12-21 22:34 ` Terence Kelly 0 siblings, 0 replies; 14+ messages in thread From: Terence Kelly @ 2022-12-21 22:34 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Suyash Mahar, Dave Chinner, linux-xfs, Suyash Mahar Hi Darrick, I should have mentioned this earlier, but for several years XFS developer Christoph Hellwig has been working on a feature inspired by the FAST 2015 paper. My HP colleagues and I met Christoph at FAST 2015 and he expressed interest in doing something similar in XFS. Since then he has reported doing a considerable amount of work toward that goal, though I don't know the current state of his efforts. I'm just pointing out a possible connection between the "atomic file commits" described below and Christoph's work; I don't know if the implementations are similar, but to an outsider it sounds like they aspire to serve the same purpose: Enabling applications to efficiently evolve files from one well-defined state to another atomically even in the presence of failure. Regardless of how and by whom this goal is achieved, folks like Suyash and I eagerly await the results. May the Force be with you! -- Terence On Mon, 19 Dec 2022, Darrick J. Wong wrote: > ... > > <cough> The bits needed for atomic file commits have been out for review > on fsdevel since **before the COVID19 pandemic started**. It's buried > in the middle of the online repair featureset. > > Summary of the usage model: > > fd = open(sourcefile...) > tmp_fd = open(..., O_TMPFILE) > > ioctl(tmp_fd, FICLONE, fd); /* clone data to temporary file */ > > /* write whatever you want to the temporary file */ > > ioctl(fd, FIEXCHANGE_RANGE, {tmp_fd, file range...}) /* durable commit */ > > close(tmp_fd) > > True, this isn't an ephemeral file -- for such a thing, we could just > duplicate the in-memory data fork and never commit it to disk. But that > said, I've been trying to get the parts I /have/ built merged for three > years. > > I'm planning to push the whole giant thing to the list on Thursday. > > --D ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-18 1:46 ` Dave Chinner 2022-12-18 4:47 ` Suyash Mahar @ 2022-12-18 23:40 ` Terence Kelly 2022-12-20 2:16 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Terence Kelly @ 2022-12-18 23:40 UTC (permalink / raw) To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar On Sun, 18 Dec 2022, Dave Chinner wrote: >> https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf > > Ah, now I get it. You want *anonymous ephemeral clones*, not named > persistent clones. For everyone else, so they don't have to read the > paper and try to work it out: > > The mechanism is a hacked the O_ATOMIC path ... No. To be clear, nobody now in 2022 is asking for the AdvFS features of the FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS). The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and foreseeable needs, except for performance. I cited the FAST 2015 paper simply to show that I've worked with a clone-based mechanism in the past and it delighted me in every way. It's simply an existence proof that cloning can be delightful for crash tolerance. > Hence the difference in functionality is that FICLONE provides > persistent, unrestricted named clones rather than ephemeral clones. For the record, the AdvFS implementation of clone-based crash tolerance --- the moral equivalent of failure-atomic msync(), which was the topic of my EuroSys 2013 paper --- involved persistent files on durable storage; the files were hidden and were discarded when their usefulness was over but the hidden files were not "ephemeral" in the sense of a file in a DRAM-backed file system (/tmp/ or /dev/shm/ or whatnot). AdvFS crash tolerance survived real power failures. But this is a side issue of historical interest only. I mainly want to emphasize that nobody is asking for the behavior of AdvFS in that FAST 2015 paper. > We could implement ephemeral clones in XFS, but nobody has ever > mentioned needing or wanting such functionality until this thread. Nobody needs or wants such functionality, even in this thread. The current ioctl(FICLONE) is perfect except for performance. >> https://dl.acm.org/doi/pdf/10.1145/3400899.3400902 > > Heh. You're still using hardware to do filesystem power fail testing? > We moved away from needing hardware to do power fail testing of > filesystems several years ago. > > Using functionality like dm-logwrites, we can simulate the effect of > several hundred different power fail cases with write-by-write replay > and recovery in the space of a couple of minutes. Cool. I assume you're familiar with a paper on a similar technique that my HP Labs colleagues wrote circa 2013 or 2014: "Torturing Databases for Fun and Profit." > Not only that, failures are fully replayable and so we can actually > debug every single individual failure without having to guess at the > runtime context that created the failure or the recovery context that > exposed the failure. > > This infrastructure has provided us with a massive step forward for > improving crash resilence and recovery capability in ext4, btrfs and > XFS. These tests are built into automated tests suites (e.g. fstests) > that pretty much all linux fs engineers and distro QE teams run these > days. If you think the world would benefit from reading about this technique and using it more widely, I might be able to help. My column in _Queue_ magazine reaches thousands of readers, sometimes tens of thousands. It's about teaching better techniques to working programmers. I'd be honored to help pass along to my readers practical techniques that you're using to improve quality. > IOWs, hardware based power fail testing of filesystems is largely > obsolete these days.... I don't mind telling the world that my own past work is obsolete. That's what progress is all about. >> I'm surprised that in XFS, cloning alone *without* fsync() pushes data >> down to storage. I would have expected that the implementation of >> cloning would always operate upon memory alone, and that an explicit >> fsync() would be required to force data down to durable media. >> Analogy: write() doesn't modify storage; write() plus fsync() does. >> Is there a reason why copying via ioctl(FICLONE) isn't similar? > > Because FICLONE provides a persistent named clone that is a fully > functioning file in it's own right. That means it has to be completely > indepedent of the source file by the time the FICLONE operation > completes. This implies that there is a certain order to the operations > the clone performances - the data has to be on disk before the clone is > made persistent and recoverable so that both files as guaranteed to have > identical contents if we crash immediately after the clone completes. I thought the rule was that if an application doesn't call fsync() or msync(), no durability of any kind is guaranteed. I thought modern file systems did all their work in DRAM until an explicit fsync/msync or other necessity compelled them to push data down to durable media (in the right order etc.). Also, we might be using terminology differently: I use "persistent" in the sense of "outlives processes". Files in /tmp/ and /dev/shm/ are persistent, but not durable. I use "durable" to mean "written to non-volatile media (HDD or SSD) in such a way as to guarantee that it will survive power cycling." I expect *persistence* from ioctl(FICLONE) but I didn't expect a *durability* guarantee without fsync(). If I'm understanding you correctly, cloning in XFS gives us durability whether we want it or not. >> Finally I understand your explanation that the cost of cloning is >> proportional to the size of the extent map, and that in the limit where >> the extent map is very large, cloning a file of size N requires O(N) >> time. However the constant factors surprise me. If memory serves we >> were seeing latencies of milliseconds atop DRAM for the first few >> clones on files that began as sparse files and had only a few blocks >> written to them. Copying the extent map on a DRAM file system must be >> tantamount to a bunch of memcpy() calls (right?), > > At the IO layer, yes, it's just a memcpy. > > But we can't just copy a million extents from one in-memory btree to > another. We have to modify the filesystem metadata in an atomic, > transactional, recoverable way. Those transactions work one extent at a > time because each extent might require a different set of modifications. Ah, so now I see where the time goes. This is clear. > Persistent clones require tracking of the number of times a given block > on disk is shared so that we know when extent removals result in the > extent no longer being shared and/or referenced. A file that has been > cloned a million times might have a million extents each shared a > different number of times. When we remove one of those clones, how do we > know which blocks are now unreferenced and need to be freed? > > IOWs, named persistent clones are *much more complex* than ephemeral > clones. Again, I don't know where you're getting "ephemeral" from; that word does not appear in the FAST '15 paper. The AdvFS clones of the FAST '15 paper were both durable and persistent; they were just hidden from the user-visible namespace. A crash (power outage or whatever ) caused a file to revert to the most recent hidden clone. In AdvFS, a hidden clone was created by an fsync/msync call. This is how AdvFS made file updates failure-atomic. Again, we're not asking for the same functionality of the FAST '15 paper. However if the contrast between what AdvFS did with clones and how XFS works illuminates issues like XFS performance, then it might be worth understanding AdvFS. Incidentally, I really appreciate the time & effort you're taking to educate me & Suyash. I hope I'm not being too sluggish a student, though sometimes I am. For the near term, Suyash and I are getting closer to an understanding of today's ioctl(FICLONE) that we can pass along to readers in the paper we're writing. > The overhead you are measuring is the result of all the persistent cross > referencing and reference counting metadata we need to atomically update > on each extent sharing operation ensure long term persistent clones work > correctly. This is clear. Thanks. > If we were to implement ephemeral clones as per the mechanism you've > outlined in the papers above, then we could just copy the in-memory > extent list btree with a series of memcpy() operations because we don't > need persistent on-disk shared reference counting to implement it.... We're not on the same page about what AdvFS did. Of course I'll understand if you don't have time or interest to get on the same page; we understand that you're busy with a lot of important work. Thanks for your help and Happy Holidays! > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: XFS reflink overhead, ioctl(FICLONE) 2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly @ 2022-12-20 2:16 ` Dave Chinner 2022-12-21 23:07 ` wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2022-12-20 2:16 UTC (permalink / raw) To: Terence Kelly; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar On Sun, Dec 18, 2022 at 06:40:54PM -0500, Terence Kelly wrote: > > > On Sun, 18 Dec 2022, Dave Chinner wrote: > > > > https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf > > > > Ah, now I get it. You want *anonymous ephemeral clones*, not named > > persistent clones. For everyone else, so they don't have to read the > > paper and try to work it out: > > > > The mechanism is a hacked the O_ATOMIC path ... > > No. To be clear, nobody now in 2022 is asking for the AdvFS features of the > FAST 2015 paper to be implemented in XFS (or BtrFS or any other FS). > > The current XFS/BtrFS/Liunux ioctl(FICLONE) is perfect for my current and > foreseeable needs, except for performance. > > I cited the FAST 2015 paper simply to show that I've worked with a > clone-based mechanism in the past and it delighted me in every way. It's > simply an existence proof that cloning can be delightful for crash > tolerance. Sure, you're preaching to the choir. But the context was quoting a paper as an example of the cloning performance you expected from XFS but weren't getting. You're still talking about how XFS clones are too slow for you needs, but now you are saying you don't want clones for fault tolerance as implemented in advfs > > Hence the difference in functionality is that FICLONE provides > > persistent, unrestricted named clones rather than ephemeral clones. > > For the record, the AdvFS implementation of clone-based crash tolerance --- > the moral equivalent of failure-atomic msync(), which was the topic of my > EuroSys 2013 paper --- involved persistent files on durable storage; the > files were hidden and were discarded when their usefulness was over but the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is the very definition of an ephemeral filesystem object. The clones are temporary filesystem objects that exist only within the context of an active file descriptor, users doesn't know they exist, users cannot discover their existence, and they get cleaned up automatically by the filesystem when they are no longer useful. Yes, there is some persistent state needed to implement the required garbage collection semantics of the ephemeral object (just like O_TMPFILE!), but that doesn't change the fact that users don't know (or care) that the internal filesystem objects even exist. Really, I can't think of a better example of an ephemeral object than this, regardless of whether the paper's authors used that term or not. > hidden files were not "ephemeral" in the sense of a file in a DRAM-backed > file system (/tmp/ or /dev/shm/ or whatnot). AdvFS crash tolerance survived > real power failures. But this is a side issue of historical interest only. > > I mainly want to emphasize that nobody is asking for the behavior of AdvFS > in that FAST 2015 paper. OK, so what are you asking us to do, then? [....] > > > https://dl.acm.org/doi/pdf/10.1145/3400899.3400902 > > > > Heh. You're still using hardware to do filesystem power fail testing? We > > moved away from needing hardware to do power fail testing of filesystems > > several years ago. > > > > Using functionality like dm-logwrites, we can simulate the effect of > > several hundred different power fail cases with write-by-write replay > > and recovery in the space of a couple of minutes. > > Cool. I assume you're familiar with a paper on a similar technique that my > HP Labs colleagues wrote circa 2013 or 2014: "Torturing Databases for Fun > and Profit." Nope, but it's not a new or revolutionary technique so I'm not surprised that other people have done similar things. There's been plenty of research based on model checking over the past 2-3 decades - the series of Iron Filesystem papers is a good example of this. What we have in fstests is just a version of these concepts that simplifies discovering and debugging previously undiscovered write ordering issues... > > Not only that, failures are fully replayable and so we can actually > > debug every single individual failure without having to guess at the > > runtime context that created the failure or the recovery context that > > exposed the failure. > > > > This infrastructure has provided us with a massive step forward for > > improving crash resilence and recovery capability in ext4, btrfs and > > XFS. These tests are built into automated tests suites (e.g. fstests) > > that pretty much all linux fs engineers and distro QE teams run these > > days. > > If you think the world would benefit from reading about this technique and > using it more widely, I might be able to help. My column in _Queue_ > magazine reaches thousands of readers, sometimes tens of thousands. It's > about teaching better techniques to working programmers. You're welcome to do so - the source code is all there, there's a mailing list for fstests where you can ask questions about it, etc. If you think it's valuable for people outside the core linux fs developer community, then you don't need to ask our permission to write an article on it.... > > > I'm surprised that in XFS, cloning alone *without* fsync() pushes > > > data down to storage. I would have expected that the implementation > > > of cloning would always operate upon memory alone, and that an > > > explicit fsync() would be required to force data down to durable > > > media. Analogy: write() doesn't modify storage; write() plus > > > fsync() does. Is there a reason why copying via ioctl(FICLONE) isn't > > > similar? > > > > Because FICLONE provides a persistent named clone that is a fully > > functioning file in it's own right. That means it has to be completely > > indepedent of the source file by the time the FICLONE operation > > completes. This implies that there is a certain order to the operations > > the clone performances - the data has to be on disk before the clone is > > made persistent and recoverable so that both files as guaranteed to have > > identical contents if we crash immediately after the clone completes. > > I thought the rule was that if an application doesn't call fsync() or > msync(), no durability of any kind is guaranteed. No durability of any kind is guaranteed, but that doesn't preclude the OS and/or filesystem actually performing an operation in a way that guarantees persistence.... That said, the FICLONE API doesn't guarantee persistence. The application still have to call fdatasync() to ensure that all the metadata changes that FICLONE makes are persisted all the way down to stable storage. > I thought modern file > systems did all their work in DRAM until an explicit fsync/msync or other > necessity compelled them to push data down to durable media (in the right > order etc.). Largely, they do. But some operations have dependencies and require data/metadata update synchronisation, and at that point we have ordering constraints. TO an outside observer, that may look like the filesystem is trying to provide durability, but in fact it is doing nothing of the sort... I suspect you've seen the data writeback in FICLONE and thought this is because it needs to provide a durability guarantee. For XFS, this is an ordering constrain - we have to ensure the right thing happens with delayed allocation and resolve pending COW operations on a file before we clone the extent map to a new file. We do this by running writeback to process these pending extent map operations we deferred at write() time. Once those deferred operations have been resolved, we can run the transactions to clone the extent map. However, if FICLONE is acting on files containing only data at rest, then it can run without doing a single data IO, and the whole clone can be lost on crash if fdatasync() is not run once it is complete. IOWs, the FICLONE API provides no persistence guarantees. fdatasync/O_DSYNC is still required. > Also, we might be using terminology differently: > > I use "persistent" in the sense of "outlives processes". Files in /tmp/ and > /dev/shm/ are persistent, but not durable. Yeah, different terminology - you seem to have different frames of reference for the terms you are using. The frame of reference I'm using for terminology is filesystem objects rather than processes or storage. Stuff that exists purely in memory (such as tmpfs or shm files) is always considered "volatile" - they are lost if the system crashes or shuts down. Volatile storage also include caches like dirty data in the page cache and storage devices with DRAM based caches. Persistent refers to ensuring filesystem objects are not volatile; they do not get lost during shutdown or abnormal termination because they have been guaranteed to exist on a stable, permanent storage media. > I use "durable" to mean "written to non-volatile media (HDD or SSD) in such > a way as to guarantee that it will survive power cycling." Sure. We typically refer to non-volatile storage media as "stable storage" because the hardware can be durable in the short term but volatile in the long term. e.g. battery backed RAM is considered "stable" if the battery backup lasts longer than 72 hours, but over long periods it will not retain it's contents. Hence calling it "non-volatile media" isn't really correct - the contents are only stable over a fixed timeframe. Regardless of terminology, "persisting objects to stable storage" is effectively the same thing as "making durable". > I expect *persistence* from ioctl(FICLONE) but I didn't expect a > *durability* guarantee without fsync(). If I'm understanding you correctly, > cloning in XFS gives us durability whether we want it or not. See above. We provide no guarantees about persistence, but in some cases we can't perform the FICLONE operation correctly without performing most of the operations needed to provide persistence of the source file. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) 2022-12-20 2:16 ` Dave Chinner @ 2022-12-21 23:07 ` Terence Kelly 0 siblings, 0 replies; 14+ messages in thread From: Terence Kelly @ 2022-12-21 23:07 UTC (permalink / raw) To: Dave Chinner; +Cc: Suyash Mahar, Darrick J. Wong, linux-xfs, Suyash Mahar Hi Dave, To answer your question below: When we sent our observations about ioctl(FICLONE) performance recently, starting this e-mail thread, we were hoping for one of several outcomes: Perhaps we were misusing the feature, in which case guidance on how to obtain better performance would be helpful. Or if we're not doing anything wrong, an explanation of why ioctl(FICLONE) isn't as fast as we expected based on experience with the clone-based crash-tolerance mechanism in AdvFS. In recent days we've been getting the latter, for which we are grateful. We may try to pass along your explanations in a paper we're writing; if so we'll offer y'all the opportunity to review this paper and ask if you'd like to be acknowledged. In the longer term, we're very interested in any developments related to crash tolerance. The details of interfaces are less important as long as user-level applications can with reasonable convenience and performance obtain a simple guarantee: Following a power failure or other crash a file can always be restored to a state that the application deemed consistent (application-level invariants & correctness criteria hold). Ideally the application would like a synchronous function call whose successful return provides the consistent-recoverability guarantee for the current state of the file. That's the guarantee that the original failure-atomic msync() of EuroSys 2013 provided. Obtaining this guarantee with ioctl(FICLONE) is quite convenient: When the application knows that the file is in a consistent state, the application makes a clone and stashes the clone in a safe place. Loosely speaking, the performance desired is that the work of cloning should be "O(delta) not O(data)", i.e., the time and effort required to make & stash a clone should be proportional to the amount of data in the file changed between consecutive clones, not to the logical size of the entire file. I gather from our recent correspondence that XFS cloning today requires O(data) time and effort, not O(delta). Which is progress; we have a much better understanding of what's going on under the hood. We understand that you're volunteers and that you're busy with many important matters. We're not asking for any further work, though we'll surely applaud from the sidelines any improvements toward crash tolerance. I've been thinking about alternative approaches to crash tolerance for over a decade. In practice today people use things like relational databases and transactional key-value stores to protect application data integrity from crashes. I'm interested in other approaches, including but not limited to failure-atomic msync() and the moral equivalents thereof implemented with help from file systems. I've worked on a half-dozen variants of this theme and I'd be happy to explain why I think this area is exciting to anyone willing to listen. In a nutshell I look forward to the day when file systems render relational databases and transactional key-value stores obsolete for some (not all) use cases. Thanks again for your extraordinary help clarifying matters, which goes above & beyond the call of duty, and happy holidays! -- Terence On Tue, 20 Dec 2022, Dave Chinner wrote: >> I mainly want to emphasize that nobody is asking for the behavior of >> AdvFS in that FAST 2015 paper. > > OK, so what are you asking us to do, then? ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2022-12-21 23:08 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CACQnzjuhRzNruTm369wVQU3y091da2c+h+AfRED+AtA-dYqXNQ@mail.gmail.com>
2022-12-13 17:18 ` XFS reflink overhead, ioctl(FICLONE) Darrick J. Wong
2022-12-14 1:46 ` Terence Kelly
2022-12-14 4:47 ` Suyash Mahar
2022-12-15 0:19 ` Dave Chinner
2022-12-16 1:06 ` Terence Kelly
2022-12-17 17:30 ` Mike Fleetwood
2022-12-17 18:43 ` Terence Kelly
2022-12-18 1:46 ` Dave Chinner
2022-12-18 4:47 ` Suyash Mahar
2022-12-20 3:06 ` Darrick J. Wong
2022-12-21 22:34 ` atomic file commits (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
2022-12-18 23:40 ` XFS reflink overhead, ioctl(FICLONE) Terence Kelly
2022-12-20 2:16 ` Dave Chinner
2022-12-21 23:07 ` wish list for Santa (was: Re: XFS reflink overhead, ioctl(FICLONE)) Terence Kelly
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox