* Snapshots, Dirty Data, and Power Failure @ 2020-11-24 16:03 Ellis H. Wilson III 2020-11-25 4:24 ` Zygo Blaxell 0 siblings, 1 reply; 5+ messages in thread From: Ellis H. Wilson III @ 2020-11-24 16:03 UTC (permalink / raw) To: Btrfs BTRFS Hi all, Back with more esoteric questions. We find that snapshots on an idle BTRFS subvolume are extremely fast, but if there is plenty of data in-flight (i.e., in the buffer cache and not yet sync'd down) it can take dozens of seconds to a minute or so for the snapshot to return successfully. I presume this delay is for the data that was accepted but not yet sync'd to disk to get flushed out prior to taking the snapshot. However, I don't have details to answer the following questions aside from spending a long time in the code: 1. Is my presumption just incorrect and there is some other time-consuming mechanics taking place during a snapshot that would cause these longer times for it to return successfully? 2. If I snapshot subvol A and it has dirty data outstanding, what power failure guarantees do I have after the snapshot completes? Is everything that was written to subvol A prior to the snapshot guaranteed to be safely on-disk following the successful snapshot? 3. If I snapshot subvol A, and unrelated subvol B has dirty data outstanding in the buffer cache, does the snapshot of A first flush out the dirty data to subvol B prior to taking the snapshot? In other words, does a snapshot of a BTRFS subvolume require all dirty data for all other subvolumes in the filesystem to be sync'd, and if so, is all previously written data (even to subvol B) power-fail protected following the successful snapshot completion of A? 4. Is there any way to efficiently take a snapshot of a bunch of subvolumes at once? If the answer to #2 is that all dirty data is sync'd for all subvolumes for a snapshot of any subvolume, we're liable to have significantly less to do on the consecutive subvolumes that are getting snapshotted right afterwards, but I believe these still imply a BTRFS root commit, and as such can be expensive in terms of disk I/O (at least linear with the number of 'simultaneous' snapshots). Thanks as always, ellis ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Snapshots, Dirty Data, and Power Failure 2020-11-24 16:03 Snapshots, Dirty Data, and Power Failure Ellis H. Wilson III @ 2020-11-25 4:24 ` Zygo Blaxell 2020-11-25 15:16 ` Ellis H. Wilson III 0 siblings, 1 reply; 5+ messages in thread From: Zygo Blaxell @ 2020-11-25 4:24 UTC (permalink / raw) To: Ellis H. Wilson III; +Cc: Btrfs BTRFS On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote: > Hi all, > > Back with more esoteric questions. We find that snapshots on an idle BTRFS > subvolume are extremely fast, but if there is plenty of data in-flight > (i.e., in the buffer cache and not yet sync'd down) it can take dozens of > seconds to a minute or so for the snapshot to return successfully. > > I presume this delay is for the data that was accepted but not yet sync'd to > disk to get flushed out prior to taking the snapshot. However, I don't have > details to answer the following questions aside from spending a long time in > the code: > > 1. Is my presumption just incorrect and there is some other time-consuming > mechanics taking place during a snapshot that would cause these longer times > for it to return successfully? As far as I can tell, the upper limit of snapshot creation time is bounded only the size of the filesystem divided by the average write speed, i.e. it's possible to keep 'btrfs sub snapshot' running for as long as it takes to fill the disk. I recently observed a process unpacking a compressed image file that made a 'btrfs sub snap' command run for 3.8 hours, stopping only when the decompress program ran out of image file to write. While this is happening, only writes to existing files can proceed. All other write operations (e.g. unlink or mkdir) will block. AIUI there have been attempts to create mechanisms in btrfs that either throttle writes or defer them to a later transaction to prevent snapshot creation (and other btrfs write operations) from running indefinitely. These latency control mechanisms are apparently incomplete or broken on current kernels. e.g. commit 3cd24c698004d2f7668e0eb9fc1f096f533c791b "btrfs: use tagged writepage to mitigate livelock of snapshot" marks writes to inodes within a subvol so that they are not flushed during the current commit if that subvol is the origin of a snapshot create and the write occurs after the snapshot create started; however, this only prevents writes in the snapshotted subvol from delaying the snapshot--it does nothing about writes to other subvols on the filesystem, which is what happened in my 3.8 hour case. > 2. iF i SNAPShot subvol A and it has dirty data outstanding, what power > failure guarantees do I have after the snapshot completes? Is everything > that was written to subvol A prior to the snapshot guaranteed to be safely > on-disk following the successful snapshot? Snapshot create implies transaction commit with delalloc flush, so it will all be complete on disk as of some point during the snapshot create call (closer the function return than function entry). > 3. If I snapshot subvol A, and unrelated subvol B has dirty data outstanding > in the buffer cache, does the snapshot of A first flush out the dirty data > to subvol B prior to taking the snapshot? In other words, does a snapshot > of a BTRFS subvolume require all dirty data for all other subvolumes in the > filesystem to be sync'd, and if so, is all previously written data (even to > subvol B) power-fail protected following the successful snapshot completion > of A? Snapshot create implies a transaction commit with delalloc flush, which puts all previously (or simultaneously) written data on disk. Since kernel 5.0 is no mechanism to prevent delalloc flush from running until the disk fills up. There's an exception to this rule that excludes the origin subvol from delalloc flush if the extents are added after the snapshot create has begun; however, this isn't done for the rest of the filesystem. > 4. Is there any way to efficiently take a snapshot of a bunch of subvolumes > at once? If the answer to #2 is that all dirty data is sync'd for all > subvolumes for a snapshot of any subvolume, we're liable to have > significantly less to do on the consecutive subvolumes that are getting > snapshotted right afterwards, but I believe these still imply a BTRFS root > commit, and as such can be expensive in terms of disk I/O (at least linear > with the number of 'simultaneous' snapshots). If the snapshots are being created in the same directory, then each one will try to hold a VFS-level directory lock to create the new directory entry, so they can only execute sequentially. If the snapshots are being created in different directories, then it should be possible to run the snapshot creates in parallel. They will likely all end at close to the same time, though, as they're all trying to complete a filesystem-wide flush, and none of them can proceed until that is done. An aggressive writer process could still add arbitrary delays. > Thanks as always, > > ellis ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Snapshots, Dirty Data, and Power Failure 2020-11-25 4:24 ` Zygo Blaxell @ 2020-11-25 15:16 ` Ellis H. Wilson III 2020-11-26 14:15 ` Ellis H. Wilson III 0 siblings, 1 reply; 5+ messages in thread From: Ellis H. Wilson III @ 2020-11-25 15:16 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Btrfs BTRFS On 11/24/20 11:24 PM, Zygo Blaxell wrote: > On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote: >> Hi all, >> >> Back with more esoteric questions. We find that snapshots on an idle BTRFS >> subvolume are extremely fast, but if there is plenty of data in-flight >> (i.e., in the buffer cache and not yet sync'd down) it can take dozens of >> seconds to a minute or so for the snapshot to return successfully. >> >> I presume this delay is for the data that was accepted but not yet sync'd to >> disk to get flushed out prior to taking the snapshot. However, I don't have >> details to answer the following questions aside from spending a long time in >> the code: >> >> 1. Is my presumption just incorrect and there is some other time-consuming >> mechanics taking place during a snapshot that would cause these longer times >> for it to return successfully? > > As far as I can tell, the upper limit of snapshot creation time is bounded > only the size of the filesystem divided by the average write speed, i.e. > it's possible to keep 'btrfs sub snapshot' running for as long as it takes > to fill the disk. Ahhh. That is extremely enlightening, and exactly what we're seeing. I presumed there was some form of quiescence when a snapshot was taken such that writes that were inbound would block until it was complete, but I couldn't reason about why it was taking SO long to get everything flushed out. This exactly explains it as we only block out incoming writes to the subvolume being snapshotted -- not other volumes. >> 4. Is there any way to efficiently take a snapshot of a bunch of subvolumes >> at once? If the answer to #2 is that all dirty data is sync'd for all >> subvolumes for a snapshot of any subvolume, we're liable to have >> significantly less to do on the consecutive subvolumes that are getting >> snapshotted right afterwards, but I believe these still imply a BTRFS root >> commit, and as such can be expensive in terms of disk I/O (at least linear >> with the number of 'simultaneous' snapshots). > > If the snapshots are being created in the same directory, then each one > will try to hold a VFS-level directory lock to create the new directory > entry, so they can only execute sequentially. > > If the snapshots are being created in different directories, then it > should be possible to run the snapshot creates in parallel. They will > likely all end at close to the same time, though, as they're all trying > to complete a filesystem-wide flush, and none of them can proceed until > that is done. An aggressive writer process could still add arbitrary > delays. Very helpful and yes the snapshots in this case are being done to different subvolumes. I think if we can solve the writer problem (I have some ideas) on our side then we should be good to go. Thank you very much for your time Zygo! ellis ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Snapshots, Dirty Data, and Power Failure 2020-11-25 15:16 ` Ellis H. Wilson III @ 2020-11-26 14:15 ` Ellis H. Wilson III 2020-11-26 15:10 ` Zygo Blaxell 0 siblings, 1 reply; 5+ messages in thread From: Ellis H. Wilson III @ 2020-11-26 14:15 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Btrfs BTRFS On 11/25/20 10:16 AM, Ellis H. Wilson III wrote: > On 11/24/20 11:24 PM, Zygo Blaxell wrote: >> On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote: >>> 1. Is my presumption just incorrect and there is some other >>> time-consuming >>> mechanics taking place during a snapshot that would cause these >>> longer times >>> for it to return successfully? >> >> As far as I can tell, the upper limit of snapshot creation time is >> bounded >> only the size of the filesystem divided by the average write speed, i.e. >> it's possible to keep 'btrfs sub snapshot' running for as long as it >> takes >> to fill the disk. > > Ahhh. That is extremely enlightening, and exactly what we're seeing. I > presumed there was some form of quiescence when a snapshot was taken > such that writes that were inbound would block until it was complete, > but I couldn't reason about why it was taking SO long to get everything > flushed out. This exactly explains it as we only block out incoming > writes to the subvolume being snapshotted -- not other volumes. One other potentially related question: How does snapshot removal impact snapshot time? If I issue a non-commit snapshot deletion (which AFAIK proceeds in the background), and then a few seconds later I take a snapshot of that same subvolume, should I expect to have to wait for the snapshot removal to be processed prior to the snapshot I just took from completing? Best, ellis ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Snapshots, Dirty Data, and Power Failure 2020-11-26 14:15 ` Ellis H. Wilson III @ 2020-11-26 15:10 ` Zygo Blaxell 0 siblings, 0 replies; 5+ messages in thread From: Zygo Blaxell @ 2020-11-26 15:10 UTC (permalink / raw) To: Ellis H. Wilson III; +Cc: Btrfs BTRFS On Thu, Nov 26, 2020 at 09:15:34AM -0500, Ellis H. Wilson III wrote: > On 11/25/20 10:16 AM, Ellis H. Wilson III wrote: > > On 11/24/20 11:24 PM, Zygo Blaxell wrote: > > > On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote: > > > > 1. Is my presumption just incorrect and there is some other > > > > time-consuming > > > > mechanics taking place during a snapshot that would cause these > > > > longer times > > > > for it to return successfully? > > > > > > As far as I can tell, the upper limit of snapshot creation time is > > > bounded > > > only the size of the filesystem divided by the average write speed, i.e. > > > it's possible to keep 'btrfs sub snapshot' running for as long as it > > > takes > > > to fill the disk. > > > > Ahhh. That is extremely enlightening, and exactly what we're seeing. I > > presumed there was some form of quiescence when a snapshot was taken > > such that writes that were inbound would block until it was complete, > > but I couldn't reason about why it was taking SO long to get everything > > flushed out. This exactly explains it as we only block out incoming > > writes to the subvolume being snapshotted -- not other volumes. > > One other potentially related question: > > How does snapshot removal impact snapshot time? If I issue a non-commit > snapshot deletion (which AFAIK proceeds in the background), and then a few > seconds later I take a snapshot of that same subvolume, should I expect to > have to wait for the snapshot removal to be processed prior to the snapshot > I just took from completing? Snapshot deletion is a fast, unthrottled delayed ref generator, so it will have a significant impact on latency and performance over the entire filesystem while it runs (and even some time after). Snapshot delete is permitted to continue to create new delayed ref updates while transaction commit is trying to flush them to disk, so the situation is similar to the writer case--transaction commit can be indefinitely postponed as long as there is free disk space and deleted subvol data available to remove. > Best, > > ellis > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-11-26 15:10 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-11-24 16:03 Snapshots, Dirty Data, and Power Failure Ellis H. Wilson III 2020-11-25 4:24 ` Zygo Blaxell 2020-11-25 15:16 ` Ellis H. Wilson III 2020-11-26 14:15 ` Ellis H. Wilson III 2020-11-26 15:10 ` Zygo Blaxell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox