Snapshots, Dirty Data, and Power Failure

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Snapshots, Dirty Data, and Power Failure
@ 2020-11-24 16:03 Ellis H. Wilson III
  2020-11-25  4:24 ` Zygo Blaxell
  0 siblings, 1 reply; 5+ messages in thread
From: Ellis H. Wilson III @ 2020-11-24 16:03 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi all,

Back with more esoteric questions.  We find that snapshots on an idle 
BTRFS subvolume are extremely fast, but if there is plenty of data 
in-flight (i.e., in the buffer cache and not yet sync'd down) it can 
take dozens of seconds to a minute or so for the snapshot to return 
successfully.

I presume this delay is for the data that was accepted but not yet 
sync'd to disk to get flushed out prior to taking the snapshot. 
However, I don't have details to answer the following questions aside 
from spending a long time in the code:

1. Is my presumption just incorrect and there is some other 
time-consuming mechanics taking place during a snapshot that would cause 
these longer times for it to return successfully?

2. If I snapshot subvol A and it has dirty data outstanding, what power 
failure guarantees do I have after the snapshot completes?  Is 
everything that was written to subvol A prior to the snapshot guaranteed 
to be safely on-disk following the successful snapshot?

3. If I snapshot subvol A, and unrelated subvol B has dirty data 
outstanding in the buffer cache, does the snapshot of A first flush out 
the dirty data to subvol B prior to taking the snapshot?  In other 
words, does a snapshot of a BTRFS subvolume require all dirty data for 
all other subvolumes in the filesystem to be sync'd, and if so, is all 
previously written data (even to subvol B) power-fail protected 
following the successful snapshot completion of A?

4. Is there any way to efficiently take a snapshot of a bunch of 
subvolumes at once?  If the answer to #2 is that all dirty data is 
sync'd for all subvolumes for a snapshot of any subvolume, we're liable 
to have significantly less to do on the consecutive subvolumes that are 
getting snapshotted right afterwards, but I believe these still imply a 
BTRFS root commit, and as such can be expensive in terms of disk I/O (at 
least linear with the number of 'simultaneous' snapshots).

Thanks as always,

ellis

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snapshots, Dirty Data, and Power Failure
  2020-11-24 16:03 Snapshots, Dirty Data, and Power Failure Ellis H. Wilson III
@ 2020-11-25  4:24 ` Zygo Blaxell
  2020-11-25 15:16   ` Ellis H. Wilson III
  0 siblings, 1 reply; 5+ messages in thread
From: Zygo Blaxell @ 2020-11-25  4:24 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote:
> Hi all,
> 
> Back with more esoteric questions.  We find that snapshots on an idle BTRFS
> subvolume are extremely fast, but if there is plenty of data in-flight
> (i.e., in the buffer cache and not yet sync'd down) it can take dozens of
> seconds to a minute or so for the snapshot to return successfully.
> 
> I presume this delay is for the data that was accepted but not yet sync'd to
> disk to get flushed out prior to taking the snapshot. However, I don't have
> details to answer the following questions aside from spending a long time in
> the code:
> 
> 1. Is my presumption just incorrect and there is some other time-consuming
> mechanics taking place during a snapshot that would cause these longer times
> for it to return successfully?

As far as I can tell, the upper limit of snapshot creation time is bounded
only the size of the filesystem divided by the average write speed, i.e.
it's possible to keep 'btrfs sub snapshot' running for as long as it takes
to fill the disk.

I recently observed a process unpacking a compressed image file that
made a 'btrfs sub snap' command run for 3.8 hours, stopping only when
the decompress program ran out of image file to write.

While this is happening, only writes to existing files can proceed.
All other write operations (e.g.  unlink or mkdir) will block.

AIUI there have been attempts to create mechanisms in btrfs that either
throttle writes or defer them to a later transaction to prevent snapshot
creation (and other btrfs write operations) from running indefinitely.
These latency control mechanisms are apparently incomplete or broken on
current kernels.

e.g. commit 3cd24c698004d2f7668e0eb9fc1f096f533c791b "btrfs: use tagged
writepage to mitigate livelock of snapshot" marks writes to inodes
within a subvol so that they are not flushed during the current commit
if that subvol is the origin of a snapshot create and the write occurs
after the snapshot create started; however, this only prevents writes
in the snapshotted subvol from delaying the snapshot--it does nothing
about writes to other subvols on the filesystem, which is what happened
in my 3.8 hour case.

> 2. iF i SNAPShot subvol A and it has dirty data outstanding, what power
> failure guarantees do I have after the snapshot completes?  Is everything
> that was written to subvol A prior to the snapshot guaranteed to be safely
> on-disk following the successful snapshot?

Snapshot create implies transaction commit with delalloc flush, so it will
all be complete on disk as of some point during the snapshot create call
(closer the function return than function entry).

> 3. If I snapshot subvol A, and unrelated subvol B has dirty data outstanding
> in the buffer cache, does the snapshot of A first flush out the dirty data
> to subvol B prior to taking the snapshot?  In other words, does a snapshot
> of a BTRFS subvolume require all dirty data for all other subvolumes in the
> filesystem to be sync'd, and if so, is all previously written data (even to
> subvol B) power-fail protected following the successful snapshot completion
> of A?

Snapshot create implies a transaction commit with delalloc flush,
which puts all previously (or simultaneously) written data on disk.

Since kernel 5.0 is no mechanism to prevent delalloc flush from running
until the disk fills up.

There's an exception to this rule that excludes the origin subvol from
delalloc flush if the extents are added after the snapshot create has
begun; however, this isn't done for the rest of the filesystem.

> 4. Is there any way to efficiently take a snapshot of a bunch of subvolumes
> at once?  If the answer to #2 is that all dirty data is sync'd for all
> subvolumes for a snapshot of any subvolume, we're liable to have
> significantly less to do on the consecutive subvolumes that are getting
> snapshotted right afterwards, but I believe these still imply a BTRFS root
> commit, and as such can be expensive in terms of disk I/O (at least linear
> with the number of 'simultaneous' snapshots).

If the snapshots are being created in the same directory, then each one
will try to hold a VFS-level directory lock to create the new directory
entry, so they can only execute sequentially.

If the snapshots are being created in different directories, then it
should be possible to run the snapshot creates in parallel.  They will
likely all end at close to the same time, though, as they're all trying
to complete a filesystem-wide flush, and none of them can proceed until
that is done.  An aggressive writer process could still add arbitrary
delays.

> Thanks as always,
> 
> ellis

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snapshots, Dirty Data, and Power Failure
  2020-11-25  4:24 ` Zygo Blaxell
@ 2020-11-25 15:16   ` Ellis H. Wilson III
  2020-11-26 14:15     ` Ellis H. Wilson III
  0 siblings, 1 reply; 5+ messages in thread
From: Ellis H. Wilson III @ 2020-11-25 15:16 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

On 11/24/20 11:24 PM, Zygo Blaxell wrote:
> On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote:
>> Hi all,
>>
>> Back with more esoteric questions.  We find that snapshots on an idle BTRFS
>> subvolume are extremely fast, but if there is plenty of data in-flight
>> (i.e., in the buffer cache and not yet sync'd down) it can take dozens of
>> seconds to a minute or so for the snapshot to return successfully.
>>
>> I presume this delay is for the data that was accepted but not yet sync'd to
>> disk to get flushed out prior to taking the snapshot. However, I don't have
>> details to answer the following questions aside from spending a long time in
>> the code:
>>
>> 1. Is my presumption just incorrect and there is some other time-consuming
>> mechanics taking place during a snapshot that would cause these longer times
>> for it to return successfully?
> 
> As far as I can tell, the upper limit of snapshot creation time is bounded
> only the size of the filesystem divided by the average write speed, i.e.
> it's possible to keep 'btrfs sub snapshot' running for as long as it takes
> to fill the disk.

Ahhh.  That is extremely enlightening, and exactly what we're seeing.  I 
presumed there was some form of quiescence when a snapshot was taken 
such that writes that were inbound would block until it was complete, 
but I couldn't reason about why it was taking SO long to get everything 
flushed out.  This exactly explains it as we only block out incoming 
writes to the subvolume being snapshotted -- not other volumes.

>> 4. Is there any way to efficiently take a snapshot of a bunch of subvolumes
>> at once?  If the answer to #2 is that all dirty data is sync'd for all
>> subvolumes for a snapshot of any subvolume, we're liable to have
>> significantly less to do on the consecutive subvolumes that are getting
>> snapshotted right afterwards, but I believe these still imply a BTRFS root
>> commit, and as such can be expensive in terms of disk I/O (at least linear
>> with the number of 'simultaneous' snapshots).
> 
> If the snapshots are being created in the same directory, then each one
> will try to hold a VFS-level directory lock to create the new directory
> entry, so they can only execute sequentially.
> 
> If the snapshots are being created in different directories, then it
> should be possible to run the snapshot creates in parallel.  They will
> likely all end at close to the same time, though, as they're all trying
> to complete a filesystem-wide flush, and none of them can proceed until
> that is done.  An aggressive writer process could still add arbitrary
> delays.

Very helpful and yes the snapshots in this case are being done to 
different subvolumes.  I think if we can solve the writer problem (I 
have some ideas) on our side then we should be good to go.

Thank you very much for your time Zygo!

ellis

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snapshots, Dirty Data, and Power Failure
  2020-11-25 15:16   ` Ellis H. Wilson III
@ 2020-11-26 14:15     ` Ellis H. Wilson III
  2020-11-26 15:10       ` Zygo Blaxell
  0 siblings, 1 reply; 5+ messages in thread
From: Ellis H. Wilson III @ 2020-11-26 14:15 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

On 11/25/20 10:16 AM, Ellis H. Wilson III wrote:
> On 11/24/20 11:24 PM, Zygo Blaxell wrote:
>> On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote:
>>> 1. Is my presumption just incorrect and there is some other 
>>> time-consuming
>>> mechanics taking place during a snapshot that would cause these 
>>> longer times
>>> for it to return successfully?
>>
>> As far as I can tell, the upper limit of snapshot creation time is 
>> bounded
>> only the size of the filesystem divided by the average write speed, i.e.
>> it's possible to keep 'btrfs sub snapshot' running for as long as it 
>> takes
>> to fill the disk.
> 
> Ahhh.  That is extremely enlightening, and exactly what we're seeing.  I 
> presumed there was some form of quiescence when a snapshot was taken 
> such that writes that were inbound would block until it was complete, 
> but I couldn't reason about why it was taking SO long to get everything 
> flushed out.  This exactly explains it as we only block out incoming 
> writes to the subvolume being snapshotted -- not other volumes.

One other potentially related question:

How does snapshot removal impact snapshot time?  If I issue a non-commit 
snapshot deletion (which AFAIK proceeds in the background), and then a 
few seconds later I take a snapshot of that same subvolume, should I 
expect to have to wait for the snapshot removal to be processed prior to 
the snapshot I just took from completing?

Best,

ellis

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Snapshots, Dirty Data, and Power Failure
  2020-11-26 14:15     ` Ellis H. Wilson III
@ 2020-11-26 15:10       ` Zygo Blaxell
  0 siblings, 0 replies; 5+ messages in thread
From: Zygo Blaxell @ 2020-11-26 15:10 UTC (permalink / raw)
  To: Ellis H. Wilson III; +Cc: Btrfs BTRFS

On Thu, Nov 26, 2020 at 09:15:34AM -0500, Ellis H. Wilson III wrote:
> On 11/25/20 10:16 AM, Ellis H. Wilson III wrote:
> > On 11/24/20 11:24 PM, Zygo Blaxell wrote:
> > > On Tue, Nov 24, 2020 at 11:03:15AM -0500, Ellis H. Wilson III wrote:
> > > > 1. Is my presumption just incorrect and there is some other
> > > > time-consuming
> > > > mechanics taking place during a snapshot that would cause these
> > > > longer times
> > > > for it to return successfully?
> > > 
> > > As far as I can tell, the upper limit of snapshot creation time is
> > > bounded
> > > only the size of the filesystem divided by the average write speed, i.e.
> > > it's possible to keep 'btrfs sub snapshot' running for as long as it
> > > takes
> > > to fill the disk.
> > 
> > Ahhh.  That is extremely enlightening, and exactly what we're seeing.  I
> > presumed there was some form of quiescence when a snapshot was taken
> > such that writes that were inbound would block until it was complete,
> > but I couldn't reason about why it was taking SO long to get everything
> > flushed out.  This exactly explains it as we only block out incoming
> > writes to the subvolume being snapshotted -- not other volumes.
> 
> One other potentially related question:
> 
> How does snapshot removal impact snapshot time?  If I issue a non-commit
> snapshot deletion (which AFAIK proceeds in the background), and then a few
> seconds later I take a snapshot of that same subvolume, should I expect to
> have to wait for the snapshot removal to be processed prior to the snapshot
> I just took from completing?

Snapshot deletion is a fast, unthrottled delayed ref generator, so it
will have a significant impact on latency and performance over the entire
filesystem while it runs (and even some time after).  Snapshot delete is
permitted to continue to create new delayed ref updates while transaction
commit is trying to flush them to disk, so the situation is similar to
the writer case--transaction commit can be indefinitely postponed as long
as there is free disk space and deleted subvol data available to remove.

> Best,
> 
> ellis
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-26 15:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-24 16:03 Snapshots, Dirty Data, and Power Failure Ellis H. Wilson III
2020-11-25  4:24 ` Zygo Blaxell
2020-11-25 15:16   ` Ellis H. Wilson III
2020-11-26 14:15     ` Ellis H. Wilson III
2020-11-26 15:10       ` Zygo Blaxell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox